Advanced search

Message boards : Number crunching : Validation error

Author Message
Profile caffeineyellow5
Avatar
Send message
Joined: 30 Jul 14
Posts: 225
Credit: 2,658,976,345
RAC: 0
Level
Phe
Scientific publications
watwatwatwatwatwatwatwat
Message 44390 - Posted: 2 Sep 2016 | 7:04:31 UTC
Last modified: 2 Sep 2016 | 7:15:38 UTC

I got 2 validation errors on one system. They completed great. There was no known shutdowns or power loss issues. These 2 WUs represent 2 days 2 hours 45 minutes of work done and good. What would cause a validation error if the WU itself showed no problems and completion as good?
https://www.gpugrid.net/results.php?userid=109388&offset=0&show_names=1&state=4&appid=

I will also let you know if more on this same system are unable to validate after finishing with no errors.

Just voicing a concern and asking a question. If no feedback is given on this, I won't be upset. If you have a reason why a task would be unable to validate, please let me know.
____________
1 Corinthians 9:16 "For though I preach the gospel, I have nothing to glory of: for necessity is laid upon me; yea, woe is unto me, if I preach not the gospel!"
Ephesians 6:18-20, please ;-)
http://tbc-pa.org

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1616
Credit: 8,055,894,351
RAC: 19,323,994
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 44395 - Posted: 2 Sep 2016 | 9:59:36 UTC - in response to Message 44390.

Other users aren't allowed to see tasks from a UserID list - you have to link by HostID for that.

My guess would be a problem with one (or more) of the uploaded files.

Profile Beyond
Avatar
Send message
Joined: 23 Nov 08
Posts: 1112
Credit: 6,162,416,256
RAC: 0
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 44402 - Posted: 2 Sep 2016 | 15:10:12 UTC
Last modified: 2 Sep 2016 | 15:12:11 UTC

Richard, here's his machine. BOINC says it has 2 x GTX 980 GPUs:

http://www.gpugrid.net/results.php?hostid=367847&offset=0&show_names=1&state=0&appid=

That box shows a lot of errors, as does another one that shows: 3 NVIDIA GeForce GTX 980 Ti (4095MB) GPUs. Most of the errors on the 2nd box (6) are from the long GIANNI WUs. (not counting all the bad ADRIAs)

http://www.gpugrid.net/results.php?hostid=335350&offset=0&show_names=1&state=5&appid=

Profile caffeineyellow5
Avatar
Send message
Joined: 30 Jul 14
Posts: 225
Credit: 2,658,976,345
RAC: 0
Level
Phe
Scientific publications
watwatwatwatwatwatwatwat
Message 44418 - Posted: 4 Sep 2016 | 12:01:46 UTC - in response to Message 44402.
Last modified: 4 Sep 2016 | 12:11:30 UTC

Richard, here's his machine. BOINC says it has 2 x GTX 980 GPUs:

http://www.gpugrid.net/results.php?hostid=367847&offset=0&show_names=1&state=0&appid=
I had a system crash on the 24th. All 4 of the tasks in progress failed. It downloaded 4 more and then crashed again within 12 hours. Those timed out. I am not sure how/why there were 2 that were "abandoned" a day after those 4 failures since the 4 that timed out say they did so 12 hours after the Abandoned ones downloaded. Maybe the second crash emptied the queue in memory or something and downloaded new ones, but not reporting anything about them to the project servers making the server think they were still out there?

The 2 that were invalid were 2 that were sent to replace the Abandoned ones on the system. As far as I know or can tell, there were no crashes that were during the time that the Abandoned ones were running or the invalid ones. I do have a rule set for Exclusive Applications so the heat in that little booth goes down for the guy who has to be in there a few times a week for a few hours. The heat is not an issue for the system, but the person in front of it, it is. SO when he turns on the browser, the media player, Audacity, or VNC, BOINC pauses all work. Given the date range on the invalid WUs, they were running over a Wednesday night that he would have been there. The Stderr outputs show both of them were paused twice and the system was rebooted once while they were running as well.

This system is factory clock settings with MSI Afterburner to keep an eye on them, but no over anything for GPU clocks, voltages, or memory clocks, etc. It is set to a max temp of 83 on both cards, but only runs to 80 or 81 in mid-day heat on the non-air conditioned building. I don't think this is causing an issue since the errors are all crash related which I suspect are my own dumb fault and not related to heat now anyway.

I think the only legitimate failed tasks on this one are the old NOELLIA at the bottom and one GIANNI on the 27th. I am also not sure if that GIANNI is not related to the crashes that caused the other errors since it is downloaded after the time out ones and before the abandons were sent. Its all an odd timeline on errors, successes, abandons, and invalids.

Mainly my question was aimed at just getting an answer of why and how a task would be considered invalid by the validation server after the client lets it run to completion and thinks there was no errors.

There seems to be all successful tasks since these 2 invalids as well.
That box shows a lot of errors, as does another one that shows: 3 NVIDIA GeForce GTX 980 Ti (4095MB) GPUs. Most of the errors on the 2nd box (6) are from the long GIANNI WUs. (not counting all the bad ADRIAs)

http://www.gpugrid.net/results.php?hostid=335350&offset=0&show_names=1&state=5&appid=
This system has GPUs that are OC'd. When I see a batch of tasks that are throwing errors I clock it back to factory, but have seen them still continue to error while that is true as well (I still do it for a while). The one card (in the middle) does have a run temp of 83 all the time since it has little ability to draw fresh air between the other 2 cards that run lower temps. The WUs that fail on this one do so across all 3 cards, so I am not sure the heat on that one card is the issue on the work anyway, but its worth noting I think. Factory OC is in place I am sure being 980TI Classifieds, so they are pre-made to clock up from the normal 980TI cards. It has no power issues and is rarely rebooted or paused or even used actively. I remote to it and I share the hard drive (10TB of drives in it) content over the network via SMB as its main daily usage. Although I do have errored tasks on the system, it is in a different location than the system in question and is fairly isolated from users/usage, is in a much cooler place with airflow available to it, has good power to it (which I do not trust the power purity with the one in question), and is obviously a more powerful machine. Neither of the systems have an overclocked RAM or CPU package and both have water cooling on the CPU anyway. And yes, since the start of August, the non GIANNIN_D3 and ADRIA WUs that have failed total 3 in number and failed fairly quickly after starting. And as you note in other places, the GIANNIs are indeed very fragile and will run for a while then fail. And I am not sure the reason for that is always power related, but definitely can and would be I am sure. As noted, this system has no power purity or interruption issues and still fails GIANNIs at normal speed (as noted normal Classified OCd) GPUs, which they have been clocked at zero OC above factory since the GIANNIs started to fail almost 2 weeks ago now.
(Side note on this system, its Host RAC would be about 2.5-2.6 million a day or better just on time run if the failed GIANNI WUs had gotten credit for their run and failed time or if I had gotten GERRARD unfailed WUs instead and they ran to fruition. As it is, since the GIANNI batch started, the RAC on this host has been between 2.25 and 2.05 million a day.)

And again, the main and really only reason I posted this was to ask about how/why the WU would get marked by the validation server as invalid after the client finishes and uploads it. I could not find a thread on the forums that explains this, though I think I found one that mentions it and I think you actually started that one. :-)

Profile caffeineyellow5
Avatar
Send message
Joined: 30 Jul 14
Posts: 225
Credit: 2,658,976,345
RAC: 0
Level
Phe
Scientific publications
watwatwatwatwatwatwatwat
Message 44450 - Posted: 7 Sep 2016 | 4:52:24 UTC

Trying the write cache thing Zoltan mentioned on this system after a third validation error today!

Profile Retvari Zoltan
Avatar
Send message
Joined: 20 Jan 09
Posts: 2343
Credit: 16,201,255,749
RAC: 0
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 44468 - Posted: 9 Sep 2016 | 14:31:58 UTC - in response to Message 44450.

In the logs of all your invalid tasks traces of dirty shutdowns are present:

# GPU 0 : 82C # GPU 1 : 79C # GPU [GeForce GTX 980] Platform [Windows] Rev [3212] VERSION [65] # SWAN Device 0 : # Name : GeForce GTX 980 # ECC : Disabled # Global mem : 4095MB

while a normal shutdown looks like this:

# GPU 0 : 83C # GPU 1 : 80C # BOINC suspending at user request (exit) # GPU [GeForce GTX 980] Platform [Windows] Rev [3212] VERSION [65] # SWAN Device 0 : # Name : GeForce GTX 980 # ECC : Disabled # Global mem : 4095MB

Notice the missing (3rd) line explaining the reason of exit in the first case.

Profile caffeineyellow5
Avatar
Send message
Joined: 30 Jul 14
Posts: 225
Credit: 2,658,976,345
RAC: 0
Level
Phe
Scientific publications
watwatwatwatwatwatwatwat
Message 44469 - Posted: 9 Sep 2016 | 15:31:10 UTC - in response to Message 44468.

Ahh. I see! I think I was reading the difference as a system shutdown or closure of the BOINC Manager and a "Suspend" in the BOINC Manager to pause the task. I saw the "BOINC suspending at user request (exit)" and thought that was the Suspend command. I figured this because it comes back at a higher temp as if the system had still been running while the other was a shutdown based on the temp being usually from a dead startup. Now I see looking at the system that has almost no shutdowns or suspensions that the occasional reboot does have this extra line and no suspensions and has almost no errors as well. There actually has been another error on that main system and another validation error on the system that had only 3 previously now. I really do need to find out what is wrong with these systems then as I suspect it is not all BOINC or even mostly partially BOINC related. (Then there is this identical system that has not completed a successful task yet! But at least it is running a few hours at a time again. I have turned off GPUGRID from it and only have it doing WCG until it can prove more stable. Updated BIOS today, so there's that.)

Post to thread

Message boards : Number crunching : Validation error

//