Revamping the failed task routine

Message boards : Wish list : Revamping the failed task routine

Author	Message
skgiven Volunteer moderator Volunteer tester Send message Joined: 23 Apr 09 Posts: 3968 Credit: 1,995,359,260 RAC: 0 Level Scientific publications	Message 18549 - Posted: 6 Sep 2010 \| 18:27:07 UTC Last modified: 6 Sep 2010 \| 18:57:56 UTC
	Perhaps at some stage the present task allocation and reporting system could be developed to incorporate the ability to use partially completed (failed) tasks? For example, if a task fails after 3h, perhaps the data up to the failure point could be made useful and the resend could start slightly before the failure point (with some overlap for confirmation). Such a system may not be feasible, due to server overhead or programming requirements, but if it could be done it would be a massive boost to crunchers and retain the services of many would be defunct CC1.1 cards, as well as deal with the random/inconspicuous errors that occur. This has been eluded to in the past (trickle feed results for example), but I think the failed results could just be uploaded, if this approach was doable. Obviously if a task failed within a few minutes a full resend would be required, but after several hours it could really start to help.
	ID: 18549 \| Rating: 0 \| rate: / Reply Quote

liveonc Send message Joined: 1 Jan 10 Posts: 292 Credit: 41,567,650 RAC: 0 Level Scientific publications	Message 18551 - Posted: 6 Sep 2010 \| 19:06:29 UTC - in response to Message 18549. Last modified: 6 Sep 2010 \| 19:23:27 UTC
	Wouldn't that require a full list of all hardware components, drivers used, software installed & a log of every single process prior to the failure? Even after that, wouldn't the brand & revision, clock rates, voltages & temperatures, age & condition still be factors that also need to be known? ____________
	ID: 18551 \| Rating: 0 \| rate: / Reply Quote

skgiven Volunteer moderator Volunteer tester Send message Joined: 23 Apr 09 Posts: 3968 Credit: 1,995,359,260 RAC: 0 Level Scientific publications	Message 18555 - Posted: 7 Sep 2010 \| 9:20:37 UTC - in response to Message 18551.
	It's just a series of calculation, so I don't think so; it does not matter which calculator you use, just that you start from the correct palce. So it may be possible to pick it up from the last checkpoint. If for example 42% of a task runs before failing, send the results back to the server, repackage say the last 60% and reissue the task (from just before the failure point). That way there would be a 2% overlap, rather than 42% wasted runtime. Also the task could fail again at the same place, so it would be better not to crunch the first 42% again. That would be 84% of the time to crunch one task wasted, in total. When a task fails at around 99% its a big loss. If it has to start from scratch again and then fails at the same place, its a double blow.
	ID: 18555 \| Rating: 0 \| rate: / Reply Quote

liveonc Send message Joined: 1 Jan 10 Posts: 292 Credit: 41,567,650 RAC: 0 Level Scientific publications	Message 18557 - Posted: 7 Sep 2010 \| 14:45:58 UTC - in response to Message 18555.
	Oops, my bad! I totally misunderstood. I though it was some kind of exaggerated log of everything that happened prior to failure. But something curious I've noticed once was a WU I aborted that my GPU couldn't complete in time that was at 7% that started on a new WU from 7%, it failed of course. But what made me curious was that it took 30 minutes to fail. Those were two totally different WU's. If it ran all the way, & I first got a message that there was an error after my GPU ran junk all the way up to 100%. That would be an even bigger waste. Would there be any chance of that? ____________
	ID: 18557 \| Rating: 0 \| rate: / Reply Quote

Retvari Zoltan Send message Joined: 20 Jan 09 Posts: 2343 Credit: 16,201,255,749 RAC: 386 Level Scientific publications	Message 18558 - Posted: 7 Sep 2010 \| 15:02:06 UTC - in response to Message 18555. Last modified: 7 Sep 2010 \| 15:03:34 UTC
	As far as I know, there is a lot of randomization takes place in this kind of simulaton, so if you start over a (failed) task with any overlap (even from the last checkpoint), the overlapping part wouldn't be the same as the original one. Therefore this method cannot serve as a stability test between different PC+GPU+OS+driver systems. Maybe the same system would't fail, if the the client restarts from the last checkpiont, or reboots the OS and then restarts the task from the checkpoint. So there is no point to have any bigger overlap than the last checkpoint. But sending back the partial result (with a detailed error report), and receiving a proportional credit for it (without the time bonus) would be nice and useful though (rosetta@home works this way). The real problem is the further processing of the partial result. That's what only the project developers know how difficult to make it working, or it's worth the effort at all.
	ID: 18558 \| Rating: 0 \| rate: / Reply Quote

Fred J. Verster Send message Joined: 1 Apr 09 Posts: 58 Credit: 35,833,978 RAC: 0 Level Scientific publications	Message 18559 - Posted: 7 Sep 2010 \| 19:49:38 UTC - in response to Message 18558.
	Hi all, haven't seen a GPUGrid tasks fail, yet, maybe cause I UPdated to CUDA 3.1 and took the 9800GTX+ and 8500GT, out. I do run 2 other projects, SETI@home (SSSE3 CPU optimized and CUDA, too and Einstein, which also uses CPU+GPU. Gonna add a GTX480, to boost performance, a bit. Question, can you still use older NVidia cards like the above mentioned 9800GTX+ and 8500GT?
	ID: 18559 \| Rating: 0 \| rate: / Reply Quote

skgiven Volunteer moderator Volunteer tester Send message Joined: 23 Apr 09 Posts: 3968 Credit: 1,995,359,260 RAC: 0 Level Scientific publications	Message 18561 - Posted: 7 Sep 2010 \| 22:53:47 UTC - in response to Message 18559.
	A 8500GT is too slow; 16 shaders is not enough, especially for a Compute Capable 1.1 card. 9800GTX+ may work, but there are two types; the older 65nm version and a 55nm version. Even with the 55nm version you are likely to get the odd error every now and then. These are usually due to driver bugs, and GPUGrid can do little about them (either NVidia fixes them or they don't). Sometimes the card is just not up to the task; it has degraded (been badly overclocked, ran too hot for too long, or used a poor power suply), but if you have one you could try it. My general opinion is that if you want to crunch, sell your old components on and get new ones. A GTX480 would make a significant contribution here. I hope you are using swan_sync = 0 for your GTX470, and you leave a CPU core free (especially if the GTX480 ends up in the same system). Also hope you have raised the fan speed, to keep the card cooler; doing so lengthens the life expectancy and slightly reduces power consumption. Thanks and good luck,
	ID: 18561 \| Rating: 0 \| rate: / Reply Quote

skgiven Volunteer moderator Volunteer tester Send message Joined: 23 Apr 09 Posts: 3968 Credit: 1,995,359,260 RAC: 0 Level Scientific publications	Message 32986 - Posted: 16 Sep 2013 \| 9:46:46 UTC - in response to Message 18561.
	Thanks, ____________ FAQ's HOW TO: - Opt out of Beta Tests - Ask for Help
	ID: 32986 \| Rating: 0 \| rate: / Reply Quote

MJH Project administrator Project developer Project scientist Send message Joined: 12 Nov 07 Posts: 696 Credit: 27,266,655 RAC: 0 Level Scientific publications	Message 33010 - Posted: 16 Sep 2013 \| 19:30:51 UTC - in response to Message 18549.
	Perhaps at some stage the present task allocation and reporting system could be developed to incorporate the ability to use partially completed (failed) tasks? Probably won't do that because it mucks up the post-simulation data analysis if trajectory files are all uneven lengths. MJH
	ID: 33010 \| Rating: 0 \| rate: / Reply Quote

ExtraTerrestrial Apes Volunteer moderator Volunteer tester Send message Joined: 17 Aug 08 Posts: 2705 Credit: 1,311,122,549 RAC: 0 Level Scientific publications	Message 33020 - Posted: 16 Sep 2013 \| 21:26:05 UTC - in response to Message 33010.
	Anyway, the new failure recovery should take care of most situations which the initial suggestion targeted. The main difference is that now the original host itself tries again from the last checkpoint, instead of another host as suggested. That doesn't help if the original host is in a strange state and fails again, where another host would have completed the task. In this case the x% already done are gone. But I don't think it would be straight forward to reuse them anyway, without considering the post-processing, because simulations might rightfully fail (not the hosts fault) or the last checkpoint might be corrupted.. I don't expect this to be easily distinguishable. MrS ____________ Scanning for our furry friends since Jan 2002
	ID: 33020 \| Rating: 0 \| rate: / Reply Quote

Post to thread

Message boards : Wish list : Revamping the failed task routine

	About	Science	Volunteers	Performance	Forum	Join us	Donate