Advanced search

Message boards : Wish list : Revamping the failed task routine

Author Message
Profile skgiven
Volunteer moderator
Project tester
Volunteer tester
Avatar
Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,837,071,099
RAC: 365,113
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 18549 - Posted: 6 Sep 2010 | 18:27:07 UTC
Last modified: 6 Sep 2010 | 18:57:56 UTC

Perhaps at some stage the present task allocation and reporting system could be developed to incorporate the ability to use partially completed (failed) tasks?
For example, if a task fails after 3h, perhaps the data up to the failure point could be made useful and the resend could start slightly before the failure point (with some overlap for confirmation).
Such a system may not be feasible, due to server overhead or programming requirements, but if it could be done it would be a massive boost to crunchers and retain the services of many would be defunct CC1.1 cards, as well as deal with the random/inconspicuous errors that occur.
This has been eluded to in the past (trickle feed results for example), but I think the failed results could just be uploaded, if this approach was doable. Obviously if a task failed within a few minutes a full resend would be required, but after several hours it could really start to help.

Profile liveonc
Avatar
Send message
Joined: 1 Jan 10
Posts: 292
Credit: 41,567,650
RAC: 0
Level
Val
Scientific publications
watwatwatwatwatwat
Message 18551 - Posted: 6 Sep 2010 | 19:06:29 UTC - in response to Message 18549.
Last modified: 6 Sep 2010 | 19:23:27 UTC

Wouldn't that require a full list of all hardware components, drivers used, software installed & a log of every single process prior to the failure? Even after that, wouldn't the brand & revision, clock rates, voltages & temperatures, age & condition still be factors that also need to be known?
____________

Profile skgiven
Volunteer moderator
Project tester
Volunteer tester
Avatar
Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,837,071,099
RAC: 365,113
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 18555 - Posted: 7 Sep 2010 | 9:20:37 UTC - in response to Message 18551.

It's just a series of calculation, so I don't think so; it does not matter which calculator you use, just that you start from the correct palce. So it may be possible to pick it up from the last checkpoint.
If for example 42% of a task runs before failing, send the results back to the server, repackage say the last 60% and reissue the task (from just before the failure point). That way there would be a 2% overlap, rather than 42% wasted runtime. Also the task could fail again at the same place, so it would be better not to crunch the first 42% again. That would be 84% of the time to crunch one task wasted, in total. When a task fails at around 99% its a big loss. If it has to start from scratch again and then fails at the same place, its a double blow.

Profile liveonc
Avatar
Send message
Joined: 1 Jan 10
Posts: 292
Credit: 41,567,650
RAC: 0
Level
Val
Scientific publications
watwatwatwatwatwat
Message 18557 - Posted: 7 Sep 2010 | 14:45:58 UTC - in response to Message 18555.

Oops, my bad! I totally misunderstood. I though it was some kind of exaggerated log of everything that happened prior to failure. But something curious I've noticed once was a WU I aborted that my GPU couldn't complete in time that was at 7% that started on a new WU from 7%, it failed of course. But what made me curious was that it took 30 minutes to fail. Those were two totally different WU's. If it ran all the way, & I first got a message that there was an error after my GPU ran junk all the way up to 100%. That would be an even bigger waste. Would there be any chance of that?
____________

Profile Retvari Zoltan
Avatar
Send message
Joined: 20 Jan 09
Posts: 1844
Credit: 10,689,124,144
RAC: 10,114,775
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 18558 - Posted: 7 Sep 2010 | 15:02:06 UTC - in response to Message 18555.
Last modified: 7 Sep 2010 | 15:03:34 UTC

As far as I know, there is a lot of randomization takes place in this kind of simulaton, so if you start over a (failed) task with any overlap (even from the last checkpoint), the overlapping part wouldn't be the same as the original one. Therefore this method cannot serve as a stability test between different PC+GPU+OS+driver systems. Maybe the same system would't fail, if the the client restarts from the last checkpiont, or reboots the OS and then restarts the task from the checkpoint. So there is no point to have any bigger overlap than the last checkpoint. But sending back the partial result (with a detailed error report), and receiving a proportional credit for it (without the time bonus) would be nice and useful though (rosetta@home works this way). The real problem is the further processing of the partial result. That's what only the project developers know how difficult to make it working, or it's worth the effort at all.

Profile Fred J. Verster
Send message
Joined: 1 Apr 09
Posts: 58
Credit: 35,833,978
RAC: 0
Level
Val
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 18559 - Posted: 7 Sep 2010 | 19:49:38 UTC - in response to Message 18558.

Hi all, haven't seen a GPUGrid tasks fail, yet, maybe cause I UPdated to CUDA 3.1
and took the 9800GTX+ and 8500GT, out.
I do run 2 other projects, SETI@home (SSSE3 CPU optimized and CUDA, too and Einstein, which also uses CPU+GPU.
Gonna add a GTX480, to boost performance, a bit.
Question, can you still use older NVidia cards like the above mentioned 9800GTX+
and 8500GT?

Profile skgiven
Volunteer moderator
Project tester
Volunteer tester
Avatar
Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,837,071,099
RAC: 365,113
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 18561 - Posted: 7 Sep 2010 | 22:53:47 UTC - in response to Message 18559.

A 8500GT is too slow; 16 shaders is not enough, especially for a Compute Capable 1.1 card.
9800GTX+ may work, but there are two types; the older 65nm version and a 55nm version. Even with the 55nm version you are likely to get the odd error every now and then. These are usually due to driver bugs, and GPUGrid can do little about them (either NVidia fixes them or they don't). Sometimes the card is just not up to the task; it has degraded (been badly overclocked, ran too hot for too long, or used a poor power suply), but if you have one you could try it.

My general opinion is that if you want to crunch, sell your old components on and get new ones. A GTX480 would make a significant contribution here.

I hope you are using swan_sync = 0 for your GTX470, and you leave a CPU core free (especially if the GTX480 ends up in the same system). Also hope you have raised the fan speed, to keep the card cooler; doing so lengthens the life expectancy and slightly reduces power consumption.

Thanks and good luck,

Profile skgiven
Volunteer moderator
Project tester
Volunteer tester
Avatar
Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,837,071,099
RAC: 365,113
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 32986 - Posted: 16 Sep 2013 | 9:46:46 UTC - in response to Message 18561.

Thanks,
____________
FAQ's

HOW TO:
- Opt out of Beta Tests
- Ask for Help

Profile MJH
Project administrator
Project developer
Project scientist
Send message
Joined: 12 Nov 07
Posts: 696
Credit: 27,266,655
RAC: 0
Level
Val
Scientific publications
watwat
Message 33010 - Posted: 16 Sep 2013 | 19:30:51 UTC - in response to Message 18549.


Perhaps at some stage the present task allocation and reporting system could be developed to incorporate the ability to use partially completed (failed) tasks?


Probably won't do that because it mucks up the post-simulation data analysis if trajectory files are all uneven lengths.

MJH

ExtraTerrestrial Apes
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 17 Aug 08
Posts: 2671
Credit: 753,908,224
RAC: 504,143
Level
Glu
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 33020 - Posted: 16 Sep 2013 | 21:26:05 UTC - in response to Message 33010.

Anyway, the new failure recovery should take care of most situations which the initial suggestion targeted. The main difference is that now the original host itself tries again from the last checkpoint, instead of another host as suggested.

That doesn't help if the original host is in a strange state and fails again, where another host would have completed the task. In this case the x% already done are gone. But I don't think it would be straight forward to reuse them anyway, without considering the post-processing, because simulations might rightfully fail (not the hosts fault) or the last checkpoint might be corrupted.. I don't expect this to be easily distinguishable.

MrS
____________
Scanning for our furry friends since Jan 2002

Post to thread

Message boards : Wish list : Revamping the failed task routine