Advanced search

Message boards : Number crunching : Dozens of Failed Tasks

Author Message
Paul Raney
Send message
Joined: 26 Dec 10
Posts: 115
Credit: 416,576,946
RAC: 0
Level
Gln
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwat
Message 23764 - Posted: 5 Mar 2012 | 9:12:13 UTC

My systems are currently failing more tasks than ever. It looks like most of the work units failed on other systems as well.

Do we have a large number of corrupt work units or have I done something wrong on my setups?

I just put a new GTX 580 in my farm this weekend and was disappointed to see all of the failed WUs. Again, when I look at the failed WUs, they usually failed 3 or 4 times on other systems as well.

Any help is appreciated.

Toni
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 9 Dec 08
Posts: 1006
Credit: 5,068,599
RAC: 0
Level
Ser
Scientific publications
watwatwatwat
Message 23765 - Posted: 5 Mar 2012 | 10:15:41 UTC - in response to Message 23764.

Hi, from a quick look it seems that only one of your hosts http://www.gpugrid.net/results.php?hostid=119703 has recent failed tasks.

New driver? Have you checked the thread on monitor-off corruption?

Paul Raney
Send message
Joined: 26 Dec 10
Posts: 115
Credit: 416,576,946
RAC: 0
Level
Gln
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwat
Message 23771 - Posted: 5 Mar 2012 | 13:08:21 UTC - in response to Message 23765.

thank you for the quick reply. When I spot check the tasks that failed, most of them look like they failed on other computers as well.

http://www.gpugrid.net/workunit.php?wuid=3231807
http://www.gpugrid.net/workunit.php?wuid=3231830
http://www.gpugrid.net/workunit.php?wuid=3228376
http://www.gpugrid.net/workunit.php?wuid=3227851
http://www.gpugrid.net/workunit.php?wuid=3231792
http://www.gpugrid.net/workunit.php?wuid=3231127

There was a discussion of monitor off issues with the 295 drivers but I did not see a resolution.

I am happy to make changes to help.

thx

Profile skgiven
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,995,359,260
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 23775 - Posted: 5 Mar 2012 | 13:58:13 UTC - in response to Message 23765.
Last modified: 5 Mar 2012 | 15:00:39 UTC

Your issue is most likely with 295.73.
On W7 systems we are recommending that people avoid the 295.x drivers.
Most drivers from 260 to 285 should work. I would recommend downloading one, then fully uninstalling 295, restart and then install the downloaded driver.

http://www.gpugrid.net/workunit.php?wuid=3231830
All errors on W7 with 295 drivers, except one (258 - too old)!
Seems to be the case with for most of that list of errors.
____________
FAQ's

HOW TO:
- Opt out of Beta Tests
- Ask for Help

Paul Raney
Send message
Joined: 26 Dec 10
Posts: 115
Credit: 416,576,946
RAC: 0
Level
Gln
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwat
Message 23789 - Posted: 6 Mar 2012 | 2:25:47 UTC - in response to Message 23775.

It looks like 275.33 fixed everything. Now back to crunching!!

Thank you

Profile dskagcommunity
Avatar
Send message
Joined: 28 Apr 11
Posts: 456
Credit: 817,865,789
RAC: 0
Level
Glu
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 23791 - Posted: 6 Mar 2012 | 7:10:34 UTC

Never touch a running system =) good to have a cruncher for real science back :)
____________
DSKAG Austria Research Team: http://www.research.dskag.at



Paul Raney
Send message
Joined: 26 Dec 10
Posts: 115
Credit: 416,576,946
RAC: 0
Level
Gln
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwat
Message 23796 - Posted: 6 Mar 2012 | 12:45:11 UTC - in response to Message 23791.

I am back to the 275.33 drivers but downclocking has become an issue. I never saw this problem before I upgraded and then downgraded the drivers. 295.73 appears to fix the downclocking but does not work with GPUGrdid WUs. 275.33 will downclock but WUs continue to run.

Does anyone have a resolution for the downclocking issue on Windows 7? I saw a batch file for Linux but nothing for Windows.

thank you

Profile skgiven
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,995,359,260
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 23802 - Posted: 6 Mar 2012 | 15:28:29 UTC - in response to Message 23796.
Last modified: 6 Mar 2012 | 15:32:00 UTC

Some sort of 'threadsafe' exit code might do it, though it might also change the code to the extent that the research methods alter; something that reviewers might not be so keen on. Obviously the recent app updates, adding instructions to enable some task types, didn't include this 'threadsafte' code. Maybe next time.

This thread might be useful.
____________
FAQ's

HOW TO:
- Opt out of Beta Tests
- Ask for Help

Paul Raney
Send message
Joined: 26 Dec 10
Posts: 115
Credit: 416,576,946
RAC: 0
Level
Gln
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwat
Message 23922 - Posted: 13 Mar 2012 | 7:49:42 UTC - in response to Message 23802.

Can someone looke at this WU http://www.gpugrid.net/workunit.php?wuid=3261113 and provide some insight into the failure? It is really hurts when they run for 13,000 seconds and fail.

Everyone else failed the task as well but they have the 295.73 drivers.

thank you.

Profile skgiven
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,995,359,260
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 23923 - Posted: 13 Mar 2012 | 8:24:16 UTC - in response to Message 23922.

The failure was, ERROR: # Energies have become nan

This means the Energy value being calculated is 'Not A Number'.
I think this may mean zero or the value just went out of some stipulated check range, and as a result are being described as nan.

Many of Boinc's 'Error codes' are not actually errors, and few elucidate the issue let alone suggest a solution to crunchers.

Consider reducing your shader clocks. Energies have become nan thread.


____________
FAQ's

HOW TO:
- Opt out of Beta Tests
- Ask for Help

Stealth Eagle*
Avatar
Send message
Joined: 19 Apr 09
Posts: 2
Credit: 11,426,878
RAC: 0
Level
Pro
Scientific publications
watwatwatwatwatwatwatwat
Message 23941 - Posted: 13 Mar 2012 | 21:25:16 UTC

I am suddenly having a bunch of acemd2 tasks fail. This is the first time this has happened since I started running GPUGRID.
http://www.gpugrid.net/results.php?userid=21556&offset=0&show_names=0&state=5&appid=

Paul Raney
Send message
Joined: 26 Dec 10
Posts: 115
Credit: 416,576,946
RAC: 0
Level
Gln
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwat
Message 23947 - Posted: 14 Mar 2012 | 1:28:06 UTC - in response to Message 23941.

This computer http://www.gpugrid.net/show_host_detail.php?hostid=119703 failed a few WUs in a row folloed by at least 2 successful WUs. I did not change anything. My GPU typically never exceeds 71C. Is my GPU just running a little too fast? I had about 10 successful WUs prior to these errors.

I read the thread on nan and I don't have a heat issue but maybe I just need to pull the performance down a little to avoid the nan condition.

Thank you

Post to thread

Message boards : Number crunching : Dozens of Failed Tasks

//