Message boards : Number crunching : Problem - Tasks error when exiting/resuming using 334.67 drivers
Author | Message |
---|---|
MJH / Admins: <core_client_version>7.2.39</core_client_version> <![CDATA[ <message> The file exists. (0x50) - exit code 80 (0x50) </message> and the last line in the stderr.txt file is: # BOINC suspending at user request (exit) I think that suspending/resuming tasks isn't working very well. Tasks are erroring out, when being resumed. http://www.gpugrid.net/result.php?resultid=7747671 http://www.gpugrid.net/result.php?resultid=7749480 http://www.gpugrid.net/result.php?resultid=7750550 http://www.gpugrid.net/result.php?resultid=7751319 Can you please look into this? I'm not sure if it's the application, or if it's the new BETA drivers, or if it's an issue that has always been there. But I would like it fixed! Hoping you agree, and available to help, Jacob PS: I originally posted this in the 8.15 app thread, but decided to create a new thread here. Also, I'm not the only one having this problem. | |
ID: 34967 | Rating: 0 | rate: / Reply Quote | |
MJH: | |
ID: 35019 | Rating: 0 | rate: / Reply Quote | |
Confirm same Problems here with 332.21 Driver | |
ID: 35105 | Rating: 0 | rate: / Reply Quote | |
This happened again, where suspending the task, then closing BOINC, resulted in the task error'ing: | |
ID: 35242 | Rating: 0 | rate: / Reply Quote | |
Same issue here too... | |
ID: 35314 | Rating: 0 | rate: / Reply Quote | |
Snap! GTX660, Win7-64, Driver 311.06 | |
ID: 35315 | Rating: 0 | rate: / Reply Quote | |
Anyone at GPUGrid care to fix this, like we did the previous suspend/resume problems? I'm willing to help test. | |
ID: 35336 | Rating: 0 | rate: / Reply Quote | |
Have we any more complete idea of the cause yet? I've recently upgraded to the WHQL version of the driver (334.89) for my GTX 670: no crashes yet, but then I don't routinely suspend tasks once they've started. What I have noticed is the reduced CPU demand, and a welcome reduction in the runtime of the SIMAP tasks running at the same time. The file exists. (0x50) - exit code 80 (0x50) but MLH's FAQ says * -80 Failed to recover after an access violation (Win32) Any signs of an access violation from Windows, Jacob? I'd be interested if the problem could be narrowed down to a more immediate cause. Candidates are Windows (I see Jacob using v8.1 - I have 7 here) Driver BOINC client (I see Jacob using alpha client v7.3.2) BOINC API (linked into application) Application and of course any combination of the above, plus probably more besides. My instinctive reaction on seeing the thread title was 'API', but I'm not so sure having looked at the full error messages. | |
ID: 35347 | Rating: 0 | rate: / Reply Quote | |
I was able to get a task to fail by: | |
ID: 35348 | Rating: 0 | rate: / Reply Quote | |
They don't fail all the time, but... if you try those exact steps over and over, eventually you might get a failure. I have caused GPUgrid tasks to fail on restart by stopping and restarting BOINC quickly 3 or 4 times in a row on Linux but that was last year not with current app and drivers. If I think of it I'll try to replicate it on a newly started task but I'm not going to try it on a task I've put an hour into. If a single stop BOINC and restart cycle is causing crashes then that's worth fixing but if it happens only after several stop and restart cycles in quick succession then I wonder if it's worth fixing as that is not a likely operating scenario. ____________ BOINC <<--- credit whores, pedants, alien hunters | |
ID: 35352 | Rating: 0 | rate: / Reply Quote | |
I run applications that I have setup as "exclusive applications" in BOINC. And sometimes I shut down BOINC. | |
ID: 35353 | Rating: 0 | rate: / Reply Quote | |
I've scheduled some time to sort this out in a week or so, when I'll also be putting out Maxwell support. | |
ID: 35354 | Rating: 0 | rate: / Reply Quote | |
Thank you. | |
ID: 35356 | Rating: 0 | rate: / Reply Quote | |
I don't think this is a driver issue. I'd been error free for a long time but in the last 5 days have been seeing errors in SANTI_MAR WUs only. Some of them occur whenever BOINC is exited (gracefully, by exit dialog) for any reason. No other WU types are affected. At first I though the exit error was only on 1GB cards but now I see on other users that it's happening on 660 Ti cards also. The SANTI_MAR WUs also seem to be particularly sensitive to other conditions too and are failing at too high a rate IMO. | |
ID: 35457 | Rating: 0 | rate: / Reply Quote | |
I've had 10 SANTI_MAR failures on the same Linux system in the past 3weeks,
| |
ID: 35478 | Rating: 0 | rate: / Reply Quote | |
I have almost every other day an error of a Santi WU on my 660. On the 770 and 780Ti no errors (yet). I agree with Beyond (nice new picture of dog) that it is not the drivers. Santi's seem to be "special". | |
ID: 35497 | Rating: 0 | rate: / Reply Quote | |
Matt: I've scheduled some time to sort this out in a week or so, when I'll also be putting out Maxwell support. | |
ID: 35566 | Rating: 0 | rate: / Reply Quote | |
I just had another one fail. I had 19 hours invested into it, and needed to restart my machine. I had suspended the task, I had closed BOINC, I restarted the machine, I resumed the task, and poof, Computation Error. Stderr output <core_client_version>7.3.10</core_client_version> <![CDATA[ <message> The file exists. (0x50) - exit code 80 (0x50) </message> <stderr_txt> # GPU [GeForce GTX 460] Platform [Windows] Rev [3203M] VERSION [42] # SWAN Device 1 : # Name : GeForce GTX 460 # ECC : Disabled # Global mem : 1024MB # Capability : 2.1 # PCI ID : 0000:08:00.0 # Device clock : 1526MHz # Memory clock : 1900MHz # Memory width : 256bit # Driver version : r334_00 : 33489 # GPU 0 : 67C # GPU 1 : 66C # GPU 2 : 78C # GPU 1 : 67C # GPU 1 : 68C # GPU 0 : 68C # GPU 1 : 69C # GPU 1 : 70C # GPU 0 : 69C # GPU 2 : 79C # BOINC suspending at user request (exit) # GPU [GeForce GTX 460] Platform [Windows] Rev [3203M] VERSION [42] # SWAN Device 1 : # Name : GeForce GTX 460 # ECC : Disabled # Global mem : 1024MB # Capability : 2.1 # PCI ID : 0000:08:00.0 # Device clock : 1526MHz # Memory clock : 1900MHz # Memory width : 256bit # Driver version : r334_00 : 33489 # GPU 0 : 66C # GPU 1 : 65C # GPU 2 : 73C # GPU 1 : 66C # GPU 2 : 75C # GPU 0 : 67C # BOINC suspending at user request (exit) </stderr_txt> ]]> | |
ID: 35583 | Rating: 0 | rate: / Reply Quote | |
And another one today. | |
ID: 35709 | Rating: 0 | rate: / Reply Quote | |
I just had another one fail. I had 19 hours invested into it, and needed to restart my machine. I had suspended the task, I had closed BOINC, I restarted the machine, I resumed the task, and poof, Computation Error. This same thing happens here on every SANTI_MAR WU when I have to exit BOINC and reboot for an update or whatever. 100% chance of error. Frustrating is the word. | |
ID: 35792 | Rating: 0 | rate: / Reply Quote | |
MJH: Please please please help. Name 1211-GIANNI_ntl-1-4-RND3734_0 Workunit 5485267 Created 26 Mar 2014 | 21:32:15 UTC Sent 27 Mar 2014 | 0:06:01 UTC Received 27 Mar 2014 | 11:46:13 UTC Server state Over Outcome Computation error Client state Compute error Exit status 80 (0x50) Unknown error number Computer ID 153764 Report deadline 1 Apr 2014 | 0:06:01 UTC Run time 0.00 CPU time 0.00 Validate state Invalid Credit 0.00 Application version Long runs (8-12 hours on fastest card) v8.15 (cuda42) Stderr output <core_client_version>7.3.11</core_client_version> <![CDATA[ <message> The file exists. (0x50) - exit code 80 (0x50) </message> <stderr_txt> # GPU [GeForce GTX 460] Platform [Windows] Rev [3203M] VERSION [42] # SWAN Device 1 : # Name : GeForce GTX 460 # ECC : Disabled # Global mem : 1024MB # Capability : 2.1 # PCI ID : 0000:08:00.0 # Device clock : 1526MHz # Memory clock : 1900MHz # Memory width : 256bit # Driver version : r334_89 : 33523 # GPU 0 : 58C # GPU 1 : 47C # GPU 2 : 67C # GPU 0 : 60C # GPU 1 : 50C # GPU 2 : 69C # GPU 0 : 61C # GPU 1 : 52C # GPU 0 : 62C # GPU 1 : 55C # GPU 2 : 70C # GPU 0 : 63C # GPU 1 : 56C # GPU 2 : 71C # GPU 1 : 57C # GPU 0 : 64C # GPU 1 : 59C # GPU 2 : 72C # GPU 0 : 65C # GPU 1 : 61C # GPU 1 : 62C # GPU 1 : 63C # GPU 2 : 73C # GPU 0 : 66C # GPU 1 : 64C # GPU 2 : 74C # GPU 1 : 65C # GPU 0 : 67C # GPU 1 : 66C # GPU 2 : 75C # GPU 0 : 68C # GPU 2 : 76C # GPU 1 : 67C # BOINC suspending at user request (exit) </stderr_txt> ]]> Name 1733-GIANNI_ntl-3-4-RND9094_0 Workunit 5485140 Created 26 Mar 2014 | 21:06:01 UTC Sent 27 Mar 2014 | 6:35:37 UTC Received 27 Mar 2014 | 11:46:13 UTC Server state Over Outcome Computation error Client state Compute error Exit status 80 (0x50) Unknown error number Computer ID 153764 Report deadline 1 Apr 2014 | 6:35:37 UTC Run time 0.00 CPU time 0.00 Validate state Invalid Credit 0.00 Application version Long runs (8-12 hours on fastest card) v8.15 (cuda55) Stderr output <core_client_version>7.3.11</core_client_version> <![CDATA[ <message> The file exists. (0x50) - exit code 80 (0x50) </message> <stderr_txt> # GPU [GeForce GTX 660 Ti] Platform [Windows] Rev [3203M] VERSION [55] # SWAN Device 0 : # Name : GeForce GTX 660 Ti # ECC : Disabled # Global mem : 3072MB # Capability : 3.0 # PCI ID : 0000:09:00.0 # Device clock : 1124MHz # Memory clock : 3004MHz # Memory width : 192bit # Driver version : r334_89 : 33523 # GPU 0 : 64C # GPU 1 : 65C # GPU 2 : 74C # GPU 2 : 75C # GPU 0 : 65C # GPU 1 : 66C # GPU 0 : 66C # GPU 2 : 76C # GPU 1 : 67C # BOINC suspending at user request (exit) </stderr_txt> ]]> | |
ID: 35927 | Rating: 0 | rate: / Reply Quote | |
Jacob | |
ID: 35928 | Rating: 0 | rate: / Reply Quote | |
What was the problem, and what was the fix? When do you think it will land on the Long queue? | |
ID: 35934 | Rating: 0 | rate: / Reply Quote | |
The problem, I think, is a false positive from the test to see if the Wu has got Stuck in a crash loop, as introduced in 815. | |
ID: 35941 | Rating: 0 | rate: / Reply Quote | |
When do you plan on deploying the 8.20 app, to Long-queue? People are still getting the "File already exists" error, losing tons of work, daily. If you were still testing it, then why was it not contained to the Beta-queue? Since it's already on Short, I think it should already be on Long too. | |
ID: 36010 | Rating: 0 | rate: / Reply Quote | |
Bugs get through beta-queue testing from time to time. So it's obviously better if we only lose the work on the short queue and not the work from both queues. But I guess at this point 820 looks stable enough, so I will suggest to Matt to push it to long. | |
ID: 36016 | Rating: 0 | rate: / Reply Quote | |
Jacob, | |
ID: 36017 | Rating: 0 | rate: / Reply Quote | |
Jacob, Have you been able to find a way of preventing the server from allocating cuda55 or cuda42 to Maxwell (CC 5.0) cards yet? Doesn't waste any actual computing time, but the downloads are a bit of a pain - and having several hours of expected crunching suddenly disappear rather confuses BOINC's scheduler. :-D | |
ID: 36018 | Rating: 0 | rate: / Reply Quote | |
No idea, although haven't looked deeply into it yet. Matt | |
ID: 36019 | Rating: 0 | rate: / Reply Quote | |
No idea, although haven't looked deeply into it yet. Matt | |
ID: 36020 | Rating: 0 | rate: / Reply Quote | |
It should be possible, by setting a maximum compute_capability for the two unwanted plan_classes. | |
ID: 36021 | Rating: 0 | rate: / Reply Quote | |
Jacob, Finally!! I noticed that it was only deployed for the cuda6 plan classes; are there any plans to update the app for the other plan classes? Also, please continue to make stability a priority. It is so very frustrating to lose progress. Some of the tasks that fail say they only had a couple seconds of run-time, where I believe they may have actually had several hours invested. Perhaps that masked the severity of the issue to you guys, not sure. But I hope bug-fixing becomes a high(er) priority. Regards, Jacob | |
ID: 36022 | Rating: 0 | rate: / Reply Quote | |
Had to chime in again to say THANK YOU for fixing this. BOINC Task Stability is obviously very important to me, and this bug had been plaguing me for weeks. The new 8.20 app seems to be suspending/exiting/resuming much better for me thus far. | |
ID: 36083 | Rating: 0 | rate: / Reply Quote | |
This has not been fixed. I have all CUDA 55 WU and if the light goes out, the work units get lost. | |
ID: 36143 | Rating: 0 | rate: / Reply Quote | |
It looks like I've started getting some errors on my machine as well over the last few days. It's not running overly hot, not sure what's going on. This is the output from the last one: Stderr output | |
ID: 36145 | Rating: 0 | rate: / Reply Quote | |
It looks like I've started getting some errors on my machine as well over the last few days. It's not running overly hot, not sure what's going on. I have been seeing that too recently on one of my previously stable GTX 660s. But the other one that I had previously underclocked from 993 MHz to 967 MHz has been stable. So it appears that the work units have just gotten a little harder, and now I am underclocking both of them. I would suggest reducing your GPU clock to 1000 MHz or so. (It is not a heat issue; mine were around 66 C). | |
ID: 36146 | Rating: 0 | rate: / Reply Quote | |
I have the same issue on two different GPUs with different drivers. <core_client_version>7.2.39</core_client_version> <![CDATA[ <message> (unknown error) - exit code -59 (0xffffffc5) On Quadro FX 3800: <core_client_version>6.10.18</core_client_version> <![CDATA[ <message> The file exists. (0x50) - exit code 80 (0x50) On both I´am running short tasks. Please solve this failing! | |
ID: 36219 | Rating: 0 | rate: / Reply Quote | |
Perhaps a little help. Yesterday I needed to boot all my systems for the necessary Windows updates after running for 26 days. | |
ID: 36220 | Rating: 0 | rate: / Reply Quote | |
Thank you too, for your help in diagnosing it. On to the next problem! Matt | |
ID: 36223 | Rating: 0 | rate: / Reply Quote | |
I thought this problem was fixed -- why are we still receiving 8.15 tasks? I just had 2 more fail, losing several hours of work, presumably because they were 8.15 instead of 8.20. Upsetting. | |
ID: 36252 | Rating: 0 | rate: / Reply Quote | |
I downclocked my card slightly (~50MHz), or more precisely reduced the overclock, and haven't gotten any more errors since. Not sure if that's causal or coincidental since I haven't bumped it back up yet to test. | |
ID: 36279 | Rating: 0 | rate: / Reply Quote | |
Variable: Your issue(s) are different than the one posted in this thread (see post 1). If you continue to have problems, please create a new thread. | |
ID: 36280 | Rating: 0 | rate: / Reply Quote | |
And... another 8.15 task crashed just now, losing tons of work. Why are we still using 8.15?!?
| |
ID: 36427 | Rating: 0 | rate: / Reply Quote | |
Power went out yesterday, I lost work units. Power went out today, I lost work units. This needs to get fixed!!!!! | |
ID: 36439 | Rating: 0 | rate: / Reply Quote | |
MJH: | |
ID: 36707 | Rating: 0 | rate: / Reply Quote | |
I have highlighted the problem in counting the cards gtx 680 a month now happens to me from . Every day becomes that the tasks of collapse in such a weird way-slow down your PC system in windows and also according to GPU-Z stops the card count. entire system is as if in slow motion ... only helps suspend computation on graphics card, abortions every task and the new has withdrawn. ., and after about cca 6-12 aborted about the tasks shall start another 3 working normally .. it's weird errors and concerns only nvidia cards 600, to 700 card counting goes perfectly. | |
ID: 36835 | Rating: 0 | rate: / Reply Quote | |
MJH: Name A2ART4Ex05x95-GERARD_A2ART4E-13-14-RND0991_0 Workunit 7496762 Created 14 May 2014 | 5:52:04 UTC Sent 16 May 2014 | 13:57:32 UTC Received 17 May 2014 | 3:24:11 UTC Server state Over Outcome Computation error Client state Compute error Exit status 80 (0x50) Unknown error number Computer ID 153764 Report deadline 21 May 2014 | 13:57:32 UTC Run time 24,161.19 CPU time 6,302.88 Validate state Invalid Credit 0.00 Application version Long runs (8-12 hours on fastest card) v8.41 (cuda60) Stderr output <core_client_version>7.3.19</core_client_version> <![CDATA[ <message> The file exists. (0x50) - exit code 80 (0x50) </message> <stderr_txt> # GPU [GeForce GTX 660 Ti] Platform [Windows] Rev [3301M] VERSION [60] # SWAN Device 0 : # Name : GeForce GTX 660 Ti # ECC : Disabled # Global mem : 3072MB # Capability : 3.0 # PCI ID : 0000:09:00.0 # Device clock : 1124MHz # Memory clock : 3004MHz # Memory width : 192bit # Driver version : DM337_50 : 33761 # GPU 0 : 67C # GPU 1 : 75C # GPU 2 : 74C # GPU 0 : 68C # GPU 1 : 76C # GPU 0 : 69C # GPU 0 : 70C # GPU 1 : 77C # GPU 0 : 71C # GPU 0 : 72C # GPU 2 : 75C # BOINC suspending at user request (exit) # GPU [GeForce GTX 660 Ti] Platform [Windows] Rev [3301M] VERSION [60] # SWAN Device 0 : # Name : GeForce GTX 660 Ti # ECC : Disabled # Global mem : 3072MB # Capability : 3.0 # PCI ID : 0000:09:00.0 # Device clock : 1124MHz # Memory clock : 3004MHz # Memory width : 192bit # Driver version : DM337_50 : 33761 # GPU 0 : 66C # GPU 1 : 71C # GPU 2 : 58C # GPU 0 : 67C # GPU 2 : 62C # GPU 2 : 66C # GPU 2 : 67C # GPU 0 : 68C # GPU 1 : 72C # GPU 2 : 68C # GPU 2 : 69C # GPU 2 : 70C # GPU 0 : 69C # GPU 1 : 73C # GPU 2 : 71C # BOINC suspending at user request (exit) # GPU [GeForce GTX 660 Ti] Platform [Windows] Rev [3301M] VERSION [60] # SWAN Device 0 : # Name : GeForce GTX 660 Ti # ECC : Disabled # Global mem : 3072MB # Capability : 3.0 # PCI ID : 0000:09:00.0 # Device clock : 1124MHz # Memory clock : 3004MHz # Memory width : 192bit # Driver version : DM337_50 : 33761 # GPU 0 : 66C # GPU 1 : 71C # GPU 2 : 65C # GPU 0 : 67C # GPU 1 : 72C # GPU 2 : 67C # GPU 2 : 68C # GPU 0 : 68C # GPU 2 : 69C # GPU 1 : 73C # GPU 0 : 69C # GPU 2 : 70C # GPU 2 : 71C # GPU 1 : 74C # BOINC suspending at user request (exit) # GPU [GeForce GTX 660 Ti] Platform [Windows] Rev [3301M] VERSION [60] # SWAN Device 0 : # Name : GeForce GTX 660 Ti # ECC : Disabled # Global mem : 3072MB # Capability : 3.0 # PCI ID : 0000:09:00.0 # Device clock : 1124MHz # Memory clock : 3004MHz # Memory width : 192bit # Driver version : DM337_50 : 33761 # GPU 0 : 68C # GPU 1 : 73C # GPU 2 : 68C # GPU 2 : 69C # GPU 2 : 70C # GPU 0 : 69C # GPU 1 : 74C # GPU 2 : 71C # GPU 0 : 70C # GPU 1 : 75C # GPU 2 : 72C # GPU 2 : 73C # GPU 1 : 76C # BOINC suspending at user request (exit) # GPU [GeForce GTX 460] Platform [Windows] Rev [3301M] VERSION [60] # SWAN Device 1 : # Name : GeForce GTX 460 # ECC : Disabled # Global mem : 1024MB # Capability : 2.1 # PCI ID : 0000:08:00.0 # Device clock : 1526MHz # Memory clock : 1900MHz # Memory width : 256bit # Driver version : DM337_50 : 33761 # GPU 0 : 57C # GPU 1 : 68C # GPU 2 : 61C # GPU 0 : 61C # GPU 1 : 69C # GPU 0 : 64C # GPU 1 : 70C # GPU 0 : 65C # GPU 1 : 71C # GPU 0 : 66C # GPU 1 : 72C # GPU 0 : 67C # GPU 1 : 73C # GPU 0 : 69C # GPU 0 : 70C # GPU 2 : 67C # BOINC suspending at user request (exit) # GPU [GeForce GTX 460] Platform [Windows] Rev [3301M] VERSION [60] # SWAN Device 1 : # Name : GeForce GTX 460 # ECC : Disabled # Global mem : 1024MB # Capability : 2.1 # PCI ID : 0000:08:00.0 # Device clock : 1526MHz # Memory clock : 1900MHz # Memory width : 256bit # Driver version : DM337_50 : 33761 # BOINC suspending at user request (exit) # GPU [GeForce GTX 460] Platform [Windows] Rev [3301M] VERSION [60] # SWAN Device 1 : # Name : GeForce GTX 460 # ECC : Disabled # Global mem : 1024MB # Capability : 2.1 # PCI ID : 0000:08:00.0 # Device clock : 1526MHz # Memory clock : 1900MHz # Memory width : 256bit # Driver version : DM337_50 : 33761 # GPU 0 : 61C # GPU 1 : 53C # GPU 2 : 67C # GPU 0 : 64C # GPU 1 : 58C # GPU 2 : 69C # GPU 0 : 66C # GPU 1 : 61C # GPU 0 : 67C # GPU 1 : 64C # GPU 2 : 70C # GPU 0 : 68C # GPU 1 : 65C # GPU 1 : 67C # GPU 2 : 71C # GPU 0 : 69C # GPU 1 : 69C # GPU 0 : 70C # GPU 1 : 70C # GPU 1 : 71C # GPU 2 : 72C # GPU 0 : 71C # GPU 1 : 72C # GPU 2 : 73C # BOINC suspending at user request (exit) # GPU [GeForce GTX 460] Platform [Windows] Rev [3301M] VERSION [60] # SWAN Device 1 : # Name : GeForce GTX 460 # ECC : Disabled # Global mem : 1024MB # Capability : 2.1 # PCI ID : 0000:08:00.0 # Device clock : 1526MHz # Memory clock : 1900MHz # Memory width : 256bit # Driver version : DM337_50 : 33761 # GPU 0 : 61C # GPU 1 : 53C # GPU 2 : 67C # GPU 0 : 64C # GPU 1 : 57C # GPU 2 : 68C # GPU 0 : 66C # GPU 1 : 60C # GPU 2 : 69C # GPU 0 : 67C # GPU 1 : 63C # GPU 2 : 70C # GPU 0 : 68C # GPU 1 : 64C # GPU 0 : 69C # GPU 1 : 67C # GPU 1 : 68C # GPU 2 : 71C # GPU 1 : 69C # GPU 0 : 70C # GPU 1 : 70C # GPU 1 : 72C # GPU 2 : 72C # BOINC suspending at user request (exit) # GPU [GeForce GTX 460] Platform [Windows] Rev [3301M] VERSION [60] # SWAN Device 1 : # Name : GeForce GTX 460 # ECC : Disabled # Global mem : 1024MB # Capability : 2.1 # PCI ID : 0000:08:00.0 # Device clock : 1526MHz # Memory clock : 1900MHz # Memory width : 256bit # Driver version : DM337_50 : 33761 # GPU 0 : 54C # GPU 1 : 58C # GPU 2 : 59C # GPU 1 : 62C # GPU 1 : 64C # GPU 0 : 60C # GPU 1 : 66C # GPU 0 : 62C # BOINC suspending at user request (exit) # GPU [GeForce GTX 460] Platform [Windows] Rev [3301M] VERSION [60] # SWAN Device 1 : # Name : GeForce GTX 460 # ECC : Disabled # Global mem : 1024MB # Capability : 2.1 # PCI ID : 0000:08:00.0 # Device clock : 1526MHz # Memory clock : 1900MHz # Memory width : 256bit # Driver version : DM337_50 : 33761 # GPU 0 : 58C # GPU 1 : 53C # GPU 2 : 58C # GPU 0 : 60C # GPU 1 : 58C # GPU 0 : 63C # GPU 1 : 62C </stderr_txt> ]]> | |
ID: 36856 | Rating: 0 | rate: / Reply Quote | |
MJH: +2 http://www.gpugrid.net/result.php?resultid=10328606 http://www.gpugrid.net/result.php?resultid=10328572 These failed after a simple system restart. | |
ID: 36857 | Rating: 0 | rate: / Reply Quote | |
Every time the lights go out I lose all the units that are being worked on. If I restart the system using the proper procedures, no problem. This has been going on for months and I am really getting sick of it. Bought UPS, now lets see. | |
ID: 36984 | Rating: 0 | rate: / Reply Quote | |
Is your error:
If not, then create a new thread please. This thread is about that error. | |
ID: 36985 | Rating: 0 | rate: / Reply Quote | |
This is *STILL* an issue. When can we finally get it fully fixed? :( Stderr output <core_client_version>7.4.8</core_client_version> <![CDATA[ <message> The file exists. (0x50) - exit code 80 (0x50) </message> <stderr_txt> ... ... ... # BOINC suspending at user request (exit) </stderr_txt> ]]> http://www.gpugrid.net/result.php?resultid=12796113 Outcome Computation error Client state Compute error Exit status 80 (0x50) Unknown error number Run time 2,221.71 Stderr output <core_client_version>7.4.8</core_client_version> <![CDATA[ <message> The file exists. (0x50) - exit code 80 (0x50) </message> <stderr_txt> ... ... ... # BOINC suspending at user request (exit) </stderr_txt> ]]> | |
ID: 37242 | Rating: 0 | rate: / Reply Quote | |
Jacob, | |
ID: 37325 | Rating: 0 | rate: / Reply Quote | |
I'm on the road, but will be home tonight. I'll try to re-review, probably tomorrow. Thanks! | |
ID: 37327 | Rating: 0 | rate: / Reply Quote | |
Jacob, Hi Matt, I don't know if we need to made a new post for this, but I have a request. Is it possible inn the Stderr output file, show only the temperature of the GPU that did the job? Now the temperature change from every card is shown. Thank you. ____________ Greetings from TJ | |
ID: 37338 | Rating: 0 | rate: / Reply Quote | |
Tricky - the GPU ordering from the temperature query interface doesn't correspond to the CUDA ordering. | |
ID: 37339 | Rating: 0 | rate: / Reply Quote | |
MJH: | |
ID: 37353 | Rating: 0 | rate: / Reply Quote | |
That exit circumstance is the failsafe exit that stops a WU getting stuck in an endless cycle of abort - resume, without making any progress. It should only trigger if the machine has been up for a few minutes (from which we infer that the WU crashed the machine). | |
ID: 37357 | Rating: 0 | rate: / Reply Quote | |
That exit circumstance is the failsafe exit that stops a WU getting stuck in an endless cycle of abort - resume, without making any progress. It should only trigger if the machine has been up for a few minutes (from which we infer that the WU crashed the machine). | |
ID: 37358 | Rating: 0 | rate: / Reply Quote | |
Perhaps you could give me even more clues on how to reproduce the error on demand? It seems that it is currently too stringent, causing otherwise-healthy tasks to fail when starting BOINC. | |
ID: 37359 | Rating: 0 | rate: / Reply Quote | |
He said: It should only trigger if the machine has been up for a few minutes So, you could try suspending / closing BOINC then resuming it without shutting down the machine in-between and with shutting it down. ____________ | |
ID: 37361 | Rating: 0 | rate: / Reply Quote | |
Matt, | |
ID: 37362 | Rating: 0 | rate: / Reply Quote | |
Matt, I was able to get another task to error for that reason... so it is still possible, if enough testing is done. Again, could you provide details on the exit algorithm? | |
ID: 37387 | Rating: 0 | rate: / Reply Quote | |
Jacob, | |
ID: 37388 | Rating: 0 | rate: / Reply Quote | |
Alright.... So, it looks like the slot directory does get the canary file when the tasks are started within the session. And, by utilizing the <checkpoint_debug> flag in cc_config.xml, I believe I see the file being removed whenever the task's first checkpoint of the session is performed. | |
ID: 37389 | Rating: 0 | rate: / Reply Quote | |
Hurray! I've been able to make all 3 of my tasks fail, essentially on-demand! All of them with error: "The file exists. (0x50) - exit code 80 (0x50)" ... This genuinely excites me! | |
ID: 37391 | Rating: 0 | rate: / Reply Quote | |
Either way, though... This algorithm doesn't jive well. Are you able to make changes to it? Perhaps we could work together to develop a better algorithm that hopefully still accomplishes your goals, without killing tasks? It might be a matter of: 1) Removing the canary file on a normal shutdown of BOINC (this could solve the majority of the issues!) 2) Consider removing the 10-minute limit, since... Maybe the machine restarted, and had been sitting at a login screen for several hours, before user logged in to start BOINC Thoughts? | |
ID: 37392 | Rating: 0 | rate: / Reply Quote | |
Jacob, | |
ID: 37393 | Rating: 0 | rate: / Reply Quote | |
Matt, | |
ID: 37394 | Rating: 0 | rate: / Reply Quote | |
I've noticed that GPUGrid tasks fail with the "file exists" error when I'm restarting my PC immediately after a restart. I thought that I should wait for the workunits made their first checkpoint to avoid this error, but I didn't thought that it's a protective algorithm. | |
ID: 37395 | Rating: 0 | rate: / Reply Quote | |
Despite having a primary SSD and secondary Boinc data drive on my main Win7 system, I still use a 30sec cc_config start delay, <options> <start_delay>30</start_delay> </options>
| |
ID: 37396 | Rating: 0 | rate: / Reply Quote | |
Any progress? | |
ID: 37418 | Rating: 0 | rate: / Reply Quote | |
Matt, | |
ID: 37585 | Rating: 0 | rate: / Reply Quote | |
Jacob, | |
ID: 37588 | Rating: 0 | rate: / Reply Quote | |
Sure hope this gets fixed. Updating my machines from 7.4.8 to 7.4.18, carefully shutting down 7.4.8 before installing the new client yielded 3 aborted GPUGrid WUs out of 7. This happens only with GPUGrid WUs, no other projects that I run (many) behave in this way. | |
ID: 37867 | Rating: 0 | rate: / Reply Quote | |
Jacob, Early September? 2014? | |
ID: 38134 | Rating: 0 | rate: / Reply Quote | |
coming with the 6.5 app under testing on beta now | |
ID: 38137 | Rating: 0 | rate: / Reply Quote | |
Thank you. Are there minimum requirements for getting tasks on that beta app? | |
ID: 38139 | Rating: 0 | rate: / Reply Quote | |
Message boards : Number crunching : Problem - Tasks error when exiting/resuming using 334.67 drivers