Advanced search

Message boards : Number crunching : Abrupt computer restart - Tasks stuck - Kernel not found

Author Message
Jacob Klein
Send message
Joined: 11 Oct 08
Posts: 1127
Credit: 1,901,927,545
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 33274 - Posted: 30 Sep 2013 | 4:09:25 UTC
Last modified: 30 Sep 2013 | 4:11:43 UTC

I recently had a power outage here, where the computer lost power while it had been working on BOINC.

When I turned the computer on again, for the 2 Long-run GPUGrid tasks, they got stuck in a continual "Driver reset" loop, where I was getting continuous Windows balloons saying the driver had successfully recovered, over and over. I looked at the stderr.txt file in the slots directory, and remember seeing:
Kernel not found# SWAN swan_assert 0
... over and over, along with each retry to start the task.

The only way I could see to get out of the loop, was to abort the work units. So I did.
The tasks are below.
Curiously, there was also a beta task that I had worked on (which error'd out and was reported way before the power outage), where it also said:
Kernel not found# SWAN swan_assert 0

1) Why did the full stderr.txt not get included in my aborted task logs?
2) Why did the app continually try to restart this unresumable situation?
3) Was the error in the beta task intentionally set (to test the retry logic?)


Thanks,
Jacob


Name I66R8-NATHAN_KIDKIXc22_6-9-50-RND7714_1
Workunit 4795185
Created 29 Sep 2013 | 9:39:42 UTC
Sent 29 Sep 2013 | 9:56:59 UTC
Received 30 Sep 2013 | 4:01:08 UTC
Server state Over
Outcome Computation error
Client state Aborted by user
Exit status 203 (0xcb) EXIT_ABORTED_VIA_GUI
Computer ID 153764
Report deadline 4 Oct 2013 | 9:56:59 UTC
Run time 48,589.21
CPU time 48,108.94
Validate state Invalid
Credit 0.00
Application version Long runs (8-12 hours on fastest card) v8.14 (cuda55)
Stderr output

<core_client_version>7.2.16</core_client_version>
<![CDATA[
<message>
aborted by user
</message>
]]>



Name 17x6-SANTI_RAP74wtCUBIC-13-34-RND0681_0
Workunit 4807187
Created 29 Sep 2013 | 13:06:23 UTC
Sent 29 Sep 2013 | 17:32:54 UTC
Received 30 Sep 2013 | 4:01:08 UTC
Server state Over
Outcome Computation error
Client state Aborted by user
Exit status 203 (0xcb) EXIT_ABORTED_VIA_GUI
Computer ID 153764
Report deadline 4 Oct 2013 | 17:32:54 UTC
Run time 17,822.88
CPU time 3,669.02
Validate state Invalid
Credit 0.00
Application version Long runs (8-12 hours on fastest card) v8.14 (cuda55)
Stderr output

<core_client_version>7.2.16</core_client_version>
<![CDATA[
<message>
aborted by user
</message>
]]>



Name 112-MJHARVEY_CRASH3-14-25-RND0090_2
Workunit 4807215
Created 29 Sep 2013 | 17:32:12 UTC
Sent 29 Sep 2013 | 17:32:54 UTC
Received 29 Sep 2013 | 19:04:42 UTC
Server state Over
Outcome Computation error
Client state Compute error
Exit status -226 (0xffffffffffffff1e) ERR_TOO_MANY_EXITS
Computer ID 153764
Report deadline 4 Oct 2013 | 17:32:54 UTC
Run time 4,020.13
CPU time 1,062.94
Validate state Invalid
Credit 0.00
Application version ACEMD beta version v8.14 (cuda55)

Stderr output

<core_client_version>7.2.16</core_client_version>
<![CDATA[
<message>
too many exit(0)s
</message>
<stderr_txt>
# GPU [GeForce GTX 460] Platform [Windows] Rev [3203] VERSION [55]
# SWAN Device 1 :
# Name : GeForce GTX 460
# ECC : Disabled
# Global mem : 1024MB
# Capability : 2.1
# PCI ID : 0000:08:00.0
# Device clock : 1526MHz
# Memory clock : 1900MHz
# Memory width : 256bit
# Driver version : r325_00 : 32723
# GPU 0 : 68C
# GPU 1 : 61C
# GPU 2 : 83C
# GPU 1 : 63C
# GPU 1 : 64C
# GPU 1 : 65C
# GPU 1 : 66C
# GPU 1 : 67C
# GPU 1 : 68C
# GPU 0 : 69C
# GPU 1 : 69C
# GPU 1 : 70C
# GPU 0 : 70C
# GPU 1 : 71C
# GPU 0 : 71C
# GPU [GeForce GTX 460] Platform [Windows] Rev [3203] VERSION [55]
# SWAN Device 1 :
# Name : GeForce GTX 460
# ECC : Disabled
# Global mem : 1024MB
# Capability : 2.1
# PCI ID : 0000:08:00.0
# Device clock : 1526MHz
# Memory clock : 1900MHz
# Memory width : 256bit
# Driver version : r325_00 : 32723
Kernel not found# SWAN swan_assert 0
14:56:38 (1696): Can't acquire lockfile (32) - waiting 35s
# GPU [GeForce GTX 460] Platform [Windows] Rev [3203] VERSION [55]
# SWAN Device 1 :
# Name : GeForce GTX 460
# ECC : Disabled
# Global mem : 1024MB
# Capability : 2.1
# PCI ID : 0000:08:00.0
# Device clock : 1526MHz
# Memory clock : 1900MHz
# Memory width : 256bit
# Driver version : r325_00 : 32723
Kernel not found# SWAN swan_assert 0
# GPU [GeForce GTX 460] Platform [Windows] Rev [3203] VERSION [55]
# SWAN Device 1 :
# Name : GeForce GTX 460
# ECC : Disabled
# Global mem : 1024MB
# Capability : 2.1
# PCI ID : 0000:08:00.0
# Device clock : 1526MHz
# Memory clock : 1900MHz
# Memory width : 256bit
# Driver version : r325_00 : 32723
Kernel not found# SWAN swan_assert 0
...
# GPU [GeForce GTX 460] Platform [Windows] Rev [3203] VERSION [55]
# SWAN Device 1 :
# Name : GeForce GTX 460
# ECC : Disabled
# Global mem : 1024MB
# Capability : 2.1
# PCI ID : 0000:08:00.0
# Device clock : 1526MHz
# Memory clock : 1900MHz
# Memory width : 256bit
# Driver version : r325_00 : 32723
Kernel not found# SWAN swan_assert 0

</stderr_txt>
]]>

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1425
Credit: 3,520,440,451
RAC: 596,293
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 33277 - Posted: 30 Sep 2013 | 10:31:53 UTC - in response to Message 33274.

Interesting that you and Matt - no, not Matt Harvey, the guy in GPUGrid Start Up/Recovery Issues - should both post about similar issues on the same day.

I've also had the problem of the continual "Driver reset" loop after an abnormal shutdown, also mostly with NATHAN_KIDKIXc22 tasks. The problem would appear to be a failure to restart the tasks from a (possibly damaged or currupt) checkpoint file - maybe the project team could look into that?

My workround has been to restart Windows in safe mode (which prevents BOINC loading), and edit client_state.xml to add the line

<suspended_via_gui/>

to the <result> block for the suspect task.

As the name suggests, that's the same as clicking 'suspend' for the task while BOINC is running, and gets control of the machine back so you can investigate on the next normal restart. By convention, the line goes just under <plan_class> in client_state, but I think anywhere at the first indent level will do.

Interesting point about stderr.txt - I hadn't looked that far into it.

The process for stderr is:

It gets written as a file in the slot directory
On task completion, the contents of the file gets copied into that same <result> block in client_state.xml
The <result> data is copied into a sched_request file for the project's server
The scheduler result handler copies it into the database for display on the web.

So, which of those gets skipped if a task gets aborted? Next time it happens, I'll follow the process through and see where it goes missing. Any which way, it's probably a BOINC problem, and I agree it would be better if partial information was available for aborted tasks. You and I both know where and how to get that changed once we've narrowed down the problem ;)

Betting Slip
Send message
Joined: 5 Jan 09
Posts: 670
Credit: 2,498,095,550
RAC: 0
Level
Phe
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 33278 - Posted: 30 Sep 2013 | 10:42:22 UTC - in response to Message 33277.
Last modified: 30 Sep 2013 | 10:56:32 UTC

I have the same problem with Nathan units on a GTX460 but I didn't have power outages.


ADDED

I wish they would Beta test full size WU's before releasing them on an unsuspecting public. It's little wonder there a very small bunch of hardcore GPUGrid crunchers because it's just too much hassle for ordinary user and causes too many problems. They join and quickly leave...shame because a little more full beta testing would catch these problems.
____________
Radio Caroline, the world's most famous offshore pirate radio station.
Great music since April 1964. Support Radio Caroline Team -
Radio Caroline

Betting Slip
Send message
Joined: 5 Jan 09
Posts: 670
Credit: 2,498,095,550
RAC: 0
Level
Phe
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 33279 - Posted: 30 Sep 2013 | 11:03:35 UTC

Also had my GTX660TI throw a wobbly on a Noelia WU here

http://www.gpugrid.net/result.php?resultid=7310174
____________
Radio Caroline, the world's most famous offshore pirate radio station.
Great music since April 1964. Support Radio Caroline Team -
Radio Caroline

Operator
Send message
Joined: 15 May 11
Posts: 108
Credit: 297,176,099
RAC: 0
Level
Asn
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 33284 - Posted: 30 Sep 2013 | 13:18:27 UTC - in response to Message 33274.

I recently had a power outage here, where the computer lost power while it had been working on BOINC.

When I turned the computer on again, for the 2 Long-run GPUGrid tasks, they got stuck in a continual "Driver reset" loop, where I was getting continuous Windows balloons saying the driver had successfully recovered, over and over. I looked at the stderr.txt file in the slots directory, and remember seeing:
Kernel not found# SWAN swan_assert 0
... over and over, along with each retry to start the task.

The only way I could see to get out of the loop, was to abort the work units. So I did.
The tasks are below.
Curiously, there was also a beta task that I had worked on (which error'd out and was reported way before the power outage), where it also said:
Kernel not found# SWAN swan_assert 0

1) Why did the full stderr.txt not get included in my aborted task logs?
2) Why did the app continually try to restart this unresumable situation?
3) Was the error in the beta task intentionally set (to test the retry logic?)


Thanks,
Jacob



Jacob, this has been my life with my GTX 590 box for the last month.

I usually just end up resetting the whole project because the apps will not continue. It may run for a day or two or it may just run for two hours before BSOD.

I'm fighting the nvlddmkm.sys thing right now and will probably end up reinstalling as a last ditch effort. This system does not normally crash unless BOINC is running GPUGrid WUs. It is not overclocked and is water cooled. All timings and specs are as from the Dell factory for this T7500.

But yeah..I completely understand what you're going through.

Operator
____________

Jacob Klein
Send message
Joined: 11 Oct 08
Posts: 1127
Credit: 1,901,927,545
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 33327 - Posted: 2 Oct 2013 | 10:32:48 UTC - in response to Message 33274.

MJH:
Can you try to reproduce this problem (in my report in the first post) and fix it?

Operator
Send message
Joined: 15 May 11
Posts: 108
Credit: 297,176,099
RAC: 0
Level
Asn
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 33337 - Posted: 3 Oct 2013 | 0:43:30 UTC

I did reinstall the OS on my GTX 590 box and have not installed any updates.

I am using 314.22 right now and it's been running for two days without any errors at all.

I am now convinced that there was a "third party software" issue or possibly the Microsoft WDDM 1.1, 1.2, 1.3 update package that caused my problems.

I'm using Win7 x64 so I really don't think I need the update to my windows display model to work with Win 8 or 8.1 if I'm not using that OS.

Regardless, it's working now!

Operator
____________

wiyosaya
Send message
Joined: 22 Nov 09
Posts: 114
Credit: 589,114,683
RAC: 0
Level
Lys
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 33367 - Posted: 5 Oct 2013 | 23:23:06 UTC

I had the same problem with this WU on my GTX 580 machine - http://www.gpugrid.net/workunit.php?wuid=4819239
Only I did not have a power outage.

My symptoms were finding my computer frozen, with no choice other than to hit the reset switch. When the computer came back up, I kept getting the windows balloons saying that there were driver problems and that the driver failed to start, and then blue screens.

I booted into safe mode, then downloaded and installed the latest WHQL NVidia driver. I then rebooted and got exactly the same thing again. I figured it was the GPU grid WU, so I again booted into safe mode, brought up BOINC manager and aborted the task. Now the my computer comes back up and is running, however, I got a computation error on this WU - http://www.gpugrid.net/workunit.php?wuid=4820870 which also caused a blue screen.

I've set my GPU Grid project to not get new tasks for the time.

Interestingly enough, my GTX 460 machine seems to be having no problems at the moment.
____________

Jacob Klein
Send message
Joined: 11 Oct 08
Posts: 1127
Credit: 1,901,927,545
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 33368 - Posted: 5 Oct 2013 | 23:36:07 UTC - in response to Message 33367.

MJH: Any response?
I tried to provide as much detail as possible.

Profile MJH
Project administrator
Project developer
Project scientist
Send message
Joined: 12 Nov 07
Posts: 696
Credit: 27,266,655
RAC: 0
Level
Val
Scientific publications
watwat
Message 33369 - Posted: 6 Oct 2013 | 8:49:51 UTC - in response to Message 33368.

Jacob,

Next time this happens, please email me the contents of the slot directory and the task files.

Mjh

Jacob Klein
Send message
Joined: 11 Oct 08
Posts: 1127
Credit: 1,901,927,545
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 33373 - Posted: 6 Oct 2013 | 12:52:45 UTC - in response to Message 33369.

Ha! Considering it seems like it should be easy to reproduce (turn off PC, via switch and not via normal shutdown, in the middle of the GPUGrid task)... Challenge accepted.

Jacob Klein
Send message
Joined: 11 Oct 08
Posts: 1127
Credit: 1,901,927,545
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 33374 - Posted: 6 Oct 2013 | 13:09:33 UTC - in response to Message 33373.

MJH:

If I'm able to reproduce the issue, where should I email the requested files? Can you please PM me your email address?

Also... For my first test, the issue did not occur on my Long-run SANTI-baxbim tasks. I wonder if it is task-type-specific? I'll try to test a "abrupt computer restart" against other task types.

Jacob Klein
Send message
Joined: 11 Oct 08
Posts: 1127
Credit: 1,901,927,545
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 33375 - Posted: 6 Oct 2013 | 13:26:37 UTC - in response to Message 33374.
Last modified: 6 Oct 2013 | 13:28:13 UTC

MJH:

I have been able to reproduce the problem with a SANTI_MAR422dim task.
Can you please PM me your email address?

Thanks,
Jacob

Jacob Klein
Send message
Joined: 11 Oct 08
Posts: 1127
Credit: 1,901,927,545
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 33376 - Posted: 6 Oct 2013 | 16:37:58 UTC - in response to Message 33375.

Matt,
I have received your PM, and have sent you the files.
Please let me know if you need anything or find anything!

Thanks,
Jacob

wiyosaya
Send message
Joined: 22 Nov 09
Posts: 114
Credit: 589,114,683
RAC: 0
Level
Lys
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 33377 - Posted: 6 Oct 2013 | 16:47:48 UTC - in response to Message 33374.
Last modified: 6 Oct 2013 | 17:00:03 UTC

MJH:

If I'm able to reproduce the issue, where should I email the requested files? Can you please PM me your email address?

Also... For my first test, the issue did not occur on my Long-run SANTI-baxbim tasks. I wonder if it is task-type-specific? I'll try to test a "abrupt computer restart" against other task types.

FWIW - My GTX 460 machine finished the task that I posted about. Although it took longer than 24-hours, it was a SANTI-baxbim task - http://www.gpugrid.net/workunit.php?wuid=4818983

Also, I have to say that I somewhat agree with the above post about people who run this project really needing to know what they are doing. I'm a software developer / computer scientist by trade, and I build my own PCs when I need them.

One reason that I think many people might leave this project is that it takes so long to run a WU, and that they must be returned in 5-days. Some people might turn their machines off, and thus would not be able to return the WU in 5-days. Personally, I only run this project on weekends.

In general, I have found this project to be relatively stable with this, perhaps, the only serious fault I have encountered so far. However, when faults like this arise, it would almost certainly take skilled people to get out of the situation created.

Unfortunately, though, this and other similar projects, at least as I see it, are on the bleeding edge. As in my job where the software that I work with is also on the bleeding edge (a custom FEA program), it is sometimes extraordinarily difficult to catch a bug like this since it sounds like it occurs only under limited circumstances that may not be caught in a test of the software unless, in this case, the PC were shut down abnormally.

Just my $0.02.
____________

Betting Slip
Send message
Joined: 5 Jan 09
Posts: 670
Credit: 2,498,095,550
RAC: 0
Level
Phe
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 33391 - Posted: 7 Oct 2013 | 12:49:50 UTC - in response to Message 33377.
Last modified: 7 Oct 2013 | 12:57:19 UTC

One reason that I think many people might leave this project is that it takes so long to run a WU, and that they must be returned in 5-days. Some people might turn their machines off, and thus would not be able to return the WU in 5-days. Personally, I only run this project on weekends.


That's another thing that is a "trap" and confusing. While the deadline is 5 days if you don't return within 2 days the WU is resent to another host and if that host returns first (likely) your computing time, if you return a result, has been wasted because the first valid result returned is canonical and yours is binned.
____________
Radio Caroline, the world's most famous offshore pirate radio station.
Great music since April 1964. Support Radio Caroline Team -
Radio Caroline

Profile Retvari Zoltan
Avatar
Send message
Joined: 20 Jan 09
Posts: 2335
Credit: 16,178,080,749
RAC: 19
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 33392 - Posted: 7 Oct 2013 | 12:57:18 UTC - in response to Message 33391.

One reason that I think many people might leave this project is that it takes so long to run a WU, and that they must be returned in 5-days. Some people might turn their machines off, and thus would not be able to return the WU in 5-days. Personally, I only run this project on weekends.


That's another thing that is a "trap" and confusing. While the deadline is 5 days if you don't return within 2 days the WU is resent to another host and if that host returns first (likely) you don't get any credit and your computing time if you return a result has been wasted because the first valid result returned is canonical and yours is binned.

The 2 day resend was ceased long ago.

Betting Slip
Send message
Joined: 5 Jan 09
Posts: 670
Credit: 2,498,095,550
RAC: 0
Level
Phe
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 33393 - Posted: 7 Oct 2013 | 12:59:48 UTC - in response to Message 33392.

One reason that I think many people might leave this project is that it takes so long to run a WU, and that they must be returned in 5-days. Some people might turn their machines off, and thus would not be able to return the WU in 5-days. Personally, I only run this project on weekends.


That's another thing that is a "trap" and confusing. While the deadline is 5 days if you don't return within 2 days the WU is resent to another host and if that host returns first (likely) you don't get any credit and your computing time if you return a result has been wasted because the first valid result returned is canonical and yours is binned.

The 2 day resend was ceased long ago.


I type corrected :-)


____________
Radio Caroline, the world's most famous offshore pirate radio station.
Great music since April 1964. Support Radio Caroline Team -
Radio Caroline

Jacob Klein
Send message
Joined: 11 Oct 08
Posts: 1127
Credit: 1,901,927,545
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 33394 - Posted: 7 Oct 2013 | 13:05:11 UTC

Alright, back on topic here...
I'm awaiting MJH to analyze the files that I sent him.

Operator
Send message
Joined: 15 May 11
Posts: 108
Credit: 297,176,099
RAC: 0
Level
Asn
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 33397 - Posted: 7 Oct 2013 | 17:22:31 UTC - in response to Message 33394.
Last modified: 7 Oct 2013 | 17:23:33 UTC

Alright, back on topic here...
I'm awaiting MJH to analyze the files that I sent him.



Jacob;

Are you talking about when one of the GPUs TDR it screws up all the other tasks running on other GPUs as well?

That happens to me on my GTX590 box all the time (mostly power outages). If one messes up and ends up causing a TDR or complete dump and reboot, when I start BOINC again all the remaining WUs in progress on the other GPUs also cause more TDRs unless I abort them.

Sometimes even that doesn't help and I have to completely reset the project.

Example: I had a TDR the other day. Three WUs were uploading at the time. Only one was actually processing. Fine. So I reboot and catch BOINC before it starts processing the problem WU and suspend processing so the three that did complete can upload for credit.

Now, I abort the problem WU and let the system download 4 new WUs.

As soon as processing starts, Wham! Another TDR.

So I do a reset of the project and 4 more WUs download and start processing without any problem at all.

So the point is, unless I reset the project when I get a TDR I'm just wasting my time downloading new WUs because they are all going to continue to crash until I do a complete reset.

So I'm not sure what file that's left over in the BOINC or GPUGrid project folder(s) is causing the TDRs after the original event.

Is that the same issue you are talking about here or am I way off?

Operator
____________

Jacob Klein
Send message
Joined: 11 Oct 08
Posts: 1127
Credit: 1,901,927,545
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 33398 - Posted: 7 Oct 2013 | 17:26:16 UTC - in response to Message 33397.
Last modified: 7 Oct 2013 | 17:27:09 UTC

I believe your issue is a separate issue.

Mine occurs as outlined in the first post within this thread:
If a GPUGrid task is in the middle of being processed, and BOINC is shutdown abnormally (like a power outage, or the computer froze without user issuing the shutdown command)...

Then when the computer/BOINC/task restarts, it can get into a loop where it crashes the driver, tries to start again (I see the "elapsed" time back off a few seconds indicating it is retrying), crash the driver again, etc. etc. It keeps crashing the driver until I abort the task. It does not affect other tasks.

I've captured a copy of the data directory when this was happening, and submitted some files to MJH, to hopefully figure out what is happening.

If you have a different issue, please consider opening a separate thread.

Thanks,
Jacob

wiyosaya
Send message
Joined: 22 Nov 09
Posts: 114
Credit: 589,114,683
RAC: 0
Level
Lys
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 33501 - Posted: 14 Oct 2013 | 18:54:09 UTC

Was there a resolution to this?

I ran several WUs this past weekend on my 580 machine, which is the one that had the problem, and I did not see this issue again.
____________

Jacob Klein
Send message
Joined: 11 Oct 08
Posts: 1127
Credit: 1,901,927,545
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 33502 - Posted: 14 Oct 2013 | 19:52:40 UTC - in response to Message 33501.

There has been no recent contact from MJH, and so no resolution.

I believe the issue only happens when the computer (running a GPUGrid.net task) is interrupted (or freezes completely) without being able to shutdown cleanly. I haven't seen it happen recently, because I usually shutdown/restart normally, instead of an abrupt power shutoff.

Regards,
Jacob

wiyosaya
Send message
Joined: 22 Nov 09
Posts: 114
Credit: 589,114,683
RAC: 0
Level
Lys
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 33504 - Posted: 15 Oct 2013 | 2:20:52 UTC

Thanks. I had no problems this past weekend. However, I did not experience any abnormal shutdowns or freezes.
____________

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1425
Credit: 3,520,440,451
RAC: 596,293
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 33520 - Posted: 16 Oct 2013 | 18:19:43 UTC

Are we still collecting these? I had a sticking task - multiple driver restarts after a forced reboot - with 23x6-SANTI_RAP74wtCUBIC-18-34-RND6543_0

The std_err txt follows: I'll preserve the rest of the slot contents before aborting the task, in case anyone wants them.

# GPU [GeForce GTX 670] Platform [Windows] Rev [3203] VERSION [55]
# SWAN Device 0 :
# Name : GeForce GTX 670
# ECC : Disabled
# Global mem : 2048MB
# Capability : 3.0
# PCI ID : 0000:07:00.0
# Device clock : 1084MHz
# Memory clock : 3054MHz
# Memory width : 256bit
# Driver version : r331_00 : 33140
# GPU 0 : 74C
# GPU 1 : 55C
# GPU 0 : 75C
# GPU 1 : 56C
# GPU 0 : 76C
# GPU 0 : 77C
# GPU 0 : 78C
# GPU 0 : 79C
# GPU 1 : 57C
# GPU 0 : 80C
# GPU 0 : 81C
# GPU 1 : 58C
# GPU [GeForce GTX 670] Platform [Windows] Rev [3203] VERSION [55]
# SWAN Device 0 :
# Name : GeForce GTX 670
# ECC : Disabled
# Global mem : 2048MB
# Capability : 3.0
# PCI ID : 0000:07:00.0
# Device clock : 1084MHz
# Memory clock : 3054MHz
# Memory width : 256bit
# Driver version : r331_00 : 33140
SWAN : FATAL : Cuda driver error 702 in file 'swanlibnv2.cpp' in line 1963.
# SWAN swan_assert 0
# GPU [GeForce GTX 670] Platform [Windows] Rev [3203] VERSION [55]
# SWAN Device 0 :
# Name : GeForce GTX 670
# ECC : Disabled
# Global mem : 2048MB
# Capability : 3.0
# PCI ID : 0000:07:00.0
# Device clock : 1084MHz
# Memory clock : 3054MHz
# Memory width : 256bit
# Driver version : r331_00 : 33140
SWAN : FATAL : Cuda driver error 702 in file 'swanlibnv2.cpp' in line 1963.
# SWAN swan_assert 0
# GPU [GeForce GTX 670] Platform [Windows] Rev [3203] VERSION [55]
# SWAN Device 0 :
# Name : GeForce GTX 670
# ECC : Disabled
# Global mem : 2048MB
# Capability : 3.0
# PCI ID : 0000:07:00.0
# Device clock : 1084MHz
# Memory clock : 3054MHz
# Memory width : 256bit
# Driver version : r331_00 : 33140
SWAN : FATAL : Cuda driver error 702 in file 'swanlibnv2.cpp' in line 1963.
# SWAN swan_assert 0
# GPU [GeForce GTX 670] Platform [Windows] Rev [3203] VERSION [55]
# SWAN Device 0 :
# Name : GeForce GTX 670
# ECC : Disabled
# Global mem : 2048MB
# Capability : 3.0
# PCI ID : 0000:07:00.0
# Device clock : 1084MHz
# Memory clock : 3054MHz
# Memory width : 256bit
# Driver version : r331_00 : 33140
SWAN : FATAL : Cuda driver error 999 in file 'swanlibnv2.cpp' in line 1963.
# SWAN swan_assert 0
# GPU [GeForce GTX 670] Platform [Windows] Rev [3203] VERSION [55]
# SWAN Device 0 :
# Name : GeForce GTX 670
# ECC : Disabled
# Global mem : 2048MB
# Capability : 3.0
# PCI ID : 0000:07:00.0
# Device clock : 1084MHz
# Memory clock : 3054MHz
# Memory width : 256bit
# Driver version : r331_00 : 33140

Jacob Klein
Send message
Joined: 11 Oct 08
Posts: 1127
Credit: 1,901,927,545
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 33521 - Posted: 16 Oct 2013 | 18:22:29 UTC - in response to Message 33520.

I sent MJH some files, but haven't heard from him :/

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1425
Credit: 3,520,440,451
RAC: 596,293
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 33524 - Posted: 16 Oct 2013 | 21:37:25 UTC

And it's just happened again, this time with potx108-NOELIA_INS1P-0-14-RND5839_0

# GPU [GeForce GTX 670] Platform [Windows] Rev [3203] VERSION [55]
# SWAN Device 1 :
# Name : GeForce GTX 670
# ECC : Disabled
# Global mem : 2048MB
# Capability : 3.0
# PCI ID : 0000:08:00.0
# Device clock : 1084MHz
# Memory clock : 3054MHz
# Memory width : 256bit
# Driver version : r331_00 : 33140
# GPU 0 : 76C
# GPU 1 : 56C
# GPU 1 : 57C
# GPU 1 : 58C
# GPU 1 : 59C
# GPU 1 : 60C
# GPU 1 : 61C
# GPU 0 : 77C
# GPU 1 : 62C
# GPU 1 : 63C
# GPU 0 : 78C
# GPU [GeForce GTX 670] Platform [Windows] Rev [3203] VERSION [55]
# SWAN Device 0 :
# Name : GeForce GTX 670
# ECC : Disabled
# Global mem : 2048MB
# Capability : 3.0
# PCI ID : 0000:07:00.0
# Device clock : 1084MHz
# Memory clock : 3054MHz
# Memory width : 256bit
# Driver version : r331_00 : 33140
SWAN : FATAL : Cuda driver error 999 in file 'swanlibnv2.cpp' in line 1963.
# SWAN swan_assert 0
# GPU [GeForce GTX 670] Platform [Windows] Rev [3203] VERSION [55]
# SWAN Device 0 :
# Name : GeForce GTX 670
# ECC : Disabled
# Global mem : 2048MB
# Capability : 3.0
# PCI ID : 0000:07:00.0
# Device clock : 1084MHz
# Memory clock : 3054MHz
# Memory width : 256bit
# Driver version : r331_00 : 33140
SWAN : FATAL : Cuda driver error 999 in file 'swanlibnv2.cpp' in line 1963.
# SWAN swan_assert 0
# GPU [GeForce GTX 670] Platform [Windows] Rev [3203] VERSION [55]
# SWAN Device 0 :
# Name : GeForce GTX 670
# ECC : Disabled
# Global mem : 2048MB
# Capability : 3.0
# PCI ID : 0000:07:00.0
# Device clock : 1084MHz
# Memory clock : 3054MHz
# Memory width : 256bit
# Driver version : r331_00 : 33140
SWAN : FATAL : Cuda driver error 999 in file 'swanlibnv2.cpp' in line 1963.
# SWAN swan_assert 0
# GPU [GeForce GTX 670] Platform [Windows] Rev [3203] VERSION [55]
# SWAN Device 0 :
# Name : GeForce GTX 670
# ECC : Disabled
# Global mem : 2048MB
# Capability : 3.0
# PCI ID : 0000:07:00.0
# Device clock : 1084MHz
# Memory clock : 3054MHz
# Memory width : 256bit
# Driver version : r331_00 : 33140
SWAN : FATAL : Cuda driver error 702 in file 'swanlibnv2.cpp' in line 1963.
# SWAN swan_assert 0
22:21:16 (5824): Can't acquire lockfile (32) - waiting 35s
# GPU [GeForce GTX 670] Platform [Windows] Rev [3203] VERSION [55]
# SWAN Device 0 :
# Name : GeForce GTX 670
# ECC : Disabled
# Global mem : 2048MB
# Capability : 3.0
# PCI ID : 0000:07:00.0
# Device clock : 1084MHz
# Memory clock : 3054MHz
# Memory width : 256bit
# Driver version : r331_00 : 33140
SWAN : FATAL : Cuda driver error 702 in file 'swanlibnv2.cpp' in line 1963.
# SWAN swan_assert 0
# GPU [GeForce GTX 670] Platform [Windows] Rev [3203] VERSION [55]
# SWAN Device 0 :
# Name : GeForce GTX 670
# ECC : Disabled
# Global mem : 2048MB
# Capability : 3.0
# PCI ID : 0000:07:00.0
# Device clock : 1084MHz
# Memory clock : 3054MHz
# Memory width : 256bit
# Driver version : r331_00 : 33140
SWAN : FATAL : Cuda driver error 702 in file 'swanlibnv2.cpp' in line 1963.
# SWAN swan_assert 0
# GPU [GeForce GTX 670] Platform [Windows] Rev [3203] VERSION [55]
# SWAN Device 0 :
# Name : GeForce GTX 670
# ECC : Disabled
# Global mem : 2048MB
# Capability : 3.0
# PCI ID : 0000:07:00.0
# Device clock : 1084MHz
# Memory clock : 3054MHz
# Memory width : 256bit
# Driver version : r331_00 : 33140
SWAN : FATAL : Cuda driver error 702 in file 'swanlibnv2.cpp' in line 1963.
# SWAN swan_assert 0
# GPU [GeForce GTX 670] Platform [Windows] Rev [3203] VERSION [55]
# SWAN Device 0 :
# Name : GeForce GTX 670
# ECC : Disabled
# Global mem : 2048MB
# Capability : 3.0
# PCI ID : 0000:07:00.0
# Device clock : 1084MHz
# Memory clock : 3054MHz
# Memory width : 256bit
# Driver version : r331_00 : 33140
SWAN : FATAL : Cuda driver error 702 in file 'swanlibnv2.cpp' in line 1963.
# SWAN swan_assert 0
# GPU [GeForce GTX 670] Platform [Windows] Rev [3203] VERSION [55]
# SWAN Device 0 :
# Name : GeForce GTX 670
# ECC : Disabled
# Global mem : 2048MB
# Capability : 3.0
# PCI ID : 0000:07:00.0
# Device clock : 1084MHz
# Memory clock : 3054MHz
# Memory width : 256bit
# Driver version : r331_00 : 33140
SWAN : FATAL : Cuda driver error 702 in file 'swanlibnv2.cpp' in line 1963.
# SWAN swan_assert 0
# GPU [GeForce GTX 670] Platform [Windows] Rev [3203] VERSION [55]
# SWAN Device 0 :
# Name : GeForce GTX 670
# ECC : Disabled
# Global mem : 2048MB
# Capability : 3.0
# PCI ID : 0000:07:00.0
# Device clock : 1084MHz
# Memory clock : 3054MHz
# Memory width : 256bit
# Driver version : r331_00 : 33140

I seem to see similarities in

SWAN : FATAL : Cuda driver error 702 in file 'swanlibnv2.cpp' in line 1963.
SWAN : FATAL : Cuda driver error 999 in file 'swanlibnv2.cpp' in line 1963.

in both reports. And in both cases, the first error occurs after the first restart.

Interestingly, this was running in the same slot directory as the previous one (slot 4), and part of my bug report to BOINC (apart from the non-report of stderr_txt) was that the slot directory wasn't cleaned after an abort. I'll make sure that's done properly before I risk another one.

Profile MJH
Project administrator
Project developer
Project scientist
Send message
Joined: 12 Nov 07
Posts: 696
Credit: 27,266,655
RAC: 0
Level
Val
Scientific publications
watwat
Message 33527 - Posted: 16 Oct 2013 | 22:11:09 UTC - in response to Message 33521.

Sorry guys, I've been (and still am) very busy. Jacob, thanks for the files, they were useful and I know how to fix the problem. Unfortunately, I'll not have opportunity to do any more work on the application for a while. Will keep you posted.

MJH

zombie67 [MM]
Avatar
Send message
Joined: 16 Jul 07
Posts: 194
Credit: 537,220,015
RAC: 0
Level
Lys
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 33543 - Posted: 18 Oct 2013 | 15:31:38 UTC

I have also been experiencing this problem. Over the past several weeks at least. Also glad to see the cause has been identified by the project. Now just waiting for a fix.

FWIW, this is the trick I use to be able to get to the BOINC GUI controls before crashing. I add this line to the cc_config.xml:

<cc_config>
<options>
<start_delay>60</start_delay>
</options>
</cc_config>

"Specify a number of seconds to delay running applications after client startup. List-add.pngNew in 6.1.6"

No fiddling with safe mode or any of that.

http://boinc.berkeley.edu/wiki/Client_configuration

____________
Reno, NV
Team: SETI.USA

Profile JStateson
Avatar
Send message
Joined: 31 Oct 08
Posts: 183
Credit: 3,326,949,677
RAC: 40,024
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 33621 - Posted: 26 Oct 2013 | 7:41:06 UTC
Last modified: 26 Oct 2013 | 7:44:08 UTC

Suspect I had the same problem: Driver resetting in loop eventually blue screen and memory dump. Managed to stop the gpu and spotted MD5 checksum error message associated with some gpugrid logo png file. Probably more to it than a bad logofile download so I reset the project and stopped future work. Problem disappeared on this gtx570 system. Other systems are running gpugrid ok.

Upgraded from 327 to 331 drivers before deciding to reset the project.


EDIT - JUST REALIZED I HAD A POWER OUTAGE RIGHT BEFORE THE PROBLEM.

Jacob Klein
Send message
Joined: 11 Oct 08
Posts: 1127
Credit: 1,901,927,545
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 33623 - Posted: 26 Oct 2013 | 11:22:15 UTC - in response to Message 33621.

The cause:
Happens when Windows is shutdown unexpectedly, ie: from freezing up, from user pulling plug, or from power outage.

The problem:
The driver resets continuously, GPUGrid tasks do not progress normally, and sometimes Windows will BSOD because of the driver resets.

The solution:
Find a way to abort any GPUGrid tasks that are causing the problem. If Windows gives you enough time to stop BOINC when you login, then do that. Stop/suspend BOINC, abort the GPUGrid tasks, restart/resume BOINC. If Windows doesn't give you enough time, then utilizing the <start_delay> option in cc_config.xml is a good choice, but you would have to start in safe mode (to prevent BOINC from starting) in order to create/edit that file, then start in regular mode, and while BOINC is in the startup delay, stop/suspend BOINC, abort the GPUGrid tasks, restart/resume BOINC.

This is a GPUGrid problem, and I hope MJH fixes it!
He says he knows how to, it's just a matter of his limited time.

Regards,
Jacob

Operator
Send message
Joined: 15 May 11
Posts: 108
Credit: 297,176,099
RAC: 0
Level
Asn
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 33624 - Posted: 26 Oct 2013 | 12:59:18 UTC - in response to Message 33623.


The solution:
If Windows doesn't give you enough time, then utilizing the <start_delay> option in cc_config.xml is a good choice, but you would have to start in safe mode (to prevent BOINC from starting) in order to create/edit that file, then start in regular mode, and while BOINC is in the startup delay, stop/suspend BOINC, abort the GPUGrid tasks, restart/resume BOINC.



I have edited the cc_config file to include the startup delay and now that delay (60 seconds in my case) is initiated everytime I start BOINC up, whether I had a problem before it was shutdown or not.

So now I don't have to try and 'catch' BOINC to abort tasks, or go into safe mode or anything else. I can just abort tasks that I know will fail due to the power interruption issues I occasionally have to deal with here (mostly on my GTX590 box).

Operator
____________

Jacob Klein
Send message
Joined: 11 Oct 08
Posts: 1127
Credit: 1,901,927,545
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 33705 - Posted: 1 Nov 2013 | 14:47:57 UTC
Last modified: 1 Nov 2013 | 14:52:41 UTC

My computer abruptly restarted a couple times today, and I had to deal with this problem again.

A "I505-SANTI_baxbim2-18-32" task got stuck into an infinite driver reset loop, and I had to suspend GPU to get to that task to abort it. A "23x5-SANTI_RAP74wtCUPIC-20-34" task did not get stuck in the loop, and so I didn't have to abort that one.

So... this is still an ongoing problem for me.
MJH?

Profile MJH
Project administrator
Project developer
Project scientist
Send message
Joined: 12 Nov 07
Posts: 696
Credit: 27,266,655
RAC: 0
Level
Val
Scientific publications
watwat
Message 33769 - Posted: 4 Nov 2013 | 10:58:50 UTC - in response to Message 33705.

Jacob,

I'll probably get a fix for this problem out next week.

Matt

Jacob Klein
Send message
Joined: 11 Oct 08
Posts: 1127
Credit: 1,901,927,545
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 33772 - Posted: 4 Nov 2013 | 14:00:46 UTC - in response to Message 33769.

Jacob,
I'll probably get a fix for this problem out next week.
Matt


Thanks. I'm looking forward to the fix. And testing the fix should be fun too muaahahahahaha (don't get to yank power cord out of this machine very often!)

mwgiii
Send message
Joined: 22 Jan 09
Posts: 8
Credit: 958,533,971
RAC: 6,470
Level
Glu
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 33784 - Posted: 5 Nov 2013 | 19:59:20 UTC

Please hurry MJH.
____________

ExtraTerrestrial Apes
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 17 Aug 08
Posts: 2705
Credit: 1,311,122,549
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 33820 - Posted: 10 Nov 2013 | 13:52:46 UTC - in response to Message 33772.

And testing the fix should be fun too muaahahahahaha (don't get to yank power cord out of this machine very often!)

LOL! I recommend using a switch instead (power switch or at the PSU) as these are "debounced" (not sure this is the correct electrical engineering term.. sounds wrong).

It could also work to just kill BOINC via task manager - maybe try this before the fix is out :)

MrS
____________
Scanning for our furry friends since Jan 2002

Profile Chilean
Avatar
Send message
Joined: 8 Oct 12
Posts: 98
Credit: 385,652,461
RAC: 0
Level
Asp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 33841 - Posted: 12 Nov 2013 | 6:50:25 UTC - in response to Message 33820.

And testing the fix should be fun too muaahahahahaha (don't get to yank power cord out of this machine very often!)

LOL! I recommend using a switch instead (power switch or at the PSU) as these are "debounced" (not sure this is the correct electrical engineering term.. sounds wrong).

It could also work to just kill BOINC via task manager - maybe try this before the fix is out :)

MrS


You got it right.
____________

Profile [PUGLIA] Riccardo
Send message
Joined: 27 Feb 12
Posts: 2
Credit: 3,410,838
RAC: 0
Level
Ala
Scientific publications
watwatwatwatwat
Message 33845 - Posted: 12 Nov 2013 | 16:58:27 UTC
Last modified: 12 Nov 2013 | 16:59:49 UTC

Exactly the same for me.

3 SANTI WU corrupted after a power outage (and about to be dismissed PSU!!!)

Actually are the 7443155, 7456552 and 7457465 of my current WUs: http://www.gpugrid.net/results.php?hostid=155107

Drivers crashing and Win7 rebooting until I've been so fast to suspend work and abort GPUGRID's wus
____________
Mio Dio, รจ pieno di stelle!

Profile skgiven
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,995,359,260
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 33903 - Posted: 16 Nov 2013 | 11:25:06 UTC - in response to Message 33845.

I didn't have a power outage, but the computer did restart (the WU's caused the system to restart).
On reboot the driver kept crashing while trying to run the same tasks.

43x1-SANTI_RAP74wtCUBIC-22-34-RND5480_0

SWAN : FATAL : Cuda driver error 999 in file 'swanlibnv2.cpp' in line 1963.
____________
FAQ's

HOW TO:
- Opt out of Beta Tests
- Ask for Help

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1425
Credit: 3,520,440,451
RAC: 596,293
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 33904 - Posted: 16 Nov 2013 | 11:44:06 UTC - in response to Message 33903.

(the WU's caused the system to restart).

That's a bold statement.

Have you opted IN to the current beta test of the v8.15 application designed to prevent the endless driver crash loop on restart, however the original problem came about?

TJ
Send message
Joined: 26 Jun 09
Posts: 815
Credit: 1,470,385,294
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 33905 - Posted: 16 Nov 2013 | 12:10:01 UTC

I had also an error with a Santi Rap after 3 "stop and starts".
I had also an error earlier this week with a Noelia with a fatal cuda driver but that was the first time that the GPU clock was not down clocked.
I had also a Santi LR run last week and however it finished without error it had down clocked the GPU clock.

I have opted for the beta but got only two of them and are quit fast.
____________
Greetings from TJ

Profile skgiven
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,995,359,260
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 33908 - Posted: 16 Nov 2013 | 15:12:13 UTC - in response to Message 33904.

Have you opted IN to the current beta test of the v8.15 application...

I didn't bother yesterday as I saw that there was only ~10 test WU's released and none available.
Since selecting Beta's today, none have come my way, so far,

    16/11/2013 14:32:50 | GPUGRID | No tasks are available for ACEMD beta version
    16/11/2013 14:48:03 | GPUGRID | No tasks are available for ACEMD beta version
    16/11/2013 15:01:34 | GPUGRID | No tasks are available for ACEMD beta version
    16/11/2013 15:10:52 | GPUGRID | No tasks are available for ACEMD beta version


Server says,

    ACEMD beta version 0 9 0.43 (0.15 - 2.74) 8


BTW. I do usually participate in the Betas:

    hours and percentage of runtime for 3 systems since the end of July 2013,
    GPUGRID ACEMD beta version 1140.53 (5.24%)
    GPUGRID ACEMD beta version 47.20 (1.12%)
    GPUGRID ACEMD beta version 544.78 (5.51%)


____________
FAQ's

HOW TO:
- Opt out of Beta Tests
- Ask for Help

Jacob Klein
Send message
Joined: 11 Oct 08
Posts: 1127
Credit: 1,901,927,545
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 33909 - Posted: 16 Nov 2013 | 15:17:44 UTC - in response to Message 33908.
Last modified: 16 Nov 2013 | 15:20:59 UTC

Right.

I too have not yet been able to get a Beta task in order to do testing with.

But I think the point was:
The 8.15 beta application supposedly fixes the problem that you (and I and others) had, which is caused by [an abrupt computer restart, or power outage, or BOINC being killed in TaskManager without closing gracefully], and results in a loop of driver resets, and which can only be resolved by aborting the GPUGrid task(s) causing the loop.

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1425
Credit: 3,520,440,451
RAC: 596,293
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 33923 - Posted: 17 Nov 2013 | 18:14:55 UTC - in response to Message 33909.

Right.

I too have not yet been able to get a Beta task in order to do testing with.

But I think the point was:
The 8.15 beta application supposedly fixes the problem that you (and I and others) had, which is caused by [an abrupt computer restart, or power outage, or BOINC being killed in TaskManager without closing gracefully], and results in a loop of driver resets, and which can only be resolved by aborting the GPUGrid task(s) causing the loop.

And looking at the stderr for the individual task in question, I could see no sign that GPUGrid had crashed or otherwise caused the initial problem, only that it had entered the 'looping driver' state on the first restart.

There seem to be more Beta tasks available for testing this afternoon - I have some flagged 'KLAUDE' which look to be heading towards 6-7 hours on my GTX 670s. That should be long enough to trigger a crash for testing purposes :)

Profile skgiven
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,995,359,260
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 33924 - Posted: 17 Nov 2013 | 19:52:48 UTC - in response to Message 33923.

The Betas might fix the driver restarts, but that doesn't address the cause of the system crash/restart - if it is related to the task/app. This seemed to be happening in the past, with certain types of WU; you ran the WU's and the system crashed and drivers restarted on restart, you didn't run those tasks and there weren't any restarts or driver failures. There probably wouldn't be anything in the Boinc logs if the app/WU triggered an immediate system Stop.
____________
FAQ's

HOW TO:
- Opt out of Beta Tests
- Ask for Help

Jacob Klein
Send message
Joined: 11 Oct 08
Posts: 1127
Credit: 1,901,927,545
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 33932 - Posted: 18 Nov 2013 | 22:37:17 UTC - in response to Message 33924.
Last modified: 18 Nov 2013 | 22:37:56 UTC

MJH:

I just got my first Beta unit that lasted more than a couple seconds.

But, when I tested it (killing process trees using Process Explorer), it restarted fine 2 times, but on the 3rd time, the task itself error'd out, with:
Exit status: 80 (0x50) Unknown error number
Message: The file exists. (0x50) - exit code 80 (0x50)

See:
http://www.gpugrid.net/result.php?resultid=7474144

Was this expected? Or is this a new bug?
Also, if you would like us to test the Beta units by doing abnormal actions, please give us a set of steps to perform. I had just been letting it run for a bit, then killing Tree in Process Explorer, but that was just a guess as to what testing steps might be necessary.

So, let me know what you think about this one?

Profile MJH
Project administrator
Project developer
Project scientist
Send message
Joined: 12 Nov 07
Posts: 696
Credit: 27,266,655
RAC: 0
Level
Val
Scientific publications
watwat
Message 33935 - Posted: 19 Nov 2013 | 0:11:58 UTC - in response to Message 33932.
Last modified: 19 Nov 2013 | 0:13:12 UTC

Jacob -

That's expected behaviour now, but not entirely desired. The app misinterpreted the rapid restart as in indication that it was stuck in a restart loop and so aborted. I expect I'll need a more sensitive test.

Matt

Jacob Klein
Send message
Joined: 11 Oct 08
Posts: 1127
Credit: 1,901,927,545
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 33936 - Posted: 19 Nov 2013 | 0:14:28 UTC - in response to Message 33935.

So, for testing the current app, should I have waited several checkpoints between Tree Kills?

Profile JStateson
Avatar
Send message
Joined: 31 Oct 08
Posts: 183
Credit: 3,326,949,677
RAC: 40,024
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 34087 - Posted: 30 Nov 2013 | 16:09:58 UTC

I hope this gets fixed because cold weather here is causing more frequent power outs and I have a farm of gpugrid systems.

Jozef J
Send message
Joined: 7 Jun 12
Posts: 112
Credit: 1,035,595,172
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwat
Message 34103 - Posted: 2 Dec 2013 | 21:42:36 UTC


GPUGRID project is no longer under control, errors and various problems with tasks when users counting is increasingly...I am completely finished from this project,

ExtraTerrestrial Apes
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 17 Aug 08
Posts: 2705
Credit: 1,311,122,549
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 34132 - Posted: 5 Dec 2013 | 20:40:29 UTC - in response to Message 34103.

Actually the subjective error rate has decreased a lot since the trouble was resolved a few months ago, when Matt developed the app to 8.14. What's left are occasional glitches (like sending WUs to the wrong queue) and from what I'm seeing more isolated and/or special errors.

MrS
____________
Scanning for our furry friends since Jan 2002

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1425
Credit: 3,520,440,451
RAC: 596,293
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 34133 - Posted: 5 Dec 2013 | 21:41:07 UTC - in response to Message 34132.

Actually the subjective error rate has decreased a lot since the trouble was resolved a few months ago, when Matt developed the app to 8.14. What's left are occasional glitches (like sending WUs to the wrong queue) and from what I'm seeing more isolated and/or special errors.

MrS

And I suspect it will be even better when they have enough confidence to promote the restart-fix v8.15 from Beta to stock application.

FoldingNator
Send message
Joined: 1 Dec 12
Posts: 24
Credit: 60,122,950
RAC: 0
Level
Thr
Scientific publications
watwatwatwatwatwatwat
Message 34137 - Posted: 6 Dec 2013 | 12:53:01 UTC - in response to Message 34132.

Actually the subjective error rate has decreased a lot since the trouble was resolved a few months ago, when Matt developed the app to 8.14. What's left are occasional glitches (like sending WUs to the wrong queue) and from what I'm seeing more isolated and/or special errors.

MrS

Yeah maybe, but my computer had also a few BSOD's yesterday, with multiple long run WU's. Nothing didn't work after that, I had to delete the whole BOINC folder, the whole driver, clean install everything and after that it finally work again. A lot work for a few long runs in my opinion, I'm glad I've only 1 pc. haha :P

The BSOD error had something to do with kernel issues and corrupted the installed NVIDIA driver. So when the computer boots, the screens are freezing down, the driver crashed within a few seconds and after that I had a BSOD again, again and again.

I dont know if it's a coincidence that I had or that more people have had the same kind of problems like this.

Jacob Klein
Send message
Joined: 11 Oct 08
Posts: 1127
Credit: 1,901,927,545
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 34138 - Posted: 6 Dec 2013 | 12:56:07 UTC - in response to Message 34137.
Last modified: 6 Dec 2013 | 13:16:38 UTC

FoldingNator:

I don't think it was corrupting the drivers.
I'm betting what you were experiencing exactly what was reported by me in post 1 of this thread.

Specifically, here are the steps that create the problem:
- v8.14 GPUGrid tasks are abruptly-interrupted (by power outage, or BSOD, or improper Windows shutdown, or BOINC client killed via Task Manager)
- Windows or user starts BOINC
- BOINC tries to run one or more of the v8.14 abruptly-interrupted GPUGrid tasks
- Running those abruptly-interrupted tasks resets the NVIDIA drivers continuously (with either continual "Display driver stopped working" notifications or BSODs)

The workaround (as previously recommended) is to:
- Abort those v8.14 abruptly-interrupted GPUGrid tasks (Try to suspend BOINC at earliest opportunity, to stop the crashing, so that you can abort these tasks)
- Restart the computer

The solution (which prevents these tasks from getting stuck in a crashing loop) is:
- GPUGrid needs to release the v8.15 application (which fixes the issue, but is currently still only on the Beta queue)

Regards,
Jacob Klein

MJH:
We can haz 8.15?

Profile skgiven
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,995,359,260
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 34144 - Posted: 7 Dec 2013 | 1:03:55 UTC - in response to Message 34138.

I again had this problem today and yesterday. It impacts Windows systems only.
No power outage, no improper shutdowns. The app/tasks cause the drivers to fail. On reboot one or more GPUGrid WU's fail. Some GPUGrid WU's can continue however.
____________
FAQ's

HOW TO:
- Opt out of Beta Tests
- Ask for Help

FoldingNator
Send message
Joined: 1 Dec 12
Posts: 24
Credit: 60,122,950
RAC: 0
Level
Thr
Scientific publications
watwatwatwatwatwatwat
Message 34146 - Posted: 7 Dec 2013 | 3:36:35 UTC - in response to Message 34138.
Last modified: 7 Dec 2013 | 3:38:08 UTC

Hi Jacob/skgiven, thanks for your messages. It sounds the same, but after the driver crash the Windows logfiles said that a part of the driver was corrupted. Though I also doubt it.

I've restarted my computer multiple times, but it didnt work.
Actually, these were my steps:
1.) restart Windows in safe mode
2.) select in BOINC that the manager don't start manually after system start-up
3.) aborted the SANTI and NATHAN runs
4.) restart again and test a new run -> again driver a crash and BSOD
5.) delete driver with driver sweeper + CCleaner for the registry entries, restart
6.) install new driver, restart, start-up BOINC and runs from point 4 -> again BSOD
7.) set in MSI Afterburner all to stock -> new tasks -> BSOD
8.) restart again, startup BOINC and go further with the tasks from point 7, finally he is getting it... it runs again
9.) after a 30 minutes I paused it, get the OC back and the WU's are running fine... very strange IMO


The solution (which prevents these tasks from getting stuck in a crashing loop) is:
- GPUGrid needs to release the v8.15 application (which fixes the issue, but is currently still only on the Beta queue)

Regards,
Jacob Klein

MJH:
We can haz 8.15?

Hmmm, sounds like a great idea. ;-)

Profile skgiven
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,995,359,260
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 34148 - Posted: 7 Dec 2013 | 11:18:36 UTC - in response to Message 34146.

This morning I again find that my computer restarted (3days in a row) and when I logged in the NVidia driver repeatedly restarts. One GPUGrid task had completed and wanted to upload (which I also saw yesterday). So it likely that the new task is causing this.

When you have 3 GPU's in the one system and have to abort 3 tasks, 3days in a row, things aren't working out!
____________
FAQ's

HOW TO:
- Opt out of Beta Tests
- Ask for Help

Jim1348
Send message
Joined: 28 Jul 12
Posts: 816
Credit: 1,568,413,471
RAC: 178,877
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 34150 - Posted: 7 Dec 2013 | 22:40:28 UTC

I have never had the reboot problem on my dedicated BOINC-PC with two GTX 660s running the longs (331.65 drivers, Win7 64-bit). But that PC has an uninterruptable power supply (UPS), and never suffers from power outages. Also, the cards are now stable, after some effort as explained elsewhere, to the point where they now don't have "The simulation has become unstable. Terminating to avoid lock-up" problem.

However my main PC is a different story. That one also has a UPS, but I put the weakest of my GTX 660's there, and found that the stability of a card depends on the motherboard; what works in an Ivy Bridge board does not work in an older P45 Core2Duo board. Before I got it stable, that card would crash on its own, sometimes producing a BSOD, which then initiated the reboot problem. Curiously, the BSODs often don't even produce a minidump file; unless you are there to see it, you might miss that it happened at all. So I had to reduce the GPU clock still more, and even reduce the memory clock to get it stable.

So the bottom line is that until they can come out with the 8.15 fix, the best thing to do is to make your cards as stable as possible so that you never get the "The simulation has become unstable" messages in stderr.txt, which I now consider a canary in the coal mine for the reboot problem.. As discussed elsewhere, what worked for me is increasing the power limits (to 110%), and reducing the GPU clocks and increasing the GPU core voltage as necessary; Nvidia Inspector worked for me. Of course, the cooling should be sufficient also; it is worth doing what it takes.

Profile (retired account)
Send message
Joined: 22 Dec 11
Posts: 38
Credit: 28,606,255
RAC: 0
Level
Val
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 34153 - Posted: 8 Dec 2013 | 14:17:50 UTC

Also have a massive problem with reboots now which might be related. The GTX Titan (Win 7 SP1 64bit, driver 331.40) received the following two long runs and then a short run (which I did load on purpose for testing) today:

I72R1-NATHAN_KIDKIXc22_6-34-50-RND4048
I35R3-NATHAN_KIDKIXc22_6-41-50-RND0098
I259-SANTI_baxbimSPW2-8-62-RND4721

With all three workunits the PC suddenly crashed and rebooted (I did not *see* a BSOD, the screen only went black, then immediate reboot). I don't have BOINC in autostart, when I did start it manually it took a few seconds then the nvidia driver crashed and was restarted by Windows (no reboot or BSOD here, BOINC still ran). The GPUGRID workunits were crashed, in the first two cases the long runs did also take the WUProp workunits down with them.

http://wuprop.boinc-af.org/result.php?resultid=36443848
http://wuprop.boinc-af.org/result.php?resultid=36448588

All other projects running concurrently remained unharmed afaics (including Einstein FGRP2, S6CasA, Test4Theory and POEM++ OpenCl, the latter running on a HD 7950).

<core_client_version>7.2.31</core_client_version>
<![CDATA[
<message>
(unknown error) - exit code -52 (0xffffffcc)
</message>
<stderr_txt>
# GPU [GeForce GTX TITAN] Platform [Windows] Rev [3203] VERSION [55]
# SWAN Device 0 :
# Name : GeForce GTX TITAN
# ECC : Disabled
# Global mem : 4095MB
# Capability : 3.5
# PCI ID : 0000:01:00.0
# Device clock : 875MHz
# Memory clock : 3004MHz
# Memory width : 384bit
# Driver version : r331_00 : 33140
# GPU 0 : 49C

(...)

# GPU 0 : 81C
# GPU [GeForce GTX TITAN] Platform [Windows] Rev [3203] VERSION [55]
# SWAN Device 0 :
# Name : GeForce GTX TITAN

(see above)

# Driver version : r331_00 : 33140
SWAN : FATAL Unable to load module .mshake_kernel.cu. (999)

</stderr_txt>
]]>


The card ran fine the last week with Milkyway in DP mode. I will try to run some other projects now in SP mode on the Titan to see if the card and the nvidia driver installation is still fine. I will also test if the same problem occurs with the GT 650M card.
____________
Mark my words and remember me. - 11th Hour, Lamb of God

Profile (retired account)
Send message
Joined: 22 Dec 11
Posts: 38
Credit: 28,606,255
RAC: 0
Level
Val
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 34155 - Posted: 8 Dec 2013 | 16:25:11 UTC - in response to Message 34153.
Last modified: 8 Dec 2013 | 17:04:44 UTC

I will try to run some other projects now in SP mode on the Titan to see if the card and the nvidia driver installation is still fine.


Collatz seems to run fine on the Titan with heavy load through config file. Nothing validated yet, but no obvious errors. Will try to catch a new v8.15 short run.

EDIT: Got one. Looks good so far, now at 25%. http://www.gpugrid.net/workunit.php?wuid=4978432

I will also test if the same problem occurs with the GT 650M card.


Not yet. A v8.14 short runs fine and is at 25% now.

Profile (retired account)
Send message
Joined: 22 Dec 11
Posts: 38
Credit: 28,606,255
RAC: 0
Level
Val
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 34156 - Posted: 8 Dec 2013 | 17:49:51 UTC - in response to Message 34155.
Last modified: 8 Dec 2013 | 18:23:49 UTC

Will try to catch a new v8.15 short run.

EDIT: Got one. Looks good so far, now at 25%. http://www.gpugrid.net/workunit.php?wuid=4978432


I699-SANTI_baxbimSPW2-12-62-RND9134

Nope, another failure. Sudden reboot at 43%. After restart some POEM OpenCL kicked in, hence the nvidia driver and the GPUGRID workunit had no chance this time to crash and I could suspend the workunit in question. The WUProp workunit was killed again, too. This shows at least, that WUProp is killed by the sudden reboot not by the video driver crashing (makes sense to me).

If you would like to receive the content of the slot, pls. PM an email address.

EDIT: The two POEM workunits finished ok. No indication of a hardware fault or faulty driver, yet.

Profile skgiven
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,995,359,260
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 34159 - Posted: 8 Dec 2013 | 20:11:18 UTC - in response to Message 34156.

I'm also using 331.40 (which is a Beta). Probably worth updating to 331.82 (the most recent WHQL driver), but for me this wasn't happening at the beginning of last week or before that (same drivers).
____________
FAQ's

HOW TO:
- Opt out of Beta Tests
- Ask for Help

Profile (retired account)
Send message
Joined: 22 Dec 11
Posts: 38
Credit: 28,606,255
RAC: 0
Level
Val
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 34161 - Posted: 8 Dec 2013 | 22:35:28 UTC - in response to Message 34159.

but for me this wasn't happening at the beginning of last week or before that (same drivers).


Yes, same here. Might consider an update, though...

Profile skgiven
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,995,359,260
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 34163 - Posted: 9 Dec 2013 | 0:45:05 UTC - in response to Message 34161.

I did the suggested update and I'm still getting stung.
____________
FAQ's

HOW TO:
- Opt out of Beta Tests
- Ask for Help

FoldingNator
Send message
Joined: 1 Dec 12
Posts: 24
Credit: 60,122,950
RAC: 0
Level
Thr
Scientific publications
watwatwatwatwatwatwat
Message 34164 - Posted: 9 Dec 2013 | 1:42:18 UTC

Something else, or maybe the same...
I was watching my taskslist, and I find out now that after the BSOD/install older driver all new tasks do have a shorter CPU time.

Before I had mostly CPU runtimes of 9.000-10.000 seconds and now it is again more like normal: 2.000-3.000 seconds.

=======================================

I'm back now to my computer and I have searched in the logfiles for the drivercrash and it says from 5 December:

Can not find the description of Event ID 1 from source NvStreamSvc. The component that started the event may not be installed on the local computer or the installation is corrupted. You can install the component on the local computer or restore.

The following information is included in the event:
NvStreamSvc
NvVAD initialization failed [6]


The computer restarts after a bug check. The bug check is 0x00000116 (0xfffffa801270d010, 0xfffff88006940010, 0x0000000000000000, 0x000000000000000d). A dump was saved in: C: \ Windows \ Minidump \ 120513-6286-01.dmp. Report ID: 120513-6286-01.

Profile (retired account)
Send message
Joined: 22 Dec 11
Posts: 38
Credit: 28,606,255
RAC: 0
Level
Val
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 34166 - Posted: 9 Dec 2013 | 2:30:19 UTC - in response to Message 34164.
Last modified: 9 Dec 2013 | 2:31:01 UTC

Thanks, skgiven, for sharing the info. I guess I will refrain from it then, at least for the time being.

after the BSOD/install older driver all new tasks do have a shorter CPU time.


Is this v314.22 you are using now, FoldingNator? Unfortunately us Titan/GTX 780 users have to stick with v331.40 or higher I'm afraid.

FoldingNator
Send message
Joined: 1 Dec 12
Posts: 24
Credit: 60,122,950
RAC: 0
Level
Thr
Scientific publications
watwatwatwatwatwatwat
Message 34169 - Posted: 9 Dec 2013 | 10:51:37 UTC - in response to Message 34166.

Yes you're right!

Before I had installed v320.18. and after the clean install of a new driver I've choosen for the always stable driver (for GF Fermi 400-500 cards) v314.22. I don't know whether the lower cpu times are coming by the other driver or it's only a coincidence.

Maybe bad news for users with recent high-end cards. :(

TheFiend
Send message
Joined: 26 Aug 11
Posts: 99
Credit: 2,500,112,138
RAC: 1
Level
Phe
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 34172 - Posted: 9 Dec 2013 | 17:05:34 UTC

Just suffered the effects of this bug today after a power failure, but it only affected my Win 7 Pro 64 machine, my WIN XP Home 32 restarted the tasks OK.

Profile JStateson
Avatar
Send message
Joined: 31 Oct 08
Posts: 183
Credit: 3,326,949,677
RAC: 40,024
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 34216 - Posted: 11 Dec 2013 | 22:52:11 UTC

Power glitch caused fatal gpugrid restarts on 5 systems a few hours ago. This is a PITA. These systems are headless and when I bring a monitor to fix the problem (reset of gpugrid) the BM program can be off the edge of the screen and by the time I get it down to where I can select gpugrid and reset the project, the damn display has reset 3 times and hung up and I start the process all over again.

I tried setting BAM! to where all the gpugrid projects are suspended but there is a timeing problem and there are 2 systems that start gpugrid before BM realize it was supposed to suspend the project.

This is really crappy but rather than complain anymore and I am just going to switch to prime and check this thread occasionally to see if the problem has been fixed. Once I get thru the cold spell, probably march, there should not be an more power glitches and I can put gpugrid back online.

Maybe someone here can some up with a script to reset the project following reboot on power out. Windows knows when the power goes out so there should be some API or whatever that gpugrid could use.

Jacob Klein
Send message
Joined: 11 Oct 08
Posts: 1127
Credit: 1,901,927,545
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 34217 - Posted: 11 Dec 2013 | 22:58:20 UTC - in response to Message 34216.

Beemer:

The problem is fixed on the Short Queue, with v8.15, I believe.
It has not yet been moved to the Long Queue.

You might be able to edit your GPUGrid preferences, to only do Short-Queue, for now.

TJ
Send message
Joined: 26 Jun 09
Posts: 815
Credit: 1,470,385,294
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 34219 - Posted: 11 Dec 2013 | 23:18:14 UTC - in response to Message 34216.

If you put a not to expensive UPS behind it, then you can safely switch the rigs down on a power outage (if you are near the PC's)?
____________
Greetings from TJ

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1425
Credit: 3,520,440,451
RAC: 596,293
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 34220 - Posted: 11 Dec 2013 | 23:41:51 UTC - in response to Message 34219.

If you put a not to expensive UPS behind it, then you can safely switch the rigs down on a power outage (if you are near the PC's)?

If you put a quality UPS behind it, it will come with software that can switch the rigs down safely whether you are nearby or not.

Dagorath
Send message
Joined: 16 Mar 11
Posts: 509
Credit: 179,005,236
RAC: 0
Level
Ile
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 34222 - Posted: 12 Dec 2013 | 4:20:46 UTC - in response to Message 34220.

Check out Cyber Power brand UPS. They're more reasonably priced than the APC brand and they even provide an app for Linux that has options to (I assume their Windows app is even better):

1) on power failure send apps the shutdown signal, then wait a few secs then shutdown

2) not shutdown immediately because the power might return very soon so wait until the backup battery approaches minimum operational voltage then shutdown gracefully

3) not shutdown immediately rather run a user designated script that might, for example, suspend power hungry apps like BOINC then wait until the battery approaches minimum operational voltage before shutting down, send the admin an email, send the power company a nasty email, whatever you want the script to do

4) when power returns run a second script that could, for example, resume power hungry apps like BOINC, send the admin an email saying the power has returned

UPS saves a lot of grief, highly recommended.
____________
BOINC <<--- credit whores, pedants, alien hunters

Jim1348
Send message
Joined: 28 Jul 12
Posts: 816
Credit: 1,568,413,471
RAC: 178,877
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 34228 - Posted: 12 Dec 2013 | 9:07:10 UTC - in response to Message 34222.

Check out Cyber Power brand UPS. They're more reasonably priced than the APC brand and they even provide an app for Linux that has options to (I assume their Windows app is even better):

I like the CyberPower "Pure sine wave" series, which gives a better sine wave output than the others, which give only a stepped-sine wave approximation. The latter can cause trouble with some of the new high-efficiency PC power supplies that have power-factor correction (PFC).

I have just replaced an APC UPS 750 with a "CyberPower CP1350PFCLCD UPS 1350VA/810W PFC compatible Pure sine wave" (my second one), since the APC causes an occasional alarm fault with the 90% efficient power supply in that PC.

Profile JStateson
Avatar
Send message
Joined: 31 Oct 08
Posts: 183
Credit: 3,326,949,677
RAC: 40,024
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 34236 - Posted: 12 Dec 2013 | 13:34:11 UTC
Last modified: 12 Dec 2013 | 14:11:42 UTC

I have a small APC that runs the cable modem, switch and WIFI but it cannot be used with the AC powerline ethernet adapter as it filters out the ethernet signal. My systems are not all in one place where there could be serviced by a single backup.

I am in the middle of installing splashtop streamer on all system. Supposedly there is a limit of 5 systems but I have gone past that and it has not complained. While that gives me access to the desktop while CUDA is running, it does not provide a solution to resetting the project before the work unit causes a crash. If all the systems had honored the suspension, then I could easily command BoincTasks to reset and resume this project on all systems.

I think boinc.exe should have noticed the suspension request that I put at BAM! and done that before starting the project. I will ask this at the boinc forum.

Profile skgiven
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,995,359,260
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 36445 - Posted: 19 Apr 2014 | 16:49:05 UTC - in response to Message 36419.
Last modified: 19 Apr 2014 | 16:49:44 UTC

This thread was regarding a specific problem, as detailed in the first post.
The problem was a bug in the 8.14 version of the app.
The problem was fixed with the 8.15 version of the app, so the thread has been closed.
____________
FAQ's

HOW TO:
- Opt out of Beta Tests
- Ask for Help

Message boards : Number crunching : Abrupt computer restart - Tasks stuck - Kernel not found

//