Errors resuming after power outage

Message boards : Number crunching : Errors resuming after power outage

Author	Message
Jacob Klein Send message Joined: 11 Oct 08 Posts: 1127 Credit: 1,901,927,545 RAC: 0 Level Scientific publications	Message 40492 - Posted: 17 Mar 2015 \| 16:54:40 UTC Last modified: 17 Mar 2015 \| 17:01:00 UTC
	My computer recently restarted, unexpectedly. It may have been a brief power outage, though I am not 100% sure. When it restarted, and BOINC tried to load up tasks, problems occurred with the GPUGrid tasks. When each task was loaded, it resulted in a TDR, and then a task failure ... for all 6 of my in-progress tasks. They all resulted in: Server state Over Outcome Computation error Client state Compute error Exit status -52 (0xffffffffffffffcc) Unknown error number Validate state Invalid And they all had the following at the bottom of their stderr.txt: SWAN : FATAL Unable to load module .mshake_kernel.cu. (702) Can anything be done to make this scenario, able to be restarted and resumed, for GPUGrid GPU tasks? e13s16_e1s33f90-NOELIA_1mgx1-2-4-RND0021_0 http://www.gpugrid.net/result.php?resultid=13982550 e26s10_e20s4f232-SDOERR_villinpub2-0-1-RND0381_3 http://www.gpugrid.net/result.php?resultid=13983070 e15s46_e1s400f24-NOELIA_1mgx2-1-4-RND5323_0 http://www.gpugrid.net/result.php?resultid=13983199 2Mgx471-NOELIA_INSP-11-12-RND1315_0 http://www.gpugrid.net/result.php?resultid=13983283 e12s13_e4s36f65-NOELIA_1mgx1-3-4-RND7924_0 http://www.gpugrid.net/result.php?resultid=13983393 e15s46_e1s386f84-NOELIA_1mgx1-1-4-RND3709_0 http://www.gpugrid.net/result.php?resultid=13983801 Note: On this computer, I load my 3 GPUs with 2-tasks-per-GPU.
	ID: 40492 \| Rating: 0 \| rate: / Reply Quote

Dayle Diamond Send message Joined: 5 Dec 12 Posts: 84 Credit: 1,637,088,415 RAC: 608,284 Level Scientific publications	Message 40507 - Posted: 18 Mar 2015 \| 15:22:50 UTC - in response to Message 40492.
	Seconded. My neighborhood is a little, uh, neglected. During warm weather, the AC drain becomes too much on the system and the whole block shuts off. I just lost about six hours of crunching yesterday due to periodic power outages. Would hate for this to be a regular issue all summer.
	ID: 40507 \| Rating: 0 \| rate: / Reply Quote

Jacob Klein Send message Joined: 11 Oct 08 Posts: 1127 Credit: 1,901,927,545 RAC: 0 Level Scientific publications	Message 40536 - Posted: 20 Mar 2015 \| 2:05:49 UTC
	MJH: Any chance you might look at this problem?
	ID: 40536 \| Rating: 0 \| rate: / Reply Quote

Trotador Send message Joined: 25 Mar 12 Posts: 103 Credit: 10,028,139,893 RAC: 11,003,956 Level Scientific publications	Message 40729 - Posted: 31 Mar 2015 \| 18:15:04 UTC
	I have two validation errors after an abrupt power-off of a host. The wus resumed from some check points and completed but ended in validation errors (two GPUs host). It was my fault, just unplugged it unintetionally while tinkering around, what a dumb!. Two of those 255 Kpoints ones!, it hurts! Just reporting.
	ID: 40729 \| Rating: 0 \| rate: / Reply Quote

Bedrich Hajek Send message Joined: 28 Mar 09 Posts: 468 Credit: 8,471,772,716 RAC: 11,134,686 Level Scientific publications	Message 40730 - Posted: 31 Mar 2015 \| 22:17:37 UTC - in response to Message 40729.
	I had the same problem 2 days ago. The WU's either crash immediately or they continue normally and than you get the validation error, when they upload. The crashing immediately is not a new problem, but the validation error is.
	ID: 40730 \| Rating: 0 \| rate: / Reply Quote

Duane Bong Send message Joined: 21 Feb 10 Posts: 16 Credit: 746,750,284 RAC: 726,288 Level Scientific publications	Message 47003 - Posted: 18 Apr 2017 \| 7:39:20 UTC Last modified: 18 Apr 2017 \| 7:40:48 UTC
	I just had a WU that gave a Computation error after 23 hours of crunching because of a power failure. It happens to me every now and then, especially during rainy seasons when thunder causes the power in my house to trip. Over the years, I've probably lost 30-40 half completed WUs this way. It is unfortunate is that GPUGrid doesn't resume from the last check point and instead errors and everything is lost. All the other projects I do like P95 or WGC simply resume after power failures from the last saved checkpoint. Is this something that the developers can improve on?
	ID: 47003 \| Rating: 0 \| rate: / Reply Quote

Erich56 Send message Joined: 1 Jan 15 Posts: 1091 Credit: 6,813,157,676 RAC: 15,458,972 Level Scientific publications	Message 47005 - Posted: 18 Apr 2017 \| 9:01:03 UTC
	Duane, somewhere in the PC settings you see "disc caching" - this should be unchecked.
	ID: 47005 \| Rating: 0 \| rate: / Reply Quote

Duane Bong Send message Joined: 21 Feb 10 Posts: 16 Credit: 746,750,284 RAC: 726,288 Level Scientific publications	Message 47006 - Posted: 18 Apr 2017 \| 12:17:35 UTC - in response to Message 47005. Last modified: 18 Apr 2017 \| 12:19:38 UTC
	Duane, somewhere in the PC settings you see "disc caching" - this should be unchecked. Thanks for the suggestion. I checked in my Device Manager under the drive > policies and find that the Write Caching box is already unchecked.... yet I still lost the WU after the power outage. But I don't think data corruption of the checkpoint is the issue. This is the report I see for the WU: SWAN : FATAL Unable to load module .mshake_kernel.cu. (719) Seems after the power failure and reboot it has some kernel error? It is the exact same problem that the starter of this thread reported. But yet the next WU in the queue starts crunching fine after that. <core_client_version>7.6.33</core_client_version> <![CDATA[ <message> (unknown error) - exit code -52 (0xffffffcc) </message> <stderr_txt> # GPU [GeForce GTX 960] Platform [Windows] Rev [3212] VERSION [80] # SWAN Device 0 : # Name : GeForce GTX 960 # ECC : Disabled # Global mem : 2048MB # Capability : 5.2 # PCI ID : 0000:28:00.0 # Device clock : 1291MHz # Memory clock : 3600MHz # Memory width : 128bit # Driver version : r381_64 : 38165 # GPU 0 : 59C # GPU 0 : 60C # GPU [GeForce GTX 960] Platform [Windows] Rev [3212] VERSION [80] # SWAN Device 0 : # Name : GeForce GTX 960 # ECC : Disabled # Global mem : 2048MB # Capability : 5.2 # PCI ID : 0000:28:00.0 # Device clock : 1291MHz # Memory clock : 3600MHz # Memory width : 128bit # Driver version : r381_64 : 38165 # GPU 0 : 57C Can't acquire lockfile - exiting No heartbeat from core client for 30 sec - exiting # GPU [GeForce GTX 960] Platform [Windows] Rev [3212] VERSION [80] # SWAN Device 0 : # Name : GeForce GTX 960 # ECC : Disabled # Global mem : 2048MB # Capability : 5.2 # PCI ID : 0000:28:00.0 # Device clock : 1291MHz # Memory clock : 3600MHz # Memory width : 128bit # Driver version : r381_64 : 38165 # GPU 0 : 58C # GPU [GeForce GTX 960] Platform [Windows] Rev [3212] VERSION [80] # SWAN Device 0 : # Name : GeForce GTX 960 # ECC : Disabled # Global mem : 2048MB # Capability : 5.2 # PCI ID : 0000:28:00.0 # Device clock : 1291MHz # Memory clock : 3600MHz # Memory width : 128bit # Driver version : r381_64 : 38165 # GPU 0 : 57C # GPU 0 : 58C # GPU 0 : 59C # GPU 0 : 60C # GPU 0 : 61C # GPU [GeForce GTX 960] Platform [Windows] Rev [3212] VERSION [80] # SWAN Device 0 : # Name : GeForce GTX 960 # ECC : Disabled # Global mem : 2048MB # Capability : 5.2 # PCI ID : 0000:28:00.0 # Device clock : 1291MHz # Memory clock : 3600MHz # Memory width : 128bit # Driver version : r381_64 : 38165 SWAN : FATAL Unable to load module .mshake_kernel.cu. (719) </stderr_txt> ]]>
	ID: 47006 \| Rating: 0 \| rate: / Reply Quote

Post to thread

Message boards : Number crunching : Errors resuming after power outage

	About	Science	Volunteers	Performance	Forum	Join us	Donate