Advanced search

Message boards : Number crunching : Unable to load module .mshake_kernel.cu. (702)

Author Message
Profile ritterm
Avatar
Send message
Joined: 31 Jul 09
Posts: 88
Credit: 244,413,897
RAC: 0
Level
Leu
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 38064 - Posted: 25 Sep 2014 | 2:38:43 UTC
Last modified: 25 Sep 2014 | 3:22:10 UTC

My normally rock solid GTX570 host kicked out a couple of compute errors on short tasks:

glumetx5-NOELIA_SH2-13-50-RND5814 (Others seemed to have problems with this one)

prolysx8-NOELIA_SH2-13-50-RND2399_0

Both stderr outputs include:

SWAN : FATAL Unable to load module .mshake_kernel.cu. (702)

Both occurrences resulted in a driver crash and system reboot.

Possibly related question/issue... Are those GPU temps in the stderr output? Could that be part of the problem? I checked other successful tasks and have seen higher values than those in the recently crashed tasks.
____________

Profile ritterm
Avatar
Send message
Joined: 31 Jul 09
Posts: 88
Credit: 244,413,897
RAC: 0
Level
Leu
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 38356 - Posted: 8 Oct 2014 | 2:01:27 UTC - in response to Message 38064.

My normally rock solid GTX570 host kicked out a couple of compute errors on short tasks...

...Probably because the GPU suffered a partial failure. Since this happened, the host would run fine under little or no load. As soon as the GPU got stressed running any BOINC tasks I threw at it, the machine would eventually crash and reboot.

The fan was getting a little noisy and there were signs of some kind of oily liquid on the enclosure. Fortunately, is was still under warranty and EVGA sent me a refurbished GTX 570 under RMA. A virtually painless process -- thanks, EVGA.

Maybe I should wait until the replacement GPU runs a few tasks successfully, but everything looks good, so far.

____________

Dayle Diamond
Send message
Joined: 5 Dec 12
Posts: 70
Credit: 1,069,142,015
RAC: 932,618
Level
Met
Scientific publications
watwatwatwatwatwatwatwat
Message 38814 - Posted: 5 Nov 2014 | 2:17:18 UTC

Well, I'm getting it on long tasks too.
This was from a brand new GTX 970 SSC from EVGA.

Name 20mgx1069-NOELIA_20MG2-14-50-RND0261_0
Workunit 10253503
Created 4 Nov 2014 | 3:29:32 UTC
Sent 4 Nov 2014 | 4:29:42 UTC
Received 4 Nov 2014 | 14:45:46 UTC
Server state Over
Outcome Computation error
Client state Compute error
Exit status -52 (0xffffffffffffffcc) Unknown error number
Computer ID 140554
Report deadline 9 Nov 2014 | 4:29:42 UTC
Run time 21,616.47
CPU time 4,273.91
Validate state Invalid
Credit 0.00
Application version Long runs (8-12 hours on fastest card) v8.47 (cuda65)
Stderr output
<core_client_version>7.2.42</core_client_version>
<![CDATA[
<message>
(unknown error) - exit code -52 (0xffffffcc)
</message>
<stderr_txt>
# GPU [GeForce GTX 970] Platform [Windows] Rev [3212] VERSION [65]
# SWAN Device 0 :
# Name : GeForce GTX 970
# ECC : Disabled
# Global mem : 4095MB
# Capability : 5.2
# PCI ID : 0000:01:00.0
# Device clock : 1342MHz
# Memory clock : 3505MHz
# Memory width : 256bit
# Driver version : r344_32 : 34448
# GPU 0 : 56C
# GPU 0 : 57C
# GPU 0 : 58C
# GPU 0 : 59C
# GPU 0 : 60C
# BOINC suspending at user request (exit)
# GPU [GeForce GTX 970] Platform [Windows] Rev [3212] VERSION [65]
# SWAN Device 0 :
# Name : GeForce GTX 970
# ECC : Disabled
# Global mem : 4095MB
# Capability : 5.2
# PCI ID : 0000:01:00.0
# Device clock : 1342MHz
# Memory clock : 3505MHz
# Memory width : 256bit
# Driver version : r344_32 : 34448
# GPU 0 : 46C
# GPU 0 : 47C
# GPU 0 : 50C
# GPU 0 : 51C
# GPU 0 : 54C
# GPU 0 : 55C
# GPU 0 : 56C
# GPU 0 : 57C
# BOINC suspending at user request (exit)
# GPU [GeForce GTX 970] Platform [Windows] Rev [3212] VERSION [65]
# SWAN Device 0 :
# Name : GeForce GTX 970
# ECC : Disabled
# Global mem : 4095MB
# Capability : 5.2
# PCI ID : 0000:01:00.0
# Device clock : 1342MHz
# Memory clock : 3505MHz
# Memory width : 256bit
# Driver version : r344_32 : 34448
# GPU 0 : 45C
# GPU 0 : 46C
# GPU 0 : 48C
# GPU 0 : 50C
# GPU 0 : 52C
# GPU 0 : 54C
# GPU 0 : 55C
# GPU 0 : 56C
# GPU 0 : 57C
# GPU 0 : 58C
# GPU 0 : 59C
# GPU 0 : 60C
# GPU 0 : 61C
# GPU 0 : 62C
# GPU [GeForce GTX 970] Platform [Windows] Rev [3212] VERSION [65]
# SWAN Device 0 :
# Name : GeForce GTX 970
# ECC : Disabled
# Global mem : 4095MB
# Capability : 5.2
# PCI ID : 0000:01:00.0
# Device clock : 1342MHz
# Memory clock : 3505MHz
# Memory width : 256bit
# Driver version : r344_32 : 34448
SWAN : FATAL Unable to load module .mshake_kernel.cu. (999)

</stderr_txt>
]]>

Dayle Diamond
Send message
Joined: 5 Dec 12
Posts: 70
Credit: 1,069,142,015
RAC: 932,618
Level
Met
Scientific publications
watwatwatwatwatwatwatwat
Message 38934 - Posted: 17 Nov 2014 | 2:37:15 UTC

Now I got five "Unable to load module" crashes in a day.
Some crashed a few seconds into their run, some of them crashed after many hours of computation.

Last ones caused a blue screen and restart.

I replaced my old GTX 460 with a 1200 watt PSU and two 970's to make a big impact with BOINC GPU projects, but the frequent crashes are erasing much of my gains.


ExtraTerrestrial Apes
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 17 Aug 08
Posts: 2671
Credit: 751,855,924
RAC: 450,091
Level
Glu
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 38945 - Posted: 17 Nov 2014 | 21:28:54 UTC

Dayle, I was getting roughly similar problems: sometimes "the simulation has become unstable", sometimes "failed to load *.cu". Also system crashes and blue screens.

I had bought a new GTX970 and it was running fine for a week. I then added my previous GTX660Ti back into the case, running 2 big GPUs for the 1st time. I've got both cards "eco-tuned", running power-limited. Yet it seems like they increase case temperatures pushed my OC'ed CPU over the stability boundary. Since I lowered the CPU clock speed a notch there have been no more failures. Well, that's only been 1.5 days by now, but it's still a record.

Morale: maybe the heat output from those GPUs also stressing some other component of your system too much.

MrS
____________
Scanning for our furry friends since Jan 2002

Dayle Diamond
Send message
Joined: 5 Dec 12
Posts: 70
Credit: 1,069,142,015
RAC: 932,618
Level
Met
Scientific publications
watwatwatwatwatwatwatwat
Message 38947 - Posted: 18 Nov 2014 | 3:15:02 UTC

That was a very interesting idea, so I went ahead and looked at the work unit logs for all five crashes.

The ambient temperature in the room fluctuates depending on the time of day, but here is EACH GPU's temp whenever one OR the other failed.

All numbers in C

1. 64 & 58
2. 71 & 77
3. 58 & 63
4. 58 & 46
5. 71 & 77

77 degrees is much hotter then I'm hoping they'd run at, and I wonder if you're right. If so, it's time for a new case. I've got both the right and left panels of my tower disconnected, plus a HEPA filter in the room to keep dust from getting in. But maybe my airflow isn't directed enough? But that doesn't seem to be all of the problem, because they're crashing at much lower temperatures too.

Simba123
Send message
Joined: 5 Dec 11
Posts: 147
Credit: 69,970,684
RAC: 0
Level
Thr
Scientific publications
watwatwatwatwatwatwatwatwat
Message 38950 - Posted: 18 Nov 2014 | 4:54:50 UTC - in response to Message 38947.

That was a very interesting idea, so I went ahead and looked at the work unit logs for all five crashes.

The ambient temperature in the room fluctuates depending on the time of day, but here is EACH GPU's temp whenever one OR the other failed.

All numbers in C

1. 64 & 58
2. 71 & 77
3. 58 & 63
4. 58 & 46
5. 71 & 77

77 degrees is much hotter then I'm hoping they'd run at, and I wonder if you're right. If so, it's time for a new case. I've got both the right and left panels of my tower disconnected, plus a HEPA filter in the room to keep dust from getting in. But maybe my airflow isn't directed enough? But that doesn't seem to be all of the problem, because they're crashing at much lower temperatures too.



That's your GPU temps, which are within the range for your GPU, just. IIRC your GPU Thermal throttles at 80c. It may be worth either reducing your clocks or employing a more agressive fan profile.

What Apes was referring to was CPU temps. If your GPU is dumping enough hot air into your case, it could be making your cpu unstable. Check those temps and adjust accordingly.

Dayle Diamond
Send message
Joined: 5 Dec 12
Posts: 70
Credit: 1,069,142,015
RAC: 932,618
Level
Met
Scientific publications
watwatwatwatwatwatwatwat
Message 38965 - Posted: 19 Nov 2014 | 9:25:12 UTC - in response to Message 38950.

Okay, done. I've just recovered from another crash.
CPU is about 53 Celsius.

I'm mystified. I've let it run a little longer, and we're down to 52 C.

I did manage to see the blue screen for a split second and but it went away too quickly to take a photo. Something like "IRQL not less or equal".

Internet says that's usually a driver issue.
As I have the latest GPU drivers, latest motherboard drivers, etc, I am running "WhoCrashed" on my system and waiting for another crash.

Hopefully this is related to the Unable to Load Module issue.

ExtraTerrestrial Apes
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 17 Aug 08
Posts: 2671
Credit: 751,855,924
RAC: 450,091
Level
Glu
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 38980 - Posted: 20 Nov 2014 | 21:47:54 UTC - in response to Message 38965.

Hopefully this is related to the Unable to Load Module issue.

Very probably. You temperatures seem fine. That blue screen message you got can be a software (driver) problem. Did you already try to do a clean driver install of the current 344.75?

I think that message can also mean just a general hardware failure. The PSU could be another candidate, but a new 1.2 kW unit sounds good. Is it, by any chance, a cheap chinese model?

MrS
____________
Scanning for our furry friends since Jan 2002

Killersocke
Send message
Joined: 18 Oct 13
Posts: 41
Credit: 134,973,970
RAC: 0
Level
Cys
Scientific publications
watwatwatwatwatwatwatwat
Message 39025 - Posted: 25 Nov 2014 | 15:55:40 UTC - in response to Message 38980.

It ist NOT a Driver Problem.
The NVIDIA Driver before crashed also.

see also
http://www.gpugrid.net/forum_thread.php?id=3932

regards

Dayle Diamond
Send message
Joined: 5 Dec 12
Posts: 70
Credit: 1,069,142,015
RAC: 932,618
Level
Met
Scientific publications
watwatwatwatwatwatwatwat
Message 39156 - Posted: 16 Dec 2014 | 15:00:55 UTC

Hi Mr. S.

I don't know if it's a cheap Chinese model, it's the Platimax 1200w, which is discontinued. Picked up the last one Fry's had in their system, then special ordered the cables, because some tosser returned theirs and kept all the cords.

I've attached my GTX 970s to a new motherboard that I was able to afford during a black Friday sale. I'll post elsewhere about that, because I'm still not getting the speed I'm expecting. Anyway, same drivers, same GPUs, same PSU, but better fans and motherboard. No more errors.

If anyone is still getting this error, I hope that helps narrow down your issue.



Dayle Diamond
Send message
Joined: 5 Dec 12
Posts: 70
Credit: 1,069,142,015
RAC: 932,618
Level
Met
Scientific publications
watwatwatwatwatwatwatwat
Message 43363 - Posted: 11 May 2016 | 12:23:35 UTC - in response to Message 39156.

Huh. Well after a few years, this error is back, and swallowed 21 hours worth of Maxwell crunching.

Two years ago it was happening on an older motherboard, with different drivers, running different tasks, and on a different OS.

https://www.gpugrid.net/result.php?resultid=15094951

Various PC temps still look fine.

Name 2d9wR8-SDOERR_opm996-0-1-RND7930_0
Workunit 11595346
Created 9 May 2016 | 9:59:05 UTC
Sent 10 May 2016 | 7:14:49 UTC
Received 11 May 2016 | 7:15:14 UTC
Server state Over
Outcome Computation error
Client state Compute error
Exit status -52 (0xffffffffffffffcc) Unknown error number
Computer ID 191317
Report deadline 15 May 2016 | 7:14:49 UTC
Run time 78,815.07
CPU time 28,671.50
Validate state Invalid
Credit 0.00
Application version Long runs (8-12 hours on fastest card) v8.48 (cuda65)

Stderr output
<core_client_version>7.6.22</core_client_version>
<![CDATA[
<message>
(unknown error) - exit code -52 (0xffffffcc)
</message>
<stderr_txt>
# GPU [GeForce GTX 970] Platform [Windows] Rev [3212] VERSION [65]
# SWAN Device 0 :
# Name : GeForce GTX 970
# ECC : Disabled
# Global mem : 4095MB
# Capability : 5.2
# PCI ID : 0000:02:00.0
# Device clock : 1342MHz
# Memory clock : 3505MHz
# Memory width : 256bit
# Driver version : r364_69 : 36510
# GPU 0 : 66C
# GPU 1 : 72C
# GPU 0 : 67C
# GPU 1 : 73C
# GPU 0 : 68C
# GPU 1 : 74C
# GPU 0 : 69C
# GPU 0 : 70C
# GPU 0 : 71C
# GPU 0 : 72C
# GPU 1 : 75C
# GPU 0 : 73C
# GPU 0 : 74C
# GPU 0 : 75C
# GPU 1 : 76C
# GPU 0 : 76C
# GPU [GeForce GTX 970] Platform [Windows] Rev [3212] VERSION [65]
# SWAN Device 0 :
# Name : GeForce GTX 970
# ECC : Disabled
# Global mem : 4095MB
# Capability : 5.2
# PCI ID : 0000:02:00.0
# Device clock : 1342MHz
# Memory clock : 3505MHz
# Memory width : 256bit
# Driver version : r364_69 : 36510
SWAN : FATAL Unable to load module .mshake_kernel.cu. (719)

</stderr_txt>
]]>

Profile skgiven
Volunteer moderator
Project tester
Volunteer tester
Avatar
Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,835,617,224
RAC: 310,037
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 43387 - Posted: 12 May 2016 | 15:41:57 UTC - in response to Message 43363.

# GPU 1 : 76C
# GPU 0 : 76C


76C is too hot!

Use NVIDIA Inspector to Prioritize Temperature and set it to 69C.
____________
FAQ's

HOW TO:
- Opt out of Beta Tests
- Ask for Help

Jacob Klein
Send message
Joined: 11 Oct 08
Posts: 1068
Credit: 1,149,280,989
RAC: 1,048,193
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 43410 - Posted: 14 May 2016 | 13:21:39 UTC
Last modified: 14 May 2016 | 13:26:35 UTC

From my experience, GPUs can run 70-85*C, no problem, so long as the clocks are stable. See if removing any GPU overclocks entirely, fixes the issue or not.

Profile skgiven
Volunteer moderator
Project tester
Volunteer tester
Avatar
Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,835,617,224
RAC: 310,037
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 43411 - Posted: 14 May 2016 | 13:39:14 UTC - in response to Message 43410.

The issue isn't with the GPU core temperature, it's with the heat generated by it; that increases the ambient temperature inside the GPU case and the computer chassis in general. Sometimes it causes failures when the GDDR heats up too much for example, sometimes system memory can become too hot, sometimes other components such as the disk drives. Generally when temps are over 50C they can cause problems.
____________
FAQ's

HOW TO:
- Opt out of Beta Tests
- Ask for Help

Profile caffeineyellow5
Avatar
Send message
Joined: 30 Jul 14
Posts: 225
Credit: 2,658,976,345
RAC: 0
Level
Phe
Scientific publications
watwatwatwatwat
Message 45242 - Posted: 10 Nov 2016 | 8:46:54 UTC

I just had this error on a fairly new system. It is not new by technological standards, but it is new as in it was bought brand new from the store as of less than 6 months ago. I find it interesting that the heat on the GPU core tops out at 58C and had this issue. The card itself has gone to 66C recently with no issue and when it was doing long tasks would flatten out at around 59-61C. Being a GT 730 2GB card, I have it running only short tasks like my laptop is doing now. (I set my fast cards to only do long tasks as well, as I think that is the polite thing to do for weaker cards so they can get short ones in to run.) AFAIK, this PC is not in an area that is hot or cold, but maintains a steady(ish) air temp, although it is next to a door and can get bursts of cooler air as people come in and out the front door during this fall/winter weather. It certainly hasn't been hot recently here. This is the only error task on the PC's history and this has been a very stable system for its total uptime. I'll keep an eye on it and see if this is a pattern. I'll also have to check on the CPU temps to see if they remain steady or go through spikes. I don't think heat is an issue though unless the card is just faulty. It has done 2 tasks successfully since this error. I also see an extra task in the In Progress list that is not on the system, so I know there will be another error task on the list that will read Timed Out after the 12th.

https://www.gpugrid.net/result.php?resultid=15586143
____________
1 Corinthians 9:16 "For though I preach the gospel, I have nothing to glory of: for necessity is laid upon me; yea, woe is unto me, if I preach not the gospel!"
Ephesians 6:18-20, please ;-)
http://tbc-pa.org

Profile Retvari Zoltan
Avatar
Send message
Joined: 20 Jan 09
Posts: 1844
Credit: 10,662,131,944
RAC: 10,093,688
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 45243 - Posted: 10 Nov 2016 | 9:20:27 UTC - in response to Message 45242.

I just had this error on a fairly new system.
...
https://www.gpugrid.net/result.php?resultid=15586143

Here's an excerpt from the task's stderr.txt:
# GPU 0 : 58C # GPU [GeForce GT 730] Platform [Windows] Rev [3212] VERSION [65] # SWAN Device 0 : # Name : GeForce GT 730
Note the missing
# BOINC suspending at user request (exit)
(or similar) message explaining the reason of task shutdown between line 1 and 2. This is the sign of a dirty task shutdown. It's cause is unknown, but it could be a dirty system shutdown, or an unattended (automatic) driver update by Windows Update or NVidia Update.

Profile caffeineyellow5
Avatar
Send message
Joined: 30 Jul 14
Posts: 225
Credit: 2,658,976,345
RAC: 0
Level
Phe
Scientific publications
watwatwatwatwat
Message 45249 - Posted: 11 Nov 2016 | 19:37:27 UTC - in response to Message 45243.

Perhaps this was a power loss. We have had 2 in the past few weeks. I just think this is the first time I have seen this particular error and when I looked it up, it brought me to this thread.

pandemonium
Send message
Joined: 2 Oct 17
Posts: 2
Credit: 7,883,850
RAC: 212,541
Level
Ser
Scientific publications
wat
Message 48226 - Posted: 23 Nov 2017 | 7:47:14 UTC

In case this hasn't been resolved, I've also run into GPUGRID tasks erring and found my solution to be increasing the virtual memory size. See here.

Profile Retvari Zoltan
Avatar
Send message
Joined: 20 Jan 09
Posts: 1844
Credit: 10,662,131,944
RAC: 10,093,688
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 48227 - Posted: 23 Nov 2017 | 14:11:16 UTC - in response to Message 48226.
Last modified: 23 Nov 2017 | 14:15:44 UTC

In case this hasn't been resolved, I've also run into GPUGRID tasks erring and found my solution to be increasing the virtual memory size. See here.
I don't think that increasing virtual memory size could fix such problems (perhaps indirectly by accident). Your PC has 32GB RAM. I can't imagine that even if you run 12 CPU tasks simultaneously it will run out of 32GB (+1GB virtual). If it does, then some of the apps you run have a memory leak, and it will run out even if you set +4GB or +8GB virtual memory.
These
SWAN : FATAL : Cuda driver error 999 in file 'swanlibnv2.cpp' in line 1965.
and
SWAN : FATAL Unable to load module .mshake_kernel.cu. (719)
errors are the result of the GPUGrid task gets suspended too frequently and / or too many times (or a failing GPU).
EDIT: SLI is not recommended for GPU crunching. You should try to turn it off for a test period (even remove the SLI bridge).

pandemonium
Send message
Joined: 2 Oct 17
Posts: 2
Credit: 7,883,850
RAC: 212,541
Level
Ser
Scientific publications
wat
Message 48248 - Posted: 25 Nov 2017 | 10:02:31 UTC - in response to Message 48227.

You seem to be correct. I've since had a few more instances of the display driver crashing during crunching and trying to reinitialize the displays. Increasing the virtual memory paging file seemed to alleviate the issue, but not completely fix it.

I will give disabling SLI and then removing the bridge to see if that helps fix the issue once and for all.

Thanks for the input!

Post to thread

Message boards : Number crunching : Unable to load module .mshake_kernel.cu. (702)