Advanced search

Message boards : Number crunching : acemdlong_6.15_windows_intelx86__cuda31 – Application Error NATHAN_CB1 swanMemcpyDtoH failed

Author Message
Profile skgiven
Volunteer moderator
Project tester
Volunteer tester
Avatar
Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,834,906,524
RAC: 276,970
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 23199 - Posted: 31 Jan 2012 | 13:38:58 UTC

Had an Application Error popup: acemdlong_6.15_windows_intelx86__cuda31

The exception unknown software exception (0x40000015) occurred in the application at location 0x0043af9a.
Click OK to terminate the program
Click on CANCEL to debug the program

This task was running,
I18R15-NATHAN_CB1_1-37-125-RND0672_6
http://www.gpugrid.net/result.php?resultid=4868236

The task 'ran' for 8h (overnight) but did not move from 0% complete. I exited Boinc and restarted Boinc, without closing the Error Popup (often recovers a task/doesn't kill it), but the same error appeared. I did a system restart, and the same error appeared.

I'm writing this one off as a bad work unit. Others experienced a similar problem. http://www.gpugrid.net/workunit.php?wuid=3089778

Name I18R15-NATHAN_CB1_1-37-125-RND0672_6
Workunit 3089778
Created 31 Jan 2012 | 0:00:42 UTC
Sent 31 Jan 2012 | 0:03:34 UTC
Received 31 Jan 2012 | 13:16:28 UTC
Server state Over
Outcome Computation error
Client state Compute error
Exit status 3 (0x3)
Computer ID 91249
Report deadline 5 Feb 2012 | 0:03:34 UTC
Run time 115.94
CPU time 3.55
Validate state Invalid
Credit 0.00
Application version Long runs (8-12 hours on fastest card) v6.15 (cuda31)
Stderr output

<core_client_version>6.10.60</core_client_version>
<![CDATA[
<message>
The system cannot find the path specified. (0x3) - exit code 3 (0x3)
</message>
<stderr_txt>
# Using device 0
# There is 1 device supporting CUDA
# Device 0: "GeForce GTX 470"
# Clock rate: 1.31 GHz
# Total amount of global memory: 1341718528 bytes
# Number of multiprocessors: 14
# Number of cores: 112
SWAN: Using synchronization method 0
MDIO: cannot open file "restart.coor"
SWAN: FATAL : swanMemcpyDtoH failed

Assertion failed: 0, file swanlib_nv.c, line 390

This application has requested the Runtime to terminate it in an unusual way.
Please contact the application's support team for more information.
# Using device 0
# There is 1 device supporting CUDA
# Device 0: "GeForce GTX 470"
# Clock rate: 1.31 GHz
# Total amount of global memory: 1341718528 bytes
# Number of multiprocessors: 14
# Number of cores: 112
SWAN: Using synchronization method 0
MDIO: cannot open file "restart.coor"
SWAN: FATAL : swanMemcpyDtoH failed

Assertion failed: 0, file swanlib_nv.c, line 390

This application has requested the Runtime to terminate it in an unusual way.
Please contact the application's support team for more information.
# Using device 0
# There is 1 device supporting CUDA
# Device 0: "GeForce GTX 470"
# Clock rate: 1.31 GHz
# Total amount of global memory: 1341718528 bytes
# Number of multiprocessors: 14
# Number of cores: 112
SWAN: Using synchronization method 0
MDIO: cannot open file "restart.coor"
SWAN: FATAL : swanMemcpyDtoH failed

Assertion failed: 0, file swanlib_nv.c, line 390

This application has requested the Runtime to terminate it in an unusual way.
Please contact the application's support team for more information.

</stderr_txt>
]]>

____________
FAQ's

HOW TO:
- Opt out of Beta Tests
- Ask for Help

Profile Retvari Zoltan
Avatar
Send message
Joined: 20 Jan 09
Posts: 1844
Credit: 10,645,982,644
RAC: 9,980,171
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 23202 - Posted: 31 Jan 2012 | 20:59:30 UTC - in response to Message 23199.
Last modified: 31 Jan 2012 | 21:04:27 UTC

I had similar experiences with this workunit.
It was running for 45 minutes, but the progress indicator didn't move from 0%. I saw by the name of the wokunit, that I was the 4th who tried to crunch it, so checked the history of this wu. I thought it was corrupted, so I decided to abort it, without giving another chance for crunching it.

Profile Retvari Zoltan
Avatar
Send message
Joined: 20 Jan 09
Posts: 1844
Credit: 10,645,982,644
RAC: 9,980,171
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 23400 - Posted: 11 Feb 2012 | 20:32:41 UTC

I have a modded GTX480, which regularly causes application error in acemd, and the GPU "downclocks" to 405MHz, but it's staying on this frequency even if nothing is using the GPU. If I restart the task, it generates the same error in a couple of seconds. If I restart the PC, the "downclocking" is gone, and the workunit completes successfully. So I wrote a little batch program, which checks for the presence of the error message popup window, and if it's there, the batch program automatically restarts the PC. I share it here, it could be very useful for my fellow cruchers. It's made for Windows.

tasklist /fi "WINDOWTITLE eq acemdlong_6.15_windows_intelx86__cuda31 - Application Error" >tasklist.txt
tasklist /fi "WINDOWTITLE eq acemd2_6.15_windows_intelx86__cuda31 - Application Error" >>tasklist.txt
find "csrss.exe" tasklist.txt
if errorlevel 1 goto end
echo =========== >>check.log
date /t >>check.log
time /t >>check.log
shutdown /r /d 4:5 /c "Acemdlong Application Error"
:end


It generates a check.log file, containing the date and the time of every restart, and places a record into the event log's system log.
Usage: the above text in blue should be placed in a batch file named for example check.bat, and then a scheduled task should be created, which runs this check.bat in every 5 minutes, all day long.
The batch file checks for the long and the short tasks. If the application will be changed in the future, the names of the application files (beginning with acemd) in the first two lines should be changed to the new file names.
If an automatic restart takes place while someone is using the PC, the user has 30 seconds to abort the restart by issuing shutdown /a command.

Profile skgiven
Volunteer moderator
Project tester
Volunteer tester
Avatar
Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,834,906,524
RAC: 276,970
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 23417 - Posted: 12 Feb 2012 | 14:17:44 UTC - in response to Message 23400.
Last modified: 12 Feb 2012 | 14:25:12 UTC

That's excellent. Good thinking!

Probably only useful for those with this/similar problems, but it does pop up now and then :)

It would only work when not using a username and password (though you might be able to add that into the script).
You could of course turn on automatic logon, http://support.microsoft.com/kb/324737
People doing this could create a standard account, basically for when using Boinc, and a separate administrator account for maintenance (good practice anyway). Make sure Boinc can be controlled by all users - if it's not just run the installation again.

You might want to add the /f switch, just in case some stupid programs start asking 'are you sure you want to close this program'!

Thanks,
____________
FAQ's

HOW TO:
- Opt out of Beta Tests
- Ask for Help

Profile Retvari Zoltan
Avatar
Send message
Joined: 20 Jan 09
Posts: 1844
Credit: 10,645,982,644
RAC: 9,980,171
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 23422 - Posted: 12 Feb 2012 | 18:17:17 UTC - in response to Message 23417.

Probably only useful for those with this/similar problems, but it does pop up now and then :)

I will apply this batch file for all of my GPUGrid cruncher PCs. It's quite annoying to notice an error message like this after a weekend. :)

It would only work when not using a username and password (though you might be able to add that into the script).
You could of course turn on automatic logon, http://support.microsoft.com/kb/324737

I prefer the control userpasswords2 command (basically it does the same, through the good old Windows 2000 style user setup window).

raTTan
Send message
Joined: 17 Mar 11
Posts: 7
Credit: 28,985,881
RAC: 0
Level
Val
Scientific publications
watwatwatwatwatwatwatwatwatwatwat
Message 23425 - Posted: 12 Feb 2012 | 21:38:11 UTC - in response to Message 23400.

I have noticed this problem where the card will get stuck at 405mhz when i try to multitask with the gpu sometimes, like watching video as well as computing. GPUs seem to be way worse for multitasking as i never notice problems when doing so with cpu tasks. Anyway, I have found a way to rectify this without restarting if you have things open you dont want to close or whatever. You can go to device manager and disable and re-enable the driver real quick. Just a quick tip..

Profile dskagcommunity
Avatar
Send message
Joined: 28 Apr 11
Posts: 456
Credit: 810,073,458
RAC: 0
Level
Glu
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwat
Message 23439 - Posted: 13 Feb 2012 | 7:11:00 UTC
Last modified: 13 Feb 2012 | 7:12:11 UTC

Oh nice thread..i was thinking my Dual GPU GPUGrid Computer is damaged in any way. So i took out one of the cards and stick it in a "new" computer to search for the error. But crashes (in my case KASHIF and TONI) are still present now on both computers. Because its very hurting on the economic side, i changed these now to Seti/Einstein after aborting crashed tasks with runtimes like 235000sec and so on O.o . So only the remaining 560TI running from time to time, dont have the crashes there. Strange. Perhaps 8800GT and 9800GTX are finally mature for a GPUGrid Pension ^^
____________
DSKAG Austria Research Team: http://www.research.dskag.at



Crunching for my deceased Dog who had "good" Braincancer..

Profile Retvari Zoltan
Avatar
Send message
Joined: 20 Jan 09
Posts: 1844
Credit: 10,645,982,644
RAC: 9,980,171
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 23760 - Posted: 4 Mar 2012 | 21:32:06 UTC - in response to Message 23400.
Last modified: 4 Mar 2012 | 21:32:37 UTC

I had to insert a third line into my little batch file because of the recent application upgrade.
tasklist /fi "WINDOWTITLE eq acemd.win.2352 - Application Error" >>tasklist.txt
When the 6.15 client will phase out, the first two lines could be removed, and at the same time one ">" sign should be removed in front of the "tasklist.txt" filename.

Post to thread

Message boards : Number crunching : acemdlong_6.15_windows_intelx86__cuda31 – Application Error NATHAN_CB1 swanMemcpyDtoH failed