Author |
Message |
|
I have a host with 2xGTX590 and 2xGTX690, and tasks exit with below error at various stages of the processing:
<core_client_version>7.0.28</core_client_version>
<![CDATA[
<message>
The system cannot find the path specified. (0x3) - exit code 3 (0x3)
</message>
<stderr_txt>
MDIO: cannot open file "restart.coor"
SWAN : FATAL : Cuda driver error 999 in file 'swanlibnv2.cpp' in line 1574.
Assertion failed: a, file swanlibnv2.cpp, line 59
This application has requested the Runtime to terminate it in an unusual way.
Please contact the application's support team for more information.
</stderr_txt>
]]>
As the app is poor at logging which device it's using/failing on I'm unable see which devices are being used.
As a comparison this host is one of the top hosts on Seti@Home (now down, and the reason for me crunching here) which no error tasks, so there is no hardware related issue or Nvidia driver issue.
Regards
Morten |
|
|
mikeySend message
Joined: 2 Jan 09 Posts: 294 Credit: 4,985,298,615 RAC: 23,512,551 Level
Scientific publications
|
I have a host with 2xGTX590 and 2xGTX690, and tasks exit with below error at various stages of the processing:
<core_client_version>7.0.28</core_client_version>
<![CDATA[
<message>
The system cannot find the path specified. (0x3) - exit code 3 (0x3)
</message>
<stderr_txt>
MDIO: cannot open file "restart.coor"
SWAN : FATAL : Cuda driver error 999 in file 'swanlibnv2.cpp' in line 1574.
Assertion failed: a, file swanlibnv2.cpp, line 59
This application has requested the Runtime to terminate it in an unusual way.
Please contact the application's support team for more information.
</stderr_txt>
]]>
As the app is poor at logging which device it's using/failing on I'm unable see which devices are being used.
As a comparison this host is one of the top hosts on Seti@Home (now down, and the reason for me crunching here) which no error tasks, so there is no hardware related issue or Nvidia driver issue.
Regards
Morten
If you look at the top of the Event Log it will tell you which gpu's loaded and which is #0 and which is #1. It should also tell that it 'found gpu #1, for instance, and failed to load the drivers or whatever. Here is mine from this machine:
11/30/2012 10:51:03 AM | | ATI GPU 0: ATI Radeon HD 5700 series (Juniper) (CAL version 1.4.1741, 1024MB, 991MB available, 2720 GFLOPS peak)
11/30/2012 10:51:03 AM | | ATI GPU 1: ATI Radeon HD 5700 series (Juniper) (CAL version 1.4.1741, 1024MB, 991MB available, 2800 GFLOPS peak)
11/30/2012 10:51:03 AM | | OpenCL: ATI GPU 0: ATI Radeon HD 5700 series (Juniper) (driver version CAL 1.4.1741 (VM), device version OpenCL 1.2 AMD-APP (938.2), 1024MB, 991MB available)
11/30/2012 10:51:03 AM | | OpenCL: ATI GPU 1: ATI Radeon HD 5700 series (Juniper) (driver version CAL 1.4.1741 (VM), device version OpenCL 1.2 AMD-APP (938.2), 1024MB, 991MB available)
As you can see I have two AMD 5770's in the machine. The first lines says what it found, while the last lines show the drivers loaded properly. I also see you are using Windows, IF a gpu crashes in Windows the ONLY way to get it back to normal is to reboot the machine, you might try that first and see if it helps. |
|
|
|
Hi,
The top of the BOINC manager eventlog is absolutely of no importance during the load of the [science]app and crunching. In other projects the app logs on which gpu it loads the task - not so here. What you are referring to is simply the listing of which GPU devices BOINC has found available on system.
In the meantime I also see this crash behavior on my GTX690-only host, so this is also occurring on a non-mixed-version GPU host. Also, again, there is no NVIDIA driver crash/restart involved in this scenario - only the app.
Morten
____________
|
|
|
|
Hi!
The MDIO: cannot open file "restart.coor" is a false error message (it appears in every task), however the following one is a real error.
The CUDA 4.2 tasks are faster than the CUDA 3.1 tasks, and they are more demanding at the same time, so the CUDA 4.2 tasks tolerate less overclocking, therefore if you are overclocking your cards, you should recalibrate for the CUDA 4.2 tasks. For example I had to set my GTX 590 to 625MHz for the CUDA 4.2 client. It was running fine at 725MHz with the CUDA 3.1 client. However, the CUDA 4.2 client is 40% faster than the CUDA 3.1 client, so it can do 20% more work at the lower frequency.
Maybe your GPU temperatures are too high (below 80°C is ok, above 90°C is dangerous). |
|
|
|
Hi!
The MDIO: cannot open file "restart.coor" is a false error message (it appears in every task), however the following one is a real error.
The CUDA 4.2 tasks are faster than the CUDA 3.1 tasks, and they are more demanding at the same time, so the CUDA 4.2 tasks tolerate less overclocking, therefore if you are overclocking your cards, you should recalibrate for the CUDA 4.2 tasks. For example I had to set my GTX 590 to 625MHz for the CUDA 4.2 client. It was running fine at 725MHz with the CUDA 3.1 client. However, the CUDA 4.2 client is 40% faster than the CUDA 3.1 client, so it can do 20% more work at the lower frequency.
Maybe your GPU temperatures are too high (below 80°C is ok, above 90°C is dangerous).
Hi Retvari,
Temp is not an issue - none reach above 70c.
Overclocking is also not an issue as all cards are on stock settings. |
|
|
skgivenVolunteer moderator Volunteer tester
Send message
Joined: 23 Apr 09 Posts: 3968 Credit: 1,995,359,260 RAC: 0 Level
Scientific publications
|
This post or other posts in the same FAQ thread might point you in the right direction, FAQ - Why does my run fail? Some answers
What's your CPU usage?
BTW. From Boinc, Tasks tab you can click on a task and select Properties to see which GPU it's running on, device 0 or device 1 for example.
This however isn't listed in the Boinc logs, neither at the top of the page or in the reams of messages; If you select a GPUGrid line and then 'Show only this project' it makes it a little easier to find some info, so long as you don't have lots of flags set, but what you'll find is something like this,
03/12/2012 08:06:43 | GPUGRID | Starting task p039_r2-TONI_AGGd4-5-100-RND1877_0 using acemdlong version 616 (cuda42) in slot 0
Slot 0 isn't a physical slot (PCIE slot, and doesn't correspond to the device either). It's just a logical slot that Boinc allocates the task to. A bit of a misnomer and probably another relic from the CPU based beginnings of Boinc.
____________
FAQ's
HOW TO:
- Opt out of Beta Tests
- Ask for Help |
|
|
|
Hi,
CPU utilization is stable around 35%, so plenty of idle cores.
Other projects have science apps that actually log what device they run on, and thus when a task fails you can see if it's the same device each time.
This project does not aid in any way to find such a pattern, as the app logs absolutely no such info.
As a comparison Milkyway@Home app:
<search_application> milkyway_separation 1.02 Windows x86_64 double OpenCL </search_application>
Unrecognized XML in project preferences: max_gfx_cpu_pct
Skipping: 100
Skipping: /max_gfx_cpu_pct
Unrecognized XML in project preferences: allow_non_preferred_apps
Skipping: 1
Skipping: /allow_non_preferred_apps
Unrecognized XML in project preferences: nbody_graphics_poll_period
Skipping: 30
Skipping: /nbody_graphics_poll_period
Unrecognized XML in project preferences: nbody_graphics_float_speed
Skipping: 5
Skipping: /nbody_graphics_float_speed
Unrecognized XML in project preferences: nbody_graphics_textured_point_size
Skipping: 250
Skipping: /nbody_graphics_textured_point_size
Unrecognized XML in project preferences: nbody_graphics_point_point_size
Skipping: 40
Skipping: /nbody_graphics_point_point_size
Guessing preferred OpenCL vendor 'Advanced Micro Devices, Inc.'
Error loading Lua script 'astronomy_parameters.txt': [string "number_parameters: 4..."]:1: '<name>' expected near '4'
Error reading astronomy parameters from file 'astronomy_parameters.txt'
Trying old parameters file
Using SSE4.1 path
Found 1 platform
Platform 0 information:
Name: AMD Accelerated Parallel Processing
Version: OpenCL 1.2 AMD-APP (1016.4)
Vendor: Advanced Micro Devices, Inc.
Extensions: cl_khr_icd cl_amd_event_callback cl_amd_offline_devices cl_khr_d3d10_sharing cl_khr_d3d11_sharing cl_khr_dx9_media_sharing
Profile: FULL_PROFILE
Using device 0 on platform 0
Found 2 CL devices
Device 'Cayman' (Advanced Micro Devices, Inc.:0x1002) (CL_DEVICE_TYPE_GPU)
Driver version: 1016.4 (VM)
Version: OpenCL 1.2 AMD-APP (1016.4)
Compute capability: 0.0
Max compute units: 24
Clock frequency: 890 Mhz
Global mem size: 2147483648
Local mem size: 32768
Max const buf size: 65536
Double extension: cl_khr_fp64
Build log:
--------------------------------------------------------------------------------
"D:\Users\EXCHTE~1\AppData\Local\Temp\OCL798C.tmp.cl", line 30: warning:
OpenCL extension is now part of core
#pragma OPENCL EXTENSION cl_khr_fp64 : enable
^
LOOP UNROLL: pragma unroll (line 288)
Unrolled as requested!
LOOP UNROLL: pragma unroll (line 280)
Unrolled as requested!
LOOP UNROLL: pragma unroll (line 273)
Unrolled as requested!
LOOP UNROLL: pragma unroll (line 244)
Unrolled as requested!
LOOP UNROLL: pragma unroll (line 202)
Unrolled as requested!
--------------------------------------------------------------------------------
Build log:
--------------------------------------------------------------------------------
"D:\Users\EXCHTE~1\AppData\Local\Temp\OCL7A49.tmp.cl", line 27: warning:
OpenCL extension is now part of core
#pragma OPENCL EXTENSION cl_khr_fp64 : enable
^
--------------------------------------------------------------------------------
Using AMD IL kernel
Binary status (0): CL_SUCCESS
Estimated AMD GPU GFLOP/s: 2734 SP GFLOP/s, 684 DP FLOP/s
Using a target frequency of 60.0
Using a block size of 6144 with 72 blocks/chunk
Using clWaitForEvents() for polling (mode -1)
Range: { nu_steps = 960, mu_steps = 1600, r_steps = 1400 }
Iteration area: 2240000
Chunk estimate: 5
Num chunks: 6
Chunk size: 442368
Added area: 414208
Effective area: 2654208
Initial wait: 12 ms
Integration time: 78.151557 s. Average time per iteration = 81.407872 ms
Integral 0 time = 78.963098 s
Running likelihood with 94785 stars
Likelihood time = 1.212664 s
<background_integral> 0.000692186091274 </background_integral>
<stream_integral> 186.259197021827840 1439.041456655146900 </stream_integral>
<background_likelihood> -3.216576106093333 </background_likelihood>
<stream_only_likelihood> -3.733580537591124 -4.441329969686433 </stream_only_likelihood>
<search_likelihood> -2.935394639475085 </search_likelihood>
16:02:13 (10500): called boinc_finish
Seti@Home app:
setiathome_CUDA: Found 6 CUDA device(s):
Device 1: GeForce GTX 590, 1535 MiB, regsPerBlock 32768
computeCap 2.0, multiProcs 16
pciBusID = 3, pciSlotID = 0
clockRate = 1215 MHz
Device 2: GeForce GTX 590, 1535 MiB, regsPerBlock 32768
computeCap 2.0, multiProcs 16
pciBusID = 4, pciSlotID = 0
clockRate = 1215 MHz
Device 3: GeForce GTX 590, 1535 MiB, regsPerBlock 32768
computeCap 2.0, multiProcs 16
pciBusID = 8, pciSlotID = 0
clockRate = 1225 MHz
Device 4: GeForce GTX 590, 1535 MiB, regsPerBlock 32768
computeCap 2.0, multiProcs 16
pciBusID = 12, pciSlotID = 0
clockRate = 1225 MHz
Device 5: GeForce GTX 590, 1535 MiB, regsPerBlock 32768
computeCap 2.0, multiProcs 16
pciBusID = 13, pciSlotID = 0
clockRate = 1225 MHz
Device 6: GeForce GTX 590, 1535 MiB, regsPerBlock 32768
computeCap 2.0, multiProcs 16
pciBusID = 9, pciSlotID = 0
clockRate = 1225 MHz
In cudaAcc_initializeDevice(): Boinc passed DevPref 3
setiathome_CUDA: CUDA Device 3 specified, checking...
Device 3: GeForce GTX 590 is okay
SETI@home using CUDA accelerated device GeForce GTX 590
mbcuda.cfg, processpriority key detected
Priority of process set to ABOVE_NORMAL successfully
Priority of worker thread set successfully
Multibeam x41x Preview, Cuda 4.20
Legacy setiathome_enhanced V6 mode.
Work Unit Info:
...............
WU true angle range is : 0.431954
VRAM: cudaMalloc((void**) &dev_cx_DataArray, 1048576x 8bytes = 8388608bytes, offs256=0, rtotal= 8388608bytes
VRAM: cudaMalloc((void**) &dev_cx_ChirpDataArray, 1179648x 8bytes = 9437184bytes, offs256=0, rtotal= 17825792bytes
VRAM: cudaMalloc((void**) &dev_flag, 1x 8bytes = 8bytes, offs256=0, rtotal= 17825800bytes
VRAM: cudaMalloc((void**) &dev_WorkData, 1179648x 8bytes = 9437184bytes, offs256=0, rtotal= 27262984bytes
VRAM: cudaMalloc((void**) &dev_PowerSpectrum, 1048576x 4bytes = 4194304bytes, offs256=0, rtotal= 31457288bytes
VRAM: cudaMalloc((void**) &dev_t_PowerSpectrum, 1048584x 4bytes = 1048608bytes, offs256=0, rtotal= 32505896bytes
VRAM: cudaMalloc((void**) &dev_GaussFitResults, 1048576x 16bytes = 16777216bytes, offs256=0, rtotal= 49283112bytes
VRAM: cudaMalloc((void**) &dev_PoT, 1572864x 4bytes = 6291456bytes, offs256=0, rtotal= 55574568bytes
VRAM: cudaMalloc((void**) &dev_PoTPrefixSum, 1572864x 4bytes = 6291456bytes, offs256=0, rtotal= 61866024bytes
VRAM: cudaMalloc((void**) &dev_NormMaxPower, 16384x 4bytes = 65536bytes, offs256=0, rtotal= 61931560bytes
VRAM: cudaMalloc((void**) &dev_flagged, 1048576x 4bytes = 4194304bytes, offs256=0, rtotal= 66125864bytes
VRAM: cudaMalloc((void**) &dev_outputposition, 1048576x 4bytes = 4194304bytes, offs256=0, rtotal= 70320168bytes
VRAM: cudaMalloc((void**) &dev_PowerSpectrumSumMax, 262144x 12bytes = 3145728bytes, offs256=0, rtotal= 73465896bytes
VRAM: cudaMallocArray( &dev_gauss_dof_lcgf_cache, 1x 8192bytes = 8192bytes, offs256=248, rtotal= 73474088bytes
VRAM: cudaMallocArray( &dev_null_dof_lcgf_cache, 1x 8192bytes = 8192bytes, offs256=144, rtotal= 73482280bytes
VRAM: cudaMalloc((void**) &dev_find_pulse_flag, 1x 8bytes = 8bytes, offs256=0, rtotal= 73482288bytes
VRAM: cudaMalloc((void**) &dev_t_funct_cache, 1966081x 4bytes = 7864324bytes, offs256=0, rtotal= 81346612bytes
Thread call stack limit is: 1k
boinc_exit(): requesting safe worker shutdown ->
Worker Acknowledging exit request, spinning-> boinc_exit(): received safe worker shutdown acknowledge ->
Checking which GPU a task is running on is not very helpful when I require the device that it has failed on. |
|
|
|
I've found a workaround which of course is only for those without anything else in life to do: restart Boincmanager. Then the task that has not been terminated will restart from checkpoint and successfully go beyond the point of app crash.
Just got back from work and now 3 tasks have terminated on one host, while they are hanging on the other - waiting to be killed by time limit exceeded. Restarting Boincmanager restarted the tasks and saved them from Computation error.
It's really strange that there are no app developer here to aid in the root cause analysis.....................
Same system is rock stable in all other projects I participate in, so without input from a developer I guess the current state of the project is not for me. |
|
|
|
I have a similar problem here on a GTX 560 thus I can't crunch for GPUGRID anymore until this is resolved. All my WUs error out with failures in swanlibnv2.cpp, some with "energies have become nan"
____________
Team Belgium |
|
|
mikeySend message
Joined: 2 Jan 09 Posts: 294 Credit: 4,985,298,615 RAC: 23,512,551 Level
Scientific publications
|
I have a similar problem here on a GTX 560 thus I can't crunch for GPUGRID anymore until this is resolved. All my WUs error out with failures in swanlibnv2.cpp, some with "energies have become nan"
It might help if you upgraded your version of Boinc from 6.12.34 to a more current one. There have been a TON of changes and your problem could be one of them. I have two 560Ti's, which is different than yours but should be similar enough, and mine run just fine. Another difference is I am on Windows but other Linux folks don't have the same problems or one would expect them to be here too. |
|
|
|
I have a similar problem here on a GTX 560 thus I can't crunch for GPUGRID anymore until this is resolved. All my WUs error out with failures in swanlibnv2.cpp, some with "energies have become nan"
It might help if you upgraded your version of Boinc from 6.12.34 to a more current one. There have been a TON of changes and your problem could be one of them. I have two 560Ti's, which is different than yours but should be similar enough, and mine run just fine. Another difference is I am on Windows but other Linux folks don't have the same problems or one would expect them to be here too.
I can't at the moment as the recent 7.0.28 version for Linux is compiled with a higher libc version which my current openSUSE 12.1 system does not provide so 6.12.34 is the highest version I can run right now and I'm really not up to compiling boinc from source
____________
Team Belgium |
|
|
mikeySend message
Joined: 2 Jan 09 Posts: 294 Credit: 4,985,298,615 RAC: 23,512,551 Level
Scientific publications
|
I have a similar problem here on a GTX 560 thus I can't crunch for GPUGRID anymore until this is resolved. All my WUs error out with failures in swanlibnv2.cpp, some with "energies have become nan"
It might help if you upgraded your version of Boinc from 6.12.34 to a more current one. There have been a TON of changes and your problem could be one of them. I have two 560Ti's, which is different than yours but should be similar enough, and mine run just fine. Another difference is I am on Windows but other Linux folks don't have the same problems or one would expect them to be here too.
I can't at the moment as the recent 7.0.28 version for Linux is compiled with a higher libc version which my current openSUSE 12.1 system does not provide so 6.12.34 is the highest version I can run right now and I'm really not up to compiling boinc from source
That is a good reason NOT to update then! |
|
|
|
I have a similar problem here on a GTX 560 thus I can't crunch for GPUGRID anymore until this is resolved. All my WUs error out with failures in swanlibnv2.cpp, some with "energies have become nan"
My problem was related to heat output. I've reduced the fan speed thus my GPU got hotter while crunching. After restoring to defaults, short WUs complete with no issues. However, long WUs still error out so I've disabled them for now.
____________
Team Belgium |
|
|
|
I have a similar problem here on a GTX 560 thus I can't crunch for GPUGRID anymore until this is resolved. All my WUs error out with failures in swanlibnv2.cpp, some with "energies have become nan"
i had this problem on my 560ti 384 cores too when i dont downclock the GPU memory and increase the GPU voltage to 1.025V. Since then it run without errors.
____________
DSKAG Austria Research Team: http://www.research.dskag.at
|
|
|