Lots of errors

Message boards : Number crunching : Lots of errors

Author	Message
nanoprobe Send message Joined: 26 Feb 12 Posts: 184 Credit: 222,376,233 RAC: 0 Level Scientific publications	Message 41152 - Posted: 27 May 2015 \| 0:00:10 UTC
	I'm starting to get a high number or short tasks that error out. Can someone explain why this is happening and how I can fix it? Have changed no settings. Here is the log from one of the tasks. WinXP SP3 dual 750Ti http://www.gpugrid.net/result.php?resultid=14202446
	ID: 41152 \| Rating: 0 \| rate: / Reply Quote

Bedrich Hajek Send message Joined: 28 Mar 09 Posts: 467 Credit: 8,393,847,716 RAC: 10,046,336 Level Scientific publications	Message 41154 - Posted: 27 May 2015 \| 0:59:50 UTC - in response to Message 41152.
	I am getting errors too, but mine are with GERARD_EQUI WU's. Three had errors, two finished ok. It seems to be bad batch. https://www.gpugrid.net/result.php?resultid=14210451 895456x4-GERARD_EQUI_26Apr_CXCL-0-1-RND0321_4 Workunit 10949024 Created 26 May 2015 \| 23:34:38 UTC Sent 26 May 2015 \| 23:34:54 UTC Received 26 May 2015 \| 23:50:42 UTC Server state Over Outcome Computation error Client state Compute error Exit status -97 (0xffffffffffffff9f) Unknown error number Computer ID 30790 Report deadline 31 May 2015 \| 23:34:54 UTC Run time 87.09 CPU time 77.31 Validate state Invalid Credit 0.00 Application version Short runs (2-3 hours on fastest card) v8.47 (cuda65) Stderr output <core_client_version>7.4.42</core_client_version> <![CDATA[ <message> (unknown error) - exit code -97 (0xffffff9f) </message> <stderr_txt> # GPU [GeForce GTX 690] Platform [Windows] Rev [3212] VERSION [65] # SWAN Device 1 : # Name : GeForce GTX 690 # ECC : Disabled # Global mem : 2047MB # Capability : 3.0 # PCI ID : 0000:05:00.0 # Device clock : 1019MHz # Memory clock : 3004MHz # Memory width : 256bit # Driver version : r343_98 : 34411 # GPU [GeForce GTX 690] Platform [Windows] Rev [3212] VERSION [65] # SWAN Device 1 : # Name : GeForce GTX 690 # ECC : Disabled # Global mem : 2047MB # Capability : 3.0 # PCI ID : 0000:05:00.0 # Device clock : 1019MHz # Memory clock : 3004MHz # Memory width : 256bit # Driver version : r343_98 : 34411 # GPU [GeForce GTX 690] Platform [Windows] Rev [3212] VERSION [65] # SWAN Device 1 : # Name : GeForce GTX 690 # ECC : Disabled # Global mem : 2047MB # Capability : 3.0 # PCI ID : 0000:05:00.0 # Device clock : 1019MHz # Memory clock : 3004MHz # Memory width : 256bit # Driver version : r343_98 : 34411 # GPU [GeForce GTX 690] Platform [Windows] Rev [3212] VERSION [65] # SWAN Device 0 : # Name : GeForce GTX 690 # ECC : Disabled # Global mem : 2047MB # Capability : 3.0 # PCI ID : 0000:04:00.0 # Device clock : 1019MHz # Memory clock : 3004MHz # Memory width : 256bit # Driver version : r343_98 : 34411 # GPU 0 : 63C # GPU 1 : 73C # The simulation has become unstable. Terminating to avoid lock-up (1)
	ID: 41154 \| Rating: 0 \| rate: / Reply Quote

Jacob Klein Send message Joined: 11 Oct 08 Posts: 1127 Credit: 1,901,927,545 RAC: 0 Level Scientific publications	Message 41155 - Posted: 27 May 2015 \| 1:07:33 UTC Last modified: 27 May 2015 \| 1:10:10 UTC
	I'm getting some nasty errors too, with the GERARD_EQUI_26Apr_CXCL tasks. They're causing major TDRs, which in turn then make the computer have hardware acceleration problems in other tasks (like web browsing, or gaming), and also cause driver problems where the clocks never go back to 3d-mode clocks. Admins: Please look into which batches need to be revoked, to prevent these problems. It's a major headache, for me at least. 1154144x3-GERARD_EQUI_26Apr_CXCL-0-1-RND9216_7 http://www.gpugrid.net/result.php?resultid=14210052 895456x5-GERARD_EQUI_26Apr_CXCL-0-1-RND9089_5 http://www.gpugrid.net/result.php?resultid=14210507
	ID: 41155 \| Rating: 0 \| rate: / Reply Quote

nanoprobe Send message Joined: 26 Feb 12 Posts: 184 Credit: 222,376,233 RAC: 0 Level Scientific publications	Message 41156 - Posted: 27 May 2015 \| 2:06:03 UTC
	All my tasks are now erroring out. Suspending this project for now until this issue is resolved.
	ID: 41156 \| Rating: 0 \| rate: / Reply Quote

Eric Send message Joined: 12 Apr 15 Posts: 1 Credit: 49,381,475 RAC: 0 Level Scientific publications	Message 41157 - Posted: 27 May 2015 \| 4:52:30 UTC
	I've actually been having issues with the Graphics drivers themselves crashing and windows having to recover.
	ID: 41157 \| Rating: 0 \| rate: / Reply Quote

Stoneageman Send message Joined: 25 May 09 Posts: 224 Credit: 34,057,224,498 RAC: 26 Level Scientific publications	Message 41158 - Posted: 27 May 2015 \| 5:46:47 UTC
	Same here. Now have five GERARD_EQUI_26Apr_CXCL tasks crashed.
	ID: 41158 \| Rating: 0 \| rate: / Reply Quote

Gerard Send message Joined: 26 Mar 14 Posts: 101 Credit: 0 RAC: 0 Level Scientific publications	Message 41160 - Posted: 27 May 2015 \| 9:05:55 UTC
	Could you please post your errors in this thread? I will cancel the batch if they persist. Thanks for your patience...
	ID: 41160 \| Rating: 0 \| rate: / Reply Quote

tito Send message Joined: 21 May 09 Posts: 16 Credit: 1,057,958,678 RAC: 0 Level Scientific publications	Message 41161 - Posted: 27 May 2015 \| 10:54:28 UTC
	https://www.gpugrid.net/result.php?resultid=14210324 Short WU errored after 80sek at 750Ti.
	ID: 41161 \| Rating: 0 \| rate: / Reply Quote

nanoprobe Send message Joined: 26 Feb 12 Posts: 184 Credit: 222,376,233 RAC: 0 Level Scientific publications	Message 41162 - Posted: 27 May 2015 \| 11:10:02 UTC Last modified: 27 May 2015 \| 11:10:52 UTC
	Could you please post your errors in this thread? I will cancel the batch if they persist. Thanks for your patience... https://www.gpugrid.net/result.php?resultid=14210504
	ID: 41162 \| Rating: 0 \| rate: / Reply Quote

Gerard Send message Joined: 26 Mar 14 Posts: 101 Credit: 0 RAC: 0 Level Scientific publications	Message 41164 - Posted: 27 May 2015 \| 12:50:55 UTC Last modified: 27 May 2015 \| 12:52:26 UTC
	We detected an unexpected parameterization error in some of the simulations and we just cancelled them. Sorry for any inconvience caused and thank your for reporting it to us! If you find any other errors please do not hesitate to tell us (hopefully this particular issue is already resolved).
	ID: 41164 \| Rating: 0 \| rate: / Reply Quote

Jacob Klein Send message Joined: 11 Oct 08 Posts: 1127 Credit: 1,901,927,545 RAC: 0 Level Scientific publications	Message 41166 - Posted: 27 May 2015 \| 12:56:58 UTC
	Excellent. Thank you!!
	ID: 41166 \| Rating: 0 \| rate: / Reply Quote

nanoprobe Send message Joined: 26 Feb 12 Posts: 184 Credit: 222,376,233 RAC: 0 Level Scientific publications	Message 41168 - Posted: 27 May 2015 \| 15:10:08 UTC
	All short run tasks still failing here. Links to last 4 https://www.gpugrid.net/result.php?resultid=14213349 https://www.gpugrid.net/result.php?resultid=14211957 https://www.gpugrid.net/result.php?resultid=14211914 https://www.gpugrid.net/result.php?resultid=14211712
	ID: 41168 \| Rating: 0 \| rate: / Reply Quote

Jacob Klein Send message Joined: 11 Oct 08 Posts: 1127 Credit: 1,901,927,545 RAC: 0 Level Scientific publications	Message 41169 - Posted: 27 May 2015 \| 16:07:07 UTC - in response to Message 41168. Last modified: 27 May 2015 \| 16:08:05 UTC
	nanoprobe: What is the exact make/model of your GPU? Do the tasks still fail when the Boost clock is set to the reference clock? My hunch is that your GPU is overclocked too much, either by the factory or by you. "The simulation has become unstable. Terminating to avoid lock-up" ... generally means that you are overclocking too much, or have a hardware problem... from my experience.
	ID: 41169 \| Rating: 0 \| rate: / Reply Quote

[CSF] Thomas H.V. DUPONT Send message Joined: 20 Jul 14 Posts: 732 Credit: 100,630,366 RAC: 0 Level Scientific publications	Message 41170 - Posted: 27 May 2015 \| 17:14:22 UTC - in response to Message 41166.
	Excellent. Thank you!! +1 :) ____________ [CSF] Thomas H.V. Dupont Founder of the team CRUNCHERS SANS FRONTIERES 2.0 www.crunchersansfrontieres
	ID: 41170 \| Rating: 0 \| rate: / Reply Quote

nanoprobe Send message Joined: 26 Feb 12 Posts: 184 Credit: 222,376,233 RAC: 0 Level Scientific publications	Message 41171 - Posted: 27 May 2015 \| 17:37:37 UTC - in response to Message 41169. Last modified: 27 May 2015 \| 17:39:06 UTC
	nanoprobe: What is the exact make/model of your GPU? Do the tasks still fail when the Boost clock is set to the reference clock? My hunch is that your GPU is overclocked too much, either by the factory or by you. "The simulation has become unstable. Terminating to avoid lock-up" ... generally means that you are overclocking too much, or have a hardware problem... from my experience. Cards are PNY 750Ti. No factory O/C. No six pin PCI-E power plugs. 60Watt load @99%. They've been running stock out of the box since I bought them and I've been running the short tasks on these cards since I got them and have never had the failure rate I've been experiencing lately. If it was one card producing all/most of the errors then I would suspect the card but the tasks are failing on both cards.
	ID: 41171 \| Rating: 0 \| rate: / Reply Quote

zdnko Send message Joined: 17 Jan 09 Posts: 2 Credit: 19,488,157 RAC: 0 Level Scientific publications	Message 41172 - Posted: 27 May 2015 \| 17:52:14 UTC
	1232906x8-GERARD_EQUI_26Apr_CXCL-0-1-RND1418_4 causes a lot of crash of gpu drivers. Stopped!
	ID: 41172 \| Rating: 0 \| rate: / Reply Quote

Jacob Klein Send message Joined: 11 Oct 08 Posts: 1127 Credit: 1,901,927,545 RAC: 0 Level Scientific publications	Message 41173 - Posted: 27 May 2015 \| 18:13:46 UTC - in response to Message 41171. Last modified: 27 May 2015 \| 18:21:17 UTC
	nanoprobe: Can you supply the exact model of the GPU, to confirm that it's not factory-overclocked? Alternatively, could you use GPU-Z to confirm that the GPU Clock and Default Clock say 1020 MHz (which is the stock speed of a GTX 750 Ti, per http://en.wikipedia.org/wiki/GeForce_700_series) If it's anything above 1020, then it is in fact overclocked, and I recommend using EVGA Precision X to downclock it back to reference 1020 MHz, to see if it helps. I'm getting frustrated trying to help by offering advice that gets ignored.
	ID: 41173 \| Rating: 0 \| rate: / Reply Quote

nanoprobe Send message Joined: 26 Feb 12 Posts: 184 Credit: 222,376,233 RAC: 0 Level Scientific publications	Message 41190 - Posted: 29 May 2015 \| 1:00:26 UTC - in response to Message 41173.
	nanoprobe: [quote]I'm getting frustrated trying to help by offering advice that gets ignored. WOW! Let me offer you some advise. If it doesn't concern life, death or health then is surely isn't worth getting frustrated over. FWIW the issue seems to have cleared up. The faulty WUs have been taken care of. Thanks for your help.
	ID: 41190 \| Rating: 0 \| rate: / Reply Quote

John C MacAlister Send message Joined: 17 Feb 13 Posts: 181 Credit: 144,871,276 RAC: 0 Level Scientific publications	Message 41191 - Posted: 29 May 2015 \| 3:05:39 UTC
	I have had over 20 WUs fail...on my GTX 660 Ti devices. I will stop gettings tasks and now go to bed....
	ID: 41191 \| Rating: 0 \| rate: / Reply Quote

Jacob Klein Send message Joined: 11 Oct 08 Posts: 1127 Credit: 1,901,927,545 RAC: 0 Level Scientific publications	Message 41192 - Posted: 29 May 2015 \| 3:49:35 UTC - in response to Message 41190. Last modified: 29 May 2015 \| 3:52:39 UTC
	nanoprobe: [quote]I'm getting frustrated trying to help by offering advice that gets ignored. WOW! Let me offer you some advise. If it doesn't concern life, death or health then is surely isn't worth getting frustrated over. FWIW the issue seems to have cleared up. The faulty WUs have been taken care of. Thanks for your help. There were some faulty WUs, but they have nothing to do with tasks erroring out with "Simulation has become unstable." messages and no other error messages. Errors like yours are usuall a result of overclocking too much. Please keep my advice (lower clocks to reference clocks) in mind, the next time you try to troubleshoot those errors. Good luck, Jacob
	ID: 41192 \| Rating: 0 \| rate: / Reply Quote

nanoprobe Send message Joined: 26 Feb 12 Posts: 184 Credit: 222,376,233 RAC: 0 Level Scientific publications	Message 41194 - Posted: 29 May 2015 \| 9:54:12 UTC
	There were some faulty WUs, but they have nothing to do with tasks erroring out with "Simulation has become unstable." messages and no other error messages. There is no way you could know this for sure. Errors like yours are usuall a result of overclocking too much. As I stated before these cards are not overclocked. The problem came and left without me changing anything on my set up. Therefore my conclusion is that the WUs were the problem. Moving on.
	ID: 41194 \| Rating: 0 \| rate: / Reply Quote

Jacob Klein Send message Joined: 11 Oct 08 Posts: 1127 Credit: 1,901,927,545 RAC: 0 Level Scientific publications	Message 41197 - Posted: 29 May 2015 \| 12:54:28 UTC - in response to Message 41194. Last modified: 29 May 2015 \| 12:56:30 UTC
	The problem is still ongoing, for you. See: https://www.gpugrid.net/result.php?resultid=14213909 # The simulation has become unstable. Terminating to avoid lock-up (1) I know you said there is no factory overclock, but I still would love to know what GPU-Z says for your "GPU Clock" and "Default Clock". If you refuse to share, so be it. And if it's anything above 1020, then it is in fact overclocked.
	ID: 41197 \| Rating: 0 \| rate: / Reply Quote

Killersocke Send message Joined: 18 Oct 13 Posts: 53 Credit: 406,647,419 RAC: 0 Level Scientific publications	Message 41198 - Posted: 29 May 2015 \| 14:20:26 UTC
	Jacob: We are all very well aware of the idea of overclocking. But , honestly : Are you really thinking I will and need to shut down my system to get the GPU- stuff running ? Really ? Let´s put it that way: If GPU -stuff is not running properly on a majority of systems and several users have the same experience and we ( the users ) do that for free - we shut down a over all properly running system for that ? I really don´t think so. I my opinion, if GPU do not work - make them work instead of begging on the overclocking stuff. If GPU will not work properly, I simply switch to something else - you know, MY PC, MY time, MY decission. Or to use a proverb: Not my circus , not my monkeys. Kind regards,
	ID: 41198 \| Rating: 0 \| rate: / Reply Quote

Jacob Klein Send message Joined: 11 Oct 08 Posts: 1127 Credit: 1,901,927,545 RAC: 0 Level Scientific publications	Message 41200 - Posted: 29 May 2015 \| 15:31:45 UTC - in response to Message 41198. Last modified: 29 May 2015 \| 15:33:21 UTC
	I have definitely had certain GPU things, such as games and GPUGrid tasks, crash or error ("Simulation has become unstable"), as a direct result of a factory-overclock that was too aggressive. If the GPU is overclocked at all, and you are trying to resolve any GPU problem, you should see if lowering the clocks resolves the problem. Yes, honestly. I have 3 factory-overclocked GPUs. GPU 1 was factory-overclocked way too aggressively, and I've had to dial it back quite a bit to be completely stable in my games and with GPUGrid. GPU 2 was factory-overclocked too little, and I could push it even farther before noticing problems. And GPU 3 was factory-overclocked just right. Forgive me for trying to help.
	ID: 41200 \| Rating: 0 \| rate: / Reply Quote

skgiven Volunteer moderator Volunteer tester Send message Joined: 23 Apr 09 Posts: 3968 Credit: 1,995,359,260 RAC: 0 Level Scientific publications	Message 41201 - Posted: 29 May 2015 \| 15:35:07 UTC - in response to Message 41198. Last modified: 29 May 2015 \| 15:36:18 UTC
	Nanoprobe, It may be the case that these Noelia WU's tax the card more than other WU's and it does appear (from what I've seen) to effect the smaller/older cards more; the same WU's that fail on your GTX750Ti complete on other systems but some also fail on the older/smaller cards. While GPU temps look OK the GDDR5 memory temps might be quite high. I would suggest reducing the GDDR5 clocks and the GPU clocks by 10% to see if that prevents the errors recurring, or just crunch the long WU's which are a different type (and are now fixed). Recently used XP with a couple of GPU's and found the drivers to be not great. Would also suggest a regular cold-start, just in case of runaway errors which appears to be the case back on the 20th. Good luck, ____________ FAQ's HOW TO: - Opt out of Beta Tests - Ask for Help
	ID: 41201 \| Rating: 0 \| rate: / Reply Quote

Retvari Zoltan Send message Joined: 20 Jan 09 Posts: 2343 Credit: 16,201,255,749 RAC: 851 Level Scientific publications	Message 41202 - Posted: 29 May 2015 \| 16:03:47 UTC - in response to Message 41198. Last modified: 29 May 2015 \| 16:16:14 UTC
	Jacob: We are all very well aware of the idea of overclocking. But , honestly : Are you really thinking I will and need to shut down my system to get the GPU- stuff running ? Really ? Sometimes this is the only (and the fastest) way to fix a malfunctioning system. I had some GPUGrid app crashes in the past on one of my dual GPU systems which caused the other GPU to fail tasks too. In my opinion it's a good practice to restart (by a scheduled task) a Windows based system once a week - regardless if it's running error free - to maintain its stability (especially when running GPU and CPU tasks simultaneously). Let´s put it that way: If GPU -stuff is not running properly on a majority of systems and several users have the same experience and we ( the users ) do that for free - we shut down a over all properly running system for that ? This is more like a rhetorical question, but - as you probably know - there's no warranty for any software (free or commercial) to work on every existing hardware. Besides, your question takes a set of other softwares as a reference which qualifies a system properly running, but from the "no warranty" thing comes that there's no such set of softwares exist. To put it in another way: I wouldn't call a system properly running, if GPUGrid tasks produce "The simulation has become unstable. Terminating to avoid lock-up" messages on that particular system only while these tasks run fine on the next host they were assigned to. I really don´t think so. I my opinion, if GPU do not work - make them work instead of begging on the overclocking stuff. If someone ask for help, it comes from that they can't figure out the reason of the error, so it might be useful to try things which don't make sense at first sight. I have a GTX780Ti on which I had to reduce the GDDR5 clock to 2900MHz (from 3500MHz) to make it work with GPUGrid (it was brand new). GPUs (and other components) are aging so they might not perform as good as before, different tasks tax the GPU differently. You can't step in the same river twice. If GPU will not work properly, I simply switch to something else - you know, MY PC, MY time, MY decission. Or to use a proverb: Not my circus , not my monkeys. If all else fails, or you've tired of trying different workarounds you can do it. Still, fixing the errors on a given system is not the project's responsibility.
	ID: 41202 \| Rating: 0 \| rate: / Reply Quote

nanoprobe Send message Joined: 26 Feb 12 Posts: 184 Credit: 222,376,233 RAC: 0 Level Scientific publications	Message 41203 - Posted: 29 May 2015 \| 17:23:50 UTC - in response to Message 41197.
	The problem is still ongoing, for you. See: https://www.gpugrid.net/result.php?resultid=14213909 # The simulation has become unstable. Terminating to avoid lock-up (1) I know you said there is no factory overclock, but I still would love to know what GPU-Z says for your "GPU Clock" and "Default Clock". If you refuse to share, so be it. And if it's anything above 1020, then it is in fact overclocked. I think 1 error out of 20 tasks is about the same I was experiencing before the problem WUs arrived. And just for the record the task you linked to completed and validated. The last failed one was more that 2 days ago. https://www.gpugrid.net/result.php?resultid=14213349
	ID: 41203 \| Rating: 0 \| rate: / Reply Quote

Jacob Klein Send message Joined: 11 Oct 08 Posts: 1127 Credit: 1,901,927,545 RAC: 0 Level Scientific publications	Message 41204 - Posted: 29 May 2015 \| 17:29:54 UTC Last modified: 29 May 2015 \| 17:30:24 UTC
	Please keep my advice (use lower clocks) in mind, the next time you try to troubleshoot any GPU errors.
	ID: 41204 \| Rating: 0 \| rate: / Reply Quote

[CSF] Thomas H.V. DUPONT Send message Joined: 20 Jul 14 Posts: 732 Credit: 100,630,366 RAC: 0 Level Scientific publications	Message 41211 - Posted: 30 May 2015 \| 6:33:42 UTC - in response to Message 41204.
	Please keep my advice (use lower clocks) in mind, the next time you try to troubleshoot any GPU errors. Jacob, please keep in mind this point : your tips are always appreciated and fortunately that we have you ! :) I will also make adjustments on my GTX 760 via Precision X because I also get errors with LONG RUNS (Gerard). I will publish the settings and the results in this thread. Of course, I also think to Retvari Zoltan* and skgiven whose advices are also very valuable :) Thanks guys! ____________ [CSF] Thomas H.V. Dupont Founder of the team CRUNCHERS SANS FRONTIERES 2.0 www.crunchersansfrontieres
	ID: 41211 \| Rating: 0 \| rate: / Reply Quote

Post to thread

Message boards : Number crunching : Lots of errors

	About	Science	Volunteers	Performance	Forum	Join us	Donate