Computation error failures and screen savers

Message boards : Graphics cards (GPUs) : Computation error failures and screen savers

Author	Message
wiyosaya Send message Joined: 22 Nov 09 Posts: 114 Credit: 589,114,683 RAC: 0 Level Scientific publications	Message 24877 - Posted: 10 May 2012 \| 15:14:42 UTC Last modified: 10 May 2012 \| 15:15:14 UTC
	It looks like there is more than one thread on computation error failures, so I thought I would start another thread as I am reasonably sure the solution lies in turning off screen savers. Rather than duplicate what I posted elsewhere, please see this post for a possible solution. If you try what I suggest, please post your results to this thread. Thanks. ____________
	ID: 24877 \| Rating: 0 \| rate: / Reply Quote

shdbcamping Send message Joined: 2 May 12 Posts: 22 Credit: 145,756,579 RAC: 0 Level Scientific publications	Message 25007 - Posted: 12 May 2012 \| 17:59:05 UTC - in response to Message 24877.
	I have screen savers as (none) on both my rigs and I get failures anyway. If you look at the work units, you'll usually see others fialing them as well. I believe that it's just the nature of the science. It is just released a number of times to make sure that 'rogue' HW failures do not get the WU dismissed without being sure that it is bad. Most fail very quickly, so it's not a big deal to me.
	ID: 25007 \| Rating: 0 \| rate: / Reply Quote

skgiven Volunteer moderator Volunteer tester Send message Joined: 23 Apr 09 Posts: 3968 Credit: 1,995,359,260 RAC: 0 Level Scientific publications	Message 25008 - Posted: 12 May 2012 \| 18:24:33 UTC - in response to Message 25007.
	Might be useful to read some of the FAQs and in particular use fan regulating software to control your temperatures. Also, free up a CPU core/thread from Boinc Manager... ____________ FAQ's HOW TO: - Opt out of Beta Tests - Ask for Help
	ID: 25008 \| Rating: 0 \| rate: / Reply Quote

The Knighty Ni Send message Joined: 6 May 12 Posts: 8 Credit: 195,800 RAC: 0 Level Scientific publications	Message 25039 - Posted: 13 May 2012 \| 17:00:00 UTC
	Hi First post to the threads so Hello to everyone on GPUGrid Reading the first post regarding ACEMD, I noticed that although my GPU seems to be or is supposed to be a good one for this project, ERROR with ACEMD is cropping up. I am a very new GUP shrubber and have had this one in place for a shade over a week so please forgive me if it seems I need a little hand holding at present. This is the overview of WU's downloaded so far where there is information: State: All (24) \| In progress (2) \| Valid (5) \| Error (17) ACEMD2: GPU molecular dynamics (24) Shortly after installing the GPU and starting the GPUGrid project I had to replace my MB Grrr. However, at the time of replacing there where only 2 WU on my machine, Valid was standing at 5 Error was standing at 2. Today currently in progress: 2, 1 running and 1 waiting to start. Earlier today while finalising the install of my new MB the ACEMD error turned up while I only had 2 WU's on the machine. Following this BONIC downloaded a further 11 WU's which all produced errors very rapidly. Think my questions are: 1. How can I ensure a stable platform to run GPUGrid tasks 2. Would the ACEMD error normally cause BONIC to download a long series of WU's which would fail after the error shows up 3. Is there any other information you need to help me resolve this Personally I don't like running projects where I get a high error rate and like the error rate to be below 1:75 or less where possible. Receiving high error rates puts me off projects if they can't be resolved because I see it as wasted resources. ____________ Don't put limits on your imagination, there is no telling.........?
	ID: 25039 \| Rating: 0 \| rate: / Reply Quote

skgiven Volunteer moderator Volunteer tester Send message Joined: 23 Apr 09 Posts: 3968 Credit: 1,995,359,260 RAC: 0 Level Scientific publications	Message 25046 - Posted: 13 May 2012 \| 19:52:23 UTC - in response to Message 25039.
	Think my questions are: 1. How can I ensure a stable platform to run GPUGrid tasks 2. Would the ACEMD error normally cause BONIC to download a long series of WU's which would fail after the error shows up 3. Is there any other information you need to help me resolve this Personally I don't like running projects where I get a high error rate and like the error rate to be below 1:75 or less where possible. Receiving high error rates puts me off projects if they can't be resolved because I see it as wasted resources. I'm not keen on failing tasks either! You can see my tasks here. The only failures are Beta apps that I ran using a driver that was incapable of running the tasks (which is what I was testing). Failures are due to bad setups. It's up to the cruncher to configure their system and Boinc correctly to participate successfully here. The FAQ's have lot's of good tips on how to do this. Key is not over using the CPU, keeping the GPU's cool and the system stable. ____________ FAQ's HOW TO: - Opt out of Beta Tests - Ask for Help
	ID: 25046 \| Rating: 0 \| rate: / Reply Quote

TheFiend Send message Joined: 26 Aug 11 Posts: 99 Credit: 2,500,112,138 RAC: 0 Level Scientific publications	Message 25047 - Posted: 13 May 2012 \| 19:56:08 UTC
	All those failed tasks had errored out on other crunchers..... he could have just been unlucky to get some bad WU's
	ID: 25047 \| Rating: 0 \| rate: / Reply Quote

5pot Send message Joined: 8 Mar 12 Posts: 411 Credit: 2,083,882,218 RAC: 0 Level Scientific publications	Message 25048 - Posted: 13 May 2012 \| 20:09:15 UTC
	Thought energies have become nan can SOMETIMES be caused on overactive CPU. Say he's crunching 6 CPU tasks + 1 GPU task. If this is the case, tell BOINC to use one less core (83.34?) I think.
	ID: 25048 \| Rating: 0 \| rate: / Reply Quote

skgiven Volunteer moderator Volunteer tester Send message Joined: 23 Apr 09 Posts: 3968 Credit: 1,995,359,260 RAC: 0 Level Scientific publications	Message 25049 - Posted: 13 May 2012 \| 20:10:30 UTC - in response to Message 25047. Last modified: 13 May 2012 \| 20:12:07 UTC
	5pot on ;) Most, but not all, failures are for MJHARVEY_MJHXA task with, exit code 98 (0x62) and ERROR: # Energies have become nan. My guess is the GPU is too hot (not using fan controlling software), or trying to use all 6 CPU cores for other projects. So, free up a CPU core and use fan controlling software such as EVGA Precision or MSI Afterburner. This is common sense for most GPU projects. Unfortunately the number of people using 295.x/296 drivers is still very high, despite hundreds of warnings across the Boinc community, so going by other failures is difficult! ____________ FAQ's HOW TO: - Opt out of Beta Tests - Ask for Help
	ID: 25049 \| Rating: 0 \| rate: / Reply Quote

The Knighty Ni Send message Joined: 6 May 12 Posts: 8 Credit: 195,800 RAC: 0 Level Scientific publications	Message 25051 - Posted: 13 May 2012 \| 21:56:05 UTC
	Thanks everyone. Read your posts and see that there are concerns over the driver version, number of cores being used, GPU temp and overclocking. Yes this system is overclocked to within very reasonable margins as you can tell from the temperatures on the Cores and GPU. The PC has been OC'ed and running for almost 18 months now with very, very few errors on any of the projects I run. Been running PrimeGrid regularly on this PC and have a superbly low error rate on a project which is highly sensitive to problems on CPU cores. My worst sub project currently stands at an average of 375 WU between errors. The best is at 3897. RNAWorld is another project I run. My normal error rate is about 1 in every 300 WU's. I know there are many who run their systems much hotter. This is not an option for me as this is a business PC that I use for my work. Here is the system information. You may be able to see something that I am not able at this time which could be causing WU to error. So if you can work with me to solve this issue I would really appreciate it :) Running 5 cores of the 6 available. Core temps steady 39C, GPU Steady 62C GPU: Name NVIDIA GeForce GTX 560 Ti PNP Device ID PCI\VEN_10DE&DEV_1087&SUBSYS_00000000&REV_A1\4&26B22A24&0&0010 Adapter Type GeForce GTX 560 Ti, NVIDIA compatible Adapter Description NVIDIA GeForce GTX 560 Ti Adapter RAM 1.25 GB (1,342,177,280 bytes) Installed Drivers nv4_disp.dll Driver Version 6.14.12.8566 INF File oem42.inf (Section005 section) Color Planes 1 Color Table Entries 4294967296 Resolution 1680 x 1050 x 60 hertz Bits/Pixel 32 Memory Address 0xFD000000-0xFE0FFFFF Memory Address 0xF0000000-0xF9FFFFFF Memory Address 0xF8000000-0xF9FFFFFF I/O Port 0x0000E000-0x0000EFFF IRQ Channel IRQ 24 I/O Port 0x000003B0-0x000003DF I/O Port 0x000003C0-0x000003DF Memory Address 0xA0000-0xBFFFF Driver c:\windows\system32\drivers\nv4_mini.sys (6.14.12.8566, 12.20 MB (12,792,576 bytes), 9/22/2006 3:39 AM) System Summary: OS Name Microsoft Windows XP Professional Version 5.1.2600 Service Pack 3 Build 2600 OS Manufacturer Microsoft Corporation System Name FASTERMACHINE System Manufacturer MSI System Model MS-7693 System Type X86-based PC Processor x86 Family 16 Model 10 Stepping 0 AuthenticAMD ~3712 Mhz (Standard setting 3333MhZ) BIOS Version/Date American Megatrends Inc. V1.4, 12/15/2011 SMBIOS Version 2.7 Windows Directory C:\WINDOWS System Directory C:\WINDOWS\system32 Boot Device \Device\HarddiskVolume1 Locale United States Hardware Abstraction Layer Version = "5.1.2600.5512 (xpsp.080413-2111)" User Name FASTERMACHINE\Deleted my name :) Time Zone GMT Daylight Time Total Physical Memory 4,096.00 MB Available Physical Memory 2.28 GB Total Virtual Memory 2.00 GB Available Virtual Memory 1.96 GB Page File Space 4.83 GB Page File C:\pagefile.sys ____________ Don't put limits on your imagination, there is no telling.........?
	ID: 25051 \| Rating: 0 \| rate: / Reply Quote

skgiven Volunteer moderator Volunteer tester Send message Joined: 23 Apr 09 Posts: 3968 Credit: 1,995,359,260 RAC: 0 Level Scientific publications	Message 25065 - Posted: 14 May 2012 \| 10:13:34 UTC - in response to Message 25051. Last modified: 14 May 2012 \| 10:17:21 UTC
	Your system setup looks fine! Possibilities that spring to mind are: The CPU OC is causing the issue. Reduce to stock. Some system change was made that has caused continuous failures. Restart system. Look into what was changed (if anything). There is a corruption/problem with the driver that comes to the fore when crunching GPUGrid tasks. Reinstall or upgrade the driver (clean/fresh install). The tasks are failing due to overuse of the system. CPU is maxed out, the disk is writing a lot/continuously, or at times all the system memory gets used (heavy office applications). Use fewer CPU's, write to disk (checkpoint) less. The GPU is not stable when running these apps due to the frequency at given voltage. First try reducing the memory clocks (try 10% and then 20%). Then the core if need be. Finally try increasing the GPU Voltage very slightly, should all else fail. Reducing the CPU to stock and then the GDDR5 memory clock are the easiest places to start. ____________ FAQ's HOW TO: - Opt out of Beta Tests - Ask for Help
	ID: 25065 \| Rating: 0 \| rate: / Reply Quote

The Knighty Ni Send message Joined: 6 May 12 Posts: 8 Credit: 195,800 RAC: 0 Level Scientific publications	Message 25069 - Posted: 14 May 2012 \| 13:00:51 UTC - in response to Message 25065. Last modified: 14 May 2012 \| 13:02:27 UTC
	Your system setup looks fine! Possibilities that spring to mind are: The CPU OC is causing the issue. Reduce to stock. Some system change was made that has caused continuous failures. Restart system. Look into what was changed (if anything). There is a corruption/problem with the driver that comes to the fore when crunching GPUGrid tasks. Reinstall or upgrade the driver (clean/fresh install). The tasks are failing due to overuse of the system. CPU is maxed out, the disk is writing a lot/continuously, or at times all the system memory gets used (heavy office applications). Use fewer CPU's, write to disk (checkpoint) less. The GPU is not stable when running these apps due to the frequency at given voltage. First try reducing the memory clocks (try 10% and then 20%). Then the core if need be. Finally try increasing the GPU Voltage very slightly, should all else fail. Reducing the CPU to stock and then the GDDR5 memory clock are the easiest places to start. Thanks sjgiven I'll start by doing one of the quick easy things. Tick each of the items you mention above off and try one WU at a time to see what happens. If it completes correctly then I'll get another one. If it errors out I'll try the next item on the list until system is stable and shrubbing properly. Checked build stability on a PrimeGrid GPU sub project overnight. It completed 61 tasks in succession with no errors. Made one of the suggested changes during this time as well. So here goes on a GPUGrid WU. Wish me luck :) ____________ Don't put limits on your imagination, there is no telling.........?
	ID: 25069 \| Rating: 0 \| rate: / Reply Quote

The Knighty Ni Send message Joined: 6 May 12 Posts: 8 Credit: 195,800 RAC: 0 Level Scientific publications	Message 25076 - Posted: 14 May 2012 \| 18:09:26 UTC
	Oops should have written skgiven. Apologies for the spelling mistake. Well that one finished correctly so here goes for the next :) ____________ Don't put limits on your imagination, there is no telling.........?
	ID: 25076 \| Rating: 0 \| rate: / Reply Quote

The Knighty Ni Send message Joined: 6 May 12 Posts: 8 Credit: 195,800 RAC: 0 Level Scientific publications	Message 25081 - Posted: 14 May 2012 \| 20:49:55 UTC Last modified: 14 May 2012 \| 20:51:09 UTC
	Next WU in progress. Seems to be stable at present and is one of the MJHARVEY_ variations. Currently just over an hour into processing with 2.5 hours left to completion. Lets see what happens. http://www.gpugrid.net/result.php?resultid=5370445 Which variants on the MJHRVEY_ WU's have people had problems with so far? Is this WU one of those variant's? ____________ Don't put limits on your imagination, there is no telling.........?
	ID: 25081 \| Rating: 0 \| rate: / Reply Quote

skgiven Volunteer moderator Volunteer tester Send message Joined: 23 Apr 09 Posts: 3968 Credit: 1,995,359,260 RAC: 0 Level Scientific publications	Message 25082 - Posted: 14 May 2012 \| 21:01:54 UTC - in response to Message 25081.
	That looks similar. Most of your failures were after a few seconds, so it sounds like you past the first hurdle. ____________ FAQ's HOW TO: - Opt out of Beta Tests - Ask for Help
	ID: 25082 \| Rating: 0 \| rate: / Reply Quote

The Knighty Ni Send message Joined: 6 May 12 Posts: 8 Credit: 195,800 RAC: 0 Level Scientific publications	Message 25083 - Posted: 14 May 2012 \| 21:34:26 UTC - in response to Message 25082.
	That looks similar. Most of your failures were after a few seconds, so it sounds like you past the first hurdle. :) Brilliant Lets make this work for a very stable platform which will shrub any WU and maybe set standard builds for the Phenom II CPU's. This PC is used for between 14-20 hours daily. The two ways it is used are below. 1. Business use with high activity with Windows packages: Excel, Word Etc and Skype (Those are what I use daily for several hours on end) normally 8-12 hours. 2. Gamers environments. Have to admit in my spare time I am a gamer :) ____________ Don't put limits on your imagination, there is no telling.........?
	ID: 25083 \| Rating: 0 \| rate: / Reply Quote

Dagorath Send message Joined: 16 Mar 11 Posts: 509 Credit: 179,005,236 RAC: 0 Level Scientific publications	Message 25088 - Posted: 15 May 2012 \| 0:44:52 UTC - in response to Message 25083.
	Word: only Superman can type fast enough to cause Word to bog down anything Excel: might cause problems but you would have to have a lot of big spreadsheets open and they would have to be recalculating very often and contain some very complex formulae, not a prime suspect Skype: voice calls should be no problem, voice + video calls might be a problem Games: Games can definitely cause a problem if they use the GPU and many do. You might see anything from sluggish game performance to game or GPUgrid task crashes. If so then look at the Client configuration section of the official BOINC wiki and create a cc_config.xml file in the BOINC data directory. The cc_config.xml file should contain the <exclusive_gpu_app> option described on the Client configuration page. If you already have a cc_config.xml file then add the <exclusive_gpu_app> option to it. Restart BOINC to make the cc_config.xml options take effect You can designate as many game or other executables as you want with the <exclusive_gpu_app> option. Whenever a designated app runs, BOINC will suspend GPU tasks. If GPU tasks do not suspend then you have designated the wrong executable(s) or else there is a syntax error in your cc_config.xml. As for determining the names of the executables to be designated in cc_config.xml, look in Task Manager while the app/game is running, the executable should be named there. If you need help please put the entire content of your cc_config.xml in a message here and someone will check it for you to see if it's syntactically correct.
	ID: 25088 \| Rating: 0 \| rate: / Reply Quote

The Knighty Ni Send message Joined: 6 May 12 Posts: 8 Credit: 195,800 RAC: 0 Level Scientific publications	Message 25093 - Posted: 15 May 2012 \| 7:41:06 UTC
	Thanks Dagorath. Have noticed Skype grumbling from time to time when the PC is in full use. Set up exclusive application for it and everything shut down as soon as Skype is opened. Had to find a better way of doing it, so was freeing up one CPU on work days. Now one CPU is free all the time. I'll tackle each of the games I play individually as some of them are what I call light games in terms of the resources they need to run. E.g Diablo II in single player mode Been running that game for over 8 years and it used to run very happily on my old 400Mhz box :) Over night a 50/50 success rate. 2 completed properly and 2 errored Noticed that other people have had problems with them as well: http://www.gpugrid.net/workunit.php?wuid=3418351 http://www.gpugrid.net/workunit.php?wuid=3403967 OK next step in adjustments as 50/50 is not good enough. 100Mhz knocked off GPU Memory clock setting. ____________ Don't put limits on your imagination, there is no telling.........?
	ID: 25093 \| Rating: 0 \| rate: / Reply Quote

5pot Send message Joined: 8 Mar 12 Posts: 411 Credit: 2,083,882,218 RAC: 0 Level Scientific publications	Message 25096 - Posted: 15 May 2012 \| 13:03:29 UTC
	You could just hit the suspend button when gaming or when using skype, as long as you have the Leave Applications In Memory box ticked in BOINC preferences. Whenever I play BF3, I simply suspend BOINC and then hit resume whenever I'm done playing.
	ID: 25096 \| Rating: 0 \| rate: / Reply Quote

Dagorath Send message Joined: 16 Mar 11 Posts: 509 Credit: 179,005,236 RAC: 0 Level Scientific publications	Message 25102 - Posted: 16 May 2012 \| 0:21:28 UTC - in response to Message 25093. Last modified: 16 May 2012 \| 0:24:42 UTC
	I think you have the idea, you just need time to experiment. BOINC has lots of options so there are often several ways to skin any cat. 5pot's suggestion works too for example. Freeing up a core just for GPUgrid tasks helps finish GPUgrid tasks faster. When I have several Climateprediction tasks running I find I need to free up a second core or else I see performance issues and weird errors in other apps not just in BOINC. Climateprediction tasks are very CPU intense. If you run other projects known to be CPU inense you might try freeing up a second core too. One thing I strongly suggest is that it you are OCing anything, GPU or CPU then the first thing you ought to do is drop back to stock settings. Once you get things stable at stock settings then you can start bumping up the clocks. Regarding the tasks that are failing on other peoples' systems too... Don't assume that means there is something wrong with those tasks. There are many hosts on this project that are not configured correctly and they crash tasks regularly. You would think the owners would notice and do something but they don't. Perhaps it's because they just assume they got a bad task. I get plenty of tasks that have crashed on several other machines but they run fine on mine. There is no magic in my machine so those "bad" tasks aren't bad as often as you might think.
	ID: 25102 \| Rating: 0 \| rate: / Reply Quote

The Knighty Ni Send message Joined: 6 May 12 Posts: 8 Credit: 195,800 RAC: 0 Level Scientific publications	Message 25103 - Posted: 16 May 2012 \| 2:00:18 UTC - in response to Message 25102. Last modified: 16 May 2012 \| 2:04:30 UTC
	Regarding the tasks that are failing on other peoples' systems too... Don't assume that means there is something wrong with those tasks. There are many hosts on this project that are not configured correctly and they crash tasks regularly. You would think the owners would notice and do something but they don't. Perhaps it's because they just assume they got a bad task. I get plenty of tasks that have crashed on several other machines but they run fine on mine. There is no magic in my machine so those "bad" tasks aren't bad as often as you might think. Very good point Dagorath. (Edited) Forgot to say 5 successive WU's have completed which is really pleasing. Only going let one in progress and one spare on the machine for the time being until I am sure that I have a very low error rate. (edit end) I have always monitored my machines closely even from the early days when it was just Seti@Home pre BONIC because I don't want errors cropping up. I have always felt this ultimately slows you down. Not only that, its a complete waste of resources. One memorable WU I did on Climate Prediction in 2007 stood at 1625hrs when it downloaded. At the time I had a very slow 2 core machine and was flabbergasted that it was going to take almost 70 days to complete. The deadline was over 1 year. So did other stuff in between and only just got it finished by the deadline. What really upset me about that one is that although I received credit for it, it ended with a computation error. :( Since then I strive to ensure the machine is as stable as it can be and regularly stress/torture test to ensure everything is working properly especially when components have been replaced for some reason. Just before joining this project I torture tested the CPU with GIMPS AKA Prime95 for 72 hours at its current OC setting. If there is anything wrong with the CPU or OC settings GIMPS will fail in a trice. It didn't murmur and stayed around the 42C temperature mark passing all the tests thrown at it. The next torture test will include memory to ensure that is working fine as I forgot to include it in the last test. When running the full on stress tests including memory, it is very very sluggish and not ideal to carry them out if you need use the machine for other things. If you are not familiar with GIMPS here is a link to them. The software is free to use. Just remember to unload it if you switch your machines off at night otherwise it will boot straight in and it will take a long time to boot up. :)They are like GPUGrid except they search for Prime numbers: http://www.mersenne.org/freesoft/ ____________ Don't put limits on your imagination, there is no telling.........?
	ID: 25103 \| Rating: 0 \| rate: / Reply Quote

Dagorath Send message Joined: 16 Mar 11 Posts: 509 Credit: 179,005,236 RAC: 0 Level Scientific publications	Message 25109 - Posted: 16 May 2012 \| 16:35:20 UTC - in response to Message 25103.
	Thanks for the link. I'll give it a try.
	ID: 25109 \| Rating: 0 \| rate: / Reply Quote

Post to thread

Message boards : Graphics cards (GPUs) : Computation error failures and screen savers

	About	Science	Volunteers	Performance	Forum	Join us	Donate