Advanced search

Message boards : Graphics cards (GPUs) : Computation error failures and screen savers

Author Message
Wiyosaya
Send message
Joined: 22 Nov 09
Posts: 82
Credit: 64,725,953
RAC: 115,718
Level
Thr
Scientific publications
watwatwatwatwatwatwatwat
Message 24877 - Posted: 10 May 2012 | 15:14:42 UTC
Last modified: 10 May 2012 | 15:15:14 UTC

It looks like there is more than one thread on computation error failures, so I thought I would start another thread as I am reasonably sure the solution lies in turning off screen savers.

Rather than duplicate what I posted elsewhere, please see this post for a possible solution.

If you try what I suggest, please post your results to this thread.

Thanks.
____________

shdbcamping
Send message
Joined: 2 May 12
Posts: 22
Credit: 145,756,579
RAC: 0
Level
Cys
Scientific publications
watwat
Message 25007 - Posted: 12 May 2012 | 17:59:05 UTC - in response to Message 24877.

I have screen savers as (none) on both my rigs and I get failures anyway. If you look at the work units, you'll usually see others fialing them as well. I believe that it's just the nature of the science. It is just released a number of times to make sure that 'rogue' HW failures do not get the WU dismissed without being sure that it is bad. Most fail very quickly, so it's not a big deal to me.

Profile skgiven
Volunteer moderator
Project tester
Volunteer tester
Avatar
Send message
Joined: 23 Apr 09
Posts: 3417
Credit: 689,452,234
RAC: 1,566,783
Level
Lys
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwat
Message 25008 - Posted: 12 May 2012 | 18:24:33 UTC - in response to Message 25007.

Might be useful to read some of the FAQs and in particular use fan regulating software to control your temperatures. Also, free up a CPU core/thread from Boinc Manager...
____________
FAQ's

HOW TO:
- Opt out of Beta Tests
- Ask for Help

The Knighty Ni
Avatar
Send message
Joined: 6 May 12
Posts: 8
Credit: 195,800
RAC: 0
Level

Scientific publications
wat
Message 25039 - Posted: 13 May 2012 | 17:00:00 UTC

Hi

First post to the threads so Hello to everyone on GPUGrid

Reading the first post regarding ACEMD, I noticed that although my GPU seems to be or is supposed to be a good one for this project, ERROR with ACEMD is cropping up.

I am a very new GUP shrubber and have had this one in place for a shade over a week so please forgive me if it seems I need a little hand holding at present.

This is the overview of WU's downloaded so far where there is information:

State: All (24) | In progress (2) | Valid (5) | Error (17)
ACEMD2: GPU molecular dynamics (24)

Shortly after installing the GPU and starting the GPUGrid project I had to replace my MB Grrr. However, at the time of replacing there where only 2 WU on my machine, Valid was standing at 5 Error was standing at 2. Today currently in progress: 2, 1 running and 1 waiting to start.

Earlier today while finalising the install of my new MB the ACEMD error turned up while I only had 2 WU's on the machine. Following this BONIC downloaded a further 11 WU's which all produced errors very rapidly.

Think my questions are:

1. How can I ensure a stable platform to run GPUGrid tasks
2. Would the ACEMD error normally cause BONIC to download a long series of WU's which would fail after the error shows up
3. Is there any other information you need to help me resolve this

Personally I don't like running projects where I get a high error rate and like the error rate to be below 1:75 or less where possible. Receiving high error rates puts me off projects if they can't be resolved because I see it as wasted resources.
____________
Don't put limits on your imagination, there is no telling.........?

Profile skgiven
Volunteer moderator
Project tester
Volunteer tester
Avatar
Send message
Joined: 23 Apr 09
Posts: 3417
Credit: 689,452,234
RAC: 1,566,783
Level
Lys
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwat
Message 25046 - Posted: 13 May 2012 | 19:52:23 UTC - in response to Message 25039.


Think my questions are:

1. How can I ensure a stable platform to run GPUGrid tasks
2. Would the ACEMD error normally cause BONIC to download a long series of WU's which would fail after the error shows up
3. Is there any other information you need to help me resolve this

Personally I don't like running projects where I get a high error rate and like the error rate to be below 1:75 or less where possible. Receiving high error rates puts me off projects if they can't be resolved because I see it as wasted resources.


I'm not keen on failing tasks either!

You can see my tasks here. The only failures are Beta apps that I ran using a driver that was incapable of running the tasks (which is what I was testing).

Failures are due to bad setups.
It's up to the cruncher to configure their system and Boinc correctly to participate successfully here. The FAQ's have lot's of good tips on how to do this. Key is not over using the CPU, keeping the GPU's cool and the system stable.

____________
FAQ's

HOW TO:
- Opt out of Beta Tests
- Ask for Help

TheFiend
Send message
Joined: 26 Aug 11
Posts: 93
Credit: 658,218,104
RAC: 988,032
Level
Lys
Scientific publications
watwatwatwatwatwat
Message 25047 - Posted: 13 May 2012 | 19:56:08 UTC

All those failed tasks had errored out on other crunchers..... he could have just been unlucky to get some bad WU's

5pot
Send message
Joined: 8 Mar 12
Posts: 397
Credit: 1,014,523,718
RAC: 2,040,563
Level
Met
Scientific publications
watwatwat
Message 25048 - Posted: 13 May 2012 | 20:09:15 UTC

Thought energies have become nan can SOMETIMES be caused on overactive CPU. Say he's crunching 6 CPU tasks + 1 GPU task. If this is the case, tell BOINC to use one less core (83.34?) I think.

Profile skgiven
Volunteer moderator
Project tester
Volunteer tester
Avatar
Send message
Joined: 23 Apr 09
Posts: 3417
Credit: 689,452,234
RAC: 1,566,783
Level
Lys
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwat
Message 25049 - Posted: 13 May 2012 | 20:10:30 UTC - in response to Message 25047.
Last modified: 13 May 2012 | 20:12:07 UTC

5pot on ;)

Most, but not all, failures are for MJHARVEY_MJHXA task with, exit code 98 (0x62) and ERROR: # Energies have become nan.

My guess is the GPU is too hot (not using fan controlling software), or trying to use all 6 CPU cores for other projects. So, free up a CPU core and use fan controlling software such as EVGA Precision or MSI Afterburner. This is common sense for most GPU projects.

Unfortunately the number of people using 295.x/296 drivers is still very high, despite hundreds of warnings across the Boinc community, so going by other failures is difficult!
____________
FAQ's

HOW TO:
- Opt out of Beta Tests
- Ask for Help

The Knighty Ni
Avatar
Send message
Joined: 6 May 12
Posts: 8
Credit: 195,800
RAC: 0
Level

Scientific publications
wat
Message 25051 - Posted: 13 May 2012 | 21:56:05 UTC

Thanks everyone.

Read your posts and see that there are concerns over the driver version, number of cores being used, GPU temp and overclocking.

Yes this system is overclocked to within very reasonable margins as you can tell from the temperatures on the Cores and GPU. The PC has been OC'ed and running for almost 18 months now with very, very few errors on any of the projects I run. Been running PrimeGrid regularly on this PC and have a superbly low error rate on a project which is highly sensitive to problems on CPU cores. My worst sub project currently stands at an average of 375 WU between errors. The best is at 3897. RNAWorld is another project I run. My normal error rate is about 1 in every 300 WU's. I know there are many who run their systems much hotter. This is not an option for me as this is a business PC that I use for my work.

Here is the system information. You may be able to see something that I am not able at this time which could be causing WU to error.

So if you can work with me to solve this issue I would really appreciate it :)

Running 5 cores of the 6 available. Core temps steady 39C, GPU Steady 62C

GPU:
Name NVIDIA GeForce GTX 560 Ti
PNP Device ID PCI\VEN_10DE&DEV_1087&SUBSYS_00000000&REV_A1\4&26B22A24&0&0010
Adapter Type GeForce GTX 560 Ti, NVIDIA compatible
Adapter Description NVIDIA GeForce GTX 560 Ti
Adapter RAM 1.25 GB (1,342,177,280 bytes)
Installed Drivers nv4_disp.dll
Driver Version 6.14.12.8566
INF File oem42.inf (Section005 section)
Color Planes 1
Color Table Entries 4294967296
Resolution 1680 x 1050 x 60 hertz
Bits/Pixel 32
Memory Address 0xFD000000-0xFE0FFFFF
Memory Address 0xF0000000-0xF9FFFFFF
Memory Address 0xF8000000-0xF9FFFFFF
I/O Port 0x0000E000-0x0000EFFF
IRQ Channel IRQ 24
I/O Port 0x000003B0-0x000003DF
I/O Port 0x000003C0-0x000003DF
Memory Address 0xA0000-0xBFFFF
Driver c:\windows\system32\drivers\nv4_mini.sys (6.14.12.8566, 12.20 MB (12,792,576 bytes), 9/22/2006 3:39 AM)


System Summary:

OS Name Microsoft Windows XP Professional
Version 5.1.2600 Service Pack 3 Build 2600
OS Manufacturer Microsoft Corporation
System Name FASTERMACHINE
System Manufacturer MSI
System Model MS-7693
System Type X86-based PC
Processor x86 Family 16 Model 10 Stepping 0 AuthenticAMD ~3712 Mhz (Standard setting 3333MhZ)
BIOS Version/Date American Megatrends Inc. V1.4, 12/15/2011
SMBIOS Version 2.7
Windows Directory C:\WINDOWS
System Directory C:\WINDOWS\system32
Boot Device \Device\HarddiskVolume1
Locale United States
Hardware Abstraction Layer Version = "5.1.2600.5512 (xpsp.080413-2111)"
User Name FASTERMACHINE\Deleted my name :)
Time Zone GMT Daylight Time
Total Physical Memory 4,096.00 MB
Available Physical Memory 2.28 GB
Total Virtual Memory 2.00 GB
Available Virtual Memory 1.96 GB
Page File Space 4.83 GB
Page File C:\pagefile.sys

____________
Don't put limits on your imagination, there is no telling.........?

Profile skgiven
Volunteer moderator
Project tester
Volunteer tester
Avatar
Send message
Joined: 23 Apr 09
Posts: 3417
Credit: 689,452,234
RAC: 1,566,783
Level
Lys
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwat
Message 25065 - Posted: 14 May 2012 | 10:13:34 UTC - in response to Message 25051.
Last modified: 14 May 2012 | 10:17:21 UTC

Your system setup looks fine!

Possibilities that spring to mind are:
The CPU OC is causing the issue. Reduce to stock.
Some system change was made that has caused continuous failures. Restart system. Look into what was changed (if anything).
There is a corruption/problem with the driver that comes to the fore when crunching GPUGrid tasks. Reinstall or upgrade the driver (clean/fresh install).
The tasks are failing due to overuse of the system. CPU is maxed out, the disk is writing a lot/continuously, or at times all the system memory gets used (heavy office applications). Use fewer CPU's, write to disk (checkpoint) less.
The GPU is not stable when running these apps due to the frequency at given voltage. First try reducing the memory clocks (try 10% and then 20%). Then the core if need be. Finally try increasing the GPU Voltage very slightly, should all else fail.

Reducing the CPU to stock and then the GDDR5 memory clock are the easiest places to start.
____________
FAQ's

HOW TO:
- Opt out of Beta Tests
- Ask for Help

The Knighty Ni
Avatar
Send message
Joined: 6 May 12
Posts: 8
Credit: 195,800
RAC: 0
Level

Scientific publications
wat
Message 25069 - Posted: 14 May 2012 | 13:00:51 UTC - in response to Message 25065.
Last modified: 14 May 2012 | 13:02:27 UTC

Your system setup looks fine!

Possibilities that spring to mind are:
The CPU OC is causing the issue. Reduce to stock.
Some system change was made that has caused continuous failures. Restart system. Look into what was changed (if anything).
There is a corruption/problem with the driver that comes to the fore when crunching GPUGrid tasks. Reinstall or upgrade the driver (clean/fresh install).
The tasks are failing due to overuse of the system. CPU is maxed out, the disk is writing a lot/continuously, or at times all the system memory gets used (heavy office applications). Use fewer CPU's, write to disk (checkpoint) less.
The GPU is not stable when running these apps due to the frequency at given voltage. First try reducing the memory clocks (try 10% and then 20%). Then the core if need be. Finally try increasing the GPU Voltage very slightly, should all else fail.

Reducing the CPU to stock and then the GDDR5 memory clock are the easiest places to start.


Thanks sjgiven

I'll start by doing one of the quick easy things. Tick each of the items you mention above off and try one WU at a time to see what happens. If it completes correctly then I'll get another one.

If it errors out I'll try the next item on the list until system is stable and shrubbing properly.

Checked build stability on a PrimeGrid GPU sub project overnight. It completed 61 tasks in succession with no errors. Made one of the suggested changes during this time as well.

So here goes on a GPUGrid WU. Wish me luck :)
____________
Don't put limits on your imagination, there is no telling.........?

The Knighty Ni
Avatar
Send message
Joined: 6 May 12
Posts: 8
Credit: 195,800
RAC: 0
Level

Scientific publications
wat
Message 25076 - Posted: 14 May 2012 | 18:09:26 UTC

Oops should have written skgiven. Apologies for the spelling mistake.

Well that one finished correctly so here goes for the next :)
____________
Don't put limits on your imagination, there is no telling.........?

The Knighty Ni
Avatar
Send message
Joined: 6 May 12
Posts: 8
Credit: 195,800
RAC: 0
Level

Scientific publications
wat
Message 25081 - Posted: 14 May 2012 | 20:49:55 UTC
Last modified: 14 May 2012 | 20:51:09 UTC

Next WU in progress. Seems to be stable at present and is one of the MJHARVEY_ variations. Currently just over an hour into processing with 2.5 hours left to completion. Lets see what happens.

http://www.gpugrid.net/result.php?resultid=5370445

Which variants on the MJHRVEY_ WU's have people had problems with so far?

Is this WU one of those variant's?
____________
Don't put limits on your imagination, there is no telling.........?

Profile skgiven
Volunteer moderator
Project tester
Volunteer tester
Avatar
Send message
Joined: 23 Apr 09
Posts: 3417
Credit: 689,452,234
RAC: 1,566,783
Level
Lys
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwat
Message 25082 - Posted: 14 May 2012 | 21:01:54 UTC - in response to Message 25081.

That looks similar. Most of your failures were after a few seconds, so it sounds like you past the first hurdle.
____________
FAQ's

HOW TO:
- Opt out of Beta Tests
- Ask for Help

The Knighty Ni
Avatar
Send message
Joined: 6 May 12
Posts: 8
Credit: 195,800
RAC: 0
Level

Scientific publications
wat
Message 25083 - Posted: 14 May 2012 | 21:34:26 UTC - in response to Message 25082.

That looks similar. Most of your failures were after a few seconds, so it sounds like you past the first hurdle.


:) Brilliant

Lets make this work for a very stable platform which will shrub any WU and maybe set standard builds for the Phenom II CPU's.

This PC is used for between 14-20 hours daily. The two ways it is used are below.

1. Business use with high activity with Windows packages: Excel, Word Etc and Skype (Those are what I use daily for several hours on end) normally 8-12 hours.

2. Gamers environments. Have to admit in my spare time I am a gamer :)
____________
Don't put limits on your imagination, there is no telling.........?

Dagorath
Send message
Joined: 16 Mar 11
Posts: 509
Credit: 179,005,236
RAC: 41,004
Level
Ile
Scientific publications
watwatwatwatwatwatwat
Message 25088 - Posted: 15 May 2012 | 0:44:52 UTC - in response to Message 25083.

Word: only Superman can type fast enough to cause Word to bog down anything

Excel: might cause problems but you would have to have a lot of big spreadsheets open and they would have to be recalculating very often and contain some very complex formulae, not a prime suspect

Skype: voice calls should be no problem, voice + video calls might be a problem

Games: Games can definitely cause a problem if they use the GPU and many do. You might see anything from sluggish game performance to game or GPUgrid task crashes. If so then look at the Client configuration section of the official BOINC wiki and create a cc_config.xml file in the BOINC data directory. The cc_config.xml file should contain the <exclusive_gpu_app> option described on the Client configuration page. If you already have a cc_config.xml file then add the <exclusive_gpu_app> option to it. Restart BOINC to make the cc_config.xml options take effect

You can designate as many game or other executables as you want with the <exclusive_gpu_app> option. Whenever a designated app runs, BOINC will suspend GPU tasks. If GPU tasks do not suspend then you have designated the wrong executable(s) or else there is a syntax error in your cc_config.xml. As for determining the names of the executables to be designated in cc_config.xml, look in Task Manager while the app/game is running, the executable should be named there. If you need help please put the entire content of your cc_config.xml in a message here and someone will check it for you to see if it's syntactically correct.

The Knighty Ni
Avatar
Send message
Joined: 6 May 12
Posts: 8
Credit: 195,800
RAC: 0
Level

Scientific publications
wat
Message 25093 - Posted: 15 May 2012 | 7:41:06 UTC

Thanks Dagorath.

Have noticed Skype grumbling from time to time when the PC is in full use. Set up exclusive application for it and everything shut down as soon as Skype is opened. Had to find a better way of doing it, so was freeing up one CPU on work days.

Now one CPU is free all the time.

I'll tackle each of the games I play individually as some of them are what I call light games in terms of the resources they need to run. E.g Diablo II in single player mode Been running that game for over 8 years and it used to run very happily on my old 400Mhz box :)

Over night a 50/50 success rate. 2 completed properly and 2 errored

Noticed that other people have had problems with them as well:
http://www.gpugrid.net/workunit.php?wuid=3418351
http://www.gpugrid.net/workunit.php?wuid=3403967

OK next step in adjustments as 50/50 is not good enough.

100Mhz knocked off GPU Memory clock setting.

____________
Don't put limits on your imagination, there is no telling.........?

5pot
Send message
Joined: 8 Mar 12
Posts: 397
Credit: 1,014,523,718
RAC: 2,040,563
Level
Met
Scientific publications
watwatwat
Message 25096 - Posted: 15 May 2012 | 13:03:29 UTC

You could just hit the suspend button when gaming or when using skype, as long as you have the Leave Applications In Memory box ticked in BOINC preferences.

Whenever I play BF3, I simply suspend BOINC and then hit resume whenever I'm done playing.

Dagorath
Send message
Joined: 16 Mar 11
Posts: 509
Credit: 179,005,236
RAC: 41,004
Level
Ile
Scientific publications
watwatwatwatwatwatwat
Message 25102 - Posted: 16 May 2012 | 0:21:28 UTC - in response to Message 25093.
Last modified: 16 May 2012 | 0:24:42 UTC

I think you have the idea, you just need time to experiment. BOINC has lots of options so there are often several ways to skin any cat. 5pot's suggestion works too for example.

Freeing up a core just for GPUgrid tasks helps finish GPUgrid tasks faster. When I have several Climateprediction tasks running I find I need to free up a second core or else I see performance issues and weird errors in other apps not just in BOINC. Climateprediction tasks are very CPU intense. If you run other projects known to be CPU inense you might try freeing up a second core too.

One thing I strongly suggest is that it you are OCing anything, GPU or CPU then the first thing you ought to do is drop back to stock settings. Once you get things stable at stock settings then you can start bumping up the clocks.

Regarding the tasks that are failing on other peoples' systems too... Don't assume that means there is something wrong with those tasks. There are many hosts on this project that are not configured correctly and they crash tasks regularly. You would think the owners would notice and do something but they don't. Perhaps it's because they just assume they got a bad task. I get plenty of tasks that have crashed on several other machines but they run fine on mine. There is no magic in my machine so those "bad" tasks aren't bad as often as you might think.

The Knighty Ni
Avatar
Send message
Joined: 6 May 12
Posts: 8
Credit: 195,800
RAC: 0
Level

Scientific publications
wat
Message 25103 - Posted: 16 May 2012 | 2:00:18 UTC - in response to Message 25102.
Last modified: 16 May 2012 | 2:04:30 UTC


Regarding the tasks that are failing on other peoples' systems too... Don't assume that means there is something wrong with those tasks. There are many hosts on this project that are not configured correctly and they crash tasks regularly. You would think the owners would notice and do something but they don't. Perhaps it's because they just assume they got a bad task. I get plenty of tasks that have crashed on several other machines but they run fine on mine. There is no magic in my machine so those "bad" tasks aren't bad as often as you might think.


Very good point Dagorath.
(Edited) Forgot to say 5 successive WU's have completed which is really pleasing. Only going let one in progress and one spare on the machine for the time being until I am sure that I have a very low error rate. (edit end)

I have always monitored my machines closely even from the early days when it was just Seti@Home pre BONIC because I don't want errors cropping up. I have always felt this ultimately slows you down. Not only that, its a complete waste of resources.

One memorable WU I did on Climate Prediction in 2007 stood at 1625hrs when it downloaded. At the time I had a very slow 2 core machine and was flabbergasted that it was going to take almost 70 days to complete. The deadline was over 1 year. So did other stuff in between and only just got it finished by the deadline. What really upset me about that one is that although I received credit for it, it ended with a computation error. :(

Since then I strive to ensure the machine is as stable as it can be and regularly stress/torture test to ensure everything is working properly especially when components have been replaced for some reason. Just before joining this project I torture tested the CPU with GIMPS AKA Prime95 for 72 hours at its current OC setting. If there is anything wrong with the CPU or OC settings GIMPS will fail in a trice. It didn't murmur and stayed around the 42C temperature mark passing all the tests thrown at it. The next torture test will include memory to ensure that is working fine as I forgot to include it in the last test. When running the full on stress tests including memory, it is very very sluggish and not ideal to carry them out if you need use the machine for other things.

If you are not familiar with GIMPS here is a link to them. The software is free to use. Just remember to unload it if you switch your machines off at night otherwise it will boot straight in and it will take a long time to boot up. :)They are like GPUGrid except they search for Prime numbers:
http://www.mersenne.org/freesoft/
____________
Don't put limits on your imagination, there is no telling.........?

Dagorath
Send message
Joined: 16 Mar 11
Posts: 509
Credit: 179,005,236
RAC: 41,004
Level
Ile
Scientific publications
watwatwatwatwatwatwat
Message 25109 - Posted: 16 May 2012 | 16:35:20 UTC - in response to Message 25103.

Thanks for the link. I'll give it a try.

Post to thread

Message boards : Graphics cards (GPUs) : Computation error failures and screen savers