Advanced search

Message boards : News : WU: NOELIA_INS1P

Author Message
noelia
Send message
Joined: 5 Jul 12
Posts: 35
Credit: 393,375
RAC: 0
Level

Scientific publications
wat
Message 32663 - Posted: 3 Sep 2013 | 16:31:37 UTC

Hi all,

New WUs in the long queue, big-box size, 120000 credits each. The batch has been previously tested and should not report any issues, but please comment any problems you might have.

Noelia

Profile Beyond
Avatar
Send message
Joined: 23 Nov 08
Posts: 1073
Credit: 4,497,887,629
RAC: 414,517
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 32798 - Posted: 6 Sep 2013 | 13:50:22 UTC

These slow a GTX 460/768mb GPU to a crawl. Santis and Nathans run fine.

Operator
Send message
Joined: 15 May 11
Posts: 108
Credit: 297,176,099
RAC: 0
Level
Asn
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 32809 - Posted: 6 Sep 2013 | 17:38:17 UTC

I run these on both the machine with the Titans and the one with the 2GB GTX 650Ti.

The Titan box runs these (as well as Nathans and Santis) roughly three times as fast as the 650Ti does.

I'm thinking about firing up my old box with the 590s in it to see what that will do in comparison.

Operator.
____________

werdwerdus
Send message
Joined: 15 Apr 10
Posts: 123
Credit: 1,004,473,861
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 32866 - Posted: 10 Sep 2013 | 1:09:20 UTC

running at 95% gpu load on gtx 660 ti in windows 7! really nice!
____________
XtremeSystems.org - #1 Team in GPUGrid

TJ
Send message
Joined: 26 Jun 09
Posts: 815
Credit: 1,469,215,105
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 32869 - Posted: 10 Sep 2013 | 12:28:42 UTC - in response to Message 32866.

running at 95% gpu load on gtx 660 ti in windows 7! really nice!

Yes, I finally got one too. Running 92% steady on my 770 at 66°C.
____________
Greetings from TJ

Jim1348
Send message
Joined: 28 Jul 12
Posts: 462
Credit: 1,130,762,168
RAC: 13,533
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwat
Message 32872 - Posted: 10 Sep 2013 | 14:53:25 UTC

It is running at 96% on my GTX 650 Ti (63 C with a side fan). At 60 percent complete, it looks like it will take 18 hours 15 minutes to complete. And no problems with the memory (774 MB used).

Vagelis Giannadakis
Send message
Joined: 5 May 13
Posts: 187
Credit: 349,254,454
RAC: 0
Level
Asp
Scientific publications
watwatwatwatwatwatwat
Message 32928 - Posted: 13 Sep 2013 | 11:05:39 UTC

Just got 80-NOELIA_INS1P-5-15-RND4120_0. It really is putting my 650Ti through its paces!

vagelis@vgserver:~$ gpuinfo
Fan Speed : 54 %
Gpu : 67 C
Memory Usage
Total : 1023 MB
Used : 938 MB
Free : 85 MB

The WU is only at the start (1.76%) and estimates to take 22:44. I expect this to drop significantly.
____________

Profile Beyond
Avatar
Send message
Joined: 23 Nov 08
Posts: 1073
Credit: 4,497,887,629
RAC: 414,517
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 32929 - Posted: 13 Sep 2013 | 12:44:21 UTC - in response to Message 32872.

memory (774 MB used)

Undoubtedly why they won't run on the GTX 460/768 cards. They work fine on my 1GB GPUs, but I have to abort them on the 460 in favor of Santi and Nathan WUs.

TJ
Send message
Joined: 26 Jun 09
Posts: 815
Credit: 1,469,215,105
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 32952 - Posted: 14 Sep 2013 | 16:10:33 UTC

Finally I had another Noelia WU. It ran steady on my 660 with a GPU load off 97-98%, Nathan's do only 88-89%. And it ran in one go, that means no termination and restart because off the simulation becoming unstable.
I wish I could opt for Noelia WU's only.
____________
Greetings from TJ

Jim1348
Send message
Joined: 28 Jul 12
Posts: 462
Credit: 1,130,762,168
RAC: 13,533
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwat
Message 32953 - Posted: 14 Sep 2013 | 17:34:37 UTC - in response to Message 32929.

memory (774 MB used)

Undoubtedly why they won't run on the GTX 460/768 cards. They work fine on my 1GB GPUs, but I have to abort them on the 460 in favor of Santi and Nathan WUs.

I noticed a lot of failures on the 400 series cards and almost posted about it, but wasn't sure why. I think you have explained it.

Vagelis Giannadakis
Send message
Joined: 5 May 13
Posts: 187
Credit: 349,254,454
RAC: 0
Level
Asp
Scientific publications
watwatwatwatwatwatwat
Message 32956 - Posted: 14 Sep 2013 | 18:29:58 UTC - in response to Message 32928.

Just got 80-NOELIA_INS1P-5-15-RND4120_0. It really is putting my 650Ti through its paces!
vagelis@vgserver:~$ gpuinfo
Fan Speed : 54 %
Gpu : 67 C
Memory Usage
Total : 1023 MB
Used : 938 MB
Free : 85 MB

The WU is only at the start (1.76%) and estimates to take 22:44. I expect this to drop significantly.

Finished successfully in 81,572.91 sec (22.7h) on the 650Ti (running on Linux). Didn't actually take 18-19 hours, like previous NOELIAs.. must be a more complex WU.

180k is sweet! :)
____________

ExtraTerrestrial Apes
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 17 Aug 08
Posts: 2671
Credit: 751,855,924
RAC: 450,091
Level
Glu
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 32966 - Posted: 15 Sep 2013 | 11:36:12 UTC - in response to Message 32798.

These slow a GTX 460/768mb GPU to a crawl. Santis and Nathans run fine.

Sounds like the minimum GPU memory requirement is set too low for these, otherwise BOINC would refuse to run them on such cards.

MrS
____________
Scanning for our furry friends since Jan 2002

Profile Beyond
Avatar
Send message
Joined: 23 Nov 08
Posts: 1073
Credit: 4,497,887,629
RAC: 414,517
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 32969 - Posted: 15 Sep 2013 | 12:14:01 UTC - in response to Message 32966.

These slow a GTX 460/768mb GPU to a crawl. Santis and Nathans run fine.

Sounds like the minimum GPU memory requirement is set too low for these, otherwise BOINC would refuse to run them on such cards.

MrS, can the minimum memory requirement be set for specific WUs or just for the app in general?

ExtraTerrestrial Apes
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 17 Aug 08
Posts: 2671
Credit: 751,855,924
RAC: 450,091
Level
Glu
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 32971 - Posted: 15 Sep 2013 | 12:18:51 UTC - in response to Message 32969.

I don't know, I've never set a BOINC server up or created WUs myself. But if I had programmed BOINC this would be a setting tagged to each WU, because that's the only way it makes sense. The entire credit system was based on the idea that different WUs can contain different contents.

So I'm not certain, but expect it to be possible. And if not a feature request is in order, IMO.

MrS
____________
Scanning for our furry friends since Jan 2002

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 791
Credit: 1,425,102,570
RAC: 1,358,780
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 32972 - Posted: 15 Sep 2013 | 12:22:56 UTC - in response to Message 32969.

These slow a GTX 460/768mb GPU to a crawl. Santis and Nathans run fine.

Sounds like the minimum GPU memory requirement is set too low for these, otherwise BOINC would refuse to run them on such cards.

MrS, can the minimum memory requirement be set for specific WUs or just for the app in general?

I believe it might need to be set at the plan_class level, which is between app and WU.

So we might need to enable something like cuda55_himem, and create _INS1P WUs for that class only.

Jacob Klein
Send message
Joined: 11 Oct 08
Posts: 1068
Credit: 1,149,280,989
RAC: 1,048,193
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 33050 - Posted: 18 Sep 2013 | 1:28:19 UTC

I just had a work unit fail (on an otherwise completely-stable system):
http://www.gpugrid.net/result.php?resultid=7285706

The error seems to indicate, to me, that it's likely a problem with the work unit. Can you (NOELIA) confirm that? Also, if there's anything else I can provide, let me know. Thanks, Jacob

Name pnitrox118-NOELIA_INS1P-7-12-RND7320_4
Workunit 4777876
Created 16 Sep 2013 | 19:44:18 UTC
Sent 17 Sep 2013 | 4:01:45 UTC
Received 17 Sep 2013 | 10:08:54 UTC
Server state Over
Outcome Computation error
Client state Compute error
Exit status -97 (0xffffffffffffff9f) Unknown error number
Computer ID 153764
Report deadline 22 Sep 2013 | 4:01:45 UTC
Run time 2.31
CPU time 2.13
Validate state Invalid
Credit 0.00
Application version Long runs (8-12 hours on fastest card) v8.14 (cuda42)

Stderr output

<core_client_version>7.2.11</core_client_version>
<![CDATA[
<message>
(unknown error) - exit code -97 (0xffffff9f)
</message>
<stderr_txt>
# GPU [GeForce GTX 660 Ti] Platform [Windows] Rev [3203] VERSION [42]
# SWAN Device 0 :
# Name : GeForce GTX 660 Ti
# ECC : Disabled
# Global mem : 3072MB
# Capability : 3.0
# PCI ID : 0000:09:00.0
# Device clock : 1124MHz
# Memory clock : 3004MHz
# Memory width : 192bit
# Driver version : r325_00 : 32680
# Simulation unstable. Flag 9 value 992
# Simulation unstable. Flag 10 value 909
# The simulation has become unstable. Terminating to avoid lock-up
# The simulation has become unstable. Terminating to avoid lock-up (2)

</stderr_txt>
]]>

Profile Retvari Zoltan
Avatar
Send message
Joined: 20 Jan 09
Posts: 1844
Credit: 10,662,131,944
RAC: 10,093,688
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 33055 - Posted: 18 Sep 2013 | 15:23:49 UTC - in response to Message 32972.

It's failed on every other host too, so it's a bad workunit.

Profile skgiven
Volunteer moderator
Project tester
Volunteer tester
Avatar
Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,835,617,224
RAC: 310,037
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 33057 - Posted: 18 Sep 2013 | 16:23:19 UTC - in response to Message 33055.

Just spotted a NOELIA_INS1P at 123h into a run, and only at 43% complete!

I suspended it and tried to get it to run on the other card. It started but the system became unresponsive (mouse stopped moving then started, but couldn't click on anything). Then CPU WU's started to fail and the system became totally unresponsive (to keyboard and mouse).

The WU has already timed out on the server,

pnitrox120-NOELIA_INS1P-3-12-RND7171_0

Would have been good to have spotted this earlier, yesterday, the day before, the day before that... Oh well, it's your loss too I guess.

After hard powering down and cold starting the system up again, the WU resumed on the other card but says it had only run for 6h 45min. This suggests it went wonky around that time and ran cold thereafter (GPU was at 45C). It's running on an 8.03 app (Ubuntu 13.04, CUDA55) and it's looking like about 6 or 7 hours to completing, so I will let it run and keep an eye on it.
____________
FAQ's

HOW TO:
- Opt out of Beta Tests
- Ask for Help

Jim1348
Send message
Joined: 28 Jul 12
Posts: 462
Credit: 1,130,762,168
RAC: 13,533
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwat
Message 33058 - Posted: 18 Sep 2013 | 16:27:35 UTC - in response to Message 33057.

Just spotted a NOELIA_INS1P at 123h into a run, and only at 43% complete!

I suspended it and tried to get it to run on the other card. It started but the system became unresponsive (mouse stopped moving then started, but couldn't click on anything). Then CPU WU's started to fail and the system became totally unresponsive (to keyboard and mouse).

That is what still concerns me. I can take errors, but stalling a machine may be even worse than a crash.

Profile Beyond
Avatar
Send message
Joined: 23 Nov 08
Posts: 1073
Credit: 4,497,887,629
RAC: 414,517
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 33060 - Posted: 18 Sep 2013 | 16:56:05 UTC - in response to Message 33057.

Just spotted a NOELIA_INS1P at 123h into a run, and only at 43% complete!

I suspended it and tried to get it to run on the other card. It started but the system became unresponsive (mouse stopped moving then started, but couldn't click on anything). Then CPU WU's started to fail and the system became totally unresponsive (to keyboard and mouse).

The WU has already timed out on the server,

pnitrox120-NOELIA_INS1P-3-12-RND7171_0

Would have been good to have spotted this earlier, yesterday, the day before, the day before that...

Oops, dskagcommunity has it now:

http://www.gpugrid.net/workunit.php?wuid=4771472

Profile skgiven
Volunteer moderator
Project tester
Volunteer tester
Avatar
Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,835,617,224
RAC: 310,037
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 33064 - Posted: 18 Sep 2013 | 17:40:03 UTC - in response to Message 33060.
Last modified: 18 Sep 2013 | 17:41:10 UTC

I wonder if it will behave itself for dskagcommunity? He's running it with v8.14 (cuda42).
If my WU completes (<5h now) limited credits might go to both of us, or just not to me. Something else that's still broken!
____________
FAQ's

HOW TO:
- Opt out of Beta Tests
- Ask for Help

Profile dskagcommunity
Avatar
Send message
Joined: 28 Apr 11
Posts: 456
Credit: 810,073,458
RAC: 0
Level
Glu
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwat
Message 33067 - Posted: 18 Sep 2013 | 18:07:24 UTC

Oh.

I looked onto the machine but it seems the WU run normal. I think it will need slight bit more than your 5h :( But i will survive one WU with reduced credits.
____________
DSKAG Austria Research Team: http://www.research.dskag.at



Crunching for my deceased Dog who had "good" Braincancer..

Profile Retvari Zoltan
Avatar
Send message
Joined: 20 Jan 09
Posts: 1844
Credit: 10,662,131,944
RAC: 10,093,688
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 33068 - Posted: 18 Sep 2013 | 18:37:46 UTC - in response to Message 33067.
Last modified: 18 Sep 2013 | 18:42:01 UTC

Oh.

I looked onto the machine but it seems the WU run normal. I think it will need slight bit more than your 5h :( But i will survive one WU with reduced credits.

While I was reading your words, this video just snapped into my mind.
Sorry for being off topic.

Profile skgiven
Volunteer moderator
Project tester
Volunteer tester
Avatar
Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,835,617,224
RAC: 310,037
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 33070 - Posted: 18 Sep 2013 | 19:29:08 UTC - in response to Message 33068.

Yeah, I can see/hear why that popped into your head!

This is how I felt,


____________
FAQ's

HOW TO:
- Opt out of Beta Tests
- Ask for Help

Profile skgiven
Volunteer moderator
Project tester
Volunteer tester
Avatar
Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,835,617,224
RAC: 310,037
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 33090 - Posted: 19 Sep 2013 | 8:49:40 UTC - in response to Message 33070.
Last modified: 19 Sep 2013 | 8:54:59 UTC

The WU completed on both systems and both systems got partial credit,

pnitrox120-NOELIA_INS1P-3-12-RND7171

7273360 154384 13 Sep 2013 | 3:43:04 UTC 18 Sep 2013 | 21:17:36 UTC Completed and validated 42,071.90 41,448.82 101,000.00 Long runs (8-12 hours on fastest card) v8.03 (cuda55)
7289613 117426 18 Sep 2013 | 7:52:19 UTC 18 Sep 2013 | 23:23:28 UTC Completed and validated 46,687.89 2,714.13 101,000.00 Long runs (8-12 hours on fastest card) v8.14 (cuda42)

The 8.14 app produced a more informative stderr output,

Stderr output

<core_client_version>7.0.28</core_client_version>
<![CDATA[
<stderr_txt>
# GPU [GeForce GTX 560 Ti] Platform [Windows] Rev [3203] VERSION [42]
# SWAN Device 0 :
# Name : GeForce GTX 560 Ti
# ECC : Disabled
# Global mem : 1279MB
# Capability : 2.0
# PCI ID : 0000:04:00.0
# Device clock : 1520MHz
# Memory clock : 1700MHz
# Memory width : 320bit
# Driver version : r301_07 : 30142
# GPU 0 : 60C
# GPU 0 : 62C
# GPU 0 : 63C
# GPU 0 : 64C
# GPU 0 : 65C
# GPU 0 : 66C
# GPU 0 : 67C
# GPU 0 : 68C
# GPU 0 : 69C
# GPU 0 : 70C
# Time per step (avg over 4200000 steps): 11.114 ms
# Approximate elapsed time for entire WU: 46680.219 s
01:15:25 (2124): called boinc_finish

</stderr_txt>
]]>

Wouldn't it be better to include a time with the GPU temp changes?
BTW. The # GPU 0 isn't needed every time you report a temp change. If there is only one GPU then reporting the name of the GPU is sufficient, if there is more than one GPU then report which GPU the device runs on and then only if that changes.
____________
FAQ's

HOW TO:
- Opt out of Beta Tests
- Ask for Help

5pot
Send message
Joined: 8 Mar 12
Posts: 411
Credit: 2,083,882,218
RAC: 0
Level
Phe
Scientific publications
watwatwatwatwatwatwatwatwatwatwat
Message 33136 - Posted: 22 Sep 2013 | 4:21:06 UTC

Just wanted to say noelia recent wu are beating nathan by a long shot on the 780s. Avg gpu usage and mem usage is 80%/20% nathan v 90%/30% noelia. Her tasks also get a lot less access violations.

Nicely done.

TJ
Send message
Joined: 26 Jun 09
Posts: 815
Credit: 1,469,215,105
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 33145 - Posted: 22 Sep 2013 | 12:14:17 UTC - in response to Message 33136.

Just wanted to say noelia recent wu are beating nathan by a long shot on the 780s. Avg gpu usage and mem usage is 80%/20% nathan v 90%/30% noelia. Her tasks also get a lot less access violations.

Nicely done.

Yes I see the same on my 770 as well. I even got a Noelia beta on my 660 and did a better performance than the other beta, Santi´s I think they where.
____________
Greetings from TJ

Profile Beyond
Avatar
Send message
Joined: 23 Nov 08
Posts: 1073
Credit: 4,497,887,629
RAC: 414,517
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 33169 - Posted: 23 Sep 2013 | 14:22:36 UTC - in response to Message 33136.

Just wanted to say noelia recent wu are beating nathan by a long shot on the 780s. Avg gpu usage and mem usage is 80%/20% nathan v 90%/30% noelia.

The other side of the coin is that Noelia WUs cause CPU processes to slow somewhat and can bring WUs on the AMD GPUs to their knees (my systems all have 1 NV and 1 AMD each). These problems are not seen with either Nathan or Santi WUs.

5pot
Send message
Joined: 8 Mar 12
Posts: 411
Credit: 2,083,882,218
RAC: 0
Level
Phe
Scientific publications
watwatwatwatwatwatwatwatwatwatwat
Message 33170 - Posted: 23 Sep 2013 | 16:00:28 UTC

Personally, I have no issues with gpus stealing cpu resources if needed. Id rather feed the roaring lion than the grasshopper.

Profile ritterm
Avatar
Send message
Joined: 31 Jul 09
Posts: 88
Credit: 244,413,897
RAC: 0
Level
Leu
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 33594 - Posted: 23 Oct 2013 | 18:46:31 UTC

Arrgh... potx21-NOELIA_INS1P-1-14-RND1061_1. Another one with this in the stderr output:

The simulation has become unstable. Terminating to avoid lock-up

No other failures for this WU. Let's see how the next guy does on it.
____________

Profile ritterm
Avatar
Send message
Joined: 31 Jul 09
Posts: 88
Credit: 244,413,897
RAC: 0
Level
Leu
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 33600 - Posted: 24 Oct 2013 | 3:31:50 UTC - in response to Message 33594.
Last modified: 24 Oct 2013 | 3:32:23 UTC

Let's see how the next guy does on it.

Just fine, I see...Never mind. Move along. Nothing to see here.
____________

Betting Slip
Send message
Joined: 5 Jan 09
Posts: 589
Credit: 2,041,855,275
RAC: 1,485,885
Level
Phe
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 33612 - Posted: 25 Oct 2013 | 9:47:07 UTC - in response to Message 33594.

Arrgh... potx21-NOELIA_INS1P-1-14-RND1061_1. Another one with this in the stderr output:

The simulation has become unstable. Terminating to avoid lock-up

No other failures for this WU. Let's see how the next guy does on it.


A large part could be due to the fact that these WU's consume a GIG of vRam and OC

Trotador
Send message
Joined: 25 Mar 12
Posts: 83
Credit: 1,071,125,199
RAC: 151,785
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 33628 - Posted: 26 Oct 2013 | 18:51:01 UTC

My failure rate on these units is getting quite bad, three of them in the last couple of days and wingmen are doing them right. No problem with any other type of WUs. Any advise?

2x660GTi in linux, driver is 304.88 but it has been rock solid so far.

No much info in stderr at least to my knowledge.

<core_client_version>6.10.58</core_client_version>
<![CDATA[
<message>
process exited with code 255 (0xff, -1)
</message>
<stderr_txt>

</stderr_txt>
]]>

Profile skgiven
Volunteer moderator
Project tester
Volunteer tester
Avatar
Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,835,617,224
RAC: 310,037
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 33634 - Posted: 27 Oct 2013 | 11:22:44 UTC - in response to Message 33628.

4 errors from 67 WU's isn't very bad, but they are all NOELIA WU's, so there is a trend. You are completing some though.

I had one fail on a Linux system with 304.88 drivers, but my system is prone to failures due to the GTX650TiBoost which has been somewhat troublesome in every system and setup I've used (the card operates too close to the edge).

I also have two GPU's in my system and I got the same output,

    Compute error, process exited with code 255 (0xff, -1).


____________
FAQ's

HOW TO:
- Opt out of Beta Tests
- Ask for Help

Jacob Klein
Send message
Joined: 11 Oct 08
Posts: 1068
Credit: 1,149,280,989
RAC: 1,048,193
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 33637 - Posted: 27 Oct 2013 | 11:41:59 UTC - in response to Message 33634.
Last modified: 27 Oct 2013 | 11:43:10 UTC

I have had a long run of success (61 straight valid GPUGrid tasks over the past 2 weeks!), including 4 successful NOELIA_INS1P tasks, on my multi-GPU Windows 8.1 x64 machine.

Trotador
Send message
Joined: 25 Mar 12
Posts: 83
Credit: 1,071,125,199
RAC: 151,785
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 33642 - Posted: 27 Oct 2013 | 17:16:56 UTC - in response to Message 33634.

4 errors from 67 WU's isn't very bad, but they are all NOELIA WU's, so there is a trend. You are completing some though.

I had one fail on a Linux system with 304.88 drivers, but my system is prone to failures due to the GTX650TiBoost which has been somewhat troublesome in every system and setup I've used (the card operates too close to the edge).

I also have two GPU's in my system and I got the same output,
    Compute error, process exited with code 255 (0xff, -1).



4 out of 67 is ok I agree, but it's around 50% for these NOELIA_INS1P, so the trend is there as you say. No other type has failed in the last months included other NOELIAS's types.

The two following units of the same type after the last failure have completed right.... maybe the Moon influence :)

Profile ritterm
Avatar
Send message
Joined: 31 Jul 09
Posts: 88
Credit: 244,413,897
RAC: 0
Level
Leu
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 33649 - Posted: 27 Oct 2013 | 23:58:43 UTC - in response to Message 33612.
Last modified: 28 Oct 2013 | 0:01:39 UTC

Betting Slip wrote:
A large part could be due to the fact that these WU's consume a GIG of vRam and OC

Yep, good call. I've got another one of these running. Afterburner shows memory usage at slightly more than 1.1GB...I suppose that's stressing my GTX 570 (1280MB), isn't it?
____________

Betting Slip
Send message
Joined: 5 Jan 09
Posts: 589
Credit: 2,041,855,275
RAC: 1,485,885
Level
Phe
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 33651 - Posted: 28 Oct 2013 | 7:48:08 UTC - in response to Message 33649.

Betting Slip wrote:
A large part could be due to the fact that these WU's consume a GIG of vRam and OC

Yep, good call. I've got another one of these running. Afterburner shows memory usage at slightly more than 1.1GB...I suppose that's stressing my GTX 570 (1280MB), isn't it?


I had one fail for becoming unstable on a GTX560 TI with same amount of memory. They should be OK on that amount but they're still failing.

It's this sort of problem that scares away contributors and annoys the hell out of me.

Stefan
Volunteer moderator
Project developer
Project scientist
Send message
Joined: 5 Mar 13
Posts: 258
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 33652 - Posted: 28 Oct 2013 | 9:11:24 UTC
Last modified: 30 Oct 2013 | 12:43:02 UTC

Apparently they have quite small error rate (<10%), so nothing systematic to worry about.

I guess it's the same memory problem (WU's being too large) which has been troubling Noelia's WU's lately. Apparently these large ones should be finishing soon so it's gonna get better. As for the reason they cause problems to some I don't know :/

Profile skgiven
Volunteer moderator
Project tester
Volunteer tester
Avatar
Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,835,617,224
RAC: 310,037
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 33670 - Posted: 30 Oct 2013 | 9:11:50 UTC - in response to Message 33652.

I've only had 12 failures this month (that are still in the database), but 3 of the last 4 were NOELIA_INS1P tasks. If I include 2 recent NOELIA_FXArep failures that's 5 out of the last 6 failures. Of course I've been running more of Noelia's work recently, as there has been more tasks available.
I've also been running some short SANTI_MAR tasks. These shorter runs would have less chance of failing so there would be little chance of seeing a trend in failures.
I suspect most of my failures occur on the mid-range cards; GTX650TiBoost and GTX660, rather than the slightly bigger cards.
Again, these mid-range cards tend to run closer to their power targets so there is more chance of failure.
____________
FAQ's

HOW TO:
- Opt out of Beta Tests
- Ask for Help

Jim1348
Send message
Joined: 28 Jul 12
Posts: 462
Credit: 1,130,762,168
RAC: 13,533
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwat
Message 33677 - Posted: 30 Oct 2013 | 12:57:33 UTC
Last modified: 30 Oct 2013 | 12:59:56 UTC

On my GTX 660's (Win7 64-bit with 331.58 drivers):

Concerning slow running: that happened once in a while to me, though only on the 660s and never on my GTX 650 Ti. But someone mentioned the old trick of setting the Power Management Mode to "Prefer Maximum Performance" in the Nvidia control panel, and I have not had a problem since. It is a little inconvenient to get to that setting now, since I normally connect my display to the internal Intel graphics adapter, but I used to always set it that way when I was running the monitor directly from the Nvidia card.

Concerning failures: I was getting occasional failures on various Noelias (not necessarily just INS1P), but only on one of my two cards, which was curious since they are supposedly identical. It turns out that the GPU core voltage setting on the one that failed was a little lower than the other, apparently because it was running into a power limit. So using MSI Afterburner, I raised the power limit (to 105%) and raised the voltage a little. That fixed it, and I have had no failures since. It is a truism that the Noelias work your card hard, and if there are any weaknesses, they will find them.

ExtraTerrestrial Apes
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 17 Aug 08
Posts: 2671
Credit: 751,855,924
RAC: 450,091
Level
Glu
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 33715 - Posted: 2 Nov 2013 | 11:33:37 UTC - in response to Message 33677.

If you want to avoid the reduced power efficiency which comes along with the increased voltage you could also scale GPU clock back by 13 or 26 MHz - should have the same stabilizing effect (but be a little slower and a little more power efficient).

MrS
____________
Scanning for our furry friends since Jan 2002

Jim1348
Send message
Joined: 28 Jul 12
Posts: 462
Credit: 1,130,762,168
RAC: 13,533
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwat
Message 33721 - Posted: 2 Nov 2013 | 13:05:34 UTC - in response to Message 33715.
Last modified: 2 Nov 2013 | 13:29:46 UTC

If you want to avoid the reduced power efficiency which comes along with the increased voltage you could also scale GPU clock back by 13 or 26 MHz - should have the same stabilizing effect (but be a little slower and a little more power efficient).

MrS

Actually, I do set back both cards by 10 MHz, but for a different reason. I found that the problem card still had the slowdown on a subsequent Noelia work unit. Then I remembered another old trick that sometimes works to keep the clocks going - let MSI Afterburner control them. It doesn't seem to matter whether you increase or decrease the clock rate from the default, or by what amount. My guess is that it takes control away from the Nvidia software, or whatever they use. At least it has been working for six days now, which is encouraging, if not proof.

But such a small change in clock rate (it is very close to the Nvidia default of 980 MHz anyway) does not make any discernible change in temperature or power consumption as measured by GPU-Z. I would have to make a much larger change than that, which I will do if necessary. I think the chip on that particular card was just weak; when they test them, I am sure they don't run them through anything as rigorous as what we do here.

I also had to bump up the voltage a little more - I started at 25 mv, but that wasn't quite enough, so now it is 37 mv. It has been error-free for a couple of days and three Noelias, but I need more Noelias to test it.

ExtraTerrestrial Apes
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 17 Aug 08
Posts: 2671
Credit: 751,855,924
RAC: 450,091
Level
Glu
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 33724 - Posted: 2 Nov 2013 | 16:01:32 UTC - in response to Message 33721.

The clock granularity of Keplers is 13 MHz, so you might want to keep to multiples of this. If you don't, it's being rounded - no problem, unless you change clocks a bit but it actually gets rounded to the same clock speed and doesn't change anything.

And you're right, +/-10 MHz has a negligible effect on power consumption. What I was referring to was the increased power consumption from the voltage increase. It's not dramatic either (larger than what the frequency change causes), but it's something you might not want.

And don't try to shoot for 0% error rate at GPU-Grid - I'm not sure this is actually possible and what it would depend on (OS, drivers etc.). If you do get occasional errors it's always a good idea to check whether the WU is also failing for your wingmen (which should have crunched it a few days after your attempt).

MrS
____________
Scanning for our furry friends since Jan 2002

Jim1348
Send message
Joined: 28 Jul 12
Posts: 462
Credit: 1,130,762,168
RAC: 13,533
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwat
Message 33729 - Posted: 2 Nov 2013 | 17:43:01 UTC - in response to Message 33724.

And you're right, +/-10 MHz has a negligible effect on power consumption. What I was referring to was the increased power consumption from the voltage increase. It's not dramatic either (larger than what the frequency change causes), but it's something you might not want.

And don't try to shoot for 0% error rate at GPU-Grid - I'm not sure this is actually possible and what it would depend on (OS, drivers etc.). If you do get occasional errors it's always a good idea to check whether the WU is also failing for your wingmen (which should have crunched it a few days after your attempt).

MrS

There is a small effect from the voltage increase thus far, but not that much. The problem card (0) is in the top slot, and runs a couple of degrees hotter than the bottom card (1) even without the boost; typically 68 and 66 degrees C, probably due to air flow from the side fans (I have one of the few motherboards that puts the top card in the very top slot, which then raises the lower card up also). When I raise the voltage, it adds a degree (or less) to that on average. I probably should reverse their slot positions, but it is not that important yet.

But the bottom card has done quite well - no errors in over a week; only the top card has had the errors.
http://www.gpugrid.net/results.php?hostid=159002&offset=0&show_names=1&state=0&appid=
I normally would buy Asus cards for better cooling (the non-overclocked versions), but needed the space-saving of these Zotac cards at the time. Now they are in a larger case, and I can replace them with anything if need be.

The main point for me is that the big problems of a few months ago are past, for the moment.

Profile skgiven
Volunteer moderator
Project tester
Volunteer tester
Avatar
Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,835,617,224
RAC: 310,037
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 33743 - Posted: 3 Nov 2013 | 13:28:23 UTC - in response to Message 33729.

The projects tasks have been quite stable of late. The only recent exception being a small batch of WU's that failed quickly. So it's a good time to know if you have a stable system or not.

In a system with 2 GPU's the top GPU is more likely to be the warmest because it's sandwiched between the CPU and the other GPU.

If you have exhaust cooling GPU's then the side fans would be better blowing into the case. If not then these fans might be better blowing out (but it depends on the case and other fans).

I have two GPU's in the one open case. Despite both having triple fans, the top card's temperature was 72 to 73C (with fans at 95%). I propped up 2 case fans to blow the air out from their sides (as they vent into the case). This dropped the top cards temperature to around 63C. It also raised the temperature of the bottom card, but only up to 55C :)
____________
FAQ's

HOW TO:
- Opt out of Beta Tests
- Ask for Help

Jim1348
Send message
Joined: 28 Jul 12
Posts: 462
Credit: 1,130,762,168
RAC: 13,533
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwat
Message 33744 - Posted: 3 Nov 2013 | 14:36:43 UTC - in response to Message 33743.

If you have exhaust cooling GPU's then the side fans would be better blowing into the case. If not then these fans might be better blowing out (but it depends on the case and other fans).

I have two 120 mm side fans blowing in, a 120 mm rear fan blowing out, and a top 140 mm fan blowing out (the power supply is bottom-mounted). I think that establishes the airflow over the GPUs pretty well, but you never know until you try it another way. As you point out, it can do strange things. However, the top temperature for the top card is about 70 C, which is reasonable enough. The real limitation on temperature now is probably just the heatsink/fans on the GPUs themselves.

But my theory of why that card had errors has more to do with the power limit rather than temperature per se. It would bump up against the power limit (as shown by GPU-Z), and so the voltage (and/or current) to the GPU core could not increase any more when the Noelias needed it. By increasing the power limit to 105% and raising the base voltage, it can supply the current when it needs it. That particular chip just fell on the wrong end of the speed/power yield curve for number crunching use, though it would be fine for other purposes. And I can re-purpose it for other use if need be; it just needs to last until the Maxwells come out.

Jacob Klein
Send message
Joined: 11 Oct 08
Posts: 1068
Credit: 1,149,280,989
RAC: 1,048,193
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 33746 - Posted: 3 Nov 2013 | 15:07:24 UTC - in response to Message 33744.
Last modified: 3 Nov 2013 | 15:12:26 UTC

The claim is that errors can be caused by "not having enough voltage" or by "having too high of a temperature".

Do we have conclusive proof of this claim? Or is it more of a generalization based on experience? I'm struggling to understand how voltage or temperature can have any effect on error % rates, and would appreciate some guidance.

For me, I have:
- 3 GPUs (eVGA GTX 660 Ti FTW 3GB, eVGA GTX 460 SC, NVIDIA/DELL GTS 240); the 660 Ti and the 460 are both factory overclocked, which I haven't touched
- Intel i7 965XE quad-core hyperthreaded CPU, factory overclocked to 3742 Mhz
- 1000-watt power supply (Dell XPS 730X case/system)
- The GPUs run tasks 24/7 (GPUGrid only runs on the 660 Ti and the GTX 460)... alongside CPU fully loaded with CPU tasks
- Precision-X setting the GTX 660 Ti to 140% Power Target (so it can upclock to max boost 1241 MHz without ever being limited by the 100% power limitation)
- Precision-X fan curves set up so that max GTX 660 Ti fan speed (80%) occurs before the 70*C mark (so I can keep max boost), and then max non-660-Ti speed (100%) occurs at 85*C (I've had no problems with GPUs running that hot)
- System fans set up to assist in keeping the 660 Ti nearly always below 70*C (whereas, the defaul system fan settings would have the 660 Ti climb to 82*C even if the GPU was at max-80% fan, as the system runs quite hot)
- Normal temps for my "fully-loaded 24/7" system:
CPU cores: 77-85*C
GTX 660 Ti: 67-71*C
GTX 460: 66-75*C
GTS 240: 75-80*C
- BOINC/GPUGrid errors that appear to be caused by any sort of hardware problem: NONE

Jim1348
Send message
Joined: 28 Jul 12
Posts: 462
Credit: 1,130,762,168
RAC: 13,533
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwat
Message 33748 - Posted: 3 Nov 2013 | 15:21:53 UTC - in response to Message 33746.

The claim is that errors can be caused by "not having enough voltage" or by "having too high of a temperature".

Do we have conclusive proof of this claim? Or is it more of a generalization based on experience? I'm struggling to understand how voltage or temperature can have any effect on error % rates, and would appreciate some guidance.

All semiconductor manufacturers create yield curves for their production lots. They show how much voltage/current it takes to achieve a given speed. In general, the more power you supply to the chip, the faster it can be clocked. Of course, it also gets hotter, which can eventually destroy the chip. That is why a power limit is also specified (e.g., 95 watts for some Intel CPUs, etc.). But the chips vary, with some being able to run fast at lower power, and some requiring higher power to achieve the same speeds. You can get errors due to a variety of reasons, with temperature being just one. But I have seen errors even below 70 C, so some other limitation may get you first.

GoodFodder
Send message
Joined: 4 Oct 12
Posts: 53
Credit: 333,467,496
RAC: 0
Level
Asp
Scientific publications
watwatwatwatwatwatwatwat
Message 33786 - Posted: 6 Nov 2013 | 11:31:02 UTC

'New' (old?) 94x4-NOELIA_1MG_RUN4 very log running (over 24hrs).

Hope gpugrid is not returning to these ridiculously large WUs again?

If so I suspect volunteer base is going to head downwards - can't they be split up?

Betting Slip
Send message
Joined: 5 Jan 09
Posts: 589
Credit: 2,041,855,275
RAC: 1,485,885
Level
Phe
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 33787 - Posted: 6 Nov 2013 | 12:13:13 UTC - in response to Message 33786.
Last modified: 6 Nov 2013 | 12:25:26 UTC

'New' (old?) 94x4-NOELIA_1MG_RUN4 very log running (over 24hrs).

Hope gpugrid is not returning to these ridiculously large WUs again?

If so I suspect volunteer base is going to head downwards - can't they be split up?




You will struggle with this type of WU because one of your cards only has 1 GIG of memory and this Noelia unit uses 1.3 GIG but doesn't use much CPU. It will probably make any computer with a 1 GIG card unresponsive.

I agree that the project is shooting itself in the foot by just dumping these WU's on machines that can't chew them

http://www.gpugrid.net/forum_thread.php?id=3523

wdiz
Send message
Joined: 4 Nov 08
Posts: 20
Credit: 871,871,594
RAC: 2
Level
Glu
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 33795 - Posted: 8 Nov 2013 | 16:39:16 UTC - in response to Message 33786.

'New' (old?) 94x4-NOELIA_1MG_RUN4 very log running (over 24hrs).

Hope gpugrid is not returning to these ridiculously large WUs again?

If so I suspect volunteer base is going to head downwards - can't they be split up?


Same here, with GTX680 or GTX 580 Very long crunch !!!

Profile dskagcommunity
Avatar
Send message
Joined: 28 Apr 11
Posts: 456
Credit: 810,073,458
RAC: 0
Level
Glu
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwat
Message 33796 - Posted: 8 Nov 2013 | 17:02:52 UTC
Last modified: 8 Nov 2013 | 17:05:43 UTC

Oh its not me only again.. 32hours...560ti 448core 1,28GB -_-
____________
DSKAG Austria Research Team: http://www.research.dskag.at



Crunching for my deceased Dog who had "good" Braincancer..

Profile skgiven
Volunteer moderator
Project tester
Volunteer tester
Avatar
Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,835,617,224
RAC: 310,037
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 33797 - Posted: 8 Nov 2013 | 23:49:57 UTC - in response to Message 33795.
Last modified: 8 Nov 2013 | 23:51:15 UTC

My 35x5-NOELIA_1MG_RUN4-2-4-RND8673_0 running on a GTX660Ti is at 34% and took 4h22min. So it should complete in about 13h (Win7x64).

If you get a task that has been running too long. Check the temps, GPU usage... and do a system shut down and restart if something looks wrong (temps too low).

Note that NOELIA_1MG tasks may not be similar to NOELIA_INS1p tasks.
PS. Noticed that some of NOELIA's tasks now use a full CPU core/thread (but others still don't).
____________
FAQ's

HOW TO:
- Opt out of Beta Tests
- Ask for Help

Jeremy Zimmerman
Send message
Joined: 13 Apr 13
Posts: 61
Credit: 726,605,417
RAC: 0
Level
Lys
Scientific publications
watwatwatwatwatwatwatwat
Message 33799 - Posted: 9 Nov 2013 | 0:21:19 UTC - in response to Message 33797.

These NOELIA_1MG are about

8-9 hours on GTX680 with 2Gb Memory
http://www.gpugrid.net/result.php?resultid=7444885
http://www.gpugrid.net/result.php?resultid=7443692

and around 34 hours on GTX460 with 1Gb Memory
http://www.gpugrid.net/result.php?resultid=7440928

Same thing happened between the 768Mb and 1024Mb division in the past. Now we move past the 1024 minimum for some WU's.

Betting Slip
Send message
Joined: 5 Jan 09
Posts: 589
Credit: 2,041,855,275
RAC: 1,485,885
Level
Phe
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 33801 - Posted: 9 Nov 2013 | 0:51:48 UTC - in response to Message 33797.

My 35x5-NOELIA_1MG_RUN4-2-4-RND8673_0 running on a GTX660Ti is at 34% and took 4h22min. So it should complete in about 13h (Win7x64).

If you get a task that has been running too long. Check the temps, GPU usage... and do a system shut down and restart if something looks wrong (temps too low).

Note that NOELIA_1MG tasks may not be similar to NOELIA_INS1p tasks.
PS. Noticed that some of NOELIA's tasks now use a full CPU core/thread (but others still don't).



On a GTX660TI with 2GB memory NO PROBLEM but this post all about those cards with less than 2GB

Profile skgiven
Volunteer moderator
Project tester
Volunteer tester
Avatar
Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,835,617,224
RAC: 310,037
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 33806 - Posted: 9 Nov 2013 | 10:15:54 UTC - in response to Message 33801.
Last modified: 9 Nov 2013 | 10:17:29 UTC

The NOELIA_1MG WU I'm presently running is using 1.2GB GDDR5, so it wouldn't do well on a 1GB card.

Cards impacted by this would be anything at or below 1GB and possibly other cards under some conditions. This includes,
Most versions of the GT 440 and GTS450, all versions of the GTX460 and GTX465.
The GT 545 (GDDR5 version), some GTX550Ti’s, some GTX560’s and some GTX560Ti's
Some GT 640's, the GTX 645, some GTX650's and GTX650Ti's.

The 1280MB cards that might be impacted are the GTX470, GTX560Ti448 and GTX570.
Would be interesting to know how much GDDR was being used on the different operating sytsems (XP, Linux, Vista, W7, W8).

Note that where larger memory versions exist, they tend to be more expensive so not many people buy them.
____________
FAQ's

HOW TO:
- Opt out of Beta Tests
- Ask for Help

Dagorath
Send message
Joined: 16 Mar 11
Posts: 509
Credit: 179,005,236
RAC: 0
Level
Ile
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 33812 - Posted: 9 Nov 2013 | 13:18:58 UTC - in response to Message 33806.

Would be interesting to know how much GDDR was being used on the different operating sytsems (XP, Linux, Vista, W7, W8).


I'm not sure if we're talking the same error here but I had potx234-NOELIA_INS1P-12-14-RND6963_0 fail on my 660Ti with 3 gig mem on Linux, driver 331.17, more details here.

That task also failed on this host (1 gig, Linux, 560Ti, driver unknown) but succeeded on this host , (2 gig, Win7, 2 X 680).

I've had 4 other Noelia run successfully on my 660Ti on Linux.

____________
BOINC <<--- credit whores, pedants, alien hunters

ExtraTerrestrial Apes
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 17 Aug 08
Posts: 2671
Credit: 751,855,924
RAC: 450,091
Level
Glu
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 33821 - Posted: 10 Nov 2013 | 14:38:07 UTC - in response to Message 33748.

The claim is that errors can be caused by "not having enough voltage" or by "having too high of a temperature".

Do we have conclusive proof of this claim? Or is it more of a generalization based on experience? I'm struggling to understand how voltage or temperature can have any effect on error % rates, and would appreciate some guidance.

All semiconductor manufacturers create yield curves for their production lots. They show how much voltage/current it takes to achieve a given speed. In general, the more power you supply to the chip, the faster it can be clocked. Of course, it also gets hotter, which can eventually destroy the chip. That is why a power limit is also specified (e.g., 95 watts for some Intel CPUs, etc.). But the chips vary, with some being able to run fast at lower power, and some requiring higher power to achieve the same speeds. You can get errors due to a variety of reasons, with temperature being just one. But I have seen errors even below 70 C, so some other limitation may get you first.


Hi Jacob.. I suppose you wouldn't mind going a bit deeper?

To make a transistor switch (at a very high level) you apply a voltage which in turn pulls electrons through the channel (or "missing electrons" aka holes in the other direction). This physical movement of charge carriers is needed to make it switch. And it takes some time, which ultimately limits the clock speeds a chip can reach. This is where temperature and voltage must be considered.

The voltage is a measure for how hard the electrons are pulled, or how quickly they're accelerated. That's why the maximum achievable (error-free) frequency scales approximately linear with voltage.

Temperature is a measure for the vibrations of the atomic lattice. Without any vibrations the electrons wouldn't "see" the lattice at all. The atoms (in a single crystal) are forming a perfectly periodic potential landscape, through which the electrons move as waves. If this periodic structure is disturbed (e.g. by random fluctuations caused by temperature > 0 K), the electrons scatter with these perturbations. This slows their movement down and heats the lattice up (like in a regular resistor).

In a real chip there are chains of transistors, which all have to switch within each clock cycle. In CPUs each stage of the pipeline is such a domain. If individual transistors are switching too slow, the computation result will not have reached the output stage of that domain yet when the next clock cycle is triggered. The old result (or something in between, depending on how the result is composed) will be used as the input for the next stage and a computation error will have occurred. That's why timing analysis is so important when designing a chip - the slowest path limits the overall clock speed the chip can achieve.

And putting it all together it should be more clear now how increased temperature and too low voltage can lead to errors. And to get a bit closer to reality: the real switching speed of each transistor is affected by many more factors, including fabrication tolerances, non-fatal defects (which also scatter electrons and hence slow them down as well), defects developed due to operating the chip under prolonged load (at high temperature and voltage).

At this point I can hand over to Jim: the manufacturer profiles their chips and determines proper working points (clock speed & voltage at maximum allowed temperature). Depending on how careful they do this (e.g. Intel usually allows for plenty of head room, whereas factory OC'ed GPUs have occasionally been set too agressive) things work out just normally.. or the end user could see calculation errors. Mostly these will only appear under unuausl work loads (which weren't tested for) or after significant chip degradation. Or just due to bad luck, which wasn't caught by the initial IC error testing (which is seldom, luckily). Hope this helps :)

MrS
____________
Scanning for our furry friends since Jan 2002

Jacob Klein
Send message
Joined: 11 Oct 08
Posts: 1068
Credit: 1,149,280,989
RAC: 1,048,193
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 33823 - Posted: 10 Nov 2013 | 16:18:22 UTC
Last modified: 10 Nov 2013 | 16:19:50 UTC

It does help, thank you very much for the detailed explanations. I've read through it once, and I'll have to read through it again a few more times for it to sink in. I actually studied Computer Engineering for a few years before switching over to a Bachelor's degree in Computer Information Systems.

But, I still don't quite understand one other thing.

When you overclock a CPU too far, you usually get a BSOD (presumably because the execution pointer is off in no-man's land, or because the data got jacked up in the pipeline, or both), right? But what about going "too far" on a GPU?

The scenario I'm looking to better define is: overclocking or overheating a GPU too far to cause GPUGrid problems, but not far enough to cause Windows problems or driver resets. In order to get these BOINC Computation Errors, then that scenario must exist, right? Why doesn't Windows catch this and explain the error to the user?

ExtraTerrestrial Apes
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 17 Aug 08
Posts: 2671
Credit: 751,855,924
RAC: 450,091
Level
Glu
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 33847 - Posted: 12 Nov 2013 | 21:34:58 UTC - in response to Message 33823.

Windows can't catch these calculation errors because, frankly, it doesn't see them. The GPU-Grid app sends some commands to the GPU, the GPU processes something and returns results to the app. Unless the GPU behaves in any different way (doesn't respond any more etc.), there's no way for the OS to tell if the data returned is correct or garbage. Specifically not even GPU-Grid can now this, unless they already know the result.. but they can check their results for sanity and, luckily for us, errors may often have no effect (on the long-term simulation result) or catastrophic effects.

I suppose molecular dynamics is comparably tolerant to single calculation errors. Imagine it this way: if a force is calculated too large in one time step and as a result an atom is moved further than it should it timestep n, then it will likely get too close to other atoms in time step n+1 and hence recieve a greater repelling force than what it would have gotten in the correct position. Thus small errors don't build up over time. Not sure it really works like this.. but I think Matt once said something which sounded to me like this :)

MrS
____________
Scanning for our furry friends since Jan 2002

Profile robertmiles
Send message
Joined: 16 Apr 09
Posts: 402
Credit: 169,933,246
RAC: 309,519
Level
Ile
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 33975 - Posted: 22 Nov 2013 | 1:26:42 UTC - in response to Message 33821.

The claim is that errors can be caused by "not having enough voltage" or by "having too high of a temperature".

Do we have conclusive proof of this claim? Or is it more of a generalization based on experience? I'm struggling to understand how voltage or temperature can have any effect on error % rates, and would appreciate some guidance.

All semiconductor manufacturers create yield curves for their production lots. They show how much voltage/current it takes to achieve a given speed. In general, the more power you supply to the chip, the faster it can be clocked. Of course, it also gets hotter, which can eventually destroy the chip. That is why a power limit is also specified (e.g., 95 watts for some Intel CPUs, etc.). But the chips vary, with some being able to run fast at lower power, and some requiring higher power to achieve the same speeds. You can get errors due to a variety of reasons, with temperature being just one. But I have seen errors even below 70 C, so some other limitation may get you first.


[snip]

MrS


Something I've read that seems relevant to this explanation:

Today's CPU chips are approaching the lower limit of the voltages at which the transistors work properly. Therefore, the power used by each CPU core can't get much lower. Instead, the companies are increasing the total speed by putting more CPU cores in each CPU package. Intel in also using a different method - hyperthreading. This method gives each CPU core two sets of registers, so that while the CPU is waiting for memory operations for the program running with one of these sets, the CPU can use the other set to run the other program using that set. This makes the CPU act as if it had twice as many CPU cores as it actually does.

If a programmer want to use more than one of these CPU cores at the same time for the same program, that programmer must study parallel programming first, in order to handle the communications between the different CPU cores properly.

I used to be an electronic engineer, specializing in logic simulation, often including timing analysis.

MrJo
Send message
Joined: 18 Apr 14
Posts: 43
Credit: 1,192,135,172
RAC: 309
Level
Met
Scientific publications
watwatwatwatwat
Message 36876 - Posted: 20 May 2014 | 10:25:04 UTC
Last modified: 20 May 2014 | 10:26:16 UTC

Just crunched my fist one on a GTX 770 at 76° in 31,145.32. Nice 153,150.00 Points ;-)
____________
Regards, Josef

TJ
Send message
Joined: 26 Jun 09
Posts: 815
Credit: 1,469,215,105
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 36877 - Posted: 20 May 2014 | 11:22:37 UTC - in response to Message 36876.

Just crunched my fist one on a GTX 770 at 76° in 31,145.32. Nice 153,150.00 Points ;-)

You finished indeed a Noelia WU, but not this one but the new one: NOELIA_BI.
But more important, your 770 can do better, mine finishes these new Noelia's in about 27000 seconds, but temperature is only 66-67°C. And the colder a GPU runs, the faster (and more error free) it does. So perhaps you can experiment with some settings to get the temperature a few degrees lower.
____________
Greetings from TJ

MrJo
Send message
Joined: 18 Apr 14
Posts: 43
Credit: 1,192,135,172
RAC: 309
Level
Met
Scientific publications
watwatwatwatwat
Message 36880 - Posted: 20 May 2014 | 13:31:13 UTC - in response to Message 36877.

[quote]your 770 can do better, mine finishes these new Noelia's in about 27000 seconds, but temperature is only 66-67°C.


THX for your advice. To lower the temperature, I'm usig the nvidia inspektor with the following settings:

I unchecked Auto-Fan and set it to 60% which speeds the fan from 1300 to 1770 1/min what is still ear-friedly. But that reduces the temperature by only 3 degrees. So I have to check the Priorize Temperature box and put the slider to 68°. Which slows down cpu-clock a little bit. Is there a better approach?




____________
Regards, Josef

Profile Beyond
Avatar
Send message
Joined: 23 Nov 08
Posts: 1073
Credit: 4,497,887,629
RAC: 414,517
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 36883 - Posted: 21 May 2014 | 14:49:46 UTC - in response to Message 36877.
Last modified: 21 May 2014 | 14:50:16 UTC

But more important, your 770 can do better, mine finishes these new Noelia's in about 27000 seconds, but temperature is only 66-67°C. And the colder a GPU runs, the faster (and more error free) it does. So perhaps you can experiment with some settings to get the temperature a few degrees lower.

You might have accidently been looking at your 780 Ti. Here's your 3 Noelia results from the 770 so far:

# GPU [GeForce GTX 770] Platform [Windows] Rev [3301M] VERSION [42]
# Approximate elapsed time for entire WU: 29643.715 s

# GPU [GeForce GTX 770] Platform [Windows] Rev [3301M] VERSION [42]
# Approximate elapsed time for entire WU: 29572.861 s

# GPU [GeForce GTX 770] Platform [Windows] Rev [3301M] VERSION [42]
# Approximate elapsed time for entire WU: 29676.489 s

TJ
Send message
Joined: 26 Jun 09
Posts: 815
Credit: 1,469,215,105
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 36886 - Posted: 21 May 2014 | 17:26:50 UTC - in response to Message 36883.

You are absolutely correct Beyond, my mistake.
Sorry for that MrJo.

Still a difference of 2000 seconds. I have never seen nVidia inspector before, I use PrecisionX from EVGA or MSI's Afterburner. I have set a fan curve that goes to 100% at 70°C but the card is allowed to go to 75% before the program must throttle the GPU clock. Power target is set to 100%. Currently with ambient temperature of 32.6°C the 770 runs at 68°C and 1149MHz. Sits in the second slot, the first is occupied by the 780Ti.
Hope this helps a bit.
____________
Greetings from TJ

MrJo
Send message
Joined: 18 Apr 14
Posts: 43
Credit: 1,192,135,172
RAC: 309
Level
Met
Scientific publications
watwatwatwatwat
Message 36888 - Posted: 22 May 2014 | 5:22:01 UTC

Now I have tested the MSI Afterburner. There you can set a custom fan curve. However, I have a problem with that: In order to lower the temperature by 3-4 ° C, the fan speed increases to 3300 1/min. This is unpleasant. With my GTX 680 I was able to reduce the temperature by 8 degrees, as I dismounted the cooler and renewed the thermal paste;-) Unfortunately, the same procedure for the GTX 770 delivered nothing, since their thermal paste was not dried out. Too new;-) So I will reduce gpu-clock a little bit to remain below 70 degrees. Reducing from 1150 MHz to 1080-1100 reduces the temperature by 5 degrees.
____________
Regards, Josef

GoodFodder
Send message
Joined: 4 Oct 12
Posts: 53
Credit: 333,467,496
RAC: 0
Level
Asp
Scientific publications
watwatwatwatwatwatwatwat
Message 37368 - Posted: 23 Jul 2014 | 8:06:38 UTC

Hi,

potx1x225-NOELIA_INSP-5-13-RND8250_1:

Have a odd error - task failed within 3secs. Hopefully it is a one off and not a bad batch; however in case it is not:

ERROR: file mdioload.cpp line 81: Unable to read bincoordfile

23:01:01 (3684): called boinc_finish


http://www.gpugrid.net/result.php?resultid=12864293

Bedrich Hajek
Send message
Joined: 28 Mar 09
Posts: 340
Credit: 3,823,142,609
RAC: 948,763
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 37369 - Posted: 23 Jul 2014 | 10:01:55 UTC - in response to Message 37368.

Hi,

potx1x225-NOELIA_INSP-5-13-RND8250_1:

Have a odd error - task failed within 3secs. Hopefully it is a one off and not a bad batch; however in case it is not:

ERROR: file mdioload.cpp line 81: Unable to read bincoordfile

23:01:01 (3684): called boinc_finish


http://www.gpugrid.net/result.php?resultid=12864293




I had the same error in 4 units so far. Here is an example of one:


potx1x492-NOELIA_INSP-3-13-RND4560_6
Workunit 9908013
Created 22 Jul 2014 | 19:40:28 UTC
Sent 22 Jul 2014 | 21:46:12 UTC
Received 22 Jul 2014 | 23:03:18 UTC
Server state Over
Outcome Computation error
Client state Compute error
Exit status -98 (0xffffffffffffff9e) Unknown error number
Computer ID 127986
Report deadline 27 Jul 2014 | 21:46:12 UTC
Run time 4.05
CPU time 2.06
Validate state Invalid
Credit 0.00
Application version Long runs (8-12 hours on fastest card) v8.41 (cuda60)
Stderr output

<core_client_version>7.2.42</core_client_version>
<![CDATA[
<message>
(unknown error) - exit code -98 (0xffffff9e)
</message>
<stderr_txt>
# GPU [GeForce GTX 690] Platform [Windows] Rev [3301M] VERSION [60]
# SWAN Device 1 :
# Name : GeForce GTX 690
# ECC : Disabled
# Global mem : 2048MB
# Capability : 3.0
# PCI ID : 0000:04:00.0
# Device clock : 1019MHz
# Memory clock : 3004MHz
# Memory width : 256bit
# Driver version : r337_00 : 33788
ERROR: file mdioload.cpp line 81: Unable to read bincoordfile

19:03:38 (5576): called boinc_finish

</stderr_txt>
]]>



http://www.gpugrid.net/result.php?resultid=12864314



Profile Grubix
Send message
Joined: 26 Sep 08
Posts: 4
Credit: 321,147,075
RAC: 0
Level
Asp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 37370 - Posted: 23 Jul 2014 | 10:14:07 UTC

Same error here:

ERROR: file mdioload.cpp line 81: Unable to read bincoordfile


potx1x284-NOELIA_INSP-2-13-RND0923 : WU 9908067

potx1x225-NOELIA_INSP-5-13-RND8250 : WU 9907982

Bye, Grubix.

Vagelis Giannadakis
Send message
Joined: 5 May 13
Posts: 187
Credit: 349,254,454
RAC: 0
Level
Asp
Scientific publications
watwatwatwatwatwatwat
Message 37372 - Posted: 23 Jul 2014 | 10:20:25 UTC

This error does not affect NOELIAs only, I had a SANTI_p53final fail on me the other day with the exact same error:

ERROR: file mdioload.cpp line 81: Unable to read bincoordfile

____________

Post to thread

Message boards : News : WU: NOELIA_INS1P