Advanced search

Message boards : News : WU: OPM simulations

Author Message
Stefan
Volunteer moderator
Project developer
Project scientist
Send message
Joined: 5 Mar 13
Posts: 258
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 43305 - Posted: 3 May 2016 | 15:08:28 UTC
Last modified: 3 May 2016 | 15:54:17 UTC

Hi everyone.
Once again after a few years I have decided to simulate ;)

This is for a new project where I try to automatically build and simulate hundreds of membrane proteins from the OPM database http://opm.phar.umich.edu/ as a proof of concept.
But I will also keep an eye out to see if something fancy happens in those simulations which I am quite sure it will.

Membrane proteins are very interesting in general as they are involved in many diseases and are simulated quite often. So building a protocol to simplify this process would help lots of biologists.

For the moment I sent out just 100 simulations as a quick test, but in the next few days I will send out around 3000 equilibrations.

Beware! The simulations are very very different, so there is no point in comparing them to each other. I have some systems which run on the short queue in an hour and some monstrous ones which run on long for 10 hours. The credits are scaled accordingly to simulation complexity so no one will be cheated of his hard earned credits :)

Also keep in mind that the first batch will be minimization/equilibration simulations, which start the minimization by using the CPU for maybe 1 minute and then should switch to GPU for the equilibration.

I might post here some pictures of the systems as well because some of these proteins are just gorgeous!

Edit: I made some gifs for you here ;)
http://imgur.com/a/YbjOd

Erich56
Send message
Joined: 1 Jan 15
Posts: 369
Credit: 1,606,755,102
RAC: 2,771,359
Level
His
Scientific publications
watwatwat
Message 43307 - Posted: 3 May 2016 | 16:51:47 UTC - in response to Message 43305.


For the moment I sent out just 100 simulations as a quick test

ha, I was lucky to get 2 of them for crunching

Trotador
Send message
Joined: 25 Mar 12
Posts: 83
Credit: 1,067,365,209
RAC: 124,898
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 43308 - Posted: 3 May 2016 | 18:43:54 UTC

Nice, impressive structures to simulate.

What specific aspects of the proteins will the simulation address?

Stefan
Volunteer moderator
Project developer
Project scientist
Send message
Joined: 5 Mar 13
Posts: 258
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 43311 - Posted: 4 May 2016 | 8:59:47 UTC - in response to Message 43308.

For the moment we are just looking if the automatical preparation can produce working simulations. Once the simulations are completed I will check for any big changes that happened (changes in protein structure, detachment from the membrane) and try to determine what caused those changes if possible.

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 783
Credit: 1,391,041,045
RAC: 1,248,198
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 43319 - Posted: 6 May 2016 | 10:15:03 UTC

I see tasks ready to send on the server status page :-)

Stefan
Volunteer moderator
Project developer
Project scientist
Send message
Joined: 5 Mar 13
Posts: 258
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 43328 - Posted: 9 May 2016 | 8:36:44 UTC

I am sending them now. Please report any serious issues here.
I had to double the runtime of the simulations due to their tricky nature so now we have some which are quite long (up to 24 hours).

What we decided is that runs which are under 6 hours go to short with normal credits. Simulations with from 6-18 hours runtime get 2x normal credits. Simulations with 18-24 hours runtime get 4x normal credits since you will likely miss the day-return bonus.

Betting Slip
Send message
Joined: 5 Jan 09
Posts: 584
Credit: 2,004,846,200
RAC: 1,662,786
Level
Phe
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 43329 - Posted: 9 May 2016 | 8:57:56 UTC - in response to Message 43328.
Last modified: 9 May 2016 | 9:15:46 UTC

OPM996 is classed as a Short Run but using 980ti seems it will take around 7 hours

Yet I have it on 2 other machines as a Long Run

Only completed 6% in 30 minutes at 61% GPU utilization

frederikhk
Send message
Joined: 14 Apr 14
Posts: 8
Credit: 57,034,536
RAC: 0
Level
Thr
Scientific publications
watwatwat
Message 43330 - Posted: 9 May 2016 | 9:19:32 UTC

Server Status calls it Short Run but in BOINC under Application it's Long Run. Got 2 already running fine.

Betting Slip
Send message
Joined: 5 Jan 09
Posts: 584
Credit: 2,004,846,200
RAC: 1,662,786
Level
Phe
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 43331 - Posted: 9 May 2016 | 9:22:25 UTC - in response to Message 43330.

On my 980ti it is running as short run and will take over 7 hours or longer to complete at current rate

Stefan
Volunteer moderator
Project developer
Project scientist
Send message
Joined: 5 Mar 13
Posts: 258
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 43332 - Posted: 9 May 2016 | 9:54:06 UTC

The runs were tested on 780s. Simulations that run under 6 hours on a 780 are labeled as short.

Are you sure you are not running maybe two simulations on the 980s?

Stefan
Volunteer moderator
Project developer
Project scientist
Send message
Joined: 5 Mar 13
Posts: 258
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 43333 - Posted: 9 May 2016 | 9:55:13 UTC
Last modified: 9 May 2016 | 9:56:31 UTC

I am sending both short and long runs in this project as I mentioned. They are split accordingly depending on runtime.

OPM996 is the project code. If you want to ask me about a specific job please refer to the job name like 2id4R1 then I can tell you specifics about this protein simulation

Betting Slip
Send message
Joined: 5 Jan 09
Posts: 584
Credit: 2,004,846,200
RAC: 1,662,786
Level
Phe
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 43334 - Posted: 9 May 2016 | 9:58:28 UTC - in response to Message 43332.
Last modified: 9 May 2016 | 10:00:04 UTC

No I am only running one at a time and it has only completed 15.57% after 1hr 11min on a 980ti utilization 60%

Projected time is increasing and it is running under short run app

2K1aR7
2K1aR5

Stefan
Volunteer moderator
Project developer
Project scientist
Send message
Joined: 5 Mar 13
Posts: 258
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 43336 - Posted: 9 May 2016 | 10:09:42 UTC - in response to Message 43334.

You got one of the edge cases apparently. 2k1a has exactly 6.0 hours projected runtime on my 780 so it got put in the short queue since my condition was >6 hours goes to long.
These simulations have a short (1 minute maybe) CPU minimization at the beginning so maybe this is throwing off the estimated completion time?
I am not totally sure.

Betting Slip
Send message
Joined: 5 Jan 09
Posts: 584
Credit: 2,004,846,200
RAC: 1,662,786
Level
Phe
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 43337 - Posted: 9 May 2016 | 10:15:08 UTC - in response to Message 43336.

Not working from what BOINC estimates but from real time and it will take betwen 7 to 8hrs

But never mind HeyHo

Profile skgiven
Volunteer moderator
Project tester
Volunteer tester
Avatar
Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,826,562,314
RAC: 416,629
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 43338 - Posted: 9 May 2016 | 17:24:29 UTC - in response to Message 43337.
Last modified: 10 May 2016 | 7:28:03 UTC

Boinc will estimate completion time based on the GFlops you put against the task, but if previous GFlops were not accurate the estimate will not be accurate.
You have 5M GFlops against these tasks.

I have 2 WU's running, one on each of my GTX970's.
My estimated run time is 32h based on 9% completion and almost 3h of runtime.

Most clocks are reasonable but the power is questionable on one card:
GPU0:
10% Power, 1291MHz/1304MHz, 7GHz GDDR5, MCU 40% (slightly high), Power 71% (P2 state), 1.131V, Temp 69C. CPU (i7-3770K) at ~75% usage.
GPU1:
59% GPU Power (MSI Afterburner), 69C, GPU Usage 75%, 1075MHz.

GPU memory usage is 1.287GB and 1.458GB (Quite big).
MY RAM and page file are quite big too (might be other apps though).

Don't see how the 1min of initial CPU usage would drastically effect the estimated runtime.

Will drop the CPU usage to ~60% and give it a spring clean before starting up to see if that improves things.
- update - that allowed the cards to run at slightly higher clocks (and normally) but it's now only looking like 31.5h. That might yet drop a few hours but these are still extra long tasks.
- update - still looking like about 31.5h (55% after 17h and 51% after ~16h).

2b5fR8-SDOERR_opm996-0-1-RND0199_0
2b5fR3-SDOERR_opm996-0-1-RND0810_0
____________
FAQ's

HOW TO:
- Opt out of Beta Tests
- Ask for Help

Erich56
Send message
Joined: 1 Jan 15
Posts: 369
Credit: 1,606,755,102
RAC: 2,771,359
Level
His
Scientific publications
watwatwat
Message 43341 - Posted: 9 May 2016 | 20:10:05 UTC

On my GTX980Ti's, the former long runs were running about 20.000 seconds, yielding 255.000 points.

Today's 2dmgR0-SDOERR_opm996-0-1-RND3256_0 was running 39.000 seconds, yielding 247.050 points.

How come?

Bedrich Hajek
Send message
Joined: 28 Mar 09
Posts: 335
Credit: 3,800,087,309
RAC: 894,669
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 43342 - Posted: 9 May 2016 | 21:39:49 UTC - in response to Message 43341.

On my GTX980Ti's, the former long runs were running about 20.000 seconds, yielding 255.000 points.

Today's 2dmgR0-SDOERR_opm996-0-1-RND3256_0 was running 39.000 seconds, yielding 247.050 points.

How come?


This is what got:

2jp3R4-SDOERR_opm996-0-1-RND2694_0 11593350 9 May 2016 | 9:19:35 UTC 9 May 2016 | 21:02:28 UTC Completed and validated 40,147.05 39,779.98 205,350.00 Long runs (8-12 hours on fastest card) v8.48 (cuda65)

1cbrR7-SDOERR_opm996-0-1-RND5441_0 11593203 9 May 2016 | 9:19:35 UTC 9 May 2016 | 21:32:36 UTC Completed and validated 41,906.29 41,465.53 232,200.00 Long runs (8-12 hours on fastest card) v8.48 (cuda65)




Skyler Baker
Send message
Joined: 19 Feb 16
Posts: 19
Credit: 136,574,536
RAC: 20,687
Level
Cys
Scientific publications
wat
Message 43344 - Posted: 9 May 2016 | 23:46:45 UTC

Yeah, they're pretty long, my 980ti is looking like it'll take about 20 hours at 70-75% usage. Don't mind the long run, I'm glad to have work to crunch, though I do hope the credit is equivalently good.

Erich56
Send message
Joined: 1 Jan 15
Posts: 369
Credit: 1,606,755,102
RAC: 2,771,359
Level
His
Scientific publications
watwatwat
Message 43345 - Posted: 10 May 2016 | 3:10:38 UTC - in response to Message 43344.

... I'm glad to have work to crunch

me too :-)

zioriga
Send message
Joined: 30 Oct 08
Posts: 44
Credit: 73,635,336
RAC: 14,947
Level
Thr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 43347 - Posted: 10 May 2016 | 9:26:02 UTC

I crunched a WU lasting 18.1 hours for 203.850 credits on my GTX970.

This credit ratio is lower then previous "Long run" WUs

Previously I completed "Long run WUs" in 12 hours for 255.000 credits

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 783
Credit: 1,391,041,045
RAC: 1,248,198
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 43348 - Posted: 10 May 2016 | 10:10:24 UTC - in response to Message 43347.

I crunched a WU lasting 18.1 hours for 203.850 credits on my GTX970.

This credit ratio is lower then previous "Long run" WUs

Previously I completed "Long run WUs" in 12 hours for 255.000 credits

Similar experience here with

4gbrR2-SDOERR_opm996-0-1-RND1953
3oaxR8-SDOERR_opm996-0-1-RND1378

Stefan
Volunteer moderator
Project developer
Project scientist
Send message
Joined: 5 Mar 13
Posts: 258
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 43349 - Posted: 10 May 2016 | 10:33:14 UTC
Last modified: 10 May 2016 | 10:36:38 UTC

Seems like I might have underestimated the real runtime. We use a script that caclulates the projected runtime on my 780s. So it seems it's a bit too optimistic on it's estimates :/ Longer WUs (projected time over 18 hours) though should give 4x credits.

I am sorry if it's not comparable to previous WUs. It's the best we can do given the tools. But I think it's not a huge issue since the WU group should be finishing in a day or two. I will consider the underestimation for next time I send out WUs though. Thanks for pointing it out. I hope it's not too much of a bother.

Edit: Gerard might be right. He mentioned that since they are equilibrations they also use CPU. So the difference between estimated time and real time could be due to that. Only Noelia and Nate I think had some experience with equilibrations here on GPUGrid but they are not around anymore. I will keep it in mind when I send more jobs.

zioriga
Send message
Joined: 30 Oct 08
Posts: 44
Credit: 73,635,336
RAC: 14,947
Level
Thr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 43351 - Posted: 10 May 2016 | 12:10:50 UTC

OK, Thanks Stefan

nanoprobe
Send message
Joined: 26 Feb 12
Posts: 181
Credit: 221,824,715
RAC: 0
Level
Leu
Scientific publications
watwatwatwatwatwatwatwatwat
Message 43352 - Posted: 10 May 2016 | 12:44:54 UTC - in response to Message 43329.
Last modified: 10 May 2016 | 12:47:28 UTC

OPM996 is classed as a Short Run but using 980ti seems it will take around 7 hours

Yet I have it on 2 other machines as a Long Run

Only completed 6% in 30 minutes at 61% GPU utilization

I received several "short" tasks on a machine set run only short tasks. They took 14 hours on a 750Ti. Short tasks in the past only took 4-5 hours on the 750. May I suggest that these be reconsidered as long tasks and not sent to short task only machines.

manalog
Send message
Joined: 23 Sep 15
Posts: 1
Credit: 11,352,567
RAC: 0
Level
Pro
Scientific publications
wat
Message 43353 - Posted: 10 May 2016 | 13:18:31 UTC - in response to Message 43352.

I'm computing the WU 2kogR2-SDOERR_opm996-0-1-RND7448 but after 24hrs it is still at 15% (750ti)... Please don't say me I have to abort it!

Skyler Baker
Send message
Joined: 19 Feb 16
Posts: 19
Credit: 136,574,536
RAC: 20,687
Level
Cys
Scientific publications
wat
Message 43354 - Posted: 10 May 2016 | 17:31:08 UTC

Some of them are pretty big. Had one task task take 20 hours. Long tasks usually finish around 6.50-7 hours for reference

eXaPower
Send message
Joined: 25 Sep 13
Posts: 263
Credit: 1,002,411,717
RAC: 2,085,028
Level
Met
Scientific publications
watwatwatwatwatwat
Message 43355 - Posted: 10 May 2016 | 17:57:34 UTC

[...]I have 2 WU's (2b5fR8 & 2b5fR3) running, one on each of my GTX970's.[...]
Will drop the CPU usage to ~60% and give it a spring clean before starting up to see if that improves things.
- update - that allowed the cards to run at slightly higher clocks (and normally) but it's now only looking like 31.5h. That might yet drop a few hours but these are still extra long tasks.
- update - still looking like about 31.5h (55% after 17h and 51% after ~16h).

With no other CPU WU's running I've recorded ~10% CPU usage for each GTX970 (2) WU while a GTX750 WU at ~6% of my Quad core 3.2GHz Haswell system. (3WU = ~26% total CPU usage)

My GPU(s) current (opm996 long WU) estimated total Runtime. Completion rate (based on) 12~24hours of real-time crunching.

2r83R1 (GTX970) = 27hr 45min (3.600% per 1hr @ 70% GPU usage / 1501MHz)
1bakR0 (GTX970) = 23hr 30min (4.320% per 1hr @ 65% / 1501MHz)
1u27R2 (GTX750) = 40hr (2.520% per 1hr @ 80% / 1401MHz)
2I35R5 (GT650m) = 70hr (1.433% per 1hr @ 75% / 790MHz)

Newer (Beta) BOINC clients introduced an (accurate) per min or hour progress rate feature - available in advanced view (task proprieties) commands bar.

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 783
Credit: 1,391,041,045
RAC: 1,248,198
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 43356 - Posted: 10 May 2016 | 18:05:46 UTC - in response to Message 43349.
Last modified: 10 May 2016 | 18:27:45 UTC

Seems like I might have underestimated the real runtime. We use a script that caclulates the projected runtime on my 780s. So it seems it's a bit too optimistic on it's estimates :/ Longer WUs (projected time over 18 hours) though should give 4x credits.

It would be hugely appreciated if you could find a way of hooking up the projections of that script to the <rsc_fpops_est> field of the associated workunits. With the BOINC server version in use here, a single mis-estimated task (I have one which has been running for 29 hours already) can mess up the BOINC client's scheduling - for other projects, as well as this one - for the next couple of weeks.

TyphooN [Gridcoin]
Send message
Joined: 29 Jun 14
Posts: 5
Credit: 29,718,557
RAC: 0
Level
Val
Scientific publications
wat
Message 43357 - Posted: 10 May 2016 | 19:16:55 UTC

I have noticed that the last batch of WUs are worth a lot less credit per time spent crunching, but I also wanted to report that we might have some bad WUs going out. I spent quite a lot of time on this workunit, and noticed that the only returned tasks for this WU is "Error while computing." The workunit in question: https://gpugrid.net/workunit.php?wuid=11593942

It is possible that my GPU is unstable, but considering that I was crunching WUs to completion on GPUgrid before, as well as asteroids/milkyway without error, I believe the chance that the WU that was sent out was corrupt or bad. I will adjust my overclock if need be, but I have been crunching for weeks at these clocks with no problems until now.

Jacob Klein
Send message
Joined: 11 Oct 08
Posts: 1060
Credit: 1,123,119,589
RAC: 1,357,910
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 43358 - Posted: 10 May 2016 | 19:24:57 UTC - in response to Message 43357.
Last modified: 10 May 2016 | 19:28:30 UTC

I have noticed that the last batch of WUs are worth a lot less credit per time spent crunching, but I also wanted to report that we might have some bad WUs going out. I spent quite a lot of time on this workunit, and noticed that the only returned tasks for this WU is "Error while computing." The workunit in question: https://gpugrid.net/workunit.php?wuid=11593942

It is possible that my GPU is unstable, but considering that I was crunching WUs to completion on GPUgrid before, as well as asteroids/milkyway without error, I believe the chance that the WU that was sent out was corrupt or bad. I will adjust my overclock if need be, but I have been crunching for weeks at these clocks with no problems until now.


You've been crashing tasks for a couple weeks, including the older 'BestUmbrella_chalcone' ones. See below.
You need to lower your GPU overclock, if you want to be stable with GPUGrid tasks!
Downclock it until you never see "The simulation has become unstable."

https://gpugrid.net/results.php?hostid=319330

https://gpugrid.net/result.php?resultid=15096591
2b6oR6-SDOERR_opm996-0-1-RND0942_1
10 May 2016 | 7:15:45 UTC

https://gpugrid.net/result.php?resultid=15086372
28 Apr 2016 | 11:34:16 UTC
e45s20_e17s22p1f138-GERARD_CXCL12_BestUmbrella_chalcone3441-0-1-RND8729_0

https://gpugrid.net/result.php?resultid=15084884
27 Apr 2016 | 23:52:34 UTC
e44s17_e43s20p1f173-GERARD_CXCL12_BestUmbrella_chalcone2212-0-1-RND5557_0

https://gpugrid.net/result.php?resultid=15084715
26 Apr 2016 | 23:30:01 UTC
e43s16_e31s21p1f321-GERARD_CXCL12_BestUmbrella_chalcone4131-0-1-RND6256_0

https://gpugrid.net/result.php?resultid=15084712
26 Apr 2016 | 23:22:44 UTC
e43s13_e20s7p1f45-GERARD_CXCL12_BestUmbrella_chalcone4131-0-1-RND8139_0

https://gpugrid.net/result.php?resultid=15082560
25 Apr 2016 | 18:49:58 UTC
e42s11_e31s18p1f391-GERARD_CXCL12_BestUmbrella_chalcone2731-0-1-RND8654_0

TyphooN [Gridcoin]
Send message
Joined: 29 Jun 14
Posts: 5
Credit: 29,718,557
RAC: 0
Level
Val
Scientific publications
wat
Message 43359 - Posted: 10 May 2016 | 19:44:24 UTC - in response to Message 43358.

I have noticed that the last batch of WUs are worth a lot less credit per time spent crunching, but I also wanted to report that we might have some bad WUs going out. I spent quite a lot of time on this workunit, and noticed that the only returned tasks for this WU is "Error while computing." The workunit in question: https://gpugrid.net/workunit.php?wuid=11593942

It is possible that my GPU is unstable, but considering that I was crunching WUs to completion on GPUgrid before, as well as asteroids/milkyway without error, I believe the chance that the WU that was sent out was corrupt or bad. I will adjust my overclock if need be, but I have been crunching for weeks at these clocks with no problems until now.


You've been crashing tasks for a couple weeks, including the older 'BestUmbrella_chalcone' ones. See below.
You need to lower your GPU overclock, if you want to be stable with GPUGrid tasks!
Downclock it until you never see "The simulation has become unstable."

https://gpugrid.net/results.php?hostid=319330

https://gpugrid.net/result.php?resultid=15096591
2b6oR6-SDOERR_opm996-0-1-RND0942_1
10 May 2016 | 7:15:45 UTC

https://gpugrid.net/result.php?resultid=15086372
28 Apr 2016 | 11:34:16 UTC
e45s20_e17s22p1f138-GERARD_CXCL12_BestUmbrella_chalcone3441-0-1-RND8729_0

https://gpugrid.net/result.php?resultid=15084884
27 Apr 2016 | 23:52:34 UTC
e44s17_e43s20p1f173-GERARD_CXCL12_BestUmbrella_chalcone2212-0-1-RND5557_0

https://gpugrid.net/result.php?resultid=15084715
26 Apr 2016 | 23:30:01 UTC
e43s16_e31s21p1f321-GERARD_CXCL12_BestUmbrella_chalcone4131-0-1-RND6256_0

https://gpugrid.net/result.php?resultid=15084712
26 Apr 2016 | 23:22:44 UTC
e43s13_e20s7p1f45-GERARD_CXCL12_BestUmbrella_chalcone4131-0-1-RND8139_0

https://gpugrid.net/result.php?resultid=15082560
25 Apr 2016 | 18:49:58 UTC
e42s11_e31s18p1f391-GERARD_CXCL12_BestUmbrella_chalcone2731-0-1-RND8654_0


Yeah I was crashing for weeks until I found what I thought were stable clocks. I ran a few WUs without crashing and then GPUgrid was out of work temporarily. From that point I was crunching asteroids/milkyway without any error. I'll lower my GPU clocks or up the voltage and report back. Thanks!

Skyler Baker
Send message
Joined: 19 Feb 16
Posts: 19
Credit: 136,574,536
RAC: 20,687
Level
Cys
Scientific publications
wat
Message 43360 - Posted: 10 May 2016 | 22:46:48 UTC


I will agree with the credit being much lower but pretty sure they can't change what projects are worth after the fact, so I'm okay with it.

Bedrich Hajek
Send message
Joined: 28 Mar 09
Posts: 335
Credit: 3,800,087,309
RAC: 894,669
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 43361 - Posted: 10 May 2016 | 23:47:45 UTC - in response to Message 43349.
Last modified: 10 May 2016 | 23:50:00 UTC

For future reference, I have a couple of suggestions:

For WUs running 18 hours +, there should be separate category : "super long runs". I believe this was mentioned in past posts.


The future WU application version should be made less CPU dependent. The WUs are getting longer, GPU s are getting faster, but CPU speed is stagnant. Something got to give, and with the Pascal cards coming out soon, you have put out a new version anyway. So, why not do both? My GPU usage on these latest SDOERR WUs is between 70% to 80%, compared to 85% to 95% for the GERARD BestUmbrella units. I don't think this is reinventing the wheel, just updating it. This does have to be done sooner or later. The Volta cards are coming out in only a few short years.

Profile skgiven
Volunteer moderator
Project tester
Volunteer tester
Avatar
Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,826,562,314
RAC: 416,629
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 43362 - Posted: 11 May 2016 | 9:58:59 UTC - in response to Message 43361.
Last modified: 11 May 2016 | 10:01:55 UTC

For future reference, I have a couple of suggestions:

For WUs running 18 hours +, there should be separate category : "super long runs". I believe this was mentioned in past posts.[quote]

This was previously suggested, however this issue was just an underestimate of the runtimes by the researchers and they would normally increase the credit ratio awarded for extra-long tasks. The real issue would be another queue on the server and facilitating and maintaining that; the short queue is often empty never mind having an extra long queue.

[quote]The future WU application version should be made less CPU dependent.

Generally that's the case but ultimately this is a different type of research and it simply requires that some work be performed on the CPU (you can't do it on the GPU).

The WUs are getting longer, GPU s are getting faster, but CPU speed is stagnant. Something got to give, and with the Pascal cards coming out soon, you have put out a new version anyway.

Over the years WU's have remained about the same length overall. Occasionally there are extra-long tasks but such batches are rare.
GPU's are getting faster, more adapt, the number of shaders (cuda cores here) is increasing and CUDA development continues. The problem isn't just CPU frequency (though an AMD Athlon 64 X2 5000+ is going to struggle a bit), WDDM is an 11%+ bottleneck (increasing with GPU performance) and when the GPUGrid app needs the CPU to perform a calculation it's down to the PCIE bridge and perhaps to some extent not being multi-threaded on the CPU.
Note that it's the same ACEMD app (not updated since Jan) just a different batch of tasks (batches vary depending on the research type).

My GPU usage on these latest SDOERR WUs is between 70% to 80%, compared to 85% to 95% for the GERARD BestUmbrella units. I don't think this is reinventing the wheel, just updating it. This does have to be done sooner or later. The Volta cards are coming out in only a few short years.

I'm seeing ~75% GPU usage on my 970's too. On Linux and XP I would expect it to be around 90%. I've increased my Memory to 3505MHz to reduce the MCU (~41%) @1345MHz [power 85%], ~37% @1253MHz. I've watched the GPU load and it varies continuously with these WU's, as does the Power.
____________
FAQ's

HOW TO:
- Opt out of Beta Tests
- Ask for Help

eXaPower
Send message
Joined: 25 Sep 13
Posts: 263
Credit: 1,002,411,717
RAC: 2,085,028
Level
Met
Scientific publications
watwatwatwatwatwat
Message 43364 - Posted: 11 May 2016 | 15:29:22 UTC

Current (Opm996 long) batch are 10 million step simulations with a large "nAtom" variation (30k <<>> 90k).
Per step compute time "nAtom" variation create runtime differences for the same GPU's.
Expect variable results due to OPM996 changeable atom amount.
The WU's atom amount shown upon task completion in <stderr> file.

Generally the number of atoms determine GPU usage: The more atoms a simulation has - the higher GPU usage.
So far prior OPM996 WU on my 970's shown a 10% (60~70%) GPU usage (63k atoms .vs. 88k) for a ~5hr completion time variation.
My current (2) opm996 tasks difference will be near ~7hr (19 & 26hr) total runtime (69% & 62%) core usage on 970's. The GPU with lower usage (fewer atoms) finishes a WU faster.

Completed OPM996 (long WU) show >90k atoms (>24hr runtime) being rewarded with (over 630,000 credits) while a 88478 atoms task is within the 336.000 range. (<24hr runtime) 63016 atom task around 294.000 credits. 52602 atoms = 236,700.



Profile Retvari Zoltan
Avatar
Send message
Joined: 20 Jan 09
Posts: 1836
Credit: 10,417,826,194
RAC: 8,674,762
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 43365 - Posted: 11 May 2016 | 16:17:46 UTC - in response to Message 43364.
Last modified: 11 May 2016 | 16:20:53 UTC

4n6hR2-SDOERR_opm996-0-1-RND7021_0 100.440 Natoms 16h 36m 46s (59.806s) 903.900 credits (GTX980Ti)
I see 85%~89%~90% GPU usage on my WinXPx64 hosts.
BTW there's no distinction of these workunits on the performance page, that makes me pretty sad.

Profile skgiven
Volunteer moderator
Project tester
Volunteer tester
Avatar
Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,826,562,314
RAC: 416,629
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 43366 - Posted: 11 May 2016 | 17:22:08 UTC - in response to Message 43365.
Last modified: 11 May 2016 | 17:27:44 UTC

If you have a small GPU, it's likely you will not finish some of these tasks inside 5 days which 'I think' is still the cut-off for credit [correct me if I'm wrong]! Suggest people with small cards look at the progression % and time taken in Boinc Manager and from that work out if the tasks will finish inside 5 days or not. If it's going to take over 120h consider aborting.

Also worth taking some extra measures to ensure these extra long tasks complete. For me that means configuring MSI afterburner/NVIDIA Inspector to Prioritise Temperature and set it to something sensible such as 69C. I also freed up another CPU thread to help it along and set the memory to 3505MHz.
____________
FAQ's

HOW TO:
- Opt out of Beta Tests
- Ask for Help

frederikhk
Send message
Joined: 14 Apr 14
Posts: 8
Credit: 57,034,536
RAC: 0
Level
Thr
Scientific publications
watwatwat
Message 43367 - Posted: 11 May 2016 | 17:49:05 UTC

This one looks like it's going to take 30hrs :) 44% after 13 hours.

https://www.gpugrid.net/result.php?resultid=15094957

Profile skgiven
Volunteer moderator
Project tester
Volunteer tester
Avatar
Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,826,562,314
RAC: 416,629
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 43368 - Posted: 11 May 2016 | 21:05:14 UTC - in response to Message 43367.

My last 2 valid tasks (970's) took around 31h also.
Presently running 2 tasks that should take 25.5h and 29h on 970's.
____________
FAQ's

HOW TO:
- Opt out of Beta Tests
- Ask for Help

MrJo
Send message
Joined: 18 Apr 14
Posts: 43
Credit: 1,192,135,172
RAC: 3,678
Level
Met
Scientific publications
watwatwatwatwat
Message 43369 - Posted: 11 May 2016 | 21:10:50 UTC - in response to Message 43349.

though should give 4x credits.

where is my 4x credit? I see only at least twice as much running time with less credit..

____________
Regards, Josef

Bedrich Hajek
Send message
Joined: 28 Mar 09
Posts: 335
Credit: 3,800,087,309
RAC: 894,669
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 43370 - Posted: 11 May 2016 | 22:34:34 UTC - in response to Message 43362.

For future reference, I have a couple of suggestions:

For WUs running 18 hours +, there should be separate category : "super long runs". I believe this was mentioned in past posts.

This was previously suggested, however this issue was just an underestimate of the runtimes by the researchers and they would normally increase the credit ratio awarded for extra-long tasks. The real issue would be another queue on the server and facilitating and maintaining that; the short queue is often empty never mind having an extra long queue.



Fine, they underestimated the runtimes, but I stand by my statement for super long runs category.



The future WU application version should be made less CPU dependent.
Generally that's the case but ultimately this is a different type of research and it simply requires that some work be performed on the CPU (you can't do it on the GPU).

The WUs are getting longer, GPU s are getting faster, but CPU speed is stagnant. Something got to give, and with the Pascal cards coming out soon, you have put out a new version anyway.
Over the years WU's have remained about the same length overall. Occasionally there are extra-long tasks but such batches are rare.
GPU's are getting faster, more adapt, the number of shaders (cuda cores here) is increasing and CUDA development continues. The problem isn't just CPU frequency (though an AMD Athlon 64 X2 5000+ is going to struggle a bit), WDDM is an 11%+ bottleneck (increasing with GPU performance) and when the GPUGrid app needs the CPU to perform a calculation it's down to the PCIE bridge and perhaps to some extent not being multi-threaded on the CPU.
Note that it's the same ACEMD app (not updated since Jan) just a different batch of tasks (batches vary depending on the research type).

My GPU usage on these latest SDOERR WUs is between 70% to 80%, compared to 85% to 95% for the GERARD BestUmbrella units. I don't think this is reinventing the wheel, just updating it. This does have to be done sooner or later. The Volta cards are coming out in only a few short years.
I'm seeing ~75% GPU usage on my 970's too. On Linux and XP I would expect it to be around 90%. I've increased my Memory to 3505MHz to reduce the MCU (~41%) @1345MHz [power 85%], ~37% @1253MHz. I've watched the GPU load and it varies continuously with these WU's, as does the Power.


As for making the WUs less CPU dependent, there are somethings that are easy to move to GPU calculation, some that are difficult, maybe not impossible, but definitely impractical. Okay, I get this point, and I am not asking for the moon, but something more must be done to avoid or minimize this bottleneck, from a programming standpoint. That's all I'm saying.


Stefan
Volunteer moderator
Project developer
Project scientist
Send message
Joined: 5 Mar 13
Posts: 258
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 43372 - Posted: 12 May 2016 | 8:13:00 UTC

@Retvari @MrJo I mentioned the credit algorithm in this post https://www.gpugrid.net/forum_thread.php?id=4299&nowrap=true#43328

I understand the runtime was underestimated, but given the knowledge we had (projected runtime) it was our best guess. I only sent WUs that were projected to run under 24 hours on a 780. If that ends up being more than 5 days on some GPUs I am really sorry, we didn't consider that possibility.

The problem with equilibrations is that we cannot split them into multiple steps like the normal simulations so we just have to push through this batch and then we are done. I expect it to be over by the end of the week. No more simulations are being sent out a few days now so only the ones that cancel or fail will be resent automatically.

Betting Slip
Send message
Joined: 5 Jan 09
Posts: 584
Credit: 2,004,846,200
RAC: 1,662,786
Level
Phe
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 43373 - Posted: 12 May 2016 | 8:49:36 UTC - in response to Message 43372.
Last modified: 12 May 2016 | 8:58:11 UTC

Don't worry about that Stefan it's a curve ball that you've thrown us that's all.

"I expect it to be over by the end of the week"
I like your optimism given the users who hold on to a WU for 5 days and never return and those that continually error even after long run times.
Surely there must be a way to suspend their ability to get WU's completely until they Log In.
They seriously impact our ability to complete a batch in good time.

Profile skgiven
Volunteer moderator
Project tester
Volunteer tester
Avatar
Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,826,562,314
RAC: 416,629
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 43374 - Posted: 12 May 2016 | 9:07:31 UTC - in response to Message 43373.
Last modified: 12 May 2016 | 9:35:47 UTC

I like your optimism given the users who hold on to a WU for 5 days and never return and those that continually error even after long run times.
Surely there must be a way to suspend their ability to get WU's completely until they Log In.
They seriously impact our ability to complete a batch in good time.


5 days does sound optimistic. Maybe you can ensure resends go to top cards with a low failure rate?

Would be useful if the likes of this batch was better controlled; only sent to known good cards - see the RAC's of the 'participants' this WU was sent to:
https://www.gpugrid.net/workunit.php?wuid=11595345

I suspect that none of these cards could ever return extra long tasks within 5days (not enough GFlops) even if properly setup:
NVIDIA GeForce GT 520M (1024MB) driver: 358.87 [too slow and not enough RAM]
NVIDIA GeForce GT 640 (2048MB) driver: 340.52 [too slow, only cuda6.0 driver]
NVIDIA Quadro K4000 (3072MB) driver: 340.62 [too slow, only cuda6.0 driver]
NVIDIA GeForce GTX 560 Ti (2048MB) driver: 365.10 [best chance to finish but runs too hot, all tasks fail on this card when it hits ~85C, needs to cool it properly {prioritise temperature}]

Like the idea that WU's don't get sent to systems with high error rates until they log in (and are directed to a recommended settings page).
____________
FAQ's

HOW TO:
- Opt out of Beta Tests
- Ask for Help

Betting Slip
Send message
Joined: 5 Jan 09
Posts: 584
Credit: 2,004,846,200
RAC: 1,662,786
Level
Phe
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 43375 - Posted: 12 May 2016 | 9:23:38 UTC - in response to Message 43374.
Last modified: 12 May 2016 | 9:24:26 UTC

Hi SK,
At least the only thing that was wasted there was bandwidth because they errored immediately but the biggest thing (especially but not only in a WU drought) is the person who never returns so basically the WU is in limbo for 5 days before it gets resent and then it's possible it goes to another machine that holds it for another 5 days while good machines are sat waiting for work.

It's NOT good for users
It's NOT good for scientists
It's NOT good for this project

About time we got the broom out.

Stefan
Volunteer moderator
Project developer
Project scientist
Send message
Joined: 5 Mar 13
Posts: 258
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 43376 - Posted: 12 May 2016 | 9:45:35 UTC

Actually there is a way to send WUs only to top users by increasing the job priority over 1000. I thought I had it at max priority but I just found out that it's only set to 'high' which is 600.

I guess most of these problems would have been avoided given a high enough priority.

In any case, we are at 1721 completed 1582 running so I think my estimate might roughly hold :P

Betting Slip
Send message
Joined: 5 Jan 09
Posts: 584
Credit: 2,004,846,200
RAC: 1,662,786
Level
Phe
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 43377 - Posted: 12 May 2016 | 10:20:06 UTC - in response to Message 43376.
Last modified: 12 May 2016 | 10:40:53 UTC

Thanks Stefan,

Finally just an example of a different scenario which is frustrating,

https://www.gpugrid.net/workunit.php?wuid=11591312

User exceeds 5 days so WU gets sent to my slowest machine which completes in less than 24hrs.

However original user then sends back result after 5 days pipping my machine to the post making my effort a waste of time.

This was of course because original user downloaded 2 WUs at a time on 750ti which struggles to complete one WU in 24hrs running 24/7

On my 980ti I recently aborted a unit that was 40% complete for exactly the same reason.

Profile Retvari Zoltan
Avatar
Send message
Joined: 20 Jan 09
Posts: 1836
Credit: 10,417,826,194
RAC: 8,674,762
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 43378 - Posted: 12 May 2016 | 10:40:38 UTC - in response to Message 43376.

Actually there is a way to send WUs only to top users by increasing the job priority over 1000. I thought I had it at max priority but I just found out that it's only set to 'high' which is 600.

I guess most of these problems would have been avoided given a high enough priority.
Priority usually does not cause exclusion, only lower probability.
Does priority over 1000 really exclude the less reliable hosts from task scheduling when there are no lower priority tasks in the queue?
I think it is true only when there are lower priority tasks also queued to give them to less reliable hosts.
It is a documented feature?

Profile skgiven
Volunteer moderator
Project tester
Volunteer tester
Avatar
Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,826,562,314
RAC: 416,629
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 43379 - Posted: 12 May 2016 | 10:44:11 UTC - in response to Message 43375.
Last modified: 12 May 2016 | 10:46:06 UTC

Hi SK,
At least the only thing that was wasted there was bandwidth because they errored immediately but the biggest thing (especially but not only in a WU drought) is the person who never returns so basically the WU is in limbo for 5 days before it gets resent and then it's possible it goes to another machine that holds it for another 5 days while good machines are sat waiting for work.

It's NOT good for users
It's NOT good for scientists
It's NOT good for this project

About time we got the broom out.


Exactly, send a WU to 6 clients in succession that don't run it or return it for 5days and after a month the WU still isn't complete.

IMO if client systems report a cache of over 1day they shouldn't get work from this project.

I highlighted a best case scenario (where several systems were involved); the WU runs soon after receipt, fails immediately, gets reported quickly and can be sent out again. Even in this situation it went to 4 clients that failed to complete the WU. To send and receive replies from 4 systems took 7h.

My second point was that the WU never really had a chance on any of those systems. They weren't capable of running these tasks (not enough RAM, too slow, oldish driver [slower or more buggy], too hot, or just not powerful enough to return an extra long WU inside 5days).
____________
FAQ's

HOW TO:
- Opt out of Beta Tests
- Ask for Help

Betting Slip
Send message
Joined: 5 Jan 09
Posts: 584
Credit: 2,004,846,200
RAC: 1,662,786
Level
Phe
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 43380 - Posted: 12 May 2016 | 10:59:37 UTC - in response to Message 43379.
Last modified: 12 May 2016 | 11:04:21 UTC



I highlighted a best case scenario (where several systems were involved); the WU runs soon after receipt, fails immediately, gets reported quickly and can be sent out again. Even in this situation it went to 4 clients that failed to complete the WU. To send and receive replies from 4 systems took 7h.

My second point was that the WU never really had a chance on any of those systems. They weren't capable of running these tasks (not enough RAM, too slow, oldish driver [slower or more buggy], too hot, or just not powerful enough to return an extra long WU inside 5days).


I take your point. In the 7 hours that WU was being bounced from bad client to bad client a fast card would have completed it. Totally agree.

In fact Retvari could have completed it and taken an hour out for lunch. HaHa

MrJo
Send message
Joined: 18 Apr 14
Posts: 43
Credit: 1,192,135,172
RAC: 3,678
Level
Met
Scientific publications
watwatwatwatwat
Message 43381 - Posted: 12 May 2016 | 11:01:54 UTC - in response to Message 43372.
Last modified: 12 May 2016 | 11:05:56 UTC

I understand the runtime was underestimated

The principle in the law "In dubio pro reo" should also apply here. When in doubt, more points.

I only sent WUs that were projected to run under 24 hours on a 780

Not everyone has a 780. I run several 970's, 770's, 760's, 680's and 950's (don't use my 750 any longer). A WU should be such that it can be completed by a midrange-card within 24 hours. A 680 ore a 770 is still a good card. Alternatively one could make a fundraising campaign so poor Cruncher's get some 980i's ;-)
____________
Regards, Josef

Profile skgiven
Volunteer moderator
Project tester
Volunteer tester
Avatar
Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,826,562,314
RAC: 416,629
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 43382 - Posted: 12 May 2016 | 11:02:28 UTC - in response to Message 43379.
Last modified: 12 May 2016 | 11:09:34 UTC

https://boinc.berkeley.edu/trac/wiki/ProjectOptions

Accelerating retries The goal of this mechanism is to send timeout-generated retries to hosts that are likely to finish them fast. Here's how it works: • Hosts are deemed "reliable" (a slight misnomer) if they satisfy turnaround time and error rate criteria. • A job instance is deemed "need-reliable" if its priority is above a threshold. • The scheduler tries to send need-reliable jobs to reliable hosts. When it does, it reduces the delay bound of the job. • When job replicas are created in response to errors or timeouts, their priority is raised relative to the job's base priority. The configurable parameters are: <reliable_on_priority>X</reliable_on_priority> Results with priority at least reliable_on_priority are treated as "need-reliable". They'll be sent preferentially to reliable hosts. <reliable_max_avg_turnaround>secs</reliable_max_avg_turnaround> Hosts whose average turnaround is at most reliable_max_avg_turnaround and that have at least 10 consecutive valid results e are considered 'reliable'. Make sure you set this low enough that a significant fraction (e.g. 25%) of your hosts qualify. <reliable_reduced_delay_bound>X</reliable_reduced_delay_bound> When a need-reliable result is sent to a reliable host, multiply the delay bound by reliable_reduced_delay_bound (typically 0.5 or so). <reliable_priority_on_over>X</reliable_priority_on_over> <reliable_priority_on_over_except_error>X</reliable_priority_on_over_except_error> If reliable_priority_on_over is nonzero, increase the priority of duplicate jobs by that amount over the job's base priority. Otherwise, if reliable_priority_on_over_except_error is nonzero, increase the priority of duplicates caused by timeout (not error) by that amount. (Typically only one of these is nonzero, and is equal to reliable_on_priority.) NOTE: this mechanism can be used to preferentially send ANY job, not just retries, to fast/reliable hosts. To do so, set the workunit's priority to reliable_on_priority or greater.

____________
FAQ's

HOW TO:
- Opt out of Beta Tests
- Ask for Help

Profile skgiven
Volunteer moderator
Project tester
Volunteer tester
Avatar
Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,826,562,314
RAC: 416,629
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 43383 - Posted: 12 May 2016 | 11:13:50 UTC - in response to Message 43381.

I understand the runtime was underestimated

The principle in the law "In dubio pro reo" should also apply here. When in doubt, more points.

I only sent WUs that were projected to run under 24 hours on a 780

Not everyone has a 780. I run several 970's, 770's, 760's, 680's and 950's (don't use my 750 any longer). A WU should be such that it can be completed by a midrange-card within 24 hours. A 680 ore a 770 is still a good card. Alternatively one could make a fundraising campaign so poor Cruncher's get some 980i's ;-)


When there is WU variation in runtime fixing credit based on anticipated runtime is always going to be hit and miss, and on a task by task basis.
____________
FAQ's

HOW TO:
- Opt out of Beta Tests
- Ask for Help

Betting Slip
Send message
Joined: 5 Jan 09
Posts: 584
Credit: 2,004,846,200
RAC: 1,662,786
Level
Phe
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 43384 - Posted: 12 May 2016 | 11:42:47 UTC

It's not only low end cards

https://www.gpugrid.net/workunit.php?wuid=11595159

Continual errors and sometimes run a long time before error.

Profile Retvari Zoltan
Avatar
Send message
Joined: 20 Jan 09
Posts: 1836
Credit: 10,417,826,194
RAC: 8,674,762
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 43385 - Posted: 12 May 2016 | 13:08:08 UTC - in response to Message 43384.

It's not only low end cards

https://www.gpugrid.net/workunit.php?wuid=11595159

Continual errors and sometimes run a long time before error.

https://www.gpugrid.net/workunit.php?wuid=11594288

Profile skgiven
Volunteer moderator
Project tester
Volunteer tester
Avatar
Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,826,562,314
RAC: 416,629
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 43386 - Posted: 12 May 2016 | 15:12:36 UTC - in response to Message 43385.
Last modified: 12 May 2016 | 16:59:07 UTC

https://www.gpugrid.net/workunit.php?wuid=11595159
Card overheating, erroring repeatedly but restarting before eventually failing.

https://www.gpugrid.net/workunit.php?wuid=11594288

287647 NVIDIA GeForce GT 520 (1023MB) driver: 352.63
201720 NVIDIA Tesla K20m (4095MB) driver: 340.29 (cuda6.0 - might be an issue)
125384 Error while downloading, also NVIDIA GeForce GT 640 (1024MB) driver: 361.91
321762 looks like a GTX980Ti but actually tried to run on the 2nd card, a GTX560Ti which only had 1GB GDDR5
54461 another 560Ti with 1GB GDDR5
329196 NVIDIA GeForce GTX 550 Ti (1023MB) driver: 361.42

Looks like 4 out of 6 fails where due to the cards only having 1GB GDDR, one failed to download but only had 1GB GDDR anyway, the other might be due to using an older driver with these WU's.
____________
FAQ's

HOW TO:
- Opt out of Beta Tests
- Ask for Help

(Ryle)
Send message
Joined: 7 Jun 09
Posts: 17
Credit: 653,530,862
RAC: 1,861,385
Level
Lys
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwat
Message 43389 - Posted: 12 May 2016 | 16:59:19 UTC

Would it make sense if a moderator or staff contacted these repeat offenders with a PM, asking them if they would consider detaching or try a less strenuous project?

I think they've done this procedure on CPDN, when users find hosts that burn workunits over and over.

Profile skgiven
Volunteer moderator
Project tester
Volunteer tester
Avatar
Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,826,562,314
RAC: 416,629
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 43390 - Posted: 12 May 2016 | 17:13:13 UTC - in response to Message 43389.

In theory the server can send messages to hosts following repeated failures, if these are logged:
<msg_to_host/>
If present, check the msg_to_host table on each RPC, and send the client any messages queued for it.

Not sure if this appears in Notices (pop-up, if enabled) or just the event log.

I've tried contacting some people by PM in the past, but if they don't use the forum they won't see it, unless they have it going through to their emails too (and read those). Basically, not a great response and would need to be automated IMO.
____________
FAQ's

HOW TO:
- Opt out of Beta Tests
- Ask for Help

eXaPower
Send message
Joined: 25 Sep 13
Posts: 263
Credit: 1,002,411,717
RAC: 2,085,028
Level
Met
Scientific publications
watwatwatwatwatwat
Message 43391 - Posted: 12 May 2016 | 22:44:16 UTC - in response to Message 43372.

[...] The problem with equilibrations is that we cannot split them into multiple steps like the normal simulations so we just have to push through this batch and then we are done. I expect it to be over by the end of the week. No more simulations are being sent out a few days now so only the ones that cancel or fail will be resent automatically.

Will there be any future OPM batches available or is this the end of OPM? I've enjoyed crunching OPM996 (non-fixed) credit WU. The unpredictable runtime simulations (interchangeable Natom batch) an exciting type of WU. Variable Natom for each task creates an allure of mystery. If viable - OPM a choice WU for Summer time crunching from it's lower power requirement compared to some other WU's. (Umbrella type WU would also help contend with the summer heat.)

100.440 Natoms 16h 36m 46s (59.806s) 903.900 credits (GTX980Ti)

Is 903,900 the most credit ever given to an ACEMD WU?

Betting Slip
Send message
Joined: 5 Jan 09
Posts: 584
Credit: 2,004,846,200
RAC: 1,662,786
Level
Phe
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 43392 - Posted: 12 May 2016 | 22:48:22 UTC - in response to Message 43390.

Or just deny them new tasks until they login as your previous idea. This would solve the probem for the most part. Just can't understand the project not adopting this approach as they would like WUs returned ASAP and users don't want to be sat idle while these hosts hold us up

Profile Retvari Zoltan
Avatar
Send message
Joined: 20 Jan 09
Posts: 1836
Credit: 10,417,826,194
RAC: 8,674,762
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 43393 - Posted: 13 May 2016 | 0:11:09 UTC - in response to Message 43391.

100.440 Natoms 16h 36m 46s (59.806s) 903.900 credits (GTX980Ti)
Is 903,900 the most credit ever given to an ACEMD WU?
I think it is so far.
There is a larger model which contains 101.237 atoms, and this could generate 911,100 credits with the +50% bonus. Unfortunately one of my slower (GTX980) host received one such task, and it has been processed under 90,589 seconds (25h 9m 49s) (plus the time it spent queued), so it has earned "only" +25% bonus, resulting in "only" 759,250 credits.

Bedrich Hajek
Send message
Joined: 28 Mar 09
Posts: 335
Credit: 3,800,087,309
RAC: 894,669
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 43394 - Posted: 13 May 2016 | 2:06:50 UTC - in response to Message 43393.

100.440 Natoms 16h 36m 46s (59.806s) 903.900 credits (GTX980Ti)
Is 903,900 the most credit ever given to an ACEMD WU?
I think it is so far.
There is a larger model which contains 101.237 atoms, and this could generate 911,100 credits with the +50% bonus. Unfortunately one of my slower (GTX980) host received one such task, and it has been processed under 90,589 seconds (25h 9m 49s) (plus the time it spent queued), so it has earned "only" +25% bonus, resulting in "only" 759,250 credits.



I got that beat with 113536 Natoms. It took my computer over 30 hours to complete, so I lost the 50% bonus. I actually got 2 of them. See links below:


https://www.gpugrid.net/result.php?resultid=15092125

https://www.gpugrid.net/result.php?resultid=15092126


Anyone have anything bigger?


This was actually fun. Let's do it again, but put it into the super long category.



Profile Retvari Zoltan
Avatar
Send message
Joined: 20 Jan 09
Posts: 1836
Credit: 10,417,826,194
RAC: 8,674,762
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 43395 - Posted: 13 May 2016 | 6:52:59 UTC - in response to Message 43394.

I got that beat with 113536 Natoms. It took my computer over 30 hours to complete, so I lost the 50% bonus.
Wow, that would get 1.021.800 credits with the +50% bonus. I hope that one of my GTX980Ti hosts will receive such a workunit :)

Profile skgiven
Volunteer moderator
Project tester
Volunteer tester
Avatar
Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,826,562,314
RAC: 416,629
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 43396 - Posted: 13 May 2016 | 7:58:31 UTC - in response to Message 43395.

717000 credits (with 25% bonus) was the highest I received. Would have been 860700 if returned inside 24h, but would require a bigger card.
On Linux or Win XP I'm sure a GTX970 could return some of these inside 24h.

Might have got the lowest credit though ;p
73318 Natoms 9.184 ns/day:

15097992 10 May 2016 | 20:17:32 UTC 12 May 2016 | 0:29:05 UTC 91,659.53 91,074.92 275,000.00
____________
FAQ's

HOW TO:
- Opt out of Beta Tests
- Ask for Help

Skyler Baker
Send message
Joined: 19 Feb 16
Posts: 19
Credit: 136,574,536
RAC: 20,687
Level
Cys
Scientific publications
wat
Message 43399 - Posted: 13 May 2016 | 14:18:02 UTC

Mines bigger, 117122 Natoms took around 24 hours, 878,500.00 credits with the 25% bonus.

Profile Beyond
Avatar
Send message
Joined: 23 Nov 08
Posts: 1073
Credit: 4,487,550,429
RAC: 409,504
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 43403 - Posted: 13 May 2016 | 19:23:20 UTC - in response to Message 43396.

717000 credits (with 25% bonus) was the highest I received. Would have been 860700 if returned inside 24h, but would require a bigger card.
On Linux or Win XP I'm sure a GTX970 could return some of these inside 24h.

Might have got the lowest credit though ;p
73318 Natoms 9.184 ns/day:

15097992 10 May 2016 | 20:17:32 UTC 12 May 2016 | 0:29:05 UTC 91,659.53 91,074.92 275,000.00

Have completed 14 of the OPM, including 6 of the 91848 Natoms size with credit of only 275,500 each (no bonuses). They took well over twice the time of earlier WUs that yielded >200,000 credits. Haven't seen any of the huge credit variety being discussed.

Profile Retvari Zoltan
Avatar
Send message
Joined: 20 Jan 09
Posts: 1836
Credit: 10,417,826,194
RAC: 8,674,762
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 43404 - Posted: 13 May 2016 | 20:15:50 UTC - in response to Message 43399.
Last modified: 13 May 2016 | 20:37:18 UTC

Mines bigger, 117122 Natoms took around 24 hours, 878,500.00 credits with the 25% bonus.
That would get 1.054.200 credits if returned under 24h.
As I see these workunits are running out, only those remain which are processed by slow hosts.
We may receive some when the 5 days deadline expires.

Profile Retvari Zoltan
Avatar
Send message
Joined: 20 Jan 09
Posts: 1836
Credit: 10,417,826,194
RAC: 8,674,762
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 43405 - Posted: 13 May 2016 | 20:21:19 UTC - in response to Message 43396.
Last modified: 13 May 2016 | 20:35:35 UTC

Might have got the lowest credit though ;p
73318 Natoms 9.184 ns/day:

15097992 10 May 2016 | 20:17:32 UTC 12 May 2016 | 0:29:05 UTC 91,659.53 91,074.92 275,000.00
You're not even close :)
2k72R5-SDOERR_opm996-0-1-RND1700_0 137.850 credits (including 50% bonus) 30637 atoms 3.017 ns/day 10M steps
2lbgR1-SDOERR_opm996-0-1-RND1419_0 130.350 credits (including 50% bonus) 28956 atoms 2.002 ns/day 10M steps
2lbgR0-SDOERR_opm996-0-1-RND9460_0 130.350 credits (including 50% bonus) 28956 atoms 2.027 ns/day 10M steps
2kbvR0-SDOERR_opm996-0-1-RND1815_1 129.900 credits (including 50% bonus) 28872 Natoms 1.937 ns/day 10M steps
1vf6R5-SDOERR_opm998-0-1-RND7439_0 122.550 credits (including 50% bonus) 54487 Natoms 5.002 ns/day 5M steps
1kqwR7-SDOERR_opm998-0-1-RND9523_1 106.650 credits (including 50% bonus) 47409 atoms 4.533 ns/day 5M steps

Profile Beyond
Avatar
Send message
Joined: 23 Nov 08
Posts: 1073
Credit: 4,487,550,429
RAC: 409,504
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 43406 - Posted: 14 May 2016 | 4:33:50 UTC - in response to Message 43355.

My GPU(s) current (opm996 long WU) estimated total Runtime. Completion rate (based on) 12~24hours of real-time crunching.

2r83R1 (GTX970) = 27hr 45min (3.600% per 1hr @ 70% GPU usage / 1501MHz)
1bakR0 (GTX970) = 23hr 30min (4.320% per 1hr @ 65% / 1501MHz)
1u27R2 (GTX750) = 40hr (2.520% per 1hr @ 80% / 1401MHz)
2I35R5 (GT650m) = 70hr (1.433% per 1hr @ 75% / 790MHz)

Newer (Beta) BOINC clients introduced an (accurate) per min or hour progress rate feature - available in advanced view (task proprieties) commands bar.

The OPM WUs spelled the death knell for my last 2 super-clocked 650 Ti GPUs. They weren't too bad with the earlier WUs but were ridiculously slow with the OPMs. Probably due to having only 1GB of memory. Anyway, pulled them out of the machines. Down to a flock (perhaps: gaggle, pack, herd, swarm, pod?) of 2GB 750 Ti cards and a 670. Also noticed that machines with only 1 NV GPU processed the OPMs faster. This wasn't the case for earlier WUs.

Profile skgiven
Volunteer moderator
Project tester
Volunteer tester
Avatar
Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,826,562,314
RAC: 416,629
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 43409 - Posted: 14 May 2016 | 9:33:37 UTC - in response to Message 43406.
Last modified: 14 May 2016 | 9:35:42 UTC

My GPU(s) current (opm996 long WU) estimated total Runtime. Completion rate (based on) 12~24hours of real-time crunching.

2r83R1 (GTX970) = 27hr 45min (3.600% per 1hr @ 70% GPU usage / 1501MHz)
1bakR0 (GTX970) = 23hr 30min (4.320% per 1hr @ 65% / 1501MHz)
1u27R2 (GTX750) = 40hr (2.520% per 1hr @ 80% / 1401MHz)
2I35R5 (GT650m) = 70hr (1.433% per 1hr @ 75% / 790MHz)

Newer (Beta) BOINC clients introduced an (accurate) per min or hour progress rate feature - available in advanced view (task proprieties) commands bar.

It's there in 7.6.22 (non-beta). For the GERARD_FXCXCL tasks it's about 8% on my 970's.

The OPM WUs spelled the death knell for my last 2 super-clocked 650 Ti GPUs. They weren't too bad with the earlier WUs but were ridiculously slow with the OPMs. Probably due to having only 1GB of memory. Anyway, pulled them out of the machines. Down to a flock (perhaps: gaggle, pack, herd, swarm, pod?) of 2GB 750 Ti cards and a 670.

Clutch, brood, rookery, skulk, crash, congregation, sleuth, school, shoal, army, quiver, gang, pride, bank, bouquet.

Also noticed that machines with only 1 NV GPU processed the OPMs faster. This wasn't the case for earlier WUs.

Probably due to the increased bus and CPU usage. I tried to alleviate this by freeing up more CPU and increasing the GDDR5 freq., but the GPU clocks could run higher too, due to slightly less GPU and power usage.
____________
FAQ's

HOW TO:
- Opt out of Beta Tests
- Ask for Help

Profile Retvari Zoltan
Avatar
Send message
Joined: 20 Jan 09
Posts: 1836
Credit: 10,417,826,194
RAC: 8,674,762
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 43415 - Posted: 14 May 2016 | 23:55:08 UTC - in response to Message 43395.
Last modified: 14 May 2016 | 23:55:51 UTC

I got that beat with 113536 Natoms. It took my computer over 30 hours to complete, so I lost the 50% bonus.
Wow, that would get 1.021.800 credits with the +50% bonus. I hope that one of my GTX980Ti hosts will receive such a workunit :)
I was lucky to get one such workunit on a GTX980Ti host.
It's 64.5% processed under 12 hours, so it will take ~18h 30m to complete.
I keep my fingers crossed :) We'll see the result in the morning.

eXaPower
Send message
Joined: 25 Sep 13
Posts: 263
Credit: 1,002,411,717
RAC: 2,085,028
Level
Met
Scientific publications
watwatwatwatwatwat
Message 43417 - Posted: 15 May 2016 | 0:37:37 UTC - in response to Message 43409.

My GPU(s) current (opm996 long WU) estimated total Runtime. Completion rate (based on) 12~24hours of real-time crunching.

2r83R1 (GTX970) = 27hr 45min (3.600% per 1hr @ 70% GPU usage / 1501MHz)
1bakR0 (GTX970) = 23hr 30min (4.320% per 1hr @ 65% / 1501MHz)
1u27R2 (GTX750) = 40hr (2.520% per 1hr @ 80% / 1401MHz)
2I35R5 (GT650m) = 70hr (1.433% per 1hr @ 75% / 790MHz)

Newer (Beta) BOINC clients introduced an (accurate) per min or hour progress rate feature - available in advanced view (task proprieties) commands bar.

It's there in 7.6.22 (non-beta). For the GERARD_FXCXCL tasks it's about 8% on my 970's.

The OPM WUs spelled the death knell for my last 2 super-clocked 650 Ti GPUs. They weren't too bad with the earlier WUs but were ridiculously slow with the OPMs. Probably due to having only 1GB of memory. Anyway, pulled them out of the machines. Down to a flock (perhaps: gaggle, pack, herd, swarm, pod?) of 2GB 750 Ti cards and a 670.

Clutch, brood, rookery, skulk, crash, congregation, sleuth, school, shoal, army, quiver, gang, pride, bank, bouquet.

Also noticed that machines with only 1 NV GPU processed the OPMs faster. This wasn't the case for earlier WUs.

Probably due to the increased bus and CPU usage. I tried to alleviate this by freeing up more CPU and increasing the GDDR5 freq., but the GPU clocks could run higher too, due to slightly less GPU and power usage.

Waiting on (2) formerly Timed out OPM to finish up on my 970's (23hr & 25hr estimated runtime). OPM WU was sent after hot day then cool evening ocean breeze -97 error GERARD_FXCXCL (50C GTX970) bent the knee for May sun.

I was lucky to get one such workunit.
It's 64.5% processed under 12 hours, so it will take ~18h 30m to complete.
I keep my fingers crossed :) We'll see the result in the morning.

960MB & 1020MB GDDR5 for each of my current OPM (60~80k Natoms guess).



Profile Retvari Zoltan
Avatar
Send message
Joined: 20 Jan 09
Posts: 1836
Credit: 10,417,826,194
RAC: 8,674,762
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 43419 - Posted: 15 May 2016 | 7:30:25 UTC - in response to Message 43415.

I got that beat with 113536 Natoms. It took my computer over 30 hours to complete, so I lost the 50% bonus.
Wow, that would get 1.021.800 credits with the +50% bonus. I hope that one of my GTX980Ti hosts will receive such a workunit :)
I was lucky to get one such workunit on a GTX980Ti host.
It's 64.5% processed under 12 hours, so it will take ~18h 30m to complete.
I keep my fingers crossed :) We'll see the result in the morning.
I'm happy to report that the workunit finished fine:
1sujR0-SDOERR_opm996-0-1-RND0758_1 1.021.800 credits (including 50% bonus), 67.146 sec (18h 39m 6s), 113536 atoms 6.707 ns/day

Erich56
Send message
Joined: 1 Jan 15
Posts: 369
Credit: 1,606,755,102
RAC: 2,771,359
Level
His
Scientific publications
watwatwat
Message 43421 - Posted: 15 May 2016 | 7:55:40 UTC - in response to Message 43419.

1.021.800 credits ...

you lucky one :-)

Profile Retvari Zoltan
Avatar
Send message
Joined: 20 Jan 09
Posts: 1836
Credit: 10,417,826,194
RAC: 8,674,762
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 43422 - Posted: 15 May 2016 | 8:07:45 UTC - in response to Message 43417.
Last modified: 15 May 2016 | 8:10:44 UTC

Waiting on (2) formerly Timed out OPM to finish up on my 970's (23hr & 25hr estimated runtime). OPM WU was sent after hot day then cool evening ocean breeze -97 error GERARD_FXCXCL (50C GTX970) bent the knee for May sun.
Your GPUs are too hot. Your GT 630 reaches 80°C (176°F), while in your laptop your GT650M reaches 93°C (199°F) which is crazy.
Your host with 4 GPUs has two GTX970s, a GTX 750 and a GT630.
There's no point in risking the stability of the simulations running on your fast GPUs by putting low-end GPUs in the same host.
Packing 4 GPU to a single PC for 24/7 crunching requires water cooling, (or PCIe riser cards to make breathing space between the cards).
Crunching on laptops is not recommended. But if you do, you should place your laptop on its side while not in use, to make the air outlet facing up and the bottom of the laptop vertical (so the fan could take more air in). You should also regularly clean the fan & the fins with compressed air.

Profile skgiven
Volunteer moderator
Project tester
Volunteer tester
Avatar
Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,826,562,314
RAC: 416,629
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 43423 - Posted: 15 May 2016 | 8:24:54 UTC - in response to Message 43419.
Last modified: 15 May 2016 | 10:25:10 UTC

Finally recieved an opm task that should finish inside 24h, 4fkdR3-SDOERR_opm996-0-1-RND2483_1

15093814 301215 9 May 2016 | 13:45:43 UTC 14 May 2016 | 13:42:45 UTC Not started by deadline - canceled

15103320 139265 14 May 2016 | 15:52:05 UTC 19 May 2016 | 15:52:05 UTC In progress...

Yes, it had been hiding on a system that appears to be designed to impede this project:

15093841 11594330 9 May 2016 | 13:45:43 UTC 14 May 2016 | 13:42:45 UTC Not started by deadline - canceled
15093816 11594305 9 May 2016 | 13:45:43 UTC 14 May 2016 | 13:42:45 UTC Not started by deadline - canceled
15093815 11594304 9 May 2016 | 13:45:43 UTC 14 May 2016 | 13:42:45 UTC Not started by deadline - canceled
15093814 11594303 9 May 2016 | 13:45:43 UTC 14 May 2016 | 13:42:45 UTC Not started by deadline - canceled
15083618 11586584 25 Apr 2016 | 23:06:12 UTC 30 Apr 2016 | 23:02:34 UTC Not started by deadline - canceled

If you don't learn from history...

... Completed and validated 59,953.27 59,504.36 182,100.00 - oh well.
____________
FAQ's

HOW TO:
- Opt out of Beta Tests
- Ask for Help

Profile skgiven
Volunteer moderator
Project tester
Volunteer tester
Avatar
Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,826,562,314
RAC: 416,629
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 43424 - Posted: 15 May 2016 | 8:35:58 UTC - in response to Message 43422.

Waiting on (2) formerly Timed out OPM to finish up on my 970's (23hr & 25hr estimated runtime). OPM WU was sent after hot day then cool evening ocean breeze -97 error GERARD_FXCXCL (50C GTX970) bent the knee for May sun.
Your GPUs are too hot. Your GT 630 reaches 80°C (176°F), while in your laptop your GT650M reaches 93°C (199°F) which is crazy.
Your host with 4 GPUs has two GTX970s, a GTX 750 and a GT630.
There's no point in risking the stability of the simulations running on your fast GPUs by putting low-end GPUs in the same host.
Packing 4 GPU to a single PC for 24/7 crunching requires water cooling, (or PCIe riser cards to make breathing space between the cards).
Crunching on laptops is not recommended. But if you do, you should place your laptop on its side while not in use, to make the air outlet facing up and the bottom of the laptop vertical (so the fan could take more air in). You should also regularly clean the fan & the fins with compressed air.


Heed the good advice!

Note that 93C is the GPU's temperature cut-off point. The GPU self-throttles to protect itself because it's dangerously hot. It doesn't have a cut-off point to protect the rest of the system and GPU's are Not designed to run at high temps continuously. Use temperature and fan controlling apps such as NVIDIA Inspector and MSI Afterburner to protect your hardware.
____________
FAQ's

HOW TO:
- Opt out of Beta Tests
- Ask for Help

Bedrich Hajek
Send message
Joined: 28 Mar 09
Posts: 335
Credit: 3,800,087,309
RAC: 894,669
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 43425 - Posted: 15 May 2016 | 9:33:23 UTC - in response to Message 43419.

I got that beat with 113536 Natoms. It took my computer over 30 hours to complete, so I lost the 50% bonus.
Wow, that would get 1.021.800 credits with the +50% bonus. I hope that one of my GTX980Ti hosts will receive such a workunit :)
I was lucky to get one such workunit on a GTX980Ti host.
It's 64.5% processed under 12 hours, so it will take ~18h 30m to complete.
I keep my fingers crossed :) We'll see the result in the morning.
I'm happy to report that the workunit finished fine:
1sujR0-SDOERR_opm996-0-1-RND0758_1 1.021.800 credits (including 50% bonus), 67.146 sec (18h 39m 6s), 113536 atoms 6.707 ns/day



I also got another big one: 3pp2R9-SDOERR_opm996-0-1-RND0623_1 Run time 80,841.79 CPU time 80,414.03 120040 Natoms Credit 1,080,300.00

https://www.gpugrid.net/result.php?resultid=15102827


This was done on my windows 10 machine in under 24 hours with 50% bonus while working on this unit on the other card: 3um7R7-SDOERR_opm996-0-1-RND5030_1 Run time 67,202.94 CPU time 66,809.66 91343 Natoms Credit 822,000.00

https://www.gpugrid.net/result.php?resultid=15102779


Though, this did cost me my number 1 position in the performance tab.


Profile Retvari Zoltan
Avatar
Send message
Joined: 20 Jan 09
Posts: 1836
Credit: 10,417,826,194
RAC: 8,674,762
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 43426 - Posted: 15 May 2016 | 10:03:30 UTC - in response to Message 43425.
Last modified: 15 May 2016 | 10:08:16 UTC

Though, this did cost me my number 1 position in the performance tab.
Never mind! :D

MrJo
Send message
Joined: 18 Apr 14
Posts: 43
Credit: 1,192,135,172
RAC: 3,678
Level
Met
Scientific publications
watwatwatwatwat
Message 43428 - Posted: 15 May 2016 | 11:25:52 UTC

@Retvari

Congrats to No. 1 of the TOP Crunchers ;-)
By the way: Does anybody knows what happened to Stoneageman? Sunk silently in the ground?

Got one of the looser-files again: https://www.gpugrid.net/result.php?resultid=15103135. Only 171,150 Credits for a runtime of 64,394.64

Now I've got this one: https://www.gpugrid.net/result.php?resultid=15104507 Haw can I see if it's a good ore a bad one?
____________
Regards, Josef

Betting Slip
Send message
Joined: 5 Jan 09
Posts: 584
Credit: 2,004,846,200
RAC: 1,662,786
Level
Phe
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 43429 - Posted: 15 May 2016 | 12:40:31 UTC - in response to Message 43428.

Top Average Performers is a very misleading and ill-conceived chart because it is based on average performance of a user and ALL his hosts rather than a host in particular.
Retvari has a lot of hosts with a mixture of cards and arguably hosts with the fastest return and throughput on GPUGrid. This mixture of hosts/cards puts him well in front on WUs completed but because times are averaged, behind on performance in hours.
Bedrich has only 2 hosts with at least 2 980ti's and possibly 3, so, because he doesn't have any slower cards when his return time in hours is averaged over all his hosts/cards end up at the top of the chart despite producing less than half of completed WUs as Retvari.

Got one of the looser-files again: https://www.gpugrid.net/result.php?resultid=15103135. Only 171,150 Credits for a runtime of 64,394.64

Now I've got this one: https://www.gpugrid.net/result.php?resultid=15104507 Haw can I see if it's a good ore a bad one?


There are no good or bad ones, there are just some you get more or less credit for.

eXaPower
Send message
Joined: 25 Sep 13
Posts: 263
Credit: 1,002,411,717
RAC: 2,085,028
Level
Met
Scientific publications
watwatwatwatwatwat
Message 43430 - Posted: 15 May 2016 | 12:43:17 UTC - in response to Message 43424.

@Retvari

Congrats to No. 1 of the TOP Crunchers ;-)
By the way: Does anybody knows what happened to Stoneageman? Sunk silently in the ground?

Got one of the looser-files again: https://www.gpugrid.net/result.php?resultid=15103135. Only 171,150 Credits for a runtime of 64,394.64

Now I've got this one: https://www.gpugrid.net/result.php?resultid=15104507 Haw can I see if it's a good ore a bad one?

The Natom amount only known after a WU validates cleanly in (<stderr> file). One way to gauge Natom size is the GPU memory usage. The really big models (credit wise and Natom seem to confirm some OPM are the largest ACEMD ever crunched) near 1.5GB while smaller models are <1.2GB or less. Long OPM WU Natom (29k to 120K) varies to point where the cruncher doesn't really know what to expect for credit. (I like this new feature since no credit amount is fixed.)

Waiting on (2) formerly Timed out OPM to finish up on my 970's (23hr & 25hr estimated runtime). OPM WU was sent after hot day then cool evening ocean breeze -97 error GERARD_FXCXCL (50C GTX970) bent the knee for May sun.
Your GPUs are too hot. Your GT 630 reaches 80°C (176°F), while in your laptop your GT650M reaches 93°C (199°F) which is crazy.
Your host with 4 GPUs has two GTX970s, a GTX 750 and a GT630.
There's no point in risking the stability of the simulations running on your fast GPUs by putting low-end GPUs in the same host.
Packing 4 GPU to a single PC for 24/7 crunching requires water cooling, (or PCIe riser cards to make breathing space between the cards).
Crunching on laptops is not recommended. But if you do, you should place your laptop on its side while not in use, to make the air outlet facing up and the bottom of the laptop vertical (so the fan could take more air in). You should also regularly clean the fan & the fins with compressed air.


Heed the good advice!

Note that 93C is the GPU's temperature cut-off point. The GPU self-throttles to protect itself because it's dangerously hot. It doesn't have a cut-off point to protect the rest of the system and GPU's are Not designed to run at high temps continuously. Use temperature and fan controlling apps such as NVIDIA Inspector and MSI Afterburner to protect your hardware.

Tending the GPU advice - I will reconfigure. I've found an WinXP home edition UlCPC key plus it's sata1 hard drive - will a ULCPC key copied onto a USB work with a desktop system? I also have a USB drive Linux debian (tails 2.3) OS as well Parrot 3.0 I could set-up for a WIn8.1 dual-boot. Though the grapevine birds chirp mentioned graphic card performance is non-existent compared to mainline 4.* linux.

I'd really like to lose the WDDM choke point so my future Pascal cards are efficient as possible.

sis651
Send message
Joined: 25 Nov 13
Posts: 65
Credit: 58,939,404
RAC: 293,889
Level
Thr
Scientific publications
watwatwat
Message 43432 - Posted: 15 May 2016 | 19:06:11 UTC

I get one of these long runs to my notebook with GT740M. As it was slow, I stopped all other CPU projects; it was still slow. After about 60 - 70 hours it was still about %50. Anyway, it ended up with errors and I'm not getting any long runs anymore. It seems they won't finish on time...
Waiting for short runs now.

Erich56
Send message
Joined: 1 Jan 15
Posts: 369
Credit: 1,606,755,102
RAC: 2,771,359
Level
His
Scientific publications
watwatwat
Message 43433 - Posted: 15 May 2016 | 19:24:26 UTC

both of the below WUs crunched with a GTX980Ti:

e4s15_e2s1p0f633-GERARD_FXCXCL12R_2189739_1-0-1-RND7197_0
22,635.45 / 22,538.23 / 249,600.00

1bakR6-SDOERR_opm996-0-1-RND6740_2
41,181.13 / 40,910.27 / 236,250.00

the second one almost double crunching time, but less points earned.
what explains for this big difference?

Profile skgiven
Volunteer moderator
Project tester
Volunteer tester
Avatar
Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,826,562,314
RAC: 416,629
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 43434 - Posted: 15 May 2016 | 20:26:08 UTC - in response to Message 43433.
Last modified: 15 May 2016 | 20:31:30 UTC

GERARD_FXCXCL12R is a typical work unit in terms of credits awarded.
The SDOERR_opm tasks unpredictably vary in size/runtime. The credits awarded were guestimates. However, these are probably one-off primer work units that will hopefully feed future runs (where potentially interesting results have been observed). Another way to look at it is that you are doing cutting edge theoretical/proof of concept science, never done before - it's bumpy.
____________
FAQ's

HOW TO:
- Opt out of Beta Tests
- Ask for Help

Profile Retvari Zoltan
Avatar
Send message
Joined: 20 Jan 09
Posts: 1836
Credit: 10,417,826,194
RAC: 8,674,762
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 43435 - Posted: 15 May 2016 | 21:57:54 UTC - in response to Message 43428.
Last modified: 15 May 2016 | 21:58:26 UTC

By the way: Does anybody knows what happened to Stoneageman?
He is crunching Einstein@home for some time now. He is ranked #8 regarding the total credits earned and #4 regarding RAC at the moment.

Profile Retvari Zoltan
Avatar
Send message
Joined: 20 Jan 09
Posts: 1836
Credit: 10,417,826,194
RAC: 8,674,762
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 43436 - Posted: 15 May 2016 | 22:09:33 UTC - in response to Message 43430.
Last modified: 15 May 2016 | 22:10:09 UTC

The Natom amount only known after a WU validates cleanly in (<stderr> file).
The number of atoms of a running task can be found in the project's folder, in a file named as the task plus a _0 attached to the end.
Though it has no .txt extension this is a clear text file, so if you open it with notepad you will find a line (5th) which contains this number:
# Topology reports 32227 atoms

MrJo
Send message
Joined: 18 Apr 14
Posts: 43
Credit: 1,192,135,172
RAC: 3,678
Level
Met
Scientific publications
watwatwatwatwat
Message 43438 - Posted: 16 May 2016 | 7:30:34 UTC - in response to Message 43435.
Last modified: 16 May 2016 | 7:32:39 UTC

He is crunching Einstein@home for some time now.


The Natom amount only known after a WU validates cleanly in (<stderr> file)The number of atoms of a running task can be found in the project's folder, in a file named as the task plus a _0 attached to the end.
Though it has no .txt extension this is a clear text file, so if you open it with notepad you will find a line (5th) which contains this number:
# Topology reports 32227 atoms


Thankyou for the explanation
____________
Regards, Josef

MrJo
Send message
Joined: 18 Apr 14
Posts: 43
Credit: 1,192,135,172
RAC: 3,678
Level
Met
Scientific publications
watwatwatwatwat
Message 43439 - Posted: 16 May 2016 | 7:34:29 UTC - in response to Message 43434.

Another way to look at it is that you are doing cutting edge theoretical/proof of concept science, never done before - it's bumpy.

I'll look at it from this angle. ;-)

____________
Regards, Josef

Betting Slip
Send message
Joined: 5 Jan 09
Posts: 584
Credit: 2,004,846,200
RAC: 1,662,786
Level
Phe
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 43440 - Posted: 16 May 2016 | 12:14:15 UTC
Last modified: 16 May 2016 | 12:21:42 UTC

This one took over 5 days to get to me https://www.gpugrid.net/workunit.php?wuid=11595181

Completed in just under 24hrs for 1,095,000

Come on admins do something about the "5 Day Timeout" and continual error machines. Next WU took over 6 days to get to me. https://www.gpugrid.net/workunit.php?wuid=11595161 and also stop people caching WUs for more than an hour.

Stefan
Volunteer moderator
Project developer
Project scientist
Send message
Joined: 5 Mar 13
Posts: 258
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 43441 - Posted: 16 May 2016 | 12:29:12 UTC

Ah you guys actually reminded me of the obvious fact that the credit calculations might be off in respect to the runtime if the system does not fit into GPU memory. Afaik if the system does not fully fit in the GPU (which might happen with quite a few of the OPM systems) it will simulate quite a bit slower.
I think this is not accounted for in the credit calculation.

On the other hand, the exact same credit calculation was used for my WUs as for Gerard's. The difference is that Gerard's are just one system and not 350 different ones like mine, so it's easy to be consistent in credits when the number of atoms doesn't change ;)

In any case I would like to thank you all for pushing through with this. It's nearly finished now so I can get to looking at the results.

Many thanks for the great work :)

Profile skgiven
Volunteer moderator
Project tester
Volunteer tester
Avatar
Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,826,562,314
RAC: 416,629
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 43442 - Posted: 16 May 2016 | 22:10:20 UTC - in response to Message 43441.
Last modified: 16 May 2016 | 22:17:31 UTC

I expect the problem was predominantly the varying number of atoms - the more atoms the longer the runtime. You would have needed to factor the atom count variable into the credit model for it to work perfectly. As any subsequent runs will likely have fixed atom counts (but varying per batch) I expect they can be calibrated as normal. If further primer runs are needed it would be good to factor the atom count into the credits.

The largest amount of GDDR I've seen being used is 1.5GB but based on reported atom counts some tasks might have been a little higher. Not all of the tasks use as much, many used <1GB so this was only a problem for some tasks that tried to run on GPU's with small amounts of GDDR (1GB mostly, but possibly a few [rare] 1.5GB cards [GT640's, 660 OEM's, 670M/670MX, the 192-bit GTX760 or some of the even rarer 400/500 series cards], or people trying to run 2 tasks on a 2GB card simultaneously).
Most cards have 2GB or more GDDR and most of the 1GB cards failed immediately when the tasks required more than 1GB GDDR. The 1GB cards that did complete tasks probably finished tasks that didn't require >1GB GDDR, otherwise they would have been heavily restricted as you suggest and experienced even greater PCIE bus usage which was already higher with this batch.
____________
FAQ's

HOW TO:
- Opt out of Beta Tests
- Ask for Help

Bedrich Hajek
Send message
Joined: 28 Mar 09
Posts: 335
Credit: 3,800,087,309
RAC: 894,669
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 43443 - Posted: 16 May 2016 | 23:00:23 UTC - in response to Message 43442.

I expect the problem was predominantly the varying number of atoms - the more atoms the longer the runtime. You would have needed to factor the atom count variable into the credit model for it to work perfectly. As any subsequent runs will likely have fixed atom counts (but varying per batch) I expect they can be calibrated as normal. If further primer runs are needed it would be good to factor the atom count into the credits.

The largest amount of GDDR I've seen being used is 1.5GB but based on reported atom counts some tasks might have been a little higher. Not all of the tasks use as much, many used <1GB so this was only a problem for some tasks that tried to run on GPU's with small amounts of GDDR (1GB mostly, but possibly a few [rare] 1.5GB cards [GT640's, 660 OEM's, 670M/670MX, the 192-bit GTX760 or some of the even rarer 400/500 series cards], or people trying to run 2 tasks on a 2GB card simultaneously).
Most cards have 2GB or more GDDR and most of the 1GB cards failed immediately when the tasks required more than 1GB GDDR. The 1GB cards that did complete tasks probably finished tasks that didn't require >1GB GDDR, otherwise they would have been heavily restricted as you suggest and experienced even greater PCIE bus usage which was already higher with this batch.



More atoms also mean a higher GPU usage. I am currently crunching a WU with 107,436 atoms. My GPU usage is 83%, compared to the low atom WUs in this batch of 71%. Which is on a windows 10 computer with WDDM lag. My current GPU memory usage is 1692 MB.


The GERARD_FXCXCL WU, by comparison, that I am running concurrently on this machine on the other card, are 80% GPU usage and 514 MB GPU memory usage with 31,718 atoms.

The power usage is the same 75% for each WU, each running on 980Ti card.


Profile Beyond
Avatar
Send message
Joined: 23 Nov 08
Posts: 1073
Credit: 4,487,550,429
RAC: 409,504
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 43449 - Posted: 19 May 2016 | 16:17:44 UTC - in response to Message 43396.

717000 credits (with 25% bonus) was the highest I received. Would have been 860700 if returned inside 24h, but would require a bigger card.
On Linux or Win XP I'm sure a GTX970 could return some of these inside 24h.

The OPMs were hopeless on all but the fastest cards. Even the Gerards lately seem to be sized to cut out the large base of super-clocked 750 Ti cards at least on the dominant WDDM based machines (the 750 Tis are still some of the most efficient GPUs that NV has ever produced). In the meantime file sizes have increased and much time is used just in the upload process. I wonder just how important it is to keep the bonus deadlines so tight considering the larger file sizes and and the fact that the admins don't even seem to be able to follow up on the WUs we're crunching by keeping new ones in the queues. It wasn't long ago that the WU times doubled, not sure why.

Seems a few are gaining a bit of speed by running XP. Is that safe, considering the lack of support from MS? I've also been wanting to try running a Linux image (perhaps even from USB), but the image here hasn't been updated in years. Even sent one of the users a new GPU so he could work on a new Linux image for GPUGrid but nothing ever came of it. Any of the Linux experts up to this job?

Profile skgiven
Volunteer moderator
Project tester
Volunteer tester
Avatar
Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,826,562,314
RAC: 416,629
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 43450 - Posted: 19 May 2016 | 17:07:18 UTC - in response to Message 43449.

Most of the issues are due to lack of personnel at GPUGrid. The research is mostly performed by the research students and several have just finished.

If you only use XP to crunch then you are limiting the risk. Anti virus packages and firewalls still work on XP.

Ubuntu 16.04 has been released recently. I'm looking to try it soon and see if there is a simple way to get it up and running for here; repository drivers + Boinc from the repository. If I can I will write it up. Alas, with every version so many commands change and new problems pop up that it's always a learning process.
____________
FAQ's

HOW TO:
- Opt out of Beta Tests
- Ask for Help

Profile Beyond
Avatar
Send message
Joined: 23 Nov 08
Posts: 1073
Credit: 4,487,550,429
RAC: 409,504
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 43451 - Posted: 19 May 2016 | 18:06:02 UTC - in response to Message 43450.

Most of the issues are due to lack of personnel at GPUGrid. The research is mostly performed by the research students and several have just finished.

If you only use XP to crunch then you are limiting the risk. Anti virus packages and firewalls still work on XP.

Ubuntu 16.04 has been released recently. I'm looking to try it soon and see if there is a simple way to get it up and running for here; repository drivers + Boinc from the repository. If I can I will write it up. Alas, with every version so many commands change and new problems pop up that it's always a learning process.

Thanks SK. Hope that you can get the Linux info updated. It would be much appreciated. I'm leery about XP at this point. Please keep us posted.

I've been doing a little research into the 1 and 2 day bonus deadlines mostly by looking at a lot of different hosts. It's interesting. By moving WUs just past the 1 day deadline for a large number of GPUs, the work return may actually be getting slower. The users with the very fast GPUs generally cache as many as allowed and return times end up being close to 1 day anyway. On the other hand for instance most of my GPUs are the factory OCed 750 Ti (very popular on this project). When they were making the 1 day deadline, I set them as the only NV project and at 0 project priority. The new WU would be fetched when the old WU was returned. Zero lag. Now since I can't quite make the 1 day cutoff anyway, I set the queue for 1/2 day. Thus the turn around time is much slower (but still well inside the 2 day limit) and I actually get significantly more credit (especially when WUs are scarce). This too tight turnaround strategy by the project can actually be harmful to their overall throughput.

Skyler Baker
Send message
Joined: 19 Feb 16
Posts: 19
Credit: 136,574,536
RAC: 20,687
Level
Cys
Scientific publications
wat
Message 43452 - Posted: 20 May 2016 | 1:59:06 UTC

Some of the new Geralds are definitely a bit long as well, they seem to run about 12.240% per hour, which wouldn't be very much except that's with a overclocked 980ti, nearly the best possible scenario until pascal later this month. Like others have said, it doesn't effect me, but it would be a long time with a slower card.

Betting Slip
Send message
Joined: 5 Jan 09
Posts: 584
Credit: 2,004,846,200
RAC: 1,662,786
Level
Phe
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 43454 - Posted: 20 May 2016 | 11:26:24 UTC
Last modified: 20 May 2016 | 11:48:31 UTC

This one took 10 Days 8 hours to get to me https://www.gpugrid.net/workunit.php?wuid=11595052

This work and all other work could be done much more quickly and efficiently if the project addressed this problem.

I imagine it would also increase the amount of work GPUGrid could accomplish and scientists might have higher confidence in the results.

TO ADD

One of Gerards took 3 and a 1/2 days to get to my slowest machine https://www.gpugrid.net/workunit.php?wuid=11600399

Profile Beyond
Avatar
Send message
Joined: 23 Nov 08
Posts: 1073
Credit: 4,487,550,429
RAC: 409,504
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 43456 - Posted: 20 May 2016 | 15:16:17 UTC - in response to Message 43454.

This one took 10 Days 8 hours to get to me https://www.gpugrid.net/workunit.php?wuid=11595052

This work and all other work could be done much more quickly and efficiently if the project addressed this problem.

I imagine it would also increase the amount of work GPUGrid could accomplish and scientists might have higher confidence in the results.

TO ADD

One of Gerards took 3 and a 1/2 days to get to my slowest machine https://www.gpugrid.net/workunit.php?wuid=11600399

Interesting that most of the failures were from fast GPUs, even 3x 980Ti and a Titan among others. Are people OCing to much? In the "research" I mentioned above I've noticed MANY 980Ti, Titan and Titan X cards throwing constant failures. Surprised me to say the least.

Profile Retvari Zoltan
Avatar
Send message
Joined: 20 Jan 09
Posts: 1836
Credit: 10,417,826,194
RAC: 8,674,762
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 43457 - Posted: 20 May 2016 | 15:21:49 UTC - in response to Message 43454.

I have some similar experiences:
e5s22_e1s14p0f264-GERARD_FXCXCL12R_2189739_2-0-1-RND1099 5 days:
1. Jonny's desktop with i7-3930K and two GTX 780s it has 4 successive timeouts

e5s7_e3s79p0f564-GERARD_FXCXCL12R_1406742_1-0-1-RND7782 5 days:
1. Jozef J's desktop with i7-5960X and GTX 980 Ti it has a lot of errors
2. i-kami's desktop with i7-3770K and GTX 650 it has 1 timeout and the other GERARD WU took 2 days

2kytR9-SDOERR_opm996-0-1-RND3899 10 days and 6 hours:
1. Remix's laptop with a GeForce 610M it has only 1 task which has timed out (probably the user realized that this GPU is insufficient)
2. John C MacAlister's desktop with AMD FX-8350 and GTX 660 Ti it has errors & user aborts
3. Alexander Knerlein's laptop with GTX780M it has only 1 task which has timed out (probably the user realized that this GPU is insufficient)

1hh4R8-SDOERR_opm996-0-1-RND5553 10 days and 2 hours:
1. mcilfone's brand new i7-6700K with a very hot GTX 980 Ti it has errors, timeouts and some successful tasks
2. MintberryCrunch's desktop with Core2 Quad 8300 and GTX 560 Ti (1024MB) it has a timeout and a successful task

Profile Retvari Zoltan
Avatar
Send message
Joined: 20 Jan 09
Posts: 1836
Credit: 10,417,826,194
RAC: 8,674,762
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 43458 - Posted: 20 May 2016 | 15:35:00 UTC - in response to Message 43456.
Last modified: 20 May 2016 | 15:35:22 UTC

Interesting that most of the failures were from fast GPUs, even 3x 980Ti and a Titan among others. Are people OCing to much? In the "research" I mentioned above I've noticed MANY 980Ti, Titan and Titan X cards throwing constant failures. Surprised me to say the least.
There are different reasons for those failures as missing libraries, overclocking, wrong driver installation.
The reason of timeouts are: too slow card and/or too many GPU tasks queued from different projects.

Profile Beyond
Avatar
Send message
Joined: 23 Nov 08
Posts: 1073
Credit: 4,487,550,429
RAC: 409,504
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 43461 - Posted: 20 May 2016 | 20:29:12 UTC - in response to Message 43457.

I have some similar experiences:
e5s22_e1s14p0f264-GERARD_FXCXCL12R_2189739_2-0-1-RND1099 5 days:
1. Jonny's desktop with i7-3930K and two GTX 780s it has 4 successive timeouts

Here's an interesting one:

https://www.gpugrid.net/workunit.php?wuid=11593078

I'm the 8th user to receive this "SHORT" OPM WU originally issued on May 9. The closest to success was by a GTX970 (until the user aborted it). Now it's running on one of my factory OCed 750 Ti cards. That card finishes the GERARD LONG WUs in 25-25.5 hours (yeah, cry me a river). This "SHORT" WU is 60% done and should complete with a total time of about 27 hours.

Show me a GPU that can finish this WU in anywhere near 2-3 hours and I'll show you a fantasy world where unicorns romp through the streets.

KSUMatt
Avatar
Send message
Joined: 11 Jan 13
Posts: 214
Credit: 831,004,493
RAC: 0
Level
Glu
Scientific publications
watwatwatwatwatwatwatwatwat
Message 43462 - Posted: 20 May 2016 | 21:23:33 UTC - in response to Message 43450.

Ubuntu 16.04 has been released recently. I'm looking to try it soon and see if there is a simple way to get it up and running for here; repository drivers + Boinc from the repository. If I can I will write it up. Alas, with every version so many commands change and new problems pop up that it's always a learning process.


Straying a bit off topic again, I'll risk posting this. I consider myself fairly computer literate, having built several PCs and having a little coding experience. However, I have nearly always used Windows. I've been very interested in Linux, but every time I've tried to set up a Linux host for BOINC I've been defeated. Either I couldn't get GPU drivers installed correctly or BOINC was somehow not set up correctly within Linux. If anyone would be willing to put together a step-by-step "Idiot's Guide" it would be HUGELY appreciated.

Profile Retvari Zoltan
Avatar
Send message
Joined: 20 Jan 09
Posts: 1836
Credit: 10,417,826,194
RAC: 8,674,762
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 43464 - Posted: 20 May 2016 | 22:44:28 UTC
Last modified: 20 May 2016 | 22:46:54 UTC

2lugR6-SDOERR_opm996-0-1-RND3712 2 days, but 6 failures:
1. NTNU's laptop i7-3720QM win an NVS 5200M it has 58 successive errors while downloading
2. Jens' desktop i7-4790K with a GTX 970 it has a timeout and an error
3. Anonymous' desktop i7-2700K with a GTX 660 Ti it has 40 (instant) errors and 1 timeout
4. [VENETO] sabayonino's desktop i7-4770 with a GTX 980 it has 1 success, 4 timeouts, 4 user aborts and 1 error
5. Evan's desktop i5-3570K with GTX 480 it has 54 successive errors
6. Jordi Prat's desktop i7-4770 with GTX 760 it has 62 successive errors (6) and "Simulation became unstable" errors (56)

Jim1348
Send message
Joined: 28 Jul 12
Posts: 455
Credit: 1,130,760,908
RAC: 182,632
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwat
Message 43465 - Posted: 21 May 2016 | 1:47:55 UTC - in response to Message 43462.

I've been very interested in Linux, but every time I've tried to set up a Linux host for BOINC I've been defeated. Either I couldn't get GPU drivers installed correctly or BOINC was somehow not set up correctly within Linux.

Something always goes wrong for me too, and I question my judgement for trying it once again. But I think when Mint 18 comes out, it will be worth another go. It should be simple enough (right).

Profile skgiven
Volunteer moderator
Project tester
Volunteer tester
Avatar
Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,826,562,314
RAC: 416,629
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 43466 - Posted: 21 May 2016 | 8:54:07 UTC - in response to Message 43465.
Last modified: 21 May 2016 | 9:39:50 UTC

The Error Rate for the latest GERARD_FX tasks is high and the OPM simulations were higher. Perhaps this should be looked into.

_Application_ _unsent_ In Progress Success Error Rate Short runs (2-3 hours on fastest card) SDOERR_opm99 0 60 2412 48.26% Long runs (8-12 hours on fastest card) GERARD_FXCXCL12R_1406742_ 0 33 573 38.12% GERARD_FXCXCL12R_1480490_ 0 31 624 35.34% GERARD_FXCXCL12R_1507586_ 0 25 581 33.14% GERARD_FXCXCL12R_2189739_ 0 42 560 31.79% GERARD_FXCXCL12R_50141_ 0 35 565 35.06% GERARD_FXCXCL12R_611559_ 0 31 565 32.09% GERARD_FXCXCL12R_630477_ 0 34 561 34.31% GERARD_FXCXCL12R_630478_ 0 44 599 34.75% GERARD_FXCXCL12R_678501_ 0 30 564 40.57% GERARD_FXCXCL12R_747791_ 0 32 568 36.89% GERARD_FXCXCL12R_780273_ 0 42 538 39.28% GERARD_FXCXCL12R_791302_ 0 37 497 34.78%

2 or 3 weeks ago the error rate was ~25% to 35% it's now ~35% to 40% - Maybe this varies due to release stage; early in the runs tasks go to everyone so have higher error rates, later more go to the most successful cards so the error rate drops?

Selection of more choice systems might have helped with the OPM but that would also have masked the problems too. GPUGrid has always been faced with user's Bad Setup problems. If you have 2 or 3 GPU's in a box and dont use temp controlling/fan controlling software, or if you overclock the GPU or GDDR too much there is little the project can do about that (at least now). It's incredibly simple to install a program such as NVIDIA Inspector and set it to prioritise temperature, yet so few do this. IMO the GPUGrid app should by default set the temperature control. However that's an app dev issue and probably isn't something Stefan has the time to work on, even if he could do it.

I've noticed some new/rarely seen before errors with these WU's, so perhaps that could be looked at too?

On the side-show to this thread 'Linux' (as it might get those 25/26h runs below 24h), the problem is that lots of things change with each version and that makes instructions for previous versions obsolete. Try to follow the instructions tested under Ubuntu 11/12/13 while working with 15.4/10 and you will probably not succeed - the short-cuts have changed the commands have changed the security rights have changed, the repo drivers are too old...
I've recently tried to get Boinc on a Ubuntu 15.10 system to see an NV GPU that I popped in without success. Spent ~2 days at this on and off. Systems sees the card, X-server works fine. Boinc just seems oblivious. Probably some folder security issue. Tried to upgrade to 16.04 only to be told (after downloading) that the (default sized) boot partition is too small... Would probably need to boot into Grub to repartition - too close to brain surgery to go down that route. Thought it would be faster and easier to format and install 16.04. Downloaded an image onto a W10 system, but took half a day to find an external DVD-Writer and still can't find a DVD I can write the image to (~1.4GB)...
____________
FAQ's

HOW TO:
- Opt out of Beta Tests
- Ask for Help

Betting Slip
Send message
Joined: 5 Jan 09
Posts: 584
Credit: 2,004,846,200
RAC: 1,662,786
Level
Phe
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 43467 - Posted: 21 May 2016 | 9:21:10 UTC - in response to Message 43466.
Last modified: 21 May 2016 | 9:42:02 UTC

If the project denied WUs to machines that continually errored and timed out we could have that error rate below 5%.

And here's another one https://www.gpugrid.net/workunit.php?wuid=11600492

Profile skgiven
Volunteer moderator
Project tester
Volunteer tester
Avatar
Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,826,562,314
RAC: 416,629
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 43469 - Posted: 21 May 2016 | 9:49:09 UTC - in response to Message 43467.
Last modified: 21 May 2016 | 9:56:03 UTC

I agree on that, no doubt, but where do you draw the line? 50% failure systems, 30% 10%...? I still think a better system needs to be introduced to exclude bad systems until the user responds to a PM/Notice/Email... It could accommodate resolution to such problems and facilitate crunching again once resolved (helps the cruncher and the project). Sometimes you just get an unstable system on which every task fails until it is restarted and then it works again fine, but even that could and should be accommodated. More often it's a bad setup; wrong drivers, heavy OC/bad cooling, wrong config/ill-advise use, but occasionally the card's a dud or it's something odd/difficult to work out.

Some time ago I suggested having a test app which could be sent to such systems, say after a reboot/user reconfiguration. The purpose would be to test that the card/system is actually capable of running a task. A 10min Test task would be sufficient to assess the system's basic capabilities. After that 1 task could be sent to the system and if it succeeded in completing that tasks it's status could then go back to normal, or say 50% successful.

IMO the Boinc system for this was devised for CPU's and isn't good enough for GPU's so this should be done either in some sort of GPU module or by the project.
____________
FAQ's

HOW TO:
- Opt out of Beta Tests
- Ask for Help

sis651
Send message
Joined: 25 Nov 13
Posts: 65
Credit: 58,939,404
RAC: 293,889
Level
Thr
Scientific publications
watwatwat
Message 43470 - Posted: 21 May 2016 | 9:53:32 UTC - in response to Message 43462.

I've been using Kubuntu for 2 - 3 years.
I downloaded Boinc from here, development version 7.4.22:
https://boinc.berkeley.edu/download_all.php

I just install the nvidia drivers from driver manager page of system settings, those are drivers in the Ubuntu package repository. Sometimes they're not up to date but works fine. Or sometimes I use Muon package manager to install some more nvidia related packages.

I use Nvidia Optimus supported notebook, which means Nvidia GPU is secondaryy and just renders an image and sends it to Intel GPU to be displayed on the screen. Thus I use Prime package and Bumblebee packages. Configuring them sometimes may be problematic, but usually no issues and once done works for months until the next Kubuntu version. In fact the issue is Boinc is run on the Nvidia GPU but CUDA detection doesn't happens. But by installing some other packages and some more commands everything works well...

I can try to help in case you try with Ubuntu/Kubuntu.

Betting Slip
Send message
Joined: 5 Jan 09
Posts: 584
Credit: 2,004,846,200
RAC: 1,662,786
Level
Phe
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 43471 - Posted: 21 May 2016 | 9:58:08 UTC - in response to Message 43469.
Last modified: 21 May 2016 | 10:05:53 UTC

I agree on that, no doubt, but where do you draw the line? 50% failure systems, 30% 10%...? I still think a better system needs to be introduced to exclude bad systems until the user responds to a PM/Notice/Email... It could accommodate resolution to such problems and facilitate crunching again once resolved (helps the cruncher and the project). Sometimes you just get an unstable system on which every task fails until it is restarted and then it works again fine, but even that could and should be accommodated. More often it's a bad setup; wrong drivers, heavy OC/bad cooling, wrong config/ill-advise use, but occasionally the card's a dud or it's something odd/difficult to work out.


I think you would have to do an impact assessment on the project on what level of denial produces benefits and at what point that plateaus and turns into a negative impact. With the data that this project already has, that shouldn't be difficult.

Totally agree with the idea of a test unit. Very good idea.

If it is only to last 10 minutes then it must be rigorous enough to make a bad card/system fail very quickly and it must have a short completion deadline.

Profile skgiven
Volunteer moderator
Project tester
Volunteer tester
Avatar
Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,826,562,314
RAC: 416,629
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 43474 - Posted: 21 May 2016 | 10:44:35 UTC - in response to Message 43471.
Last modified: 21 May 2016 | 11:08:32 UTC

The majority of failures tend to be almost immediate <1min. If the system could deal with those it would be of great benefit, even if it can't do much more.

Maybe with a 2 Test system you could set the 1st task with high priority (run ASAP) to test actual functionality. With the second Test task set a deadline of 2days but send a server abort after 1day to exclude people who keep a long queue? They would never run the 2nd task and that would prevent many bad systems from hogging tasks, which is the second biggest problem IMO. A Notice/email/PM & FAQ recommendation would give them the opportunity to reduce their queue/task cache.

Heat/cooling/OC related failures would take a bit longer to identify but cards heat up quickly if they are not properly cooled. How long they run before failing is a bit random but would increase with time. Unfortunately you also get half-configured systems; 3 cards, two setup properly, one cooker. What else is running would also impact on temp, but dealing with situations like that isn't a priority.
____________
FAQ's

HOW TO:
- Opt out of Beta Tests
- Ask for Help

Betting Slip
Send message
Joined: 5 Jan 09
Posts: 584
Credit: 2,004,846,200
RAC: 1,662,786
Level
Phe
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 43475 - Posted: 21 May 2016 | 11:16:40 UTC - in response to Message 43474.
Last modified: 21 May 2016 | 11:17:39 UTC


Maybe with a 2 Test system you could set the 1st task with high priority to test actual functionality. With the second Test task set a deadline of 2days but send a server abort after 1day to exclude people who keep a long queue? They would never run the 2nd task but that would prevent people hogging tasks, which is the second biggest problem IMO. A Notice/email/PM would give them the opportunity to reduce their cache.


Once again I totally agree and to address your other question

I agree on that, no doubt, but where do you draw the line? 50% failure systems, 30% 10%...?


I think you have to be brutal in your approach and give this project "high standards" instead of "come one come all". This project is already an elite one based on the core contributors and the money, time and effort they put into it.

Bad hosts hog WU's, slow results and deprive good hosts of work which, frustrates good hosts which, you may lose and may keep new ones from joining so "raise the bar" and turn this into a truly elite project, we all know people want to go to TOP clubs, restaraunts, universities etc.

Heat/cooling/OC related failures would take a bit longer to identify but cards heat up quickly if they are not properly cooled. How long they run before failing is a bit random but would increase with time. Unfortunately you also get half-configured systems; 3 cards, two setup properly, one cooker. What else is running would also impact on temp, but dealing with situations like that isn't a priority.


They can get help via the forums as usual but as far as the project is concerned you can't make their problem your problem.

Profile Retvari Zoltan
Avatar
Send message
Joined: 20 Jan 09
Posts: 1836
Credit: 10,417,826,194
RAC: 8,674,762
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 43476 - Posted: 21 May 2016 | 12:21:47 UTC - in response to Message 43475.

Some kind of test is a very good idea, but it have to be done on a regular basis on every hosts, even on the reliable ones, as I think this test should watch for GPU temperatures as well, and if GPU temps are too high (above 85°C) then the given host should be excluded.

I agree on that, no doubt, but where do you draw the line? 50% failure systems, 30% 10%...?

I think you have to be brutal in your approach and give this project "high standards" instead of "come one come all". This project is already an elite one based on the core contributors and the money, time and effort they put into it.
The actual percentage could be set by scientific means based on the data available for the project, but there should be a time limit for the ban and a manual override for the user. (and a regular re-evaluation of the banned hosts). I would set it to 10%.

Bad hosts hog WU's, slow results and deprive good hosts of work which, frustrates good hosts which, you may lose and may keep new ones from joining so "raise the bar" and turn this into a truly elite project, we all know people want to go to TOP clubs, restaurants, universities etc.
I agree partly. I think there should be a queue available only for "elite"=reliable&fast users (or hosts), but basically it should contain the same type of work as the "normal" queue, but the batches should be separated. In this way a part of the batches would finish earlier, or they can be a single-step workunits with very long (24h+ on a GTX 980 Ti) processing times.

Heat/cooling/OC related failures would take a bit longer to identify but cards heat up quickly if they are not properly cooled. How long they run before failing is a bit random but would increase with time. Unfortunately you also get half-configured systems; 3 cards, two setup properly, one cooker. What else is running would also impact on temp, but dealing with situations like that isn't a priority.

They can get help via the forums as usual but as far as the project is concerned you can't make their problem your problem.

Until our reliability assessment dreams come true (~never), we should find other means to reach the problematic contributors.
It should be made very clear right at the start (on the project's homepage, in the BOINC manager when a user tries to join the project, in the FAQ etc) the project's minimum requirements:
1. A decent NVidia GPU (GTX 760+ or GTX 960+)
2. No overclocking (later you can try, but read the forums)
3. Other GPU projects are allowed only as a backup (0 resource share) project.
Some tips should be broadcast by the project as a notice on a regular basis about the above 3 points. Also there should be someone/something who could send an email to the user who have unreliable host(s), or perhaps their username/hostname should be broadcast as a notice.

Profile skgiven
Volunteer moderator
Project tester
Volunteer tester
Avatar
Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,826,562,314
RAC: 416,629
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 43477 - Posted: 21 May 2016 | 12:55:51 UTC - in response to Message 43476.
Last modified: 21 May 2016 | 13:18:47 UTC

Communications would need to be automated IMO. Too big a task for admin and mods to perform manually; would be hundreds of messages daily. There is also a limit on how many PM's you and I can send. It's about 30/day for me, might be less for others? I trialled contacting people directly who are failing all workunits. From ~100 PM's I think I got about 3 replies, 1 was 6months later IIRC. Suggests ~97% of people attached don't read their PM's/check their email or they can't be bothered/don't understand how to fix their issues.

If the app could be configured to create a default temperature preference of say 69C that would save a lot of pain. If lots of the errors were down to cards simply not having enough memory to run the OPM's, which might be the case that's another app only fix issue.

Like the idea where Tips are sent to the Notices on a regular basis. Ideally this could be personalised but that would be down to Boinc central to introduce. IMO Log messages would be almost of zero use - most users rarely read the Boinc log files, if ever.

Perhaps a project requirement to log into the forums every month would help? It's not a set and forget project.
____________
FAQ's

HOW TO:
- Opt out of Beta Tests
- Ask for Help

Jacob Klein
Send message
Joined: 11 Oct 08
Posts: 1060
Credit: 1,123,119,589
RAC: 1,357,910
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 43478 - Posted: 21 May 2016 | 14:14:48 UTC

It would be very helpful, to some (especially to me!), to see a notice returned from the project's server.

It could say: "In the past week, this host has had x failed tasks, and y tasks with instability warnings. Reliability is a factor that this project uses to determine which hosts get tasks. Click here for more info." I believe it's relatively easy for a project to do that.

Also, it would be nice if the project had a way to communicate this via email. A web preference, let's say, defaulting to being on. And it could evaluate all the hosts for a user, and send an email weekly or monthly, with a way to turn it off in the web preferences. I know I'd use it!

Regarding my particular scenario, I have 2 PCs with NVIDIA GPUs - Speed Racer has 2 GTX 980 Tis, and Racer X has a GTX 970 and 2 GTX 660 Tis. I overclock all 5 of the GPUs to the maximum stable clock, regardless of temps, such that I never see "Simulation has become unstable" (I check regularly). I run GPUGrid tasks 2-per-GPU as my primary project, but have several other backup projects. GPU temps are usually in the range of 65*C to 85*C, with a custom fan curve that hits max fan set at 70*C for GPU Boost v1 and 90*C for GPU Boost v2, with no problems completing tasks. So, I certainly don't want the notices to be based on temperature at all. :)

Until this notification scheme happens, I'll routinely monitor my own results to make sure my overclocks are good. If I ever see "Unstable" in a result, I downclock the GPU another 10 MHz. Note: Some of my recent GPUGrid failures are due to me testing the CPU overclocking limits of Speed Racer, he's only a few weeks old :)

That's my take!

Betting Slip
Send message
Joined: 5 Jan 09
Posts: 584
Credit: 2,004,846,200
RAC: 1,662,786
Level
Phe
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 43479 - Posted: 21 May 2016 | 14:31:00 UTC - in response to Message 43477.
Last modified: 21 May 2016 | 14:47:23 UTC

Everything is getting complicated again and unfortunately that's where people tune out and NOTHING gets done.

Use the principle "KISS" "Keep It Simple Stupid"

Exclude the hosts that need excluding, send them a PM and/or email if they don't respond they stay excluded. You have to bear in mind if PM or email does not garner a response then they are probably not interested and couldn't care less so they stay excluded, FULL STOP.

When you start getting "creative" with methodologies on how to re-interest, educate/inform these people you introduce problems and complications that need not be there.

Please remember there are HOT cards that produce perfectly good results, there are SLOW cards that are reliable and fast enough.

Unreliable hosts stick out like a sore thumb and can be dealt with easily and without recourse to changing BOINC or the GPUGrid app and if we keep it simple we MAY be able to convince administrators of GPUGrid to make the changes.

Profile skgiven
Volunteer moderator
Project tester
Volunteer tester
Avatar
Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,826,562,314
RAC: 416,629
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 43483 - Posted: 21 May 2016 | 15:30:20 UTC - in response to Message 43479.

From personal experience it's usually the smaller and older cards that are less capable of running at higher temps. It's also the case that their safe temp limit is impacted by task type; some batches run hotter. I've seen cards that are not stable at even reasonable temps (65 to 70C) but run fine if the temps are reduced to say 59C (which while this isn't reasonable [requires downclocking] is still achievable). There was several threads here about factory overclocked cards not working out of the box, but they worked well when set to reference clocks, or their voltage nudged up a bit.

IF a default setting for temperature prioritization was linked to a Test app, then that could correct settings for people who don't really know what they are doing. The people who do can change what they like. In fact, if your settings are saved in something like MSI Afterburner they are likely to change automatically, certainly on a restart if you have saved your settings. If you just use NVidia inspector you can save a file and get it to automatically start when a user logs in (if you know what you are doing).
____________
FAQ's

HOW TO:
- Opt out of Beta Tests
- Ask for Help

Jim1348
Send message
Joined: 28 Jul 12
Posts: 455
Credit: 1,130,760,908
RAC: 182,632
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwat
Message 43485 - Posted: 21 May 2016 | 16:34:57 UTC - in response to Message 43466.
Last modified: 21 May 2016 | 17:13:28 UTC

On the side-show to this thread 'Linux' (as it might get those 25/26h runs below 24h), the problem is that lots of things change with each version and that makes instructions for previous versions obsolete. Try to follow the instructions tested under Ubuntu 11/12/13 while working with 15.4/10 and you will probably not succeed - the short-cuts have changed the commands have changed the security rights have changed, the repo drivers are too old...
I've recently tried to get Boinc on a Ubuntu 15.10 system to see an NV GPU that I popped in without success. Spent ~2 days at this on and off. Systems sees the card, X-server works fine. Boinc just seems oblivious. Probably some folder security issue. Tried to upgrade to 16.04 only to be told (after downloading) that the (default sized) boot partition is too small... Would probably need to boot into Grub to repartition - too close to brain surgery to go down that route. Thought it would be faster and easier to format and install 16.04. Downloaded an image onto a W10 system, but took half a day to find an external DVD-Writer and still can't find a DVD I can write the image to (~1.4GB)...

Lots of luck. By fortuitous (?) coincidence, my SSD failed yesterday, and I tried Ubuntu 16.04 this morning. The good news is that after figuring out the partitioning, I was able to get it installed without incident, except that you have to use a wired connection at first; the WiFi would not connect.

Even installing the Nvidia drivers for my GTX 960 was easy enough with the "System Setting" icon and then "Software Updates/Additional Drivers". That I thought would be the hardest part. Then, I went to "Ubuntu Software" and searched for BOINC. Wonder of wonders, it found it (I don't know which version), and it installed without incident. I could even attach to POEM, GPUGrid and Universe. We are home free, right?

Not quite. None of them show any work available, which is not possible. So we are back to square zero, and I will re-install Win7 when a new (larger) SSD arrives.

EDIT: Maybe I spoke too soon. The POEM website does show one work unit completed under Linux at 2,757 seconds, which is faster than the 3,400 seconds that I get for that series (1vii) under Windows. So maybe it will work, but it appears that you have to manage BOINC through the website; I don't see much in the way of local settings or information available yet. We will see.

Profile skgiven
Volunteer moderator
Project tester
Volunteer tester
Avatar
Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,826,562,314
RAC: 416,629
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 43490 - Posted: 21 May 2016 | 19:05:51 UTC - in response to Message 43485.

Thanks Jim,
Great to know you can get Ubuntu 16.04 up and running for here (and other GPU based Boinc projects) easily.

There is a dwindling number of tasks available here. Only 373 in progress, which will keep falling to zero/1 until a new batch of tasks are released (possibly next week, but unlikely beforehand).

Einstein should have work if you can't pick up any at the other projects. Note however that it can take some time to get work as your system will not have a history of completing work and the tasks being sent out might be prioritised towards known good systems with fast turnaround times.
____________
FAQ's

HOW TO:
- Opt out of Beta Tests
- Ask for Help

Jim1348
Send message
Joined: 28 Jul 12
Posts: 455
Credit: 1,130,760,908
RAC: 182,632
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwat
Message 43491 - Posted: 21 May 2016 | 19:09:20 UTC - in response to Message 43485.
Last modified: 21 May 2016 | 19:44:54 UTC

More good news: BOINC downloaded a lot of Universe work units too.
More bad news: The one POEM work unit was the only one it ran. It would not process any more of them, or any of the Universe ones either. But Ubuntu did pop up a useful notice to the effect that Nvidia cards using CUDA 6.5 or later drivers won't work on CUDA or OpenCL projects. Thanks a lot. I wonder how it did complete the one POEM one?

Finally, I was able to remote in using the X11VNC server. One time that is. After that, it refused all further connections.

I will leave Linux to the experts and retire to Windows. Maybe Mint 18 will be more useful for me. One can always hope.

Jim1348
Send message
Joined: 28 Jul 12
Posts: 455
Credit: 1,130,760,908
RAC: 182,632
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwat
Message 43492 - Posted: 22 May 2016 | 7:07:59 UTC - in response to Message 43491.

The basic problem appears to be that there is a conflict between the X11VNC server and BOINC. I can do one or the other, but not both. I will just uninstall X11VNC and maybe I can make do with BoincTasks for monitoring this machine, which is a dedicated machine anyway. Hopefully, a future Linux version will fix it.

Profile skgiven
Volunteer moderator
Project tester
Volunteer tester
Avatar
Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,826,562,314
RAC: 416,629
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 43493 - Posted: 22 May 2016 | 9:20:30 UTC - in response to Message 43491.
Last modified: 22 May 2016 | 9:41:56 UTC

Got Ubuntu 16.04-x64 LTS up and running last night via a USB stick installation.
After 200MB of system updates, restarts and switching to the 361.42 binary drivers (which might not have been necessary - maybe a restart would have sufficed?) I configured Coolbits, restarted again and installed Boinc. Restarted again and then opened Boinc and attached to here. The work here is sparse, so I'm running POEM tasks. Configured Thermal Settings (GPU Fan speed to 80%).
For comparison/reference, most POEM tasks take ~775sec (13min) to complete (range is 750 to 830sec) but some longer runs take ~1975sec (33min). Credit is either 5500 or 9100. Temps range from 69C to 73C, GPU clock is ~1278. It's an older AMD system and only PCIE2 x16 (5GT/s), but works fine with 2 CPU tasks and one GPU task running. Seems faster than W10x64. Memory mostly @ 6008MHz but occasionally jumped to 7010 of it's own accord (which I've never seen before on a 970). 30 valid GPU tasks since last night, no invalid's or errors.

Overall I found 16.04 easier than previous distrobutions to setup for GPU crunching. Many previous distributions didn't have the GPU drivers in the repositories for ages. Hopefully with this being a LTS version the repository drivers will be maintained/updated reasonably frequently.
____________
FAQ's

HOW TO:
- Opt out of Beta Tests
- Ask for Help

Profile Retvari Zoltan
Avatar
Send message
Joined: 20 Jan 09
Posts: 1836
Credit: 10,417,826,194
RAC: 8,674,762
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 43494 - Posted: 22 May 2016 | 12:48:18 UTC

3d3lR4-SDOERR_opm996-0-1-RND0292 I've received it after 10 days and 20 hours and 34 minutes
1. death's (hell yeah, death is International) desktop with i7-3770K and GTX 670 it has 41 successive errors
2. Alen's desktop with Core2 Quad Q6700 and GTS 450 it has 30 errors, 1 valid and 1 too late tasks (at least the errors have fixed)
3. Robert's desktop with i7-5820K and GTX 770 it has 1 not started by deadline, 1 error, 2 user aborts and 4 successful tasks (seems ok now)
4. Megacruncher TSBT's Xeon E5-2650v3 with GTX 780 it has 52 successive errors
5. ServicEnginIC's Pentium Dual-Core E6300 with GTX 750 it has 1 error and 5 successful tasks
6. Jonathan's desktop with i7-5820K and GTX 960 it has 7 successful and 3 timed out tasks

Jim1348
Send message
Joined: 28 Jul 12
Posts: 455
Credit: 1,130,760,908
RAC: 182,632
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwat
Message 43495 - Posted: 22 May 2016 | 20:58:09 UTC - in response to Message 43485.

The POEM website does show one work unit completed under Linux at 2,757 seconds, which is faster than the 3,400 seconds that I get for that series (1vii) under Windows.

Not that it matter much, but I must have misread BOINCTasks, and was comparing a 1vii to a 2k39, which always runs faster. So the Linux advantage is not quite that large. Comparing the same type of work units (this time 2dx3d) shows about 20.5 minutes for Win7, and 17 minutes for Linux, or about a 20% improvement (all on GTX 960s). That may be about what we see here.

By the way, BOINCTasks is working nicely on Win7 to monitor the Linux machine, though you have to jump through some hoops to set the permissions on the folders in order to copy the app_config, gui_rpc_auth.cfg and remote_hosts.cfg. And that is after you find where Linux puts them; they are a bit spread out as compared to the BOINC Data folder in Windows. It is a learning experience.

Betting Slip
Send message
Joined: 5 Jan 09
Posts: 584
Credit: 2,004,846,200
RAC: 1,662,786
Level
Phe
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 43496 - Posted: 23 May 2016 | 8:33:12 UTC

How does a host with 2 cards have 6 WUs in progress https://www.gpugrid.net/results.php?hostid=326161 at one time Monday 23 May 8:36 UTC

Vagelis Giannadakis
Send message
Joined: 5 May 13
Posts: 187
Credit: 349,254,454
RAC: 0
Level
Asp
Scientific publications
watwatwatwatwatwatwat
Message 43498 - Posted: 23 May 2016 | 10:49:15 UTC

On the topic of WU timeouts, while all the issues raised in this discussion can cause them, let me point to the most probable (IMO) cause, thoroughly reported by affected users, but not resolved as yet:

GPUGRID's network issues

Most of you know the discussions about these issues, with people reporting they can't upload results, downloads taking for ever, etc. One other way these issues manifest themselves is by "phantom" WU assignments, whereby a host requests for work, the server grants it work, but the HTTP request times out for the host and it never receives the positive response. The WU is assigned to the host, but the host has no knowledge of this, does not download it and the WU remains there, waiting to timeout!

This has happened for me two or three times. I wanted to post the errored-out tasks, but they have been deleted.
____________

Profile Beyond
Avatar
Send message
Joined: 23 Nov 08
Posts: 1073
Credit: 4,487,550,429
RAC: 409,504
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 43500 - Posted: 23 May 2016 | 13:58:33 UTC - in response to Message 43498.

On the topic of WU timeouts, while all the issues raised in this discussion can cause them, let me point to the most probable (IMO) cause, thoroughly reported by affected users, but not resolved as yet:

GPUGRID's network issues

Most of you know the discussions about these issues, with people reporting they can't upload results, downloads taking for ever, etc. One other way these issues manifest themselves is by "phantom" WU assignments, whereby a host requests for work, the server grants it work, but the HTTP request times out for the host and it never receives the positive response. The WU is assigned to the host, but the host has no knowledge of this, does not download it and the WU remains there, waiting to timeout!

This has happened for me two or three times. I wanted to post the errored-out tasks, but they have been deleted.

"GPUGRID's network issues"

The GPUGrid network issues are a problem and they never seem to be addressed. Just looked at the WUs supposedly assigned to my machines and there are 2 phantom WUs that the server thinks I have, but I don't:

https://www.gpugrid.net/workunit.php?wuid=11594782

https://www.gpugrid.net/workunit.php?wuid=11602422

As you allude, some of the timeout issues here are due to poor network setup/performance or perhaps BOINC misconfiguration. Maybe someone from one of the other projects could help them out. Haven't seen issues like this anywhere else and have been running BOINC extensively since its inception. Some of (and perhaps a lot of) the timeouts complained about in this thread are due to this poor BOINC/network setup/performance (take your pick).

Profile skgiven
Volunteer moderator
Project tester
Volunteer tester
Avatar
Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,826,562,314
RAC: 416,629
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 43501 - Posted: 23 May 2016 | 14:00:28 UTC - in response to Message 43496.
Last modified: 23 May 2016 | 14:04:52 UTC

How does a host with 2 cards have 6 WUs in progress https://www.gpugrid.net/results.php?hostid=326161 at one time Monday 23 May 8:36 UTC

GPUGrid issues 'up to' 2 tasks per GPU and that system has 3 GPU's, though only 2 are NVidia GPU's!

CPU type AuthenticAMD
AMD A10-7700K Radeon R7, 10 Compute Cores 4C+6G [Family 21 Model 48 Stepping 1]

Coprocessors [2] NVIDIA GeForce GTX 980 (4095MB) driver: 365.10, AMD Spectre (765MB)

It's losing the 50% credit bonus but at least it's a reliable system.
____________
FAQ's

HOW TO:
- Opt out of Beta Tests
- Ask for Help

Betting Slip
Send message
Joined: 5 Jan 09
Posts: 584
Credit: 2,004,846,200
RAC: 1,662,786
Level
Phe
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 43511 - Posted: 24 May 2016 | 1:05:38 UTC - in response to Message 43501.
Last modified: 24 May 2016 | 1:19:14 UTC

How does a host with 2 cards have 6 WUs in progress https://www.gpugrid.net/results.php?hostid=326161 at one time Monday 23 May 8:36 UTC

GPUGrid issues 'up to' 2 tasks per GPU and that system has 3 GPU's, though only 2 are NVidia GPU's!

CPU type AuthenticAMD
AMD A10-7700K Radeon R7, 10 Compute Cores 4C+6G [Family 21 Model 48 Stepping 1]

Coprocessors [2] NVIDIA GeForce GTX 980 (4095MB) driver: 365.10, AMD Spectre (765MB)

It's losing the 50% credit bonus but at least it's a reliable system.


If it has only 2 CUDA GPU's it should only get 2 WU's per CUDA GPU since this project does NOT send WU's to NON Cuda cards.

Card switching is the answer, basically you put 3 cards into one host and get 6 tasks you then take a card out and put it into another host and get more tasks.

And if you are asking whether I am accusing Caffeine of doing that....YES. I am.

Profile skgiven
Volunteer moderator
Project tester
Volunteer tester
Avatar
Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,826,562,314
RAC: 416,629
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 43516 - Posted: 24 May 2016 | 8:42:59 UTC - in response to Message 43511.
Last modified: 24 May 2016 | 8:48:09 UTC

How does a host with 2 cards have 6 WUs in progress https://www.gpugrid.net/results.php?hostid=326161 at one time Monday 23 May 8:36 UTC

GPUGrid issues 'up to' 2 tasks per GPU and that system has 3 GPU's, though only 2 are NVidia GPU's!

CPU type AuthenticAMD
AMD A10-7700K Radeon R7, 10 Compute Cores 4C+6G [Family 21 Model 48 Stepping 1]

Coprocessors [2] NVIDIA GeForce GTX 980 (4095MB) driver: 365.10, AMD Spectre (765MB)

It's losing the 50% credit bonus but at least it's a reliable system.


If it has only 2 CUDA GPU's it should only get 2 WU's per CUDA GPU since this project does NOT send WU's to NON Cuda cards.

Card switching is the answer, basically you put 3 cards into one host and get 6 tasks you then take a card out and put it into another host and get more tasks.

And if you are asking whether I am accusing Caffeine of doing that....YES. I am.


Work fetch is where Boinc Manager comes into play and confuses the matter. GPUGrid would need to put in more server side configurations and routines to try and better deal with that, or remove the AMD app (possibly), but this problem just happened upon GPUGrid.
Setting 1 WU per GPU would be simpler, and more fair (especially with so few tasks available), and go a long way to rectifying the situation.

While GPUGrid doesn't presently have an active AMD/ATI app, it sort-of does have an AMD/ATI app - the MT app for CPU's+AMD's:
https://www.gpugrid.net/apps.php
Maybe somewhere on the GPUGrid server they can set something up so as not to send so many tasks, but I don't keep up with all the development of Boinc these days.

It's not physical card switching (inserting and removing cards) because that's an integrated AMD/ATI GPU.
Ideally the GPUGrid's server would recognised that there are only 2 NVidia GPU's and send out tasks going by that number, but it's a Boinc server that's used. While there is a way for the user/cruncher to exclude a GPU type against a project (Client_Configuration), it's very hands on and if AMD work did turn up here they wouldn't get any.
http://boinc.berkeley.edu/wiki/Client_configuration

My guess is that having 3 GPU's (even though one isn't useful) inflates the status of the system; as the system returns +ve results (no failure) it's rated highly, but more-so because there are 3 GPU's in it. So its even more likely to get work than a system with 2 GPU's with an identical yield.
Not sure anything is being done deliberately. It's likely the iATI is being used exclusively for display purposes and why would you want to get 25% less credit for the same amount of work? From experience 'playing' with the use of various integrated and mixed GPU types, they are a pain to setup, and when you get it working you don't want to change anything. That might be the case here.
____________
FAQ's

HOW TO:
- Opt out of Beta Tests
- Ask for Help

Tomas Brada
Send message
Joined: 3 Nov 15
Posts: 37
Credit: 1,775,725
RAC: 0
Level
Ala
Scientific publications
wat
Message 43517 - Posted: 24 May 2016 | 10:37:23 UTC
Last modified: 24 May 2016 | 10:39:24 UTC

x.

Tomas Brada
Send message
Joined: 3 Nov 15
Posts: 37
Credit: 1,775,725
RAC: 0
Level
Ala
Scientific publications
wat
Message 43518 - Posted: 24 May 2016 | 10:37:29 UTC

I notice lot of Users have great trouble installing Linux+BOINC.
I am playing with the idea to write a sort-of guide to set-up basic Debian install and configuration of BOINC. Would you appreciate it?
It should be on-line this or the next week.

About that WU problem: Prime Grid project utilize "tickles". Large tasks are sent out with short deadline and if the tickle is successful and your computer active works on the task, the deadline is extended. Gpugrid project could benefit from this.

____________

Jacob Klein
Send message
Joined: 11 Oct 08
Posts: 1060
Credit: 1,123,119,589
RAC: 1,357,910
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 43519 - Posted: 24 May 2016 | 12:20:12 UTC

"Trickles", not "tickles".

And I happen to think that GPUGrid's current deadlines are sufficient for most of its users to get done on time; I believe we don't need trickles. Interesting idea, though! NOTE: RNA World also uses trickles to auto-extend task deadlines on the server. Some of my tasks are approaching 300 days of compute time already :)

Profile skgiven
Volunteer moderator
Project tester
Volunteer tester
Avatar
Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,826,562,314
RAC: 416,629
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 43520 - Posted: 24 May 2016 | 12:48:59 UTC - in response to Message 43518.
Last modified: 24 May 2016 | 12:51:26 UTC

How to - install Ubuntu 16.04 x64 Linux & setup for GPUGrid

Discussion of Ubuntu 16.04-x64 LTS installation and configuration

ClimatePrediction uses trickle uploads too. Was suggested for here years ago but wasn't suitable then and probably still isn't.
____________
FAQ's

HOW TO:
- Opt out of Beta Tests
- Ask for Help

Profile Beyond
Avatar
Send message
Joined: 23 Nov 08
Posts: 1073
Credit: 4,487,550,429
RAC: 409,504
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 43527 - Posted: 24 May 2016 | 18:02:14 UTC - in response to Message 43494.

3d3lR4-SDOERR_opm996-0-1-RND0292 I've received it after 10 days and 20 hours and 34 minutes
1. death's (hell yeah, death is International) desktop with i7-3770K and GTX 670 it has 41 successive errors
2. Alen's desktop with Core2 Quad Q6700 and GTS 450 it has 30 errors, 1 valid and 1 too late tasks (at least the errors have fixed)
3. Robert's desktop with i7-5820K and GTX 770 it has 1 not started by deadline, 1 error, 2 user aborts and 4 successful tasks (seems ok now)
4. Megacruncher TSBT's Xeon E5-2650v3 with GTX 780 it has 52 successive errors
5. ServicEnginIC's Pentium Dual-Core E6300 with GTX 750 it has 1 error and 5 successful tasks
6. Jonathan's desktop with i7-5820K and GTX 960 it has 7 successful and 3 timed out tasks

Here's the kind of thing that I find most mystifying. Running a 980Ti GPU, then holding the WU for 5 days until it gets sent out again, negating the usefulness of the next users contribution and missing all bonuses. Big waste of time and resources:

https://www.gpugrid.net/workunit.php?wuid=11602161

Profile skgiven
Volunteer moderator
Project tester
Volunteer tester
Avatar
Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,826,562,314
RAC: 416,629
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 43528 - Posted: 24 May 2016 | 18:16:01 UTC - in response to Message 43527.
Last modified: 24 May 2016 | 18:27:51 UTC

Doesn't bother cooling the $650 card either!

# GPU 0 : 55C
# GPU 0 : 59C
# GPU 0 : 64C
# GPU 0 : 68C
# GPU 0 : 71C
# GPU 0 : 73C
# GPU 0 : 75C
# GPU 0 : 76C
# GPU 0 : 77C
# GPU 0 : 78C
# GPU 0 : 79C
# GPU 0 : 80C
# GPU 0 : 81C
# BOINC suspending at user request (exit)
# GPU [GeForce GTX 980 Ti] Platform [Windows] Rev [3212] VERSION [65]
# SWAN Device 0 :
# Name : GeForce GTX 980 Ti
# ECC : Disabled
# Global mem : 4095MB
# Capability : 5.2
# PCI ID : 0000:02:00.0
# Device clock : 1076MHz
# Memory clock : 3805MHz
# Memory width : 384bit
# Driver version : r364_69 : 36510

https://www.gpugrid.net/result.php?resultid=15108452

Also clear that the user keeps a high cache level and frequently gets wonky credits:

https://www.gpugrid.net/results.php?hostid=331964

PS. Running a GERARD_FXCX... task on my Linux system (GTX970), and two on my W10 system. Looks like Linux is about ~16% faster than under W10, and that's with the W10 system being slightly Overclocked and being supported by a faster CPU. As observed before, the difference is likely higher for bigger cards. So with a GTX980Ti it's probably greater (maybe ~20%) and with a GTX750Ti it's probably less (maybe ~11%).
____________
FAQ's

HOW TO:
- Opt out of Beta Tests
- Ask for Help

Jacob Klein
Send message
Joined: 11 Oct 08
Posts: 1060
Credit: 1,123,119,589
RAC: 1,357,910
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 43529 - Posted: 24 May 2016 | 18:23:34 UTC
Last modified: 24 May 2016 | 18:24:14 UTC

I have 2 GTX 980 Ti's in my new rig. I have an aggressive MSI Afterburner fan profile, that goes 0% fan @ 50*C, to 100% fan @ 90*C.

One of my GPUs sees temps up-to-85*C. Another up-to-75*C. I consider the cooling adequate, so long as the clocks are stable. I'm working on finding the max stable overclocks, presently.

Example result:
https://www.gpugrid.net/result.php?resultid=15106295

So ... sometimes, a system just runs hot. Stable, but hot. All my systems are hot, overclocked to max stable clocks, CPU and GPU. They refuse to take their shirts off, and deliver a rockstar performance every time.

Profile Beyond
Avatar
Send message
Joined: 23 Nov 08
Posts: 1073
Credit: 4,487,550,429
RAC: 409,504
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 43530 - Posted: 24 May 2016 | 18:24:08 UTC - in response to Message 43528.

Possibly running other NV projects and not getting back to the GPUGrid WU until BOINC goes into panic mode.

Profile skgiven
Volunteer moderator
Project tester
Volunteer tester
Avatar
Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,826,562,314
RAC: 416,629
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 43531 - Posted: 24 May 2016 | 18:34:15 UTC - in response to Message 43529.

I have 2 GTX 980 Ti's in my new rig. I have an aggressive MSI Afterburner fan profile, that goes 0% fan @ 50*C, to 100% fan @ 90*C.

One of my GPUs sees temps up-to-85*C. Another up-to-75*C. I consider the cooling adequate, so long as the clocks are stable. I'm working on finding the max stable overclocks, presently.

Example result:
https://www.gpugrid.net/result.php?resultid=15106295

So ... sometimes, a system just runs hot. Stable, but hot. All my systems are hot, overclocked to max stable clocks, CPU and GPU. They refuse to take their shirts off, and deliver a rockstar performance every time.


# GPU 1 : 74C
# GPU 0 : 78C
# GPU 0 : 79C
# GPU 0 : 80C
# GPU 1 : 75C
# GPU 0 : 81C
# GPU 0 : 82C
# GPU 0 : 83C
# BOINC suspending at user request (exit)
# GPU [GeForce GTX 980 Ti] Platform [Windows] Rev [3212] VERSION [65]

That suggests to me that the GPU was running too hot, the task became unstable and the app suspended crunching for a bit and recovered (recoverable errors). Matt added that suspend-recover feature some time ago IIRC.

____________
FAQ's

HOW TO:
- Opt out of Beta Tests
- Ask for Help

Jacob Klein
Send message
Joined: 11 Oct 08
Posts: 1060
Credit: 1,123,119,589
RAC: 1,357,910
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 43532 - Posted: 24 May 2016 | 18:42:16 UTC - in response to Message 43531.
Last modified: 24 May 2016 | 18:43:06 UTC

I have 2 GTX 980 Ti's in my new rig. I have an aggressive MSI Afterburner fan profile, that goes 0% fan @ 50*C, to 100% fan @ 90*C.

One of my GPUs sees temps up-to-85*C. Another up-to-75*C. I consider the cooling adequate, so long as the clocks are stable. I'm working on finding the max stable overclocks, presently.

Example result:
https://www.gpugrid.net/result.php?resultid=15106295

So ... sometimes, a system just runs hot. Stable, but hot. All my systems are hot, overclocked to max stable clocks, CPU and GPU. They refuse to take their shirts off, and deliver a rockstar performance every time.


# GPU 1 : 74C
# GPU 0 : 78C
# GPU 0 : 79C
# GPU 0 : 80C
# GPU 1 : 75C
# GPU 0 : 81C
# GPU 0 : 82C
# GPU 0 : 83C
# BOINC suspending at user request (exit)
# GPU [GeForce GTX 980 Ti] Platform [Windows] Rev [3212] VERSION [65]

That suggests to me that the GPU was running too hot, the task became unstable and the app suspended crunching for a bit and recovered (recoverable errors). Matt added that suspend-recover feature some time ago IIRC.

No. I believe you are incorrect.

The simulation will only terminate/retry when it says:
# The simulation has become unstable. Terminating to avoid lock-up (1)
# Attempting restart

Try not to jump to conclusions about hot machines :)

I routinely search my results, for "stab", and if I find an instability message matching that text, I know to downclock my overclock a bit more.

Skyler Baker
Send message
Joined: 19 Feb 16
Posts: 19
Credit: 136,574,536
RAC: 20,687
Level
Cys
Scientific publications
wat
Message 43533 - Posted: 24 May 2016 | 18:53:12 UTC

Truthfully I'm not entirely sure if a rig with multiple GPUs can even be kept very cool. I keep my 980ti fan profile at 50% and it tends to run about 60-65c, but the heat makes my case and even cpu cooler fans crank up, I'd imagine a pair of them would get pretty toasty. My fan setup I am quite sure could keep 2 below 70c, but it would sound like a jet engine and literally could not be kept in living quarters.

Profile Retvari Zoltan
Avatar
Send message
Joined: 20 Jan 09
Posts: 1836
Credit: 10,417,826,194
RAC: 8,674,762
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 43542 - Posted: 25 May 2016 | 8:30:25 UTC
Last modified: 25 May 2016 | 8:40:57 UTC

The workunit with the worst history I've ever received:
2wf1R8-SDOERR_opm996-0-1-RND6142 I've received it after 15 days 8 hours 22 minutes
1. Gagiman's desktop with i5-4670 and GTX 970 it has 24 errors in a row and 1 successful task (probably it's OK now)
2. [PUGLIA] kidkidkid3's desktop with Core2 Quad Q9450 and GTX 750 Ti it has 29 user aborts and 4 successful tasks
3. Ralph M. Fay III's laptop with i7-4600M and GT 730M (1GB) it has 7 errors and 1 successful task (probably it's OK now)
4. Evan's desktop with i5-3570K and GTX 480 it has 35 successive immediate errors
5. Sean's desktop with i7-4770K and GTX 750 Ti it has 10 errors and 2 successful tasks (probably it's OK now)
6. Megacruncher TSBT's desktop with AMD FX-6300 and GTX 580 it has 1 aborted, 6 errors (sim. unstable) and 3 not started by deadline tasks
7. Car a carn's desktop with i7-3770 and GTX 980 Ti it has 2 successful tasks
This task is actually succeeded on this host. It spent 10 days and 2 hours on this host alone
8. shuras' desktop with i5-2500K and GTX 670 it has 1 user aborted, 1 successful and 1 timed out task
9. My desktop with i7-980x and two GTX 980 it has 41 successful, 2 user aborted, 1 ghost (timed out) and 12 error tasks
I've aborted this task after 5 hours when I've checked its history and I've noticed that it's already succeeded.
Note for the 12 errors on my host: these are leftovers from August & September 2013, March & September 2014 and March 2015 which should have been removed from the server long ago. All of these 12 errors are the result of bad batches.

Profile Retvari Zoltan
Avatar
Send message
Joined: 20 Jan 09
Posts: 1836
Credit: 10,417,826,194
RAC: 8,674,762
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 43543 - Posted: 25 May 2016 | 8:51:53 UTC - in response to Message 43476.

... we should find other means to reach the problematic contributors.
It should be made very clear right at the start (on the project's homepage, in the BOINC manager when a user tries to join the project, in the FAQ etc) the project's minimum requirements:
1. A decent NVidia GPU (GTX 760+ or GTX 960+)
2. No overclocking (later you can try, but read the forums)
3. Other GPU projects are allowed only as a backup (0 resource share) project.
Some tips should be broadcast by the project as a notice on a regular basis about the above 3 points. Also there should be someone/something who could send an email to the user who have unreliable host(s), or perhaps their username/hostname should be broadcast as a notice.
I'd like to add the following:
4. Don't suspend the GPU tasks while your computer is in use, or at least set its timeout to 30 minutes. It's better to set up the list of exclusive apps (games, etc) in BOINC manager.
5. If you don't have a high-end GPU & you switch your computer off daily then GPUGrid is not for you.

I strongly recommend for the GPUGrid staff to broadcast the list of worst hosts & the tips above in every month (while needed)

Stefan
Volunteer moderator
Project developer
Project scientist
Send message
Joined: 5 Mar 13
Posts: 258
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 43545 - Posted: 25 May 2016 | 9:08:38 UTC

Wow ok, this thread derailed. We are supposed to keep discussions related just to the specific WUs here, even though I am sure it's a very productive discussions in general :)
I am a bit out of time right now so I won't split threads and will just open a new one because I will resend OPM simulations soon.

Right now I am trying to look into the discrepancies between projected runtimes and real runtimes as well as credits to hopefully do it better this time.

The thing with excluding bad hosts is unfortunately not doable as the queuing system of BOINC apparently is pretty stupid and would exclude all of Gerard's WUs until all of mine finished if I send them with high priority :(

Profile Retvari Zoltan
Avatar
Send message
Joined: 20 Jan 09
Posts: 1836
Credit: 10,417,826,194
RAC: 8,674,762
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 43546 - Posted: 25 May 2016 | 9:23:33 UTC - in response to Message 43545.
Last modified: 25 May 2016 | 9:25:07 UTC

Wow ok, this thread derailed.
Sorry, but that's the way it goes :)

We are supposed to keep discussions related just to the specific WUs here, even though I am sure it's a very productive discussions in general :)
It's good to have a confirmation that you are reading this :)

I am a bit out of time right now so I won't split threads and will just open a new one because I will resend OPM simulations soon.
Will there be very long ones (~18-20 hours on GTX980Ti)? As in this case I will reduce my cache to 0.03 days.

Right now I am trying to look into the discrepancies between projected runtimes and real runtimes as well as credits to hopefully do it better this time.
We'll see. :) I keep my fingers crossed.

The thing with excluding bad hosts is unfortunately not doable as the queuing system of BOINC apparently is pretty stupid and would exclude all of Gerard's WUs until all of mine finished if I send them with high priority :(
This is left us to the only possibility of broadcasting to make things better.

Betting Slip
Send message
Joined: 5 Jan 09
Posts: 584
Credit: 2,004,846,200
RAC: 1,662,786
Level
Phe
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 43547 - Posted: 25 May 2016 | 9:30:27 UTC - in response to Message 43545.
Last modified: 25 May 2016 | 9:33:12 UTC

There are things within your control that would mitigate the problem.

Reduce baseline WUs available per GPU per day from the present 50 to 10 and reduce WUs per gpu to one at a time.

Profile skgiven
Volunteer moderator
Project tester
Volunteer tester
Avatar
Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,826,562,314
RAC: 416,629
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 43548 - Posted: 25 May 2016 | 9:43:25 UTC - in response to Message 43545.
Last modified: 25 May 2016 | 9:45:37 UTC

The thing with excluding bad hosts is unfortunately not doable as the queuing system of BOINC apparently is pretty stupid and would exclude all of Gerard's WUs until all of mine finished if I send them with high priority :(

The only complete/long-term way around that might be to separate the research types using different apps & queues. Things like that were explored in the past and the biggest obstacle was the time-intensive maintenance for everyone; crunchers would have to select different queues and be up to speed with what's going on and you would have to spend more time on project maintenance (which isn't science). There might also be subsequent server issues.
If the OPM's were release in the beta queue would that server priority still apply (is priority applied per queue, per app or per project)?
Given how hungry GPUGrid crunchers are these days how long would it take to clear the prioritised tasks and could they be drip fed into the queue (small batches)?
____________
FAQ's

HOW TO:
- Opt out of Beta Tests
- Ask for Help

Profile Retvari Zoltan
Avatar
Send message
Joined: 20 Jan 09
Posts: 1836
Credit: 10,417,826,194
RAC: 8,674,762
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 43549 - Posted: 25 May 2016 | 9:45:55 UTC - in response to Message 43547.
Last modified: 25 May 2016 | 9:46:40 UTC

Reduce baseline WUs available per GPU per day from the present 50 to 10
That's a good idea. Even it could be reduced to 5.
... and reduce WUs per gpu to one at a time.
I'm ambivalent about this.
Perhaps a 1 hour delay between WU downloads would be enough to spread the available workunits evenly between the hosts.

I see different max task per day numbers on my different hosts with same GPUs, is this how it should be?

Betting Slip
Send message
Joined: 5 Jan 09
Posts: 584
Credit: 2,004,846,200
RAC: 1,662,786
Level
Phe
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 43550 - Posted: 25 May 2016 | 9:54:28 UTC - in response to Message 43549.
Last modified: 25 May 2016 | 9:55:33 UTC


I see different max task per day numbers on my different hosts with same GPUs, is this how it should be?


I believe so. You start at 50, when you send a valid result it goes up 1 when you send an error, abort a WU or server cancels, it goes back down to 50.

50 is a ridiculously high number anyway and as you have said could be reduced to 5 with benefit to user and project.

Profile Retvari Zoltan
Avatar
Send message
Joined: 20 Jan 09
Posts: 1836
Credit: 10,417,826,194
RAC: 8,674,762
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 43551 - Posted: 25 May 2016 | 9:57:47 UTC - in response to Message 43545.

The thing with excluding bad hosts is unfortunately not doable as the queuing system of BOINC apparently is pretty stupid and would exclude all of Gerard's WUs until all of mine finished if I send them with high priority :(
I recall that there was a "blacklist" of hosts in the GTX480-GTX580 era. Once my host got blacklisted upon the release of the CUDA4.2 app, as this app was much faster then the previous CUDA3.1, so the cards could be overclocked less and my hosts began to throw errors until I've reduced its clock frequency. It could not get tasks for 24 hours IIRC. However it seems that later when the BOINC server software was updated at GPUGrid this "blacklist" feature disappeared. It would be nice to have this feature again.

Stefan
Volunteer moderator
Project developer
Project scientist
Send message
Joined: 5 Mar 13
Posts: 258
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 43554 - Posted: 25 May 2016 | 10:58:10 UTC

Ok, Gianni changed the baseline WUs available per GPU per day from 50 to 10

Profile Retvari Zoltan
Avatar
Send message
Joined: 20 Jan 09
Posts: 1836
Credit: 10,417,826,194
RAC: 8,674,762
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 43556 - Posted: 25 May 2016 | 11:20:51 UTC - in response to Message 43554.
Last modified: 25 May 2016 | 11:23:22 UTC

Ok, Gianni changed the baseline WUs available per GPU per day from 50 to 10
Thanks!
EDIT: I don't see any change yet on my hosts' max number of tasks per day...

Betting Slip
Send message
Joined: 5 Jan 09
Posts: 584
Credit: 2,004,846,200
RAC: 1,662,786
Level
Phe
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 43557 - Posted: 25 May 2016 | 11:31:13 UTC - in response to Message 43554.
Last modified: 25 May 2016 | 11:55:36 UTC

Ok, Gianni changed the baseline WUs available per GPU per day from 50 to 10


I don't want to sound in anyway disrespectful with this post so please don't take offence, here goes.

WOOHOO! A sign things CAN be done instead of, can't do that, not doable.

Thank you Stefan and Gianni for taking this first important step to making this project more efficient and when you witness the decline in error rates on the server status page which, while small, should be evident you will maybe consider reducing baseline to 5 and employ other initiatives to reduce errors/timeouts and ensure work is spread evenly/fairly over the GPUGrid userbase which will make this project faster, more efficient with a happier and hopefully growing core userbase.

Jacob Klein
Send message
Joined: 11 Oct 08
Posts: 1060
Credit: 1,123,119,589
RAC: 1,357,910
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 43558 - Posted: 25 May 2016 | 12:12:51 UTC - in response to Message 43543.

... we should find other means to reach the problematic contributors.
It should be made very clear right at the start (on the project's homepage, in the BOINC manager when a user tries to join the project, in the FAQ etc) the project's minimum requirements:
1. A decent NVidia GPU (GTX 760+ or GTX 960+)
2. No overclocking (later you can try, but read the forums)
3. Other GPU projects are allowed only as a backup (0 resource share) project.
Some tips should be broadcast by the project as a notice on a regular basis about the above 3 points. Also there should be someone/something who could send an email to the user who have unreliable host(s), or perhaps their username/hostname should be broadcast as a notice.
I'd like to add the following:
4. Don't suspend the GPU tasks while your computer is in use, or at least set its timeout to 30 minutes. It's better to set up the list of exclusive apps (games, etc) in BOINC manager.
5. If you don't have a high-end GPU & you switch your computer off daily then GPUGrid is not for you.

I strongly recommend for the GPUGrid staff to broadcast the list of worst hosts & the tips above in every month (while needed)


I know my opinions aren't liked very much here, but I wanted to express my response to these 5 proposed "minimum requirements".

1. A decent NVidia GPU (GTX 760+ or GTX 960+)
--- I disagree. The minimum GPU should be one that is supported by the toolset the devs release apps for, and one that can return results within the timeline they define. If they want results returned in a 6-week-time-period, and a GTS 250 fits the toolset, I see no reason why it should be excluded.

2. No overclocking (later you can try, but read the forums)
--- I disagree. Overclocking can provide tangible performance results, when done correctly. It would be better if the task's final results could be verified by another GPU, for consistency, as it seems currently that an overclock that is too high can still result in a successful completion of the task. I wish I could verify that the results were correct, even for my own overclocked GPUs. Right now, the only tool I have is to look at stderr results for "Simulation has become unstable", and downclock when I see it. GPUGrid should improve on this somehow.

3. Other GPU projects are allowed only as a backup (0 resource share) project.
--- I disagree. Who are you to define what I'm allowed to use? I am attached to 58 projects. Some have GPU work, some have CPU work, some have ASIC work, and some have non-CPU-intensive work. I routinely get "non-backup" work from about 15 of them, all on the same PC.

4. Don't suspend the GPU tasks while your computer is in use, or at least set its timeout to 30 minutes. It's better to set up the list of exclusive apps (games, etc) in BOINC manager.
--- I disagree. I am at my computer during all waking hours, and I routinely suspend BOINC, and even shut down BOINC, because I have some very-long-running-tasks (300 days!) that I don't want to possibly get messed up, as I do things like install/uninstall software or update Windows. Suspending and shutting down should be completely supported by GPUGrid, and to my knowledge, they are.

5. If you don't have a high-end GPU & you switch your computer off daily then GPUGrid is not for you.
--- I disagree. GPUGrid tasks have a 5-day-deadline, currently, to my knowledge. So, if your GPU isn't on enough to complete any GPUGrid task within their deadline, then maybe GPUGrid is not for you.

These "minimum requirements" are... not great suggestions, for someone like me at least. I realize I'm an edge case. But I'd imagine that lots of people would take issue with at least a couple of the 5.

I feel that any project can define great minimum requirements by:
- setting up their apps appropriately
- massaging their deadlines appropriately
- restricting bad hosts from wasting time
- continually looking for ways to improve throughput

I'm glad the project is now (finally?) looking for ways to continuously improve.

Betting Slip
Send message
Joined: 5 Jan 09
Posts: 584
Credit: 2,004,846,200
RAC: 1,662,786
Level
Phe
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 43559 - Posted: 25 May 2016 | 12:24:38 UTC - in response to Message 43558.
Last modified: 25 May 2016 | 12:25:30 UTC


I feel that any project can define great minimum requirements by:
- setting up their apps appropriately
- massaging their deadlines appropriately
- restricting bad hosts from wasting time
- continually looking for ways to improve throughput

I'm glad the project is now (finally?) looking for ways to continuously improve.


I don't know whether your right about your opinions not being liked or not but you are entitled to them. Opinions are just that, you have a right to espouse and defend them. I for one can see nothing wrong with the above list that you posted.

If a host is reliable and returns WUs within the deadline period I don't think it matters whether it's a 750ti or a 980ti or whether it runs 24/7 or 12/7. I myself have a running and working 660ti which is reliable and does just that.

Stefan
Volunteer moderator
Project developer
Project scientist
Send message
Joined: 5 Mar 13
Posts: 258
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 43562 - Posted: 25 May 2016 | 13:01:39 UTC - in response to Message 43557.
Last modified: 25 May 2016 | 13:01:55 UTC

If you like success stories Betting Slip then you can have another one :D
We found the reason for the underestimation of the OPM runtime (and all further equilibrations we ever send to GPUGRID).

When we calculate the projected runtime we try the first 500 steps of the simulation. However our equilibrations actually do some faster calculations during the first 500 steps and then switch to some slower ones, so they were underestimating the runtime by quite a bit (one example: 17 vs 24 hours).

This has now been fixed, so the credits should reflect better the real runtime. I am nearly feeling confident enough to submit the rest of the OPM now, hehe.

Profile Retvari Zoltan
Avatar
Send message
Joined: 20 Jan 09
Posts: 1836
Credit: 10,417,826,194
RAC: 8,674,762
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 43563 - Posted: 25 May 2016 | 13:03:13 UTC - in response to Message 43558.
Last modified: 25 May 2016 | 13:11:51 UTC

I know my opinions aren't liked very much here
That should not ever make you to refrain from expressing your opinion.

but I wanted to express my response to these 5 proposed "minimum requirements".
It was a mistake to call these "Minimum requirements", and it's intended for dummies. Perhaps that's makes itself unavailing.

These "minimum requirements" are... not great suggestions, for someone like me at least. I realize I'm an edge case. But I'd imagine that lots of people would take issue with at least a couple of the 5.
If you keep an eye on your results, you can safely skip these "recommendations". We can, we should refine these recommendations to make them more appropriate, and less offensive. I've made the wording of these harsh on purpose to induce a debate. But I can show you results or hosts which validate my 5 points. (just browse the links in my post about the workunit with the worst history I've ever received, and the other similar ones)
The the recommended minimum GPU should be better than the recent (~GTX 750-GTX 660), as the release of the new GTX 10x0 series will result in longer workunits by the end of this year, and the project should not lure new users with lesser cards to frustrate them in 6 months.

Betting Slip
Send message
Joined: 5 Jan 09
Posts: 584
Credit: 2,004,846,200
RAC: 1,662,786
Level
Phe
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 43566 - Posted: 25 May 2016 | 13:22:52 UTC - in response to Message 43562.

If you like success stories Betting Slip then you can have another one :D
We found the reason for the underestimation of the OPM runtime (and all further equilibrations we ever send to GPUGRID).

When we calculate the projected runtime we try the first 500 steps of the simulation. However our equilibrations actually do some faster calculations during the first 500 steps and then switch to some slower ones, so they were underestimating the runtime by quite a bit (one example: 17 vs 24 hours).

This has now been fixed, so the credits should reflect better the real runtime. I am nearly feeling confident enough to submit the rest of the OPM now, hehe.


Thanks Stefan and let em rip.

Hope these simulations are producing the results you expected.

Profile Retvari Zoltan
Avatar
Send message
Joined: 20 Jan 09
Posts: 1836
Credit: 10,417,826,194
RAC: 8,674,762
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 43567 - Posted: 25 May 2016 | 13:36:34 UTC - in response to Message 43356.
Last modified: 25 May 2016 | 13:37:06 UTC

It would be hugely appreciated if you could find a way of hooking up the projections of that script to the <rsc_fpops_est> field of the associated workunits. With the BOINC server version in use here, a single mis-estimated task (I have one which has been running for 29 hours already) can mess up the BOINC client's scheduling - for other projects, as well as this one - for the next couple of weeks.
+1
Could you please set the <rsc_fpops_est> field and the <rsc_disk_bound> field correctly for the new tasks?
The <rsc_disk_bound> is set to 8*10^9 (7.45GB) which is at least one order of magnitude higher then necessary.

Profile Beyond
Avatar
Send message
Joined: 23 Nov 08
Posts: 1073
Credit: 4,487,550,429
RAC: 409,504
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 43568 - Posted: 25 May 2016 | 14:04:21 UTC - in response to Message 43558.

I know my opinions aren't liked very much here, but I wanted to express my response to these 5 proposed "minimum requirements".

I like your opinions. Whether or not I agree with them, they're always well thought out.

These "minimum requirements" are... not great suggestions, for someone like me at least. I realize I'm an edge case. But I'd imagine that lots of people would take issue with at least a couple of the 5.

I feel that any project can define great minimum requirements by:
- setting up their apps appropriately
- massaging their deadlines appropriately
- restricting bad hosts from wasting time
- continually looking for ways to improve throughput

I'm glad the project is now (finally?) looking for ways to continuously improve.

Thumbs up and +1.

Bedrich Hajek
Send message
Joined: 28 Mar 09
Posts: 335
Credit: 3,800,087,309
RAC: 894,669
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 43580 - Posted: 26 May 2016 | 0:09:26 UTC - in response to Message 43545.

Wow ok, this thread derailed. We are supposed to keep discussions related just to the specific WUs here, even though I am sure it's a very productive discussions in general :)


That's what happens when you allow the lunatics to run the asylum.


I am a bit out of time right now so I won't split threads and will just open a new one because I will resend OPM simulations soon.


Ok, bring them on. I'm ready.



Stefan
Volunteer moderator
Project developer
Project scientist
Send message
Joined: 5 Mar 13
Posts: 258
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 43581 - Posted: 26 May 2016 | 9:31:57 UTC - in response to Message 43580.
Last modified: 26 May 2016 | 9:33:57 UTC

How I imagine your GPUs after OPM:

Profile Retvari Zoltan
Avatar
Send message
Joined: 20 Jan 09
Posts: 1836
Credit: 10,417,826,194
RAC: 8,674,762
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 43583 - Posted: 26 May 2016 | 11:32:00 UTC - in response to Message 43581.



A simulation containing only 35632 atoms is a piece of cake.

Erich56
Send message
Joined: 1 Jan 15
Posts: 369
Credit: 1,606,755,102
RAC: 2,771,359
Level
His
Scientific publications
watwatwat
Message 43584 - Posted: 26 May 2016 | 16:49:45 UTC - in response to Message 43567.

... The <rsc_disk_bound> is set to 8*10^9 (7.45GB) which is at least one order of magnitude higher then necessary.

when I temporarily ran BOINC on a RAMDisk some weeks ago, I was harshly confronted with this problem.
There was only limited disk space available for BOINC, and each time the free RAMDisk space went below 7.629MB (7.45GB), the BOINC manager did not download new GPUGRID tasks (the event log complained about too little free disk space).

I contacted the GPUGRID people, and they told me that they will look into this at some time; it can't be done right now, though, as Matt is not available for some reason (and seems be the only one who could change/fix this).

Profile skgiven
Volunteer moderator
Project tester
Volunteer tester
Avatar
Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,826,562,314
RAC: 416,629
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 43587 - Posted: 26 May 2016 | 20:19:57 UTC - in response to Message 43584.

Are the GERARD_CXCL12VOLK_ Work Units step 2 of the OPM simulations or extensions of the GERARD_FCCXCL work - or something else?

PS Nice to see plenty of tasks over the long weekend:
Tasks ready to send 2,413
Tasks in progress 2,089

Will these auto-generate new work?
____________
FAQ's

HOW TO:
- Opt out of Beta Tests
- Ask for Help

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 783
Credit: 1,391,041,045
RAC: 1,248,198
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 43589 - Posted: 26 May 2016 | 20:41:42 UTC - in response to Message 43588.

I haven't tried this, but theoretically it should work.

What theory is that? It isn't a defined field, according to the Application configuration documentation.

Profile Retvari Zoltan
Avatar
Send message
Joined: 20 Jan 09
Posts: 1836
Credit: 10,417,826,194
RAC: 8,674,762
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 43590 - Posted: 26 May 2016 | 21:31:25 UTC - in response to Message 43589.
Last modified: 26 May 2016 | 21:33:32 UTC

I haven't tried this, but theoretically it should work.

What theory is that? It isn't a defined field, according to the Application configuration documentation.

Oh, my bad!
That won't work...
I read a couple of post about this somewhere, but I've clearly messed it up.
Sorry!
Sk, Could you hide that post please?

Bedrich Hajek
Send message
Joined: 28 Mar 09
Posts: 335
Credit: 3,800,087,309
RAC: 894,669
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 43592 - Posted: 26 May 2016 | 22:15:01 UTC - in response to Message 43581.

How I imagine your GPUs after OPM:


For the past few day, while there was little work here, I was crunching at a tough back up project (Einstein), where my computers were able to crunch 2 GPU WUs per card simultaneously with GPU usage of 99% max for my xp computer and 91% max for my windows 10 computer. So, anything you have, should be a walk in the park, even if you come with a 200,000+ atom simulation with 90%+ GPU usage.


Good luck!!


Profile Beyond
Avatar
Send message
Joined: 23 Nov 08
Posts: 1073
Credit: 4,487,550,429
RAC: 409,504
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 43593 - Posted: 27 May 2016 | 0:46:51 UTC - in response to Message 43584.

when I temporarily ran BOINC on a RAMDisk some weeks ago, I was harshly confronted with this problem.
There was only limited disk space available for BOINC, and each time the free RAMDisk space went below 7.629MB (7.45GB), the BOINC manager did not download new GPUGRID tasks (the event log complained about too little free disk space).

I contacted the GPUGRID people, and they told me that they will look into this at some time; it can't be done right now, though, as Matt is not available for some reason (and seems be the only one who could change/fix this).

I had this happen recently when the disk partitions on which BOINC was installed went below that level. Thought it was strange, wasn't sure if it was a GPUGrid or BOINC thing. Anyway resized the partitions with a disk manager and started getting work on those machines again.

nanoprobe
Send message
Joined: 26 Feb 12
Posts: 181
Credit: 221,824,715
RAC: 0
Level
Leu
Scientific publications
watwatwatwatwatwatwatwatwat
Message 43594 - Posted: 27 May 2016 | 2:19:39 UTC - in response to Message 43466.
Last modified: 27 May 2016 | 2:21:03 UTC

The Error Rate for the latest GERARD_FX tasks is high and the OPM simulations were higher. Perhaps this should be looked into.
_Application_ _unsent_ In Progress Success Error Rate Short runs (2-3 hours on fastest card) SDOERR_opm99 0 60 2412 48.26% Long runs (8-12 hours on fastest card) GERARD_FXCXCL12R_1406742_ 0 33 573 38.12% GERARD_FXCXCL12R_1480490_ 0 31 624 35.34% GERARD_FXCXCL12R_1507586_ 0 25 581 33.14% GERARD_FXCXCL12R_2189739_ 0 42 560 31.79% GERARD_FXCXCL12R_50141_ 0 35 565 35.06% GERARD_FXCXCL12R_611559_ 0 31 565 32.09% GERARD_FXCXCL12R_630477_ 0 34 561 34.31% GERARD_FXCXCL12R_630478_ 0 44 599 34.75% GERARD_FXCXCL12R_678501_ 0 30 564 40.57% GERARD_FXCXCL12R_747791_ 0 32 568 36.89% GERARD_FXCXCL12R_780273_ 0 42 538 39.28% GERARD_FXCXCL12R_791302_ 0 37 497 34.78%

2 or 3 weeks ago the error rate was ~25% to 35% it's now ~35% to 40% - Maybe this varies due to release stage; early in the runs tasks go to everyone so have higher error rates, later more go to the most successful cards so the error rate drops?
...

FWIW the ever increasing error rate is why I no longer crunch here. Hours of wasted time and electricity could be better put to use elsewhere like POEM. My 970s are pretty much useless here nowadays and the 750TIs are completely useless. JMHO

Stefan
Volunteer moderator
Project developer
Project scientist
Send message
Joined: 5 Mar 13
Posts: 258
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 43598 - Posted: 27 May 2016 | 8:39:32 UTC - in response to Message 43594.

These error rates are a bit exaggerated since AFAIK they include instantaneous errors which don't really bother much.

nanoprobe
Send message
Joined: 26 Feb 12
Posts: 181
Credit: 221,824,715
RAC: 0
Level
Leu
Scientific publications
watwatwatwatwatwatwatwatwat
Message 43607 - Posted: 27 May 2016 | 14:31:45 UTC - in response to Message 43598.
Last modified: 27 May 2016 | 14:43:50 UTC

These error rates are a bit exaggerated since AFAIK they include instantaneous errors which don't really bother much.

Unfortunately that is not true for me. I almost never have a task that errors out immediately. They're thousand of seconds in before they puke. Especially so in the last few months. FWIW I'm not a points ho but if we got some kind of credit for tasks that error out before finishing like other projects do I'd be more inclined to run them but 6-10 hours of run time for nada just irks me when that run time could be productive somewhere else. And yes I understand that errors still provide useful info. At least I'm assuming they do and so if they supply useful info we should get some credit. JMHO

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 783
Credit: 1,391,041,045
RAC: 1,248,198
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 43608 - Posted: 27 May 2016 | 15:32:27 UTC - in response to Message 43607.

These error rates are a bit exaggerated since AFAIK they include instantaneous errors which don't really bother much.

Unfortunately that is not true for me. I almost never have a task that errors out immediately. They're thousand of seconds in before they puke. Especially so in the last few months. FWIW I'm not a points ho but if we got some kind of credit for tasks that error out before finishing like other projects do I'd be more inclined to run them but 6-10 hours of run time for nada just irks me when that run time could be productive somewhere else. And yes I understand that errors still provide useful info. At least I'm assuming they do and so if they supply useful info we should get some credit. JMHO

On the other hand, I can barely remember a task which errored out here for an unexplained reason. I've certainly had some since the last ones showing, which were for October/November 2013.

I think my most recent failures were because of improper computer shutdown/retstarts - power outages due to the winter storms. I don't see any reason why the project should reward me for those - my bad for not investing in a UPS. The machine I'm posting from - 45218 - has no "Unrecoverable error" events for GPUGrid as far back as the logs go (13 January 2016), and it runs GPUGrid constantly when tasks are available.

If you are seeing a much higher error rate, I think you should look closer to home. I don't think the project's applications and tasks are inherently unstable.

Erich56
Send message
Joined: 1 Jan 15
Posts: 369
Credit: 1,606,755,102
RAC: 2,771,359
Level
His
Scientific publications
watwatwat
Message 43609 - Posted: 27 May 2016 | 16:44:35 UTC - in response to Message 43590.

Oh, my bad!
That won't work...
I read a couple of post about this somewhere, but I've clearly messed it up.
Sorry!


Yes, indeed it won't work :-(

One of the comments, a few weeks ago, in the forum was:

The disk space requirement is set in the workunit meta-data. ...

If disk usage was associated with the application, you could re-define it in an app_info.xml: but because it's data, it's correctly assigned to the researcher to configure.


Meanwhile it doesn't bother me any more, since I gave up running BOINC on a RamDisk.
Nevertheless though, I think this should be looked into / questioned by the GPUGRID people.

Jim1348
Send message
Joined: 28 Jul 12
Posts: 455
Credit: 1,130,760,908
RAC: 182,632
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwat
Message 43610 - Posted: 27 May 2016 | 18:45:26 UTC - in response to Message 43594.

My 970s are pretty much useless here nowadays and the 750TIs are completely useless. JMHO

This latest batch might be better, though I have just started. But at 3 hours into the run, it looks like a GERARD_CXCL12VOLK will take 12.5 hours to complete on a GTX 970 running at 1365 MHz (Win7 64-bit).

Erich56
Send message
Joined: 1 Jan 15
Posts: 369
Credit: 1,606,755,102
RAC: 2,771,359
Level
His
Scientific publications
watwatwat
Message 43612 - Posted: 27 May 2016 | 19:47:46 UTC - in response to Message 43610.

This latest batch might be better, though I have just started. But at 3 hours into the run, it looks like a GERARD_CXCL12VOLK will take 12.5 hours to complete on a GTX 970 running at 1365 MHz (Win7 64-bit).

here it took 12.7 hrs on a GTX970 (running at 1367 MHz) - Win10 64-bit.