Advanced search

Message boards : News : WU: OPM995 simulations

Author Message
Stefan
Volunteer moderator
Project developer
Project scientist
Send message
Joined: 5 Mar 13
Posts: 258
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 43600 - Posted: 27 May 2016 | 8:54:11 UTC

Here we go again :) This time with 33% more credits + corrected runtimes which means an additional 2x credit for WUs which take more than 18 hours on a 780 and only WUs which take up to a max of 24 hours on a 780. I hope I don't seriously overshoot on credits this time but it's really a bit hit & miss.

Profile Retvari Zoltan
Avatar
Send message
Joined: 20 Jan 09
Posts: 1844
Credit: 10,630,096,894
RAC: 9,819,606
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 43602 - Posted: 27 May 2016 | 9:19:12 UTC - in response to Message 43600.

Thanks Stefan!
As there is plenty of workunits queued (7920 atm), and some of these are very long I suggest everyone to reduce their work cache to 0.03 days to maximize throughput & the credits earned.

Profile skgiven
Volunteer moderator
Project tester
Volunteer tester
Avatar
Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,834,518,624
RAC: 292,156
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 43604 - Posted: 27 May 2016 | 13:51:39 UTC - in response to Message 43602.
Last modified: 27 May 2016 | 13:52:17 UTC

Thanks Stefan!
As there is plenty of workunits queued (7920 atm), and some of these are very long I suggest everyone to reduce their work cache to 0.03 days to maximize throughput & the credits earned.

Good suggestion. Given the length of these tasks (extra-long or at least some of them), and so many being available, there is no point in people hoarding tasks - they will just miss bonus deadlines and get less credit.
____________
FAQ's

HOW TO:
- Opt out of Beta Tests
- Ask for Help

WPrion
Send message
Joined: 30 Apr 13
Posts: 56
Credit: 594,374,919
RAC: 830,704
Level
Lys
Scientific publications
watwatwatwatwatwatwatwat
Message 43628 - Posted: 29 May 2016 | 2:05:23 UTC - in response to Message 43602.

Thanks Stefan!
As there is plenty of workunits queued (7920 atm), and some of these are very long I suggest everyone to reduce their work cache to 0.03 days to maximize throughput & the credits earned.


Are you referring to the setting:

"Maintain enough work for an additional"

I set mine to 0.03 several hours ago and updated my client. Yet it downloaded another WU shortly after one was finished just as the the running WU barely started.

Is there something else to tweak?

Thanks,

Win
____________

Profile skgiven
Volunteer moderator
Project tester
Volunteer tester
Avatar
Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,834,518,624
RAC: 292,156
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 43632 - Posted: 29 May 2016 | 13:14:27 UTC - in response to Message 43628.
Last modified: 29 May 2016 | 13:19:19 UTC

Yes, in Boinc Manager (advanced view) under Options, Computing preference and the Computing tab you need to set two values:

    Store at least [0.02] days of work
    Store up to an additional [0.01] days of work


If the combined values add up to anything less than 0.10 then the settings should work reasonably well.
It's likely that the second value was something like 0.25 or 0.5 and that caused you to download additional work (a second task).
____________
FAQ's

HOW TO:
- Opt out of Beta Tests
- Ask for Help

Jacob Klein
Send message
Joined: 11 Oct 08
Posts: 1067
Credit: 1,146,403,839
RAC: 1,089,717
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 43633 - Posted: 29 May 2016 | 13:30:23 UTC
Last modified: 29 May 2016 | 13:32:05 UTC

Please note that really low buffer settings cause increased stress on project scheduler servers, for all projects you are attached to.

I personally leave my buffers at something like "store at least 1 day, store up to 0.5 days more", since I don't care about the GPUGrid credit bonus, and short buffers don't really help GPUGrid throughput unless very few work units are available, and I don't want to add increased stress to my attached projects' scheduler servers.

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 788
Credit: 1,422,060,845
RAC: 1,410,932
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 43634 - Posted: 29 May 2016 | 14:20:12 UTC - in response to Message 43633.

... short buffers don't really help GPUGrid throughput ...

Not necessarily true. I'm not speaking specifically about the OPM simulations here, but I think most GPUGrid work is run as a sort of relay race - you hold the baton for a short while, complete your lap of the track, and then hand it back in for somebody else to take over.

If you sit at the side of the track for a day and a half before you even start running, that particular baton - series of linked tasks, each generated from the result of the previous lap - is permanently delayed, and the final results aren't available for the scientists to study until that much later.

Jacob Klein
Send message
Joined: 11 Oct 08
Posts: 1067
Credit: 1,146,403,839
RAC: 1,089,717
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 43635 - Posted: 29 May 2016 | 14:26:29 UTC - in response to Message 43634.
Last modified: 29 May 2016 | 14:28:33 UTC

That had slipped my mind. But, if GPUGrid was having a problem getting the batons back for the next runners, and they wanted to ensure that the race kept running smoothly, they could tighten the deadlines on the relay chunks if need be.

So, I'm just going to stick with the deadlines they give me, and not micro-manage BOINC, and not add stress to my attached projects' servers. I actually have GPUGrid set to 99999 resource share, and GPUs crunching 2-at-a-time, so ... :) When I get tasks from this project, they are usually firing on all cylinders, top priority.

Profile skgiven
Volunteer moderator
Project tester
Volunteer tester
Avatar
Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,834,518,624
RAC: 292,156
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 43639 - Posted: 29 May 2016 | 19:25:46 UTC - in response to Message 43635.
Last modified: 29 May 2016 | 19:28:57 UTC

Until the scheduler is re-written at a per device/device-specific level there will be issues with attaching to multiple projects (when using multiple devices). However, these have been addressed as far as reasonably feasible with the existing manager.

Would add that many CPU projects have long tasks; some Einstein and WCG tasks for example take ~20h to complete, ClimatePrediction several days to weeks. If you have a low cache and are running a GPUGrid task(s) on your GPU(s) and WCG tasks on your CPU then you won't badger the server for new work until you are almost out of work which probably won't be very often (a few times per day, which isn't an issue).

Granted there are/where some projects with very short run-times, but that does not mean it's better to have long a long queue/big cache of tasks. There are substantial issues with having hundreds/thousands of tasks in your queue too. For example, if you crunch for BU and your Internet goes down, all queued tasks will fail - not exactly great news for their server.

My opinion for here - low cache good for the project and user/team credits, higher (but reasonably low) cache not as good for either but still good, Not Bad, and it's your choice. High cache (3+ days) bad news.
The bonus system is designed to reflect this projects need for a quick return. It can't take into account what else you crunch.
____________
FAQ's

HOW TO:
- Opt out of Beta Tests
- Ask for Help

Jacob Klein
Send message
Joined: 11 Oct 08
Posts: 1067
Credit: 1,146,403,839
RAC: 1,089,717
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 43640 - Posted: 29 May 2016 | 19:53:28 UTC - in response to Message 43639.
Last modified: 29 May 2016 | 20:00:50 UTC

It can't take into account what else you crunch.


That's exactly the reason that you shouldn't make blanket suggestions on suggested cache settings that benefit GPUGrid most, without also specifying some of the drawbacks :) I digress.

For my particular scenario, I have modified my cache settings a bit, in order to try to keep all my GPUs sustained at 2-GPUGrid-tasks-per-GPU without taking on additional work from other attached GPU projects. I'm using 0.9d+0.9d on the PC that has GTX970+GTX660Ti+GTX660Ti, and 0.5d+0.5d on the PC that has GTX980Ti+GTX980Ti. To each their own.

Profile Beyond
Avatar
Send message
Joined: 23 Nov 08
Posts: 1073
Credit: 4,496,456,504
RAC: 414,796
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 43641 - Posted: 29 May 2016 | 21:02:47 UTC

For years many have asked for per project work buffer settings or at LEAST separate settings for GPUs and CPUs. All to no avail, while a lot of effort has been spent on less important (IMO) issues.

Profile skgiven
Volunteer moderator
Project tester
Volunteer tester
Avatar
Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,834,518,624
RAC: 292,156
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 43642 - Posted: 29 May 2016 | 22:04:57 UTC - in response to Message 43640.

It can't take into account what else you crunch.


That's exactly the reason that you shouldn't make blanket suggestions on suggested cache settings that benefit GPUGrid most, without also specifying some of the drawbacks :) I digress.

For my particular scenario, I have modified my cache settings a bit, in order to try to keep all my GPUs sustained at 2-GPUGrid-tasks-per-GPU without taking on additional work from other attached GPU projects. I'm using 0.9d+0.9d on the PC that has GTX970+GTX660Ti+GTX660Ti, and 0.5d+0.5d on the PC that has GTX980Ti+GTX980Ti. To each their own.

My suggestions are predominantly for GPUGrid only and are typically optimisations for GPUGrid throughput and user/team credit. I don't make suggestions at GPUGrid to facilitate every conceivable combination of Boinc-wide project admix, nor could I - it can't be done.
You have different views, values, opinions and objectives which you are quite entitled to express and implement for yourself and to your own ends.
My advice is mostly aimed at new, novice or just GPUGrid-new crunchers or people with a specific problem to here. Usually they need a setup to facilitate crunching here and often changes just to make it work.
Occasionally I digress too, to advise on an experience crunching elsewhere, or to pass on some observations or knowledge, but there is no catch all super setup for Boinc.
I enjoy the fact that people crunch for a diversity of reasons with different setups and takes on crunching. Highlighting different circumstances and experiences adds to my knowledge and crunchers knowledge as a whole, but one shoe doesn't fit all and this is a GPUGrid forum not the Boinc central forum where generic advice might better be propagated.
____________
FAQ's

HOW TO:
- Opt out of Beta Tests
- Ask for Help

Profile skgiven
Volunteer moderator
Project tester
Volunteer tester
Avatar
Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,834,518,624
RAC: 292,156
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 43643 - Posted: 29 May 2016 | 22:10:02 UTC - in response to Message 43641.

For years many have asked for per project work buffer settings or at LEAST separate settings for GPUs and CPUs. All to no avail, while a lot of effort has been spent on less important (IMO) issues.

I don't bother any more. IMO it is what it is and that's just about all it will ever be.

____________
FAQ's

HOW TO:
- Opt out of Beta Tests
- Ask for Help

Profile Beyond
Avatar
Send message
Joined: 23 Nov 08
Posts: 1073
Credit: 4,496,456,504
RAC: 414,796
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 43644 - Posted: 29 May 2016 | 23:04:59 UTC - in response to Message 43643.

For years many have asked for per project work buffer settings or at LEAST separate settings for GPUs and CPUs. All to no avail, while a lot of effort has been spent on less important (IMO) issues.

I don't bother any more. IMO it is what it is and that's just about all it will ever be.

Gave up too. However it is supremely important to devise more ways for people to burn up their phones while doing nothing useful.

klepel
Send message
Joined: 23 Dec 09
Posts: 136
Credit: 1,830,142,470
RAC: 1,457,984
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 43654 - Posted: 30 May 2016 | 14:21:55 UTC

Stefan,

Two of my computers have received SDOERR_opm995 tasks which are processed by an other computer at the same time. They have been send more or less at the same time.
https://www.gpugrid.net/workunit.php?wuid=11614785
https://www.gpugrid.net/workunit.php?wuid=11614829

Is this by your intention as these SDOERR WUs had been so error prone or is it a fault of the scheduler? Please advise as fast as possible so I might kill them as soon as possible. I do not like to make double work if it is not required.

Profile skgiven
Volunteer moderator
Project tester
Volunteer tester
Avatar
Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,834,518,624
RAC: 292,156
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 43655 - Posted: 30 May 2016 | 15:17:12 UTC - in response to Message 43654.

initial replication 2

https://www.gpugrid.net/workunit.php?wuid=11614785

That means two tasks are sent out, by design.

One of the OPM995's I'm running also has an initial replication of 2:
https://www.gpugrid.net/workunit.php?wuid=11614838
____________
FAQ's

HOW TO:
- Opt out of Beta Tests
- Ask for Help

Jacob Klein
Send message
Joined: 11 Oct 08
Posts: 1067
Credit: 1,146,403,839
RAC: 1,089,717
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 43656 - Posted: 30 May 2016 | 15:25:56 UTC - in response to Message 43655.

Perhaps the question is:
Why was it set up with initial replication set to 2?

Profile skgiven
Volunteer moderator
Project tester
Volunteer tester
Avatar
Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,834,518,624
RAC: 292,156
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 43659 - Posted: 30 May 2016 | 20:48:32 UTC - in response to Message 43656.
Last modified: 30 May 2016 | 22:10:00 UTC

Probably validation; any proof of concept experiment to demonstrate ability needs to contain appropriate verification for it to be accepted as a model/framework for performing experiments.
____________
FAQ's

HOW TO:
- Opt out of Beta Tests
- Ask for Help

WPrion
Send message
Joined: 30 Apr 13
Posts: 56
Credit: 594,374,919
RAC: 830,704
Level
Lys
Scientific publications
watwatwatwatwatwatwatwat
Message 43661 - Posted: 31 May 2016 | 0:56:38 UTC - in response to Message 43632.

Thanks!

Jacob Klein
Send message
Joined: 11 Oct 08
Posts: 1067
Credit: 1,146,403,839
RAC: 1,089,717
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 43662 - Posted: 31 May 2016 | 1:14:43 UTC - in response to Message 43659.

Hmm... validation deals with quorum though, and also, I thought the way these GPUGrid tasks worked was that the results couldn't really be validated against each other. I might be mistaken though.

Profile skgiven
Volunteer moderator
Project tester
Volunteer tester
Avatar
Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,834,518,624
RAC: 292,156
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 43663 - Posted: 31 May 2016 | 7:11:54 UTC - in response to Message 43662.

Wasn't thinking about task validation in the Boinc sense but rather validation of the experimental procedure - does it hold any weight? If we consider an experiment as a batch of work, validation of the experiment (and procedures) in scientific terms usually requires that the whole experiment be replicated, and perhaps many times before the results/methods are accepted. Of course Stefan might be doing this for different reasons.
____________
FAQ's

HOW TO:
- Opt out of Beta Tests
- Ask for Help

Jacob Klein
Send message
Joined: 11 Oct 08
Posts: 1067
Credit: 1,146,403,839
RAC: 1,089,717
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 43666 - Posted: 31 May 2016 | 12:52:59 UTC - in response to Message 43663.

I see what you mean now. I hope he has another reason.

Profile skgiven
Volunteer moderator
Project tester
Volunteer tester
Avatar
Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,834,518,624
RAC: 292,156
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 43671 - Posted: 31 May 2016 | 16:40:16 UTC
Last modified: 31 May 2016 | 16:40:32 UTC

GTX970 on W10 24h and 41min with a bit of upload time too (118MB).

http://www.gpugrid.net/result.php?resultid=15125538

Run time 88,881.18
CPU time 88,253.09
Validate state Valid
Credit 788,690.00

I expect if a system was setup a bit better it could complete within 24h but I've a second GPU, the room's been 24C to 28C, I'm using the CPU quite a bit and my system is set to drop the clocks to keep the temperature down. This GPU was clocked at ~1300MHz, the second has dropped down to 1088. GDDR5 is @7GHz.

Haven't managed to get an OPM on my Linux system yet. The point of installing Ubuntu 16.04 was to see if I could setup a GTX970 system to return these long OPM's inside 24h!
____________
FAQ's

HOW TO:
- Opt out of Beta Tests
- Ask for Help

Bedrich Hajek
Send message
Joined: 28 Mar 09
Posts: 340
Credit: 3,819,818,009
RAC: 929,462
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 43678 - Posted: 1 Jun 2016 | 1:53:30 UTC

I was fortunate enough to get and complete successfully 2 of these units:

5f1c-SDOERR_opm995-0-1-RND8074_2 11614800 30 May 2016 | 13:52:39 UTC 31 May 2016 | 6:23:14 UTC Completed and validated 56,458.02 56,161.20 940,443.00 Long runs (8-12 hours on fastest card) v8.48 (cuda65)
# Time per step (avg over 5000000 steps): 11.257 ms
# Approximate elapsed time for entire WU: 56284.859 s
# PERFORMANCE: 157144 Natoms 11.257 ns/day 0.000 ms/step 0.000 us/step/atom
02:17:56 (7792): called boinc_finish

http://www.gpugrid.net/result.php?resultid=15124495


3jw8R0-SDOERR_opm995-0-1-RND9612_2 11614181 30 May 2016 | 8:49:32 UTC 31 May 2016 | 0:50:29 UTC Completed and validated 55,859.07 55,499.59 956,403.00 Long runs (8-12 hours on fastest card) v8.48 (cuda65)
# Time per step (avg over 10000000 steps): 5.578 ms
# Approximate elapsed time for entire WU: 55780.416 s
# PERFORMANCE: 79913 Natoms 5.578 ns/day 0.000 ms/step 0.000 us/step/atom
20:45:10 (7740): called boinc_finish

http://www.gpugrid.net/result.php?resultid=15124201


With the 5f1c-SDOERR_opm995-0-1-RND8074_2, my windows 10 computer was able to achieve a 87% maximum GPU usage, while using 1950 MB of memory. While the 3jw8R0-SDOERR_opm995-0-1-RND9612_2, on the same computer, achieved 80% maximum GPU usage, while using 1100 MB of memory.

I can't wait to get a few more of these!


(Ryle)
Send message
Joined: 7 Jun 09
Posts: 17
Credit: 671,880,212
RAC: 1,201,007
Level
Lys
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwat
Message 43681 - Posted: 1 Jun 2016 | 14:38:34 UTC

Is it so, that when the new students arrive, that you would consider creating more short tasks?
I think it is a pity, that you mostly cater to the very highend cards here. I'd like to continue supporting this project, but as it is I just can't afford to buy the faster cards.

I do own a 970, and it is still a fast card. I would just hate to see it go over that 24H limit in the near future. I understand it is eventually inevitable, but it's barely a year old.

Sadly, the highend cards also crunch the short units, when the long unit pool is dry, so they quickly eat up the short pool too. A WU tier would be nice however. I think it's been suggested somewhere else before, in these forums, that you could make a short, medium and long unit pool. That would be cool, so the small cards have the short pool, the cards a bit faster have the medium pool, and finally the highend can get into the top tier, long pool.

Still, it was so in the past, that the short units also gave less points per day overall, even if same time is used on same card, but I don't know what the reason is for that. (Maybe the bonus isn't added to those?).

Well, just my 2 cents worth of opinion :)

John C MacAlister
Send message
Joined: 17 Feb 13
Posts: 178
Credit: 132,357,411
RAC: 2,487
Level
Cys
Scientific publications
watwatwatwatwatwatwatwat
Message 43683 - Posted: 2 Jun 2016 | 10:40:59 UTC

Agreed: pity there are so few shorts.....

My 650 Tis are too slow and the 660Tis looking pretty slow compared to many others.

I can't afford newer cards and now with electricity costing me 18 cents (Canadian) per kWh, my contribution to GPUGrid will be very low.

:(

eXaPower
Send message
Joined: 25 Sep 13
Posts: 265
Credit: 1,043,270,117
RAC: 1,772,186
Level
Met
Scientific publications
watwatwatwatwatwat
Message 43685 - Posted: 2 Jun 2016 | 16:05:42 UTC
Last modified: 2 Jun 2016 | 16:19:22 UTC

Before I received 2m59_SDOERR_opm994 (short WU) - Three prior hosts (GT640 / GTX950 / GTX970 r361&r364 driver) produced outcome -55 exit code (0xffffffffffffffc9) Unknown error zero runtime's.

GTX970 (2m59 WU) compute 6.45hr estimated runtime (15.480% per 1hr).

2m59 WU status: 11-14% CPU usage (3.2GHz) / 54% GPU usage (1511MHz) / 24% MCU (7200MHz) / 25% BUS (PCIe3.0 x4) / GPU temp 39C / 33% GPU power (108W) / 550MB memory usage (no display connected)

Topology reports 27558 atoms
4344 waters in system

Thank you Zoltan for sharing helpful tip (in previous OPM thread) on where to locate a WU's atom amount file.

Bedrich Hajek
Send message
Joined: 28 Mar 09
Posts: 340
Credit: 3,819,818,009
RAC: 929,462
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 43695 - Posted: 3 Jun 2016 | 5:47:27 UTC
Last modified: 3 Jun 2016 | 5:48:04 UTC

I had one of these WUs fail with this error message:

upload failure: <file_xfer_error>
<file_name>4mt6-SDOERR_opm994-0-1-RND0442_0_11</file_name>
<error_code>-131 (file size too big)</error_code>
</file_xfer_error>


http://www.gpugrid.net/result.php?resultid=15127701


Has this happened to anyone else with these WUs?

I remember this happened in the past, and there is a fix to this posted, in the threads somewhere, but I can't remember where.

I think this WU would have been otherwise good.

Profile Retvari Zoltan
Avatar
Send message
Joined: 20 Jan 09
Posts: 1844
Credit: 10,630,096,894
RAC: 9,819,606
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 43696 - Posted: 3 Jun 2016 | 7:35:49 UTC - in response to Message 43695.

I had one of these WUs fail with this error message:

upload failure: <file_xfer_error>
<file_name>4mt6-SDOERR_opm994-0-1-RND0442_0_11</file_name>
<error_code>-131 (file size too big)</error_code>
</file_xfer_error>


http://www.gpugrid.net/result.php?resultid=15127701


Has this happened to anyone else with these WUs?

I remember this happened in the past, and there is a fix to this posted, in the threads somewhere, but I can't remember where.

I think this WU would have been otherwise good.

See the WARNING/CHALLENGE: VERY LONG WU (VERYLONG_CXCL12_confAna) thread.
It's embarrassing that we've run into this again.

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 788
Credit: 1,422,060,845
RAC: 1,410,932
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 43697 - Posted: 3 Jun 2016 | 8:02:15 UTC

I've got 2d57-SDOERR_opm994-0-1-RND4399_1 running. The file description in client_state.xml is

<file>
<name>2d57-SDOERR_opm994-0-1-RND4399_1_11</name>
<nbytes>0.000000</nbytes>
<max_nbytes>5000000.000000</max_nbytes>
<status>0</status>
<upload_url>http://www.gpugrid.org/PS3GRID_cgi/file_upload_handler</upload_url>
</file>

- so the maximum size allowed is 5,000,000 bytes.

So far, it's reached 852 KB at about 80% progress - which sounds like plenty of headroom, and perhaps not a widespread problem. But I'll keep an eye on it as it approaches completion.

Stefan
Volunteer moderator
Project developer
Project scientist
Send message
Joined: 5 Mar 13
Posts: 258
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 43698 - Posted: 3 Jun 2016 | 8:09:50 UTC

I apologize for not answering in a while, I have been a bit busy with writing my thesis.

Job replication 2 was my desperate attempt to get my results back faster while also competing with the mass of simulations sent out by Gerard and reducing a bit my failure rates. I hope you don't mind too much since they were only around 300 WUs. If they arrive on the same host of course it's quite pointless.

On the subject of short runs, I am unfortunately unable to help you because the equilibration runs cannot be split into smaller chunks. But as Gianni mentioned we are getting new students soon so it is possible that they have something for short.

John C MacAlister
Send message
Joined: 17 Feb 13
Posts: 178
Credit: 132,357,411
RAC: 2,487
Level
Cys
Scientific publications
watwatwatwatwatwatwatwat
Message 43699 - Posted: 3 Jun 2016 | 10:08:34 UTC

Hi, Stefan:

Thank you for this-


On the subject of short runs, I am unfortunately unable to help you because the equilibration runs cannot be split into smaller chunks. But as Gianni mentioned we are getting new students soon so it is possible that they have something for short.

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 788
Credit: 1,422,060,845
RAC: 1,410,932
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 43700 - Posted: 3 Jun 2016 | 10:46:15 UTC - in response to Message 43697.

2d57-SDOERR_opm994-0-1-RND4399_1 uploaded cleanly, so it's not a universal problem.

4azpR0-SDOERR_opm995-0-1-RND6483_1 might get closer to the limit - I'll keep an eye on it.

eXaPower
Send message
Joined: 25 Sep 13
Posts: 265
Credit: 1,043,270,117
RAC: 1,772,186
Level
Met
Scientific publications
watwatwatwatwatwat
Message 43703 - Posted: 3 Jun 2016 | 17:18:48 UTC - in response to Message 43685.

Before I received 2m59_SDOERR_opm994 (short WU) - Three prior hosts (GT640 / GTX950 / GTX970 r361&r364 driver) produced outcome -55 exit code (0xffffffffffffffc9) Unknown error zero runtime's.

GTX970 (2m59 WU) compute 6.45hr estimated runtime (15.480% per 1hr).

2m59 WU status: 11-14% CPU usage (3.2GHz) / 54% GPU usage (1511MHz) / 24% MCU (7200MHz) / 25% BUS (PCIe3.0 x4) / GPU temp 39C / 33% GPU power (108W) / 550MB memory usage (no display connected)

Topology reports 27558 atoms
4344 waters in system

Thank you Zoltan for sharing helpful tip (in previous OPM thread) on where to locate a WU's atom amount file.

WUid=11616186 (1a0r OPM994) crashed my system multiple times - this WU had 100% GPU usage / 1% MCU / 20% power (65W) before the (first ever driver reset(s) I've encountered computing ACEMD in three years.) The (1a0r) WU ended with a -97 (0xffffffffffffff9f) Unknown error number after 102sec at reference stock clock once I noticed the first couple of driver recoveries OCed. (FATAL : Cuda driver error 719 in file 'swanlibnv2.cpp' in line 1965)
A few other stable wingman (980ti / (2) 970's) high-end RAC systems (6 total) have error(s) (<100sec) with (1a0r) WU.

As of now (2) OPM995 are without issue on my 970's at very high OC's:

(WUid=11614432) 4a6fRO (50479 atoms with 9411 waters in system) 20.25hr estimated runtime at 12-15% CPU usage (3.2GHz) / 63% GPU usage (1511MHz) / 31% MCU (7200MHz) / 27% BUS (PCIe3.0 x4) / 34% power (110W) / 42C core / 820MB memory usage

(WUid=116143650 4u15RO (51270 atoms with 8255 waters in system) 20.5hr estimated runtime at 12-15% CPU usage (3.2GHz) / 65% GPU usage (1511MHz) / 34% MCU (7010MHz) / 22% BUS (PCIe3.0 x8) / 60% power (120W) / 45C core / 843MB memory usage





Profile skgiven
Volunteer moderator
Project tester
Volunteer tester
Avatar
Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,834,518,624
RAC: 292,156
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 43704 - Posted: 3 Jun 2016 | 20:41:09 UTC

1s4wR0-SDOERR_opm995-0-1-RND5214_0 11614436 3 Jun 2016 | 6:47:02 UTC 3 Jun 2016 | 20:01:33 UTC Completed and validated 45,293.51 20,015.48 147,829.50

Finally got an OPM on my Ubuntu 16.04 rig. Alas it didn't turn out to be an extra-long run and completed in 12h 35min at stock.
Based on the run time of other long WU's the credit is about half what it should be. Was hoping to get an extra-long task and to finish inside 24h - c'est la vie...
____________
FAQ's

HOW TO:
- Opt out of Beta Tests
- Ask for Help

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 788
Credit: 1,422,060,845
RAC: 1,410,932
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 43705 - Posted: 3 Jun 2016 | 21:56:01 UTC - in response to Message 43700.

4azpR0-SDOERR_opm995-0-1-RND6483_1 looks safe as well - 1,283 KB at 61%.

# Topology reports 50432 atoms

eXaPower
Send message
Joined: 25 Sep 13
Posts: 265
Credit: 1,043,270,117
RAC: 1,772,186
Level
Met
Scientific publications
watwatwatwatwatwat
Message 43706 - Posted: 3 Jun 2016 | 22:20:36 UTC

Too many errors (may have bug) 1a0r-SDOERR_opm994-0-1-RND9594

https://www.gpugrid.net/workunit.php?wuid=11616186

eXaPower
Send message
Joined: 25 Sep 13
Posts: 265
Credit: 1,043,270,117
RAC: 1,772,186
Level
Met
Scientific publications
watwatwatwatwatwat
Message 43713 - Posted: 4 Jun 2016 | 14:51:49 UTC - in response to Message 43703.

(2) new OPM995 that should make the maximum size file_xfer allowed 5,000,000 bytes:

3nce WU#11614771 (126091 atoms with 25796 waters) status: 20hr estimated runtime at 12-16% CPU usage (3.2GHz) / 76% GPU usage (1511MHz) / 40% MCU (7200MHz) / 33% BUS (PCIe3.0 x4) / 40% power (130W) / 44C temp / 1559MB memory usage

2b6p WU#11614758 (129818 atoms with 23308 waters) status: 21hr estimated runtime at 12-16% CPU usage (3.2GHZ) / 75% GPU usage (1511MHz) / 45% MCU (7010MHz) / 24% BUS (PCIe3.0 x8) / 70% power (140W) / 47C temp / 1662MB memory usage

Before I received 2m59_SDOERR_opm994 (short WU) - Three prior hosts (GT640 / GTX950 / GTX970 r361&r364 driver) produced outcome -55 exit code (0xffffffffffffffc9) Unknown error zero runtime's.

GTX970 (2m59 WU) compute 6.45hr estimated runtime (15.480% per 1hr).

2m59 WU status: 11-14% CPU usage (3.2GHz) / 54% GPU usage (1511MHz) / 24% MCU (7200MHz) / 25% BUS (PCIe3.0 x4) / GPU temp 39C / 33% GPU power (108W) / 550MB memory usage (no display connected)

Topology reports 27558 atoms
4344 waters in system

Thank you Zoltan for sharing helpful tip (in previous OPM thread) on where to locate a WU's atom amount file.

WUid=11616186 (1a0r OPM994) crashed my system multiple times - this WU had 100% GPU usage / 1% MCU / 20% power (65W) before the (first ever driver reset(s) I've encountered computing ACEMD in three years.) The (1a0r) WU ended with a -97 (0xffffffffffffff9f) Unknown error number after 102sec at reference stock clock once I noticed the first couple of driver recoveries OCed. (FATAL : Cuda driver error 719 in file 'swanlibnv2.cpp' in line 1965)
A few other stable wingman (980ti / (2) 970's) high-end RAC systems (6 total) have error(s) (<100sec) with (1a0r) WU.
Too many errors (may have bug) 1a0r-SDOERR_opm994-0-1-RND9594

As of now (2) OPM995 are without issue on my 970's at very high OC's:

(WUid=11614432) 4a6fRO (50479 atoms with 9411 waters in system) 20.25hr estimated runtime at 12-15% CPU usage (3.2GHz) / 63% GPU usage (1511MHz) / 31% MCU (7200MHz) / 27% BUS (PCIe3.0 x4) / 34% power (110W) / 42C core / 820MB memory usage

(WUid=116143650 4u15RO (51270 atoms with 8255 waters in system) 20.5hr estimated runtime at 12-15% CPU usage (3.2GHz) / 65% GPU usage (1511MHz) / 34% MCU (7010MHz) / 22% BUS (PCIe3.0 x8) / 60% power (120W) / 45C core / 843MB memory usage

eXaPower
Send message
Joined: 25 Sep 13
Posts: 265
Credit: 1,043,270,117
RAC: 1,772,186
Level
Met
Scientific publications
watwatwatwatwatwat
Message 43724 - Posted: 5 Jun 2016 | 14:38:34 UTC

Any TX/980ti/980/970 (Present batch) SDOERR_opm99 grant 1,000,000 credit?
My -+ (runtime) Credit:
23,912.30 GPU / 11,332.23 CPU / 41,296.50 credits (27588 atoms) / 5mil step
74,154.80 / 16,389.80 / 377,254.50 credits (126091 atoms) / 5mil step

An odd short run 5mil step (~27k atoms) WU cropped up.

0 unsent
271 in progress
1155 success
47.62% error rate


Profile Retvari Zoltan
Avatar
Send message
Joined: 20 Jan 09
Posts: 1844
Credit: 10,630,096,894
RAC: 9,819,606
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 43725 - Posted: 5 Jun 2016 | 15:54:02 UTC - in response to Message 43724.
Last modified: 5 Jun 2016 | 15:57:25 UTC

Any TX/980ti/980/970 (Present batch) SDOERR_opm99 grant 1,000,000 credit?
4by0-SDOERR_opm994-0-1-RND5591_1 58.472s (16h 14m 26s) 1.023.036 credits 170941 atoms 11.696 ns/day 5M steps
This workunit is very interesting, as the initial replication was 2, the other host which received this workunit also received the +50% bonus, while it has returned it after 1d 14h.

Profile Beyond
Avatar
Send message
Joined: 23 Nov 08
Posts: 1073
Credit: 4,496,456,504
RAC: 414,796
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 43727 - Posted: 5 Jun 2016 | 20:11:34 UTC - in response to Message 43725.

This workunit is very interesting, as the initial replication was 2, the other host which received this workunit also received the +50% bonus, while it has returned it after 1d 14h.

AFAIK that's the way it's always worked here. The first reported WU sets the credit for everyone.

Profile Beyond
Avatar
Send message
Joined: 23 Nov 08
Posts: 1073
Credit: 4,496,456,504
RAC: 414,796
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 43728 - Posted: 5 Jun 2016 | 20:15:08 UTC - in response to Message 43704.

Finally got an OPM on my Ubuntu 16.04 rig. Alas it didn't turn out to be an extra-long run and completed in 12h 35min at stock.
Based on the run time of other long WU's the credit is about half what it should be.

Had 4 OPMs finish today. The credit on all of them is 1/2 or less per hour compared to any other long WUs. Guess the credit wasn't fixed after all.

Profile skgiven
Volunteer moderator
Project tester
Volunteer tester
Avatar
Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,834,518,624
RAC: 292,156
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 43729 - Posted: 5 Jun 2016 | 20:30:58 UTC - in response to Message 43728.
Last modified: 5 Jun 2016 | 20:36:10 UTC

Got 2 real extra-long tasks on my Win10 system and one 'fake' extra-long task on my Linux system. The real extra-long tasks got 900K Boinc credits whereas the normal-long task only received 147K credits (or there about).
____________
FAQ's

HOW TO:
- Opt out of Beta Tests
- Ask for Help

Profile Beyond
Avatar
Send message
Joined: 23 Nov 08
Posts: 1073
Credit: 4,496,456,504
RAC: 414,796
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 43730 - Posted: 5 Jun 2016 | 20:33:19 UTC - in response to Message 43729.
Last modified: 5 Jun 2016 | 20:34:08 UTC

I got 2 really long tasks on my Win10 system and one fake long task on my Linux system. The real long tasks got |900K Boinc credits whereas the not-really-long task (normal-ling) only received 147K credits (or there about).

Remedial math is a good post graduate course... ;-)

Profile skgiven
Volunteer moderator
Project tester
Volunteer tester
Avatar
Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,834,518,624
RAC: 292,156
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 43731 - Posted: 5 Jun 2016 | 20:37:01 UTC - in response to Message 43730.
Last modified: 5 Jun 2016 | 20:46:09 UTC

Just after correcting my remedial English :)

PS. Looks like it's backup-project time again 🕒
____________
FAQ's

HOW TO:
- Opt out of Beta Tests
- Ask for Help

Profile Beyond
Avatar
Send message
Joined: 23 Nov 08
Posts: 1073
Credit: 4,496,456,504
RAC: 414,796
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 43733 - Posted: 5 Jun 2016 | 20:59:45 UTC - in response to Message 43731.

Just after correcting my remedial English :)

PS. Looks like it's backup-project time again 🕒

I don't think it's you that needs the remedial math, and yep it's that time again.

Stefan
Volunteer moderator
Project developer
Project scientist
Send message
Joined: 5 Mar 13
Posts: 258
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 43745 - Posted: 7 Jun 2016 | 10:04:03 UTC

Really, I am out of ideas on how to fix the credits any further. I did everything I could imagine being wrong. I could blindly multiply the credits by whatever factor you guys tell me, but right now I have to base it off our usual credit calculation script.

Profile Beyond
Avatar
Send message
Joined: 23 Nov 08
Posts: 1073
Credit: 4,496,456,504
RAC: 414,796
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 43749 - Posted: 8 Jun 2016 | 18:36:19 UTC - in response to Message 43745.

Really, I am out of ideas on how to fix the credits any further. I did everything I could imagine being wrong. I could blindly multiply the credits by whatever factor you guys tell me, but right now I have to base it off our usual credit calculation script.

Recent comparisons, OPM vs. CXCL12VOLK. Example from one of my machines:

1gzmR0-SDOERR_opm995-0-1-RND1802_0 11614349 3 Jun 2016 | 6:36:43 UTC 5 Jun 2016 | 9:02:02 UTC Completed and validated 162,200.44 47,231.66 237,804.00

e6s24_e1s9p0f524-GERARD_CXCL12VOLK_15782120_2-0-1-RND1978_0 11613059 28 May 2016 | 21:23:31 UTC 30 May 2016 | 4:49:33 UTC Completed and validated 96,473.03 31,352.66 233,875.00

Here's another one of my computers. This WU had 131548 Natoms:

2w61-SDOERR_opm994-0-1-RND7728_0 11616211 2 Jun 2016 | 17:30:20 UTC 5 Jun 2016 | 18:09:42 UTC Completed and validated 243,192.27 35,757.08 262,409.00

e4s9_e1s18p0f473-GERARD_CXCL12VOLK_15782120_2-0-1-RND7513_1 11609049 1 Jun 2016 | 14:32:18 UTC 2 Jun 2016 | 22:47:34 UTC Completed and validated 98,520.08 29,575.68 233,875.00

From the OPM WUs I've been running lately it seems that the credit is about 45% - 60% per hour compared to other/previous long WUs. On top of that there is a greater chance of failure with these long WUs. I would suggest erroring on the high side rather than the low side when estimating credit as it costs you nothing and it's one of the few tokens of appreciation that we receive for our small contribution to the great science that you guys are doing. Whining aside, keep up the excellent work. For a lot of us this is a small way that we can contribute to science.



Stefan
Volunteer moderator
Project developer
Project scientist
Send message
Joined: 5 Mar 13
Posts: 258
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 43777 - Posted: 14 Jun 2016 | 12:29:45 UTC
Last modified: 14 Jun 2016 | 13:59:24 UTC

I thought you guys might appreciate seeing what can go wrong in a simulation ;) I always love these mistakes. Still, only 1 out of 600+ systems managed to break like this so I'm quite impressed.

http://imgur.com/qcvaMyq

Essentially because of some water between the protein and the lower membrane layer (whose upper side is hydrophobic, hence hates water), the membrane starts bending and when it bends it suddenly interacts with the periodic image* of the protein and decides that it likes it more than staying with the other membrane layer. And then it goes pop :D


* MD simulations are typically done using periodic interactions

Vagelis Giannadakis
Send message
Joined: 5 May 13
Posts: 187
Credit: 349,254,454
RAC: 0
Level
Asp
Scientific publications
watwatwatwatwatwatwat
Message 43778 - Posted: 14 Jun 2016 | 14:25:28 UTC - in response to Message 43777.

Cool animation, Stefan! Thanks for sharing! :)
____________

Post to thread

Message boards : News : WU: OPM995 simulations