Advanced search

Message boards : Number crunching : Aborted by server

Author Message
Stefan
Volunteer moderator
Project developer
Project scientist
Send message
Joined: 5 Mar 13
Posts: 258
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 46125 - Posted: 10 Jan 2017 | 18:46:55 UTC

I want to apologize to everyone who lost work on the BNBS simulations a moment ago.

It was an unfortunate step I had to take. There was a small but important mistake in the BNBS simulations which caused them to not chain correctly together. We will still use the simulations which came back but we have a deadline for the publication and I had to make use of all the resources for the fixed BNBS2 simulations.

I am really sorry for this. I considered it well before doing it but I believe that it made sense given the circumstances.

Profile Beyond
Avatar
Send message
Joined: 23 Nov 08
Posts: 1073
Credit: 4,497,656,004
RAC: 407,908
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 46127 - Posted: 10 Jan 2017 | 19:02:11 UTC
Last modified: 10 Jan 2017 | 19:28:29 UTC

Was wondering, just had 4 BNBS WUs aborted while running.

Edit, after investigating further it seems there were 6 running. Strange that 2 were actually uploading when they were aborted and on the website report 0 time as if there weren't started? Only 4 of the 8 show time on the website but 6 were running, 2 hadn't yet started.

PappaLitto
Send message
Joined: 21 Mar 16
Posts: 271
Credit: 1,323,887,081
RAC: 5,505,188
Level
Met
Scientific publications
watwat
Message 46128 - Posted: 10 Jan 2017 | 19:04:56 UTC

Thank you for informing us

Erich56
Send message
Joined: 1 Jan 15
Posts: 372
Credit: 1,679,211,402
RAC: 2,917,345
Level
His
Scientific publications
watwatwat
Message 46133 - Posted: 10 Jan 2017 | 20:53:57 UTC - in response to Message 46128.

Thank you for informing us

+ 1

Erich56
Send message
Joined: 1 Jan 15
Posts: 372
Credit: 1,679,211,402
RAC: 2,917,345
Level
His
Scientific publications
watwatwat
Message 46134 - Posted: 10 Jan 2017 | 20:56:25 UTC

the newly downloaded WU

WT_S3F9_C2-SDOERR_BNBS2-0-4-RND5326_0

errored out after some 7.000 seconds. This has not happened before with any BNBS.
Was this just a coincidence, or could anything be wrong with the new WUs?

Dave Peachey
Send message
Joined: 16 May 09
Posts: 11
Credit: 130,663,120
RAC: 0
Level
Cys
Scientific publications
watwatwatwatwatwatwatwatwat
Message 46135 - Posted: 10 Jan 2017 | 21:36:24 UTC - in response to Message 46134.
Last modified: 10 Jan 2017 | 21:53:20 UTC

the newly downloaded WU WT_S3F9_C2-SDOERR_BNBS2-0-4-RND5326_0 errored out after some 7.000 seconds. This has not happened before with any BNBS.
Was this just a coincidence, or could anything be wrong with the new WUs?


I notice that the table at the bottom of the "Server status" page shows a worrying statement of zero succeses (which could be attributed to a lack of returned WU results) and also a 100% error rate (which suggests all returns to date have failed).

Whilst this may be caused by just a few failed WUs thus far, I've not seen that rate of errors since last October's early beta-testing of the PASCAL version of the application so maybe there is, indeed, something amiss within this new batch of WUs.

Or else I'm imagining a trend where none exists based on too small a sample size ... which is always possible.

Dave

Profile Retvari Zoltan
Avatar
Send message
Joined: 20 Jan 09
Posts: 1844
Credit: 10,659,932,044
RAC: 9,994,938
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 46136 - Posted: 10 Jan 2017 | 22:06:55 UTC - in response to Message 46135.

I notice that the table at the bottom of the "Server status" page shows a worrying statement of zero succeses (which could be attributed to a lack of returned WU results) and also a 100% error rate (which suggests all returns to date have failed).
This is normal, as not enough time has passed since the release of this batch to have any successful tasks.

Whilst this may be caused by just a few failed WUs thus far, I've not seen that rate of errors since last October's early beta-testing of the PASCAL version of the application so maybe there is, indeed, something amiss within this new batch of WUs.
I have 4 running for 3 hours 20 minutes without any failures. They need another 6 hours to finish, so we'll have normalized error rate after that.

Betting Slip
Send message
Joined: 5 Jan 09
Posts: 589
Credit: 2,041,743,975
RAC: 1,505,145
Level
Phe
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 46137 - Posted: 10 Jan 2017 | 22:08:47 UTC - in response to Message 46134.

the newly downloaded WU

WT_S3F9_C2-SDOERR_BNBS2-0-4-RND5326_0

errored out after some 7.000 seconds. This has not happened before with any BNBS.
Was this just a coincidence, or could anything be wrong with the new WUs?



More likely your card is overclocked too much for this particular WU as you have "Simulation has become unstable" in your output file.

Stefan
Volunteer moderator
Project developer
Project scientist
Send message
Joined: 5 Mar 13
Posts: 258
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 46139 - Posted: 10 Jan 2017 | 22:52:25 UTC

Thanks for the understanding.

From performance they should be identical. I only changed the input configuration and did a single initial simulation step to generate the needed files.

I would say that if it works for one (i.e. Zoltan) it should work for all in the sense that the only thing I can imagine could break would be a broken input file. But since they all share the same input files they should work.

But I will check tomorrow the results anyway if anything is going wrong as we don't have the luxury to repeat that mistake.

Erich56
Send message
Joined: 1 Jan 15
Posts: 372
Credit: 1,679,211,402
RAC: 2,917,345
Level
His
Scientific publications
watwatwat
Message 46143 - Posted: 11 Jan 2017 | 5:29:46 UTC - in response to Message 46137.

the newly downloaded WU

WT_S3F9_C2-SDOERR_BNBS2-0-4-RND5326_0

errored out after some 7.000 seconds. This has not happened before with any BNBS.
Was this just a coincidence, or could anything be wrong with the new WUs?


More likely your card is overclocked too much for this particular WU as you have "Simulation has become unstable" in your output file.

hm, I forgot to look at the output file, thanks for the hint.
However, I wonder if we can really talk about overclocking at a rate of 1240MHz. On my other GTX980ti (same model as this one here) the clock is 1340MHz - without any problem.
But, as you said, it may have had to do with that particular WU.

BTW, also yesterday, another BNBS failed at a GTX750ti at 1137MHz.

So, maybe these BNBS are somewhat more susceptible to overclocking, in comparison to other WUs.

Profile Beyond
Avatar
Send message
Joined: 23 Nov 08
Posts: 1073
Credit: 4,497,656,004
RAC: 407,908
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 46145 - Posted: 11 Jan 2017 | 17:13:33 UTC
Last modified: 11 Jan 2017 | 17:16:27 UTC

I've had a couple of the new BNBS2 WUs finish on my 1060 cards now. Seem to run fine. Other than the cancelled ones I've seen no failures at all with the BNBS2 WUs and I run 15 750Ti factory OCed cards.

Dave Peachey
Send message
Joined: 16 May 09
Posts: 11
Credit: 130,663,120
RAC: 0
Level
Cys
Scientific publications
watwatwatwatwatwatwatwatwat
Message 46146 - Posted: 11 Jan 2017 | 19:31:59 UTC - in response to Message 46145.
Last modified: 11 Jan 2017 | 19:39:51 UTC

I've had a couple of the new BNBS2 WUs finish on my 1060 cards now. Seem to run fine. Other than the cancelled ones I've seen no failures at all with the BNBS2 WUs and I run 15 750Ti factory OCed cards.

Mixed success from me:
- my 1050ti choked on a BNBS2 yesterday evening after 3.5 hours, then happily crunched a GERRARD_MO_TRV2 without issue and is now 2.5 hours into another BNBS2 (albeit threatening a >31 hour turnaround)
- my 1060 has recently finished one in just under 21 hours and is embarking on the next one (with a similar predicted turnaround)

Both cards are (now) in the same box with no system o/c and only the factory-set o/c on the cards so my first failure remains a mystery unless these WUs really are particularly sensitive in some circumstances.

The error rate for these WUs on the "Server status" page is coming down so that's a good sign ... albeit the turnaround time for some these WUs seems to be on the high side (so more chances of failure, perhaps).

Profile Retvari Zoltan
Avatar
Send message
Joined: 20 Jan 09
Posts: 1844
Credit: 10,659,932,044
RAC: 9,994,938
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 46147 - Posted: 11 Jan 2017 | 20:10:40 UTC

I had one failure on BNBS2: GLU73ALA_S5F20_C2-SDOERR_BNBS2-0-4-RND2615_0
This was on an ASUS GTX 980Ti Strix, which has 3600MHz memory clock, and its GPU is boosted to 1401MHz (which is a bit optimistic under Windows XP, so I shaved off 11MHz now).
Due to very low outside temperatures (-11°C, 12°F) I've reanimated my GTX680 (@1189MHz) in a DQ45CB motherboard with a Core2Duo E8500 (3.16GHz).
It has successfully crunched a WT_S3F4_C27-SDOERR_BNBS2-0-4-RND5274_0 in 22h 55m 10s :)

Profile Beyond
Avatar
Send message
Joined: 23 Nov 08
Posts: 1073
Credit: 4,497,656,004
RAC: 407,908
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 46163 - Posted: 12 Jan 2017 | 19:41:20 UTC

So far 15 of the new BNBS2 WUs completed:

9 on 750Ti 2GB cards
4 on 1060 3GB
1 on 1050Ti 4GB
1 on 670 2GB

4 more are at 92% - 99% done

No errors on any...

Profile Retvari Zoltan
Avatar
Send message
Joined: 20 Jan 09
Posts: 1844
Credit: 10,659,932,044
RAC: 9,994,938
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 46165 - Posted: 12 Jan 2017 | 20:31:33 UTC
Last modified: 12 Jan 2017 | 20:36:22 UTC

My ancient host finished another of these BNBS2 workunits: GLU73ALA_S14F19_C2-SDOERR_BNBS2-0-4-RND8567_0 in 23h 56m 53s.
It has missed the 24h bonus by 13 minutes, because it was downloaded earlier, but I've reduced my work cache to 0 days since then.
I don't have any failures in the past 24 hours.

Dave Peachey
Send message
Joined: 16 May 09
Posts: 11
Credit: 130,663,120
RAC: 0
Level
Cys
Scientific publications
watwatwatwatwatwatwatwatwat
Message 46167 - Posted: 13 Jan 2017 | 0:20:29 UTC

The situation for me is similar:
- three WUs completed successfully on my 1060 in approx. 20-21 hrs
- one early failure and one recent, successful completion on the 1050ti in just under 30 hrs
- two new WUs are currently on the go, one on each card; predicted run-times in each case are simliar to the previous ones

I too am keeping my work cache at a low level to avoid downloading WUs too soon and missing the 24hr deadline but not right down to zero days as I need some headroom to ensure the CPU tasks on that machine are kept trickling in.

Post to thread

Message boards : Number crunching : Aborted by server