Advanced search

Message boards : Number crunching : Stalled WUs?

Author Message
lohphat
Send message
Joined: 21 Jan 10
Posts: 20
Credit: 434,877,169
RAC: 144,520
Level
Gln
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 48295 - Posted: 7 Dec 2017 | 20:26:58 UTC

I have a WU which has been running for several days and seems to get stuck at a percentage complete.

Then I shutdown BOINC and relaunch and the accumulated work disappears and it restarts from a much lower percentage.

Rinse repeat.

The WU is crunching but with no percentage progress.

PappaLitto
Send message
Joined: 21 Mar 16
Posts: 350
Credit: 2,454,272,877
RAC: 7,249,180
Level
Phe
Scientific publications
watwat
Message 48296 - Posted: 7 Dec 2017 | 20:31:45 UTC - in response to Message 48295.

Just abort it

Profile BeemerBiker
Avatar
Send message
Joined: 31 Oct 08
Posts: 102
Credit: 928,366,503
RAC: 3,482,438
Level
Glu
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 48306 - Posted: 8 Dec 2017 | 16:52:13 UTC

i had transfers stalled for weeks on a system that had "no more work" on gpugrid (board too slow). Aborting worked only until the next reboot. Only got rid by detaching and reattaching. may have been stuck for months as i rarely check that feature. maybe this was the "1" task the server status shows as ready to be sent. i finally got rid of it a few minutes ago.

lohphat
Send message
Joined: 21 Jan 10
Posts: 20
Credit: 434,877,169
RAC: 144,520
Level
Gln
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 48309 - Posted: 8 Dec 2017 | 19:12:16 UTC

Two more work units stalled which I had to abort.

Methinks there's a systemic problem managing WUs.

lohphat
Send message
Joined: 21 Jan 10
Posts: 20
Credit: 434,877,169
RAC: 144,520
Level
Gln
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 48718 - Posted: 22 Jan 2018 | 18:41:37 UTC

It seems related to running Firefox (knowing it has h/w acceleration options) -- I'm still playing woth the settings but I can get GPUGRID work units to stall simply by opening up YouTube and playing a video.

The WU percentage stops increasing but it still shows active.

After 10hours I exit BOINC and restart and the hours worked drops back down to the point where it stalled.

So it's still happening, and I can recreate the failure consistently.

I've restarted the project to refresh resources.

Profile Retvari Zoltan
Avatar
Send message
Joined: 20 Jan 09
Posts: 1898
Credit: 12,079,961,419
RAC: 2,420,809
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 48723 - Posted: 22 Jan 2018 | 23:37:15 UTC - in response to Message 48718.

It seems related to running Firefox (knowing it has h/w acceleration options) -- I'm still playing woth the settings but I can get GPUGRID work units to stall simply by opening up YouTube and playing a video.
I suppose that this card is your GTX 980Ti. If it's overclocked, then you should reduce it's clock speed by 100MHz, to see if it makes it more stable. Your card reaches 78°C (172°F) which could be too much while using it for crunching and other purposes simultaneously. It is also recommended to dust off its fins with compressed air.

Betting Slip
Send message
Joined: 5 Jan 09
Posts: 635
Credit: 2,433,764,450
RAC: 2,303,272
Level
Phe
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 48725 - Posted: 23 Jan 2018 | 11:40:22 UTC - in response to Message 48718.

It seems related to running Firefox (knowing it has h/w acceleration options) -- I'm still playing woth the settings but I can get GPUGRID work units to stall simply by opening up YouTube and playing a video.

The WU percentage stops increasing but it still shows active.

After 10hours I exit BOINC and restart and the hours worked drops back down to the point where it stalled.

So it's still happening, and I can recreate the failure consistently.

I've restarted the project to refresh resources.


I had to roll back driver to 385.41 which is the latest driver not to have issues with Firefox browser. It is on Nvidia forums, I had driver "stopped responding and recovered" while browsing with Firefox.

lohphat
Send message
Joined: 21 Jan 10
Posts: 20
Credit: 434,877,169
RAC: 144,520
Level
Gln
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 48732 - Posted: 24 Jan 2018 | 7:45:57 UTC - in response to Message 48725.

I had to roll back driver to 385.41 which is the latest driver not to have issues with Firefox browser. It is on Nvidia forums, I had driver "stopped responding and recovered" while browsing with Firefox.


That did it.

However I have in my notes that 385.41 caused WU errors with Einstein@Home -- I'm awaiting for the project to issue me new WUs to verify.

But as for GPUGRID, it fixed the problem.

FWIW, it never crashed the driver or FFox -- it just caused GPUGRID WUs to stall but not error out.

Erico
Send message
Joined: 19 Apr 18
Posts: 1
Credit: 149,850
RAC: 50
Level

Scientific publications
wat
Message 49355 - Posted: 24 Apr 2018 | 20:51:28 UTC

Just wanted to say I had this problem too. The project stalled three times from 0 to 50%, then I reduced my GTX 970's memory clock from 3800 MHz (which never had any issues running another project, Milkyway@Home) to the default of 3500 MHz and the last 50% didn't stall. I don't know if it's just a coincidence or not.

I stopped running GPUGRID because of this, so if the admins want to know what project it was, just look at the last one I turned over.

Betting Slip
Send message
Joined: 5 Jan 09
Posts: 635
Credit: 2,433,764,450
RAC: 2,303,272
Level
Phe
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 49356 - Posted: 24 Apr 2018 | 21:23:26 UTC - in response to Message 49355.
Last modified: 24 Apr 2018 | 21:40:14 UTC


I stopped running GPUGRID because of this, so if the admins want to know what project it was, just look at the last one I turned over.


A lot of the work on this project is far more demanding of video ram and gpu's than the projects you mention. Overclocking your vram was your problem not this projects.

When you see "The simulation has become unstable. Terminating to avoid lock-up" it's almost always due to overclocking.
____________
Radio Caroline, the world's most famous offshore pirate radio station.
Great music since April 1964. Support Radio Caroline Team -
Radio Caroline

Post to thread

Message boards : Number crunching : Stalled WUs?