Advanced search

Message boards : News : acemdbeta application - discussion

Author Message
Profile MJH
Project administrator
Project developer
Project scientist
Send message
Joined: 12 Nov 07
Posts: 696
Credit: 27,266,655
RAC: 0
Level
Val
Scientific publications
watwat
Message 32645 - Posted: 2 Sep 2013 | 22:34:54 UTC

The Beta application may be somewhat volatile for the next few days, as we try to understand and fix the remaining common failure modes. This will ultimately lead to a more stable production application, so please do continue to take WUs from there. Your help's appreciated.

Thanks,

Matt

Bedrich Hajek
Send message
Joined: 28 Mar 09
Posts: 340
Credit: 3,821,859,209
RAC: 942,358
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 32652 - Posted: 3 Sep 2013 | 6:40:28 UTC - in response to Message 32645.

These new betas did reduce the speed of the 690 video cards, on my windows 7 computer from 1097 Mhz to 914 Mhz when they were running on that GPU. If I had a non beta running along side the beta on another GPU, the non beta was running at the higher speed. When the beta finished, and a non beta started running on that GPU the speed returned to 1097 Mhz. This did not happened on windows xp, with the 690 video card. The driver on the windows 7 computer is the beta 326.80, with EVGA precision x 4.0, while the windows xp computer is running 314.22, with EVGA precision x 4.0.




juan BFP
Send message
Joined: 11 Dec 11
Posts: 21
Credit: 145,256,218
RAC: 19,356
Level
Cys
Scientific publications
watwatwatwatwatwatwatwat
Message 32653 - Posted: 3 Sep 2013 | 13:21:22 UTC

How do we make to get the beta apps? I did´t receive anyone. My seetings:

ACEMD short runs (2-3 hours on fastest card) for CUDA 4.2: no
ACEMD short runs (2-3 hours on fastest card) for CUDA 3.1: no
ACEMD beta: yes
ACEMD long runs (8-12 hours on fastest GPU) for CUDA 4.2: yes
____________

zombie67 [MM]
Avatar
Send message
Joined: 16 Jul 07
Posts: 165
Credit: 260,948,154
RAC: 247
Level
Asn
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 32654 - Posted: 3 Sep 2013 | 13:22:36 UTC - in response to Message 32653.

How do we make to get the beta apps? I did´t receive anyone. My seetings:

ACEMD short runs (2-3 hours on fastest card) for CUDA 4.2: no
ACEMD short runs (2-3 hours on fastest card) for CUDA 3.1: no
ACEMD beta: yes
ACEMD long runs (8-12 hours on fastest GPU) for CUDA 4.2: yes


Also select "Run test applications?"
____________
Dublin, California
Team: SETI.USA

5pot
Send message
Joined: 8 Mar 12
Posts: 411
Credit: 2,083,882,218
RAC: 0
Level
Phe
Scientific publications
watwatwatwatwatwatwatwatwatwatwat
Message 32655 - Posted: 3 Sep 2013 | 13:24:16 UTC

I hope my massive quantity of failed beta wus is helping ;)

juan BFP
Send message
Joined: 11 Dec 11
Posts: 21
Credit: 145,256,218
RAC: 19,356
Level
Cys
Scientific publications
watwatwatwatwatwatwatwat
Message 32657 - Posted: 3 Sep 2013 | 14:04:47 UTC - in response to Message 32654.
Last modified: 3 Sep 2013 | 14:43:03 UTC

How do we make to get the beta apps? I did´t receive anyone. My seetings:

ACEMD short runs (2-3 hours on fastest card) for CUDA 4.2: no
ACEMD short runs (2-3 hours on fastest card) for CUDA 3.1: no
ACEMD beta: yes
ACEMD long runs (8-12 hours on fastest GPU) for CUDA 4.2: yes


Also select "Run test applications?"

OK i select it now.

May i ask something else, i see a lot of beta WU with running times of about 60sec who paid 1500 credits. That´s right? Normaly the WU takes a long time to crunch here.

All my GPU´s are Cuda55 capable, did i need to do change something else in the settings?

Thanks for the help
____________

Profile MJH
Project administrator
Project developer
Project scientist
Send message
Joined: 12 Nov 07
Posts: 696
Credit: 27,266,655
RAC: 0
Level
Val
Scientific publications
watwat
Message 32660 - Posted: 3 Sep 2013 | 15:10:31 UTC - in response to Message 32655.


I hope my massive quantity of failed beta wus is helping ;)



5pot - that's exactly the problem that I am trying to fix. Your machine seems one of the worst affected. Could you PM more details about its setup please? In particular if you have any AV or GPU-related utilities installed.

MJH

Profile skgiven
Volunteer moderator
Project tester
Volunteer tester
Avatar
Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,834,906,524
RAC: 276,970
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 32661 - Posted: 3 Sep 2013 | 15:56:51 UTC - in response to Message 32598.
Last modified: 3 Sep 2013 | 16:01:33 UTC

# GPU 0 Current Temp: 66 C Target Temp 1
# GPU 1 Current Temp: 57 C Target Temp 1
6653000 1116.8328 2531.7110 3339.7214 -271975.1685 33633.0608 -231353.8424 46210.2440 0.0000 -185143.5984 296.9487 0.0000 0.0000
# GPU 0 Current Temp: 66 C Target Temp 1
# GPU 1 Current Temp: 56 C Target Temp 1
6654000 1.#QNB 1.#QNB 1.#QNB 1.#QNB 0.0000 1.#QNB 1.#QNB 0.0000 1.#QNB 1.#QNB 0.0000 0.0000
# The simulation has become unstable. Terminating to avoid lock-up (1)

Snippet from an ACEMD beta version v8.10 (cuda55) WU that failed when I started using the system. It ran for 8.8h before becoming unstable :(
- Using MSI Afterburner to control GPU temps.
____________
FAQ's

HOW TO:
- Opt out of Beta Tests
- Ask for Help

TJ
Send message
Joined: 26 Jun 09
Posts: 815
Credit: 1,469,215,105
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 32662 - Posted: 3 Sep 2013 | 16:01:46 UTC

I got 3 beta's ACEMD v 8.07 overnight and all failed. I got 3 short ones with v 8.09 that ran longer. And now I have a NOELIA_KLEBEbeta-2-3 running v 8.09 and has done 76% in 16h32m on my 660. This is way longer than on version 8.00 to 8.04.
____________
Greetings from TJ

Carlos Augusto Engel
Send message
Joined: 5 Jun 09
Posts: 34
Credit: 1,755,736,429
RAC: 968,004
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 32664 - Posted: 3 Sep 2013 | 17:21:59 UTC - in response to Message 32657.

May i ask something else, i see a lot of beta WU with running times of about 60sec who paid 1500 credits. That´s right? Normaly the WU takes a long time to crunch here.

All my GPU´s are Cuda55 capable, did i need to do change something else in the settings?

Thanks for the help


That is ok. I received a lot of small ones too:

7242845 4748357 3 Sep 2013 | 4:51:19 UTC 3 Sep 2013 | 9:11:28 UTC Completo e validado 14,906.95 7,437.96 15,000.00 ACEMD beta version v8.10 (cuda55)
7242760 4748284 3 Sep 2013 | 4:35:37 UTC 3 Sep 2013 | 4:48:40 UTC Completo e validado 152.31 76.89 1,500.00 ACEMD beta version v8.10 (cuda55)
7242759 4748283 3 Sep 2013 | 4:35:37 UTC 3 Sep 2013 | 4:42:40 UTC Completo e validado 152.00 74.30 1,500.00 ACEMD beta version v8.10 (cuda55)
7242751 4748277 3 Sep 2013 | 4:35:37 UTC 3 Sep 2013 | 4:51:19 UTC Completo e validado 151.85 74.69 1,500.00 ACEMD beta version v8.10 (cuda55)
____________

juan BFP
Send message
Joined: 11 Dec 11
Posts: 21
Credit: 145,256,218
RAC: 19,356
Level
Cys
Scientific publications
watwatwatwatwatwatwatwat
Message 32665 - Posted: 3 Sep 2013 | 19:46:47 UTC

But that kind of WU produces a lot of:

197 (0xc5) EXIT_TIME_LIMIT_EXCEEDED

Since their initial estimate is very short but some runs for a long time, maybe a bug who knows?

I.E. http://www.gpugrid.net/result.php?resultid=7244185
____________

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 790
Credit: 1,424,289,095
RAC: 1,369,661
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 32666 - Posted: 3 Sep 2013 | 21:16:58 UTC - in response to Message 32665.

But that kind of WU produces a lot of:

197 (0xc5) EXIT_TIME_LIMIT_EXCEEDED

Since their initial estimate is very short but some runs for a long time, maybe a bug who knows?

Number crunching knows.

juan BFP
Send message
Joined: 11 Dec 11
Posts: 21
Credit: 145,256,218
RAC: 19,356
Level
Cys
Scientific publications
watwatwatwatwatwatwatwat
Message 32667 - Posted: 3 Sep 2013 | 21:25:45 UTC - in response to Message 32666.

But that kind of WU produces a lot of:

197 (0xc5) EXIT_TIME_LIMIT_EXCEEDED

Since their initial estimate is very short but some runs for a long time, maybe a bug who knows?

Number crunching knows.

Thanks for the info, but edit client_state.xml is a dangerous territory, at least for me, i wait for the fix.
____________

Bedrich Hajek
Send message
Joined: 28 Mar 09
Posts: 340
Credit: 3,821,859,209
RAC: 942,358
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 32668 - Posted: 4 Sep 2013 | 1:18:27 UTC

On my computers, the beta versions 8.05 to 8.10 are significantly slower than version 8.04. The MJHARVEY_TEST14, for example ran about 70 seconds with version 8.04, but takes about 4 minutes to complete on versions 8.05 through 8.10. I ran NOELIA_KLEBEbeta WU's in 10 to 12 hours on version 8.04, while currently I am running 4 of these NOELIA unit on versions 8.05 and 8.10, and it looks like they will finish in about 16 to 20 hours. These results are typical for windows 7 and xp, cuda 4.2 and 5.5, Nvidia drivers 314.22 and 326.80. Windows 7 down clocks, but xp doesn't is the only difference. Please don't cancel the units, they seem to be running okay, and I want to finish them to proof that point. I hope the next beta version is faster and better.


Profile skgiven
Volunteer moderator
Project tester
Volunteer tester
Avatar
Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,834,906,524
RAC: 276,970
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 32675 - Posted: 4 Sep 2013 | 10:08:39 UTC - in response to Message 32661.

Had another Blue Screen crash!
The culprit was a NOELIA_KLEBEbeta. It had run for 16h 30min on a GTX660Ti (which sounds a bit too long).

Cold started the system and the same WU restarted from zero. I aborted the WU, 063px68-NOELIA_KLEBEbeta-0-3-RND7563_1.

Unfortunately 3 WU’s from other projects erred, a WUProp task and two climate models (330h lost)!

The same WU had already completed on a GTX 560 Ti using v8.02:
7221577 114293 29 Aug 2013 | 15:39:46 UTC 3 Sep 2013 | 18:13:38 UTC Completed and validated 64,854.86 2,489.65 95,200.00 ACEMD beta version v8.02 (cuda42)
7244112 139265 3 Sep 2013 | 15:40:26 UTC 4 Sep 2013 | 9:25:31 UTC Aborted by user 60,387.27 15,443.53 --- ACEMD beta version v8.10 (cuda55)

Obviously the changes have made the WU run slower; a GTX660Ti should be much faster than a GTX 560 Ti.

Are you trying to stabilize WU's by Temp targeted control of the GPU or do you just want to see if there is a temp issue?

The NOELIA_KLEBE WU's are still causing driver restarts and occasionally blue screen crashing the system, which kills other work. The WU below might not have been running properly/using the GPU (seen previously with GPU load at 0).


# GPU 0 Current Temp: 32 C Target Temp 1
# GPU 1 Current Temp: 54 C Target Temp 1
4269000 1119.7984 2489.9390 3374.5652 -270800.3207 33500.5416 -230315.4765 46055.2623 0.0000 -184260.2142 295.9528 20749.7732 20749.7732
#SWAN : Running in DEBUG mode
# CUDA Synchronisation mode: BLOCKING
# SWAN Device 0 :
# Name : GeForce GTX 660 Ti
# ECC : Disabled
# Global mem : 2048MB
# Capability : 3.0
# PCI ID : 0000:05:00.0
# Device clock : 1110MHz
# Memory clock : 3004MHz
# Memory width : 192bit
# Driver version : r325_00 : 32641
#SWAN NVAPI Version: NVidia Complete Version 1.10
# GPU [GeForce GTX 660 Ti] Platform [Windows] Rev [3192M] VERSION [55]
# SWAN Device 0 :
# Name : GeForce GTX 660 Ti
# ECC : Disabled
# Global mem : 2048MB
# Capability : 3.0
# PCI ID : 0000:05:00.0
# Device clock : 1110MHz
# Memory clock : 3004MHz
# Memory width : 192bit
# Driver version : r325_00 : 32641
# SWAN : Attempt to malloc 1234688. 2049331200 free
# SWAN : Attempt to malloc 144640. 2048020480 free
# SWAN : Attempt to malloc 317056. 2046971904 free
# SWAN : Attempt to malloc 80128. 2045923328 free
# SWAN : Attempt to malloc 1152. 2045923328 free
# SWAN : Attempt to malloc 1152. 2044874752 free
# SWAN : Attempt to malloc 1152. 2044874752 free
# SWAN : Attempt to malloc 1152. 2044874752 free
# SWAN : Attempt to malloc 2816. 2044874752 free
# SWAN : Attempt to malloc 1792. 2044874752 free
# SWAN : Attempt to malloc 1152. 2044874752 free
...
# swanReallocHost: new allocation of 4
# swanRealloc: new allocation of 2580480
# SWAN : Attempt to malloc 2581504. 1747210240 free
...

Unfortunately, the problems I've experienced with the NOELIA_KLEBE WU's have been too severe. While I don't mind testing a WU, and getting WU failures, testing to the point of self-destruction isn't for me.
____________
FAQ's

HOW TO:
- Opt out of Beta Tests
- Ask for Help

Jacob Klein
Send message
Joined: 11 Oct 08
Posts: 1068
Credit: 1,147,847,239
RAC: 1,038,161
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 32680 - Posted: 4 Sep 2013 | 10:32:58 UTC - in response to Message 32675.
Last modified: 4 Sep 2013 | 10:35:40 UTC

Oh man, skgiven! You lost 330 hours of work? That's very unfortunate.
It sounds like the beta app needs even more critical sections added to avoid more driver crashes and BSODs.

MJH:
Are there any other sections within the app that are missing the BOINC critical section logic? We need you to solve this.

Profile skgiven
Volunteer moderator
Project tester
Volunteer tester
Avatar
Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,834,906,524
RAC: 276,970
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 32686 - Posted: 4 Sep 2013 | 12:57:22 UTC - in response to Message 32680.

It's not the first time, or the second - Maybe I'll never learn!

Fortunately only 2 climate WU's were running this time (I had others suspended, but was hoping I could get a couple done without any issues; my system has been very stable of late - close, but no cigar). Hopefully I won't see too many more...

It seems WUProp and Climate have tasks that are particularly vulnerable to such failures. C'est la vie,



____________
FAQ's

HOW TO:
- Opt out of Beta Tests
- Ask for Help

Profile MJH
Project administrator
Project developer
Project scientist
Send message
Joined: 12 Nov 07
Posts: 696
Credit: 27,266,655
RAC: 0
Level
Val
Scientific publications
watwat
Message 32690 - Posted: 4 Sep 2013 | 13:42:53 UTC - in response to Message 32686.


It seems WUProp and Climate have tasks that are particularly vulnerable to such failures. C'est la vie,


Highly surprising that the driver reset would cause two non-GPU-using apps (I repsume) to crash. Were their graphical screen-savers enabled? Sure the BOINC client didn't get confused and terminate them?

MJH

Profile MJH
Project administrator
Project developer
Project scientist
Send message
Joined: 12 Nov 07
Posts: 696
Credit: 27,266,655
RAC: 0
Level
Val
Scientific publications
watwat
Message 32691 - Posted: 4 Sep 2013 | 13:44:11 UTC - in response to Message 32680.


Are there any other sections within the app that are missing the BOINC critical section logic? We need you to solve this.


8.11 (now on beta and short) represents the best that can be done using that method.

MJH

Jacob Klein
Send message
Joined: 11 Oct 08
Posts: 1068
Credit: 1,147,847,239
RAC: 1,038,161
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 32696 - Posted: 4 Sep 2013 | 14:29:53 UTC - in response to Message 32690.

Note: A Blue-Screen is worse than a driver reset. And I have sometimes seen GPUGrid tasks, when being suspended (looking at the NOELIA ones again)... give me blue screens in the past, with error DPC_WATCHDOG_VIOLATION.

My experience has been:

- A driver reset can cause GPU tasks to fail, even if they are on other GPUs working on other projects.
- A BSOD can cause any task to fail, including CPU ones, if they aren't robust enough to handle resuming after restarting Windows from the abrupt BSOD.

Make sense?




It seems WUProp and Climate have tasks that are particularly vulnerable to such failures. C'est la vie,


Highly surprising that the driver reset would cause two non-GPU-using apps (I repsume) to crash. Were their graphical screen-savers enabled? Sure the BOINC client didn't get confused and terminate them?

MJH

Profile MJH
Project administrator
Project developer
Project scientist
Send message
Joined: 12 Nov 07
Posts: 696
Credit: 27,266,655
RAC: 0
Level
Val
Scientific publications
watwat
Message 32697 - Posted: 4 Sep 2013 | 14:42:04 UTC - in response to Message 32696.


Note: A Blue-Screen is worse than a driver reset. And I have sometimes seen GPUGrid tasks, when being suspended (looking at the NOELIA ones again)... give me blue screens in the past, with error DPC_WATCHDOG_VIOLATION.


Yes - I read "blue-screen" but heard "driver reset". BSOD's are by definition a driver bug. It's axiomatic that no user program should be able to crash the kernel.
"DPC_WATCHDOG_VIOLATION" is the event that the driver is supposed to trap and recover from, triggering a driver reset. Evidently that's not a perfect process.

MJH

Jacob Klein
Send message
Joined: 11 Oct 08
Posts: 1068
Credit: 1,147,847,239
RAC: 1,038,161
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 32699 - Posted: 4 Sep 2013 | 15:01:26 UTC - in response to Message 32697.

I think if a driver encounters too many TDRs in a short period of time, the OS issues the DPC_WATCHDOG_VIOLATION BSOD.

I believe it is not a driver issue.
I believe it is a result of getting too many TDRs (from GPUGrid apps).

Profile Stoneageman
Avatar
Send message
Joined: 25 May 09
Posts: 211
Credit: 12,252,435,346
RAC: 8,631,734
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 32707 - Posted: 4 Sep 2013 | 18:03:34 UTC
Last modified: 4 Sep 2013 | 18:11:59 UTC

5/5, 8.11-Noelia beta wu's failed on time exceeded.Sample
Previously, 8.11 MJHarvey betas ran OK as did 8.05 Noelia betas.
Xp32 GTX570 stock, 314.22, 7.0.64

Profile MJH
Project administrator
Project developer
Project scientist
Send message
Joined: 12 Nov 07
Posts: 696
Credit: 27,266,655
RAC: 0
Level
Val
Scientific publications
watwat
Message 32709 - Posted: 4 Sep 2013 | 18:06:57 UTC - in response to Message 32707.

Just killing off the remaining beta WUs now.

MJH

Profile Retvari Zoltan
Avatar
Send message
Joined: 20 Jan 09
Posts: 1844
Credit: 10,648,138,594
RAC: 9,965,763
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 32716 - Posted: 4 Sep 2013 | 21:40:58 UTC - in response to Message 32709.

Just killing off the remaining beta WUs now.

MJH

I had a beta WU from the TEST18 series, and my host reported it successfully, and it didn't received an abort request from the GPUGrid scheduler.
Is there anything we should do? (for example manually abort all beta tasks, including NOELIA_KLEBEbeta tasks?)

Bedrich Hajek
Send message
Joined: 28 Mar 09
Posts: 340
Credit: 3,821,859,209
RAC: 942,358
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 32718 - Posted: 5 Sep 2013 | 1:51:46 UTC - in response to Message 32709.
Last modified: 5 Sep 2013 | 1:54:36 UTC

I guess we did enough testing for now, and we can forgot beta versions 8.05 to 8.10. Here is the proof. Look at the finishing times.

Versions 8.05 & 8.10

7245172 4749995 3 Sep 2013 | 23:26:39 UTC 4 Sep 2013 | 19:10:37 UTC Completed and validated 68,637.53 21,605.83 142,800.00 ACEMD beta version v8.10 (cuda55)

7245072 4732574 3 Sep 2013 | 22:31:54 UTC 4 Sep 2013 | 20:20:27 UTC Completed and validated 75,759.22 21,670.35 142,800.00 ACEMD beta version v8.05 (cuda42)

Versus version 8.11

7247558 4731877 4 Sep 2013 | 13:32:52 UTC 5 Sep 2013 | 1:24:54 UTC Completed and validated 35,208.78 7,022.73 142,800.00 ACEMD beta version v8.11 (cuda55)

7247095 4751418 4 Sep 2013 | 10:00:08 UTC 5 Sep 2013 | 1:03:53 UTC Completed and validated 34,651.15 7,074.61 142,800.00 ACEMD beta version v8.11 (cuda55)


Talk about down clocking and low GPU usage that happened in windows 7!!

Profile MJH
Project administrator
Project developer
Project scientist
Send message
Joined: 12 Nov 07
Posts: 696
Credit: 27,266,655
RAC: 0
Level
Val
Scientific publications
watwat
Message 32722 - Posted: 5 Sep 2013 | 11:02:48 UTC

The "simulation unstable" (err -97) failure mode can be quite painful in terms of lost credit. In some circumstances this can be recoverable error, so aborting the WU is unnecessarily wasteful.

There'll be a new beta out in a few hours putting this recovery into practice. It will also be accompanied by a new batch of beta WUs, MJHARVEY-CRASHY.

If you have been encountering this error a lot please start taking these WUs - I need to see err -97 failures, and lots of them.

MJH

juan BFP
Send message
Joined: 11 Dec 11
Posts: 21
Credit: 145,256,218
RAC: 19,356
Level
Cys
Scientific publications
watwatwatwatwatwatwatwat
Message 32724 - Posted: 5 Sep 2013 | 12:15:05 UTC
Last modified: 5 Sep 2013 | 12:20:19 UTC

Sorry if i put this on the wrong thread.

Could be just curiosity but somebody could explain why this WU receives such diferences in credits, on the same host and are from the same kind of WU (NOELIA)

WU 1 - http://www.gpugrid.net/result.php?resultid=7238706

WU 2 - http://www.gpugrid.net/result.php?resultid=7247939

WU1 is cuda42 and WU2 is cuda55, WU1 runs for less time than WU2 and receives about 20% more credit. Both WU reported within the 24 hrs limit.
____________

Profile Retvari Zoltan
Avatar
Send message
Joined: 20 Jan 09
Posts: 1844
Credit: 10,648,138,594
RAC: 9,965,763
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 32725 - Posted: 5 Sep 2013 | 12:20:44 UTC - in response to Message 32724.

Could be just curiosity but somebody could explain why this WU receives such diferences in credits, if they uses about the same GPU/CPU times on similar hosts and are from the same kind of WU (NOELIA)


These workunits are from the same scientist (Noelia), but they are not in the same batch.

The first workunit is a NOELIA_FRAG041p.
The second workunit is a NOELIA_INS1P.

juan BFP
Send message
Joined: 11 Dec 11
Posts: 21
Credit: 145,256,218
RAC: 19,356
Level
Cys
Scientific publications
watwatwatwatwatwatwatwat
Message 32726 - Posted: 5 Sep 2013 | 13:11:25 UTC - in response to Message 32725.
Last modified: 5 Sep 2013 | 13:15:48 UTC

Could be just curiosity but somebody could explain why this WU receives such diferences in credits, if they uses about the same GPU/CPU times on similar hosts and are from the same kind of WU (NOELIA)


These workunits are from the same scientist (Noelia), but they are not in the same batch.

The first workunit is a NOELIA_FRAG041p.
The second workunit is a NOELIA_INS1P.

By that i understand, the credit "paid" is not related to the processing power used to crunch the WU is related to the batch of the WU when somebody decides the number of credit paid by the batch WU´s, that´s diferent from most of the other Boinc projects and why that´s bugs my mind. Initialy i expect the same number of credits for the same processing time used on the same host (or something aproximately).

Please understand me, i don´t question the metodoth i just want to find the answer why. That´s ok now.

Thanks for the answer and happy crunching.
____________

Profile Retvari Zoltan
Avatar
Send message
Joined: 20 Jan 09
Posts: 1844
Credit: 10,648,138,594
RAC: 9,965,763
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 32727 - Posted: 5 Sep 2013 | 13:56:08 UTC - in response to Message 32726.

By that i understand, the credit "paid" is not related to the processing power used to crunch the WU is related to the batch of the WU when somebody decides the number of credit paid by the batch WU´s, that´s diferent from most of the other Boinc projects and why that´s bugs my mind. Initialy i expect the same number of credits for the same processing time used on the same host (or something aproximately).

Please understand me, i don´t question the metodoth i just want to find the answer why. That´s ok now.

Thanks for the answer and happy crunching.

For the second time I've read your previous post it came to my mind that your problem could be that a shorter WU received more credit than a longer WU. Well, that's a paradox. It happens from time to time. Later on you will get used to it. It's probably caused by the method used for approximating the processing power needed for the given WU (based on the complexity of the model, and the steps needed).
The shorter (~30ksec) WU had 6.25 million steps, and received 180k credit. (6 credit/sec)
The longer (~31.4ksec) WU had 4.2 million steps, and received 148.5k credit. (4.73 credit/sec)
There is 27% difference between the credit/sec rate of the two workunits. It's significant, but not unusual.

juan BFP
Send message
Joined: 11 Dec 11
Posts: 21
Credit: 145,256,218
RAC: 19,356
Level
Cys
Scientific publications
watwatwatwatwatwatwatwat
Message 32728 - Posted: 5 Sep 2013 | 14:09:42 UTC - in response to Message 32727.

For the second time I've read your previous post it came to my mind that your problem could be that a shorter WU received more credit than a longer WU. Well, that's a paradox......It's significant, but not unusual.

That´s exactly what i means, the paradox (less time more credit - more time less credit). But if is normal and not a bug... then Go crunching both. Thanks for your time and explanations.

____________

Profile MJH
Project administrator
Project developer
Project scientist
Send message
Joined: 12 Nov 07
Posts: 696
Credit: 27,266,655
RAC: 0
Level
Val
Scientific publications
watwat
Message 32729 - Posted: 5 Sep 2013 | 14:11:15 UTC

Ok chaps, there's a new beta 8.12.
This comes along with a bunch of WUs, MJHARVEY-CRASH1.

If you have suffered error -97 "simulation unstable" errors, please take some of these WUs.

The new app implements a recovery mechanism that should see unstable simulations automatically restarted from an earlier checkpoint. Recoveries should be accompanied by a message to the BOINC console, and in the stderr when the job is complete.

I have had to update the BOINC client library to implement this, so expect everything to go hilariously wrong.

MJH

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 790
Credit: 1,424,289,095
RAC: 1,369,661
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 32730 - Posted: 5 Sep 2013 | 14:54:14 UTC - in response to Message 32729.

I have had to update the BOINC client library to implement this, so expect everything to go hilariously wrong.

ROFL - now that's a good way to attract testers! We have been duly warned, and I'm off to try and download some now.

I've seen a few exits with 'error -97', but not any great number. If I get any CRASH1 tasks now, they will run on device 0 - the hotter of my two GTX 670s - hopefully that will generate some material for you to work with.

Jacob Klein
Send message
Joined: 11 Oct 08
Posts: 1068
Credit: 1,147,847,239
RAC: 1,038,161
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 32731 - Posted: 5 Sep 2013 | 14:57:52 UTC - in response to Message 32729.
Last modified: 5 Sep 2013 | 15:00:34 UTC

I grabbed 4 of them too. It'll be a couple hours before my 2 GPUs can begin work on them.

Note: It looks like the 8.11 app "floods" the stderr.txt file with tons of lines of GPU temperature readings. This makes it impossible for me to see all the "GPU start blocks" on the results page.

Is there any way to either not flood it with temps, or maybe put continuous temp readings on a single line?

Basically, if possible, for any of my completed tasks, I'd prefer to see ALL of the blocks that looks like this:

# GPU [GeForce GTX 660 Ti] Platform [Windows] Rev [3203] VERSION [55]
# SWAN Device 0 :
# Name : GeForce GTX 660 Ti
# ECC : Disabled
# Global mem : 3072MB
# Capability : 3.0
# PCI ID : 0000:09:00.0
# Device clock : 1124MHz
# Memory clock : 3004MHz
# Memory width : 192bit
# Driver version : r325_00 : 32680

... but, instead, all I'd get to see in the "truncated" web result is:

# GPU 0 Current Temp: 67 C
# GPU 1 Current Temp: 71 C
# GPU 2 Current Temp: 80 C
# GPU 0 Current Temp: 67 C
# GPU 1 Current Temp: 71 C
# GPU 2 Current Temp: 80 C
# GPU 0 Current Temp: 67 C
# GPU 1 Current Temp: 71 C
# GPU 2 Current Temp: 80 C
# GPU 0 Current Temp: 67 C
# GPU 1 Current Temp: 71 C
# GPU 2 Current Temp: 80 C
# GPU 0 Current Temp: 67 C
# GPU 1 Current Temp: 71 C
# GPU 2 Current Temp: 80 C
# GPU 0 Current Temp: 67 C
# GPU 1 Current Temp: 71 C
# GPU 2 Current Temp: 80 C
# GPU 0 Current Temp: 67 C
# GPU 1 Current Temp: 71 C
# GPU 2 Current Temp: 80 C

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 790
Credit: 1,424,289,095
RAC: 1,369,661
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 32732 - Posted: 5 Sep 2013 | 15:07:47 UTC - in response to Message 32731.

I hadn't reset my DCF, so I only got one job, and it started immediately in high priority - task 7250880. Initial indications are that it will run for roughly 2 hours, if it doesn't spend too much time rewinding.

@ MJH - those temperature readings.

BOINC has a limit on the amount of text it will return via stderr - 64KB, IIRC. Depending on the client version in use, you might get the first 64KB (with that startup block Jacob wanted), or the last 64KB (which is more likely to contain a crash dump analysis, of interest to debuggers). We could look up which version does what, if you wish.

Of course, if you could shrink the bit in the middle, you might be able to fit both ends into 64KB.

Profile MJH
Project administrator
Project developer
Project scientist
Send message
Joined: 12 Nov 07
Posts: 696
Credit: 27,266,655
RAC: 0
Level
Val
Scientific publications
watwat
Message 32733 - Posted: 5 Sep 2013 | 15:10:38 UTC - in response to Message 32732.

The next version will emit temperatures only when they change.

MJH

Jacob Klein
Send message
Joined: 11 Oct 08
Posts: 1068
Credit: 1,147,847,239
RAC: 1,038,161
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 32734 - Posted: 5 Sep 2013 | 15:21:05 UTC - in response to Message 32733.
Last modified: 5 Sep 2013 | 15:25:39 UTC

Just to clarify... Most of my GPUGrid tasks get "started" about 5-10 times, as I'm suspending the GPU often (for exclusive apps), and restarting the machine often (troubleshooting nVidia driver problems).

What I'm MOST interested in, is seeing the "GPU identification block" for every restart. So, if it was restarted 10 times, I expect to see 10 blocks, without truncation, in the web result.

Hopefully that's not too much to ask.
Thanks.

Profile MJH
Project administrator
Project developer
Project scientist
Send message
Joined: 12 Nov 07
Posts: 696
Credit: 27,266,655
RAC: 0
Level
Val
Scientific publications
watwat
Message 32736 - Posted: 5 Sep 2013 | 15:50:32 UTC - in response to Message 32734.

8.13: reduce temperature verbosity and trap access violations and recover.

MJH

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 790
Credit: 1,424,289,095
RAC: 1,369,661
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 32738 - Posted: 5 Sep 2013 | 17:47:26 UTC - in response to Message 32732.
Last modified: 5 Sep 2013 | 18:03:23 UTC

task 7250880.

I think we got what you were hoping for:

# The simulation has become unstable. Terminating to avoid lock-up (1)
# Attempting restart (step 2561000)
# GPU [GeForce GTX 670] Platform [Windows] Rev [3203] VERSION [55]

Edit - on the other hand, task 7250828 wasn't so lucky. And I can't see any special messages in the BOINC event log either: these are all ones I would have expected to see anyway.

05/09/2013 18:52:21 | GPUGRID | Finished download of 147-MJHARVEY_CRASH1-0-xsc_file
05/09/2013 18:52:21 | GPUGRID | [coproc] NVIDIA instance 1: confirming for 170xlow-NOELIA_INS1P-0-12-RND0803_1
05/09/2013 18:52:21 | GPUGRID | [coproc] Assigning 0.990000 of NVIDIA free instance 0 to 119-MJHARVEY_CRASH1-0-25-RND2510_0
05/09/2013 18:52:21 | SETI@home | [coproc] NVIDIA device 0 already assigned: task 29mr08ag.26459.13160.3.12.250_0
05/09/2013 18:52:21 | SETI@home | [coproc] NVIDIA device 0 already assigned: task 29mr08ag.26459.13160.3.12.74_0
05/09/2013 18:52:21 | SETI@home | [cpu_sched] Preempting 29mr08ag.26459.13160.3.12.74_0 (removed from memory)
05/09/2013 18:52:21 | SETI@home | [cpu_sched] Preempting 29mr08ag.26459.13160.3.12.250_0 (removed from memory)
05/09/2013 18:52:21 | GPUGRID | [coproc] NVIDIA instance 1: confirming for 170xlow-NOELIA_INS1P-0-12-RND0803_1
05/09/2013 18:52:21 | GPUGRID | [coproc] Assigning 0.990000 of NVIDIA free instance 0 to 119-MJHARVEY_CRASH1-0-25-RND2510_0
05/09/2013 18:52:22 | GPUGRID | [coproc] NVIDIA instance 1: confirming for 170xlow-NOELIA_INS1P-0-12-RND0803_1
05/09/2013 18:52:22 | GPUGRID | [coproc] Assigning 0.990000 of NVIDIA free instance 0 to 119-MJHARVEY_CRASH1-0-25-RND2510_0
05/09/2013 18:52:22 | GPUGRID | Restarting task 119-MJHARVEY_CRASH1-0-25-RND2510_0 using acemdbeta version 812 (cuda55) in slot 10
05/09/2013 18:52:23 | GPUGRID | [sched_op] Deferring communication for 00:01:36
05/09/2013 18:52:23 | GPUGRID | [sched_op] Reason: Unrecoverable error for task 119-MJHARVEY_CRASH1-0-25-RND2510_0
05/09/2013 18:52:23 | GPUGRID | Computation for task 119-MJHARVEY_CRASH1-0-25-RND2510_0 finished
05/09/2013 18:52:23 | GPUGRID | Output file 119-MJHARVEY_CRASH1-0-25-RND2510_0_1 for task 119-MJHARVEY_CRASH1-0-25-RND2510_0 absent
05/09/2013 18:52:23 | GPUGRID | Output file 119-MJHARVEY_CRASH1-0-25-RND2510_0_2 for task 119-MJHARVEY_CRASH1-0-25-RND2510_0 absent
05/09/2013 18:52:23 | GPUGRID | Output file 119-MJHARVEY_CRASH1-0-25-RND2510_0_3 for task 119-MJHARVEY_CRASH1-0-25-RND2510_0 absent
05/09/2013 18:52:23 | GPUGRID | [coproc] NVIDIA instance 1: confirming for 170xlow-NOELIA_INS1P-0-12-RND0803_1
05/09/2013 18:52:23 | GPUGRID | [coproc] Assigning 0.990000 of NVIDIA free instance 0 to 147-MJHARVEY_CRASH1-0-25-RND2539_0
05/09/2013 18:52:34 | GPUGRID | [coproc] NVIDIA instance 1: confirming for 170xlow-NOELIA_INS1P-0-12-RND0803_1
05/09/2013 18:52:34 | GPUGRID | [coproc] Assigning 0.990000 of NVIDIA free instance 0 to 147-MJHARVEY_CRASH1-0-25-RND2539_0
05/09/2013 18:52:34 | GPUGRID | Starting task 147-MJHARVEY_CRASH1-0-25-RND2539_0 using acemdbeta version 813 (cuda55) in slot 9
05/09/2013 18:52:35 | GPUGRID | Started upload of 119-MJHARVEY_CRASH1-0-25-RND2510_0_0

Profile MJH
Project administrator
Project developer
Project scientist
Send message
Joined: 12 Nov 07
Posts: 696
Credit: 27,266,655
RAC: 0
Level
Val
Scientific publications
watwat
Message 32739 - Posted: 5 Sep 2013 | 17:59:11 UTC - in response to Message 32738.

Grand. Worked as designed in both cases - in the first it was able to restart and continue, in the second the restart lead to immediate failure, so it corrected aborted, rather than getting stuck in a loop.

MJH

ExtraTerrestrial Apes
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 17 Aug 08
Posts: 2671
Credit: 751,577,974
RAC: 453,655
Level
Glu
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 32740 - Posted: 5 Sep 2013 | 20:19:33 UTC - in response to Message 32739.

That's some really nice progress!

Regarding the temperature output: you could keep track of min, max and mean temperature and only output these at the end of a WU or in case of a crash / instability event. In the latter case the current value would also be of interest.

Regarding the crash-recovery: let's assume I'd be pushing my GPU too far and produce calculation errors occasionally. In earlier apps I'd see occasional crashes for WUs which others return just fine. That's a clear indicator of something going wrong, and relatively easy to spot by watching the number of errors and invalids in the host stats.

If, however, the new app is used with the same host and the recovery works well, then I might not notice the problem at all. The WU runtimes would suffer a bit due to the restarts, but apart from that I wouldn't see any difference from a regular host, until I browse the actual result outputs, right?

I think the recovery is a great feature which will hopefully save us from a lot of lost computation time. But it would be even better if we'd have some easy indicator of it being needed. Not sure what this could be, though, without changing the BOINC server software.

MrS
____________
Scanning for our furry friends since Jan 2002

Jacob Klein
Send message
Joined: 11 Oct 08
Posts: 1068
Credit: 1,147,847,239
RAC: 1,038,161
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 32741 - Posted: 5 Sep 2013 | 20:27:52 UTC - in response to Message 32740.
Last modified: 5 Sep 2013 | 20:30:19 UTC

Right -- I think this begs the question: Is it normal or possible for the program to become unstable due to a problem in the program? ie: If the hardware isn't overclocked and is calculating perfectly, is it normal or possible to encounter a recoverable program instability?

If so:
Then I can see why you're adding recovery, though successful results could mask programming errors (like memory leaks, etc.)

If not:
Then... all you're doing is masking hardware calculation errors I think. Which might be bad, because successful results could mask erroneous data.

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 790
Credit: 1,424,289,095
RAC: 1,369,661
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 32742 - Posted: 5 Sep 2013 | 20:44:14 UTC - in response to Message 32736.

8.13: reduce temperature verbosity and trap access violations and recover.

MJH

First v8.13 returned - task 7250857

Stderr compact enough to fit inside the 64KB limit, but no restart event to bulk it up.

I note the workunits still have the old 5 PetaFpop estimate:

<name>147-MJHARVEY_CRASH1-0-25-RND2539</name>
<app_name>acemdbeta</app_name>
<version_num>813</version_num>
<rsc_fpops_est>5000000000000000.000000</rsc_fpops_est>
<rsc_fpops_bound>250000000000000000.000000</rsc_fpops_bound>

That has led to an APR of 546 for host 132158. That's not a problem, while the jobs are relatively long (nearer 2.5 hours than my initial guess of 2 hours) - that APR is high, but nowhere near high enough to cause any problems, so no immediate remedial action is needed. But it would be good to get into the habit of dialling it down as a matter of routine.

Profile MJH
Project administrator
Project developer
Project scientist
Send message
Joined: 12 Nov 07
Posts: 696
Credit: 27,266,655
RAC: 0
Level
Val
Scientific publications
watwat
Message 32744 - Posted: 5 Sep 2013 | 21:31:02 UTC - in response to Message 32741.


Right -- I think this begs the question: Is it normal or possible for the program to become unstable due to a problem in the program?


This fix addresses two specific failure modes - 1) an access violation and 2) transient instability of the simulation. As you know, the application regularly checkpoints so that it can resume if suspended. We use that mechanism to simply restart the simulation at the last known good point when failure occurs. (If no checkpoint state exists, or itself is corrupted, the WU will be aborted as before).

What we are not doing is ignoring the error and ploughing on regardless (which wouldn't be possible anyway, because the sim is dead by that point). Because of the nature of our simulations and analytical methods, we can treat any transient error that perturbs but does not kill the simulation as an ordinary source of random experimental error.

Concerning the problems themselves: the first is not so important in absolute terms, only a few users are affected by it (though those that are suffer repeatedly), but the workaround is similar to that for 2) so it was no effort to include a fix for it also. This problem is almost certainly some peculiarity of their systems, whether interaction with some other running software, specific version of some DLLs, or colour of gonk sat on top of the monitor.

The second problem is our current major failure modem largely because it is effectively a catch-all for problems that interfered somehow with the correct operation of the GPU, but which did not kill the application process outright.

When this type of error occurs on reliable computers of known quality and configuration I know that it strongly indicates either hardware or driver problems (and the boundary between those two blurs at times). In GPUGRID, where every client is unique, root-causing these failures is exceedingly difficult and can only really be approached statistically.[1]

The problem is compounded by having an app that really exercises the whole system (not just CPU, but GPU, PCIe and al whole heap of OS drivers). The opportunity for unexpected and unfavourable interactions with other system activities is increased, and tractability of debugging decreased.

To summarise -my goal here is not to eliminate WU errors entirely (which is practically impossible), but to
1) mitigate them to a sufficiently low level that they do not impede our use of GPUGRID (or, put another way, maximise the effective throughput of an inherently unreliable system)
2) minimise the amount of wastage of your volunteered resources, in terms of lost contribution from failed partially-complete WUs

Hope that explains the situation.

MJH

[1] For a great example of this, see how Microsoft manages WER bug reports.
http://research.microsoft.com/pubs/81176/sosp153-glerum-web.pdf

Profile Retvari Zoltan
Avatar
Send message
Joined: 20 Jan 09
Posts: 1844
Credit: 10,648,138,594
RAC: 9,965,763
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 32758 - Posted: 5 Sep 2013 | 23:40:11 UTC - in response to Message 32744.

To summarise -my goal here is not to eliminate WU errors entirely (which is practically impossible), but to
1) mitigate them to a sufficiently low level that they do not impede our use of GPUGRID (or, put another way, maximise the effective throughput of an inherently unreliable system)
2) minimise the amount of wastage of your volunteered resources, in terms of lost contribution from failed partially-complete WUs

These are really nice goals, which meet all crunchers' expectations (or dreams if you like).
I have two down-to-earth suggestions to help you achieve your goals:
1. for the server side: do not send long workunits to unreliable, or slow hosts
2. for the client side: now that you can monitor the GPU temperature, you should throttle the client if the GPU it's running on became too hot (for example above 80°C, and a warning should be present in the stderr.out)

juan BFP
Send message
Joined: 11 Dec 11
Posts: 21
Credit: 145,256,218
RAC: 19,356
Level
Cys
Scientific publications
watwatwatwatwatwatwatwat
Message 32764 - Posted: 6 Sep 2013 | 2:02:01 UTC
Last modified: 6 Sep 2013 | 2:02:25 UTC

I don´t know if that is what you look for, i was forced to abort this WU after > 10 hr of crunching and only 35% done (normal WU crunching total time of 8-9 hrs)

http://www.gpugrid.net/result.php?resultid=7250850
____________

Profile skgiven
Volunteer moderator
Project tester
Volunteer tester
Avatar
Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,834,906,524
RAC: 276,970
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 32768 - Posted: 6 Sep 2013 | 8:31:12 UTC - in response to Message 32758.
Last modified: 15 Sep 2013 | 22:00:33 UTC

for the server side: do not send long workunits to unreliable, or slow hosts

Indeed this would be of great benefit to the researchers and entry level crunchers alike,
A GT620 isn't fast enough to crunch Long WU's, so ideally it wouldn't be offered any. Even if there is no other work, there would still be no point in sending Long work to a GF 605, GT610, GT620, GT625...

For example, http://www.gpugrid.net/forum_thread.php?id=3463&nowrap=true#32750

Perhaps this could be done most easily based on the GFLOPS? If so I suggest a cutoff point of 595, as this would still allow the GTS 450, GTX 550 Ti, GTX 560M, GT640, and GT 745M to run long WU's, should people chose to (especially via 'If no work for selected applications is available, accept work from other applications?').
You might want to note that some of the smaller cards have seriously low bandwidth, so maybe that should be factored in too.

Is the app able to detect a downclock (say from 656MHz to 402MHz on a GF400)? If so could a message be sent to the user either through Boinc or email alerting the user of the downclock. Ditto, if as Zoltan suggested, the temp goes over say 80°C, so the user can increase their fan speeds (or clean the dust out)? I like pro-active.
____________
FAQ's

HOW TO:
- Opt out of Beta Tests
- Ask for Help

TJ
Send message
Joined: 26 Jun 09
Posts: 815
Credit: 1,469,215,105
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 32769 - Posted: 6 Sep 2013 | 8:43:44 UTC - in response to Message 32736.
Last modified: 6 Sep 2013 | 8:48:00 UTC

8.13: reduce temperature verbosity and trap access violations and recover.

MJH

Hi Matt,

My 770 had a MJHARVEY_CRACH ACEMD beta 8.13 overnight. When I looked this morning it had done little more than 96% but it was not running anymore. I saw that as I monitor the GPU behavior and the temp. was lower, the fan was lower and GPU use was zero. I wait about 10 minutes to see if it recovered, but no. So I suspended it, another one started. I suspended that one and resume the one that stood still. It finished okay. In the log afterwards it is shown that it was re-started, (a second block with info about the card) but no line in the output that I suspend/resumed it, and no reason why it stopped running. Here it is.
As you can see no error message.

Edit: I don´t know how low it stopped running, but the time between sent and received is quite long. Run time and CPU time is almost the same. I have put a line in cc-config to report finished WU´s immediately (especially for Rosetta).
____________
Greetings from TJ

Bedrich Hajek
Send message
Joined: 28 Mar 09
Posts: 340
Credit: 3,821,859,209
RAC: 942,358
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 32772 - Posted: 6 Sep 2013 | 10:21:40 UTC

My windows 7 computer was running 4 of these MJHARVEY_CRASH beta units on 2 690 video cards. Boinc manager decided to run a benchmark. When the units resumed, their status was listed as running, but they were not running, (no progress was being made, the video card were cooling off, and not running anything). I had to suspend all the units and restart them one by one in order to get them going. They all finished successfully.


9/5/2013 9:53:15 PM | | Running CPU benchmarks
9/5/2013 9:53:15 PM | | Suspending computation - CPU benchmarks in progress
9/5/2013 9:53:46 PM | | Benchmark results:
9/5/2013 9:53:46 PM | | Number of CPUs: 5
9/5/2013 9:53:46 PM | | 2609 floating point MIPS (Whetstone) per CPU
9/5/2013 9:53:46 PM | | 8953 integer MIPS (Dhrystone) per CPU
9/5/2013 9:53:47 PM | | Resuming computation
9/5/2013 9:55:19 PM | Einstein@Home | task LATeah0041U_720.0_400180_0.0_3 suspended by user
9/5/2013 9:55:19 PM | Einstein@Home | task h1_0579.00_S6Directed__S6CasAf40a_579.15Hz_445_0 suspended by user
9/5/2013 9:55:19 PM | Einstein@Home | task h1_0579.00_S6Directed__S6CasAf40a_579.15Hz_443_1 suspended by user
9/5/2013 9:55:19 PM | Einstein@Home | task h1_0579.00_S6Directed__S6CasAf40a_579.15Hz_442_0 suspended by user
9/5/2013 9:55:19 PM | Einstein@Home | task h1_0579.00_S6Directed__S6CasAf40a_579.15Hz_441_0 suspended by user
9/5/2013 9:55:19 PM | Einstein@Home | task h1_0579.00_S6Directed__S6CasAf40a_579.15Hz_440_1 suspended by user
9/5/2013 9:55:19 PM | Einstein@Home | task h1_0579.05_S6Directed__S6CasAf40a_579.2Hz_447_1 suspended by user
9/5/2013 9:55:34 PM | GPUGRID | task 156-MJHARVEY_CRASH1-2-25-RND0162_0 suspended by user
9/5/2013 9:55:34 PM | GPUGRID | task 179-MJHARVEY_CRASH1-0-25-RND6235_1 suspended by user
9/5/2013 9:55:34 PM | GPUGRID | task 165-MJHARVEY_CRASH1-1-25-RND5861_0 suspended by user
9/5/2013 9:55:34 PM | GPUGRID | task 101-MJHARVEY_CRASH2-1-25-RND8176_1 suspended by user
9/5/2013 9:56:10 PM | GPUGRID | task 156-MJHARVEY_CRASH1-2-25-RND0162_0 resumed by user
9/5/2013 9:56:11 PM | GPUGRID | Restarting task 156-MJHARVEY_CRASH1-2-25-RND0162_0 using acemdbeta version 813 (cuda55) in slot 5
9/5/2013 9:56:19 PM | GPUGRID | task 179-MJHARVEY_CRASH1-0-25-RND6235_1 resumed by user
9/5/2013 9:56:20 PM | GPUGRID | Restarting task 179-MJHARVEY_CRASH1-0-25-RND6235_1 using acemdbeta version 813 (cuda55) in slot 1
9/5/2013 9:56:26 PM | GPUGRID | task 165-MJHARVEY_CRASH1-1-25-RND5861_0 resumed by user
9/5/2013 9:56:27 PM | GPUGRID | Restarting task 165-MJHARVEY_CRASH1-1-25-RND5861_0 using acemdbeta version 813 (cuda55) in slot 2
9/5/2013 9:56:35 PM | GPUGRID | task 101-MJHARVEY_CRASH2-1-25-RND8176_1 resumed by user
9/5/2013 9:56:36 PM | GPUGRID | Restarting task 101-MJHARVEY_CRASH2-1-25-RND8176_1 using acemdbeta version 813 (cuda42) in slot 4
9/5/2013 9:56:49 PM | Einstein@Home | task LATeah0041U_720.0_400180_0.0_3 resumed by user
9/5/2013 9:56:50 PM | Einstein@Home | Restarting task LATeah0041U_720.0_400180_0.0_3 using hsgamma_FGRP2 version 112 in slot 3
9/5/2013 9:56:51 PM | Einstein@Home | task h1_0579.00_S6Directed__S6CasAf40a_579.15Hz_445_0 resumed by user
9/5/2013 9:56:51 PM | Einstein@Home | Restarting task h1_0579.00_S6Directed__S6CasAf40a_579.15Hz_445_0 using einstein_S6CasA version 105 (SSE2) in slot 0
9/5/2013 9:56:56 PM | Einstein@Home | task h1_0579.00_S6Directed__S6CasAf40a_579.15Hz_443_1 resumed by user
9/5/2013 9:56:58 PM | Einstein@Home | task h1_0579.00_S6Directed__S6CasAf40a_579.15Hz_442_0 resumed by user
9/5/2013 9:56:59 PM | Einstein@Home | task h1_0579.00_S6Directed__S6CasAf40a_579.15Hz_441_0 resumed by user
9/5/2013 9:57:01 PM | Einstein@Home | task h1_0579.00_S6Directed__S6CasAf40a_579.15Hz_440_1 resumed by user
9/5/2013 9:57:03 PM | Einstein@Home | task h1_0579.05_S6Directed__S6CasAf40a_579.2Hz_447_1 resumed by user




101-MJHARVEY_CRASH2-1-25-RND8176_1 4755731 6 Sep 2013 | 1:22:53 UTC 6 Sep 2013 | 4:20:26 UTC Completed and validated 9,896.12 9,203.11 18,750.00 ACEMD beta version v8.13 (cuda42)
165-MJHARVEY_CRASH1-1-25-RND5861_0 4755809 6 Sep 2013 | 0:50:33 UTC 6 Sep 2013 | 3:49:07 UTC Completed and validated 9,885.03 8,951.40 18,750.00 ACEMD beta version v8.13 (cuda55)
179-MJHARVEY_CRASH1-0-25-RND6235_1 4754443 6 Sep 2013 | 0:11:22 UTC 6 Sep 2013 | 3:11:34 UTC Completed and validated 10,086.93 8,842.96 18,750.00 ACEMD beta version v8.13 (cuda55)
156-MJHARVEY_CRASH1-2-25-RND0162_0 4755706 5 Sep 2013 | 23:43:22 UTC 6 Sep 2013 | 2:51:23 UTC Completed and validated 10,496.17 8,977.87 18,750.00 ACEMD beta version v8.13
(cuda55)



Profile skgiven
Volunteer moderator
Project tester
Volunteer tester
Avatar
Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,834,906,524
RAC: 276,970
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 32773 - Posted: 6 Sep 2013 | 10:27:13 UTC - in response to Message 32769.

MJHARVEY_CRASH2 WU's are saying they have a Virtual memory size of 16.22GB on Linux, but on Windows it's 253MB.

155-MJHARVEY_CRASH2-2-25-RND0389_0 4756641 6 Sep 2013 | 9:53:18 UTC 11 Sep 2013 | 9:53:18 UTC In progress --- --- --- ACEMD beta version v8.00 (cuda55)
141-MJHARVEY_CRASH2-2-25-RND9742_0 4756635 6 Sep 2013 | 10:02:36 UTC 11 Sep 2013 | 10:02:36 UTC In progress --- --- --- ACEMD beta version v8.00 (cuda55)
____________
FAQ's

HOW TO:
- Opt out of Beta Tests
- Ask for Help

Profile MJH
Project administrator
Project developer
Project scientist
Send message
Joined: 12 Nov 07
Posts: 696
Credit: 27,266,655
RAC: 0
Level
Val
Scientific publications
watwat
Message 32774 - Posted: 6 Sep 2013 | 10:31:21 UTC - in response to Message 32773.


Virtual memory size of 16.22GB


That's normal, and nothing to worry about.

Matt

Profile MJH
Project administrator
Project developer
Project scientist
Send message
Joined: 12 Nov 07
Posts: 696
Credit: 27,266,655
RAC: 0
Level
Val
Scientific publications
watwat
Message 32775 - Posted: 6 Sep 2013 | 10:32:31 UTC - in response to Message 32768.


for the server side: do not send long workunits to unreliable, or slow hosts


Understanding the crazy in the server is also on the Todo list.

MJH

TJ
Send message
Joined: 26 Jun 09
Posts: 815
Credit: 1,469,215,105
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 32776 - Posted: 6 Sep 2013 | 11:00:41 UTC - in response to Message 32772.

Thanks for your post Bedrich.
This made me look into BOINC log as well and indeed BOINC did a benchmark and after that no new work was requested for GPUGRID as work was still in progress. But actually it did nothing until manual intervention. Can you have a look at this please Matt?
When not being at the rigs, this will stop them from execution and is thus not a help for your science project.
____________
Greetings from TJ

Profile MJH
Project administrator
Project developer
Project scientist
Send message
Joined: 12 Nov 07
Posts: 696
Credit: 27,266,655
RAC: 0
Level
Val
Scientific publications
watwat
Message 32777 - Posted: 6 Sep 2013 | 11:26:07 UTC - in response to Message 32776.


This made me look into BOINC log as well and indeed BOINC did a benchmark and after that no new work was requested for GPUGRID as work was still in progress. But actually it did nothing until manual intervention. Can you have a look at this please Matt?


Sounds like a BOINC client problem. Why is it running benchmarks? does't it only do that the once?

MJH

TJ
Send message
Joined: 26 Jun 09
Posts: 815
Credit: 1,469,215,105
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 32778 - Posted: 6 Sep 2013 | 11:57:45 UTC - in response to Message 32777.
Last modified: 6 Sep 2013 | 11:59:28 UTC


This made me look into BOINC log as well and indeed BOINC did a benchmark and after that no new work was requested for GPUGRID as work was still in progress. But actually it did nothing until manual intervention. Can you have a look at this please Matt?


Sounds like a BOINC client problem. Why is it running benchmarks? does't it only do that the once?

MJH

yes indeed it does when BOINC starts but also once in a while. When a system runs 24/7 than it will do it "regularly" it suspends all work and then resumes again. But that didn´t work with MJHARVEY_CRASH.
So every rig that runs24/7/365 will have this issue now, wan´t before 8.13.
____________
Greetings from TJ

Profile MJH
Project administrator
Project developer
Project scientist
Send message
Joined: 12 Nov 07
Posts: 696
Credit: 27,266,655
RAC: 0
Level
Val
Scientific publications
watwat
Message 32780 - Posted: 6 Sep 2013 | 12:12:59 UTC - in response to Message 32778.

After the benchmark, exactly what was the state of the machine?

Did the acemd processes exist but were suspended/not running, or did they not there at all?

What messages was the boinc gui showing?

MJH

Jacob Klein
Send message
Joined: 11 Oct 08
Posts: 1068
Credit: 1,147,847,239
RAC: 1,038,161
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 32786 - Posted: 6 Sep 2013 | 12:44:44 UTC - in response to Message 32780.
Last modified: 6 Sep 2013 | 12:47:54 UTC

I just did it as a test (it's very easy to manually run benchmarks you know, just click Advanced -> Run CPU benchmarks!)

Anyway, all the tasks (my 3 GPU tasks, my 6 CPU tasks) got suspended, benchmarks ran, and they all resumed... except the 8.13 MJHARVEY_CRASH1 task wasn't truly running anymore.

Process Explorer still showed the acemd.813-55.exe process, but its CPU usage was nearly-0 (not the normal 1-3% I usually see, but "<0.01", ie not using ANY CPU), and Precision-X showed that the GPU-usage was 0. BOINC says the task is still Running, and Elapsed time is still going up. I presume it will "sit there doing nothing until it hits the limit imposed by <rsc_fpops_bound> with maximum-time-exceeded)

Note: The 8.03 Long-run task that was also in the mix here, handled the CPU benchmarks just fine, and has resumed appropriately and is running.

So, ironically, after all those suspend fixes, something in the new app isn't "resuming" right anymore. At least it's easily reproducible - just click "Run CPU benchmarks" to see for yourself!

Hopefully you can fix it, Matt -- we're SO CLOSE to having this thing working much more smoothly!!

Profile Zarck
Send message
Joined: 16 Aug 08
Posts: 135
Credit: 238,817,093
RAC: 9,376
Level
Leu
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 32787 - Posted: 6 Sep 2013 | 12:45:47 UTC - in response to Message 32786.
Last modified: 6 Sep 2013 | 13:39:44 UTC

My Pc crash and reboot.

I'll do a test with seti beta for my Titan.

See you later.

@+
*_*
____________

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 790
Credit: 1,424,289,095
RAC: 1,369,661
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 32788 - Posted: 6 Sep 2013 | 12:47:48 UTC - in response to Message 32780.

Benchmarks can be invoked at any time for testing - they are listed as 'Run CPU benchmarks' on the Advanced menu.

For newer BOINC clients, they are one of the very few - possibly only - times when a GPU task is 'suspended' but kept in VRAM, and can supposedly be 'resumed'. All other task exits, temporary or to allow another project to take a time-slice, require VRAM to be cleared and a full 'restart' done from the checkpoint file.

Profile MJH
Project administrator
Project developer
Project scientist
Send message
Joined: 12 Nov 07
Posts: 696
Credit: 27,266,655
RAC: 0
Level
Val
Scientific publications
watwat
Message 32790 - Posted: 6 Sep 2013 | 12:58:31 UTC - in response to Message 32788.

Thanks Richard,

By coincidence I was just looking into the suspend resume mechanism. I'm going to put out a new beta shortly that should allow more graceful termination, and also make suspend/resume to memory safer.

MJH

TJ
Send message
Joined: 26 Jun 09
Posts: 815
Credit: 1,469,215,105
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 32791 - Posted: 6 Sep 2013 | 13:01:01 UTC - in response to Message 32780.

It was as Jacob explained in post 32786. No error message and not stopping the ACEMD app. BOINC manager said running and the time kept ticking, but no progress. Seems to be almost 2 hours in that state in my case.
Nice test from Jacob as well that it is happening from 8.13 onwards.
____________
Greetings from TJ

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 790
Credit: 1,424,289,095
RAC: 1,369,661
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 32799 - Posted: 6 Sep 2013 | 15:09:32 UTC

My most recent one (task 7253807) shows a crash and recovery from a

SWAN : FATAL : Cuda driver error 999 in file 'swanlibnv2.cpp' in line 1963.
# SWAN swan_assert 0

which is pretty impressive.

In case you're puzzled by the high frequency of restarts at the beginning of the task: at the moment, I'm restricting BOINC to running only one GPUGrid task at a time ('<max_concurrent>'). If the running task suffers a failure, the next in line gets called forward, and runs for a few seconds. But when the original task is ready to resume, 'high priority' (EDF) forces it to run immediately, and the second task to be swapped out. So, a rather stuttering start, but not the fault of the application.

The previous task (7253208) shows a number of

# The simulation has become unstable. Terminating to avoid lock-up (1)

which account for the false starts.

Jacob Klein
Send message
Joined: 11 Oct 08
Posts: 1068
Credit: 1,147,847,239
RAC: 1,038,161
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 32800 - Posted: 6 Sep 2013 | 15:46:58 UTC
Last modified: 6 Sep 2013 | 15:47:14 UTC

The 8.13 app is still spitting out too much temperature data.

On this task, I can't see which GPU it started on :(
http://www.gpugrid.net/result.php?resultid=7253930

Are the temperature readings that important?
If so, then maybe only output temp changes on the current-running-GPU, and even then, condense the text to just say "67*C" instead of "# GPU 0 Current Temp: 67 C" each line? It may even be more ideal to not have each reading on its own line; instead, maybe have a single long line that has temperature fluctuations for the current GPU?

I just want to be able to always see what GPU it started on, and which GPUs it was restarted on. The temps are irrelevant to me, but if you want/need them, please find a way to consolidate further.

Thanks,
Jacob

Profile MJH
Project administrator
Project developer
Project scientist
Send message
Joined: 12 Nov 07
Posts: 696
Credit: 27,266,655
RAC: 0
Level
Val
Scientific publications
watwat
Message 32801 - Posted: 6 Sep 2013 | 16:03:40 UTC

New beta 8.14. Suspend and resume, of either favour, should now be working without problems.

MJH

Profile MJH
Project administrator
Project developer
Project scientist
Send message
Joined: 12 Nov 07
Posts: 696
Credit: 27,266,655
RAC: 0
Level
Val
Scientific publications
watwat
Message 32802 - Posted: 6 Sep 2013 | 16:13:43 UTC - in response to Message 32800.


The 8.13 app is still spitting out too much temperature data.


Only maxima printed now

Jacob Klein
Send message
Joined: 11 Oct 08
Posts: 1068
Credit: 1,147,847,239
RAC: 1,038,161
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 32803 - Posted: 6 Sep 2013 | 16:15:24 UTC - in response to Message 32801.
Last modified: 6 Sep 2013 | 16:16:56 UTC

8.14 appears to be resuming appropriately from running CPU benchmarks.

And I think you should keep the "event notifications" that are in the stderr.txt, they are very very helpful.

# BOINC suspending at user request (thread suspend)
# BOINC resuming at user request (thread suspend)
# BOINC suspending at user request (exit)


Great job!

I also see you've done some work to condense the temp readings. Thanks for that.


The 8.13 app is still spitting out too much temperature data.


Only maxima printed now

If that means "Only printing a temperature reading if it has increased since the start of the run", then that is a GREAT compromise.
Do you think you need them for all GPUs? Or could you maybe just limit to the running GPU?

# GPU [GeForce GTX 460] Platform [Windows] Rev [3203] VERSION [55]
# SWAN Device 1 :
# Name : GeForce GTX 460
# ECC : Disabled
# Global mem : 1024MB
# Capability : 2.1
# PCI ID : 0000:08:00.0
# Device clock : 1526MHz
# Memory clock : 1900MHz
# Memory width : 256bit
# Driver version : r325_00 : 32680
# GPU 0 : 67C
# GPU 1 : 66C
# GPU 2 : 76C
# GPU 1 : 67C
# GPU 2 : 77C

Profile MJH
Project administrator
Project developer
Project scientist
Send message
Joined: 12 Nov 07
Posts: 696
Credit: 27,266,655
RAC: 0
Level
Val
Scientific publications
watwat
Message 32804 - Posted: 6 Sep 2013 | 16:17:56 UTC - in response to Message 32803.


Do you think you need them for all GPUs? Or could you maybe just limit to the running GPU?


The GPU numbering doesn't necessarily correspond to that that the rest of the app uses, so I'm going to leave them all in.

MJH

Jacob Klein
Send message
Joined: 11 Oct 08
Posts: 1068
Credit: 1,147,847,239
RAC: 1,038,161
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 32806 - Posted: 6 Sep 2013 | 16:19:18 UTC - in response to Message 32804.
Last modified: 6 Sep 2013 | 16:19:26 UTC

Thanks Matt.
The work you've done here, especially the suspend/resume work, will greatly improve the stability of people's machines, and the ability to diagnose problems. It is very much appreciated!

Profile Zarck
Send message
Joined: 16 Aug 08
Posts: 135
Credit: 238,817,093
RAC: 9,376
Level
Leu
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 32807 - Posted: 6 Sep 2013 | 17:01:08 UTC - in response to Message 32806.
Last modified: 6 Sep 2013 | 17:01:42 UTC

Despite units GPUGRID test "Crash" my machine continues to produce blue screens and reboot, I need to work with, I am forced to stop GPUGRID and replace by Seti Beta.

Sorry.

@+
*_*
____________

Profile MJH
Project administrator
Project developer
Project scientist
Send message
Joined: 12 Nov 07
Posts: 696
Credit: 27,266,655
RAC: 0
Level
Val
Scientific publications
watwat
Message 32808 - Posted: 6 Sep 2013 | 17:05:18 UTC - in response to Message 32807.

Zarck, hello! Don't give up now - I've been watching your tasks and 8.13 has a fix especially for you!

MJH

Profile MJH
Project administrator
Project developer
Project scientist
Send message
Joined: 12 Nov 07
Posts: 696
Credit: 27,266,655
RAC: 0
Level
Val
Scientific publications
watwat
Message 32810 - Posted: 6 Sep 2013 | 18:18:42 UTC

Ok folks - last call for feature/mod requests for the beta.
Next week I'm moving on to other things.

MJH

Profile skgiven
Volunteer moderator
Project tester
Volunteer tester
Avatar
Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,834,906,524
RAC: 276,970
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 32811 - Posted: 6 Sep 2013 | 18:33:25 UTC - in response to Message 32808.

Performed a CPU Benchmark (with LAIM off). The WU running on the 8.13 app stopped and didn't resume, but the WU on the 8.14 app resumed normally (also with LAIM on).

The WU resumed on the 8.13 app when I exited Boinc (and running tasks) and reopened it (not that it's an issue any more with 8.14).

BTW. I've experienced this 'task not resuming' issue before, so it wasn't a new one, and benchmarks run periodically (just not often enough to have associate it with a task issue, especially when the tasks had plenty of other issues).
____________
FAQ's

HOW TO:
- Opt out of Beta Tests
- Ask for Help

Jim1348
Send message
Joined: 28 Jul 12
Posts: 460
Credit: 1,130,761,180
RAC: 15,358
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwat
Message 32812 - Posted: 6 Sep 2013 | 18:34:28 UTC - in response to Message 32810.

Ok folks - last call for feature/mod requests for the beta.

Since you asked. There have been a number of comments about monitoring temperature, which is good. But I have found that cards can crash due to overclocking while still running relatively cool (less than 70C for example). I don't know if BOINC allows you to monitor the actual GPU core speed, but if so that would be worthwhile to report in some form. I don't know that it is high priority for this beta, but maybe the next one.

Profile MJH
Project administrator
Project developer
Project scientist
Send message
Joined: 12 Nov 07
Posts: 696
Credit: 27,266,655
RAC: 0
Level
Val
Scientific publications
watwat
Message 32813 - Posted: 6 Sep 2013 | 18:37:49 UTC - in response to Message 32811.
Last modified: 6 Sep 2013 | 18:41:53 UTC


BTW. I've experienced this 'task not resuming' issue before, so it wasn't a new one, and benchmarks run periodically (just not often enough to have associate it with a task issue, especially when the tasks had plenty of other issues).


Unsurprising - it's an inevitable consequence of the way the BOINC client library (which we build into our application) goes about doing suspend -resume[1] I've re-plumbed the whole thing entirely, using a much more reliable method.

MJH

[1] To paraphrase the old saying - 'Some people, when confronted with a problem, think "I'll uses threads". Now they have two problems'.

Jacob Klein
Send message
Joined: 11 Oct 08
Posts: 1068
Credit: 1,147,847,239
RAC: 1,038,161
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 32814 - Posted: 6 Sep 2013 | 19:06:53 UTC - in response to Message 32810.
Last modified: 6 Sep 2013 | 19:07:07 UTC

Ok folks - last call for feature/mod requests for the beta.
Next week I'm moving on to other things.

MJH

Can you make it print an ascii rainbow at the end of a successful task?

Seriously, though, can't think of much, except maybe
- Format the driver version to say 326.80 instead of 32680
- Add a timestamp with every start/restart block

Profile MJH
Project administrator
Project developer
Project scientist
Send message
Joined: 12 Nov 07
Posts: 696
Credit: 27,266,655
RAC: 0
Level
Val
Scientific publications
watwat
Message 32875 - Posted: 10 Sep 2013 | 22:39:45 UTC
Last modified: 10 Sep 2013 | 22:54:41 UTC

There's a new batch of Beta WUs - "MJHARVEY-CRASHNPT". These test an important feature of the application that we've not been able to use much in the past because it seemed to be contributing towards crashes. The last series of CRASH units has given me good stats on the underlying error rates for control, so this batch should reveal whether there is in fact a bug with the feature.

Please report here particularly if you have a failure mode unlike any you have sene with 8.14 and earlier CRASH units.

MJH

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 790
Credit: 1,424,289,095
RAC: 1,369,661
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 32887 - Posted: 11 Sep 2013 | 13:37:37 UTC - in response to Message 32875.

First NPT processed with no errors at all - task 7269244. If I get any more, I'll try running them on the 'hot' GPU.

Jacob Klein
Send message
Joined: 11 Oct 08
Posts: 1068
Credit: 1,147,847,239
RAC: 1,038,161
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 32947 - Posted: 14 Sep 2013 | 1:04:21 UTC
Last modified: 14 Sep 2013 | 2:03:07 UTC

I have not had any problems processing the "MJHARVEY-CRASHNPT" units on my stable machine that runs GPUGrid on my GTX 660 Ti and GTX 460.

:) I kinda wish the server would stop sending me beta units, but alas, I'm going to keep my settings set at "Give me any unit you think I should do" (aka: all apps checked). It just seems that lately it wants me to do beta!

Just wanted to report that it is running smoothly for me.

Profile Carlesa25
Avatar
Send message
Joined: 13 Nov 10
Posts: 324
Credit: 72,394,453
RAC: 0
Level
Thr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 32951 - Posted: 14 Sep 2013 | 14:00:08 UTC
Last modified: 14 Sep 2013 | 14:23:36 UTC

Hello: You are about to finish without problems Beta " 102-MJHARVEY_CRASHNPT-7-25-RND3270_0 " and what I've noticed is a different behavior of the CPU usage, at least Linux

The four cores enabled BOINC that I have are with an average load of 23-25% (no more running processes) although the task indicates the use of 1 CPU - 1 NVIDIA GPU. clearly there is an execution of the task in the form of multi-threaded on the CPU, even setting the app_config.xml to use 1 CPU and 1 GPU task.

Note: Completed without problem.

ExtraTerrestrial Apes
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 17 Aug 08
Posts: 2671
Credit: 751,577,974
RAC: 453,655
Level
Glu
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 32974 - Posted: 15 Sep 2013 | 13:43:34 UTC

Matt, thanks for the explanations some week ago! Going back to my original question, if there was a better way to communicate the number of task restarts instead of "when in doubt take a close look at the task outputs".

This is the best idea I have so far: it would be nice if the BOINC server software could be changed (1) to show additional "custom" columns in the task overview. Projects could configure what this custom column should actually display (2). If this was used for e.g. the number of task recoveries we could just browse the task lists for anomalies (which I think many of us are already doing anyway).

(1) I know this isn't your job. It's just a suggestion I'm throwing into the discussion, which could be implemented in the future if enough projects wanted it.

(2) I'm sure there would be some use of this to output trouble-shooting info, performance data, results (pulsars or aliens found etc.) or other new ideas.

Zoltan wrote:
2. for the client side: now that you can monitor the GPU temperature, you should throttle the client if the GPU it's running on became too hot (for example above 80°C, and a warning should be present in the stderr.out)

I don't think throtteling by GPU-Grid itself would be a good idea. Titans and newer cards with Boost 2 are set for a target temperature of 80°C, which could be changed by the user. Older cards fan control often targets "<90°C" under load. And GPU-Grid would only have one lever available: switching calculations on or off. Which is not efficient at all if a GPU boosts into a high-voltage state (because its boost settings say that's OK), which then triggers an app-internal throttle, pausing computations for a moment.. only to run again into the inefficient high-performance state. In this case it would be much better if the user simply adjusted the temperature target to a good value, so that the card could choose a more efficient lower voltage state which allows sustained load at the target temperature.

I agree, though, that a notification system could help users how're unfamiliar with these things. On the other hand: these users would probably not look into the stderr.out either.

SK suggests using e-mail or the boinc notification system for this. I'd caution against overusing these - users could easily feel spammed if they read the warning but have reasons to ignore it, yet keep recieving them. Also the notifications are pushed out to all the users machines connected to the project (Or could this be changed? I don't think it's intended for individual messages), which could be badgering. I'm already getting quite a few messages through this system repeatedly which.. ehm, I don't like getting. Let's leave it at that ;)

MJH wrote:
Unsurprising - it's an inevitable consequence of the way the BOINC client library (which we build into our application) goes about doing suspend -resume[1] I've re-plumbed the whole thing entirely, using a much more reliable method.

Sounds like an iprovement which should find its way back into the main branch :)

MrS
____________
Scanning for our furry friends since Jan 2002

Profile skgiven
Volunteer moderator
Project tester
Volunteer tester
Avatar
Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,834,906,524
RAC: 276,970
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 32979 - Posted: 15 Sep 2013 | 22:39:44 UTC - in response to Message 32974.
Last modified: 15 Sep 2013 | 22:41:16 UTC

Communication is a conundrum. Some want it, some don't. In the Boinc Anouncements system (which isn't a per user system) you can opt out; Tools, Options, Notice Reminder Interval (Never). You can also opt out of project messages (again not per user, but the option would be good) or into PM's, so they go straight to email - which I like and would favor in the case of critical announcements (your card fails every WU because, it has memory issues/is too hot/the clocks have dropped/a fan has failed...).

If we had a Personal Notices area, the server cold post warning messages to alert users and make suggestions. An opt out and auto delete after x days would keep it controlled. Maybe in MyAccount, but ideally just a link from there to a new page (which can be linked to from a warnings button in Boinc Manger, under project web pages). More of a web dev's area than an app dev's and would be nice to see such things added by the Boinc server and site devs, but Matt's rather capable and could easily add a web page and a button to Boinc Manager.

While most newbies wouldn't know to look in the Boinc Logs, some would and a message there (in Bold Red) would help many non newbies too.

When advertising you don't put one sign up!
____________
FAQ's

HOW TO:
- Opt out of Beta Tests
- Ask for Help

Profile Carlesa25
Avatar
Send message
Joined: 13 Nov 10
Posts: 324
Credit: 72,394,453
RAC: 0
Level
Thr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 33000 - Posted: 16 Sep 2013 | 16:00:34 UTC

Hello: Task Beta MJHARVEY_CRASH2-110-18-25-RND5104_0 is running without problem on Linux (Ubuntu13.10) and the behavior is normal CPU load of 90>99% dedicated to the core.

The Beta CRASHNPT (two completed) who carried the four CPUs <25%.

ExtraTerrestrial Apes
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 17 Aug 08
Posts: 2671
Credit: 751,577,974
RAC: 453,655
Level
Glu
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 33012 - Posted: 16 Sep 2013 | 20:33:42 UTC - in response to Message 32979.

Thanks SK, that's certainly worth considering and could take care of all my concerns against automated messages. An implementation would require some serious work. I think it would be best done in the BOINC main code base, so that projects could benefit from it, but setting it up here could be seen as a demonstration / showcase, which could motivate the main BOINC developers to include it.

MrS
____________
Scanning for our furry friends since Jan 2002

Profile skgiven
Volunteer moderator
Project tester
Volunteer tester
Avatar
Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,834,906,524
RAC: 276,970
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 33059 - Posted: 18 Sep 2013 | 16:44:04 UTC - in response to Message 33012.
Last modified: 18 Sep 2013 | 16:48:46 UTC

I would add that it would be nice to know if a WU is taking exceptionally long to complete - say 2 or 3 times what is normal for any given type of WU.

- just spotted a WU that had been pretending to run for the last 5 days!
WU cache on that system is up to 0.2days only.

- Maybe a Danger message to alert you when you are running a NOELIA WU!
____________
FAQ's

HOW TO:
- Opt out of Beta Tests
- Ask for Help

ExtraTerrestrial Apes
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 17 Aug 08
Posts: 2671
Credit: 751,577,974
RAC: 453,655
Level
Glu
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 33071 - Posted: 18 Sep 2013 | 19:34:38 UTC - in response to Message 33059.

We may already have a mechanism for that. I can't remember the exact wording.. time limit exceeded? It mostly triggers unintentionally and results in a straight error, though.

Assuming this is not actually a time limit, as the naming suggests, but something related to the amount of flops expected for a WU, it might be possible to fine tune and use this limit to catch hanging WUs. One would have to be very careful about not generating false positives.. the harm done would probably outweight the gain.

Another thought regarding a hanging app: there's this error message "no heartbeat from client" sometimes popping up. Which implies there's some heartbeat-checking going on. I assume BOINC would throw an error if it doesn't receive heartbeats from an app any more. I this is true than your hanging WU was still generating a heartbeat and hence was not totally stuck. At this point the GPU-Grid app could monitor itself and trigger Matt's new recovery mode if it detects no progress from the GPU after some time.

MrS
____________
Scanning for our furry friends since Jan 2002

Profile skgiven
Volunteer moderator
Project tester
Volunteer tester
Avatar
Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,834,906,524
RAC: 276,970
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 33082 - Posted: 18 Sep 2013 | 21:39:41 UTC - in response to Message 33071.

The "no heartbeat from client" problem is a bit of an anchor for Boinc; it trawls around the seabeds ripping up reefs - as old-school as my ideas on task timeouts. I presume I experienced the consequences of such murmurs today when my CPU apps failed, compliments of a N* WU.
Is the recovery mechanism confined to the WDDM timeout limits (which can be changed)?
____________
FAQ's

HOW TO:
- Opt out of Beta Tests
- Ask for Help

Profile MJH
Project administrator
Project developer
Project scientist
Send message
Joined: 12 Nov 07
Posts: 696
Credit: 27,266,655
RAC: 0
Level
Val
Scientific publications
watwat
Message 33871 - Posted: 14 Nov 2013 | 9:55:37 UTC - in response to Message 32974.


Matt, thanks for the explanations some week ago! Going back to my original question, if there was a better way to communicate the number of task restarts instead of "when in doubt take a close look at the task outputs".


Unfortunately not. I agree, it would be really nice to be able to push a message to the BOINC Mangler from the client. I should make a request to DA.

Matt

Profile MJH
Project administrator
Project developer
Project scientist
Send message
Joined: 12 Nov 07
Posts: 696
Credit: 27,266,655
RAC: 0
Level
Val
Scientific publications
watwat
Message 33872 - Posted: 14 Nov 2013 | 9:56:45 UTC

New BETA coming later today.
Will include a fix for the repeated crashing on restart that some of you have seen.

Matt

Jacob Klein
Send message
Joined: 11 Oct 08
Posts: 1068
Credit: 1,147,847,239
RAC: 1,038,161
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 33875 - Posted: 14 Nov 2013 | 14:05:13 UTC
Last modified: 14 Nov 2013 | 14:06:00 UTC

Huzzah! [Thanks, looking forward to testing it!]

Profile MJH
Project administrator
Project developer
Project scientist
Send message
Joined: 12 Nov 07
Posts: 696
Credit: 27,266,655
RAC: 0
Level
Val
Scientific publications
watwat
Message 33876 - Posted: 14 Nov 2013 | 14:39:48 UTC - in response to Message 33875.

815 is live now.

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 790
Credit: 1,424,289,095
RAC: 1,369,661
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 33880 - Posted: 14 Nov 2013 | 16:16:45 UTC - in response to Message 33876.

815 is live now.

Got one. I think you might have sent out 10-minute jobs with a full-length runtime estimate again.

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 790
Credit: 1,424,289,095
RAC: 1,369,661
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 33927 - Posted: 18 Nov 2013 | 19:57:51 UTC

Just had a batch of 'KLAUDE' beta tasks all error with

ERROR: file pme.cpp line 85: PME NX too small

Profile skgiven
Volunteer moderator
Project tester
Volunteer tester
Avatar
Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,834,906,524
RAC: 276,970
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 33928 - Posted: 18 Nov 2013 | 20:16:11 UTC - in response to Message 33927.
Last modified: 18 Nov 2013 | 20:18:22 UTC

ditto, all failed in 2 or 3 seconds:

Name 8-KLAUDE_6426-0-3-RND4620_2
Workunit 4932250
Created 18 Nov 2013 | 20:05:37 UTC
Sent 18 Nov 2013 | 20:08:06 UTC
Received 18 Nov 2013 | 20:12:46 UTC
Server state Over
Outcome Computation error
Client state Compute error
Exit status -98 (0xffffffffffffff9e) Unknown error number

Computer ID 139265
Report deadline 23 Nov 2013 | 20:08:06 UTC
Run time 2.52
CPU time 0.45
Validate state Invalid
Credit 0.00
Application version ACEMD beta version v8.15 (cuda55)
Stderr output

<core_client_version>7.2.28</core_client_version>
<![CDATA[
<message>
(unknown error) - exit code -98 (0xffffff9e)
</message>
<stderr_txt>
# GPU [GeForce GTX 770] Platform [Windows] Rev [3203M] VERSION [55]
# SWAN Device 0 :
# Name : GeForce GTX 770
# ECC : Disabled
# Global mem : 2048MB
# Capability : 3.0
# PCI ID : 0000:01:00.0
# Device clock : 1084MHz
# Memory clock : 3505MHz
# Memory width : 256bit
# Driver version : r331_00 : 33140
ERROR: file pme.cpp line 85: PME NX too small
20:10:02 (1592): called boinc_finish

</stderr_txt>
]]>
____________
FAQ's

HOW TO:
- Opt out of Beta Tests
- Ask for Help

Profile (retired account)
Send message
Joined: 22 Dec 11
Posts: 38
Credit: 28,606,255
RAC: 0
Level
Val
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 33929 - Posted: 18 Nov 2013 | 20:44:39 UTC
Last modified: 18 Nov 2013 | 20:54:06 UTC

Yep, all thrashed within seconds, no matter if GTX Titan or GT 650M... Hope it helps as I'm only consuming up-/download here and it won't even heat my room. ;)

EDIT: There's a good one, at least for the 10 and a half minutes it ran so far.
____________
Mark my words and remember me. - 11th Hour, Lamb of God

Profile (retired account)
Send message
Joined: 22 Dec 11
Posts: 38
Credit: 28,606,255
RAC: 0
Level
Val
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 33930 - Posted: 18 Nov 2013 | 21:53:55 UTC - in response to Message 33929.


EDIT: There's a good one, at least for the 10 and a half minutes it ran so far.


Still running, but estimation of remaining time is way off: 18.730%, runtime 01:07:40, remaining time 00:10:18.

Profile skgiven
Volunteer moderator
Project tester
Volunteer tester
Avatar
Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,834,906,524
RAC: 276,970
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 33931 - Posted: 18 Nov 2013 | 22:06:32 UTC - in response to Message 33930.
Last modified: 18 Nov 2013 | 22:08:24 UTC


EDIT: There's a good one, at least for the 10 and a half minutes it ran so far.


Still running, but estimation of remaining time is way off: 18.730%, runtime 01:07:40, remaining time 00:10:18.

The Run Time and % Complete are accurate, so you can estimate the overall time from that; 18.73% in 67 2/3min suggests it will take a total of 6h and 2minutes (+/- a couple) to complete.

I have two 8.15 Betas running on a GTX660Ti and a GTX770 (W7) that look like taking 9h 12min and 6h 32min respectively.
____________
FAQ's

HOW TO:
- Opt out of Beta Tests
- Ask for Help

Profile (retired account)
Send message
Joined: 22 Dec 11
Posts: 38
Credit: 28,606,255
RAC: 0
Level
Val
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 33937 - Posted: 19 Nov 2013 | 6:37:49 UTC

Yes, it took 5 hrs. and 58 min., validated ok.

Profile Damaraland
Send message
Joined: 7 Nov 09
Posts: 152
Credit: 16,181,924
RAC: 0
Level
Pro
Scientific publications
watwatwatwatwatwatwatwatwat
Message 33945 - Posted: 20 Nov 2013 | 19:52:28 UTC - in response to Message 33937.

Not very sure if you still want this info. Maybe you could be more precise.

CUDA: NVIDIA GPU 0: GeForce GTX 260 (driver version 331.65, CUDA version 6.0, compute capability 1.3, 896MB, 818MB available, 912 GFLOPS peak)

ACEMD beta version v8.15 (cuda55)
77-KLAUDE_6429-0-2-RND1641_1 Expected to finish in 22h. 83% processed right so far.

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 790
Credit: 1,424,289,095
RAC: 1,369,661
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 33949 - Posted: 20 Nov 2013 | 23:58:27 UTC

I've finally had a crash while beta tasks were running ;)

I would need to examine the logs more carefully to be certain of the sequence of events, but it seems likely that these two tasks were running (one on each GTX 670) at around 16:50 tonight when the computer froze: I restarted it (hard power off) some 15 minutes later.

1-KLAUDE_6429-1-2-RND1937_0 did not survive the experience.

95-KLAUDE_6429-0-2-RND2489_1 was luckier with its restart.

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 790
Credit: 1,424,289,095
RAC: 1,369,661
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 33952 - Posted: 21 Nov 2013 | 12:19:38 UTC - in response to Message 33949.

OK, now I'm awake, I've checked the logs for those two tasks, and the sequence of events is as I surmised.

20-Nov-2013 16:50:15 [---] NOTICES::write: sending notice 6
20-Nov-2013 17:03:31 [---] Starting BOINC client version 7.2.28 for windows_x86_64

That was the host crash interval

20-Nov-2013 16:43:15 [GPUGRID] Starting task 1-KLAUDE_6429-1-2-RND1937_0 using acemdbeta version 815 (cuda55) in slot 7
20-Nov-2013 17:04:06 [GPUGRID] Restarting task 1-KLAUDE_6429-1-2-RND1937_0 using acemdbeta version 815 (cuda55) in slot 7
20-Nov-2013 17:04:14 [GPUGRID] Task 1-KLAUDE_6429-1-2-RND1937_0 exited with zero status but no 'finished' file
20-Nov-2013 17:04:14 [GPUGRID] Restarting task 1-KLAUDE_6429-1-2-RND1937_0 using acemdbeta version 815 (cuda55) in slot 7
20-Nov-2013 17:04:16 [GPUGRID] [sched_op] Deferring communication for 00:01:39
20-Nov-2013 17:04:16 [GPUGRID] [sched_op] Reason: Unrecoverable error for task 1-KLAUDE_6429-1-2-RND1937_0
20-Nov-2013 17:04:16 [GPUGRID] Computation for task 1-KLAUDE_6429-1-2-RND1937_0 finished

That task crashed, but in a 'benign' way (it didn't take the driver down with it)

20-Nov-2013 16:35:47 [GPUGRID] Starting task 95-KLAUDE_6429-0-2-RND2489_1 using acemdbeta version 815 (cuda55) in slot 3
20-Nov-2013 17:04:06 [GPUGRID] Restarting task 95-KLAUDE_6429-0-2-RND2489_1 using acemdbeta version 815 (cuda55) in slot 3
20-Nov-2013 23:43:46 [GPUGRID] Computation for task 95-KLAUDE_6429-0-2-RND2489_1 finished

And that task validated.

Profile Damaraland
Send message
Joined: 7 Nov 09
Posts: 152
Credit: 16,181,924
RAC: 0
Level
Pro
Scientific publications
watwatwatwatwatwatwatwatwat
Message 33999 - Posted: 24 Nov 2013 | 0:05:08 UTC

This unit 3-KLAUDE_6429-0-2-RND6465 worked fine for me after it failed on 6 computers before

Maybe this can help???

If you need any further conf just ask.

nanoprobe
Send message
Joined: 26 Feb 12
Posts: 181
Credit: 221,824,715
RAC: 0
Level
Leu
Scientific publications
watwatwatwatwatwatwatwatwat
Message 34005 - Posted: 24 Nov 2013 | 2:33:02 UTC

All KLAUDE failed here after 2 seconds. Found this in the BOINC event log.

11/23/2013 9:28:11 PM | GPUGRID | Output file 91-KLAUDE_6444-0-3-RND9028_1_1 for task 91-KLAUDE_6444-0-3-RND9028_1 absent
11/23/2013 9:28:11 PM | GPUGRID | Output file 91-KLAUDE_6444-0-3-RND9028_1_2 for task 91-KLAUDE_6444-0-3-RND9028_1 absent
11/23/2013 9:28:11 PM | GPUGRID | Output file 91-KLAUDE_6444-0-3-RND9028_1_3 for task 91-KLAUDE_6444-0-3-RND9028_1 absent

Bedrich Hajek
Send message
Joined: 28 Mar 09
Posts: 340
Credit: 3,821,859,209
RAC: 942,358
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 34006 - Posted: 24 Nov 2013 | 2:55:07 UTC - in response to Message 34005.

All KLAUDE failed here after 2 seconds. Found this in the BOINC event log.

11/23/2013 9:28:11 PM | GPUGRID | Output file 91-KLAUDE_6444-0-3-RND9028_1_1 for task 91-KLAUDE_6444-0-3-RND9028_1 absent
11/23/2013 9:28:11 PM | GPUGRID | Output file 91-KLAUDE_6444-0-3-RND9028_1_2 for task 91-KLAUDE_6444-0-3-RND9028_1 absent
11/23/2013 9:28:11 PM | GPUGRID | Output file 91-KLAUDE_6444-0-3-RND9028_1_3 for task 91-KLAUDE_6444-0-3-RND9028_1 absent



Same here!

Dagorath
Send message
Joined: 16 Mar 11
Posts: 509
Credit: 179,005,236
RAC: 0
Level
Ile
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 34008 - Posted: 24 Nov 2013 | 8:14:55 UTC

I had several KLAUDE tasks fail after I configured my system to crunch 2 GPUgrid tasks simultaneously on my single 660Ti. They failed ~2 secs after starting but were reported as "error while computing" and did not verify.

I then turned off beta tasks in my website prefs and received a NATHAN which ran fine alongside the KLAUDE I had run to ~50% completion before trying 2 simultaneous tasks.

I do understand GPUgrid does not support 2 tasks on 1 GPU and I'm not expecting a fix for that, just passing along what I saw, FWIW, as I find it interesting 2 KLAUDE would not run together but 1 KLAUDE + 1 NATHAN were OK and now 1 NATHAN plus 1 SANTI are crunching fine.

I expect I'll witness what others have reported (that 2 tasks don't give much of a production increase) but I wanna try it for a day or 2 just to say I tried it. At that point I'll likely turn beta tasks on again and revert to just 1 task at a time.

____________
BOINC <<--- credit whores, pedants, alien hunters

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 790
Credit: 1,424,289,095
RAC: 1,369,661
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 34009 - Posted: 24 Nov 2013 | 8:57:58 UTC

It looks as if there was a bad batch of KLAUDE workunits overnight, all of which failed with

ERROR: file mdioload.cpp line 209: Error reading parmtop file

That includes yours, Dagorath - I don't think you can draw a conclusion that running two at a time had anything to do with the failures.

Profile (retired account)
Send message
Joined: 22 Dec 11
Posts: 38
Credit: 28,606,255
RAC: 0
Level
Val
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 34010 - Posted: 24 Nov 2013 | 9:34:12 UTC

Yes, same here, lots of errors over night. Luckily enough, the Titan already hit a max.-per-day-limit of 15.
____________
Mark my words and remember me. - 11th Hour, Lamb of God

Profile skgiven
Volunteer moderator
Project tester
Volunteer tester
Avatar
Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,834,906,524
RAC: 276,970
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 34011 - Posted: 24 Nov 2013 | 10:17:10 UTC - in response to Message 34010.
Last modified: 24 Nov 2013 | 10:20:18 UTC

Yeah, bad batch.

57-KLAUDE_6444-0-3-RND0265
ACEMD beta version

7485150 129153 24 Nov 2013 | 0:53:30 UTC 24 Nov 2013 | 0:55:10 UTC Error while computing 2.14 0.08 --- ACEMD beta version v8.15 (cuda42)
7487860 159186 24 Nov 2013 | 2:18:25 UTC 24 Nov 2013 | 5:58:30 UTC Error while computing 2.06 0.04 --- ACEMD beta version v8.14 (cuda55)
7488724 99934 24 Nov 2013 | 6:00:24 UTC 24 Nov 2013 | 6:06:35 UTC Error while computing 1.30 0.13 --- ACEMD beta version v8.15 (cuda55)
7488745 160877 24 Nov 2013 | 6:08:47 UTC 24 Nov 2013 | 6:14:58 UTC Error while computing 2.08 0.25 --- ACEMD beta version v8.15 (cuda55)
7488772 161748 24 Nov 2013 | 6:17:42 UTC 29 Nov 2013 | 6:17:42 UTC In progress --- --- --- ACEMD beta version v8.15 (cuda55)...

Its the same on Windows and Linux and the errors occur on different generations of GPU from GTX400's to GTX700's.

Exit status 98 (0x62) Unknown error number
process exited with code 98 (0x62, -158)

ERROR: file mdioload.cpp line 209: Error reading parmtop file
05:54:54 (21170): called boinc_finish

I expect this batch just wasn't built correctly.

-
I see that some WU's have already failed 8 times - the cutoff failure point, so they won't be resent. Given that they fail after 2seconds, and there are only ~130 of these Betas the batch probably won't be around too long.
____________
FAQ's

HOW TO:
- Opt out of Beta Tests
- Ask for Help

Dagorath
Send message
Joined: 16 Mar 11
Posts: 509
Credit: 179,005,236
RAC: 0
Level
Ile
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 34012 - Posted: 24 Nov 2013 | 10:18:35 UTC - in response to Message 34009.

It looks as if there was a bad batch of KLAUDE workunits overnight, all of which failed with

ERROR: file mdioload.cpp line 209: Error reading parmtop file

That includes yours, Dagorath - I don't think you can draw a conclusion that running two at a time had anything to do with the failures.


That's good to know, thanks.

____________
BOINC <<--- credit whores, pedants, alien hunters

Profile MJH
Project administrator
Project developer
Project scientist
Send message
Joined: 12 Nov 07
Posts: 696
Credit: 27,266,655
RAC: 0
Level
Val
Scientific publications
watwat
Message 34014 - Posted: 24 Nov 2013 | 14:04:16 UTC - in response to Message 34012.

Yes, my fault. I put out some broken WUs on the beta channel.

Matt

TJ
Send message
Joined: 26 Jun 09
Posts: 815
Credit: 1,469,215,105
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 34015 - Posted: 24 Nov 2013 | 16:25:39 UTC - in response to Message 34014.

And not only KLAUDE, a trypsin has result in yet another fatal cuda driver error and downclocked even to 1/3th of the normal clock speed. I know trypsin is nasty stuff as its purpose is to "break down" molecules, but that it can also "break down" a GPU-clock is new for me :)
So I have now opt for LR only on the 660.
____________
Greetings from TJ

Profile skgiven
Volunteer moderator
Project tester
Volunteer tester
Avatar
Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,834,906,524
RAC: 276,970
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 34016 - Posted: 24 Nov 2013 | 19:57:57 UTC - in response to Message 34015.

The Trp tasks are not part of the bad KLAUDE batch:

trypsin_lig_75x2-NOELIA_RCDOS-0-1-RND0557_1 4940668 159186 24 Nov 2013 | 8:14:31 UTC 24 Nov 2013 | 15:31:51 UTC Completed and validated 6,504.01 2,200.87 30,000.00 ACEMD beta version v8.14 (cuda55)

...and yes, Trypsin is a very useful digestive enzyme.
____________
FAQ's

HOW TO:
- Opt out of Beta Tests
- Ask for Help

Bedrich Hajek
Send message
Joined: 28 Mar 09
Posts: 340
Credit: 3,821,859,209
RAC: 942,358
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 34036 - Posted: 27 Nov 2013 | 11:04:52 UTC
Last modified: 27 Nov 2013 | 11:08:11 UTC

The latest NOELIA_RCDOSequ betas run fine, but they do down clock the video card speed to 914 Mhz, not the 1019 Mhz speed as recorded on the Stderr output below. Notice the temperature readings, they are in the 50's, not the 60's to low 70's when I run the long tasks. This is true for both windows xp and 7. Below is a typical output for all these betas. The cards do return to normal speed when they run the long runs.

Stderr output

<core_client_version>7.0.64</core_client_version>
<![CDATA[
<stderr_txt>
# GPU [GeForce GTX 690] Platform [Windows] Rev [3203M] VERSION [55]
# SWAN Device 1 :
# Name : GeForce GTX 690
# ECC : Disabled
# Global mem : 2048MB
# Capability : 3.0
# PCI ID : 0000:04:00.0
# Device clock : 1019MHz
# Memory clock : 3004MHz
# Memory width : 256bit
# Driver version : r325_00 : 32723
# GPU 0 : 52C
# GPU 1 : 51C
# GPU 2 : 51C
# GPU 3 : 49C
# GPU 1 : 52C
# GPU 1 : 53C
# GPU 1 : 54C
# GPU 1 : 55C
# GPU 1 : 56C
# GPU 1 : 57C
# GPU 1 : 58C
# GPU 3 : 51C
# GPU 3 : 52C
# GPU 3 : 53C
# GPU 3 : 54C
# GPU 3 : 55C
# GPU 3 : 56C
# GPU 2 : 53C
# GPU 2 : 54C
# GPU 2 : 55C
# GPU 2 : 56C
# GPU 2 : 57C
# GPU 2 : 58C
# GPU 0 : 53C
# GPU 0 : 54C
# GPU 0 : 55C
# GPU 0 : 56C
# GPU 0 : 57C
# GPU 0 : 58C
# Time per step (avg over 525000 steps): 5.742 ms
# Approximate elapsed time for entire WU: 3014.682 s
02:08:35 (2912): called boinc_finish

</stderr_txt>
]]>

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 790
Credit: 1,424,289,095
RAC: 1,369,661
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 34037 - Posted: 27 Nov 2013 | 15:52:51 UTC

I see the latest GPU-Z gives a reason for performance capping:



That GTX 670 was capped for reliability voltage and operating voltage (it was running a SANTI_MAR423cap from the long queue at the time, not a Beta - just illustrating the point).

Snow Crash
Send message
Joined: 4 Apr 09
Posts: 450
Credit: 539,316,349
RAC: 0
Level
Lys
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 34046 - Posted: 27 Nov 2013 | 21:34:00 UTC - in response to Message 34037.

I see the latest GPU-Z gives a reason for performance capping:

That GTX 670 was capped for reliability voltage and operating voltage (it was running a SANTI_MAR423cap from the long queue at the time, not a Beta - just illustrating the point).


<ot>

I have GPUZ 0.7.4 (which says it is the latest version) and I do not see that entry (I have a 670 also), is that a beta version and can you provide a link to where you got it from?

</ot>

____________
Thanks - Steve

Profile skgiven
Volunteer moderator
Project tester
Volunteer tester
Avatar
Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,834,906,524
RAC: 276,970
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 34047 - Posted: 27 Nov 2013 | 21:48:57 UTC - in response to Message 34046.

GPU-Z has been showing this for months. I'm still using 0.7.3 and it shows it, as did the previous version, and probably the one before that. Thought I posted about this several months ago?!? Anyway, its a useful tool but only works on Windows. My GTX660Ti (which is hanging out of the case against a wall complements of a riser) is limited by V.Rel and V0p (Reliability Voltage and Operating Voltage, respectively). My GTX660 is limited by Power and Reliability Voltage. My GTX770 is limited by Reliability Voltage and Operating Voltage. All in the same system. Of note is that only the GTX660 is limited by Power!
Just saying...

____________
FAQ's

HOW TO:
- Opt out of Beta Tests
- Ask for Help

TJ
Send message
Joined: 26 Jun 09
Posts: 815
Credit: 1,469,215,105
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 34048 - Posted: 27 Nov 2013 | 22:07:55 UTC
Last modified: 27 Nov 2013 | 22:08:24 UTC

Yes I have that seen already from GPU-Z in previous versions just as skgiven says.
My 660 and 770 have limited power reliability. When changing power, nothing changed and the PSU's are powerful enough for the cards.
I saw/see it with beta, SR and LR, from all scientists.
____________
Greetings from TJ

ExtraTerrestrial Apes
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 17 Aug 08
Posts: 2671
Credit: 751,577,974
RAC: 453,655
Level
Glu
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 34123 - Posted: 4 Dec 2013 | 20:22:09 UTC

If GPU-Z reports "VRel., VOp" as throttling reason this actually means the card is running full throttle and has reached the highest boost bin. Since it would need higher voltage for the next bin, it's reported as being throttled by voltage. Unless the power limit is set tightly or cooling is poor, then this should be the default state a GPU-Grid-crunching Kepler is in.

Bedrich wrote:
but they do down clock the video card speed to 914 Mhz, not the 1019 Mhz speed as recorded on the Stderr output below. Notice the temperature readings, they are in the 50's, not the 60's to low 70's when I run the long tasks.

This sounds like the GPU utilization was low, in which case the driver doesn't see it necessary to push to full boost. In this case GPU-Z reports "Util" as throttle reason, for "GPU utilization too low". This mostly happens with small / short WUs. Those are also the ones where running 2 concurrent WUs actually provides some throughput gains.

MrS
____________
Scanning for our furry friends since Jan 2002

Post to thread

Message boards : News : acemdbeta application - discussion