Advanced search

Message boards : News : New CUDA65 beta app

Author Message
Profile MJH
Project administrator
Project developer
Project scientist
Send message
Joined: 12 Nov 07
Posts: 696
Credit: 27,266,655
RAC: 0
Level
Val
Scientific publications
watwat
Message 38148 - Posted: 29 Sep 2014 | 9:49:51 UTC

Dear all, please give the new acemdbeta app, ver 845, a work out. This supports all GPUs now.
It's Windows only - if you don't get WUs, you'll need to update your driver.

Matt

eXaPower
Send message
Joined: 25 Sep 13
Posts: 265
Credit: 1,045,896,167
RAC: 1,737,032
Level
Met
Scientific publications
watwatwatwatwatwat
Message 38150 - Posted: 29 Sep 2014 | 10:02:06 UTC
Last modified: 29 Sep 2014 | 10:10:54 UTC

Matt, is the 343.98 Driver accepted? I've been trying to get Beta tasks. 14/09/29 06:12:36 | GPUGRID | No tasks are available for ACEMD beta version

I have correct configure-- /run testing app/Beta app checked, not accepting other short or long. I never update to WHQL drivers, from being limited for certain functional areas, unlike Betas or Developer Driver.

Profile MJH
Project administrator
Project developer
Project scientist
Send message
Joined: 12 Nov 07
Posts: 696
Credit: 27,266,655
RAC: 0
Level
Val
Scientific publications
watwat
Message 38152 - Posted: 29 Sep 2014 | 10:23:10 UTC - in response to Message 38150.
Last modified: 29 Sep 2014 | 10:26:35 UTC

Huh, yes. You should be getting something...
According to the logs your host #159309 got given work at 12:15 CEST.

eXaPower
Send message
Joined: 25 Sep 13
Posts: 265
Credit: 1,045,896,167
RAC: 1,737,032
Level
Met
Scientific publications
watwatwatwatwatwat
Message 38153 - Posted: 29 Sep 2014 | 10:49:17 UTC - in response to Message 38152.

14/09/29 06:48:02 | GPUGRID | No tasks are available for ACEMD beta version

Strange, I see no Beta tasks running on Boinc Manager. I just tried again. If driver is accepted, I will continue to try. Thanks for the help.

14/09/29 06:50:55 | GPUGRID | No tasks are available for ACEMD beta version

Profile MJH
Project administrator
Project developer
Project scientist
Send message
Joined: 12 Nov 07
Posts: 696
Credit: 27,266,655
RAC: 0
Level
Val
Scientific publications
watwat
Message 38154 - Posted: 29 Sep 2014 | 10:52:33 UTC - in response to Message 38153.

*Now* you should get something..

eXaPower
Send message
Joined: 25 Sep 13
Posts: 265
Credit: 1,045,896,167
RAC: 1,737,032
Level
Met
Scientific publications
watwatwatwatwatwat
Message 38157 - Posted: 29 Sep 2014 | 11:03:29 UTC - in response to Message 38156.
Last modified: 29 Sep 2014 | 11:33:21 UTC

I did, indeed.


Update: unknown error) - exit code -97 (0xffffff9f)after 8s

The simulation has become unstable. Terminating to avoid lock-up (1)(this first time I've had this during my time at GPUGRID. GPU1 Temp was 58C.
If you don't mind errors, I will try again.

Update2: same error. GPU usage go's to 90% for seconds, after GPU usage to 0% then crashes.

Bedrich Hajek
Send message
Joined: 28 Mar 09
Posts: 340
Credit: 3,821,471,309
RAC: 941,032
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 38158 - Posted: 29 Sep 2014 | 11:14:37 UTC

On the first test unit, I got an error.

9/29/2014 7:13:41 AM | GPUGRID | Computation for task 21-MJHARVEY_TEST4000-0-10-RND0794_0 finished
9/29/2014 7:13:41 AM | GPUGRID | Output file 21-MJHARVEY_TEST4000-0-10-RND0794_0_1 for task 21-MJHARVEY_TEST4000-0-10-RND0794_0 absent
9/29/2014 7:13:41 AM | GPUGRID | Output file 21-MJHARVEY_TEST4000-0-10-RND0794_0_2 for task 21-MJHARVEY_TEST4000-0-10-RND0794_0 absent
9/29/2014 7:13:41 AM | GPUGRID | Output file 21-MJHARVEY_TEST4000-0-10-RND0794_0_3 for task 21-MJHARVEY_TEST4000-0-10-RND0794_0 absent



Name 21-MJHARVEY_TEST4000-0-10-RND0794_0
Workunit 10123268
Created 29 Sep 2014 | 9:50:11 UTC
Sent 29 Sep 2014 | 11:10:35 UTC
Received 29 Sep 2014 | 11:13:11 UTC
Server state Over
Outcome Computation error
Client state Compute error
Exit status -97 (0xffffffffffffff9f) Unknown error number
Computer ID 127986
Report deadline 4 Oct 2014 | 11:10:35 UTC
Run time 4.10
CPU time 3.48
Validate state Invalid
Credit 0.00
Application version ACEMD beta version v8.45 (cuda65)
Stderr output
<core_client_version>7.2.42</core_client_version>
<![CDATA[
<message>
(unknown error) - exit code -97 (0xffffff9f)
</message>
<stderr_txt>
# GPU [GeForce GTX 690] Platform [Windows] Rev [3212] VERSION [65]
# SWAN Device 2 :
# Name : GeForce GTX 690
# ECC : Disabled
# Global mem : 2048MB
# Capability : 3.0
# PCI ID : 0000:07:00.0
# Device clock : 1019MHz
# Memory clock : 3004MHz
# Memory width : 256bit
# Driver version : r343_98 : 34411
# GPU 0 : 67C
# GPU 1 : 42C
# GPU 2 : 69C
# GPU 3 : 70C
# The simulation has become unstable. Terminating to avoid lock-up (1)

</stderr_txt>
]]>


eXaPower
Send message
Joined: 25 Sep 13
Posts: 265
Credit: 1,045,896,167
RAC: 1,737,032
Level
Met
Scientific publications
watwatwatwatwatwat
Message 38163 - Posted: 29 Sep 2014 | 12:46:15 UTC
Last modified: 29 Sep 2014 | 13:30:12 UTC

Update#3 I've received 5 Beta tasks- all have failed, but two caused a system hang ( no error files).

FATAL : Cuda driver error 702 in file 'swanlibnv2.cpp' in line 1965/ Simulation unstable. Flag 11 value 1
# The simulation has become unstable. Terminating to avoid lock-up
# The simulation has become unstable. Terminating to avoid lock-up (2)

Simulation unstable. Flag 11 value 1
# The simulation has become unstable. Terminating to avoid lock-up
# The simulation has become unstable. Terminating to avoid lock-up (2)

Update#4 Still failing on both cards with same error--

(unknown error) - exit code -97 (0xffffff9f)


[url] http://www.gpugrid.net/workunit.php?wuid=10099983 [/url]

This work unit has 3 Linux failures (all with GTX 780) and 2 Win8.1 failures.

Update#5 received 5 more beta for total of ten-- all failed with same error number. All Tasks have started fine (90+GPUusage/14%MCU) with progress .016 intervals, before failing. Wingman with Tesla K20c/GTX780 (c.c3.5, along C.C3.0 wingman, failed also.

Profile MJH
Project administrator
Project developer
Project scientist
Send message
Joined: 12 Nov 07
Posts: 696
Credit: 27,266,655
RAC: 0
Level
Val
Scientific publications
watwat
Message 38164 - Posted: 29 Sep 2014 | 12:50:59 UTC - in response to Message 38163.

Yes, looks like CUDA65 is bad on everything but GM204s. Ho hum.

Jacob Klein
Send message
Joined: 11 Oct 08
Posts: 1068
Credit: 1,147,372,614
RAC: 1,071,233
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 38166 - Posted: 29 Sep 2014 | 13:18:43 UTC
Last modified: 29 Sep 2014 | 13:19:08 UTC

-97 error here, on my GTX 460
# Simulation unstable. Flag 11 value 1
# The simulation has become unstable. Terminating to avoid lock-up
# The simulation has become unstable. Terminating to avoid lock-up (2)

=========================

http://www.gpugrid.net/result.php?resultid=13149151
Name 43-MJHARVEY_TEST1999-1-10-RND5744_2
Workunit 10123176
Created 29 Sep 2014 | 11:40:28 UTC
Sent 29 Sep 2014 | 12:54:10 UTC
Received 29 Sep 2014 | 13:17:01 UTC
Server state Over
Outcome Computation error
Client state Compute error
Exit status -97 (0xffffffffffffff9f) Unknown error number
Computer ID 153764
Report deadline 4 Oct 2014 | 12:54:10 UTC
Run time 2.56
CPU time 0.00
Validate state Invalid
Credit 0.00
Application version ACEMD beta version v8.45 (cuda65)
Stderr output

<core_client_version>7.4.22</core_client_version>
<![CDATA[
<message>
(unknown error) - exit code -97 (0xffffff9f)
</message>
<stderr_txt>
# GPU [GeForce GTX 460] Platform [Windows] Rev [3212] VERSION [65]
# SWAN Device 1 :
# Name : GeForce GTX 460
# ECC : Disabled
# Global mem : 1024MB
# Capability : 2.1
# PCI ID : 0000:07:00.0
# Device clock : 1526MHz
# Memory clock : 1900MHz
# Memory width : 256bit
# Driver version : r343_98 : 34411
# Simulation unstable. Flag 11 value 1
# The simulation has become unstable. Terminating to avoid lock-up
# The simulation has become unstable. Terminating to avoid lock-up (2)

</stderr_txt>
]]>

Jacob Klein
Send message
Joined: 11 Oct 08
Posts: 1068
Credit: 1,147,372,614
RAC: 1,071,233
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 38169 - Posted: 29 Sep 2014 | 13:30:10 UTC

Yikes, I'm seeing these same errors on the Short Run queue -- I guess the Cuda65 app has been deployed there too?

Profile MJH
Project administrator
Project developer
Project scientist
Send message
Joined: 12 Nov 07
Posts: 696
Credit: 27,266,655
RAC: 0
Level
Val
Scientific publications
watwat
Message 38171 - Posted: 29 Sep 2014 | 13:49:11 UTC - in response to Message 38169.

It was on acemdshort briefly. It is no longer.

Matt

eXaPower
Send message
Joined: 25 Sep 13
Posts: 265
Credit: 1,045,896,167
RAC: 1,737,032
Level
Met
Scientific publications
watwatwatwatwatwat
Message 38172 - Posted: 29 Sep 2014 | 14:10:59 UTC

14/09/29 09:48:39 | GPUGRID | No tasks are available for ACEMD beta version

Has beta app been pulled for non-C.C 5.2 cards?

Profile MJH
Project administrator
Project developer
Project scientist
Send message
Joined: 12 Nov 07
Posts: 696
Credit: 27,266,655
RAC: 0
Level
Val
Scientific publications
watwat
Message 38173 - Posted: 29 Sep 2014 | 14:39:10 UTC - in response to Message 38172.


Has beta app been pulled for non-C.C 5.2 cards?


Yes, it's served its purpose there. The CUDA65 build is broken on non-5.2

Matt

Profile MJH
Project administrator
Project developer
Project scientist
Send message
Joined: 12 Nov 07
Posts: 696
Credit: 27,266,655
RAC: 0
Level
Val
Scientific publications
watwat
Message 38175 - Posted: 29 Sep 2014 | 17:37:26 UTC

846 on acemdbeta now. CUDA65 for sm 3.0 and higher.

Matt

Jacob Klein
Send message
Joined: 11 Oct 08
Posts: 1068
Credit: 1,147,372,614
RAC: 1,071,233
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 38176 - Posted: 29 Sep 2014 | 18:20:38 UTC

My 2 GTX 660 Tis, and my GTX 460, in my main rig, are now successfully simultaneously crunching 3 ACEMD beta version 8.46 (cuda65) tasks.

Thank you!

eXaPower
Send message
Joined: 25 Sep 13
Posts: 265
Credit: 1,045,896,167
RAC: 1,737,032
Level
Met
Scientific publications
watwatwatwatwatwat
Message 38177 - Posted: 29 Sep 2014 | 18:27:58 UTC - in response to Message 38175.

So far, so good. .004% progress intervals-- 1.000% in four minutes. 24,000s est. time to complete.

Jacob Klein
Send message
Joined: 11 Oct 08
Posts: 1068
Credit: 1,147,372,614
RAC: 1,071,233
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 38178 - Posted: 29 Sep 2014 | 18:32:29 UTC

Matt:

I even think the canary behavior works better for me now. I tried the scenario where it was failing on the 8.41 app, and now it worked fine without failure on the 8.46 beta app.

Can you please explain, in detail, how the canary behavior was changed? How exactly does behave in 8.46?

Thanks,
Jacob

Jim1348
Send message
Joined: 28 Jul 12
Posts: 460
Credit: 1,130,761,180
RAC: 15,358
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwat
Message 38179 - Posted: 29 Sep 2014 | 18:46:49 UTC

Running fine after 25 minutes on a GTX 650 Ti. It will complete in 3 hours 16 minutes (344.11 driver, Win7 64-bit).

Jim1348
Send message
Joined: 28 Jul 12
Posts: 460
Credit: 1,130,761,180
RAC: 15,358
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwat
Message 38180 - Posted: 29 Sep 2014 | 22:17:47 UTC
Last modified: 29 Sep 2014 | 22:19:25 UTC

It completed OK on the GTX 650 Ti, but seems to be causing problems on some higher-end cards. But their versions of ACEMD probably have more changes than the one I got (8.46).
http://www.gpugrid.net/workunit.php?wuid=10123336

I will be trying my GTX 660 Ti next on the same machine to see what happens.

biodoc
Send message
Joined: 26 Aug 08
Posts: 89
Credit: 656,130,328
RAC: 0
Level
Lys
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 38181 - Posted: 29 Sep 2014 | 22:21:54 UTC

All WU's completed & validated thus far on my GTX980 with beta app versions 8.44, 8.45 and 8.46. I'm running windows 8.1 and nvidia drivers v. 344.16.

http://www.gpugrid.net/results.php?hostid=142719

Bedrich Hajek
Send message
Joined: 28 Mar 09
Posts: 340
Credit: 3,821,471,309
RAC: 941,032
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 38182 - Posted: 29 Sep 2014 | 23:05:55 UTC

All the beta units are finishing valid. Though, the output files are rather large, 44 Megabytes.

52-MJHARVEY_TEST4000-0-10-RND4601_3
Workunit 10123299
Created 29 Sep 2014 | 12:46:43 UTC
Sent 29 Sep 2014 | 19:41:27 UTC
Received 29 Sep 2014 | 22:41:52 UTC
Server state Over
Outcome Success
Client state Done
Exit status 0 (0x0)
Computer ID 127986
Report deadline 4 Oct 2014 | 19:41:27 UTC
Run time 6,363.92
CPU time 6,033.14
Validate state Valid
Credit 1,500.00
Application version ACEMD beta version v8.46 (cuda65)
Stderr output
<core_client_version>7.2.42</core_client_version>
<![CDATA[
<stderr_txt>
# GPU [GeForce GTX 690] Platform [Windows] Rev [3212] VERSION [65]
# SWAN Device 1 :
# Name : GeForce GTX 690
# ECC : Disabled
# Global mem : 2048MB
# Capability : 3.0
# PCI ID : 0000:04:00.0
# Device clock : 1019MHz
# Memory clock : 3004MHz
# Memory width : 256bit
# Driver version : r343_98 : 34411
# GPU 0 : 63C
# GPU 1 : 73C
# GPU 2 : 74C
# GPU 3 : 74C
# GPU 0 : 64C
# GPU 0 : 65C
# GPU 0 : 66C
# GPU 0 : 67C
# GPU 0 : 68C
# GPU 0 : 69C
# GPU 0 : 70C
# GPU 0 : 71C
# Time per step (avg over 2500000 steps): 2.549 ms
# Approximate elapsed time for entire WU: 6371.977 s
# PERFORMANCE: 23558 Natoms 2.549 ns/day 0.000 ms/step 0.000 us/step/atom
18:24:40 (5228): called boinc_finish

</stderr_txt>
]]>


eXaPower
Send message
Joined: 25 Sep 13
Posts: 265
Credit: 1,045,896,167
RAC: 1,737,032
Level
Met
Scientific publications
watwatwatwatwatwat
Message 38183 - Posted: 29 Sep 2014 | 23:25:59 UTC

What's the meaning of ns/day performance? Number is same as time (ms) per step.

23558 Natoms 4.726 ns/day-GTX650Ti
23558 Natoms 2.549 ns/day-GTX690
23558 Natoms 1.633 ns/day-GTX980

TJ
Send message
Joined: 26 Jun 09
Posts: 815
Credit: 1,469,215,105
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 38184 - Posted: 30 Sep 2014 | 8:31:12 UTC - in response to Message 38180.

It completed OK on the GTX 650 Ti, but seems to be causing problems on some higher-end cards. But their versions of ACEMD probably have more changes than the one I got (8.46).
http://www.gpugrid.net/workunit.php?wuid=10123336

I will be trying my GTX 660 Ti next on the same machine to see what happens.

I think the errors on the higher-end cards where caused by to old drivers Jim. I had a lot errors on my 780Ti's yesterday, but when I updated to the latest driver, they run smooth as usual again.
The beta did okay on my 660, so your 660Ti will do great as well.
____________
Greetings from TJ

Jim1348
Send message
Joined: 28 Jul 12
Posts: 460
Credit: 1,130,761,180
RAC: 15,358
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwat
Message 38186 - Posted: 30 Sep 2014 | 10:28:53 UTC - in response to Message 38184.

TJ,

Thanks, that is probably it. My GTX 660 Ti did finish fine; I will be trying a couple of GTX 750 Ti's now just for fun.

KSUMatt
Avatar
Send message
Joined: 11 Jan 13
Posts: 214
Credit: 831,004,493
RAC: 0
Level
Glu
Scientific publications
watwatwatwatwatwatwatwatwat
Message 38208 - Posted: 1 Oct 2014 | 1:46:11 UTC
Last modified: 1 Oct 2014 | 1:49:43 UTC

Just enabled Test Apps for my GTX 680 and GTX 780Ti cards. I'll check back in a while to see how they're doing.

Edit: I saw that TJ recommended updating to the latest drivers. Is this the latest Beta or WHQL driver? I'm currently running 344.11. Thanks.

Jim1348
Send message
Joined: 28 Jul 12
Posts: 460
Credit: 1,130,761,180
RAC: 15,358
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwat
Message 38210 - Posted: 1 Oct 2014 | 2:03:50 UTC - in response to Message 38208.

Edit: I saw that TJ recommended updating to the latest drivers. Is this the latest Beta or WHQL driver? I'm currently running 344.11. Thanks.

344.11 works fine on my GTX 650 Ti and 660 Ti on the test apps. I am running it on my GTX 750 Ti also, but haven't picked up the new apps yet

KSUMatt
Avatar
Send message
Joined: 11 Jan 13
Posts: 214
Credit: 831,004,493
RAC: 0
Level
Glu
Scientific publications
watwatwatwatwatwatwatwatwat
Message 38213 - Posted: 1 Oct 2014 | 2:59:31 UTC

Thanks, Jim1348. I'll stick with 344.11 for now, then.

Jacob Klein
Send message
Joined: 11 Oct 08
Posts: 1068
Credit: 1,147,372,614
RAC: 1,071,233
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 38214 - Posted: 1 Oct 2014 | 3:45:25 UTC - in response to Message 38213.

Thanks, Jim1348. I'll stick with 344.11 for now, then.


What other options are there? :)

TJ
Send message
Joined: 26 Jun 09
Posts: 815
Credit: 1,469,215,105
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 38216 - Posted: 1 Oct 2014 | 7:23:02 UTC - in response to Message 38208.

Just enabled Test Apps for my GTX 680 and GTX 780Ti cards. I'll check back in a while to see how they're doing.

Edit: I saw that TJ recommended updating to the latest drivers. Is this the latest Beta or WHQL driver? I'm currently running 344.11. Thanks.

Hello Matt, yes I am running 344.11 the latest WHQL driver. But to be clear it was recommended by Matt from the project.
The older driver I was using, was a bit faster on Win7 as the WDDM was introduced with Vista and can not be switched of, but that is besides the scope of this thread.
____________
Greetings from TJ

Profile MJH
Project administrator
Project developer
Project scientist
Send message
Joined: 12 Nov 07
Posts: 696
Credit: 27,266,655
RAC: 0
Level
Val
Scientific publications
watwat
Message 38217 - Posted: 1 Oct 2014 | 8:46:10 UTC - in response to Message 38216.
Last modified: 1 Oct 2014 | 8:47:07 UTC

If I've got things right, the 65 apps shouldn't be sent any driver older than 343.00. The exception to that will be the Linux app, when that finally exists. That will give the WU out to any client that reports CUDA 6.5 capability, as only our patched client reports the driver version.

Matt

TJ
Send message
Joined: 26 Jun 09
Posts: 815
Credit: 1,469,215,105
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 38219 - Posted: 1 Oct 2014 | 8:58:34 UTC - in response to Message 38217.
Last modified: 1 Oct 2014 | 9:00:13 UTC

If I've got things right, the 65 apps shouldn't be sent any driver older than 343.00. The exception to that will be the Linux app, when that finally exists. That will give the WU out to any client that reports CUDA 6.5 capability, as only our patched client reports the driver version.

Matt

Well Matt with driver 331 on my 780Ti's win7 where a bit faster but then I got cuda65 tasks and errored out. With your advice I updated the driver and no more errors (yesterday one, but that was another reason).
But if you have made changes yesterday or today, then you are probably right.
____________
Greetings from TJ

Profile Retvari Zoltan
Avatar
Send message
Joined: 20 Jan 09
Posts: 1844
Credit: 10,645,982,644
RAC: 9,980,171
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 38220 - Posted: 1 Oct 2014 | 8:58:46 UTC

I think it's safe to promote the CUDA6.5 application to the long queue.

Profile MJH
Project administrator
Project developer
Project scientist
Send message
Joined: 12 Nov 07
Posts: 696
Credit: 27,266,655
RAC: 0
Level
Val
Scientific publications
watwat
Message 38221 - Posted: 1 Oct 2014 | 8:59:36 UTC - in response to Message 38220.

Not just yet...

biodoc
Send message
Joined: 26 Aug 08
Posts: 89
Credit: 656,130,328
RAC: 0
Level
Lys
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 38222 - Posted: 1 Oct 2014 | 9:28:39 UTC - in response to Message 38217.

If I've got things right, the 65 apps shouldn't be sent any driver older than 343.00. The exception to that will be the Linux app, when that finally exists. That will give the WU out to any client that reports CUDA 6.5 capability, as only our patched client reports the driver version.

Matt


boinc 7.4.22 (development version) now reports the driver version:

Starting BOINC client version 7.4.22 for x86_64-pc-linux-gnu
CUDA: NVIDIA GPU 0: GeForce GTX 780 Ti (driver version 343.22, CUDA version 6.5, compute capability 3.5, 3072MB, 2814MB available, 5345 GFLOPS peak)

Shows up here too:

http://www.gpugrid.net/show_host_detail.php?hostid=183991

Jacob Klein
Send message
Joined: 11 Oct 08
Posts: 1068
Credit: 1,147,372,614
RAC: 1,071,233
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 38224 - Posted: 1 Oct 2014 | 10:18:09 UTC
Last modified: 1 Oct 2014 | 10:19:40 UTC

MJH:

I've been processing Beta tasks, and although nearly all are successful for me on the 8.46 app, I did have a failure last night. This is on a completely-stable Windows 8.1 Update 1 x64 machine, on one of my GTX 660 Ti GPUs, using 344.11 driver.

Any ideas?

http://www.gpugrid.net/result.php?resultid=13154266

Name 79-MJHARVEY_TEST4001-2-10-RND8149_0
Workunit 10126844
Created 30 Sep 2014 | 18:55:59 UTC
Sent 1 Oct 2014 | 3:25:13 UTC
Received 1 Oct 2014 | 4:35:24 UTC
Server state Over
Outcome Computation error
Client state Compute error
Exit status -97 (0xffffffffffffff9f) Unknown error number
Computer ID 153764
Report deadline 6 Oct 2014 | 3:25:13 UTC
Run time 1,431.05
CPU time 384.63
Validate state Invalid
Credit 0.00
Application version ACEMD beta version v8.46 (cuda65)
Stderr output

<core_client_version>7.4.22</core_client_version>
<![CDATA[
<message>
(unknown error) - exit code -97 (0xffffff9f)
</message>
<stderr_txt>
# GPU [GeForce GTX 660 Ti] Platform [Windows] Rev [3212] VERSION [65]
# SWAN Device 2 :
# Name : GeForce GTX 660 Ti
# ECC : Disabled
# Global mem : 3072MB
# Capability : 3.0
# PCI ID : 0000:08:00.0
# Device clock : 1045MHz
# Memory clock : 3004MHz
# Memory width : 192bit
# Driver version : r343_98 : 34411
# GPU 0 : 69C
# GPU 1 : 64C
# GPU 2 : 69C
# GPU 1 : 65C
# GPU 1 : 66C
# GPU 1 : 67C
# GPU 0 : 70C
# The simulation has become unstable. Terminating to avoid lock-up (1)
# Attempting restart (step 5505000)

# GPU [GeForce GTX 660 Ti] Platform [Windows] Rev [3212] VERSION [65]
# SWAN Device 2 :
# Name : GeForce GTX 660 Ti
# ECC : Disabled
# Global mem : 3072MB
# Capability : 3.0
# PCI ID : 0000:08:00.0
# Device clock : 1045MHz
# Memory clock : 3004MHz
# Memory width : 192bit
# Driver version : r343_98 : 34411
# The simulation has become unstable. Terminating to avoid lock-up (1)

</stderr_txt>
]]>

eXaPower
Send message
Joined: 25 Sep 13
Posts: 265
Credit: 1,045,896,167
RAC: 1,737,032
Level
Met
Scientific publications
watwatwatwatwatwat
Message 38225 - Posted: 1 Oct 2014 | 11:05:22 UTC
Last modified: 1 Oct 2014 | 11:06:43 UTC

Question: would a 4.2CUDA long task running on one card slow down CUDA 6.5 short or Beta tasks running on other or vise versa? I just noticed a CUDA 4.2 Noelia Long task running, with 6.5 Beta and Short. Runtime for Long task is more than normal. It takes ~40Hr to complete, but at ~40Hr the 4.2 task is 80%.

Profile MJH
Project administrator
Project developer
Project scientist
Send message
Joined: 12 Nov 07
Posts: 696
Credit: 27,266,655
RAC: 0
Level
Val
Scientific publications
watwat
Message 38226 - Posted: 1 Oct 2014 | 14:55:20 UTC - in response to Message 38225.

Maybe, if the processes are competing for CPU.

Matt

Jacob Klein
Send message
Joined: 11 Oct 08
Posts: 1068
Credit: 1,147,372,614
RAC: 1,071,233
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 38227 - Posted: 1 Oct 2014 | 15:09:47 UTC
Last modified: 1 Oct 2014 | 15:10:05 UTC

Any idea why my task failed, 3 posts up?

eXaPower
Send message
Joined: 25 Sep 13
Posts: 265
Credit: 1,045,896,167
RAC: 1,737,032
Level
Met
Scientific publications
watwatwatwatwatwat
Message 38228 - Posted: 1 Oct 2014 | 16:12:36 UTC - in response to Message 38227.
Last modified: 1 Oct 2014 | 16:13:52 UTC

Any idea why my task failed, 3 posts up?


Have you checked event viewer to locate any occurrences at the time task failed? Any kernel failures ? Or database instances? If you have automatic windows updates or auto Maintenance enabled- this can trigger random failures for other processes. (or sometimes fault any heavy usage process) Also, a security "audit" can trigger background task (GPUGRID) failures.

Jacob Klein
Send message
Joined: 11 Oct 08
Posts: 1068
Credit: 1,147,372,614
RAC: 1,071,233
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 38229 - Posted: 1 Oct 2014 | 16:28:57 UTC - in response to Message 38228.

Thanks, but it was a couple "simulation became unstable" errors, which I believe to be a problem with the GPUGrid application.

Profile Retvari Zoltan
Avatar
Send message
Joined: 20 Jan 09
Posts: 1844
Credit: 10,645,982,644
RAC: 9,980,171
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 38234 - Posted: 1 Oct 2014 | 17:08:17 UTC - in response to Message 38221.

I think it's safe to promote the CUDA6.5 application to the long queue.
Not just yet...

Are we waiting for your GTX980 to arrive?

biodoc
Send message
Joined: 26 Aug 08
Posts: 89
Credit: 656,130,328
RAC: 0
Level
Lys
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 38248 - Posted: 2 Oct 2014 | 8:59:52 UTC - in response to Message 38234.

I think it's safe to promote the CUDA6.5 application to the long queue.
Not just yet...

Are we waiting for your GTX980 to arrive?


It's probably due to me overclocking my GTX980. I had several tasks fail while I was at work yesterday. Since I clocked back, I've had 4 short run tasks complete successfully.

My apologies for messing up the beta test.

Jacob Klein
Send message
Joined: 11 Oct 08
Posts: 1068
Credit: 1,147,372,614
RAC: 1,071,233
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 38267 - Posted: 3 Oct 2014 | 1:47:09 UTC

I had another failure, where simulation became unstable on a 8.46 Cuda 6.5 beta task. http://www.gpugrid.net/result.php?resultid=13161365

I am not entirely convinced that the error is the fault of the task or the application. Perhaps the new 344.11 drivers push the GPUs even harder than previous drivers. I will do additional testing, with Heaven, to attempt to confirm.

Thanks,
Jacob

Name 30-MJHARVEY_TEST1999-5-10-RND7983_0
Workunit 10132352
Created 2 Oct 2014 | 16:49:51 UTC
Sent 2 Oct 2014 | 21:36:23 UTC
Received 2 Oct 2014 | 22:20:50 UTC
Server state Over
Outcome Computation error
Client state Compute error
Exit status -97 (0xffffffffffffff9f) Unknown error number
Computer ID 153764
Report deadline 7 Oct 2014 | 21:36:23 UTC
Run time 386.44
CPU time 100.75
Validate state Invalid
Credit 0.00
Application version ACEMD beta version v8.46 (cuda65)
Stderr output

<core_client_version>7.4.22</core_client_version>
<![CDATA[
<message>
(unknown error) - exit code -97 (0xffffff9f)
</message>
<stderr_txt>
# GPU [GeForce GTX 660 Ti] Platform [Windows] Rev [3212] VERSION [65]
# SWAN Device 2 :
# Name : GeForce GTX 660 Ti
# ECC : Disabled
# Global mem : 3072MB
# Capability : 3.0
# PCI ID : 0000:08:00.0
# Device clock : 1045MHz
# Memory clock : 3004MHz
# Memory width : 192bit
# Driver version : r343_98 : 34411
# GPU 0 : 69C
# GPU 1 : 64C
# GPU 2 : 70C
# GPU 1 : 65C
# GPU 1 : 66C
# GPU 1 : 67C
# BOINC suspending at user request (exit)
# GPU [GeForce GTX 660 Ti] Platform [Windows] Rev [3212] VERSION [65]
# SWAN Device 2 :
# Name : GeForce GTX 660 Ti
# ECC : Disabled
# Global mem : 3072MB
# Capability : 3.0
# PCI ID : 0000:08:00.0
# Device clock : 1045MHz
# Memory clock : 3004MHz
# Memory width : 192bit
# Driver version : r343_98 : 34411
# GPU 0 : 62C
# GPU 1 : 58C
# GPU 2 : 54C
# GPU 0 : 63C
# GPU 1 : 59C
# GPU 2 : 55C
# GPU 0 : 64C
# GPU 1 : 60C
# GPU 2 : 56C
# GPU 0 : 65C
# GPU 2 : 57C
# GPU 0 : 66C
# GPU 1 : 61C
# GPU 2 : 58C
# GPU 2 : 59C
# GPU 0 : 67C
# GPU 1 : 62C
# GPU 2 : 60C
# GPU 2 : 61C
# GPU 0 : 68C
# GPU 1 : 63C
# GPU 2 : 62C
# GPU 2 : 63C
# GPU 0 : 69C
# GPU 1 : 64C
# GPU 2 : 64C
# GPU 2 : 65C
# GPU 0 : 70C
# GPU 0 : 71C
# GPU 1 : 65C
# GPU 2 : 66C
# GPU 1 : 66C
# GPU 2 : 67C
# GPU 0 : 72C
# GPU 1 : 67C
# GPU 2 : 68C
# The simulation has become unstable. Terminating to avoid lock-up (1)
# Attempting restart (step 12630000)
# GPU [GeForce GTX 660 Ti] Platform [Windows] Rev [3212] VERSION [65]
# SWAN Device 2 :
# Name : GeForce GTX 660 Ti
# ECC : Disabled
# Global mem : 3072MB
# Capability : 3.0
# PCI ID : 0000:08:00.0
# Device clock : 1045MHz
# Memory clock : 3004MHz
# Memory width : 192bit
# Driver version : r343_98 : 34411
# The simulation has become unstable. Terminating to avoid lock-up (1)

</stderr_txt>
]]>

TJ
Send message
Joined: 26 Jun 09
Posts: 815
Credit: 1,469,215,105
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 38272 - Posted: 3 Oct 2014 | 8:43:55 UTC - in response to Message 38267.

I don't think the 344.11 driver is pushing the cards harder as with this driver my 780Ti's are around 700 seconds slower than with the 331 driver I used until 309 September when I was forced to update as I got errors with the new app. See below in this thread.

____________
Greetings from TJ

Rion Family
Send message
Joined: 13 Jan 14
Posts: 18
Credit: 6,127,606,365
RAC: 10,374,889
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwat
Message 38278 - Posted: 3 Oct 2014 | 15:08:16 UTC

Hello - I have noticed that my dual gtx 780 machine has been getting mostly beta tasks lately - only 2 short runs and no long runs over the past few days ? I even set my prefs to no beta and no other apps but still pulling only beta tasks?

My other 3 systems on the account - gtx 770 & 660 do not show any beta tasks?

Just curious

Jacob Klein
Send message
Joined: 11 Oct 08
Posts: 1068
Credit: 1,147,372,614
RAC: 1,071,233
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 38279 - Posted: 3 Oct 2014 | 15:44:46 UTC

I too seem to only be getting Beta, even though I've re-enabled all applications. Is the scheduler prioritizing the Beta application somehow?

eXaPower
Send message
Joined: 25 Sep 13
Posts: 265
Credit: 1,045,896,167
RAC: 1,737,032
Level
Met
Scientific publications
watwatwatwatwatwat
Message 38281 - Posted: 3 Oct 2014 | 16:12:58 UTC - in response to Message 38272.

I don't think the 344.11 driver is pushing the cards harder as with this driver my 780Ti's are around 700 seconds slower than with the 331 driver I used until 309 September when I was forced to update as I got errors with the new app. See below in this thread.


With new technologies being added to 343 branch driver for Second Generation Maxwell GM204: Dynamic Super Resolution, Third Generation Delta Color Compression, Multi-Pixel Programming Sampling, NVidia VXGI (Real-Time-Voxel-Global Illumination), VR Direct, Multi-Projection Acceleration, and Multi-Frame Sampled Anti-Aliasing(MFAA) with support for CSAA removed. HDMI 2.0 support was also added.

I'd say this driver branch is not fully complete yet. A couple more releases should find driver's full potential. Considering how support for pre-Fermi cards were dropped, and amount differences between SM/SMX/SMM, these first 343 branch drivers have room to be refined.

Jacob Klein
Send message
Joined: 11 Oct 08
Posts: 1068
Credit: 1,147,372,614
RAC: 1,071,233
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 38310 - Posted: 5 Oct 2014 | 16:13:08 UTC - in response to Message 38178.

Matt:

I even think the canary behavior works better for me now. I tried the scenario where it was failing on the 8.41 app, and now it worked fine without failure on the 8.46 beta app.

Can you please explain, in detail, how the canary behavior was changed? How exactly does behave in 8.46?

Thanks,
Jacob



Any answer on this?

Profile MJH
Project administrator
Project developer
Project scientist
Send message
Joined: 12 Nov 07
Posts: 696
Credit: 27,266,655
RAC: 0
Level
Val
Scientific publications
watwat
Message 38311 - Posted: 5 Oct 2014 | 17:40:18 UTC - in response to Message 38310.


Can you please explain, in detail, how the canary behavior was changed? How exactly does behave in 8.46?


It doesn't. I've disabled it altogether. I'm counting on the newer drivers to do a better job at recovering from deadlocks.

Matt

Jacob Klein
Send message
Joined: 11 Oct 08
Posts: 1068
Credit: 1,147,372,614
RAC: 1,071,233
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 38313 - Posted: 5 Oct 2014 | 18:01:12 UTC - in response to Message 38311.

Sound good to me. If you ever decide to re-add it, or modify its functionality, please be sure to let us know.

Thanks,
Jacob

Profile Beyond
Avatar
Send message
Joined: 23 Nov 08
Posts: 1073
Credit: 4,497,101,754
RAC: 415,156
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 38323 - Posted: 6 Oct 2014 | 15:40:56 UTC - in response to Message 38311.

Can you please explain, in detail, how the canary behavior was changed? How exactly does behave in 8.46?

It doesn't. I've disabled it altogether. I'm counting on the newer drivers to do a better job at recovering from deadlocks. Matt

Thanks much for disabling that feature, I've lost a lot of WUs to it :-)

Post to thread

Message boards : News : New CUDA65 beta app