Advanced search

Message boards : Number crunching : GERARD_CXCL12LOCKMONO

Author Message
Profile Beyond
Avatar
Send message
Joined: 23 Nov 08
Posts: 1112
Credit: 6,162,416,256
RAC: 0
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 43808 - Posted: 22 Jun 2016 | 6:07:39 UTC

Just received 5 of these:

GERARD_CXCL12LOCKMONO

Haven't seen this type before. All failed in about 1 second.

Profile skgiven
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,995,359,260
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 43809 - Posted: 22 Jun 2016 | 8:04:06 UTC - in response to Message 43808.

Most failing on Linux too. Errors:

Stderr output

<core_client_version>7.6.31</core_client_version>
<![CDATA[
<message>
process exited with code 176 (0xb0, -80)
</message>
<stderr_txt>
# SWAN Device 0 :
# Name : GeForce GTX 970
# ECC : Disabled
# Global mem : 4095MB
# Capability : 5.2
# PCI ID : 0000:01:00.0
# Device clock : 1215MHz
# Memory clock : 3505MHz
# Memory width : 256bit
# The simulation has become unstable. Terminating to avoid lock-up (1)
# Attempting restart (step 5000)

</stderr_txt>
]]>

Stderr output

<core_client_version>7.6.31</core_client_version>
<![CDATA[
<message>
process exited with code 199 (0xc7, -57)
</message>
<stderr_txt>
# SWAN Device 0 :
# Name : GeForce GTX 970
# ECC : Disabled
# Global mem : 4095MB
# Capability : 5.2
# PCI ID : 0000:01:00.0
# Device clock : 1215MHz
# Memory clock : 3505MHz
# Memory width : 256bit
SWAN : FATAL : Cuda driver error 700 in file 'swanlibnv2.cpp' in line 1963.
# SWAN swan_assert -57

</stderr_txt>
]]>

Stderr output

<core_client_version>7.6.31</core_client_version>
<![CDATA[
<message>
process exited with code 176 (0xb0, -80)
</message>
<stderr_txt>
# SWAN Device 0 :
# Name : GeForce GTX 970
# ECC : Disabled
# Global mem : 4095MB
# Capability : 5.2
# PCI ID : 0000:01:00.0
# Device clock : 1215MHz
# Memory clock : 3505MHz
# Memory width : 256bit
# The simulation has become unstable. Terminating to avoid lock-up (1)
# Attempting restart (step 5000)

</stderr_txt>
]]>


Stderr output

<core_client_version>7.6.31</core_client_version>
<![CDATA[
<message>
process exited with code 199 (0xc7, -57)
</message>
<stderr_txt>
# SWAN Device 0 :
# Name : GeForce GTX 970
# ECC : Disabled
# Global mem : 4095MB
# Capability : 5.2
# PCI ID : 0000:01:00.0
# Device clock : 1215MHz
# Memory clock : 3505MHz
# Memory width : 256bit
SWAN : FATAL : Cuda driver error 700 in file 'swanlibnv2.cpp' in line 1963.
# SWAN swan_assert -57

</stderr_txt>
]]>

Maybe these are designed to 'fail early' if they are likely to fail at all?

Task 15166493 has reached 5.5% after 1h on my Linux system, so the odd one appears to be running normally.

1x39-GERARD_CXCL12LOCKMONO-0-3-RND8941_0
____________
FAQ's

HOW TO:
- Opt out of Beta Tests
- Ask for Help

Profile skgiven
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,995,359,260
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 43816 - Posted: 22 Jun 2016 | 8:58:04 UTC - in response to Message 43809.

Server says:

Detailed computing status

Application unsent in progress success error rate -- GERARD_CXCL12LOCKMON 234 170 0 100%


http://www.gpugrid.net/server_status.php - scroll down

If it's a new batch this is to be expected, lots of immediate failures will return before completed tasks. The first completed task might not report until ~7h after it was sent.
The first healthy looking LOCKMONO task I'm running is now at 10% now and I've another on W10 at 1.3% after 11 mins. GPU usage 80% @73% power and temp throttling enabled, using 1GB GDDR at present - that all looks quite normal.
____________
FAQ's

HOW TO:
- Opt out of Beta Tests
- Ask for Help

Bedrich Hajek
Send message
Joined: 28 Mar 09
Posts: 485
Credit: 10,924,098,466
RAC: 15,880,938
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 43817 - Posted: 22 Jun 2016 | 10:41:19 UTC - in response to Message 43816.

I had a couple of these WUs fail as well, before getting a couple of good WUs, which are crunching well, right now. I hope they finish successfully.


Name 0x20-GERARD_CXCL12LOCKMONO-0-3-RND0335_2
Workunit 11645809
Created 22 Jun 2016 | 6:51:54 UTC
Sent 22 Jun 2016 | 8:20:34 UTC
Received 22 Jun 2016 | 9:06:57 UTC
Server state Over
Outcome Computation error
Client state Compute error
Exit status -97 (0xffffffffffffff9f) Unknown error number
Computer ID 263612
Report deadline 27 Jun 2016 | 8:20:34 UTC
Run time 0.00
CPU time 0.00
Validate state Invalid
Credit 0.00
Application version Long runs (8-12 hours on fastest card) v8.48 (cuda65)
Stderr output

<core_client_version>7.6.22</core_client_version>
<![CDATA[
<message>
(unknown error) - exit code -97 (0xffffff9f)
</message>
<stderr_txt>
# GPU [GeForce GTX 980 Ti] Platform [Windows] Rev [3212] VERSION [65]
# SWAN Device 1 :
# Name : GeForce GTX 980 Ti
# ECC : Disabled
# Global mem : 4095MB
# Capability : 5.2
# PCI ID : 0000:02:00.0
# Device clock : 1190MHz
# Memory clock : 3505MHz
# Memory width : 384bit
# Driver version : r358_00 : 35906
# GPU 0 : 63C
# GPU 1 : 48C
# The simulation has become unstable. Terminating to avoid lock-up (1)
# Attempting restart (step 5000)
# GPU [GeForce GTX 980 Ti] Platform [Windows] Rev [3212] VERSION [65]
# SWAN Device 1 :
# Name : GeForce GTX 980 Ti
# ECC : Disabled
# Global mem : 4095MB
# Capability : 5.2
# PCI ID : 0000:02:00.0
# Device clock : 1190MHz
# Memory clock : 3505MHz
# Memory width : 384bit
# Driver version : r358_00 : 35906
# The simulation has become unstable. Terminating to avoid lock-up (1)

</stderr_txt>
]]>


Name 0x23-GERARD_CXCL12LOCKMONO-0-3-RND1112_0
Workunit 11645812
Created 21 Jun 2016 | 15:15:19 UTC
Sent 22 Jun 2016 | 5:09:24 UTC
Received 22 Jun 2016 | 5:53:14 UTC
Server state Over
Outcome Computation error
Client state Compute error
Exit status -97 (0xffffffffffffff9f) Unknown error number
Computer ID 263612
Report deadline 27 Jun 2016 | 5:09:24 UTC
Run time 0.00
CPU time 0.00
Validate state Invalid
Credit 0.00
Application version Long runs (8-12 hours on fastest card) v8.48 (cuda65)
Stderr output

<core_client_version>7.6.22</core_client_version>
<![CDATA[
<message>
(unknown error) - exit code -97 (0xffffff9f)
</message>
<stderr_txt>
# GPU [GeForce GTX 980 Ti] Platform [Windows] Rev [3212] VERSION [65]
# SWAN Device 0 :
# Name : GeForce GTX 980 Ti
# ECC : Disabled
# Global mem : 4095MB
# Capability : 5.2
# PCI ID : 0000:01:00.0
# Device clock : 1266MHz
# Memory clock : 3505MHz
# Memory width : 384bit
# Driver version : r358_00 : 35906
# GPU 0 : 61C
# GPU 1 : 57C
# The simulation has become unstable. Terminating to avoid lock-up (1)
# Attempting restart (step 5000)
# GPU [GeForce GTX 980 Ti] Platform [Windows] Rev [3212] VERSION [65]
# SWAN Device 0 :
# Name : GeForce GTX 980 Ti
# ECC : Disabled
# Global mem : 4095MB
# Capability : 5.2
# PCI ID : 0000:01:00.0
# Device clock : 1266MHz
# Memory clock : 3505MHz
# Memory width : 384bit
# Driver version : r358_00 : 35906
# The simulation has become unstable. Terminating to avoid lock-up (1)

</stderr_txt>
]]>

Profile Beyond
Avatar
Send message
Joined: 23 Nov 08
Posts: 1112
Credit: 6,162,416,256
RAC: 0
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 43818 - Posted: 22 Jun 2016 | 15:56:13 UTC - in response to Message 43816.
Last modified: 22 Jun 2016 | 15:59:26 UTC

If it's a new batch this is to be expected, lots of immediate failures will return before completed tasks. The first completed task might not report until ~7h after it was sent.
The first healthy looking LOCKMONO task I'm running is now at 10% now and I've another on W10 at 1.3% after 11 mins. GPU usage 80% @73% power and temp throttling enabled, using 1GB GDDR at present - that all looks quite normal.

On all our machines (you, Bedrich and me), the failed ones all begin with "0x". Some are up to 5 errors, no successes. I have 4 more GERARD_CXCL12LOCKMONO running now that begin with "1x" that seem to be progressing normally (one's over 30%).

Profile Beyond
Avatar
Send message
Joined: 23 Nov 08
Posts: 1112
Credit: 6,162,416,256
RAC: 0
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 43820 - Posted: 22 Jun 2016 | 17:34:40 UTC - in response to Message 43816.

Server says: Detailed computing status

Application unsent in progress success error rate -- GERARD_CXCL12LOCKMON 234 170 0 100%

http://www.gpugrid.net/server_status.php - scroll down

Looks like 4 have now been completed. Bet they all have the "1x" prefix, not "0x":

GERARD_CXCL12LOCKMON 36 368 4 98.46%

At least we know that some of them are OK.

eXaPower
Send message
Joined: 25 Sep 13
Posts: 293
Credit: 1,897,601,978
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwat
Message 43821 - Posted: 22 Jun 2016 | 18:01:42 UTC - in response to Message 43818.

If it's a new batch this is to be expected, lots of immediate failures will return before completed tasks. The first completed task might not report until ~7h after it was sent.
The first healthy looking LOCKMONO task I'm running is now at 10% now and I've another on W10 at 1.3% after 11 mins. GPU usage 80% @73% power and temp throttling enabled, using 1GB GDDR at present - that all looks quite normal.

On all our machines (you, Bedrich and me), the failed ones all begin with "0x". Some are up to 5 errors, no successes. I have 4 more GERARD_CXCL12LOCKMONO running now that begin with "1x" that seem to be progressing normally (one's over 30%).

My host #208061:

0x83-GERARD_CXCL12LOCKMONO-0-3-RND0285_1
-97 (0xffffffffffffff9f) Unknown error number)
Attempting restart (step 5000)
Run time 1.00
CPU time 0.00

Paradoxically a -97 error thought as an overclocking problem.

3x96-GERARD_CXCL12LOCKMONO-0-3-RND9182_0
-97 (0xffffffffffffff9f) Unknown error number
Attempting restart (step 3845000)
Run time 10,433.28
CPU time 3,276.63

A note about (e6s8_e5s7p0f230-GIANNI_MORC36bCHL1-0-1-RND7755_0:
Faulted at 99.992% WU completion a couple days ago on my host.
ERROR: file force.cpp line 513: TCL evaluation of [calcforces]
-98 (0xffffffffffffff9e) Unknown error number
Run time 68,745.34
CPU time 20,137.64




Bedrich Hajek
Send message
Joined: 28 Mar 09
Posts: 485
Credit: 10,924,098,466
RAC: 15,880,938
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 43822 - Posted: 22 Jun 2016 | 21:51:12 UTC

I had 2 of these WUs complete successfully. They were 1x and 2x WUs.


Name 1x28-GERARD_CXCL12LOCKMONO-0-3-RND5689_0
Workunit 11645918
Created 21 Jun 2016 | 15:17:47 UTC
Sent 22 Jun 2016 | 5:59:44 UTC
Received 22 Jun 2016 | 14:21:20 UTC
Server state Over
Outcome Success
Client state Done
Exit status 0 (0x0)
Computer ID 263612
Report deadline 27 Jun 2016 | 5:59:44 UTC
Run time 29,436.96
CPU time 29,315.81
Validate state Valid
Credit 294,750.00
Application version Long runs (8-12 hours on fastest card) v8.48 (cuda65)


Name 2x67-GERARD_CXCL12LOCKMONO-0-3-RND9322_0
Workunit 11646058
Created 21 Jun 2016 | 15:21:10 UTC
Sent 22 Jun 2016 | 9:12:36 UTC
Received 22 Jun 2016 | 17:48:52 UTC
Server state Over
Outcome Success
Client state Done
Exit status 0 (0x0)
Computer ID 263612
Report deadline 27 Jun 2016 | 9:12:36 UTC
Run time 30,274.28
CPU time 30,138.36
Validate state Valid
Credit 294,750.00
Application version Long runs (8-12 hours on fastest card) v8.48 (cuda65)


I also had another 0x WU fail, which makes 3 for me.


Name 0x54-GERARD_CXCL12LOCKMONO-0-3-RND3534_2
Workunit 11645843
Created 22 Jun 2016 | 10:05:17 UTC
Sent 22 Jun 2016 | 13:11:49 UTC
Received 22 Jun 2016 | 13:33:01 UTC
Server state Over
Outcome Computation error
Client state Compute error
Exit status -97 (0xffffffffffffff9f) Unknown error number
Computer ID 30790
Report deadline 27 Jun 2016 | 13:11:49 UTC
Run time 1.31
CPU time 1.31
Validate state Invalid
Credit 0.00
Application version Long runs (8-12 hours on fastest card) v8.48 (cuda65)


Looks like the 0x WUs are all bad, and should be canceled.


Bedrich Hajek
Send message
Joined: 28 Mar 09
Posts: 485
Credit: 10,924,098,466
RAC: 15,880,938
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 43828 - Posted: 24 Jun 2016 | 10:34:21 UTC

Enough with these 0x WUs, already, I had 2 more fail on me.


Name 0x0-GERARD_CXCL12LOCKMONO-0-3-RND6293_6
Workunit 11645789
Created 23 Jun 2016 | 17:22:43 UTC
Sent 23 Jun 2016 | 18:54:17 UTC
Received 23 Jun 2016 | 19:40:13 UTC
Server state Over
Outcome Computation error
Client state Compute error
Exit status -97 (0xffffffffffffff9f) Unknown error number
Computer ID 263612
Report deadline 28 Jun 2016 | 18:54:17 UTC
Run time 13.19
CPU time 13.19
Validate state Invalid
Credit 0.00
Application version Long runs (8-12 hours on fastest card) v8.48 (cuda65)


Name 0x47-GERARD_CXCL12LOCKMONO-0-3-RND0003_4
Workunit 11645836
Created 24 Jun 2016 | 5:30:48 UTC
Sent 24 Jun 2016 | 7:04:56 UTC
Received 24 Jun 2016 | 7:31:29 UTC
Server state Over
Outcome Computation error
Client state Compute error
Exit status -97 (0xffffffffffffff9f) Unknown error number
Computer ID 30790
Report deadline 29 Jun 2016 | 7:04:56 UTC
Run time 1.22
CPU time 1.22
Validate state Invalid
Credit 0.00
Application version Long runs (8-12 hours on fastest card) v8.48 (cuda65)


These WUs are bad. Please cancel them.


Post to thread

Message boards : Number crunching : GERARD_CXCL12LOCKMONO

//