Advanced search

Message boards : Number crunching : What am I doing wrong

Author Message
EdwardPF
Send message
Joined: 24 Nov 12
Posts: 17
Credit: 453,679,903
RAC: 0
Level
Gln
Scientific publications
watwatwatwatwatwat
Message 40296 - Posted: 1 Mar 2015 | 21:17:10 UTC

I now have about 15 WU that spontaneously abort at 1 Hr.

like ...

Name e10s3_e1s67f460-GERARD_CXCL12_LIG11_CGENFF2-1-2-RND2153_0
Workunit 10707801
Created 1 Mar 2015 | 1:31:51 UTC
Sent 1 Mar 2015 | 8:34:50 UTC
Received 1 Mar 2015 | 9:34:55 UTC
Server state Over
Outcome Abandoned
Client state New
Exit status 0 (0x0)
Computer ID 191787
Report deadline 6 Mar 2015 | 8:34:50 UTC
Run time 0.00
CPU time 0.00
Validate state Initial
Credit 0.00
Application version Long runs (8-12 hours on fastest card) v8.47 (cuda65)

They are running on an nvidia gpx 770

Any Ideas?? This is using Win-7

What info do "you" need to help diagnose this?

Ed F


EdwardPF
Send message
Joined: 24 Nov 12
Posts: 17
Credit: 453,679,903
RAC: 0
Level
Gln
Scientific publications
watwatwatwatwatwat
Message 40297 - Posted: 1 Mar 2015 | 22:46:39 UTC
Last modified: 1 Mar 2015 | 23:03:02 UTC

(I don't see an "edit" option)

The most recent WU is now at 1:45 ... must have been a bad batch??

Ed F

Edit

nope ... this one died at 2:00

Ed F

Dayle Diamond
Send message
Joined: 5 Dec 12
Posts: 84
Credit: 1,663,883,415
RAC: 1
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 40300 - Posted: 1 Mar 2015 | 23:57:31 UTC

They all say for a status "Abandoned". That's so weird!

Jacob Klein
Send message
Joined: 11 Oct 08
Posts: 1127
Credit: 1,901,927,545
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 40301 - Posted: 1 Mar 2015 | 23:59:48 UTC
Last modified: 2 Mar 2015 | 0:00:15 UTC

I see that your computer's details show:
BOINC version 6.12.34
....

That is ANCIENT.

Please try the latest release, BOINC v7.4.36.
http://boinc.berkeley.edu/download.php

Profile Retvari Zoltan
Avatar
Send message
Joined: 20 Jan 09
Posts: 2343
Credit: 16,201,255,749
RAC: 0
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 40307 - Posted: 2 Mar 2015 | 11:28:05 UTC - in response to Message 40300.

Dayle Diamond wrote:
They all say for a status "Abandoned". That's so weird!

The root of this phenomenon should be some kind of BOINC work folder access rights problem.
Are there more than one user on this PC?
Is the BOINC installed as a system service (protected execution mode)?

Jacob Klein wrote:
I see that your computer's details show:
BOINC version 6.12.34
....

That is ANCIENT.

That's true, but it still has to work under Windows 7 x64.
Until recently, I've used 6.10.60 on my hosts. The only reason for the update was to have such spare projects which are using OpenCL.

Jacob Klein wrote:
Please try the latest release, BOINC v7.4.36.
http://boinc.berkeley.edu/download.php

Updating to this version is still a good idea, as this will update the folder access rights.

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1617
Credit: 8,126,544,351
RAC: 17,726,755
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 40308 - Posted: 2 Mar 2015 | 12:11:13 UTC - in response to Message 40307.

No, 'Abandoned' is a server-only phenomenon - the tasks are marked thus in the server database record, but as the OP stated in the first post, the BOINC client locally knows nothing about this, and carries on processing - there are no permission problems locally.

In general, when this happens, local tasks continue running until the user notices or all tasks are completed. I'm wondering (and this is pure speculation) whether the 'spontaneous abort' is actually the regular once-per-hour scheduler request 'requested by project', which is specific to this project. If the scheduler reply says the work is no longer viable, that could trigger the abort. That could be checked in the Event Log.

As to why the server is marking the tasks as abandoned - nobody really knows, and I'd appreciate more help in tracking it down. It's done by the function mark_results_over(), which is called in two places in sched/handle_request.cpp (and nowhere else). It's supposed to happen "when there's evidence that the host has detached.", or "If the [RPC] seqno from the host is less than what we expect, the user must have copied the state file to a different host". But it seems to happen more than that, and the finger of suspicion seems to point at communication problems between host and server resulting in RPC requests being processed out of order on the server.

As to running BOINC v6.12.34, that's fine. I run it here too, because it's the last version allowed to run GPUs in Service Mode under Windows XP. Works fine.

EdwardPF
Send message
Joined: 24 Nov 12
Posts: 17
Credit: 453,679,903
RAC: 0
Level
Gln
Scientific publications
watwatwatwatwatwat
Message 40309 - Posted: 2 Mar 2015 | 15:41:17 UTC - in response to Message 40308.
Last modified: 2 Mar 2015 | 15:44:30 UTC

Well ... I have no idea ... but I removed the project and reconnected ...

I have completed 1 WU and am 1:45 into the next ... However WU 10709889 is nowhere to be seen ... must have fallen through the cracks during the disconnect??

Anyway ... all SEEMS to be well now ... I have no idea what went wrong ... but ... Thanks for the response!

Ed F

Profile Retvari Zoltan
Avatar
Send message
Joined: 20 Jan 09
Posts: 2343
Credit: 16,201,255,749
RAC: 0
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 40311 - Posted: 2 Mar 2015 | 16:20:55 UTC - in response to Message 40308.

As to why the server is marking the tasks as abandoned - nobody really knows, and I'd appreciate more help in tracking it down. It's done by the function mark_results_over(), which is called in two places in sched/handle_request.cpp (and nowhere else). It's supposed to happen "when there's evidence that the host has detached.", or "If the [RPC] seqno from the host is less than what we expect, the user must have copied the state file to a different host". But it seems to happen more than that, and the finger of suspicion seems to point at communication problems between host and server resulting in RPC requests being processed out of order on the server.

It happened on one of my dual boot (WinXP/Win7) hosts, when I've tried to make the BOINC manager use the same working folder on both OSes. I've succeeded to do it on my other similar host by setting the proper access rights for the BOINC work folder (which is located on the Win7's partition on this host), but on the first host the ongoing GPUGrid workunits gets abandoned, whenever I boot to Win7 (the BOINC working folder is located on the WinXP's partition on this host).

Post to thread

Message boards : Number crunching : What am I doing wrong

//