Advanced search

Message boards : Number crunching : Problem with PABLO tasks

Author Message
Profile robertmiles
Send message
Joined: 16 Apr 09
Posts: 402
Credit: 169,545,796
RAC: 311,149
Level
Ile
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 47698 - Posted: 27 Jul 2017 | 21:22:02 UTC
Last modified: 27 Jul 2017 | 21:23:12 UTC

I'm seeing a problem with tasks that have PABLO in their names. More than half complete without problems, but the remainder seem to go into an endless loop that stops them from writing any more checkpoints or making any more progress. Estimated remaining time eventually drops to zero without changing the progress percentage.

Workaround that lets progress resume - suspend the task for at least one minute. Then resume the task. Expect most of the elapsed time to be lost when this is done, but progress then resumes.

A task where this happened:
http://www.gpugrid.net/result.php?resultid=16421650

Computer where this happened:
http://www.gpugrid.net/show_host_detail.php?hostid=422382

A wingmate for this workunit got this error:
The simulation has become unstable. Terminating to avoid lock-up (1)

I've seen the problem on one or two tasks before, but did not save enough information about those tasks to tell you which ones.

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 790
Credit: 1,424,289,095
RAC: 1,369,661
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 47700 - Posted: 27 Jul 2017 | 21:34:29 UTC - in response to Message 47698.

I think this is more common on Windows 10.

I haven't encountered it on Windows 7 yet (or on my single Windows 10 machine, come to think of it).

Profile robertmiles
Send message
Joined: 16 Apr 09
Posts: 402
Credit: 169,545,796
RAC: 311,149
Level
Ile
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 47737 - Posted: 4 Aug 2017 | 0:19:50 UTC

Another PABLO task that seems to go into an endless loop, unless suspended, losing nearly a full day of compute time:

http://www.gpugrid.net/result.php?resultid=16435516

http://www.gpugrid.net/workunit.php?wuid=12651837

Do these task have enough debugging enabled to show the cause of the endless loop? The slot directory does not appears to contain a text file showing anything relevant.

Running under 64-bit Windows 10. Problem does not happen on all of the PABLO tasks.

Jacob Klein
Send message
Joined: 11 Oct 08
Posts: 1068
Credit: 1,147,847,239
RAC: 1,038,161
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 47745 - Posted: 6 Aug 2017 | 1:24:15 UTC
Last modified: 6 Aug 2017 | 1:28:54 UTC

This may be a shot in the dark, but ...

Do you have anything BOINC-related, in the following folder:
C:\Users\{username}\AppData\Local\VirtualStore\

I have seen some GPUGrid strangeness at one time, where I was playing with compatibility modes, and Windows created files in that "VirtualStore" folder that ... get used, instead of the normal (C:\Program Files\BOINC\) files. Worse yet, BOINC-related "VirtualStore" folders won't get properly cleaned by BOINC!

In my case, at that time, my tasks were erroneously insta-completing.

Anyway .. So, do you have anything in that "VirtualStore" folder?
If you do have BOINC-related stuff in there, try closing BOINC then removing the BOINC-related stuff then restarting BOINC.

Profile robertmiles
Send message
Joined: 16 Apr 09
Posts: 402
Credit: 169,545,796
RAC: 311,149
Level
Ile
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 47746 - Posted: 6 Aug 2017 | 3:22:43 UTC - in response to Message 47745.

This may be a shot in the dark, but ...

Do you have anything BOINC-related, in the following folder:
C:\Users\{username}\AppData\Local\VirtualStore\

[snip]

There are a number of files there, but none appear to be BOINC-related.

Jacob Klein
Send message
Joined: 11 Oct 08
Posts: 1068
Credit: 1,147,847,239
RAC: 1,038,161
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 47747 - Posted: 6 Aug 2017 | 3:54:02 UTC - in response to Message 47746.

What setting are you using for:
"Use at most X% of CPU time"

If you're not using 100%, can you try it and see if it fixes the problem?

Profile robertmiles
Send message
Joined: 16 Apr 09
Posts: 402
Credit: 169,545,796
RAC: 311,149
Level
Ile
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 47748 - Posted: 6 Aug 2017 | 4:39:06 UTC - in response to Message 47747.

What setting are you using for:
"Use at most X% of CPU time"

If you're not using 100%, can you try it and see if it fixes the problem?

100%

Profile robertmiles
Send message
Joined: 16 Apr 09
Posts: 402
Credit: 169,545,796
RAC: 311,149
Level
Ile
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 47750 - Posted: 6 Aug 2017 | 22:13:48 UTC
Last modified: 6 Aug 2017 | 22:15:21 UTC

Another task with the problem:

http://www.gpugrid.net/result.php?resultid=16442636
http://www.gpugrid.net/workunit.php?wuid=12657605

I've bought a GTX 1080, and have told BOINC not to download any more GPU workunits for now so all of them can finish before I install the new graphics board tomorrow.

Profile robertmiles
Send message
Joined: 16 Apr 09
Posts: 402
Credit: 169,545,796
RAC: 311,149
Level
Ile
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 47797 - Posted: 21 Aug 2017 | 14:42:31 UTC

Now using the GTX 1080, which appears to have stopped the problem.

Some PABLO tasks run for an unexpectedly long time now, but they finish and verify properly.

Profile robertmiles
Send message
Joined: 16 Apr 09
Posts: 402
Credit: 169,545,796
RAC: 311,149
Level
Ile
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 47855 - Posted: 8 Sep 2017 | 21:04:11 UTC
Last modified: 8 Sep 2017 | 21:08:52 UTC

Another PABLO task that apparantly went into an endless loop:

http://www.gpugrid.net/workunit.php?wuid=12708240
http://www.gpugrid.net/result.php?resultid=16504434

It reached:
55.160% progress, 1d 10:29:13 elapsed, --- remaining
no change to these numbers other than elapsed for at least 12 hours

using GTX 1080 with 385.28 driver, and i7-5980X CPU, BOINC 7.6.33

I suspended it and installed the 385.41 driver (without the 3D sections).

It now indicates 55.160% progress, 03:38:48 elapsed, 04:03:15 remaining. Running has not resumed - BOINC appears to be catching up on other GPU work.

This suggests that the problem is soon after writing a checkpoint, but before anything that does the next progress increase. Resuming from a checkpoint instead does not appear to give the problem.

NO error messages on the screen, or in any file in the slot that looked likely to be a non-empty text file.

Profile robertmiles
Send message
Joined: 16 Apr 09
Posts: 402
Credit: 169,545,796
RAC: 311,149
Level
Ile
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 47859 - Posted: 9 Sep 2017 | 14:53:44 UTC - in response to Message 47855.

Another PABLO task that apparantly went into an endless loop:

http://www.gpugrid.net/workunit.php?wuid=12708240
http://www.gpugrid.net/result.php?resultid=16504434

[snip]

It finally resumed from the checkpoint, and started updating the progress percentage about once a second. The task completed with a few hours, was reported, and has now validated.

This suggests that also enabling debug output between resuming from the checkpoint file and first update of the progress percentage would allow comparing the failed first try to the second try that worked better.

Variable
Send message
Joined: 20 Nov 13
Posts: 20
Credit: 151,699,255
RAC: 188
Level
Ile
Scientific publications
watwatwatwatwat
Message 47860 - Posted: 11 Sep 2017 | 15:36:38 UTC

I have also had this problem using my 1070, on Win7. GPUgrid tasks will compute to halfway or so and then stop for hours. The card is dedicated to GPUgrid only so no other projects are competing with it for compute time. The core load of the card indicates it is just sitting idle. Exiting BOINC and restarting it will resume computation on the task, as will suspending it and then resuming. It's happening pretty frequently, every 1-2 days.

I have the most recent BOINC version, but I have not updated graphics drivers in a while so my next step is to try that.

Wiyosaya
Send message
Joined: 22 Nov 09
Posts: 111
Credit: 171,819,453
RAC: 392,567
Level
Ile
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 47861 - Posted: 12 Sep 2017 | 17:27:51 UTC - in response to Message 47748.
Last modified: 12 Sep 2017 | 17:33:45 UTC

What setting are you using for:
"Use at most X% of CPU time"

If you're not using 100%, can you try it and see if it fixes the problem?

100%

Not sure whether this will help, but I just got a 1060 up last night on Win 10 x64, and noted that while the elapsed time kept incrementing, the percent done stopped. This was after it had crunched about 2.5-percent of the WU. I looked into the BOINC client log and there was a message in there that said "CPU busy, suspending work" or something like that. I was using my computer at the time with non-CPU intensive stuff like running my web browser.

I checked "Suspend work when non-BOINC CPU usage is above" in "When and how BOINC uses your computer" under "Preferences" on my account page and noted it was set to 80%. I then set it to 0 which means to run BOINC projects all the time regardless of host CPU usage.

I suggest checking that setting.

I then did an Update on GPUGrid, and it still did not restart. So I exited BOINC and restarted, and the problem seemed to disappear - that is, I was still using my computer, however, the task did not suspend and ran to completion in a timely manner. Perhaps this is coincidental, IDK. However, I have a 6-core processor and there is no way that non-BOINC total core/thread usage was 80% or above at the time, unless the browser briefly spun up 11 or 12 threads.

The next time I get a PABLO, I will watch for this again. If it is not coincidental and the job of checking that setting is in each individual client's code, then perhaps there is a bug in that code with PABLO units. If the code is in BOINC, then perhaps there is a bug there.
____________

Profile robertmiles
Send message
Joined: 16 Apr 09
Posts: 402
Credit: 169,545,796
RAC: 311,149
Level
Ile
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 47862 - Posted: 12 Sep 2017 | 21:07:06 UTC - in response to Message 47861.

[snip]

I checked "Suspend work when non-BOINC CPU usage is above" in "When and how BOINC uses your computer" under "Preferences" on my account page and noted it was set to 80%. I then set it to 0 which means to run BOINC projects all the time regardless of host CPU usage.

I suggest checking that setting.

[snip]

I don't see a "Suspend work when non-BOINC CPU usage is above" setting, but I would have set it to off. My computer has 8 physical cores plus hyperthreading, which makes it behave like it has 16 cores, and I've found that limiting the number of cores BOINC can use gives better results than limiting the percentage of CPU time it can use on each core. 14 cores are allowed for CPU tasks, leaving one for GPU tasks and one for non-BOINC programs.


Using BOINC 7.6.33 under Windows 10.

Wiyosaya
Send message
Joined: 22 Nov 09
Posts: 111
Credit: 171,819,453
RAC: 392,567
Level
Ile
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 47863 - Posted: 12 Sep 2017 | 21:24:01 UTC - in response to Message 47862.

Interesting to know of your experience, and it is also interesting that you do not see this setting.

For me, it is under -
1. My Account
2. Preferences section
3. "When and how BOINC uses your computer"
4. Click on "Computing Preferences" which is on the same line as "When and how BOINC uses your computer"
5. "Processor Usage"
6. In the "Processor Usage" section there is an entry "Suspend work when non-BOINC CPU usage is above"

If you also observe that this problem happens again, I suggest checking BOINC's Activity Log for a message similar to the one I found. To me, finding the same message would be a strong indicator that the same thing is happening on your system.

____________

Profile robertmiles
Send message
Joined: 16 Apr 09
Posts: 402
Credit: 169,545,796
RAC: 311,149
Level
Ile
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 47864 - Posted: 12 Sep 2017 | 21:27:44 UTC - in response to Message 47863.

Interesting to know of your experience, and it is also interesting that you do not see this setting.

[snip]

I finally found it. It was turned on, so I turned it off.

Wiyosaya
Send message
Joined: 22 Nov 09
Posts: 111
Credit: 171,819,453
RAC: 392,567
Level
Ile
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 47865 - Posted: 12 Sep 2017 | 21:43:47 UTC - in response to Message 47864.
Last modified: 12 Sep 2017 | 21:44:26 UTC

Interesting to know of your experience, and it is also interesting that you do not see this setting.

[snip]

I finally found it. It was turned on, so I turned it off.

Awesome! So we may have found the problem where these tasks seem to suspend and then not resume. I was reading another thread, and it seems like it may not be specific to PABLO tasks.

It will be interesting to know if you see it again. If I do, I will post to this thread.
____________

hsdecalc
Send message
Joined: 5 Jul 15
Posts: 2
Credit: 52,962,550
RAC: 49,577
Level
Thr
Scientific publications
wat
Message 48162 - Posted: 12 Nov 2017 | 21:00:53 UTC
Last modified: 12 Nov 2017 | 21:05:05 UTC

I have this problem many times. Win 10, latest Nvidia driver etc.
GPU run out of work, process still activ.
I tick the task_debug option in messageoption...
Output is:

12.11.2017 16:11:47 | GPUGRID | [task] result e212s6_e132s33p0f417-PABLO_P61244_0_IDP-0-1-RND8388_0 checkpointed
12.11.2017 16:11:54 | GPUGRID | [task] result e212s6_e132s33p0f417-PABLO_P61244_0_IDP-0-1-RND8388_0 checkpointed
12.11.2017 16:12:01 | GPUGRID | [task] result e212s6_e132s33p0f417-PABLO_P61244_0_IDP-0-1-RND8388_0 checkpointed
12.11.2017 16:12:08 | GPUGRID | [task] result e212s6_e132s33p0f417-PABLO_P61244_0_IDP-0-1-RND8388_0 checkpointed
12.11.2017 16:12:15 | GPUGRID | [task] result e212s6_e132s33p0f417-PABLO_P61244_0_IDP-0-1-RND8388_0 checkpointed
12.11.2017 16:12:24 | | Suspending computation - CPU is busy
12.11.2017 16:12:24 | GPUGRID | [cpu_sched] Preempting e212s6_e132s33p0f417-PABLO_P61244_0_IDP-0-1-RND8388_0 (left in memory)
12.11.2017 16:12:24 | GPUGRID | [task] task_state=SUSPENDED for e212s6_e132s33p0f417-PABLO_P61244_0_IDP-0-1-RND8388_0 from suspend
12.11.2017 16:12:44 | | Resuming computation
12.11.2017 16:12:44 | GPUGRID | [cpu_sched] Resuming e212s6_e132s33p0f417-PABLO_P61244_0_IDP-0-1-RND8388_0
12.11.2017 16:12:44 | GPUGRID | [task] task_state=EXECUTING for e212s6_e132s33p0f417-PABLO_P61244_0_IDP-0-1-RND8388_0 from unsuspend
12.11.2017 16:12:48 | GPUGRID | [task] result e212s6_e132s33p0f417-PABLO_P61244_0_IDP-0-1-RND8388_0 checkpointed
12.11.2017 16:13:04 | | Suspending computation - CPU is busy
12.11.2017 16:13:04 | GPUGRID | [cpu_sched] Preempting e212s6_e132s33p0f417-PABLO_P61244_0_IDP-0-1-RND8388_0 (left in memory)
12.11.2017 16:13:04 | GPUGRID | [task] task_state=SUSPENDED for e212s6_e132s33p0f417-PABLO_P61244_0_IDP-0-1-RND8388_0 from suspend
12.11.2017 16:13:06 | GPUGRID | [task] result e212s6_e132s33p0f417-PABLO_P61244_0_IDP-0-1-RND8388_0 checkpointed
12.11.2017 16:13:12 | GPUGRID | [task] result e212s6_e132s33p0f417-PABLO_P61244_0_IDP-0-1-RND8388_0 checkpointed
12.11.2017 16:13:14 | | Resuming computation
12.11.2017 16:13:14 | GPUGRID | [cpu_sched] Resuming e212s6_e132s33p0f417-PABLO_P61244_0_IDP-0-1-RND8388_0
12.11.2017 16:13:14 | GPUGRID | [task] task_state=EXECUTING for e212s6_e132s33p0f417-PABLO_P61244_0_IDP-0-1-RND8388_0 from unsuspend
12.11.2017 16:19:54 | | Suspending GPU computation - user request
12.11.2017 16:19:54 | GPUGRID | [cpu_sched] Preempting e212s6_e132s33p0f417-PABLO_P61244_0_IDP-0-1-RND8388_0 (removed from memory)
12.11.2017 16:19:54 | GPUGRID | [task] task_state=QUIT_PENDING for e212s6_e132s33p0f417-PABLO_P61244_0_IDP-0-1-RND8388_0 from request_exit()
12.11.2017 16:19:54 | | request_exit(): PID 13004 has 1 descendants
12.11.2017 16:19:54 | | PID 5096
12.11.2017 16:20:02 | | Resuming GPU computation
12.11.2017 16:20:55 | GPUGRID | [task] quit request timed out, killing task e212s6_e132s33p0f417-PABLO_P61244_0_IDP-0-1-RND8388_0
12.11.2017 16:20:56 | GPUGRID | [task] Process for e212s6_e132s33p0f417-PABLO_P61244_0_IDP-0-1-RND8388_0 exited, exit code 0, task state 8
12.11.2017 16:20:56 | GPUGRID | [task] task_state=UNINITIALIZED for e212s6_e132s33p0f417-PABLO_P61244_0_IDP-0-1-RND8388_0 from handle_exited_app
12.11.2017 16:20:56 | GPUGRID | [task] task_state=EXECUTING for e212s6_e132s33p0f417-PABLO_P61244_0_IDP-0-1-RND8388_0 from start
12.11.2017 16:20:56 | GPUGRID | [cpu_sched] Restarting task e212s6_e132s33p0f417-PABLO_P61244_0_IDP-0-1-RND8388_0 using acemdlong version 918 (cuda80) in slot 0
12.11.2017 16:21:06 | GPUGRID | [task] result e212s6_e132s33p0f417-PABLO_P61244_0_IDP-0-1-RND8388_0 checkpointed
12.11.2017 16:21:13 | GPUGRID | [task] result e212s6_e132s33p0f417-PABLO_P61244_0_IDP-0-1-RND8388_0 checkpointed
12.11.2017 16:21:20 | GPUGRID | [task] result e212s6_e132s33p0f417-PABLO_P61244_0_IDP-0-1-RND8388_0 checkpointed
12.11.2017 16:21:27 | GPUGRID | [task] result e212s6_e132s33p0f417-PABLO_P61244_0_IDP-0-1-RND8388_0 checkpointed
12.11.2017 16:21:35 | GPUGRID | [task] result e212s6_e132s33p0f417-PABLO_P61244_0_IDP-0-1-RND8388_0 checkpointed
12.11.2017 16:21:42 | GPUGRID | [task] result e212s6_e132s33p0f417-PABLO_P61244_0_IDP-0-1-RND8388_0 checkpointed

At the second resume at 16:13 the GPU-perfomance was lost.
I stop the GPU-usage at 16:19 and resume at 16:20.
The job worked again.
The different I found was that it was removed from memory (16:19).
Too bad that there is no solution.

There is also a checkpoint every 15 seconds. I think it's usually 5-15 min.

Profile robertmiles
Send message
Joined: 16 Apr 09
Posts: 402
Credit: 169,545,796
RAC: 311,149
Level
Ile
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 48164 - Posted: 13 Nov 2017 | 4:16:05 UTC

hsdecalc, I haven't seen the problem lately. My recent changes include installing BOINC 7.8.3, and setting the number of CPU cores BOINC is allowed to use to two less than the total number present.

Profile robertmiles
Send message
Joined: 16 Apr 09
Posts: 402
Credit: 169,545,796
RAC: 311,149
Level
Ile
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 48166 - Posted: 13 Nov 2017 | 5:36:33 UTC - in response to Message 47863.

hsdecalc, another recent change, probably after the last time I saw the problem.

Note that you must be in the advanced view, not the simple view, to follow the directions below. Click on View to start changing which view you have.

Interesting to know of your experience, and it is also interesting that you do not see this setting.

For me, it is under -
1. My Account
2. Preferences section
3. "When and how BOINC uses your computer"
4. Click on "Computing Preferences" which is on the same line as "When and how BOINC uses your computer"
5. "Processor Usage"
6. In the "Processor Usage" section there is an entry "Suspend work when non-BOINC CPU usage is above"

If you also observe that this problem happens again, I suggest checking BOINC's Activity Log for a message similar to the one I found. To me, finding the same message would be a strong indicator that the same thing is happening on your system.

hsdecalc
Send message
Joined: 5 Jul 15
Posts: 2
Credit: 52,962,550
RAC: 49,577
Level
Thr
Scientific publications
wat
Message 48186 - Posted: 14 Nov 2017 | 9:10:56 UTC - in response to Message 48166.

Thanks for reply.
The above procedure is a workaround not a solution for me. I have sometimes high cpu-usage by video-playing, so I need the pause-option.

Profile Retvari Zoltan
Avatar
Send message
Joined: 20 Jan 09
Posts: 1844
Credit: 10,648,138,594
RAC: 9,965,763
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 48224 - Posted: 21 Nov 2017 | 15:04:21 UTC - in response to Message 48186.

Thanks for reply.
The above procedure is a workaround not a solution for me. I have sometimes high cpu-usage by video-playing, so I need the pause-option.
Then you should put those games and applications to the settings -> exclusive applications list.

Post to thread

Message boards : Number crunching : Problem with PABLO tasks