Advanced search

Message boards : Number crunching : ATM work units "bomb out"

Author Message
ARC3670
Send message
Joined: 25 Feb 22
Posts: 5
Credit: 33,447,735
RAC: 54,994
Level
Val
Scientific publications
wat
Message 61381 - Posted: 6 Mar 2024 | 23:18:45 UTC
Last modified: 6 Mar 2024 | 23:28:32 UTC

Below is an event log excerpt from a recent attempt to run a work unit. I enrolled my PC a long time ago and I'm pretty sure not one work unit has ever completed successfully, unless they're supposed to run in under two minutes. So, different versions of Windows, different Nvidia hardware/driver versions, different security software. None of that seems to make any difference. I have been reading anything that seemed relevant on this forum and don't find anyone reporting this problem who actually found a solution. My PC does run asteroids@home GPU units without any issues.

3/6/2024 3:32:09 PM | GPUGRID | Scheduler request completed: got 1 new tasks
3/6/2024 3:32:09 PM | GPUGRID | Project requested delay of 11 seconds
3/6/2024 3:32:11 PM | GPUGRID | Started download of Bace_m26_m15_2-QUICO_ATM_XFF-2-input
3/6/2024 3:32:11 PM | GPUGRID | Started download of Bace_m26_m15_2-QUICO_ATM_XFF-2-Bace_m26_m15_2-QUICO_ATM_XFF-1-7-RND5798_1
3/6/2024 3:33:53 PM | GPUGRID | Finished download of Bace_m26_m15_2-QUICO_ATM_XFF-2-input (5629576 bytes)
3/6/2024 3:44:40 PM | GPUGRID | Finished download of Bace_m26_m15_2-QUICO_ATM_XFF-2-Bace_m26_m15_2-QUICO_ATM_XFF-1-7-RND5798_1 (72249701 bytes)
3/6/2024 3:44:46 PM | GPUGRID | Starting task Bace_m26_m15_2-QUICO_ATM_XFF-2-7-RND5798_1
3/6/2024 3:46:15 PM | GPUGRID | Computation for task Bace_m26_m15_2-QUICO_ATM_XFF-2-7-RND5798_1 finished
3/6/2024 3:46:15 PM | GPUGRID | Output file Bace_m26_m15_2-QUICO_ATM_XFF-2-7-RND5798_1_0 for task Bace_m26_m15_2-QUICO_ATM_XFF-2-7-RND5798_1 absent
3/6/2024 3:46:16 PM | GPUGRID | Started upload of Bace_m26_m15_2-QUICO_ATM_XFF-2-7-RND5798_1_1
3/6/2024 3:48:01 PM | GPUGRID | Sending scheduler request: To fetch work.
3/6/2024 3:48:01 PM | GPUGRID | Reporting 1 completed tasks
3/6/2024 3:48:01 PM | GPUGRID | Requesting new tasks for NVIDIA GPU
3/6/2024 3:48:07 PM | GPUGRID | Scheduler request completed: got 0 new tasks

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1065
Credit: 40,231,533,983
RAC: 22,690
Level
Trp
Scientific publications
wat
Message 61382 - Posted: 7 Mar 2024 | 0:01:49 UTC - in response to Message 61381.

unhide your system so we can properly look at the host details and the task errors.

the BOINC event log messages you've posted dont tell you anything.
____________

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1332
Credit: 7,157,017,459
RAC: 14,603,110
Level
Tyr
Scientific publications
watwatwatwatwat
Message 61383 - Posted: 7 Mar 2024 | 0:46:50 UTC

The output file absent message likely means an antivirus app is restricting access to the task slots.

But agree you need to unhide your computers so we can read the stderr.txt task result.

ARC3670
Send message
Joined: 25 Feb 22
Posts: 5
Credit: 33,447,735
RAC: 54,994
Level
Val
Scientific publications
wat
Message 61384 - Posted: 7 Mar 2024 | 1:16:37 UTC

I corrected the hidden PC issue and updated the project, but I don't know where I can view that information.

ARC3670
Send message
Joined: 25 Feb 22
Posts: 5
Credit: 33,447,735
RAC: 54,994
Level
Val
Scientific publications
wat
Message 61385 - Posted: 7 Mar 2024 | 1:22:29 UTC

I think I found the area here:

https://www.gpugrid.net/results.php?hostid=613766

ARC3670
Send message
Joined: 25 Feb 22
Posts: 5
Credit: 33,447,735
RAC: 54,994
Level
Val
Scientific publications
wat
Message 61386 - Posted: 7 Mar 2024 | 2:17:56 UTC - in response to Message 61385.

After reviewing the tasks that are being processed and returned with what looks like a computation error, it seems that it only appears that they failed. This task(( https://www.gpugrid.net/result.php?resultid=34275043 )) appears to have been successful.

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1332
Credit: 7,157,017,459
RAC: 14,603,110
Level
Tyr
Scientific publications
watwatwatwatwat
Message 61389 - Posted: 7 Mar 2024 | 17:35:09 UTC - in response to Message 61386.
Last modified: 7 Mar 2024 | 17:37:48 UTC

Not much helpful information in stderr.txt output being this is a Windows host.

You are getting the typical message error for Windows:

<message>
The operating system cannot run %1.
(0xc3) - exit code 195 (0xc3)</message>


So the task is unable to properly setup your task environment.

Your gpu also is on the weak side with its VRAM limit of only 8GB.

The tasks are very spiky in memory utilization often exceeding 12GB which will produce and more usable error message of "out of memory" on Linux hosts.

You got lucky with that one successful task.

I'd disable the ATM and QC tasks and wait for the acemd tasks to make an appearance again someday.

Or put the project on Suspend and move onto other projects where your gpu is more capable of running.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1065
Credit: 40,231,533,983
RAC: 22,690
Level
Trp
Scientific publications
wat
Message 61391 - Posted: 7 Mar 2024 | 18:06:48 UTC - in response to Message 61389.

he's running the ATM tasks, which don't use much VRAM, not like the QChem tasks. so he's fine on VRAM.

but that error is the common issue many others are having with the Windows application. I don't think anyone has narrowed down exactly why some hosts work on Windows and others don't.
____________

[BAT] Svennemans
Send message
Joined: 27 May 21
Posts: 50
Credit: 701,342,017
RAC: 4,539,492
Level
Lys
Scientific publications
wat
Message 61392 - Posted: 7 Mar 2024 | 19:48:52 UTC - in response to Message 61391.

Anybody having this error should try to capture run.log. I know of several ways to achieve that, unfortunately I'm one of the lucky (or unlucky in this case) users that never has this error on Windows.

ARC3670
Send message
Joined: 25 Feb 22
Posts: 5
Credit: 33,447,735
RAC: 54,994
Level
Val
Scientific publications
wat
Message 61393 - Posted: 8 Mar 2024 | 0:48:51 UTC - in response to Message 61389.

It turns out I was wrong about one task completing. I clicked a link somewhere which showed the result of a different computer, not mine. When I click the properties button for the project in the BOINC manager there is a statistic; tasks completed: 0 - Tasks failed: 141. So, batting .1000 I guess. I have suspended for now. I'll watch this thread for a while, see if anyone suggests anything to try.

[BAT] Svennemans
Send message
Joined: 27 May 21
Posts: 50
Credit: 701,342,017
RAC: 4,539,492
Level
Lys
Scientific publications
wat
Message 61394 - Posted: 8 Mar 2024 | 7:53:35 UTC - in response to Message 61393.

It turns out I was wrong about one task completing. I clicked a link somewhere which showed the result of a different computer, not mine. When I click the properties button for the project in the BOINC manager there is a statistic; tasks completed: 0 - Tasks failed: 141. So, batting .1000 I guess. I have suspended for now. I'll watch this thread for a while, see if anyone suggests anything to try.


I'll suggest to try this:
1) suspend all your projects
2) edit or create cc_config.xml in the main BOINC directory (probably C:\ProgramData\BOINC - but your config may vary)
3) set 'exit_after_finish' switch to 1:

<cc_config> ...mandatory section, so add it if you're creating this from scratch
<log_flags>
...this section may or may not be there - ignore it...
</log_flags>
<options>
...this section may or may not be there,if it's not: CREATE IT!
<exit_after_finish>1</exit_after_finish>
...ignore all other flags that are already there
</options>
</cc_config>

4) save and close
5) in boinc manager, select menu "options"=>"read config files" or just restart the boinc SERVICE (not just the manager)
6) unsuspend GPUGRID, wait for an ATM unit to run and finish. BOINC will exit immediately after it finishes
7) in the 'slots' directory (probably C:\Windows\ProgramData\BOINC\slots) there should be one or more subdirectories 0, 1, 2 etc. Locate the one that contains the ATM workunit. Copy the "run.log" file to some personal directory. Look for errormessages at the end of this file and post them here
8) edit cc_config.xml and set the flag 'exit_after_finish=0'
9) restart BOINC

Post to thread

Message boards : Number crunching : ATM work units "bomb out"

//