Advanced search

Message boards : Number crunching : Lost 8 hours of work on a long WU why?

Author Message
Profile BeemerBiker
Avatar
Send message
Joined: 31 Oct 08
Posts: 102
Credit: 932,632,653
RAC: 3,578,513
Level
Glu
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 48801 - Posted: 3 Feb 2018 | 15:23:50 UTC

While looking at Boinctasks, I noticed that boinc was not running on one of my systems. I logged into that wn10x64 system and checked the event viewer and did not see any problem in any of its logs.

I brought up boinc and then checked back with Boinctasks.

GPUGrid was at 0% but I noticed the elapsed time was over 8 hours of which 2 was CPU time. Something seems wrong. Here is a snapshot I just took, about an hour after restarting. This system is a gtx1070 and typically gpugrid finishes after 10 or 11 hours.


Note the 9:21 is elapsed time and the 10:09 is remaining time. There should be only about 2 hours remaining.

I would like to debug this (I am retired with obviously nothing else to do) and started poking around and had a few questions.

1. Who actually does the checkpoint?

Is it the running gpugrid task or does the running task pass this off to boinc to handle? Possibly, boinc died 9 hours previously and gpugrid requested a boatload of checkpoints but its requests were not serviced.

2. If boinc died surely the event viewer would have some record of it. It is very rare that this happens but I have seen it occassionally. I would hate to enable debugging in cc_config to try to catch something very rare and would not know which flag to enable.

3. While poking around, I happened to look at service request and service reply and spotted something strange. I cannot account for this and will post over at boinc to get an answer if no one here knows.

here are 3 lines of the service request xml from gpugrid. Note that grcpool is identified as a project. It is actually a manager.


    <global_preferences>
    <source_project>https://grcpool.com/</source_project>
    <mod_time>0.000000</mod_time>
    <battery_charge_min_pct>90.000000</battery_charge_min_pct>



here is the reply from grcpool back to gpugrid with the corresponding pieces from the xml


    <source_project>http://www.worldcommunitygrid.org/</source_project>
    <source_scheduler>https://scheduler.worldcommunitygrid.org/boinc/wcg_cgi/fcgi</source_scheduler>

    <mod_time>1503442910</mod_time>
    <run_on_batteries>0</run_on_batteries>



Note that grcpool thinks it is replying back to WCG!!!!!

Profile Retvari Zoltan
Avatar
Send message
Joined: 20 Jan 09
Posts: 1898
Credit: 12,083,448,769
RAC: 2,537,221
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 48802 - Posted: 3 Feb 2018 | 16:13:40 UTC - in response to Message 48801.

1. Who actually does the checkpoint?
The GPUGrid app does the checkpoint. When you have a very fast GPU, it does its checkpoints so frequently that the (Windows) disk cache never writes the actual data to the drive (this could be resolved by disabling the write caching of the BOINC drive). This behavior results in an error only if the procession of a GPUGrid task is broken unexpectedly (by a power failure, or a system hang)

here are 3 lines of the service request xml from gpugrid. Note that grcpool is identified as a project. It is actually a manager.

    <global_preferences>
    <source_project>https://grcpool.com/</source_project>
    <mod_time>0.000000</mod_time>
    <battery_charge_min_pct>90.000000</battery_charge_min_pct>



here is the reply from grcpool back to gpugrid with the corresponding pieces from the xml


    <source_project>http://www.worldcommunitygrid.org/</source_project>
    <source_scheduler>https://scheduler.worldcommunitygrid.org/boinc/wcg_cgi/fcgi</source_scheduler>

    <mod_time>1503442910</mod_time>
    <run_on_batteries>0</run_on_batteries>



Note that grcpool thinks it is replying back to WCG!!!!!

No, it says that your most recent computing preferences are on WCG.

Profile BeemerBiker
Avatar
Send message
Joined: 31 Oct 08
Posts: 102
Credit: 932,632,653
RAC: 3,578,513
Level
Glu
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 48803 - Posted: 3 Feb 2018 | 16:19:45 UTC - in response to Message 48802.
Last modified: 3 Feb 2018 | 16:42:31 UTC

1. Who actually does the checkpoint?
The GPUGrid app does the checkpoint. When you have a very fast GPU, it does its checkpoints so frequently that the (Windows) disk cache never writes the actual data to the drive (this could be resolved by disabling the write caching of the BOINC drive). This behavior results in an error only if the procession of a GPUGrid task is broken unexpectedly (by a power failure, or a system hang)

here are 3 lines of the service request xml from gpugrid. Note that grcpool is identified as a project. It is actually a manager.

    <global_preferences>
    <source_project>https://grcpool.com/</source_project>
    <mod_time>0.000000</mod_time>
    <battery_charge_min_pct>90.000000</battery_charge_min_pct>



here is the reply from grcpool back to gpugrid with the corresponding pieces from the xml


    <source_project>http://www.worldcommunitygrid.org/</source_project>
    <source_scheduler>https://scheduler.worldcommunitygrid.org/boinc/wcg_cgi/fcgi</source_scheduler>

    <mod_time>1503442910</mod_time>
    <run_on_batteries>0</run_on_batteries>



Note that grcpool thinks it is replying back to WCG!!!!!

No, it says that your most recent computing preferences are on WCG.


WCG is not on any of my systems nor has it ever been since I switched to grcpool. Possibly grcpool uses WCG for all its settings which does not surprise me as I have no control over any project parameters unlike BAM! where I can log into the project and change things. That cannot be done on grcpool.

[EDIT] Assuming there is no hardware problem, why is it that gpugrid did not pick up at a recent checkpoint? I clearly saw it start at %0 progress and slowly start up even with over 8 hours of existing "progress". OTH, it is conceivable a hardware problem such as the gtx1070 clock dropping to 300mhz could show very little progress even after 8 hours. I have not seen a problem like that on this system. Surely in 8 hours of time, at least one checkpoint was flushed out of the cache I would think.

Also, I went back over several days looking through the event log and did not see any restart. About the time that I noticed that boinc was not running, there was a minor windows update. The update did not required a reboot.

Post to thread

Message boards : Number crunching : Lost 8 hours of work on a long WU why?