Author |
Message |
|
Well, I have one of the new CELLGA_SHORT WUs running for over one day and 5 hours now, still showing 0% done. Is it save to assume the CPU time is wasted and the WU will never complete? ;-) Should I abort it? It's this one - http://www.gpugrid.net/result.php?resultid=555615
____________
pixelicious.at - my little photoblog |
|
|
ToniVolunteer moderator Project administrator Project developer Project tester Project scientist Send message
Joined: 9 Dec 08 Posts: 1006 Credit: 5,068,599 RAC: 0 Level
Scientific publications
|
Yes, please, abort it. It was expected to run for 2 hours, roughly. Thank for reporting the problem! |
|
|
|
Ok, thanks! Aborted it now after 1 day and 16 hours... My nice credits... ;-)
____________
pixelicious.at - my little photoblog |
|
|
|
Maybe it wasn't a problem with the task but with my PS3.
Now I got a TONI_CELLGA task that also seemed to hang because it showed no progress after 20 minutes, but after a reboot it is running fine now...
____________
pixelicious.at - my little photoblog |
|
|
|
Hmmm, after 20 minutes the next task was hanging, this time a TONI_CELLGA_MED - http://www.gpugrid.net/result.php?resultid=565176. I aborted it, and will try another task...
____________
pixelicious.at - my little photoblog |
|
|
GDFVolunteer moderator Project administrator Project developer Project tester Volunteer developer Volunteer tester Project scientist Send message
Joined: 14 Mar 07 Posts: 1957 Credit: 629,356 RAC: 0 Level
Scientific publications
|
If a workunit hangs. You can just try to restart the machine.
The problem is not usually with the wu, but with the fact that your processor is reserved by another application.
gdf |
|
|
|
Thanks Gianni, but I think I just found the problem...
I just started my PS3 again, started BOINC and got a hanging task again. It was running for 2:37 minutes and then the CPU time was hanging with 0% progress.
When I started BOINC it was running CPU benchmarks, but after a few minutes it was running the benchmarks again. The problem is boincmgr showed "suspending computation - running CPU benchmarks", but "top" still showed 154% CPU for cellmd2_5.03_po. Another thing I noticed was the time shown in boincmgr. First it was wrong and when it started the second benchmarks it showed the right time (see screenshot). So I don't know if the task was hanging because it wasn't suspended during benchmarks or because the system time has changed. Weird thing is NTP is turned on, and the system time should always be right...
I think I'll try to delete the whole boinc dir and redownload BOINC and if that doesn't help I'll try a new YDL 6.1 install...
____________
pixelicious.at - my little photoblog |
|
|
|
Have same problem;
http://www.gpugrid.net/workunit.php?wuid=399536
http://www.gpugrid.net/workunit.php?wuid=405292
lot's of hrs and energy wasted again.
I'm not requesting new WU for a while and only do work for yoyo.
____________
|
|
|
ToniVolunteer moderator Project administrator Project developer Project tester Project scientist Send message
Joined: 9 Dec 08 Posts: 1006 Credit: 5,068,599 RAC: 0 Level
Scientific publications
|
I'm trying to figure out the pattern behind those failures.
These new "short" WUs, called "SHORT" and "MED", are expected to run for <3 hours and grant approximately 300 credits. Another difference with respect to other WUs is that they generate a largish result file (*_4, approx 16M).
The WUs crunch correctly on most machines - hopefully we'll be able to reproduce the problem. |
|
|
|
Thanks for looking into the problem Toni!
I previously deleted the BOINC directory, redownloaded BOINC and got a GRA_US task. Seems this one is running ok now. It`s almost at 10% after 4 hours crunching...
____________
pixelicious.at - my little photoblog |
|
|
ToniVolunteer moderator Project administrator Project developer Project tester Project scientist Send message
Joined: 9 Dec 08 Posts: 1006 Credit: 5,068,599 RAC: 0 Level
Scientific publications
|
Thanks to you for reporting the symptoms: "runaway" processes may in fact be related to anomalous task suspend/resume, triggered by the CPU benchmarks.
To relieve the problem, I'm postponing some WUs and lowering the FP bound, which will hopefully halt "runaway" jobs before they hog the CPU for days (the changes will take time to propagate, though). |
|
|
|
Hi TG,
No benchmark was running during start and the time I cancelled the WU, below I inserted the log from the moment I started till I cancelled it.
Hope you can fix it or at least build in a max running-time for certain WU.
Will follow the forums to see if things get better and I'll be back than.
do 23 apr 2009 18:27:40 MDT|GPUGRID|Starting A8-TONI_CELLGA_MED_4-6-40-RND6539_1
do 23 apr 2009 18:27:40 MDT|GPUGRID|Starting task A8-TONI_CELLGA_MED_4-6-40-RND6539_1 using cellmd2 version 503
do 23 apr 2009 23:27:22 MDT|GPUGRID|Sending scheduler request: Requested by project
do 23 apr 2009 23:27:22 MDT|GPUGRID|(not requesting new work or reporting completed tasks)
do 23 apr 2009 23:27:27 MDT|GPUGRID|Scheduler RPC succeeded [server version 607]
do 23 apr 2009 23:27:27 MDT|GPUGRID|Deferring communication for 31 sec
do 23 apr 2009 23:27:27 MDT|GPUGRID|Reason: requested by project
vr 24 apr 2009 04:27:28 MDT|GPUGRID|Sending scheduler request: Requested by project
vr 24 apr 2009 04:27:28 MDT|GPUGRID|(not requesting new work or reporting completed tasks)
vr 24 apr 2009 04:27:34 MDT|GPUGRID|Scheduler RPC succeeded [server version 607]
vr 24 apr 2009 04:27:34 MDT|GPUGRID|Deferring communication for 31 sec
vr 24 apr 2009 04:27:34 MDT|GPUGRID|Reason: requested by project
vr 24 apr 2009 09:27:34 MDT|GPUGRID|Sending scheduler request: Requested by project
vr 24 apr 2009 09:27:34 MDT|GPUGRID|(not requesting new work or reporting completed tasks)
vr 24 apr 2009 09:27:44 MDT|GPUGRID|Scheduler RPC succeeded [server version 607]
vr 24 apr 2009 09:27:44 MDT|GPUGRID|Deferring communication for 31 sec
vr 24 apr 2009 09:27:44 MDT|GPUGRID|Reason: requested by project
vr 24 apr 2009 11:05:05 MDT|GPUGRID|Deferring communication for 1 min 0 sec
vr 24 apr 2009 11:05:05 MDT|GPUGRID|Reason: Unrecoverable error for result A8-TONI_CELLGA_MED_4-6-40-RND6539_1 (aborted by user)
vr 24 apr 2009 11:05:05 MDT|GPUGRID|Computation for task A8-TONI_CELLGA_MED_4-6-40-RND6539_1 finished
vr 24 apr 2009 11:05:05 MDT|yoyo@home|Resuming task ogr_090407202003_15_0 using crunch version 211
____________
|
|
|
QuasarSend message
Joined: 17 Dec 08 Posts: 6 Credit: 1,943,937 RAC: 0 Level
Scientific publications
|
Could it be that this problem occurs when more than one project are using the PS3 (maybe it's related to task-switching)? I've been crunching on my PS3 for a couple of months without problems but had this problem yesterday once I attached BOINC to yoyo@home. I aborted the WU and got another one but this one seems to be hanging too. Here are the links to the WUs:
http://www.gpugrid.net/workunit.php?wuid=407696
http://www.gpugrid.net/workunit.php?wuid=407774 |
|
|
ToniVolunteer moderator Project administrator Project developer Project tester Project scientist Send message
Joined: 9 Dec 08 Posts: 1006 Credit: 5,068,599 RAC: 0 Level
Scientific publications
|
Yes, we believe that task switching is an issue: the accelerated processors are not properly "freed" somehow upon process termination, probably a shortcoming of the platform. :-( Do just PS3GRID hang, or also those from other projects? |
|
|
QuasarSend message
Joined: 17 Dec 08 Posts: 6 Credit: 1,943,937 RAC: 0 Level
Scientific publications
|
I think all the WUs, from PS3Grid and others, hang, at least that's what happened with me. I was looking around and found a newer BOINC client optimized for the PS3 (http://www.dotsch.de/boinc/boinc6219_10.linux-ps3.tar.gz), unfortunately it's a command-line version and I tried to get it to work with BOINC Manager but it didn't work. Maybe an update to the current PS3Grid BOINC client (whether using the one mentioned above or otherwise) would fix the problem? |
|
|
jboeseSend message
Joined: 30 Jul 08 Posts: 21 Credit: 31,229 RAC: 0 Level
Scientific publications
|
Just to be helpful to another BOINCer at least on my machine the hanging WU seem to be specific to gpugrid. If you set the resource share for yoyo to say 3E+38 (a huge number) and gpugrid to 1 then your ps3 will happily only crush yoyo WU and will not hang. The problem only seems to occur when the PS3 starts working on a gpugrid WU. I am not sure but the problem also seems more pronounced with the memory stick version (what I run) as compared to YDL but I think it occurs with both at times. |
|
|
samSend message
Joined: 30 Apr 09 Posts: 15 Credit: 228,425 RAC: 0 Level
Scientific publications
|
just joined the gpugrid; first wu is hung 0.0%
Thu 30 Apr 2009 04:48:38 PM EDT|GPUGRID|Starting task 5-TONI_CELLGA_MED_9-0-2-RND0374_0 using cellmd2 version 503
posted because the admin mentioned trying to discern a pattern. I have not tried another ps3 based project. I just setup the machine with ydl6.1, basically vanilla.
|
|
|
samSend message
Joined: 30 Apr 09 Posts: 15 Credit: 228,425 RAC: 0 Level
Scientific publications
|
I guess I picked a bad day to join. I booted; no worky. aborted. started another, no worky. then I see a message suggesting I joined the project with the wrong link, the one from the website. so I tried again, new wu: Thu 30 Apr 2009 05:20:10 PM EDT|GPUGRID|Starting task 638000-IBUCH_GRAUS-1-100-RND2278_1 using cellmd2 version 503
no worky. 0.0% I just suspended the wu as I will my effort for today.
I am open recommendations regarding this project and/or ydl6.1; I am new to both.
thanks. |
|
|
ToniVolunteer moderator Project administrator Project developer Project tester Project scientist Send message
Joined: 9 Dec 08 Posts: 1006 Credit: 5,068,599 RAC: 0 Level
Scientific publications
|
Hi sam - the progress bar does not show the actual progression for PS3 WUs. All WUs last circa 12 hours each. |
|
|
|
1 WU since a week i crunched for gpugrid and hung, +16hrs and 0%
http://www.gpugrid.net/result.php?resultid=598295 |
|
|
jboeseSend message
Joined: 30 Jul 08 Posts: 21 Credit: 31,229 RAC: 0 Level
Scientific publications
|
I have promised to be political and will simply say the PS3 portion of this project is experiencing "growing" pains and gpugrid WU often hang. The problem is not with your setup (wish it was more clear so people don't waste their time debugging a universal project problem). |
|
|
|
after 7 finished WU this one went +39hrs with 0% progress so I canceled it.
http://www.gpugrid.net/result.php?resultid=683455
____________
|
|
|
|
Maybe next can help with hanging WU:
This http://www.gpugrid.net/result.php?resultid=694578 crunched 1st 24hrs with 0% progress. Then I restarted linux from the menu the way you would see the blue screen but without needing entering password and ROOT as user. After the reboot within an hour the progress was counting and it finished positive.
There was a donor system that did the normal hrs for this WU.
____________
|
|
|
|
http://www.gpugrid.net/result.php?resultid=819492
it's getting boring that things aren't getting fixed
____________
|
|
|
|
Could it be that's it's heat related my WU often hang?
I notice that with higher room temp (above +21°C) I do have this hanging.
When it's 20 or below it could crunch for weeks.
I've NOT this problem with yoyo!!!!
BTW an other one wasted 2 full days of energy ... so I cancelled requesting work for the moment (26°C inside)
http://www.gpugrid.net/result.php?resultid=863967
____________
|
|
|
|
i've since i last started ps3grid, with yoyo suspended no problems.... it looks sofar good ....but if it's multi-project related, it isn't nice of this project to not fix it. I'm MORE satisfied if i could get yoyo back at 10% so the ps3 could ask work from that project if gpugrid runs out,server hangs ,....
____________
|
|
|