BAD PABLO_p53 WUs

Message boards : Number crunching : BAD PABLO_p53 WUs

Author	Message
Beyond Send message Joined: 23 Nov 08 Posts: 1112 Credit: 6,162,416,256 RAC: 0 Level Scientific publications	Message 46695 - Posted: 20 Mar 2017 \| 20:42:47 UTC
	So far I've had 23 of these bad PABLO_p53 WUs today. Think maybe they should be cancelled?
	ID: 46695 \| Rating: 0 \| rate: / Reply Quote

Jim1348 Send message Joined: 28 Jul 12 Posts: 819 Credit: 1,591,285,971 RAC: 0 Level Scientific publications	Message 46696 - Posted: 20 Mar 2017 \| 21:32:08 UTC - in response to Message 46695.
	Same for me. http://www.gpugrid.net/forum_thread.php?id=4513&nowrap=true#46692
	ID: 46696 \| Rating: 0 \| rate: / Reply Quote

Bedrich Hajek Send message Joined: 28 Mar 09 Posts: 468 Credit: 8,476,947,716 RAC: 11,072,058 Level Scientific publications	Message 46697 - Posted: 20 Mar 2017 \| 21:43:59 UTC
	I also had 2 bad units from this bunch. The problem with canceling these units is that the error will stay with you forever, but if you let it run its course until it get 8 errors and becomes a "too many errors (may have a bug)" unit, it should in time disappear.
	ID: 46697 \| Rating: 0 \| rate: / Reply Quote

Jim1348 Send message Joined: 28 Jul 12 Posts: 819 Credit: 1,591,285,971 RAC: 0 Level Scientific publications	Message 46698 - Posted: 20 Mar 2017 \| 21:47:27 UTC - in response to Message 46697. Last modified: 20 Mar 2017 \| 21:47:40 UTC
	The problem with canceling these units is that the error will stay with you forever, but if you let it run its course until it get 8 errors and becomes a "too many errors (may have a bug)" unit, it should in time disappear. That will eliminate the work unit, but I think Beyond is saying that they should cancel the entire series until they fix it. That will save a lot of people some time, though fortunately they error out quickly.
	ID: 46698 \| Rating: 0 \| rate: / Reply Quote

koschi Send message Joined: 14 Aug 08 Posts: 124 Credit: 486,829,198 RAC: 1,362,608 Level Scientific publications	Message 46699 - Posted: 20 Mar 2017 \| 22:43:27 UTC
	Same here under Linux: ERROR: file mdioload.cpp line 81: Unable to read bincoordfile
	ID: 46699 \| Rating: 0 \| rate: / Reply Quote

Retvari Zoltan Send message Joined: 20 Jan 09 Posts: 2343 Credit: 16,201,255,749 RAC: 386 Level Scientific publications	Message 46701 - Posted: 21 Mar 2017 \| 1:42:13 UTC
	My main host got its daily quota of long workunits reduced to 1 because it had too many failures (caused by this bad batch). Luckily there are short runs (and one other long run), so my main host is not completely shut off of this project. This is really annoying. This batch was working fine previously.
	ID: 46701 \| Rating: 0 \| rate: / Reply Quote

WPrion Send message Joined: 30 Apr 13 Posts: 87 Credit: 1,065,409,111 RAC: 0 Level Scientific publications	Message 46702 - Posted: 21 Mar 2017 \| 4:20:17 UTC - in response to Message 46698.
	That will eliminate the work unit, but I think Beyond is saying that they should cancel the entire series until they fix it. That will save a lot of people some time, though fortunately they error out quickly. They do error quickly, but it kicked me into my daily quota limit right at the beginning of a new day. $#%@^!
	ID: 46702 \| Rating: 0 \| rate: / Reply Quote

Erich56 Send message Joined: 1 Jan 15 Posts: 1091 Credit: 6,822,157,676 RAC: 15,720,418 Level Scientific publications	Message 46703 - Posted: 21 Mar 2017 \| 4:23:57 UTC
	also bad here: for a few hours now, BOINC has not downloaded any new tasks, telling me that "the computer has finished a daily quota of 3 tasks" :-((( This means that no new task can be downloaded before March 22, right?
	ID: 46703 \| Rating: 0 \| rate: / Reply Quote

Erich56 Send message Joined: 1 Jan 15 Posts: 1091 Credit: 6,822,157,676 RAC: 15,720,418 Level Scientific publications	Message 46704 - Posted: 21 Mar 2017 \| 7:27:43 UTC - in response to Message 46703.
	also bad here: for a few hours now, BOINC has not downloaded any new tasks, telling me that "the computer has finished a daily quota of 3 tasks" :-((( This means that no new tasks can be downloaded before March 22, right? The incident early this morning shows that the policy of the daily quota should be revisited quickly. In the specific case, it results in total nonsense. No idea how many people (I think: many) now are not able to download any new tasks for whole March 21. Rather bad thing.
	ID: 46704 \| Rating: 0 \| rate: / Reply Quote

Retvari Zoltan Send message Joined: 20 Jan 09 Posts: 2343 Credit: 16,201,255,749 RAC: 386 Level Scientific publications	Message 46705 - Posted: 21 Mar 2017 \| 9:49:54 UTC Last modified: 21 Mar 2017 \| 9:50:23 UTC
	This is a generic error, all long workunits failed on all of my hosts too overnight, so all of my hosts are processing short runs now, but the short queue is ran dry already. The incident early this morning shows that the policy of the daily quota should be revisited quickly. In the specific case, it results in total nonsense. The daily quota would decrease to 1 in every case if the project supplies only failing workunits, there's no problem with that. The problem is the waaay too high ratio of bad workunits in the queue.
	ID: 46705 \| Rating: 0 \| rate: / Reply Quote

Lluis Send message Joined: 22 Feb 14 Posts: 26 Credit: 672,639,304 RAC: 0 Level Scientific publications	Message 46706 - Posted: 21 Mar 2017 \| 10:52:52 UTC - in response to Message 46705.
	Since yesterday I have 10 Pablo long work units with an "unknown" error, and now I don't have any work unit to process. Anyone has an idea of what to do? Any advice (other than process short units)?
	ID: 46706 \| Rating: 0 \| rate: / Reply Quote

Erich56 Send message Joined: 1 Jan 15 Posts: 1091 Credit: 6,822,157,676 RAC: 15,720,418 Level Scientific publications	Message 46707 - Posted: 21 Mar 2017 \| 11:32:47 UTC - in response to Message 46705.
	... so all of my hosts are processing short runs now, but the short queue is ran dry already. I switched to short runs in the early morning when, according to the Project Status Page, some were still available. However, the download of those was again refused referring to the "daily quota of 3 tasks" :-(
	ID: 46707 \| Rating: 0 \| rate: / Reply Quote

Matt Send message Joined: 11 Jan 13 Posts: 216 Credit: 846,538,252 RAC: 0 Level Scientific publications	Message 46708 - Posted: 21 Mar 2017 \| 11:54:32 UTC
	I've had 9 Pablos fail on me. I'm only receiving short runs now. Returning to Einstein until this is sorted.
	ID: 46708 \| Rating: 0 \| rate: / Reply Quote

Betting Slip Send message Joined: 5 Jan 09 Posts: 670 Credit: 2,498,095,550 RAC: 0 Level Scientific publications	Message 46709 - Posted: 21 Mar 2017 \| 11:54:40 UTC - in response to Message 46706.
	Since yesterday I have 10 Pablo long work units with an "unknown" error, and now I don't have any work unit to process. Anyone has an idea of what to do? Any advice (other than process short units)? There is nothing you can do just wait 24hrs. There is not a problem with the Daily Quota" there is a massive problem with the dumping into the queue of faulty WU's. The system does not appear to be monitored to any great effect, if it was somebody would notice and cancel the WU's before this problem occured.
	ID: 46709 \| Rating: 0 \| rate: / Reply Quote

Retvari Zoltan Send message Joined: 20 Jan 09 Posts: 2343 Credit: 16,201,255,749 RAC: 386 Level Scientific publications	Message 46711 - Posted: 21 Mar 2017 \| 12:09:04 UTC - in response to Message 46709.
	The system does not appear to be monitored to any great effect, if it was somebody would notice and cancel the WU's before this problem occured. It seems that the workunits gone wrong at a certain point, but it wasn't clear that it would affect every batch. It took a couple of hours while the error spread out wide, now the situation is clear. It's very easy to be wise retrospectively. BTW I've sent a notification email to a member of the staff.
	ID: 46711 \| Rating: 0 \| rate: / Reply Quote

Erich56 Send message Joined: 1 Jan 15 Posts: 1091 Credit: 6,822,157,676 RAC: 15,720,418 Level Scientific publications	Message 46714 - Posted: 21 Mar 2017 \| 12:42:22 UTC - in response to Message 46709.
	There is nothing you can do just wait 24hrs. the bad thing on this is that the GPUGRID people most probably cannot "reset" this 24 hours-lock. I guess quite a number of crunchers would be pleased if this was possible.
	ID: 46714 \| Rating: 0 \| rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1576 Credit: 5,873,211,851 RAC: 10,063,166 Level Scientific publications	Message 46716 - Posted: 21 Mar 2017 \| 12:50:43 UTC - in response to Message 46714.
	There is nothing you can do just wait 24hrs. the bad thing on this is that the GPUGRID people most probably cannot "reset" this 24 hours-lock. I guess quite a number of crunchers would be pleased if this was possible. They have to cure the underlying problem first - that's the priority when something like this happens.
	ID: 46716 \| Rating: 0 \| rate: / Reply Quote

Stefan Project administrator Project developer Project tester Project scientist Send message Joined: 5 Mar 13 Posts: 348 Credit: 0 RAC: 0 Level Scientific publications	Message 46718 - Posted: 21 Mar 2017 \| 13:15:49 UTC
	Oh crappity crap. I think I messed up the adaptive. Will check now. Thanks for pointing it out!
	ID: 46718 \| Rating: 0 \| rate: / Reply Quote

Erich56 Send message Joined: 1 Jan 15 Posts: 1091 Credit: 6,822,157,676 RAC: 15,720,418 Level Scientific publications	Message 46719 - Posted: 21 Mar 2017 \| 14:13:04 UTC - in response to Message 46716.
	They have to cure the underlying problem first - that's the priority when something like this happens. that's clear to me - first the problem must be fixed, and then, if possible, some kind of "reset" should be made so that all crunchers for which the downloads were stopped could download new tasks again. Although I am afraid that this will not be possible.
	ID: 46719 \| Rating: 0 \| rate: / Reply Quote

Stefan Project administrator Project developer Project tester Project scientist Send message Joined: 5 Mar 13 Posts: 348 Credit: 0 RAC: 0 Level Scientific publications	Message 46720 - Posted: 21 Mar 2017 \| 14:21:55 UTC
	Well these broken tasks will have to run their course. But they crash on start so they should be gone very quickly now. I fixed the bugs and we will restart the adaptives in a moment
	ID: 46720 \| Rating: 0 \| rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1576 Credit: 5,873,211,851 RAC: 10,063,166 Level Scientific publications	Message 46721 - Posted: 21 Mar 2017 \| 14:48:12 UTC
	People who are having task requests rejected because their quota is exhausted may wish to set 'No New Tasks' until they read that the faulty tasks have been flushed, and these new tasks are running successfully. BOINC rebuilds the quota quickly when tasks are returned successfully, but if you're restricted to one task per day, and that one turns out to be a faulty one, you're stuck for another 24 hours.
	ID: 46721 \| Rating: 0 \| rate: / Reply Quote

Betting Slip Send message Joined: 5 Jan 09 Posts: 670 Credit: 2,498,095,550 RAC: 0 Level Scientific publications	Message 46722 - Posted: 21 Mar 2017 \| 15:10:19 UTC - in response to Message 46721.
	People who are having task requests rejected because their quota is exhausted may wish to set 'No New Tasks' until they read that the faulty tasks have been flushed, and these new tasks are running successfully. BOINC rebuilds the quota quickly when tasks are returned successfully, but if you're restricted to one task per day, and that one turns out to be a faulty one, you're stuck for another 24 hours. Nobody on this project is restricted to one task a day but they are restricted to 2 a day because of the way computers count. 0 = 1, 1 = 2, etc
	ID: 46722 \| Rating: 0 \| rate: / Reply Quote

Beyond Send message Joined: 23 Nov 08 Posts: 1112 Credit: 6,162,416,256 RAC: 0 Level Scientific publications	Message 46723 - Posted: 21 Mar 2017 \| 15:19:15 UTC - in response to Message 46701. Last modified: 21 Mar 2017 \| 15:19:43 UTC
	My main host got its daily quota of long workunits reduced to 1 because it had too many failures (caused by this bad batch). Luckily there are short runs (and one other long run), so my main host is not completely shut off of this project. This is really annoying. It's beyond annoying. I now have 6 hosts that won't get tasks because of these bad WUs. Two of the hosts not getting tasks are the fastest ones with 1060 GPUs. Irritating.
	ID: 46723 \| Rating: 0 \| rate: / Reply Quote

Erich56 Send message Joined: 1 Jan 15 Posts: 1091 Credit: 6,822,157,676 RAC: 15,720,418 Level Scientific publications	Message 46725 - Posted: 21 Mar 2017 \| 17:38:16 UTC - in response to Message 46721.
	People who are having task requests rejected because their quota is exhausted may wish to set 'No New Tasks' until they read that the faulty tasks have been flushed, and these new tasks are running successfully. BOINC rebuilds the quota quickly when tasks are returned successfully, but if you're restricted to one task per day, and that one turns out to be a faulty one, you're stuck for another 24 hours. I have tried this, however, without success. The only differnce to what it was before is that now the BOINC notice does no longer refer to the task limit per day, but simply says 21/03/2017 18:36:42 \| GPUGRID \| No tasks are available for Long runs (8-12 hours on fastest card) Why so?
	ID: 46725 \| Rating: 0 \| rate: / Reply Quote

Greger Send message Joined: 6 Jan 15 Posts: 74 Credit: 14,942,254,749 RAC: 17,636,304 Level Scientific publications	Message 46726 - Posted: 21 Mar 2017 \| 17:43:52 UTC - in response to Message 46723.
	Even worse is that all my linux host got coproc error which means bad batch crash drivers.So other project did fail to. A restart is now done and might crash again if there still is task out.
	ID: 46726 \| Rating: 0 \| rate: / Reply Quote

Erich56 Send message Joined: 1 Jan 15 Posts: 1091 Credit: 6,822,157,676 RAC: 15,720,418 Level Scientific publications	Message 46727 - Posted: 21 Mar 2017 \| 17:48:55 UTC Last modified: 21 Mar 2017 \| 17:57:19 UTC
	for some reason, 2 of my PCs now received new tasks, one of them was a PABLO_contact_goal_KIX_CMYB and even this one failed after a few seconds. Till now I thought that only PABLO_p53 tasks are affected. Edit: Only now I realize that on others of my PCs, during the day, had same probleme with all kinds of different WUs, not only PABLO_93. Can it be that all recent WUs, regardless of the type, were faulty?
	ID: 46727 \| Rating: 0 \| rate: / Reply Quote

Beyond Send message Joined: 23 Nov 08 Posts: 1112 Credit: 6,162,416,256 RAC: 0 Level Scientific publications	Message 46729 - Posted: 21 Mar 2017 \| 21:24:25 UTC
	It's ridiculous that these bad tasks weren't canceled. How many machines have been denied work because of this laziness on the admins part? I've personally received 137 of these bad WUs so far and now have 7 machines not accepting long WUs. Multiply this by how many users? This kind of thing can also happen at other projects but they cancel the the bad WUs when informed of the problem. Why not here?
	ID: 46729 \| Rating: 0 \| rate: / Reply Quote

Tom Miller Send message Joined: 21 Nov 14 Posts: 5 Credit: 1,081,640,766 RAC: 0 Level Scientific publications	Message 46730 - Posted: 21 Mar 2017 \| 23:00:49 UTC - in response to Message 46720.
	And still, for hours, the junk keeps rolling out. If we the volunteers who donate our GPUs and electrons to help in what we're to believe is real science, I would hope the people using our resources would maybe have a little better way of administering them.
	ID: 46730 \| Rating: 0 \| rate: / Reply Quote

Bedrich Hajek Send message Joined: 28 Mar 09 Posts: 468 Credit: 8,476,947,716 RAC: 11,072,058 Level Scientific publications	Message 46731 - Posted: 22 Mar 2017 \| 0:51:49 UTC
	They should eliminate the "daily quota" for this particular situation. and let us crunch!
	ID: 46731 \| Rating: 0 \| rate: / Reply Quote

Retvari Zoltan Send message Joined: 20 Jan 09 Posts: 2343 Credit: 16,201,255,749 RAC: 386 Level Scientific publications	Message 46734 - Posted: 22 Mar 2017 \| 4:00:31 UTC - in response to Message 46720. Last modified: 22 Mar 2017 \| 4:10:57 UTC
	Well these broken tasks will have to run their course. That will be a long and frustrating process, as every host can have only one workunit per day, but right now 9 out of 10 workunits is a broken one (so the daily quota of the hosts won't rise for a while), and every workunit has to fail 7 times before it's cleared from the queue. To speed this up, I've created dummy hosts with my inactive host, and I've "killed" about 100 of these broken workunits. I had to abort some working units, but these are the minority right now.
	ID: 46734 \| Rating: 0 \| rate: / Reply Quote

Erich56 Send message Joined: 1 Jan 15 Posts: 1091 Credit: 6,822,157,676 RAC: 15,720,418 Level Scientific publications	Message 46735 - Posted: 22 Mar 2017 \| 4:18:50 UTC
	The situation here still unchanged. One of my 4 hosts luckily got a "good" WU some time last night and is crunching it. On all other hosts BOINC still tells me 22/03/2017 05:14:41 \| GPUGRID \| This computer has finished a daily quota of 1 tasks What I don't understand is why all these broken WUs cannot be removed from the queue, and why GPUGRID cannot somehow reset this daily quota junk. By now, my frustration has reached quite a level :-(
	ID: 46735 \| Rating: 0 \| rate: / Reply Quote

Betting Slip Send message Joined: 5 Jan 09 Posts: 670 Credit: 2,498,095,550 RAC: 0 Level Scientific publications	Message 46736 - Posted: 22 Mar 2017 \| 10:16:56 UTC
	Relax everone, we are where we are,I'm sure the admins are as frustrated as ourselves and are working to correct the situation. On the bright side short WU's jus got a boost in computation.
	ID: 46736 \| Rating: 0 \| rate: / Reply Quote

Erich56 Send message Joined: 1 Jan 15 Posts: 1091 Credit: 6,822,157,676 RAC: 15,720,418 Level Scientific publications	Message 46737 - Posted: 22 Mar 2017 \| 10:27:01 UTC - in response to Message 46736.
	short WU's jus got a boost in computation. what does it help if they cannot be downloaded?
	ID: 46737 \| Rating: 0 \| rate: / Reply Quote

Loohi Send message Joined: 27 Aug 16 Posts: 16 Credit: 43,745,875 RAC: 0 Level Scientific publications	Message 46738 - Posted: 22 Mar 2017 \| 10:50:47 UTC - in response to Message 46737.
	what does it help if they cannot be downloaded? They can be downloaded, actually.
	ID: 46738 \| Rating: 0 \| rate: / Reply Quote

Erich56 Send message Joined: 1 Jan 15 Posts: 1091 Credit: 6,822,157,676 RAC: 15,720,418 Level Scientific publications	Message 46739 - Posted: 22 Mar 2017 \| 11:21:11 UTC - in response to Message 46738. Last modified: 22 Mar 2017 \| 11:22:26 UTC
	They can be downloaded, actually. NOT on my machines. There comes the same notice re "daily quota of 1 task" ... :-(
	ID: 46739 \| Rating: 0 \| rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1576 Credit: 5,873,211,851 RAC: 10,063,166 Level Scientific publications	Message 46740 - Posted: 22 Mar 2017 \| 11:28:33 UTC - in response to Message 46739. Last modified: 22 Mar 2017 \| 11:30:15 UTC
	They can be downloaded, actually. NOT on my machine. There comes the same notice re "daily quota of 1 task" ... :-( The quota is applied per task type. You are likely to be suffering from a quota of one long task per day: if you allow short tasks in your preferences, it is possible (but rare) to get short tasks allocated - I have two machines running them at the moment, because of that. Here are the log entries from one of the affected machines: 22/03/2017 09:51:04 \| GPUGRID \| This computer has finished a daily quota of 1 tasks 22/03/2017 10:13:27 \| GPUGRID \| Scheduler request completed: got 2 new tasks 22/03/2017 10:13:27 \| GPUGRID \| No tasks are available for the applications you have selected 22/03/2017 10:13:27 \| GPUGRID \| No tasks are available for Long runs (8-12 hours on fastest card) 22/03/2017 10:13:27 \| GPUGRID \| Your preferences allow tasks from applications other than those selected
	ID: 46740 \| Rating: 0 \| rate: / Reply Quote

Betting Slip Send message Joined: 5 Jan 09 Posts: 670 Credit: 2,498,095,550 RAC: 0 Level Scientific publications	Message 46741 - Posted: 22 Mar 2017 \| 11:35:05 UTC - in response to Message 46739. Last modified: 22 Mar 2017 \| 11:37:36 UTC
	They can be downloaded, actually. NOT on my machines. There comes the same notice re "daily quota of 1 task" ... :-( In addition to Richards response you have Long WU's running on three out of four of your machines. What more exactly do you want?
	ID: 46741 \| Rating: 0 \| rate: / Reply Quote

Erich56 Send message Joined: 1 Jan 15 Posts: 1091 Credit: 6,822,157,676 RAC: 15,720,418 Level Scientific publications	Message 46742 - Posted: 22 Mar 2017 \| 11:44:07 UTC - in response to Message 46741.
	What is shown in my log is unfortunately wrong. There is a total of 2 tasks running now. One on the slow GTX750Ti which was obviously not affacted the same way as the faster machines. And one, to my surprise, on the GTX970. The log, erronously shows 2 tasks on the PC with the two GTX980Ti, however, no tasks are being crunched there. Then there is another PC with a GTX750Ti, which still shows the "quota of 1 task per day" notice. It would of course be great if I could finally run tasks on the two GTX980ti's
	ID: 46742 \| Rating: 0 \| rate: / Reply Quote

Erich56 Send message Joined: 1 Jan 15 Posts: 1091 Credit: 6,822,157,676 RAC: 15,720,418 Level Scientific publications	Message 46743 - Posted: 22 Mar 2017 \| 11:47:29 UTC
	Further, the log shows that a PABLO_contact_goal_KIX_CMYB-0-4-RND2705_5 was downloaded at 10:11 hrs UTC this morning, and also errored out after a few seconds. Can these faulty WUs indeed not be eleminated from the queue?
	ID: 46743 \| Rating: 0 \| rate: / Reply Quote

Erich56 Send message Joined: 1 Jan 15 Posts: 1091 Credit: 6,822,157,676 RAC: 15,720,418 Level Scientific publications	Message 46744 - Posted: 22 Mar 2017 \| 12:15:53 UTC - in response to Message 46740.
	You are likely to be suffering from a quota of one long task per day: if you allow short tasks in your preferences, it is possible (but rare) to get short tasks allocated that's what BOINC is showing me: 22/03/2017 13:12:42 \| GPUGRID \| No tasks are available for Short runs (2-3 hours on fastest card) 22/03/2017 13:12:42 \| GPUGRID \| No tasks are available for Long runs (8-12 hours on fastest card) 22/03/2017 13:12:42 \| GPUGRID \| This computer has finished a daily quota of 1 tasks So I doubt that could get short runs. (your assumption is correct: I should be suffering on a long runs quota only, since no short runs were selected when the "accident" happened).
	ID: 46744 \| Rating: 0 \| rate: / Reply Quote

Retvari Zoltan Send message Joined: 20 Jan 09 Posts: 2343 Credit: 16,201,255,749 RAC: 386 Level Scientific publications	Message 46745 - Posted: 22 Mar 2017 \| 12:22:34 UTC
	To my surprise, the faulty / working ratio is much better than I've expected. I did a test with my dummy host again, and only 18 of 48 workunits were faulty. I've received some of the new (working) workunits on my alive hosts too, so the daily quota will be recovered in a couple of days.
	ID: 46745 \| Rating: 0 \| rate: / Reply Quote

Erich56 Send message Joined: 1 Jan 15 Posts: 1091 Credit: 6,822,157,676 RAC: 15,720,418 Level Scientific publications	Message 46746 - Posted: 22 Mar 2017 \| 12:24:56 UTC - in response to Message 46745.
	... so the daily quota will be recovered in a couple of days. still it's a shame that there is no other mechanism in place for cases like the present one :-(
	ID: 46746 \| Rating: 0 \| rate: / Reply Quote

Retvari Zoltan Send message Joined: 20 Jan 09 Posts: 2343 Credit: 16,201,255,749 RAC: 386 Level Scientific publications	Message 46747 - Posted: 22 Mar 2017 \| 12:26:17 UTC - in response to Message 46744.
	You are likely to be suffering from a quota of one long task per day: if you allow short tasks in your preferences, it is possible (but rare) to get short tasks allocated that's what BOINC is showing me: 22/03/2017 13:12:42 \| GPUGRID \| No tasks are available for Short runs (2-3 hours on fastest card) 22/03/2017 13:12:42 \| GPUGRID \| No tasks are available for Long runs (8-12 hours on fastest card) 22/03/2017 13:12:42 \| GPUGRID \| This computer has finished a daily quota of 1 tasks So I doubt that could get short runs. (your assumption is correct: I should be suffering on a long runs quota only, since no short runs were selected when the "accident" happened). The short queue is empty, and the scheduler won't send you from the long queue, because of the host's decreased daily quota. You should wait for a couple of hours.
	ID: 46747 \| Rating: 0 \| rate: / Reply Quote

Retvari Zoltan Send message Joined: 20 Jan 09 Posts: 2343 Credit: 16,201,255,749 RAC: 386 Level Scientific publications	Message 46748 - Posted: 22 Mar 2017 \| 12:50:53 UTC - in response to Message 46746.
	... so the daily quota will be recovered in a couple of days. still it's a shame that there is no other mechanism in place for cases like the present one :-( You can't prepare a system to every abnormal situation. BTW you'll receive workunits while your daily quota is lower than its maximum. The only important factor is that a host should not receive many faulty workunits in a row, because it will "blacklist" that host for a day. This is a pretty good automatism to minimize the effects of a faulty host, as such a host would exhaust the queues in a very short time if there's nothing to limit the work assigned to a faulty host. Too bad that this generic error combined with this self-defense made all of our hosts blacklisted, but there's no defense of this self-defense. I've realized that we are this "device", which could make this project running in such regrettable situations.
	ID: 46748 \| Rating: 0 \| rate: / Reply Quote

WPrion Send message Joined: 30 Apr 13 Posts: 87 Credit: 1,065,409,111 RAC: 0 Level Scientific publications	Message 46749 - Posted: 22 Mar 2017 \| 12:56:02 UTC
	When this is all over there should be a publication badge for participation in faulty Pablo WUs ;-)
	ID: 46749 \| Rating: 0 \| rate: / Reply Quote

Retvari Zoltan Send message Joined: 20 Jan 09 Posts: 2343 Credit: 16,201,255,749 RAC: 386 Level Scientific publications	Message 46750 - Posted: 22 Mar 2017 \| 13:12:20 UTC - in response to Message 46749.
	When this is all over there should be a publication badge for participation in faulty Pablo WUs ;-) Indeed. This should be a special one, with special design. I think of a crashed bug. :)
	ID: 46750 \| Rating: 0 \| rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1576 Credit: 5,873,211,851 RAC: 10,063,166 Level Scientific publications	Message 46751 - Posted: 22 Mar 2017 \| 13:35:28 UTC - in response to Message 46747.
	The short queue is empty, and the scheduler won't send you from the long queue, because of the host's decreased daily quota. You should wait for a couple of hours. Sometimes you get a working long task, sometimes you get a faulty long task, sometimes you get a short task - it's very much the luck of the draw at the moment. I've had all three outcomes within the last hour.
	ID: 46751 \| Rating: 0 \| rate: / Reply Quote

Erich56 Send message Joined: 1 Jan 15 Posts: 1091 Credit: 6,822,157,676 RAC: 15,720,418 Level Scientific publications	Message 46752 - Posted: 22 Mar 2017 \| 13:53:18 UTC - in response to Message 46751.
	[quote]... sometimes you get a faulty long task this leads me to repeating my question: why were/are the faulty ones not eliminated from the queue?
	ID: 46752 \| Rating: 0 \| rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1576 Credit: 5,873,211,851 RAC: 10,063,166 Level Scientific publications	Message 46753 - Posted: 22 Mar 2017 \| 14:13:09 UTC - in response to Message 46752.
	why were/are the faulty ones not eliminated from the queue? My guess - and it is only a guess - is that the currently-available staff are all biochemical researchers, rather than specialist database repairers. BOINC server code provides tools for researchers to submit jobs directly, but identifying faulty (and only faulty) workunits for cancellation is a tricky business. We've had cases in the past when batches of tasks have been cancelled en bloc, including tasks in the middle of an apparently viable run. That caused even more vociferous complaints (of wasted electricity) than the current forced diversion of BOINC resources to other backup projects. Amateur meddling in technical matters (anything outside your personal professional skill) can cause more problems than it's worth. Stefan has owned up to making a mistake in preparing the workunit parameters: he has corrected that error, but he seems to have decided - wisely, in my opinion - not to risk dabbling in areas where he doesn't feel comfortable about his own level of expertise.
	ID: 46753 \| Rating: 0 \| rate: / Reply Quote

Erich56 Send message Joined: 1 Jan 15 Posts: 1091 Credit: 6,822,157,676 RAC: 15,720,418 Level Scientific publications	Message 46754 - Posted: 22 Mar 2017 \| 14:29:14 UTC
	@Richard: what you are saying sounds logical
	ID: 46754 \| Rating: 0 \| rate: / Reply Quote

Ken Florian Send message Joined: 4 May 12 Posts: 56 Credit: 1,832,989,878 RAC: 0 Level Scientific publications	Message 46755 - Posted: 22 Mar 2017 \| 20:24:35 UTC
	Though I once posted some good numbers to the project, I've been away for awhile and lost track of how BOINC ought to work. I still do not have new tasks after my own set of failed tasks. Is there anything I need to do "clear my name" so that I get tasks?
	ID: 46755 \| Rating: 0 \| rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1576 Credit: 5,873,211,851 RAC: 10,063,166 Level Scientific publications	Message 46756 - Posted: 22 Mar 2017 \| 22:50:18 UTC
	I've just picked up a 4th replication from workunit e34s5_e17s62p0f449-PABLO_p53_mut_7_DIS-0-1-RND8386. From the PABLO_p53 and the _4 at the end of the task name, I assumed the worst - but it's running just fine. Don't assume that every failure - even multiple failures - comes from a faulty workunit. As to what to do about it - just allow/encourage your computer to request work once each day. Perhaps you will be lucky and get a good one at the next attempt, or you may end up with several more days' wait. It'll work out in the end.
	ID: 46756 \| Rating: 0 \| rate: / Reply Quote

Retvari Zoltan Send message Joined: 20 Jan 09 Posts: 2343 Credit: 16,201,255,749 RAC: 386 Level Scientific publications	Message 46757 - Posted: 23 Mar 2017 \| 8:12:20 UTC - in response to Message 46756.
	I've just picked up a 4th replication from workunit e34s5_e17s62p0f449-PABLO_p53_mut_7_DIS-0-1-RND8386. From the PABLO_p53 and the _4 at the end of the task name, I assumed the worst - but it's running just fine. Don't assume that every failure - even multiple failures - comes from a faulty workunit. If there's the ERROR: file mdioload.cpp line 81: Unable to read bincoordfile message in many of the previous task's stderr.txt output file, then it's a faulty task. The one you've received is failed 4 times, from 3 different reasons (but none of them is the one above): 1st & 3rd: <message> process exited with code 201 (0xc9, -55) </message> <stderr_txt> # Unable to initialise. Check permissions on /dev/nvidia* (err=100) </stderr_txt> 2nd (that's the most mysterious:) <message> process exited with code 212 (0xd4, -44) </message> <stderr_txt> </stderr_txt> 4th: <message> (unknown error) - exit code -80 (0xffffffb0) </message> <stderr_txt> ... # Access violation : progress made, try to restart called boinc_finish </stderr_txt> BTW things are now back to normal (almost), some faulty workunits are still floating around.
	ID: 46757 \| Rating: 0 \| rate: / Reply Quote

PappaLitto Send message Joined: 21 Mar 16 Posts: 511 Credit: 4,669,992,755 RAC: 2,668,069 Level Scientific publications	Message 46760 - Posted: 23 Mar 2017 \| 18:07:10 UTC
	Has the problem been fixed?
	ID: 46760 \| Rating: 0 \| rate: / Reply Quote

Retvari Zoltan Send message Joined: 20 Jan 09 Posts: 2343 Credit: 16,201,255,749 RAC: 386 Level Scientific publications	Message 46761 - Posted: 23 Mar 2017 \| 18:54:14 UTC - in response to Message 46760.
	Has the problem been fixed? Yes. There still could be some faulty workunits in the long queue, but those are not threatening the daily quota.
	ID: 46761 \| Rating: 0 \| rate: / Reply Quote

Bedrich Hajek Send message Joined: 28 Mar 09 Posts: 468 Credit: 8,476,947,716 RAC: 11,072,058 Level Scientific publications	Message 46799 - Posted: 31 Mar 2017 \| 10:54:38 UTC
	These error units are starting to disappear from the tasks pages. Soon, they will be all gone, nothing more than a memory. Good bye!!!!
	ID: 46799 \| Rating: 0 \| rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1576 Credit: 5,873,211,851 RAC: 10,063,166 Level Scientific publications	Message 46803 - Posted: 31 Mar 2017 \| 17:32:53 UTC
	Trouble is, I'm starting to ses a new bad batch, like e1s2_ubiquitin_50ns_1-ADRIA_FOLDGREED10_crystal_ss_contacts_50_ubiquitin_1-0-1-RND7532 I've seen failures for each of contacts_20, contacts_50, and contacts_100
	ID: 46803 \| Rating: 0 \| rate: / Reply Quote

Jim1348 Send message Joined: 28 Jul 12 Posts: 819 Credit: 1,591,285,971 RAC: 0 Level Scientific publications	Message 46804 - Posted: 31 Mar 2017 \| 17:54:16 UTC - in response to Message 46803.
	I just got one an hour ago, that failed after two seconds. e1s2_ubiquitin_20ns_1-ADRIA_FOLDGREED10_crystal_ss_contacts_20_ubiquitin_6-0-1-RND9359
	ID: 46804 \| Rating: 0 \| rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1576 Credit: 5,873,211,851 RAC: 10,063,166 Level Scientific publications	Message 46805 - Posted: 31 Mar 2017 \| 21:30:23 UTC
	e1s9_ubiquitin_100ns_8-ADRIA_FOLDGREED10_crystal_ss_contacts_100_ubiquitin_4-0-2-RND2702 is running OK, so they're not all bad.
	ID: 46805 \| Rating: 0 \| rate: / Reply Quote

Loohi Send message Joined: 27 Aug 16 Posts: 16 Credit: 43,745,875 RAC: 0 Level Scientific publications	Message 46806 - Posted: 1 Apr 2017 \| 3:58:45 UTC
	Same here, 6 broken Adria WU out of 8, in 12 hours so far. Failing immediately.
	ID: 46806 \| Rating: 0 \| rate: / Reply Quote

Killersocke Send message Joined: 18 Oct 13 Posts: 53 Credit: 406,647,419 RAC: 0 Level Scientific publications	Message 46811 - Posted: 1 Apr 2017 \| 23:59:56 UTC Last modified: 2 Apr 2017 \| 0:24:44 UTC
	,,,and here too 02.04.2017 01:58:39 \| GPUGRID \| Started download of e1s17_ubiquitin_100ns_16-ADRIA_FOLDGREED90_crystal_ss_contacts_100_ubiquitin_8-0-psf_file SWAN : FATAL Unable to load module .mshake_kernel.cu. (999) next WU e1s17_ubiquitin_100ns_16-ADRIA_FOLDGREED90_crystal_ss_contacts_100_ubiquitin_8-0-2-RND0956_3
	ID: 46811 \| Rating: 0 \| rate: / Reply Quote

Betting Slip Send message Joined: 5 Jan 09 Posts: 670 Credit: 2,498,095,550 RAC: 0 Level Scientific publications	Message 46812 - Posted: 2 Apr 2017 \| 2:17:59 UTC - in response to Message 46811.
	SWAN : FATAL Unable to load module .mshake_kernel.cu. (999) That's more likely to have been your hosts fault rather than the WU. Maybe due to power failure or hard shutdown. The other one says: (simulation unstable) maybe due to overclocking.
	ID: 46812 \| Rating: 0 \| rate: / Reply Quote

Erich56 Send message Joined: 1 Jan 15 Posts: 1091 Credit: 6,822,157,676 RAC: 15,720,418 Level Scientific publications	Message 46839 - Posted: 8 Apr 2017 \| 8:03:25 UTC
	During the past days, there are still WUs which fail after a few seconds. Latest example: e1s17_ubiquitin_100ns_16-ADRIA_FOLDGREED90_crystal_ss_contacts_100_ubiquitin_1-0-2-RND7346_8 ending after 4.38 seconds. What I notice is that these are particularly the "...ubiquitin..." tasks. Any explanation why this happens?
	ID: 46839 \| Rating: 0 \| rate: / Reply Quote

Loohi Send message Joined: 27 Aug 16 Posts: 16 Credit: 43,745,875 RAC: 0 Level Scientific publications	Message 46865 - Posted: 14 Apr 2017 \| 16:35:23 UTC - in response to Message 46839.
	During the past days, there are still WUs which fail after a few seconds. Latest example: e1s17_ubiquitin_100ns_16-ADRIA_FOLDGREED90_crystal_ss_contacts_100_ubiquitin_1-0-2-RND7346_8 ending after 4.38 seconds. What I notice is that these are particularly the "...ubiquitin..." tasks. Any explanation why this happens? This is still happening, just had 2 in a row
	ID: 46865 \| Rating: 0 \| rate: / Reply Quote

Erich56 Send message Joined: 1 Jan 15 Posts: 1091 Credit: 6,822,157,676 RAC: 15,720,418 Level Scientific publications	Message 46866 - Posted: 14 Apr 2017 \| 16:39:13 UTC - in response to Message 46865.
	This is still happening, just had 2 in a row yes, this new problem is being discussed here: http://www.gpugrid.net/forum_thread.php?id=4545
	ID: 46866 \| Rating: 0 \| rate: / Reply Quote

Post to thread

Message boards : Number crunching : BAD PABLO_p53 WUs

	About	Science	Volunteers	Performance	Forum	Join us	Donate