Advanced search

Message boards : Number crunching : failing tasks lately

Author Message
Erich56
Send message
Joined: 1 Jan 15
Posts: 1131
Credit: 9,914,107,676
RAC: 32,603,125
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwat
Message 52174 - Posted: 3 Jul 2019 | 16:56:32 UTC
Last modified: 3 Jul 2019 | 16:59:27 UTC

This afternoon, I had 4 tasks in a row which failed after few seconds; see here: http://www.gpugrid.net/results.php?userid=125700&offset=0&show_names=1&state=0&appid=

-97 (0xffffffffffffff9f) Unknown error number

The simulation has become unstable. Terminating to avoid lock-up


I've never had that before; and I didn't change anything in my settings or so.
Does anyone else experience the same problem?
I now stopped the download.

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1618
Credit: 8,606,094,351
RAC: 16,317,055
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 52176 - Posted: 3 Jul 2019 | 18:48:47 UTC

I've had three failed tasks over the last two days, but all the others have run normally. All the failed tasks had PABLO_V3_p27_sj403_IDP in their name.

But I'm currently uploading e10s21_e4s18p1f211-PABLO_V3_p27_sj403_IDP-0-2-RND5679_0 - which fits that name pattern, but has run normally. By the time you read this, it will probably have reported and you can read the outcome for yourselves. If it's valid, I think you can assume that Pablo has found the problem and corrected it.

Erich56
Send message
Joined: 1 Jan 15
Posts: 1131
Credit: 9,914,107,676
RAC: 32,603,125
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwat
Message 52177 - Posted: 3 Jul 2019 | 19:33:05 UTC

Yes, part of the PABLO_V3_p27_sj403_ID series seems to be erronious.
Within the past few days, some of them worked well here. But others don't, as can be seen.
The server status page shows an error rate of 56.37% for them. Which is high, isn't it?

I'll switch off my aircond over night and will try to download the next task tomorrow morning.

Erich56
Send message
Joined: 1 Jan 15
Posts: 1131
Credit: 9,914,107,676
RAC: 32,603,125
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwat
Message 52179 - Posted: 4 Jul 2019 | 4:46:06 UTC - in response to Message 52177.

The server status page shows an error rate of 56.37% for them. Which is high, isn't it?

over night, failure rate has raised to 57.98%.

The remaining tasks from this series should be cancelled from the queue.

Erich56
Send message
Joined: 1 Jan 15
Posts: 1131
Credit: 9,914,107,676
RAC: 32,603,125
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwat
Message 52182 - Posted: 4 Jul 2019 | 15:41:34 UTC - in response to Message 52179.

The server status page shows an error rate of 56.37% for them. Which is high, isn't it?

over night, failure rate has raised to 57.98%.

The remaining tasks from this series should be cancelled from the queue.

meanwhile, the failure rate has passed the 60% mark. It's 60,12%, to be exact.

And these faulty tasks are still in the download queue, WHY ???

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1618
Credit: 8,606,094,351
RAC: 16,317,055
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 52189 - Posted: 5 Jul 2019 | 16:33:40 UTC

I thought we'd got rid of these, but I've just sent back e15s24_e1s258p1f302-PABLO_V3_p27_sj403_IDP-0-2-RND4645_0 - note the _0 replication. I was the first victim since the job was created at 11:25:23 UTC today, seven more to go.

Erich56
Send message
Joined: 1 Jan 15
Posts: 1131
Credit: 9,914,107,676
RAC: 32,603,125
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwat
Message 52194 - Posted: 5 Jul 2019 | 19:55:48 UTC

The failure rate now is close to 64%, so it's still climbing up.
From what it looks, none of the tasks from this series are successful.

Can anyone from the GPUGRID people explain the rationale behind leaving these faulty tasks in the download queue?

mmonnin
Send message
Joined: 2 Jul 16
Posts: 337
Credit: 7,499,301,065
RAC: 9,540,197
Level
Tyr
Scientific publications
watwatwatwatwat
Message 52195 - Posted: 5 Jul 2019 | 21:55:24 UTC - in response to Message 52194.

The failure rate now is close to 64%, so it's still climbing up.
From what it looks, none of the tasks from this series are successful.

Can anyone from the GPUGRID people explain the rationale behind leaving these faulty tasks in the download queue?


A holiday. Some admins won't even cancel tasks like that even if they are active. Some will just let them error out the max # of times.

Erich56
Send message
Joined: 1 Jan 15
Posts: 1131
Credit: 9,914,107,676
RAC: 32,603,125
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwat
Message 52197 - Posted: 6 Jul 2019 | 4:44:48 UTC - in response to Message 52195.

Some will just let them error out the max # of times.

The bad thing is that once a host has more than 2 or 3 such faulty tasks in a row, the host is considered as unreliable and will no longer receive tasks for the next 24 hours.
So the host is penalized for something which is not in the responsibility of the host.

Even more I am wondering that the GPUGRID people don't care :-(

Erich56
Send message
Joined: 1 Jan 15
Posts: 1131
Credit: 9,914,107,676
RAC: 32,603,125
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwat
Message 52204 - Posted: 7 Jul 2019 | 5:04:44 UTC

the failure rate has passed the 70% mark now. Great !

Erich56
Send message
Joined: 1 Jan 15
Posts: 1131
Credit: 9,914,107,676
RAC: 32,603,125
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwat
Message 52208 - Posted: 8 Jul 2019 | 18:50:22 UTC

meanwhile, the failure rate has passed the 75% mark. It now is 75,18%, to be exact.
And still, these faulty tasks are in the download queue.
Does anybody understand this?

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1335
Credit: 7,516,867,459
RAC: 13,855,815
Level
Tyr
Scientific publications
watwatwatwatwat
Message 52210 - Posted: 9 Jul 2019 | 4:34:10 UTC

If you are so unhappy running the available Windows tasks, just stop getting any work. Problem solved. You are happy now.

I don't have any issues with the project and I haven't had any normal work since February when the Linux app was decommissioned.

I trust Toni will eventually figure out the new wrapper apps and we will get work again. Don't PANIC!

Erich56
Send message
Joined: 1 Jan 15
Posts: 1131
Credit: 9,914,107,676
RAC: 32,603,125
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwat
Message 52211 - Posted: 9 Jul 2019 | 5:04:46 UTC - in response to Message 52210.

If you are so unhappy running the available Windows tasks, just stop getting any work. Problem solved. You are happy now.

I don't have any issues with the project and I haven't had any normal work since February when the Linux app was decommissioned.

I trust Toni will eventually figure out the new wrapper apps and we will get work again. Don't PANIC!

The question isn't whether or not I am unhappy. The question rather is what makes sense and what doesn't.
Don't you think the only real solution to the problem would logically be to simply withdraw the remaining tasks of this faulty series from the download queue?
Or can you explain the rationale for leaving them in the download queue?
In a few more weeks, when all these tasks will be used up, the error rate will be 100%. How does this serve the project?

As I explained before: once a host happens to download such a faulty task 2 or 3 times in a row, this host is blocked for 24 hours. So, what sense does this then make?

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1618
Credit: 8,606,094,351
RAC: 16,317,055
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 52215 - Posted: 9 Jul 2019 | 13:17:11 UTC

So far as I can tell from my account pages, my machines are processing GPUGrid tasks just fine and at the normal rate.

It's just one sub-type which is failing, and it's only wasting a few seconds when it does so. For some people on metered internet connections, there might be an additional cost, but I think it's unlikely that many people are running a high-bandwidth project that way.

The rationale for letting them time out naturally? It saves staff time, better spent doing the analysis and debugging behind the scenes. Let them get on with that, and I'm sure the research will be re-run when they find and solve the problem.

BTW, "No, it doesn't work" is a valid research outcome.

Redirect Left
Send message
Joined: 8 Dec 12
Posts: 23
Credit: 181,940,893
RAC: 27
Level
Ile
Scientific publications
watwatwatwatwatwatwatwat
Message 52216 - Posted: 9 Jul 2019 | 15:12:21 UTC

My machine has also failed numerous GPUGrid tasks lately, running on 2 GTX 1070 cards (individual, not SLI'd).

The failed ones are usually PABLO or NOELIA in their names.

Here are four examples of recent fails on my machine, hopefully you can determine from output any issues to resolve.

http://www.gpugrid.net/result.php?resultid=7412820
http://www.gpugrid.net/result.php?resultid=21094782
http://www.gpugrid.net/result.php?resultid=7412829
http://www.gpugrid.net/result.php?resultid=21075338

I'll be skipping GPUGrid tasks from now on until it is resolved, as it is wasting CPU/GPU time that i can use for other projects on the machine. I'll refer back to these forums to check on updates though so i know when to restart GPUGRID tasks.

Profile Retvari Zoltan
Avatar
Send message
Joined: 20 Jan 09
Posts: 2343
Credit: 16,201,255,749
RAC: 0
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 52217 - Posted: 9 Jul 2019 | 22:28:06 UTC - in response to Message 52216.
Last modified: 9 Jul 2019 | 22:31:17 UTC

http://www.gpugrid.net/result.php?resultid=7412820 This WU is from 2013.
http://www.gpugrid.net/result.php?resultid=21094782 This WU is from the present bad batch. It took 6 seconds to error out.
http://www.gpugrid.net/result.php?resultid=7412829 This WU is from 2013.
http://www.gpugrid.net/result.php?resultid=21075338 This WU is from the present bad batch. It took 5 seconds to error out.

http://www.gpugrid.net/result.php?resultid=21094816 This WU is from the present bad batch. It took 6 seconds to error out.

I'll be skipping GPUGrid tasks from now on until it is resolved, as it is wasting CPU/GPU time that i can use for other projects on the machine.
The 3 recent errors wasted 17 seconds on your host in the past 4 days, so there's no reason for panicking. (even though your host didn't received work for 3 days.)

I'll refer back to these forums to check on updates though so i know when to restart GPUGRID tasks.
The project is running fine beside this one bad batch, so you can do it right away.

The number of resends may increase as this bad batch runs out, that may cause a host to be "blacklisted" for 24 hours, but it needs many failing workunits in a row (so it is unlikely to happen, as the maximal number of daily workunits get reduced by 1 after an error).
The max number of "Long runs (8-12 hours on fastest card) 9.22 windows_intelx86 (cuda80)" app for your host is currently 28, so this host should be extremely unlucky to receive 28 bad workunits in a row to get "banned" for 24 hours.

Redirect Left
Send message
Joined: 8 Dec 12
Posts: 23
Credit: 181,940,893
RAC: 27
Level
Ile
Scientific publications
watwatwatwatwatwatwatwat
Message 52218 - Posted: 9 Jul 2019 | 23:09:11 UTC - in response to Message 52217.

Oops my bad, i sorted the tasks by 'errored' and mixed up the ones to paste.

The results in their entirity are below, with 10 errored ones, only 4 recently with non have errored (or are not showing there) since one in 2015, and the other 5 in 2013.
http://www.gpugrid.net/results.php?userid=93721&offset=0&show_names=0&state=5&appid=

On your advice i'll restart the GPUGrid task seeking, and hopefully the toin cosses go in my way and it'll fetch a wide spread of tasks to not get itself blacklisted. Interesting it is set to store up to 28, given it only ever stores 4, and that is if 2 are running active on the GPUs with 2 spare. But I guess that is down to the limits on the future work storage settings for BOINC.

Profile Retvari Zoltan
Avatar
Send message
Joined: 20 Jan 09
Posts: 2343
Credit: 16,201,255,749
RAC: 0
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 52342 - Posted: 24 Jul 2019 | 8:42:59 UTC
Last modified: 24 Jul 2019 | 8:46:04 UTC

There are two more 'bad' batches at the moment in the 'long' queue:
PABLO_V4_UCB_p27_isolated_005_salt_ID
PABLO_V4_UCB_p27_sj403_short_005_salt_ID

Don't be surprised if the tasks from these two batches fail on your host after a couple of seconds - there's nothing wrong with your host.
The safety check of these batches is too sensitive, so it thinks that "the simulation became unstable" while it's probably not.

Erich56
Send message
Joined: 1 Jan 15
Posts: 1131
Credit: 9,914,107,676
RAC: 32,603,125
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwat
Message 52385 - Posted: 7 Aug 2019 | 12:10:27 UTC

any idea why all tasks downloaded within the last few hours fail immediately?

gemini8
Avatar
Send message
Joined: 3 Jul 16
Posts: 31
Credit: 2,150,987,676
RAC: 1,090,850
Level
Phe
Scientific publications
watwat
Message 52386 - Posted: 7 Aug 2019 | 12:51:31 UTC - in response to Message 52385.

any idea why all tasks downloaded within the last few hours fail immediately?

No idea, but it's the same for others.

I'm using Win7pro, work-units crash at once:

Stderr Ausgabe
<core_client_version>7.10.2</core_client_version>
<![CDATA[
<message>
(unknown error) - exit code -44 (0xffffffd4)</message>
]]>


07.08.2019 14:17:11 | GPUGRID | Sending scheduler request: To fetch work.
07.08.2019 14:17:11 | GPUGRID | Requesting new tasks for NVIDIA GPU
07.08.2019 14:17:13 | GPUGRID | Scheduler request completed: got 1 new tasks
07.08.2019 14:17:15 | GPUGRID | Started download of e14s18_e8s70p1f46-PABLO_V4_UCB_p27_sj403_005_salt_IDP-0-LICENSE
07.08.2019 14:17:15 | GPUGRID | Started download of e14s18_e8s70p1f46-PABLO_V4_UCB_p27_sj403_005_salt_IDP-0-COPYRIGHT
07.08.2019 14:17:17 | GPUGRID | Finished download of e14s18_e8s70p1f46-PABLO_V4_UCB_p27_sj403_005_salt_IDP-0-LICENSE
07.08.2019 14:17:17 | GPUGRID | Finished download of e14s18_e8s70p1f46-PABLO_V4_UCB_p27_sj403_005_salt_IDP-0-COPYRIGHT
07.08.2019 14:17:17 | GPUGRID | Started download of e14s18_e8s70p1f46-PABLO_V4_UCB_p27_sj403_005_salt_IDP-0-coor_file
07.08.2019 14:17:17 | GPUGRID | Started download of e14s18_e8s70p1f46-PABLO_V4_UCB_p27_sj403_005_salt_IDP-0-vel_file
07.08.2019 14:17:18 | GPUGRID | Finished download of e14s18_e8s70p1f46-PABLO_V4_UCB_p27_sj403_005_salt_IDP-0-vel_file
07.08.2019 14:17:18 | GPUGRID | Started download of e14s18_e8s70p1f46-PABLO_V4_UCB_p27_sj403_005_salt_IDP-0-idx_file
07.08.2019 14:17:19 | GPUGRID | Finished download of e14s18_e8s70p1f46-PABLO_V4_UCB_p27_sj403_005_salt_IDP-0-idx_file
07.08.2019 14:17:19 | GPUGRID | Started download of e14s18_e8s70p1f46-PABLO_V4_UCB_p27_sj403_005_salt_IDP-0-pdb_file
07.08.2019 14:17:21 | GPUGRID | Finished download of e14s18_e8s70p1f46-PABLO_V4_UCB_p27_sj403_005_salt_IDP-0-coor_file
07.08.2019 14:17:21 | GPUGRID | Started download of e14s18_e8s70p1f46-PABLO_V4_UCB_p27_sj403_005_salt_IDP-0-psf_file
07.08.2019 14:17:30 | GPUGRID | Finished download of e14s18_e8s70p1f46-PABLO_V4_UCB_p27_sj403_005_salt_IDP-0-pdb_file
07.08.2019 14:17:30 | GPUGRID | Started download of e14s18_e8s70p1f46-PABLO_V4_UCB_p27_sj403_005_salt_IDP-0-par_file
07.08.2019 14:17:33 | GPUGRID | Finished download of e14s18_e8s70p1f46-PABLO_V4_UCB_p27_sj403_005_salt_IDP-0-par_file
07.08.2019 14:17:33 | GPUGRID | Started download of e14s18_e8s70p1f46-PABLO_V4_UCB_p27_sj403_005_salt_IDP-0-conf_file_enc
07.08.2019 14:17:34 | GPUGRID | Finished download of e14s18_e8s70p1f46-PABLO_V4_UCB_p27_sj403_005_salt_IDP-0-conf_file_enc
07.08.2019 14:17:34 | GPUGRID | Started download of e14s18_e8s70p1f46-PABLO_V4_UCB_p27_sj403_005_salt_IDP-0-metainp_file
07.08.2019 14:17:35 | GPUGRID | Finished download of e14s18_e8s70p1f46-PABLO_V4_UCB_p27_sj403_005_salt_IDP-0-metainp_file
07.08.2019 14:17:35 | GPUGRID | Started download of e14s18_e8s70p1f46-PABLO_V4_UCB_p27_sj403_005_salt_IDP-0-hills_file
07.08.2019 14:17:36 | GPUGRID | Finished download of e14s18_e8s70p1f46-PABLO_V4_UCB_p27_sj403_005_salt_IDP-0-hills_file
07.08.2019 14:17:36 | GPUGRID | Started download of e14s18_e8s70p1f46-PABLO_V4_UCB_p27_sj403_005_salt_IDP-0-xsc_file
07.08.2019 14:17:37 | GPUGRID | Finished download of e14s18_e8s70p1f46-PABLO_V4_UCB_p27_sj403_005_salt_IDP-0-xsc_file
07.08.2019 14:17:37 | GPUGRID | Started download of e14s18_e8s70p1f46-PABLO_V4_UCB_p27_sj403_005_salt_IDP-0-prmtop_file
07.08.2019 14:17:38 | GPUGRID | Finished download of e14s18_e8s70p1f46-PABLO_V4_UCB_p27_sj403_005_salt_IDP-0-psf_file
07.08.2019 14:17:38 | GPUGRID | Finished download of e14s18_e8s70p1f46-PABLO_V4_UCB_p27_sj403_005_salt_IDP-0-prmtop_file
07.08.2019 14:19:22 | GPUGRID | Starting task e14s18_e8s70p1f46-PABLO_V4_UCB_p27_sj403_005_salt_IDP-0-2-RND1985_4
07.08.2019 14:19:29 | GPUGRID | Computation for task e14s18_e8s70p1f46-PABLO_V4_UCB_p27_sj403_005_salt_IDP-0-2-RND1985_4 finished
07.08.2019 14:19:29 | GPUGRID | Output file e14s18_e8s70p1f46-PABLO_V4_UCB_p27_sj403_005_salt_IDP-0-2-RND1985_4_0 for task e14s18_e8s70p1f46-PABLO_V4_UCB_p27_sj403_005_salt_IDP-0-2-RND1985_4 absent
07.08.2019 14:19:29 | GPUGRID | Output file e14s18_e8s70p1f46-PABLO_V4_UCB_p27_sj403_005_salt_IDP-0-2-RND1985_4_1 for task e14s18_e8s70p1f46-PABLO_V4_UCB_p27_sj403_005_salt_IDP-0-2-RND1985_4 absent
07.08.2019 14:19:29 | GPUGRID | Output file e14s18_e8s70p1f46-PABLO_V4_UCB_p27_sj403_005_salt_IDP-0-2-RND1985_4_2 for task e14s18_e8s70p1f46-PABLO_V4_UCB_p27_sj403_005_salt_IDP-0-2-RND1985_4 absent
07.08.2019 14:19:29 | GPUGRID | Output file e14s18_e8s70p1f46-PABLO_V4_UCB_p27_sj403_005_salt_IDP-0-2-RND1985_4_3 for task e14s18_e8s70p1f46-PABLO_V4_UCB_p27_sj403_005_salt_IDP-0-2-RND1985_4 absent
07.08.2019 14:19:37 | GPUGRID | Started upload of e14s18_e8s70p1f46-PABLO_V4_UCB_p27_sj403_005_salt_IDP-0-2-RND1985_4_7
07.08.2019 14:19:39 | GPUGRID | Finished upload of e14s18_e8s70p1f46-PABLO_V4_UCB_p27_sj403_005_salt_IDP-0-2-RND1985_4_7


Another member of our team has the same problem on Win10.
I'd really like to compare this with Linux, but I didn't get any work-unit on my Debian machine for weeks.
____________
- - - - - - - - - -
Greetings, Jens

Erich56
Send message
Joined: 1 Jan 15
Posts: 1131
Credit: 9,914,107,676
RAC: 32,603,125
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwat
Message 52390 - Posted: 7 Aug 2019 | 14:29:10 UTC - in response to Message 52386.

any idea why all tasks downloaded within the last few hours fail immediately?

No idea, but it's the same for others.

yes, I had checked that before I wrote my posting above.

I wonder whether the GPUGRID team has realized this problem yet.

Killersocke
Send message
Joined: 18 Oct 13
Posts: 53
Credit: 406,647,419
RAC: 0
Level
Gln
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 52392 - Posted: 7 Aug 2019 | 16:22:35 UTC - in response to Message 52174.

same here all WU's with the same Error Code

<core_client_version>7.14.2</core_client_version>
<![CDATA[
<message>
(unknown error) - exit code -44 (0xffffffd4)</message>
]]>

Erich56
Send message
Joined: 1 Jan 15
Posts: 1131
Credit: 9,914,107,676
RAC: 32,603,125
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwat
Message 52400 - Posted: 7 Aug 2019 | 19:24:31 UTC

it seems that the licence for Windows 10 (and maybe for Windows 7/8, too) has expired.

Why do I think so? My Windows XP host downloaded a new tasks a few minutes ago, and it works well.

Profile JStateson
Avatar
Send message
Joined: 31 Oct 08
Posts: 186
Credit: 3,363,027,550
RAC: 61,060
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 52405 - Posted: 7 Aug 2019 | 19:52:18 UTC - in response to Message 52390.

any idea why all tasks downloaded within the last few hours fail immediately?

No idea, but it's the same for others.

yes, I had checked that before I wrote my posting above.

I wonder whether the GPUGRID team has realized this problem yet.


Things left to themselves tend to go from bad to worse.

Profile robertmiles
Send message
Joined: 16 Apr 09
Posts: 503
Credit: 753,570,933
RAC: 234,641
Level
Glu
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 52407 - Posted: 7 Aug 2019 | 22:13:07 UTC
Last modified: 7 Aug 2019 | 22:14:49 UTC

Several more tasks with computation errors, but nothing definite about just what kind of error.

At least they didn't use much CPU or GPU time.

http://www.gpugrid.net/result.php?resultid=21242466

http://www.gpugrid.net/result.php?resultid=21242065

http://www.gpugrid.net/result.php?resultid=21241863

http://www.gpugrid.net/result.php?resultid=21233480

And so on.

Could more diagnostics be added to v9.22 (cuda80) to show what caused this error, if you can't fix it instead? This appears for both short and long runs.

Moises Cardona
Send message
Joined: 7 Jun 10
Posts: 3
Credit: 208,405,467
RAC: 0
Level
Leu
Scientific publications
watwatwatwat
Message 52410 - Posted: 7 Aug 2019 | 23:48:39 UTC

Same here...

Bedrich Hajek
Send message
Joined: 28 Mar 09
Posts: 485
Credit: 10,924,398,466
RAC: 15,842,573
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 52411 - Posted: 8 Aug 2019 | 0:24:44 UTC

I actually got one to finish successfully:

http://www.gpugrid.net/workunit.php?wuid=16709219


I changed the date to before the license expired, right after the WU started crunching and before it crashes, and then change it back. It's actually tricky to do, because boinc acts strangely when the date is moved back. My two other attempts failed, so I had enough of this.

BTW, the video card that I used was a gtx 980 ti, not the rtx 2080 ti.






Erich56
Send message
Joined: 1 Jan 15
Posts: 1131
Credit: 9,914,107,676
RAC: 32,603,125
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwat
Message 52415 - Posted: 8 Aug 2019 | 5:39:06 UTC - in response to Message 52411.

I actually got one to finish successfully:

http://www.gpugrid.net/workunit.php?wuid=16709219

I changed the date to before the license expired, right after the WU started crunching and before it crashes, and then change it back. It's actually tricky to do, because boinc acts strangely when the date is moved back.

so it's clear that the license has expired.

Changing the date of the host can indeed be tricky, even more if also other BOINC projects are running which could be totally confused by doing this. Happened to me last time when the license expired, it all ended up in a total mess.

Let's hope that it won't take too long until there is a new acemd with a valid license.

mmonnin
Send message
Joined: 2 Jul 16
Posts: 337
Credit: 7,499,301,065
RAC: 9,540,197
Level
Tyr
Scientific publications
watwatwatwatwat
Message 52416 - Posted: 8 Aug 2019 | 11:58:10 UTC - in response to Message 52415.

I actually got one to finish successfully:

http://www.gpugrid.net/workunit.php?wuid=16709219

I changed the date to before the license expired, right after the WU started crunching and before it crashes, and then change it back. It's actually tricky to do, because boinc acts strangely when the date is moved back.

so it's clear that the license has expired.

Changing the date of the host can indeed be tricky, even more if also other BOINC projects are running which could be totally confused by doing this. Happened to me last time when the license expired, it all ended up in a total mess.

Let's hope that it won't take too long until there is a new acemd with a valid license.


I thought one of the reasons for the new app was to not need the license that keeps expiring. Plus Turing support in a BOINC wrapper to separate the science part from the BOINC part.

PappaLitto
Send message
Joined: 21 Mar 16
Posts: 511
Credit: 4,672,242,755
RAC: 0
Level
Arg
Scientific publications
watwatwatwatwatwatwatwat
Message 52418 - Posted: 8 Aug 2019 | 12:25:59 UTC

They are not using the new app yet, the reason the app expired is because it's still the old app.

mmonnin
Send message
Joined: 2 Jul 16
Posts: 337
Credit: 7,499,301,065
RAC: 9,540,197
Level
Tyr
Scientific publications
watwatwatwatwat
Message 52419 - Posted: 8 Aug 2019 | 12:43:39 UTC - in response to Message 52418.

They are not using the new app yet, the reason the app expired is because it's still the old app.


And?

I was replying to this part
"new acemd with a valid license."

The new app won't need a license from what I recall.

Profile robertmiles
Send message
Joined: 16 Apr 09
Posts: 503
Credit: 753,570,933
RAC: 234,641
Level
Glu
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 52432 - Posted: 9 Aug 2019 | 12:23:31 UTC
Last modified: 9 Aug 2019 | 12:25:13 UTC

I've seen some mentions of tasks still completing properly on some rather old versions of Windows, such as Windows XP. Could some people with at least one computer with such a version give more details?

Perhaps the older versions don't include an expiration check, and therefore have to assume that it is not expired.

Erich56
Send message
Joined: 1 Jan 15
Posts: 1131
Credit: 9,914,107,676
RAC: 32,603,125
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwat
Message 52433 - Posted: 9 Aug 2019 | 12:59:21 UTC

the "older versions" also include an expiration check.

However, for XP, a differnt acemd.exe is used (running with CUDA 65), the license for which seems to expire at a later date. No idea at what date exactly, it could be tomorrow, or in a week, or next month ...

GPUGRID
Send message
Joined: 12 Dec 11
Posts: 91
Credit: 2,730,095,033
RAC: 0
Level
Phe
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwat
Message 52436 - Posted: 9 Aug 2019 | 18:40:01 UTC - in response to Message 52433.

I´m using Win XP 64 and havind just errors aswell.

Profile Retvari Zoltan
Avatar
Send message
Joined: 20 Jan 09
Posts: 2343
Credit: 16,201,255,749
RAC: 0
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 52437 - Posted: 9 Aug 2019 | 19:32:25 UTC - in response to Message 52436.

No, you are using Windows 7 x64.

KAMasud
Send message
Joined: 27 Jul 11
Posts: 137
Credit: 523,901,354
RAC: 0
Level
Lys
Scientific publications
watwat
Message 52472 - Posted: 12 Aug 2019 | 11:20:32 UTC

Stderr output
<core_client_version>7.14.2</core_client_version>
<![CDATA[
<message>
(unknown error) - exit code -44 (0xffffffd4)</message>
]]>

name e18s22_e7s95p0f111-PABLO_V4_UCB_p27_sj403_no_salt_IDP-0-2-RND0646
application Long runs (8-12 hours on fastest card)
created 8 Aug 2019 | 21:02:41 UTC
minimum quorum 1
initial replication 1
max # of error/total/success tasks 7, 10, 6
errors Too many errors (may have bug)

100% failure rate for the last three days.

marsinph
Send message
Joined: 11 Feb 18
Posts: 41
Credit: 579,891,424
RAC: 0
Level
Lys
Scientific publications
wat
Message 52474 - Posted: 12 Aug 2019 | 11:34:31 UTC

Hello everyone,
Please read the post in "news" about "expired licence".
It is not at our side, but at server side.

Admin know it already two days.

GPUGRID
Send message
Joined: 12 Dec 11
Posts: 91
Credit: 2,730,095,033
RAC: 0
Level
Phe
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwat
Message 52500 - Posted: 13 Aug 2019 | 16:09:35 UTC - in response to Message 52437.
Last modified: 13 Aug 2019 | 16:11:10 UTC

No, you are using Windows 7 x64.

You are right, my bad. But I was having errors with the new drivers. Then I rolled back to 378.94 driver and it´s running fine now.

http://www.gpugrid.net/show_host_detail.php?hostid=413063

http://www.gpugrid.net/workunit.php?wuid=16717273

mikey
Send message
Joined: 2 Jan 09
Posts: 297
Credit: 5,835,111,115
RAC: 31,173,272
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 52503 - Posted: 13 Aug 2019 | 19:49:08 UTC - in response to Message 52474.
Last modified: 13 Aug 2019 | 19:51:14 UTC

Hello everyone,
Please read the post in "news" about "expired licence".
It is not at our side, but at server side.

Admin know it already two days.


That's fixed now. But the errors continue, 2 seconds into a Pablo unit and poof they error out. I turned off the long run units and it seems there aren't any short run units to do for the gpu's.

rod4x4
Send message
Joined: 4 Aug 14
Posts: 266
Credit: 2,219,935,054
RAC: 0
Level
Phe
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 52504 - Posted: 13 Aug 2019 | 23:42:12 UTC - in response to Message 52503.
Last modified: 14 Aug 2019 | 0:27:35 UTC

But the errors continue, 2 seconds into a Pablo unit and poof they error out

mikey, the tasks with errors were run on a Turing based card (GTX1660ti). These GPUs are not currently supported by the ACEMD2 app.
Admins are working on ACEMD3 app which will support Turing based GPUs. Hopefully this will be released soon.
There is currently no short tasks in the queue.

Erich56
Send message
Joined: 1 Jan 15
Posts: 1131
Credit: 9,914,107,676
RAC: 32,603,125
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwat
Message 52534 - Posted: 27 Aug 2019 | 5:04:15 UTC

the faulty tasks seem to be back (erroring out after a few seconds):

http://www.gpugrid.net/result.php?resultid=21331546

:-(

Erich56
Send message
Joined: 1 Jan 15
Posts: 1131
Credit: 9,914,107,676
RAC: 32,603,125
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwat
Message 52807 - Posted: 8 Oct 2019 | 8:49:00 UTC

I had a task fail after few seconds.

Stderr says: ERROR: file pme.cpp line 91: PME NX too small

here the URL: http://www.gpugrid.net/result.php?resultid=21429528

anyone any idea what was going wrong?

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1618
Credit: 8,606,094,351
RAC: 16,317,055
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 52808 - Posted: 8 Oct 2019 | 10:30:21 UTC - in response to Message 52807.

At least it went wrong for everyone, not just for you. A bad workunit.

WU 16799014

Erich56
Send message
Joined: 1 Jan 15
Posts: 1131
Credit: 9,914,107,676
RAC: 32,603,125
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwat
Message 52819 - Posted: 9 Oct 2019 | 11:08:06 UTC

here another one, from this morning, with error message:

ERROR: file mdioload.cpp line 81: Unable to read bincoordfile

http://www.gpugrid.net/result.php?resultid=21431713

Killersocke
Send message
Joined: 18 Oct 13
Posts: 53
Credit: 406,647,419
RAC: 0
Level
Gln
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 52820 - Posted: 9 Oct 2019 | 11:47:45 UTC - in response to Message 52819.
Last modified: 9 Oct 2019 | 11:48:55 UTC

Same here
http://www.gpugrid.net/result.php?resultid=21432948
http://www.gpugrid.net/result.php?resultid=21432946
http://www.gpugrid.net/result.php?resultid=21431340
http://www.gpugrid.net/result.php?resultid=21431266
http://www.gpugrid.net/result.php?resultid=21430771
...and more others, all CUDA 80

Profile Retvari Zoltan
Avatar
Send message
Joined: 20 Jan 09
Posts: 2343
Credit: 16,201,255,749
RAC: 0
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 52822 - Posted: 9 Oct 2019 | 12:05:46 UTC - in response to Message 52820.
Last modified: 9 Oct 2019 | 12:06:57 UTC

Same here
http://www.gpugrid.net/result.php?resultid=21432948
http://www.gpugrid.net/result.php?resultid=21432946
http://www.gpugrid.net/result.php?resultid=21431340
http://www.gpugrid.net/result.php?resultid=21431266
http://www.gpugrid.net/result.php?resultid=21430771
...and more others, all CUDA 80
Until the new app (ACEMD3) is released, you should assign this host to a venue which receives work only from the ACEMD3 queue, as the other two queues have the old client, which is incompatible with the Turing cards.

Erich56
Send message
Joined: 1 Jan 15
Posts: 1131
Credit: 9,914,107,676
RAC: 32,603,125
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwat
Message 52829 - Posted: 9 Oct 2019 | 18:45:15 UTC

obviously, the faulty tasks are back, here the next one from a minute ago:
http://www.gpugrid.net/result.php?resultid=21433016

This is even worse in times where new tasks are very rare, anyway :-(

Erich56
Send message
Joined: 1 Jan 15
Posts: 1131
Credit: 9,914,107,676
RAC: 32,603,125
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwat
Message 52892 - Posted: 24 Oct 2019 | 18:57:45 UTC

the next ones:

http://www.gpugrid.net/result.php?resultid=21462742

http://www.gpugrid.net/result.php?resultid=21462460

http://www.gpugrid.net/result.php?resultid=21462682

http://www.gpugrid.net/result.php?resultid=21462715

they all didn't run even one second :-(

Erich56
Send message
Joined: 1 Jan 15
Posts: 1131
Credit: 9,914,107,676
RAC: 32,603,125
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwat
Message 52893 - Posted: 24 Oct 2019 | 19:21:45 UTC

and here some more:

http://www.gpugrid.net/result.php?resultid=21463119

http://www.gpugrid.net/result.php?resultid=21463047

http://www.gpugrid.net/result.php?resultid=21463010

http://www.gpugrid.net/result.php?resultid=21462974

http://www.gpugrid.net/result.php?resultid=21463183

http://www.gpugrid.net/result.php?resultid=21463207

Profile Retvari Zoltan
Avatar
Send message
Joined: 20 Jan 09
Posts: 2343
Credit: 16,201,255,749
RAC: 0
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 52894 - Posted: 24 Oct 2019 | 22:24:57 UTC - in response to Message 52893.

I think the license of the v9.22 app has expired this time.

Erich56
Send message
Joined: 1 Jan 15
Posts: 1131
Credit: 9,914,107,676
RAC: 32,603,125
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwat
Message 52895 - Posted: 25 Oct 2019 | 2:58:20 UTC - in response to Message 52894.

I think the license of the v9.22 app has expired this time.

that's what I now am suspecting, too :-(

BelgianEnthousiast
Send message
Joined: 7 Apr 15
Posts: 33
Credit: 1,201,157,375
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwat
Message 52896 - Posted: 25 Oct 2019 | 14:25:47 UTC

Any prediction when continous supply of new WU's will become available again ?
Nearly full month of very intermittent and small numbers of WU's.

Einstein is a happy project in the meantime :-)

Are all efforts being put into support of the new 20XX cards at the detriment
of the current 10XX cards ? (limited staff available maybe/lack of funding ?)

Erich56
Send message
Joined: 1 Jan 15
Posts: 1131
Credit: 9,914,107,676
RAC: 32,603,125
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwat
Message 52923 - Posted: 31 Oct 2019 | 15:48:16 UTC

this is an increasingly annoying situation:

while there are no tasks available most of the time, some of the few ones that are being downloaded fail after 5 seconds:

http://www.gpugrid.net/result.php?resultid=21481323

ERROR: file mdioload.cpp line 81: Unable to read bincoordfile

:-( :-( :-(

Clive
Send message
Joined: 2 Jul 19
Posts: 21
Credit: 90,744,164
RAC: 0
Level
Thr
Scientific publications
wat
Message 52928 - Posted: 4 Nov 2019 | 4:41:35 UTC

Hi:

I see this is a well used section of the forum.

I would like to contribute some useful results here with my Alienware laptop but I have a high failure rate which I would like to resolve here. The GPU in my laptop is a Geoforce 660M. The OS I am using is uptodate Windows 10.

I would appreciate it if a tech person could narrow down the reason or reasons why I am experiencing such a high failure rate.

Clive Hunt
Canada

Erich56
Send message
Joined: 1 Jan 15
Posts: 1131
Credit: 9,914,107,676
RAC: 32,603,125
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwat
Message 52929 - Posted: 4 Nov 2019 | 5:24:31 UTC - in response to Message 52928.

I would like to contribute some useful results here with my Alienware laptop

I am afraid that laptop GPUs are not made for this kind of load :-(

Erich56
Send message
Joined: 1 Jan 15
Posts: 1131
Credit: 9,914,107,676
RAC: 32,603,125
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwat
Message 52930 - Posted: 4 Nov 2019 | 5:24:49 UTC - in response to Message 52929.

I would like to contribute some useful results here with my Alienware laptop

I am afraid that laptop GPUs are not made for this kind of heavy load :-(

Erich56
Send message
Joined: 1 Jan 15
Posts: 1131
Credit: 9,914,107,676
RAC: 32,603,125
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwat
Message 52931 - Posted: 4 Nov 2019 | 5:25:32 UTC - in response to Message 52930.

I would like to contribute some useful results here with my Alienware laptop

I am afraid that laptop GPUs are not made for this kind of heavy load :-(

KAMasud
Send message
Joined: 27 Jul 11
Posts: 137
Credit: 523,901,354
RAC: 0
Level
Lys
Scientific publications
watwat
Message 52932 - Posted: 4 Nov 2019 | 7:06:53 UTC

My Dell G7 15 laptop is happily crunching. That is another matter that I have to send a blast of air every day to get the dust-out.

rod4x4
Send message
Joined: 4 Aug 14
Posts: 266
Credit: 2,219,935,054
RAC: 0
Level
Phe
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 52933 - Posted: 4 Nov 2019 | 7:59:05 UTC - in response to Message 52928.
Last modified: 4 Nov 2019 | 8:02:22 UTC

Hi:

I see this is a well used section of the forum.

I would like to contribute some useful results here with my Alienware laptop but I have a high failure rate which I would like to resolve here. The GPU in my laptop is a Geoforce 660M. The OS I am using is uptodate Windows 10.

I would appreciate it if a tech person could narrow down the reason or reasons why I am experiencing such a high failure rate.

Clive Hunt
Canada


The issue is with the Scheduler on the GPUgrid servers. The Scheduler is sending CUDA65 tasks to your Laptop, all of which will fail due to an expired license. (Server end)
Your laptop can process CUDA80 tasks, but you are at the mercy of the Scheduler. For most Hosts it sends the correct tasks, and for a handful of Hosts, it is sending the wrong tasks.
This issue tends to affect Kepler GPUs (600 series GPU), even though they are still supported.
Some relevant posts discussing this issue are here:
http://www.gpugrid.net/forum_thread.php?id=5000&nowrap=true#52924
http://www.gpugrid.net/forum_thread.php?id=5000&nowrap=true#52920

The Project is in the middle of changing the Application to a newer version, hopefully when the new Application is released (ACEMD3), these issues will be smoothed out.

Erich56
Send message
Joined: 1 Jan 15
Posts: 1131
Credit: 9,914,107,676
RAC: 32,603,125
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwat
Message 52934 - Posted: 4 Nov 2019 | 8:34:40 UTC - in response to Message 52933.

... when the new Application is released (ACEMD3)...

I am curious WHEN this will be the case

rod4x4
Send message
Joined: 4 Aug 14
Posts: 266
Credit: 2,219,935,054
RAC: 0
Level
Phe
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 52935 - Posted: 4 Nov 2019 | 13:23:47 UTC - in response to Message 52934.

... when the new Application is released (ACEMD3)...

I am curious WHEN this will be the case

I think you speak for all of us on this point....

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1335
Credit: 7,516,867,459
RAC: 13,855,815
Level
Tyr
Scientific publications
watwatwatwatwat
Message 52936 - Posted: 4 Nov 2019 | 15:58:39 UTC - in response to Message 52935.

I thought at one point when I saw the acemd2 long task buffer dwindle down that was in preparation of the project deprecating the acemd2 applications and move on to the new acemd3 applications.

But then they added a lot more acemd2 tasks to the buffer and now the acemd3 tasks have dwindled down to nothing.

Just the opposite of what I expected. Who knows what is up with the project? Seems like a lot of wasted effort developing and testing the new acemd3 app that finally removes the yearly aggravations of expired licenses and no sign of significant project acemd3 task work has appeared showing the project is back in gear.

chenshaoju
Send message
Joined: 28 Dec 18
Posts: 3
Credit: 19,316,371
RAC: 0
Level
Pro
Scientific publications
watwat
Message 52943 - Posted: 7 Nov 2019 | 3:45:04 UTC

Sorry for my English.

I don't know why my tasks most failed about 1 month:
http://www.gpugrid.net/results.php?hostid=495250

I looking in to tasks, some tasks failed on another users too.
http://www.gpugrid.net/workunit.php?wuid=16845665
http://www.gpugrid.net/workunit.php?wuid=16845047
http://www.gpugrid.net/workunit.php?wuid=16842720
http://www.gpugrid.net/workunit.php?wuid=16837588
http://www.gpugrid.net/workunit.php?wuid=16835265
http://www.gpugrid.net/workunit.php?wuid=16833172

IMHO, The program maybe have some issue.

rod4x4
Send message
Joined: 4 Aug 14
Posts: 266
Credit: 2,219,935,054
RAC: 0
Level
Phe
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 52944 - Posted: 7 Nov 2019 | 4:48:40 UTC - in response to Message 52943.

Sorry for my English.

I don't know why my tasks most failed about 1 month:
http://www.gpugrid.net/results.php?hostid=495250

I looking in to tasks, some tasks failed on another users too.
http://www.gpugrid.net/workunit.php?wuid=16845665
http://www.gpugrid.net/workunit.php?wuid=16845047
http://www.gpugrid.net/workunit.php?wuid=16842720
http://www.gpugrid.net/workunit.php?wuid=16837588
http://www.gpugrid.net/workunit.php?wuid=16835265
http://www.gpugrid.net/workunit.php?wuid=16833172

IMHO, The program maybe have some issue.


This post here applies to your issues as well: http://www.gpugrid.net/forum_thread.php?id=4954&nowrap=true#52933

chenshaoju
Send message
Joined: 28 Dec 18
Posts: 3
Credit: 19,316,371
RAC: 0
Level
Pro
Scientific publications
watwat
Message 52952 - Posted: 9 Nov 2019 | 7:25:05 UTC - in response to Message 52944.

Thank you.

chenshaoju
Send message
Joined: 28 Dec 18
Posts: 3
Credit: 19,316,371
RAC: 0
Level
Pro
Scientific publications
watwat
Message 53135 - Posted: 27 Nov 2019 | 2:53:49 UTC

Sorry for my English.

After update to "New version of ACEMD v2.10 (cuda101)", My first task still failed.

http://www.gpugrid.net/result.php?resultid=21504825

Is my graphics too old for this? :\

Profile Retvari Zoltan
Avatar
Send message
Joined: 20 Jan 09
Posts: 2343
Credit: 16,201,255,749
RAC: 0
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 53145 - Posted: 27 Nov 2019 | 7:42:34 UTC - in response to Message 53135.

Is my graphics too old for this? :\
Yes.

Post to thread

Message boards : Number crunching : failing tasks lately

//