Advanced search

Message boards : Number crunching : Almost every GPUGRID task failing at 100%

Author Message
Kaddaman
Send message
Joined: 22 Mar 20
Posts: 4
Credit: 27,100,055
RAC: 330,517
Level
Val
Scientific publications
wat
Message 54102 - Posted: 26 Mar 2020 | 23:51:47 UTC

Hello fellow crunchers!

I am pretty new to crunching, started 4 days ago with Rosetta@Home and GPUGRID, contributing to grcpool. I mainly left the PC alone and it crunched almost 24h every day and I didn't really pay attention to BOINC.

Recently I realized that a GPUGRID task put out a computation error at 100% and I looked further into it. I looked through the log and to my surprise, literally every WU was being cancelled and spit out a computation error.


The logs look like this:

26.03.2020 13:09:09 | GPUGRID | Computation for task 3e3xA02_320_1-TONI_MDADpr4se-2-10-RND2806_0 finished

26.03.2020 13:09:09 | GPUGRID | Output file 3e3xA02_320_1-TONI_MDADpr4se-2-10-RND2806_0_0 for task 3e3xA02_320_1-TONI_MDADpr4se-2-10-RND2806_0 absent

26.03.2020 13:09:09 | GPUGRID | Output file 3e3xA02_320_1-TONI_MDADpr4se-2-10-RND2806_0_9 for task 3e3xA02_320_1-TONI_MDADpr4se-2-10-RND2806_0 absent

26.03.2020 13:10:40 | GPUGRID | Sending scheduler request: To report completed tasks.

26.03.2020 13:10:40 | GPUGRID | Reporting 1 completed tasks

26.03.2020 13:10:40 | GPUGRID | Requesting new tasks for NVIDIA GPU


Another example:

26.03.2020 14:28:29 | GPUGRID | Computation for task 3vu1B03_348_1-TONI_MDADpr4sv-2-10-RND7340_0 finished

26.03.2020 14:28:29 | GPUGRID | Output file 3vu1B03_348_1-TONI_MDADpr4sv-2-10-RND7340_0_0 for task 3vu1B03_348_1-TONI_MDADpr4sv-2-10-RND7340_0 absent

26.03.2020 14:28:29 | GPUGRID | Output file 3vu1B03_348_1-TONI_MDADpr4sv-2-10-RND7340_0_9 for task 3vu1B03_348_1-TONI_MDADpr4sv-2-10-RND7340_0 absent

26.03.2020 14:28:29 | GPUGRID | Starting task 3ncvA01_379_0-TONI_MDADpr4sn-2-10-RND1169_0


This occurs with almost every WU, finding one in the log that (seemingly) finished correctly was harder than to find one which failed. The log then goes from task started to computation for task finished to the upload.

Maybe it is helpful to mention that ever GPUGRID WU takes about 1 to 1,5 hour to complete, while Rosetta's WUs take from 5 to 10 hours.

This PC has an AMD Ryzen 5 3600 and a NVidia GeForce RTX 2070, so a pretty recent and decent hardware. I have 16 GB RAM, almost 1 TB of SSD space free and am running Windows 10 Education 64 bit.

At grcpool, I assigned this host to Rosetta@Home and GPUGRID, both are at 100% ressource sharing. In the hosts section, I have a link at Rosetta@Home which redirects me to "task details", this link is not present at GPUGRID.

At the hosts overview, on the left of the PC's name there is a dropdown-menu with a "1", clicking it displays my contribution for Rosetta@Home; no GPUGRID there.

I have another host, my notebook with an i7 4710HQ and a GTX 850M, here the "task details"-link is present at both projects and both projects are shown at the overview.


I don't want my GPU to constantly compute at 100% power, but just producing errors. Am I doing something wrong? Is my hardware not supported? Did I just configure something wrong?



Kaddaman
Send message
Joined: 22 Mar 20
Posts: 4
Credit: 27,100,055
RAC: 330,517
Level
Val
Scientific publications
wat
Message 54103 - Posted: 27 Mar 2020 | 1:14:01 UTC - in response to Message 54102.

I think I may have fixed it. I noticed a message that the GPUGRID host already exists in the pool. I figured it had to do with the name of my computer. I removed my host in grcpool, cleaned my PC from everything BOINC-related, renamed my PC and restarted. I set up BOINC again, connected my PC with its new name to the grcpool, attached the projects and now didn't get the message that the host already exists in the pool. In grcpool, I now also have the link to the task details at GPUGRID, so I assume it is attached now correctly.
Will now leave it working overnight, tomorrow will show if everything is set up right now.

Aurum
Avatar
Send message
Joined: 12 Jul 17
Posts: 247
Credit: 9,673,632,266
RAC: 6,482,620
Level
Tyr
Scientific publications
wat
Message 54113 - Posted: 27 Mar 2020 | 14:48:41 UTC
Last modified: 27 Mar 2020 | 15:39:21 UTC

Good work Kaddaman. My 1080s & 2080s don't seem to be having any problems. I assume you have the latest Nvidia graphics driver circa 444.75.

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 500
Credit: 477,670,253
RAC: 1,496,188
Level
Gln
Scientific publications
wat
Message 54118 - Posted: 27 Mar 2020 | 15:16:59 UTC

Some of your errors are caused by BOINC's "finish file present" error.

This is because of your old client. You would need to get a newer client to eliminate those errors where the bug has been fixed.

The error is caused by your computer being too busy so BOINC can't clean up its files when stopping itself or when you shut the computer down.

You could try the latest BOINC client.
https://boinc.berkeley.edu/dl/boinc_7.16.5_windows_x86_64.exe

Toni
Volunteer moderator
Project administrator
Project developer
Project scientist
Send message
Joined: 9 Dec 08
Posts: 947
Credit: 4,353,973
RAC: 930
Level
Ala
Scientific publications
watwatwatwat
Message 54119 - Posted: 27 Mar 2020 | 16:08:40 UTC - in response to Message 54118.

Remarkable detective-ing, people

Kaddaman
Send message
Joined: 22 Mar 20
Posts: 4
Credit: 27,100,055
RAC: 330,517
Level
Val
Scientific publications
wat
Message 54270 - Posted: 7 Apr 2020 | 0:01:59 UTC - in response to Message 54118.

Even after setting everything up, I got some computing errors. I am thinking the WUs were somehow bad since I absolutely can't imagine what should have been wrong with my PC.

Anyway, I gathered some GRC and started solo crunching. Everything works perfectly now, I didn't see a computing error since I started.

Some of your errors are caused by BOINC's "finish file present" error.

This is because of your old client. You would need to get a newer client to eliminate those errors where the bug has been fixed.

The error is caused by your computer being too busy so BOINC can't clean up its files when stopping itself or when you shut the computer down.

You could try the latest BOINC client.
https://boinc.berkeley.edu/dl/boinc_7.16.5_windows_x86_64.exe

I don't know...This version is a developement version and - according to the boinc website - may be unstable and should only be used for testing. I've read some posts which are telling to install the - also according to the boinc website - recommended version (7.14.2).


Since I started crunching by myself, I didn't see any Error, it seems to be working now. If something hits the fan again, I will try the development version.

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 500
Credit: 477,670,253
RAC: 1,496,188
Level
Gln
Scientific publications
wat
Message 54273 - Posted: 7 Apr 2020 | 3:10:01 UTC - in response to Message 54270.

That is just the usual disclaimer to cover their butt. It means there may be as yet undiscovered bugs in the "test" version. The use at your discretion disclaimer applies.

If it has made it do the download page, there are no "showstopper" bugs or they would have pulled it immediately.

They did in fact pull the 7.16.4 version immediately because of a showstopper bug. The 7.16.5 is the bugfixed version.

The "test" versions are normally perfectly fine and usable. Desirable in fact because they have the most current fixes in place.

Like the fix for the "finish file present" bug that is in 7.14.2.

I run the latest 7.17.0 code branch with no issues. I have always run the latest code branch with no issues. I like testing for the developers to find the bugs which I always seem to find the rare corner cases to raise a bug issue for.

If you want to see the current state of development, read the issues tab at:

https://github.com/BOINC/boinc/issues

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 500
Credit: 477,670,253
RAC: 1,496,188
Level
Gln
Scientific publications
wat
Message 54274 - Posted: 7 Apr 2020 | 3:13:58 UTC - in response to Message 54270.

I am thinking the WUs were somehow bad

Yes, there have been a few badly formatted tasks lately.
If in doubt over whether your hardware is acting up, just look at the WU on the tasks page and see how many times the task has been resent because of errors on other hosts. If the task has been resent 7 times and errored out, it will be pulled from distribution because it is a bad task.

https://www.gpugrid.net/workunit.php?wuid=18610278

Post to thread

Message boards : Number crunching : Almost every GPUGRID task failing at 100%