Advanced search

Message boards : Number crunching : extremely high error rates

Author Message
Erich56
Send message
Joined: 1 Jan 15
Posts: 1090
Credit: 6,603,906,926
RAC: 3,160,162
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwat
Message 47028 - Posted: 19 Apr 2017 | 5:35:45 UTC

I am just noticing on the Project Status Page that, expect for GERARD_PLAYMOL_4B80IC6U, all tasks - long runs as well as short runs - have extremely high error rates, some of them even close to 70%.

I guess this may have to do with the recent problem where for about 2 days, all tasks failed right away, and hence were reported back as invalid - or are there any other reasons?

Profile Retvari Zoltan
Avatar
Send message
Joined: 20 Jan 09
Posts: 2343
Credit: 16,201,255,749
RAC: 1,038
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 47038 - Posted: 19 Apr 2017 | 21:58:51 UTC - in response to Message 47028.
Last modified: 19 Apr 2017 | 22:00:42 UTC

I am just noticing on the Project Status Page that, expect for GERARD_PLAYMOL_4B80IC6U, all tasks - long runs as well as short runs - have extremely high error rates, some of them even close to 70%.
These error rates are slowly decreasing now, as the new applications are working fine.

I guess this may have to do with the recent problem where for about 2 days, all tasks failed right away, and hence were reported back as invalid
That's exactly the only reason for these high error rates.

- or are there any other reasons?
Note that the error rates always start at high, as failed workunits are returned much faster than successful tasks.
The another consequence of this is that the error rates always increase much faster than they decrease.

Jacob Klein
Send message
Joined: 11 Oct 08
Posts: 1127
Credit: 1,901,927,545
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 47041 - Posted: 20 Apr 2017 | 1:06:11 UTC - in response to Message 47038.
Last modified: 20 Apr 2017 | 1:07:47 UTC

the new applications are working fine.

... Well ... Since I have a PC that has GTX 970 alongside 2x GTX 660 Ti (SM3.0).... that means that I'm still failing a lot of tasks, until the app is fixed.

So, the "918" app is not running fine... for me.
I imagine I'm not the only one.

Profile Retvari Zoltan
Avatar
Send message
Joined: 20 Jan 09
Posts: 2343
Credit: 16,201,255,749
RAC: 1,038
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 47058 - Posted: 21 Apr 2017 | 0:24:40 UTC - in response to Message 47041.
Last modified: 21 Apr 2017 | 0:26:45 UTC

Since I have a PC that has GTX 970 alongside 2x GTX 660 Ti (SM3.0).... that means that I'm still failing a lot of tasks, until the app is fixed.
Who knows how long it could take. Perhaps you should exclude in your cc_config.xml those GPUs in the meantime.

So, the "918" app is not running fine... for me.
I imagine I'm not the only one.
I copy my method here for you and everybody else:
Copy the following to your clipboard:
notepad c:\ProgramData\BOINC\cc_config.xml
Press Windows key + R, then paste and press enter.
If you see an empty file, copy and paste the following text:
<cc_config> <options> <exclude_gpu> <url>www.gpugrid.net</url> <device_num>1</device_num> <type>NVIDIA</type> </exclude_gpu> </options> </cc_config>
The value in the <device_num> section should be adapted to the given system.
You can have as many <exclude_gpu> sections in your cc_config.xml as many GPUs you have to disable.
If your cc_config.xml already has an <options> section then you should insert the section between the <exclude_gpu> and the </exclude_gpu> tags (including both) right after the <options> tag.
Click file -> save and click [save].
If your BOINC manager is running, you should click Options -> read config files.
Perhaps you should restart BOINC manager (stop the scientific applications upon exiting).

Jacob Klein
Send message
Joined: 11 Oct 08
Posts: 1127
Credit: 1,901,927,545
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 47064 - Posted: 22 Apr 2017 | 0:01:24 UTC - in response to Message 47058.

Thanks. I'm well aware of <exclude_gpu>. I am the one that requested that David A imlement it into BOINC :) I'm directly responsible for its existence, originally requested to prevent certain apps from running on the primary GPU because they made my display laggy!

However, I'm not going to use it as a workaround to fix this server issue.

Instead, the tasks will continue to error on my GTX 660 Ti GPUs, until MJH and staff step up to better identify and then fix the issues.

They've hinted at some bug, but did not give appropriate info for anybody to do anything to fix it... So what is the nature of the problem?

Profile Retvari Zoltan
Avatar
Send message
Joined: 20 Jan 09
Posts: 2343
Credit: 16,201,255,749
RAC: 1,038
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 47068 - Posted: 22 Apr 2017 | 11:53:34 UTC - in response to Message 47064.
Last modified: 22 Apr 2017 | 11:56:15 UTC

Thanks. I'm well aware of <exclude_gpu>. I am the one that requested that David A imlement it into BOINC :) I'm directly responsible for its existence, originally requested to prevent certain apps from running on the primary GPU because they made my display laggy!
I know, I intended this for the others you referred having the same problem.

However, I'm not going to use it as a workaround to fix this server issue.
This is not a server issue, this is a compiler issue.
However it could be avoided by the server if it wouldn't send work for the hosts equipped with GTX 660Ti cards, which policy wouldn't filter out your host, as it has a GTX 970 too, so the server doesn't know about the lesser cards in it.

Instead, the tasks will continue to error on my GTX 660 Ti GPUs, until MJH and staff step up to better identify and then fix the issues.
I think it's an unnecessary display of protest, as there are enough unsupervised hosts to make the statistics worse.

They've hinted at some bug, but did not give appropriate info for anybody to do anything to fix it... So what is the nature of the problem?
I don't know, but it should be a nasty one, as GTX 670 & GTX 680 (both CC3.0) is working fine with the new app.

Jacob Klein
Send message
Joined: 11 Oct 08
Posts: 1127
Credit: 1,901,927,545
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 47070 - Posted: 22 Apr 2017 | 17:47:57 UTC - in response to Message 47068.
Last modified: 22 Apr 2017 | 17:50:43 UTC

I am seeking clarification about the CC3.0 SM3 GTX 660 Ti problem.

- Is it a problem with NVIDIA code that NVIDIA needs to fix?
- Is it a problem with GPUGrid code that GPUGrid needs to fix?
- Is it something else?

Those are the questions that, to my knowledge, MJH did not clearly answer.
I'm still waiting on answers.

I have NVIDIA contacts who could help solve it, if we get pre-confirmation from MJH that the problem is NVIDIA's problem to fix.

MJH?

Post to thread

Message boards : Number crunching : extremely high error rates

//