Advanced search

Message boards : Number crunching : Lots of failures

Author Message
Peter Schon
Send message
Joined: 12 Sep 10
Posts: 2
Credit: 1,137,445
RAC: 0
Level
Ala
Scientific publications
watwatwat
Message 18935 - Posted: 15 Oct 2010 | 4:35:46 UTC

Hi there

Maybe somebody has an answer,

I get a lot of failures with this message



MDIO ERROR: cannot open file "restart.coor"
SWAN : FATAL : Failure executing kernel sync [transpose_float2] [999]
Assertion failed: 0, file swanlib_nv.cpp, line 121

This application has requested the Runtime to terminate it in an unusual way.
Please contact the application's support team for more information.

Has somebody an answer ?

I am using the latest grafic card driver.

Hints are wlcome

Peter

MarkJ
Volunteer moderator
Volunteer tester
Send message
Joined: 24 Dec 08
Posts: 738
Credit: 200,909,904
RAC: 0
Level
Leu
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 18936 - Posted: 15 Oct 2010 | 8:20:02 UTC - in response to Message 18935.

Hi there

Maybe somebody has an answer,

I get a lot of failures with this message

MDIO ERROR: cannot open file "restart.coor"
SWAN : FATAL : Failure executing kernel sync [transpose_float2] [999]
Assertion failed: 0, file swanlib_nv.cpp, line 121


I see most of them are the TONI_KKi4 wu. There is a seperate message thread about them. There was some issues with them and a lot fail immediately when they start - nothing you can do about that.

One thing you need to check is that you have disabled multi-gpu mode via the nvidia control panel. You could also run the cuda memory checker to see if your card has issues, you'll need to test both GPU's of the GTX295.
____________
BOINC blog

Peter Schon
Send message
Joined: 12 Sep 10
Posts: 2
Credit: 1,137,445
RAC: 0
Level
Ala
Scientific publications
watwatwat
Message 18938 - Posted: 15 Oct 2010 | 15:29:48 UTC - in response to Message 18936.

Hi there

Thats a good hint. I have enabled SLI. Will switch it of and see what happened.

Thx for the quick reply.

Will post the result.

Regards
Peter

BarryAZ
Send message
Joined: 16 Apr 09
Posts: 163
Credit: 920,275,294
RAC: 0
Level
Glu
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 19177 - Posted: 1 Nov 2010 | 17:39:35 UTC

I thought I might have tracked down one of the problem sources specific to GPUGrid computation errors.

I set up a new installation:

Windows 7 - 64 bit
Latest Nvidia driver
9800GT video card.
Latest version of the BOINC client

NO other GPU projects.

First 3 units processed fine. 4th unit failed after 8 or 9 hours with a computation error.

I figure I controlled for processing contention with other GPU projects, I controlled for latest driver install, I controlled for most recent released BOINC client. Yet still I encountered a computation error on a GPU Grid work unit -- worse, the error was 8 hours plus in.

I had previously encountered errors running 9800GT cards in Windows XP -- various versions of the nvidia drivers including the current driver, various versions of the BOINC client from 6.10.18 up to the current 6.10.56. In those cases other GPU supporting projects were installed. So I thought perhaps the problem was not (since I controlled for it) BOINC client version or nvidia driver version or OS (I saw this on XP, Vista and Win7). I'd note, on none of these multi GPU project configurations did I see long run computation errors on other projects (Collatz, SETI, Dnetc) -- I'd see some errors with other projects but they were 'efficient' errors (ie within a few minutes of processing).

Based on my sampling, I really suspect that at least a piece of the computation error problem comes from the source of the work units. Either it is because they, by design, run long enough for computational errors to surface (other GPU projects don't run more than a few hours, often less than an hour, on the same GPU, while GPUGrid work units run 20 hours or so), or that there is some other work unit 'weakness' which makes them significantly more subject to 'long run' computation errors than ANY of the other GPU projects currently available in the BOINC world.

I know that the GPUGrid project isn't *intentionally* doing this, and I would like to support the project with some of my available 9800GT processing power, but at this point, it seems rather an inefficient use of GPU processing power for me.

Given the traffic here, I must assume that very few other current users encounter computation errors, as I don't see a LOT of traffic regarding it. Perhaps the existing active base for GPUGrid has different (and faster) GPU's they are working with -- I don't know.

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1576
Credit: 5,862,711,851
RAC: 10,024,449
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 19178 - Posted: 1 Nov 2010 | 18:00:57 UTC - in response to Message 19177.

First 3 units processed fine. 4th unit failed after 8 or 9 hours with a computation error.

I figure I controlled for processing contention with other GPU projects, I controlled for latest driver install, I controlled for most recent released BOINC client. Yet still I encountered a computation error on a GPU Grid work unit -- worse, the error was 8 hours plus in.

Are you looking at this task list? - seems to match that description, taken from one of your hosts running Windows 7 x64 (host ID: 83792).

You seem to have controlled for everything else, so how about the GPUGrid task sub-type?

I see one IBUCH_pYEEI, and two TONI_KKi4 (all successful), and one KASHIF_HIVPR_n1 - the failure, after several hours.

That exactly matches my experiences with two 9800GT and one 9800GTX+: I can run everything GPUGrid throws at me (from the current mix), except HIVPR_n1. I've reported it, several times, but I don't go banging on about it once the point has been made.

I'm afraid to say the 'abort' button is your friend, but only for that particular task subtype. Let somebody with a Fermi crunch them (they're fine on my GTX470).

BarryAZ
Send message
Joined: 16 Apr 09
Posts: 163
Credit: 920,275,294
RAC: 0
Level
Glu
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 19190 - Posted: 2 Nov 2010 | 4:45:38 UTC - in response to Message 19178.

Richard, thanks for the explanation -- I suppose it would be a nice feature at the project configuration level to just say no to specific task subtypes.

The alternative I suppose is to look closely at any new downloaded work units, and abort the instant one of the 'GPUGrid doesn't like 9800's' specific work units drops on a given system.

That's a bit of a bother for me. In any event, for the workstation in my specific example, I've switched from the 9800GT to a HD 4850 -- and (since GPUGrid doesn't do ATI GPU's), switched that workstation to Collatz and MW.

I've posted about this sort of issue here over the months, but what you've suggested is the most responsive reply I've received. I think for me, rather than 'fight the good fight' and try to run (and manually filter) GPUGrid, I'll wait for the day when the project itself is positioned to handle these sorts of things a bit better.





You seem to have controlled for everything else, so how about the GPUGrid task sub-type?

I see one IBUCH_pYEEI, and two TONI_KKi4 (all successful), and one KASHIF_HIVPR_n1 - the failure, after several hours.

That exactly matches my experiences with two 9800GT and one 9800GTX+: I can run everything GPUGrid throws at me (from the current mix), except HIVPR_n1. I've reported it, several times, but I don't go banging on about it once the point has been made.

I'm afraid to say the 'abort' button is your friend, but only for that particular task subtype. Let somebody with a Fermi crunch them (they're fine on my GTX470).

Ivailo Bonev
Avatar
Send message
Joined: 14 Jun 10
Posts: 5
Credit: 300,613
RAC: 0
Level

Scientific publications
watwat
Message 19198 - Posted: 2 Nov 2010 | 15:41:30 UTC

Richard, my 9800GT (btw it's oc-ed 10%) works fine on KASHIF_HIVPR_n1, with new app 6.11. See:
Task 3219845
Task 3214029

Do you have Swan_Sync=0 in your system variables?
____________

Windows: "Where do you want to go today?"
Linux: "Where do you want to go tomorrow?"
FreeBSD: "Are you guys coming, or what?"

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1576
Credit: 5,862,711,851
RAC: 10,024,449
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 19243 - Posted: 4 Nov 2010 | 15:58:10 UTC - in response to Message 19198.

Richard, my 9800GT (btw it's oc-ed 10%) works fine on KASHIF_HIVPR_n1, with new app 6.11. See:
Task 3219845
Task 3214029

Well, one of mine has just blown away task 3239533, but since it only thought about it for one second, I won't lose much sleep over it.

Do you have Swan_Sync=0 in your system variables?

No, I prefer to keep my CPUs free for other work.

Profile skgiven
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,995,359,260
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 19254 - Posted: 4 Nov 2010 | 19:40:13 UTC - in response to Message 19243.
Last modified: 4 Nov 2010 | 19:41:53 UTC

The Windows only optional variable Swan_Sync=0 is for Fermi's and does not have to be used along with one free core but it usually helps a lot. It will make little or no difference to the performance of a 9800GT. There is little need to leave a CPU core free, unless you have 3 or 4 such cards in the same system, at which point your CPU performance for CPU only tasks will degenerate to the point that you might as well free up a CPU core. On a high end Fermi it is still optional but generally recommended to use both Swan_Sync=0 and to leave a Core/Thread free; the performance difference is quite noticeable.

Ivailo Bonev
Avatar
Send message
Joined: 14 Jun 10
Posts: 5
Credit: 300,613
RAC: 0
Level

Scientific publications
watwat
Message 19262 - Posted: 4 Nov 2010 | 23:24:53 UTC
Last modified: 4 Nov 2010 | 23:26:12 UTC

Thanks for explanation skgiven, I wonder why GPUGrid uses so much CPU time... Now I understand!
____________

Windows: "Where do you want to go today?"
Linux: "Where do you want to go tomorrow?"
FreeBSD: "Are you guys coming, or what?"

David Cappello
Send message
Joined: 27 Sep 08
Posts: 3
Credit: 10,699,899
RAC: 0
Level
Pro
Scientific publications
watwatwat
Message 19797 - Posted: 7 Dec 2010 | 16:38:05 UTC

I have a host that has dual 9800GTXs and I can't get one valid result:

http://www.gpugrid.net/results.php?hostid=84907&offset=0&show_names=0&state=0

What is the suggested best fix for this machine?


____________

MarkJ
Volunteer moderator
Volunteer tester
Send message
Joined: 24 Dec 08
Posts: 738
Credit: 200,909,904
RAC: 0
Level
Leu
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 19802 - Posted: 7 Dec 2010 | 20:25:05 UTC

Downgrade drivers to 197.45 and this will force GPUgrid to give you the 6.12 app (assuming you run windows). Not sure if your cards support SLI or not, but if they do make sure its disabled in the Nvidia control panel.
____________
BOINC blog

TJ
Send message
Joined: 26 Jun 09
Posts: 815
Credit: 1,470,385,294
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 20004 - Posted: 22 Dec 2010 | 22:39:34 UTC - in response to Message 19802.

Well for a few weeks I am getting no new tasks from yours and was advised to update my drivers. I did and now I read I have to downgrade them.
That's a lot of work.
____________
Greetings from TJ

Profile skgiven
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,995,359,260
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 20006 - Posted: 22 Dec 2010 | 23:18:07 UTC - in response to Message 20004.

What driver did you update to?

Post to thread

Message boards : Number crunching : Lots of failures

//