Lots of failures

Message boards : Number crunching : Lots of failures

Author	Message
Peter Schon Send message Joined: 12 Sep 10 Posts: 2 Credit: 1,137,445 RAC: 0 Level Scientific publications	Message 18935 - Posted: 15 Oct 2010 \| 4:35:46 UTC
	Hi there Maybe somebody has an answer, I get a lot of failures with this message MDIO ERROR: cannot open file "restart.coor" SWAN : FATAL : Failure executing kernel sync [transpose_float2] [999] Assertion failed: 0, file swanlib_nv.cpp, line 121 This application has requested the Runtime to terminate it in an unusual way. Please contact the application's support team for more information. Has somebody an answer ? I am using the latest grafic card driver. Hints are wlcome Peter
	ID: 18935 \| Rating: 0 \| rate: / Reply Quote

MarkJ Volunteer moderator Volunteer tester Send message Joined: 24 Dec 08 Posts: 738 Credit: 200,909,904 RAC: 0 Level Scientific publications	Message 18936 - Posted: 15 Oct 2010 \| 8:20:02 UTC - in response to Message 18935.
	Hi there Maybe somebody has an answer, I get a lot of failures with this message MDIO ERROR: cannot open file "restart.coor" SWAN : FATAL : Failure executing kernel sync [transpose_float2] [999] Assertion failed: 0, file swanlib_nv.cpp, line 121 I see most of them are the TONI_KKi4 wu. There is a seperate message thread about them. There was some issues with them and a lot fail immediately when they start - nothing you can do about that. One thing you need to check is that you have disabled multi-gpu mode via the nvidia control panel. You could also run the cuda memory checker to see if your card has issues, you'll need to test both GPU's of the GTX295. ____________ BOINC blog
	ID: 18936 \| Rating: 0 \| rate: / Reply Quote

Peter Schon Send message Joined: 12 Sep 10 Posts: 2 Credit: 1,137,445 RAC: 0 Level Scientific publications	Message 18938 - Posted: 15 Oct 2010 \| 15:29:48 UTC - in response to Message 18936.
	Hi there Thats a good hint. I have enabled SLI. Will switch it of and see what happened. Thx for the quick reply. Will post the result. Regards Peter
	ID: 18938 \| Rating: 0 \| rate: / Reply Quote

BarryAZ Send message Joined: 16 Apr 09 Posts: 163 Credit: 920,275,294 RAC: 0 Level Scientific publications	Message 19177 - Posted: 1 Nov 2010 \| 17:39:35 UTC
	I thought I might have tracked down one of the problem sources specific to GPUGrid computation errors. I set up a new installation: Windows 7 - 64 bit Latest Nvidia driver 9800GT video card. Latest version of the BOINC client NO other GPU projects. First 3 units processed fine. 4th unit failed after 8 or 9 hours with a computation error. I figure I controlled for processing contention with other GPU projects, I controlled for latest driver install, I controlled for most recent released BOINC client. Yet still I encountered a computation error on a GPU Grid work unit -- worse, the error was 8 hours plus in. I had previously encountered errors running 9800GT cards in Windows XP -- various versions of the nvidia drivers including the current driver, various versions of the BOINC client from 6.10.18 up to the current 6.10.56. In those cases other GPU supporting projects were installed. So I thought perhaps the problem was not (since I controlled for it) BOINC client version or nvidia driver version or OS (I saw this on XP, Vista and Win7). I'd note, on none of these multi GPU project configurations did I see long run computation errors on other projects (Collatz, SETI, Dnetc) -- I'd see some errors with other projects but they were 'efficient' errors (ie within a few minutes of processing). Based on my sampling, I really suspect that at least a piece of the computation error problem comes from the source of the work units. Either it is because they, by design, run long enough for computational errors to surface (other GPU projects don't run more than a few hours, often less than an hour, on the same GPU, while GPUGrid work units run 20 hours or so), or that there is some other work unit 'weakness' which makes them significantly more subject to 'long run' computation errors than ANY of the other GPU projects currently available in the BOINC world. I know that the GPUGrid project isn't intentionally doing this, and I would like to support the project with some of my available 9800GT processing power, but at this point, it seems rather an inefficient use of GPU processing power for me. Given the traffic here, I must assume that very few other current users encounter computation errors, as I don't see a LOT of traffic regarding it. Perhaps the existing active base for GPUGrid has different (and faster) GPU's they are working with -- I don't know.
	ID: 19177 \| Rating: 0 \| rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1576 Credit: 5,862,711,851 RAC: 10,024,449 Level Scientific publications	Message 19178 - Posted: 1 Nov 2010 \| 18:00:57 UTC - in response to Message 19177.
	First 3 units processed fine. 4th unit failed after 8 or 9 hours with a computation error. I figure I controlled for processing contention with other GPU projects, I controlled for latest driver install, I controlled for most recent released BOINC client. Yet still I encountered a computation error on a GPU Grid work unit -- worse, the error was 8 hours plus in. Are you looking at this task list? - seems to match that description, taken from one of your hosts running Windows 7 x64 (host ID: 83792). You seem to have controlled for everything else, so how about the GPUGrid task sub-type? I see one IBUCH_pYEEI, and two TONI_KKi4 (all successful), and one KASHIF_HIVPR_n1 - the failure, after several hours. That exactly matches my experiences with two 9800GT and one 9800GTX+: I can run everything GPUGrid throws at me (from the current mix), except HIVPR_n1. I've reported it, several times, but I don't go banging on about it once the point has been made. I'm afraid to say the 'abort' button is your friend, but only for that particular task subtype. Let somebody with a Fermi crunch them (they're fine on my GTX470).
	ID: 19178 \| Rating: 0 \| rate: / Reply Quote

BarryAZ Send message Joined: 16 Apr 09 Posts: 163 Credit: 920,275,294 RAC: 0 Level Scientific publications	Message 19190 - Posted: 2 Nov 2010 \| 4:45:38 UTC - in response to Message 19178.
	Richard, thanks for the explanation -- I suppose it would be a nice feature at the project configuration level to just say no to specific task subtypes. The alternative I suppose is to look closely at any new downloaded work units, and abort the instant one of the 'GPUGrid doesn't like 9800's' specific work units drops on a given system. That's a bit of a bother for me. In any event, for the workstation in my specific example, I've switched from the 9800GT to a HD 4850 -- and (since GPUGrid doesn't do ATI GPU's), switched that workstation to Collatz and MW. I've posted about this sort of issue here over the months, but what you've suggested is the most responsive reply I've received. I think for me, rather than 'fight the good fight' and try to run (and manually filter) GPUGrid, I'll wait for the day when the project itself is positioned to handle these sorts of things a bit better. You seem to have controlled for everything else, so how about the GPUGrid task sub-type? I see one IBUCH_pYEEI, and two TONI_KKi4 (all successful), and one KASHIF_HIVPR_n1 - the failure, after several hours. That exactly matches my experiences with two 9800GT and one 9800GTX+: I can run everything GPUGrid throws at me (from the current mix), except HIVPR_n1. I've reported it, several times, but I don't go banging on about it once the point has been made. I'm afraid to say the 'abort' button is your friend, but only for that particular task subtype. Let somebody with a Fermi crunch them (they're fine on my GTX470).
	ID: 19190 \| Rating: 0 \| rate: / Reply Quote

Ivailo Bonev Send message Joined: 14 Jun 10 Posts: 5 Credit: 300,613 RAC: 0 Level Scientific publications	Message 19198 - Posted: 2 Nov 2010 \| 15:41:30 UTC
	Richard, my 9800GT (btw it's oc-ed 10%) works fine on KASHIF_HIVPR_n1, with new app 6.11. See: Task 3219845 Task 3214029 Do you have Swan_Sync=0 in your system variables? ____________ Windows: "Where do you want to go today?" Linux: "Where do you want to go tomorrow?" FreeBSD: "Are you guys coming, or what?"
	ID: 19198 \| Rating: 0 \| rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1576 Credit: 5,862,711,851 RAC: 10,024,449 Level Scientific publications	Message 19243 - Posted: 4 Nov 2010 \| 15:58:10 UTC - in response to Message 19198.
	Richard, my 9800GT (btw it's oc-ed 10%) works fine on KASHIF_HIVPR_n1, with new app 6.11. See: Task 3219845 Task 3214029 Well, one of mine has just blown away task 3239533, but since it only thought about it for one second, I won't lose much sleep over it. Do you have Swan_Sync=0 in your system variables? No, I prefer to keep my CPUs free for other work.
	ID: 19243 \| Rating: 0 \| rate: / Reply Quote

skgiven Volunteer moderator Volunteer tester Send message Joined: 23 Apr 09 Posts: 3968 Credit: 1,995,359,260 RAC: 0 Level Scientific publications	Message 19254 - Posted: 4 Nov 2010 \| 19:40:13 UTC - in response to Message 19243. Last modified: 4 Nov 2010 \| 19:41:53 UTC
	The Windows only optional variable Swan_Sync=0 is for Fermi's and does not have to be used along with one free core but it usually helps a lot. It will make little or no difference to the performance of a 9800GT. There is little need to leave a CPU core free, unless you have 3 or 4 such cards in the same system, at which point your CPU performance for CPU only tasks will degenerate to the point that you might as well free up a CPU core. On a high end Fermi it is still optional but generally recommended to use both Swan_Sync=0 and to leave a Core/Thread free; the performance difference is quite noticeable.
	ID: 19254 \| Rating: 0 \| rate: / Reply Quote

Ivailo Bonev Send message Joined: 14 Jun 10 Posts: 5 Credit: 300,613 RAC: 0 Level Scientific publications	Message 19262 - Posted: 4 Nov 2010 \| 23:24:53 UTC Last modified: 4 Nov 2010 \| 23:26:12 UTC
	Thanks for explanation skgiven, I wonder why GPUGrid uses so much CPU time... Now I understand! ____________ Windows: "Where do you want to go today?" Linux: "Where do you want to go tomorrow?" FreeBSD: "Are you guys coming, or what?"
	ID: 19262 \| Rating: 0 \| rate: / Reply Quote

David Cappello Send message Joined: 27 Sep 08 Posts: 3 Credit: 10,699,899 RAC: 0 Level Scientific publications	Message 19797 - Posted: 7 Dec 2010 \| 16:38:05 UTC
	I have a host that has dual 9800GTXs and I can't get one valid result: http://www.gpugrid.net/results.php?hostid=84907&offset=0&show_names=0&state=0 What is the suggested best fix for this machine? ____________
	ID: 19797 \| Rating: 0 \| rate: / Reply Quote

MarkJ Volunteer moderator Volunteer tester Send message Joined: 24 Dec 08 Posts: 738 Credit: 200,909,904 RAC: 0 Level Scientific publications	Message 19802 - Posted: 7 Dec 2010 \| 20:25:05 UTC
	Downgrade drivers to 197.45 and this will force GPUgrid to give you the 6.12 app (assuming you run windows). Not sure if your cards support SLI or not, but if they do make sure its disabled in the Nvidia control panel. ____________ BOINC blog
	ID: 19802 \| Rating: 0 \| rate: / Reply Quote

TJ Send message Joined: 26 Jun 09 Posts: 815 Credit: 1,470,385,294 RAC: 0 Level Scientific publications	Message 20004 - Posted: 22 Dec 2010 \| 22:39:34 UTC - in response to Message 19802.
	Well for a few weeks I am getting no new tasks from yours and was advised to update my drivers. I did and now I read I have to downgrade them. That's a lot of work. ____________ Greetings from TJ
	ID: 20004 \| Rating: 0 \| rate: / Reply Quote

skgiven Volunteer moderator Volunteer tester Send message Joined: 23 Apr 09 Posts: 3968 Credit: 1,995,359,260 RAC: 0 Level Scientific publications	Message 20006 - Posted: 22 Dec 2010 \| 23:18:07 UTC - in response to Message 20004.
	What driver did you update to?
	ID: 20006 \| Rating: 0 \| rate: / Reply Quote

Post to thread

Message boards : Number crunching : Lots of failures

	About	Science	Volunteers	Performance	Forum	Join us	Donate