Advanced search

Message boards : Number crunching : High Failure Rate of SANTI Tasks

Author Message
Vagelis Giannadakis
Send message
Joined: 5 May 13
Posts: 187
Credit: 349,254,454
RAC: 0
Level
Asp
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 36863 - Posted: 18 May 2014 | 20:47:59 UTC

I am creating this thread because the other SANTI thread has been closed.

Basically every SANTI task I've gotten has failed! ALL these tasks have also failed at least once on another host, although many of them have succeeded in the end.

I think the project people must take a look.
____________

Vagelis Giannadakis
Send message
Joined: 5 May 13
Posts: 187
Credit: 349,254,454
RAC: 0
Level
Asp
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 37032 - Posted: 9 Jun 2014 | 8:17:28 UTC

I keep having many SANTIs failing on my 750Ti on Linux. Some of them succeed, but most of them fail. They run completely OK on my 650Ti on the same computer. Many of them fail on other systems as well, not just on mine. Eventually, they seem to locate a system of their liking and succeed! Two recent examples (my computer being 171276):
http://www.gpugrid.net/workunit.php?wuid=8275005
http://www.gpugrid.net/workunit.php?wuid=8281764

The recently failed SANTIs on my 750Ti:
http://www.gpugrid.net/result.php?resultid=11483800
http://www.gpugrid.net/result.php?resultid=11256545
http://www.gpugrid.net/result.php?resultid=11106610
http://www.gpugrid.net/result.php?resultid=11103836
http://www.gpugrid.net/result.php?resultid=11103808

I know my card is not a dud, since I tried it with Einstein and it worked OK. Something else must be wrong. I'm using the newest driver for Linux, 337.25. The errors existed with older versions as well, 334.21 and 331.49.

Can a project researcher / engineer please take a look? I'll be glad to provide more information or try remedies.
____________

Vagelis Giannadakis
Send message
Joined: 5 May 13
Posts: 187
Credit: 349,254,454
RAC: 0
Level
Asp
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 37043 - Posted: 12 Jun 2014 | 9:42:53 UTC

I switched my 750Ti to Einstein for about a day and a half just to make sure the card is OK. I crunched 10 WUs successfully, so I am positive my card is OK.

I switched back to GPUGRID yesterday evening and crunched a NOELIA_MG1EC. Then I got another SANTI_p53final, which errored-out about 2 hours in. And take a look at this guy: http://www.gpugrid.net/workunit.php?wuid=8259035

Something is definitely wrong with these WUs!
____________

ConflictingEmotions
Send message
Joined: 6 Jan 09
Posts: 4
Credit: 151,278,745
RAC: 0
Level
Ile
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwat
Message 37046 - Posted: 12 Jun 2014 | 14:27:21 UTC - in response to Message 37043.

I hope you did realize that one of the failed WUs was completed with a 750ti under windows. There is some indication that GPU memory is a factor for some of GPUGRID WUs. For the last one you linked, it will be interesting to see if that 780ti can complete it since it has more memory available than the other cards.

Now looking at your 750ti host, it has 2 GPUs. Based on the forum, it is not easy to troubleshoot but you should try to find the various suggestions for similar systems. However, you have an high failure rate so I think your problem is with your system, probably related to overheating.

The main things that I would suggest are:
Upgrading the Linux kernel and GPU driver if you can
Monitor GPU and system temps - newer NVIDIA driver is better for that under Linux. If these get too hot then you know you need better cooling or downclocking GPUs.

Some other suggestions of things that you can experiment with:
Use only one GPU - one valid WU is better than 2 failed WUs. Also switch cards to ensure both work correctly.
Run under the short queue to see what happens (if it is overheating then hopefully WUs complete before temps get too high)
Run only GPU work so CPU can maintain GPUs and not contribute to system heating.

From that you should get some clue of what is happening with your system and what you can do about it.


Vagelis Giannadakis
Send message
Joined: 5 May 13
Posts: 187
Credit: 349,254,454
RAC: 0
Level
Asp
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 37048 - Posted: 12 Jun 2014 | 17:05:09 UTC - in response to Message 37046.

My setup is handling heat pretty adequately, to give you an idea, with ambient temperatures at ~30C my GPUs and CPU maxes out at ~70C. I've invested much time in building my crunching rig for heat and I think it works fine. It's also pretty quiet, you can easily sit next to it. It's not easy to tolerate the heat it emits though, that top exhaust fan is like a small oven!

I also don't think it's the memory size, since my 650Ti crunches everything like a boss and it has half the memory of the 750Ti.

I tried three different NVIDIA driver versions to no avail. Upgrading the kernel is a good idea and I think it's the next thing I will be trying.

Another suspect in my mind is the motherboard. My ASUS P7P55D-E's second PCIEx16 slot works at x4. Over at Einstein it caused the 750Ti to perform really slow, but maybe it causes stability problems with GPUGRID as well.

I think I will not avoid fiddling with the hardware again, swapping cards in slots, leaving just the 750 in there, etc.
____________

ConflictingEmotions
Send message
Joined: 6 Jan 09
Posts: 4
Credit: 151,278,745
RAC: 0
Level
Ile
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwat
Message 37054 - Posted: 13 Jun 2014 | 16:18:24 UTC - in response to Message 37048.

Other systems are completing those failed WUs so it is not the WUs. (Okay, there are occasionally bad batches which usually get sorted out very quickly.) Hopefully you figure out what is the problem with your system.

Vagelis Giannadakis
Send message
Joined: 5 May 13
Posts: 187
Credit: 349,254,454
RAC: 0
Level
Asp
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 37055 - Posted: 13 Jun 2014 | 18:00:19 UTC

I upgraded the kernel today to Ubuntu's 12.04 latest - 3.13. I then got two NOELIA_TRPS1S4, the one of which that was assigned to the 750Ti failed after ~1100 sec... It then got a SANTI_p53final and it's still crunching that ~10 hours later. Maybe it will complete it (knocking on wood!)

I also discovered today that my motherboard's second PCIEx16 slot's 4 lanes come from the chipset, not from the CPU. I don't know if that could cause the errors, but I guess next thing to test would be to take out the 650Ti and leave only the 750Ti in, on the primary PCIEx16 slot.
____________

pvh
Send message
Joined: 17 Mar 10
Posts: 23
Credit: 1,170,567,145
RAC: 66,756
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 37060 - Posted: 15 Jun 2014 | 21:16:34 UTC

I also had two WUs from SANTI that failed with the message

The simulation has become unstable. Terminating to avoid lock-up (1)


This card runs other WUs just fine, so I don't think it is a hardware or driver problem.

http://www.gpugrid.net/result.php?resultid=12188313
http://www.gpugrid.net/result.php?resultid=12529134

Profile Beyond
Avatar
Send message
Joined: 23 Nov 08
Posts: 1112
Credit: 6,162,416,256
RAC: 0
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 37073 - Posted: 17 Jun 2014 | 20:24:55 UTC

I'm running 7 EVGA & PNY factory OCed 750 Ti cards in Win7-64 and have had only 1 error:

http://www.gpugrid.net/workunit.php?wuid=8906204

which appears to be a bad WU. Many, many SANTI WUs have completed successfully on these cards. As I have asked twice before, can you try your card on a Windows box? I suspect the card has problems, but testing it in a different environment will give you some answers.

Vagelis Giannadakis
Send message
Joined: 5 May 13
Posts: 187
Credit: 349,254,454
RAC: 0
Level
Asp
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 37074 - Posted: 18 Jun 2014 | 7:52:19 UTC - in response to Message 37073.

Hey Beyond, thanks for your response! Yes, I have concluded it is not the SANTIs after all, but something with my system. The card has failed with all sorts of WUs, but I have made some interesting observations. Take a look at the 750TI-650TI Combo on Linux thread, where I am continuing this discussion, as it is not a matter of SANTIs any more and I don't want to keep this thread at the head of the Number crunching section. Your input is always welcome and appreciated!
____________

Post to thread

Message boards : Number crunching : High Failure Rate of SANTI Tasks

//