Advanced search

Message boards : Graphics cards (GPUs) : TONI_KIDln issues

Author Message
Profile Retvari Zoltan
Avatar
Send message
Joined: 20 Jan 09
Posts: 2343
Credit: 16,201,255,749
RAC: 6,169
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 18319 - Posted: 11 Aug 2010 | 9:57:11 UTC

I have two almost identical PCs. One of them began failing almost every TONI_KIDln wus, while the other not. I tried to update the BOINC manager, and the NVidia drivers, but it didn't help.

The 1st (failing) pc's config:
MB: Asus P5Q pro (Intel P45 chipset)
CPU: Core2 Quad 6600 @ 2.4GHz (stock)
RAM: 4Gb 1066MHz DDR2 Kingston HiperX
VGA: Gigabyte GV-N480D5-15I-B (stock clocking, 72-76°C)
PSU: Chieftec A135-1000W
OS: WinXP SP3 x86
Boinc MGR 6.11.4
NVidia drivers 259.31
swan_sync=0

The 2nd pc's config:
MB: Asus P5Q Deluxe (Intel P45 chipset)
CPU: Core2 Quad 9550 @ 2.83GHz (stock)
RAM: 8Gb 1066MHz DDR2 Kingston HiperX
VGA: Asus ENGTX480 (stock clocking, 72-76°C)
PSU: Gigabyte Superb 720W
OS: WinXP SP3 x86
Boinc MGR 6.11.4
NVidia drivers 259.12
swan_sync=0

Any ideas?

Profile skgiven
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,995,359,260
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 18320 - Posted: 11 Aug 2010 | 11:14:16 UTC - in response to Message 18319.

The errors are not time specific. The TONI_KIDln tasks are failing any time from immediately through to times over about 4000sec.

I did see the odd other task failure too, like this one, but almost all failures were for TONI_KIDln tasks.

Seems to make no different using 6.11.4 and more recent drivers, to the error messages:

<core_client_version>6.11.4</core_client_version>
<![CDATA[
<message>
Nem megfelel� funkci�. (0x1) - exit code 1 (0x1)
</message>
<stderr_txt>
# Using device 0
# There is 1 device supporting CUDA
# Device 0: "GeForce GTX 480"
# Clock rate: 1.40 GHz
# Total amount of global memory: 1610153984 bytes
# Number of multiprocessors: 15
# Number of cores: 120
SWAN: Using synchronization method 0
MDIO ERROR: cannot open file "restart.coor"


<core_client_version>6.10.58</core_client_version>
<![CDATA[
<message>
Nem megfelel� funkci�. (0x1) - exit code 1 (0x1)
</message>
<stderr_txt>
# Using device 0
# There is 1 device supporting CUDA
# Device 0: "GeForce GTX 480"
# Clock rate: 1.40 GHz
# Total amount of global memory: 1610153984 bytes
# Number of multiprocessors: 15
# Number of cores: 120
SWAN: Using synchronization method 0
MDIO ERROR: cannot open file "restart.coor"

There is the odd nan error too, but I would expect to get the odd one of these anyway.

SWAN: Using synchronization method 0
MDIO ERROR: cannot open file "restart.coor"
ERROR: file deven.cpp line 855: # Energies have become nan

Perhaps these tasks are stressing your card and finding a weakness with it, or your system.
I would be inclined to suspend any CPU tasks, close Boinc, restart and just run GPU tasks to see if the failures continue; just in case you are crunching a greedy CPU tasks that hoggs system memory. Per chance did you check how much system memory was being used and are you crunching Lattice CPU tasks?

You might also want to make a RAM testing disk, and boot to it.
http://oca.microsoft.com/en/windiag.asp

You might want to consider swapping the GPU's between systems. That would tell you a lot, but I would be inclined to test the RAM first!

PS. There is no point having 8GB RAM in your other system, running XP x86!

Profile liveonc
Avatar
Send message
Joined: 1 Jan 10
Posts: 292
Credit: 41,567,650
RAC: 0
Level
Val
Scientific publications
watwatwatwatwatwat
Message 18321 - Posted: 11 Aug 2010 | 11:36:34 UTC - in response to Message 18319.
Last modified: 11 Aug 2010 | 11:39:29 UTC

I'd make a wild guess, it's your RAM. Why? Kingston HiperX is a RAM I've tried having the most difficulties with. They need such a high RAM Voltage compared with so many others just to get stable, some times so much the mainboard starts freaking out. I can't imaging 8GB (4x2GB) is fun or or easy, I've had problems with 4GB (2x2GB) of that RAM on Asus, MSI, & XFX mainboards. They're "touchy" DDR2 at 1066Mhz, better to lower to 800Mhz or increase RAM Voltage, which "may" also require you to increase the NB Voltage.
____________

Profile Retvari Zoltan
Avatar
Send message
Joined: 20 Jan 09
Posts: 2343
Credit: 16,201,255,749
RAC: 6,169
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 18324 - Posted: 11 Aug 2010 | 20:14:22 UTC - in response to Message 18320.

I would be inclined to suspend any CPU tasks, close Boinc, restart and just run GPU tasks to see if the failures continue;

I'll try this.

Per chance did you check how much system memory was being used and are you crunching Lattice CPU tasks?

About 1.5Gb used. I run 3 rosetta task simultaneously, they consume about 250~400kb each. But that's true for my other system too. I don't crunch Lattice at all.

You might also want to make a RAM testing disk, and boot to it.
http://oca.microsoft.com/en/windiag.asp

I have Vista x64 and Win 7 x64 on both systems, so I can run their RAM test, but that'll be the last thing :)

You might want to consider swapping the GPU's between systems. That would tell you a lot

That's came to my mind too. I was just hoping there will be some other (simple) way to figure it out.

PS. There is no point having 8GB RAM in your other system, running XP x86!

I know :) I play on the same PC sometimes, on Win 7 x64. This was my default operating system until I started crunching...

Profile Retvari Zoltan
Avatar
Send message
Joined: 20 Jan 09
Posts: 2343
Credit: 16,201,255,749
RAC: 6,169
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 18325 - Posted: 11 Aug 2010 | 20:31:01 UTC - in response to Message 18321.
Last modified: 11 Aug 2010 | 20:31:29 UTC

I'd make a wild guess, it's your RAM. Why? Kingston HiperX is a RAM I've tried having the most difficulties with. They need such a high RAM Voltage compared with so many others just to get stable, some times so much the mainboard starts freaking out. I can't imaging 8GB (4x2GB) is fun or or easy, I've had problems with 4GB (2x2GB) of that RAM on Asus, MSI, & XFX mainboards. They're "touchy" DDR2 at 1066Mhz, better to lower to 800Mhz or increase RAM Voltage, which "may" also require you to increase the NB Voltage.

OK, I lowered it to 800MHz on both system. (these are those "Tall Heatsink" modules, even so I don't like to increase RAM Voltage). BTW I never had any memory related problem before with this modules.
But it didn't help. There are two failed WUs since then.
I will swap the GPUs...

Profile Retvari Zoltan
Avatar
Send message
Joined: 20 Jan 09
Posts: 2343
Credit: 16,201,255,749
RAC: 6,169
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 18326 - Posted: 11 Aug 2010 | 20:39:34 UTC - in response to Message 18320.

Perhaps these tasks are stressing your card and finding a weakness with it, or your system.

I agree with that, but the KASHIF_HIVPR tasks also a GPU stressing kind, and these doesn't tend to fail.

Profile skgiven
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,995,359,260
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 18327 - Posted: 11 Aug 2010 | 21:43:18 UTC - in response to Message 18326.

Let a few tasks run when you swap the cards. Drivers should not need updating, so it's not too much of a task. Report back any findings.
Thanks,

Profile Retvari Zoltan
Avatar
Send message
Joined: 20 Jan 09
Posts: 2343
Credit: 16,201,255,749
RAC: 6,169
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 18338 - Posted: 13 Aug 2010 | 12:38:40 UTC - in response to Message 18327.

It seems to me now, I've solved the problem without swapping the GPUs.
I've looked for further differences between my two systems, and noticed my failing GPU runs at 1.000V, while the other runs at 1.025V. So I raised the failing GPU's core voltage to 1.025V, since then it's completed 4 TONI_KIDln tasks.

Profile skgiven
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,995,359,260
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 18340 - Posted: 13 Aug 2010 | 15:21:29 UTC - in response to Message 18338.

Well spotted. There has been one or two of these issues before with the Fermi cards. I think the last one was a Gigabyte as well; perhaps they are cutting it a bit too fine. I’m guessing ASUS got the pick of the bunch to work with. My Asus ENGTX470 has a Voltage of 0.975, and crunches away quite happily at 704MHz. I paid a bit over the odds for it, but now I think it was definitely worth it.

Good Luck,

Post to thread

Message boards : Graphics cards (GPUs) : TONI_KIDln issues

//