Advanced search

Message boards : Graphics cards (GPUs) : Gigabyte GTX 780 Ti OC (Windforce 3x) problems

Author Message
Profile Retvari Zoltan
Avatar
Send message
Joined: 20 Jan 09
Posts: 2343
Credit: 16,201,255,749
RAC: 7,520
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 34426 - Posted: 22 Dec 2013 | 1:48:56 UTC
Last modified: 22 Dec 2013 | 1:55:10 UTC

I have serious problems with my new card, and after trying to make it work with GPUGrid for a whole day - quite frankly - I've run out of ideas.
Does anybody crunch with this type of card (GV-N780TOC-3GD) for GPUGrid?
I have a standard Gigabyte GTX 780 Ti in this host (beside an ASUS GTX 670 DC2 OC). These cards are doing very well their job. Originally I wanted a Gigabyte GTX 780 Ti OC, so after waiting for this card a couple of weeks it became available, and I've ordered and received one. I've put this card beside its little brother (the non OC Gigabyte GTX 780 Ti), and my problems immediately emerged: the new card failed every task almost immediately.
There is only one task which was run for 2000s, because the card was downclocked to 450MHz, bur after a restart the downclock was gone, and it has failed too.
Since then I've tried the following, none of them helped:
1. uninstall NV driver, restart, install latest NV (331.93 beta) driver
2. install a fresh Win7 x64 on this host, with the latest beta driver
3. Put the failing card in different PCIe slots of the GA-Z87X-OC motherboard, remove the original card, and put the failing card in it's place to be the only card in the system
4. test the card with different applications: Heaven's benchmark, Furmark, Primegrid, GPU memory stress test, G80 memtest - all of them were OK
5. put the card to a different MB (Intel DH87RL), under different OS (Win8.1 x64)
6. Further tests: NVidia human face demo (it's quite stunning actually), 3DMark13, NVidia design garage, NVidia Islands - all of them OK.
7. downclock the card with MSI Afterburner
8. increase the GPU voltage to the maximum of 1212mV with Kepler BIOS tweaker 1.26
9. uninstall the 331.93 driver, restart, download and install the card's original driver (331.60) from Gigabyte
10. decrease the GPU clocks, power limits, boost clocks, etc. to the settings of the working (non-OC) card with Kepler BIOS tweaker 1.26
11. decrease the GPU clocks further down to 800MHz
12. decrease the GPU memory clock to 3.4 GHz
13. decrease the GPU memory clock to 3.3 GHz
14. decrease the GPU memory clock to 3.3 GHZ and the GPU clocks etc. to the settings of the working card.

The really annoying aspect of this is that only the GPUGrid tasks are failing on this card - the tests, benchmarks, and demos are not. I had a faulty GTX 580 some time ago, but that card also showed some heavy artifacts with Heaven's benchmark. I would appreciate if you could recommend any further testing tool which can detect any defect on this card (while it shows no problem on the other GTX 780 Ti), because I'm not convinced that this card is faulty.
Tomorrow I'll try to test this card with folding@home (I couldn't reach Stanford's webpage today).

Profile dskagcommunity
Avatar
Send message
Joined: 28 Apr 11
Posts: 456
Credit: 817,865,789
RAC: 0
Level
Glu
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 34429 - Posted: 22 Dec 2013 | 11:36:16 UTC
Last modified: 22 Dec 2013 | 11:37:05 UTC

ok already done on Point 4 sry ^^
____________
DSKAG Austria Research Team: http://www.research.dskag.at



Dagorath
Send message
Joined: 16 Mar 11
Posts: 509
Credit: 179,005,236
RAC: 0
Level
Ile
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 34430 - Posted: 22 Dec 2013 | 11:48:44 UTC - in response to Message 34426.

I also doubt the card is defective but if it can't run GPUgrid tasks with the numerous tweaks you've tried then something is wrong somewhere.

I've Googled terms like "GV-N78TOC-3GD problem", "problem GV-N78TOC-3GD", "fubar GV-N78TOC-3GD", "GV-N78TOC-3GD fail", etc. and haven't found any reports or reviews speaking of problems.

Do any of the diagnostic programs you've run test with CUDA or in CUDA mode? I ask because if you can demonstrate that it consistently fails on CUDA but not OpenGL or alternatively fails OpenGL and CUDA but nothing else then you have something definite that you can present to Gigabyte and/or NVIDIA if you want to motivate them to investigate and perhaps issue a driver update or BIOS update.

The point is you've more or less eliminated clock and voltage settings, mobo and drivers as the problem. The only other possible things I can think of to experiment with are:

1) Win XP or Linux versus Win 7/8
2) CUDA vs. OpenGL vs. "game mode" or whatever the correct term is
3) localized over-heating which I explain below

We know there are different ways of measuring/detecting GPU core temperature as was discussed recently in another thread here. Maybe certain portions of the GPU core are not cooling properly due to one or more of the following causes:

1) a curved heat spreader on the GPU
2) a curved surface on the heat sink
3) improper application of thermal grease

Maybe high temps in affected areas are not showing up in temp reports and only GPUgrid tasks are exercising those poorly cooled areas. (Yes, that does require a number of factors to align properly but nothing happens by accident, there is always a cause, you have eliminated most of the causes already and this one is one of the few remaining reasons.)

If you have good vision you can spot badly curved/warped surfaces easily with what I call "the straight edge and light test" which is a very common test. You probably are familiar with how it works but don't rely on it unless you know your vision is good. A much better way of measuring flatness is to use a dial gauge on a pivot as it will easily show defects as small as .005 inches if used properly. A top quality dial gauge and pivot are expensive but there are less expensive models that are accurate enough for the job we're talking about. Or you can take the GPU and heatsink to a machine shop and pay them to check the flatness.

If it turns out one or both of the 2 surfaces are curved then you can correct it by lapping if the mismatch is small or by machining and lapping/polishing if the mismatch is large.

Some people think they can detect curved surface(s) by seeing if they can rock the heatsink side-to-side. That test can reveal a convex-to-flat or convex-to-convex situation but cannot reveal a concave-to-flat or concave-to-concave situation.


____________
BOINC <<--- credit whores, pedants, alien hunters

TJ
Send message
Joined: 26 Jun 09
Posts: 815
Credit: 1,470,385,294
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 34435 - Posted: 22 Dec 2013 | 18:12:21 UTC - in response to Message 34426.

I am sorry to hear you have problems with you new card Zoltan. I can't help you, as you don't know the answer yourself then I certainly don't know. The only thing I can think of is that these cards are to powerful but that can't be true?
Good luck, I am sure over time you will find the answer.

But I have two questions.
1. The host you are showing has awesome time compared to mine and you have now Win7 installed as well? What have you done then to get these results.
2. I downloaded and installed Kepler BIOS Tweaker as you suggested in the thread I started, but all I see are empty fields and clicking in it does not work, I can not put any values in. What did I do wrong there?
____________
Greetings from TJ

Profile Retvari Zoltan
Avatar
Send message
Joined: 20 Jan 09
Posts: 2343
Credit: 16,201,255,749
RAC: 7,520
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 34450 - Posted: 23 Dec 2013 | 19:14:43 UTC - in response to Message 34430.
Last modified: 23 Dec 2013 | 19:34:23 UTC

I also doubt the card is defective but if it can't run GPUgrid tasks with the numerous tweaks you've tried then something is wrong somewhere.

Errrr.....

I've Googled terms like "GV-N78TOC-3GD problem", "problem GV-N78TOC-3GD", "fubar GV-N78TOC-3GD", "GV-N78TOC-3GD fail", etc. and haven't found any reports or reviews speaking of problems.

That's pioneer's fate.

Do any of the diagnostic programs you've run test with CUDA or in CUDA mode? I ask because if you can demonstrate that it consistently fails on CUDA but not OpenGL or alternatively fails OpenGL and CUDA but nothing else then you have something definite that you can present to Gigabyte and/or NVIDIA if you want to motivate them to investigate and perhaps issue a driver update or BIOS update.

Primegrid PPS Sieve & GeneFer is CUDA (I'm not sure about its version though). As I further tested with PrimeGrid, the GeneFer CUDA client was stuck once, and anonther task has failed - I'm not sure about why.
FurMark is OpenGL.
Heaven's Benchmark is DX11 & tessellation is OpenGL4.0.

The point is you've more or less eliminated clock and voltage settings, mobo and drivers as the problem. The only other possible things I can think of to experiment with are:

1) Win XP or Linux versus Win 7/8

The card first failed under WinXPx64, I've switched to Win7x64 and Win8.1x64 only for further testing the card without interrupting the crunching.

2) CUDA vs. OpenGL vs. "game mode" or whatever the correct term is

My guess is that the GPU Boost 2.0 make mistakes, or the algorithm in the GPUGrid client which detects when the simulation becomes unstable, or I've got a very tricky error in my card.

3) localized over-heating which I explain below

We know there are different ways of measuring/detecting GPU core temperature as was discussed recently in another thread here. Maybe certain portions of the GPU core are not cooling properly due to one or more of the following causes:

1) a curved heat spreader on the GPU
2) a curved surface on the heat sink
3) improper application of thermal grease

Maybe high temps in affected areas are not showing up in temp reports and only GPUgrid tasks are exercising those poorly cooled areas. (Yes, that does require a number of factors to align properly but nothing happens by accident, there is always a cause, you have eliminated most of the causes already and this one is one of the few remaining reasons.)

I got that. As you say, it's very unlikely that this is the source of my problems. However, I'll check the heatsink if I can remove it without voiding warranty.

If you have good vision you can spot badly curved/warped surfaces easily with what I call "the straight edge and light test" which is a very common test. You probably are familiar with how it works but don't rely on it unless you know your vision is good. A much better way of measuring flatness is to use a dial gauge on a pivot as it will easily show defects as small as .005 inches if used properly. A top quality dial gauge and pivot are expensive but there are less expensive models that are accurate enough for the job we're talking about. Or you can take the GPU and heatsink to a machine shop and pay them to check the flatness.

If it turns out one or both of the 2 surfaces are curved then you can correct it by lapping if the mismatch is small or by machining and lapping/polishing if the mismatch is large.

Some people think they can detect curved surface(s) by seeing if they can rock the heatsink side-to-side. That test can reveal a convex-to-flat or convex-to-convex situation but cannot reveal a concave-to-flat or concave-to-concave situation.

From my experience it's much more common error that one or two corners of the heatsink is not fastened well, so it's touching the chip only on one (or two) edge(s), not on its entire surface.

Profile Retvari Zoltan
Avatar
Send message
Joined: 20 Jan 09
Posts: 2343
Credit: 16,201,255,749
RAC: 7,520
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 34451 - Posted: 23 Dec 2013 | 19:22:17 UTC - in response to Message 34435.

I have two questions.
1. The host you are showing has awesome time compared to mine and you have now Win7 installed as well? What have you done then to get these results.

I'm still crunching under WinXPx64 on that host. I've installed Win7 only to have DX11 and some other fancy stuff for the graphical tests.

2. I downloaded and installed Kepler BIOS Tweaker as you suggested in the thread I started, but all I see are empty fields and clicking in it does not work, I can not put any values in. What did I do wrong there?

The previous (1.25) version can extract / flash the BIOS from the card if nvflash.exe is located in its folder. The latest one (1.26) can't, it can manipulate the firmware image in a file, so you have to extract / flash it manually with nvflash.exe (GPU-Z can extract the BIOS from the card through GUI, but it also uses a built-in copy of nvflash.exe)

Profile Retvari Zoltan
Avatar
Send message
Joined: 20 Jan 09
Posts: 2343
Credit: 16,201,255,749
RAC: 7,520
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 34452 - Posted: 23 Dec 2013 | 19:39:10 UTC
Last modified: 23 Dec 2013 | 19:40:18 UTC

I've flashed the working card's BIOS to the OC card (they have different vendor and PCI subsystem IDs, so I was a little concerned about doing it). The card was working ok, but the GPUGrid client still fails. I'm afraid that I have to sell this beautiful card to a gamer...

Dagorath
Send message
Joined: 16 Mar 11
Posts: 509
Credit: 179,005,236
RAC: 0
Level
Ile
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 34453 - Posted: 23 Dec 2013 | 19:46:54 UTC - in response to Message 34450.

From my experience it's much more common error that one or two corners of the heatsink is not fastened well, so it's touching the chip only on one (or two) edge(s), not on its entire surface.


Yep, that happens frequently and bit me once a few years ago on a CPU. The temperatures were a little high but still reasonable. It took me a long time to figure it out.
____________
BOINC <<--- credit whores, pedants, alien hunters

Profile skgiven
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,995,359,260
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 34456 - Posted: 23 Dec 2013 | 22:50:54 UTC - in response to Message 34453.

Zoltan, you have already tried a lot, and as you have tried the GPU on different rigs and with different operating systems, my guess is that the card is a dud, most likely GDDR/some capacitor. Alternatively the GPUGrid app/s simply don't work with it due to some obscure oddity in the app/bespoke card (slim chance).
Suggest you check the GPU physically (loose anything, burnt smell, dodgy soldering...), sunk PCIE power connector pins (on the card), does the GPU seat fully?

Some loose/desperate suggestions (possibly already covered):
Try short tasks, if you haven't already.
What was installed with the NVidia drivers; all the 3D and sound crap or just the drivers?
Have you tried lowering the power target (MSI Afterburner...)?
What about System Power settings?
NVidia control panel settings? Prefer max performance, PhysX pointing where - specific GPU or CPU (not sure that even matters though)?
Have you tried to drop the GPU memory to 3000MHz?
Motherboard Bios upgrade (long shot).
CPU drivers from Intel (might be messing with the bus)?
Chipset update?
Different versions of Boinc (perhaps even completely uninstalling Boinc and then reinstalling.

Tried Linux?
____________
FAQ's

HOW TO:
- Opt out of Beta Tests
- Ask for Help

Jim1348
Send message
Joined: 28 Jul 12
Posts: 819
Credit: 1,591,285,971
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 34457 - Posted: 23 Dec 2013 | 22:54:46 UTC
Last modified: 23 Dec 2013 | 22:55:18 UTC

Having just completed a new Haswell build, I have learned that a motherboard that is quite stable with its own internal graphics can go bananas once I insert a GTX 660 and try to run BOINC/GPUGrid. As soon as BOINC starts (barely having time to reach the desktop, if that), I get BSODs. That itself is not so unusual, but like you I had already underclocked/overvolted the card sufficiently that it should have worked fine.

The solution for me lay in the motherboard DDR3 memory; I had manually set the speed to 1600 MHz and 8-8-8-24, which is the rated speed of the Crucial Ballistix 2 GB modules (4 of them). But by returning to the default motherboard value of 1333 MHz and 9-9-9-24 timings (as set by SPD), the problems seem to have disappeared. Maybe you have something similar.

Profile Retvari Zoltan
Avatar
Send message
Joined: 20 Jan 09
Posts: 2343
Credit: 16,201,255,749
RAC: 7,520
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 34461 - Posted: 24 Dec 2013 | 14:33:41 UTC - in response to Message 34430.
Last modified: 24 Dec 2013 | 14:40:11 UTC

1) a curved heat spreader on the GPU
2) a curved surface on the heat sink
3) improper application of thermal grease

Yesterday I've had the nerves to dismount the heatsink from the failing card. There were 2 things that surprised me:
1. there are only 7 screws fixing the whole heatsink assembly to the card. (on the original one there are 4 bigger and 4 smaller screws only for the GPU)
2. the thermal grease and the surface of the heatsink for the GPU was ok, the thermal pads for the RAM chips also was fine, but the one long thermal pad for the 8 chips of the GPU's power supply was too short, so it was stretched to reach the 8th chip, therefore it became too thin between the 7th and 8th. I've cut some strips from the unnecessary parts of the long side of the thermal pad, and put it to the 8th chip. Unfortunately, this didn't helped. I couldn't disassemble the card since then, but I'll do it on 27th when I get home.

Profile Retvari Zoltan
Avatar
Send message
Joined: 20 Jan 09
Posts: 2343
Credit: 16,201,255,749
RAC: 7,520
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 34465 - Posted: 24 Dec 2013 | 17:28:24 UTC - in response to Message 34456.
Last modified: 24 Dec 2013 | 17:39:49 UTC

Zoltan, you have already tried a lot, and as you have tried the GPU on different rigs and with different operating systems, my guess is that the card is a dud, most likely GDDR/some capacitor. Alternatively the GPUGrid app/s simply don't work with it due to some obscure oddity in the app/bespoke card (slim chance).
Suggest you check the GPU physically (loose anything, burnt smell, dodgy soldering...), sunk PCIE power connector pins (on the card), does the GPU seat fully?

Yesterday I've checked both sides of the card (I did check the backside before). There is a lot of components on the front side of the board, so it's impossible to check every soldering (not to mention the BGA packaging of the GPU and the RAM chips, which covers the actual soldering). But I didn't notice any sloppy soldering, loose or missing components (however, there are unused component spaces on the board, but it's normal).
The PCIe power connectors (2x8 pin) are a little different on this card than usual: the latch(?) (I don't know how we call it even in my native language) in which the clamp of the PCIe connector clicks in is much shallower than usual (about 1/5th of the normal), so it's much easier to remove the PCIe power connectors from the card.

The most interesting part is that the card is running PrimeGrid tasks just fine. (I know that they're not comparable to GPUGrid tasks)

I could make the card consume more power with FurMark than when crunching GPUGrid tasks, and it was running for about an hour without errors.
I let Heaven Benchmark run for a whole night, and it didn't produce any artifacts.

Some loose/desperate suggestions (possibly already covered):
Try short tasks, if you haven't already.

I didn't try short runs before you asked, but the short run is also failed.

What was installed with the NVidia drivers; all the 3D and sound crap or just the drivers?

On the WinXPx64 host only the graphics drivers installed, on the same hardware I've installed everything under Win7, and on the other MB with Win8.1 I've also installed everything.

Have you tried lowering the power target (MSI Afterburner...)?

Yep. Lowering, increasing.

What about System Power settings?

Never go to sleep, never turn off monitor (never give up, never surrender :))

NVidia control panel settings? Prefer max performance,

That's why I've installed everything under Win7. For a moment I thought it helped, but after 10 secs the WU crashed.

PhysX pointing where - specific GPU or CPU (not sure that even matters though)?

It points to the GPU, but the WinXPx64 doesn't have PhysX, so that's irrelevant I guess.

Have you tried to drop the GPU memory to 3000MHz?

I've tried 3400MHz and 3300MHz.
I just did something new: flashed the BIOS of the Graphic card in a RealVNC session :). It's now down to 3000MHz.

Motherboard Bios upgrade (long shot).

Both MB have the latest BIOS installed (before it all began), GA-Z87X-OC: F6, DH87RL: 0323

CPU drivers from Intel (might be messing with the bus)?

Do such drivers exist? Could you please give me a link?

Chipset update?

The latest chipset drivers are installed (9.4.0.1027)

Different versions of Boinc (perhaps even completely uninstalling Boinc and then reinstalling.

I don't like the series 7 of the BOINC manager, but after the first couple of failures I've upgraded to 7.2.33 from 6.10.60. The only difference I've noticed, that now I can see such status messages as "Trying to restart unstable simulation" instead of "waiting for GPU memory" (the latter made me to upgrade to the latest BOINC manager). But it didn't help.

Tried Linux?

I'm kind of a Windows guy. :) I even hate power shell (namely the concept that there is a lot of things you can't do through GUI). I don't know Linux. I think it's not a good idea trying something unknown to fix a tricky error. Besides, I don't believe that the source is the OS, because another GTX 780 Ti is working fine on my system.

Profile Retvari Zoltan
Avatar
Send message
Joined: 20 Jan 09
Posts: 2343
Credit: 16,201,255,749
RAC: 7,520
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 34467 - Posted: 24 Dec 2013 | 18:26:16 UTC - in response to Message 34456.
Last modified: 24 Dec 2013 | 18:26:31 UTC

Have you tried to drop the GPU memory to 3000MHz?

Wow!
A short run is finished at 3000MHz memory clock. There was two restarts, so I'm lowering the RAM frequency to 2900MHz, and trying a long run.

Profile skgiven
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,995,359,260
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 34476 - Posted: 25 Dec 2013 | 12:25:41 UTC - in response to Message 34467.
Last modified: 25 Dec 2013 | 13:32:10 UTC

I see you have had 2 more successful runs and one failure,

188x-SANTI_MAR422cap310-13-32-RND9223_1 5020610 24 Dec 2013 | 18:39:10 UTC 25 Dec 2013 | 2:47:24 UTC Completed and validated 21,301.04 21,173.92 115,650.00 Long runs (8-12 hours on fastest card) v8.14 (cuda55)

I161-SANTI_bax2-9-32-RND0580_0 5021047 24 Dec 2013 | 20:13:13 UTC 25 Dec 2013 | 6:49:52 UTC Completed and validated 21,901.13 21,750.64 154,050.00 Long runs (8-12 hours on fastest card) v8.14 (cuda55)

76x-SANTI_MAR422cap310-13-32-RND3988_0 5022310 25 Dec 2013 | 6:40:37 UTC 25 Dec 2013 | 10:29:10 UTC Error while computing 5,981.45 5,943.67 --- Long runs (8-12 hours on fastest card) v8.14 (cuda55)

Despite the memory drop that's still around 28% faster than my GTX770.

For each task, the logs show the GPU temps going up to 64C and then the card sometimes stops working for a while. This suggests to me that there is something not right with the cooling. While the GPU is fine, I suspect the GDDR5 is not (or something related to the GDDR). Perhaps a bad module that might not get used by other types of work.


* I would say "plastic clip" on the PCIE power connector.
____________
FAQ's

HOW TO:
- Opt out of Beta Tests
- Ask for Help

Profile Retvari Zoltan
Avatar
Send message
Joined: 20 Jan 09
Posts: 2343
Credit: 16,201,255,749
RAC: 7,520
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 34477 - Posted: 25 Dec 2013 | 14:14:03 UTC - in response to Message 34476.

I see you have had 2 more successful runs and one failure,

Yep, the error is still there, despite that now the RAM runs at 2800MHz. Sometimes lowering the frequency makes things worse, and I think that lowering even more the RAM frequency won't solve completely the problem of my card.

Despite the memory drop that's still around 28% faster than my GTX770.

This card runs under Win8.1, but I'll put it in my WinXPx64 host (if I can fix it), and it will be even more faster :).

For each task, the logs show the GPU temps going up to 64C and then the card sometimes stops working for a while. This suggests to me that there is something not right with the cooling. While the GPU is fine, I suspect the GDDR5 is not (or something related to the GDDR). Perhaps a bad module that might not get used by other types of work.

I'll dismount the cooler assembly once again when I get home, and check the thermal pads again, but I think this is either a memory power line failure or a RAM chip failure. I guess that some of the capacitors have insufficient capacity, or sloppily soldered (or missing). First I'll check it with my naked eye (through my reading glasses), but if I don't find something suspicious I'll take a couple of macro photos from different parts of the card, and check the photos. Now that I know what part of the card malfunctions, it's not a mysterious error anymore, and I have ideas about finding and fixing the card. If I can't fix it, I still can RMA the card, as now I'm confident that this card is bad, also I can prove it to the RMA guys.

Profile Retvari Zoltan
Avatar
Send message
Joined: 20 Jan 09
Posts: 2343
Credit: 16,201,255,749
RAC: 7,520
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 34514 - Posted: 30 Dec 2013 | 14:17:49 UTC - in response to Message 34477.

I see you have had 2 more successful runs and one failure,

Yep, the error is still there, despite that now the RAM runs at 2800MHz. Sometimes lowering the frequency makes things worse, and I think that lowering even more the RAM frequency won't solve completely the problem of my card.

Well, fortunately I wasn't right about that: I've put this card to my WinXPx64 host's PCIe 2.0 x4 slot, and it had a couple of errors, so I've lowered the RAM frequency to 2700MHz, and now it's running smoothly. There is a 2000 sec loss compared to the (standard and oc-ed) card in the PCIe 3.0 x16 slot. I'll try the OC card in the PCIe 3.0 x8 slot (which will make the standard card to run at x8 also) to see how much loss is caused by the lowered RAM frequency.

a1kabear
Send message
Joined: 19 Oct 13
Posts: 15
Credit: 578,770,199
RAC: 0
Level
Lys
Scientific publications
watwatwatwatwatwatwatwatwat
Message 34553 - Posted: 2 Jan 2014 | 16:11:19 UTC

I just got my new 780ti OC windforce 3x and have the same problems. My other card is fine but this card fails almost instantly with computation error. I tried on both Ubuntu (319.76 and 331.20 drivers) and Windows with no joy so far.. I haven't tried lowering the clocks yet I am still troubleshooting it :(

Did you ever make any progress on this?

Profile Retvari Zoltan
Avatar
Send message
Joined: 20 Jan 09
Posts: 2343
Credit: 16,201,255,749
RAC: 7,520
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 34558 - Posted: 2 Jan 2014 | 17:50:49 UTC - in response to Message 34553.
Last modified: 2 Jan 2014 | 17:58:04 UTC

Yes, this card is working fine since I've lowered its memory clock to 2700MHz.
For example:Task 7614006, Task 7612456, Task 7611911
You've answered my unasked question: is this problem by design, or just my card is faulty?
It seems to be some problem with the design of the card.

TJ
Send message
Joined: 26 Jun 09
Posts: 815
Credit: 1,470,385,294
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 34562 - Posted: 2 Jan 2014 | 23:18:57 UTC - in response to Message 34558.
Last modified: 2 Jan 2014 | 23:19:29 UTC

Yes, this card is working fine since I've lowered its memory clock to 2700MHz.
For example:Task 7614006, Task 7612456, Task 7611911
You've answered my unasked question: is this problem by design, or just my card is faulty?
It seems to be some problem with the design of the card.

Yes it would be great if someone with another brand has the OC version and see how that goes. I had the plan to buy a EVGA 780Ti OC, but as my "normal" 780Ti heavily under performs yours with Win7. I decided to wait for the Maxwell.
But will try Linux first in the coming days.
____________
Greetings from TJ

Dagorath
Send message
Joined: 16 Mar 11
Posts: 509
Credit: 179,005,236
RAC: 0
Level
Ile
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 34565 - Posted: 3 Jan 2014 | 0:32:45 UTC - in response to Message 34558.

You've answered my unasked question: is this problem by design, or just my card is faulty?
It seems to be some problem with the design of the card.


From your perspective it is the design of the card and I agree with your perspective. If you ask the manufacturer and present all the evidence you have uncovered, their response might be like "It's a problem with the application, that card is for gaming applications where a few errors won't be noticed. It's not for data crunching that requires high precision and reliability."

Do you plan to RMA it? IIUC, you have had to downclock the memory below the frequency used on the standard model (not OC), yes?

____________
BOINC <<--- credit whores, pedants, alien hunters

Profile Retvari Zoltan
Avatar
Send message
Joined: 20 Jan 09
Posts: 2343
Credit: 16,201,255,749
RAC: 7,520
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 34569 - Posted: 3 Jan 2014 | 16:23:53 UTC - in response to Message 34565.

From your perspective it is the design of the card and I agree with your perspective. If you ask the manufacturer and present all the evidence you have uncovered, their response might be like "It's a problem with the application, that card is for gaming applications where a few errors won't be noticed. It's not for data crunching that requires high precision and reliability."

I'm aware of (and accept) the manufacturer's perspective. However, none of my previous OC cards showed such flaw, including theirs. It's very strange, that the memory clock is the one which had to be reduced to the 77% of its original frequency to fix this problem. However it is much harder to make the RMA guys accept this error condition at the shop I've bought this card, since I'm sure that they are testing graphics cards only with games, and 3D accelerator tests (which show no problem at all).

Do you plan to RMA it?

No, as it is working now, and probably the replacement card would have the same flaw. It is a better option to sell this card to a gamer, and buy a different OC card (from a different manufacturer, or the new version of this card).

IIUC, you have had to downclock the memory below the frequency used on the standard model (not OC), yes?

Yes, since the OC and the non-OC card originally have the same memory frequency (3500MHz).

288larsson
Send message
Joined: 15 Apr 10
Posts: 2
Credit: 674,542,975
RAC: 0
Level
Lys
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 34805 - Posted: 24 Jan 2014 | 16:28:16 UTC

hello Retvari Zoltan* thanks for the problem solution. Can run mine at 3100MHz

Profile Retvari Zoltan
Avatar
Send message
Joined: 20 Jan 09
Posts: 2343
Credit: 16,201,255,749
RAC: 7,520
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 34810 - Posted: 24 Jan 2014 | 23:34:46 UTC - in response to Message 34805.

hello Retvari Zoltan* thanks for the problem solution.

You're welcome!
The thanks goes to skgiven as well.

Can run mine at 3100MHz

You've got more luck with your card than me with mine, but it could be because you run it under Win8.1. If you'd run it under WinXP, probably you should reduce a little further the memory clock frequency.
If I'll have some time and guts then I'll try to change the power buffering capacitors around the RAM chips for bigger capacity on my card.

a1kabear
Send message
Joined: 19 Oct 13
Posts: 15
Credit: 578,770,199
RAC: 0
Level
Lys
Scientific publications
watwatwatwatwatwatwatwatwat
Message 34811 - Posted: 25 Jan 2014 | 1:27:16 UTC
Last modified: 25 Jan 2014 | 1:27:43 UTC

I finally got around to flashing my card down to 2700 memory and now it works fine under linux. its a little slower than yours (about 1500 seconds) but I am also running worldcommunitygrid on every other thread and the ambient is about 25c here recently.

I am on linux with it.

so thanks for the solution :)

Profile skgiven
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,995,359,260
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 34830 - Posted: 27 Jan 2014 | 1:00:32 UTC - in response to Message 34811.
Last modified: 27 Jan 2014 | 9:46:43 UTC

When you had you GPU stripped did you notice what type of GDDR5 your Gigabyte Windforce 3X card was using?
If it uses Hynix GDDR5 is should be R2C, but there are other types it might be; R0C, T2C, or T0C.
____________
FAQ's

HOW TO:
- Opt out of Beta Tests
- Ask for Help

Profile Retvari Zoltan
Avatar
Send message
Joined: 20 Jan 09
Posts: 2343
Credit: 16,201,255,749
RAC: 7,520
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 34833 - Posted: 27 Jan 2014 | 17:05:40 UTC - in response to Message 34830.
Last modified: 27 Jan 2014 | 17:13:09 UTC

When you had you GPU stripped did you notice what type of GDDR5 your Gigabyte Windforce 3X card was using?
If it uses Hynix GDDR5 is should be R2C, but there are other types it might be; R0C, T2C, or T0C.

It's using 12 pieces of Hynix H5GQ2H24AFA R2C. There are 6 groups of 2 (adjacent) RAM chips. The groups have a FET along with 5 resistors and 2 capacitors between the RAM chips they belong to. One of the capacitors is bigger. I suspect that either this bigger capacitor is not big enough for 2 RAM chips, or the capacitors around the whole memory array are not big enough for the array. It would be nice to have the electrical scheme of this board, or at least some recommended circuit diagram from Hynix.

Dagorath
Send message
Joined: 16 Mar 11
Posts: 509
Credit: 179,005,236
RAC: 0
Level
Ile
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 34834 - Posted: 27 Jan 2014 | 18:37:08 UTC - in response to Message 34833.

Have you checked Gigabyte's website for an errata sheet and/or correction sheet dealing with this issue? I suspect many hundreds of their customers are having the same issue for exactly the same reason (wrong component or failed component such as a cap or resistor) and I would think by now Gigabyte is aware of the problem. If not then the more people who report it and the workaround (downclocking the memory) the sooner they will become aware they have a big problem and a potential big blow to their reputation and move on the issue.

____________
BOINC <<--- credit whores, pedants, alien hunters

Profile skgiven
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,995,359,260
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 34849 - Posted: 30 Jan 2014 | 21:55:49 UTC - in response to Message 34833.
Last modified: 1 Feb 2014 | 19:17:41 UTC

H5GQ2H24AFA R2C or,
H5GQ2H24AFR R2C ?

Assuming H5GQ2H24AFR R2C, these require 1.6V to support 3.5GHz.

Excluding the possibility of bad GDDR5 and bad circuitry (which we can do nothing about anyway), my guess is that the card isn't supplying the necessary 1.6V, and is either supplying 1.5V or 1.35V - possibly 1.5V for some people and 1.35V for others; with 1.35V perhaps being sufficient for 2.7GHz and 1.5V sufficient for 3.1GHz. This ties in with what has been reported here and suggests a firmware, driver or OS issue.
This might be related to the performance levels of the GPU (0, 1, 2, 3, 4) which might behave differently for GK110 than GK104.
____________
FAQ's

HOW TO:
- Opt out of Beta Tests
- Ask for Help

Profile Gattorantolo [Ticino]
Avatar
Send message
Joined: 29 Dec 11
Posts: 44
Credit: 251,211,525
RAC: 0
Level
Asn
Scientific publications
watwatwatwatwatwatwatwatwat
Message 34909 - Posted: 5 Feb 2014 | 6:11:23 UTC - in response to Message 34849.
Last modified: 5 Feb 2014 | 6:12:18 UTC

How is possible to have crunching time like you Zoltan (about 16.000 sec.) on the last WU? I have 4 GPU like you, GTX780ti, and my crunching time is 24.000 sec. Why?
____________
Member of Boinc Italy.

Profile skgiven
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,995,359,260
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 34911 - Posted: 5 Feb 2014 | 11:56:26 UTC - in response to Message 34909.
Last modified: 5 Feb 2014 | 11:57:30 UTC

Your GTX780Ti temperatures look very cool - too cool. If you are not using water cooling, your GPU's may be downclocking.

Win XP vs Win8.1 - XP and Linux were ~12.5% faster for a GTX770, last time I looked, which might increase for a GTX780Ti.

48 CPU threads vs 8threads - less resource conflict, HT doesn't scale really well for some CPU/GPU project combinations.

i7-4770K CPU @ 3.50GHz vs E5-2695 @ 2.4GHz - 46% faster stock cores (and might be overclocked).

2 GPU's vs 4 GPU's - less demand on the PCIE and CPU, likely to run cooler and have higher clocks. Might be overclocked too.

Your use of CPU is somewhat unknown, as are your settings.

Probably has faster RAM too.
____________
FAQ's

HOW TO:
- Opt out of Beta Tests
- Ask for Help

TJ
Send message
Joined: 26 Jun 09
Posts: 815
Credit: 1,470,385,294
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 34916 - Posted: 5 Feb 2014 | 15:51:49 UTC - in response to Message 34909.

How is possible to have crunching time like you Zoltan (about 16.000 sec.) on the last WU? I have 4 GPU like you, GTX780ti, and my crunching time is 24.000 sec. Why?

You have the same I have. To get Zoltan's time you need XP or Linux.
There is a thread about it: http://www.gpugrid.net/forum_thread.php?id=3580
____________
Greetings from TJ

Dagorath
Send message
Joined: 16 Mar 11
Posts: 509
Credit: 179,005,236
RAC: 0
Level
Ile
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 34918 - Posted: 5 Feb 2014 | 16:22:37 UTC - in response to Message 34911.
Last modified: 5 Feb 2014 | 16:26:11 UTC

48 CPU threads vs 8threads - less resource conflict, HT doesn't scale really well for some CPU/GPU project combinations.


I gasped when I saw 48 too but it turns out his CPU has 12 real cores (24 virtual) and I suspect he might have 2 CPUs. Also, that's a socket 2011 CPU with 40 PCIe lanes so I doubt there is congestion on the PCIe bus unless his mobo is lane restricted.

BTW, I saw ads for that CPU... $2,500 US!!!

Gattorantolo... get with the penguin :-)
____________
BOINC <<--- credit whores, pedants, alien hunters

Profile Retvari Zoltan
Avatar
Send message
Joined: 20 Jan 09
Posts: 2343
Credit: 16,201,255,749
RAC: 7,520
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 34923 - Posted: 5 Feb 2014 | 17:44:49 UTC - in response to Message 34918.

48 CPU threads vs 8threads - less resource conflict, HT doesn't scale really well for some CPU/GPU project combinations.


I gasped when I saw 48 too but it turns out his CPU has 12 real cores (24 virtual) and I suspect he might have 2 CPUs. Also, that's a socket 2011 CPU with 40 PCIe lanes so I doubt there is congestion on the PCIe bus unless his mobo is lane restricted.

BTW, I saw ads for that CPU... $2,500 US!!!

Gattorantolo... get with the penguin :-)

The memory bandwidth could limit the performance of the GPU tasks, as this Xeon E5-2695v2 and E5-2697v2 processors are basically two i7-4960X processors within a single package: They have lowered clock frequency (for staying within 130W TDP), some (safety) features turned on, but they have only the same 4-channel memory interface as the i7-4960X capable of 59.7GB/s. So the two CPU chips inside the physical CPU sharing its memory interface (and its bandwidth).
If the two physical processors share this physical memory interface, this could reduce this bandwidth further. But I think that the two physical processors have separated physical memory, yet they can access each other's physical memory through QPI or somehow, but this method can also reduce the throughput of the memory interface. If CPU tasks running on all cores (virtual+real) of both physical CPUs, this impact could be big enough to reduce the performance of the GPU tasks by 10-20%. These hosts have Windows 8(.1) on them, which is not a server OS, so I'm sure it's not aware of aligning the memory allocation of an application to the CPU's physical memory it's running on. Even the application can be switched over to the other physical CPU, which is a time consuming process, and will force the CPU to handle the application's data transfer through (and with the help of) the other physical CPU. I think that only the Datacenter versions of the MS server OSes can handle this complex task. I don't know Linux so there maybe such edition of that OS also.

Profile Gattorantolo [Ticino]
Avatar
Send message
Joined: 29 Dec 11
Posts: 44
Credit: 251,211,525
RAC: 0
Level
Asn
Scientific publications
watwatwatwatwatwatwatwatwat
Message 34926 - Posted: 5 Feb 2014 | 21:33:49 UTC - in response to Message 34911.

Your GTX780Ti temperatures look very cool - too cool. If you are not using water cooling, your GPU's may be downclocking.

Water cooling of course :-)
Thank you for your help GPUGRID cruncher ;-), now i know the "Problem"!

____________
Member of Boinc Italy.

Profile skgiven
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,995,359,260
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 34927 - Posted: 5 Feb 2014 | 21:57:36 UTC - in response to Message 34923.
Last modified: 5 Feb 2014 | 22:01:43 UTC

Gattorantolo, what are your Boinc processor usage settings, GPU clocks, and the RAM frequency?

The PCIE controller is on-die but if there are only 40 PCIE lanes 'in total' that's 20 per CPU or at most 8 per GPU, and massive contention. How much contention there is could be assessed if all CPU tasks were suspended for a full GPUGrid run.

Generally, the best CPU's for GPUGrid crunching have enough cores to support the number of GPU's, and PCIE lanes, but also have high clocks (at least until Maxwell arrives). I expect a 4th generation i7 has something over the LGA2011 processors.

The 2695v2 has a Tdp of 115W, which is really sweet for CPU crunching.

As an aside, Linux scales VERY well, which is why it's used in data centers, including Microsoft's.
____________
FAQ's

HOW TO:
- Opt out of Beta Tests
- Ask for Help

Dagorath
Send message
Joined: 16 Mar 11
Posts: 509
Credit: 179,005,236
RAC: 0
Level
Ile
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 34929 - Posted: 5 Feb 2014 | 23:06:03 UTC - in response to Message 34927.

As an aside, Linux scales VERY well, which is why it's used in data centers, including Microsoft's.


Oh quit pulling my leg.

____________
BOINC <<--- credit whores, pedants, alien hunters

Jozef J
Send message
Joined: 7 Jun 12
Posts: 112
Credit: 1,118,845,172
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwat
Message 34932 - Posted: 6 Feb 2014 | 18:43:49 UTC

So I was able to complete this one task-gluilex2x33-NOELIA_DIPEPT1-0-2-RND9057_2
GPU [GeForce GTX 780 Ti] Platform [Windows] Rev [3203M] VERSION [55]
# SWAN Device 0 :
# Name : GeForce GTX 780 Ti
# ECC : Disabled
# Global mem : 3072MB
# Capability : 3.5
# PCI ID : 0000:01:00.0
# Device clock : 1084MHz
# Memory clock : 3500MHz
# Memory width : 384bit
# Driver version : r331_82 : 33221
# GPU 0 : 52C
# GPU 0 : 53C
# GPU 0 : 54C
# GPU 0 : 55C
# GPU 0 : 56C
# GPU 0 : 57C
# GPU 0 : 58C
# GPU 0 : 59C
# GPU 0 : 60C
# GPU 0 : 61C
# GPU 0 : 62C
# Time per step (avg over 12500000 steps): 1.657 ms
# Approximate elapsed time for entire WU: 20717.930 s
18:44:22 (7780): called boinc_finish

unfortunately unfinished right behind this successful
970x-SANTI_MAR420cap310-0-32-RND2207_1
<core_client_version>7.2.33</core_client_version>
<![CDATA[
<message>
(unknown error) - exit code -97 (0xffffff9f)
</message>
<stderr_txt>
# GPU [GeForce GTX 780 Ti] Platform [Windows] Rev [3203M] VERSION [55]
# SWAN Device 0 :
# Name : GeForce GTX 780 Ti
# ECC : Disabled
# Global mem : 3072MB
# Capability : 3.5
# PCI ID : 0000:01:00.0
# Device clock : 1084MHz
# Memory clock : 3500MHz
# Memory width : 384bit
# Driver version : r331_82 : 33221
# GPU 0 : 52C
# The simulation has become unstable. Terminating to avoid lock-up (1)

220x-SANTI_MAR423cap310-0-84-RND2313_1
(unknown error) - exit code -97 (0xffffff9f)

201x-SANTI_MARwtcap310-6-32-RND4585_0
(unknown error) - exit code -97 (0xffffff9f)

883x-SANTI_MARwtcap310-2-32-RND9284_0
same sh**

I can not download and try other tasks .. GPUGRID it blocked my client
Why this one task goes no problems ...??


Profile skgiven
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,995,359,260
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 34944 - Posted: 8 Feb 2014 | 11:50:36 UTC - in response to Message 34932.

The successful task was a NOELIA_DIPEPT Work Unit. Typically these WU's utilize the GPU to a lesser extent. The task is quite different than the SANTI_MAR tasks.

Can I suggest you shut down your system, start it up, and underclock your GPU. Start by underclocking the GDDR5 memory to 3000MHz. Should that fail try 2700MHz and then 2600MHz. If you are still unsuccessful try to reduce the GPU clocks too.

When you continuously fail tasks the server stops sending them. This is to protect the servers available Internet bandwidth - It's essential.
____________
FAQ's

HOW TO:
- Opt out of Beta Tests
- Ask for Help

valterc
Send message
Joined: 21 Jun 10
Posts: 21
Credit: 6,161,484,672
RAC: 4,196,238
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 35031 - Posted: 14 Feb 2014 | 10:59:56 UTC - in response to Message 34944.

Just for information, I went to the Gigabyte site and they are now selling a card called GV-N78TOC-3GD (Rev. 1.0). Is it the same as the one you described having problems? Other question, does this problem occur also with the similar GV-N78TGHZ-3GD?

a1kabear
Send message
Joined: 19 Oct 13
Posts: 15
Credit: 578,770,199
RAC: 0
Level
Lys
Scientific publications
watwatwatwatwatwatwatwatwat
Message 35032 - Posted: 14 Feb 2014 | 11:08:38 UTC

GV-N78TOC-3GD (Rev. 1.0) is the same card as I got in Thailand which has the problems. It works ok now with the memory downclocked but runs at the same speed as my 780 oc'd which is disappointing (which is also the same speed as my titan which is sitting unused :/)

Profile Retvari Zoltan
Avatar
Send message
Joined: 20 Jan 09
Posts: 2343
Credit: 16,201,255,749
RAC: 7,520
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 35033 - Posted: 14 Feb 2014 | 11:54:24 UTC - in response to Message 35032.

GV-N78TOC-3GD (Rev. 1.0) is the same card as I got in Thailand which has the problems. It works ok now with the memory downclocked but runs at the same speed as my 780 oc'd which is disappointing (which is also the same speed as my titan which is sitting unused :/)

Oh dear, such a waste. Just send me that unused Titan, and I'll put it in one of my hosts. :)

Profile Retvari Zoltan
Avatar
Send message
Joined: 20 Jan 09
Posts: 2343
Credit: 16,201,255,749
RAC: 7,520
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 35034 - Posted: 14 Feb 2014 | 11:57:08 UTC - in response to Message 35031.

Just for information, I went to the Gigabyte site and they are now selling a card called GV-N78TOC-3GD (Rev. 1.0). Is it the same as the one you described having problems?

Yes, it's the same, my card is rev 1.0

Other question, does this problem occur also with the similar GV-N78TGHZ-3GD?

Good question, I hope someone will answer that, as I don't plan to buy one just to find out.

Profile skgiven
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,995,359,260
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 35035 - Posted: 14 Feb 2014 | 12:05:09 UTC - in response to Message 35034.

I still think their memory voltages are wrong.
http://www.gpugrid.net/forum_thread.php?id=3584&nowrap=true#34849
____________
FAQ's

HOW TO:
- Opt out of Beta Tests
- Ask for Help

Profile Retvari Zoltan
Avatar
Send message
Joined: 20 Jan 09
Posts: 2343
Credit: 16,201,255,749
RAC: 7,520
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 37058 - Posted: 15 Jun 2014 | 14:14:38 UTC

Gigabyte has released a new frimware (ver F3) for this card in april.
Its description begins with this: "Release for HYNIX Memory", so I thought that it could fix my memory clock issue.
I haven't had the time and motivation back then to upgrade - and forgot about this -, but today I did upgrade the card's BIOS to F3.
The good news is that the card is crunching fine for about 15 minutes now (at RAM clock 3500MHz, GPU clock 1137MHz).
I'll report the card's crunching status when I'll have something important to report :)

Profile Retvari Zoltan
Avatar
Send message
Joined: 20 Jan 09
Posts: 2343
Credit: 16,201,255,749
RAC: 7,520
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 37061 - Posted: 15 Jun 2014 | 21:48:57 UTC - in response to Message 37058.

No luck at 3500MHz - The task I've referred in my previous post has failed after 5168 sec.
Two consecutive NOELIA tasks failed almost instantly at start.
Tried 3400MHz - it was worse than @ 3500MHz.
Now at 3300MHz a NOELIA_BI is running for 15 minutes without "simulation became unstable" message.

a1kabear
Send message
Joined: 19 Oct 13
Posts: 15
Credit: 578,770,199
RAC: 0
Level
Lys
Scientific publications
watwatwatwatwatwatwatwatwat
Message 37062 - Posted: 16 Jun 2014 | 1:59:53 UTC

Thanks for the update! Oh well :/

Jacob Klein
Send message
Joined: 11 Oct 08
Posts: 1127
Credit: 1,901,927,545
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 37063 - Posted: 16 Jun 2014 | 2:57:27 UTC

Forgive me, as I'm late to this thread, but... have you tried setting the GPU fans manually, via Precision-X or MSI Afterburner, to the maximum fan % allowed for that GPU, just to see if keeping the GPU cooler will have an effect? Set it to maximum for 2 days, to test, maybe?

Vagelis Giannadakis
Send message
Joined: 5 May 13
Posts: 187
Credit: 349,254,454
RAC: 0
Level
Asp
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 37064 - Posted: 16 Jun 2014 | 9:05:57 UTC - in response to Message 37061.

I sympathize with you Retvari, having problems with a new, shiny piece of kit is a frustrating experience... I'm in a similar situation with you, having a 750Ti acting in a psychotic manner.

I am refusing to RMA it just yet and am trying to make it work (with mixed results), but it won't be long now, if it keeps getting many errors I will go ahead and return it.

I feel you should do the same, especially with an expensive card like yours. Seriously, if something (anything) is wrong in the hardware, how feasible is it to repair it yourself? Maybe it would be realistic 15-20 years ago, when many things were soldered by hand, but today's PCBs are multi-layered and full of tiny components soldered by robot hands to amazing precision. It's far more possible that you will break something if you try to fix it and void your warranty along the way, injury and insult at the same time! I say, just RMA it and keep your peace of mind!

On the other hand, you may be very experienced with such labor, in which case I wish the best of luck to you!
____________

Profile Retvari Zoltan
Avatar
Send message
Joined: 20 Jan 09
Posts: 2343
Credit: 16,201,255,749
RAC: 7,520
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 37065 - Posted: 16 Jun 2014 | 11:07:38 UTC - in response to Message 37063.

Forgive me, as I'm late to this thread, but... have you tried setting the GPU fans manually, via Precision-X or MSI Afterburner, to the maximum fan % allowed for that GPU, just to see if keeping the GPU cooler will have an effect? Set it to maximum for 2 days, to test, maybe?

I've set a manual fan 'curve' in MSI Afterburner before I got this card: 20°C:40% -> 80°C:100%
I've tried every trick in the book on this card, none of them helped except reducing the RAM clock to 2700MHz.
This card rarely goes above 70°C:
This workunit was processed at 33°C ambient temperature, and GPU 1 max temp was 70°C.

Profile Retvari Zoltan
Avatar
Send message
Joined: 20 Jan 09
Posts: 2343
Credit: 16,201,255,749
RAC: 7,520
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 37066 - Posted: 16 Jun 2014 | 11:47:19 UTC

There was a successful I770-SANTI_p53final at 3300MHz, but a e2s948_e1s373f83-SANTI_marsalWTbound2, and a 2x118-NOELIA_TRPS1S4 has failed.
I've set the GPU RAM to 3200MHz, but this I1072-SANTI_p53final had some "Simulation unstable" messages, so now the card is down at 3100MHz, and processing this 15x30-NOELIA_BI_3.

Profile Retvari Zoltan
Avatar
Send message
Joined: 20 Jan 09
Posts: 2343
Credit: 16,201,255,749
RAC: 7,520
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 37071 - Posted: 17 Jun 2014 | 16:12:40 UTC

My Gigabyte GTX 780Ti OC is crunching fine (i.e. without "Simulation unstable" messages) at 3100MHz RAM clock for more than 1 day now.
The source of the original problem could be a memory voltage / timing (latency) problem, which was addressed by the new BIOS release (F3), but wasn't solved completely. With the new BIOS I could achieve 400MHz higher clock speed, while it's still 400MHz lower than the nominal.
Is there any tool to tweak a Kepler GPU's memory settings beside the clock? (Voltage, latency etc.)

Profile skgiven
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,995,359,260
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 37120 - Posted: 21 Jun 2014 | 21:30:55 UTC - in response to Message 37071.

Your 3.1GHz ties in well with what I thought the situation might be,

Assuming H5GQ2H24AFR R2C, these require 1.6V to support 3.5GHz.

Excluding the possibility of bad GDDR5 and bad circuitry (which we can do nothing about anyway), my guess is that the card isn't supplying the necessary 1.6V, and is either supplying 1.5V or 1.35V - possibly 1.5V for some people and 1.35V for others; with 1.35V perhaps being sufficient for 2.7GHz and 1.5V sufficient for 3.1GHz. This ties in with what has been reported here and suggests a firmware, driver or OS issue.

My solution would be to stick with it at 3.1GHz, if it proves to be stable, or sell the card and get an equivalent second hand card that does run at 3.5GHz.

288larsson who posted in this thread also has a Gigabyte GTX 780 Ti OC (Windforce 3x) GPU.

Alas I don't know how to change the GDDR5 voltage.
____________
FAQ's

HOW TO:
- Opt out of Beta Tests
- Ask for Help

Profile Retvari Zoltan
Avatar
Send message
Joined: 20 Jan 09
Posts: 2343
Credit: 16,201,255,749
RAC: 7,520
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 37125 - Posted: 22 Jun 2014 | 17:58:57 UTC - in response to Message 37120.

I wrote to Gigabyte support, they replied that I should try my card the way I did back in December.
(Which I'm sure will end with the same results - so I didn't redo my tests yet.)
I'll dismount the cooler once again and check the type of every RAM chip individually, I'll also try to measure the operating voltage on the buffering capacitors.

Profile Retvari Zoltan
Avatar
Send message
Joined: 20 Jan 09
Posts: 2343
Credit: 16,201,255,749
RAC: 7,520
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 37138 - Posted: 23 Jun 2014 | 22:44:34 UTC - in response to Message 34849.

H5GQ2H24AFA R2C or,
H5GQ2H24AFR R2C ?

Assuming H5GQ2H24AFR R2C, these require 1.6V to support 3.5GHz.

Excluding the possibility of bad GDDR5 and bad circuitry (which we can do nothing about anyway), my guess is that the card isn't supplying the necessary 1.6V, and is either supplying 1.5V or 1.35V - possibly 1.5V for some people and 1.35V for others; with 1.35V perhaps being sufficient for 2.7GHz and 1.5V sufficient for 3.1GHz. This ties in with what has been reported here and suggests a firmware, driver or OS issue.
This might be related to the performance levels of the GPU (0, 1, 2, 3, 4) which might behave differently for GK110 than GK104.

You were right: my card is built on 8 pieces of H5GQ2H24AFR R2C.
The only excuse for my mistake is that the font they use makes it very hard to tell apart the "A" from the "R", especially when the oil from the thermal pads is covering the chip's package.

I've got two more failures (NOELIA_THROMBIN units) on this card, so now my card is down at 3.0GHz.

I've found a capacitor around the RAM chips (on the other side of the PCB), on which the voltage is measured as 1.58~1.633 volts.
I think I should check it using an oscilloscope, to find out if the capacitor is undersized. But first I have to know the right spot. I didn't find any info on this board's wiring.

Profile Retvari Zoltan
Avatar
Send message
Joined: 20 Jan 09
Posts: 2343
Credit: 16,201,255,749
RAC: 7,520
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 37967 - Posted: 21 Sep 2014 | 20:11:46 UTC

There's a new BIOS version (F4) for this card on Gigabyte's website.
I've flashed it to my card, but a task immediately failed on default clocks (3.5GHz GDDR5).
I have to reiterate the right GDDR5 frequency with this new BIOS.
Now it's down by 100MHz (at 3.4GHz).

Profile Retvari Zoltan
Avatar
Send message
Joined: 20 Jan 09
Posts: 2343
Credit: 16,201,255,749
RAC: 7,520
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 37968 - Posted: 21 Sep 2014 | 20:35:39 UTC - in response to Message 37967.
Last modified: 21 Sep 2014 | 20:38:18 UTC

The task running on this card got a couple of "Simulation became unstable" messages in the stderr.txt file, so I took down the GDDR5 clock by another 100MHz (now it runs at 3.3GHz).

Profile Retvari Zoltan
Avatar
Send message
Joined: 20 Jan 09
Posts: 2343
Credit: 16,201,255,749
RAC: 7,520
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 37971 - Posted: 22 Sep 2014 | 8:30:29 UTC - in response to Message 37968.

There were "Simulation became unstable" messages, and a failed WU at 3.3GHz.
Now I'm testing the card at 3.2GHz.

Jacob Klein
Send message
Joined: 11 Oct 08
Posts: 1127
Credit: 1,901,927,545
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 37973 - Posted: 22 Sep 2014 | 9:45:48 UTC

Have you considered running Heaven, to determine how far you may need to downclock? If you can get Heaven to run at max settings overnight with no issues, then I'd consider it stable.

Profile Retvari Zoltan
Avatar
Send message
Joined: 20 Jan 09
Posts: 2343
Credit: 16,201,255,749
RAC: 7,520
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 37977 - Posted: 22 Sep 2014 | 11:17:22 UTC - in response to Message 37973.

Have you considered running Heaven, to determine how far you may need to downclock? If you can get Heaven to run at max settings overnight with no issues, then I'd consider it stable.

When I first tested the card, I did. The only application failed is GPUGrid's.
See the first post of this thread.
BTW the card seems to be stable at 3.2GHz, but different workunit batches could use different parts of the GPU.
I suspect that something messed up with the GDDR5 voltage, or the PSU of the memory subsystem on this card series.

Jacob Klein
Send message
Joined: 11 Oct 08
Posts: 1127
Credit: 1,901,927,545
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 37980 - Posted: 22 Sep 2014 | 12:19:04 UTC - in response to Message 37977.

I apologize - although I did read most of the thread, I did miss the part where you said you tested with Heaven.

Nevertheless... did you just run a single benchmark with it? That's often not enough. Usually, it takes several hours to confirm stability.

I'd be curious to see if you can get through an *overnight* test, running at [DirectX 11, Ultra, Extreme, x8 AA, Full Screen, 1920x1080]... with:
- no application crashes
- no TDRs (as evidenced by dmp files in C:\Windows\LiveKernelReports\WATCHDOG
- no strange display glitching (which would indicate memory corruption possibly due to memory being clocked too high)

Sorry if you've done this, or you feel this is an inappropriate test. But I do recommend that you try it, if you haven't already. Just trying to help.

TJ
Send message
Joined: 26 Jun 09
Posts: 815
Credit: 1,470,385,294
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 37987 - Posted: 22 Sep 2014 | 16:12:24 UTC - in response to Message 37980.

How can one read these dmp files Jacob?
____________
Greetings from TJ

Jacob Klein
Send message
Joined: 11 Oct 08
Posts: 1127
Credit: 1,901,927,545
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 37988 - Posted: 22 Sep 2014 | 16:21:17 UTC
Last modified: 22 Sep 2014 | 16:23:25 UTC

I don't know. I think, if you have the Windows SDK and development tools installed, and have Windows symbols available, you might be able to step through them. But that's all beyond my ability.

The short story is: If you have a .dmp file in your C:\Windows\LiveKernelReports\WATCHDOG directory, it means your GPU had a TDR... and either an application was faulty, or the GPU was faulty, or (the most common case) you are pushing your GPU too hard in terms of Core Clock or Memory Clock.

I still fully recommend Heaven 4.0, on the maximum settings I described a couple posts up, running overnight, to confirm stability. Once I did that for my 2 GTX 660 Ti's, and found that I had to decrease the clock on one and could increase the clock on the other, and since then I have had 0 problems with GPUGrid and with iRacing.

Not trying to spam this thread. Retvari, I hope you can get your issue figured out, and if my suggestion doesn't help you, then I apologize.

eXaPower
Send message
Joined: 25 Sep 13
Posts: 293
Credit: 1,897,601,978
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwat
Message 38007 - Posted: 23 Sep 2014 | 13:38:04 UTC - in response to Message 37987.
Last modified: 23 Sep 2014 | 13:55:12 UTC

How can one read these dmp files Jacob?


Having Visual 2013 helps with reading certain files, but most can be read with notepad, if text is involved. (watchdog are mostly text files)

If you have a .dmp file in your C:\Windows\LiveKernelReports\WATCHDOG directory

You can read these files with notepad. Run as admin, you'll see a prompt "user doesn't have access" if in non-admin mode.

If you have game that hard on a GPU (BF4, Metro2033/Last Light) if you don't have any games on you're disk- Heaven is great tool to stress, Or 3Dmark Vantage benchmark has looping for TMU, ROP, Memory test that strain cards to limits. The extreme Firestrike benchmark loops, and will fail an card overclocked. If have you Nvidia Cuda samples: the n-body test can be looped, a card will also fail the random number samples, if over clocked too high. This how I Found my cards best temps and voltage.

With a custom bios and Nvidia Inspector Bat files, as Jacob has shown for setting "Max boost", works wonders once know cards limit for core/memory speeds and voltage. New Gm204 can run at 1.000V with a 1.2 GHz speed. 1.025 voltage also. Overclocking records past 2GhZ (GM204 card is first ever to break 2Ghz) with L2N. Many 1.5 GHz speeds have be reached with air cooling and stock voltage. GM204 is truly an engineering feat. The amount features added along with new filtering tech really raises Molecular Dynamics function for single precision.

eXaPower
Send message
Joined: 25 Sep 13
Posts: 293
Credit: 1,897,601,978
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwat
Message 38008 - Posted: 23 Sep 2014 | 14:09:51 UTC - in response to Message 37977.
Last modified: 23 Sep 2014 | 14:15:51 UTC

Have you considered running Heaven, to determine how far you may need to downclock? If you can get Heaven to run at max settings overnight with no issues, then I'd consider it stable.

When I first tested the card, I did. The only application failed is GPUGrid's.
See the first post of this thread.
BTW the card seems to be stable at 3.2GHz, but different workunit batches could use different parts of the GPU.
I suspect that something messed up with the GDDR5 voltage, or the PSU of the memory subsystem on this card series.


I see this card runs ~80C. Do you know Voltage control temps on card? VRM runs over 100C on some GTX780ti cards.(rated for 110C for you're card.) Gigabyte been worse offender, from viewing 780ti owner boards. eVGA and Asus 780ti's VRM is rated at 120-125c.

Do you know temps for DDR memory? These temps run really hot on certain GTX780ti's

New Zotac's GTX970/980 along eVGA's 900 series have highest rated core/boost speeds.

TJ
Send message
Joined: 26 Jun 09
Posts: 815
Credit: 1,470,385,294
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 38010 - Posted: 23 Sep 2014 | 15:15:25 UTC - in response to Message 38007.

Thanks for your help eXaPower, but I have tried notepad, wordpad but no normal reading is possible.
I did not game but have a few dmp files, perhaps when a drives crashed with GPUGRID? And now I am interested in what is in those files.
____________
Greetings from TJ

eXaPower
Send message
Joined: 25 Sep 13
Posts: 293
Credit: 1,897,601,978
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwat
Message 38014 - Posted: 23 Sep 2014 | 16:36:52 UTC - in response to Message 38010.
Last modified: 23 Sep 2014 | 16:41:34 UTC

Thanks for your help eXaPower, but I have tried notepad, wordpad but no normal reading is possible.
I did not game but have a few dmp files, perhaps when a drives crashed with GPUGRID? And now I am interested in what is in those files.


I neglected to mention Win8.1 notepad will open these type files- only if Visual been installed prior, but license expired or is current. During a period you're host can be tweaked- Try fiddling with some windows System32 program list to see if one will allow it opening it or......

You can gain access to newest Visual version (To create/test custom made programs/prior or custom files) with new CUDA 6.5.19 toolkit, if want full visual VC++ redis, Microsoft has developer account (you block all info being sent to them- Just do a custom install.) Trial period for 60-90 days. Once CUDA toolkit/Visual are linked together a world learning DIY programs is opened. Nvidia has a debugging program with they're Registered Developer program. Intel has a great AVX/FMA3 DIY programming tool. AMD is a HSA member.

Linux is intertwined with NVidia HSA. A lot of options are available. Freedom of choice.

Profile Retvari Zoltan
Avatar
Send message
Joined: 20 Jan 09
Posts: 2343
Credit: 16,201,255,749
RAC: 7,520
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 38015 - Posted: 23 Sep 2014 | 16:52:02 UTC - in response to Message 38008.

I see this card runs ~80C.

No. This card (GPU1) runs at 65-70°C.
The other card (GPU0) - which is a standard NVidia design - runs fine on 3.5GHz at 80°C.

Do you know Voltage control temps on card?

I don't know voltage control temps, but I think it should be lower than the other card's, as this card has more phase on that VR, and this card has better cooling.

Do you know temps for DDR memory? These temps run really hot on certain GTX780ti's

I don't know that either, but the same reasoning applies to the RAM chips as for the VRM chips.

eXaPower
Send message
Joined: 25 Sep 13
Posts: 293
Credit: 1,897,601,978
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwat
Message 38021 - Posted: 23 Sep 2014 | 18:07:03 UTC - in response to Message 38015.
Last modified: 23 Sep 2014 | 18:08:00 UTC

I see this card runs ~80C.

No. This card (GPU1) runs at 65-70°C.
The other card (GPU0) - which is a standard NVidia design - runs fine on 3.5GHz at 80°C.

Do you know Voltage control temps on card?

I don't know voltage control temps, but I think it should be lower than the other card's, as this card has more phase on that VR, and this card has better cooling.

Do you know temps for DDR memory? These temps run really hot on certain GTX780ti's

I don't know that either, but the same reasoning applies to the RAM chips as for the VRM chips.


If time permits - thin gauge wires with correct metal probes to attach? (Do you have tools for you're Gigabyte Ti?) , you can manually read temp with proper equipment. (Or if you already have kit for electrical/ or temp readouts.)

JugNut
Send message
Joined: 27 Nov 11
Posts: 11
Credit: 1,021,749,297
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwat
Message 38063 - Posted: 25 Sep 2014 | 2:08:21 UTC - in response to Message 37987.

How can one read these dmp files Jacob?


Hi TJ you can read those .dmp files with bluescreenviewer
http://www.nirsoft.net/utils/blue_screen_view.html#DownloadLinks

Just download the app, unzip it, and run it from the resulting file called BlueScreenView.exe, then go to the options menu & click on "advanced Options", then click the radio button that says "load a single minidump file" then just direct it to the folder that was mentioned. C:\Windows\LiveKernelReports\WATCHDOG, & pick the .dmp file you want.

I hope the results give you what your looking for.

Profile Retvari Zoltan
Avatar
Send message
Joined: 20 Jan 09
Posts: 2343
Credit: 16,201,255,749
RAC: 7,520
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 38072 - Posted: 25 Sep 2014 | 18:29:24 UTC
Last modified: 25 Sep 2014 | 18:29:48 UTC

I had another failed workunit on this card, so I took another 100MHz off, it's now running at 3.1GHz GDDR5 clock.

TJ
Send message
Joined: 26 Jun 09
Posts: 815
Credit: 1,470,385,294
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 38078 - Posted: 26 Sep 2014 | 8:36:32 UTC - in response to Message 38063.

Thanks JugNut I will try it over the weekend.
____________
Greetings from TJ

Profile [AF>Amis des Lapins] Phil...
Send message
Joined: 16 Jul 13
Posts: 56
Credit: 1,626,354,890
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 38105 - Posted: 27 Sep 2014 | 13:25:23 UTC
Last modified: 27 Sep 2014 | 13:57:32 UTC

Hello,

Wanted to try GPUGRID today, but UNFORTUNATELY, still ONLY ERRORS within 15 seconds :/

Tried 1 GPU only (0 and then 1) but same result.

Temp 65°

What's happening ?

What can we do ?

EDIT : Have decreased the Power target down to 60 % + short WU's, but same problem. All tasks = errors.

388-NOELIA_20MGK36I-2-5-RND2053_0 10116890 160926 27 Sep 2014 | 9:40:02 UTC 27 Sep 2014 | 9:50:20 UTC Erreur en cours de calculs 8.27 2.54 --- Long runs (8-12 hours on fastest card) v8.41 (cuda60)
14-NOELIA_20MGK36I-2-5-RND3099_0 10116873 160926 27 Sep 2014 | 9:33:35 UTC 27 Sep 2014 | 9:41:41 UTC Erreur en cours de calculs 67.69 12.40 --- Long runs (8-12 hours on fastest card) v8.41 (cuda60)
522-NOELIA_20MGWT-2-5-RND4515_0 10116837 160926 27 Sep 2014 | 9:33:35 UTC 27 Sep 2014 | 9:50:20 UTC Erreur en cours de calculs 2.40 0.00 --- Long runs (8-12 hours on fastest card) v8.41 (cuda60)
249-NOELIA_20MGK36I-2-5-RND5686_0 10116835 160926 27 Sep 2014 | 9:33:35 UTC 27 Sep 2014 | 9:41:41 UTC Erreur en cours de calculs 10.39 2.96 --- Long runs (8-12 hours on fastest card) v8.41 (cuda60)
747-NOELIA_20MGK36I-2-5-RND8713_0 10116750 160926 27 Sep 2014 | 9:41:42 UTC 27 Sep 2014 | 9:58:09 UTC Erreur en cours de calculs 7.78 2.95 --- Long runs (8-12 hours on fastest card) v8.41 (cuda60)
232-NOELIA_20MGK36I-2-5-RND3755_0 10116603 160926 27 Sep 2014 | 9:33:35 UTC 27 Sep 2014 | 9:40:02 UTC Erreur en cours de calculs 7.30 2.34 --- Long runs (8-12 hours on fastest card) v8.41 (cuda60)
406-NOELIA_20MGWT-2-5-RND6078_0 10116559 160926 27 Sep 2014 | 9:41:42 UTC 27 Sep 2014 | 9:58:09 UTC Erreur en cours de calculs 4.18 2.22 --- Long runs (8-12 hours on fastest card) v8.41 (cuda60)

Best Regards

Philippe

Profile [AF>Amis des Lapins] Phil...
Send message
Joined: 16 Jul 13
Posts: 56
Credit: 1,626,354,890
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 38107 - Posted: 27 Sep 2014 | 14:06:15 UTC - in response to Message 37120.

Your 3.1GHz ties in well with what I thought the situation might be,

Assuming H5GQ2H24AFR R2C, these require 1.6V to support 3.5GHz.

Excluding the possibility of bad GDDR5 and bad circuitry (which we can do nothing about anyway), my guess is that the card isn't supplying the necessary 1.6V, and is either supplying 1.5V or 1.35V - possibly 1.5V for some people and 1.35V for others; with 1.35V perhaps being sufficient for 2.7GHz and 1.5V sufficient for 3.1GHz. This ties in with what has been reported here and suggests a firmware, driver or OS issue.

My solution would be to stick with it at 3.1GHz, if it proves to be stable, or sell the card and get an equivalent second hand card that does run at 3.5GHz.

288larsson who posted in this thread also has a Gigabyte GTX 780 Ti OC (Windforce 3x) GPU.

Alas I don't know how to change the GDDR5 voltage.



Hello !

How comes these 780Ti are doing OK on all other projects but GPUGRID ?

If it was a hardware issue, one should have problems on all projects I guess ?

Thank You

Profile Retvari Zoltan
Avatar
Send message
Joined: 20 Jan 09
Posts: 2343
Credit: 16,201,255,749
RAC: 7,520
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 38114 - Posted: 27 Sep 2014 | 21:19:04 UTC - in response to Message 38107.

Hello !

How comes these 780Ti are doing OK on all other projects but GPUGRID ?

If it was a hardware issue, one should have problems on all projects I guess ?

Thank You

Hello Philippe,

It's because the GPUGrid app is the most advanced one. It's compiled with the latest CUDA version, so it can utilize the card like no other project's app can. The "GPU usage" measurement is misleading.

Could you please specify all details of your GTX780Ti (Manufacturer, model, clocks), and your PSU (Manufacturer, model, wattage, efficiency)?

Profile [AF>Amis des Lapins] Phil...
Send message
Joined: 16 Jul 13
Posts: 56
Credit: 1,626,354,890
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 38119 - Posted: 28 Sep 2014 | 7:23:30 UTC
Last modified: 28 Sep 2014 | 8:02:00 UTC

Hello Zoltan,

Thank you for your message.

the 2 * GTX780Ti = Gigabyte Windforce GV-N78TWF3 - 3GD

http://www.gigabyte.fr/products/product-page.aspx?pid=4912#sp

The PSU is a brand new CORSAIR RM1000 / 1000W 80+ Gold

MB = ASUS Z87PRO / CPU i7-4770K / WC AIO NEPTON 140 XL / RAM DDR3 Corsair vengeance 2 x 8 go 1600 Mhz CL10 LP

------------------------------------------------------------------------------


NB Should one allow 1 CPU core / WU or is 0.5 still OK ?

I see that your CPU time = GPU time, and on the only WU I finished, the CPU use is about 25 % of GPU time ? http://www.gpugrid.net/result.php?resultid=13111203

I have finished only 1 WU since I plugged in these cards last week ...

Thank You very much for your help !

Best Regards

Philippe

EDIT : Have modified app_config to 1 CPU / 1 GPU + Have decrased the "MEM CLOCK" from 3500 down to 3100, and it looks like the WU won't crash. At least, no error during the 6 first minutes.

Should one keep this 3100 Mhz as standard for GPUGRID or can I increase it step by step up to ???

Is this method extending the time it takes to complete the WU's ?

Profile [AF>Amis des Lapins] Phil...
Send message
Joined: 16 Jul 13
Posts: 56
Credit: 1,626,354,890
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 38120 - Posted: 28 Sep 2014 | 9:23:05 UTC

All WU's crashing / errors =>

Collatz (GPU use = 99 % thanks to a .config ad hoc file) until PrimeGrid recovers.

Hope to be able to run GPUGRID one day without any problem, as I bought these cards having in mind to increase my participation in this BIO project.

Thank You

Best,

Philippe

biodoc
Send message
Joined: 26 Aug 08
Posts: 183
Credit: 6,493,864,375
RAC: 2,796,812
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 38127 - Posted: 28 Sep 2014 | 11:55:03 UTC - in response to Message 38015.

I see this card runs ~80C.

No. This card (GPU1) runs at 65-70°C.
The other card (GPU0) - which is a standard NVidia design - runs fine on 3.5GHz at 80°C.

Do you know Voltage control temps on card?

I don't know voltage control temps, but I think it should be lower than the other card's, as this card has more phase on that VR, and this card has better cooling.

Do you know temps for DDR memory? These temps run really hot on certain GTX780ti's

I don't know that either, but the same reasoning applies to the RAM chips as for the VRM chips.


In this review of your card (right one?), they measured temps using thermal imaging and found the VRM is running quite hot (87C @ load)

http://www.guru3d.com/articles_pages/gigabyte_geforce_gtx_780_ti_windforce_3x_review,9.html

Profile [AF>Amis des Lapins] Phil...
Send message
Joined: 16 Jul 13
Posts: 56
Credit: 1,626,354,890
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 38129 - Posted: 28 Sep 2014 | 14:05:58 UTC - in response to Message 38127.
Last modified: 28 Sep 2014 | 14:14:34 UTC

Hello !

Thank you for your message.

These Gigabyte 780Ti are in an open case, with 3 extra fans helping cooling.

Or supposed to help cooling.

They don't exceed (GPU) 65° C, but no idea about the VRM temp ...

Have decreased the MEM CLOCK, unsuccessfully :/


On the other hand, they run just fine on Collatz, with an extra .config file
that utilizes the card at 99 %.


verbose=1
items_per_kernel=22
kernels_per_reduction=9
threads=9
sleep=1


They also run OK on PPS Sieve (PrimeGrid) ...

Will probably build a 100 % WC crunchbox, but not with the 780Ti, but

will wait until the 980 are accepted by GPUGRID.

In the meantime, any idea what I can do in order to be able to crunch on GPUGRID ?

I can use EVGA Precision X in order to decrease power or temp ...

Thank You

Philippe

Profile skgiven
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,995,359,260
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 38130 - Posted: 28 Sep 2014 | 14:14:05 UTC - in response to Message 38127.

Interesting, I remember having to cool the back of a GPU to keep it stable (might have been a ref GTX660 or 650TiBoost). I just used a case fan angled up at the bottom of the card.

Not sure about the VRM but the memory (H5GQ2H24AFR-R2C) is only rated to 70℃,
http://component.iiic.cc/index.php?main_page=product_info&products_id=1198893
____________
FAQ's

HOW TO:
- Opt out of Beta Tests
- Ask for Help

Profile [AF>Amis des Lapins] Phil...
Send message
Joined: 16 Jul 13
Posts: 56
Credit: 1,626,354,890
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 38131 - Posted: 28 Sep 2014 | 14:19:06 UTC - in response to Message 38130.
Last modified: 28 Sep 2014 | 14:22:43 UTC

The GPU temp is monitored using EVGA Precision X, and the fan speed (in %) = Temp + 10 => the 3 "Windforce" fans are running already very fast + the 3 external fans are helping with heat dispersal ...

Do you think the origin of the problem could be heat ?

NB : http://www.gpugrid.net/forum_thread.php?id=2507&nowrap=true#21073


This WU crashed after 1 hour only, GPU Temp about 62 ° ... :
http://www.gpugrid.net/result.php?resultid=13143650

Profile skgiven
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,995,359,260
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 38132 - Posted: 28 Sep 2014 | 14:33:14 UTC - in response to Message 38131.
Last modified: 28 Sep 2014 | 14:33:38 UTC

My point is that the GPU fans cool the top of the card, but not the back.
GDDR5 temps are not easy to measure and this type is only rated to work up to 70℃.
There seems to be issues with some Gigabyte Windforce GTX780Ti versions, but just when crunching here. We think its related to the GDDR5, it's voltage/the VRM and we've identified that it uses H5GQ2H24AFR-R2C (at least in Zoltan's case) and this requires 1.6V (rather than 1.5V for some 7GHz modules or 1.35V for some 6GHz GDDR modules). Obviously 1.6V generates more heat than 1.5V...
So while there might be a design fault, there is nothing we can do to fix that, but we can try to treat the symptoms.
____________
FAQ's

HOW TO:
- Opt out of Beta Tests
- Ask for Help

Profile [AF>Amis des Lapins] Phil...
Send message
Joined: 16 Jul 13
Posts: 56
Credit: 1,626,354,890
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 38133 - Posted: 28 Sep 2014 | 15:07:39 UTC - in response to Message 38132.

Thank you for your message !

I am currently tryng to run short WU's

Have decreased the MEM CLOCQ by 300 Mhz

The volatge shown on EVGA Precision X is 1.2 for card 0 and 1.175 for card 1

Have added extra external fans, (= 4 external fans) in order to cool the back of the cards +
have now the PSU outside the case (it's a real mess ;) )

Will see if it works.

Thank You

Philippe

Profile [AF>Amis des Lapins] Phil...
Send message
Joined: 16 Jul 13
Posts: 56
Credit: 1,626,354,890
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 38143 - Posted: 29 Sep 2014 | 5:54:21 UTC

Would GPU WaterCooling be the ideal solution ?

Or better to change for 970/980 ?

Thank You

Profile [AF>Amis des Lapins] Phil...
Send message
Joined: 16 Jul 13
Posts: 56
Credit: 1,626,354,890
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 38161 - Posted: 29 Sep 2014 | 12:31:58 UTC
Last modified: 29 Sep 2014 | 12:51:07 UTC

EDIT : It seems that today not only 780Ti have problems :/

EDIT 2 : I give up. All WU's crashing.

Hello !

Currently running other tests.

MB = ASUS Z87 Pro / PSU CORSAIR RM 1000

When GPUGRID runs on both GPU, the GPU Load =

94 % on card 0

65 % on card 1

?

=> WU on card 0 crashes first, and after a few seconds only.

This afternoon, I will try to run GPUGRID on card 1 (looks to be slow but TBC)
while running Collatz on card 0 (GPU Load = 99 %)

NB : The cards are 100 % stable running Collatz with high temp + GPU Load 99 %, even when both GPU @ work.


Do you think that GPU Watercooling might help ?

NB2 : 2 small external fans directed on the GPU's back ...

Thank You

Profile skgiven
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,995,359,260
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 38174 - Posted: 29 Sep 2014 | 17:11:07 UTC - in response to Message 38161.

I do not think that water cooling is the answer as all tasks are failing on your card.
If cooling the back failed then I can only suggest reducing the GDDR5 down to 3000MHz or less.
I suggest you get rid of the GTX780Ti as your Gigabyte Windforce doesn't work well here.
A GTX970 should be a good card when the apps work well with the GTX900 range.
____________
FAQ's

HOW TO:
- Opt out of Beta Tests
- Ask for Help

Profile [AF>Amis des Lapins] Phil...
Send message
Joined: 16 Jul 13
Posts: 56
Credit: 1,626,354,890
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 38244 - Posted: 2 Oct 2014 | 6:10:43 UTC - in response to Message 38174.
Last modified: 2 Oct 2014 | 7:06:13 UTC

Thank you for your message.

However, when I read all posts related to errors with the same code
"simulation as become unstable", on different types of cards,
I am wondering if buying 970 or 980 will really "solve the problem" ...


EDIT : Bought 2 * GTX970 Gigabyte Gaming G1 WindForce 4Go.

Hope it will work ;)

Post to thread

Message boards : Graphics cards (GPUs) : Gigabyte GTX 780 Ti OC (Windforce 3x) problems

//