Advanced search

Message boards : Graphics cards (GPUs) : Video Card Longevity

Author Message
Profile mike047
Send message
Joined: 21 Dec 08
Posts: 47
Credit: 7,330,049
RAC: 0
Level
Ser
Scientific publications
watwatwatwatwatwatwat
Message 5638 - Posted: 15 Jan 2009 | 13:21:21 UTC

Does anyone have any first hand experience with failures related to 24/7 crunching? Overclocked or stock?

I ask because a posting on another forum indicated failures from stress of crunching 24/7. There was no specific information given, so I don't really think that it was a valid statement.

Anyone?
____________
mike

Profile X1900AIW
Send message
Joined: 12 Sep 08
Posts: 74
Credit: 23,566,124
RAC: 0
Level
Pro
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 5642 - Posted: 15 Jan 2009 | 14:43:16 UTC - in response to Message 5638.

Of course I got failures with overclocking, immediately after test-parcour (which should be run in any case before working on WUs. In my opinion this is not a matter of 24/7, but of stability and time you invest in testing and adjusting your clock rates and fans (not forget the case fans !).

Both GTX 260 I tested overclocking by Rivatuner in a excessive manner, afterwards I flashed the BIOS, included fan settings. If the cooling and temperature can be controlled, only different WUs (for example in folding@home with upcoming the "big" WUs) can compromise your OC-settings. Hardware issues can never be excluded if you overclock or not. No one can guarantee a 24/7-operating. It´s risky, whatever you do in crunching business, especially in beta projects.

My new 9800GX2 runs stock at first time, because I have no experience with that monster, it was applied directly to GPUgrid, but cooling seems to be fine. [I changed GTX 260 versus 9800GX2 during the term of one WU, it´s working. ]

Find your best settings (stock/ocerclocked with fixed fan) and reduce the clock a bit to get some tolerance. In my opinion don´t count on automatic fan-control by driver settings, I would fix the fan speed in any event you think about 24/7, just to control temperature or to prevent your fans from damage by periodical running up and down.

Good luck.

Profile mike047
Send message
Joined: 21 Dec 08
Posts: 47
Credit: 7,330,049
RAC: 0
Level
Ser
Scientific publications
watwatwatwatwatwatwat
Message 5643 - Posted: 15 Jan 2009 | 14:55:40 UTC

Thanks for the response, I guess I should had worded my query differently.

I am interested in complete failure of the video card from crunching.
____________
mike

Profile X1900AIW
Send message
Joined: 12 Sep 08
Posts: 74
Credit: 23,566,124
RAC: 0
Level
Pro
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 5645 - Posted: 15 Jan 2009 | 15:11:29 UTC - in response to Message 5643.

You mean irreparable blackout ? Or temporary disfunction ? Just crunching (shader usage), or up to collapse of the 2D-function ?

Heard about some cases in the folding-forum. See
http://foldingforum.org/viewforum.php?f=49
http://foldingforum.org/viewforum.php?f=38

Profile mike047
Send message
Joined: 21 Dec 08
Posts: 47
Credit: 7,330,049
RAC: 0
Level
Ser
Scientific publications
watwatwatwatwatwatwat
Message 5646 - Posted: 15 Jan 2009 | 15:14:14 UTC - in response to Message 5645.

Ruin of the card to the point of being unusable.

There are those that contend 24/7 crunching will destroy a video card.
____________
mike

Profile Paul D. Buck
Send message
Joined: 9 Jun 08
Posts: 1050
Credit: 37,321,185
RAC: 0
Level
Val
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 5647 - Posted: 15 Jan 2009 | 16:22:55 UTC

The trouble is the lack of context.

I have run computers 24/7 for years with no problems. There is a group that also thinks that when you run computers you must run them up to the edge of stable performance. I have long taken the stance that in the case of over-clocking that it is not a good thing for scientific computing. That does not mean that I think that those that do over-clocking are evil ...

All that being said. It is certainly possible that if you take the computing equipment to the edge and are not that skilled in the maintenance of machines tweaked to this performance level that you can experience machine failures due to heat (primarily), or by voltage (because of mis-adjustment(s)) ...

And some cases are not always configured to allow the heat to be removed when you add several, or even one, high performance GPUs then run them full speed for 24/7 ...

Oh, well, just my thoughts ...
____________

Profile mike047
Send message
Joined: 21 Dec 08
Posts: 47
Credit: 7,330,049
RAC: 0
Level
Ser
Scientific publications
watwatwatwatwatwatwat
Message 5650 - Posted: 15 Jan 2009 | 19:02:30 UTC - in response to Message 5647.

The trouble is the lack of context.

I have run computers 24/7 for years with no problems. There is a group that also thinks that when you run computers you must run them up to the edge of stable performance. I have long taken the stance that in the case of over-clocking that it is not a good thing for scientific computing. That does not mean that I think that those that do over-clocking are evil ...

All that being said. It is certainly possible that if you take the computing equipment to the edge and are not that skilled in the maintenance of machines tweaked to this performance level that you can experience machine failures due to heat (primarily), or by voltage (because of mis-adjustment(s)) ...

And some cases are not always configured to allow the heat to be removed when you add several, or even one, high performance GPUs then run them full speed for 24/7 ...

Oh, well, just my thoughts ...



Can I assume that you have no failures to discuss?
____________
mike

Profile Nightlord
Avatar
Send message
Joined: 22 Jul 08
Posts: 61
Credit: 5,461,041
RAC: 0
Level
Ser
Scientific publications
watwatwatwatwat
Message 5651 - Posted: 15 Jan 2009 | 19:03:26 UTC

If it helps, I have several cards here that have run 24/7 on GPUGrid since July last year with no failures.

I have also never lost a CPU, ram or hard drive due to 24/7 crunching. I damaged a mobo some years ago, but that was my stupidity coupled with a live psu and a screwdriver.

Your mileage may vary.
____________

Profile Paul D. Buck
Send message
Joined: 9 Jun 08
Posts: 1050
Credit: 37,321,185
RAC: 0
Level
Val
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 5653 - Posted: 15 Jan 2009 | 19:21:43 UTC - in response to Message 5650.

Can I assume that you have no failures to discuss?


No I do not. Do you?

What I was saying is that some report failures and blame GPU Grid, BOINC, etc. when the problem is that because these programs run the system at full speed for long periods of time which will, in fact, stress the system upon which they are run.

If there is a problem, or a weakness in the system, the use of a program such as BOINC is going to probably push the system over the brink ... is that the fault of BOINC? Not really ...

Just like race cars lose engines through explosions and other catastrophic events because they are pushed to the edge, any minor flaw or event will cause failure, so it is with BOINC ...
____________

Profile mike047
Send message
Joined: 21 Dec 08
Posts: 47
Credit: 7,330,049
RAC: 0
Level
Ser
Scientific publications
watwatwatwatwatwatwat
Message 5655 - Posted: 15 Jan 2009 | 20:53:59 UTC - in response to Message 5651.

If it helps, I have several cards here that have run 24/7 on GPUGrid since July last year with no failures.

I have also never lost a CPU, ram or hard drive due to 24/7 crunching. I damaged a mobo some years ago, but that was my stupidity coupled with a live psu and a screwdriver.

Your mileage may vary.


This is what I find everywhere that I have asked. I had assumed that there would be no big issues and had stated so....but was told[without foundation] that the longevity would be severely shortened by crunching.

Thanks for your input.
____________
mike

Profile mike047
Send message
Joined: 21 Dec 08
Posts: 47
Credit: 7,330,049
RAC: 0
Level
Ser
Scientific publications
watwatwatwatwatwatwat
Message 5656 - Posted: 15 Jan 2009 | 20:55:03 UTC - in response to Message 5653.

Can I assume that you have no failures to discuss?


No I do not. Do you?

What I was saying is that some report failures and blame GPU Grid, BOINC, etc. when the problem is that because these programs run the system at full speed for long periods of time which will, in fact, stress the system upon which they are run.

If there is a problem, or a weakness in the system, the use of a program such as BOINC is going to probably push the system over the brink ... is that the fault of BOINC? Not really ...

Just like race cars lose engines through explosions and other catastrophic events because they are pushed to the edge, any minor flaw or event will cause failure, so it is with BOINC ...


Thank you for your input.
____________
mike

ExtraTerrestrial Apes
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 17 Aug 08
Posts: 2671
Credit: 751,466,674
RAC: 481,751
Level
Glu
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 5657 - Posted: 15 Jan 2009 | 21:08:17 UTC
Last modified: 15 Jan 2009 | 21:09:53 UTC

Guys, this is a serious topic. I could talk a lot about this, but will try to stay focussed. Feel free to ask further questions!

Basically there are 3 kinds of chip failures:

1. Transient errors: you push your system to its clock speed limits and it fails (or just fails occasionally). You reboot and everything is fine again. We're not concerned about these, just back off a few MHz and you're good to go.

2. Catastrophic failures: a chip suddenly fails and refuses to work, it's broken. This just happens and can be avoided by not running your machines and not running them under load.. luckily such chip failures are very rare. I think power supply circuitry breaks much more often than chips.

3. Decay of chips: this is something to be concerned about and what I'll talk about a bit more.

How does this decay look like?

At a given voltage any chip can run up to a certain frequency, if pushed higher some transistors (actually entire data paths) can not switch fast enough and the operation fails. This maximum frequency is determined by the slowest element. During operation current flows through the transistor in the form of electrons. This current causes microscopic changes in the atomic structure, which ultimately degrades transistor performance.

Thus, over time the transistors become worse and the chip can not reach as high a frequency any more as it did in the beginning. Or, similarly, it needs a higher voltage to maintain a certain speed.

Usually we don't notice this decay because the manufacturers built enough headroom into the chips so they'll long be retired before the effect kicks in. It's only when you push your chip to its limit that you notice the change. Ever wondered why your OC fails in the beginning of a new summer, when it worked perfectly last year? That's the decay. Usually it's not dramatic: at 24/7 load, stock voltage and adequate cooling I'd estimate 10 - 50 MHz per year.

So what can make this decay matter?

In short: temperature and voltage. Temperature does increase the "decay rate", or, if you just watch components until they finally break, the failure rate a little bit. An old rule of thumb is "half the life time for every 10 degrees more". I'm not sure how appropriate this still is.. the laws of physics tend to be rather time independent, but our manufacturing processes are changing.

So temperature is to be avoided, but now comes the kicker: voltage is a real killer! Its effect on chip lifetime is much more severe. I can't give precise numbers, but as long as you stay within the range of stock voltages you're surely fine. Example: 65 nm C2D / C2Q are rated up to 1.35 V. So increasing your voltage from 1.25 V to 1.30 V does hurt your chip, but the effect is not dramatic - you'll still be able to use the chip for a very long time. But going to 1.45 V or any higher.. I really wouldn't recommend it for 24/7. Personally my OC'ed 65 nm C2Q is set to 1.31 V, which amounts to 1.22 V under load and I'm fine with that.

Some consequences:

If people push their chips to high voltages and they fail "suddenly" this is actually a rapid decay due to voltage. The time scale is different, but the mechanism is the same.

The common wisdom "just increase voltage & clock as long as temps are fine" is not true. If it wouldn't be for the large safety margins built into these chips people would kill many more chips due to such OC.

"Overclocking can kill your chip!" - true, because it actually can do so.. but it's very very unlikely unless you apply high voltages or totally forget about cooling. Overclocking itself means increasing the frequency. Note that this does not necessarily include raising the voltage! A higher OC is not always better.. something which most people are not aware of. I overclock at reasonable voltages (and with good cooling), which doesn't give me the highest numbers but enables me to run BOINC 24/7 on these systems without problems.

So what does this mean for GPU crunching?

Usually we only OC our GPUs a little, but we don't raise the voltage (due to a lack of means to do so easily). That means OCing GPUs does not have a dramatic impact on their lifetime. Power consumption increases linearly with frequency, which is not that much. Temperatures increase a little and the power supply circuitry on the card is stressed a bit more.. but we're not drawing as much power as in games, so that should be fine and well within specs.

What we're not save from, however, is temperature. Compare CPUs and GPUs and you'll see that GPUs usually run much hotter due to the limited cooling available in a 1- or 2-slot form factor. Stock fan settings keep most cards between 70 and 90°C. One can argue that "GPUs are designed for such temperatures". Well, they're not. TSMC can not disable the laws of physics just because it's ATI or NV who requests them to make a GPU. That's really trying to make a fortune out of a mishap (1). GPUs run so hot because it's damn inconvenient to cool them any better. It's not that they couldn't stand 90°C.. they just don't have to do this for too long. Nobody's going to game 24/7 for years.

So I sincerely think heat is our main enemy in GPU crunching and OC isn't. Let me put another reference to my cooling solution here.

I can even go a bit further and argue that OC is somewhat beneficial if you're interested in longevity. Let me explain: if you push your card to its limit, back off a bit for safety and at some point you see it fail you can back off some more MHz and you're likey good again for some time. If these cycles accelerate you know you reached the end of the (crunching-)life of your chip. Now you could still give it away to some gamer on a budget who can use it at stock frequency for quite some time to come. At this point degradation will slow down, as it's neither running 24/7 nor 100% load any more.
So on the upside you know when you should retire your GPU from active crunching. On the otherside you'll have to watch things more closely or you'll produce errors.

Some personal experience:

My OC'ed 24/7 chips | approximate time of use | degradation | comment
Celeron 600@900 | 1 year | yes | retired due to failed OC, very high voltage (1.9)
Athlon XP 1700+ | 1.5 years | yes | 1.53 to 1.47 GHz at slightly higher voltage
Athlon XP 2400+ | 1.5 years | yes | 2.18 to 2.12 GHz at slightly higher voltage
Athlon 64 3000+ | 0.5 years | no | 2.7 / 2.5 GHz
Athlon 64 X2 3800+ | 2 years | yes | higher voltage for 2.50 GHz
Core 2 Quad Q6600 | 1.5 years | yes | 3.00 GHz at slightly higher voltage
Radeon X1950Pro | 4 months | yes | failed OC after 3 months, failed stock after another, 24/7 folding@home at ~70°C
Radeon X1950Pro | 3 months | yes (?) | crunched at ~50°C and stopped after some errors, never really checked
Geforce 9800GTX+ | 5 months | no | 50 - 55°C, OCed

That's all for now!
MrS


(1) I know there's some proper English saying for this..
____________
Scanning for our furry friends since Jan 2002

Profile Paul D. Buck
Send message
Joined: 9 Jun 08
Posts: 1050
Credit: 37,321,185
RAC: 0
Level
Val
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 5663 - Posted: 16 Jan 2009 | 7:54:56 UTC
Last modified: 16 Jan 2009 | 7:55:25 UTC

Good explanation ...

And since I could not figure out what you are trying to say I cannot come to the rescue with an appropriate saying ... your note one.

There is one other cause of failure that you did not mention and that is "latent defects". That is a defect on the chip that is there, but not significant to cause immediate failure of the component. The US Military used to test chips to prove that they met specifications ... then they would get a part number in the 5400 series... the same logic element manufactured to commercial specification would be labeled with a 7400 part number ... and the testing was usually the only difference ... the testing to qualify the part ...

The problem was the testing ran the part up to its limits ... and the stress almost always caused the beginning of a defect that would grow over time ... so paradoxically, the mil-spec parts were less reliable than the commercial equivalents ... I had a commander hit the roof once when he asked me how we were repairing a test bench ... and I told him we were putting in parts I bought at radio shack before deployment ... when he ordered me to stop, I told him I could do that and the tester would be off-line for the rest of the deployment while we shipped the card back to the states, or we could repair the tester as we had on other cards earlier in the cruise ... and never had a failure of the "less qualified" parts ...

When a new airplane comes off the assembly line the test pilots fly the beast and confirm the calculated "flight envelop" before regular pilots fly the darn thing ... before we had computers that allowed simulations and calculations of the flight envelops, many times they were guestimates and these were only confirmed in early flights with some attrition of aircraft and pilots ... the P38 had a an interesting defect where in a steep dive parts of the control surfaces were locked into position by the air moving across the surfaces ... thus the dive ended with the aircraft and pilot pointing out of the dirt ... a minor hidden, latent, defect ... now called shock compressibility (or only compressibility) and was solved with dive brakes and the movement of one of the surfaces up a few inches. (See Richard Bong as the USA's highest scoring ace, P38 Lightning and for contrast Erich Alfred "Bubi" Hartmann whose record is not likely to be equaled soon ...)

I only give these examples as a contrast in that they may be easier to understand ...

But, I agree with ETA that the "problem" with OC is not directly the speed, it is the heat ...

The quibble section ... :)

The highest failure rate components are those that have mechanical actuation, fans (also because they are made cheaply to keep the cost down, but that means their life is expected to be short), and disk drives (CD/DVD too). ALL OTHER THINGS BEING EQUAL ...

Failures are most common on cold starts because of the effect of "inrush" currents (the article discuss this only in some contexts, but the problem is true for all electrical devices, inside the chips we have transistors, capacitors, and resistors ...)

Which is one of the reasons some of us like to leave our PCs on at all times ... :)

The age problem is the balance between "infant Mortality" and natural death of the devices at their normal life time as shown by the "Bathtub Curve" , which in our context is relevant as the running components hot causes the end of life portion of the curve to be pulled to the left ... See Thermal management of electronic devices and systems (for more google "electronic failure heat")

For those REALLY dedicated, Google "gamma ray electronics failure" though most articles discuss high-altitude events where this is a serious problem it is little known that most of the packaging material for chips emits radioactivity, soft gamma and beta that all can impinge on the chips causing a "soft event" but in the presence of a stressed part can be the camel that broke the straws back ...

Oh, and my mind is a very cluttered attic ... and this is as focused as I get ...
____________

Profile dataman
Avatar
Send message
Joined: 18 Sep 08
Posts: 36
Credit: 100,352,867
RAC: 0
Level
Cys
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwat
Message 5673 - Posted: 16 Jan 2009 | 16:04:17 UTC

Thanks ETA ... that was very interesting.
____________

ExtraTerrestrial Apes
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 17 Aug 08
Posts: 2671
Credit: 751,466,674
RAC: 481,751
Level
Glu
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 5690 - Posted: 16 Jan 2009 | 22:23:41 UTC

Hi Paul,

I actually meant to include the latent defects into "2. Catastrophic failures" without actually mentioning them. Your explanation is much better than "some just fail at some point due to some reason".

And an interesting note about transient errors caused by radiation: Intel uses a "hardened" design and I guess all other major players too. I don't know how they do it, but single bit errors due to radiation should not make the chips fail.

Regarding note (1): it means spinning something negative into something positive. Still I have no idea which English saying I'm looking for..

And dataman, thanks for the flowers :)

MrS
____________
Scanning for our furry friends since Jan 2002

Profile Paul D. Buck
Send message
Joined: 9 Jun 08
Posts: 1050
Credit: 37,321,185
RAC: 0
Level
Val
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 5692 - Posted: 16 Jan 2009 | 22:48:53 UTC - in response to Message 5690.

Hi Paul,

I actually meant to include the latent defects into "2. Catastrophic failures" without actually mentioning them. Your explanation is much better than "some just fail at some point due to some reason".

And an interesting note about transient errors caused by radiation: Intel uses a "hardened" design and I guess all other major players too. I don't know how they do it, but single bit errors due to radiation should not make the chips fail.

Regarding note (1): it means spinning something negative into something positive. Still I have no idea which English saying I'm looking for..

And dataman, thanks for the flowers :)

MrS


I have the opposite problem, I can't include things without mentioning them ... why lots of my posts tend to the long ...

Hardening can be combinations of technologies from the design of the structures so that an impinging ray will not be able to create enough of a charge change to make an internal change, to coatings to absorb or negate the ray. But, what I was trying to get at is that not only can a ray cause a soft error but it can also create a local voltage "spike" that it causes the catastrophic failure due to the presence of the latent defect ... which, sans the event, would have caused the failure in the future due to the normal wear and tear we had been discussing.

But you are correct that I was not attempting to make a point that flipping a bit due to soft error change by a cosmic/gamma-ray will cause a failure ...

There are several, the most common is "Turning lemons into lemonade" ... of "If life hands you lemons, make lemonade" ...

Thinking about that, life usually hands me onions and I am not sure that learning how to cry really makes it as an aphorism ... but that is just me ...
____________

Scott Brown
Send message
Joined: 21 Oct 08
Posts: 144
Credit: 2,973,555
RAC: 0
Level
Ala
Scientific publications
watwatwatwatwatwat
Message 5693 - Posted: 17 Jan 2009 | 0:37:38 UTC - in response to Message 5692.

...life usually hands me onions and I am not sure that learning how to cry really makes it as an aphorism ... but that is just me ...


Hopefully, at least sometimes they are sweet Vidalia onions. :)

And thanks to both you and MrS for the excellent discussions on this topic. Gives me something to think about with my 9600GSO (tends to run constantly in the low 70's celsius)...



Profile Paul D. Buck
Send message
Joined: 9 Jun 08
Posts: 1050
Credit: 37,321,185
RAC: 0
Level
Val
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 5694 - Posted: 17 Jan 2009 | 2:45:27 UTC - in response to Message 5693.

...life usually hands me onions and I am not sure that learning how to cry really makes it as an aphorism ... but that is just me ...


Hopefully, at least sometimes they are sweet Vidalia onions. :)

And thanks to both you and MrS for the excellent discussions on this topic. Gives me something to think about with my 9600GSO (tends to run constantly in the low 70's celsius)...


Except I hate onions ...

All kinds of onions ...

And your temperature, as I recall, is in the nominal zone as we figure these things ... mine is at 78, of course I let the room get warm so I am sure that drove it up some ...

Making me even happier is Virtual Prairie has just issued some new work!!! :)

And I am on track to have Cosmology to goal on the 25th ... and my Mac Pro is raising ABC on its own (while still doing other projects) nicely so that it looks like I should easily be able to make that goal by mid to late Feb, even with the detour of SIMAP at the end of the month ... which I am going to make a real focus for that one week ...

and new applications promised here ... things are really looking up ...
____________

Profile bloodrain
Send message
Joined: 11 Dec 08
Posts: 32
Credit: 748,159
RAC: 0
Level
Gly
Scientific publications
watwatwatwatwatwatwatwat
Message 6086 - Posted: 28 Jan 2009 | 8:24:16 UTC - in response to Message 5694.

one main thing is to watch out on how hot it gets. that can kill a system by overheating the parts. but really on this topic. no it wont happen.



but their is a very very very small chance it could happen. like 1 in a billion

ExtraTerrestrial Apes
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 17 Aug 08
Posts: 2671
Credit: 751,466,674
RAC: 481,751
Level
Glu
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 6136 - Posted: 28 Jan 2009 | 21:26:45 UTC - in response to Message 6086.

May I kindly redirect your attention to this post? What you're talking about is failure type number (2), which is indeed not our main concern.

MrS
____________
Scanning for our furry friends since Jan 2002

Jeremy
Send message
Joined: 15 Feb 09
Posts: 55
Credit: 3,542,733
RAC: 0
Level
Ala
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 9286 - Posted: 4 May 2009 | 2:39:25 UTC - in response to Message 6136.

Actually, number 2 might be more of a problem than you might think. The GT200 is a very hardy chip and can take some serious heat. However, I'll be receiving my THIRD GTX 260 (192) tomorrow if UPS cooperates. Two have failed me so far under warranty.

The first failure was unexplainable. It failed while I was away at work. The second, however, I caught. The fan on the video card stopped. Completely. 0 rpm. I noticed artifacts in Crysis and closed out the game. The GPU was at 115 deg C and climbing. I monitor all system temps via SpeedFan. When I discovered the video card fan wasn't spinning, I immediately shut the system down. I let it sit for an hour, then restarted it. The fan spun up properly and everything worked, but the video card failed and refused to POST 2 days later.

I now use SpeedFan's event monitor to automatically monitor temps and shut the system should any of them go higher than what I deem allowable, hopefully this will prevent any future RMA requests. It's not difficult to set up.

Point is, if you're running your system under high loads for a sustained period of time as you do with BOINC or FAH, you really need to keep a close eye on things as much as possible. Automated is best IMHO. If my system had shut down in the middle of my Crysis session I would've been annoyed, but at least my video card wouldn't have cooked itself. I leave my system unattended so often (sleep and work), that it just makes sense to have something there in case something goes awry while I'm away.

Jeremy

uBronan
Avatar
Send message
Joined: 1 Feb 09
Posts: 139
Credit: 575,023
RAC: 0
Level
Gly
Scientific publications
watwatwatwatwat
Message 9293 - Posted: 4 May 2009 | 10:53:18 UTC

Heat is a common problem for the lifespan of any part, and failures are not as many people think selden or rare but they actually are common.
I have had such an problem with a gigabyte mainboard which i never got support on, the problem with most problems are related to overheating one or more parts.
The board i have is cooled by those fanless blocks but what for instance happened with this board is that the manufactor did not construct this coolblock at the right way, causing the northbridge chip case to meld.
The coolblock only had partial contact with the chip case, but nobody can /will check these parts untill failures appear.
I can show you pictures of this if asked, what most people don't know is that many parts on a mainbord are only cooled by airflow in the case or are not even cooled at all.
Hence super high temps on these parts some actually become 128 C or hotter i remember a guys which made picture with a thermal camera showing the heat spots on most mainly mainboards showing very high temps.
Another discussion is about harddrives and their temps, we think the lower the better but i have read documents from google that it actually seems to be more the moderate temps (between 45 and 65) where they get better and life longer. I also found out myself when i still worked as IT person on medium/large companies that spinning up a drive frequently does more damage to them then letting them run 24/7. Ofcourse we talk about enterprise drives here and not everybody wants to run his pc 24/7. But i allways tell people to leave the machine on when they know they need it again i a few hours.

Anyway we can go on forever on this topic and i have read hundreds of reports and tests made by friends working with the same huge systems.
I general a computer can allways fail especially when operated outside their parameters like temperature/frequency and/or voltage.
This can occur by intend ( OC, test, benchmark ) or by failure of parts.
(breakdown of fan, cooling system)
And ofcourse nowadays manufactors make the parts not to last long but last the expected lifespan they want.

Profile Paul D. Buck
Send message
Joined: 9 Jun 08
Posts: 1050
Credit: 37,321,185
RAC: 0
Level
Val
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 9304 - Posted: 4 May 2009 | 18:48:58 UTC - in response to Message 9293.

Another discussion is about harddrives and their temps, we think the lower the better but i have read documents from google that it actually seems to be more the moderate temps (between 45 and 65) where they get better and life longer. I also found out myself when i still worked as IT person on medium/large companies that spinning up a drive frequently does more damage to them then letting them run 24/7.

YOu are seeing the effects of two factors, thermal cycling which leads to expansion and contraction effects which can induce failures and inrush currents which cause other failure modes (I talked of this in my part of the failure mode discussion in the other referenced thread).

uBronan
Avatar
Send message
Joined: 1 Feb 09
Posts: 139
Credit: 575,023
RAC: 0
Level
Gly
Scientific publications
watwatwatwatwat
Message 9379 - Posted: 6 May 2009 | 14:13:15 UTC

Yes and is nice to see such very handy info, because we allways am concerned about our hardware. In fact sometimes a little bit too much :)

Profile Edboard
Avatar
Send message
Joined: 24 Sep 08
Posts: 72
Credit: 12,410,275
RAC: 0
Level
Pro
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 9383 - Posted: 6 May 2009 | 17:53:39 UTC
Last modified: 6 May 2009 | 17:56:20 UTC

Two of my GTX280 died crunching 24/7 (Gpugrid/Folding/SETI) with a 16% OC (only clock and shaders, not memory). They last aprox. two month each. Since then, I do not OC my GPUs and only crunch about 10 hours/day

Profile JockMacMad TSBT
Send message
Joined: 26 Jan 09
Posts: 31
Credit: 3,577,572
RAC: 0
Level
Ala
Scientific publications
watwatwatwatwat
Message 9508 - Posted: 9 May 2009 | 9:05:01 UTC

I lost an ATI HD4850x2 at stock due to excessive temperatures.

Since bought an AC unit for the room which cost less than the card.
____________

uBronan
Avatar
Send message
Joined: 1 Feb 09
Posts: 139
Credit: 575,023
RAC: 0
Level
Gly
Scientific publications
watwatwatwatwat
Message 9511 - Posted: 9 May 2009 | 9:53:33 UTC
Last modified: 9 May 2009 | 9:56:15 UTC

Well you said it right excessive temps are at longer periods a disaster some cards need better cooling then provided by manufactors.
So to be honest if i buy a card i allways try to find one which blows the air out of the case.
And I allways look at the cooling on it and the previews if the card is cooler then its competition
Or if i can find a better solution to cool it like watercooling or a better heatsink.
Like my previous card which is a nvdia 6600 Gt which is a notorious hothead (up to 160 C) i tweaked it with a watercooler and got it at stressed at 68 C and believe me not many are able to get it that low.
So in both your cases it probably run too long at full power without enough air(cooling) flow, but to be honest its almost a science on itself to get optimum airflow in your machine.
Nevertheless i allways make sure mine does and ofcourse allways have huge pc cases, even my htpc has the biggest case i could find and is full of 12" cooling fans :D
And all fan slots have the best fans available ( sounding like a little airplane ;)) meaning the best airflow with the least of noise.
My main case tops it with a radial fan blowing on my mainboard in my stacker case on top i have my watercooling fans 3 x 12" then i have 2x12" on the backside blowing out, on the front another 2x12" fans blowing on to the 4 drives inwards.
Then ofcourse the single fan on the mainboard chipset and the 2 fans from my power supply and my VC fan so in all enough to make some people crazy ;) (woman)

ExtraTerrestrial Apes
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 17 Aug 08
Posts: 2671
Credit: 751,466,674
RAC: 481,751
Level
Glu
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 9587 - Posted: 10 May 2009 | 12:15:17 UTC

Jeremy wrote:
Actually, number 2 might be more of a problem than you might think.


Yes and no.. well, it depends.
The "normal" failure rate of graphics cards seems to be higher than I expected, even without BOINC. Your first card seems to be one of them.

What I mean by type 2 is a sudden failure with no apparent reason (overclock at stock voltage is not a proper reason, as I explained above). What you describe for your 2nd card (fan failed) could well be attributed to this type, but I'd tend to assign the mechanism to type 3: it's heat damage, greatly accelerated compared to the normal decay. Admittedly, when I talk about type 3 I have normal temperatures in mind, i.e. 50 - 80 °C.

uBronan,

a heat sink which is not mounted properly is rather similar to a fan suddenly failing. It's obviously bad and has to be avoided, but what I'm talking about is what happens in the absence of such failures, in a system where the cooling system works as expected.

Regarding the hot spots on mainboards: not all of these components are silicon chips. for example a dumb inductor coil can tolerate a much higher temperature. And you can actually manufacture power electronics, which feature much more crude structures and are therefore much less prone to damage than the very fine structures of current cpus and gpus. So just that a component is 125°C hot does not neccessarily tell you that this is wrong or too bad.

Regarding hdds: sorry, but temperatures up to 65°C are likely going to kill the disk!! Take a look, most are specified up to 60°C. German c't magazine once showed a photo of an old 10k rpm SCSI disk after the fan failed.. some plastic had molten and the entire hdd had turned into an unshapely something [slightly exagerated].

I also read about this Google study and while their conclusion "we see lower failure rates at mid 40°C than at mid 30°C" is right, it is not so clear what this means. The advantage of their study is that they average over many different systems, so they can gather lots of data. The drawback, however, is that they average over many different systems. There's at least one question which can not be answered easily: are the drives running at lower temperatures mounted in server chassis with cooling.. in critical systems, which experience a much higher load on their hdds? There could be more of such factors, which influence the result and persuade us to draw wrong conclusions, when we ignore the heterogenous landscape of Googles server farm.

What I think is happening: the *old* rule of "lower temp is better" still applies, but in the range of mid-40°C we are relatively safe from thermally induced hdd failures. Thus other factors start to dominate the failure rates, which may coincidentally seem linked to the hdd temperature, but which may actually be linked to the hdd tpye / class / usage patterns.

But i allways tell people to leave the machine on when they know they need it again i a few hours.


But don't forget that nowadays all hdds have fluid-dynamic bearings (I imagine it quite difficult to do permanent damage to a fluid) and that pc component costs went down, whereas power costs went up, as well as pc power consumption. However, thermal cycling is of cource still a factor.

Like my previous card which is a nvdia 6600 Gt which is a notorious hothead (up to 160 C) i tweaked it with a watercooler and got it at stressed at 68 C and believe me not many are able to get it that low.


Well, mine ran at ~50°C idle and ~70°C load with a silent "Nv Silencer" style cooler. And emergency shutdown is set by the NV driver somewhere around 120 -130°C. Maybe you saw 160F mentioned somewhere?

MrS
____________
Scanning for our furry friends since Jan 2002

uBronan
Avatar
Send message
Joined: 1 Feb 09
Posts: 139
Credit: 575,023
RAC: 0
Level
Gly
Scientific publications
watwatwatwatwat
Message 9591 - Posted: 10 May 2009 | 15:06:15 UTC
Last modified: 10 May 2009 | 15:07:38 UTC

The card which is a real hothead was running at temps exceeding 110 C
not F with the normal fan from the manufactor, when just gaming .. hell i must not think for that card doing a dc project.
To your information its not being used but still runs fine even after 5 years of use.
And no the manufactor send me a answer on my question about the card running at 110 c when i was gaming that it could have much higher temps without really failing, if they lied about the temps then ofcourse i can't help that.
If they made up a story to get rid of me sending emails about these temps then i can't help it.
And yes your partially right about drive temps, but then again i was not talking about a 10k or 15k drive but a plain 7k drive.
Read the google documents about drive temps in large clusters and the fail rates, Its a proven fact and they run also plain sata 7.2k drives.
Since the high speed drives fail much faster, google needs storage not faster thats why they run sata/sas instead of iscsi or other solutions.
My 2 seagate boot drives am running constantly at 65 C since i bought them about 2 years ago never been cooler. And yes there is a huge fan blowing cold air on them which doesn't cool them down much.
The other drives are samsungs which run much cooler (37 C) if not together with the seagates. But since they share the same drive case they are at 43 now.
Hence the samsungs help to cool the seagates.

Andrew
Send message
Joined: 9 Dec 08
Posts: 29
Credit: 18,754,468
RAC: 0
Level
Pro
Scientific publications
watwatwatwatwatwatwatwatwatwatwat
Message 9596 - Posted: 10 May 2009 | 16:58:02 UTC - in response to Message 9508.

@ JockMacMad TSBT

Are you aware that using AC in the same room as your crunching machines may mean that you are paying several times over for that power? I'm not entirely sure about my figures, but basically, if you're dumping, say 100W in an AC'ed room, then I believe the AC unit will require a significant fraction of this to remove the heat (if the heat is being moved to a hotter place as is usual).

Perhaps someone else can provide numbers - I live in the UK where we sadly have no need for AC!

ExtraTerrestrial Apes
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 17 Aug 08
Posts: 2671
Credit: 751,466,674
RAC: 481,751
Level
Glu
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 9601 - Posted: 10 May 2009 | 18:50:55 UTC - in response to Message 9596.

Andrew,
you're right, an AC increases the power bill considerably. For supercomputers and clusters they usually factor in a factor of 2 for the cooling. So in your example a 100W PC in a room with AC will cost 200W in the end.

uBronan,
OK, I actually read the paper again.. damn, what am I doing here? Isn't it supposed to be sunday?! :p
I think we're both too wrong. I remembered that they gathered stats from all their drives, which would mainly be desktop drives (IDE, SATA) mixed with some enterprise class drives. For example this could have lead to a situation, where the cheap desktop drives run around 35°C and fail more often than the enterprise drives, which run hotter due to their higher spindle speeds.

However, this is not the case: they include only 5.4k and 7.2k rpm desktop drives, IDE and SATA. The very important passage:

Overall our experiments confirm previously reported temperature effects only for the high end of our temperature range(*) and especially for older drives. In the lower and middle temperature ranges, higher temperatures are not associated with higher failure rates.
...
We can conclude that at moderate temperature ranges it is likely that there are other effects which affect failure rates much more strongly than temperatures do.


I think that's quite what I've been saying :)

(*) Note that the high end of the temperature spectrum starts at 45°C for them and ends at 50°C. There the error rate rises, but the data quickly becomes noisy due to low statistics (large error bars).

Regarding that 6600GT.. well, I can't accuse them of lying without futher knowledge. They may very well have had some reason to state that you could have seen even higher temps without immediate chip failure. I think those chips were produced on the 110 nm node, which means much larger and more robust structures, i.e. if you move one atom it causes less of an effect.

Here's some nice information: most 6600GTs running in the 60 - 80°C range under load and a statement that 127°C is the limit where the NV driver does the emergency shutdown.

Do you know what? "You could have seen higher temps" means "emergency shut down happens later". Which is not lying, but totally different from "110°C is fine" :D

MrS
____________
Scanning for our furry friends since Jan 2002

uBronan
Avatar
Send message
Joined: 1 Feb 09
Posts: 139
Credit: 575,023
RAC: 0
Level
Gly
Scientific publications
watwatwatwatwat
Message 9660 - Posted: 11 May 2009 | 23:34:47 UTC

Well lol yea i have been reading it several times also.
Since in my old job i had a cluster of 2100 drives devided over many cabinets running for the huge databases we had, those scsi drives where also kept at temps near 15 C.
In fact i have also run some tests for the company on workstations also to see what was the best temperature to maintain them at, but yes those are all enterprise drives which are sturdier then normal desktop drives.
And i must add the newer drives seem to be much weaker then the older drives, probably related to the much finer surface of the platters.
Except the new glass platters which seems to be able to operate with higher temps.

uBronan
Avatar
Send message
Joined: 1 Feb 09
Posts: 139
Credit: 575,023
RAC: 0
Level
Gly
Scientific publications
watwatwatwatwat
Message 9704 - Posted: 13 May 2009 | 11:59:14 UTC
Last modified: 13 May 2009 | 12:00:52 UTC

Sorry ET after i started up the old beast i saw that i misinformed about the old card its a 6800 GT nvidia from Gigabyte
The 6600 came out a bit later.

Thamir Ghaslan
Send message
Joined: 26 Aug 08
Posts: 55
Credit: 1,475,857
RAC: 0
Level
Ala
Scientific publications
watwatwat
Message 10318 - Posted: 30 May 2009 | 7:22:01 UTC - in response to Message 5638.

Does anyone have any first hand experience with failures related to 24/7 crunching? Overclocked or stock?

I ask because a posting on another forum indicated failures from stress of crunching 24/7. There was no specific information given, so I don't really think that it was a valid statement.

Anyone?


I bought a gtx 280 in August 2008 and burned it in March 2009.

So thats 6 months of stock 24/7 crunching on GPU grid, the fan was set to automatic, don't know if it would made a difference if I've set it to a higher manual fan speed. I remember the tempratures were below the tolerable levels.

The relevant thread is here:

http://www.gpugrid.net/forum_thread.php?id=829&nowrap=true#7338

So yes, GPU failures are well and real, I've seen enough posts from other GPU owners. However, its very rare to hear of CPU failures. I guess there are safe guards on CPUs that are more advanced than the GPU.

Profile Bigred
Send message
Joined: 24 Nov 08
Posts: 10
Credit: 25,447,456
RAC: 0
Level
Val
Scientific publications
watwatwatwatwatwatwatwat
Message 10319 - Posted: 30 May 2009 | 7:50:56 UTC - in response to Message 10318.

So far, I've had 1 GTX260 out of 10 fail after 4 months of crunching. The fan bearings were totally wore out. It took 5 weeks to get it's replacement.As allways my stuff runs stock speeds.
____________

pharrg
Send message
Joined: 12 Jan 09
Posts: 36
Credit: 1,075,543
RAC: 0
Level
Ala
Scientific publications
watwatwatwatwatwatwatwatwatwatwat
Message 10381 - Posted: 2 Jun 2009 | 15:57:53 UTC
Last modified: 2 Jun 2009 | 16:00:07 UTC

I use XFX brand cards since they give a lifetime warranty if you register them. If mine burns out, I just do an replacement with them, though I've yet to have one die anyway. Keeping any video card cool is the other major factor. Just like your CPU, the cooler you keep your GPU, the less likely you are to see failures or errors.

uBronan
Avatar
Send message
Joined: 1 Feb 09
Posts: 139
Credit: 575,023
RAC: 0
Level
Gly
Scientific publications
watwatwatwatwat
Message 10433 - Posted: 5 Jun 2009 | 20:29:48 UTC

Looks like my 9600 gt is showing sings of break down as well
I get random errors and saw some weird pixels when booting
I even tried folding@home which run for a couple of units and non finished all where errored out.
So i guess vc die from dpc projects

popandbob
Send message
Joined: 18 Jul 07
Posts: 65
Credit: 10,972,900
RAC: 0
Level
Pro
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 10436 - Posted: 6 Jun 2009 | 7:36:12 UTC - in response to Message 10433.

So i guess vc die from dpc projects


Saying that will scare others away.
They don't die from doing dc as they would have failed anyway.
If there is a problem with a card it will show up faster if stressed harder yes
but to claim projects like GPUgrid kill cards is wrong.

The best safeguard is to buy from good companies who will help solve problems. I've only dealt with evga and they've been good to me but I can't comment on other places.

Bob

ExtraTerrestrial Apes
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 17 Aug 08
Posts: 2671
Credit: 751,466,674
RAC: 481,751
Level
Glu
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 10520 - Posted: 12 Jun 2009 | 22:51:54 UTC - in response to Message 10436.
Last modified: 12 Jun 2009 | 22:52:45 UTC

They don't die from doing dc as they would have failed anyway.
If there is a problem with a card it will show up faster if stressed harder yes
but to claim projects like GPUgrid kill cards is wrong.


I don't think it's that simple.

Running load instead of idle will typically increase GPU temps by 20 - 30°C, which means a lifetime reduction of a factor of 4 to 8. So if a card fails after a half year of DC we could have expected it to last 2 to 4 years otherwise.

And if it went into a lower-voltage 2D mode then degradation would be further reduced without DC. I can't give precise numbers, but I'd go as far as "at a significantly reduced voltage degradation almost doesn't matter any more". So you can kill a card in 6 months, which may otherwise have lasted 10 years, running most of which in 2D mode.

So, yes, DC only accelerated the failure. However, it turned the card from "quasi-infinite lifetime" into "short lifetime". Which is by all practical means equivalent to "killing it". I sincerely think this is what we have to admit in order to be honest. To us and to our fellow crunchers.

Why are GPUs seeing higher failure rates than CPUs? Easy: CPUs must be specified to withstand 24/7 load. Up to now the GPU manufacturers didn't have to deal with such loads.. after a few days even the hardest gamer needs some sleep. Due to these relaxed load-conditions they specify higher temperatures than the CPU guys. Furthermore the CPU guys have more space for efficient cooling solutions, so their chips don't have to be as hot. GPUs have long hit the power wall, where the maximum clock speed is actually determined by the noise the user can stand with the best cooling solution the manufacturer can fit into 2 slots.

As a long term solution the manufacturers would have to offer special 24/7-versions of their cards: slightly lower clocks, slightly lower voltages, maybe a better cooling solution and the fan setting biased towards cooling rather than noise. Such cards could be used for 24/7 crunching.. but who would buy them? More expensive, slower and likely louder!

MrS
____________
Scanning for our furry friends since Jan 2002

Daniel Neely
Send message
Joined: 21 Feb 09
Posts: 5
Credit: 7,632,261
RAC: 0
Level
Ser
Scientific publications
watwatwatwatwatwatwatwatwat
Message 10521 - Posted: 12 Jun 2009 | 23:38:27 UTC - in response to Message 10520.

lution the manufacturers would have to offer special 24/7-versions of their cards: slightly lower clocks, slightly lower voltages, maybe a better cooling solution and the fan setting biased towards cooling rather than noise. Such cards could be used for 24/7 crunching.. but who would buy them? More expensive, slower and likely louder!



Isn't that called the nVidia Tesla?

ExtraTerrestrial Apes
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 17 Aug 08
Posts: 2671
Credit: 751,466,674
RAC: 481,751
Level
Glu
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 10522 - Posted: 13 Jun 2009 | 1:54:21 UTC - in response to Message 10521.

Almost. The Teslas cost 1000 to 2000$ more, whereas I'm talking about 10 to 20$ more. I suppose what makes the Teslas really expensive is the extensive testing and "guaranted" functionality (if there is something like that for chips at all). They wouldn't neccessarily need that for "heavy duty GP-GPUs".

MrS
____________
Scanning for our furry friends since Jan 2002

Profile skgiven
Volunteer moderator
Project tester
Volunteer tester
Avatar
Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,834,906,524
RAC: 276,970
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 10759 - Posted: 21 Jun 2009 | 17:17:24 UTC - in response to Message 10522.

Radiation
GPU’s, and everything else, suffer from continuous Radiation bombardment. As well as Ionising radiation, Neutrons, Protons, Muons and Pions we are even hit by Cosmic Radiation. These all cause random system Errors. Most impacts do not cause permanent damage, but sometimes RAM has to be replaced or a BIOS reset. They might cause the most damage to hard disk drives, and it is why your CD’s of precious memories won’t be readable in 15 or 20 years! You can’t hide from a particle that can go through 10’ of lead. It’s also why aircraft are grounded when there is an increase in solar flares. So the next time your system restarts you have something to blame – it could always be solar radiation!

ExtraTerrestrial Apes
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 17 Aug 08
Posts: 2671
Credit: 751,466,674
RAC: 481,751
Level
Glu
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 10765 - Posted: 21 Jun 2009 | 20:18:09 UTC - in response to Message 10759.
Last modified: 21 Jun 2009 | 20:18:33 UTC

High energy particles cause transient errors by impact ionization. They don't generally cause permanent erros: the mass/energy difference between these particles and the atoms of your chip are too large for significant momentum transfers. Therefore they can (temporarly) kick electrons out of their bindings, but they can hardly move atoms.

MrS
____________
Scanning for our furry friends since Jan 2002

Profile skgiven
Volunteer moderator
Project tester
Volunteer tester
Avatar
Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,834,906,524
RAC: 276,970
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 10804 - Posted: 23 Jun 2009 | 23:01:43 UTC - in response to Message 10765.

I agree that most (radiation) impacts do not cause permanent damage - a restart and you are up and running again.
Theoretically speaking however (and only in very rare circumstances) Neutrons can cause Permanent damage (and yes they can move atoms, and sometimes not just one). Unfortunately however, radiation does not have to move atoms to cause permanent damage, just the odd atomic bond; causing material degradation.
Humans have a protein called telomerase that repairs such damage when it occurs in DNA, but Computers don’t have an equivalent just yet.
We might all live for much longer if the Telomerase gene was not stuck at the end of a Chromosome (which shrinks with age)! Cancers would probably not be such the problem either. It’s such a pity nobody is studying this Cure for All Cancers Solution.

Jonathan Figdor
Send message
Joined: 8 Sep 08
Posts: 14
Credit: 425,295,955
RAC: 85,073
Level
Gln
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 10811 - Posted: 24 Jun 2009 | 6:37:16 UTC - in response to Message 10804.

Well get to work then.

ExtraTerrestrial Apes
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 17 Aug 08
Posts: 2671
Credit: 751,466,674
RAC: 481,751
Level
Glu
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 10833 - Posted: 24 Jun 2009 | 21:16:43 UTC - in response to Message 10804.

Theoretically speaking however (and only in very rare circumstances) Neutrons can cause Permanent damage (and yes they can move atoms, and sometimes not just one). Unfortunately however, radiation does not have to move atoms to cause permanent damage, just the odd atomic bond; causing material degradation.


You're right, neutrons can move atoms. good that there aren't too many of them in the cosmic ray mix :)

And you wouldn't have to reboot on every transient error: the fault might not lead to any disturbing consequences. Furthermore I heard a talk about 3 years ago where the Prof. said Intels core logic is entirely "radiation hardened" to the point where they can detect 2-bit errors and correct 1-bit errors. Don't quote me on the these numbers, though.. it's been quite some time.

Interesting that you mention breaking bonds. This is actually what causes the slow degradation of chips over time, it's just not mainly caused by cosmic radiation. The defects at the Si-SiO2 interface (or now Si-HfOx) are passivated by hydrogen atoms. Over time the occasional highly energetic electrons (the boltzmann tail, or from the substrate) kick these light hydrogen atoms out and a dangling bond is created. This is a "trap state" for charge carriers. Once such a trap contains charge the transistor operation is influenced (the threshold voltage shifts), which can only be bad in either direction.

MrS
____________
Scanning for our furry friends since Jan 2002

Profile skgiven
Volunteer moderator
Project tester
Volunteer tester
Avatar
Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,834,906,524
RAC: 276,970
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 10844 - Posted: 25 Jun 2009 | 1:16:46 UTC - in response to Message 10833.

I was not implying that neutrons were part of the cosmic radiation! Most sources are quite terrestrial. They are also very rare, and are usually produced by other rare particle bombardments. The biggest single radiation concern for humans is Radon, as it is in the stone of many buildings, work benches, ornaments and the rocks beneath us. I’m sure the ionising radiation that directly or indirectly results from Radon causes computer problems too.

ExtraTerrestrial Apes
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 17 Aug 08
Posts: 2671
Credit: 751,466,674
RAC: 481,751
Level
Glu
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 10861 - Posted: 25 Jun 2009 | 21:24:48 UTC - in response to Message 10844.

My bad, I read too much into that! Regarding computer problems due to radioactivity: fortunately we don't need to worry about the alphas here, as they can not even penetrate paper. Betas and Gammas, on the other hand, can cause transient errors by ionization if the design is not radiation hardened or there are too many of them (but you likely wouldn't care much as you'd sit in the middle of a fission reactor :D). But at a few MeV they can probably create dangling bonds and thus lead to component decay.

MrS
____________
Scanning for our furry friends since Jan 2002

Scott Brown
Send message
Joined: 21 Oct 08
Posts: 144
Credit: 2,973,555
RAC: 0
Level
Ala
Scientific publications
watwatwatwatwatwat
Message 10989 - Posted: 6 Jul 2009 | 16:50:02 UTC - in response to Message 10804.


Humans have a protein called telomerase that repairs such damage when it occurs in DNA, but Computers don’t have an equivalent just yet.
We might all live for much longer if the Telomerase gene was not stuck at the end of a Chromosome (which shrinks with age)! Cancers would probably not be such the problem either. It’s such a pity nobody is studying this Cure for All Cancers Solution.


The "shrinking with age" regarding telomerase probably has little effect on how long we live currently given upper population life expectancies of around 85 years for Japanese women. Essentially, this repair process and shrinking is related to the "Hayflick Limit" in cell division, which places a finite limit on natural human life span at around 250 years (when "shrinking" results in lengths too short for the division to occur properly). Research on telomerase (and related issues) has been ongoing for three or more decades, including some work on cell division in some cancers.

Profile skgiven
Volunteer moderator
Project tester
Volunteer tester
Avatar
Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,834,906,524
RAC: 276,970
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 11049 - Posted: 8 Jul 2009 | 23:15:29 UTC - in response to Message 10989.

Telomerase repairs damaged DNA, but the Telomerase gene resides close to the ends of chromosomes (the end region is called a telomere). So my point was that when chromosomes shrink overall Telomerase production is reduced in the body, and non-existent in some cells. Without Telomerase DNA stays damaged, so there is a greater risk of Cancer and other illnesses.

Just because someone is researching something does not mean they are looking for a cure. They might just be looking!

Anyway back to the topic – Video Card Longevity

It’s a good idea to use a fine mesh fan filter on your system inlet fans. To clean the system all you have to do is point the vacuum cleaner at it for about 2 seconds every other week. But they don’t just keep the dust off your components. I looked at a system about 2 years ago and was told it made a loud pop/bang noise and stopped suddenly. There was a bluebottle fly lying at the bottom of the case.

Scott Brown
Send message
Joined: 21 Oct 08
Posts: 144
Credit: 2,973,555
RAC: 0
Level
Ala
Scientific publications
watwatwatwatwatwat
Message 11204 - Posted: 20 Jul 2009 | 14:26:17 UTC - in response to Message 11049.

...Without Telomerase DNA stays damaged, so there is a greater risk of Cancer and other illnesses.

Just because someone is researching something does not mean they are looking for a cure. They might just be looking!


Just an FYI...see here and here for example.


Anyway back to the topic – Video Card Longevity

It’s a good idea to use a fine mesh fan filter...But they don’t just keep the dust off your components. I looked at a system about 2 years ago and was told it made a loud pop/bang noise and stopped suddenly. There was a bluebottle fly lying at the bottom of the case.




Dust is definitely not the only problem. I saw a system about 5 years ago that had the same "pop/bang" noise problem...opened the case only to find a nice colony of ants (some rather toasted)! I doubt even the fine mesh would have kept them out.

Profile skgiven
Volunteer moderator
Project tester
Volunteer tester
Avatar
Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,834,906,524
RAC: 276,970
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 11242 - Posted: 22 Jul 2009 | 10:20:16 UTC - in response to Message 11204.

Both these research teams took the inverted smart approach. If something is essential for life, they want to kill it. They know they will take out a few cancer cells on the way, be able to publish in a few obscure journals, and further their career. If they get really lucky a drug company will develop some sort of anti-telomerase, to slowly kill people with, and they will get a bit of money out of it. Drug companies don’t do cures! Unfortunately this sort of research undermines science and interferes with the work of descent scientist that are really trying to do something positive.

Betting Slip
Send message
Joined: 5 Jan 09
Posts: 589
Credit: 2,039,762,925
RAC: 1,511,935
Level
Phe
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 11244 - Posted: 22 Jul 2009 | 11:01:18 UTC - in response to Message 11242.
Last modified: 22 Jul 2009 | 11:01:57 UTC

Hear, Hear.

It's the Cholestorol arguement.

Your quack will tell you your cholestoral is to high and give you drugs to reduce it (Statins) lipitor etc.

What your quack never tells you is that most (up to 90%) of the Cholestorol in your body is MADE by your own Liver. From that you could assume your Liver is trying to kill you by making an excessive amount of Cholestorol. Actually Cholestorol is needed by every cell in your body for life and if your Liver is making more of it, it's doing it for a reason.

Atherosclerosis (hardening of the arteries) is caused by plaques made of Cholestorol (soft) and Calcium (hard) so drugs companies came up with 'lets make a drug which reduces the Livers' ability to produce this terrible substance' and they came up with Statins (Billions of $$$ are spent on Statins every year) despite the fact that Lipitor is known to cause memory loss and is in the dock for causing Cancers.

So why does your Liver begin producing excessive Cholestorol? To repair you, that's why. Your Arteries get damaged in use and your body would normally repair them using substances like Collagene, however, if you lack sufficient quantities of Collagene your body uses...YES Cholestorol which acts like a sticking plaster which the body places over the damaged area in your arteries. Your Liver is NOT trying to kill you it's trying to save your LIFE.
____________

Scott Brown
Send message
Joined: 21 Oct 08
Posts: 144
Credit: 2,973,555
RAC: 0
Level
Ala
Scientific publications
watwatwatwatwatwat
Message 11254 - Posted: 22 Jul 2009 | 14:10:36 UTC - in response to Message 11242.

Both these research teams took the inverted smart approach. If something is essential for life, they want to kill it.


Neither of these were research teams doing work on telomerase and cancer. Both were review articles (from 1996 and 2001) demonstrating that at least theraputic research in this area has been going on for quite some time and were provided by me to counter your statement that "It’s such a pity nobody is studying this Cure for All Cancers Solution".

...publish in a few obscure journals...


Though I probably wouldn't really call "Scientific American" a journal (the first review piece), it is hardly obscure. "Human Molecular Genetics" (the second review piece) is a prominent journal in the area.

Unfortunately this sort of research undermines science and interferes with the work of descent scientist that are really trying to do something positive.


I am really at a loss with this kind of statement. Are you really suggesting that the U.S. and Japanese researchers from the second article are not descent scientists?


Anyway, this has gone way off topic, so I apologize for Hijacking the thread.

Profile skgiven
Volunteer moderator
Project tester
Volunteer tester
Avatar
Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,834,906,524
RAC: 276,970
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 11265 - Posted: 23 Jul 2009 | 1:34:56 UTC - in response to Message 11254.

For Scott Brown only, everyone else skip to the last paragraph!

Quote from your most recent abstract, 2001, link in your post.
“By reverse transcription, the telomerase RNP maintains telomere length stability in almost all cancer cells”.
Why, however did they work that out?
Perhaps it’s because telomerase maintains telomere length and stability in ALL cells – yeah, that’s the cells without cancer too! Without Telomerase you would age Very rapidly, your DNA would fall off the ends of your Telomeres, the cells would stop working and die. There are still no magic bullets. You can’t turn it off in one cell, without turning it off in another. So, fundamentally, researching how to stop telomerase working won’t find a cure for cancer. They might be able to squeeze a drug out of it that manipulates telomerase, and make some money when someone spends an extra few weeks at deaths door, but that’s about it. Just because it’s a Professor arguing doesn’t mean the black crow is white. The skewed reasoning behind their research was exposed and compared to the drug manufacturing industries Cholesterol con. I expect they were playing the game; wanted some new shiny microscopes, so they drew an improbable link to a cancer treatment to get the drug companies interested in their research, and who knows maybe a few post-docs to do some of that tedious teaching.
Scientific America is a sensationalistic rag. Human Molecular Genetics, well OK if that’s your thing, I don’t read it, it’s not exactly Nature or Cell. Perhaps some other researchers are trying to aid telomerase functionality; looking for cancer prevention by trying to find out what is interfering with telomerase to stop it repairing the DNA correctly in the first place (it’s DNA damage that results in cancer) - but none of this is relevant to this thread, unless they are using CUDA or at least some sort of processor modelling, and you never mentioned that they are. There was no need for you to draw the conversation away from the theme. So again, and in a less subtle attempt to get back to the subject, my point was that GPU’s don’t have a Telomerase, they don’t repair themselves, so their life expectancy is more limited!

Is anyone working on a processor (CPU or GPU) that can perform a self diagnostic test and do an instruction set work around, like a Bios patch? I’m sure space agencies and aircraft manufacturers would be very interested.
Anyone want to pick up on that (and not Cell Biology)?

ExtraTerrestrial Apes
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 17 Aug 08
Posts: 2671
Credit: 751,466,674
RAC: 481,751
Level
Glu
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 11394 - Posted: 27 Jul 2009 | 19:35:07 UTC - in response to Message 11265.

Is anyone working on a processor (CPU or GPU) that can perform a self diagnostic test and do an instruction set work around, like a Bios patch? I’m sure space agencies and aircraft manufacturers would be very interested.
Anyone want to pick up on that (and not Cell Biology)?


There are transient and permanent errors. Transient ones are errors which happen due to whatever reason (e.g. ionizing radiation) and disappear shortly afterwards. As I stated before I believe Intels designs are radiation hardened to the point where they can detect 2 bit transient errors and correct 1 bit errors, in both, the core and the cache. For regular use this is quite good already.

The permanent errors are more challenging. If they appear in the cache you can blend the affected cache line out. The fat CPUs (IBM Power, Itanium, Sparc) can certainly do this, whereas for desktop chips I think it's a one time action - before the chip leaves the factory.

Permanent errors within the logic parts of the chip are currently unrepairable. One could think about disabling certain blocks after failures, but there's not much redundancy in cpus, so you can't take much away so that they still work. It's different for GPUs: disabling individual shader clusters should be possible by software / bios, maybe requiring little tweaks.

Another option is to use redundant hardware from the beginning on. This is fine for safety-critical markets (space, military, airplanes, cars etc.), but wouldn't work in the consumer sector. Who'd buy a dual core for the price of a quad, just so he can still have 2 working cores even if 2 of them fail? We'd want to go 4-3-2 instead.

An interesting option are FPGAs, reconfigurable logic. With this stuff you could build chips which can adapt to the situation and which could repair themselves. The problem is that you need 10 times the transistors and you can only run the design at about 1/10th the frequency. To put this into perspective: with 130 nm tech you could build a regular Athlon XP at 2 GHz. Or you could build a Pentium 1 at 200 MHz, something already available at the 350 nm node. It's a very interesting research area, but no option for the consumer market.

Otherwise.. IBM is researching such stuff, but I don't know how far they got by now. And you can be sure Intel's in the boat as well ;)

MrS
____________
Scanning for our furry friends since Jan 2002

Ross*
Send message
Joined: 6 May 09
Posts: 34
Credit: 442,860,201
RAC: 0
Level
Gln
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 11461 - Posted: 29 Jul 2009 | 10:58:02 UTC - in response to Message 11394.

A question to you
I have a inprogress WU 19-KASHIF_HIVPR_dim_ba4-28-100-RND7953_0 running but not in my tasks. How do i get rid of it ?
It is putting strain on my 295 to the extent that I have had to go back to 1 WU at a time. tempertures have going over 65C
Thanks
Ross
____________

Alain Maes
Send message
Joined: 8 Sep 08
Posts: 62
Credit: 849,080,684
RAC: 62,018
Level
Glu
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 11465 - Posted: 29 Jul 2009 | 11:40:36 UTC - in response to Message 11461.
Last modified: 29 Jul 2009 | 11:43:21 UTC

Hi Ross

actually the WU you mentioned is in your list, but way down.

In your task list select the "Show: in progress" instead of "Show: all", just above the top row, and you will see it.

kind regards

Alain

Edit - and BTW 65 C is by now means (too) hot, but well within limits. My GTX260 runs at 77 C, which is still cool.

Ross*
Send message
Joined: 6 May 09
Posts: 34
Credit: 442,860,201
RAC: 0
Level
Gln
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 11466 - Posted: 29 Jul 2009 | 11:56:04 UTC - in response to Message 11465.

Hi
that WU is not showing up in my tasks list . it and other were the cause of all the problems early this week . has July 30th expire date on it.
normaly I can crunch a WU in 7.5 hrs but now takes 9hrs.
As it does not show up in tasks how do I abort or kill it?
Thanks
Ross
my 295 freaks at over 73c smartdoctor alarm goes off new fan etc
using good airflow
Ross
____________

Alain Maes
Send message
Joined: 8 Sep 08
Posts: 62
Credit: 849,080,684
RAC: 62,018
Level
Glu
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 11470 - Posted: 29 Jul 2009 | 13:21:02 UTC - in response to Message 11466.

OK, let us try to get this right now.
Looking again in more detail at that WU, it indeed is cancelled and crunching it further serves no purpose and yields no credit.
So, are you saying that it is still on your machine?
In that case, select the WU in your BOINC task list and abort it. Not fun, I know, but part of the game I am afraid.

Hope this solves your issue now.

Kind regards

Alain.

ExtraTerrestrial Apes
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 17 Aug 08
Posts: 2671
Credit: 751,466,674
RAC: 481,751
Level
Glu
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 11479 - Posted: 29 Jul 2009 | 18:45:21 UTC

Let's try not to turn this sticky thread on "Video Card Longevity" into a "I need help with WU xyz" thread.

MrS
____________
Scanning for our furry friends since Jan 2002

Profile skgiven
Volunteer moderator
Project tester
Volunteer tester
Avatar
Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,834,906,524
RAC: 276,970
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 11567 - Posted: 1 Aug 2009 | 13:49:16 UTC - in response to Message 11394.

Thanks for the on-topic reply!


Permanent errors within the logic parts of the chip are currently unrepairable. One could think about disabling certain blocks after failures, but there's not much redundancy in cpus, so you can't take much away so that they still work. It's different for GPUs: disabling individual shader clusters should be possible by software / bios, maybe requiring little tweaks.

MrS


I would not like to lose 2 cores either (in some sort of mirror failover solution), but I think it might be possible for the consumer market to have a dead core work around - AMD do this in the factory, making their quad cores triple cores, or dual cores, when they are not quite up to scratch. We know that their approach is not quite a permanent one; people have been able to re-enable the cores on some motherboards. So whatever AMD did could in theory be used subsequent to shipping when a core fails.

For business this could be a great advantage. From experience, replacing a failed system can be a logistical nightmare, particularly for small businesses. Usually lost hours = lost income. Losses would be reduced if a CPU replacement could be planned and scheduled.
When 6 and 8 cores become more common place for CPU’s the need to replace the CPU might not actually be so urgent, and the CPU would still hold some value; a CPU with 5 working cores is better than a similar quad core CPU with all 4 cores working!

I was also thinking that if you could set/reduce the clock speeds of cores independently it could offer some sort of fallback advantage. For example, if one of my Phenom II 940 cores struggled for reliability at it’s native 3GHz, and I could reduce it to 1800MHz, or even 800MHz – just by setting it’s multiplier separately – it would be better than having to underclock all 4 cores, or immediately having to replace the CPU.
I like the idea of a software work around / solution for erroneous shaders.

NVidia would do us all a big favour if they developed a proper diagnostic utility, never mind the work around!

ExtraTerrestrial Apes
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 17 Aug 08
Posts: 2671
Credit: 751,466,674
RAC: 481,751
Level
Glu
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 11638 - Posted: 3 Aug 2009 | 20:34:20 UTC - in response to Message 11567.
Last modified: 3 Aug 2009 | 20:35:04 UTC

Hi,

I would not like to lose 2 cores either (in some sort of mirror failover solution), but I think it might be possible for the consumer market to have a dead core work around - AMD do this in the factory, making their quad cores triple cores, or dual cores, when they are not quite up to scratch. We know that their approach is not quite a permanent one; people have been able to re-enable the cores on some motherboards. So whatever AMD did could in theory be used subsequent to shipping when a core fails.


Testing at the factory is done with external probe stations prior to packaging (not to waste money on defect chips). This can not be repeated directly at home ;) (btw, these tests are expensive although each one only requires several seconds. I suppose the cost is mainly due to the time it takes that expensive device to measure the chip.. which wouldn't matter for us)
Therefore such a test would have to be software based. I see at least 2 major problems with that:

1. Whatever you put into the chip, you have to test it. Such software could reveal the chip architecture completely, just due to the way it's doing the tests. Software can be hacked and / or reverse engineered and that's something no chip maker would want to risk. It would open up the door for all sorts of things: full or partly copies, bad press due to discovered design errors, software deliberately targeted to be slow on your hardware (hint: compiler).

2. You'd be executing code on your cpu to test your cpu. How could you know the results are reliable? It would be a shame to get the message "3 of 4 cores defect" due to a minor fault somewhere else. Possible solution: dedicate some specialized logic with self diagnostic functions and error checking for such tests.

For business this could be a great advantage.


That's why the "big iron" servers have RAS features, hot swap of almost everything and such :)

I like the idea of a software work around / solution for erroneous shaders.
NVidia would do us all a big favour if they developed a proper diagnostic utility, never mind the work around!


Yes, that would be very nice. However, seeing how their software version struggles with driver bugs I'm not very confident anything like that is going to happen anytime soon. The problem of "revealing the architecture" would likely be less severe in this case, as communication with the GPU is done by the driver anyway. If such a tool is released i'd imagine them to be careful, i.e. "If you get errors there's a problem [not neccessarily caused by defect hardware] and you may get wrong results under CUDA. But we don't know your exact code and therefore we can not guarantee you that there is not a hardware error just because we didn't find any."

I was also thinking that if you could set/reduce the clock speeds of cores independently it could offer some sort of fallback advantage. For example, if one of my Phenom II 940 cores struggled for reliability at it’s native 3GHz, and I could reduce it to 1800MHz, or even 800MHz – just by setting it’s multiplier separately – it would be better than having to underclock all 4 cores, or immediately having to replace the CPU.


Let's take this one step further: the clock speed of chips is limited by the slowest parts, or more exactly paths which signals must travel within one clock cycle. If they arrive too late an error is likely produced. It's really tough to guess what the slowest paths through all your 100 millions of transistors will be, given the vast amount of possible instruction combinations, states, error handling, interrupts etc. But the manufacturers do have some idea.

So why not design a chip with some test circuitry with deliberately long signal run times and sophisticated error detection, somewhere near the known hot spots. Now you could lower the operating voltage just to the point where you're starting to see errors (and increase again just above the threshold). That would reduce average power consumption a lot and would help to choose proper turbo modes for i7-like designs. It wouldn't help against permanent errors, but in the case of your 940 the bios could have raised the voltage of that core a little (within safety margin).

MrS
____________
Scanning for our furry friends since Jan 2002

Profile skgiven
Volunteer moderator
Project tester
Volunteer tester
Avatar
Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,834,906,524
RAC: 276,970
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 11650 - Posted: 4 Aug 2009 | 9:37:55 UTC - in response to Message 11638.

Therefore such a test would have to be software based.

As you said later,
So why not design a chip with some test circuitry

Perhaps an on-die instruction set for testing, and if required automatically modifying voltages, frequencies or even disabling Cache banks or a core? A small program could receive reports, analyse them and calculate ideal frequencies automatically. These could be saved to the system drive or bios and reloaded on restart. A sort of built in CPU Optimalization kit.

I still like the idea of independent frequencies and voltages for CPU cores.
Most of the time people dont actually use all 4 cores or a quad, so if the CPU could raise and lower the frequencies independently, or even turn one or more cores off altogether, it would save energy, and therefore the overall cost of the system during it life. Unless you are crunching, playing games or using some serious software, there are few times when you would notice the difference in a quad core at 3.3GHz or 800MHz (8MB Cache). I often forget and have to check and see what my clock is set at – If the system gets loud, I turn it down.

If the cores could independently rise to the occasion, even when you are using intensive CPU applications, you would be saving on electric (temperatures would be lower, as would the noise)!
I’m not sure Intel would go for this, as their cores are paired and it might reveal some underlying limitation (until 8 or more cores are mainstream, then it would be less obvious and less of an issue).

If these ideas were applied to graphics cards, it would save a small fortune in Electric. Even GPUGrid does not always use all the processing power of the graphics cards. I think folding at home probably comes a lot closer, but some GPU Crunching clients such as Aqua often use substantially less (it seems to vary with different tasks – similar to a computer game). GPU’s are far from Green!

ExtraTerrestrial Apes
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 17 Aug 08
Posts: 2671
Credit: 751,466,674
RAC: 481,751
Level
Glu
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 11693 - Posted: 6 Aug 2009 | 20:36:29 UTC - in response to Message 11650.

I'm a bit confused by your post. That's actually just what "Power Now!", Cool & Quiet, Speed Step etc. are doing. They're not perfect yet, but they do adjust clock speeds and voltages on the fly according to demand, and in the newest incarnations also independently for individual cores. Intel heavily uses the thermal headroom under single / low-threaded load for their turbo mode. So it's not perfect yet, but we're getting there. And now that almost all high performance chips 8CPUs, GPUs) are power limited these power management features are quickly becoming ever more important.

Talking about power management for GPUs: I've been complaining about this wasted power for a decade. Why can the same chips used in laptops be power efficient, downclocked and everything, whereas as soon as they're used in desktops they have to waste 10 - 60 W, even if they're doing nothing?! The answer is simply: because people don't care (as long as it doesn't hurt them too much) and because added hardware or driver features would cost more - and that's soemthing people do care about.

A few days ago I read about some chip, I think it was the GPU integrated into the new 785 chipset. Here they adjust clock speed and voltage to target 60 fps in 3D. Really, that's the way it should have been from the beginning on!

Oh, and the problem I have with all these features: the manufacturer has to set the clock speeds and voltages, regardless of chip quailty, temperature (well, they could factor that in to some extent), chip aging / degradation. So they have to use generous safety margins (which is what overclockers exploit). What I propose is to add circuitry to measure the current chip condition in a reliable way and to adjust voltage accordingly (clock speed is determined by load anyway). That way the hardware could be used in the most efficient way. I'm sure it will be coming.. just not anytime *soon*.

MrS
____________
Scanning for our furry friends since Jan 2002

Profile Nognlite
Send message
Joined: 9 Nov 08
Posts: 69
Credit: 25,106,923
RAC: 0
Level
Val
Scientific publications
watwatwatwatwatwatwatwat
Message 11772 - Posted: 10 Aug 2009 | 11:44:35 UTC

Well Ladies and Gents,

To talk about card longevity I am adding my 2 cents. I have been using two systems to run gpugrid for about 2 years now. On one system 2x XFX 8800GT's in SLI. On the other 2x XFX GTX280's in SLI. While my 8800's have been rock solid since bought (other than a fan replacement but that's another pissy story about XFX), my 280's have been replace a total of three times, possibly a fourth coming. Thank the Lord they are XFX with double lifetime warranty but this is rediculous. The cards lasted a year before they had to be replaced the first time and about 6 for the second replacement. Makes me wonder if XFX sends out refurbished cards as replacements?

I only run Gpugrid on all my cards and they use automatic fans when they get hot, controlled by the driver. What I have noticed over the two years is that when the driver doesn't load properly on startup or goes corrupt the thermal solution does not function correctly and a few times I found my cards at 105 celcius. Again thank the Lord I was at the computer, but how many times have I not been and the system ran at 105.

I don't believe that I should have any issues with my 280's but I don't think that gpugrid is that taxing that it should be killing GPU's. XFX says that there might be power issues on my system that's killing cards but I have a PC P&C 1200 with all voltages right on spec and the cards were on an OCZ PSU the first time they died.

So this leaves me wondering. Should I stop Gpugrid to save GPU's or are they faulty GPU's or is it a faulty GPU design to start with and just get them replaced as they break?

Just my 2 cents.

STE\/E
Send message
Joined: 18 Sep 08
Posts: 360
Credit: 251,941,635
RAC: 0
Level
Asn
Scientific publications
watwatwatwatwatwatwatwatwatwatwat
Message 11773 - Posted: 10 Aug 2009 | 13:12:53 UTC
Last modified: 10 Aug 2009 | 13:16:20 UTC

In less than 1 years time I've RMA'ed 3 GTX 260's already and right now I'm looking at RMA'ing 5 more GTX 260's (4 BFG's & 1 EVGA) plus 1 possibly 2 GTX 295's. Oh and for good measure throw in a Sapphire 4850 X2 & Sapphire 4870 that are going to need to be RMA'ed.

From what I'm hearing about Sapphire that could be a nightmare trying to get them to RMA 1 Card let alone 2 Cards. BFG is good about it and already told me when I get ready to RMA the Cards to let them know & they would set it up. EVGA I haven't had any dealings with but I'll find out I guess.

Personally having this amount of Video Cards go all at once tells me their just not made to run 24/7 @ full load and your going to have trouble with some of them if you do.

ExtraTerrestrial Apes
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 17 Aug 08
Posts: 2671
Credit: 751,466,674
RAC: 481,751
Level
Glu
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 11790 - Posted: 10 Aug 2009 | 20:54:34 UTC

Hey Nognlite,

are you sure you've been here for 2 years? Your account says "joined 9 Nov 2008", which is just 3/4 of a year if I'm not totally mistaken ;)

Not that this makes your card failures any better. What I can tell you, though, is that it would be better to set your fan speeds manually. As high as you're still comfortable with. As I wrote somewhere up there this does increase your GPU lifetime considerably.

Not sure if I wrote it here, but I'm convinced: current GPUs are not made for 24/7 operation. It's not that the chips are much different, it's that the tolerances and priorities are set different. The temperatures which the manufacturers allow are OK for occasional gaming, but not really for 24/7 operation. Sure, some chips / cards can take it for quite some time and some fail anyway, regardless of temperature.. but this is a statistical process after all.

BTW: I'm sure they are sending out refurbished units as replacements (even for HDDs). Just think of all the people who have some whatever-so-nasty software or compatibility problem, RMA their product and then the hardware actually appears stable under different conditions. They woudln't want to throw these things away ;)

MrS
____________
Scanning for our furry friends since Jan 2002

Profile skgiven
Volunteer moderator
Project tester
Volunteer tester
Avatar
Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,834,906,524
RAC: 276,970
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 11793 - Posted: 10 Aug 2009 | 22:04:47 UTC - in response to Message 11693.

I'm a bit confused by your post. That's actually just what "Power Now!", Cool & Quiet, Speed Step etc. are doing. MrS

Well yes, up to a point, but as you went on to say, they are not perfect! I was trying to be general, as there are so many energy saving variations used in different CPUs, but very few are combined sufficiently. Perhaps Intel’s Enhanced Speed Step is the closest to what I am suggesting, but in itself it does not offer everything.
Many CPU’s only have 2 speeds. Why not 10 or 30? If motherboards can be clocked in 1MHz steps, why not CPU’s? Why develop so many different technologies separately, rather than combining, centralising, streamlining and reducing manufacturing costs. If the technology does not significantly increase production costs and is worthwhile having, have it in all the CPU’s rather than 100’s of slightly different CPU types. In many areas Intel’s biggest rival is Intel; they make so many chips that many are competing directly against each other. Flooding the market with every possible combination of technology is just plain thick.

Why only reduce the multiplier and Voltage? Why not the FSB as well? If the CPU is built to support it the motherboard designs will follow as there is descent competition there.
Why send power to the Cache when it’s doing nothing?
Why send power to all the CPU cores when only one is in use?
Why charge a small fortune for a slightly more energy efficient CPU (SLARP, L5420 vs SLANV, E5420)? Especially when manufacturing costs are the same.
Why use one energy saving feature in one CPU but a different feature in another CPU when both could be used? In many ways it’s not so much about being clever, just not being so stupid.

To be fair to both Intel and AMD, there have been excellent improvements over the last 5 years:
My Phenom II 940 offers three steps (3GHz, 1800MHz and 800MHz) which is one of the main reasons I purchased it. This was a big improvement over my previous Phenom 9750 (2.4GHz and 1.8GHZ). The E2160 (and similar) only use 8Watts when idle, and many of the systems they inhabit typically operate at about 50Watts –much less than top GPU cards!

Mind you, these are exceptions rather than the rule. Many speed steps were none too special – stepping down from 2.13GHz to 1.8GHz was a bit of a lame gesture by Intel!
My opinion is that if it’s not in use, it does not need power. So if it is using power that it does not need, it has been poorly designed.

they do adjust clock speeds and voltages on the fly according to demand, and in the newest incarnations also independently for individual cores.


OK, I was not aware the latest server cores could be independently stepped down in speed.
I hope the motherboard manufacturers keep up; I recently worked on several desktop systems that boasted energy efficient CPUs such as the E2160 (with C1E & EIST), only to see that the motherboard did not support speed stepping! Again this just smells of mismatched hardware/a stupid design flaw, but I do think the motherboard manufacturers need to make more of an effort - perhaps they are more to blame than AMD and Intel.

And now that almost all high performance chips 8CPUs, GPUs) are power limited these power management features are quickly becoming ever more important.


I agree; server farms are using more and more of the grids energy each year so they must look towards energy efficiency. Hopefully many of these server advancements will become readily available to the general consumer in the near future. Some of these advances come at a shocking price though, and the new CPU designs often seem to drop existing energy efficiency systems to incorporate the new ones rather than adding the new energy efficient technology. Presumably so they can compete against each other! Reminds me of the second wave Intel quad cores – clocked faster, but came with less cache, so there was only a slight improvement with some chips and it was difficult to choose which one was actually faster! Ditto for Hyper Threading which competed against faster clocked non HT cores.

Talking about power management for GPUs: I've been complaining about this wasted power for a decade. Why can the same chips used in laptops be power efficient, downclocked and everything, whereas as soon as they're used in desktops they have to waste 10 - 60 W, even if they're doing nothing?! The answer is simply: because people don't care (as long as it doesn't hurt them too much) and because added hardware or driver features would cost more - and that's something people do care about.


The General Public probably don’t think about the running costs as much as IT Pro’s do, but they really should. The lack of ‘green’ desktop GPU’s is a serious problem. Neither ATI or NVIDIA have bothered to produce a really green desktop GPU. It’s as though there is some sort of unspoken agreement to not compete on this front!

Sooner or later ATI or NVIDIA will realise that people like me would rather go on a 2 weeks holiday with a new netbook than pay for two power greedy cards that cost almost as much to run as they do to buy!

Profile Nognlite
Send message
Joined: 9 Nov 08
Posts: 69
Credit: 25,106,923
RAC: 0
Level
Val
Scientific publications
watwatwatwatwatwatwatwat
Message 11804 - Posted: 11 Aug 2009 | 13:36:59 UTC - in response to Message 11790.
Last modified: 11 Aug 2009 | 13:37:43 UTC

You are in fact correct. I had to look at my records. My bad!

However this new information compounds my statement as it's only been 3/4's of a year and two sets of gpu's replaced.

I wonder if other people have has the same issue and as bad.

Cheers

I built the systems two years ago. That's why that sticks in my head.

RalphEllis
Send message
Joined: 11 Dec 08
Posts: 43
Credit: 2,216,617
RAC: 0
Level
Ala
Scientific publications
watwatwatwatwat
Message 11819 - Posted: 12 Aug 2009 | 6:10:57 UTC - in response to Message 11772.

You may wish to set the fan speed manually either with the Evga utility or Ntune in Windows or Nvclock-Gtk in Linux. This would cut down on the heat issues.

STE\/E
Send message
Joined: 18 Sep 08
Posts: 360
Credit: 251,941,635
RAC: 0
Level
Asn
Scientific publications
watwatwatwatwatwatwatwatwatwatwat
Message 11821 - Posted: 12 Aug 2009 | 11:11:01 UTC

Looks like I'll be RMA'ing 5 GTX 260's with the Clock Down Bug either today or tomorrow, I have a GTX 295 that will do the same thing off and on but hasn't for a few days so I'll keep it for now and see if the Proposed Fix GDF mentioned later this month fixes it permanently or not. As long as it doesn't get any worse I can live with it for a few days more ... :)

Just so ATI doesn't feel left out I RMA'ed 2 of them yesterday, 1 4850 X2 & 1 4870, both were used at the MWay Project but quit working, the 4850 X2 in about a months time, the 4870 took about 6 month's before going bad.

ExtraTerrestrial Apes
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 17 Aug 08
Posts: 2671
Credit: 751,466,674
RAC: 481,751
Level
Glu
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 11858 - Posted: 13 Aug 2009 | 20:31:10 UTC - in response to Message 11793.

Hey SKGiven,

it's really a matter of cost. First and foremost the buyer has to care about power. If the buyer decisions are not influenced buy power consumption in any way, then every single cent a company spends on power saving is lost (short term). That's why they're not going to do it, until the power requirements start to hurt them (laptop, Pentium 4 etc.)

And it's really not a matter of cents. First there are hardware modifications. The power management in i7 CPUs needs about 3 Mio transistors, that's as much as a Pentium 1! You couldn't have possibly implemented something like that in a Pentium 1 - if you wouldn't want to ruin the company. That's why power saving features develop gradually in an evolutionary process.

Next there's software. A lot can go wrong if you want to implement proper power saving: degraded performance or even bugs and crashes. Testing, debugging and certifying such code is expensive. And the more expensive the more complex the system gets. That's why manufacturers only implement small improvements at a time, as much as they feel they can still handle until product introduction.

An example: at work I've got a Phenom 9850. It eats so much power that it quickly overheats at 2.5 GHz (stock cooler, some case cooling). I usually run it at 2.1 GHz and 1.10V, that prevents it from crashing and keeps the noise acceptable. However, if I want to speed up some single thread Matlab simulation, switch BOINC off and allow it to go to the full 2.5 GHz.. something almost funny happens. Windows keeps bouncing the task between cores and after each move the app runs on a core which was set to 1.2 GHz by Cool & Quiet. It has to adapt and speed up. Shortly afterwards the cycle repeats. Overall the system uses more power but becomes slower than at a constant 2.1 GHz.

The reason for the slow switches is that the CPU draws so much current that the motherboard circuitry would be overloaded if the speeds were switched instantaneously, so AMD choose some delay time. All in all that's an example for a power saving feature going rediculously wrong. Cool & Quiet on the Athlon 64, on the other hand, worked fine. They only got into trouble because they wanted to make it even better, offer more fine grained control and make the system more complex.

Finally you also have top consider proportionality. If you could spend 1 Mio in development costs to cut GPU idle power consumption from 60 W to 1 W you'd be foolish not to do so. However, investing another 1 Mio to cut power even further to 0.9 W wouldn't help your company at all on the desktop.

BTW, I'm not saying this money-focussed approach is the best way to go. But that's how it works as long as money reigns the world.

Oh, there's more:

Why develop so many different technologies separately

It's building upon each other. Each new generation of power saving technologies generally incorparates and superceeds the previous one. It does not suddenly replace it with something different. And some companies are licencing this stuff, but the big players are basically all developing the same stuff on their own - adapted to their special needs, of course.

Why not the FSB as well?

That's being done on notebooks. You wouldn't notice the difference on a desktop.

Why send power to the Cache when it’s doing nothing?

That's being done since some time, minor savings.

Why send power to all the CPU cores when only one is in use?

i7 is first to really shut them off.

Why charge a small fortune for a slightly more energy efficient CPU (SLARP, L5420 vs SLANV, E5420)? Especially when manufacturing costs are the same.

Because costs are not the same. Energy efficient CPUs run at lower voltages, which not all CPUs can do. To first approximation you can decide to sell a CPU as a normal 3 GHz chip or as a 2.5 GHz EE chip. The regular 2.5 GHz chip might not reach 3 GHz at all.

Why use one energy saving feature in one CPU but a different feature in another CPU when both could be used?

I don't think this is being done. The features mainly build upon each other. Exceptions are mobile Celerons, where Intel just removed power saving features (but doesn't include others), which I really dislike. And mobile chips generally get more refined power managing. I think this is mainly due to cost.

MrS
____________
Scanning for our furry friends since Jan 2002

Profile Paul D. Buck
Send message
Joined: 9 Jun 08
Posts: 1050
Credit: 37,321,185
RAC: 0
Level
Val
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 12525 - Posted: 17 Sep 2009 | 3:31:57 UTC
Last modified: 17 Sep 2009 | 3:32:47 UTC

After roughly 8 months of life my GTX280 card (EVGA) died. The good news is that I am in the 1 year warranty, the bad news is I missed the fine print. If you have EVGA cards you have to register them ON THEIR SITE to get long term warranty conversion (90 days from purchase, save receipt you also need that for an RMA).

Word to the wise ... you also need S/N and P/N off the card or box ... though if you got a rebate it is going to have to come off the card.

I suppose my only good take away is that with luck I will be in replacement mode if the other cards start to fail and are not covered ... of course, with next generation cards on the verge now it is also possible I can get replacement cards for a whole lot less than I spent on the originals if I just want to stay at current production levels (or near enough) ...

Profile skgiven
Volunteer moderator
Project tester
Volunteer tester
Avatar
Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,834,906,524
RAC: 276,970
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 12986 - Posted: 2 Oct 2009 | 20:16:16 UTC - in response to Message 12525.

The thing is, Manufacturers don’t want broken anythings back - they want to keep their money! So they setup so many obstacles for you and everyone else to negotiate that the majority of people will give up, or spend more time than it is worth trying to get some sort of partial refund or refurbished item (say after 3 months). Basically, the law says you can return an item that malfunctions for up to one year. Unfortunately, dubious politicians with unclear financial interests, have sought to undermine this with grey legislation. So you are left mulling through all sorts of dodgy terms and conditions – many of which are just meant to deter you; they have no legal ground, but serve to hold up the proceedings long enough for them to get away with it. By the time you (or say 20percent pf people like you) get through their many hoops there is a fair chance they will have been bought over, merged, renamed, re-launched or have gone under, and you will have another layer of it to go through. If you buy an item in a shop, hang onto the receipt and the packaging. If it breaks within a year, take it back and get a replacement or refund. If you buy online, you may have to deal with their terms and conditions, RTMs and of course the outfit not being there too long. To me it is worth the extra 5 or 10 percent to buy an expensive item in a local store with a good reputation.

STE\/E
Send message
Joined: 18 Sep 08
Posts: 360
Credit: 251,941,635
RAC: 0
Level
Asn
Scientific publications
watwatwatwatwatwatwatwatwatwatwat
Message 13001 - Posted: 3 Oct 2009 | 22:00:58 UTC

I've RMA'ed probably 10 GTX 200 Series Cards & 2 ATI 48xx Cards this year alone & haven't had a bit of a problem getting the Manufactures to back their Card & send me a Replacement ASAP ... Of course as Paul said you have to read the Fine Print and Register them as soon as you get them or you may be SOL & have to eat the Costs. Most of my GTX are BFG's which have a Lifetime Warranty so those Cards are good to go for a long time if I choose to continue to run them.

The ATI Cards only have a 1 year Warranty which is due to run out soon so I'll have to eat the costs there but with the new cards coming out I'll be ready to move up anyway ... :)

zpm
Avatar
Send message
Joined: 2 Mar 09
Posts: 159
Credit: 13,639,818
RAC: 0
Level
Pro
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 13006 - Posted: 4 Oct 2009 | 2:14:43 UTC - in response to Message 13001.

Most of my GTX are BFG's which have a Lifetime Warranty so those Cards are good to go for a long time if I choose to continue to run them.



that's why i got BFG for now on...


Another tip to cool the beast..
if you live in a temperature sensitive climate say like the south of the US, fall and spring are perfect times to bring in the cool air at night....

i've seen 10C temp drops just by letting 52 F air into my room....which is normally 80F.
____________

I recommend Secunia PSI: http://secunia.com/vulnerability_scanning/personal/

Profile skgiven
Volunteer moderator
Project tester
Volunteer tester
Avatar
Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,834,906,524
RAC: 276,970
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 13147 - Posted: 11 Oct 2009 | 23:51:29 UTC - in response to Message 13006.

You need to watch that trick.
Jnr. Frio might breeze in, and nick your computer!

Profile Argus
Avatar
Send message
Joined: 14 Mar 09
Posts: 6
Credit: 5,143,945
RAC: 0
Level
Ser
Scientific publications
watwatwatwatwatwat
Message 13160 - Posted: 13 Oct 2009 | 12:51:54 UTC
Last modified: 13 Oct 2009 | 13:46:22 UTC

I have 2 Leadtek Winfast GTX280's that died on 14 September '09, precisely after 6 months of crunching GPUGrid more or less 24/7 (less because my rigs are gaming rigs, so whenever me and my son were playing we would disable GPUGrid).

Cooling issues are out of the question, as my cases are Tt Xaser VI, with 3x120mm + 3x140mm case fans (soon to be 5x140mm + 1x120mm), plus another 120mm fan on the CPU cooler, and 1x135mm + 1x80mm in PSU (Gallaxy DXX 1000W). Not to mention A/C in every room where I have a PC.

Edit: forgot to mention, no OC. I've never OC'ed, I'm for rock solid stability. I prefer to buy components with high stock (factory) performance instead of low performance components to OC later.

Edit 2: I'm blaming crunching, specifically GPUGrid, because I sleep better knowing I have identified the culprit :))
____________
Semper ubi sub ubi.

zpm
Avatar
Send message
Joined: 2 Mar 09
Posts: 159
Credit: 13,639,818
RAC: 0
Level
Pro
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 13168 - Posted: 13 Oct 2009 | 21:38:18 UTC - in response to Message 13160.

I have 2 Leadtek Winfast GTX280's that died on 14 September '09, precisely after 6 months of crunching GPUGrid more or less 24/7 (less because my rigs are gaming rigs, so whenever me and my son were playing we would disable GPUGrid).

Cooling issues are out of the question, as my cases are Tt Xaser VI, with 3x120mm + 3x140mm case fans (soon to be 5x140mm + 1x120mm), plus another 120mm fan on the CPU cooler, and 1x135mm + 1x80mm in PSU (Gallaxy DXX 1000W). Not to mention A/C in every room where I have a PC.

Edit: forgot to mention, no OC. I've never OC'ed, I'm for rock solid stability. I prefer to buy components with high stock (factory) performance instead of low performance components to OC later.

Edit 2: I'm blaming crunching, specifically GPUGrid, because I sleep better knowing I have identified the culprit :))


could it be that the manufacturers cards don't pass the 24/7 full throttle test!!!! bfg GTX260, still going like the energizer bunny.

Profile Argus
Avatar
Send message
Joined: 14 Mar 09
Posts: 6
Credit: 5,143,945
RAC: 0
Level
Ser
Scientific publications
watwatwatwatwatwat
Message 13170 - Posted: 14 Oct 2009 | 7:45:53 UTC - in response to Message 13168.


could it be that the manufacturers cards don't pass the 24/7 full throttle test!!!!


Yeah, pretty sure that's actually the issue.


bfg GTX260, still going like the energizer bunny.


Hmmm, probably the reason I bought 3 BFG GTX285's a week ago :)))
____________
Semper ubi sub ubi.

zpm
Avatar
Send message
Joined: 2 Mar 09
Posts: 159
Credit: 13,639,818
RAC: 0
Level
Pro
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 13191 - Posted: 16 Oct 2009 | 18:17:08 UTC - in response to Message 13170.


could it be that the manufacturers cards don't pass the 24/7 full throttle test!!!!


Yeah, pretty sure that's actually the issue.


bfg GTX260, still going like the energizer bunny.


Hmmm, probably the reason I bought 3 BFG GTX285's a week ago :)))


i know that when my BFG GTX 260 FOC 216 SP.. gets to 75C it shutdown the computers b/c of overheating...

maybe a BFG safety feature or faulty sensor but i actually like that about my card, means that it's safe until about 70 C...

=Lupus=
Send message
Joined: 10 Nov 07
Posts: 9
Credit: 1,083,415
RAC: 0
Level
Ala
Scientific publications
watwatwatwat
Message 13235 - Posted: 20 Oct 2009 | 2:32:50 UTC

Hi,
I'm using a Palit GTX260 Sonic 216SP graphics card.
Yeah I know, that is the one that is OC'ed by the manufacturer.
I know everyone says they are b###sh#t and not even worth buying.

I will NEVER OC anything in my computer myself. For a good lifetime of your computer, buy good equipment and UNDERclock it (or leave it on stock ratings).

Funny thing #1: The Nvidia VTune utility they had on disk keeps my gpu at ca. 65 °C, while being at 40-45 % speed of the two fans.
Funny thing #2: When manually setting the GPU's fan speed to 100%, I can cool down the GPU to 45°C under full GPUGRID load! On the downfall, it is loud as hell then.
Funny thing #3: Playing AION (my I-really-love-it-MMORPG) which uses a modified CryEngine1 as graphics engine, my GPU yawns and goes to bed while still giving me 60+ fps...

=Lupus=

Profile skgiven
Volunteer moderator
Project tester
Volunteer tester
Avatar
Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,834,906,524
RAC: 276,970
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 13271 - Posted: 24 Oct 2009 | 18:59:08 UTC - in response to Message 13235.

I have a palit GTX260 and managed to OC it quite well. I found the sweet spot and kept it there for a while, but I now just go native because the noise was just a little too high for me to work at the system.

For those that keep their Fan at 100% - dont! It will drastically reduce the life expectancy of the fan, and could take out the GPU.

ExtraTerrestrial Apes
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 17 Aug 08
Posts: 2671
Credit: 751,466,674
RAC: 481,751
Level
Glu
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 13277 - Posted: 25 Oct 2009 | 13:01:07 UTC - in response to Message 13271.

For those that keep their Fan at 100% - dont! It will drastically reduce the life expectancy of the fan, and could take out the GPU.


.. but it helps the GPU and the rest of the card ;)

I will NEVER OC anything in my computer myself. For a good lifetime of your computer, buy good equipment and UNDERclock it (or leave it on stock ratings).


OCed or underclocked doesn't mean much if you leave the fan on auto. OC'ed (stock voltage) 70°C will be much better for the card than underclocked 90°C (stock voltage).

MrS
____________
Scanning for our furry friends since Jan 2002

zpm
Avatar
Send message
Joined: 2 Mar 09
Posts: 159
Credit: 13,639,818
RAC: 0
Level
Pro
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 13278 - Posted: 25 Oct 2009 | 14:43:30 UTC - in response to Message 13277.

i run stock settings for my FOC 216 260 from BFG...

70% fan with window closed = 63 C
70% fan with window open = 55 C.

Profile skgiven
Volunteer moderator
Project tester
Volunteer tester
Avatar
Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,834,906,524
RAC: 276,970
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 13279 - Posted: 25 Oct 2009 | 19:46:14 UTC - in response to Message 13277.

You are quite right to say that it reduces the heat and increase the life expectancy of the cards other parts. Right up to the time when the fan fails, then it is a bit hit and miss how things turn out.
If you have a GPU sitting at 60degrees, the fan fails and the card shuts down; because it detects a fan failure or even after waiting to the card heats up to 85degrees (or whatever the cut-off point is), all well and good; Your system might not stay booted too long, but at least you will be able to work out why and twiddle your thumbs for a few days when you wait for the replacement fan, or RTM replacement.
But, if your OC'd card is sitting at 79degrees and there is no automatic shutdown, you are in hot trouble. This is the reason that house insurance companies put clauses in about modified electronic equipment, and why I only use metal system cases.
If you don’t already have them, good system fans will help. Might let you get away with a modestly raised (40-80%) graphics fan speed. Sometimes they can even make the noise bearable, when the window is open!

I feel a similar effect when I open the window. The room temp drops to comfortable, but in another month it will be well below comfortable. Last summer I sometimes sat a fan on the open windows frame to suck in the slightly cooler air. I went PC-green; hooked up a USB fan to a Solio.

Post to thread

Message boards : Graphics cards (GPUs) : Video Card Longevity