1) Message boards : Number crunching : Dangers of crunching (Message 38586)
Posted 22 hours ago by Profile Retvari Zoltan*
We had a power failure late at night on last Thursday. This happened before, but not the way like this time.
We have LED bulbs which are more sensitive to the power surges and flicker much more frequently than incandescent and fluorescent lighting. They did flicker this time just before the electricity went out completely. The power came back for a half second, but then it went out permanently. I saw that the lighting in our staircase is on (we live in a 4 storied apartment house), so I thought the source of this power failure is near. I've checked all of our fuses, but all of them were ok. I was puzzled for a minute, but then I've heard that the other resident from a floor below came out to check their fuses in the staircase. It turned out that half of the apartments lost electricity in the building. I went to the ground floor to search for the blown fuse. I've met a guy from the ground floor, he heard a bang before the power went out in their apartment, which is a pretty bad sign. We found 2 fuse-boxes, both have a large emergency power switch (3 phased), but their handle was missing (to prevent some stupid pranks...). However the spindle of the bigger switch felt lukewarm, so I knew that something burned inside, and we couldn't fix this on our own. At this moment I've decided to call an electrician... It was after midnight, so it took a while to find the one who answered our call.
So the electrician found the burnt fuse panel, and the two blown and burnt fuses, and a wire whose insulation was burned / melt down completely.

The burnt / blown fuses, the fuse panel, and the naked wire:

The wire (and its termination) is made of aluminium, which is a pretty bad conductor and has a pretty high contact resistance, and used to corrode the contact when connected with other metals like brass. Probably that led to this meltdown. (The electrician said those screws wasn't fastened well enough, they could became loose over time from the vibration caused by the traffic, especially the tramway near to our building). Aluminium was used as a replacement for copper wires during and after WWII (when our block was built).
I know that this failure wasn't caused only by my constant power consumption (10~12A @ 230V), but it certainly had the biggest part in it. The lesson is those who use that much electric power should not skip the maintenance of the wires in the building for 70 years...
2) Message boards : Graphics cards (GPUs) : Maxwell now (Message 38582)
Posted 1 day ago by Profile Retvari Zoltan*
It's a Palit NE5X970014G2-2041F (1569) GM204-A Rev A1 with a default core clock of 1051MHz.
It uses an exhaust fan (blower), so while it's a Palit shell it's basically of reference design. Don't know of any board alterations from reference designs.
My understanding is that Palit support GDDR5 from Elpida, Hynix and Samsung. This model has the Samsung GDDR5 and like other Palit models is supposed to operate at 3505MHz (7000MHz effectively). However it seems fixed at 3005MHz. While I can set the clock to 3555MHz the current clock remains at 3005MHz. Raising or lowering it does not change the MCL (so it appears that my settings are being ignored).

The same applies to my Gigabyte GTX-980.

So while it can run at ~110% power @ 1.212V (1406MHz) @64C Fan@75% I cannot reduce the MCL bottleneck (53% @1406MHz) which I would prefer to do.

Is 53% MCL really a bottleneck? Shouldn't this bottleneck lower the GPU usage? Did you try to lower the memory clock to measure the effect of this 'bottleneck'?

I've tried Furmark, and it seems to be limited by memory bandwith, while GPUGrid seems to be limited by GPU speed:

The history of the graph is:
GPUGrid -> Furmark (1600x900) -> Furmark (1920x1200 fullscreen) -> GPUGrid

biodoc, thanks for letting us know you are experiencing the same GDDR5 issue. Anyone else seeing this (or not)?

It's hard to spot, (3005MHz instead of 3505MHz), but my GTX980 does the same, but I don't think that this is an error.
3) Message boards : Graphics cards (GPUs) : Maxwell now (Message 38566)
Posted 2 days ago by Profile Retvari Zoltan*
There is a new application (v8.47) distributed since yesterday.
I'd like to have some information about the changes since the previous version.
It's not faster than the previous one.
4) Message boards : Graphics cards (GPUs) : Maxwell now (Message 38515)
Posted 6 days ago by Profile Retvari Zoltan*
BTW: your GTX780Ti is (factory-)overclocked as well, isn't it?

I have two GTX780Ti's: one standard, and one factory overclocked. I had to lower the memory clock of the overclocked one to 3.1GHz...
5) Message boards : Graphics cards (GPUs) : Maxwell now (Message 38507)
Posted 6 days ago by Profile Retvari Zoltan*
Looking at performance tab- someone has finally equaled RZ GTX780ti host time. Host 168841 [3] GTX980 with same OS as RZ (WinXP) is competing tasks as fast. (RZ GTX780ti been the fastest card for awhile)

That GTX980 is an overclocked one, so its performance/power ratio must be lower than the standard GTX980's. However it's still better than a GTX780Ti.

<core_client_version>7.2.42</core_client_version> <![CDATA[ <stderr_txt> # GPU [GeForce GTX 980] Platform [Windows] Rev [3212] VERSION [65] # SWAN Device 0 : # Name : GeForce GTX 980 # ECC : Disabled # Global mem : 4095MB # Capability : 5.2 # PCI ID : 0000:04:00.0 # Device clock : 1342MHz # Memory clock : 3505MHz # Memory width : 256bit # Driver version : r343_98 : 34411 # GPU 0 : 79C # GPU 1 : 74C # GPU 2 : 78C # GPU 1 : 75C # GPU 1 : 76C # GPU 1 : 77C # GPU 1 : 78C # GPU 1 : 79C # GPU 1 : 80C # GPU 0 : 80C # Time per step (avg over 3750000 steps): 4.088 ms # Approximate elapsed time for entire WU: 15331.500 s # PERFORMANCE: 87466 Natoms 4.088 ns/day 0.000 ms/step 0.000 us/step/atom 00:19:43 (3276): called boinc_finish </stderr_txt> ]]>

1342/1240=1.082258, so this card is overclocked by 8.2% which equal to the performance gap between a GTX780Ti and the GTX980.
6) Message boards : Server and website : GPU Results ready to send - number dwindling (Message 38438)
Posted 8 days ago by Profile Retvari Zoltan*
There are only 394 unsent workunits on the long queue, while there are 1926 in progress.
I think this 394 workunits consist mostly of SDOERR_BARNA5's, and we're at ~35% of that batch, so it won't run out very soon, but I think we'll need new batches in a week.
Is there new work in preparation for the long queue?
7) Message boards : News : Changes to scheduling policy (Message 38415)
Posted 9 days ago by Profile Retvari Zoltan*
unfortunately I'm getting computation errors most of the time.

If you take a look into your tasks details, you could see the reason for those errors:
# The simulation has become unstable. Terminating to avoid lock-up (1)

This error is a sign of an unstable GPU. The root of this instability can be various:
- Too high GPU temperature (above 80°C - so this is not for you)
- Too low GPU voltage for the given GPU clock
- Too high GPU clock for the given GPU voltage (e.g. an aging GPU could not run even at factory settings)
- Too high GDDR5 frequency
- Insufficient, low quality or (nearly) broken PSU
- Too high transient resistance on the PCIe power connectors (usually caused by Molex->PCIe converters), or on the two 12V pins of the 24-pin MB power connector

I've got two GTX 570 with 2.5 GB VRAM each, newest driver 344.11.

This card has twice as much memory chips as a standard GTX570 has, so perhaps the GPU can't drive the memory data lanes that fast.

Doesn't matter if I'm in SLI or not.

SLI is usually a source of random errors.

Other GPU projects like SETI, Einstein or Asteroids run fine.

Other GPU projects has obsolete GPU applications built on older CUDA versions, while GPUGrid uses the latest (CUDA6.5 at the moment), therefore other projects couldn't stress the GPU as much as the GPUGrid client does.
The "GPU usage" measurement is misleading.

Is there anything I can do?

Check all power connectors in your PC for burnt ones.
Lower the GPU clock by 100MHz steps until it gets stable, if it doesn't work then try again by lowering the GDDR5 frequency by 100MHz steps.
If your GPU gets stable by lowering the GPU clock at some point, you can try to raise the GPU clock by 10-20MHz steps, while it doesn't cause these "simulation became unstable" messages, then increase the GPU voltage by 12.5mV, and repeat increasing the clock while the GPU doesn't get hot.
Beware of that different GPUGrid batches stressing the GPU differently, so if there's no stability headroom in your settings, some harder workunits could fail.
8) Message boards : Graphics cards (GPUs) : Maxwell now (Message 38401)
Posted 10 days ago by Profile Retvari Zoltan*
Yes, I can see that now looking at individual runs on your two machines. That is rather surprising, my testing in more controlled circumstances shows the opposite.

I'd like to have a pair of those circumstance controllers you use. :)
9) Message boards : Graphics cards (GPUs) : Maxwell now (Message 38400)
Posted 10 days ago by Profile Retvari Zoltan*
Wow, great information. The 980 looks like a winner. Question, are the above power draw figures for the GPU alone or for the system as a whole?

The heading of that column reads of "Delta of PC power consumption", which is the difference of the whole PC's power consumption between the GPU is crunching and not crunching.

If for the system are there any CPU WUs running? Thanks for the info!

There were 6 SIMAP CPU workunits running on that host, the total power consumption is 321W using the GTX-980.
10) Message boards : Graphics cards (GPUs) : Maxwell now (Message 38398)
Posted 10 days ago by Profile Retvari Zoltan*
The GTX780Ti is faster by 8-10% than the GTX980, but the GTX980 consumes only the 2/3rd of the GTX780Ti.
RZ, what is your metric for performance?

My metric for performance is the data could be find under the "performance" tab, which is based on the time it takes to complete a WU from the same batch by different GPUs (hosts).
GTX-980 GTX-780Ti GTX-780TiOC SDOERR_BARNA5 15713~15843 14915~15019 14892~14928 NOELIA_5MG 18026~18165 16601~16713 16826~16924 NOELIA_20MGWT 18085~18099 16849 17034 NOELIA_20MGK36I 16617~16779 16844~17049 NOELIA_20MG2 16674~16831 NOELIA_UNFOLD 16533 15602

As it takes more time for the GTX-980 to complete similar workunits as it takes for the GTX780Ti, I consider the GTX-980 slower (the motherboard, CPU, RAM are similar, actually my host with the GTX 980 has slightly faster CPU and RAM).

Next 10