1) Message boards : Number crunching : Dangers of crunching (Message 38636)
Posted 2 days ago by Profile Retvari Zoltan*
Great to hear you had no damage to your computer equipment.

It's interesting to see that bit of copper wire twisted around one of the fuse (contacts?). I wonder why that was there? A poor fix for a bad contact/connection?

We was wondering about it too. There could be two reasons for that:
1. The fuse was short circuited for some reason (bigger then 50A load for a short time)
2. Dodge the electronic meter by bypassing it.
2) Message boards : Number crunching : Dangers of crunching (Message 38621)
Posted 2 days ago by Profile Retvari Zoltan*
Hi Zoltan,

Quite a story. I hope you don't have any damage to your hardware, I know you have quite a nice farm of rigs.
All the best.

Thanks TJ!

I was afraid that, but luckily nothing damaged, except my RAC as some workunits failed and one was stucked and running for 44 hours before I've noticed.
3) Message boards : Number crunching : Dangers of crunching (Message 38586)
Posted 4 days ago by Profile Retvari Zoltan*
We had a power failure late at night on last Thursday. This happened before, but not the way like this time.
We have LED bulbs which are more sensitive to the power surges and flicker much more frequently than incandescent and fluorescent lighting. They did flicker this time just before the electricity went out completely. The power came back for a half second, but then it went out permanently. I saw that the lighting in our staircase is on (we live in a 4 storied apartment house), so I thought the source of this power failure is near. I've checked all of our fuses, but all of them were ok. I was puzzled for a minute, but then I've heard that the other resident from a floor below came out to check their fuses in the staircase. It turned out that half of the apartments lost electricity in the building. I went to the ground floor to search for the blown fuse. I've met a guy from the ground floor, he heard a bang before the power went out in their apartment, which is a pretty bad sign. We found 2 fuse-boxes, both have a large emergency power switch (3 phased), but their handle was missing (to prevent some stupid pranks...). However the spindle of the bigger switch felt lukewarm, so I knew that something burned inside, and we couldn't fix this on our own. At this moment I've decided to call an electrician... It was after midnight, so it took a while to find the one who answered our call.
So the electrician found the burnt fuse panel, and the two blown and burnt fuses, and a wire whose insulation was burned / melt down completely.

The burnt / blown fuses, the fuse panel, and the naked wire:

The wire (and its termination) is made of aluminium, which is a pretty bad conductor and has a pretty high contact resistance, and used to corrode the contact when connected with other metals like brass. Probably that led to this meltdown. (The electrician said those screws wasn't fastened well enough, they could became loose over time from the vibration caused by the traffic, especially the tramway near to our building). Aluminium was used as a replacement for copper wires during and after WWII (when our block was built).
I know that this failure wasn't caused only by my constant power consumption (10~12A @ 230V), but it certainly had the biggest part in it. The lesson is those who use that much electric power should not skip the maintenance of the wires in the building for 70 years...
4) Message boards : Graphics cards (GPUs) : Maxwell now (Message 38582)
Posted 4 days ago by Profile Retvari Zoltan*
It's a Palit NE5X970014G2-2041F (1569) GM204-A Rev A1 with a default core clock of 1051MHz.
It uses an exhaust fan (blower), so while it's a Palit shell it's basically of reference design. Don't know of any board alterations from reference designs.
My understanding is that Palit support GDDR5 from Elpida, Hynix and Samsung. This model has the Samsung GDDR5 and like other Palit models is supposed to operate at 3505MHz (7000MHz effectively). However it seems fixed at 3005MHz. While I can set the clock to 3555MHz the current clock remains at 3005MHz. Raising or lowering it does not change the MCL (so it appears that my settings are being ignored).

The same applies to my Gigabyte GTX-980.

So while it can run at ~110% power @ 1.212V (1406MHz) @64C Fan@75% I cannot reduce the MCL bottleneck (53% @1406MHz) which I would prefer to do.

Is 53% MCL really a bottleneck? Shouldn't this bottleneck lower the GPU usage? Did you try to lower the memory clock to measure the effect of this 'bottleneck'?

I've tried Furmark, and it seems to be limited by memory bandwith, while GPUGrid seems to be limited by GPU speed:

The history of the graph is:
GPUGrid -> Furmark (1600x900) -> Furmark (1920x1200 fullscreen) -> GPUGrid

biodoc, thanks for letting us know you are experiencing the same GDDR5 issue. Anyone else seeing this (or not)?

It's hard to spot, (3005MHz instead of 3505MHz), but my GTX980 does the same, but I don't think that this is an error.
5) Message boards : Graphics cards (GPUs) : Maxwell now (Message 38566)
Posted 5 days ago by Profile Retvari Zoltan*
There is a new application (v8.47) distributed since yesterday.
I'd like to have some information about the changes since the previous version.
It's not faster than the previous one.
6) Message boards : Graphics cards (GPUs) : Maxwell now (Message 38515)
Posted 9 days ago by Profile Retvari Zoltan*
BTW: your GTX780Ti is (factory-)overclocked as well, isn't it?

I have two GTX780Ti's: one standard, and one factory overclocked. I had to lower the memory clock of the overclocked one to 3.1GHz...
7) Message boards : Graphics cards (GPUs) : Maxwell now (Message 38507)
Posted 9 days ago by Profile Retvari Zoltan*
Looking at performance tab- someone has finally equaled RZ GTX780ti host time. Host 168841 [3] GTX980 with same OS as RZ (WinXP) is competing tasks as fast. (RZ GTX780ti been the fastest card for awhile)

That GTX980 is an overclocked one, so its performance/power ratio must be lower than the standard GTX980's. However it's still better than a GTX780Ti.

<core_client_version>7.2.42</core_client_version> <![CDATA[ <stderr_txt> # GPU [GeForce GTX 980] Platform [Windows] Rev [3212] VERSION [65] # SWAN Device 0 : # Name : GeForce GTX 980 # ECC : Disabled # Global mem : 4095MB # Capability : 5.2 # PCI ID : 0000:04:00.0 # Device clock : 1342MHz # Memory clock : 3505MHz # Memory width : 256bit # Driver version : r343_98 : 34411 # GPU 0 : 79C # GPU 1 : 74C # GPU 2 : 78C # GPU 1 : 75C # GPU 1 : 76C # GPU 1 : 77C # GPU 1 : 78C # GPU 1 : 79C # GPU 1 : 80C # GPU 0 : 80C # Time per step (avg over 3750000 steps): 4.088 ms # Approximate elapsed time for entire WU: 15331.500 s # PERFORMANCE: 87466 Natoms 4.088 ns/day 0.000 ms/step 0.000 us/step/atom 00:19:43 (3276): called boinc_finish </stderr_txt> ]]>

1342/1240=1.082258, so this card is overclocked by 8.2% which equal to the performance gap between a GTX780Ti and the GTX980.
8) Message boards : Server and website : GPU Results ready to send - number dwindling (Message 38438)
Posted 11 days ago by Profile Retvari Zoltan*
There are only 394 unsent workunits on the long queue, while there are 1926 in progress.
I think this 394 workunits consist mostly of SDOERR_BARNA5's, and we're at ~35% of that batch, so it won't run out very soon, but I think we'll need new batches in a week.
Is there new work in preparation for the long queue?
9) Message boards : News : Changes to scheduling policy (Message 38415)
Posted 12 days ago by Profile Retvari Zoltan*
unfortunately I'm getting computation errors most of the time.

If you take a look into your tasks details, you could see the reason for those errors:
# The simulation has become unstable. Terminating to avoid lock-up (1)

This error is a sign of an unstable GPU. The root of this instability can be various:
- Too high GPU temperature (above 80°C - so this is not for you)
- Too low GPU voltage for the given GPU clock
- Too high GPU clock for the given GPU voltage (e.g. an aging GPU could not run even at factory settings)
- Too high GDDR5 frequency
- Insufficient, low quality or (nearly) broken PSU
- Too high transient resistance on the PCIe power connectors (usually caused by Molex->PCIe converters), or on the two 12V pins of the 24-pin MB power connector

I've got two GTX 570 with 2.5 GB VRAM each, newest driver 344.11.

This card has twice as much memory chips as a standard GTX570 has, so perhaps the GPU can't drive the memory data lanes that fast.

Doesn't matter if I'm in SLI or not.

SLI is usually a source of random errors.

Other GPU projects like SETI, Einstein or Asteroids run fine.

Other GPU projects has obsolete GPU applications built on older CUDA versions, while GPUGrid uses the latest (CUDA6.5 at the moment), therefore other projects couldn't stress the GPU as much as the GPUGrid client does.
The "GPU usage" measurement is misleading.

Is there anything I can do?

Check all power connectors in your PC for burnt ones.
Lower the GPU clock by 100MHz steps until it gets stable, if it doesn't work then try again by lowering the GDDR5 frequency by 100MHz steps.
If your GPU gets stable by lowering the GPU clock at some point, you can try to raise the GPU clock by 10-20MHz steps, while it doesn't cause these "simulation became unstable" messages, then increase the GPU voltage by 12.5mV, and repeat increasing the clock while the GPU doesn't get hot.
Beware of that different GPUGrid batches stressing the GPU differently, so if there's no stability headroom in your settings, some harder workunits could fail.
10) Message boards : Graphics cards (GPUs) : Maxwell now (Message 38401)
Posted 13 days ago by Profile Retvari Zoltan*
Yes, I can see that now looking at individual runs on your two machines. That is rather surprising, my testing in more controlled circumstances shows the opposite.

I'd like to have a pair of those circumstance controllers you use. :)

Next 10