Advanced search

Message boards : Number crunching : About the GERARD_A2AR batch

Author Message
Profile Retvari Zoltan
Avatar
Send message
Joined: 20 Jan 09
Posts: 2343
Credit: 16,201,255,749
RAC: 7,520
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 42683 - Posted: 25 Jan 2016 | 9:18:45 UTC
Last modified: 25 Jan 2016 | 9:24:26 UTC

I accidentally did a test with my i7-4790k + GTX 980Ti + WinXP x64 host.
The CPU's PCIe3.0 bus was running at only 4x, so its speed was like a PCIe1 x16.
I've experienced the following:
1. The GPU usage & temperature was slightly less then normal (so it run unnoticed for a couple of days)
2. The GPU & memory clocks were normal (1404MHz & 3505MHz)
3. The workunits' runtime went up by 123% (yes, more the doubled: 6h5m -> 14h14m)

These results confirm that old MB & CPU should not be used with high-end GPUs to avoid such frustration by similar workunits in the future.
(Also, the performance decrease caused by WDDM is higher than usual on these workunits)

Dayle Diamond
Send message
Joined: 5 Dec 12
Posts: 84
Credit: 1,629,213,415
RAC: 672,941
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 42687 - Posted: 25 Jan 2016 | 23:13:43 UTC - in response to Message 42683.

I had a few WU take much longer than average a few days ago, with no system changes. Got a regular amount of cobblestones, so I assume the length was unintentional. Back to normal now.

Would you mind offering a few more details:

How old was the motherboard running the 4790k?

In my estimation, the 4790k isn't a very old CPU; what sort of motherboard were you plugging it into?

Bedrich Hajek
Send message
Joined: 28 Mar 09
Posts: 467
Credit: 8,189,996,966
RAC: 10,555,478
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 42688 - Posted: 25 Jan 2016 | 23:44:02 UTC - in response to Message 42683.

I accidentally did a test with my i7-4790k + GTX 980Ti + WinXP x64 host.
The CPU's PCIe3.0 bus was running at only 4x, so its speed was like a PCIe1 x16.
I've experienced the following:
1. The GPU usage & temperature was slightly less then normal (so it run unnoticed for a couple of days)
2. The GPU & memory clocks were normal (1404MHz & 3505MHz)
3. The workunits' runtime went up by 123% (yes, more the doubled: 6h5m -> 14h14m)



Compared it to my AMD Athlon(tm) 64 X2 Dual Core Processor 5000+, with a GTX 980Ti, and windows xp x32, the results are as follows:

1. GPU usage is under 70% and temperature is about 50C, which compares to other GERARD WUs about 95% usage at about 60C temperature.

2. Device clock : 1190MHz Memory clock : 3505MHz

3. Work unit run time is over 12 hours, compared to about 6 hours (plus or minus) for the other GERARD WUs.

Even having several of these slow WUs in my average, my average completion time is currently in 2nd place:

Rank User name Average (h) Total crunched (WU) GPU description of fastest setup
1 BurningToad 6.23333 12 NVIDIA GeForce GTX 980 Ti (4095MB) driver: 355.98
2 Bedrich Hajek 6.37091 55 NVIDIA GeForce GTX 980 Ti (4095MB) driver: 355.82
3 Xeaon 6.77059 17 NVIDIA GeForce GTX 980 Ti (4095MB) driver: 361.43
4 Streetlight 6.95152 33 [2] NVIDIA GeForce GTX TITAN X (4095MB) driver: 358.50
5 syntech 6.98824 17 NVIDIA GeForce GTX 980 Ti (4095MB) driver: 358.91
6 Retvari Zoltan 7.02595 185 NVIDIA GeForce GTX 980 Ti (4095MB) driver: 358.50
7 Gamekiller 7.03636 11 NVIDIA GeForce GTX 980 Ti (4095MB) driver: 361.43
8 whizbang 7.81600 25 [2] NVIDIA GeForce GTX 980 (4095MB) driver: 361.43
9 Kagura Kagami@jisaku 7.81818 11 NVIDIA GeForce GTX 980 Ti (4095MB) driver: 361.43
10 Andree Jacobson 7.89167 12 NVIDIA GeForce GTX 980 (4095MB)


These results confirm that old MB & CPU should not be used with high-end GPUs to avoid such frustration by similar workunits in the future.
(Also, the performance decrease caused by WDDM is higher than usual on these workunits)


Yes and no. Yes, it is frustrating. But why shouldn't high end cards be used in older computer? I am getting run times on the other GERARD WUs that are comparable to my new windows 10 computer.

This brings up another question about future WUs. Should they be more or less depend on CPUs? Even fast computer pay a time penalty with the A2AR WUs, though it is much less than the older CPUs. If we want to have more efficient (faster) crunching, the WUs should be made less CPU depend, where possibly.

As far WDDM penalty, which happens (correct me if I am wrong) when the GPU has to access the CPU after each step. Is it possibly to be able to have this access every other step or possibly every third? This should reduce the lag. I am not sure I have the right wording for this, but I hope you can understand what I am saying.




Profile Retvari Zoltan
Avatar
Send message
Joined: 20 Jan 09
Posts: 2343
Credit: 16,201,255,749
RAC: 7,520
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 42689 - Posted: 26 Jan 2016 | 0:17:23 UTC - in response to Message 42687.
Last modified: 26 Jan 2016 | 0:17:50 UTC

How old was the motherboard running the 4790k?
In my estimation, the 4790k isn't a very old CPU; what sort of motherboard were you plugging it into?
It's in a Gigabyte GA-Z87X-OC motherboard, I don't know the exact age of this motherboard, but the Z87 chipset is almost 3 years old.
This is not too old for a GTX 980 Ti. But when its PCIe bus was limited to 4x, it was acting like a really old motherboard with PCIe 1.0 x16.

Profile Retvari Zoltan
Avatar
Send message
Joined: 20 Jan 09
Posts: 2343
Credit: 16,201,255,749
RAC: 7,520
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 42690 - Posted: 26 Jan 2016 | 1:41:40 UTC - in response to Message 42688.

Even having several of these slow WUs in my average, my average completion time is currently in 2nd place
The present mix of workunits is easier on the CPU, but previously there were large batches (for example the NOELIA tasks) which were like the GERARD_A2AR batch now.

These results confirm that old MB & CPU should not be used with high-end GPUs to avoid such frustration by similar workunits in the future.
(Also, the performance decrease caused by WDDM is higher than usual on these workunits)
Yes and no. Yes, it is frustrating. But why shouldn't high end cards be used in older computer?
This is merely a forewarning to avoid the frustration could be caused by a large CPU demanding batch.

I am getting run times on the other GERARD WUs that are comparable to my new windows 10 computer.
True. But this thread is about the GERARD_A2AR batch, of which runtimes are ~70% longer on your older Athlon/WinXP host than on your i7-5820K/Win10. To put it in an even worse perspective: your older host's GERARD_A2AR runtimes are ~100% longer than of my WinXP/i3-4130 host.

This brings up another question about future WUs. Should they be more or less depend on CPUs?
Now that's the million dollar question.

Even fast computer pay a time penalty with the A2AR WUs, though it is much less than the older CPUs. If we want to have more efficient (faster) crunching, the WUs should be made less CPU depend, where possibly.
I think it's impossible from computing point of view. (Then we should use Double Precision enabled GPUs, which are very very expensive.)

disturber
Send message
Joined: 11 Jan 15
Posts: 11
Credit: 62,705,704
RAC: 0
Level
Thr
Scientific publications
watwatwat
Message 42709 - Posted: 28 Jan 2016 | 20:35:46 UTC - in response to Message 42690.

(Then we should use Double Precision enabled GPUs, which are very very expensive.)


If you can tolerate 1/4 DFP performance which is significantly better than 1/24 or 1/32 on crippled cards, then the only reasonable choice is the 7970 or 280x AMD card. It is the only one that produces high output in Milkyway that is specifically programmed for double precision floating point. These cards go for less than $200 on the used market.

My computer recently downloaded a work-unit than normally take 9-10 hours on my GTX 970, and to my astonishment BOINC tells me that it will finish in 1d 03:16:51. It is the new chalcone229x2-GERARD_CXCL12_DCKCHALK. Is anyone else seeing this? It not a slow machine, pcie-e 3.0 is running at 8x and cpu is an i7-3770k overclocked to 4.4GHz.

Is the credit commensurate with the long compute time?

Not trying to hijack this thread.

biodoc
Send message
Joined: 26 Aug 08
Posts: 183
Credit: 6,493,864,375
RAC: 2,796,812
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 42710 - Posted: 28 Jan 2016 | 21:29:58 UTC

If the Work Units are truly CPU limited on a 4790K, then you should see an increase in performance if you disable hyperthreading in the bios. The app would then have access to one full core rather than a single logical core.

Profile Retvari Zoltan
Avatar
Send message
Joined: 20 Jan 09
Posts: 2343
Credit: 16,201,255,749
RAC: 7,520
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 42711 - Posted: 28 Jan 2016 | 23:53:44 UTC - in response to Message 42710.

If the Work Units are truly CPU limited on a 4790K, then you should see an increase in performance if you disable hyperthreading in the bios. The app would then have access to one full core rather than a single logical core.
It's not CPU limited. It was PCIe bandwidth limited when the CPU's PCIe bus was running at 4x speed by accident.

Profile Retvari Zoltan
Avatar
Send message
Joined: 20 Jan 09
Posts: 2343
Credit: 16,201,255,749
RAC: 7,520
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 42712 - Posted: 29 Jan 2016 | 0:12:33 UTC
Last modified: 29 Jan 2016 | 0:14:49 UTC

Actually there's a complete thread for a similar batch in the news topic, started by Gerard himself :)
There is an explanation for the higher CPU usage of this batch:

Gerard wrote:
I forgot to note that due to the nature of these simulations, some small forces have to be added externally and, unfortunatenly, these have to be calculated using cpu instead of gpu. Therefore you may notice some amount of cpu usage that in my case never surpassed a 10%.

kingcarcas
Avatar
Send message
Joined: 27 Oct 09
Posts: 18
Credit: 378,626,631
RAC: 0
Level
Asp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 42714 - Posted: 29 Jan 2016 | 8:53:22 UTC
Last modified: 29 Jan 2016 | 8:54:49 UTC

Good stuff, This is the sort of thing LinusTechTips over on Youtube does once in a while, how does it fare at 8x?
____________

fractal
Send message
Joined: 16 Aug 08
Posts: 87
Credit: 1,248,879,715
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 42717 - Posted: 29 Jan 2016 | 18:18:35 UTC - in response to Message 42709.
Last modified: 29 Jan 2016 | 18:19:12 UTC

My computer recently downloaded a work-unit than normally take 9-10 hours on my GTX 970, and to my astonishment BOINC tells me that it will finish in 1d 03:16:51. It is the new chalcone229x2-GERARD_CXCL12_DCKCHALK. Is anyone else seeing this?

I got a couple of those earlier this week. They started off with an estimate of a day and a half but finished in 10 hrs. How long did yours take?

The other question raised is one I have been pondering lately about how does PCIe bandwidth affect performance. The general "hive consensus" to date has been that projects to date do not more than what is provided by PCIe1 x16. This was extensively tested by the bitcoin community. It is highly dependent on the project and workloads so different projects will have different requirements.

My current generation of builds expect that a PCIe2 x8 slot can keep a GPU happy. This thread is starting to make me wonder if this is true. The cost of a system with 32 PCIe lanes and enough power to run two modern GPUs exceeds the cost of two basic systems with 16 PCIe lanes each and a modest PSU by a significant number. A hundred dollar bundle with a thirty dollar PSU will easily handle any single GPU. A system with 32 PCIe lanes on two x16 slots is easily a two hundred dollar motherboard plus memory and cpu and a hundred dollar PSU. Add onto that the issues cooling a system with dual 300 watt CPUs.

So, it used to be we had to shop for motherboards that ran x16 single slot but dropped to x8/x8 if you added a second card instead of motherboards that ran x16 single slot and dropped to x16/x4 if you added a second card. Have we entered an era where x8/x8 isn't fast enough any more?

Profile Retvari Zoltan
Avatar
Send message
Joined: 20 Jan 09
Posts: 2343
Credit: 16,201,255,749
RAC: 7,520
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 42719 - Posted: 29 Jan 2016 | 20:43:38 UTC - in response to Message 42717.

Have we entered an era where x8/x8 isn't fast enough any more?
I wouldn't say that. WU batches come and go, some of them (for example the one this thread is about) are more CPU/PCIe bandwidth dependent. As a performance enthusiast I don't like to make compromises, so I wouldn't build a dual (multi-) GPU host for GPUGrid (though I have some). There's no point in spending the extra bucks for a more capable (s2011) MB and CPU (and cooling), if you have the space (and the will) to build two (or more) hosts and you don't want to see your host on the hosts' overall & RAC toplist.

Bedrich Hajek
Send message
Joined: 28 Mar 09
Posts: 467
Credit: 8,189,996,966
RAC: 10,555,478
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 42727 - Posted: 31 Jan 2016 | 11:48:40 UTC - in response to Message 42690.

This brings up another question about future WUs. Should they be more or less depend on CPUs?
Now that's the million dollar question.



If WUs do become more CPU dependent, then having 2 or more CPUs feeding 1 GPU should offset this lag, unless this increases the PCIe bus traffic dramatically.

This should also reduce the WDDM lag.

Can this be done? If yes, how?



Profile skgiven
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,995,359,260
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 42758 - Posted: 5 Feb 2016 | 18:12:16 UTC - in response to Message 42727.

The science will determine the reliance on the CPU. That said, this is a GPU project and tries to do most of the work on the GPU, and the project is designed to utilize gaming GPU's.

The problem identified in the OP wasn't with the CPU, it was (mostly) a PCIE x4 bottleneck. That said, the PCIE controller is on the CPU & the CPU does have to do some work. So there isn't an inherent need for another CPU (socket 2 on a dual CPU board) and if there was, then PCIE controllers would need to be separated across both CPU's.

Perhaps the only way for WU's to avoid the WDDM overhead in recent MS OS's would be to use on-GPU CPU's. If this ever does become a reality then the co-processor would get a dedicated 'main' processor (a developmental flip). For now there is still XP, 2003/2003R2 server and Linux.
____________
FAQ's

HOW TO:
- Opt out of Beta Tests
- Ask for Help

Post to thread

Message boards : Number crunching : About the GERARD_A2AR batch

//