Advanced search

Message boards : Number crunching : HyperThreading (HT) and Simultaneous Multi-Threading (SMT)

Author Message
Profile Retvari Zoltan
Avatar
Send message
Joined: 20 Jan 09
Posts: 2343
Credit: 16,201,255,749
RAC: 7,520
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 54584 - Posted: 5 May 2020 | 14:29:45 UTC
Last modified: 5 May 2020 | 14:36:39 UTC

Intel's HT (HyperThreading) and AMD's SMT (Simultaneous Multi-Threading) are introduced to maximize CPU core utilization by duplicating the parts which feed the execution units, but they do not duplicate the parts which do the floating point math operations. Moreover they work better with apps coded to be multi-threaded. Running many single-threaded apps (like Rosetta@home, or Einstein@home etc.) may be counterproductive, if the apps (including the data they actually process) don't fit into the (L3) cache of the CPU.

If you have a single gaming PC with a high-end CPU and high-end GPU you intend to use it for crunching on both of them, there will be inevitable performance loss as you push your hardware toward its limits (=running independent CPU tasks on all threads on a multi-threaded CPU) especially on the performance of the GPUGrid app (as it is pushing the GPU to its limits in itself).

I've posted about the performance of the AMD ThreadRipper 3970x on Einstein@home:
___________________________________________________________________________________________________________________

I've built recently two identical HEDT PCs for a friend, and I've tested it with Einstein@home CPU apps (among other tests).

Host 12807190, and 12807391.

CPU: AMD Ryzen Threadripper 3970X (32 cores/64 threads) (@4074MHz)
RAM: 4x32GB Kingston HyperX DDR4 3200MHz CL16 (HX432C16FB3/32) @3200MHz CL16-20-20
GPU: NVIDIA GeForce GTX 1650 (4095MB)
OS: Windows 10 Pro x64
I thought that if I run 64 tasks simultaneously, then it would reduce the performance of the app greatly, so I disabled SMT in the BIOS. So there were "only" 32 tasks running, the run times were quite high: 47,000~52,000 secs (13h~14h30m). I decided to further reduce the number of simultaneous tasks, so I set "use at most 50% of the processors". I also wrote a little batch program to periodically set the CPU affinity of each task to the even numbered cores (to spread the running tasks between the CPU chiplets). The runtimes dropped to 19,200~19,600 secs (5h20m~5h30m), while the power consumption rose by 30W (the CPU temperature went up as well by 7°C).

32 tasks on 32 cores: 47,000~52,000 secs (13h~14h30m) 265% (32.65% performance loss) 16 tasks on 32 cores: 19,200~19,600 secs (5h20m~5h30m) 100%
My conclusion: perhaps it's not worth buying this CPU for Einstein@home due to L3 cache / memory bandwidth limitations.
__________________________________________________________________________________________________________________

Recent AMD CPUs are built on 8-core CPU chiplets, plus a much larger chip connecting the CPU chiplets together, and also to the outside world (RAM, PCIe bus, etc). As the (L2+L3) cache resides on the CPU chiplets, their size doesn't add up. Intel CPUs have a single chip inside (it may change in the future), the L3 cache is shared among all cores (and threads). Both architectures have pros and cons. For running many single threaded tasks, Intel's architecture may be a little bit better (I didn't compare them, as I don't have a 10+ core Intel CPU).

However, it makes sense to reduce the number of simultaneous Rosetta@home (and other) tasks to the number of cores the CPU has, for troubleshooting, and to leave some memory bandwidth free for other apps.
That's why I suggested:
Try to limit the number of usable CPUs in BOINC manager to 50% (options -> computing preferences -> use at most 50% of the processors).

Zoltan you lost me on this one. Why 50%??? I would think 92% (11/12).
The actual percentage depends on many factors:
1. The number of GPUs in the system: the highest recommended number of threads is the max-(the number of GPUs in the system)
2. The CPU app: Rosetta uses up to 2 GB RAM (typically up to 1 GB) consequently it uses a lot of DDR bandwidth, so it is more likely counterproductive to run many of them because of the next factor
3. The ratio of the memory bandwidth and the number of processor cores: high (10+) core count processors have "only" 4-channel memory, which results in a lower memory bandwidth per processor core ratio than a 2 core 2-channel memory CPU has. To compensate this, the L3 cache size is increased in high core count processors, but it may be insufficient for many separate apps running simultaneously (depending on the app).

Let's evaluate the CPU-Z benchmark of my 4-core 8-thread i7-4770K:
Left column: Right column: Upper row: older test (non-AVX) 8 threads 6 threads Lower row: newer test (AVX) 8 threads 6 threads
Look for the "Multi-tread ratio"

Multi-Thread Ratio 8 threads | 6 threads | 4 threads (not included in the picture) older test (non-AVX) 5.13 (-2.87) | 4.39 (-1.61) | 3.77 (-0.23) newer test (AVX) 5.52 (-2.48) | 4.64 (-1.36) | 3.82 (-0.18)
The most important outcome is that the multi-thread ratio is nowhere near 8 for 8 threads, also much less than 6 for 6 threads.
You can read the above like: "If I can squeeze another 0.8 core performance when I run 8 threads then I will", but the performance of real world apps (especially which use a lot of memory) may not scale up with the number of threads that well (especially single-threaded apps as their memory usage and memory bandwidth needs will also scale up). These will have a negative impact on the performance of each other, also on the performance of other parts of the PC thought to be independent of the CPU (GPU apps), proving that they are not independent.

For example GPUGrid's performance will suffer more as you increase the number of simultaneous CPU tasks, so you'll lose more credits on the degradation of the GPU performance, than the gain on CPU wise. According to my earlier measurements, running more than 1 Rosetta will cause more credit loss on GPUGrid than its gain (if you use a high-end GPU for GPUGrid). That's one of the reasons for me to build PCs for GPU crunching on cheap (i3) CPUs and cheap (which has only single PCIe3.0x16 slot) MBs with single GPUs. This way it doesn't bother me that I don't use their CPU for crunching. Recent CPUs may perform a little better, but not that much to change my mind.

You can test a recent HT or SMT CPU, you'll have similar results: the multi-thread ratio will scale up to the number of cores your CPU has in 1:1 ratio (almost), above that it will be 1:4 (or less). So there's no much gain in running more CPU tasks than the number of CPU cores (depending how awful the app is), but it could be worth running 1 more task than that. (on high core count CPUs it can be 2 or 3 more). You have to experiment the exact number, it should be near or equal the number of cores if you use the computer while it's crunching.

The app of Rosetta@home is a difficult one, as it runs for a given period of time, there's no actual length (measured in the number of floating point operations) of a workunit. Therefore we can't compare the completion times to see the actual performance degradation coming from overcommitting the CPU. We could compare the awarded credits per day, but it's not fixed either: If the given workunit gets lucky (=the protein is folded well), it will receive more credits. The max credit I receive for 24h workunits is about 1400 credits per core, as I run as many Rosetta@home simultaneously as many cores my CPU has (or even less than that).

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1031
Credit: 35,493,857,483
RAC: 71,175,505
Level
Trp
Scientific publications
wat
Message 54586 - Posted: 5 May 2020 | 14:53:15 UTC - in response to Message 54584.
Last modified: 5 May 2020 | 14:55:08 UTC

I thought that if I run 64 tasks simultaneously, then it would reduce the performance of the app greatly, so I disabled SMT in the BIOS. So there were "only" 32 tasks running, the run times were quite high: 47,000~52,000 secs (13h~14h30m). I decided to further reduce the number of simultaneous tasks, so I set "use at most 50% of the processors". I also wrote a little batch program to periodically set the CPU affinity of each task to the even numbered cores (to spread the running tasks between the CPU chiplets). The runtimes dropped to 19,200~19,600 secs (5h20m~5h30m), while the power consumption rose by 30W (the CPU temperature went up as well by 7°C).
32 tasks on 32 cores: 47,000~52,000 secs (13h~14h30m) 265% (32.65% performance loss)
16 tasks on 32 cores: 19,200~19,600 secs (5h20m~5h30m) 100%
My conclusion: perhaps it's not worth buying this CPU for Einstein@home due to L3 cache / memory bandwidth limitations.


did you take into account the increase in core clock speed you'd get by doing this? reducing the load (active cores) allows the cpu to boost it's CPU frequency.

did you try testing the middle of the spectrum, and not just the extremes? it's well known that having the CPU at 100% load causes issues with available resources and slow run times. but I would like to see the results at 80-90% load. that way you should get increased performance from more cores working that outweighs the clock speed benefit from only running 50% load.

turning off HT/SMT likely helps if you don't have enough system memory. Rosetta seems to use up to 2GB per task. meaning if you wanted to run 60+ threads simultaneously, you should probably have more than 128GB of system ram. faster ram with low latency will benefit here.

for these kinds of CPU/RAM intensive workloads, I would probably go to the Epyc line of processors instead, giving you 8-channels for RAM than just the 4 you get with threadripper.
____________

Aurum
Avatar
Send message
Joined: 12 Jul 17
Posts: 399
Credit: 13,024,100,382
RAC: 766,238
Level
Trp
Scientific publications
watwatwat
Message 54587 - Posted: 5 May 2020 | 18:26:08 UTC - in response to Message 54584.

Zoltan that was beautiful. Thank you so much for taking the time to run those tests and explain them in detail. I've been hoping for 2 years since we discovered the problem with WCG's MIP and wondered if it also hampered Rosetta.

Since last night I've been running all Rosetta and GG with one CPU headroom on Storj nodes. I'm thinking I could pair computers with the same CPU, GPU and RAM into HT versus non-HT and observe the slope change on the BOINCmgr Statistics graph.
One problem is that as soon as OpenPandemics kicks off I'll drop Rosetta like a hot potato.

Profile [AF] fansyl
Send message
Joined: 26 Sep 13
Posts: 20
Credit: 1,713,956,441
RAC: 466,197
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 55040 - Posted: 4 Jun 2020 | 17:39:33 UTC

More information about bottlenecks on x86/GPU systems : https://www.worldcommunitygrid.org/forums/wcg/viewthread_thread,42399_offset,0#627426

Post to thread

Message boards : Number crunching : HyperThreading (HT) and Simultaneous Multi-Threading (SMT)

//