1) Message boards : Multicore CPUs : New batch of QC tasks (QMML) (Message 48454)
Posted 2315 days ago by Sebastian M. Bobrecki
After about 10h and reaching 69.568% app started to use only one core. What's worst it stays in that state for another 10h and perf is indicating that it's in OMP spinlock:

83.49% python libiomp5.so [.] __kmp_wait_yield_4
6.76% python libiomp5.so [.] __kmp_eq_4
5.74% python libiomp5.so [.] __kmp_yield
0.66% python [kernel.vmlinux] [k] entry_SYSCALL_64
...

Edit: On second machine it looks similar but after 6h and 78.698% it stays in that state for about 11h now. Perf:

84.60% python libiomp5.so [.] __kmp_wait_yield_4
6.80% python libiomp5.so [.] __kmp_eq_4
5.77% python libiomp5.so [.] __kmp_yield
0.59% python [kernel.vmlinux] [k] entry_SYSCALL_64
0.37% python [kernel.vmlinux] [k] __schedule
...
2) Message boards : Multicore CPUs : Error with more than 32 threads (Message 39138)
Posted 3419 days ago by Sebastian M. Bobrecki
Limited to 32 now.


Does this affect what we are talking about here:
"Increasing the maximum number of coprocessor devices to 64."

That is one of the changes in the latest version of Boinc that was just released for testing.
http://boinc.berkeley.edu/dev/forum_thread.php?id=8649

No, it is about specific feature of the GROMACS application which is used by GPUGRID for this tasks.
3) Message boards : Multicore CPUs : Message about suboptimal build (Message 39129)
Posted 3420 days ago by Sebastian M. Bobrecki
Nothing important.

MJH
However, I think it is important because performance is very low. My notebook which has 8 threads clocked 1.86GHz has the performance of 7.603ns/day. A system that has 64 threads (32 used by application) clocked at 2.5GHz is faster only about 2 times reaching only 15.829ns/day.

According to the study by Professor Agner Fog from the Technical University of Denmark, processors with Bulldozer and Piledriver architecture, quote:
"- The throughput of 256-bit store instructions is less than half the throughput of 128-bit store instructions on Bulldozer and Piledriver. It is particularly bad on the Piledriver, which has a throughput of one 256-bit store per 17 - 20 clock cycles.
- 128-bit register-to-register moves have zero latency, while 256-bit register-to-register moves have a latency of 2 clocks plus a penalty of 2-3 clocks for using a different domain (see below) on Bulldozer and Piledriver."
and:
"Therefore, there is no advantage in using 256-bit instructions on Bulldozer and Piledriver when the bottleneck is execution unit throughput or instruction decoding. The poor throughput of 256-bit stores makes it a disadvantage to use 256-bit registers on the Piledriver."

This is probably reason why the developers of GROMACS application sacrificed time to develop an appropriate optimization. Quote from the GROMACS project site:
"Currently the supported acceleration options are: none, SSE2, SSE4.1, AVX-128-FMA (AMD Bulldozer + Piledriver) and AVX-256 (Intel Sandy+Ivy Bridge)."
and:
"On x86, the performance difference between SSE2 and SSE4.1 is minor. All other, higher acceleration differences are significant."

Therefore, I think it would be good to also have application version built with such optimizations. Certainly I'd be delighted.
4) Message boards : Multicore CPUs : Error with more than 32 threads (Message 39118)
Posted 3420 days ago by Sebastian M. Bobrecki
To make this work on your machine I think you could limit the number of cpu's to 50% on that machine, get work, set it no new work, and then reset the cpu's back to 100%.
Yes, I already figured it out.

This should force the cpu units to ONLY use the cpu's it saw when the units were downloaded, yet let you run other tasks on the other cores. It will be a pain to remember the steps every time you need more work, but after a few times it could become routine.
In my case, it didn't work that way. I mean, when I set cpu number back to 100%, other tasks didn't start.

Setting the limit WITHIN the app though would be a MUCH better option, but that is the projects job and nothing a cruncher can do about it.
I think that setting this in application, or perhaps somehow in project server configuration is the only sensible solution.
5) Message boards : Multicore CPUs : Message about suboptimal build (Message 39115)
Posted 3420 days ago by Sebastian M. Bobrecki
I got this message:
Compiled SIMD instructions: AVX_256 (Gromacs could use AVX_128_FMA on this machine, which is better)
and
The current CPU can measure timings more accurately than the code in
mdrun_mtavx.901 was configured to use. This might affect your simulation
speed as accurate timings are needed for load-balancing.
Please consider rebuilding mdrun_mtavx.901 with the GMX_USE_RDTSCP=OFF CMake option.
6) Message boards : Multicore CPUs : Error with more than 32 threads (Message 39114)
Posted 3420 days ago by Sebastian M. Bobrecki
I got this error:
Fatal error:
64 OpenMP threads were requested. Since the non-bonded force buffer reduction is prohibitively slow with more than 32 threads, we do not allow this. Use 32 or less OpenMP threads.
In that case, the maximum number of threads should be limited either in the application or the server side.


//