Message boards : Number crunching : ATM free energy calcuation -> GPU overheated and kicked off bus
Author | Message |
---|---|
The NVIDIA GPU running at 93C, then suddenly kicked off from the bus due to overheating, I am guessing: Mon Mar 20 06:00:36 2023 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 525.89.02 Driver Version: 525.89.02 CUDA Version: 12.0 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 NVIDIA GeForce ... Off | 00000000:02:00.0 Off | N/A | | N/A 93C P0 N/A / N/A | 170MiB / 4096MiB | 98% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | 0 N/A N/A 2706 G /usr/lib/xorg/Xorg 4MiB | | 0 N/A N/A 5626 C python 164MiB | +-----------------------------------------------------------------------------+ Mon Mar 20 06:00:41 2023 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 525.89.02 Driver Version: 525.89.02 CUDA Version: 12.0 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 NVIDIA GeForce ... ERR! | 00000000:02:00.0 Off | N/A | |ERR! ERR! ERR! N/A / ERR! | GPU is lost | ERR! ERR! | | | | ERR! | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | No running processes found | +-----------------------------------------------------------------------------+ | |
ID: 60114 | Rating: 0 | rate:
![]() ![]() ![]() | |
Yes, fan not running will do that. Fix the fan so that it runs and try again. | |
ID: 60115 | Rating: 0 | rate:
![]() ![]() ![]() | |
What brand and model GPU? E.g., MSI and Gigabyte fans don't last very long but EVGA fans do. Easy to replace. | |
ID: 60116 | Rating: 0 | rate:
![]() ![]() ![]() | |
The NVIDIA GPU running at 93C, then suddenly kicked off from the bus due to overheating This symptom would also fit to an overheated crunching laptop. More details are given at my Message #52937 | |
ID: 60117 | Rating: 0 | rate:
![]() ![]() ![]() | |
it's a laptop with a very old GPU. just look at his profile and hosts. this is the same problem he had with the acemd3 tasks where the GPU overheated and dropped off the bus. | |
ID: 60119 | Rating: 0 | rate:
![]() ![]() ![]() | |
I just saw the reported 0 rpm fan speed in the nvidia-smi output and commented. Didn't look into the actual hardware. | |
ID: 60122 | Rating: 0 | rate:
![]() ![]() ![]() | |
... laptops in general aren't good candidates for BOINC due to the limited cooling under 24/7 use. laptop cooling systems really aren't designed for that. Recently I was playing with the idea of buying a new laptop with a RTX3070 or even 3080 inside; but exactly what you are saying prevented me from doing it. | |
ID: 60124 | Rating: 0 | rate:
![]() ![]() ![]() | |
Run only the selected applications ACEMD 3: no | |
ID: 60323 | Rating: 0 | rate:
![]() ![]() ![]() | |
You missed one critical setting in Project Preferences. | |
ID: 60324 | Rating: 0 | rate:
![]() ![]() ![]() | |
I am using a gaming laptop with an RTX 3060 to crunch ATM which works fine. As mentioned before the heat needs to be controlled more carefully in a laptop. When I let the GPU crunch the CPU stays in idle so the only heat is generated by the GPU. Since Python tasks also use a few CPU threads at the same time I manually set the CPU frequency to 1300 Mhz which accelerates the calculation because otherwise it would stay at 400 Mhz but on the other hand doesn’t increase the heat much if otherwise left at idle. The GPU runs at 80 degress C. System is a Ryzen 7 5800H from Asus. | |
ID: 60531 | Rating: 0 | rate:
![]() ![]() ![]() | |
I am using a gaming laptop with an RTX 3060 to crunch ATM ... When I let the GPU crunch the CPU stays in idle so the only heat is generated by the GPU. how can the CPU stay idle while crunching an ATM task? | |
ID: 60532 | Rating: 0 | rate:
![]() ![]() ![]() | |
I am using a gaming laptop with an RTX 3060 to crunch ATM ... When I let the GPU crunch the CPU stays in idle so the only heat is generated by the GPU. All of these Python based applications are very 'bursty' IOW, they very infrequently use the cpu and gpu, flipping back and forth between the two processing elements. | |
ID: 60533 | Rating: 0 | rate:
![]() ![]() ![]() | |
Exactly! That‘s why I wrote „otherwise in idle“ meaning only the ATM task is being crunched. | |
ID: 60534 | Rating: 0 | rate:
![]() ![]() ![]() | |
By throttling the cpu speed down to idle to save watts and heat, the only consequence is longer running tasks which may risk getting the credit bonuses. | |
ID: 60535 | Rating: 0 | rate:
![]() ![]() ![]() | |
Not necessarily. With this setup my ATM beta tasks finish after around 6 hours on this linux host awarding 1.1 million credit and neither CPU or GPU overheat. If I were to let them lose like they were programmed to the CPU would stay at 400 Mhz even if the ATM task needs it. So manually raising it to 1300 Mhz decreases CPU calculation times. That way the CPU temp doesn't exceed 75 degrees and the GPU stays at around 80. It depends on the project that you run and your personal boltness what temps your willing to accept. I like the CPU to peak at 75 degrees and the GPU a little over 80. That's why I usually run only GPU or CPU work. Both together may be too much. An exception is milkyway which can be run in parallel if the CPU gets throttled to 1000 Mhz. This is just an example to show that ATM tasks can be run on a laptop. | |
ID: 60536 | Rating: 0 | rate:
![]() ![]() ![]() | |
Message boards : Number crunching : ATM free energy calcuation -> GPU overheated and kicked off bus