Advanced search

Message boards : News : PYSCFbeta: Quantum chemistry calculations on GPU

Author Message
Steve
Volunteer moderator
Project administrator
Project developer
Project tester
Volunteer developer
Volunteer tester
Project scientist
Send message
Joined: 21 Dec 23
Posts: 32
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 60963 - Posted: 12 Jan 2024 | 13:03:21 UTC

Hello GPUGRID!

We are deploying a new app "PYSCFbeta: Quantum chemistry calculations on GPU". It is currently in testing/beta stage. It is only on Linux at the moment.

The app performs quantum chemistry calculations. At the moment we are using it specifically for Density Functional Theory calculations: http://en.wikipedia.org/wiki/Density_functional_theory

These types of calculations allow us to accurately compute specific properties of small molecules.


The current test work units have a runtime of the order 1hr (very much dependent on the GPU speed and size of molecule). Each work unit currently contains 1 molecule with ~10 configurations.

The app will not work on GPUs with compute capability less than 6.0. It should not be sending them to these cards but I think at the moment this functionality is not working properly.

The work-units require a lot of GPU memory. It works best if the work-unit is the only thing running on the GPU. If other programs are using significant GPU memory the work-unit might fail.

Looking forward to hearing feedback from you.

Steve

Skillz
Send message
Joined: 6 Jun 17
Posts: 4
Credit: 8,090,535,479
RAC: 49,348,086
Level
Tyr
Scientific publications
wat
Message 60964 - Posted: 12 Jan 2024 | 13:32:48 UTC

When can we expect to start getting these new tasks?

Steve
Volunteer moderator
Project administrator
Project developer
Project tester
Volunteer developer
Volunteer tester
Project scientist
Send message
Joined: 21 Dec 23
Posts: 32
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 60965 - Posted: 12 Jan 2024 | 13:56:16 UTC

Now, if you are using Linux and have "run test applications?" selected

roundup
Send message
Joined: 11 May 10
Posts: 57
Credit: 1,724,820,193
RAC: 13,910,294
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 60966 - Posted: 12 Jan 2024 | 13:57:24 UTC - in response to Message 60964.
Last modified: 12 Jan 2024 | 14:19:26 UTC

When can we expect to start getting these new tasks?

They are being distributed RIGHT NOW.
The first 6 WU have arrived here.

bormolino
Send message
Joined: 16 May 13
Posts: 41
Credit: 79,726,864
RAC: 274
Level
Thr
Scientific publications
watwatwatwatwatwat
Message 60967 - Posted: 12 Jan 2024 | 14:00:40 UTC

I only get "No tasks sent".

Test applications are allowed and i have compute capability 8.6 with 12GB of GPU Mem running Ubuntu.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1036
Credit: 39,372,107,483
RAC: 162,570,079
Level
Trp
Scientific publications
wat
Message 60968 - Posted: 12 Jan 2024 | 15:14:43 UTC

Steve,

there is an issue with this application, that will only be apparent for multi-GPU systems.

the application seems to be hard coded in some way to always use GPU0, or the BOINC device assignment is somehow not being correctly communicated to the app.

this results in all tasks running on the same GPU when they should be split up to different GPUs. due to the high VRAM use, this fills the VRAM on most GPUs and causes errors.

see here:

GLaDOS:~$ nvidia-smi
Fri Jan 12 10:05:59 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.86.05 Driver Version: 535.86.05 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA TITAN V On | 00000000:21:00.0 On | N/A |
| 80% 55C P2 88W / 150W | 9453MiB / 12288MiB | 100% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 1 NVIDIA TITAN V On | 00000000:22:00.0 Off | N/A |
| 80% 34C P2 36W / 150W | 42MiB / 12288MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 2 NVIDIA TITAN V On | 00000000:42:00.0 Off | N/A |
| 80% 42C P2 39W / 150W | 42MiB / 12288MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 3 NVIDIA TITAN V On | 00000000:61:00.0 Off | N/A |
| 80% 35C P2 36W / 150W | 42MiB / 12288MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| 0 N/A N/A 1612 G /usr/lib/xorg/Xorg 94MiB |
| 0 N/A N/A 1961 C+G ...libexec/gnome-remote-desktop-daemon 311MiB |
| 0 N/A N/A 2000 G /usr/bin/gnome-shell 67MiB |
| 0 N/A N/A 5931 C nvidia-cuda-mps-server 30MiB |
| 0 N/A N/A 223543 M+C python 4490MiB |
| 0 N/A N/A 223769 M+C python 4462MiB |

| 1 N/A N/A 1612 G /usr/lib/xorg/Xorg 6MiB |
| 1 N/A N/A 5931 C nvidia-cuda-mps-server 30MiB |
| 2 N/A N/A 1612 G /usr/lib/xorg/Xorg 6MiB |
| 2 N/A N/A 5931 C nvidia-cuda-mps-server 30MiB |
| 3 N/A N/A 1612 G /usr/lib/xorg/Xorg 6MiB |
| 3 N/A N/A 5931 C nvidia-cuda-mps-server 30MiB |
+---------------------------------------------------------------------------------------+


in bold, both processes running on the same GPU.
____________

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1036
Credit: 39,372,107,483
RAC: 162,570,079
Level
Trp
Scientific publications
wat
Message 60969 - Posted: 12 Jan 2024 | 15:19:15 UTC

also, could you please add explicit QChem for GPU selections in the project preferences page? currently it is only possible to get this app if you have ALL apps selected + test apps. I want to exclude some apps but still get this one.
____________

Steve
Volunteer moderator
Project administrator
Project developer
Project tester
Volunteer developer
Volunteer tester
Project scientist
Send message
Joined: 21 Dec 23
Posts: 32
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 60970 - Posted: 12 Jan 2024 | 15:21:46 UTC - in response to Message 60968.

Ah yes thank you for confirming this! This is an omission in the scripts from my end. My test machine has one GPU so I missed it. This can be fixed thank you.

Steve
Volunteer moderator
Project administrator
Project developer
Project tester
Volunteer developer
Volunteer tester
Project scientist
Send message
Joined: 21 Dec 23
Posts: 32
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 60971 - Posted: 12 Jan 2024 | 15:23:29 UTC

I will try and get the web interface updated but this will take longer due to my unfamiliarity with it. Thanks

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1036
Credit: 39,372,107,483
RAC: 162,570,079
Level
Trp
Scientific publications
wat
Message 60972 - Posted: 12 Jan 2024 | 16:20:22 UTC

just a hunch but I think the problem is with your export command in the run.sh

you have:

export CUDA_VISIBLE_DEVICES=$CUDA_DEVICE


which if I'm reading it right, will set all visible devices to just one GPU. this will have a bad impact for any other tasks running in the BOINC environment i think.

normally on my 4x GPU system, I have CUDA_VISIBLE_DEVICES=0,1,2,3, and if you override that to just the single CUDA device it seems to shuffle all tasks there instead.
____________

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1036
Credit: 39,372,107,483
RAC: 162,570,079
Level
Trp
Scientific publications
wat
Message 60973 - Posted: 12 Jan 2024 | 17:45:08 UTC - in response to Message 60972.

just a hunch but I think the problem is with your export command in the run.sh

you have:
export CUDA_VISIBLE_DEVICES=$CUDA_DEVICE


which if I'm reading it right, will set all visible devices to just one GPU. this will have a bad impact for any other tasks running in the BOINC environment i think.

normally on my 4x GPU system, I have CUDA_VISIBLE_DEVICES=0,1,2,3, and if you override that to just the single CUDA device it seems to shuffle all tasks there instead.


I guess this wasnt the problem after all :) I see a new small batch went out and i downloaded some and they are working fine now.

____________

Steve
Volunteer moderator
Project administrator
Project developer
Project tester
Volunteer developer
Volunteer tester
Project scientist
Send message
Joined: 21 Dec 23
Posts: 32
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 60974 - Posted: 12 Jan 2024 | 18:02:58 UTC - in response to Message 60973.

just a hunch but I think the problem is with your export command in the run.sh

you have:
export CUDA_VISIBLE_DEVICES=$CUDA_DEVICE


which if I'm reading it right, will set all visible devices to just one GPU. this will have a bad impact for any other tasks running in the BOINC environment i think.

normally on my 4x GPU system, I have CUDA_VISIBLE_DEVICES=0,1,2,3, and if you override that to just the single CUDA device it seems to shuffle all tasks there instead.


I guess this wasnt the problem after all :) I see a new small batch went out and i downloaded some and they are working fine now.


Hello, Can you confirm the latest WUs are getting assigned to different GPUs in the way you would expect?


The line in the script you have mentioned is actually the fix I just did. In the first round I had forgotten to put this line.

When the boinc client runs the app via the wrapper mechanism it specifies the gpu device which we capture in the variable CUDA_DEVICE. The Python CUDA code in our app uses the CUDA_VISIBLE_DEVICES variable to choose the GPU. When it is not set (as in the first round of jobs) it defaults to zero. So all jobs end up on GPU zero. With this fix the WUs will be run on the device specified by the boinc client.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1036
Credit: 39,372,107,483
RAC: 162,570,079
Level
Trp
Scientific publications
wat
Message 60975 - Posted: 12 Jan 2024 | 18:09:30 UTC - in response to Message 60974.

yup. I just ran 4 tasks on the same 4-GPU system and each one went to a different GPU as it should.

I see in the stderr that the device was selected properly.
____________

Steve
Volunteer moderator
Project administrator
Project developer
Project tester
Volunteer developer
Volunteer tester
Project scientist
Send message
Joined: 21 Dec 23
Posts: 32
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 60976 - Posted: 12 Jan 2024 | 18:12:10 UTC - in response to Message 60975.

Thanks very much for the help!

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1036
Credit: 39,372,107,483
RAC: 162,570,079
Level
Trp
Scientific publications
wat
Message 60977 - Posted: 12 Jan 2024 | 18:13:10 UTC - in response to Message 60975.
Last modified: 12 Jan 2024 | 18:31:33 UTC

also, does this app make much use of FP64? I'm noticing very fast runtimes on a Titan V, even faster than something like a RTX 3090. the titan V is slower in FP32, but like 14x faster in FP64.

it's hard to follow the code, but I did see that you use cupy a lot, and maybe something in cupy is able to accelerate the Titan V in some way.

or maybe Tensor core difference? does this QChem app use the tensor cores?
____________

Steve
Volunteer moderator
Project administrator
Project developer
Project tester
Volunteer developer
Volunteer tester
Project scientist
Send message
Joined: 21 Dec 23
Posts: 32
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 60978 - Posted: 12 Jan 2024 | 19:07:31 UTC - in response to Message 60977.

Yes this app does make use of some double precision arithmetic. High precision is needed in QM calculations. The bulk of the crunching is done by Nvidia's cusolver library which I believe uses tensor cores when available.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1036
Credit: 39,372,107,483
RAC: 162,570,079
Level
Trp
Scientific publications
wat
Message 60979 - Posted: 12 Jan 2024 | 19:10:08 UTC - in response to Message 60978.

Awesome, thanks for that info.

Looking forward to you re-releasing all the tasks you had to pull back earlier :)
____________

Steve
Volunteer moderator
Project administrator
Project developer
Project tester
Volunteer developer
Volunteer tester
Project scientist
Send message
Joined: 21 Dec 23
Posts: 32
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 60980 - Posted: 12 Jan 2024 | 19:16:45 UTC - in response to Message 60979.

Yes we will restart the large scale test next week!

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1289
Credit: 5,219,281,959
RAC: 10,592,914
Level
Tyr
Scientific publications
watwatwatwatwat
Message 60981 - Posted: 12 Jan 2024 | 20:52:58 UTC - in response to Message 60980.

+1

GWGeorge007
Avatar
Send message
Joined: 4 Mar 23
Posts: 10
Credit: 1,804,577,500
RAC: 7,184,174
Level
His
Scientific publications
wat
Message 60982 - Posted: 13 Jan 2024 | 11:51:35 UTC

+1
____________

Steve
Volunteer moderator
Project administrator
Project developer
Project tester
Volunteer developer
Volunteer tester
Project scientist
Send message
Joined: 21 Dec 23
Posts: 32
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 60984 - Posted: 15 Jan 2024 | 13:07:46 UTC

Sending out work for this app today. The work units take an hour (very approximately). They should be using different GPUs on multigpu systems. Please let me know if you see anything not working as you would normally expect

Profile ServicEnginIC
Avatar
Send message
Joined: 24 Sep 10
Posts: 566
Credit: 6,321,177,024
RAC: 17,362,450
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 60985 - Posted: 15 Jan 2024 | 13:50:37 UTC

Everything working as expected at my hosts.
Well done!
👍️

Freewill
Send message
Joined: 18 Mar 10
Posts: 13
Credit: 7,093,713,894
RAC: 40,189,547
Level
Tyr
Scientific publications
watwatwatwatwat
Message 60986 - Posted: 15 Jan 2024 | 13:57:02 UTC - in response to Message 60984.

Steve, so far the first few tasks are completing and being validated for me on single and multi-GPU systems.

Drago
Send message
Joined: 3 May 20
Posts: 12
Credit: 342,481,560
RAC: 449,239
Level
Asp
Scientific publications
wat
Message 60987 - Posted: 15 Jan 2024 | 15:45:24 UTC

My host is an R9-3900X, RTX 3070-Ti running ubuntu 20.04.06 LTS but it doesn't receive Quantum chemistry work units. I selected it in the preferences, test work and "ok to send work of other subprojects". Did I miss anything?

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1576
Credit: 5,887,311,851
RAC: 10,471,281
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 60988 - Posted: 15 Jan 2024 | 16:18:22 UTC - in response to Message 60987.

My host is an R9-3900X, RTX 3070-Ti running ubuntu 20.04.06 LTS but it doesn't receive Quantum chemistry work units. I selected it in the preferences, test work and "ok to send work of other subprojects". Did I miss anything?

I had the same problem until I ticked every available application for the venue, resulting in "(all applications)" showing on the confirmation page.

Having cleared that hurdle, I note that the tasks are estimated to run for 1 minute 36 seconds (slower device) and 20 seconds (fastest device). The machines have most recently been running ATMbeta (Python) tasks, and have been left with "Duration Correction Factors" of 0.0148 and 0.0100 as a result. The target value should be 1.0000 in all cases. Please could keep an eye on the <rsc_fpops_est> value for each workunit type, to try and minimise these large fluctuations when new applications are deployed?

Freewill
Send message
Joined: 18 Mar 10
Posts: 13
Credit: 7,093,713,894
RAC: 40,189,547
Level
Tyr
Scientific publications
watwatwatwatwat
Message 60989 - Posted: 15 Jan 2024 | 16:27:47 UTC - in response to Message 60988.

Drago,
You also need to check the "Run test applications?" box.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1036
Credit: 39,372,107,483
RAC: 162,570,079
Level
Trp
Scientific publications
wat
Message 60990 - Posted: 15 Jan 2024 | 16:39:22 UTC - in response to Message 60984.

Sending out work for this app today. The work units take an hour (very approximately). They should be using different GPUs on multigpu systems. Please let me know if you see anything not working as you would normally expect


at least one of my computers is unable to get any tasks. the scheduler just reports that there are no tasks sent.

it's inexplicable since it is the exact same configuration as a system that is receiving tasks just fine.

they are both on the same venue. and that venue has ALL projects selected, and has both test/beta apps allowed, and both have allow other apps selected. not sure what's going on here.

the only difference is one has 4 GPUs and the other has 7.

will get work: https://gpugrid.net/show_host_detail.php?hostid=582493
will not get work: https://gpugrid.net/show_host_detail.php?hostid=605892
____________

Drago
Send message
Joined: 3 May 20
Posts: 12
Credit: 342,481,560
RAC: 449,239
Level
Asp
Scientific publications
wat
Message 60991 - Posted: 15 Jan 2024 | 16:43:15 UTC

Yeah! I got all boxes checked but I still don't get work. Maybe it is a problem with the driver? I have version 470 installed which worked fine for me so far...

Steve
Volunteer moderator
Project administrator
Project developer
Project tester
Volunteer developer
Volunteer tester
Project scientist
Send message
Joined: 21 Dec 23
Posts: 32
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 60992 - Posted: 15 Jan 2024 | 17:17:29 UTC - in response to Message 60990.



it's inexplicable since it is the exact same configuration as a system that is receiving tasks just fine.


Ok thanks for this information. There must be something unexpected going on with the scheduler.

Profile ServicEnginIC
Avatar
Send message
Joined: 24 Sep 10
Posts: 566
Credit: 6,321,177,024
RAC: 17,362,450
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 60993 - Posted: 15 Jan 2024 | 17:34:11 UTC

I made a couple tests with these new PYSCFbeta tasks.
I tested to stop two of them, and they restarted without erroring. This is good... but both of them got reset their execution times and restarted from the beginning. This is not so good...
And the tests were made at a blend double GPU system (GTX 1660 Ti + GTX 1650). Conversely to ACEMD tasks, both tasks were restarted on the different GPU model than they started, and they did not crash. This is good!

Also, I've noticed a considerable reduction in power draw (about halved) comparing to ACEMD tasks.
GPU power draw at GTX 1660 Ti GPU with PYSCFbeta tasks is half than I'm familiar to see with ACEMD tasks.
And the same happens to GTX 1650 GPU.
Consequently, although 100% GPU usage is shown, working temperatures are much lower...

Steve
Volunteer moderator
Project administrator
Project developer
Project tester
Volunteer developer
Volunteer tester
Project scientist
Send message
Joined: 21 Dec 23
Posts: 32
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 60994 - Posted: 15 Jan 2024 | 17:41:13 UTC - in response to Message 60993.

Stopping and resuming is not currently implemented. It will just restart from the beginning.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1036
Credit: 39,372,107,483
RAC: 162,570,079
Level
Trp
Scientific publications
wat
Message 60995 - Posted: 15 Jan 2024 | 17:49:44 UTC - in response to Message 60992.



it's inexplicable since it is the exact same configuration as a system that is receiving tasks just fine.


Ok thanks for this information. There must be something unexpected going on with the scheduler.


are you able to inspect the scheduler log from this host? can you see more detail about the specific reason it was not sent any work?

the only thing i see on my end is "no tasks sent" with no reason.
____________

Sasa Jovicic
Send message
Joined: 22 Oct 09
Posts: 2
Credit: 274,077,500
RAC: 1,051,937
Level
Asn
Scientific publications
wat
Message 60996 - Posted: 15 Jan 2024 | 18:21:18 UTC

I have the same problem too: no tasks sent!

Bedrich Hajek
Send message
Joined: 28 Mar 09
Posts: 468
Credit: 8,486,022,716
RAC: 10,942,361
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 60997 - Posted: 15 Jan 2024 | 19:51:33 UTC
Last modified: 15 Jan 2024 | 19:59:07 UTC

Here is a tale of 2 computers, one that was getting units, and the other was not.

https://www.gpugrid.net/hosts_user.php?userid=19626

They both have the same GPUGRID preferences.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1036
Credit: 39,372,107,483
RAC: 162,570,079
Level
Trp
Scientific publications
wat
Message 60998 - Posted: 15 Jan 2024 | 20:24:11 UTC

another observation is keep an eye on your CPU use.

these look to be another mt+cuda setup for which BOINC is not prepared to handle, much like the PythonGPU work. i saw upwards of 30 threads utilized per task. but it wasn't sustained, it would come in bursts.

on average reported cpu_time and runtime was about 4x actual (15min actual would be reported as about an hour runtime)
____________

Steve
Volunteer moderator
Project administrator
Project developer
Project tester
Volunteer developer
Volunteer tester
Project scientist
Send message
Joined: 21 Dec 23
Posts: 32
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 61000 - Posted: 15 Jan 2024 | 20:58:03 UTC

Thanks for listing the host ids that are not receiving. I can see them in the scheduler logs so hopefully can pin point why they are not getting work.

And yes I missed a setting to limit the multi-threading thanks for catching that!
(all the modern libraries try very hard to multi-thread withing telling you they are going to haha)

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1036
Credit: 39,372,107,483
RAC: 162,570,079
Level
Trp
Scientific publications
wat
Message 61001 - Posted: 15 Jan 2024 | 21:01:32 UTC - in response to Message 61000.

i think if you get a discrete check box selection in the project preferences for QChem on GPU, that will solve the issues of requesting work for this project.
____________

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1036
Credit: 39,372,107,483
RAC: 162,570,079
Level
Trp
Scientific publications
wat
Message 61002 - Posted: 17 Jan 2024 | 0:35:42 UTC - in response to Message 61001.

Thank you to whoever got the discrete checkbox implemeted in the settings :). this should make getting work less trivial.
____________

Steve
Volunteer moderator
Project administrator
Project developer
Project tester
Volunteer developer
Volunteer tester
Project scientist
Send message
Joined: 21 Dec 23
Posts: 32
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 61003 - Posted: 17 Jan 2024 | 9:43:41 UTC

The app will now appear in the GPUGRID preferences:"Quantum chemistry on GPU (beta)"

Previous scheduler problems should be fixed. (I can see that https://gpugrid.net/results.php?hostid=605892 is now getting the jobs when before it was not.)

Bedrich Hajek
Send message
Joined: 28 Mar 09
Posts: 468
Credit: 8,486,022,716
RAC: 10,942,361
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 61004 - Posted: 17 Jan 2024 | 12:23:36 UTC - in response to Message 60997.

Here is a tale of 2 computers, one that was getting units, and the other was not.

https://www.gpugrid.net/hosts_user.php?userid=19626

They both have the same GPUGRID preferences.



I am getting tasks on both computers, now. So far, all tasks are completing successfully.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1036
Credit: 39,372,107,483
RAC: 162,570,079
Level
Trp
Scientific publications
wat
Message 61005 - Posted: 17 Jan 2024 | 12:41:37 UTC

Yes, thanks! this works much better now.

more observed behavior. this batch seems to use less VRAM, and also more limited in CPU use. it's pretty much stuck to just 1 thread per process now. not sure if it's a consequence of the CPU limiting or some other change that limited the VRAM use, but these tasks run a bit slower than the last batch.

if it's bottlenecked by the CPU limiting, maybe there's a middleground? like letting it use up to 4 cores?

the lasst batch 2 days ago, on a Titan V I was running 2x in about 15mins total (7.5 mins per task). this new batch was doing about 25mins for two tasks (12.5mins per task). since VRAM use has gone down, I'm doing 3x in about 25-30mins (8.5-10mins per task). and I'm experimenting with 4x now as well. but the old tasks were undoubtedly faster for some reason.
____________

Steve
Volunteer moderator
Project administrator
Project developer
Project tester
Volunteer developer
Volunteer tester
Project scientist
Send message
Joined: 21 Dec 23
Posts: 32
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 61006 - Posted: 17 Jan 2024 | 13:02:49 UTC - in response to Message 61005.

The most recent WUs are just twice the size of the previous test set. There Are 100 molecules in each WU now, previously there were 50.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1036
Credit: 39,372,107,483
RAC: 162,570,079
Level
Trp
Scientific publications
wat
Message 61007 - Posted: 17 Jan 2024 | 13:38:39 UTC - in response to Message 61006.

oh ok, that explains it!
____________

Freewill
Send message
Joined: 18 Mar 10
Posts: 13
Credit: 7,093,713,894
RAC: 40,189,547
Level
Tyr
Scientific publications
watwatwatwatwat
Message 61008 - Posted: 17 Jan 2024 | 13:52:04 UTC - in response to Message 61006.

The most recent WUs are just twice the size of the previous test set. There Are 100 molecules in each WU now, previously there were 50.

I wouldn't complain if the credit per task was also doubled. ;)

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1036
Credit: 39,372,107,483
RAC: 162,570,079
Level
Trp
Scientific publications
wat
Message 61009 - Posted: 17 Jan 2024 | 15:08:44 UTC - in response to Message 61006.

Steve,

Can you see if you can lift the task download limits? currently it looks like each schedule request will only send one task. instead of a few at a time.

and with multiple computers at the same location, coupled with the DoS protection on your network reventing multiple requests from the same IP, i get sched request failures pretty often, which is limiting how many tasks I can download and not allowing all the GPUs to work.

i know you probably can't do anything about the network DoS protections, but can you allow multiple tasks to download in a single request?
____________

Sasa Jovicic
Send message
Joined: 22 Oct 09
Posts: 2
Credit: 274,077,500
RAC: 1,051,937
Level
Asn
Scientific publications
wat
Message 61010 - Posted: 17 Jan 2024 | 16:22:01 UTC

I made fresh Linux Mint installation and it is OK for me now. Now I can dowload new WU.

Aurum
Avatar
Send message
Joined: 12 Jul 17
Posts: 399
Credit: 13,238,627,382
RAC: 14,211,365
Level
Trp
Scientific publications
watwatwat
Message 61011 - Posted: 17 Jan 2024 | 17:45:33 UTC - in response to Message 60963.

The app will not work on GPUs with compute capability less than 6.0. It should not be sending them to these cards but I think at the moment this functionality is not working properly.

WUs are being sent to GPUs like GTX 960 (cc=5.2, 2 GB VRAM) and they fail. E.g.,
https://www.gpugrid.net/show_host_detail.php?hostid=550055
https://developer.nvidia.com/cuda-gpus

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1036
Credit: 39,372,107,483
RAC: 162,570,079
Level
Trp
Scientific publications
wat
Message 61012 - Posted: 17 Jan 2024 | 17:51:36 UTC - in response to Message 61011.
Last modified: 17 Jan 2024 | 17:53:02 UTC

The app will not work on GPUs with compute capability less than 6.0. It should not be sending them to these cards but I think at the moment this functionality is not working properly.

WUs are being sent to GPUs like GTX 960 (cc=5.2, 2 GB VRAM) and they fail. E.g.,
https://www.gpugrid.net/show_host_detail.php?hostid=550055
https://developer.nvidia.com/cuda-gpus


Steve mentioned that the scheduler blocks from low CC cards wasnt working properly.

best to uncheck QChem for GPU in your project preferences for those hosts.

Edit:
Sorry, disregard, I thought you were talking about your own host. Since that host is anonymous, not really anything to be be done at the moment. will just have to deal with the resends.
____________

Aurum
Avatar
Send message
Joined: 12 Jul 17
Posts: 399
Credit: 13,238,627,382
RAC: 14,211,365
Level
Trp
Scientific publications
watwatwat
Message 61013 - Posted: 17 Jan 2024 | 18:18:22 UTC
Last modified: 17 Jan 2024 | 18:22:16 UTC

When you send out WUs with 0.991C + 1NV BOINC does not assign a CPU core to that task. You should designate them 1C.
I've been changing my Use At Most N-2 CPUs to accommodate these tasks. If not they slow down significantly.

That GTX 960 I pointed out also has BOINC 7.7 installed and may be a Science United member so many failures can be expected. But with 7 errors allowed they'll probably find a qualified cruncher before they die.

bormolino
Send message
Joined: 16 May 13
Posts: 41
Credit: 79,726,864
RAC: 274
Level
Thr
Scientific publications
watwatwatwatwatwat
Message 61014 - Posted: 17 Jan 2024 | 18:33:49 UTC

The 1,92GB file downloads with only ~19,75KBps.

No chance to get the file within the deadline.

I mentioned this problem multiple times in multiple threads. Seems like nobody cares even if the problem affects multiple users.

Skillz
Send message
Joined: 6 Jun 17
Posts: 4
Credit: 8,090,535,479
RAC: 49,348,086
Level
Tyr
Scientific publications
wat
Message 61015 - Posted: 17 Jan 2024 | 18:47:36 UTC - in response to Message 61009.

Steve,

Can you see if you can lift the task download limits? currently it looks like each schedule request will only send one task. instead of a few at a time.

and with multiple computers at the same location, coupled with the DoS protection on your network reventing multiple requests from the same IP, i get sched request failures pretty often, which is limiting how many tasks I can download and not allowing all the GPUs to work.

i know you probably can't do anything about the network DoS protections, but can you allow multiple tasks to download in a single request?


I've got this issue also. We need to be able to download multiple tasks in one request otherwise the GPU sits idle or grabs a backup project and thus will miss multiple requests until that task completes.

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1289
Credit: 5,219,281,959
RAC: 10,592,914
Level
Tyr
Scientific publications
watwatwatwatwat
Message 61016 - Posted: 17 Jan 2024 | 19:33:59 UTC - in response to Message 61013.

When you send out WUs with 0.991C + 1NV BOINC does not assign a CPU core to that task. You should designate them 1C.
I've been changing my Use At Most N-2 CPUs to accommodate these tasks. If not they slow down significantly.

That GTX 960 I pointed out also has BOINC 7.7 installed and may be a Science United member so many failures can be expected. But with 7 errors allowed they'll probably find a qualified cruncher before they die.

You can always override that with an app_config.xml file in the project folder and assign 1.0 cpu threads to the task.

gemini8
Send message
Joined: 3 Jul 16
Posts: 31
Credit: 1,329,100,176
RAC: 4,926,985
Level
Met
Scientific publications
watwat
Message 61017 - Posted: 17 Jan 2024 | 21:25:46 UTC - in response to Message 61014.

Hello.

The 1,92GB file downloads with only ~19,75KBps.

I'm also encountering the issue of slow downloads on several hosts.
It would be nice if the project infrastructure worked a little bit faster on our downloads.
Thank you.
____________
Greetings, Jens

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1036
Credit: 39,372,107,483
RAC: 162,570,079
Level
Trp
Scientific publications
wat
Message 61018 - Posted: 17 Jan 2024 | 22:04:02 UTC - in response to Message 61017.

Hello.
The 1,92GB file downloads with only ~19,75KBps.

I'm also encountering the issue of slow downloads on several hosts.
It would be nice if the project infrastructure worked a little bit faster on our downloads.
Thank you.


once this file is downloaded, you dont need to download it again. it's re-used for every task. the input files sent for each task are very small and will download quickly.
____________

Bedrich Hajek
Send message
Joined: 28 Mar 09
Posts: 468
Credit: 8,486,022,716
RAC: 10,942,361
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 61019 - Posted: 18 Jan 2024 | 2:56:16 UTC - in response to Message 61004.
Last modified: 18 Jan 2024 | 2:57:20 UTC

Here is a tale of 2 computers, one that was getting units, and the other was not.

https://www.gpugrid.net/hosts_user.php?userid=19626

They both have the same GPUGRID preferences.



I am getting tasks on both computers, now. So far, all tasks are completing successfully.



After running these tasks successfully for almost a day on both of my computers, now my BOINC manager, task tab, Remaining (estimated) "time" is telling approximately 24 days to complete on one computer and 62 days on the other, at the task's beginning, and incrementally counts down from there. The task actually completes successfully in a little over an hour. A few hours ago, they were showing the correct times to complete.

Everything else is working fine, but this is definitely unusual. Did anyone else observed this?

[AF>Libristes] alain65
Avatar
Send message
Joined: 30 May 14
Posts: 9
Credit: 1,666,048,820
RAC: 7,748,754
Level
His
Scientific publications
wat
Message 61020 - Posted: 18 Jan 2024 | 3:13:07 UTC

Good morning.
Wu Pyscfbeta: Quantum Chemistry Calculations On GPU work well on my 1080TI and 1650TI.
Unfortunately on my GTX 970 with 4 GB VRAM, I receive many WUs but they quikly go in error.
I run with Debian 11 and the Nvidia 470 driver.
Is this material too old?
For the moment I have removed Quantum Chemistry on GPU (BETA) on this machine. Because I quickly arrived a daily maximum and would still want to do other wu gpugrid if there are any.
____________
PC are like air conditioning, they becomes useless when you open Windows (L.T)

In a world without walls and fences, who needs windows and gates?

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1036
Credit: 39,372,107,483
RAC: 162,570,079
Level
Trp
Scientific publications
wat
Message 61021 - Posted: 18 Jan 2024 | 4:25:39 UTC - in response to Message 61020.

Good morning.
Wu Pyscfbeta: Quantum Chemistry Calculations On GPU work well on my 1080TI and 1650TI.
Unfortunately on my GTX 970 with 4 GB VRAM, I receive many WUs but they quikly go in error.
I run with Debian 11 and the Nvidia 470 driver.
Is this material too old?
For the moment I have removed Quantum Chemistry on GPU (BETA) on this machine. Because I quickly arrived a daily maximum and would still want to do other wu gpugrid if there are any.


The project admin said at the beginning that the application will only work for cards with compute capability of 6.0 or greater. This equates to cards of Pascal generation and newer.

Your GTX 970 is Maxwell with a compute capability of 5.2. It is too old for this app.
____________

[AF>Libristes] alain65
Avatar
Send message
Joined: 30 May 14
Posts: 9
Credit: 1,666,048,820
RAC: 7,748,754
Level
His
Scientific publications
wat
Message 61022 - Posted: 18 Jan 2024 | 7:17:09 UTC - in response to Message 60963.

Okay ... the answer was in the first message:



The app will not work on GPUs with compute capability less than 6.0. It should not be sending them to these cards but I think at the moment this functionality is not working properly.



Désolé pour le dérangement ;)
____________
PC are like air conditioning, they becomes useless when you open Windows (L.T)

In a world without walls and fences, who needs windows and gates?

Skip Da Shu
Send message
Joined: 13 Jul 09
Posts: 61
Credit: 827,525,165
RAC: 10,449,053
Level
Glu
Scientific publications
watwatwatwatwatwatwat
Message 61023 - Posted: 18 Jan 2024 | 15:21:10 UTC - in response to Message 61022.

OMG, LOL, I love this and must go abuse it...

PC are like air conditioning, they becomes useless when you open Windows (L.T)

Aurum
Avatar
Send message
Joined: 12 Jul 17
Posts: 399
Credit: 13,238,627,382
RAC: 14,211,365
Level
Trp
Scientific publications
watwatwat
Message 61026 - Posted: 18 Jan 2024 | 15:48:02 UTC - in response to Message 61016.

When you send out WUs with 0.991C + 1NV BOINC does not assign a CPU core to that task. You should designate them 1C.
I've been changing my Use At Most N-2 CPUs to accommodate these tasks. If not they slow down significantly.

That GTX 960 I pointed out also has BOINC 7.7 installed and may be a Science United member so many failures can be expected. But with 7 errors allowed they'll probably find a qualified cruncher before they die.

You can always override that with an app_config.xml file in the project folder and assign 1.0 cpu threads to the task.

I know I can. What about the many people that leave BOINC on autopilot?
I've seen multiple instances of 5 errors before a WU got to me. It's in Steve's best interest.

Aurum
Avatar
Send message
Joined: 12 Jul 17
Posts: 399
Credit: 13,238,627,382
RAC: 14,211,365
Level
Trp
Scientific publications
watwatwat
Message 61028 - Posted: 18 Jan 2024 | 15:51:15 UTC - in response to Message 61019.

Here is a tale of 2 computers, one that was getting units, and the other was not.

https://www.gpugrid.net/hosts_user.php?userid=19626

They both have the same GPUGRID preferences.



I am getting tasks on both computers, now. So far, all tasks are completing successfully.



After running these tasks successfully for almost a day on both of my computers, now my BOINC manager, task tab, Remaining (estimated) "time" is telling approximately 24 days to complete on one computer and 62 days on the other, at the task's beginning, and incrementally counts down from there. The task actually completes successfully in a little over an hour. A few hours ago, they were showing the correct times to complete.

Everything else is working fine, but this is definitely unusual. Did anyone else observed this?

At first I did. But including <fraction_done_exact/> seems to heal that fairly quickly.
<app>
<name>PYSCFbeta</name>
<!-- Quantum chemistry calculations on GPU -->
<plan_class>cuda1121</plan_class>
<gpu_versions>
<cpu_usage>1</cpu_usage>
<gpu_usage>1</gpu_usage>
</gpu_versions>
<fraction_done_exact/>
</app>

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1036
Credit: 39,372,107,483
RAC: 162,570,079
Level
Trp
Scientific publications
wat
Message 61029 - Posted: 18 Jan 2024 | 16:11:41 UTC - in response to Message 61026.

When you send out WUs with 0.991C + 1NV BOINC does not assign a CPU core to that task. You should designate them 1C.
I've been changing my Use At Most N-2 CPUs to accommodate these tasks. If not they slow down significantly.

That GTX 960 I pointed out also has BOINC 7.7 installed and may be a Science United member so many failures can be expected. But with 7 errors allowed they'll probably find a qualified cruncher before they die.

You can always override that with an app_config.xml file in the project folder and assign 1.0 cpu threads to the task.

I know I can. What about the many people that leave BOINC on autopilot?
I've seen multiple instances of 5 errors before a WU got to me. It's in Steve's best interest.


the errors have nothing to do with the CPU resource allocation setting. they all errored because of running on GPUs that are too old, the app needs cards with at least CC of 6.0+ (Pascal and up).

at worst, if someone is running the CPU full out 100% and not leaving space CPU cycles available (as they should), the worst that happens is that the GPU task might run a little more slowly. but it wont fail.

I believe that the issue of "0.991" CPUs or whatever is a byproduct of the BOINC serverside software. from what I've read elsewhere, this value is not intentionally set by the researchers, it is automatically selected by the BOINC server somewhere along the way, and the researchers here have previously commented that they are not aware of any way to override this serverside. so competent users can just override it themselves if they prefer. setting your CPU use in BOINC to like 99 or 98% has the same effect overall though.
____________

Bedrich Hajek
Send message
Joined: 28 Mar 09
Posts: 468
Credit: 8,486,022,716
RAC: 10,942,361
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 61032 - Posted: 18 Jan 2024 | 23:46:55 UTC - in response to Message 61028.

Here is a tale of 2 computers, one that was getting units, and the other was not.

https://www.gpugrid.net/hosts_user.php?userid=19626

They both have the same GPUGRID preferences.



I am getting tasks on both computers, now. So far, all tasks are completing successfully.



After running these tasks successfully for almost a day on both of my computers, now my BOINC manager, task tab, Remaining (estimated) "time" is telling approximately 24 days to complete on one computer and 62 days on the other, at the task's beginning, and incrementally counts down from there. The task actually completes successfully in a little over an hour. A few hours ago, they were showing the correct times to complete.

Everything else is working fine, but this is definitely unusual. Did anyone else observed this?

At first I did. But including <fraction_done_exact/> seems to heal that fairly quickly.
<app>
<name>PYSCFbeta</name>
<!-- Quantum chemistry calculations on GPU -->
<plan_class>cuda1121</plan_class>
<gpu_versions>
<cpu_usage>1</cpu_usage>
<gpu_usage>1</gpu_usage>
</gpu_versions>
<fraction_done_exact/>
</app>




Thanks for this information. I updated my computers.

Now, I remember this <fraction_done_exact/> from a post several years ago. I can't remember the thread. In the past I didn't need this, because the tasks would correct themselves eventually, even the ATMbetas.

The Quantum Chemistry on GPU does the complete opposite. I wonder if this is connected to the observation of "upwards of 30 threads utilized per task" as posted by Ian&Steve C.?

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1036
Credit: 39,372,107,483
RAC: 162,570,079
Level
Trp
Scientific publications
wat
Message 61033 - Posted: 19 Jan 2024 | 0:07:25 UTC - in response to Message 61032.

nah the multi thread issue has already been fixed. the app only uses a single thread now.
____________

zombie67 [MM]
Avatar
Send message
Joined: 16 Jul 07
Posts: 207
Credit: 1,736,801,456
RAC: 4,977,645
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 61034 - Posted: 19 Jan 2024 | 2:30:50 UTC - in response to Message 60963.

The work-units require a lot of GPU memory.


How much is "a lot" exactly? I have a pacal card, so it meets the compute capability requirement. But it has only 2gb of VRAM. But without knowing the amount of VRAM required, I am not sure if it will work.
____________
Reno, NV
Team: SETI.USA

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1036
Credit: 39,372,107,483
RAC: 162,570,079
Level
Trp
Scientific publications
wat
Message 61035 - Posted: 19 Jan 2024 | 3:41:52 UTC - in response to Message 61034.

The work-units require a lot of GPU memory.


How much is "a lot" exactly? I have a pacal card, so it meets the compute capability requirement. But it has only 2gb of VRAM. But without knowing the amount of VRAM required, I am not sure if it will work.


It requires more than 2GB
____________

zombie67 [MM]
Avatar
Send message
Joined: 16 Jul 07
Posts: 207
Credit: 1,736,801,456
RAC: 4,977,645
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 61036 - Posted: 19 Jan 2024 | 4:24:35 UTC - in response to Message 61035.

It requires more than 2GB


Good to know. Thanks!
____________
Reno, NV
Team: SETI.USA

Steve
Volunteer moderator
Project administrator
Project developer
Project tester
Volunteer developer
Volunteer tester
Project scientist
Send message
Joined: 21 Dec 23
Posts: 32
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 61037 - Posted: 19 Jan 2024 | 14:07:54 UTC - in response to Message 61029.



the errors have nothing to do with the CPU resource allocation setting. they all errored because of running on GPUs that are too old, the app needs cards with at least CC of 6.0+ (Pascal and up).

at worst, if someone is running the CPU full out 100% and not leaving space CPU cycles available (as they should), the worst that happens is that the GPU task might run a little more slowly. but it wont fail.

I believe that the issue of "0.991" CPUs or whatever is a byproduct of the BOINC serverside software. from what I've read elsewhere, this value is not intentionally set by the researchers, it is automatically selected by the BOINC server somewhere along the way, and the researchers here have previously commented that they are not aware of any way to override this serverside. so competent users can just override it themselves if they prefer. setting your CPU use in BOINC to like 99 or 98% has the same effect overall though.


This is all correct I believe.

It seems that the jobs have enough retry attempts that all work units end up eventually succeeding. The scheduler has some inbuilt mechanism to classify hosts as "reliable" it also has a mechanism to send workunits that have failed a few times to only hosts that are "reliable". This is not ideal of course. We will try and get the CC requirements honoured but these are project wide scheduler settings which are rather complex to fix without breaking everything else that is currently working.

The download limitations is something I will not be able to change easily. A potential reason I can guess for the current settings is to stop a failing host acting as a black-hole of failed jobs or something similar.

The large file download should just happen once. The app is deployed in the same way as the ATM app. It is a 2GB zip file that contains a python environment and some cuda libraries. Each work-unit only requires downloading a small file (<1MB I think).


This last large scale run has been rather impressive. The throughput was very high! Especially considering that it is only on Linux hosts and not Windows. We will be sending some similar batches over the next few weeks.

[AF>Libristes] alain65
Avatar
Send message
Joined: 30 May 14
Posts: 9
Credit: 1,666,048,820
RAC: 7,748,754
Level
His
Scientific publications
wat
Message 61039 - Posted: 20 Jan 2024 | 3:25:58 UTC - in response to Message 61037.

Hello Steeve.

[quote]

The throughput was very high! Especially considering that it is only on Linux hosts and not Windows.


I would say: this is certainly for that! :D
____________
PC are like air conditioning, they becomes useless when you open Windows (L.T)

In a world without walls and fences, who needs windows and gates?

Erich56
Send message
Joined: 1 Jan 15
Posts: 1091
Credit: 6,854,782,676
RAC: 17,245,093
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwat
Message 61040 - Posted: 21 Jan 2024 | 8:59:42 UTC - in response to Message 61037.

... Especially considering that it is only on Linux hosts and not Windows. We will be sending some similar batches over the next few weeks.

Is there a plan to come up with a Windows version too?

Philip Nicholson
Send message
Joined: 23 Feb 22
Posts: 1
Credit: 518,814,968
RAC: 203,660
Level
Lys
Scientific publications
wat
Message 61041 - Posted: 21 Jan 2024 | 19:01:49 UTC

Still no work for windows 11 operating systems?
I see the occasional task that failed but nothing processed.
It worked well for months and then just stopped before xmas.
All my software is up to date.
I have a dedicated GPU for this project.
Where is the best place to find an update on GPUgrid's software migration?

Tasks completed
134
Tasks failed
55
Credit
User
491,814,968 total, 13,657.85 average
Host
150,562,500 total, 13,650.92 average
Scheduling
Scheduling priority
-0.93
Don't request tasks for CPU
Project has no apps for CPU
NVIDIA GPU task request deferred for
00:03:35
NVIDIA GPU task request deferral interval
00:10:00
Last scheduler reply
2024-01-21 1:55:15 PM

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1289
Credit: 5,219,281,959
RAC: 10,592,914
Level
Tyr
Scientific publications
watwatwatwatwat
Message 61042 - Posted: 21 Jan 2024 | 21:53:29 UTC

Most of the work released lately has been the Quantum Chemistry tasks. The researcher said that since most educational and research labs run Linux OS', that Windows applications are a second thought.

The only tasks with a Windows app that has appeared somewhat regularly are the acemd tasks.

You will have to try and snag one of those when they show up.

Erich56
Send message
Joined: 1 Jan 15
Posts: 1091
Credit: 6,854,782,676
RAC: 17,245,093
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwat
Message 61043 - Posted: 22 Jan 2024 | 7:52:56 UTC - in response to Message 61042.
Last modified: 22 Jan 2024 | 7:56:36 UTC

The researcher said that since most educational and research labs run Linux OS', that Windows applications are a second thought.

it's really too bad that GPUGRID obviously more and more tends to exclude Windows crunchers :-(
When I joined this project 8 years ago, at that time and many years thereafter, no lack of Windows tasks.
On the other hand: with these few tasks available since last year, it might be the case that the number of Linux crunchers is sufficient for processing them, and the Windows crunchers from before are not needed any longer :-(
At least, this is the impression one is bound to get.

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1289
Credit: 5,219,281,959
RAC: 10,592,914
Level
Tyr
Scientific publications
watwatwatwatwat
Message 61044 - Posted: 22 Jan 2024 | 17:27:39 UTC

The lack of current Windows applications has more to do with the type of applications and API's being used currently.

The latest and current sub-projects are all Python based. Python runs much better on Linux compared to Windows since most development is done in Linux to begin with.

Even Microsoft advises that Python application development should be done in Linux rather than Windows.

Erich56
Send message
Joined: 1 Jan 15
Posts: 1091
Credit: 6,854,782,676
RAC: 17,245,093
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwat
Message 61045 - Posted: 23 Jan 2024 | 8:06:09 UTC

So - in short - bad times for Windows crunchers. Now and in the future :-(

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1289
Credit: 5,219,281,959
RAC: 10,592,914
Level
Tyr
Scientific publications
watwatwatwatwat
Message 61046 - Posted: 23 Jan 2024 | 9:05:47 UTC - in response to Message 61045.

So - in short - bad times for Windows crunchers. Now and in the future :-(

Pretty much so.

Windows had it best back with the original release of the acemd app. Remember it was a simple, single, executable file of modest size. Derived from source code that could be compiled for Windows or for Linux.

But, if you were paying attention lately, the recent acemd tasks no longer use an executable. They are using Python.

The Python based tasks are NOT a single executable, they are comprised of a complete packaged python environment of many gigabytes.

The nature of the tasks have changed for the project to using complex, state-of-the-art discovery calculation using cutting edge technology.

The QChem tasks are even using the Tensor cores of our Nvidia cards now. This is something we asked about several years ago in the forum and were told, maybe, in the future.

The future has come and our desires have been answered.

But the hardware and software of our hosts now have to rise to meet those challenges. Sadly, the Windows environment is still waiting in the wings.

Profile ServicEnginIC
Avatar
Send message
Joined: 24 Sep 10
Posts: 566
Credit: 6,321,177,024
RAC: 17,362,450
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 61082 - Posted: 25 Jan 2024 | 18:16:11 UTC - in response to Message 61028.

...But including <fraction_done_exact/> seems to heal that fairly quickly.

Nice advice, thank you!
It resulted quickly in an accurate remaining time estimation, so I applied it to ATMbeta tasks also.

[BAT] Svennemans
Send message
Joined: 27 May 21
Posts: 50
Credit: 289,422,017
RAC: 2,616,469
Level
Asn
Scientific publications
wat
Message 61084 - Posted: 25 Jan 2024 | 19:24:22 UTC - in response to Message 61046.

Choosing not to release Windows apps is a choice they can take, obviously. And maybe their use cases warrant taking the tradeoff inherent in that.

If there's often large volumes of work to process in a small time (i.e. you'd need something like a supercomputer ideally if it didn't cost that much), then you'd want to design your apps for what BOINC intended to be all along. Meaning you try to get them ported to as many platforms as you possibly can in order to reach maximum compute power. Or you leverage the power of VBox for non-native platforms.

If however the volumes are never going to be that large, where basically any single platform user group can easily provide the necessary compute power, then indeed why bother.

Although it would be nice of them to make that choice public and explicit so all non-Linux users can gracefully detach instead of posting frustrated "why no work" messages along the forums.
Or indeed spend hours trying to help fix Windows apps ;-)

Aurum
Avatar
Send message
Joined: 12 Jul 17
Posts: 399
Credit: 13,238,627,382
RAC: 14,211,365
Level
Trp
Scientific publications
watwatwat
Message 61096 - Posted: 26 Jan 2024 | 15:51:20 UTC - in response to Message 61029.

I believe that the issue of "0.991" CPUs or whatever is a byproduct of the BOINC serverside software. from what I've read elsewhere, this value is not intentionally set by the researchers, it is automatically selected by the BOINC server somewhere along the way, and the researchers here have previously commented that they are not aware of any way to override this serverside.

I didn't know that. It's probably a sloppy BOINC design like using percentage to determine the number of CPU threads to use instead of integers.

Aurum
Avatar
Send message
Joined: 12 Jul 17
Posts: 399
Credit: 13,238,627,382
RAC: 14,211,365
Level
Trp
Scientific publications
watwatwat
Message 61097 - Posted: 26 Jan 2024 | 15:56:14 UTC - in response to Message 61034.

The work-units require a lot of GPU memory.


How much is "a lot" exactly? I have a pacal card, so it meets the compute capability requirement. But it has only 2gb of VRAM. But without knowing the amount of VRAM required, I am not sure if it will work.

The highest being used today on my Pascal cards is 795 MB.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1036
Credit: 39,372,107,483
RAC: 162,570,079
Level
Trp
Scientific publications
wat
Message 61098 - Posted: 26 Jan 2024 | 16:00:09 UTC - in response to Message 61097.

The work-units require a lot of GPU memory.


How much is "a lot" exactly? I have a pacal card, so it meets the compute capability requirement. But it has only 2gb of VRAM. But without knowing the amount of VRAM required, I am not sure if it will work.

The highest being used today on my Pascal cards is 795 MB.


Might want to watch that on a longer time scale, the VRAM use is not static, it fluctuates up and down
____________

Aurum
Avatar
Send message
Joined: 12 Jul 17
Posts: 399
Credit: 13,238,627,382
RAC: 14,211,365
Level
Trp
Scientific publications
watwatwat
Message 61099 - Posted: 26 Jan 2024 | 16:18:23 UTC
Last modified: 26 Jan 2024 | 16:32:42 UTC

Retraction: I'm monitoring with the BoincTasks Js 2.4.2.2 and it has bugs.
I loaded NVITOP and it does use 2 GB VRAM with 100% GPU utilization.

BTW, if anyone wants to try NVITOP here's my notes to install for Ubuntu 22.04:
sudo apt update
sudo apt upgrade -y
sudo apt install python3-pip -y
python3 -m pip install --user pipx
python3 -m pip install --user --upgrade pipx
python3 -m pipx ensurepath
# if requested: sudo apt install python3.8-venv -y
For LM 21.3: sudo apt install python3.10-venv -y
Open a new terminal:
pip3 install --upgrade nvitop
pipx run nvitop --colorful -m full

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1036
Credit: 39,372,107,483
RAC: 162,570,079
Level
Trp
Scientific publications
wat
Message 61100 - Posted: 26 Jan 2024 | 16:26:34 UTC - in response to Message 61099.

I'm not seeing any different behavior on my titan Vs. the VRAM use still exceeds 3GB at times. but it's spikey. you have to watch it for a few mins. instantaneous measurements might not catch it.
____________

Boca Raton Community HS
Send message
Joined: 27 Aug 21
Posts: 36
Credit: 3,066,631,809
RAC: 15,873,927
Level
Arg
Scientific publications
wat
Message 61101 - Posted: 26 Jan 2024 | 17:04:26 UTC - in response to Message 61100.

I am seeing spikes to ~7.6 GB with these. Not long lasting (in the context of the whole work unit) but consistently elevated during that part of the work unit. I want to say that I saw that spike at about 5% complete and then at 95% complete, but that also could be somewhat coincidental versus factual.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1036
Credit: 39,372,107,483
RAC: 162,570,079
Level
Trp
Scientific publications
wat
Message 61102 - Posted: 26 Jan 2024 | 17:11:14 UTC - in response to Message 61101.
Last modified: 26 Jan 2024 | 17:14:12 UTC

I am seeing spikes to ~7.6 GB with these. Not long lasting (in the context of the whole work unit) but consistently elevated during that part of the work unit. I want to say that I saw that spike at about 5% complete and then at 95% complete, but that also could be somewhat coincidental versus factual.


to add on to this, for everyone's info.

these tasks (and a lot of CUDA applications in general) do not require any set absolute value of VRAM. VRAM will scale to the GPU individually. generally, the more SMs you have, to more VRAM will be used. it's not linear, but there is some portion of the allocated VRAM that scales directly with how many SMs are being used.

to put it simply, different GPUs with different core counts, will have different amounts of VRAM utilization.

so even if one powerful GPU like an RTX 4090 with 100+ SMs on the die might need 7+GB, doesn't mean that something much smaller like a GTX 1070 needs that much. it needs to be evaluated on a case by case basis.
____________

Boca Raton Community HS
Send message
Joined: 27 Aug 21
Posts: 36
Credit: 3,066,631,809
RAC: 15,873,927
Level
Arg
Scientific publications
wat
Message 61103 - Posted: 26 Jan 2024 | 17:20:59 UTC - in response to Message 61102.

I am seeing spikes to ~7.6 GB with these. Not long lasting (in the context of the whole work unit) but consistently elevated during that part of the work unit. I want to say that I saw that spike at about 5% complete and then at 95% complete, but that also could be somewhat coincidental versus factual.


to add on to this, for everyone's info.

these tasks (and a lot of CUDA applications in general) do not require any set absolute value of VRAM. VRAM will scale to the GPU individually. generally, the more SMs you have, to more VRAM will be used. it's not linear, but there is some portion of the allocated VRAM that scales directly with how many SMs are being used.

to put it simply, different GPUs with different core counts, will have different amounts of VRAM utilization.

so even if one powerful GPU like an RTX 4090 with 100+ SMs on the die might need 7+GB, doesn't mean that something much smaller like a GTX 1070 needs that much. it needs to be evaluated on a case by case basis.



Thanks for this! I did not know about the scaling and I don't think this is something I ever thought about (the correlation between SMs and VRAM usage).

bibi
Send message
Joined: 4 May 17
Posts: 14
Credit: 8,872,824,643
RAC: 15,619,873
Level
Tyr
Scientific publications
watwatwatwatwat
Message 61108 - Posted: 29 Jan 2024 | 13:55:27 UTC

Why do I allways get segmentation fault
on Windows/wsl2/Ubuntu 22.04.3 LTS
12 processors, 28 GB memory, 16GB swap, GPU RTX 4070 Ti Super with 16 GB, driver version 551.23

https://www.gpugrid.net/result.php?resultid=33759912
https://www.gpugrid.net/result.php?resultid=33758940
https://www.gpugrid.net/result.php?resultid=33759139
https://www.gpugrid.net/result.php?resultid=33759328

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1036
Credit: 39,372,107,483
RAC: 162,570,079
Level
Trp
Scientific publications
wat
Message 61109 - Posted: 29 Jan 2024 | 14:01:46 UTC - in response to Message 61108.

Why do I allways get segmentation fault
on Windows/wsl2/Ubuntu 22.04.3 LTS
12 processors, 28 GB memory, 16GB swap, GPU RTX 4070 Ti Super with 16 GB, driver version 551.23

https://www.gpugrid.net/result.php?resultid=33759912
https://www.gpugrid.net/result.php?resultid=33758940
https://www.gpugrid.net/result.php?resultid=33759139
https://www.gpugrid.net/result.php?resultid=33759328


something wrong with your environment or drivers likely.

try running a native Linux OS install, WSL might not be well supported
____________

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1036
Credit: 39,372,107,483
RAC: 162,570,079
Level
Trp
Scientific publications
wat
Message 61117 - Posted: 30 Jan 2024 | 12:54:49 UTC - in response to Message 61109.
Last modified: 30 Jan 2024 | 13:15:58 UTC

Steve,

these TEST units you have out right now. they seem to be using a ton of reserved memory. one process right now is using 30+GB. that seems much higher than usual. and i even have another one reserving 64GB of memory. that's way too high.


____________

Freewill
Send message
Joined: 18 Mar 10
Posts: 13
Credit: 7,093,713,894
RAC: 40,189,547
Level
Tyr
Scientific publications
watwatwatwatwat
Message 61118 - Posted: 30 Jan 2024 | 13:17:30 UTC
Last modified: 30 Jan 2024 | 13:19:20 UTC

Here's one that died on my Ubuntu system which has 32 GB RAM:
https://www.gpugrid.net/result.php?resultid=33764282

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1036
Credit: 39,372,107,483
RAC: 162,570,079
Level
Trp
Scientific publications
wat
Message 61119 - Posted: 30 Jan 2024 | 14:33:26 UTC - in response to Message 61117.
Last modified: 30 Jan 2024 | 15:13:20 UTC

i see v3 being deployed now

the memory limiting you're trying isn't working. I'm seeing it spike to near 100%

i see you put export CUPY_GPU_MEMORY_LIMIT=50%

a quick google seems to indicate that you need to put the percentage in quotes. like this - export CUPY_GPU_MEMORY_LIMIT="50%". or additionally you can set a discrete memory amount as the limit. for example, export CUPY_GPU_MEMORY_LIMIT="1073741824" to limit to 1GB.

and the system memory use is still a little high, around 10GB each. EDIT - system memory use still climbed to ~30GB by the end
____________

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1036
Credit: 39,372,107,483
RAC: 162,570,079
Level
Trp
Scientific publications
wat
Message 61120 - Posted: 30 Jan 2024 | 16:01:04 UTC - in response to Message 61119.
Last modified: 30 Jan 2024 | 16:01:30 UTC

v4 report.

i see you attempted to add some additional VRAM limiting. but the task is still trying to allocate more VRAM, and instead of using more VRAM, the process gets killed for trying to allocate more than the limit.

https://gpugrid.net/result.php?resultid=33764464
https://gpugrid.net/result.php?resultid=33764469
____________

Steve
Volunteer moderator
Project administrator
Project developer
Project tester
Volunteer developer
Volunteer tester
Project scientist
Send message
Joined: 21 Dec 23
Posts: 32
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 61121 - Posted: 30 Jan 2024 | 16:11:32 UTC

Yes I was doing some testing to see how large molecules we can compute properties for.

The previous batches have been for small molecules which all work very well.

The memory use scales very quickly with increased molecule size.
This test today had molecules 3 to 4 times the size of the previous batches. As you can see I have not solved the memory limiting issue it. It should be possible to limit instantaneous GPU memory use (at the cost of runtime performance and increased CPU memory use). But due to the different levels of CUDA libraries in play in this code it is rather complicated. I will work on this locally for now and resume sending out the batches that were working well tomorrow!

Thank you for the assistance and compute availability, it is much appreciated!

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1036
Credit: 39,372,107,483
RAC: 162,570,079
Level
Trp
Scientific publications
wat
Message 61122 - Posted: 30 Jan 2024 | 16:13:47 UTC - in response to Message 61121.

no problem! glad to see you were monitoring my feedback and making changes.

looking forward to another stable batch tomorrow :) should be similar to previous runs like yesterday right?
____________

Steve
Volunteer moderator
Project administrator
Project developer
Project tester
Volunteer developer
Volunteer tester
Project scientist
Send message
Joined: 21 Dec 23
Posts: 32
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 61123 - Posted: 30 Jan 2024 | 16:18:55 UTC - in response to Message 61122.

Yes It will be same as yesterday but roughly 10x the work units released.

Each workunit contains 100 small molecules.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1036
Credit: 39,372,107,483
RAC: 162,570,079
Level
Trp
Scientific publications
wat
Message 61124 - Posted: 30 Jan 2024 | 16:19:50 UTC - in response to Message 61123.

looking forward to it :)

____________

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1576
Credit: 5,887,311,851
RAC: 10,471,281
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 61126 - Posted: 31 Jan 2024 | 12:38:25 UTC

I have Task 33765246 running on a RTX 3060 Ti under Linux Mint 21.3

It's running incredibly slowly, and with zero GPU usage. I've found this in stderr.txt:

+ python compute_dft.py
/hdd/boinc-client/slots/5/lib/python3.11/site-packages/pyscf/dft/libxc.py:771: UserWarning: Since PySCF-2.3, B3LYP (and B3P86) are changed to the VWN-RPA variant, corresponding to the original definition by Stephens et al. (issue 1480) and the same as the B3LYP functional in Gaussian. To restore the VWN5 definition, you can put the setting "B3LYP_WITH_VWN5 = True" in pyscf_conf.py
warnings.warn('Since PySCF-2.3, B3LYP (and B3P86) are changed to the VWN-RPA variant, '
/hdd/boinc-client/slots/5/lib/python3.11/site-packages/gpu4pyscf/lib/cutensor.py:174: UserWarning: using cupy as the tensor contraction engine.
warnings.warn(f'using {contract_engine} as the tensor contraction engine.')
/hdd/boinc-client/slots/5/lib/python3.11/site-packages/pyscf/gto/mole.py:1280: UserWarning: Function mol.dumps drops attribute charge because it is not JSON-serializable
warnings.warn(msg)
Exception:
Fallback to CPU
Exception:
Fallback to CPU

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1036
Credit: 39,372,107,483
RAC: 162,570,079
Level
Trp
Scientific publications
wat
Message 61127 - Posted: 31 Jan 2024 | 12:40:53 UTC - in response to Message 61124.
Last modified: 31 Jan 2024 | 13:08:03 UTC

Steve,

this new batch, right off the bat, is loading up the GPU VRAM nearly full again.

edit, that's for a v1 tasks, will check out the v2s
____________

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1036
Credit: 39,372,107,483
RAC: 162,570,079
Level
Trp
Scientific publications
wat
Message 61128 - Posted: 31 Jan 2024 | 13:12:40 UTC - in response to Message 61127.

OK. looks like the v2 tasks are back to normal. it was only that v1 task that was using lots of vram

____________

Steve
Volunteer moderator
Project administrator
Project developer
Project tester
Volunteer developer
Volunteer tester
Project scientist
Send message
Joined: 21 Dec 23
Posts: 32
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 61129 - Posted: 31 Jan 2024 | 13:19:52 UTC - in response to Message 61127.

Ok my previous post was incorrect.

It turns out the previous large batch was not a respresentative test set. It only contained very small molecules. This is why the GPU RAM usage was low. As per my previous post these task use a lot of GPU memory. You can see more detail in this post: http://gpugrid.org/forum_thread.php?id=5428&nowrap=true#60945

The work units are now just 10 molecules. They vary in size from 10 to 20 atoms per molecule. All molecules in a WU are the same size. Tests WU's (smallest and largest sized molecules) pass on my GTX1080 (8GB) test machine without failing.

The CPU fallback part was left over from testing this should have been removed but appears it was not.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1036
Credit: 39,372,107,483
RAC: 162,570,079
Level
Trp
Scientific publications
wat
Message 61130 - Posted: 31 Jan 2024 | 13:33:51 UTC - in response to Message 61129.

Thanks for the info Steve.

In general, I don't have much problem with using a large amount of VRAM, if that's what you require for your science goals. Personally I just wish to have expectations set so that I can setup my hosts accordingly. If VRAM use is low, I can set my host to run multiple tasks at a time for better efficiency. if VRAM use is high, I'll need to cut it back to only 2 or 1 tasks per GPU, which hurts overall efficiency on my end and requires me to reconfigure some things, but it's fine if that's how they will be. I just prefer to know which way it will be so that I don't leave it in a bad configuration and cause errors.

the bigger problem for me (and maybe many others) was the batch yesterday with VERY high system memory use per task. when system ram filled up it would crash the system, which requires some more manual intervention to get it running again. anyone with multi-GPU would be at risk there. just something to consider.

for overall VRAM use. again you can require whatever you need for your science goals. but you might consider making sure you can at least keep them under 8GB. I'd say many people on GPUGRID these days have a GPU with at least 8GB. all of mine have 12GB. and less with 16+. if you can keep them below 8GB I think you'll be able to maintain a large pool of users rather than dealing with the tasks running out of memory and having to be resent multiple times to land on a host with enough VRAM.
____________

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1036
Credit: 39,372,107,483
RAC: 162,570,079
Level
Trp
Scientific publications
wat
Message 61131 - Posted: 31 Jan 2024 | 13:58:20 UTC - in response to Message 61126.

I have Task 33765246 running on a RTX 3060 Ti under Linux Mint 21.3

It's running incredibly slowly, and with zero GPU usage. I've found this in stderr.txt:

+ python compute_dft.py
/hdd/boinc-client/slots/5/lib/python3.11/site-packages/pyscf/dft/libxc.py:771: UserWarning: Since PySCF-2.3, B3LYP (and B3P86) are changed to the VWN-RPA variant, corresponding to the original definition by Stephens et al. (issue 1480) and the same as the B3LYP functional in Gaussian. To restore the VWN5 definition, you can put the setting "B3LYP_WITH_VWN5 = True" in pyscf_conf.py
warnings.warn('Since PySCF-2.3, B3LYP (and B3P86) are changed to the VWN-RPA variant, '
/hdd/boinc-client/slots/5/lib/python3.11/site-packages/gpu4pyscf/lib/cutensor.py:174: UserWarning: using cupy as the tensor contraction engine.
warnings.warn(f'using {contract_engine} as the tensor contraction engine.')
/hdd/boinc-client/slots/5/lib/python3.11/site-packages/pyscf/gto/mole.py:1280: UserWarning: Function mol.dumps drops attribute charge because it is not JSON-serializable
warnings.warn(msg)
Exception:
Fallback to CPU
Exception:
Fallback to CPU


I'm getting several of these also. this is a problem too. you can always tell when the task basically stalls with almost no progress.


____________

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1576
Credit: 5,887,311,851
RAC: 10,471,281
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 61132 - Posted: 31 Jan 2024 | 14:15:17 UTC

My CPU fallback task has now completed and validated, in not much longer than is usual for tasks on that host. I assume it was a shortened test task, running on a slower device? I now have just completed what looks like a similar task, with similarly large jumps in progress %age, but much more quickly. Task 33765553

Steve
Volunteer moderator
Project administrator
Project developer
Project tester
Volunteer developer
Volunteer tester
Project scientist
Send message
Joined: 21 Dec 23
Posts: 32
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 61133 - Posted: 31 Jan 2024 | 14:44:43 UTC - in response to Message 61132.

This is still very much a beta app.

We will continue to explore different WU sizes and application settings (with better local testing on our internal hardware before sending them out).

This app is the first time it has been possible to run QM calculations on GPUs. The underlying software was primarliy designed for the latest generation professional cards, e.g. A100s that are used in HPC centres. It is proving challenging to us to port the code to GPUGRID consumer hardware. We also are looking into how a windows port can be done.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1036
Credit: 39,372,107,483
RAC: 162,570,079
Level
Trp
Scientific publications
wat
Message 61134 - Posted: 31 Jan 2024 | 14:51:32 UTC - in response to Message 61133.

No problem Steve. I definitely understand the beta aspect of this and the need to test things. I’m just giving honest feedback from my POV. Sometimes it’s hard to tell if a radical change in behavior is intended or a sign of some problem or misconfiguration.

Maybe it’s not possible for all the various molecules you want to test, but the size of the previous large batch last week I feel was very appropriate. Moderate VRAM use and consistent size/runtimes. Those worked well with the consumer hardware.

Oh if everyone had A100s with 40-80GB of VRAM life would be nice LOL.
____________

Boca Raton Community HS
Send message
Joined: 27 Aug 21
Posts: 36
Credit: 3,066,631,809
RAC: 15,873,927
Level
Arg
Scientific publications
wat
Message 61135 - Posted: 31 Jan 2024 | 16:33:38 UTC

I had an odd work unit come through (and just abandoned). I have not had any issues with these work units so thought I would mention this one specifically.

https://www.gpugrid.net/result.php?resultid=33764946

I think there was a memory error with it but I am not very skilled at reading the results. It hung at ~75% but I let it work for 5 hours (honestly, I just didn't notice that it was hung up...).

When looking at properties of the work unit:
Virtual memory: 56GB
Working Size Set: 3.59GB

I thought this was an odd one so thought I would post.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1036
Credit: 39,372,107,483
RAC: 162,570,079
Level
Trp
Scientific publications
wat
Message 61136 - Posted: 31 Jan 2024 | 17:03:21 UTC - in response to Message 61135.

I had an odd work unit come through (and just abandoned). I have not had any issues with these work units so thought I would mention this one specifically.

https://www.gpugrid.net/result.php?resultid=33764946

I think there was a memory error with it but I am not very skilled at reading the results. It hung at ~75% but I let it work for 5 hours (honestly, I just didn't notice that it was hung up...).

When looking at properties of the work unit:
Virtual memory: 56GB
Working Size Set: 3.59GB

I thought this was an odd one so thought I would post.


Yeah you can see several out of memory errors. Are you running more than one at a time?

I’ve had many like this. And many that seem to just fall back to CPU without any reason and get stuck for a long time. I’ve been aborting them when I notice. But it is troublesome :(
____________

Boca Raton Community HS
Send message
Joined: 27 Aug 21
Posts: 36
Credit: 3,066,631,809
RAC: 15,873,927
Level
Arg
Scientific publications
wat
Message 61137 - Posted: 31 Jan 2024 | 17:19:14 UTC - in response to Message 61136.



Yeah you can see several out of memory errors. Are you running more than one at a time?

I’ve had many like this. And many that seem to just fall back to CPU without any reason and get stuck for a long time. I’ve been aborting them when I notice. But it is troublesome :(



I have been running 2x for these (I can't get them to run 3x or 4x via app config file but it doesn't look like there are any cued tasks waiting to start).

Good to know that others have seen this too! I have seen a MASSIVE reduction in time these tasks take today.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1036
Credit: 39,372,107,483
RAC: 162,570,079
Level
Trp
Scientific publications
wat
Message 61138 - Posted: 31 Jan 2024 | 18:15:03 UTC

I’m now getting a 3rd type of error across all of my hosts.

“AssertionError”

https://www.gpugrid.net/result.php?resultid=33766654
____________

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1576
Credit: 5,887,311,851
RAC: 10,471,281
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 61139 - Posted: 31 Jan 2024 | 18:25:02 UTC - in response to Message 61138.

I've had a few of those too, mainly of the form

File "/hdd/boinc-client/slots/6/lib/python3.11/site-packages/gpu4pyscf/df/grad/rhf.py", line 163, in get_jk
assert k1-k0 <= block_size

Profile Retvari Zoltan
Avatar
Send message
Joined: 20 Jan 09
Posts: 2343
Credit: 16,201,255,749
RAC: 349
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 61140 - Posted: 1 Feb 2024 | 4:17:42 UTC

150.000 credits for a few 100 seconds? I'm in! ;)
https://www.gpugrid.net/result.php?resultid=33771102
https://www.gpugrid.net/result.php?resultid=33771333
https://www.gpugrid.net/result.php?resultid=33771431
https://www.gpugrid.net/result.php?resultid=33771446
https://www.gpugrid.net/result.php?resultid=33771539

gemini8
Send message
Joined: 3 Jul 16
Posts: 31
Credit: 1,329,100,176
RAC: 4,926,985
Level
Met
Scientific publications
watwat
Message 61141 - Posted: 1 Feb 2024 | 7:48:55 UTC - in response to Message 61131.

I have Task 33765246 running on a RTX 3060 Ti under Linux Mint 21.3

It's running incredibly slowly, and with zero GPU usage. I've found this in stderr.txt:

+ python compute_dft.py
/hdd/boinc-client/slots/5/lib/python3.11/site-packages/pyscf/dft/libxc.py:771: UserWarning: Since PySCF-2.3, B3LYP (and B3P86) are changed to the VWN-RPA variant, corresponding to the original definition by Stephens et al. (issue 1480) and the same as the B3LYP functional in Gaussian. To restore the VWN5 definition, you can put the setting "B3LYP_WITH_VWN5 = True" in pyscf_conf.py
warnings.warn('Since PySCF-2.3, B3LYP (and B3P86) are changed to the VWN-RPA variant, '
/hdd/boinc-client/slots/5/lib/python3.11/site-packages/gpu4pyscf/lib/cutensor.py:174: UserWarning: using cupy as the tensor contraction engine.
warnings.warn(f'using {contract_engine} as the tensor contraction engine.')
/hdd/boinc-client/slots/5/lib/python3.11/site-packages/pyscf/gto/mole.py:1280: UserWarning: Function mol.dumps drops attribute charge because it is not JSON-serializable
warnings.warn(msg)
Exception:
Fallback to CPU
Exception:
Fallback to CPU


I'm getting several of these also. this is a problem too. you can always tell when the task basically stalls with almost no progress.

I had only those on one of my machines.
Apparently it had lost sight of the GPU for crunching.
Rebooting brought back the Nvidia driver to the BOINC client.

Apart from this I found out that I can't run these tasks aside Private GFN Server's tasks on a six Gig GPU. So I called the PYSCFbeta tasks off for this machine, as I often have to wait for tasks to download from GPUGrid, and I don't want my GPUs to run idle.
____________
Greetings, Jens

gemini8
Send message
Joined: 3 Jul 16
Posts: 31
Credit: 1,329,100,176
RAC: 4,926,985
Level
Met
Scientific publications
watwat
Message 61145 - Posted: 1 Feb 2024 | 21:47:40 UTC

Did we encounter this one already?

<core_client_version>7.14.2</core_client_version>
<![CDATA[
<message>
process exited with code 195 (0xc3, -61)</message>
<stderr_txt>
14:21:03 (24335): wrapper (7.7.26016): starting
14:21:36 (24335): wrapper (7.7.26016): starting
14:21:36 (24335): wrapper: running bin/python (bin/conda-unpack)
14:21:38 (24335): bin/python exited; CPU time 0.223114
14:21:38 (24335): wrapper: running bin/tar (xjvf input.tar.bz2)
14:21:39 (24335): bin/tar exited; CPU time 0.005282
14:21:39 (24335): wrapper: running bin/bash (run.sh)
+ echo 'Setup environment'
+ source bin/activate
++ _conda_pack_activate
++ local _CONDA_SHELL_FLAVOR
++ '[' -n x ']'
++ _CONDA_SHELL_FLAVOR=bash
++ local script_dir
++ case "$_CONDA_SHELL_FLAVOR" in
+++ dirname bin/activate
++ script_dir=bin
+++ cd bin
+++ pwd
++ local full_path_script_dir=/var/lib/boinc-client/slots/4/bin
+++ dirname /var/lib/boinc-client/slots/4/bin
++ local full_path_env=/var/lib/boinc-client/slots/4
+++ basename /var/lib/boinc-client/slots/4
++ local env_name=4
++ '[' -n '' ']'
++ export CONDA_PREFIX=/var/lib/boinc-client/slots/4
++ CONDA_PREFIX=/var/lib/boinc-client/slots/4
++ export _CONDA_PACK_OLD_PS1=
++ _CONDA_PACK_OLD_PS1=
++ PATH=/var/lib/boinc-client/slots/4/bin:/usr/local/bin:/usr/local/sbin:/usr/bin:/usr/sbin:/bin:/sbin:.
++ PS1='(4) '
++ case "$_CONDA_SHELL_FLAVOR" in
++ hash -r
++ local _script_dir=/var/lib/boinc-client/slots/4/etc/conda/activate.d
++ '[' -d /var/lib/boinc-client/slots/4/etc/conda/activate.d ']'
+ export PATH=/var/lib/boinc-client/slots/4:/var/lib/boinc-client/slots/4/bin:/usr/local/bin:/usr/local/sbin:/usr/bin:/usr/sbin:/bin:/sbin:.
+ PATH=/var/lib/boinc-client/slots/4:/var/lib/boinc-client/slots/4/bin:/usr/local/bin:/usr/local/sbin:/usr/bin:/usr/sbin:/bin:/sbin:.
+ echo 'Create a temporary directory'
+ export TMP=/var/lib/boinc-client/slots/4/tmp
+ TMP=/var/lib/boinc-client/slots/4/tmp
+ mkdir -p /var/lib/boinc-client/slots/4/tmp
+ export OMP_NUM_THREADS=1
+ OMP_NUM_THREADS=1
+ export CUDA_VISIBLE_DEVICES=0
+ CUDA_VISIBLE_DEVICES=0
+ export CUPY_CUDA_LIB_PATH=/var/lib/boinc-client/slots/4/cupy
+ CUPY_CUDA_LIB_PATH=/var/lib/boinc-client/slots/4/cupy
+ echo 'Running PySCF'
+ python compute_dft.py
/var/lib/boinc-client/slots/4/lib/python3.11/site-packages/gpu4pyscf/lib/cutensor.py:174: UserWarning: using cupy as the tensor contraction engine.
warnings.warn(f'using {contract_engine} as the tensor contraction engine.')
/var/lib/boinc-client/slots/4/lib/python3.11/site-packages/pyscf/dft/libxc.py:771: UserWarning: Since PySCF-2.3, B3LYP (and B3P86) are changed to the VWN-RPA variant, corresponding to the original definition by Stephens et al. (issue 1480) and the same as the B3LYP functional in Gaussian. To restore the VWN5 definition, you can put the setting "B3LYP_WITH_VWN5 = True" in pyscf_conf.py
warnings.warn('Since PySCF-2.3, B3LYP (and B3P86) are changed to the VWN-RPA variant, '
/var/lib/boinc-client/slots/4/lib/python3.11/site-packages/pyscf/gto/mole.py:1280: UserWarning: Function mol.dumps drops attribute charge because it is not JSON-serializable
warnings.warn(msg)
Traceback (most recent call last):
File "/var/lib/boinc-client/slots/4/lib/python3.11/site-packages/cupy/cuda/compiler.py", line 253, in _jitify_prep
name, options, headers, include_names = jitify.jitify(source, options)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "cupy/cuda/jitify.pyx", line 63, in cupy.cuda.jitify.jitify
File "cupy/cuda/jitify.pyx", line 88, in cupy.cuda.jitify.jitify
RuntimeError: Runtime compilation failed

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/var/lib/boinc-client/slots/4/compute_dft.py", line 125, in <module>
e,f,dip,q = compute_gpu(mol)
^^^^^^^^^^^^^^^^
File "/var/lib/boinc-client/slots/4/compute_dft.py", line 32, in compute_gpu
e_dft = mf.kernel() # compute total energy
^^^^^^^^^^^
File "<string>", line 2, in kernel
File "/var/lib/boinc-client/slots/4/lib/python3.11/site-packages/gpu4pyscf/scf/hf.py", line 586, in scf
_kernel(self, self.conv_tol, self.conv_tol_grad,
File "/var/lib/boinc-client/slots/4/lib/python3.11/site-packages/gpu4pyscf/scf/hf.py", line 393, in _kernel
mf.init_workflow(dm0=dm)
File "/var/lib/boinc-client/slots/4/lib/python3.11/site-packages/gpu4pyscf/df/df_jk.py", line 63, in init_workflow
rks.initialize_grids(mf, mf.mol, dm0)
File "/var/lib/boinc-client/slots/4/lib/python3.11/site-packages/gpu4pyscf/dft/rks.py", line 80, in initialize_grids
ks.grids = prune_small_rho_grids_(ks, ks.mol, dm, ks.grids)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/var/lib/boinc-client/slots/4/lib/python3.11/site-packages/gpu4pyscf/dft/rks.py", line 49, in prune_small_rho_grids_
logger.debug(grids, 'Drop grids %d', grids.weights.size - cupy.count_nonzero(idx))
^^^^^^^^^^^^^^^^^^^^^^^
File "/var/lib/boinc-client/slots/4/lib/python3.11/site-packages/cupy/_sorting/count.py", line 24, in count_nonzero
return _count_nonzero(a, axis=axis)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "cupy/_core/_reduction.pyx", line 608, in cupy._core._reduction._SimpleReductionKernel.__call__
File "cupy/_core/_reduction.pyx", line 364, in cupy._core._reduction._AbstractReductionKernel._call
File "cupy/_core/_cub_reduction.pyx", line 701, in cupy._core._cub_reduction._try_to_call_cub_reduction
File "cupy/_core/_cub_reduction.pyx", line 538, in cupy._core._cub_reduction._launch_cub
File "cupy/_core/_cub_reduction.pyx", line 473, in cupy._core._cub_reduction._cub_two_pass_launch
File "cupy/_util.pyx", line 64, in cupy._util.memoize.decorator.ret
File "cupy/_core/_cub_reduction.pyx", line 246, in cupy._core._cub_reduction._SimpleCubReductionKernel_get_cached_function
File "cupy/_core/_cub_reduction.pyx", line 231, in cupy._core._cub_reduction._create_cub_reduction_function
File "cupy/_core/core.pyx", line 2251, in cupy._core.core.compile_with_cache
File "/var/lib/boinc-client/slots/4/lib/python3.11/site-packages/cupy/cuda/compiler.py", line 496, in _compile_module_with_cache
return _compile_with_cache_cuda(
^^^^^^^^^^^^^^^^^^^^^^^^^
File "/var/lib/boinc-client/slots/4/lib/python3.11/site-packages/cupy/cuda/compiler.py", line 574, in _compile_with_cache_cuda
ptx, mapping = compile_using_nvrtc(
^^^^^^^^^^^^^^^^^^^^
File "/var/lib/boinc-client/slots/4/lib/python3.11/site-packages/cupy/cuda/compiler.py", line 322, in compile_using_nvrtc
return _compile(source, options, cu_path,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/var/lib/boinc-client/slots/4/lib/python3.11/site-packages/cupy/cuda/compiler.py", line 287, in _compile
options, headers, include_names = _jitify_prep(
^^^^^^^^^^^^^
File "/var/lib/boinc-client/slots/4/lib/python3.11/site-packages/cupy/cuda/compiler.py", line 260, in _jitify_prep
raise JitifyException(str(cex))
cupy.cuda.compiler.JitifyException: Runtime compilation failed
14:23:34 (24335): bin/bash exited; CPU time 14.043607
14:23:34 (24335): app exit status: 0x1
14:23:34 (24335): called boinc_finish(195)

</stderr_txt>
]]>

____________
Greetings, Jens

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1036
Credit: 39,372,107,483
RAC: 162,570,079
Level
Trp
Scientific publications
wat
Message 61146 - Posted: 1 Feb 2024 | 21:54:29 UTC - in response to Message 61145.

that looks like a driver issue.

but something else I noticed is that these tasks for the most part are having a very high failure rate. 30-50% on most hosts.

there are a few hosts that have few or no errors however, and all of them are hosts with 24-48GB of VRAM. so it seems something like 30-50% of the tasks require more than 12-16GB.

I'm sure the project has a very large error percentage to sort through, as there arent enough 24-48GB GPUs to catch all the resends
____________

Profile Retvari Zoltan
Avatar
Send message
Joined: 20 Jan 09
Posts: 2343
Credit: 16,201,255,749
RAC: 349
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 61147 - Posted: 1 Feb 2024 | 22:22:48 UTC
Last modified: 1 Feb 2024 | 22:26:22 UTC

The present batch has a far worse failure ratio than the previous one.

Boca Raton Community HS
Send message
Joined: 27 Aug 21
Posts: 36
Credit: 3,066,631,809
RAC: 15,873,927
Level
Arg
Scientific publications
wat
Message 61149 - Posted: 2 Feb 2024 | 2:21:12 UTC - in response to Message 61146.
Last modified: 2 Feb 2024 | 2:21:54 UTC

that looks like a driver issue.

but something else I noticed is that these tasks for the most part are having a very high failure rate. 30-50% on most hosts.

there are a few hosts that have few or no errors however, and all of them are hosts with 24-48GB of VRAM. so it seems something like 30-50% of the tasks require more than 12-16GB.

I'm sure the project has a very large error percentage to sort through, as there arent enough 24-48GB GPUs to catch all the resends


This is 100% correct.

Our system with 2x RTX a6000 (48GB of VRAM) has had 500 valid results and no errors. They are running tasks at 2x and they seem to run really well (https://www.gpugrid.net/results.php?hostid=616410).

In one of our systems with 3x RTX a4500 GPUs (20GB), as soon as I changed running 2x of these tasks to 1x, the error rate greatly improved (https://www.gpugrid.net/results.php?hostid=616409). I made the change and have had 14 tasks in a row without errors.

When I am back in the classroom I think I will be changing anything equal to, or less than, 24GB to only run one task in order to improve the valid rate.

Has any tried running MPS with these tasks, and would it would make a difference in the allocation of resources to successfully run 2X? Just curious about thoughts.

Bedrich Hajek
Send message
Joined: 28 Mar 09
Posts: 468
Credit: 8,486,022,716
RAC: 10,942,361
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 61150 - Posted: 2 Feb 2024 | 2:57:47 UTC

Last week, I had a 100% success rate. This week, it's a different story. Maybe, it's time to step back and dial it down a bit. You have to work with the resources that you have, not the ones that you wish you had.


Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1036
Credit: 39,372,107,483
RAC: 162,570,079
Level
Trp
Scientific publications
wat
Message 61151 - Posted: 2 Feb 2024 | 4:07:19 UTC - in response to Message 61149.

Boca,

How much VRAM do you see actually being used on some of these tasks? Mind watching a few? You’ll have to run a watch command to see continuous output of VRAM utilization since the usage isn’t constant. It spikes up and down. I’m just curious how much is actually needed. Most of the tasks I was running I would see spike up to about 8GB. But i assume the tasks that needed more just failed instead so I can’t know how much they are trying to use. Even though these Titan Vs are great DP performers they only have 12GB VRAM. Even most of the 16GB cards like V100 and P100 are seeing very high error rates.

MPS helps. But not enough with this current batch. I was getting good throughput with running 3x tasks at once on the batches last week.
____________

gemini8
Send message
Joined: 3 Jul 16
Posts: 31
Credit: 1,329,100,176
RAC: 4,926,985
Level
Met
Scientific publications
watwat
Message 61152 - Posted: 2 Feb 2024 | 6:43:20 UTC - in response to Message 61146.

that looks like a driver issue.

That's what Pascal (?) wrote in the Q&A as well.

Had three tasks on that host, and two of them failed.
____________
Greetings, Jens

Pascal
Send message
Joined: 15 Jul 20
Posts: 45
Credit: 331,799,934
RAC: 3,481,654
Level
Asp
Scientific publications
wat
Message 61153 - Posted: 2 Feb 2024 | 8:33:53 UTC

tout le monde ne dipose de carte graphique a 5000 euros avec 24 gigas de vram voir plus.vous devriez penser au plus modeste d'entre nous.
mo j'ai une rtx 4060 et une gtx 1650 mais je n'ai que des erreurs par exemple.
je pense que lq plupart des gens qui calcule pour gpugrid et qui attende avec impatience du travail pour leur gpu sont comme moi.

everyone only dipose graphics card has 5000 euros with 24 gigas of vram see more.you should think of the most modest of us.
mo I have a 4060 rtx and a 1650 gtx but I have only errors for example.
I think most people who compute for gpugrid and look forward to work for their gpu are like me.

je pense toujours que le probleme est mon installation systeme alors je reformate et refais une installation propre en espérant que cela fonctionnera correctement. en vain car le probleme vient de vos unités de calcul défaillantes

I still think the problem is my system installation so I reformat and redo a clean installation hoping it will work properly. in vain because the problem comes from your faulty computing units
____________

Profile Retvari Zoltan
Avatar
Send message
Joined: 20 Jan 09
Posts: 2343
Credit: 16,201,255,749
RAC: 349
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 61154 - Posted: 2 Feb 2024 | 10:54:03 UTC
Last modified: 2 Feb 2024 | 11:32:19 UTC

I've disabled getting new GPUGrid tasks GPUGrid on my host with "small" amount (below 24GB) of GPU memory.
This gigantic memory requirement is ridiculous in my opinion.
This is not a user error, if the workunits can't be changed, then the project should not send these tasks to hosts that have less than ~20GB of GPU memory.
There could be another solution, if the workunit would allocate memory in a less careless way.
I've started a task on my RTX 4090 (it has 24GiB RAM), and I've monitored the memory usage:

idle: 305 MiB task starting: 895 MiB GPU usage rises: 6115 MiB GPU usage drops: 7105 MiB GPU usage 100%: 7205 MiB GPU usage drops: 8495 MiB GPU usage rises: 9961 MiB GPU usage drops: 14327 MiB (it would have failed on my GTX 1080 Ti at this point) GPU usage rises: 6323 MiB GPU usage drops: 15945 MiB GPU usage 100%: 6205 MiB ...and so on
So the memory usage doubles at some points of processing for a short while, and this cause the workunits to fail on GPUs that have "small" amount of memory. If this behaviour could be eliminated, much more hosts could process these workunits.

Profile ServicEnginIC
Avatar
Send message
Joined: 24 Sep 10
Posts: 566
Credit: 6,321,177,024
RAC: 17,362,450
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 61155 - Posted: 2 Feb 2024 | 11:59:29 UTC - in response to Message 61154.

Nothing to do at this time for my currently working GPUs with PYSCFbeta tasks.
5 GTX 1650 4GB, 1 GTX 1650 SUPER 4GB, 1 GTX 1660 Ti 6GB.
100% errors with current PYSCFbeta tasks, now I can realize why...
I've disabled Quantum chemistry on GPU (beta) at my project preferences in the wait for a correction, if any.
Conversely, they are performing right with ATMbeta tasks.

Freewill
Send message
Joined: 18 Mar 10
Posts: 13
Credit: 7,093,713,894
RAC: 40,189,547
Level
Tyr
Scientific publications
watwatwatwatwat
Message 61156 - Posted: 2 Feb 2024 | 12:09:55 UTC

I agree it does seem these tasks have a spike in memory usage. I "rented" an RTX A5000 GPU which also has 24 GB memory, and running 1 task at a time, at least the first task completed:
https://www.gpugrid.net/workunit.php?wuid=27678500
I will try a few more

roundup
Send message
Joined: 11 May 10
Posts: 57
Credit: 1,724,820,193
RAC: 13,910,294
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 61157 - Posted: 2 Feb 2024 | 12:16:07 UTC - in response to Message 61155.
Last modified: 2 Feb 2024 | 12:17:30 UTC


I've disabled Quantum chemistry on GPU (beta) at my project preferences in the wait for a correction, if any.
Conversely, they are performing right with ATMbeta tasks.

Exactly the same here. After 29 consecutive errors on a RTX4070Ti, I have disabled 'Quantum chemistry on GPU (beta)'.

gemini8
Send message
Joined: 3 Jul 16
Posts: 31
Credit: 1,329,100,176
RAC: 4,926,985
Level
Met
Scientific publications
watwat
Message 61158 - Posted: 2 Feb 2024 | 12:25:43 UTC

I have one machine still taking on GPUGrid tasks.
The others are using their GPUs for the Tour de Primes over at PrimeGrid only.
If there really is a driver issue (see earlier post and answers) with this machine I'd like to know which, as its GPU is running fine on other BOINC projects apart from SRBase. Not being able to run SRBase is related to libc, not the GPU driver.
____________
Greetings, Jens

Pascal
Send message
Joined: 15 Jul 20
Posts: 45
Credit: 331,799,934
RAC: 3,481,654
Level
Asp
Scientific publications
wat
Message 61159 - Posted: 2 Feb 2024 | 12:37:34 UTC
Last modified: 2 Feb 2024 | 12:38:16 UTC

bonjour
existe t'il un moyen de simuler de la vram pour gpu en utilisant la ram ou un ssd sous linux.
cela éviterait les erreurs de calcul.
J'ai augmenter le swap file a 50 gigas comme sous windows mais cela ne fonctionne pas.
Merci

hello
Is there a way to simulate vram for GPU using RAM or SSD under linux.
this would avoid miscalculation.
I increased the swap file to 50 gigas as under windows but it does not work.
Thanks
____________

Boca Raton Community HS
Send message
Joined: 27 Aug 21
Posts: 36
Credit: 3,066,631,809
RAC: 15,873,927
Level
Arg
Scientific publications
wat
Message 61160 - Posted: 2 Feb 2024 | 13:36:47 UTC - in response to Message 61151.
Last modified: 2 Feb 2024 | 13:42:31 UTC

Boca,

How much VRAM do you see actually being used on some of these tasks? Mind watching a few? You’ll have to run a watch command to see continuous output of VRAM utilization since the usage isn’t constant. It spikes up and down. I’m just curious how much is actually needed. Most of the tasks I was running I would see spike up to about 8GB. But i assume the tasks that needed more just failed instead so I can’t know how much they are trying to use. Even though these Titan Vs are great DP performers they only have 12GB VRAM. Even most of the 16GB cards like V100 and P100 are seeing very high error rates.

MPS helps. But not enough with this current batch. I was getting good throughput with running 3x tasks at once on the batches last week.


This was wild...

For a single work unit:

Hovers around 3-4GB
Rises to 8-9GB
Spikes to ~11GB regularly.

Highest Spike (seen): 12.5GB
Highest Spike (estimated based on psensor): ~20GB. Additionally, Psensor caught a highest memory usage spike of 76% of the 48GB of the RTX A6000 for one work unit but I did not see when this happened or if it happened at all.

I graphically captured the VRAM memory usage for one work unit. I have no idea how to imbed images here. So, here is a Google Doc:

https://docs.google.com/document/d/1xpOpNJ93finciJQW7U07dMHOycSVlbYq9G6h0Xg7GtA/edit?usp=sharing

EDIT: I think they just purged these work units from the server?

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1036
Credit: 39,372,107,483
RAC: 162,570,079
Level
Trp
Scientific publications
wat
Message 61161 - Posted: 2 Feb 2024 | 14:02:10 UTC - in response to Message 61160.
Last modified: 2 Feb 2024 | 14:06:34 UTC

thanks. that's kind of what I expected was happening.

and yeah, they must have seen the problems and just abandoned the remainder of this run to reassess how to tweak them.

it seemed like they sweaked the input files to give the assertion error instead of just hanging like the earlier (index numbers below ~1000). the early tasks would hang with the fallback to CPU issue, and after that it changed to the assertion error if it ran out of vram. that was better behavior for the user since a quick failure is better than hanging for hours on end doing nothing. but they were probably getting back a majority of errors as the VRAM requirements grew beyond what most people have for available hardware.
____________

Boca Raton Community HS
Send message
Joined: 27 Aug 21
Posts: 36
Credit: 3,066,631,809
RAC: 15,873,927
Level
Arg
Scientific publications
wat
Message 61162 - Posted: 2 Feb 2024 | 15:30:46 UTC

New batch just come through- seeing the same VRAM spikes and patterns.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1036
Credit: 39,372,107,483
RAC: 162,570,079
Level
Trp
Scientific publications
wat
Message 61163 - Posted: 2 Feb 2024 | 15:32:14 UTC - in response to Message 61162.
Last modified: 2 Feb 2024 | 15:39:40 UTC

I'm seeing the same spikes, but so far so good. biggest spike i saw was ~9GB

no errors ...yet.

spoke too soon. did get one failure

https://gpugrid.net/result.php?resultid=33801391
____________

Steve
Volunteer moderator
Project administrator
Project developer
Project tester
Volunteer developer
Volunteer tester
Project scientist
Send message
Joined: 21 Dec 23
Posts: 32
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 61164 - Posted: 2 Feb 2024 | 15:37:15 UTC - in response to Message 61163.

Hi. I have been tweaking settings. All WUs I have tried now work on my 1080(8GB).


Sending a new batch of smaller WUs out now. From our end we will need to see how to assign WU's based on GPU memory. (Previous apps have been compute bound rather than GPU memory bound and have only been assigned based on driver version)

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1036
Credit: 39,372,107,483
RAC: 162,570,079
Level
Trp
Scientific publications
wat
Message 61165 - Posted: 2 Feb 2024 | 16:07:04 UTC - in response to Message 61164.

seeing some errors on Titan V (12GB). not a huge amount. but certainly a noteworthy amount. maybe you can correlate these specific WUs and see why these kind (number of atoms or molecules?) might be requesting more VRAM than the ones you tried on your 1080.

most of the ones i've observed running will hover around ~3-4GB constant VRAM use, with spikes to the 8-11GB range.

https://gpugrid.net/result.php?resultid=33802055
https://gpugrid.net/result.php?resultid=33801492
https://gpugrid.net/result.php?resultid=33801447
https://gpugrid.net/result.php?resultid=33801391
https://gpugrid.net/result.php?resultid=33801238
____________

pututu
Send message
Joined: 8 Oct 16
Posts: 14
Credit: 613,876,869
RAC: 11,505
Level
Lys
Scientific publications
watwatwatwat
Message 61166 - Posted: 2 Feb 2024 | 16:08:36 UTC

Still seeing a vram spike above 8GB

2024/02/02 08:07:08.774, 71, 100 %, 40 %, 8997 MiB
2024/02/02 08:07:09.774, 71, 100 %, 34 %, 8999 MiB
2024/02/02 08:07:10.775, 71, 22 %, 1 %, 8989 MiB
2024/02/02 08:07:11.775, 70, 96 %, 2 %, 10209 MiB
2024/02/02 08:07:12.775, 71, 98 %, 7 %, 10721 MiB
2024/02/02 08:07:13.775, 71, 93 %, 8 %, 5023 MiB
2024/02/02 08:07:14.775, 72, 96 %, 24 %, 5019 MiB
2024/02/02 08:07:15.776, 72, 100 %, 0 %, 5019 MiB
2024/02/02 08:07:16.776, 72, 100 %, 0 %, 5019 MiB

Seems like credit has gone down from 150K to 15K.

Boca Raton Community HS
Send message
Joined: 27 Aug 21
Posts: 36
Credit: 3,066,631,809
RAC: 15,873,927
Level
Arg
Scientific publications
wat
Message 61167 - Posted: 2 Feb 2024 | 16:20:20 UTC - in response to Message 61166.

Agreed- it seems that there are fewer spikes and most of them are in the 8-9GB range. A few higher but it seems less frequent? Difficult to quantify an actual difference since the work units can be so different. Is there a difference in VRAM usage or does the actual work unit just happen to need less VRAM?

Pascal
Send message
Joined: 15 Jul 20
Posts: 45
Credit: 331,799,934
RAC: 3,481,654
Level
Asp
Scientific publications
wat
Message 61168 - Posted: 2 Feb 2024 | 16:40:21 UTC

Seems like credit has gone down from 150K to 15K.
____________

pututu
Send message
Joined: 8 Oct 16
Posts: 14
Credit: 613,876,869
RAC: 11,505
Level
Lys
Scientific publications
watwatwatwat
Message 61169 - Posted: 2 Feb 2024 | 17:33:47 UTC
Last modified: 2 Feb 2024 | 17:34:29 UTC

Occasionally 8G of vram card is not sufficient. Still seeing error on these cards.

Example: two of the hosts below have 8G vram while the other one returned successfully has 16G.
http://gpugrid.net/workunit.php?wuid=27683202

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1036
Credit: 39,372,107,483
RAC: 162,570,079
Level
Trp
Scientific publications
wat
Message 61171 - Posted: 2 Feb 2024 | 17:55:00 UTC - in response to Message 61169.

Even that 16GB GPU had one failure with the new v3 batch

http://gpugrid.net/result.php?resultid=33802340
____________

Boca Raton Community HS
Send message
Joined: 27 Aug 21
Posts: 36
Credit: 3,066,631,809
RAC: 15,873,927
Level
Arg
Scientific publications
wat
Message 61172 - Posted: 2 Feb 2024 | 18:47:46 UTC - in response to Message 61171.

Even that 16GB GPU had one failure with the new v3 batch

http://gpugrid.net/result.php?resultid=33802340



Based on the times of tasks, it looks like those were running at 1x?

Pascal
Send message
Joined: 15 Jul 20
Posts: 45
Credit: 331,799,934
RAC: 3,481,654
Level
Asp
Scientific publications
wat
Message 61173 - Posted: 2 Feb 2024 | 18:52:03 UTC
Last modified: 2 Feb 2024 | 18:55:03 UTC

bonsoir chez moi ça marche bien maintenant.
je viens de finir 5 unités de calcul sans probleme avec ma gtx 1650 et ma rtx 4060.
espérons que cela continue.
j'ai reformaté mon pc aujourd'hui et j'ai réinstallé linux mint 21,3,une fois de plus.


Good evening at my place it works well now.
I just finished 5 computing units without problems with my gtx 1650 and my rtx 4060.
let’s hope this continues.
I reformatted my pc today and reinstalled linux mint 21,3,once again.

https://www.gpugrid.net/results.php?userid=563937
____________

roundup
Send message
Joined: 11 May 10
Posts: 57
Credit: 1,724,820,193
RAC: 13,910,294
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 61174 - Posted: 2 Feb 2024 | 19:00:05 UTC - in response to Message 61168.

14 tasks of the latest batch completed successfully without any error.
Great progress!

Seems like credit has gone down from 150K to 15K.

Perhaps 150k was a little too generous. But 15k is not on par with other GPU projects. I expect there will be fairer credits again soon - with the next batch?

Boca Raton Community HS
Send message
Joined: 27 Aug 21
Posts: 36
Credit: 3,066,631,809
RAC: 15,873,927
Level
Arg
Scientific publications
wat
Message 61175 - Posted: 2 Feb 2024 | 19:04:15 UTC - in response to Message 61174.

14 tasks of the latest batch completed successfully without any error.
Great progress!

Seems like credit has gone down from 150K to 15K.

Perhaps 150k was a little too generous. But 15k is not on par with other GPU projects. I expect there will be fairer credits again soon - with the next batch?


Are you running them at 1x and with how much VRAM? Trying to get a feel for what the actual "cutoff" is for these tasks right now. I am still feeling 24GB VRAM is needed for the success running 1x and double that for 2x.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1036
Credit: 39,372,107,483
RAC: 162,570,079
Level
Trp
Scientific publications
wat
Message 61176 - Posted: 2 Feb 2024 | 19:13:14 UTC - in response to Message 61175.

sometimes more than 12GB as about 4% (16 out of 372) of my tasks failed all on GPUs with 12GB, all running at 1x only for the v3 batch. not sure how much VRAM is needed to be 100% successful. I did have one success that was a resend of one of your errors from a 4090 24GB. so i'm guessing you were running that one at 2x and got unlucky with two big tasks at the same time.
____________

Boca Raton Community HS
Send message
Joined: 27 Aug 21
Posts: 36
Credit: 3,066,631,809
RAC: 15,873,927
Level
Arg
Scientific publications
wat
Message 61177 - Posted: 2 Feb 2024 | 19:30:11 UTC - in response to Message 61176.

sometimes more than 12GB as about 4% (16 out of 372) of my tasks failed all on GPUs with 12GB, all running at 1x only for the v3 batch. not sure how much VRAM is needed to be 100% successful. I did have one success that was a resend of one of your errors from a 4090 24GB. so i'm guessing you were running that one at 2x and got unlucky with two big tasks at the same time.


Correct- I was playing around with the two 4090 systems running these to make some comparisons. And you are also correct- it seems that even with 24GB, running 2x is still not really ideal. Those random, huge spikes seem to find each other when running 2x.

roundup
Send message
Joined: 11 May 10
Posts: 57
Credit: 1,724,820,193
RAC: 13,910,294
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 61178 - Posted: 2 Feb 2024 | 19:40:54 UTC - in response to Message 61175.

Are you running them at 1x and with how much VRAM? Trying to get a feel for what the actual "cutoff" is for these tasks right now. I am still feeling 24GB VRAM is needed for the success running 1x and double that for 2x.

The GPU is an MSI 4070 Ti GAMING X SLIM with 12GB GDDR6X, run at 1x. Obviously sufficient for the latest batch to run flawlessly.

pututu
Send message
Joined: 8 Oct 16
Posts: 14
Credit: 613,876,869
RAC: 11,505
Level
Lys
Scientific publications
watwatwatwat
Message 61179 - Posted: 2 Feb 2024 | 19:43:15 UTC - in response to Message 61174.

14 tasks of the latest batch completed successfully without any error.
Great progress!

Seems like credit has gone down from 150K to 15K.

Perhaps 150k was a little too generous. But 15k is not on par with other GPU projects. I expect there will be fairer credits again soon - with the next batch?


Assuming someone with a 3080Ti card, it will be better to run ATMbeta task first and then Quantum chemistry (if former has no available tasks) if credits granted is an important factor for some crunchers.

For me, I've 3080Ti and P100, so I will likely run ATMbeta on 3080Ti and Quantum chemistry on P100, if both tasks are available.

Boca Raton Community HS
Send message
Joined: 27 Aug 21
Posts: 36
Credit: 3,066,631,809
RAC: 15,873,927
Level
Arg
Scientific publications
wat
Message 61180 - Posted: 2 Feb 2024 | 19:49:01 UTC - in response to Message 61178.

Are you running them at 1x and with how much VRAM? Trying to get a feel for what the actual "cutoff" is for these tasks right now. I am still feeling 24GB VRAM is needed for the success running 1x and double that for 2x.

The GPU is an MSI 4070 Ti GAMING X SLIM with 12GB GDDR6X, run at 1x. Obviously sufficient for the latest batch to run flawlessly.



Thanks for the info. If you don't mind me asking- how many ran (in a row) without any errors?

CallMeFoxie
Send message
Joined: 6 Jan 21
Posts: 2
Credit: 24,835,750
RAC: 2,865
Level
Pro
Scientific publications
wat
Message 61181 - Posted: 2 Feb 2024 | 22:28:30 UTC

I have got a rig with 9 pieces of P106, which are slightly modified GTX1060 6GB used for Ethereum mining back in the day. I can run only two GPUgrid tasks at once (main CPU is only a dual core Celeron) but so far I have had one error and several tasks finish and validate. Hoping for good results for the rest!

roundup
Send message
Joined: 11 May 10
Posts: 57
Credit: 1,724,820,193
RAC: 13,910,294
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 61182 - Posted: 3 Feb 2024 | 0:00:22 UTC - in response to Message 61180.
Last modified: 3 Feb 2024 | 0:02:16 UTC

Are you running them at 1x and with how much VRAM? Trying to get a feel for what the actual "cutoff" is for these tasks right now. I am still feeling 24GB VRAM is needed for the success running 1x and double that for 2x.

The GPU is an MSI 4070 Ti GAMING X SLIM with 12GB GDDR6X, run at 1x. Obviously sufficient for the latest batch to run flawlessly.



Thanks for the info. If you don't mind me asking- how many ran (in a row) without any errors?

14 consecutive tasks without any error.

CallMeFoxie
Send message
Joined: 6 Jan 21
Posts: 2
Credit: 24,835,750
RAC: 2,865
Level
Pro
Scientific publications
wat
Message 61183 - Posted: 3 Feb 2024 | 11:04:43 UTC - in response to Message 61181.

I have got a rig with 9 pieces of P106, which are slightly modified GTX1060 6GB used for Ethereum mining back in the day. I can run only two GPUgrid tasks at once (main CPU is only a dual core Celeron) but so far I have had one error and several tasks finish and validate. Hoping for good results for the rest!


So managed to get 11 tasks, from which 9 passed and validated and 2 failed some time into the process.

Profile ServicEnginIC
Avatar
Send message
Joined: 24 Sep 10
Posts: 566
Credit: 6,321,177,024
RAC: 17,362,450
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 61184 - Posted: 3 Feb 2024 | 13:59:03 UTC - in response to Message 61164.

...From our end we will need to see how to assign WU's based on GPU memory. (Previous apps have been compute bound rather than GPU memory bound and have only been assigned based on driver version)

Probably (I don't know if it is viable), a better solution would be to include some portion in the code to limit peak VRAM according to the true device assigned.
The reason, based in an example:
My Host #482132 is shown by BOINC as [2] NVIDIA NVIDIA GeForce GTX 1660 Ti (5928MB) driver: 550.40
This is true for Device 0, NVIDIA NVIDIA GeForce GTX 1660 Ti (5928MB) driver: 550.40
But Device 1 completing this host, should be shown as NVIDIA NVIDIA GeForce GTX 1650 SUPER (3895MB) driver: 550.40
Tasks sent according to Device 0 VRAM (6 GB), would likely run out of memory when striking Device 1 (4 GB VRAM)

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1036
Credit: 39,372,107,483
RAC: 162,570,079
Level
Trp
Scientific publications
wat
Message 61185 - Posted: 3 Feb 2024 | 14:24:40 UTC - in response to Message 61184.
Last modified: 3 Feb 2024 | 14:25:45 UTC

...From our end we will need to see how to assign WU's based on GPU memory. (Previous apps have been compute bound rather than GPU memory bound and have only been assigned based on driver version)

Probably (I don't know if it is viable), a better solution would be to include some portion in the code to limit peak VRAM according to the true device assigned.
The reason, based in an example:
My Host #482132 is shown by BOINC as [2] NVIDIA NVIDIA GeForce GTX 1660 Ti (5928MB) driver: 550.40
This is true for Device 0, NVIDIA NVIDIA GeForce GTX 1660 Ti (5928MB) driver: 550.40
But Device 1 completing this host, should be shown as NVIDIA NVIDIA GeForce GTX 1650 SUPER (3895MB) driver: 550.40
Tasks sent according to Device 0 VRAM (6 GB), would likely run out of memory when striking Device 1 (4 GB VRAM)


the only caveat with this is that the application or project doesnt have any ability to select which GPUs you have or which GPU will run the task. in your example, if a task was sent that required >4GB, the project has no idea that GPU1 only has 4GB. the project can only see the "first/best" GPU in the system, that is communicated via your boinc client, and the boinc client is the one that selects which tasks go to which GPU. the science application is called after the GPU selection has already been made. and similarly, BOINC has no mechanism to assign tasks based on GPU VRAM use.

you will have to manage things yourself after observing behavior. if you notice one GPU consistently has too little VRAM, you can exclude that GPU from running the QChem project via setting the <exclude_gpu> statement in the cc_config.xml file.

<options>
<exclude_gpu>
<url>https://www.gpugrid.net/</url>
<app>PYSCFbeta</app>
<device_num>1</device_num>
</exclude_gpu>
</options>
____________

Profile ServicEnginIC
Avatar
Send message
Joined: 24 Sep 10
Posts: 566
Credit: 6,321,177,024
RAC: 17,362,450
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 61186 - Posted: 3 Feb 2024 | 18:49:12 UTC - in response to Message 61185.
Last modified: 3 Feb 2024 | 18:53:24 UTC

you will have to manage things yourself after observing behavior.

Certainly.
Your advice is always very appreciated.
Would be fine an update of minimum requirements when PYSCF taks arrive to production stage, as a help for excluding non accomplishing hosts / GPUs.

pututu
Send message
Joined: 8 Oct 16
Posts: 14
Credit: 613,876,869
RAC: 11,505
Level
Lys
Scientific publications
watwatwatwat
Message 61187 - Posted: 3 Feb 2024 | 21:04:18 UTC - in response to Message 61186.

you will have to manage things yourself after observing behavior.

Certainly.
Your advice is always very appreciated.
Would be fine an update of minimum requirements when PYSCF taks arrive to production stage, as a help for excluding non accomplishing hosts / GPUs.


I would imagine something like what WCG posted may be useful showing system requirements such as memory, disk space, one-time download file size, etc https://www.worldcommunitygrid.org/help/topic.s?shortName=minimumreq.
Other than WCG not running smoothly since the IBM migration, I notice that the WCG system requirements are outdated. I guess it takes effort to maintain such information and keeping it up to date.

So far, this is my limited knowledge about the quantum chemistry task as I'm still learning. Anyone is welcome to chime in for the system requirements.
1) One time download file is about 2GB. Be prepare to wait for hours if you have very slow internet speed.
2) The more gpu vram the better. Seems like 24GB cards or more perform the best.
3) GPUs with faster memory bandwidth and faster FP64 have advantage in shorter run time. Typically this is found in datacenter/server/workstation cards.

gemini8
Send message
Joined: 3 Jul 16
Posts: 31
Credit: 1,329,100,176
RAC: 4,926,985
Level
Met
Scientific publications
watwat
Message 61188 - Posted: 4 Feb 2024 | 8:18:18 UTC

Implementing a possibility to choose work with certain demands to the hardware through the preferences would be nice as well.
After lots of problems with the ECM subproject claiming too much system memory yoyo@home divided the subproject into smaller and bigger tasks, which can both be ticked (or be left unticked) in the project preferences.
So, my suggestion is to hand out work that comes in 4, 6, 8, 12, 16 and 24 GB flavours which the user can choose from.
As the machine's system also claims GPU memory it should naturally be considered to leavy about half a gig untouched by the GPUGrid tasks.
____________
Greetings, Jens

Steve
Volunteer moderator
Project administrator
Project developer
Project tester
Volunteer developer
Volunteer tester
Project scientist
Send message
Joined: 21 Dec 23
Posts: 32
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 61189 - Posted: 5 Feb 2024 | 11:04:22 UTC - in response to Message 61188.

Ok so it seems like things are improved with the latest settings.

I am keeping the WUs short (10 molecule configurations per WU) to minimize the effect of the errors.

I am going to send out some batches of WUs to get through a large dataset we have.

I think this

After lots of problems with the ECM subproject claiming too much system memory yoyo@home divided the subproject into smaller and bigger tasks, which can both be ticked (or be left unticked) in the project preferences.
So, my suggestion is to hand out work that comes in 4, 6, 8, 12, 16 and 24 GB flavours which the user can choose from.

Might be the most workable solution for the future once the current batch of work is done.

The memory use is mainly determined by the size of molecule and number of heavy elememts. So before WUs are sent out we can make a rough estimate of the memory use. There is an elemnt of randomness that comes from high memory use for specific physical configurations that are harder to converge. We cannot estimate this before sending and it will only happen during the calculation.

Pascal
Send message
Joined: 15 Jul 20
Posts: 45
Credit: 331,799,934
RAC: 3,481,654
Level
Asp
Scientific publications
wat
Message 61190 - Posted: 5 Feb 2024 | 11:35:20 UTC - in response to Message 61189.

Seems like credit has gone down from 150K to 15K?
____________

Freewill
Send message
Joined: 18 Mar 10
Posts: 13
Credit: 7,093,713,894
RAC: 40,189,547
Level
Tyr
Scientific publications
watwatwatwatwat
Message 61191 - Posted: 5 Feb 2024 | 11:42:50 UTC - in response to Message 61190.

Seems like credit has gone down from 150K to 15K?

Yes, and the memory use this morning seems to require running 1 at a time on GPUs with less than 16 GB, which hurts performance even more.

Steve, what determines point value for a task?

Pascal
Send message
Joined: 15 Jul 20
Posts: 45
Credit: 331,799,934
RAC: 3,481,654
Level
Asp
Scientific publications
wat
Message 61192 - Posted: 5 Feb 2024 | 12:03:37 UTC

Pour le moment ça va pas trop mal avec les nouvelles unités de calcul.
Une erreur sur 4.

For the moment it is not too bad with the new units of calculation.
One in four mistakes.





Nom inputs_v3_ace_pch_ms_gc_filt_af05_index_64000_to_64000-SFARR_PYSCF_ace_pch_ms_gc_filt_af05_v4-0-1-RND0521_0
Unité de travail (WU) 27684102
Créé 5 Feb 2024 | 10:40:37 UTC
Envoyé 5 Feb 2024 | 10:47:37 UTC
Reçu 5 Feb 2024 | 10:49:50 UTC
État du serveur Sur
Résultats Erreur de calcul
État du client Erreur de calcul
État à la sortie 195 (0xc3) EXIT_CHILD_FAILED
ID de l'ordinateur 617458
Date limite de rapport 10 Feb 2024 | 10:47:37 UTC
Temps de fonctionnement 45.93
Temps de CPU 9.59
Valider l'état Invalide
Crédit 0.00
Version de l'application Quantum chemistry calculations on GPU v1.04 (cuda1121)
Stderr output

<core_client_version>7.20.5</core_client_version>
<![CDATA[
<message>
process exited with code 195 (0xc3, -61)</message>
<stderr_txt>
11:47:47 (5931): wrapper (7.7.26016): starting
11:48:16 (5931): wrapper (7.7.26016): starting
11:48:16 (5931): wrapper: running bin/python (bin/conda-unpack)
11:48:17 (5931): bin/python exited; CPU time 0.157053
11:48:17 (5931): wrapper: running bin/tar (xjvf input.tar.bz2)
11:48:18 (5931): bin/tar exited; CPU time 0.002953
11:48:18 (5931): wrapper: running bin/bash (run.sh)
+ echo 'Setup environment'
+ source bin/activate
++ _conda_pack_activate
++ local _CONDA_SHELL_FLAVOR
++ '[' -n x ']'
++ _CONDA_SHELL_FLAVOR=bash
++ local script_dir
++ case "$_CONDA_SHELL_FLAVOR" in
+++ dirname bin/activate
++ script_dir=bin
+++ cd bin
+++ pwd
++ local full_path_script_dir=/home/pascal/slots/3/bin
+++ dirname /home/pascal/slots/3/bin
++ local full_path_env=/home/pascal/slots/3
+++ basename /home/pascal/slots/3
++ local env_name=3
++ '[' -n '' ']'
++ export CONDA_PREFIX=/home/pascal/slots/3
++ CONDA_PREFIX=/home/pascal/slots/3
++ export _CONDA_PACK_OLD_PS1=
++ _CONDA_PACK_OLD_PS1=
++ PATH=/home/pascal/slots/3/bin:/usr/local/bin:/usr/local/sbin:/usr/bin:/usr/sbin:/bin:/sbin:.
++ PS1='(3) '
++ case "$_CONDA_SHELL_FLAVOR" in
++ hash -r
++ local _script_dir=/home/pascal/slots/3/etc/conda/activate.d
++ '[' -d /home/pascal/slots/3/etc/conda/activate.d ']'
+ export PATH=/home/pascal/slots/3:/home/pascal/slots/3/bin:/usr/local/bin:/usr/local/sbin:/usr/bin:/usr/sbin:/bin:/sbin:.
+ PATH=/home/pascal/slots/3:/home/pascal/slots/3/bin:/usr/local/bin:/usr/local/sbin:/usr/bin:/usr/sbin:/bin:/sbin:.
+ echo 'Create a temporary directory'
+ export TMP=/home/pascal/slots/3/tmp
+ TMP=/home/pascal/slots/3/tmp
+ mkdir -p /home/pascal/slots/3/tmp
+ export OMP_NUM_THREADS=1
+ OMP_NUM_THREADS=1
+ export CUDA_VISIBLE_DEVICES=1
+ CUDA_VISIBLE_DEVICES=1
+ export CUPY_CUDA_LIB_PATH=/home/pascal/slots/3/cupy
+ CUPY_CUDA_LIB_PATH=/home/pascal/slots/3/cupy
+ echo 'Running PySCF'
+ python compute_dft.py
/home/pascal/slots/3/lib/python3.11/site-packages/gpu4pyscf/lib/cutensor.py:174: UserWarning: using cupy as the tensor contraction engine.
warnings.warn(f'using {contract_engine} as the tensor contraction engine.')
/home/pascal/slots/3/lib/python3.11/site-packages/pyscf/dft/libxc.py:771: UserWarning: Since PySCF-2.3, B3LYP (and B3P86) are changed to the VWN-RPA variant, corresponding to the original definition by Stephens et al. (issue 1480) and the same as the B3LYP functional in Gaussian. To restore the VWN5 definition, you can put the setting "B3LYP_WITH_VWN5 = True" in pyscf_conf.py
warnings.warn('Since PySCF-2.3, B3LYP (and B3P86) are changed to the VWN-RPA variant, '
nao = 570
/home/pascal/slots/3/lib/python3.11/site-packages/pyscf/gto/mole.py:1280: UserWarning: Function mol.dumps drops attribute charge because it is not JSON-serializable
warnings.warn(msg)
Traceback (most recent call last):
File "/home/pascal/slots/3/lib/python3.11/site-packages/pyscf/lib/misc.py", line 1094, in __exit__
handler.result()
File "/home/pascal/slots/3/lib/python3.11/concurrent/futures/_base.py", line 456, in result
return self.__get_result()
^^^^^^^^^^^^^^^^^^^
File "/home/pascal/slots/3/lib/python3.11/concurrent/futures/_base.py", line 401, in __get_result
raise self._exception
File "/home/pascal/slots/3/lib/python3.11/concurrent/futures/thread.py", line 58, in run
result = self.fn(*self.args, **self.kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/pascal/slots/3/lib/python3.11/site-packages/gpu4pyscf/df/df_jk.py", line 52, in build_df
rsh_df.build(omega=omega)
File "/home/pascal/slots/3/lib/python3.11/site-packages/gpu4pyscf/df/df.py", line 102, in build
self._cderi = cholesky_eri_gpu(intopt, mol, auxmol, self.cd_low, omega=omega)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/pascal/slots/3/lib/python3.11/site-packages/gpu4pyscf/df/df.py", line 256, in cholesky_eri_gpu
if lj>1: ints_slices = cart2sph(ints_slices, axis=1, ang=lj)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/pascal/slots/3/lib/python3.11/site-packages/gpu4pyscf/lib/cupy_helper.py", line 333, in cart2sph
t_sph = contract('min,ip->mpn', t_cart, c2s, out=out)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/pascal/slots/3/lib/python3.11/site-packages/gpu4pyscf/lib/cutensor.py", line 177, in contract
return cupy.asarray(einsum(pattern, a, b), order='C')
^^^^^^^^^^^^^^^^^^^^^
File "/home/pascal/slots/3/lib/python3.11/site-packages/cupy/linalg/_einsum.py", line 676, in einsum
arr_out, sub_out = reduced_binary_einsum(
^^^^^^^^^^^^^^^^^^^^^^
File "/home/pascal/slots/3/lib/python3.11/site-packages/cupy/linalg/_einsum.py", line 418, in reduced_binary_einsum
tmp1, shapes1 = _flatten_transpose(arr1, [bs1, cs1, ts1])
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/pascal/slots/3/lib/python3.11/site-packages/cupy/linalg/_einsum.py", line 298, in _flatten_transpose
a.transpose(transpose_axes).reshape(
File "cupy/_core/core.pyx", line 752, in cupy._core.core._ndarray_base.reshape
File "cupy/_core/_routines_manipulation.pyx", line 81, in cupy._core._routines_manipulation._ndarray_reshape
File "cupy/_core/_routines_manipulation.pyx", line 357, in cupy._core._routines_manipulation._reshape
File "cupy/_core/core.pyx", line 611, in cupy._core.core._ndarray_base.copy
File "cupy/_core/core.pyx", line 570, in cupy._core.core._ndarray_base.astype
File "cupy/_core/core.pyx", line 132, in cupy._core.core.ndarray.__new__
File "cupy/_core/core.pyx", line 220, in cupy._core.core._ndarray_base._init
File "cupy/cuda/memory.pyx", line 740, in cupy.cuda.memory.alloc
File "cupy/cuda/memory.pyx", line 1426, in cupy.cuda.memory.MemoryPool.malloc
File "cupy/cuda/memory.pyx", line 1447, in cupy.cuda.memory.MemoryPool.malloc
File "cupy/cuda/memory.pyx", line 1118, in cupy.cuda.memory.SingleDeviceMemoryPool.malloc
File "cupy/cuda/memory.pyx", line 1139, in cupy.cuda.memory.SingleDeviceMemoryPool._malloc
File "cupy/cuda/memory.pyx", line 1346, in cupy.cuda.memory.SingleDeviceMemoryPool._try_malloc
File "cupy/cuda/memory.pyx", line 1358, in cupy.cuda.memory.SingleDeviceMemoryPool._try_malloc
cupy.cuda.memory.OutOfMemoryError: Out of memory allocating 595,413,504 bytes (allocated so far: 3,207,694,336 bytes, limit set to: 3,684,158,668 bytes).

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/home/pascal/slots/3/compute_dft.py", line 121, in <module>
e,f,dip,q = compute_gpu(mol)
^^^^^^^^^^^^^^^^
File "/home/pascal/slots/3/compute_dft.py", line 24, in compute_gpu
e_dft = mf.kernel() # compute total energy
^^^^^^^^^^^
File "<string>", line 2, in kernel
File "/home/pascal/slots/3/lib/python3.11/site-packages/gpu4pyscf/scf/hf.py", line 586, in scf
_kernel(self, self.conv_tol, self.conv_tol_grad,
File "/home/pascal/slots/3/lib/python3.11/site-packages/gpu4pyscf/scf/hf.py", line 393, in _kernel
mf.init_workflow(dm0=dm)
File "/home/pascal/slots/3/lib/python3.11/site-packages/gpu4pyscf/df/df_jk.py", line 56, in init_workflow
with lib.call_in_background(build_df) as build:
File "/home/pascal/slots/3/lib/python3.11/site-packages/pyscf/lib/misc.py", line 1096, in __exit__
raise ThreadRuntimeError('Error on thread %s:\n%s' % (self, e))
pyscf.lib.misc.ThreadRuntimeError: Error on thread <pyscf.lib.misc.call_in_background object at 0x7fec06934850>:
Out of memory allocating 595,413,504 bytes (allocated so far: 3,207,694,336 bytes, limit set to: 3,684,158,668 bytes).
11:48:31 (5931): bin/bash exited; CPU time 11.139443
11:48:31 (5931): app exit status: 0x1
11:48:31 (5931): called boinc_finish(195)

</stderr_txt>
]]>

____________

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1036
Credit: 39,372,107,483
RAC: 162,570,079
Level
Trp
Scientific publications
wat
Message 61193 - Posted: 5 Feb 2024 | 12:32:37 UTC

I'm seeing about 10% failure rate with 12GB cards.
____________

Steve
Volunteer moderator
Project administrator
Project developer
Project tester
Volunteer developer
Volunteer tester
Project scientist
Send message
Joined: 21 Dec 23
Posts: 32
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 61194 - Posted: 5 Feb 2024 | 12:55:51 UTC - in response to Message 61193.

Credits should now be at 75k for rest of the batch. They should be consistent based on comparisons of runtime on our test machines across the other Apps, but this is complicated with this new memory intensive app. I will investigate before sending the next batch.

pututu
Send message
Joined: 8 Oct 16
Posts: 14
Credit: 613,876,869
RAC: 11,505
Level
Lys
Scientific publications
watwatwatwat
Message 61195 - Posted: 5 Feb 2024 | 15:09:41 UTC

There are some tasks that spike over 10G. Seems like nvidia-smi doesn't allow logging time shorter than 1s. Anyone has a workaround? Likely that the momentarily spike could be higher than 10G as recorded.

2024/02/05 07:06:39.675, 88 %, 1328 MHz, 5147 MiB, 115.28 W, 65
2024/02/05 07:06:40.678, 96 %, 1278 MHz, 5147 MiB, 117.58 W, 65
2024/02/05 07:06:41.688, 100 %, 1328 MHz, 5177 MiB, 111.94 W, 65
2024/02/05 07:06:42.691, 100 %, 1328 MHz, 6647 MiB, 70.23 W, 64
2024/02/05 07:06:43.694, 30 %, 1328 MHz, 8475 MiB, 69.65 W, 64
2024/02/05 07:06:44.697, 100 %, 1328 MHz, 9015 MiB, 81.81 W, 64
2024/02/05 07:06:45.700, 100 %, 1328 MHz, 9007 MiB, 46.32 W, 63
2024/02/05 07:06:46.705, 98 %, 1278 MHz, 9941 MiB, 46.08 W, 63
2024/02/05 07:06:47.708, 99 %, 1328 MHz, 10251 MiB, 57.06 W, 63
2024/02/05 07:06:48.711, 97 %, 1088 MHz, 4553 MiB, 133.72 W, 65
2024/02/05 07:06:49.714, 95 %, 1075 MHz, 4553 MiB, 132.99 W, 65

pututu
Send message
Joined: 8 Oct 16
Posts: 14
Credit: 613,876,869
RAC: 11,505
Level
Lys
Scientific publications
watwatwatwat
Message 61196 - Posted: 5 Feb 2024 | 16:21:57 UTC - in response to Message 61195.

Got a biggie. This one is 14.6G. I'm running 16G card. One task per gpu.

2024/02/05 08:20:03.043, 100 %, 1328 MHz, 9604 MiB, 107.19 W, 71
2024/02/05 08:20:04.046, 94 %, 1328 MHz, 11970 MiB, 97.69 W, 71
2024/02/05 08:20:05.049, 99 %, 1328 MHz, 12130 MiB, 123.24 W, 70
2024/02/05 08:20:06.052, 100 %, 1316 MHz, 12130 MiB, 122.21 W, 71
2024/02/05 08:20:07.055, 100 %, 1328 MHz, 12130 MiB, 121.26 W, 71
2024/02/05 08:20:08.058, 100 %, 1328 MHz, 12130 MiB, 118.64 W, 71
2024/02/05 08:20:09.061, 17 %, 1328 MHz, 12116 MiB, 56.48 W, 70
2024/02/05 08:20:10.064, 95 %, 1189 MHz, 14646 MiB, 73.99 W, 71
2024/02/05 08:20:11.071, 99 %, 1139 MHz, 14646 MiB, 194.84 W, 71
2024/02/05 08:20:12.078, 96 %, 1316 MHz, 14650 MiB, 65.82 W, 70
2024/02/05 08:20:13.081, 85 %, 1328 MHz, 8952 MiB, 84.32 W, 70
2024/02/05 08:20:14.084, 100 %, 1075 MHz, 8952 MiB, 130.53 W, 71

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1036
Credit: 39,372,107,483
RAC: 162,570,079
Level
Trp
Scientific publications
wat
Message 61197 - Posted: 5 Feb 2024 | 16:35:36 UTC - in response to Message 61196.
Last modified: 5 Feb 2024 | 16:36:34 UTC

yeah i think you'll only ever see the spike if you actually have the VRAM for it. if you don't have enough, it will error out before hitting it and you'll never see it.

I'm just gonna deal with the errors. cost of doing business lol. I have my system set for 70% ATP through MPS.

QChem gpu_usage set to 0.55
ATMbeta gpu_usage set to 0.44

this way when both tasks are available, it will run either ATMbeta+ATMbeta, or ATMbeta+QChem on the same GPU, but will not allow 2x Qchem on the same GPU. i do this because ATMbeta uses a really small amount of the GPU VRAM and can utilize some of the spare compute cycles without hurting QChem VRAM use much. but when it's running only QChem and only running 1x tasks, it's not using absolutely the most compute that it could (only 70%), so maybe a little slower, but Titan Vs are fast enough anyway. most tasks finishing in about 6mins or so. some outliers running ~18mins.
____________

Boca Raton Community HS
Send message
Joined: 27 Aug 21
Posts: 36
Credit: 3,066,631,809
RAC: 15,873,927
Level
Arg
Scientific publications
wat
Message 61198 - Posted: 5 Feb 2024 | 16:37:51 UTC - in response to Message 61196.

pututu, have you had any failed tasks? Ian&Steve C. reports ~10% failure rate with 12GB so I am curious about 16GB. I am guessing this is about the minimum for error-free (related to memory limitations) processing of the current work.

Boca Raton Community HS
Send message
Joined: 27 Aug 21
Posts: 36
Credit: 3,066,631,809
RAC: 15,873,927
Level
Arg
Scientific publications
wat
Message 61199 - Posted: 5 Feb 2024 | 16:53:51 UTC - in response to Message 61197.



QChem gpu_usage set to 0.55
ATMbeta gpu_usage set to 0.44




We did this as well this morning for the 4090 GPUs since they have 24GB but with E@H work. To little VRAM to run QChem at 2x but too much compute power left on the table for running them at 1x.

pututu
Send message
Joined: 8 Oct 16
Posts: 14
Credit: 613,876,869
RAC: 11,505
Level
Lys
Scientific publications
watwatwatwat
Message 61200 - Posted: 5 Feb 2024 | 17:02:55 UTC - in response to Message 61198.
Last modified: 5 Feb 2024 | 17:20:10 UTC

pututu, have you had any failed tasks? Ian&Steve C. reports ~10% failure rate with 12GB so I am curious about 16GB. I am guessing this is about the minimum for error-free (related to memory limitations) processing of the current work.


0 failure after 19 completed tasks on one P100 with 16G.

So far 14.6G is the highest I've seen with 1 sec interval monitoring

More than half of the tasks processed momentarily hit 8G or more. Didn't record any actual data, just watching the nvidia-smi from time to time.

Edit: another task with more than 12G but with ominous 6666M, lol
2024/02/05 09:17:58.869, 99 %, 1328 MHz, 10712 MiB, 131.69 W, 70
2024/02/05 09:17:59.872, 100 %, 1328 MHz, 10712 MiB, 101.87 W, 70
2024/02/05 09:18:00.877, 100 %, 1328 MHz, 10700 MiB, 50.15 W, 69
2024/02/05 09:18:01.880, 92 %, 1240 MHz, 11790 MiB, 54.34 W, 69
2024/02/05 09:18:02.883, 95 %, 1240 MHz, 12364 MiB, 53.20 W, 69
2024/02/05 09:18:03.886, 83 %, 1126 MHz, 6666 MiB, 137.77 W, 70
2024/02/05 09:18:04.889, 100 %, 1075 MHz, 6666 MiB, 130.53 W, 71
2024/02/05 09:18:05.892, 92 %, 1164 MHz, 6666 MiB, 129.84 W, 71
2024/02/05 09:18:06.902, 100 %, 1063 MHz, 6666 MiB, 129.82 W, 71

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1036
Credit: 39,372,107,483
RAC: 162,570,079
Level
Trp
Scientific publications
wat
Message 61201 - Posted: 6 Feb 2024 | 2:51:01 UTC - in response to Message 61198.

pututu, have you had any failed tasks? Ian&Steve C. reports ~10% failure rate with 12GB so I am curious about 16GB. I am guessing this is about the minimum for error-free (related to memory limitations) processing of the current work.


been running all day across my 18x Titan Vs. the effective error rate is right around 5%. so 5% of the tasks needed more than 12GB. running only 1 task per GPU.

i rented an A100 40GB for the day. running 3x on this GPU with MPS set to 40%, it's done about 300 tasks and only 1 task failed from out of memory. highest spike i saw was 39GB, but usually stays around 20GB utilized
____________

Boca Raton Community HS
Send message
Joined: 27 Aug 21
Posts: 36
Credit: 3,066,631,809
RAC: 15,873,927
Level
Arg
Scientific publications
wat
Message 61202 - Posted: 6 Feb 2024 | 5:02:10 UTC - in response to Message 61201.

pututu, have you had any failed tasks? Ian&Steve C. reports ~10% failure rate with 12GB so I am curious about 16GB. I am guessing this is about the minimum for error-free (related to memory limitations) processing of the current work.


been running all day across my 18x Titan Vs. the effective error rate is right around 5%. so 5% of the tasks needed more than 12GB. running only 1 task per GPU.

i rented an A100 40GB for the day. running 3x on this GPU with MPS set to 40%, it's done about 300 tasks and only 1 task failed from out of memory. highest spike i saw was 39GB, but usually stays around 20GB utilized



Wow, the A100 is powerful. I can't believe how fast it can chew through these (well, I can believe it, but it's still amazing). I am somewhat new to MPS and I understand the general concept, but what do you mean when you say it is set to 40%?

Pascal
Send message
Joined: 15 Jul 20
Posts: 45
Credit: 331,799,934
RAC: 3,481,654
Level
Asp
Scientific publications
wat
Message 61203 - Posted: 6 Feb 2024 | 8:44:37 UTC

eh bien moi j'ai abandonné trop d'erreurs.

well I gave up too many mistakes
____________

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1036
Credit: 39,372,107,483
RAC: 162,570,079
Level
Trp
Scientific publications
wat
Message 61204 - Posted: 6 Feb 2024 | 11:39:54 UTC - in response to Message 61202.

I am somewhat new to MPS and I understand the general concept, but what do you mean when you say it is set to 40%?


CUDA MPS has a setting called active thread percentage. It basically limits how many SMs of the GPU get used for each process. Without MPS, each process will call for all available SMs all the time, in separate contexts (MPS also shares a single context). I set that to 40%, so each task is only using 40% of the available SMs. With 3x running that’s slightly over provisioning the GPU, but it usually works well and runs faster than 3x without MPS. It also has the benefit of reducing VRAM use most of the time, but it doesn’t seem to limit these tasks much. The only caveat is that when you run low on work, the remaining one or two tasks won’t use all the GPU, instead using only the 40% and none of the rest of the idle GPU.
____________

Aurum
Avatar
Send message
Joined: 12 Jul 17
Posts: 399
Credit: 13,238,627,382
RAC: 14,211,365
Level
Trp
Scientific publications
watwatwat
Message 61205 - Posted: 6 Feb 2024 | 13:11:48 UTC - in response to Message 61195.

Seems like nvidia-smi doesn't allow logging time shorter than 1s. Anyone has a workaround?

Have you tried NVITOP?
https://github.com/XuehaiPan/nvitop

pututu
Send message
Joined: 8 Oct 16
Posts: 14
Credit: 613,876,869
RAC: 11,505
Level
Lys
Scientific publications
watwatwatwat
Message 61206 - Posted: 6 Feb 2024 | 17:00:06 UTC - in response to Message 61205.
Last modified: 6 Feb 2024 | 17:00:33 UTC

Seems like nvidia-smi doesn't allow logging time shorter than 1s. Anyone has a workaround?

Have you tried NVITOP?
https://github.com/XuehaiPan/nvitop


No. A quick search seems to indicate that it uses nvidia-smi command, so likely to have similar limitation.

Anyway after a day or running (>100+ tasks) I didn't see any failures on the 16GB card, so I'm good, at least for now.

Boca Raton Community HS
Send message
Joined: 27 Aug 21
Posts: 36
Credit: 3,066,631,809
RAC: 15,873,927
Level
Arg
Scientific publications
wat
Message 61207 - Posted: 7 Feb 2024 | 15:04:51 UTC - in response to Message 61204.

I am somewhat new to MPS and I understand the general concept, but what do you mean when you say it is set to 40%?


CUDA MPS has a setting called active thread percentage. It basically limits how many SMs of the GPU get used for each process. Without MPS, each process will call for all available SMs all the time, in separate contexts (MPS also shares a single context). I set that to 40%, so each task is only using 40% of the available SMs. With 3x running that’s slightly over provisioning the GPU, but it usually works well and runs faster than 3x without MPS. It also has the benefit of reducing VRAM use most of the time, but it doesn’t seem to limit these tasks much. The only caveat is that when you run low on work, the remaining one or two tasks won’t use all the GPU, instead using only the 40% and none of the rest of the idle GPU.



Thank you for the explanation!

Pascal
Send message
Joined: 15 Jul 20
Posts: 45
Credit: 331,799,934
RAC: 3,481,654
Level
Asp
Scientific publications
wat
Message 61208 - Posted: 7 Feb 2024 | 20:38:28 UTC

bonsoir ,
y a t'il des unités de travail Windows a calculer ou faut il que je repasse sous linux?
Merci

Good evening,
Are there Windows work units to calculate or do I have to go back to linux?
Thanks
____________

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1036
Credit: 39,372,107,483
RAC: 162,570,079
Level
Trp
Scientific publications
wat
Message 61210 - Posted: 7 Feb 2024 | 21:10:58 UTC - in response to Message 61208.

bonsoir ,
y a t'il des unités de travail Windows a calculer ou faut il que je repasse sous linux?
Merci

Good evening,
Are there Windows work units to calculate or do I have to go back to linux?
Thanks


Only Linux still.
____________

Erich56
Send message
Joined: 1 Jan 15
Posts: 1091
Credit: 6,854,782,676
RAC: 17,245,093
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwat
Message 61212 - Posted: 8 Feb 2024 | 13:05:33 UTC - in response to Message 61210.

Good evening,
Are there Windows work units to calculate or do I have to go back to linux?
Thanks


Only Linux still.

:-( :-( :-(

Pascal
Send message
Joined: 15 Jul 20
Posts: 45
Credit: 331,799,934
RAC: 3,481,654
Level
Asp
Scientific publications
wat
Message 61213 - Posted: 8 Feb 2024 | 16:12:49 UTC

je viens de repasser sous linux et c'est reparti.bye bye windows 10.


I just came back under linux and it’s gone again.bye bye windows 10
____________

Boca Raton Community HS
Send message
Joined: 27 Aug 21
Posts: 36
Credit: 3,066,631,809
RAC: 15,873,927
Level
Arg
Scientific publications
wat
Message 61214 - Posted: 8 Feb 2024 | 16:43:11 UTC

We have definitely noticed a sharp decrease in "errors" with these tasks. Steve (or anyone), can you offer some insight into the filenames? As example:


inputs_v3_ace_pch_ms_gc_filt_af05_index_263591_to_263591-SFARR_PYSCF_ace_pch_ms_gc_filt_af05_v4-0-1-RND5514_2

Are there two different references to version? I see a "_v3_" and then a "_v4-0-1".

Then, the app version: v1.04

I thought that "_v4-0-1" would equate to the app version, but it doesn't look like it does.

Thanks!

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1036
Credit: 39,372,107,483
RAC: 162,570,079
Level
Trp
Scientific publications
wat
Message 61215 - Posted: 8 Feb 2024 | 17:35:30 UTC - in response to Message 61214.
Last modified: 8 Feb 2024 | 17:41:34 UTC

“0-1”notation with all GPUGRID tasks seems to indicate the segment you are on and how many total segments there are

So here, 0 = which segment you are on
1= how many segments there are in total
The segment you are on seems to always be in the 0-first kind of notation.

We see/saw the same behavior with ATM. Where you will have tasks like 0-5, 1-5, 2-5, etc. and they stop at 4-5, there was a batch that had ten segment for 0-10 through 9-10.

they likely have some kind of process on the server side which stiches the results together based on these (and other) numbers
____________

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1289
Credit: 5,219,281,959
RAC: 10,592,914
Level
Tyr
Scientific publications
watwatwatwatwat
Message 61216 - Posted: 8 Feb 2024 | 17:36:34 UTC - in response to Message 61214.

Looks like they transitioned from v3-0-1 on Feb 2 to a test result on Feb 3 and then started the v4-0-1 run on Feb 5

That was looking back through 360 validated tasks.

I had two errors on the v4-0-1 tasks right at their beginning. Then they all validated since then.

All run on two 2080 Ti cards.

[BAT] Svennemans
Send message
Joined: 27 May 21
Posts: 50
Credit: 289,422,017
RAC: 2,616,469
Level
Asn
Scientific publications
wat
Message 61217 - Posted: 9 Feb 2024 | 0:41:33 UTC - in response to Message 61109.

Why do I allways get segmentation fault
on Windows/wsl2/Ubuntu 22.04.3 LTS
12 processors, 28 GB memory, 16GB swap, GPU RTX 4070 Ti Super with 16 GB, driver version 551.23

https://www.gpugrid.net/result.php?resultid=33759912
https://www.gpugrid.net/result.php?resultid=33758940
https://www.gpugrid.net/result.php?resultid=33759139
https://www.gpugrid.net/result.php?resultid=33759328


something wrong with your environment or drivers likely.

try running a native Linux OS install, WSL might not be well supported



I'm getting the same issues running throug WSL2, immediate segmentation fault.
https://www.gpugrid.net/result.php?resultid=33853832
https://www.gpugrid.net/result.php?resultid=33853734

Environment & drivers should be OK, since it is running other project's GPU tasks just fine! Unless gpugrid has some specific prerequisites?

Working project tasks:
https://moowrap.net/result.php?resultid=201144661

Installing a native Linux OS is simply not an option for most regular users that don't have dedicated compute farms...

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1036
Credit: 39,372,107,483
RAC: 162,570,079
Level
Trp
Scientific publications
wat
Message 61218 - Posted: 9 Feb 2024 | 1:58:41 UTC - in response to Message 61217.

then I guess you'll just have to wait for the native Windows app. it seems apparent that something doesnt work with these tasks under WSL. so indeed some kind of problem or incompatibility related to WSL. the fact that some other app works isnt really relevant. a key difference is probably in the difference in how these apps are distributed. Moo wrapper uses a compiled binary and the QChem work is supplied via an entire python environment designed to work with a native linux install (it does a lot of things for setting up things like environment variables which might not be correct for WSL as an example). these tasks also use CuPy, which might not be well supported for WSL, or the way cupy is being called isnt right for WSL. either way, don't think there's gonna be a solution for use with WSL. switch to Linux, or wait for the Windows version.
____________

Pascal
Send message
Joined: 15 Jul 20
Posts: 45
Credit: 331,799,934
RAC: 3,481,654
Level
Asp
Scientific publications
wat
Message 61219 - Posted: 9 Feb 2024 | 9:04:14 UTC

hello
I noticed that you are losing users.
Not many but the number of gpugrid users is decreasing.
Maybe you have too many requirements level harware and credits are no longer the same as before.
____________

[BAT] Svennemans
Send message
Joined: 27 May 21
Posts: 50
Credit: 289,422,017
RAC: 2,616,469
Level
Asn
Scientific publications
wat
Message 61220 - Posted: 9 Feb 2024 | 10:15:45 UTC - in response to Message 61219.

hello
I noticed that you are losing users.
Not many but the number of gpugrid users is decreasing.
Maybe you have too many requirements level harware and credits are no longer the same as before.



That's hardly surprising given this stat:
https://www.boincstats.com/stats/45/host/breakdown/os/

2500+ Windows hosts
688 Linux hosts

Yet Windows hosts are not getting any work, so are not given opportunity to contribute to research or to beta testing even if they're prepared to go the extra mile with getting experimental apps to work.
So logical that people start leaving - certainly the set-it-and-forget-it crowd.

Erich56
Send message
Joined: 1 Jan 15
Posts: 1091
Credit: 6,854,782,676
RAC: 17,245,093
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwat
Message 61221 - Posted: 9 Feb 2024 | 13:35:02 UTC - in response to Message 61220.

Yet Windows hosts are not getting any work, so are not given opportunity to contribute to research or to beta testing even if they're prepared to go the extra mile with getting experimental apps to work.

When I joined GPUGRID about 9 years ago, all subprojects were available for Linux and Windows as well.
At that time and even several years later, my hosts were working for GPUGRID almost 365 days/year.

Somehow, it makes me sad that I am less and less able to contribute to this valuable project.

Recently, someone here explained the reason: scientific projects are primarily done by Linux, not by Windows.
Why so, all of a sudden ???

[BAT] Svennemans
Send message
Joined: 27 May 21
Posts: 50
Credit: 289,422,017
RAC: 2,616,469
Level
Asn
Scientific publications
wat
Message 61222 - Posted: 9 Feb 2024 | 14:25:44 UTC - in response to Message 61218.

then I guess you'll just have to wait for the native Windows app. it seems apparent that something doesnt work with these tasks under WSL. so indeed some kind of problem or incompatibility related to WSL. the fact that some other app works isnt really relevant. a key difference is probably in the difference in how these apps are distributed. Moo wrapper uses a compiled binary and the QChem work is supplied via an entire python environment designed to work with a native linux install (it does a lot of things for setting up things like environment variables which might not be correct for WSL as an example). these tasks also use CuPy, which might not be well supported for WSL, or the way cupy is being called isnt right for WSL. either way, don't think there's gonna be a solution for use with WSL. switch to Linux, or wait for the Windows version.


It could be that, yes. But it could also be memory overflow.
Running a gtx1080ti with 11GB vram
Running from the commandline with nvidia-smi logging I see memory going up to 8GB allocated, then a segmentation fault - which could be caused by a block allocating over the 11GB limit?

monitoring output:
# gpu pwr gtemp mtemp sm mem enc dec jpg ofa mclk pclk pviol tviol fb bar1 ccpm sbecc dbecc pci rxpci txpci
# Idx W C C % % % % % % MHz MHz % bool MB MB MB errs errs errs MB/s MB/s
0 15 30 - 2 8 0 0 - - 405 607 0 0 1915 2 - - - 0 0 0
0 17 30 - 2 8 0 0 - - 405 607 0 0 1915 2 - - - 0 0 0
0 74 33 - 2 1 0 0 - - 5005 1569 0 0 2179 2 - - - 0 0 0
0 133 39 - 77 5 0 0 - - 5005 1987 0 0 4797 2 - - - 0 0 0
0 167 49 - 63 16 0 0 - - 5005 1974 0 0 6393 2 - - - 0 0 0
0 119 54 - 74 4 0 0 - - 5005 1974 0 0 8329 2 - - - 0 0 0
0 87 47 - 0 0 0 0 - - 5508 1974 0 0 1915 2 - - - 0 0 0


commandline run output:

/var/lib/boinc/projects/www.gpugrid.net/bck/lib/python3.11/site-packages/gpu4pyscf/lib/cutensor.py:174: UserWarning: using cupy as the tensor contraction engine.
warnings.warn(f'using {contract_engine} as the tensor contraction engine.')
/var/lib/boinc/projects/www.gpugrid.net/bck/lib/python3.11/site-packages/pyscf/dft/libxc.py:771: UserWarning: Since PySCF-2.3, B3LYP (and B3P86) are changed to the VWN-RPA variant, corresponding to the original definition by Stephens et al. (issue 1480) and the same as the B3LYP functional in Gaussian. To restore the VWN5 definition, you can put the setting "B3LYP_WITH_VWN5 = True" in pyscf_conf.py
warnings.warn('Since PySCF-2.3, B3LYP (and B3P86) are changed to the VWN-RPA variant, '
nao = 590
reading molecules in current dir
mol_130305284_conf_0.xyz
mol_130305284_conf_1.xyz
mol_130305284_conf_2.xyz
mol_130305284_conf_3.xyz
mol_130305284_conf_4.xyz
mol_130305284_conf_5.xyz
mol_130305284_conf_6.xyz
mol_130305284_conf_7.xyz
mol_130305284_conf_8.xyz
mol_130305284_conf_9.xyz
['mol_130305284_conf_0.xyz', 'mol_130305284_conf_1.xyz', 'mol_130305284_conf_2.xyz', 'mol_130305284_conf_3.xyz', 'mol_130305284_conf_4.xyz', 'mol_130305284_conf_5.xyz', 'mol_130305284_conf_6.xyz', 'mol_130305284_conf_7.xyz', 'mol_130305284_conf_8.xyz', 'mol_130305284_conf_9.xyz']
Computing energy and forces for molecule 1 of 10
charge = 0
Structure:
('I', [-9.750986802755719, 0.9391938839088357, 0.1768783652592898])
('C', [-5.895945508642993, 0.12453295160883758, 0.05083363275080016])
('C', [-4.856596140132209, -2.2109795657411224, -0.2513335745671532])
('C', [-2.2109795657411224, -2.0220069532846163, -0.24377467006889297])
('O', [-0.304245906054975, -3.7227604653931716, -0.46865207889213534])
('C', [1.8519316020737606, -2.3621576557063273, -0.3080253583041051])
('C', [4.440856392727896, -2.9668700155671472, -0.4006219384077931])
('C', [5.839253724906041, -0.8163616858121067, -0.1379500070932495])
('I', [9.769884064001369, -0.6368377039784259, -0.13889487015553204])
('S', [4.100705690306184, 1.9464179083020137, 0.22298768269867728])
('C', [1.3587130835622794, 0.22298768269867728, 0.02022006953284616])
('C', [-1.2925726692025024, 0.43463700864996424, 0.06254993472310354])
('S', [-3.7227604653931716, 2.5700275294084842, 0.3477096069199714])
('H', [-5.914842769888644, -3.9306303390953286, -0.46298290051844015])
('H', [5.19674684255392, -4.818801617640907, -0.640617156227556])


******** <class 'gpu4pyscf.df.df_jk.DFRKS'> ********
method = DFRKS
initial guess = minao
damping factor = 0
level_shift factor = 0
DIIS = <class 'gpu4pyscf.scf.diis.CDIIS'>
diis_start_cycle = 1
diis_space = 8
SCF conv_tol = 1e-09
SCF conv_tol_grad = None
SCF max_cycles = 50
direct_scf = False
chkfile to save SCF result = /var/lib/boinc/projects/www.gpugrid.net/bck/tmp/tmpd03fogee
max_memory 4000 MB (current use 345 MB)
XC library pyscf.dft.libxc version 6.2.2
unable to decode the reference due to https://github.com/NVIDIA/cuda-python/issues/29
XC functionals = wB97M-V
N. Mardirossian and M. Head-Gordon., J. Chem. Phys. 144, 214110 (2016)
radial grids:
Treutler-Ahlrichs [JCP 102, 346 (1995); DOI:10.1063/1.469408] (M4) radial grids

becke partition: Becke, JCP 88, 2547 (1988); DOI:10.1063/1.454033
pruning grids: <function nwchem_prune at 0x7f29529356c0>
grids dens level: 3
symmetrized grids: False
atomic radii adjust function: <function treutler_atomic_radii_adjust at 0x7f2952935580>
** Following is NLC and NLC Grids **
NLC functional = wB97M-V
radial grids:
Treutler-Ahlrichs [JCP 102, 346 (1995); DOI:10.1063/1.469408] (M4) radial grids

becke partition: Becke, JCP 88, 2547 (1988); DOI:10.1063/1.454033
pruning grids: <function nwchem_prune at 0x7f29529356c0>
grids dens level: 3
symmetrized grids: False
atomic radii adjust function: <function treutler_atomic_radii_adjust at 0x7f2952935580>
small_rho_cutoff = 1e-07
Set gradient conv threshold to 3.16228e-05
Initial guess from minao.
Default auxbasis def2-tzvpp-jkfit is used for H def2-tzvppd
Default auxbasis def2-tzvpp-jkfit is used for C def2-tzvppd
Default auxbasis def2-tzvpp-jkfit is used for S def2-tzvppd
Default auxbasis def2-tzvpp-jkfit is used for O def2-tzvppd
Default auxbasis def2-tzvpp-jkfit is used for I def2-tzvppd
/var/lib/boinc/projects/www.gpugrid.net/bck/lib/python3.11/site-packages/pyscf/gto/mole.py:1280: UserWarning: Function mol.dumps drops attribute charge because it is not JSON-serializable
warnings.warn(msg)
tot grids = 225920
tot grids = 225920
segmentation fault

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1036
Credit: 39,372,107,483
RAC: 162,570,079
Level
Trp
Scientific publications
wat
Message 61223 - Posted: 9 Feb 2024 | 17:11:05 UTC - in response to Message 61222.

First, it’s well known at this point that these tasks require a lot of VRAM. So some failures are to be expected from that. The VRAM utilization is not constant, but spikes up and down. From the tasks running on my systems, loading up to 5-6GB and staying around that amount is pretty normal, with intermittent spikes to the 9-12GB+ range occasionally. Just by looking at the failure rate of different GPUs, I’m estimating that most tasks need more than 8GB (>70%), a small amount of tasks need more than 12GB (~5%), and a very small number of them need even more than 16GB (<1%). A teammate of mine is running on a couple 2080Tis (11GB) and has had some failures but mostly success.

When you hit memory limits, they fail, but not in a segfault. You always get some kind of error printed in the stderr regarding a memory allocation issue of some kind. With an 11GB GPU you should be seeing a majority of successes. Since they all fail in the same way with a segfault, that tells me it’s not the memory allocation problem, but something else. And now with two people having the same problem both using WSL, it’s clear that WSL is the root of the problem. The tasks were not setup to run in that environment.
____________

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1289
Credit: 5,219,281,959
RAC: 10,592,914
Level
Tyr
Scientific publications
watwatwatwatwat
Message 61224 - Posted: 9 Feb 2024 | 18:14:24 UTC - in response to Message 61221.

Yet Windows hosts are not getting any work, so are not given opportunity to contribute to research or to beta testing even if they're prepared to go the extra mile with getting experimental apps to work.

When I joined GPUGRID about 9 years ago, all subprojects were available for Linux and Windows as well.
At that time and even several years later, my hosts were working for GPUGRID almost 365 days/year.

Somehow, it makes me sad that I am less and less able to contribute to this valuable project.

Recently, someone here explained the reason: scientific projects are primarily done by Linux, not by Windows.
Why so, all of a sudden ???

I posed this question to Google and their AI engine came up with this response

"how long has most scientific research projects used linux compared to windows"

Linux is a popular choice for research companies because it offers flexibility, security, stability, and cost-effectiveness. Linux is also used in technical disciplines at universities and research centers because it's free and includes a large amount of free and open-source software.

en.wikipedia.org
List of Linux adopters - Wikipedia
Linux is often used in technical disciplines at universities and research centres. This is due to several factors, including that Linux is available free of charge and includes a large body of free/open-source software.

brainly.com
Why might a large research company use the Linux operating system?
Sep 20, 2022 — Overall, the Linux operating system provides research companies with flexibility, stability, security, and cost-effectiveness, making it a popular choice in the research community.
Linux is known for its reliability, security, and breadth of open source tools available. It's also known for its stability and reliability, and can run for months or even years without any issues.
Linux is an open-source operating system, whereas Microsoft is a commercial operating system. Linux users have access to the source code of the operating system and can make amendments as per their choices. Windows users don't have such privileges.
Linux is also used by biologists in various domains of research. In the field of biology, where data analysis, computational modeling, and scientific exploration are essential, Linux offers numerous advantages.


Pascal
Send message
Joined: 15 Jul 20
Posts: 45
Credit: 331,799,934
RAC: 3,481,654
Level
Asp
Scientific publications
wat
Message 61225 - Posted: 9 Feb 2024 | 19:12:20 UTC

une chose est sure il n'y aura pas assez d'utilisateurs pour tout calculer.il y à 50462 taches pour 106 ordinateurs au moment ou j'écris ces lignes.Elles arrivent plus vite que les taches qui sont calculées.je pense que gpugrid va droit dans le mur s'il ne font rien.



one thing is sure there will not be enough users to calculate everything.there are 50462 tasks for 106 computers at the time I write these lines. They arrive faster than the spots that are calculated.I think gpugrid goes straight into the wall if they do nothing.
____________

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1036
Credit: 39,372,107,483
RAC: 162,570,079
Level
Trp
Scientific publications
wat
Message 61226 - Posted: 9 Feb 2024 | 19:46:36 UTC - in response to Message 61225.

we are processing about 12,000 tasks per day, so there's a little more than 4 days worth of work right now, but the available work is still climbing
____________

[BAT] Svennemans
Send message
Joined: 27 May 21
Posts: 50
Credit: 289,422,017
RAC: 2,616,469
Level
Asn
Scientific publications
wat
Message 61227 - Posted: 9 Feb 2024 | 20:46:44 UTC - in response to Message 61224.



Somehow, it makes me sad that I am less and less able to contribute to this valuable project.

Recently, someone here explained the reason: scientific projects are primarily done by Linux, not by Windows.
Why so, all of a sudden ???

I posed this question to Google and their AI engine came up with this response

"how long has most scientific research projects used linux compared to windows"

Linux is a popular choice for research companies because it offers flexibility, security, stability, and cost-effectiveness. Linux is also used in technical disciplines at universities and research centers because it's free and includes a large amount of free and open-source software.

<truncated>




The choice for Linux as a research OS in academic context is clear, but has really no relation with the choice for which platforms to support as a BOINC project.
BOINC as a platform was always a 'supercomputer for peanuts' proposition - you invest a fraction of what a real supercomputer costs but can get similar processing power, which is exactly what many low-budget academic research groups were looking for.
Part of that investment is the choice of which platforms to support, and it is primarily driven by the amount of processing power needed, with the correlation to your native development OS only a secondary consideration.

As I said already in my previous post it all depends what type of project you want to be
1) You need all the power and/or turnaround you can get? Support all the platforms you can handle, with Windows your #1 priority because that's where the majority of the FLOPS are.
2) You don't really need that much power, your focus is more on developing/researching algorithms? Stay native OS
3) You need some of both? Prio on native OS for your beta apps, but keep driving that steady stream of stable work on #1 Windows and #2 Linux to keep the interest of your supercomputer 'providers' engaged.
Because that's the last part of the 'small investment' needed for your FLOPS: keeping your users happy and engaged.

So I see no issue at all with new beta's being on Linux first, but am also concerned or sad that there is only beta/Linux work lately as opposed to the earlier days of gpugrid.

Unless of course the decision is made to go full-on as a type 2) project?

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1036
Credit: 39,372,107,483
RAC: 162,570,079
Level
Trp
Scientific publications
wat
Message 61228 - Posted: 9 Feb 2024 | 21:14:00 UTC - in response to Message 61227.

there has been a bunch of ATM work intermittently, which works on Windows. they had to fix both Windows and Linux versions of their application at different times so there were times when Linux worked and Windows didn't, and times where Windows worked and Linux didnt. the most recent batch i believe both applications were working. this is still classified as "beta" for both Linux and Windows.

The project admins/researchers have already mentioned a few times that a Windows app is in the pipeline. but it takes time. they obviously don't have a lot of expertise with Windows and are more comfortable with the Linux environment, so it makes sense that it will take more time and effort for them to get up to speed and get the windows version working. they likely need to sort out other parts of their workflow on the backend also (work generation, task sizes, task configurations, batch sizes, etc) and Linux users are the guinea pigs for that. They had many weeks of "false starts" with this QChem project where they generated a bunch of work, but it was causing errors and they ended up having to cancel the whole batch and try again the following week. it's a lot easier for the researchers to iron out these problems with one version of the code rather than juggling two version with different code changes to each. then when most issues are sorted, port that to Windows. I think they are still figuring out what configurations work best for them on the backend and the hardware available on GPUGRID. Steve had previously mentioned that he originally based things on high end datacenter GPUs like the A100 with lots of VRAM, but changes are necessary to get the same results from our pool of users with much lower end GPUs.

when the Windows app comes I imagine it will still be "beta" in the context of BOINC, but it'll be a more polished setup from what Linux started with.
____________

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1289
Credit: 5,219,281,959
RAC: 10,592,914
Level
Tyr
Scientific publications
watwatwatwatwat
Message 61229 - Posted: 9 Feb 2024 | 21:14:03 UTC

The researcher earlier stated there were NO Windows computers in the lab.

Are you going to buy some for them or fund them?

How many of you have actually donated monetarily to the project?

MentalFS
Send message
Joined: 10 Feb 24
Posts: 1
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 61230 - Posted: 10 Feb 2024 | 16:35:11 UTC - in response to Message 61223.
Last modified: 10 Feb 2024 | 16:35:50 UTC

I'm using Docker for Windows, which is using WSL2 as backend, and I'm having the same problems. So another hint at WSL being the problem. Other Projects that use my NVidia card work fine though.

For now I've disabled "Quantum chemistry on GPU (beta)" and "If no work for selected applications is available, accept work from other applications" in my project settings to avoid this.

Currently there's no other work available for me but I'll keep an eye on the other tasks coming in.

Bedrich Hajek
Send message
Joined: 28 Mar 09
Posts: 468
Credit: 8,486,022,716
RAC: 10,942,361
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 61231 - Posted: 10 Feb 2024 | 20:29:30 UTC

There is an obvious solution, that no one has mentioned, for Windows users who wish to contribute to this project, and at the risk of starting a proverbial firestorm, I will mentioned it. You could install Linux on your machine(s). I did it last year. It has work out fine for me.

I did the installation on a separate SSD, leaving the Windows disc intact. The default boot up is set for Linux, with option to boot up into Windows, when the need might arise.

The process of installing Linux itself was not difficult, but I did have an issue of attaching an existing project accounts to boinc, but some of the Linux users crunching here helped me solve it. Thank you again for the help.

It is option you might want to consider.


[BAT] Svennemans
Send message
Joined: 27 May 21
Posts: 50
Credit: 289,422,017
RAC: 2,616,469
Level
Asn
Scientific publications
wat
Message 61232 - Posted: 10 Feb 2024 | 21:25:45 UTC - in response to Message 61231.

Sure that's an option, and no need to fear a firestorm. At least not from me, I've worked with Linux or other Unix flavors a lot over the years, both professionally and personally. And besides, I hate forum flame-wars or any kind of tech solution holy wars. ;-)

The problem with that solution for me and for many Windows users like me is that's an either/or solution. You either boot Linux or Windows.

I have a single computer that I need for both work and personal stuff and that requires Windows due to the software stack being Microsoft based. Not all of which have Linux alternatives that I have the time, patience or skills to explore.
I also run Boinc on that machine using 50% CPU + 100% GPU, 24/7.
When participating in Linux-only projects, I just spin up a 25% CPU VMWare and let that run in parallel. Or since recently - WSL.

I did just install Linux bare-metal on a partition of my data drive just to confirm that WSL is the issue, not the system, but for the reasons mentioned above, this I cannot let run 24/7.

FYI - Ian&Steve, you're right. PYSCFbeta on bare-metal Linux runs just fine. So it must indeed be some incompatibility with WSL.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1036
Credit: 39,372,107,483
RAC: 162,570,079
Level
Trp
Scientific publications
wat
Message 61233 - Posted: 10 Feb 2024 | 23:03:27 UTC - in response to Message 61232.

why not run Linux as prime, and then virtualize (or maybe even WINE) your windows only software?
____________

[BAT] Svennemans
Send message
Joined: 27 May 21
Posts: 50
Credit: 289,422,017
RAC: 2,616,469
Level
Asn
Scientific publications
wat
Message 61234 - Posted: 11 Feb 2024 | 1:10:21 UTC - in response to Message 61233.
Last modified: 11 Feb 2024 | 1:11:01 UTC

why not run Linux as prime, and then virtualize (or maybe even WINE) your windows only software?


Because I need Windows all the time, whereas in the last 15 years, this is the only time I couldn't get something to work through a virtual Linux. And BOINC is just a hobby after all...
Would you switch prime OS in such a case?

On another note - DCF is going crazy again. Average runtimes are consistent around 30 minutes, yet DCF is going up like crazy - estimated runtime of new WU's now at 76 days!

On a positive note: not a single failure yet!

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1036
Credit: 39,372,107,483
RAC: 162,570,079
Level
Trp
Scientific publications
wat
Message 61235 - Posted: 11 Feb 2024 | 1:17:14 UTC - in response to Message 61234.

well i did switch all my computers to linux. even personal ones. the only windows system I have is my work provided laptop. but i could do everything i need on a linux laptop. WINE runs a lot of things these days.

yeah i don't know what's wrong with DCF. mine goes crazy shortly after i fix it also. says my tasks will take like 27 days even though most are done in 5-10 mins.
____________

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1289
Credit: 5,219,281,959
RAC: 10,592,914
Level
Tyr
Scientific publications
watwatwatwatwat
Message 61236 - Posted: 11 Feb 2024 | 3:14:48 UTC - in response to Message 61235.

Ian, are you saying that even after you've set DCF to a low value in the client_state file that it is still escalating?

I set mine to 0.02 a month ago and it is still hanging around there now that I looked at the hosts here.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1036
Credit: 39,372,107,483
RAC: 162,570,079
Level
Trp
Scientific publications
wat
Message 61237 - Posted: 11 Feb 2024 | 3:30:21 UTC - in response to Message 61236.

Ian, are you saying that even after you've set DCF to a low value in the client_state file that it is still escalating?

I set mine to 0.02 a month ago and it is still hanging around there now that I looked at the hosts here.


my DCF was set to about 0.01, and my tasks were estimating that they would take 27hrs each to complete.

i changed the DCF to 0.0001, and that changed the estimate to about 16mins each.

then after a short time i noticed that the time to completion estimate was going up again, reaching back to 27hrs again. i checked DCF and it's back to 0.01.
____________

Profile ServicEnginIC
Avatar
Send message
Joined: 24 Sep 10
Posts: 566
Credit: 6,321,177,024
RAC: 17,362,450
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 61238 - Posted: 11 Feb 2024 | 15:35:28 UTC - in response to Message 61223.
Last modified: 11 Feb 2024 | 15:46:18 UTC

First, it’s well known at this point that these tasks require a lot of VRAM. So some failures are to be expected from that. The VRAM utilization is not constant, but spikes up and down. From the tasks running on my systems, loading up to 5-6GB and staying around that amount is pretty normal, with intermittent spikes to the 9-12GB+ range occasionally. Just by looking at the failure rate of different GPUs, I’m estimating that most tasks need more than 8GB (>70%), a small amount of tasks need more than 12GB (~5%), and a very small number of them need even more than 16GB (<1%). A teammate of mine is running on a couple 2080Tis (11GB) and has had some failures but mostly success.

As you suggested in some previous post, VRAM utilization seems to be bound to every particular model of graphics card / GPU.
GPUs with fewer CUDA cores available, seem to span less amount of VRAM.
My GTX 1650 GPUs have 896 CUDA cores and 4 GB VRAM.
My GTX 1650 SUPER GPU has 1280 CUDA cores and 4 GB VRAM.
My GTX 1660 Ti GPU has 1536 CUDA cores and 6 GB VRAM.
These cards are achieving currently an overall success of 44% on processing PYSCFbeta (676 valid versus 856 errored tasks at the moment of writing this).
Not all the errors were due to memory overflows, some of them were due to not viable WUs or other reasons, but deeping in this would take too much time...
Processing ATMbeta tasks, success was pretty close to 100%

Skip Da Shu
Send message
Joined: 13 Jul 09
Posts: 61
Credit: 827,525,165
RAC: 10,449,053
Level
Glu
Scientific publications
watwatwatwatwatwatwat
Message 61239 - Posted: 11 Feb 2024 | 15:58:36 UTC - in response to Message 61235.
Last modified: 11 Feb 2024 | 16:04:45 UTC

well i did switch all my computers to linux. even personal ones. the only windows system I have is my work provided laptop. but i could do everything i need on a linux laptop. WINE runs a lot of things these days.

yeah i don't know what's wrong with DCF. mine goes crazy shortly after i fix it also. says my tasks will take like 27 days even though most are done in 5-10 mins.


I'm trying to suppress my grinning at this upside down world... having retired my last Windoze box some time ago.

On a more germane note...

Between this
CUDA Error of GINTint2e_jk_kernel: out of memory

https://www.gpugrid.net/result.php?resultid=33956113

and this...

Does the fact that the reported error (my most common error on RTX3080 12G cards) seems to say that it didn't really get to the imposed limit but still failed mean anything to anyone?

I am ASSUMING that this is referring to the memory on the 12G vid card?

cupy.cuda.memory.OutOfMemoryError: Out of memory allocating 535,127,040 bytes (allocated so far: 4,278,332,416 bytes, limit set to: 9,443,495,116 bytes).


https://www.gpugrid.net/result.php?resultid=33955488


And for what it's worth best I can tell I'm getting a lower error % on my RTX3070 8GB cards once I backed off the sclk/mclk clocks.

Skip
____________
- da shu @ HeliOS,
"A child's exposure to technology should never be predicated on an ability to afford it."

pututu
Send message
Joined: 8 Oct 16
Posts: 14
Credit: 613,876,869
RAC: 11,505
Level
Lys
Scientific publications
watwatwatwat
Message 61240 - Posted: 11 Feb 2024 | 16:23:18 UTC - in response to Message 61239.

well i did switch all my computers to linux. even personal ones. the only windows system I have is my work provided laptop. but i could do everything i need on a linux laptop. WINE runs a lot of things these days.

yeah i don't know what's wrong with DCF. mine goes crazy shortly after i fix it also. says my tasks will take like 27 days even though most are done in 5-10 mins.


I'm trying to suppress my grinning at this upside down world... having retired my last Windoze box some time ago.

On a more germane note...

Between this
CUDA Error of GINTint2e_jk_kernel: out of memory

https://www.gpugrid.net/result.php?resultid=33956113

and this...

Does the fact that the reported error (my most common error on RTX3080 12G cards) seems to say that it didn't really get to the imposed limit but still failed mean anything to anyone?

I am ASSUMING that this is referring to the memory on the 12G vid card?

cupy.cuda.memory.OutOfMemoryError: Out of memory allocating 535,127,040 bytes (allocated so far: 4,278,332,416 bytes, limit set to: 9,443,495,116 bytes).


https://www.gpugrid.net/result.php?resultid=33955488


And for what it's worth best I can tell I'm getting a lower error % on my RTX3070 8GB cards once I backed off the sclk/mclk clocks.

Skip

Seems to me that your 3080 is the 10G version instead of 12G?

[BAT] Svennemans
Send message
Joined: 27 May 21
Posts: 50
Credit: 289,422,017
RAC: 2,616,469
Level
Asn
Scientific publications
wat
Message 61241 - Posted: 11 Feb 2024 | 16:52:25 UTC - in response to Message 61239.

well i did switch all my computers to linux. even personal ones. the only windows system I have is my work provided laptop. but i could do everything i need on a linux laptop. WINE runs a lot of things these days.


I'm trying to suppress my grinning at this upside down world... having retired my last Windoze box some time ago.


I'm clearly in the presence of passionate Linux believers here... :-)


Between this
CUDA Error of GINTint2e_jk_kernel: out of memory

https://www.gpugrid.net/result.php?resultid=33956113

and this...

Does the fact that the reported error (my most common error on RTX3080 12G cards) seems to say that it didn't really get to the imposed limit but still failed mean anything to anyone?

I am ASSUMING that this is referring to the memory on the 12G vid card?

cupy.cuda.memory.OutOfMemoryError: Out of memory allocating 535,127,040 bytes (allocated so far: 4,278,332,416 bytes, limit set to: 9,443,495,116 bytes).


https://www.gpugrid.net/result.php?resultid=33955488



And for what it's worth best I can tell I'm getting a lower error % on my RTX3070 8GB cards once I backed off the sclk/mclk clocks.

Skip


It does refer to video memory, but the limit each WU sets possibly doesn't take into account other processes allocating video memory. That would especially be an issue I think if you run multiple WU's in parallel.
Try executing nvidia-smi to see which processes allocate how much video memory:


svennemans@PCSLLINUX01:~$ nvidia-smi
Sun Feb 11 17:29:48 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.154.05 Driver Version: 535.154.05 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA GeForce GTX 1080 Ti Off | 00000000:01:00.0 On | N/A |
| 47% 71C P2 179W / 275W | 6449MiB / 11264MiB | 100% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| 0 N/A N/A 1611 G /usr/lib/xorg/Xorg 534MiB |
| 0 N/A N/A 1801 G /usr/bin/gnome-shell 75MiB |
| 0 N/A N/A 9616 G boincmgr 2MiB |
| 0 N/A N/A 9665 G ...gnu/webkit2gtk-4.0/WebKitWebProcess 12MiB |
| 0 N/A N/A 27480 G ...38,262144 --variations-seed-version 125MiB |
| 0 N/A N/A 46332 G gnome-control-center 2MiB |
| 0 N/A N/A 47110 C python 5562MiB |
+---------------------------------------------------------------------------------------+


My one running WU has allocated 5.5G but with the other running processes, total allocated is 6.4G.
It would depend on implementation if the limit is calculated from the total CUDA memory or the actual free CUDA memory and whether that limit is updated only once at the start or multiple times.

Skip Da Shu
Send message
Joined: 13 Jul 09
Posts: 61
Credit: 827,525,165
RAC: 10,449,053
Level
Glu
Scientific publications
watwatwatwatwatwatwat
Message 61242 - Posted: 11 Feb 2024 | 17:38:59 UTC - in response to Message 61241.
Last modified: 11 Feb 2024 | 17:59:48 UTC

Good point about the other stuff on the card... right this minute it's taking a break from GPUGRID to do a Meerkat Burp7...

I usually have "watch -t -n 8 nvidia-smi" running on this box if I'm poking around. I'll capture a shot of it as soon as GPUGRID comes back up if any of the listed below changes significantly. I don't think it will.

While the 'cuda_1222' is running I see a total ~286MB of 'other stuff' if my 'ciphering' is right:


/usr/lib/xorg/Xorg 153MiB
cinnamon 18MiB
...gnu/webkit2gtk-4.0/WebKitWebProcess 12MiB Boincmgr
/usr/lib/firefox/firefox 103MiB because I'm reading/posting
...inary_x86_64-pc-linux-gnu__cuda1222 776MiB the only Compute task


Skip

PS: GPUGRID WUs are all 1x here.
PPS: Yes, it's the 10G version!
PPPS: Also my adhoc perception of error rates was wrong... working on that.

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1289
Credit: 5,219,281,959
RAC: 10,592,914
Level
Tyr
Scientific publications
watwatwatwatwat
Message 61243 - Posted: 11 Feb 2024 | 19:14:30 UTC - in response to Message 61237.

I believe the lowest value that DCF can be in the client_state file is 0.01

Found that in the code someplace, sometime

Pascal
Send message
Joined: 15 Jul 20
Posts: 45
Credit: 331,799,934
RAC: 3,481,654
Level
Asp
Scientific publications
wat
Message 61244 - Posted: 12 Feb 2024 | 9:39:40 UTC

bonjour apparemment maintenant ça fonctionne sur mes 2 gpu-gtx 1650 et rtx 4060.
Je n'ai pas eu d'erreur de calcul.


hello apparently now it works on my 2 gpu-gtx 1650 and rtx 4060.
I did not have a miscalculation
____________

Steve
Volunteer moderator
Project administrator
Project developer
Project tester
Volunteer developer
Volunteer tester
Project scientist
Send message
Joined: 21 Dec 23
Posts: 32
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 61245 - Posted: 12 Feb 2024 | 11:02:45 UTC - in response to Message 61244.

Hello,

Yes I would not expect the app to work on WSL. There are many linux specific libraries in the packaged python environent that is the "app".

Thank you for the feedback regarding the faliure rate. As I mentioned different WUs require different memory use that is hard to check before they start crunching. From my viewpoint the failiure rates are low enough that all WUs seem to suceed with a few retries. This is still a "Beta" app.

We definitely want a Windows app and it is in pipeline. However, as I mentioned before the development of this is time consuming. Several of the underlying code-bases are linux only at the moment so a windows app requires a windows port of some code.

[BAT] Svennemans
Send message
Joined: 27 May 21
Posts: 50
Credit: 289,422,017
RAC: 2,616,469
Level
Asn
Scientific publications
wat
Message 61246 - Posted: 12 Feb 2024 | 12:25:22 UTC - in response to Message 61245.

Yes I would not expect the app to work on WSL. There are many linux specific libraries in the packaged python environent that is the "app".



Actually, it *should* work, since WSL2 is being sold as a native Linux kernel running in a virtual environment with full system call compatibility.
So one could reasonably expect any native linux libraries to work as expected.

However there are obviously still a few issues to iron out.

Not by gpugrid to be clear - by microsoft.

[BAT] Svennemans
Send message
Joined: 27 May 21
Posts: 50
Credit: 289,422,017
RAC: 2,616,469
Level
Asn
Scientific publications
wat
Message 61252 - Posted: 13 Feb 2024 | 11:23:35 UTC

I'm seeing a bunch of checksum errors during unzip, anyone else have this problem?

https://www.gpugrid.net/results.php?hostid=617834&offset=0&show_names=0&state=5&appid=

Stderr output
<core_client_version>7.20.5</core_client_version>
<![CDATA[
<message>
process exited with code 1 (0x1, -255)</message>
<stderr_txt>
11:26:18 (177385): wrapper (7.7.26016): starting
lib/libcufft.so.10.9.0.58 bad CRC e458474a (should be 0a867ac2)
boinc_unzip() error: 2

</stderr_txt>
]]>


The workunits seem to all run fine on a subsequent host.

Pascal
Send message
Joined: 15 Jul 20
Posts: 45
Credit: 331,799,934
RAC: 3,481,654
Level
Asp
Scientific publications
wat
Message 61260 - Posted: 13 Feb 2024 | 14:13:19 UTC

bonjour,
quand les taches windows seront elles pretes pour essais?
franchement,Linux ,c'est pourri.
apres une mise a jour le lhc@home ne fonctionne plus.Je reste sous linux pour vous mais j'ai hate de repasser sous un bon vieux windows.
Merci


Good afternoon,
when will windows tasks be ready for testing?
Frankly, Linux is rotten.
after an update the lhc@home no longer works. I stay under linux for you but I hate to go back under a good old windows.
Thanks
____________

Boca Raton Community HS
Send message
Joined: 27 Aug 21
Posts: 36
Credit: 3,066,631,809
RAC: 15,873,927
Level
Arg
Scientific publications
wat
Message 61261 - Posted: 13 Feb 2024 | 16:15:31 UTC - in response to Message 61260.

bonjour,
quand les taches windows seront elles pretes pour essais?
franchement,Linux ,c'est pourri.
apres une mise a jour le lhc@home ne fonctionne plus.Je reste sous linux pour vous mais j'ai hate de repasser sous un bon vieux windows.
Merci


Good afternoon,
when will windows tasks be ready for testing?
Frankly, Linux is rotten.
after an update the lhc@home no longer works. I stay under linux for you but I hate to go back under a good old windows.
Thanks


Maybe try a different version. I have always used Windows (and still do on some systems) but use Linux Mint on others. Really user friendly and a very similar feel to Windows.

Aurum
Avatar
Send message
Joined: 12 Jul 17
Posts: 399
Credit: 13,238,627,382
RAC: 14,211,365
Level
Trp
Scientific publications
watwatwat
Message 61269 - Posted: 14 Feb 2024 | 15:59:06 UTC - in response to Message 61239.
Last modified: 14 Feb 2024 | 16:00:55 UTC

Between this
CUDA Error of GINTint2e_jk_kernel: out of memory

https://www.gpugrid.net/result.php?resultid=33956113

and this...

Does the fact that the reported error (my most common error on RTX3080 12G cards) seems to say that it didn't really get to the imposed limit but still failed mean anything to anyone?

I am ASSUMING that this is referring to the memory on the 12G vid card?

cupy.cuda.memory.OutOfMemoryError: Out of memory allocating 535,127,040 bytes (allocated so far: 4,278,332,416 bytes, limit set to: 9,443,495,116 bytes).


https://www.gpugrid.net/result.php?resultid=33955488

Sometimes I get the same error on my 3080 10 GB Card. E.g., https://www.gpugrid.net/result.php?resultid=33960422
Headless computer with a single 3080 running 1C + 1N.

Aurum
Avatar
Send message
Joined: 12 Jul 17
Posts: 399
Credit: 13,238,627,382
RAC: 14,211,365
Level
Trp
Scientific publications
watwatwat
Message 61270 - Posted: 14 Feb 2024 | 16:04:55 UTC - in response to Message 61243.

I believe the lowest value that DCF can be in the client_state file is 0.01

Found that in the code someplace, sometime

Zoltan posted long ago that BOINC does not understand zero and 0.01 is as close as it can get. I wonder if that was someones approach to fixing a division by zero problem in antiquity.

Skip Da Shu
Send message
Joined: 13 Jul 09
Posts: 61
Credit: 827,525,165
RAC: 10,449,053
Level
Glu
Scientific publications
watwatwatwatwatwatwat
Message 61272 - Posted: 14 Feb 2024 | 16:20:02 UTC - in response to Message 61242.
Last modified: 14 Feb 2024 | 16:21:50 UTC

...
Skip

PS: GPUGRID WUs are all 1x here.
PPS: Yes, it's the 10G version!
PPPS: Also my adhoc perception of error rates was wrong... working on that.


After logging error rates for a few days across 5 boxes w/ Nvidia cards (all RTX30x0, all Linux Mint v2x.3) and trying to be aware of what I was doing on the main desktop while 'python' was running along with some sclk / mclk cut backs, shows the avg error rate is dropping. The last cut shows it at 23.44% across the 5 boxes averaged over 28 hours.

No longer any segfault 0x8b errors, all 0x1. The last one was on the most troublesome of the 3070 cards.

https://www.gpugrid.net/result.php?resultid=33950656

Anything I can do to help with this type of error?

Skip

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1036
Credit: 39,372,107,483
RAC: 162,570,079
Level
Trp
Scientific publications
wat
Message 61273 - Posted: 14 Feb 2024 | 16:29:14 UTC - in response to Message 61272.

...
Skip

PS: GPUGRID WUs are all 1x here.
PPS: Yes, it's the 10G version!
PPPS: Also my adhoc perception of error rates was wrong... working on that.


After logging error rates for a few days across 5 boxes w/ Nvidia cards (all RTX30x0, all Linux Mint v2x.3) and trying to be aware of what I was doing on the main desktop while 'python' was running along with some sclk / mclk cut backs, shows the avg error rate is dropping. The last cut shows it at 23.44% across the 5 boxes averaged over 28 hours.

No longer any segfault 0x8b errors, all 0x1. The last one was on the most troublesome of the 3070 cards.

https://www.gpugrid.net/result.php?resultid=33950656

Anything I can do to help with this type of error?

Skip


its still an out of memory error. a little further up in the error log shows this:
"CUDA Error of GINTint2e_jk_kernel: out of memory"

so it's probably just running out of memory at a different stage of the task, producing a slightly different error, but still an issue with not enough memory.

____________

Skip Da Shu
Send message
Joined: 13 Jul 09
Posts: 61
Credit: 827,525,165
RAC: 10,449,053
Level
Glu
Scientific publications
watwatwatwatwatwatwat
Message 61274 - Posted: 14 Feb 2024 | 16:44:55 UTC - in response to Message 61252.

I'm seeing a bunch of checksum errors during unzip, anyone else have this problem?

https://www.gpugrid.net/results.php?hostid=617834&offset=0&show_names=0&state=5&appid=

Stderr output
<core_client_version>7.20.5</core_client_version>
<![CDATA[
<message>
process exited with code 1 (0x1, -255)</message>
<stderr_txt>
11:26:18 (177385): wrapper (7.7.26016): starting
lib/libcufft.so.10.9.0.58 bad CRC e458474a (should be 0a867ac2)
boinc_unzip() error: 2

</stderr_txt>
]]>


The workunits seem to all run fine on a subsequent host.


I didn't find any of these in the 10GB 3080 errors that occurred so far today. Will check the 3070 cards shortly.

Skip

Skip Da Shu
Send message
Joined: 13 Jul 09
Posts: 61
Credit: 827,525,165
RAC: 10,449,053
Level
Glu
Scientific publications
watwatwatwatwatwatwat
Message 61275 - Posted: 14 Feb 2024 | 16:53:42 UTC - in response to Message 61273.


its still an out of memory error. a little further up in the error log shows this:
"CUDA Error of GINTint2e_jk_kernel: out of memory"

so it's probably just running out of memory at a different stage of the task, producing a slightly different error, but still an issue with not enough memory.


Thanx... as I suspected and this is my most common error now.

Along with these that I'm thinking are also memory related also from a different point in the process... same situation w/o having reached the cap limit shown.

https://www.gpugrid.net/result.php?resultid=33962293

Skip

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1036
Credit: 39,372,107,483
RAC: 162,570,079
Level
Trp
Scientific publications
wat
Message 61276 - Posted: 14 Feb 2024 | 16:58:25 UTC - in response to Message 61275.
Last modified: 14 Feb 2024 | 16:58:55 UTC

between your systems and mine, looking at the error rates;

~23% of tasks need more than 8GB
~17% of tasks need more than 10GB
~4% of tasks need more than 12GB
<1% of tasks need more than 16GB

me personally, i wouldn't run these (as they are now) with less than 12GB VRAM.
____________

Skip Da Shu
Send message
Joined: 13 Jul 09
Posts: 61
Credit: 827,525,165
RAC: 10,449,053
Level
Glu
Scientific publications
watwatwatwatwatwatwat
Message 61277 - Posted: 14 Feb 2024 | 17:01:02 UTC - in response to Message 61274.

I'm seeing a bunch of checksum errors during unzip, anyone else have this problem?

https://www.gpugrid.net/results.php?hostid=617834&offset=0&show_names=0&state=5&appid=

Stderr output
<core_client_version>7.20.5</core_client_version>
<![CDATA[
<message>
process exited with code 1 (0x1, -255)</message>
<stderr_txt>
11:26:18 (177385): wrapper (7.7.26016): starting
lib/libcufft.so.10.9.0.58 bad CRC e458474a (should be 0a867ac2)
boinc_unzip() error: 2

</stderr_txt>
]]>


The workunits seem to all run fine on a subsequent host.


I didn't find any of these in the 10GB 3080 errors that occurred so far today. Will check the 3070 cards shortly.

Skip


8GB 3070 card errors today checked were all:

CUDA Error of GINTint2e_jk_kernel: out of memory


Skip

Skip Da Shu
Send message
Joined: 13 Jul 09
Posts: 61
Credit: 827,525,165
RAC: 10,449,053
Level
Glu
Scientific publications
watwatwatwatwatwatwat
Message 61278 - Posted: 14 Feb 2024 | 17:14:16 UTC - in response to Message 61274.

I'm seeing a bunch of checksum errors during unzip, anyone else have this problem?

https://www.gpugrid.net/results.php?hostid=617834&offset=0&show_names=0&state=5&appid=

Stderr output
<core_client_version>7.20.5</core_client_version>
<![CDATA[
<message>
process exited with code 1 (0x1, -255)</message>
<stderr_txt>
11:26:18 (177385): wrapper (7.7.26016): starting
lib/libcufft.so.10.9.0.58 bad CRC e458474a (should be 0a867ac2)
boinc_unzip() error: 2

</stderr_txt>
]]>


The workunits seem to all run fine on a subsequent host.


I didn't find any of these in the 10GB 3080 errors that occurred so far today. Will check the 3070 cards shortly.

Skip


8GB 3070 card errors today checked were all:

CUDA Error of GINTint2e_jk_kernel: out of memory


Skip

Skip Da Shu
Send message
Joined: 13 Jul 09
Posts: 61
Credit: 827,525,165
RAC: 10,449,053
Level
Glu
Scientific publications
watwatwatwatwatwatwat
Message 61279 - Posted: 14 Feb 2024 | 17:22:50 UTC - in response to Message 61276.

between your systems and mine, looking at the error rates;

~23% of tasks need more than 8GB
~17% of tasks need more than 10GB
~4% of tasks need more than 12GB
<1% of tasks need more than 16GB

me personally, i wouldn't run these (as they are now) with less than 12GB VRAM.


Thanx for info. As is right now the only cards I have w/ 16GB are my RX6800/6800xt cards.

https://ibb.co/hKZtR0q

Guess I need to start a go-fund-me for some $600 12GB 4070 Super cards that I've been eyeing up ;-)

Skip

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1036
Credit: 39,372,107,483
RAC: 162,570,079
Level
Trp
Scientific publications
wat
Message 61280 - Posted: 14 Feb 2024 | 17:29:26 UTC - in response to Message 61279.



Guess I need to start a go-fund-me for some $600 12GB 4070 Super cards that I've been eyeing up ;-)

Skip


a $600 12GB Titan V is like 4x faster though.

other projects are a consideration of course.
____________

pututu
Send message
Joined: 8 Oct 16
Posts: 14
Credit: 613,876,869
RAC: 11,505
Level
Lys
Scientific publications
watwatwatwat
Message 61281 - Posted: 14 Feb 2024 | 18:03:43 UTC - in response to Message 61280.

If this quantum chemistry project is going to last for more than a year, perhaps a $170 (via ebay) investment on Tesla P100 16G may be worth it? If you look at my gpugrid output via boincstat, I'm doing like 20M PPD over the past 4 days running on a single card with power limit of 130W. I've processed more than 1000 tasks and I think I have 2 failures with its 16G memory.

The only drawback is that there aren't many projects that do benefit from high FP64 and/or memory bandwidth performance. Originally bought it for MilkyWay. However if you have extra cash, the Titan V is a great option for such projects.

The project admin can change the granted credit and/or the task run time but as long as the high FP64 and memory bandwidth requirement remains unchanged, relatively P100 should perform better than most consumer cards for such applications.

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1289
Credit: 5,219,281,959
RAC: 10,592,914
Level
Tyr
Scientific publications
watwatwatwatwat
Message 61282 - Posted: 14 Feb 2024 | 18:27:15 UTC - in response to Message 61270.

My DCF is set to 0.02

So that is not considered zero by BOINC apparently.

Skip Da Shu
Send message
Joined: 13 Jul 09
Posts: 61
Credit: 827,525,165
RAC: 10,449,053
Level
Glu
Scientific publications
watwatwatwatwatwatwat
Message 61283 - Posted: 14 Feb 2024 | 19:35:56 UTC - in response to Message 61280.



Guess I need to start a go-fund-me for some $600 12GB 4070 Super cards that I've been eyeing up ;-)

Skip


a $600 12GB Titan V is like 4x faster though.

other projects are a consideration of course.


Can you point me to someplace I can educate myself a bit on using Titan V cards for BOINC. I see some for $600 used on ebay. As u know there is no used market for used 'Super' cards yet. Did u mean 4x faster than a 4070 Super or than the 3070 I would replace with it?

Thanx, Skip

Skip Da Shu
Send message
Joined: 13 Jul 09
Posts: 61
Credit: 827,525,165
RAC: 10,449,053
Level
Glu
Scientific publications
watwatwatwatwatwatwat
Message 61284 - Posted: 14 Feb 2024 | 20:07:36 UTC - in response to Message 61283.



Guess I need to start a go-fund-me for some $600 12GB 4070 Super cards that I've been eyeing up ;-)

Skip


a $600 12GB Titan V is like 4x faster though.

other projects are a consideration of course.


Can you point me to someplace I can educate myself a bit on using Titan V cards for BOINC. I see some for $600 used on ebay. As u know there is no used market for used 'Super' cards yet. Did u mean 4x faster than a 4070 Super or than the 3070 I would replace with it?

Thanx, Skip



Ah, it's an FP64 thing. Any other projects doing heavy FP64 lifting since the demise of MW GPU WUs?

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1036
Credit: 39,372,107,483
RAC: 162,570,079
Level
Trp
Scientific publications
wat
Message 61285 - Posted: 14 Feb 2024 | 21:16:14 UTC - in response to Message 61284.

ATMbeta tasks here have some small element of FP64. (integration)

BRP7 tasks at Einstein also use FP64 a little bit.

Asteroids@home GPU apps are also primarily FP64, but they have massive GPU memory bandwidth bottleneck that slows things down more than the FP64 does anyway so you don't realize the benefit there. and the CPUs are better production per watt at Asteroids.

not sure if any any other projects use it.


____________

Skip Da Shu
Send message
Joined: 13 Jul 09
Posts: 61
Credit: 827,525,165
RAC: 10,449,053
Level
Glu
Scientific publications
watwatwatwatwatwatwat
Message 61288 - Posted: 16 Feb 2024 | 5:30:39 UTC - in response to Message 61276.
Last modified: 16 Feb 2024 | 5:31:43 UTC

between your systems and mine, looking at the error rates;

~23% of tasks need more than 8GB
~17% of tasks need more than 10GB
~4% of tasks need more than 12GB
<1% of tasks need more than 16GB

me personally, i wouldn't run these (as they are now) with less than 12GB VRAM.


Not sure why but...

Error rates seemed to start dropping after 5pm (23:00 Zulu) today. Overall error average since 2/11 across my 5 Nvid cards was 26.7% with it slowly creeping down over time. Early on a little bit of this was the result of lowering clocks to eliminate the occasional segfault (0x8b).

The average of the last two captures today across the 5 cards was 20.5%

For the last 6 hour period I just checked, my 10GB card average error rate dropped to 17.3% (15.92 & 18.7) and the 8GB card error rate was at 21.3%.

Skip

Pascal
Send message
Joined: 15 Jul 20
Posts: 45
Credit: 331,799,934
RAC: 3,481,654
Level
Asp
Scientific publications
wat
Message 61291 - Posted: 17 Feb 2024 | 9:38:28 UTC

les unites de calcul pour windows sont elles arrivées?


Have the computing units for windows arrived?
____________

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1576
Credit: 5,887,311,851
RAC: 10,471,281
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 61292 - Posted: 17 Feb 2024 | 11:44:20 UTC - in response to Message 61291.

It isn't the tasks which need to be released, it's the application programs needed to run them.

You can read the list of applications at https://www.gpugrid.net/apps.php

The newest ones tend to be towards the bottom of the page - and no, there isn't one for 'Quantum chemistry calculations on GPU' yet.

Bookmark that page - there isn't a direct link to it on this site, although it's a standard feature of BOINC projects.

Profile ServicEnginIC
Avatar
Send message
Joined: 24 Sep 10
Posts: 566
Credit: 6,321,177,024
RAC: 17,362,450
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 61293 - Posted: 17 Feb 2024 | 12:18:37 UTC

Watching Stderr output report for a certain PYSCFbeta task, can be found a line like this:

.
+ CUDA_VISIBLE_DEVICES=N
.

Where "N" corresponds to the Device Number (GPU) where the task was run on.
This is very much appreciated on multi GPU hosts when trying to identify reliable or unreliable devices.
This allows, if desired, to exclude unreliable devices as of this Ian&Steve C. kind advice.

A similar feature would be useful at other apps, as ATMbeta.

Skip Da Shu
Send message
Joined: 13 Jul 09
Posts: 61
Credit: 827,525,165
RAC: 10,449,053
Level
Glu
Scientific publications
watwatwatwatwatwatwat
Message 61295 - Posted: 17 Feb 2024 | 17:15:35 UTC - in response to Message 61288.
Last modified: 17 Feb 2024 | 17:15:50 UTC

between your systems and mine, looking at the error rates;

~23% of tasks need more than 8GB
~17% of tasks need more than 10GB
~4% of tasks need more than 12GB
<1% of tasks need more than 16GB

me personally, i wouldn't run these (as they are now) with less than 12GB VRAM.


Not sure why but...

Error rates seemed to start dropping after 5pm (23:00 Zulu) today. Overall error average since 2/11 across my 5 Nvid cards was 26.7% with it slowly creeping down over time. Early on a little bit of this was the result of lowering clocks to eliminate the occasional segfault (0x8b).

The average of the last two captures today across the 5 cards was 20.5%

For the last 6 hour period I just checked, my 10GB card average error rate dropped to 17.3% (15.92 & 18.7) and the 8GB card error rate was at 21.3%.

Skip


IGNORE... all went to crap the next day (today)

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1036
Credit: 39,372,107,483
RAC: 162,570,079
Level
Trp
Scientific publications
wat
Message 61296 - Posted: 17 Feb 2024 | 17:17:48 UTC - in response to Message 61295.
Last modified: 17 Feb 2024 | 17:18:03 UTC

yeah i've been seeing higher error rates on my 12GB cards too.

still very low on my 16GB cards though.
____________

Profile ServicEnginIC
Avatar
Send message
Joined: 24 Sep 10
Posts: 566
Credit: 6,321,177,024
RAC: 17,362,450
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 61305 - Posted: 20 Feb 2024 | 23:04:35 UTC

My preferences are set to receive work from all apps, including beta ones, but none of my 4 GB VRAM graphics cards have received lately PYSCFbeta tasks.
Casual, or scheduler-driven behavior?
In the meanwhile, they are performing ATMbeta tasks without a single processing error so far.
And unsent PYSCFbeta tasks seem to be growing more and more, 39K+ at this moment.

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1576
Credit: 5,887,311,851
RAC: 10,471,281
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 61306 - Posted: 20 Feb 2024 | 23:23:36 UTC - in response to Message 61305.

My GPUs are all on the smaller-memory side, too. Since ATMbeta tasks became available again, I haven't picked up a single Quantum chemistry task.

I think it's either a cunning project plan, or (more likely) some subtle BOINC behaviour concerning our hosts' "reliability" rating on particular task types.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1036
Credit: 39,372,107,483
RAC: 162,570,079
Level
Trp
Scientific publications
wat
Message 61307 - Posted: 21 Feb 2024 | 1:26:26 UTC - in response to Message 61306.

My GPUs are all on the smaller-memory side, too. Since ATMbeta tasks became available again, I haven't picked up a single Quantum chemistry task.

I think it's either a cunning project plan, or (more likely) some subtle BOINC behaviour concerning our hosts' "reliability" rating on particular task types.


it's because you have test tasks enabled. with that, it's giving preferential treatment for ATM tasks which are classified in the scheduler as beta/test.

QChem seems to not be classified in the scheduler as "test" or beta. despite being treated as such by the staff and the app name literally has the word beta in it. if you disable test tasks, and enable only QChem, you will get them still.
____________

Profile ServicEnginIC
Avatar
Send message
Joined: 24 Sep 10
Posts: 566
Credit: 6,321,177,024
RAC: 17,362,450
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 61308 - Posted: 21 Feb 2024 | 6:28:06 UTC - in response to Message 61307.

it's because you have test tasks enabled. with that, it's giving preferential treatment for ATM tasks which are classified in the scheduler as beta/test.

Thank you, that fully explains the fact.
In the dilemma of choosing between my 50% erroring PYSCFbeta or 100% succeeding ATMbeta tasks, I'll keep this last.

Pascal
Send message
Joined: 15 Jul 20
Posts: 45
Credit: 331,799,934
RAC: 3,481,654
Level
Asp
Scientific publications
wat
Message 61311 - Posted: 21 Feb 2024 | 13:09:41 UTC

bonjour,
j'aimerais calculer pour atmbeta avec ma gtx 1650 et pour quantum chemistry avec ma rtx 4060.
Je ne parviens pas a modifier le config.xml pour cela.
Je n'ai que des unités atmbeta a calculer et aucune unités quantum chemistry.
voici ce que j'ai mis dans le fichier config.xml de boinc.
Quelqu'un pourrait il m'aider.Merci d'avance.

Good afternoon,
I would like to calculate for atmbeta with my gtx 1650 and for quantum chemistry with my rtx 4060.
I can’t change the config.xml for this.
I only have atmbeta units to calculate and no quantum chemistry units.
here is what I put in the config.xml file of boinc.
Someone could help me. Thanks in advance


<cc_config>
<options>
<exclude_gpu>
<url>https://www.gpugrid.net/</url>
[<device_num>0</device_num>]
[<type>NVIDIA</type>]
[<app>ATMbeta</app>]
</exclude_gpu>
<exclude_gpu>
<url>https://www.gpugrid.net/</url>
[<device_num>1</device_num>]
[<type>NVIDIA</type>]
[<app>PYSCFbeta</app>]
</exclude_gpu>
<exclude_gpu>
<url>http://asteroidsathome.net/boinc/</url>
<device_num>0</device_num>
<type>NVIDIA</type>
</exclude_gpu>
<exclude_gpu>
<url>https://einstein.phys.uwm.edu/</url>
<device_num>0</device_num>
<type>NVIDIA</type>
</exclude_gpu>
<use_all_gpus>1</use_all_gpus>
<ncpus>-1</ncpus>
</options>
</cc_config>








____________

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1036
Credit: 39,372,107,483
RAC: 162,570,079
Level
Trp
Scientific publications
wat
Message 61312 - Posted: 21 Feb 2024 | 13:22:20 UTC - in response to Message 61311.

schedule requests from your host are not specific about what it's asking for. it just asks for work for "Nvidia" and the scheduler on the project side decides what you need and what to send based on your preferences. the way the scheduler is setup right now, you wont be sent both types of work when both are available, only ATM.

you will need to move the GPUs to different hosts and setup the project preferences to be different for each of them. or run two clients on one host with one gpu attached to each,

or just stay with ATM on both cards.


____________

Pascal
Send message
Joined: 15 Jul 20
Posts: 45
Credit: 331,799,934
RAC: 3,481,654
Level
Asp
Scientific publications
wat
Message 61313 - Posted: 21 Feb 2024 | 13:32:31 UTC - in response to Message 61312.

ok merci
____________

Profile ServicEnginIC
Avatar
Send message
Joined: 24 Sep 10
Posts: 566
Credit: 6,321,177,024
RAC: 17,362,450
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 61319 - Posted: 21 Feb 2024 | 22:59:26 UTC - in response to Message 61307.

QChem seems to not be classified in the scheduler as "test" or beta. despite being treated as such by the staff and the app name literally has the word beta in it. if you disable test tasks, and enable only QChem, you will get them still.

Giving a bit more assortment to current GPUGRID apps spectrum, I happened to be watching Server status page when a limited number (about 215) of "ATM: Free energy calculations of protein-ligand binding" tasks grew up. To be distinguished from previously existing ATMbeta branch.
I managed to configure a venue at GPUGRID preferences page to catch one of them before unsent tasks vanished.
Task: tnks2_m5f_m5l_1_RE-QUICO_ATM_GAFF2_1fs-0-5-RND3367_1
To achieve this, I disabled getting test apps, and enabled only (somehow paradoxical ;-) "ATM (beta)" app.
That task is currently running at my GTX 1660 Ti GPU, at an estimated rate of 9,72% per hour.

And quickly returning to PYSCFbeta (QChem) topic: tasks for this app grew up today to a noticeable amount of 80K+ ready to send ones.
After peaking, QChem unsent tasks are now decreasing again.

Pascal
Send message
Joined: 15 Jul 20
Posts: 45
Credit: 331,799,934
RAC: 3,481,654
Level
Asp
Scientific publications
wat
Message 61320 - Posted: 22 Feb 2024 | 9:50:55 UTC

Bonjour
y a t il des unités de calcul pour windows disponible?

Hello
Are there computing units for windows available?
____________

Profile ServicEnginIC
Avatar
Send message
Joined: 24 Sep 10
Posts: 566
Credit: 6,321,177,024
RAC: 17,362,450
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 61321 - Posted: 22 Feb 2024 | 11:05:33 UTC - in response to Message 61320.
Last modified: 22 Feb 2024 | 11:28:25 UTC

Yes, ATM and ATMbeta apps have both Windows and Linux versions currently available.

Edit.
Regarding Quantum chemistry, there is no still any Windows version

Erich56
Send message
Joined: 1 Jan 15
Posts: 1091
Credit: 6,854,782,676
RAC: 17,245,093
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwat
Message 61322 - Posted: 22 Feb 2024 | 16:47:17 UTC - in response to Message 61321.

Regarding Quantum chemistry, there is no still any Windows version

:-( :-( :-(

Bedrich Hajek
Send message
Joined: 28 Mar 09
Posts: 468
Credit: 8,486,022,716
RAC: 10,942,361
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 61353 - Posted: 2 Mar 2024 | 1:00:26 UTC

This one barely made it:

https://www.gpugrid.net/workunit.php?wuid=27943603


Skip Da Shu
Send message
Joined: 13 Jul 09
Posts: 61
Credit: 827,525,165
RAC: 10,449,053
Level
Glu
Scientific publications
watwatwatwatwatwatwat
Message 61360 - Posted: 3 Mar 2024 | 0:58:43 UTC

https://imgur.com/evCBB73

GPUGRID error rate across 2x 3070 8GB, 2x 3080 10GB & 1 4070 Super 12GB (early part is with 3x 3070 8GB one of which was replaced by 4070S 2/20).

Skip

Skip Da Shu
Send message
Joined: 13 Jul 09
Posts: 61
Credit: 827,525,165
RAC: 10,449,053
Level
Glu
Scientific publications
watwatwatwatwatwatwat
Message 61363 - Posted: 4 Mar 2024 | 13:00:29 UTC - in response to Message 61360.
Last modified: 4 Mar 2024 | 13:04:08 UTC




Going the wrong direction :-(

Skip

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1036
Credit: 39,372,107,483
RAC: 162,570,079
Level
Trp
Scientific publications
wat
Message 61364 - Posted: 4 Mar 2024 | 13:07:01 UTC - in response to Message 61363.

to be expected with 8-10GB cards.

might get better context if you split the graphs up by card type. so you can see the relative error rate vs different VRAM sizes. I'm guessing most errors come from the 8GB cards.

____________

[BAT] Svennemans
Send message
Joined: 27 May 21
Posts: 50
Credit: 289,422,017
RAC: 2,616,469
Level
Asn
Scientific publications
wat
Message 61370 - Posted: 4 Mar 2024 | 16:02:45 UTC

On my GTX1080ti 11GB, I've only got about 1% error rate due to memory.

But watching 'nvidia-smi dmon' there are a lot of close shaves, where I'm only a couple of MB's below the limit...

So from a 10GB card, I'd already expect a non-trivial error rate.

Skip Da Shu
Send message
Joined: 13 Jul 09
Posts: 61
Credit: 827,525,165
RAC: 10,449,053
Level
Glu
Scientific publications
watwatwatwatwatwatwat
Message 61401 - Posted: 9 Mar 2024 | 20:10:53 UTC - in response to Message 61364.
Last modified: 9 Mar 2024 | 20:13:28 UTC

to be expected with 8-10GB cards.

might get better context if you split the graphs up by card type. so you can see the relative error rate vs different VRAM sizes. I'm guessing most errors come from the 8GB cards.


They do:

8GB – last 2 checks of 2 cards 44.07
10GB – last 2 checks of 2 cards 30.80
12GB – last 2 checks of 1 card 7.62

But I need to look at the last day or two as rates have been going up.
____________
- da shu @ HeliOS,
"A child's exposure to technology should never be predicated on an ability to afford it."

Skip Da Shu
Send message
Joined: 13 Jul 09
Posts: 61
Credit: 827,525,165
RAC: 10,449,053
Level
Glu
Scientific publications
watwatwatwatwatwatwat
Message 61402 - Posted: 9 Mar 2024 | 20:22:16 UTC
Last modified: 9 Mar 2024 | 20:23:07 UTC

Anyone have insight into this error:

<stderr_txt>
09:06:00 (130033): wrapper (7.7.26016): starting
[x86_64-pc-linux-gnu__cuda1121.zip]
End-of-central-directory signature not found. Either this file is not
a zipfile, or it constitutes one disk of a multi-part archive. In the
latter case the central directory and zipfile comment will be found on
the last disk(s) of this archive.
unzip: cannot find zipfile directory in one of x86_64-pc-linux-gnu__cuda1121.zip or
x86_64-pc-linux-gnu__cuda1121.zip.zip, and cannot find x86_64-pc-linux-gnu__cuda1121.zip.ZIP, period.
boinc_unzip() error: 9

It looks like every WU since the afternoon of the 7th (Zulu) is getting this but only on my single 12GB 4070S

Skip

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1289
Credit: 5,219,281,959
RAC: 10,592,914
Level
Tyr
Scientific publications
watwatwatwatwat
Message 61403 - Posted: 9 Mar 2024 | 20:35:22 UTC - in response to Message 61402.

Download error causing the zip file to be corrupted because it is missing the end of file signature.

I was getting that on a Google Drive zip archive a couple of days ago. Switching browsers let me download the archive correctly so it would unpack.

Skip Da Shu
Send message
Joined: 13 Jul 09
Posts: 61
Credit: 827,525,165
RAC: 10,449,053
Level
Glu
Scientific publications
watwatwatwatwatwatwat
Message 61404 - Posted: 9 Mar 2024 | 23:13:30 UTC - in response to Message 61403.

Download error causing the zip file to be corrupted because it is missing the end of file signature.

I was getting that on a Google Drive zip archive a couple of days ago. Switching browsers let me download the archive correctly so it would unpack.


Well after 100+ of these errors I finally got 3 good ones out of that box after a reboot for a different reason.

Thanx, Skip

Pascal
Send message
Joined: 15 Jul 20
Posts: 45
Credit: 331,799,934
RAC: 3,481,654
Level
Asp
Scientific publications
wat
Message 61405 - Posted: 11 Mar 2024 | 8:33:44 UTC

Bonjour
y a t il des unités de calcul pour windows disponible?

Hello
Are there computing units for windows available?
____________

Boca Raton Community HS
Send message
Joined: 27 Aug 21
Posts: 36
Credit: 3,066,631,809
RAC: 15,873,927
Level
Arg
Scientific publications
wat
Message 61406 - Posted: 11 Mar 2024 | 11:59:23 UTC - in response to Message 61405.
Last modified: 11 Mar 2024 | 12:00:00 UTC

There are not for this project (at this time).

Skip Da Shu
Send message
Joined: 13 Jul 09
Posts: 61
Credit: 827,525,165
RAC: 10,449,053
Level
Glu
Scientific publications
watwatwatwatwatwatwat
Message 61454 - Posted: 10 Apr 2024 | 11:35:30 UTC

Error rates skyrocketed on me for this app... even on the 10GB cards (12GB card will be back on Thursday). This started late on April 7th.

Error rate now over 50% so I will have to NNW till I can figure it out.

Skip
____________
- da shu @ HeliOS,
"A child's exposure to technology should never be predicated on an ability to afford it."

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1036
Credit: 39,372,107,483
RAC: 162,570,079
Level
Trp
Scientific publications
wat
Message 61455 - Posted: 10 Apr 2024 | 13:17:01 UTC - in response to Message 61454.

Error rates skyrocketed on me for this app... even on the 10GB cards (12GB card will be back on Thursday). This started late on April 7th.

Error rate now over 50% so I will have to NNW till I can figure it out.

Skip


It's not you. its the new v4 tasks require more VRAM. I asked about this on their discord.

I asked:
it seems the newer "v4" tasks on average require a bit more VRAM than the previous v3 tasks. I'm seeing a higher error percentage on 12GB cards.

v3 had about 5% failure from OOM on 12GB VRAM
v4 is more like 15% failure from OOM on 12GB VRAM
no failures with 16GB VRAM

what changed in V4?


Steve replied:
yes this make sense unfortunately. In the previous round of "inputs_v3**" it was calculating things incorrectly for any molecule containing Iodine. This is heaviest element in our dataset. The computational cost of this QM method scales with the size of the elements (it depends on the number of electrons). We are resending the incorrect calculations for Iodine containing molecules in this round of "v4" work units. Therefore the v4 set is a subset of the previous v3 WUs containing heavier elements, hence there are more OOM errors.

____________

Skip Da Shu
Send message
Joined: 13 Jul 09
Posts: 61
Credit: 827,525,165
RAC: 10,449,053
Level
Glu
Scientific publications
watwatwatwatwatwatwat
Message 61456 - Posted: 10 Apr 2024 | 16:15:25 UTC - in response to Message 61455.
Last modified: 10 Apr 2024 | 16:15:57 UTC

Thank you. U probably just saved me hours of wasted time.

Error %
AVG ALL: 29.1
AVG – last 3: 59.0

8GB – last 2 72.76
10GB – last 2 66.52
12GB – last 2 3.55 (card out for a week)

Skip Da Shu
Send message
Joined: 13 Jul 09
Posts: 61
Credit: 827,525,165
RAC: 10,449,053
Level
Glu
Scientific publications
watwatwatwatwatwatwat
Message 61470 - Posted: 16 Apr 2024 | 16:32:16 UTC - in response to Message 61455.
Last modified: 16 Apr 2024 | 16:39:06 UTC

Steve replied:

yes this make sense unfortunately. In the previous round of "inputs_v3**" it was calculating things incorrectly for any molecule containing Iodine. This is heaviest element in our dataset. The computational cost of this QM method scales with the size of the elements (it depends on the number of electrons). We are resending the incorrect calculations for Iodine containing molecules in this round of "v4" work units. Therefore the v4 set is a subset of the previous v3 WUs containing heavier elements, hence there are more OOM errors.


Any change in this situation?

I got my 12GB card back and my haphazard data collection seems to have it under a 9% error rate and with the very last grab showing 5.85%.

The 8GB & 10GB cards are still on NNW (other than 3 WUs i let thru on 10GB cards. They completed).

Skip

Skip Da Shu
Send message
Joined: 13 Jul 09
Posts: 61
Credit: 827,525,165
RAC: 10,449,053
Level
Glu
Scientific publications
watwatwatwatwatwatwat
Message 61471 - Posted: 21 Apr 2024 | 12:20:15 UTC - in response to Message 61470.
Last modified: 21 Apr 2024 | 12:29:46 UTC

Steve replied:
yes this make sense unfortunately. In the previous round of "inputs_v3**" it was calculating things incorrectly for any molecule containing Iodine. This is heaviest element in our dataset. The computational cost of this QM method scales with the size of the elements (it depends on the number of electrons). We are resending the incorrect calculations for Iodine containing molecules in this round of "v4" work units. Therefore the v4 set is a subset of the previous v3 WUs containing heavier elements, hence there are more OOM errors.


Any change in this situation?

I got my 12GB card back and my haphazard data collection seems to have it under a 9% error rate and with the very last grab showing 5.85%.


Somethings coming around... error rates for 10GB cards are now under 13% and the 12GB card is ~3%.

Skip

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1036
Credit: 39,372,107,483
RAC: 162,570,079
Level
Trp
Scientific publications
wat
Message 61472 - Posted: 21 Apr 2024 | 12:54:27 UTC

I also see about 3% on my 12GB cards.

I think it will vary depending on what kind of molecules are being processed.
____________

Bedrich Hajek
Send message
Joined: 28 Mar 09
Posts: 468
Credit: 8,486,022,716
RAC: 10,942,361
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 61473 - Posted: 21 Apr 2024 | 13:47:39 UTC - in response to Message 61472.

Right now, I am seeing less than a 2% error rate on my computers, each has a 11 GB card. This does vary over time.

[BAT] Svennemans
Send message
Joined: 27 May 21
Posts: 50
Credit: 289,422,017
RAC: 2,616,469
Level
Asn
Scientific publications
wat
Message 61474 - Posted: 22 Apr 2024 | 21:47:03 UTC

I'm only seeing a single Memory error in the last 300 results for my gtx 1080Ti (11GB), so 0.33%

Something I do get quite often are CRC errors on UnZipping the input files. So failing within the first 30 seconds.

Anybody else seeing this?

https://www.gpugrid.net/results.php?userid=571263&offset=0&show_names=0&state=5&appid=47

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1289
Credit: 5,219,281,959
RAC: 10,592,914
Level
Tyr
Scientific publications
watwatwatwatwat
Message 61475 - Posted: 23 Apr 2024 | 2:11:14 UTC - in response to Message 61474.
Last modified: 23 Apr 2024 | 2:12:16 UTC

No, I've not had any CRC errors unzipping the tar archives.

Sounds like a machine problem. Memory, heat, high workload latency??

Post to thread

Message boards : News : PYSCFbeta: Quantum chemistry calculations on GPU

//