PYSCFbeta: Quantum chemistry calculations on GPU

Message boards : News : PYSCFbeta: Quantum chemistry calculations on GPU

Author	Message
Steve Volunteer moderator Project administrator Project developer Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 21 Dec 23 Posts: 34 Credit: 0 RAC: 0 Level Scientific publications	Message 60963 - Posted: 12 Jan 2024 \| 13:03:21 UTC
	Hello GPUGRID! We are deploying a new app "PYSCFbeta: Quantum chemistry calculations on GPU". It is currently in testing/beta stage. It is only on Linux at the moment. The app performs quantum chemistry calculations. At the moment we are using it specifically for Density Functional Theory calculations: http://en.wikipedia.org/wiki/Density_functional_theory These types of calculations allow us to accurately compute specific properties of small molecules. The current test work units have a runtime of the order 1hr (very much dependent on the GPU speed and size of molecule). Each work unit currently contains 1 molecule with ~10 configurations. The app will not work on GPUs with compute capability less than 6.0. It should not be sending them to these cards but I think at the moment this functionality is not working properly. The work-units require a lot of GPU memory. It works best if the work-unit is the only thing running on the GPU. If other programs are using significant GPU memory the work-unit might fail. Looking forward to hearing feedback from you. Steve
	ID: 60963 \| Rating: 0 \| rate: / Reply Quote

Skillz Send message Joined: 6 Jun 17 Posts: 4 Credit: 8,436,660,479 RAC: 12,470 Level Scientific publications	Message 60964 - Posted: 12 Jan 2024 \| 13:32:48 UTC
	When can we expect to start getting these new tasks?
	ID: 60964 \| Rating: 0 \| rate: / Reply Quote

Steve Volunteer moderator Project administrator Project developer Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 21 Dec 23 Posts: 34 Credit: 0 RAC: 0 Level Scientific publications	Message 60965 - Posted: 12 Jan 2024 \| 13:56:16 UTC
	Now, if you are using Linux and have "run test applications?" selected
	ID: 60965 \| Rating: 0 \| rate: / Reply Quote

roundup Send message Joined: 11 May 10 Posts: 60 Credit: 2,925,505,193 RAC: 40,268,066 Level Scientific publications	Message 60966 - Posted: 12 Jan 2024 \| 13:57:24 UTC - in response to Message 60964. Last modified: 12 Jan 2024 \| 14:19:26 UTC
	When can we expect to start getting these new tasks? They are being distributed RIGHT NOW. The first 6 WU have arrived here.
	ID: 60966 \| Rating: 0 \| rate: / Reply Quote

bormolino Send message Joined: 16 May 13 Posts: 41 Credit: 80,026,864 RAC: 114 Level Scientific publications	Message 60967 - Posted: 12 Jan 2024 \| 14:00:40 UTC
	I only get "No tasks sent". Test applications are allowed and i have compute capability 8.6 with 12GB of GPU Mem running Ubuntu.
	ID: 60967 \| Rating: 0 \| rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1048 Credit: 40,228,383,983 RAC: 449,121 Level Scientific publications	Message 60968 - Posted: 12 Jan 2024 \| 15:14:43 UTC
	Steve, there is an issue with this application, that will only be apparent for multi-GPU systems. the application seems to be hard coded in some way to always use GPU0, or the BOINC device assignment is somehow not being correctly communicated to the app. this results in all tasks running on the same GPU when they should be split up to different GPUs. due to the high VRAM use, this fills the VRAM on most GPUs and causes errors. see here: GLaDOS:~$ nvidia-smi Fri Jan 12 10:05:59 2024 +---------------------------------------------------------------------------------------+ \| NVIDIA-SMI 535.86.05 Driver Version: 535.86.05 CUDA Version: 12.2 \| \|-----------------------------------------+----------------------+----------------------+ \| GPU Name Persistence-M \| Bus-Id Disp.A \| Volatile Uncorr. ECC \| \| Fan Temp Perf Pwr:Usage/Cap \| Memory-Usage \| GPU-Util Compute M. \| \| \| \| MIG M. \| \|=========================================+======================+======================\| \| 0 NVIDIA TITAN V On \| 00000000:21:00.0 On \| N/A \| \| 80% 55C P2 88W / 150W \| 9453MiB / 12288MiB \| 100% Default \| \| \| \| N/A \| +-----------------------------------------+----------------------+----------------------+ \| 1 NVIDIA TITAN V On \| 00000000:22:00.0 Off \| N/A \| \| 80% 34C P2 36W / 150W \| 42MiB / 12288MiB \| 0% Default \| \| \| \| N/A \| +-----------------------------------------+----------------------+----------------------+ \| 2 NVIDIA TITAN V On \| 00000000:42:00.0 Off \| N/A \| \| 80% 42C P2 39W / 150W \| 42MiB / 12288MiB \| 0% Default \| \| \| \| N/A \| +-----------------------------------------+----------------------+----------------------+ \| 3 NVIDIA TITAN V On \| 00000000:61:00.0 Off \| N/A \| \| 80% 35C P2 36W / 150W \| 42MiB / 12288MiB \| 0% Default \| \| \| \| N/A \| +-----------------------------------------+----------------------+----------------------+ +---------------------------------------------------------------------------------------+ \| Processes: \| \| GPU GI CI PID Type Process name GPU Memory \| \| ID ID Usage \| \|=======================================================================================\| \| 0 N/A N/A 1612 G /usr/lib/xorg/Xorg 94MiB \| \| 0 N/A N/A 1961 C+G ...libexec/gnome-remote-desktop-daemon 311MiB \| \| 0 N/A N/A 2000 G /usr/bin/gnome-shell 67MiB \| \| 0 N/A N/A 5931 C nvidia-cuda-mps-server 30MiB \| \| 0 N/A N/A 223543 M+C python 4490MiB \| \| 0 N/A N/A 223769 M+C python 4462MiB \| \| 1 N/A N/A 1612 G /usr/lib/xorg/Xorg 6MiB \| \| 1 N/A N/A 5931 C nvidia-cuda-mps-server 30MiB \| \| 2 N/A N/A 1612 G /usr/lib/xorg/Xorg 6MiB \| \| 2 N/A N/A 5931 C nvidia-cuda-mps-server 30MiB \| \| 3 N/A N/A 1612 G /usr/lib/xorg/Xorg 6MiB \| \| 3 N/A N/A 5931 C nvidia-cuda-mps-server 30MiB \| +---------------------------------------------------------------------------------------+ in bold, both processes running on the same GPU. ____________
	ID: 60968 \| Rating: 0 \| rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1048 Credit: 40,228,383,983 RAC: 449,121 Level Scientific publications	Message 60969 - Posted: 12 Jan 2024 \| 15:19:15 UTC
	also, could you please add explicit QChem for GPU selections in the project preferences page? currently it is only possible to get this app if you have ALL apps selected + test apps. I want to exclude some apps but still get this one. ____________
	ID: 60969 \| Rating: 0 \| rate: / Reply Quote

Steve Volunteer moderator Project administrator Project developer Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 21 Dec 23 Posts: 34 Credit: 0 RAC: 0 Level Scientific publications	Message 60970 - Posted: 12 Jan 2024 \| 15:21:46 UTC - in response to Message 60968.
	Ah yes thank you for confirming this! This is an omission in the scripts from my end. My test machine has one GPU so I missed it. This can be fixed thank you.
	ID: 60970 \| Rating: 0 \| rate: / Reply Quote

Steve Volunteer moderator Project administrator Project developer Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 21 Dec 23 Posts: 34 Credit: 0 RAC: 0 Level Scientific publications	Message 60971 - Posted: 12 Jan 2024 \| 15:23:29 UTC
	I will try and get the web interface updated but this will take longer due to my unfamiliarity with it. Thanks
	ID: 60971 \| Rating: 0 \| rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1048 Credit: 40,228,383,983 RAC: 449,121 Level Scientific publications	Message 60972 - Posted: 12 Jan 2024 \| 16:20:22 UTC
	just a hunch but I think the problem is with your export command in the run.sh you have: export CUDA_VISIBLE_DEVICES=$CUDA_DEVICE which if I'm reading it right, will set all visible devices to just one GPU. this will have a bad impact for any other tasks running in the BOINC environment i think. normally on my 4x GPU system, I have CUDA_VISIBLE_DEVICES=0,1,2,3, and if you override that to just the single CUDA device it seems to shuffle all tasks there instead. ____________
	ID: 60972 \| Rating: 0 \| rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1048 Credit: 40,228,383,983 RAC: 449,121 Level Scientific publications	Message 60973 - Posted: 12 Jan 2024 \| 17:45:08 UTC - in response to Message 60972.
	just a hunch but I think the problem is with your export command in the run.sh you have: export CUDA_VISIBLE_DEVICES=$CUDA_DEVICE which if I'm reading it right, will set all visible devices to just one GPU. this will have a bad impact for any other tasks running in the BOINC environment i think. normally on my 4x GPU system, I have CUDA_VISIBLE_DEVICES=0,1,2,3, and if you override that to just the single CUDA device it seems to shuffle all tasks there instead. I guess this wasnt the problem after all :) I see a new small batch went out and i downloaded some and they are working fine now. ____________
	ID: 60973 \| Rating: 0 \| rate: / Reply Quote

Steve Volunteer moderator Project administrator Project developer Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 21 Dec 23 Posts: 34 Credit: 0 RAC: 0 Level Scientific publications	Message 60974 - Posted: 12 Jan 2024 \| 18:02:58 UTC - in response to Message 60973.
	just a hunch but I think the problem is with your export command in the run.sh you have: export CUDA_VISIBLE_DEVICES=$CUDA_DEVICE which if I'm reading it right, will set all visible devices to just one GPU. this will have a bad impact for any other tasks running in the BOINC environment i think. normally on my 4x GPU system, I have CUDA_VISIBLE_DEVICES=0,1,2,3, and if you override that to just the single CUDA device it seems to shuffle all tasks there instead. I guess this wasnt the problem after all :) I see a new small batch went out and i downloaded some and they are working fine now. Hello, Can you confirm the latest WUs are getting assigned to different GPUs in the way you would expect? The line in the script you have mentioned is actually the fix I just did. In the first round I had forgotten to put this line. When the boinc client runs the app via the wrapper mechanism it specifies the gpu device which we capture in the variable CUDA_DEVICE. The Python CUDA code in our app uses the CUDA_VISIBLE_DEVICES variable to choose the GPU. When it is not set (as in the first round of jobs) it defaults to zero. So all jobs end up on GPU zero. With this fix the WUs will be run on the device specified by the boinc client.
	ID: 60974 \| Rating: 0 \| rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1048 Credit: 40,228,383,983 RAC: 449,121 Level Scientific publications	Message 60975 - Posted: 12 Jan 2024 \| 18:09:30 UTC - in response to Message 60974.
	yup. I just ran 4 tasks on the same 4-GPU system and each one went to a different GPU as it should. I see in the stderr that the device was selected properly. ____________
	ID: 60975 \| Rating: 0 \| rate: / Reply Quote

Steve Volunteer moderator Project administrator Project developer Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 21 Dec 23 Posts: 34 Credit: 0 RAC: 0 Level Scientific publications	Message 60976 - Posted: 12 Jan 2024 \| 18:12:10 UTC - in response to Message 60975.
	Thanks very much for the help!
	ID: 60976 \| Rating: 0 \| rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1048 Credit: 40,228,383,983 RAC: 449,121 Level Scientific publications	Message 60977 - Posted: 12 Jan 2024 \| 18:13:10 UTC - in response to Message 60975. Last modified: 12 Jan 2024 \| 18:31:33 UTC
	also, does this app make much use of FP64? I'm noticing very fast runtimes on a Titan V, even faster than something like a RTX 3090. the titan V is slower in FP32, but like 14x faster in FP64. it's hard to follow the code, but I did see that you use cupy a lot, and maybe something in cupy is able to accelerate the Titan V in some way. or maybe Tensor core difference? does this QChem app use the tensor cores? ____________
	ID: 60977 \| Rating: 0 \| rate: / Reply Quote

Steve Volunteer moderator Project administrator Project developer Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 21 Dec 23 Posts: 34 Credit: 0 RAC: 0 Level Scientific publications	Message 60978 - Posted: 12 Jan 2024 \| 19:07:31 UTC - in response to Message 60977.
	Yes this app does make use of some double precision arithmetic. High precision is needed in QM calculations. The bulk of the crunching is done by Nvidia's cusolver library which I believe uses tensor cores when available.
	ID: 60978 \| Rating: 0 \| rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1048 Credit: 40,228,383,983 RAC: 449,121 Level Scientific publications	Message 60979 - Posted: 12 Jan 2024 \| 19:10:08 UTC - in response to Message 60978.
	Awesome, thanks for that info. Looking forward to you re-releasing all the tasks you had to pull back earlier :) ____________
	ID: 60979 \| Rating: 0 \| rate: / Reply Quote

Steve Volunteer moderator Project administrator Project developer Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 21 Dec 23 Posts: 34 Credit: 0 RAC: 0 Level Scientific publications	Message 60980 - Posted: 12 Jan 2024 \| 19:16:45 UTC - in response to Message 60979.
	Yes we will restart the large scale test next week!
	ID: 60980 \| Rating: 0 \| rate: / Reply Quote

Keith Myers Send message Joined: 13 Dec 17 Posts: 1313 Credit: 6,022,292,459 RAC: 10,175,539 Level Scientific publications	Message 60981 - Posted: 12 Jan 2024 \| 20:52:58 UTC - in response to Message 60980.
	+1
	ID: 60981 \| Rating: 0 \| rate: / Reply Quote

GWGeorge007 Send message Joined: 4 Mar 23 Posts: 10 Credit: 2,360,783,000 RAC: 7,091,216 Level Scientific publications	Message 60982 - Posted: 13 Jan 2024 \| 11:51:35 UTC
	+1 ____________
	ID: 60982 \| Rating: 0 \| rate: / Reply Quote

Steve Volunteer moderator Project administrator Project developer Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 21 Dec 23 Posts: 34 Credit: 0 RAC: 0 Level Scientific publications	Message 60984 - Posted: 15 Jan 2024 \| 13:07:46 UTC
	Sending out work for this app today. The work units take an hour (very approximately). They should be using different GPUs on multigpu systems. Please let me know if you see anything not working as you would normally expect
	ID: 60984 \| Rating: 0 \| rate: / Reply Quote

ServicEnginIC Send message Joined: 24 Sep 10 Posts: 572 Credit: 7,519,447,024 RAC: 10,542,159 Level Scientific publications	Message 60985 - Posted: 15 Jan 2024 \| 13:50:37 UTC
	Everything working as expected at my hosts. Well done! 👍️
	ID: 60985 \| Rating: 0 \| rate: / Reply Quote

Freewill Send message Joined: 18 Mar 10 Posts: 13 Credit: 12,440,092,894 RAC: 94,208,303 Level Scientific publications	Message 60986 - Posted: 15 Jan 2024 \| 13:57:02 UTC - in response to Message 60984.
	Steve, so far the first few tasks are completing and being validated for me on single and multi-GPU systems.
	ID: 60986 \| Rating: 0 \| rate: / Reply Quote

Drago Send message Joined: 3 May 20 Posts: 16 Credit: 407,694,060 RAC: 2,252,019 Level Scientific publications	Message 60987 - Posted: 15 Jan 2024 \| 15:45:24 UTC
	My host is an R9-3900X, RTX 3070-Ti running ubuntu 20.04.06 LTS but it doesn't receive Quantum chemistry work units. I selected it in the preferences, test work and "ok to send work of other subprojects". Did I miss anything?
	ID: 60987 \| Rating: 0 \| rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1589 Credit: 6,680,244,351 RAC: 9,211,180 Level Scientific publications	Message 60988 - Posted: 15 Jan 2024 \| 16:18:22 UTC - in response to Message 60987.
	My host is an R9-3900X, RTX 3070-Ti running ubuntu 20.04.06 LTS but it doesn't receive Quantum chemistry work units. I selected it in the preferences, test work and "ok to send work of other subprojects". Did I miss anything? I had the same problem until I ticked every available application for the venue, resulting in "(all applications)" showing on the confirmation page. Having cleared that hurdle, I note that the tasks are estimated to run for 1 minute 36 seconds (slower device) and 20 seconds (fastest device). The machines have most recently been running ATMbeta (Python) tasks, and have been left with "Duration Correction Factors" of 0.0148 and 0.0100 as a result. The target value should be 1.0000 in all cases. Please could keep an eye on the <rsc_fpops_est> value for each workunit type, to try and minimise these large fluctuations when new applications are deployed?
	ID: 60988 \| Rating: 0 \| rate: / Reply Quote

Freewill Send message Joined: 18 Mar 10 Posts: 13 Credit: 12,440,092,894 RAC: 94,208,303 Level Scientific publications	Message 60989 - Posted: 15 Jan 2024 \| 16:27:47 UTC - in response to Message 60988.
	Drago, You also need to check the "Run test applications?" box.
	ID: 60989 \| Rating: 0 \| rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1048 Credit: 40,228,383,983 RAC: 449,121 Level Scientific publications	Message 60990 - Posted: 15 Jan 2024 \| 16:39:22 UTC - in response to Message 60984.
	Sending out work for this app today. The work units take an hour (very approximately). They should be using different GPUs on multigpu systems. Please let me know if you see anything not working as you would normally expect at least one of my computers is unable to get any tasks. the scheduler just reports that there are no tasks sent. it's inexplicable since it is the exact same configuration as a system that is receiving tasks just fine. they are both on the same venue. and that venue has ALL projects selected, and has both test/beta apps allowed, and both have allow other apps selected. not sure what's going on here. the only difference is one has 4 GPUs and the other has 7. will get work: https://gpugrid.net/show_host_detail.php?hostid=582493 will not get work: https://gpugrid.net/show_host_detail.php?hostid=605892 ____________
	ID: 60990 \| Rating: 0 \| rate: / Reply Quote

Drago Send message Joined: 3 May 20 Posts: 16 Credit: 407,694,060 RAC: 2,252,019 Level Scientific publications	Message 60991 - Posted: 15 Jan 2024 \| 16:43:15 UTC
	Yeah! I got all boxes checked but I still don't get work. Maybe it is a problem with the driver? I have version 470 installed which worked fine for me so far...
	ID: 60991 \| Rating: 0 \| rate: / Reply Quote

Steve Volunteer moderator Project administrator Project developer Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 21 Dec 23 Posts: 34 Credit: 0 RAC: 0 Level Scientific publications	Message 60992 - Posted: 15 Jan 2024 \| 17:17:29 UTC - in response to Message 60990.
	it's inexplicable since it is the exact same configuration as a system that is receiving tasks just fine. Ok thanks for this information. There must be something unexpected going on with the scheduler.
	ID: 60992 \| Rating: 0 \| rate: / Reply Quote

ServicEnginIC Send message Joined: 24 Sep 10 Posts: 572 Credit: 7,519,447,024 RAC: 10,542,159 Level Scientific publications	Message 60993 - Posted: 15 Jan 2024 \| 17:34:11 UTC
	I made a couple tests with these new PYSCFbeta tasks. I tested to stop two of them, and they restarted without erroring. This is good... but both of them got reset their execution times and restarted from the beginning. This is not so good... And the tests were made at a blend double GPU system (GTX 1660 Ti + GTX 1650). Conversely to ACEMD tasks, both tasks were restarted on the different GPU model than they started, and they did not crash. This is good! Also, I've noticed a considerable reduction in power draw (about halved) comparing to ACEMD tasks. GPU power draw at GTX 1660 Ti GPU with PYSCFbeta tasks is half than I'm familiar to see with ACEMD tasks. And the same happens to GTX 1650 GPU. Consequently, although 100% GPU usage is shown, working temperatures are much lower...
	ID: 60993 \| Rating: 0 \| rate: / Reply Quote

Steve Volunteer moderator Project administrator Project developer Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 21 Dec 23 Posts: 34 Credit: 0 RAC: 0 Level Scientific publications	Message 60994 - Posted: 15 Jan 2024 \| 17:41:13 UTC - in response to Message 60993.
	Stopping and resuming is not currently implemented. It will just restart from the beginning.
	ID: 60994 \| Rating: 0 \| rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1048 Credit: 40,228,383,983 RAC: 449,121 Level Scientific publications	Message 60995 - Posted: 15 Jan 2024 \| 17:49:44 UTC - in response to Message 60992.
	it's inexplicable since it is the exact same configuration as a system that is receiving tasks just fine. Ok thanks for this information. There must be something unexpected going on with the scheduler. are you able to inspect the scheduler log from this host? can you see more detail about the specific reason it was not sent any work? the only thing i see on my end is "no tasks sent" with no reason. ____________
	ID: 60995 \| Rating: 0 \| rate: / Reply Quote

Sasa Jovicic Send message Joined: 22 Oct 09 Posts: 2 Credit: 348,222,500 RAC: 229,338 Level Scientific publications	Message 60996 - Posted: 15 Jan 2024 \| 18:21:18 UTC
	I have the same problem too: no tasks sent!
	ID: 60996 \| Rating: 0 \| rate: / Reply Quote

Bedrich Hajek Send message Joined: 28 Mar 09 Posts: 477 Credit: 9,341,672,716 RAC: 10,973,430 Level Scientific publications	Message 60997 - Posted: 15 Jan 2024 \| 19:51:33 UTC Last modified: 15 Jan 2024 \| 19:59:07 UTC
	Here is a tale of 2 computers, one that was getting units, and the other was not. https://www.gpugrid.net/hosts_user.php?userid=19626 They both have the same GPUGRID preferences.
	ID: 60997 \| Rating: 0 \| rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1048 Credit: 40,228,383,983 RAC: 449,121 Level Scientific publications	Message 60998 - Posted: 15 Jan 2024 \| 20:24:11 UTC
	another observation is keep an eye on your CPU use. these look to be another mt+cuda setup for which BOINC is not prepared to handle, much like the PythonGPU work. i saw upwards of 30 threads utilized per task. but it wasn't sustained, it would come in bursts. on average reported cpu_time and runtime was about 4x actual (15min actual would be reported as about an hour runtime) ____________
	ID: 60998 \| Rating: 0 \| rate: / Reply Quote

Steve Volunteer moderator Project administrator Project developer Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 21 Dec 23 Posts: 34 Credit: 0 RAC: 0 Level Scientific publications	Message 61000 - Posted: 15 Jan 2024 \| 20:58:03 UTC
	Thanks for listing the host ids that are not receiving. I can see them in the scheduler logs so hopefully can pin point why they are not getting work. And yes I missed a setting to limit the multi-threading thanks for catching that! (all the modern libraries try very hard to multi-thread withing telling you they are going to haha)
	ID: 61000 \| Rating: 0 \| rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1048 Credit: 40,228,383,983 RAC: 449,121 Level Scientific publications	Message 61001 - Posted: 15 Jan 2024 \| 21:01:32 UTC - in response to Message 61000.
	i think if you get a discrete check box selection in the project preferences for QChem on GPU, that will solve the issues of requesting work for this project. ____________
	ID: 61001 \| Rating: 0 \| rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1048 Credit: 40,228,383,983 RAC: 449,121 Level Scientific publications	Message 61002 - Posted: 17 Jan 2024 \| 0:35:42 UTC - in response to Message 61001.
	Thank you to whoever got the discrete checkbox implemeted in the settings :). this should make getting work less trivial. ____________
	ID: 61002 \| Rating: 0 \| rate: / Reply Quote

Steve Volunteer moderator Project administrator Project developer Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 21 Dec 23 Posts: 34 Credit: 0 RAC: 0 Level Scientific publications	Message 61003 - Posted: 17 Jan 2024 \| 9:43:41 UTC
	The app will now appear in the GPUGRID preferences:"Quantum chemistry on GPU (beta)" Previous scheduler problems should be fixed. (I can see that https://gpugrid.net/results.php?hostid=605892 is now getting the jobs when before it was not.)
	ID: 61003 \| Rating: 0 \| rate: / Reply Quote

Bedrich Hajek Send message Joined: 28 Mar 09 Posts: 477 Credit: 9,341,672,716 RAC: 10,973,430 Level Scientific publications	Message 61004 - Posted: 17 Jan 2024 \| 12:23:36 UTC - in response to Message 60997.
	Here is a tale of 2 computers, one that was getting units, and the other was not. https://www.gpugrid.net/hosts_user.php?userid=19626 They both have the same GPUGRID preferences. I am getting tasks on both computers, now. So far, all tasks are completing successfully.
	ID: 61004 \| Rating: 0 \| rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1048 Credit: 40,228,383,983 RAC: 449,121 Level Scientific publications	Message 61005 - Posted: 17 Jan 2024 \| 12:41:37 UTC
	Yes, thanks! this works much better now. more observed behavior. this batch seems to use less VRAM, and also more limited in CPU use. it's pretty much stuck to just 1 thread per process now. not sure if it's a consequence of the CPU limiting or some other change that limited the VRAM use, but these tasks run a bit slower than the last batch. if it's bottlenecked by the CPU limiting, maybe there's a middleground? like letting it use up to 4 cores? the lasst batch 2 days ago, on a Titan V I was running 2x in about 15mins total (7.5 mins per task). this new batch was doing about 25mins for two tasks (12.5mins per task). since VRAM use has gone down, I'm doing 3x in about 25-30mins (8.5-10mins per task). and I'm experimenting with 4x now as well. but the old tasks were undoubtedly faster for some reason. ____________
	ID: 61005 \| Rating: 0 \| rate: / Reply Quote

Steve Volunteer moderator Project administrator Project developer Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 21 Dec 23 Posts: 34 Credit: 0 RAC: 0 Level Scientific publications	Message 61006 - Posted: 17 Jan 2024 \| 13:02:49 UTC - in response to Message 61005.
	The most recent WUs are just twice the size of the previous test set. There Are 100 molecules in each WU now, previously there were 50.
	ID: 61006 \| Rating: 0 \| rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1048 Credit: 40,228,383,983 RAC: 449,121 Level Scientific publications	Message 61007 - Posted: 17 Jan 2024 \| 13:38:39 UTC - in response to Message 61006.
	oh ok, that explains it! ____________
	ID: 61007 \| Rating: 0 \| rate: / Reply Quote

Freewill Send message Joined: 18 Mar 10 Posts: 13 Credit: 12,440,092,894 RAC: 94,208,303 Level Scientific publications	Message 61008 - Posted: 17 Jan 2024 \| 13:52:04 UTC - in response to Message 61006.
	The most recent WUs are just twice the size of the previous test set. There Are 100 molecules in each WU now, previously there were 50. I wouldn't complain if the credit per task was also doubled. ;)
	ID: 61008 \| Rating: 0 \| rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1048 Credit: 40,228,383,983 RAC: 449,121 Level Scientific publications	Message 61009 - Posted: 17 Jan 2024 \| 15:08:44 UTC - in response to Message 61006.
	Steve, Can you see if you can lift the task download limits? currently it looks like each schedule request will only send one task. instead of a few at a time. and with multiple computers at the same location, coupled with the DoS protection on your network reventing multiple requests from the same IP, i get sched request failures pretty often, which is limiting how many tasks I can download and not allowing all the GPUs to work. i know you probably can't do anything about the network DoS protections, but can you allow multiple tasks to download in a single request? ____________
	ID: 61009 \| Rating: 0 \| rate: / Reply Quote

Sasa Jovicic Send message Joined: 22 Oct 09 Posts: 2 Credit: 348,222,500 RAC: 229,338 Level Scientific publications	Message 61010 - Posted: 17 Jan 2024 \| 16:22:01 UTC
	I made fresh Linux Mint installation and it is OK for me now. Now I can dowload new WU.
	ID: 61010 \| Rating: 0 \| rate: / Reply Quote

Aurum Send message Joined: 12 Jul 17 Posts: 400 Credit: 13,846,639,882 RAC: 31,504,974 Level Scientific publications	Message 61011 - Posted: 17 Jan 2024 \| 17:45:33 UTC - in response to Message 60963.
	The app will not work on GPUs with compute capability less than 6.0. It should not be sending them to these cards but I think at the moment this functionality is not working properly. WUs are being sent to GPUs like GTX 960 (cc=5.2, 2 GB VRAM) and they fail. E.g., https://www.gpugrid.net/show_host_detail.php?hostid=550055 https://developer.nvidia.com/cuda-gpus
	ID: 61011 \| Rating: 0 \| rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1048 Credit: 40,228,383,983 RAC: 449,121 Level Scientific publications	Message 61012 - Posted: 17 Jan 2024 \| 17:51:36 UTC - in response to Message 61011. Last modified: 17 Jan 2024 \| 17:53:02 UTC
	The app will not work on GPUs with compute capability less than 6.0. It should not be sending them to these cards but I think at the moment this functionality is not working properly. WUs are being sent to GPUs like GTX 960 (cc=5.2, 2 GB VRAM) and they fail. E.g., https://www.gpugrid.net/show_host_detail.php?hostid=550055 https://developer.nvidia.com/cuda-gpus Steve mentioned that the scheduler blocks from low CC cards wasnt working properly. best to uncheck QChem for GPU in your project preferences for those hosts. Edit: Sorry, disregard, I thought you were talking about your own host. Since that host is anonymous, not really anything to be be done at the moment. will just have to deal with the resends. ____________
	ID: 61012 \| Rating: 0 \| rate: / Reply Quote

Aurum Send message Joined: 12 Jul 17 Posts: 400 Credit: 13,846,639,882 RAC: 31,504,974 Level Scientific publications	Message 61013 - Posted: 17 Jan 2024 \| 18:18:22 UTC Last modified: 17 Jan 2024 \| 18:22:16 UTC
	When you send out WUs with 0.991C + 1NV BOINC does not assign a CPU core to that task. You should designate them 1C. I've been changing my Use At Most N-2 CPUs to accommodate these tasks. If not they slow down significantly. That GTX 960 I pointed out also has BOINC 7.7 installed and may be a Science United member so many failures can be expected. But with 7 errors allowed they'll probably find a qualified cruncher before they die.
	ID: 61013 \| Rating: 0 \| rate: / Reply Quote

bormolino Send message Joined: 16 May 13 Posts: 41 Credit: 80,026,864 RAC: 114 Level Scientific publications	Message 61014 - Posted: 17 Jan 2024 \| 18:33:49 UTC
	The 1,92GB file downloads with only ~19,75KBps. No chance to get the file within the deadline. I mentioned this problem multiple times in multiple threads. Seems like nobody cares even if the problem affects multiple users.
	ID: 61014 \| Rating: 0 \| rate: / Reply Quote

Skillz Send message Joined: 6 Jun 17 Posts: 4 Credit: 8,436,660,479 RAC: 12,470 Level Scientific publications	Message 61015 - Posted: 17 Jan 2024 \| 18:47:36 UTC - in response to Message 61009.
	Steve, Can you see if you can lift the task download limits? currently it looks like each schedule request will only send one task. instead of a few at a time. and with multiple computers at the same location, coupled with the DoS protection on your network reventing multiple requests from the same IP, i get sched request failures pretty often, which is limiting how many tasks I can download and not allowing all the GPUs to work. i know you probably can't do anything about the network DoS protections, but can you allow multiple tasks to download in a single request? I've got this issue also. We need to be able to download multiple tasks in one request otherwise the GPU sits idle or grabs a backup project and thus will miss multiple requests until that task completes.
	ID: 61015 \| Rating: 0 \| rate: / Reply Quote

Keith Myers Send message Joined: 13 Dec 17 Posts: 1313 Credit: 6,022,292,459 RAC: 10,175,539 Level Scientific publications	Message 61016 - Posted: 17 Jan 2024 \| 19:33:59 UTC - in response to Message 61013.
	When you send out WUs with 0.991C + 1NV BOINC does not assign a CPU core to that task. You should designate them 1C. I've been changing my Use At Most N-2 CPUs to accommodate these tasks. If not they slow down significantly. That GTX 960 I pointed out also has BOINC 7.7 installed and may be a Science United member so many failures can be expected. But with 7 errors allowed they'll probably find a qualified cruncher before they die. You can always override that with an app_config.xml file in the project folder and assign 1.0 cpu threads to the task.
	ID: 61016 \| Rating: 0 \| rate: / Reply Quote

gemini8 Send message Joined: 3 Jul 16 Posts: 31 Credit: 1,578,137,676 RAC: 5,622,108 Level Scientific publications	Message 61017 - Posted: 17 Jan 2024 \| 21:25:46 UTC - in response to Message 61014.
	Hello. The 1,92GB file downloads with only ~19,75KBps. I'm also encountering the issue of slow downloads on several hosts. It would be nice if the project infrastructure worked a little bit faster on our downloads. Thank you. ____________ Greetings, Jens
	ID: 61017 \| Rating: 0 \| rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1048 Credit: 40,228,383,983 RAC: 449,121 Level Scientific publications	Message 61018 - Posted: 17 Jan 2024 \| 22:04:02 UTC - in response to Message 61017.
	Hello. The 1,92GB file downloads with only ~19,75KBps. I'm also encountering the issue of slow downloads on several hosts. It would be nice if the project infrastructure worked a little bit faster on our downloads. Thank you. once this file is downloaded, you dont need to download it again. it's re-used for every task. the input files sent for each task are very small and will download quickly. ____________
	ID: 61018 \| Rating: 0 \| rate: / Reply Quote

Bedrich Hajek Send message Joined: 28 Mar 09 Posts: 477 Credit: 9,341,672,716 RAC: 10,973,430 Level Scientific publications	Message 61019 - Posted: 18 Jan 2024 \| 2:56:16 UTC - in response to Message 61004. Last modified: 18 Jan 2024 \| 2:57:20 UTC
	Here is a tale of 2 computers, one that was getting units, and the other was not. https://www.gpugrid.net/hosts_user.php?userid=19626 They both have the same GPUGRID preferences. I am getting tasks on both computers, now. So far, all tasks are completing successfully. After running these tasks successfully for almost a day on both of my computers, now my BOINC manager, task tab, Remaining (estimated) "time" is telling approximately 24 days to complete on one computer and 62 days on the other, at the task's beginning, and incrementally counts down from there. The task actually completes successfully in a little over an hour. A few hours ago, they were showing the correct times to complete. Everything else is working fine, but this is definitely unusual. Did anyone else observed this?
	ID: 61019 \| Rating: 0 \| rate: / Reply Quote

[AF>Libristes] alain65 Send message Joined: 30 May 14 Posts: 9 Credit: 2,155,273,820 RAC: 4,705,439 Level Scientific publications	Message 61020 - Posted: 18 Jan 2024 \| 3:13:07 UTC
	Good morning. Wu Pyscfbeta: Quantum Chemistry Calculations On GPU work well on my 1080TI and 1650TI. Unfortunately on my GTX 970 with 4 GB VRAM, I receive many WUs but they quikly go in error. I run with Debian 11 and the Nvidia 470 driver. Is this material too old? For the moment I have removed Quantum Chemistry on GPU (BETA) on this machine. Because I quickly arrived a daily maximum and would still want to do other wu gpugrid if there are any. ____________ PC are like air conditioning, they becomes useless when you open Windows (L.T) In a world without walls and fences, who needs windows and gates?
	ID: 61020 \| Rating: 0 \| rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1048 Credit: 40,228,383,983 RAC: 449,121 Level Scientific publications	Message 61021 - Posted: 18 Jan 2024 \| 4:25:39 UTC - in response to Message 61020.
	Good morning. Wu Pyscfbeta: Quantum Chemistry Calculations On GPU work well on my 1080TI and 1650TI. Unfortunately on my GTX 970 with 4 GB VRAM, I receive many WUs but they quikly go in error. I run with Debian 11 and the Nvidia 470 driver. Is this material too old? For the moment I have removed Quantum Chemistry on GPU (BETA) on this machine. Because I quickly arrived a daily maximum and would still want to do other wu gpugrid if there are any. The project admin said at the beginning that the application will only work for cards with compute capability of 6.0 or greater. This equates to cards of Pascal generation and newer. Your GTX 970 is Maxwell with a compute capability of 5.2. It is too old for this app. ____________
	ID: 61021 \| Rating: 0 \| rate: / Reply Quote

[AF>Libristes] alain65 Send message Joined: 30 May 14 Posts: 9 Credit: 2,155,273,820 RAC: 4,705,439 Level Scientific publications	Message 61022 - Posted: 18 Jan 2024 \| 7:17:09 UTC - in response to Message 60963.
	Okay ... the answer was in the first message: The app will not work on GPUs with compute capability less than 6.0. It should not be sending them to these cards but I think at the moment this functionality is not working properly. Désolé pour le dérangement ;) ____________ PC are like air conditioning, they becomes useless when you open Windows (L.T) In a world without walls and fences, who needs windows and gates?
	ID: 61022 \| Rating: 0 \| rate: / Reply Quote

Skip Da Shu Send message Joined: 13 Jul 09 Posts: 63 Credit: 1,431,770,165 RAC: 5,676,647 Level Scientific publications	Message 61023 - Posted: 18 Jan 2024 \| 15:21:10 UTC - in response to Message 61022.
	OMG, LOL, I love this and must go abuse it... PC are like air conditioning, they becomes useless when you open Windows (L.T)
	ID: 61023 \| Rating: 0 \| rate: / Reply Quote

Aurum Send message Joined: 12 Jul 17 Posts: 400 Credit: 13,846,639,882 RAC: 31,504,974 Level Scientific publications	Message 61026 - Posted: 18 Jan 2024 \| 15:48:02 UTC - in response to Message 61016.
	When you send out WUs with 0.991C + 1NV BOINC does not assign a CPU core to that task. You should designate them 1C. I've been changing my Use At Most N-2 CPUs to accommodate these tasks. If not they slow down significantly. That GTX 960 I pointed out also has BOINC 7.7 installed and may be a Science United member so many failures can be expected. But with 7 errors allowed they'll probably find a qualified cruncher before they die. You can always override that with an app_config.xml file in the project folder and assign 1.0 cpu threads to the task. I know I can. What about the many people that leave BOINC on autopilot? I've seen multiple instances of 5 errors before a WU got to me. It's in Steve's best interest.
	ID: 61026 \| Rating: 0 \| rate: / Reply Quote

Aurum Send message Joined: 12 Jul 17 Posts: 400 Credit: 13,846,639,882 RAC: 31,504,974 Level Scientific publications	Message 61028 - Posted: 18 Jan 2024 \| 15:51:15 UTC - in response to Message 61019.
	Here is a tale of 2 computers, one that was getting units, and the other was not. https://www.gpugrid.net/hosts_user.php?userid=19626 They both have the same GPUGRID preferences. I am getting tasks on both computers, now. So far, all tasks are completing successfully. After running these tasks successfully for almost a day on both of my computers, now my BOINC manager, task tab, Remaining (estimated) "time" is telling approximately 24 days to complete on one computer and 62 days on the other, at the task's beginning, and incrementally counts down from there. The task actually completes successfully in a little over an hour. A few hours ago, they were showing the correct times to complete. Everything else is working fine, but this is definitely unusual. Did anyone else observed this? At first I did. But including <fraction_done_exact/> seems to heal that fairly quickly. <app> <name>PYSCFbeta</name> <!-- Quantum chemistry calculations on GPU --> <plan_class>cuda1121</plan_class> <gpu_versions> <cpu_usage>1</cpu_usage> <gpu_usage>1</gpu_usage> </gpu_versions> <fraction_done_exact/> </app>
	ID: 61028 \| Rating: 0 \| rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1048 Credit: 40,228,383,983 RAC: 449,121 Level Scientific publications	Message 61029 - Posted: 18 Jan 2024 \| 16:11:41 UTC - in response to Message 61026.
	When you send out WUs with 0.991C + 1NV BOINC does not assign a CPU core to that task. You should designate them 1C. I've been changing my Use At Most N-2 CPUs to accommodate these tasks. If not they slow down significantly. That GTX 960 I pointed out also has BOINC 7.7 installed and may be a Science United member so many failures can be expected. But with 7 errors allowed they'll probably find a qualified cruncher before they die. You can always override that with an app_config.xml file in the project folder and assign 1.0 cpu threads to the task. I know I can. What about the many people that leave BOINC on autopilot? I've seen multiple instances of 5 errors before a WU got to me. It's in Steve's best interest. the errors have nothing to do with the CPU resource allocation setting. they all errored because of running on GPUs that are too old, the app needs cards with at least CC of 6.0+ (Pascal and up). at worst, if someone is running the CPU full out 100% and not leaving space CPU cycles available (as they should), the worst that happens is that the GPU task might run a little more slowly. but it wont fail. I believe that the issue of "0.991" CPUs or whatever is a byproduct of the BOINC serverside software. from what I've read elsewhere, this value is not intentionally set by the researchers, it is automatically selected by the BOINC server somewhere along the way, and the researchers here have previously commented that they are not aware of any way to override this serverside. so competent users can just override it themselves if they prefer. setting your CPU use in BOINC to like 99 or 98% has the same effect overall though. ____________
	ID: 61029 \| Rating: 0 \| rate: / Reply Quote

Bedrich Hajek Send message Joined: 28 Mar 09 Posts: 477 Credit: 9,341,672,716 RAC: 10,973,430 Level Scientific publications	Message 61032 - Posted: 18 Jan 2024 \| 23:46:55 UTC - in response to Message 61028.
	Here is a tale of 2 computers, one that was getting units, and the other was not. https://www.gpugrid.net/hosts_user.php?userid=19626 They both have the same GPUGRID preferences. I am getting tasks on both computers, now. So far, all tasks are completing successfully. After running these tasks successfully for almost a day on both of my computers, now my BOINC manager, task tab, Remaining (estimated) "time" is telling approximately 24 days to complete on one computer and 62 days on the other, at the task's beginning, and incrementally counts down from there. The task actually completes successfully in a little over an hour. A few hours ago, they were showing the correct times to complete. Everything else is working fine, but this is definitely unusual. Did anyone else observed this? At first I did. But including <fraction_done_exact/> seems to heal that fairly quickly. <app> <name>PYSCFbeta</name> <!-- Quantum chemistry calculations on GPU --> <plan_class>cuda1121</plan_class> <gpu_versions> <cpu_usage>1</cpu_usage> <gpu_usage>1</gpu_usage> </gpu_versions> <fraction_done_exact/> </app> Thanks for this information. I updated my computers. Now, I remember this <fraction_done_exact/> from a post several years ago. I can't remember the thread. In the past I didn't need this, because the tasks would correct themselves eventually, even the ATMbetas. The Quantum Chemistry on GPU does the complete opposite. I wonder if this is connected to the observation of "upwards of 30 threads utilized per task" as posted by Ian&Steve C.?
	ID: 61032 \| Rating: 0 \| rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1048 Credit: 40,228,383,983 RAC: 449,121 Level Scientific publications	Message 61033 - Posted: 19 Jan 2024 \| 0:07:25 UTC - in response to Message 61032.
	nah the multi thread issue has already been fixed. the app only uses a single thread now. ____________
	ID: 61033 \| Rating: 0 \| rate: / Reply Quote

zombie67 [MM] Send message Joined: 16 Jul 07 Posts: 207 Credit: 2,161,961,456 RAC: 9,510,708 Level Scientific publications	Message 61034 - Posted: 19 Jan 2024 \| 2:30:50 UTC - in response to Message 60963.
	The work-units require a lot of GPU memory. How much is "a lot" exactly? I have a pacal card, so it meets the compute capability requirement. But it has only 2gb of VRAM. But without knowing the amount of VRAM required, I am not sure if it will work. ____________ Reno, NV Team: SETI.USA
	ID: 61034 \| Rating: 0 \| rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1048 Credit: 40,228,383,983 RAC: 449,121 Level Scientific publications	Message 61035 - Posted: 19 Jan 2024 \| 3:41:52 UTC - in response to Message 61034.
	The work-units require a lot of GPU memory. How much is "a lot" exactly? I have a pacal card, so it meets the compute capability requirement. But it has only 2gb of VRAM. But without knowing the amount of VRAM required, I am not sure if it will work. It requires more than 2GB ____________
	ID: 61035 \| Rating: 0 \| rate: / Reply Quote

zombie67 [MM] Send message Joined: 16 Jul 07 Posts: 207 Credit: 2,161,961,456 RAC: 9,510,708 Level Scientific publications	Message 61036 - Posted: 19 Jan 2024 \| 4:24:35 UTC - in response to Message 61035.
	It requires more than 2GB Good to know. Thanks! ____________ Reno, NV Team: SETI.USA
	ID: 61036 \| Rating: 0 \| rate: / Reply Quote

Steve Volunteer moderator Project administrator Project developer Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 21 Dec 23 Posts: 34 Credit: 0 RAC: 0 Level Scientific publications	Message 61037 - Posted: 19 Jan 2024 \| 14:07:54 UTC - in response to Message 61029.
	the errors have nothing to do with the CPU resource allocation setting. they all errored because of running on GPUs that are too old, the app needs cards with at least CC of 6.0+ (Pascal and up). at worst, if someone is running the CPU full out 100% and not leaving space CPU cycles available (as they should), the worst that happens is that the GPU task might run a little more slowly. but it wont fail. I believe that the issue of "0.991" CPUs or whatever is a byproduct of the BOINC serverside software. from what I've read elsewhere, this value is not intentionally set by the researchers, it is automatically selected by the BOINC server somewhere along the way, and the researchers here have previously commented that they are not aware of any way to override this serverside. so competent users can just override it themselves if they prefer. setting your CPU use in BOINC to like 99 or 98% has the same effect overall though. This is all correct I believe. It seems that the jobs have enough retry attempts that all work units end up eventually succeeding. The scheduler has some inbuilt mechanism to classify hosts as "reliable" it also has a mechanism to send workunits that have failed a few times to only hosts that are "reliable". This is not ideal of course. We will try and get the CC requirements honoured but these are project wide scheduler settings which are rather complex to fix without breaking everything else that is currently working. The download limitations is something I will not be able to change easily. A potential reason I can guess for the current settings is to stop a failing host acting as a black-hole of failed jobs or something similar. The large file download should just happen once. The app is deployed in the same way as the ATM app. It is a 2GB zip file that contains a python environment and some cuda libraries. Each work-unit only requires downloading a small file (<1MB I think). This last large scale run has been rather impressive. The throughput was very high! Especially considering that it is only on Linux hosts and not Windows. We will be sending some similar batches over the next few weeks.
	ID: 61037 \| Rating: 0 \| rate: / Reply Quote

[AF>Libristes] alain65 Send message Joined: 30 May 14 Posts: 9 Credit: 2,155,273,820 RAC: 4,705,439 Level Scientific publications	Message 61039 - Posted: 20 Jan 2024 \| 3:25:58 UTC - in response to Message 61037.
	Hello Steeve. [quote] The throughput was very high! Especially considering that it is only on Linux hosts and not Windows. I would say: this is certainly for that! :D ____________ PC are like air conditioning, they becomes useless when you open Windows (L.T) In a world without walls and fences, who needs windows and gates?
	ID: 61039 \| Rating: 0 \| rate: / Reply Quote

Erich56 Send message Joined: 1 Jan 15 Posts: 1105 Credit: 7,822,620,176 RAC: 1,489,209 Level Scientific publications	Message 61040 - Posted: 21 Jan 2024 \| 8:59:42 UTC - in response to Message 61037.
	... Especially considering that it is only on Linux hosts and not Windows. We will be sending some similar batches over the next few weeks. Is there a plan to come up with a Windows version too?
	ID: 61040 \| Rating: 0 \| rate: / Reply Quote

Philip Nicholson Send message Joined: 23 Feb 22 Posts: 1 Credit: 541,864,968 RAC: 8,452 Level Scientific publications	Message 61041 - Posted: 21 Jan 2024 \| 19:01:49 UTC
	Still no work for windows 11 operating systems? I see the occasional task that failed but nothing processed. It worked well for months and then just stopped before xmas. All my software is up to date. I have a dedicated GPU for this project. Where is the best place to find an update on GPUgrid's software migration? Tasks completed 134 Tasks failed 55 Credit User 491,814,968 total, 13,657.85 average Host 150,562,500 total, 13,650.92 average Scheduling Scheduling priority -0.93 Don't request tasks for CPU Project has no apps for CPU NVIDIA GPU task request deferred for 00:03:35 NVIDIA GPU task request deferral interval 00:10:00 Last scheduler reply 2024-01-21 1:55:15 PM
	ID: 61041 \| Rating: 0 \| rate: / Reply Quote

Keith Myers Send message Joined: 13 Dec 17 Posts: 1313 Credit: 6,022,292,459 RAC: 10,175,539 Level Scientific publications	Message 61042 - Posted: 21 Jan 2024 \| 21:53:29 UTC
	Most of the work released lately has been the Quantum Chemistry tasks. The researcher said that since most educational and research labs run Linux OS', that Windows applications are a second thought. The only tasks with a Windows app that has appeared somewhat regularly are the acemd tasks. You will have to try and snag one of those when they show up.
	ID: 61042 \| Rating: 0 \| rate: / Reply Quote

Erich56 Send message Joined: 1 Jan 15 Posts: 1105 Credit: 7,822,620,176 RAC: 1,489,209 Level Scientific publications	Message 61043 - Posted: 22 Jan 2024 \| 7:52:56 UTC - in response to Message 61042. Last modified: 22 Jan 2024 \| 7:56:36 UTC
	The researcher said that since most educational and research labs run Linux OS', that Windows applications are a second thought. it's really too bad that GPUGRID obviously more and more tends to exclude Windows crunchers :-( When I joined this project 8 years ago, at that time and many years thereafter, no lack of Windows tasks. On the other hand: with these few tasks available since last year, it might be the case that the number of Linux crunchers is sufficient for processing them, and the Windows crunchers from before are not needed any longer :-( At least, this is the impression one is bound to get.
	ID: 61043 \| Rating: 0 \| rate: / Reply Quote

Keith Myers Send message Joined: 13 Dec 17 Posts: 1313 Credit: 6,022,292,459 RAC: 10,175,539 Level Scientific publications	Message 61044 - Posted: 22 Jan 2024 \| 17:27:39 UTC
	The lack of current Windows applications has more to do with the type of applications and API's being used currently. The latest and current sub-projects are all Python based. Python runs much better on Linux compared to Windows since most development is done in Linux to begin with. Even Microsoft advises that Python application development should be done in Linux rather than Windows.
	ID: 61044 \| Rating: 0 \| rate: / Reply Quote

Erich56 Send message Joined: 1 Jan 15 Posts: 1105 Credit: 7,822,620,176 RAC: 1,489,209 Level Scientific publications	Message 61045 - Posted: 23 Jan 2024 \| 8:06:09 UTC
	So - in short - bad times for Windows crunchers. Now and in the future :-(
	ID: 61045 \| Rating: 0 \| rate: / Reply Quote

Keith Myers Send message Joined: 13 Dec 17 Posts: 1313 Credit: 6,022,292,459 RAC: 10,175,539 Level Scientific publications	Message 61046 - Posted: 23 Jan 2024 \| 9:05:47 UTC - in response to Message 61045.
	So - in short - bad times for Windows crunchers. Now and in the future :-( Pretty much so. Windows had it best back with the original release of the acemd app. Remember it was a simple, single, executable file of modest size. Derived from source code that could be compiled for Windows or for Linux. But, if you were paying attention lately, the recent acemd tasks no longer use an executable. They are using Python. The Python based tasks are NOT a single executable, they are comprised of a complete packaged python environment of many gigabytes. The nature of the tasks have changed for the project to using complex, state-of-the-art discovery calculation using cutting edge technology. The QChem tasks are even using the Tensor cores of our Nvidia cards now. This is something we asked about several years ago in the forum and were told, maybe, in the future. The future has come and our desires have been answered. But the hardware and software of our hosts now have to rise to meet those challenges. Sadly, the Windows environment is still waiting in the wings.
	ID: 61046 \| Rating: 0 \| rate: / Reply Quote

ServicEnginIC Send message Joined: 24 Sep 10 Posts: 572 Credit: 7,519,447,024 RAC: 10,542,159 Level Scientific publications	Message 61082 - Posted: 25 Jan 2024 \| 18:16:11 UTC - in response to Message 61028.
	...But including <fraction_done_exact/> seems to heal that fairly quickly. Nice advice, thank you! It resulted quickly in an accurate remaining time estimation, so I applied it to ATMbeta tasks also.
	ID: 61082 \| Rating: 0 \| rate: / Reply Quote

[BAT] Svennemans Send message Joined: 27 May 21 Posts: 50 Credit: 468,892,017 RAC: 3,152,507 Level Scientific publications	Message 61084 - Posted: 25 Jan 2024 \| 19:24:22 UTC - in response to Message 61046.
	Choosing not to release Windows apps is a choice they can take, obviously. And maybe their use cases warrant taking the tradeoff inherent in that. If there's often large volumes of work to process in a small time (i.e. you'd need something like a supercomputer ideally if it didn't cost that much), then you'd want to design your apps for what BOINC intended to be all along. Meaning you try to get them ported to as many platforms as you possibly can in order to reach maximum compute power. Or you leverage the power of VBox for non-native platforms. If however the volumes are never going to be that large, where basically any single platform user group can easily provide the necessary compute power, then indeed why bother. Although it would be nice of them to make that choice public and explicit so all non-Linux users can gracefully detach instead of posting frustrated "why no work" messages along the forums. Or indeed spend hours trying to help fix Windows apps ;-)
	ID: 61084 \| Rating: 0 \| rate: / Reply Quote

Aurum Send message Joined: 12 Jul 17 Posts: 400 Credit: 13,846,639,882 RAC: 31,504,974 Level Scientific publications	Message 61096 - Posted: 26 Jan 2024 \| 15:51:20 UTC - in response to Message 61029.
	I believe that the issue of "0.991" CPUs or whatever is a byproduct of the BOINC serverside software. from what I've read elsewhere, this value is not intentionally set by the researchers, it is automatically selected by the BOINC server somewhere along the way, and the researchers here have previously commented that they are not aware of any way to override this serverside. I didn't know that. It's probably a sloppy BOINC design like using percentage to determine the number of CPU threads to use instead of integers.
	ID: 61096 \| Rating: 0 \| rate: / Reply Quote

Aurum Send message Joined: 12 Jul 17 Posts: 400 Credit: 13,846,639,882 RAC: 31,504,974 Level Scientific publications	Message 61097 - Posted: 26 Jan 2024 \| 15:56:14 UTC - in response to Message 61034.
	The work-units require a lot of GPU memory. How much is "a lot" exactly? I have a pacal card, so it meets the compute capability requirement. But it has only 2gb of VRAM. But without knowing the amount of VRAM required, I am not sure if it will work. The highest being used today on my Pascal cards is 795 MB.
	ID: 61097 \| Rating: 0 \| rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1048 Credit: 40,228,383,983 RAC: 449,121 Level Scientific publications	Message 61098 - Posted: 26 Jan 2024 \| 16:00:09 UTC - in response to Message 61097.
	The work-units require a lot of GPU memory. How much is "a lot" exactly? I have a pacal card, so it meets the compute capability requirement. But it has only 2gb of VRAM. But without knowing the amount of VRAM required, I am not sure if it will work. The highest being used today on my Pascal cards is 795 MB. Might want to watch that on a longer time scale, the VRAM use is not static, it fluctuates up and down ____________
	ID: 61098 \| Rating: 0 \| rate: / Reply Quote

Aurum Send message Joined: 12 Jul 17 Posts: 400 Credit: 13,846,639,882 RAC: 31,504,974 Level Scientific publications	Message 61099 - Posted: 26 Jan 2024 \| 16:18:23 UTC Last modified: 26 Jan 2024 \| 16:32:42 UTC
	Retraction: I'm monitoring with the BoincTasks Js 2.4.2.2 and it has bugs. I loaded NVITOP and it does use 2 GB VRAM with 100% GPU utilization. BTW, if anyone wants to try NVITOP here's my notes to install for Ubuntu 22.04: sudo apt update sudo apt upgrade -y sudo apt install python3-pip -y python3 -m pip install --user pipx python3 -m pip install --user --upgrade pipx python3 -m pipx ensurepath # if requested: sudo apt install python3.8-venv -y For LM 21.3: sudo apt install python3.10-venv -y Open a new terminal: pip3 install --upgrade nvitop pipx run nvitop --colorful -m full
	ID: 61099 \| Rating: 0 \| rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1048 Credit: 40,228,383,983 RAC: 449,121 Level Scientific publications	Message 61100 - Posted: 26 Jan 2024 \| 16:26:34 UTC - in response to Message 61099.
	I'm not seeing any different behavior on my titan Vs. the VRAM use still exceeds 3GB at times. but it's spikey. you have to watch it for a few mins. instantaneous measurements might not catch it. ____________
	ID: 61100 \| Rating: 0 \| rate: / Reply Quote

Boca Raton Community HS Send message Joined: 27 Aug 21 Posts: 36 Credit: 3,441,856,809 RAC: 19,827 Level Scientific publications	Message 61101 - Posted: 26 Jan 2024 \| 17:04:26 UTC - in response to Message 61100.
	I am seeing spikes to ~7.6 GB with these. Not long lasting (in the context of the whole work unit) but consistently elevated during that part of the work unit. I want to say that I saw that spike at about 5% complete and then at 95% complete, but that also could be somewhat coincidental versus factual.
	ID: 61101 \| Rating: 0 \| rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1048 Credit: 40,228,383,983 RAC: 449,121 Level Scientific publications	Message 61102 - Posted: 26 Jan 2024 \| 17:11:14 UTC - in response to Message 61101. Last modified: 26 Jan 2024 \| 17:14:12 UTC
	I am seeing spikes to ~7.6 GB with these. Not long lasting (in the context of the whole work unit) but consistently elevated during that part of the work unit. I want to say that I saw that spike at about 5% complete and then at 95% complete, but that also could be somewhat coincidental versus factual. to add on to this, for everyone's info. these tasks (and a lot of CUDA applications in general) do not require any set absolute value of VRAM. VRAM will scale to the GPU individually. generally, the more SMs you have, to more VRAM will be used. it's not linear, but there is some portion of the allocated VRAM that scales directly with how many SMs are being used. to put it simply, different GPUs with different core counts, will have different amounts of VRAM utilization. so even if one powerful GPU like an RTX 4090 with 100+ SMs on the die might need 7+GB, doesn't mean that something much smaller like a GTX 1070 needs that much. it needs to be evaluated on a case by case basis. ____________
	ID: 61102 \| Rating: 0 \| rate: / Reply Quote

Boca Raton Community HS Send message Joined: 27 Aug 21 Posts: 36 Credit: 3,441,856,809 RAC: 19,827 Level Scientific publications	Message 61103 - Posted: 26 Jan 2024 \| 17:20:59 UTC - in response to Message 61102.
	I am seeing spikes to ~7.6 GB with these. Not long lasting (in the context of the whole work unit) but consistently elevated during that part of the work unit. I want to say that I saw that spike at about 5% complete and then at 95% complete, but that also could be somewhat coincidental versus factual. to add on to this, for everyone's info. these tasks (and a lot of CUDA applications in general) do not require any set absolute value of VRAM. VRAM will scale to the GPU individually. generally, the more SMs you have, to more VRAM will be used. it's not linear, but there is some portion of the allocated VRAM that scales directly with how many SMs are being used. to put it simply, different GPUs with different core counts, will have different amounts of VRAM utilization. so even if one powerful GPU like an RTX 4090 with 100+ SMs on the die might need 7+GB, doesn't mean that something much smaller like a GTX 1070 needs that much. it needs to be evaluated on a case by case basis. Thanks for this! I did not know about the scaling and I don't think this is something I ever thought about (the correlation between SMs and VRAM usage).
	ID: 61103 \| Rating: 0 \| rate: / Reply Quote

bibi Send message Joined: 4 May 17 Posts: 14 Credit: 10,627,092,143 RAC: 19,904,177 Level Scientific publications	Message 61108 - Posted: 29 Jan 2024 \| 13:55:27 UTC
	Why do I allways get segmentation fault on Windows/wsl2/Ubuntu 22.04.3 LTS 12 processors, 28 GB memory, 16GB swap, GPU RTX 4070 Ti Super with 16 GB, driver version 551.23 https://www.gpugrid.net/result.php?resultid=33759912 https://www.gpugrid.net/result.php?resultid=33758940 https://www.gpugrid.net/result.php?resultid=33759139 https://www.gpugrid.net/result.php?resultid=33759328
	ID: 61108 \| Rating: 0 \| rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1048 Credit: 40,228,383,983 RAC: 449,121 Level Scientific publications	Message 61109 - Posted: 29 Jan 2024 \| 14:01:46 UTC - in response to Message 61108.
	Why do I allways get segmentation fault on Windows/wsl2/Ubuntu 22.04.3 LTS 12 processors, 28 GB memory, 16GB swap, GPU RTX 4070 Ti Super with 16 GB, driver version 551.23 https://www.gpugrid.net/result.php?resultid=33759912 https://www.gpugrid.net/result.php?resultid=33758940 https://www.gpugrid.net/result.php?resultid=33759139 https://www.gpugrid.net/result.php?resultid=33759328 something wrong with your environment or drivers likely. try running a native Linux OS install, WSL might not be well supported ____________
	ID: 61109 \| Rating: 0 \| rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1048 Credit: 40,228,383,983 RAC: 449,121 Level Scientific publications	Message 61117 - Posted: 30 Jan 2024 \| 12:54:49 UTC - in response to Message 61109. Last modified: 30 Jan 2024 \| 13:15:58 UTC
	Steve, these TEST units you have out right now. they seem to be using a ton of reserved memory. one process right now is using 30+GB. that seems much higher than usual. and i even have another one reserving 64GB of memory. that's way too high. ____________
	ID: 61117 \| Rating: 0 \| rate: / Reply Quote

Freewill Send message Joined: 18 Mar 10 Posts: 13 Credit: 12,440,092,894 RAC: 94,208,303 Level Scientific publications	Message 61118 - Posted: 30 Jan 2024 \| 13:17:30 UTC Last modified: 30 Jan 2024 \| 13:19:20 UTC
	Here's one that died on my Ubuntu system which has 32 GB RAM: https://www.gpugrid.net/result.php?resultid=33764282
	ID: 61118 \| Rating: 0 \| rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1048 Credit: 40,228,383,983 RAC: 449,121 Level Scientific publications	Message 61119 - Posted: 30 Jan 2024 \| 14:33:26 UTC - in response to Message 61117. Last modified: 30 Jan 2024 \| 15:13:20 UTC
	i see v3 being deployed now the memory limiting you're trying isn't working. I'm seeing it spike to near 100% i see you put export CUPY_GPU_MEMORY_LIMIT=50% a quick google seems to indicate that you need to put the percentage in quotes. like this - export CUPY_GPU_MEMORY_LIMIT="50%". or additionally you can set a discrete memory amount as the limit. for example, export CUPY_GPU_MEMORY_LIMIT="1073741824" to limit to 1GB. and the system memory use is still a little high, around 10GB each. EDIT - system memory use still climbed to ~30GB by the end ____________
	ID: 61119 \| Rating: 0 \| rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1048 Credit: 40,228,383,983 RAC: 449,121 Level Scientific publications	Message 61120 - Posted: 30 Jan 2024 \| 16:01:04 UTC - in response to Message 61119. Last modified: 30 Jan 2024 \| 16:01:30 UTC
	v4 report. i see you attempted to add some additional VRAM limiting. but the task is still trying to allocate more VRAM, and instead of using more VRAM, the process gets killed for trying to allocate more than the limit. https://gpugrid.net/result.php?resultid=33764464 https://gpugrid.net/result.php?resultid=33764469 ____________
	ID: 61120 \| Rating: 0 \| rate: / Reply Quote

Steve Volunteer moderator Project administrator Project developer Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 21 Dec 23 Posts: 34 Credit: 0 RAC: 0 Level Scientific publications	Message 61121 - Posted: 30 Jan 2024 \| 16:11:32 UTC
	Yes I was doing some testing to see how large molecules we can compute properties for. The previous batches have been for small molecules which all work very well. The memory use scales very quickly with increased molecule size. This test today had molecules 3 to 4 times the size of the previous batches. As you can see I have not solved the memory limiting issue it. It should be possible to limit instantaneous GPU memory use (at the cost of runtime performance and increased CPU memory use). But due to the different levels of CUDA libraries in play in this code it is rather complicated. I will work on this locally for now and resume sending out the batches that were working well tomorrow! Thank you for the assistance and compute availability, it is much appreciated!
	ID: 61121 \| Rating: 0 \| rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1048 Credit: 40,228,383,983 RAC: 449,121 Level Scientific publications	Message 61122 - Posted: 30 Jan 2024 \| 16:13:47 UTC - in response to Message 61121.
	no problem! glad to see you were monitoring my feedback and making changes. looking forward to another stable batch tomorrow :) should be similar to previous runs like yesterday right? ____________
	ID: 61122 \| Rating: 0 \| rate: / Reply Quote

Steve Volunteer moderator Project administrator Project developer Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 21 Dec 23 Posts: 34 Credit: 0 RAC: 0 Level Scientific publications	Message 61123 - Posted: 30 Jan 2024 \| 16:18:55 UTC - in response to Message 61122.
	Yes It will be same as yesterday but roughly 10x the work units released. Each workunit contains 100 small molecules.
	ID: 61123 \| Rating: 0 \| rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1048 Credit: 40,228,383,983 RAC: 449,121 Level Scientific publications	Message 61124 - Posted: 30 Jan 2024 \| 16:19:50 UTC - in response to Message 61123.
	looking forward to it :) ____________
	ID: 61124 \| Rating: 0 \| rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1589 Credit: 6,680,244,351 RAC: 9,211,180 Level Scientific publications	Message 61126 - Posted: 31 Jan 2024 \| 12:38:25 UTC
	I have Task 33765246 running on a RTX 3060 Ti under Linux Mint 21.3 It's running incredibly slowly, and with zero GPU usage. I've found this in stderr.txt: + python compute_dft.py /hdd/boinc-client/slots/5/lib/python3.11/site-packages/pyscf/dft/libxc.py:771: UserWarning: Since PySCF-2.3, B3LYP (and B3P86) are changed to the VWN-RPA variant, corresponding to the original definition by Stephens et al. (issue 1480) and the same as the B3LYP functional in Gaussian. To restore the VWN5 definition, you can put the setting "B3LYP_WITH_VWN5 = True" in pyscf_conf.py warnings.warn('Since PySCF-2.3, B3LYP (and B3P86) are changed to the VWN-RPA variant, ' /hdd/boinc-client/slots/5/lib/python3.11/site-packages/gpu4pyscf/lib/cutensor.py:174: UserWarning: using cupy as the tensor contraction engine. warnings.warn(f'using {contract_engine} as the tensor contraction engine.') /hdd/boinc-client/slots/5/lib/python3.11/site-packages/pyscf/gto/mole.py:1280: UserWarning: Function mol.dumps drops attribute charge because it is not JSON-serializable warnings.warn(msg) Exception: Fallback to CPU Exception: Fallback to CPU
	ID: 61126 \| Rating: 0 \| rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1048 Credit: 40,228,383,983 RAC: 449,121 Level Scientific publications	Message 61127 - Posted: 31 Jan 2024 \| 12:40:53 UTC - in response to Message 61124. Last modified: 31 Jan 2024 \| 13:08:03 UTC
	Steve, this new batch, right off the bat, is loading up the GPU VRAM nearly full again. edit, that's for a v1 tasks, will check out the v2s ____________
	ID: 61127 \| Rating: 0 \| rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1048 Credit: 40,228,383,983 RAC: 449,121 Level Scientific publications	Message 61128 - Posted: 31 Jan 2024 \| 13:12:40 UTC - in response to Message 61127.
	OK. looks like the v2 tasks are back to normal. it was only that v1 task that was using lots of vram ____________
	ID: 61128 \| Rating: 0 \| rate: / Reply Quote

Steve Volunteer moderator Project administrator Project developer Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 21 Dec 23 Posts: 34 Credit: 0 RAC: 0 Level Scientific publications	Message 61129 - Posted: 31 Jan 2024 \| 13:19:52 UTC - in response to Message 61127.
	Ok my previous post was incorrect. It turns out the previous large batch was not a respresentative test set. It only contained very small molecules. This is why the GPU RAM usage was low. As per my previous post these task use a lot of GPU memory. You can see more detail in this post: http://gpugrid.org/forum_thread.php?id=5428&nowrap=true#60945 The work units are now just 10 molecules. They vary in size from 10 to 20 atoms per molecule. All molecules in a WU are the same size. Tests WU's (smallest and largest sized molecules) pass on my GTX1080 (8GB) test machine without failing. The CPU fallback part was left over from testing this should have been removed but appears it was not.
	ID: 61129 \| Rating: 0 \| rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1048 Credit: 40,228,383,983 RAC: 449,121 Level Scientific publications	Message 61130 - Posted: 31 Jan 2024 \| 13:33:51 UTC - in response to Message 61129.
	Thanks for the info Steve. In general, I don't have much problem with using a large amount of VRAM, if that's what you require for your science goals. Personally I just wish to have expectations set so that I can setup my hosts accordingly. If VRAM use is low, I can set my host to run multiple tasks at a time for better efficiency. if VRAM use is high, I'll need to cut it back to only 2 or 1 tasks per GPU, which hurts overall efficiency on my end and requires me to reconfigure some things, but it's fine if that's how they will be. I just prefer to know which way it will be so that I don't leave it in a bad configuration and cause errors. the bigger problem for me (and maybe many others) was the batch yesterday with VERY high system memory use per task. when system ram filled up it would crash the system, which requires some more manual intervention to get it running again. anyone with multi-GPU would be at risk there. just something to consider. for overall VRAM use. again you can require whatever you need for your science goals. but you might consider making sure you can at least keep them under 8GB. I'd say many people on GPUGRID these days have a GPU with at least 8GB. all of mine have 12GB. and less with 16+. if you can keep them below 8GB I think you'll be able to maintain a large pool of users rather than dealing with the tasks running out of memory and having to be resent multiple times to land on a host with enough VRAM. ____________
	ID: 61130 \| Rating: 0 \| rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1048 Credit: 40,228,383,983 RAC: 449,121 Level Scientific publications	Message 61131 - Posted: 31 Jan 2024 \| 13:58:20 UTC - in response to Message 61126.
	I have Task 33765246 running on a RTX 3060 Ti under Linux Mint 21.3 It's running incredibly slowly, and with zero GPU usage. I've found this in stderr.txt: + python compute_dft.py /hdd/boinc-client/slots/5/lib/python3.11/site-packages/pyscf/dft/libxc.py:771: UserWarning: Since PySCF-2.3, B3LYP (and B3P86) are changed to the VWN-RPA variant, corresponding to the original definition by Stephens et al. (issue 1480) and the same as the B3LYP functional in Gaussian. To restore the VWN5 definition, you can put the setting "B3LYP_WITH_VWN5 = True" in pyscf_conf.py warnings.warn('Since PySCF-2.3, B3LYP (and B3P86) are changed to the VWN-RPA variant, ' /hdd/boinc-client/slots/5/lib/python3.11/site-packages/gpu4pyscf/lib/cutensor.py:174: UserWarning: using cupy as the tensor contraction engine. warnings.warn(f'using {contract_engine} as the tensor contraction engine.') /hdd/boinc-client/slots/5/lib/python3.11/site-packages/pyscf/gto/mole.py:1280: UserWarning: Function mol.dumps drops attribute charge because it is not JSON-serializable warnings.warn(msg) Exception: Fallback to CPU Exception: Fallback to CPU I'm getting several of these also. this is a problem too. you can always tell when the task basically stalls with almost no progress. ____________
	ID: 61131 \| Rating: 0 \| rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1589 Credit: 6,680,244,351 RAC: 9,211,180 Level Scientific publications	Message 61132 - Posted: 31 Jan 2024 \| 14:15:17 UTC
	My CPU fallback task has now completed and validated, in not much longer than is usual for tasks on that host. I assume it was a shortened test task, running on a slower device? I now have just completed what looks like a similar task, with similarly large jumps in progress %age, but much more quickly. Task 33765553
	ID: 61132 \| Rating: 0 \| rate: / Reply Quote

Steve Volunteer moderator Project administrator Project developer Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 21 Dec 23 Posts: 34 Credit: 0 RAC: 0 Level Scientific publications	Message 61133 - Posted: 31 Jan 2024 \| 14:44:43 UTC - in response to Message 61132.
	This is still very much a beta app. We will continue to explore different WU sizes and application settings (with better local testing on our internal hardware before sending them out). This app is the first time it has been possible to run QM calculations on GPUs. The underlying software was primarliy designed for the latest generation professional cards, e.g. A100s that are used in HPC centres. It is proving challenging to us to port the code to GPUGRID consumer hardware. We also are looking into how a windows port can be done.
	ID: 61133 \| Rating: 0 \| rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1048 Credit: 40,228,383,983 RAC: 449,121 Level Scientific publications	Message 61134 - Posted: 31 Jan 2024 \| 14:51:32 UTC - in response to Message 61133.
	No problem Steve. I definitely understand the beta aspect of this and the need to test things. I’m just giving honest feedback from my POV. Sometimes it’s hard to tell if a radical change in behavior is intended or a sign of some problem or misconfiguration. Maybe it’s not possible for all the various molecules you want to test, but the size of the previous large batch last week I feel was very appropriate. Moderate VRAM use and consistent size/runtimes. Those worked well with the consumer hardware. Oh if everyone had A100s with 40-80GB of VRAM life would be nice LOL. ____________
	ID: 61134 \| Rating: 0 \| rate: / Reply Quote

Boca Raton Community HS Send message Joined: 27 Aug 21 Posts: 36 Credit: 3,441,856,809 RAC: 19,827 Level Scientific publications	Message 61135 - Posted: 31 Jan 2024 \| 16:33:38 UTC
	I had an odd work unit come through (and just abandoned). I have not had any issues with these work units so thought I would mention this one specifically. https://www.gpugrid.net/result.php?resultid=33764946 I think there was a memory error with it but I am not very skilled at reading the results. It hung at ~75% but I let it work for 5 hours (honestly, I just didn't notice that it was hung up...). When looking at properties of the work unit: Virtual memory: 56GB Working Size Set: 3.59GB I thought this was an odd one so thought I would post.
	ID: 61135 \| Rating: 0 \| rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1048 Credit: 40,228,383,983 RAC: 449,121 Level Scientific publications	Message 61136 - Posted: 31 Jan 2024 \| 17:03:21 UTC - in response to Message 61135.
	I had an odd work unit come through (and just abandoned). I have not had any issues with these work units so thought I would mention this one specifically. https://www.gpugrid.net/result.php?resultid=33764946 I think there was a memory error with it but I am not very skilled at reading the results. It hung at ~75% but I let it work for 5 hours (honestly, I just didn't notice that it was hung up...). When looking at properties of the work unit: Virtual memory: 56GB Working Size Set: 3.59GB I thought this was an odd one so thought I would post. Yeah you can see several out of memory errors. Are you running more than one at a time? I’ve had many like this. And many that seem to just fall back to CPU without any reason and get stuck for a long time. I’ve been aborting them when I notice. But it is troublesome :( ____________
	ID: 61136 \| Rating: 0 \| rate: / Reply Quote

Boca Raton Community HS Send message Joined: 27 Aug 21 Posts: 36 Credit: 3,441,856,809 RAC: 19,827 Level Scientific publications	Message 61137 - Posted: 31 Jan 2024 \| 17:19:14 UTC - in response to Message 61136.
	Yeah you can see several out of memory errors. Are you running more than one at a time? I’ve had many like this. And many that seem to just fall back to CPU without any reason and get stuck for a long time. I’ve been aborting them when I notice. But it is troublesome :( I have been running 2x for these (I can't get them to run 3x or 4x via app config file but it doesn't look like there are any cued tasks waiting to start). Good to know that others have seen this too! I have seen a MASSIVE reduction in time these tasks take today.
	ID: 61137 \| Rating: 0 \| rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1048 Credit: 40,228,383,983 RAC: 449,121 Level Scientific publications	Message 61138 - Posted: 31 Jan 2024 \| 18:15:03 UTC
	I’m now getting a 3rd type of error across all of my hosts. “AssertionError” https://www.gpugrid.net/result.php?resultid=33766654 ____________
	ID: 61138 \| Rating: 0 \| rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1589 Credit: 6,680,244,351 RAC: 9,211,180 Level Scientific publications	Message 61139 - Posted: 31 Jan 2024 \| 18:25:02 UTC - in response to Message 61138.
	I've had a few of those too, mainly of the form File "/hdd/boinc-client/slots/6/lib/python3.11/site-packages/gpu4pyscf/df/grad/rhf.py", line 163, in get_jk assert k1-k0 <= block_size
	ID: 61139 \| Rating: 0 \| rate: / Reply Quote

Retvari Zoltan Send message Joined: 20 Jan 09 Posts: 2343 Credit: 16,201,255,749 RAC: 0 Level Scientific publications	Message 61140 - Posted: 1 Feb 2024 \| 4:17:42 UTC
	150.000 credits for a few 100 seconds? I'm in! ;) https://www.gpugrid.net/result.php?resultid=33771102 https://www.gpugrid.net/result.php?resultid=33771333 https://www.gpugrid.net/result.php?resultid=33771431 https://www.gpugrid.net/result.php?resultid=33771446 https://www.gpugrid.net/result.php?resultid=33771539
	ID: 61140 \| Rating: 0 \| rate: / Reply Quote

gemini8 Send message Joined: 3 Jul 16 Posts: 31 Credit: 1,578,137,676 RAC: 5,622,108 Level Scientific publications	Message 61141 - Posted: 1 Feb 2024 \| 7:48:55 UTC - in response to Message 61131.
	I have Task 33765246 running on a RTX 3060 Ti under Linux Mint 21.3 It's running incredibly slowly, and with zero GPU usage. I've found this in stderr.txt: + python compute_dft.py /hdd/boinc-client/slots/5/lib/python3.11/site-packages/pyscf/dft/libxc.py:771: UserWarning: Since PySCF-2.3, B3LYP (and B3P86) are changed to the VWN-RPA variant, corresponding to the original definition by Stephens et al. (issue 1480) and the same as the B3LYP functional in Gaussian. To restore the VWN5 definition, you can put the setting "B3LYP_WITH_VWN5 = True" in pyscf_conf.py warnings.warn('Since PySCF-2.3, B3LYP (and B3P86) are changed to the VWN-RPA variant, ' /hdd/boinc-client/slots/5/lib/python3.11/site-packages/gpu4pyscf/lib/cutensor.py:174: UserWarning: using cupy as the tensor contraction engine. warnings.warn(f'using {contract_engine} as the tensor contraction engine.') /hdd/boinc-client/slots/5/lib/python3.11/site-packages/pyscf/gto/mole.py:1280: UserWarning: Function mol.dumps drops attribute charge because it is not JSON-serializable warnings.warn(msg) Exception: Fallback to CPU Exception: Fallback to CPU I'm getting several of these also. this is a problem too. you can always tell when the task basically stalls with almost no progress. I had only those on one of my machines. Apparently it had lost sight of the GPU for crunching. Rebooting brought back the Nvidia driver to the BOINC client. Apart from this I found out that I can't run these tasks aside Private GFN Server's tasks on a six Gig GPU. So I called the PYSCFbeta tasks off for this machine, as I often have to wait for tasks to download from GPUGrid, and I don't want my GPUs to run idle. ____________ Greetings, Jens
	ID: 61141 \| Rating: 0 \| rate: / Reply Quote

gemini8 Send message Joined: 3 Jul 16 Posts: 31 Credit: 1,578,137,676 RAC: 5,622,108 Level Scientific publications	Message 61145 - Posted: 1 Feb 2024 \| 21:47:40 UTC
	Did we encounter this one already? <core_client_version>7.14.2</core_client_version> <![CDATA[ <message> process exited with code 195 (0xc3, -61)</message> <stderr_txt> 14:21:03 (24335): wrapper (7.7.26016): starting 14:21:36 (24335): wrapper (7.7.26016): starting 14:21:36 (24335): wrapper: running bin/python (bin/conda-unpack) 14:21:38 (24335): bin/python exited; CPU time 0.223114 14:21:38 (24335): wrapper: running bin/tar (xjvf input.tar.bz2) 14:21:39 (24335): bin/tar exited; CPU time 0.005282 14:21:39 (24335): wrapper: running bin/bash (run.sh) + echo 'Setup environment' + source bin/activate ++ _conda_pack_activate ++ local _CONDA_SHELL_FLAVOR ++ '[' -n x ']' ++ _CONDA_SHELL_FLAVOR=bash ++ local script_dir ++ case "$_CONDA_SHELL_FLAVOR" in +++ dirname bin/activate ++ script_dir=bin +++ cd bin +++ pwd ++ local full_path_script_dir=/var/lib/boinc-client/slots/4/bin +++ dirname /var/lib/boinc-client/slots/4/bin ++ local full_path_env=/var/lib/boinc-client/slots/4 +++ basename /var/lib/boinc-client/slots/4 ++ local env_name=4 ++ '[' -n '' ']' ++ export CONDA_PREFIX=/var/lib/boinc-client/slots/4 ++ CONDA_PREFIX=/var/lib/boinc-client/slots/4 ++ export _CONDA_PACK_OLD_PS1= ++ _CONDA_PACK_OLD_PS1= ++ PATH=/var/lib/boinc-client/slots/4/bin:/usr/local/bin:/usr/local/sbin:/usr/bin:/usr/sbin:/bin:/sbin:. ++ PS1='(4) ' ++ case "$_CONDA_SHELL_FLAVOR" in ++ hash -r ++ local _script_dir=/var/lib/boinc-client/slots/4/etc/conda/activate.d ++ '[' -d /var/lib/boinc-client/slots/4/etc/conda/activate.d ']' + export PATH=/var/lib/boinc-client/slots/4:/var/lib/boinc-client/slots/4/bin:/usr/local/bin:/usr/local/sbin:/usr/bin:/usr/sbin:/bin:/sbin:. + PATH=/var/lib/boinc-client/slots/4:/var/lib/boinc-client/slots/4/bin:/usr/local/bin:/usr/local/sbin:/usr/bin:/usr/sbin:/bin:/sbin:. + echo 'Create a temporary directory' + export TMP=/var/lib/boinc-client/slots/4/tmp + TMP=/var/lib/boinc-client/slots/4/tmp + mkdir -p /var/lib/boinc-client/slots/4/tmp + export OMP_NUM_THREADS=1 + OMP_NUM_THREADS=1 + export CUDA_VISIBLE_DEVICES=0 + CUDA_VISIBLE_DEVICES=0 + export CUPY_CUDA_LIB_PATH=/var/lib/boinc-client/slots/4/cupy + CUPY_CUDA_LIB_PATH=/var/lib/boinc-client/slots/4/cupy + echo 'Running PySCF' + python compute_dft.py /var/lib/boinc-client/slots/4/lib/python3.11/site-packages/gpu4pyscf/lib/cutensor.py:174: UserWarning: using cupy as the tensor contraction engine. warnings.warn(f'using {contract_engine} as the tensor contraction engine.') /var/lib/boinc-client/slots/4/lib/python3.11/site-packages/pyscf/dft/libxc.py:771: UserWarning: Since PySCF-2.3, B3LYP (and B3P86) are changed to the VWN-RPA variant, corresponding to the original definition by Stephens et al. (issue 1480) and the same as the B3LYP functional in Gaussian. To restore the VWN5 definition, you can put the setting "B3LYP_WITH_VWN5 = True" in pyscf_conf.py warnings.warn('Since PySCF-2.3, B3LYP (and B3P86) are changed to the VWN-RPA variant, ' /var/lib/boinc-client/slots/4/lib/python3.11/site-packages/pyscf/gto/mole.py:1280: UserWarning: Function mol.dumps drops attribute charge because it is not JSON-serializable warnings.warn(msg) Traceback (most recent call last): File "/var/lib/boinc-client/slots/4/lib/python3.11/site-packages/cupy/cuda/compiler.py", line 253, in _jitify_prep name, options, headers, include_names = jitify.jitify(source, options) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "cupy/cuda/jitify.pyx", line 63, in cupy.cuda.jitify.jitify File "cupy/cuda/jitify.pyx", line 88, in cupy.cuda.jitify.jitify RuntimeError: Runtime compilation failed During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/var/lib/boinc-client/slots/4/compute_dft.py", line 125, in <module> e,f,dip,q = compute_gpu(mol) ^^^^^^^^^^^^^^^^ File "/var/lib/boinc-client/slots/4/compute_dft.py", line 32, in compute_gpu e_dft = mf.kernel() # compute total energy ^^^^^^^^^^^ File "<string>", line 2, in kernel File "/var/lib/boinc-client/slots/4/lib/python3.11/site-packages/gpu4pyscf/scf/hf.py", line 586, in scf _kernel(self, self.conv_tol, self.conv_tol_grad, File "/var/lib/boinc-client/slots/4/lib/python3.11/site-packages/gpu4pyscf/scf/hf.py", line 393, in _kernel mf.init_workflow(dm0=dm) File "/var/lib/boinc-client/slots/4/lib/python3.11/site-packages/gpu4pyscf/df/df_jk.py", line 63, in init_workflow rks.initialize_grids(mf, mf.mol, dm0) File "/var/lib/boinc-client/slots/4/lib/python3.11/site-packages/gpu4pyscf/dft/rks.py", line 80, in initialize_grids ks.grids = prune_small_rho_grids_(ks, ks.mol, dm, ks.grids) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/var/lib/boinc-client/slots/4/lib/python3.11/site-packages/gpu4pyscf/dft/rks.py", line 49, in prune_small_rho_grids_ logger.debug(grids, 'Drop grids %d', grids.weights.size - cupy.count_nonzero(idx)) ^^^^^^^^^^^^^^^^^^^^^^^ File "/var/lib/boinc-client/slots/4/lib/python3.11/site-packages/cupy/_sorting/count.py", line 24, in count_nonzero return _count_nonzero(a, axis=axis) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "cupy/_core/_reduction.pyx", line 608, in cupy._core._reduction._SimpleReductionKernel.__call__ File "cupy/_core/_reduction.pyx", line 364, in cupy._core._reduction._AbstractReductionKernel._call File "cupy/_core/_cub_reduction.pyx", line 701, in cupy._core._cub_reduction._try_to_call_cub_reduction File "cupy/_core/_cub_reduction.pyx", line 538, in cupy._core._cub_reduction._launch_cub File "cupy/_core/_cub_reduction.pyx", line 473, in cupy._core._cub_reduction._cub_two_pass_launch File "cupy/_util.pyx", line 64, in cupy._util.memoize.decorator.ret File "cupy/_core/_cub_reduction.pyx", line 246, in cupy._core._cub_reduction._SimpleCubReductionKernel_get_cached_function File "cupy/_core/_cub_reduction.pyx", line 231, in cupy._core._cub_reduction._create_cub_reduction_function File "cupy/_core/core.pyx", line 2251, in cupy._core.core.compile_with_cache File "/var/lib/boinc-client/slots/4/lib/python3.11/site-packages/cupy/cuda/compiler.py", line 496, in _compile_module_with_cache return _compile_with_cache_cuda( ^^^^^^^^^^^^^^^^^^^^^^^^^ File "/var/lib/boinc-client/slots/4/lib/python3.11/site-packages/cupy/cuda/compiler.py", line 574, in _compile_with_cache_cuda ptx, mapping = compile_using_nvrtc( ^^^^^^^^^^^^^^^^^^^^ File "/var/lib/boinc-client/slots/4/lib/python3.11/site-packages/cupy/cuda/compiler.py", line 322, in compile_using_nvrtc return _compile(source, options, cu_path, ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/var/lib/boinc-client/slots/4/lib/python3.11/site-packages/cupy/cuda/compiler.py", line 287, in _compile options, headers, include_names = _jitify_prep( ^^^^^^^^^^^^^ File "/var/lib/boinc-client/slots/4/lib/python3.11/site-packages/cupy/cuda/compiler.py", line 260, in _jitify_prep raise JitifyException(str(cex)) cupy.cuda.compiler.JitifyException: Runtime compilation failed 14:23:34 (24335): bin/bash exited; CPU time 14.043607 14:23:34 (24335): app exit status: 0x1 14:23:34 (24335): called boinc_finish(195) </stderr_txt> ]]> ____________ Greetings, Jens
	ID: 61145 \| Rating: 0 \| rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1048 Credit: 40,228,383,983 RAC: 449,121 Level Scientific publications	Message 61146 - Posted: 1 Feb 2024 \| 21:54:29 UTC - in response to Message 61145.
	that looks like a driver issue. but something else I noticed is that these tasks for the most part are having a very high failure rate. 30-50% on most hosts. there are a few hosts that have few or no errors however, and all of them are hosts with 24-48GB of VRAM. so it seems something like 30-50% of the tasks require more than 12-16GB. I'm sure the project has a very large error percentage to sort through, as there arent enough 24-48GB GPUs to catch all the resends ____________
	ID: 61146 \| Rating: 0 \| rate: / Reply Quote

Retvari Zoltan Send message Joined: 20 Jan 09 Posts: 2343 Credit: 16,201,255,749 RAC: 0 Level Scientific publications	Message 61147 - Posted: 1 Feb 2024 \| 22:22:48 UTC Last modified: 1 Feb 2024 \| 22:26:22 UTC
	The present batch has a far worse failure ratio than the previous one.
	ID: 61147 \| Rating: 0 \| rate: / Reply Quote

Boca Raton Community HS Send message Joined: 27 Aug 21 Posts: 36 Credit: 3,441,856,809 RAC: 19,827 Level Scientific publications	Message 61149 - Posted: 2 Feb 2024 \| 2:21:12 UTC - in response to Message 61146. Last modified: 2 Feb 2024 \| 2:21:54 UTC
	that looks like a driver issue. but something else I noticed is that these tasks for the most part are having a very high failure rate. 30-50% on most hosts. there are a few hosts that have few or no errors however, and all of them are hosts with 24-48GB of VRAM. so it seems something like 30-50% of the tasks require more than 12-16GB. I'm sure the project has a very large error percentage to sort through, as there arent enough 24-48GB GPUs to catch all the resends This is 100% correct. Our system with 2x RTX a6000 (48GB of VRAM) has had 500 valid results and no errors. They are running tasks at 2x and they seem to run really well (https://www.gpugrid.net/results.php?hostid=616410). In one of our systems with 3x RTX a4500 GPUs (20GB), as soon as I changed running 2x of these tasks to 1x, the error rate greatly improved (https://www.gpugrid.net/results.php?hostid=616409). I made the change and have had 14 tasks in a row without errors. When I am back in the classroom I think I will be changing anything equal to, or less than, 24GB to only run one task in order to improve the valid rate. Has any tried running MPS with these tasks, and would it would make a difference in the allocation of resources to successfully run 2X? Just curious about thoughts.
	ID: 61149 \| Rating: 0 \| rate: / Reply Quote

Bedrich Hajek Send message Joined: 28 Mar 09 Posts: 477 Credit: 9,341,672,716 RAC: 10,973,430 Level Scientific publications	Message 61150 - Posted: 2 Feb 2024 \| 2:57:47 UTC
	Last week, I had a 100% success rate. This week, it's a different story. Maybe, it's time to step back and dial it down a bit. You have to work with the resources that you have, not the ones that you wish you had.
	ID: 61150 \| Rating: 0 \| rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1048 Credit: 40,228,383,983 RAC: 449,121 Level Scientific publications	Message 61151 - Posted: 2 Feb 2024 \| 4:07:19 UTC - in response to Message 61149.
	Boca, How much VRAM do you see actually being used on some of these tasks? Mind watching a few? You’ll have to run a watch command to see continuous output of VRAM utilization since the usage isn’t constant. It spikes up and down. I’m just curious how much is actually needed. Most of the tasks I was running I would see spike up to about 8GB. But i assume the tasks that needed more just failed instead so I can’t know how much they are trying to use. Even though these Titan Vs are great DP performers they only have 12GB VRAM. Even most of the 16GB cards like V100 and P100 are seeing very high error rates. MPS helps. But not enough with this current batch. I was getting good throughput with running 3x tasks at once on the batches last week. ____________
	ID: 61151 \| Rating: 0 \| rate: / Reply Quote

gemini8 Send message Joined: 3 Jul 16 Posts: 31 Credit: 1,578,137,676 RAC: 5,622,108 Level Scientific publications	Message 61152 - Posted: 2 Feb 2024 \| 6:43:20 UTC - in response to Message 61146.
	that looks like a driver issue. That's what Pascal (?) wrote in the Q&A as well. Had three tasks on that host, and two of them failed. ____________ Greetings, Jens
	ID: 61152 \| Rating: 0 \| rate: / Reply Quote

Pascal Send message Joined: 15 Jul 20 Posts: 66 Credit: 628,972,434 RAC: 3,637,787 Level Scientific publications	Message 61153 - Posted: 2 Feb 2024 \| 8:33:53 UTC
	tout le monde ne dipose de carte graphique a 5000 euros avec 24 gigas de vram voir plus.vous devriez penser au plus modeste d'entre nous. mo j'ai une rtx 4060 et une gtx 1650 mais je n'ai que des erreurs par exemple. je pense que lq plupart des gens qui calcule pour gpugrid et qui attende avec impatience du travail pour leur gpu sont comme moi. everyone only dipose graphics card has 5000 euros with 24 gigas of vram see more.you should think of the most modest of us. mo I have a 4060 rtx and a 1650 gtx but I have only errors for example. I think most people who compute for gpugrid and look forward to work for their gpu are like me. je pense toujours que le probleme est mon installation systeme alors je reformate et refais une installation propre en espérant que cela fonctionnera correctement. en vain car le probleme vient de vos unités de calcul défaillantes I still think the problem is my system installation so I reformat and redo a clean installation hoping it will work properly. in vain because the problem comes from your faulty computing units ____________
	ID: 61153 \| Rating: 0 \| rate: / Reply Quote

Retvari Zoltan Send message Joined: 20 Jan 09 Posts: 2343 Credit: 16,201,255,749 RAC: 0 Level Scientific publications	Message 61154 - Posted: 2 Feb 2024 \| 10:54:03 UTC Last modified: 2 Feb 2024 \| 11:32:19 UTC
	I've disabled getting new GPUGrid tasks GPUGrid on my host with "small" amount (below 24GB) of GPU memory. This gigantic memory requirement is ridiculous in my opinion. This is not a user error, if the workunits can't be changed, then the project should not send these tasks to hosts that have less than ~20GB of GPU memory. There could be another solution, if the workunit would allocate memory in a less careless way. I've started a task on my RTX 4090 (it has 24GiB RAM), and I've monitored the memory usage: idle: 305 MiB task starting: 895 MiB GPU usage rises: 6115 MiB GPU usage drops: 7105 MiB GPU usage 100%: 7205 MiB GPU usage drops: 8495 MiB GPU usage rises: 9961 MiB GPU usage drops: 14327 MiB (it would have failed on my GTX 1080 Ti at this point) GPU usage rises: 6323 MiB GPU usage drops: 15945 MiB GPU usage 100%: 6205 MiB ...and so on So the memory usage doubles at some points of processing for a short while, and this cause the workunits to fail on GPUs that have "small" amount of memory. If this behaviour could be eliminated, much more hosts could process these workunits.
	ID: 61154 \| Rating: 0 \| rate: / Reply Quote

ServicEnginIC Send message Joined: 24 Sep 10 Posts: 572 Credit: 7,519,447,024 RAC: 10,542,159 Level Scientific publications	Message 61155 - Posted: 2 Feb 2024 \| 11:59:29 UTC - in response to Message 61154.
	Nothing to do at this time for my currently working GPUs with PYSCFbeta tasks. 5 GTX 1650 4GB, 1 GTX 1650 SUPER 4GB, 1 GTX 1660 Ti 6GB. 100% errors with current PYSCFbeta tasks, now I can realize why... I've disabled Quantum chemistry on GPU (beta) at my project preferences in the wait for a correction, if any. Conversely, they are performing right with ATMbeta tasks.
	ID: 61155 \| Rating: 0 \| rate: / Reply Quote

Freewill Send message Joined: 18 Mar 10 Posts: 13 Credit: 12,440,092,894 RAC: 94,208,303 Level Scientific publications	Message 61156 - Posted: 2 Feb 2024 \| 12:09:55 UTC
	I agree it does seem these tasks have a spike in memory usage. I "rented" an RTX A5000 GPU which also has 24 GB memory, and running 1 task at a time, at least the first task completed: https://www.gpugrid.net/workunit.php?wuid=27678500 I will try a few more
	ID: 61156 \| Rating: 0 \| rate: / Reply Quote

roundup Send message Joined: 11 May 10 Posts: 60 Credit: 2,925,505,193 RAC: 40,268,066 Level Scientific publications	Message 61157 - Posted: 2 Feb 2024 \| 12:16:07 UTC - in response to Message 61155. Last modified: 2 Feb 2024 \| 12:17:30 UTC
	I've disabled Quantum chemistry on GPU (beta) at my project preferences in the wait for a correction, if any. Conversely, they are performing right with ATMbeta tasks. Exactly the same here. After 29 consecutive errors on a RTX4070Ti, I have disabled 'Quantum chemistry on GPU (beta)'.
	ID: 61157 \| Rating: 0 \| rate: / Reply Quote

gemini8 Send message Joined: 3 Jul 16 Posts: 31 Credit: 1,578,137,676 RAC: 5,622,108 Level Scientific publications	Message 61158 - Posted: 2 Feb 2024 \| 12:25:43 UTC
	I have one machine still taking on GPUGrid tasks. The others are using their GPUs for the Tour de Primes over at PrimeGrid only. If there really is a driver issue (see earlier post and answers) with this machine I'd like to know which, as its GPU is running fine on other BOINC projects apart from SRBase. Not being able to run SRBase is related to libc, not the GPU driver. ____________ Greetings, Jens
	ID: 61158 \| Rating: 0 \| rate: / Reply Quote

Pascal Send message Joined: 15 Jul 20 Posts: 66 Credit: 628,972,434 RAC: 3,637,787 Level Scientific publications	Message 61159 - Posted: 2 Feb 2024 \| 12:37:34 UTC Last modified: 2 Feb 2024 \| 12:38:16 UTC
	bonjour existe t'il un moyen de simuler de la vram pour gpu en utilisant la ram ou un ssd sous linux. cela éviterait les erreurs de calcul. J'ai augmenter le swap file a 50 gigas comme sous windows mais cela ne fonctionne pas. Merci hello Is there a way to simulate vram for GPU using RAM or SSD under linux. this would avoid miscalculation. I increased the swap file to 50 gigas as under windows but it does not work. Thanks ____________
	ID: 61159 \| Rating: 0 \| rate: / Reply Quote

Boca Raton Community HS Send message Joined: 27 Aug 21 Posts: 36 Credit: 3,441,856,809 RAC: 19,827 Level Scientific publications	Message 61160 - Posted: 2 Feb 2024 \| 13:36:47 UTC - in response to Message 61151. Last modified: 2 Feb 2024 \| 13:42:31 UTC
	Boca, How much VRAM do you see actually being used on some of these tasks? Mind watching a few? You’ll have to run a watch command to see continuous output of VRAM utilization since the usage isn’t constant. It spikes up and down. I’m just curious how much is actually needed. Most of the tasks I was running I would see spike up to about 8GB. But i assume the tasks that needed more just failed instead so I can’t know how much they are trying to use. Even though these Titan Vs are great DP performers they only have 12GB VRAM. Even most of the 16GB cards like V100 and P100 are seeing very high error rates. MPS helps. But not enough with this current batch. I was getting good throughput with running 3x tasks at once on the batches last week. This was wild... For a single work unit: Hovers around 3-4GB Rises to 8-9GB Spikes to ~11GB regularly. Highest Spike (seen): 12.5GB Highest Spike (estimated based on psensor): ~20GB. Additionally, Psensor caught a highest memory usage spike of 76% of the 48GB of the RTX A6000 for one work unit but I did not see when this happened or if it happened at all. I graphically captured the VRAM memory usage for one work unit. I have no idea how to imbed images here. So, here is a Google Doc: https://docs.google.com/document/d/1xpOpNJ93finciJQW7U07dMHOycSVlbYq9G6h0Xg7GtA/edit?usp=sharing EDIT: I think they just purged these work units from the server?
	ID: 61160 \| Rating: 0 \| rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1048 Credit: 40,228,383,983 RAC: 449,121 Level Scientific publications	Message 61161 - Posted: 2 Feb 2024 \| 14:02:10 UTC - in response to Message 61160. Last modified: 2 Feb 2024 \| 14:06:34 UTC
	thanks. that's kind of what I expected was happening. and yeah, they must have seen the problems and just abandoned the remainder of this run to reassess how to tweak them. it seemed like they sweaked the input files to give the assertion error instead of just hanging like the earlier (index numbers below ~1000). the early tasks would hang with the fallback to CPU issue, and after that it changed to the assertion error if it ran out of vram. that was better behavior for the user since a quick failure is better than hanging for hours on end doing nothing. but they were probably getting back a majority of errors as the VRAM requirements grew beyond what most people have for available hardware. ____________
	ID: 61161 \| Rating: 0 \| rate: / Reply Quote

Boca Raton Community HS Send message Joined: 27 Aug 21 Posts: 36 Credit: 3,441,856,809 RAC: 19,827 Level Scientific publications	Message 61162 - Posted: 2 Feb 2024 \| 15:30:46 UTC
	New batch just come through- seeing the same VRAM spikes and patterns.
	ID: 61162 \| Rating: 0 \| rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1048 Credit: 40,228,383,983 RAC: 449,121 Level Scientific publications	Message 61163 - Posted: 2 Feb 2024 \| 15:32:14 UTC - in response to Message 61162. Last modified: 2 Feb 2024 \| 15:39:40 UTC
	I'm seeing the same spikes, but so far so good. biggest spike i saw was ~9GB no errors ...yet. spoke too soon. did get one failure https://gpugrid.net/result.php?resultid=33801391 ____________
	ID: 61163 \| Rating: 0 \| rate: / Reply Quote

Steve Volunteer moderator Project administrator Project developer Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 21 Dec 23 Posts: 34 Credit: 0 RAC: 0 Level Scientific publications	Message 61164 - Posted: 2 Feb 2024 \| 15:37:15 UTC - in response to Message 61163.
	Hi. I have been tweaking settings. All WUs I have tried now work on my 1080(8GB). Sending a new batch of smaller WUs out now. From our end we will need to see how to assign WU's based on GPU memory. (Previous apps have been compute bound rather than GPU memory bound and have only been assigned based on driver version)
	ID: 61164 \| Rating: 0 \| rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1048 Credit: 40,228,383,983 RAC: 449,121 Level Scientific publications	Message 61165 - Posted: 2 Feb 2024 \| 16:07:04 UTC - in response to Message 61164.
	seeing some errors on Titan V (12GB). not a huge amount. but certainly a noteworthy amount. maybe you can correlate these specific WUs and see why these kind (number of atoms or molecules?) might be requesting more VRAM than the ones you tried on your 1080. most of the ones i've observed running will hover around ~3-4GB constant VRAM use, with spikes to the 8-11GB range. https://gpugrid.net/result.php?resultid=33802055 https://gpugrid.net/result.php?resultid=33801492 https://gpugrid.net/result.php?resultid=33801447 https://gpugrid.net/result.php?resultid=33801391 https://gpugrid.net/result.php?resultid=33801238 ____________
	ID: 61165 \| Rating: 0 \| rate: / Reply Quote

pututu Send message Joined: 8 Oct 16 Posts: 15 Credit: 935,926,869 RAC: 18,769,238 Level Scientific publications	Message 61166 - Posted: 2 Feb 2024 \| 16:08:36 UTC
	Still seeing a vram spike above 8GB 2024/02/02 08:07:08.774, 71, 100 %, 40 %, 8997 MiB 2024/02/02 08:07:09.774, 71, 100 %, 34 %, 8999 MiB 2024/02/02 08:07:10.775, 71, 22 %, 1 %, 8989 MiB 2024/02/02 08:07:11.775, 70, 96 %, 2 %, 10209 MiB 2024/02/02 08:07:12.775, 71, 98 %, 7 %, 10721 MiB 2024/02/02 08:07:13.775, 71, 93 %, 8 %, 5023 MiB 2024/02/02 08:07:14.775, 72, 96 %, 24 %, 5019 MiB 2024/02/02 08:07:15.776, 72, 100 %, 0 %, 5019 MiB 2024/02/02 08:07:16.776, 72, 100 %, 0 %, 5019 MiB Seems like credit has gone down from 150K to 15K.
	ID: 61166 \| Rating: 0 \| rate: / Reply Quote

Boca Raton Community HS Send message Joined: 27 Aug 21 Posts: 36 Credit: 3,441,856,809 RAC: 19,827 Level Scientific publications	Message 61167 - Posted: 2 Feb 2024 \| 16:20:20 UTC - in response to Message 61166.
	Agreed- it seems that there are fewer spikes and most of them are in the 8-9GB range. A few higher but it seems less frequent? Difficult to quantify an actual difference since the work units can be so different. Is there a difference in VRAM usage or does the actual work unit just happen to need less VRAM?
	ID: 61167 \| Rating: 0 \| rate: / Reply Quote

Pascal Send message Joined: 15 Jul 20 Posts: 66 Credit: 628,972,434 RAC: 3,637,787 Level Scientific publications	Message 61168 - Posted: 2 Feb 2024 \| 16:40:21 UTC
	Seems like credit has gone down from 150K to 15K. ____________
	ID: 61168 \| Rating: 0 \| rate: / Reply Quote

pututu Send message Joined: 8 Oct 16 Posts: 15 Credit: 935,926,869 RAC: 18,769,238 Level Scientific publications	Message 61169 - Posted: 2 Feb 2024 \| 17:33:47 UTC Last modified: 2 Feb 2024 \| 17:34:29 UTC
	Occasionally 8G of vram card is not sufficient. Still seeing error on these cards. Example: two of the hosts below have 8G vram while the other one returned successfully has 16G. http://gpugrid.net/workunit.php?wuid=27683202
	ID: 61169 \| Rating: 0 \| rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1048 Credit: 40,228,383,983 RAC: 449,121 Level Scientific publications	Message 61171 - Posted: 2 Feb 2024 \| 17:55:00 UTC - in response to Message 61169.
	Even that 16GB GPU had one failure with the new v3 batch http://gpugrid.net/result.php?resultid=33802340 ____________
	ID: 61171 \| Rating: 0 \| rate: / Reply Quote

Boca Raton Community HS Send message Joined: 27 Aug 21 Posts: 36 Credit: 3,441,856,809 RAC: 19,827 Level Scientific publications	Message 61172 - Posted: 2 Feb 2024 \| 18:47:46 UTC - in response to Message 61171.
	Even that 16GB GPU had one failure with the new v3 batch http://gpugrid.net/result.php?resultid=33802340 Based on the times of tasks, it looks like those were running at 1x?
	ID: 61172 \| Rating: 0 \| rate: / Reply Quote

Pascal Send message Joined: 15 Jul 20 Posts: 66 Credit: 628,972,434 RAC: 3,637,787 Level Scientific publications	Message 61173 - Posted: 2 Feb 2024 \| 18:52:03 UTC Last modified: 2 Feb 2024 \| 18:55:03 UTC
	bonsoir chez moi ça marche bien maintenant. je viens de finir 5 unités de calcul sans probleme avec ma gtx 1650 et ma rtx 4060. espérons que cela continue. j'ai reformaté mon pc aujourd'hui et j'ai réinstallé linux mint 21,3,une fois de plus. Good evening at my place it works well now. I just finished 5 computing units without problems with my gtx 1650 and my rtx 4060. let’s hope this continues. I reformatted my pc today and reinstalled linux mint 21,3,once again. https://www.gpugrid.net/results.php?userid=563937 ____________
	ID: 61173 \| Rating: 0 \| rate: / Reply Quote

roundup Send message Joined: 11 May 10 Posts: 60 Credit: 2,925,505,193 RAC: 40,268,066 Level Scientific publications	Message 61174 - Posted: 2 Feb 2024 \| 19:00:05 UTC - in response to Message 61168.
	14 tasks of the latest batch completed successfully without any error. Great progress! Seems like credit has gone down from 150K to 15K. Perhaps 150k was a little too generous. But 15k is not on par with other GPU projects. I expect there will be fairer credits again soon - with the next batch?
	ID: 61174 \| Rating: 0 \| rate: / Reply Quote

Boca Raton Community HS Send message Joined: 27 Aug 21 Posts: 36 Credit: 3,441,856,809 RAC: 19,827 Level Scientific publications	Message 61175 - Posted: 2 Feb 2024 \| 19:04:15 UTC - in response to Message 61174.
	14 tasks of the latest batch completed successfully without any error. Great progress! Seems like credit has gone down from 150K to 15K. Perhaps 150k was a little too generous. But 15k is not on par with other GPU projects. I expect there will be fairer credits again soon - with the next batch? Are you running them at 1x and with how much VRAM? Trying to get a feel for what the actual "cutoff" is for these tasks right now. I am still feeling 24GB VRAM is needed for the success running 1x and double that for 2x.
	ID: 61175 \| Rating: 0 \| rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1048 Credit: 40,228,383,983 RAC: 449,121 Level Scientific publications	Message 61176 - Posted: 2 Feb 2024 \| 19:13:14 UTC - in response to Message 61175.
	sometimes more than 12GB as about 4% (16 out of 372) of my tasks failed all on GPUs with 12GB, all running at 1x only for the v3 batch. not sure how much VRAM is needed to be 100% successful. I did have one success that was a resend of one of your errors from a 4090 24GB. so i'm guessing you were running that one at 2x and got unlucky with two big tasks at the same time. ____________
	ID: 61176 \| Rating: 0 \| rate: / Reply Quote

Boca Raton Community HS Send message Joined: 27 Aug 21 Posts: 36 Credit: 3,441,856,809 RAC: 19,827 Level Scientific publications	Message 61177 - Posted: 2 Feb 2024 \| 19:30:11 UTC - in response to Message 61176.
	sometimes more than 12GB as about 4% (16 out of 372) of my tasks failed all on GPUs with 12GB, all running at 1x only for the v3 batch. not sure how much VRAM is needed to be 100% successful. I did have one success that was a resend of one of your errors from a 4090 24GB. so i'm guessing you were running that one at 2x and got unlucky with two big tasks at the same time. Correct- I was playing around with the two 4090 systems running these to make some comparisons. And you are also correct- it seems that even with 24GB, running 2x is still not really ideal. Those random, huge spikes seem to find each other when running 2x.
	ID: 61177 \| Rating: 0 \| rate: / Reply Quote

roundup Send message Joined: 11 May 10 Posts: 60 Credit: 2,925,505,193 RAC: 40,268,066 Level Scientific publications	Message 61178 - Posted: 2 Feb 2024 \| 19:40:54 UTC - in response to Message 61175.
	Are you running them at 1x and with how much VRAM? Trying to get a feel for what the actual "cutoff" is for these tasks right now. I am still feeling 24GB VRAM is needed for the success running 1x and double that for 2x. The GPU is an MSI 4070 Ti GAMING X SLIM with 12GB GDDR6X, run at 1x. Obviously sufficient for the latest batch to run flawlessly.
	ID: 61178 \| Rating: 0 \| rate: / Reply Quote

pututu Send message Joined: 8 Oct 16 Posts: 15 Credit: 935,926,869 RAC: 18,769,238 Level Scientific publications	Message 61179 - Posted: 2 Feb 2024 \| 19:43:15 UTC - in response to Message 61174.
	14 tasks of the latest batch completed successfully without any error. Great progress! Seems like credit has gone down from 150K to 15K. Perhaps 150k was a little too generous. But 15k is not on par with other GPU projects. I expect there will be fairer credits again soon - with the next batch? Assuming someone with a 3080Ti card, it will be better to run ATMbeta task first and then Quantum chemistry (if former has no available tasks) if credits granted is an important factor for some crunchers. For me, I've 3080Ti and P100, so I will likely run ATMbeta on 3080Ti and Quantum chemistry on P100, if both tasks are available.
	ID: 61179 \| Rating: 0 \| rate: / Reply Quote

Boca Raton Community HS Send message Joined: 27 Aug 21 Posts: 36 Credit: 3,441,856,809 RAC: 19,827 Level Scientific publications	Message 61180 - Posted: 2 Feb 2024 \| 19:49:01 UTC - in response to Message 61178.
	Are you running them at 1x and with how much VRAM? Trying to get a feel for what the actual "cutoff" is for these tasks right now. I am still feeling 24GB VRAM is needed for the success running 1x and double that for 2x. The GPU is an MSI 4070 Ti GAMING X SLIM with 12GB GDDR6X, run at 1x. Obviously sufficient for the latest batch to run flawlessly. Thanks for the info. If you don't mind me asking- how many ran (in a row) without any errors?
	ID: 61180 \| Rating: 0 \| rate: / Reply Quote

CallMeFoxie Send message Joined: 6 Jan 21 Posts: 2 Credit: 24,835,750 RAC: 0 Level Scientific publications	Message 61181 - Posted: 2 Feb 2024 \| 22:28:30 UTC
	I have got a rig with 9 pieces of P106, which are slightly modified GTX1060 6GB used for Ethereum mining back in the day. I can run only two GPUgrid tasks at once (main CPU is only a dual core Celeron) but so far I have had one error and several tasks finish and validate. Hoping for good results for the rest!
	ID: 61181 \| Rating: 0 \| rate: / Reply Quote

roundup Send message Joined: 11 May 10 Posts: 60 Credit: 2,925,505,193 RAC: 40,268,066 Level Scientific publications	Message 61182 - Posted: 3 Feb 2024 \| 0:00:22 UTC - in response to Message 61180. Last modified: 3 Feb 2024 \| 0:02:16 UTC
	Are you running them at 1x and with how much VRAM? Trying to get a feel for what the actual "cutoff" is for these tasks right now. I am still feeling 24GB VRAM is needed for the success running 1x and double that for 2x. The GPU is an MSI 4070 Ti GAMING X SLIM with 12GB GDDR6X, run at 1x. Obviously sufficient for the latest batch to run flawlessly. Thanks for the info. If you don't mind me asking- how many ran (in a row) without any errors? 14 consecutive tasks without any error.
	ID: 61182 \| Rating: 0 \| rate: / Reply Quote

CallMeFoxie Send message Joined: 6 Jan 21 Posts: 2 Credit: 24,835,750 RAC: 0 Level Scientific publications	Message 61183 - Posted: 3 Feb 2024 \| 11:04:43 UTC - in response to Message 61181.
	I have got a rig with 9 pieces of P106, which are slightly modified GTX1060 6GB used for Ethereum mining back in the day. I can run only two GPUgrid tasks at once (main CPU is only a dual core Celeron) but so far I have had one error and several tasks finish and validate. Hoping for good results for the rest! So managed to get 11 tasks, from which 9 passed and validated and 2 failed some time into the process.
	ID: 61183 \| Rating: 0 \| rate: / Reply Quote

ServicEnginIC Send message Joined: 24 Sep 10 Posts: 572 Credit: 7,519,447,024 RAC: 10,542,159 Level Scientific publications	Message 61184 - Posted: 3 Feb 2024 \| 13:59:03 UTC - in response to Message 61164.
	...From our end we will need to see how to assign WU's based on GPU memory. (Previous apps have been compute bound rather than GPU memory bound and have only been assigned based on driver version) Probably (I don't know if it is viable), a better solution would be to include some portion in the code to limit peak VRAM according to the true device assigned. The reason, based in an example: My Host #482132 is shown by BOINC as [2] NVIDIA NVIDIA GeForce GTX 1660 Ti (5928MB) driver: 550.40 This is true for Device 0, NVIDIA NVIDIA GeForce GTX 1660 Ti (5928MB) driver: 550.40 But Device 1 completing this host, should be shown as NVIDIA NVIDIA GeForce GTX 1650 SUPER (3895MB) driver: 550.40 Tasks sent according to Device 0 VRAM (6 GB), would likely run out of memory when striking Device 1 (4 GB VRAM)
	ID: 61184 \| Rating: 0 \| rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1048 Credit: 40,228,383,983 RAC: 449,121 Level Scientific publications	Message 61185 - Posted: 3 Feb 2024 \| 14:24:40 UTC - in response to Message 61184. Last modified: 3 Feb 2024 \| 14:25:45 UTC
	...From our end we will need to see how to assign WU's based on GPU memory. (Previous apps have been compute bound rather than GPU memory bound and have only been assigned based on driver version) Probably (I don't know if it is viable), a better solution would be to include some portion in the code to limit peak VRAM according to the true device assigned. The reason, based in an example: My Host #482132 is shown by BOINC as [2] NVIDIA NVIDIA GeForce GTX 1660 Ti (5928MB) driver: 550.40 This is true for Device 0, NVIDIA NVIDIA GeForce GTX 1660 Ti (5928MB) driver: 550.40 But Device 1 completing this host, should be shown as NVIDIA NVIDIA GeForce GTX 1650 SUPER (3895MB) driver: 550.40 Tasks sent according to Device 0 VRAM (6 GB), would likely run out of memory when striking Device 1 (4 GB VRAM) the only caveat with this is that the application or project doesnt have any ability to select which GPUs you have or which GPU will run the task. in your example, if a task was sent that required >4GB, the project has no idea that GPU1 only has 4GB. the project can only see the "first/best" GPU in the system, that is communicated via your boinc client, and the boinc client is the one that selects which tasks go to which GPU. the science application is called after the GPU selection has already been made. and similarly, BOINC has no mechanism to assign tasks based on GPU VRAM use. you will have to manage things yourself after observing behavior. if you notice one GPU consistently has too little VRAM, you can exclude that GPU from running the QChem project via setting the <exclude_gpu> statement in the cc_config.xml file. <options> <exclude_gpu> <url>https://www.gpugrid.net/</url> <app>PYSCFbeta</app> <device_num>1</device_num> </exclude_gpu> </options> ____________
	ID: 61185 \| Rating: 0 \| rate: / Reply Quote

ServicEnginIC Send message Joined: 24 Sep 10 Posts: 572 Credit: 7,519,447,024 RAC: 10,542,159 Level Scientific publications	Message 61186 - Posted: 3 Feb 2024 \| 18:49:12 UTC - in response to Message 61185. Last modified: 3 Feb 2024 \| 18:53:24 UTC
	you will have to manage things yourself after observing behavior. Certainly. Your advice is always very appreciated. Would be fine an update of minimum requirements when PYSCF taks arrive to production stage, as a help for excluding non accomplishing hosts / GPUs.
	ID: 61186 \| Rating: 0 \| rate: / Reply Quote

pututu Send message Joined: 8 Oct 16 Posts: 15 Credit: 935,926,869 RAC: 18,769,238 Level Scientific publications	Message 61187 - Posted: 3 Feb 2024 \| 21:04:18 UTC - in response to Message 61186.
	you will have to manage things yourself after observing behavior. Certainly. Your advice is always very appreciated. Would be fine an update of minimum requirements when PYSCF taks arrive to production stage, as a help for excluding non accomplishing hosts / GPUs. I would imagine something like what WCG posted may be useful showing system requirements such as memory, disk space, one-time download file size, etc https://www.worldcommunitygrid.org/help/topic.s?shortName=minimumreq. Other than WCG not running smoothly since the IBM migration, I notice that the WCG system requirements are outdated. I guess it takes effort to maintain such information and keeping it up to date. So far, this is my limited knowledge about the quantum chemistry task as I'm still learning. Anyone is welcome to chime in for the system requirements. 1) One time download file is about 2GB. Be prepare to wait for hours if you have very slow internet speed. 2) The more gpu vram the better. Seems like 24GB cards or more perform the best. 3) GPUs with faster memory bandwidth and faster FP64 have advantage in shorter run time. Typically this is found in datacenter/server/workstation cards.
	ID: 61187 \| Rating: 0 \| rate: / Reply Quote

gemini8 Send message Joined: 3 Jul 16 Posts: 31 Credit: 1,578,137,676 RAC: 5,622,108 Level Scientific publications	Message 61188 - Posted: 4 Feb 2024 \| 8:18:18 UTC
	Implementing a possibility to choose work with certain demands to the hardware through the preferences would be nice as well. After lots of problems with the ECM subproject claiming too much system memory yoyo@home divided the subproject into smaller and bigger tasks, which can both be ticked (or be left unticked) in the project preferences. So, my suggestion is to hand out work that comes in 4, 6, 8, 12, 16 and 24 GB flavours which the user can choose from. As the machine's system also claims GPU memory it should naturally be considered to leavy about half a gig untouched by the GPUGrid tasks. ____________ Greetings, Jens
	ID: 61188 \| Rating: 0 \| rate: / Reply Quote

Steve Volunteer moderator Project administrator Project developer Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 21 Dec 23 Posts: 34 Credit: 0 RAC: 0 Level Scientific publications	Message 61189 - Posted: 5 Feb 2024 \| 11:04:22 UTC - in response to Message 61188.
	Ok so it seems like things are improved with the latest settings. I am keeping the WUs short (10 molecule configurations per WU) to minimize the effect of the errors. I am going to send out some batches of WUs to get through a large dataset we have. I think this After lots of problems with the ECM subproject claiming too much system memory yoyo@home divided the subproject into smaller and bigger tasks, which can both be ticked (or be left unticked) in the project preferences. So, my suggestion is to hand out work that comes in 4, 6, 8, 12, 16 and 24 GB flavours which the user can choose from. Might be the most workable solution for the future once the current batch of work is done. The memory use is mainly determined by the size of molecule and number of heavy elememts. So before WUs are sent out we can make a rough estimate of the memory use. There is an elemnt of randomness that comes from high memory use for specific physical configurations that are harder to converge. We cannot estimate this before sending and it will only happen during the calculation.
	ID: 61189 \| Rating: 0 \| rate: / Reply Quote

Pascal Send message Joined: 15 Jul 20 Posts: 66 Credit: 628,972,434 RAC: 3,637,787 Level Scientific publications	Message 61190 - Posted: 5 Feb 2024 \| 11:35:20 UTC - in response to Message 61189.
	Seems like credit has gone down from 150K to 15K? ____________
	ID: 61190 \| Rating: 0 \| rate: / Reply Quote

Freewill Send message Joined: 18 Mar 10 Posts: 13 Credit: 12,440,092,894 RAC: 94,208,303 Level Scientific publications	Message 61191 - Posted: 5 Feb 2024 \| 11:42:50 UTC - in response to Message 61190.
	Seems like credit has gone down from 150K to 15K? Yes, and the memory use this morning seems to require running 1 at a time on GPUs with less than 16 GB, which hurts performance even more. Steve, what determines point value for a task?
	ID: 61191 \| Rating: 0 \| rate: / Reply Quote

Pascal Send message Joined: 15 Jul 20 Posts: 66 Credit: 628,972,434 RAC: 3,637,787 Level Scientific publications	Message 61192 - Posted: 5 Feb 2024 \| 12:03:37 UTC
	Pour le moment ça va pas trop mal avec les nouvelles unités de calcul. Une erreur sur 4. For the moment it is not too bad with the new units of calculation. One in four mistakes. Nom inputs_v3_ace_pch_ms_gc_filt_af05_index_64000_to_64000-SFARR_PYSCF_ace_pch_ms_gc_filt_af05_v4-0-1-RND0521_0 Unité de travail (WU) 27684102 Créé 5 Feb 2024 \| 10:40:37 UTC Envoyé 5 Feb 2024 \| 10:47:37 UTC Reçu 5 Feb 2024 \| 10:49:50 UTC État du serveur Sur Résultats Erreur de calcul État du client Erreur de calcul État à la sortie 195 (0xc3) EXIT_CHILD_FAILED ID de l'ordinateur 617458 Date limite de rapport 10 Feb 2024 \| 10:47:37 UTC Temps de fonctionnement 45.93 Temps de CPU 9.59 Valider l'état Invalide Crédit 0.00 Version de l'application Quantum chemistry calculations on GPU v1.04 (cuda1121) Stderr output <core_client_version>7.20.5</core_client_version> <![CDATA[ <message> process exited with code 195 (0xc3, -61)</message> <stderr_txt> 11:47:47 (5931): wrapper (7.7.26016): starting 11:48:16 (5931): wrapper (7.7.26016): starting 11:48:16 (5931): wrapper: running bin/python (bin/conda-unpack) 11:48:17 (5931): bin/python exited; CPU time 0.157053 11:48:17 (5931): wrapper: running bin/tar (xjvf input.tar.bz2) 11:48:18 (5931): bin/tar exited; CPU time 0.002953 11:48:18 (5931): wrapper: running bin/bash (run.sh) + echo 'Setup environment' + source bin/activate ++ _conda_pack_activate ++ local _CONDA_SHELL_FLAVOR ++ '[' -n x ']' ++ _CONDA_SHELL_FLAVOR=bash ++ local script_dir ++ case "$_CONDA_SHELL_FLAVOR" in +++ dirname bin/activate ++ script_dir=bin +++ cd bin +++ pwd ++ local full_path_script_dir=/home/pascal/slots/3/bin +++ dirname /home/pascal/slots/3/bin ++ local full_path_env=/home/pascal/slots/3 +++ basename /home/pascal/slots/3 ++ local env_name=3 ++ '[' -n '' ']' ++ export CONDA_PREFIX=/home/pascal/slots/3 ++ CONDA_PREFIX=/home/pascal/slots/3 ++ export _CONDA_PACK_OLD_PS1= ++ _CONDA_PACK_OLD_PS1= ++ PATH=/home/pascal/slots/3/bin:/usr/local/bin:/usr/local/sbin:/usr/bin:/usr/sbin:/bin:/sbin:. ++ PS1='(3) ' ++ case "$_CONDA_SHELL_FLAVOR" in ++ hash -r ++ local _script_dir=/home/pascal/slots/3/etc/conda/activate.d ++ '[' -d /home/pascal/slots/3/etc/conda/activate.d ']' + export PATH=/home/pascal/slots/3:/home/pascal/slots/3/bin:/usr/local/bin:/usr/local/sbin:/usr/bin:/usr/sbin:/bin:/sbin:. + PATH=/home/pascal/slots/3:/home/pascal/slots/3/bin:/usr/local/bin:/usr/local/sbin:/usr/bin:/usr/sbin:/bin:/sbin:. + echo 'Create a temporary directory' + export TMP=/home/pascal/slots/3/tmp + TMP=/home/pascal/slots/3/tmp + mkdir -p /home/pascal/slots/3/tmp + export OMP_NUM_THREADS=1 + OMP_NUM_THREADS=1 + export CUDA_VISIBLE_DEVICES=1 + CUDA_VISIBLE_DEVICES=1 + export CUPY_CUDA_LIB_PATH=/home/pascal/slots/3/cupy + CUPY_CUDA_LIB_PATH=/home/pascal/slots/3/cupy + echo 'Running PySCF' + python compute_dft.py /home/pascal/slots/3/lib/python3.11/site-packages/gpu4pyscf/lib/cutensor.py:174: UserWarning: using cupy as the tensor contraction engine. warnings.warn(f'using {contract_engine} as the tensor contraction engine.') /home/pascal/slots/3/lib/python3.11/site-packages/pyscf/dft/libxc.py:771: UserWarning: Since PySCF-2.3, B3LYP (and B3P86) are changed to the VWN-RPA variant, corresponding to the original definition by Stephens et al. (issue 1480) and the same as the B3LYP functional in Gaussian. To restore the VWN5 definition, you can put the setting "B3LYP_WITH_VWN5 = True" in pyscf_conf.py warnings.warn('Since PySCF-2.3, B3LYP (and B3P86) are changed to the VWN-RPA variant, ' nao = 570 /home/pascal/slots/3/lib/python3.11/site-packages/pyscf/gto/mole.py:1280: UserWarning: Function mol.dumps drops attribute charge because it is not JSON-serializable warnings.warn(msg) Traceback (most recent call last): File "/home/pascal/slots/3/lib/python3.11/site-packages/pyscf/lib/misc.py", line 1094, in __exit__ handler.result() File "/home/pascal/slots/3/lib/python3.11/concurrent/futures/_base.py", line 456, in result return self.__get_result() ^^^^^^^^^^^^^^^^^^^ File "/home/pascal/slots/3/lib/python3.11/concurrent/futures/_base.py", line 401, in __get_result raise self._exception File "/home/pascal/slots/3/lib/python3.11/concurrent/futures/thread.py", line 58, in run result = self.fn(self.args, *self.kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/pascal/slots/3/lib/python3.11/site-packages/gpu4pyscf/df/df_jk.py", line 52, in build_df rsh_df.build(omega=omega) File "/home/pascal/slots/3/lib/python3.11/site-packages/gpu4pyscf/df/df.py", line 102, in build self._cderi = cholesky_eri_gpu(intopt, mol, auxmol, self.cd_low, omega=omega) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/pascal/slots/3/lib/python3.11/site-packages/gpu4pyscf/df/df.py", line 256, in cholesky_eri_gpu if lj>1: ints_slices = cart2sph(ints_slices, axis=1, ang=lj) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/pascal/slots/3/lib/python3.11/site-packages/gpu4pyscf/lib/cupy_helper.py", line 333, in cart2sph t_sph = contract('min,ip->mpn', t_cart, c2s, out=out) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/pascal/slots/3/lib/python3.11/site-packages/gpu4pyscf/lib/cutensor.py", line 177, in contract return cupy.asarray(einsum(pattern, a, b), order='C') ^^^^^^^^^^^^^^^^^^^^^ File "/home/pascal/slots/3/lib/python3.11/site-packages/cupy/linalg/_einsum.py", line 676, in einsum arr_out, sub_out = reduced_binary_einsum( ^^^^^^^^^^^^^^^^^^^^^^ File "/home/pascal/slots/3/lib/python3.11/site-packages/cupy/linalg/_einsum.py", line 418, in reduced_binary_einsum tmp1, shapes1 = _flatten_transpose(arr1, [bs1, cs1, ts1]) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/pascal/slots/3/lib/python3.11/site-packages/cupy/linalg/_einsum.py", line 298, in _flatten_transpose a.transpose(transpose_axes).reshape( File "cupy/_core/core.pyx", line 752, in cupy._core.core._ndarray_base.reshape File "cupy/_core/_routines_manipulation.pyx", line 81, in cupy._core._routines_manipulation._ndarray_reshape File "cupy/_core/_routines_manipulation.pyx", line 357, in cupy._core._routines_manipulation._reshape File "cupy/_core/core.pyx", line 611, in cupy._core.core._ndarray_base.copy File "cupy/_core/core.pyx", line 570, in cupy._core.core._ndarray_base.astype File "cupy/_core/core.pyx", line 132, in cupy._core.core.ndarray.__new__ File "cupy/_core/core.pyx", line 220, in cupy._core.core._ndarray_base._init File "cupy/cuda/memory.pyx", line 740, in cupy.cuda.memory.alloc File "cupy/cuda/memory.pyx", line 1426, in cupy.cuda.memory.MemoryPool.malloc File "cupy/cuda/memory.pyx", line 1447, in cupy.cuda.memory.MemoryPool.malloc File "cupy/cuda/memory.pyx", line 1118, in cupy.cuda.memory.SingleDeviceMemoryPool.malloc File "cupy/cuda/memory.pyx", line 1139, in cupy.cuda.memory.SingleDeviceMemoryPool._malloc File "cupy/cuda/memory.pyx", line 1346, in cupy.cuda.memory.SingleDeviceMemoryPool._try_malloc File "cupy/cuda/memory.pyx", line 1358, in cupy.cuda.memory.SingleDeviceMemoryPool._try_malloc cupy.cuda.memory.OutOfMemoryError: Out of memory allocating 595,413,504 bytes (allocated so far: 3,207,694,336 bytes, limit set to: 3,684,158,668 bytes). During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/home/pascal/slots/3/compute_dft.py", line 121, in <module> e,f,dip,q = compute_gpu(mol) ^^^^^^^^^^^^^^^^ File "/home/pascal/slots/3/compute_dft.py", line 24, in compute_gpu e_dft = mf.kernel() # compute total energy ^^^^^^^^^^^ File "<string>", line 2, in kernel File "/home/pascal/slots/3/lib/python3.11/site-packages/gpu4pyscf/scf/hf.py", line 586, in scf _kernel(self, self.conv_tol, self.conv_tol_grad, File "/home/pascal/slots/3/lib/python3.11/site-packages/gpu4pyscf/scf/hf.py", line 393, in _kernel mf.init_workflow(dm0=dm) File "/home/pascal/slots/3/lib/python3.11/site-packages/gpu4pyscf/df/df_jk.py", line 56, in init_workflow with lib.call_in_background(build_df) as build: File "/home/pascal/slots/3/lib/python3.11/site-packages/pyscf/lib/misc.py", line 1096, in __exit__ raise ThreadRuntimeError('Error on thread %s:\n%s' % (self, e)) pyscf.lib.misc.ThreadRuntimeError: Error on thread <pyscf.lib.misc.call_in_background object at 0x7fec06934850>: Out of memory allocating 595,413,504 bytes (allocated so far: 3,207,694,336 bytes, limit set to: 3,684,158,668 bytes). 11:48:31 (5931): bin/bash exited; CPU time 11.139443 11:48:31 (5931): app exit status: 0x1 11:48:31 (5931): called boinc_finish(195) </stderr_txt> ]]> ____________
	ID: 61192 \| Rating: 0 \| rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1048 Credit: 40,228,383,983 RAC: 449,121 Level Scientific publications	Message 61193 - Posted: 5 Feb 2024 \| 12:32:37 UTC
	I'm seeing about 10% failure rate with 12GB cards. ____________
	ID: 61193 \| Rating: 0 \| rate: / Reply Quote

Steve Volunteer moderator Project administrator Project developer Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 21 Dec 23 Posts: 34 Credit: 0 RAC: 0 Level Scientific publications	Message 61194 - Posted: 5 Feb 2024 \| 12:55:51 UTC - in response to Message 61193.
	Credits should now be at 75k for rest of the batch. They should be consistent based on comparisons of runtime on our test machines across the other Apps, but this is complicated with this new memory intensive app. I will investigate before sending the next batch.
	ID: 61194 \| Rating: 0 \| rate: / Reply Quote

pututu Send message Joined: 8 Oct 16 Posts: 15 Credit: 935,926,869 RAC: 18,769,238 Level Scientific publications	Message 61195 - Posted: 5 Feb 2024 \| 15:09:41 UTC
	There are some tasks that spike over 10G. Seems like nvidia-smi doesn't allow logging time shorter than 1s. Anyone has a workaround? Likely that the momentarily spike could be higher than 10G as recorded. 2024/02/05 07:06:39.675, 88 %, 1328 MHz, 5147 MiB, 115.28 W, 65 2024/02/05 07:06:40.678, 96 %, 1278 MHz, 5147 MiB, 117.58 W, 65 2024/02/05 07:06:41.688, 100 %, 1328 MHz, 5177 MiB, 111.94 W, 65 2024/02/05 07:06:42.691, 100 %, 1328 MHz, 6647 MiB, 70.23 W, 64 2024/02/05 07:06:43.694, 30 %, 1328 MHz, 8475 MiB, 69.65 W, 64 2024/02/05 07:06:44.697, 100 %, 1328 MHz, 9015 MiB, 81.81 W, 64 2024/02/05 07:06:45.700, 100 %, 1328 MHz, 9007 MiB, 46.32 W, 63 2024/02/05 07:06:46.705, 98 %, 1278 MHz, 9941 MiB, 46.08 W, 63 2024/02/05 07:06:47.708, 99 %, 1328 MHz, 10251 MiB, 57.06 W, 63 2024/02/05 07:06:48.711, 97 %, 1088 MHz, 4553 MiB, 133.72 W, 65 2024/02/05 07:06:49.714, 95 %, 1075 MHz, 4553 MiB, 132.99 W, 65
	ID: 61195 \| Rating: 0 \| rate: / Reply Quote

pututu Send message Joined: 8 Oct 16 Posts: 15 Credit: 935,926,869 RAC: 18,769,238 Level Scientific publications	Message 61196 - Posted: 5 Feb 2024 \| 16:21:57 UTC - in response to Message 61195.
	Got a biggie. This one is 14.6G. I'm running 16G card. One task per gpu. 2024/02/05 08:20:03.043, 100 %, 1328 MHz, 9604 MiB, 107.19 W, 71 2024/02/05 08:20:04.046, 94 %, 1328 MHz, 11970 MiB, 97.69 W, 71 2024/02/05 08:20:05.049, 99 %, 1328 MHz, 12130 MiB, 123.24 W, 70 2024/02/05 08:20:06.052, 100 %, 1316 MHz, 12130 MiB, 122.21 W, 71 2024/02/05 08:20:07.055, 100 %, 1328 MHz, 12130 MiB, 121.26 W, 71 2024/02/05 08:20:08.058, 100 %, 1328 MHz, 12130 MiB, 118.64 W, 71 2024/02/05 08:20:09.061, 17 %, 1328 MHz, 12116 MiB, 56.48 W, 70 2024/02/05 08:20:10.064, 95 %, 1189 MHz, 14646 MiB, 73.99 W, 71 2024/02/05 08:20:11.071, 99 %, 1139 MHz, 14646 MiB, 194.84 W, 71 2024/02/05 08:20:12.078, 96 %, 1316 MHz, 14650 MiB, 65.82 W, 70 2024/02/05 08:20:13.081, 85 %, 1328 MHz, 8952 MiB, 84.32 W, 70 2024/02/05 08:20:14.084, 100 %, 1075 MHz, 8952 MiB, 130.53 W, 71
	ID: 61196 \| Rating: 0 \| rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1048 Credit: 40,228,383,983 RAC: 449,121 Level Scientific publications	Message 61197 - Posted: 5 Feb 2024 \| 16:35:36 UTC - in response to Message 61196. Last modified: 5 Feb 2024 \| 16:36:34 UTC
	yeah i think you'll only ever see the spike if you actually have the VRAM for it. if you don't have enough, it will error out before hitting it and you'll never see it. I'm just gonna deal with the errors. cost of doing business lol. I have my system set for 70% ATP through MPS. QChem gpu_usage set to 0.55 ATMbeta gpu_usage set to 0.44 this way when both tasks are available, it will run either ATMbeta+ATMbeta, or ATMbeta+QChem on the same GPU, but will not allow 2x Qchem on the same GPU. i do this because ATMbeta uses a really small amount of the GPU VRAM and can utilize some of the spare compute cycles without hurting QChem VRAM use much. but when it's running only QChem and only running 1x tasks, it's not using absolutely the most compute that it could (only 70%), so maybe a little slower, but Titan Vs are fast enough anyway. most tasks finishing in about 6mins or so. some outliers running ~18mins. ____________
	ID: 61197 \| Rating: 0 \| rate: / Reply Quote

Boca Raton Community HS Send message Joined: 27 Aug 21 Posts: 36 Credit: 3,441,856,809 RAC: 19,827 Level Scientific publications	Message 61198 - Posted: 5 Feb 2024 \| 16:37:51 UTC - in response to Message 61196.
	pututu, have you had any failed tasks? Ian&Steve C. reports ~10% failure rate with 12GB so I am curious about 16GB. I am guessing this is about the minimum for error-free (related to memory limitations) processing of the current work.
	ID: 61198 \| Rating: 0 \| rate: / Reply Quote

Boca Raton Community HS Send message Joined: 27 Aug 21 Posts: 36 Credit: 3,441,856,809 RAC: 19,827 Level Scientific publications	Message 61199 - Posted: 5 Feb 2024 \| 16:53:51 UTC - in response to Message 61197.
	QChem gpu_usage set to 0.55 ATMbeta gpu_usage set to 0.44 We did this as well this morning for the 4090 GPUs since they have 24GB but with E@H work. To little VRAM to run QChem at 2x but too much compute power left on the table for running them at 1x.
	ID: 61199 \| Rating: 0 \| rate: / Reply Quote

pututu Send message Joined: 8 Oct 16 Posts: 15 Credit: 935,926,869 RAC: 18,769,238 Level Scientific publications	Message 61200 - Posted: 5 Feb 2024 \| 17:02:55 UTC - in response to Message 61198. Last modified: 5 Feb 2024 \| 17:20:10 UTC
	pututu, have you had any failed tasks? Ian&Steve C. reports ~10% failure rate with 12GB so I am curious about 16GB. I am guessing this is about the minimum for error-free (related to memory limitations) processing of the current work. 0 failure after 19 completed tasks on one P100 with 16G. So far 14.6G is the highest I've seen with 1 sec interval monitoring More than half of the tasks processed momentarily hit 8G or more. Didn't record any actual data, just watching the nvidia-smi from time to time. Edit: another task with more than 12G but with ominous 6666M, lol 2024/02/05 09:17:58.869, 99 %, 1328 MHz, 10712 MiB, 131.69 W, 70 2024/02/05 09:17:59.872, 100 %, 1328 MHz, 10712 MiB, 101.87 W, 70 2024/02/05 09:18:00.877, 100 %, 1328 MHz, 10700 MiB, 50.15 W, 69 2024/02/05 09:18:01.880, 92 %, 1240 MHz, 11790 MiB, 54.34 W, 69 2024/02/05 09:18:02.883, 95 %, 1240 MHz, 12364 MiB, 53.20 W, 69 2024/02/05 09:18:03.886, 83 %, 1126 MHz, 6666 MiB, 137.77 W, 70 2024/02/05 09:18:04.889, 100 %, 1075 MHz, 6666 MiB, 130.53 W, 71 2024/02/05 09:18:05.892, 92 %, 1164 MHz, 6666 MiB, 129.84 W, 71 2024/02/05 09:18:06.902, 100 %, 1063 MHz, 6666 MiB, 129.82 W, 71
	ID: 61200 \| Rating: 0 \| rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1048 Credit: 40,228,383,983 RAC: 449,121 Level Scientific publications	Message 61201 - Posted: 6 Feb 2024 \| 2:51:01 UTC - in response to Message 61198.
	pututu, have you had any failed tasks? Ian&Steve C. reports ~10% failure rate with 12GB so I am curious about 16GB. I am guessing this is about the minimum for error-free (related to memory limitations) processing of the current work. been running all day across my 18x Titan Vs. the effective error rate is right around 5%. so 5% of the tasks needed more than 12GB. running only 1 task per GPU. i rented an A100 40GB for the day. running 3x on this GPU with MPS set to 40%, it's done about 300 tasks and only 1 task failed from out of memory. highest spike i saw was 39GB, but usually stays around 20GB utilized ____________
	ID: 61201 \| Rating: 0 \| rate: / Reply Quote

Boca Raton Community HS Send message Joined: 27 Aug 21 Posts: 36 Credit: 3,441,856,809 RAC: 19,827 Level Scientific publications	Message 61202 - Posted: 6 Feb 2024 \| 5:02:10 UTC - in response to Message 61201.
	pututu, have you had any failed tasks? Ian&Steve C. reports ~10% failure rate with 12GB so I am curious about 16GB. I am guessing this is about the minimum for error-free (related to memory limitations) processing of the current work. been running all day across my 18x Titan Vs. the effective error rate is right around 5%. so 5% of the tasks needed more than 12GB. running only 1 task per GPU. i rented an A100 40GB for the day. running 3x on this GPU with MPS set to 40%, it's done about 300 tasks and only 1 task failed from out of memory. highest spike i saw was 39GB, but usually stays around 20GB utilized Wow, the A100 is powerful. I can't believe how fast it can chew through these (well, I can believe it, but it's still amazing). I am somewhat new to MPS and I understand the general concept, but what do you mean when you say it is set to 40%?
	ID: 61202 \| Rating: 0 \| rate: / Reply Quote

Pascal Send message Joined: 15 Jul 20 Posts: 66 Credit: 628,972,434 RAC: 3,637,787 Level Scientific publications	Message 61203 - Posted: 6 Feb 2024 \| 8:44:37 UTC
	eh bien moi j'ai abandonné trop d'erreurs. well I gave up too many mistakes ____________
	ID: 61203 \| Rating: 0 \| rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1048 Credit: 40,228,383,983 RAC: 449,121 Level Scientific publications	Message 61204 - Posted: 6 Feb 2024 \| 11:39:54 UTC - in response to Message 61202.
	I am somewhat new to MPS and I understand the general concept, but what do you mean when you say it is set to 40%? CUDA MPS has a setting called active thread percentage. It basically limits how many SMs of the GPU get used for each process. Without MPS, each process will call for all available SMs all the time, in separate contexts (MPS also shares a single context). I set that to 40%, so each task is only using 40% of the available SMs. With 3x running that’s slightly over provisioning the GPU, but it usually works well and runs faster than 3x without MPS. It also has the benefit of reducing VRAM use most of the time, but it doesn’t seem to limit these tasks much. The only caveat is that when you run low on work, the remaining one or two tasks won’t use all the GPU, instead using only the 40% and none of the rest of the idle GPU. ____________
	ID: 61204 \| Rating: 0 \| rate: / Reply Quote

Aurum Send message Joined: 12 Jul 17 Posts: 400 Credit: 13,846,639,882 RAC: 31,504,974 Level Scientific publications	Message 61205 - Posted: 6 Feb 2024 \| 13:11:48 UTC - in response to Message 61195.
	Seems like nvidia-smi doesn't allow logging time shorter than 1s. Anyone has a workaround? Have you tried NVITOP? https://github.com/XuehaiPan/nvitop
	ID: 61205 \| Rating: 0 \| rate: / Reply Quote

pututu Send message Joined: 8 Oct 16 Posts: 15 Credit: 935,926,869 RAC: 18,769,238 Level Scientific publications	Message 61206 - Posted: 6 Feb 2024 \| 17:00:06 UTC - in response to Message 61205. Last modified: 6 Feb 2024 \| 17:00:33 UTC
	Seems like nvidia-smi doesn't allow logging time shorter than 1s. Anyone has a workaround? Have you tried NVITOP? https://github.com/XuehaiPan/nvitop No. A quick search seems to indicate that it uses nvidia-smi command, so likely to have similar limitation. Anyway after a day or running (>100+ tasks) I didn't see any failures on the 16GB card, so I'm good, at least for now.
	ID: 61206 \| Rating: 0 \| rate: / Reply Quote

Boca Raton Community HS Send message Joined: 27 Aug 21 Posts: 36 Credit: 3,441,856,809 RAC: 19,827 Level Scientific publications	Message 61207 - Posted: 7 Feb 2024 \| 15:04:51 UTC - in response to Message 61204.
	I am somewhat new to MPS and I understand the general concept, but what do you mean when you say it is set to 40%? CUDA MPS has a setting called active thread percentage. It basically limits how many SMs of the GPU get used for each process. Without MPS, each process will call for all available SMs all the time, in separate contexts (MPS also shares a single context). I set that to 40%, so each task is only using 40% of the available SMs. With 3x running that’s slightly over provisioning the GPU, but it usually works well and runs faster than 3x without MPS. It also has the benefit of reducing VRAM use most of the time, but it doesn’t seem to limit these tasks much. The only caveat is that when you run low on work, the remaining one or two tasks won’t use all the GPU, instead using only the 40% and none of the rest of the idle GPU. Thank you for the explanation!
	ID: 61207 \| Rating: 0 \| rate: / Reply Quote

Pascal Send message Joined: 15 Jul 20 Posts: 66 Credit: 628,972,434 RAC: 3,637,787 Level Scientific publications	Message 61208 - Posted: 7 Feb 2024 \| 20:38:28 UTC
	bonsoir , y a t'il des unités de travail Windows a calculer ou faut il que je repasse sous linux? Merci Good evening, Are there Windows work units to calculate or do I have to go back to linux? Thanks ____________
	ID: 61208 \| Rating: 0 \| rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1048 Credit: 40,228,383,983 RAC: 449,121 Level Scientific publications	Message 61210 - Posted: 7 Feb 2024 \| 21:10:58 UTC - in response to Message 61208.
	bonsoir , y a t'il des unités de travail Windows a calculer ou faut il que je repasse sous linux? Merci Good evening, Are there Windows work units to calculate or do I have to go back to linux? Thanks Only Linux still. ____________
	ID: 61210 \| Rating: 0 \| rate: / Reply Quote

Erich56 Send message Joined: 1 Jan 15 Posts: 1105 Credit: 7,822,620,176 RAC: 1,489,209 Level Scientific publications	Message 61212 - Posted: 8 Feb 2024 \| 13:05:33 UTC - in response to Message 61210.
	Good evening, Are there Windows work units to calculate or do I have to go back to linux? Thanks Only Linux still. :-( :-( :-(
	ID: 61212 \| Rating: 0 \| rate: / Reply Quote

Pascal Send message Joined: 15 Jul 20 Posts: 66 Credit: 628,972,434 RAC: 3,637,787 Level Scientific publications	Message 61213 - Posted: 8 Feb 2024 \| 16:12:49 UTC
	je viens de repasser sous linux et c'est reparti.bye bye windows 10. I just came back under linux and it’s gone again.bye bye windows 10 ____________
	ID: 61213 \| Rating: 0 \| rate: / Reply Quote

Boca Raton Community HS Send message Joined: 27 Aug 21 Posts: 36 Credit: 3,441,856,809 RAC: 19,827 Level Scientific publications	Message 61214 - Posted: 8 Feb 2024 \| 16:43:11 UTC
	We have definitely noticed a sharp decrease in "errors" with these tasks. Steve (or anyone), can you offer some insight into the filenames? As example: inputs_v3_ace_pch_ms_gc_filt_af05_index_263591_to_263591-SFARR_PYSCF_ace_pch_ms_gc_filt_af05_v4-0-1-RND5514_2 Are there two different references to version? I see a "_v3_" and then a "_v4-0-1". Then, the app version: v1.04 I thought that "_v4-0-1" would equate to the app version, but it doesn't look like it does. Thanks!
	ID: 61214 \| Rating: 0 \| rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1048 Credit: 40,228,383,983 RAC: 449,121 Level Scientific publications	Message 61215 - Posted: 8 Feb 2024 \| 17:35:30 UTC - in response to Message 61214. Last modified: 8 Feb 2024 \| 17:41:34 UTC
	“0-1”notation with all GPUGRID tasks seems to indicate the segment you are on and how many total segments there are So here, 0 = which segment you are on 1= how many segments there are in total The segment you are on seems to always be in the 0-first kind of notation. We see/saw the same behavior with ATM. Where you will have tasks like 0-5, 1-5, 2-5, etc. and they stop at 4-5, there was a batch that had ten segment for 0-10 through 9-10. they likely have some kind of process on the server side which stiches the results together based on these (and other) numbers ____________
	ID: 61215 \| Rating: 0 \| rate: / Reply Quote

Keith Myers Send message Joined: 13 Dec 17 Posts: 1313 Credit: 6,022,292,459 RAC: 10,175,539 Level Scientific publications	Message 61216 - Posted: 8 Feb 2024 \| 17:36:34 UTC - in response to Message 61214.
	Looks like they transitioned from v3-0-1 on Feb 2 to a test result on Feb 3 and then started the v4-0-1 run on Feb 5 That was looking back through 360 validated tasks. I had two errors on the v4-0-1 tasks right at their beginning. Then they all validated since then. All run on two 2080 Ti cards.
	ID: 61216 \| Rating: 0 \| rate: / Reply Quote

[BAT] Svennemans Send message Joined: 27 May 21 Posts: 50 Credit: 468,892,017 RAC: 3,152,507 Level Scientific publications	Message 61217 - Posted: 9 Feb 2024 \| 0:41:33 UTC - in response to Message 61109.
	Why do I allways get segmentation fault on Windows/wsl2/Ubuntu 22.04.3 LTS 12 processors, 28 GB memory, 16GB swap, GPU RTX 4070 Ti Super with 16 GB, driver version 551.23 https://www.gpugrid.net/result.php?resultid=33759912 https://www.gpugrid.net/result.php?resultid=33758940 https://www.gpugrid.net/result.php?resultid=33759139 https://www.gpugrid.net/result.php?resultid=33759328 something wrong with your environment or drivers likely. try running a native Linux OS install, WSL might not be well supported I'm getting the same issues running throug WSL2, immediate segmentation fault. https://www.gpugrid.net/result.php?resultid=33853832 https://www.gpugrid.net/result.php?resultid=33853734 Environment & drivers should be OK, since it is running other project's GPU tasks just fine! Unless gpugrid has some specific prerequisites? Working project tasks: https://moowrap.net/result.php?resultid=201144661 Installing a native Linux OS is simply not an option for most regular users that don't have dedicated compute farms...
	ID: 61217 \| Rating: 0 \| rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1048 Credit: 40,228,383,983 RAC: 449,121 Level Scientific publications	Message 61218 - Posted: 9 Feb 2024 \| 1:58:41 UTC - in response to Message 61217.
	then I guess you'll just have to wait for the native Windows app. it seems apparent that something doesnt work with these tasks under WSL. so indeed some kind of problem or incompatibility related to WSL. the fact that some other app works isnt really relevant. a key difference is probably in the difference in how these apps are distributed. Moo wrapper uses a compiled binary and the QChem work is supplied via an entire python environment designed to work with a native linux install (it does a lot of things for setting up things like environment variables which might not be correct for WSL as an example). these tasks also use CuPy, which might not be well supported for WSL, or the way cupy is being called isnt right for WSL. either way, don't think there's gonna be a solution for use with WSL. switch to Linux, or wait for the Windows version. ____________
	ID: 61218 \| Rating: 0 \| rate: / Reply Quote

Pascal Send message Joined: 15 Jul 20 Posts: 66 Credit: 628,972,434 RAC: 3,637,787 Level Scientific publications	Message 61219 - Posted: 9 Feb 2024 \| 9:04:14 UTC
	hello I noticed that you are losing users. Not many but the number of gpugrid users is decreasing. Maybe you have too many requirements level harware and credits are no longer the same as before. ____________
	ID: 61219 \| Rating: 0 \| rate: / Reply Quote

[BAT] Svennemans Send message Joined: 27 May 21 Posts: 50 Credit: 468,892,017 RAC: 3,152,507 Level Scientific publications	Message 61220 - Posted: 9 Feb 2024 \| 10:15:45 UTC - in response to Message 61219.
	hello I noticed that you are losing users. Not many but the number of gpugrid users is decreasing. Maybe you have too many requirements level harware and credits are no longer the same as before. That's hardly surprising given this stat: https://www.boincstats.com/stats/45/host/breakdown/os/ 2500+ Windows hosts 688 Linux hosts Yet Windows hosts are not getting any work, so are not given opportunity to contribute to research or to beta testing even if they're prepared to go the extra mile with getting experimental apps to work. So logical that people start leaving - certainly the set-it-and-forget-it crowd.
	ID: 61220 \| Rating: 0 \| rate: / Reply Quote

Erich56 Send message Joined: 1 Jan 15 Posts: 1105 Credit: 7,822,620,176 RAC: 1,489,209 Level Scientific publications	Message 61221 - Posted: 9 Feb 2024 \| 13:35:02 UTC - in response to Message 61220.
	Yet Windows hosts are not getting any work, so are not given opportunity to contribute to research or to beta testing even if they're prepared to go the extra mile with getting experimental apps to work. When I joined GPUGRID about 9 years ago, all subprojects were available for Linux and Windows as well. At that time and even several years later, my hosts were working for GPUGRID almost 365 days/year. Somehow, it makes me sad that I am less and less able to contribute to this valuable project. Recently, someone here explained the reason: scientific projects are primarily done by Linux, not by Windows. Why so, all of a sudden ???
	ID: 61221 \| Rating: 0 \| rate: / Reply Quote

[BAT] Svennemans Send message Joined: 27 May 21 Posts: 50 Credit: 468,892,017 RAC: 3,152,507 Level Scientific publications	Message 61222 - Posted: 9 Feb 2024 \| 14:25:44 UTC - in response to Message 61218.
	then I guess you'll just have to wait for the native Windows app. it seems apparent that something doesnt work with these tasks under WSL. so indeed some kind of problem or incompatibility related to WSL. the fact that some other app works isnt really relevant. a key difference is probably in the difference in how these apps are distributed. Moo wrapper uses a compiled binary and the QChem work is supplied via an entire python environment designed to work with a native linux install (it does a lot of things for setting up things like environment variables which might not be correct for WSL as an example). these tasks also use CuPy, which might not be well supported for WSL, or the way cupy is being called isnt right for WSL. either way, don't think there's gonna be a solution for use with WSL. switch to Linux, or wait for the Windows version. It could be that, yes. But it could also be memory overflow. Running a gtx1080ti with 11GB vram Running from the commandline with nvidia-smi logging I see memory going up to 8GB allocated, then a segmentation fault - which could be caused by a block allocating over the 11GB limit? monitoring output: # gpu pwr gtemp mtemp sm mem enc dec jpg ofa mclk pclk pviol tviol fb bar1 ccpm sbecc dbecc pci rxpci txpci # Idx W C C % % % % % % MHz MHz % bool MB MB MB errs errs errs MB/s MB/s 0 15 30 - 2 8 0 0 - - 405 607 0 0 1915 2 - - - 0 0 0 0 17 30 - 2 8 0 0 - - 405 607 0 0 1915 2 - - - 0 0 0 0 74 33 - 2 1 0 0 - - 5005 1569 0 0 2179 2 - - - 0 0 0 0 133 39 - 77 5 0 0 - - 5005 1987 0 0 4797 2 - - - 0 0 0 0 167 49 - 63 16 0 0 - - 5005 1974 0 0 6393 2 - - - 0 0 0 0 119 54 - 74 4 0 0 - - 5005 1974 0 0 8329 2 - - - 0 0 0 0 87 47 - 0 0 0 0 - - 5508 1974 0 0 1915 2 - - - 0 0 0 commandline run output: /var/lib/boinc/projects/www.gpugrid.net/bck/lib/python3.11/site-packages/gpu4pyscf/lib/cutensor.py:174: UserWarning: using cupy as the tensor contraction engine. warnings.warn(f'using {contract_engine} as the tensor contraction engine.') /var/lib/boinc/projects/www.gpugrid.net/bck/lib/python3.11/site-packages/pyscf/dft/libxc.py:771: UserWarning: Since PySCF-2.3, B3LYP (and B3P86) are changed to the VWN-RPA variant, corresponding to the original definition by Stephens et al. (issue 1480) and the same as the B3LYP functional in Gaussian. To restore the VWN5 definition, you can put the setting "B3LYP_WITH_VWN5 = True" in pyscf_conf.py warnings.warn('Since PySCF-2.3, B3LYP (and B3P86) are changed to the VWN-RPA variant, ' nao = 590 reading molecules in current dir mol_130305284_conf_0.xyz mol_130305284_conf_1.xyz mol_130305284_conf_2.xyz mol_130305284_conf_3.xyz mol_130305284_conf_4.xyz mol_130305284_conf_5.xyz mol_130305284_conf_6.xyz mol_130305284_conf_7.xyz mol_130305284_conf_8.xyz mol_130305284_conf_9.xyz ['mol_130305284_conf_0.xyz', 'mol_130305284_conf_1.xyz', 'mol_130305284_conf_2.xyz', 'mol_130305284_conf_3.xyz', 'mol_130305284_conf_4.xyz', 'mol_130305284_conf_5.xyz', 'mol_130305284_conf_6.xyz', 'mol_130305284_conf_7.xyz', 'mol_130305284_conf_8.xyz', 'mol_130305284_conf_9.xyz'] Computing energy and forces for molecule 1 of 10 charge = 0 Structure: ('I', [-9.750986802755719, 0.9391938839088357, 0.1768783652592898]) ('C', [-5.895945508642993, 0.12453295160883758, 0.05083363275080016]) ('C', [-4.856596140132209, -2.2109795657411224, -0.2513335745671532]) ('C', [-2.2109795657411224, -2.0220069532846163, -0.24377467006889297]) ('O', [-0.304245906054975, -3.7227604653931716, -0.46865207889213534]) ('C', [1.8519316020737606, -2.3621576557063273, -0.3080253583041051]) ('C', [4.440856392727896, -2.9668700155671472, -0.4006219384077931]) ('C', [5.839253724906041, -0.8163616858121067, -0.1379500070932495]) ('I', [9.769884064001369, -0.6368377039784259, -0.13889487015553204]) ('S', [4.100705690306184, 1.9464179083020137, 0.22298768269867728]) ('C', [1.3587130835622794, 0.22298768269867728, 0.02022006953284616]) ('C', [-1.2925726692025024, 0.43463700864996424, 0.06254993472310354]) ('S', [-3.7227604653931716, 2.5700275294084842, 0.3477096069199714]) ('H', [-5.914842769888644, -3.9306303390953286, -0.46298290051844015]) ('H', [5.19674684255392, -4.818801617640907, -0.640617156227556]) ****** <class 'gpu4pyscf.df.df_jk.DFRKS'> **** method = DFRKS initial guess = minao damping factor = 0 level_shift factor = 0 DIIS = <class 'gpu4pyscf.scf.diis.CDIIS'> diis_start_cycle = 1 diis_space = 8 SCF conv_tol = 1e-09 SCF conv_tol_grad = None SCF max_cycles = 50 direct_scf = False chkfile to save SCF result = /var/lib/boinc/projects/www.gpugrid.net/bck/tmp/tmpd03fogee max_memory 4000 MB (current use 345 MB) XC library pyscf.dft.libxc version 6.2.2 unable to decode the reference due to https://github.com/NVIDIA/cuda-python/issues/29 XC functionals = wB97M-V N. Mardirossian and M. Head-Gordon., J. Chem. Phys. 144, 214110 (2016) radial grids: Treutler-Ahlrichs [JCP 102, 346 (1995); DOI:10.1063/1.469408] (M4) radial grids becke partition: Becke, JCP 88, 2547 (1988); DOI:10.1063/1.454033 pruning grids: <function nwchem_prune at 0x7f29529356c0> grids dens level: 3 symmetrized grids: False atomic radii adjust function: <function treutler_atomic_radii_adjust at 0x7f2952935580> Following is NLC and NLC Grids ** NLC functional = wB97M-V radial grids: Treutler-Ahlrichs [JCP 102, 346 (1995); DOI:10.1063/1.469408] (M4) radial grids becke partition: Becke, JCP 88, 2547 (1988); DOI:10.1063/1.454033 pruning grids: <function nwchem_prune at 0x7f29529356c0> grids dens level: 3 symmetrized grids: False atomic radii adjust function: <function treutler_atomic_radii_adjust at 0x7f2952935580> small_rho_cutoff = 1e-07 Set gradient conv threshold to 3.16228e-05 Initial guess from minao. Default auxbasis def2-tzvpp-jkfit is used for H def2-tzvppd Default auxbasis def2-tzvpp-jkfit is used for C def2-tzvppd Default auxbasis def2-tzvpp-jkfit is used for S def2-tzvppd Default auxbasis def2-tzvpp-jkfit is used for O def2-tzvppd Default auxbasis def2-tzvpp-jkfit is used for I def2-tzvppd /var/lib/boinc/projects/www.gpugrid.net/bck/lib/python3.11/site-packages/pyscf/gto/mole.py:1280: UserWarning: Function mol.dumps drops attribute charge because it is not JSON-serializable warnings.warn(msg) tot grids = 225920 tot grids = 225920 segmentation fault
	ID: 61222 \| Rating: 0 \| rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1048 Credit: 40,228,383,983 RAC: 449,121 Level Scientific publications	Message 61223 - Posted: 9 Feb 2024 \| 17:11:05 UTC - in response to Message 61222.
	First, it’s well known at this point that these tasks require a lot of VRAM. So some failures are to be expected from that. The VRAM utilization is not constant, but spikes up and down. From the tasks running on my systems, loading up to 5-6GB and staying around that amount is pretty normal, with intermittent spikes to the 9-12GB+ range occasionally. Just by looking at the failure rate of different GPUs, I’m estimating that most tasks need more than 8GB (>70%), a small amount of tasks need more than 12GB (~5%), and a very small number of them need even more than 16GB (<1%). A teammate of mine is running on a couple 2080Tis (11GB) and has had some failures but mostly success. When you hit memory limits, they fail, but not in a segfault. You always get some kind of error printed in the stderr regarding a memory allocation issue of some kind. With an 11GB GPU you should be seeing a majority of successes. Since they all fail in the same way with a segfault, that tells me it’s not the memory allocation problem, but something else. And now with two people having the same problem both using WSL, it’s clear that WSL is the root of the problem. The tasks were not setup to run in that environment. ____________
	ID: 61223 \| Rating: 0 \| rate: / Reply Quote

Keith Myers Send message Joined: 13 Dec 17 Posts: 1313 Credit: 6,022,292,459 RAC: 10,175,539 Level Scientific publications	Message 61224 - Posted: 9 Feb 2024 \| 18:14:24 UTC - in response to Message 61221.
	Yet Windows hosts are not getting any work, so are not given opportunity to contribute to research or to beta testing even if they're prepared to go the extra mile with getting experimental apps to work. When I joined GPUGRID about 9 years ago, all subprojects were available for Linux and Windows as well. At that time and even several years later, my hosts were working for GPUGRID almost 365 days/year. Somehow, it makes me sad that I am less and less able to contribute to this valuable project. Recently, someone here explained the reason: scientific projects are primarily done by Linux, not by Windows. Why so, all of a sudden ??? I posed this question to Google and their AI engine came up with this response "how long has most scientific research projects used linux compared to windows" Linux is a popular choice for research companies because it offers flexibility, security, stability, and cost-effectiveness. Linux is also used in technical disciplines at universities and research centers because it's free and includes a large amount of free and open-source software. en.wikipedia.org List of Linux adopters - Wikipedia Linux is often used in technical disciplines at universities and research centres. This is due to several factors, including that Linux is available free of charge and includes a large body of free/open-source software. brainly.com Why might a large research company use the Linux operating system? Sep 20, 2022 — Overall, the Linux operating system provides research companies with flexibility, stability, security, and cost-effectiveness, making it a popular choice in the research community. Linux is known for its reliability, security, and breadth of open source tools available. It's also known for its stability and reliability, and can run for months or even years without any issues. Linux is an open-source operating system, whereas Microsoft is a commercial operating system. Linux users have access to the source code of the operating system and can make amendments as per their choices. Windows users don't have such privileges. Linux is also used by biologists in various domains of research. In the field of biology, where data analysis, computational modeling, and scientific exploration are essential, Linux offers numerous advantages.
	ID: 61224 \| Rating: 0 \| rate: / Reply Quote

Pascal Send message Joined: 15 Jul 20 Posts: 66 Credit: 628,972,434 RAC: 3,637,787 Level Scientific publications	Message 61225 - Posted: 9 Feb 2024 \| 19:12:20 UTC
	une chose est sure il n'y aura pas assez d'utilisateurs pour tout calculer.il y à 50462 taches pour 106 ordinateurs au moment ou j'écris ces lignes.Elles arrivent plus vite que les taches qui sont calculées.je pense que gpugrid va droit dans le mur s'il ne font rien. one thing is sure there will not be enough users to calculate everything.there are 50462 tasks for 106 computers at the time I write these lines. They arrive faster than the spots that are calculated.I think gpugrid goes straight into the wall if they do nothing. ____________
	ID: 61225 \| Rating: 0 \| rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1048 Credit: 40,228,383,983 RAC: 449,121 Level Scientific publications	Message 61226 - Posted: 9 Feb 2024 \| 19:46:36 UTC - in response to Message 61225.
	we are processing about 12,000 tasks per day, so there's a little more than 4 days worth of work right now, but the available work is still climbing ____________
	ID: 61226 \| Rating: 0 \| rate: / Reply Quote

[BAT] Svennemans Send message Joined: 27 May 21 Posts: 50 Credit: 468,892,017 RAC: 3,152,507 Level Scientific publications	Message 61227 - Posted: 9 Feb 2024 \| 20:46:44 UTC - in response to Message 61224.
	Somehow, it makes me sad that I am less and less able to contribute to this valuable project. Recently, someone here explained the reason: scientific projects are primarily done by Linux, not by Windows. Why so, all of a sudden ??? I posed this question to Google and their AI engine came up with this response "how long has most scientific research projects used linux compared to windows" Linux is a popular choice for research companies because it offers flexibility, security, stability, and cost-effectiveness. Linux is also used in technical disciplines at universities and research centers because it's free and includes a large amount of free and open-source software. <truncated> The choice for Linux as a research OS in academic context is clear, but has really no relation with the choice for which platforms to support as a BOINC project. BOINC as a platform was always a 'supercomputer for peanuts' proposition - you invest a fraction of what a real supercomputer costs but can get similar processing power, which is exactly what many low-budget academic research groups were looking for. Part of that investment is the choice of which platforms to support, and it is primarily driven by the amount of processing power needed, with the correlation to your native development OS only a secondary consideration. As I said already in my previous post it all depends what type of project you want to be 1) You need all the power and/or turnaround you can get? Support all the platforms you can handle, with Windows your #1 priority because that's where the majority of the FLOPS are. 2) You don't really need that much power, your focus is more on developing/researching algorithms? Stay native OS 3) You need some of both? Prio on native OS for your beta apps, but keep driving that steady stream of stable work on #1 Windows and #2 Linux to keep the interest of your supercomputer 'providers' engaged. Because that's the last part of the 'small investment' needed for your FLOPS: keeping your users happy and engaged. So I see no issue at all with new beta's being on Linux first, but am also concerned or sad that there is only beta/Linux work lately as opposed to the earlier days of gpugrid. Unless of course the decision is made to go full-on as a type 2) project?
	ID: 61227 \| Rating: 0 \| rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1048 Credit: 40,228,383,983 RAC: 449,121 Level Scientific publications	Message 61228 - Posted: 9 Feb 2024 \| 21:14:00 UTC - in response to Message 61227.
	there has been a bunch of ATM work intermittently, which works on Windows. they had to fix both Windows and Linux versions of their application at different times so there were times when Linux worked and Windows didn't, and times where Windows worked and Linux didnt. the most recent batch i believe both applications were working. this is still classified as "beta" for both Linux and Windows. The project admins/researchers have already mentioned a few times that a Windows app is in the pipeline. but it takes time. they obviously don't have a lot of expertise with Windows and are more comfortable with the Linux environment, so it makes sense that it will take more time and effort for them to get up to speed and get the windows version working. they likely need to sort out other parts of their workflow on the backend also (work generation, task sizes, task configurations, batch sizes, etc) and Linux users are the guinea pigs for that. They had many weeks of "false starts" with this QChem project where they generated a bunch of work, but it was causing errors and they ended up having to cancel the whole batch and try again the following week. it's a lot easier for the researchers to iron out these problems with one version of the code rather than juggling two version with different code changes to each. then when most issues are sorted, port that to Windows. I think they are still figuring out what configurations work best for them on the backend and the hardware available on GPUGRID. Steve had previously mentioned that he originally based things on high end datacenter GPUs like the A100 with lots of VRAM, but changes are necessary to get the same results from our pool of users with much lower end GPUs. when the Windows app comes I imagine it will still be "beta" in the context of BOINC, but it'll be a more polished setup from what Linux started with. ____________
	ID: 61228 \| Rating: 0 \| rate: / Reply Quote

Keith Myers Send message Joined: 13 Dec 17 Posts: 1313 Credit: 6,022,292,459 RAC: 10,175,539 Level Scientific publications	Message 61229 - Posted: 9 Feb 2024 \| 21:14:03 UTC
	The researcher earlier stated there were NO Windows computers in the lab. Are you going to buy some for them or fund them? How many of you have actually donated monetarily to the project?
	ID: 61229 \| Rating: 0 \| rate: / Reply Quote

MentalFS Send message Joined: 10 Feb 24 Posts: 1 Credit: 0 RAC: 0 Level Scientific publications	Message 61230 - Posted: 10 Feb 2024 \| 16:35:11 UTC - in response to Message 61223. Last modified: 10 Feb 2024 \| 16:35:50 UTC
	I'm using Docker for Windows, which is using WSL2 as backend, and I'm having the same problems. So another hint at WSL being the problem. Other Projects that use my NVidia card work fine though. For now I've disabled "Quantum chemistry on GPU (beta)" and "If no work for selected applications is available, accept work from other applications" in my project settings to avoid this. Currently there's no other work available for me but I'll keep an eye on the other tasks coming in.
	ID: 61230 \| Rating: 0 \| rate: / Reply Quote

Bedrich Hajek Send message Joined: 28 Mar 09 Posts: 477 Credit: 9,341,672,716 RAC: 10,973,430 Level Scientific publications	Message 61231 - Posted: 10 Feb 2024 \| 20:29:30 UTC
	There is an obvious solution, that no one has mentioned, for Windows users who wish to contribute to this project, and at the risk of starting a proverbial firestorm, I will mentioned it. You could install Linux on your machine(s). I did it last year. It has work out fine for me. I did the installation on a separate SSD, leaving the Windows disc intact. The default boot up is set for Linux, with option to boot up into Windows, when the need might arise. The process of installing Linux itself was not difficult, but I did have an issue of attaching an existing project accounts to boinc, but some of the Linux users crunching here helped me solve it. Thank you again for the help. It is option you might want to consider.
	ID: 61231 \| Rating: 0 \| rate: / Reply Quote

[BAT] Svennemans Send message Joined: 27 May 21 Posts: 50 Credit: 468,892,017 RAC: 3,152,507 Level Scientific publications	Message 61232 - Posted: 10 Feb 2024 \| 21:25:45 UTC - in response to Message 61231.
	Sure that's an option, and no need to fear a firestorm. At least not from me, I've worked with Linux or other Unix flavors a lot over the years, both professionally and personally. And besides, I hate forum flame-wars or any kind of tech solution holy wars. ;-) The problem with that solution for me and for many Windows users like me is that's an either/or solution. You either boot Linux or Windows. I have a single computer that I need for both work and personal stuff and that requires Windows due to the software stack being Microsoft based. Not all of which have Linux alternatives that I have the time, patience or skills to explore. I also run Boinc on that machine using 50% CPU + 100% GPU, 24/7. When participating in Linux-only projects, I just spin up a 25% CPU VMWare and let that run in parallel. Or since recently - WSL. I did just install Linux bare-metal on a partition of my data drive just to confirm that WSL is the issue, not the system, but for the reasons mentioned above, this I cannot let run 24/7. FYI - Ian&Steve, you're right. PYSCFbeta on bare-metal Linux runs just fine. So it must indeed be some incompatibility with WSL.
	ID: 61232 \| Rating: 0 \| rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1048 Credit: 40,228,383,983 RAC: 449,121 Level Scientific publications	Message 61233 - Posted: 10 Feb 2024 \| 23:03:27 UTC - in response to Message 61232.
	why not run Linux as prime, and then virtualize (or maybe even WINE) your windows only software? ____________
	ID: 61233 \| Rating: 0 \| rate: / Reply Quote

[BAT] Svennemans Send message Joined: 27 May 21 Posts: 50 Credit: 468,892,017 RAC: 3,152,507 Level Scientific publications	Message 61234 - Posted: 11 Feb 2024 \| 1:10:21 UTC - in response to Message 61233. Last modified: 11 Feb 2024 \| 1:11:01 UTC
	why not run Linux as prime, and then virtualize (or maybe even WINE) your windows only software? Because I need Windows all the time, whereas in the last 15 years, this is the only time I couldn't get something to work through a virtual Linux. And BOINC is just a hobby after all... Would you switch prime OS in such a case? On another note - DCF is going crazy again. Average runtimes are consistent around 30 minutes, yet DCF is going up like crazy - estimated runtime of new WU's now at 76 days! On a positive note: not a single failure yet!
	ID: 61234 \| Rating: 0 \| rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1048 Credit: 40,228,383,983 RAC: 449,121 Level Scientific publications	Message 61235 - Posted: 11 Feb 2024 \| 1:17:14 UTC - in response to Message 61234.
	well i did switch all my computers to linux. even personal ones. the only windows system I have is my work provided laptop. but i could do everything i need on a linux laptop. WINE runs a lot of things these days. yeah i don't know what's wrong with DCF. mine goes crazy shortly after i fix it also. says my tasks will take like 27 days even though most are done in 5-10 mins. ____________
	ID: 61235 \| Rating: 0 \| rate: / Reply Quote

Keith Myers Send message Joined: 13 Dec 17 Posts: 1313 Credit: 6,022,292,459 RAC: 10,175,539 Level Scientific publications	Message 61236 - Posted: 11 Feb 2024 \| 3:14:48 UTC - in response to Message 61235.
	Ian, are you saying that even after you've set DCF to a low value in the client_state file that it is still escalating? I set mine to 0.02 a month ago and it is still hanging around there now that I looked at the hosts here.
	ID: 61236 \| Rating: 0 \| rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1048 Credit: 40,228,383,983 RAC: 449,121 Level Scientific publications	Message 61237 - Posted: 11 Feb 2024 \| 3:30:21 UTC - in response to Message 61236.
	Ian, are you saying that even after you've set DCF to a low value in the client_state file that it is still escalating? I set mine to 0.02 a month ago and it is still hanging around there now that I looked at the hosts here. my DCF was set to about 0.01, and my tasks were estimating that they would take 27hrs each to complete. i changed the DCF to 0.0001, and that changed the estimate to about 16mins each. then after a short time i noticed that the time to completion estimate was going up again, reaching back to 27hrs again. i checked DCF and it's back to 0.01. ____________
	ID: 61237 \| Rating: 0 \| rate: / Reply Quote

ServicEnginIC Send message Joined: 24 Sep 10 Posts: 572 Credit: 7,519,447,024 RAC: 10,542,159 Level Scientific publications	Message 61238 - Posted: 11 Feb 2024 \| 15:35:28 UTC - in response to Message 61223. Last modified: 11 Feb 2024 \| 15:46:18 UTC
	First, it’s well known at this point that these tasks require a lot of VRAM. So some failures are to be expected from that. The VRAM utilization is not constant, but spikes up and down. From the tasks running on my systems, loading up to 5-6GB and staying around that amount is pretty normal, with intermittent spikes to the 9-12GB+ range occasionally. Just by looking at the failure rate of different GPUs, I’m estimating that most tasks need more than 8GB (>70%), a small amount of tasks need more than 12GB (~5%), and a very small number of them need even more than 16GB (<1%). A teammate of mine is running on a couple 2080Tis (11GB) and has had some failures but mostly success. As you suggested in some previous post, VRAM utilization seems to be bound to every particular model of graphics card / GPU. GPUs with fewer CUDA cores available, seem to span less amount of VRAM. My GTX 1650 GPUs have 896 CUDA cores and 4 GB VRAM. My GTX 1650 SUPER GPU has 1280 CUDA cores and 4 GB VRAM. My GTX 1660 Ti GPU has 1536 CUDA cores and 6 GB VRAM. These cards are achieving currently an overall success of 44% on processing PYSCFbeta (676 valid versus 856 errored tasks at the moment of writing this). Not all the errors were due to memory overflows, some of them were due to not viable WUs or other reasons, but deeping in this would take too much time... Processing ATMbeta tasks, success was pretty close to 100%
	ID: 61238 \| Rating: 0 \| rate: / Reply Quote

Skip Da Shu Send message Joined: 13 Jul 09 Posts: 63 Credit: 1,431,770,165 RAC: 5,676,647 Level Scientific publications	Message 61239 - Posted: 11 Feb 2024 \| 15:58:36 UTC - in response to Message 61235. Last modified: 11 Feb 2024 \| 16:04:45 UTC
	well i did switch all my computers to linux. even personal ones. the only windows system I have is my work provided laptop. but i could do everything i need on a linux laptop. WINE runs a lot of things these days. yeah i don't know what's wrong with DCF. mine goes crazy shortly after i fix it also. says my tasks will take like 27 days even though most are done in 5-10 mins. I'm trying to suppress my grinning at this upside down world... having retired my last Windoze box some time ago. On a more germane note... Between this CUDA Error of GINTint2e_jk_kernel: out of memory https://www.gpugrid.net/result.php?resultid=33956113 and this... Does the fact that the reported error (my most common error on RTX3080 12G cards) seems to say that it didn't really get to the imposed limit but still failed mean anything to anyone? I am ASSUMING that this is referring to the memory on the 12G vid card? cupy.cuda.memory.OutOfMemoryError: Out of memory allocating 535,127,040 bytes (allocated so far: 4,278,332,416 bytes, limit set to: 9,443,495,116 bytes). https://www.gpugrid.net/result.php?resultid=33955488 And for what it's worth best I can tell I'm getting a lower error % on my RTX3070 8GB cards once I backed off the sclk/mclk clocks. Skip ____________ - da shu @ HeliOS, "A child's exposure to technology should never be predicated on an ability to afford it."
	ID: 61239 \| Rating: 0 \| rate: / Reply Quote

pututu Send message Joined: 8 Oct 16 Posts: 15 Credit: 935,926,869 RAC: 18,769,238 Level Scientific publications	Message 61240 - Posted: 11 Feb 2024 \| 16:23:18 UTC - in response to Message 61239.
	well i did switch all my computers to linux. even personal ones. the only windows system I have is my work provided laptop. but i could do everything i need on a linux laptop. WINE runs a lot of things these days. yeah i don't know what's wrong with DCF. mine goes crazy shortly after i fix it also. says my tasks will take like 27 days even though most are done in 5-10 mins. I'm trying to suppress my grinning at this upside down world... having retired my last Windoze box some time ago. On a more germane note... Between this CUDA Error of GINTint2e_jk_kernel: out of memory https://www.gpugrid.net/result.php?resultid=33956113 and this... Does the fact that the reported error (my most common error on RTX3080 12G cards) seems to say that it didn't really get to the imposed limit but still failed mean anything to anyone? I am ASSUMING that this is referring to the memory on the 12G vid card? cupy.cuda.memory.OutOfMemoryError: Out of memory allocating 535,127,040 bytes (allocated so far: 4,278,332,416 bytes, limit set to: 9,443,495,116 bytes). https://www.gpugrid.net/result.php?resultid=33955488 And for what it's worth best I can tell I'm getting a lower error % on my RTX3070 8GB cards once I backed off the sclk/mclk clocks. Skip Seems to me that your 3080 is the 10G version instead of 12G?
	ID: 61240 \| Rating: 0 \| rate: / Reply Quote

[BAT] Svennemans Send message Joined: 27 May 21 Posts: 50 Credit: 468,892,017 RAC: 3,152,507 Level Scientific publications	Message 61241 - Posted: 11 Feb 2024 \| 16:52:25 UTC - in response to Message 61239.
	well i did switch all my computers to linux. even personal ones. the only windows system I have is my work provided laptop. but i could do everything i need on a linux laptop. WINE runs a lot of things these days. I'm trying to suppress my grinning at this upside down world... having retired my last Windoze box some time ago. I'm clearly in the presence of passionate Linux believers here... :-) Between this CUDA Error of GINTint2e_jk_kernel: out of memory https://www.gpugrid.net/result.php?resultid=33956113 and this... Does the fact that the reported error (my most common error on RTX3080 12G cards) seems to say that it didn't really get to the imposed limit but still failed mean anything to anyone? I am ASSUMING that this is referring to the memory on the 12G vid card? cupy.cuda.memory.OutOfMemoryError: Out of memory allocating 535,127,040 bytes (allocated so far: 4,278,332,416 bytes, limit set to: 9,443,495,116 bytes). https://www.gpugrid.net/result.php?resultid=33955488 And for what it's worth best I can tell I'm getting a lower error % on my RTX3070 8GB cards once I backed off the sclk/mclk clocks. Skip It does refer to video memory, but the limit each WU sets possibly doesn't take into account other processes allocating video memory. That would especially be an issue I think if you run multiple WU's in parallel. Try executing nvidia-smi to see which processes allocate how much video memory: svennemans@PCSLLINUX01:~$ nvidia-smi Sun Feb 11 17:29:48 2024 +---------------------------------------------------------------------------------------+ \| NVIDIA-SMI 535.154.05 Driver Version: 535.154.05 CUDA Version: 12.2 \| \|-----------------------------------------+----------------------+----------------------+ \| GPU Name Persistence-M \| Bus-Id Disp.A \| Volatile Uncorr. ECC \| \| Fan Temp Perf Pwr:Usage/Cap \| Memory-Usage \| GPU-Util Compute M. \| \| \| \| MIG M. \| \|=========================================+======================+======================\| \| 0 NVIDIA GeForce GTX 1080 Ti Off \| 00000000:01:00.0 On \| N/A \| \| 47% 71C P2 179W / 275W \| 6449MiB / 11264MiB \| 100% Default \| \| \| \| N/A \| +-----------------------------------------+----------------------+----------------------+ +---------------------------------------------------------------------------------------+ \| Processes: \| \| GPU GI CI PID Type Process name GPU Memory \| \| ID ID Usage \| \|=======================================================================================\| \| 0 N/A N/A 1611 G /usr/lib/xorg/Xorg 534MiB \| \| 0 N/A N/A 1801 G /usr/bin/gnome-shell 75MiB \| \| 0 N/A N/A 9616 G boincmgr 2MiB \| \| 0 N/A N/A 9665 G ...gnu/webkit2gtk-4.0/WebKitWebProcess 12MiB \| \| 0 N/A N/A 27480 G ...38,262144 --variations-seed-version 125MiB \| \| 0 N/A N/A 46332 G gnome-control-center 2MiB \| \| 0 N/A N/A 47110 C python 5562MiB \| +---------------------------------------------------------------------------------------+ My one running WU has allocated 5.5G but with the other running processes, total allocated is 6.4G. It would depend on implementation if the limit is calculated from the total CUDA memory or the actual free CUDA memory and whether that limit is updated only once at the start or multiple times.
	ID: 61241 \| Rating: 0 \| rate: / Reply Quote

Skip Da Shu Send message Joined: 13 Jul 09 Posts: 63 Credit: 1,431,770,165 RAC: 5,676,647 Level Scientific publications	Message 61242 - Posted: 11 Feb 2024 \| 17:38:59 UTC - in response to Message 61241. Last modified: 11 Feb 2024 \| 17:59:48 UTC
	Good point about the other stuff on the card... right this minute it's taking a break from GPUGRID to do a Meerkat Burp7... I usually have "watch -t -n 8 nvidia-smi" running on this box if I'm poking around. I'll capture a shot of it as soon as GPUGRID comes back up if any of the listed below changes significantly. I don't think it will. While the 'cuda_1222' is running I see a total ~286MB of 'other stuff' if my 'ciphering' is right: /usr/lib/xorg/Xorg 153MiB cinnamon 18MiB ...gnu/webkit2gtk-4.0/WebKitWebProcess 12MiB Boincmgr /usr/lib/firefox/firefox 103MiB because I'm reading/posting ...inary_x86_64-pc-linux-gnu__cuda1222 776MiB the only Compute task Skip PS: GPUGRID WUs are all 1x here. PPS: Yes, it's the 10G version! PPPS: Also my adhoc perception of error rates was wrong... working on that.
	ID: 61242 \| Rating: 0 \| rate: / Reply Quote

Keith Myers Send message Joined: 13 Dec 17 Posts: 1313 Credit: 6,022,292,459 RAC: 10,175,539 Level Scientific publications	Message 61243 - Posted: 11 Feb 2024 \| 19:14:30 UTC - in response to Message 61237.
	I believe the lowest value that DCF can be in the client_state file is 0.01 Found that in the code someplace, sometime
	ID: 61243 \| Rating: 0 \| rate: / Reply Quote

Pascal Send message Joined: 15 Jul 20 Posts: 66 Credit: 628,972,434 RAC: 3,637,787 Level Scientific publications	Message 61244 - Posted: 12 Feb 2024 \| 9:39:40 UTC
	bonjour apparemment maintenant ça fonctionne sur mes 2 gpu-gtx 1650 et rtx 4060. Je n'ai pas eu d'erreur de calcul. hello apparently now it works on my 2 gpu-gtx 1650 and rtx 4060. I did not have a miscalculation ____________
	ID: 61244 \| Rating: 0 \| rate: / Reply Quote

Steve Volunteer moderator Project administrator Project developer Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 21 Dec 23 Posts: 34 Credit: 0 RAC: 0 Level Scientific publications	Message 61245 - Posted: 12 Feb 2024 \| 11:02:45 UTC - in response to Message 61244.
	Hello, Yes I would not expect the app to work on WSL. There are many linux specific libraries in the packaged python environent that is the "app". Thank you for the feedback regarding the faliure rate. As I mentioned different WUs require different memory use that is hard to check before they start crunching. From my viewpoint the failiure rates are low enough that all WUs seem to suceed with a few retries. This is still a "Beta" app. We definitely want a Windows app and it is in pipeline. However, as I mentioned before the development of this is time consuming. Several of the underlying code-bases are linux only at the moment so a windows app requires a windows port of some code.
	ID: 61245 \| Rating: 0 \| rate: / Reply Quote

[BAT] Svennemans Send message Joined: 27 May 21 Posts: 50 Credit: 468,892,017 RAC: 3,152,507 Level Scientific publications	Message 61246 - Posted: 12 Feb 2024 \| 12:25:22 UTC - in response to Message 61245.
	Yes I would not expect the app to work on WSL. There are many linux specific libraries in the packaged python environent that is the "app". Actually, it should work, since WSL2 is being sold as a native Linux kernel running in a virtual environment with full system call compatibility. So one could reasonably expect any native linux libraries to work as expected. However there are obviously still a few issues to iron out. Not by gpugrid to be clear - by microsoft.
	ID: 61246 \| Rating: 0 \| rate: / Reply Quote

[BAT] Svennemans Send message Joined: 27 May 21 Posts: 50 Credit: 468,892,017 RAC: 3,152,507 Level Scientific publications	Message 61252 - Posted: 13 Feb 2024 \| 11:23:35 UTC
	I'm seeing a bunch of checksum errors during unzip, anyone else have this problem? https://www.gpugrid.net/results.php?hostid=617834&offset=0&show_names=0&state=5&appid= Stderr output <core_client_version>7.20.5</core_client_version> <![CDATA[ <message> process exited with code 1 (0x1, -255)</message> <stderr_txt> 11:26:18 (177385): wrapper (7.7.26016): starting lib/libcufft.so.10.9.0.58 bad CRC e458474a (should be 0a867ac2) boinc_unzip() error: 2 </stderr_txt> ]]> The workunits seem to all run fine on a subsequent host.
	ID: 61252 \| Rating: 0 \| rate: / Reply Quote

Pascal Send message Joined: 15 Jul 20 Posts: 66 Credit: 628,972,434 RAC: 3,637,787 Level Scientific publications	Message 61260 - Posted: 13 Feb 2024 \| 14:13:19 UTC
	bonjour, quand les taches windows seront elles pretes pour essais? franchement,Linux ,c'est pourri. apres une mise a jour le lhc@home ne fonctionne plus.Je reste sous linux pour vous mais j'ai hate de repasser sous un bon vieux windows. Merci Good afternoon, when will windows tasks be ready for testing? Frankly, Linux is rotten. after an update the lhc@home no longer works. I stay under linux for you but I hate to go back under a good old windows. Thanks ____________
	ID: 61260 \| Rating: 0 \| rate: / Reply Quote

Boca Raton Community HS Send message Joined: 27 Aug 21 Posts: 36 Credit: 3,441,856,809 RAC: 19,827 Level Scientific publications	Message 61261 - Posted: 13 Feb 2024 \| 16:15:31 UTC - in response to Message 61260.
	bonjour, quand les taches windows seront elles pretes pour essais? franchement,Linux ,c'est pourri. apres une mise a jour le lhc@home ne fonctionne plus.Je reste sous linux pour vous mais j'ai hate de repasser sous un bon vieux windows. Merci Good afternoon, when will windows tasks be ready for testing? Frankly, Linux is rotten. after an update the lhc@home no longer works. I stay under linux for you but I hate to go back under a good old windows. Thanks Maybe try a different version. I have always used Windows (and still do on some systems) but use Linux Mint on others. Really user friendly and a very similar feel to Windows.
	ID: 61261 \| Rating: 0 \| rate: / Reply Quote

Aurum Send message Joined: 12 Jul 17 Posts: 400 Credit: 13,846,639,882 RAC: 31,504,974 Level Scientific publications	Message 61269 - Posted: 14 Feb 2024 \| 15:59:06 UTC - in response to Message 61239. Last modified: 14 Feb 2024 \| 16:00:55 UTC
	Between this CUDA Error of GINTint2e_jk_kernel: out of memory https://www.gpugrid.net/result.php?resultid=33956113 and this... Does the fact that the reported error (my most common error on RTX3080 12G cards) seems to say that it didn't really get to the imposed limit but still failed mean anything to anyone? I am ASSUMING that this is referring to the memory on the 12G vid card? cupy.cuda.memory.OutOfMemoryError: Out of memory allocating 535,127,040 bytes (allocated so far: 4,278,332,416 bytes, limit set to: 9,443,495,116 bytes). https://www.gpugrid.net/result.php?resultid=33955488 Sometimes I get the same error on my 3080 10 GB Card. E.g., https://www.gpugrid.net/result.php?resultid=33960422 Headless computer with a single 3080 running 1C + 1N.
	ID: 61269 \| Rating: 0 \| rate: / Reply Quote

Aurum Send message Joined: 12 Jul 17 Posts: 400 Credit: 13,846,639,882 RAC: 31,504,974 Level Scientific publications	Message 61270 - Posted: 14 Feb 2024 \| 16:04:55 UTC - in response to Message 61243.
	I believe the lowest value that DCF can be in the client_state file is 0.01 Found that in the code someplace, sometime Zoltan posted long ago that BOINC does not understand zero and 0.01 is as close as it can get. I wonder if that was someones approach to fixing a division by zero problem in antiquity.
	ID: 61270 \| Rating: 0 \| rate: / Reply Quote

Skip Da Shu Send message Joined: 13 Jul 09 Posts: 63 Credit: 1,431,770,165 RAC: 5,676,647 Level Scientific publications	Message 61272 - Posted: 14 Feb 2024 \| 16:20:02 UTC - in response to Message 61242. Last modified: 14 Feb 2024 \| 16:21:50 UTC
	... Skip PS: GPUGRID WUs are all 1x here. PPS: Yes, it's the 10G version! PPPS: Also my adhoc perception of error rates was wrong... working on that. After logging error rates for a few days across 5 boxes w/ Nvidia cards (all RTX30x0, all Linux Mint v2x.3) and trying to be aware of what I was doing on the main desktop while 'python' was running along with some sclk / mclk cut backs, shows the avg error rate is dropping. The last cut shows it at 23.44% across the 5 boxes averaged over 28 hours. No longer any segfault 0x8b errors, all 0x1. The last one was on the most troublesome of the 3070 cards. https://www.gpugrid.net/result.php?resultid=33950656 Anything I can do to help with this type of error? Skip
	ID: 61272 \| Rating: 0 \| rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1048 Credit: 40,228,383,983 RAC: 449,121 Level Scientific publications	Message 61273 - Posted: 14 Feb 2024 \| 16:29:14 UTC - in response to Message 61272.
	... Skip PS: GPUGRID WUs are all 1x here. PPS: Yes, it's the 10G version! PPPS: Also my adhoc perception of error rates was wrong... working on that. After logging error rates for a few days across 5 boxes w/ Nvidia cards (all RTX30x0, all Linux Mint v2x.3) and trying to be aware of what I was doing on the main desktop while 'python' was running along with some sclk / mclk cut backs, shows the avg error rate is dropping. The last cut shows it at 23.44% across the 5 boxes averaged over 28 hours. No longer any segfault 0x8b errors, all 0x1. The last one was on the most troublesome of the 3070 cards. https://www.gpugrid.net/result.php?resultid=33950656 Anything I can do to help with this type of error? Skip its still an out of memory error. a little further up in the error log shows this: "CUDA Error of GINTint2e_jk_kernel: out of memory" so it's probably just running out of memory at a different stage of the task, producing a slightly different error, but still an issue with not enough memory. ____________
	ID: 61273 \| Rating: 0 \| rate: / Reply Quote

Skip Da Shu Send message Joined: 13 Jul 09 Posts: 63 Credit: 1,431,770,165 RAC: 5,676,647 Level Scientific publications	Message 61274 - Posted: 14 Feb 2024 \| 16:44:55 UTC - in response to Message 61252.
	I'm seeing a bunch of checksum errors during unzip, anyone else have this problem? https://www.gpugrid.net/results.php?hostid=617834&offset=0&show_names=0&state=5&appid= Stderr output <core_client_version>7.20.5</core_client_version> <![CDATA[ <message> process exited with code 1 (0x1, -255)</message> <stderr_txt> 11:26:18 (177385): wrapper (7.7.26016): starting lib/libcufft.so.10.9.0.58 bad CRC e458474a (should be 0a867ac2) boinc_unzip() error: 2 </stderr_txt> ]]> The workunits seem to all run fine on a subsequent host. I didn't find any of these in the 10GB 3080 errors that occurred so far today. Will check the 3070 cards shortly. Skip
	ID: 61274 \| Rating: 0 \| rate: / Reply Quote

Skip Da Shu Send message Joined: 13 Jul 09 Posts: 63 Credit: 1,431,770,165 RAC: 5,676,647 Level Scientific publications	Message 61275 - Posted: 14 Feb 2024 \| 16:53:42 UTC - in response to Message 61273.
	its still an out of memory error. a little further up in the error log shows this: "CUDA Error of GINTint2e_jk_kernel: out of memory" so it's probably just running out of memory at a different stage of the task, producing a slightly different error, but still an issue with not enough memory. Thanx... as I suspected and this is my most common error now. Along with these that I'm thinking are also memory related also from a different point in the process... same situation w/o having reached the cap limit shown. https://www.gpugrid.net/result.php?resultid=33962293 Skip
	ID: 61275 \| Rating: 0 \| rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1048 Credit: 40,228,383,983 RAC: 449,121 Level Scientific publications	Message 61276 - Posted: 14 Feb 2024 \| 16:58:25 UTC - in response to Message 61275. Last modified: 14 Feb 2024 \| 16:58:55 UTC
	between your systems and mine, looking at the error rates; ~23% of tasks need more than 8GB ~17% of tasks need more than 10GB ~4% of tasks need more than 12GB <1% of tasks need more than 16GB me personally, i wouldn't run these (as they are now) with less than 12GB VRAM. ____________
	ID: 61276 \| Rating: 0 \| rate: / Reply Quote

Skip Da Shu Send message Joined: 13 Jul 09 Posts: 63 Credit: 1,431,770,165 RAC: 5,676,647 Level Scientific publications	Message 61277 - Posted: 14 Feb 2024 \| 17:01:02 UTC - in response to Message 61274.
	I'm seeing a bunch of checksum errors during unzip, anyone else have this problem? https://www.gpugrid.net/results.php?hostid=617834&offset=0&show_names=0&state=5&appid= Stderr output <core_client_version>7.20.5</core_client_version> <![CDATA[ <message> process exited with code 1 (0x1, -255)</message> <stderr_txt> 11:26:18 (177385): wrapper (7.7.26016): starting lib/libcufft.so.10.9.0.58 bad CRC e458474a (should be 0a867ac2) boinc_unzip() error: 2 </stderr_txt> ]]> The workunits seem to all run fine on a subsequent host. I didn't find any of these in the 10GB 3080 errors that occurred so far today. Will check the 3070 cards shortly. Skip 8GB 3070 card errors today checked were all: CUDA Error of GINTint2e_jk_kernel: out of memory Skip
	ID: 61277 \| Rating: 0 \| rate: / Reply Quote

Skip Da Shu Send message Joined: 13 Jul 09 Posts: 63 Credit: 1,431,770,165 RAC: 5,676,647 Level Scientific publications	Message 61278 - Posted: 14 Feb 2024 \| 17:14:16 UTC - in response to Message 61274.
	I'm seeing a bunch of checksum errors during unzip, anyone else have this problem? https://www.gpugrid.net/results.php?hostid=617834&offset=0&show_names=0&state=5&appid= Stderr output <core_client_version>7.20.5</core_client_version> <![CDATA[ <message> process exited with code 1 (0x1, -255)</message> <stderr_txt> 11:26:18 (177385): wrapper (7.7.26016): starting lib/libcufft.so.10.9.0.58 bad CRC e458474a (should be 0a867ac2) boinc_unzip() error: 2 </stderr_txt> ]]> The workunits seem to all run fine on a subsequent host. I didn't find any of these in the 10GB 3080 errors that occurred so far today. Will check the 3070 cards shortly. Skip 8GB 3070 card errors today checked were all: CUDA Error of GINTint2e_jk_kernel: out of memory Skip
	ID: 61278 \| Rating: 0 \| rate: / Reply Quote

Skip Da Shu Send message Joined: 13 Jul 09 Posts: 63 Credit: 1,431,770,165 RAC: 5,676,647 Level Scientific publications	Message 61279 - Posted: 14 Feb 2024 \| 17:22:50 UTC - in response to Message 61276.
	between your systems and mine, looking at the error rates; ~23% of tasks need more than 8GB ~17% of tasks need more than 10GB ~4% of tasks need more than 12GB <1% of tasks need more than 16GB me personally, i wouldn't run these (as they are now) with less than 12GB VRAM. Thanx for info. As is right now the only cards I have w/ 16GB are my RX6800/6800xt cards. https://ibb.co/hKZtR0q Guess I need to start a go-fund-me for some $600 12GB 4070 Super cards that I've been eyeing up ;-) Skip
	ID: 61279 \| Rating: 0 \| rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1048 Credit: 40,228,383,983 RAC: 449,121 Level Scientific publications	Message 61280 - Posted: 14 Feb 2024 \| 17:29:26 UTC - in response to Message 61279.
	Guess I need to start a go-fund-me for some $600 12GB 4070 Super cards that I've been eyeing up ;-) Skip a $600 12GB Titan V is like 4x faster though. other projects are a consideration of course. ____________
	ID: 61280 \| Rating: 0 \| rate: / Reply Quote

pututu Send message Joined: 8 Oct 16 Posts: 15 Credit: 935,926,869 RAC: 18,769,238 Level Scientific publications	Message 61281 - Posted: 14 Feb 2024 \| 18:03:43 UTC - in response to Message 61280.
	If this quantum chemistry project is going to last for more than a year, perhaps a $170 (via ebay) investment on Tesla P100 16G may be worth it? If you look at my gpugrid output via boincstat, I'm doing like 20M PPD over the past 4 days running on a single card with power limit of 130W. I've processed more than 1000 tasks and I think I have 2 failures with its 16G memory. The only drawback is that there aren't many projects that do benefit from high FP64 and/or memory bandwidth performance. Originally bought it for MilkyWay. However if you have extra cash, the Titan V is a great option for such projects. The project admin can change the granted credit and/or the task run time but as long as the high FP64 and memory bandwidth requirement remains unchanged, relatively P100 should perform better than most consumer cards for such applications.
	ID: 61281 \| Rating: 0 \| rate: / Reply Quote

Keith Myers Send message Joined: 13 Dec 17 Posts: 1313 Credit: 6,022,292,459 RAC: 10,175,539 Level Scientific publications	Message 61282 - Posted: 14 Feb 2024 \| 18:27:15 UTC - in response to Message 61270.
	My DCF is set to 0.02 So that is not considered zero by BOINC apparently.
	ID: 61282 \| Rating: 0 \| rate: / Reply Quote

Skip Da Shu Send message Joined: 13 Jul 09 Posts: 63 Credit: 1,431,770,165 RAC: 5,676,647 Level Scientific publications	Message 61283 - Posted: 14 Feb 2024 \| 19:35:56 UTC - in response to Message 61280.
	Guess I need to start a go-fund-me for some $600 12GB 4070 Super cards that I've been eyeing up ;-) Skip a $600 12GB Titan V is like 4x faster though. other projects are a consideration of course. Can you point me to someplace I can educate myself a bit on using Titan V cards for BOINC. I see some for $600 used on ebay. As u know there is no used market for used 'Super' cards yet. Did u mean 4x faster than a 4070 Super or than the 3070 I would replace with it? Thanx, Skip
	ID: 61283 \| Rating: 0 \| rate: / Reply Quote

Skip Da Shu Send message Joined: 13 Jul 09 Posts: 63 Credit: 1,431,770,165 RAC: 5,676,647 Level Scientific publications	Message 61284 - Posted: 14 Feb 2024 \| 20:07:36 UTC - in response to Message 61283.
	Guess I need to start a go-fund-me for some $600 12GB 4070 Super cards that I've been eyeing up ;-) Skip a $600 12GB Titan V is like 4x faster though. other projects are a consideration of course. Can you point me to someplace I can educate myself a bit on using Titan V cards for BOINC. I see some for $600 used on ebay. As u know there is no used market for used 'Super' cards yet. Did u mean 4x faster than a 4070 Super or than the 3070 I would replace with it? Thanx, Skip Ah, it's an FP64 thing. Any other projects doing heavy FP64 lifting since the demise of MW GPU WUs?
	ID: 61284 \| Rating: 0 \| rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1048 Credit: 40,228,383,983 RAC: 449,121 Level Scientific publications	Message 61285 - Posted: 14 Feb 2024 \| 21:16:14 UTC - in response to Message 61284.
	ATMbeta tasks here have some small element of FP64. (integration) BRP7 tasks at Einstein also use FP64 a little bit. Asteroids@home GPU apps are also primarily FP64, but they have massive GPU memory bandwidth bottleneck that slows things down more than the FP64 does anyway so you don't realize the benefit there. and the CPUs are better production per watt at Asteroids. not sure if any any other projects use it. ____________
	ID: 61285 \| Rating: 0 \| rate: / Reply Quote

Skip Da Shu Send message Joined: 13 Jul 09 Posts: 63 Credit: 1,431,770,165 RAC: 5,676,647 Level Scientific publications	Message 61288 - Posted: 16 Feb 2024 \| 5:30:39 UTC - in response to Message 61276. Last modified: 16 Feb 2024 \| 5:31:43 UTC
	between your systems and mine, looking at the error rates; ~23% of tasks need more than 8GB ~17% of tasks need more than 10GB ~4% of tasks need more than 12GB <1% of tasks need more than 16GB me personally, i wouldn't run these (as they are now) with less than 12GB VRAM. Not sure why but... Error rates seemed to start dropping after 5pm (23:00 Zulu) today. Overall error average since 2/11 across my 5 Nvid cards was 26.7% with it slowly creeping down over time. Early on a little bit of this was the result of lowering clocks to eliminate the occasional segfault (0x8b). The average of the last two captures today across the 5 cards was 20.5% For the last 6 hour period I just checked, my 10GB card average error rate dropped to 17.3% (15.92 & 18.7) and the 8GB card error rate was at 21.3%. Skip
	ID: 61288 \| Rating: 0 \| rate: / Reply Quote

Pascal Send message Joined: 15 Jul 20 Posts: 66 Credit: 628,972,434 RAC: 3,637,787 Level Scientific publications	Message 61291 - Posted: 17 Feb 2024 \| 9:38:28 UTC
	les unites de calcul pour windows sont elles arrivées? Have the computing units for windows arrived? ____________
	ID: 61291 \| Rating: 0 \| rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1589 Credit: 6,680,244,351 RAC: 9,211,180 Level Scientific publications	Message 61292 - Posted: 17 Feb 2024 \| 11:44:20 UTC - in response to Message 61291.
	It isn't the tasks which need to be released, it's the application programs needed to run them. You can read the list of applications at https://www.gpugrid.net/apps.php The newest ones tend to be towards the bottom of the page - and no, there isn't one for 'Quantum chemistry calculations on GPU' yet. Bookmark that page - there isn't a direct link to it on this site, although it's a standard feature of BOINC projects.
	ID: 61292 \| Rating: 0 \| rate: / Reply Quote

ServicEnginIC Send message Joined: 24 Sep 10 Posts: 572 Credit: 7,519,447,024 RAC: 10,542,159 Level Scientific publications	Message 61293 - Posted: 17 Feb 2024 \| 12:18:37 UTC
	Watching Stderr output report for a certain PYSCFbeta task, can be found a line like this: . + CUDA_VISIBLE_DEVICES=N . Where "N" corresponds to the Device Number (GPU) where the task was run on. This is very much appreciated on multi GPU hosts when trying to identify reliable or unreliable devices. This allows, if desired, to exclude unreliable devices as of this Ian&Steve C. kind advice. A similar feature would be useful at other apps, as ATMbeta.
	ID: 61293 \| Rating: 0 \| rate: / Reply Quote

Skip Da Shu Send message Joined: 13 Jul 09 Posts: 63 Credit: 1,431,770,165 RAC: 5,676,647 Level Scientific publications	Message 61295 - Posted: 17 Feb 2024 \| 17:15:35 UTC - in response to Message 61288. Last modified: 17 Feb 2024 \| 17:15:50 UTC
	between your systems and mine, looking at the error rates; ~23% of tasks need more than 8GB ~17% of tasks need more than 10GB ~4% of tasks need more than 12GB <1% of tasks need more than 16GB me personally, i wouldn't run these (as they are now) with less than 12GB VRAM. Not sure why but... Error rates seemed to start dropping after 5pm (23:00 Zulu) today. Overall error average since 2/11 across my 5 Nvid cards was 26.7% with it slowly creeping down over time. Early on a little bit of this was the result of lowering clocks to eliminate the occasional segfault (0x8b). The average of the last two captures today across the 5 cards was 20.5% For the last 6 hour period I just checked, my 10GB card average error rate dropped to 17.3% (15.92 & 18.7) and the 8GB card error rate was at 21.3%. Skip IGNORE... all went to crap the next day (today)
	ID: 61295 \| Rating: 0 \| rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1048 Credit: 40,228,383,983 RAC: 449,121 Level Scientific publications	Message 61296 - Posted: 17 Feb 2024 \| 17:17:48 UTC - in response to Message 61295. Last modified: 17 Feb 2024 \| 17:18:03 UTC
	yeah i've been seeing higher error rates on my 12GB cards too. still very low on my 16GB cards though. ____________
	ID: 61296 \| Rating: 0 \| rate: / Reply Quote

ServicEnginIC Send message Joined: 24 Sep 10 Posts: 572 Credit: 7,519,447,024 RAC: 10,542,159 Level Scientific publications	Message 61305 - Posted: 20 Feb 2024 \| 23:04:35 UTC
	My preferences are set to receive work from all apps, including beta ones, but none of my 4 GB VRAM graphics cards have received lately PYSCFbeta tasks. Casual, or scheduler-driven behavior? In the meanwhile, they are performing ATMbeta tasks without a single processing error so far. And unsent PYSCFbeta tasks seem to be growing more and more, 39K+ at this moment.
	ID: 61305 \| Rating: 0 \| rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1589 Credit: 6,680,244,351 RAC: 9,211,180 Level Scientific publications	Message 61306 - Posted: 20 Feb 2024 \| 23:23:36 UTC - in response to Message 61305.
	My GPUs are all on the smaller-memory side, too. Since ATMbeta tasks became available again, I haven't picked up a single Quantum chemistry task. I think it's either a cunning project plan, or (more likely) some subtle BOINC behaviour concerning our hosts' "reliability" rating on particular task types.
	ID: 61306 \| Rating: 0 \| rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1048 Credit: 40,228,383,983 RAC: 449,121 Level Scientific publications	Message 61307 - Posted: 21 Feb 2024 \| 1:26:26 UTC - in response to Message 61306.
	My GPUs are all on the smaller-memory side, too. Since ATMbeta tasks became available again, I haven't picked up a single Quantum chemistry task. I think it's either a cunning project plan, or (more likely) some subtle BOINC behaviour concerning our hosts' "reliability" rating on particular task types. it's because you have test tasks enabled. with that, it's giving preferential treatment for ATM tasks which are classified in the scheduler as beta/test. QChem seems to not be classified in the scheduler as "test" or beta. despite being treated as such by the staff and the app name literally has the word beta in it. if you disable test tasks, and enable only QChem, you will get them still. ____________
	ID: 61307 \| Rating: 0 \| rate: / Reply Quote

ServicEnginIC Send message Joined: 24 Sep 10 Posts: 572 Credit: 7,519,447,024 RAC: 10,542,159 Level Scientific publications	Message 61308 - Posted: 21 Feb 2024 \| 6:28:06 UTC - in response to Message 61307.
	it's because you have test tasks enabled. with that, it's giving preferential treatment for ATM tasks which are classified in the scheduler as beta/test. Thank you, that fully explains the fact. In the dilemma of choosing between my 50% erroring PYSCFbeta or 100% succeeding ATMbeta tasks, I'll keep this last.
	ID: 61308 \| Rating: 0 \| rate: / Reply Quote

Pascal Send message Joined: 15 Jul 20 Posts: 66 Credit: 628,972,434 RAC: 3,637,787 Level Scientific publications	Message 61311 - Posted: 21 Feb 2024 \| 13:09:41 UTC
	bonjour, j'aimerais calculer pour atmbeta avec ma gtx 1650 et pour quantum chemistry avec ma rtx 4060. Je ne parviens pas a modifier le config.xml pour cela. Je n'ai que des unités atmbeta a calculer et aucune unités quantum chemistry. voici ce que j'ai mis dans le fichier config.xml de boinc. Quelqu'un pourrait il m'aider.Merci d'avance. Good afternoon, I would like to calculate for atmbeta with my gtx 1650 and for quantum chemistry with my rtx 4060. I can’t change the config.xml for this. I only have atmbeta units to calculate and no quantum chemistry units. here is what I put in the config.xml file of boinc. Someone could help me. Thanks in advance <cc_config> <options> <exclude_gpu> <url>https://www.gpugrid.net/</url> [<device_num>0</device_num>] [<type>NVIDIA</type>] [<app>ATMbeta</app>] </exclude_gpu> <exclude_gpu> <url>https://www.gpugrid.net/</url> [<device_num>1</device_num>] [<type>NVIDIA</type>] [<app>PYSCFbeta</app>] </exclude_gpu> <exclude_gpu> <url>http://asteroidsathome.net/boinc/</url> <device_num>0</device_num> <type>NVIDIA</type> </exclude_gpu> <exclude_gpu> <url>https://einstein.phys.uwm.edu/</url> <device_num>0</device_num> <type>NVIDIA</type> </exclude_gpu> <use_all_gpus>1</use_all_gpus> <ncpus>-1</ncpus> </options> </cc_config> ____________
	ID: 61311 \| Rating: 0 \| rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1048 Credit: 40,228,383,983 RAC: 449,121 Level Scientific publications	Message 61312 - Posted: 21 Feb 2024 \| 13:22:20 UTC - in response to Message 61311.
	schedule requests from your host are not specific about what it's asking for. it just asks for work for "Nvidia" and the scheduler on the project side decides what you need and what to send based on your preferences. the way the scheduler is setup right now, you wont be sent both types of work when both are available, only ATM. you will need to move the GPUs to different hosts and setup the project preferences to be different for each of them. or run two clients on one host with one gpu attached to each, or just stay with ATM on both cards. ____________
	ID: 61312 \| Rating: 0 \| rate: / Reply Quote

Pascal Send message Joined: 15 Jul 20 Posts: 66 Credit: 628,972,434 RAC: 3,637,787 Level Scientific publications	Message 61313 - Posted: 21 Feb 2024 \| 13:32:31 UTC - in response to Message 61312.
	ok merci ____________
	ID: 61313 \| Rating: 0 \| rate: / Reply Quote

ServicEnginIC Send message Joined: 24 Sep 10 Posts: 572 Credit: 7,519,447,024 RAC: 10,542,159 Level Scientific publications	Message 61319 - Posted: 21 Feb 2024 \| 22:59:26 UTC - in response to Message 61307.
	QChem seems to not be classified in the scheduler as "test" or beta. despite being treated as such by the staff and the app name literally has the word beta in it. if you disable test tasks, and enable only QChem, you will get them still. Giving a bit more assortment to current GPUGRID apps spectrum, I happened to be watching Server status page when a limited number (about 215) of "ATM: Free energy calculations of protein-ligand binding" tasks grew up. To be distinguished from previously existing ATMbeta branch. I managed to configure a venue at GPUGRID preferences page to catch one of them before unsent tasks vanished. Task: tnks2_m5f_m5l_1_RE-QUICO_ATM_GAFF2_1fs-0-5-RND3367_1 To achieve this, I disabled getting test apps, and enabled only (somehow paradoxical ;-) "ATM (beta)" app. That task is currently running at my GTX 1660 Ti GPU, at an estimated rate of 9,72% per hour. And quickly returning to PYSCFbeta (QChem) topic: tasks for this app grew up today to a noticeable amount of 80K+ ready to send ones. After peaking, QChem unsent tasks are now decreasing again.
	ID: 61319 \| Rating: 0 \| rate: / Reply Quote

Pascal Send message Joined: 15 Jul 20 Posts: 66 Credit: 628,972,434 RAC: 3,637,787 Level Scientific publications	Message 61320 - Posted: 22 Feb 2024 \| 9:50:55 UTC
	Bonjour y a t il des unités de calcul pour windows disponible? Hello Are there computing units for windows available? ____________
	ID: 61320 \| Rating: 0 \| rate: / Reply Quote

ServicEnginIC Send message Joined: 24 Sep 10 Posts: 572 Credit: 7,519,447,024 RAC: 10,542,159 Level Scientific publications	Message 61321 - Posted: 22 Feb 2024 \| 11:05:33 UTC - in response to Message 61320. Last modified: 22 Feb 2024 \| 11:28:25 UTC
	Yes, ATM and ATMbeta apps have both Windows and Linux versions currently available. Edit. Regarding Quantum chemistry, there is no still any Windows version
	ID: 61321 \| Rating: 0 \| rate: / Reply Quote

Erich56 Send message Joined: 1 Jan 15 Posts: 1105 Credit: 7,822,620,176 RAC: 1,489,209 Level Scientific publications	Message 61322 - Posted: 22 Feb 2024 \| 16:47:17 UTC - in response to Message 61321.
	Regarding Quantum chemistry, there is no still any Windows version :-( :-( :-(
	ID: 61322 \| Rating: 0 \| rate: / Reply Quote

Bedrich Hajek Send message Joined: 28 Mar 09 Posts: 477 Credit: 9,341,672,716 RAC: 10,973,430 Level Scientific publications	Message 61353 - Posted: 2 Mar 2024 \| 1:00:26 UTC
	This one barely made it: https://www.gpugrid.net/workunit.php?wuid=27943603
	ID: 61353 \| Rating: 0 \| rate: / Reply Quote

Skip Da Shu Send message Joined: 13 Jul 09 Posts: 63 Credit: 1,431,770,165 RAC: 5,676,647 Level Scientific publications	Message 61360 - Posted: 3 Mar 2024 \| 0:58:43 UTC
	https://imgur.com/evCBB73 GPUGRID error rate across 2x 3070 8GB, 2x 3080 10GB & 1 4070 Super 12GB (early part is with 3x 3070 8GB one of which was replaced by 4070S 2/20). Skip
	ID: 61360 \| Rating: 0 \| rate: / Reply Quote

Skip Da Shu Send message Joined: 13 Jul 09 Posts: 63 Credit: 1,431,770,165 RAC: 5,676,647 Level Scientific publications	Message 61363 - Posted: 4 Mar 2024 \| 13:00:29 UTC - in response to Message 61360. Last modified: 4 Mar 2024 \| 13:04:08 UTC
	Going the wrong direction :-( Skip
	ID: 61363 \| Rating: 0 \| rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1048 Credit: 40,228,383,983 RAC: 449,121 Level Scientific publications	Message 61364 - Posted: 4 Mar 2024 \| 13:07:01 UTC - in response to Message 61363.
	to be expected with 8-10GB cards. might get better context if you split the graphs up by card type. so you can see the relative error rate vs different VRAM sizes. I'm guessing most errors come from the 8GB cards. ____________
	ID: 61364 \| Rating: 0 \| rate: / Reply Quote

[BAT] Svennemans Send message Joined: 27 May 21 Posts: 50 Credit: 468,892,017 RAC: 3,152,507 Level Scientific publications	Message 61370 - Posted: 4 Mar 2024 \| 16:02:45 UTC
	On my GTX1080ti 11GB, I've only got about 1% error rate due to memory. But watching 'nvidia-smi dmon' there are a lot of close shaves, where I'm only a couple of MB's below the limit... So from a 10GB card, I'd already expect a non-trivial error rate.
	ID: 61370 \| Rating: 0 \| rate: / Reply Quote

Skip Da Shu Send message Joined: 13 Jul 09 Posts: 63 Credit: 1,431,770,165 RAC: 5,676,647 Level Scientific publications	Message 61401 - Posted: 9 Mar 2024 \| 20:10:53 UTC - in response to Message 61364. Last modified: 9 Mar 2024 \| 20:13:28 UTC
	to be expected with 8-10GB cards. might get better context if you split the graphs up by card type. so you can see the relative error rate vs different VRAM sizes. I'm guessing most errors come from the 8GB cards. They do: 8GB – last 2 checks of 2 cards 44.07 10GB – last 2 checks of 2 cards 30.80 12GB – last 2 checks of 1 card 7.62 But I need to look at the last day or two as rates have been going up. ____________ - da shu @ HeliOS, "A child's exposure to technology should never be predicated on an ability to afford it."
	ID: 61401 \| Rating: 0 \| rate: / Reply Quote

Skip Da Shu Send message Joined: 13 Jul 09 Posts: 63 Credit: 1,431,770,165 RAC: 5,676,647 Level Scientific publications	Message 61402 - Posted: 9 Mar 2024 \| 20:22:16 UTC Last modified: 9 Mar 2024 \| 20:23:07 UTC
	Anyone have insight into this error: <stderr_txt> 09:06:00 (130033): wrapper (7.7.26016): starting [x86_64-pc-linux-gnu__cuda1121.zip] End-of-central-directory signature not found. Either this file is not a zipfile, or it constitutes one disk of a multi-part archive. In the latter case the central directory and zipfile comment will be found on the last disk(s) of this archive. unzip: cannot find zipfile directory in one of x86_64-pc-linux-gnu__cuda1121.zip or x86_64-pc-linux-gnu__cuda1121.zip.zip, and cannot find x86_64-pc-linux-gnu__cuda1121.zip.ZIP, period. boinc_unzip() error: 9 It looks like every WU since the afternoon of the 7th (Zulu) is getting this but only on my single 12GB 4070S Skip
	ID: 61402 \| Rating: 0 \| rate: / Reply Quote

Keith Myers Send message Joined: 13 Dec 17 Posts: 1313 Credit: 6,022,292,459 RAC: 10,175,539 Level Scientific publications	Message 61403 - Posted: 9 Mar 2024 \| 20:35:22 UTC - in response to Message 61402.
	Download error causing the zip file to be corrupted because it is missing the end of file signature. I was getting that on a Google Drive zip archive a couple of days ago. Switching browsers let me download the archive correctly so it would unpack.
	ID: 61403 \| Rating: 0 \| rate: / Reply Quote

Skip Da Shu Send message Joined: 13 Jul 09 Posts: 63 Credit: 1,431,770,165 RAC: 5,676,647 Level Scientific publications	Message 61404 - Posted: 9 Mar 2024 \| 23:13:30 UTC - in response to Message 61403.
	Download error causing the zip file to be corrupted because it is missing the end of file signature. I was getting that on a Google Drive zip archive a couple of days ago. Switching browsers let me download the archive correctly so it would unpack. Well after 100+ of these errors I finally got 3 good ones out of that box after a reboot for a different reason. Thanx, Skip
	ID: 61404 \| Rating: 0 \| rate: / Reply Quote

Pascal Send message Joined: 15 Jul 20 Posts: 66 Credit: 628,972,434 RAC: 3,637,787 Level Scientific publications	Message 61405 - Posted: 11 Mar 2024 \| 8:33:44 UTC
	Bonjour y a t il des unités de calcul pour windows disponible? Hello Are there computing units for windows available? ____________
	ID: 61405 \| Rating: 0 \| rate: / Reply Quote

Boca Raton Community HS Send message Joined: 27 Aug 21 Posts: 36 Credit: 3,441,856,809 RAC: 19,827 Level Scientific publications	Message 61406 - Posted: 11 Mar 2024 \| 11:59:23 UTC - in response to Message 61405. Last modified: 11 Mar 2024 \| 12:00:00 UTC
	There are not for this project (at this time).
	ID: 61406 \| Rating: 0 \| rate: / Reply Quote

Skip Da Shu Send message Joined: 13 Jul 09 Posts: 63 Credit: 1,431,770,165 RAC: 5,676,647 Level Scientific publications	Message 61454 - Posted: 10 Apr 2024 \| 11:35:30 UTC
	Error rates skyrocketed on me for this app... even on the 10GB cards (12GB card will be back on Thursday). This started late on April 7th. Error rate now over 50% so I will have to NNW till I can figure it out. Skip ____________ - da shu @ HeliOS, "A child's exposure to technology should never be predicated on an ability to afford it."
	ID: 61454 \| Rating: 0 \| rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1048 Credit: 40,228,383,983 RAC: 449,121 Level Scientific publications	Message 61455 - Posted: 10 Apr 2024 \| 13:17:01 UTC - in response to Message 61454.
	Error rates skyrocketed on me for this app... even on the 10GB cards (12GB card will be back on Thursday). This started late on April 7th. Error rate now over 50% so I will have to NNW till I can figure it out. Skip It's not you. its the new v4 tasks require more VRAM. I asked about this on their discord. I asked: it seems the newer "v4" tasks on average require a bit more VRAM than the previous v3 tasks. I'm seeing a higher error percentage on 12GB cards. v3 had about 5% failure from OOM on 12GB VRAM v4 is more like 15% failure from OOM on 12GB VRAM no failures with 16GB VRAM what changed in V4? Steve replied: yes this make sense unfortunately. In the previous round of "inputs_v3**" it was calculating things incorrectly for any molecule containing Iodine. This is heaviest element in our dataset. The computational cost of this QM method scales with the size of the elements (it depends on the number of electrons). We are resending the incorrect calculations for Iodine containing molecules in this round of "v4" work units. Therefore the v4 set is a subset of the previous v3 WUs containing heavier elements, hence there are more OOM errors. ____________
	ID: 61455 \| Rating: 0 \| rate: / Reply Quote

Skip Da Shu Send message Joined: 13 Jul 09 Posts: 63 Credit: 1,431,770,165 RAC: 5,676,647 Level Scientific publications	Message 61456 - Posted: 10 Apr 2024 \| 16:15:25 UTC - in response to Message 61455. Last modified: 10 Apr 2024 \| 16:15:57 UTC
	Thank you. U probably just saved me hours of wasted time. Error % AVG ALL: 29.1 AVG – last 3: 59.0 8GB – last 2 72.76 10GB – last 2 66.52 12GB – last 2 3.55 (card out for a week)
	ID: 61456 \| Rating: 0 \| rate: / Reply Quote

Skip Da Shu Send message Joined: 13 Jul 09 Posts: 63 Credit: 1,431,770,165 RAC: 5,676,647 Level Scientific publications	Message 61470 - Posted: 16 Apr 2024 \| 16:32:16 UTC - in response to Message 61455. Last modified: 16 Apr 2024 \| 16:39:06 UTC
	Steve replied: yes this make sense unfortunately. In the previous round of "inputs_v3" it was calculating things incorrectly for any molecule containing Iodine. This is heaviest element in our dataset. The computational cost of this QM method scales with the size of the elements (it depends on the number of electrons). We are resending the incorrect calculations for Iodine containing molecules in this round of "v4" work units. Therefore the v4 set is a subset of the previous v3 WUs containing heavier elements, hence there are more OOM errors. Any change in this situation? I got my 12GB card back and my haphazard data collection seems to have it under a 9% error rate and with the very last grab showing 5.85%**. The 8GB & 10GB cards are still on NNW (other than 3 WUs i let thru on 10GB cards. They completed). Skip
	ID: 61470 \| Rating: 0 \| rate: / Reply Quote

Skip Da Shu Send message Joined: 13 Jul 09 Posts: 63 Credit: 1,431,770,165 RAC: 5,676,647 Level Scientific publications	Message 61471 - Posted: 21 Apr 2024 \| 12:20:15 UTC - in response to Message 61470. Last modified: 21 Apr 2024 \| 12:29:46 UTC
	Steve replied: yes this make sense unfortunately. In the previous round of "inputs_v3" it was calculating things incorrectly for any molecule containing Iodine. This is heaviest element in our dataset. The computational cost of this QM method scales with the size of the elements (it depends on the number of electrons). We are resending the incorrect calculations for Iodine containing molecules in this round of "v4" work units. Therefore the v4 set is a subset of the previous v3 WUs containing heavier elements, hence there are more OOM errors. Any change in this situation? I got my 12GB card back and my haphazard data collection seems to have it under a 9% error rate and with the very last grab showing 5.85%**. Somethings coming around... error rates for 10GB cards are now under 13% and the 12GB card is ~3%. Skip
	ID: 61471 \| Rating: 0 \| rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1048 Credit: 40,228,383,983 RAC: 449,121 Level Scientific publications	Message 61472 - Posted: 21 Apr 2024 \| 12:54:27 UTC
	I also see about 3% on my 12GB cards. I think it will vary depending on what kind of molecules are being processed. ____________
	ID: 61472 \| Rating: 0 \| rate: / Reply Quote

Bedrich Hajek Send message Joined: 28 Mar 09 Posts: 477 Credit: 9,341,672,716 RAC: 10,973,430 Level Scientific publications	Message 61473 - Posted: 21 Apr 2024 \| 13:47:39 UTC - in response to Message 61472.
	Right now, I am seeing less than a 2% error rate on my computers, each has a 11 GB card. This does vary over time.
	ID: 61473 \| Rating: 0 \| rate: / Reply Quote

[BAT] Svennemans Send message Joined: 27 May 21 Posts: 50 Credit: 468,892,017 RAC: 3,152,507 Level Scientific publications	Message 61474 - Posted: 22 Apr 2024 \| 21:47:03 UTC
	I'm only seeing a single Memory error in the last 300 results for my gtx 1080Ti (11GB), so 0.33% Something I do get quite often are CRC errors on UnZipping the input files. So failing within the first 30 seconds. Anybody else seeing this? https://www.gpugrid.net/results.php?userid=571263&offset=0&show_names=0&state=5&appid=47
	ID: 61474 \| Rating: 0 \| rate: / Reply Quote

Keith Myers Send message Joined: 13 Dec 17 Posts: 1313 Credit: 6,022,292,459 RAC: 10,175,539 Level Scientific publications	Message 61475 - Posted: 23 Apr 2024 \| 2:11:14 UTC - in response to Message 61474. Last modified: 23 Apr 2024 \| 2:12:16 UTC
	No, I've not had any CRC errors unzipping the tar archives. Sounds like a machine problem. Memory, heat, high workload latency??
	ID: 61475 \| Rating: 0 \| rate: / Reply Quote

ServicEnginIC Send message Joined: 24 Sep 10 Posts: 572 Credit: 7,519,447,024 RAC: 10,542,159 Level Scientific publications	Message 61485 - Posted: 5 May 2024 \| 11:39:15 UTC
	Error rate for QChem tasks seems to have pretty decreased lately on my 4GB VRAM graphics cards. And currently ~0% error rate on a recently installed RTX 3060 12GB VRAM.
	ID: 61485 \| Rating: 0 \| rate: / Reply Quote

Skip Da Shu Send message Joined: 13 Jul 09 Posts: 63 Credit: 1,431,770,165 RAC: 5,676,647 Level Scientific publications	Message 61490 - Posted: 7 May 2024 \| 12:49:38 UTC - in response to Message 61485.
	Average error % rate as of early 5/7/24 using last 3 data scrapes - AVG – last 3: 15.0 all cards 8GB – last 3: 22.58 (2x 3070) 10GB – last 3: 13.74 (2x 3080) 12GB – last 3: 2.30 (1x 4070S) Skip
	ID: 61490 \| Rating: 0 \| rate: / Reply Quote

Erich56 Send message Joined: 1 Jan 15 Posts: 1105 Credit: 7,822,620,176 RAC: 1,489,209 Level Scientific publications	Message 61498 - Posted: 13 May 2024 \| 13:32:43 UTC
	any news on as to whether QC will ever run on Windows machines ?
	ID: 61498 \| Rating: 0 \| rate: / Reply Quote

Keith Myers Send message Joined: 13 Dec 17 Posts: 1313 Credit: 6,022,292,459 RAC: 10,175,539 Level Scientific publications	Message 61500 - Posted: 14 May 2024 \| 1:27:37 UTC - in response to Message 61498.
	Believe the news still is that until the external repositories that the QC tasks use, create and compile Windows libraries, there won't ever be any Windows apps here.
	ID: 61500 \| Rating: 0 \| rate: / Reply Quote

Skip Da Shu Send message Joined: 13 Jul 09 Posts: 63 Credit: 1,431,770,165 RAC: 5,676,647 Level Scientific publications	Message 61502 - Posted: 14 May 2024 \| 8:56:17 UTC - in response to Message 61490. Last modified: 14 May 2024 \| 8:58:46 UTC
	Average error % rate as of early 5/7/24 using last 3 data scrapes - AVG – last 3: 15.0 all cards 8GB – last 3: 22.58 (2x 3070) 10GB – last 3: 13.74 (2x 3080) 12GB – last 3: 2.30 (1x 4070S) 5/14/24. 8:57am Zulu using last 3 data scrapes - AVG – last 3: 20.4 all cards 8GB – last 3: 25.22 (2x 3070) 10GB – last 3: 24.62 (2x 3080) 12GB – last 3: 2.11 (1x 4070S) Skip
	ID: 61502 \| Rating: 0 \| rate: / Reply Quote

Post to thread

Message boards : News : PYSCFbeta: Quantum chemistry calculations on GPU

	About	Science	Volunteers	Performance	Forum	Join us	Donate