Message boards : News : Experimental Python tasks (beta) - task description
| Author | Message |
|---|---|
|
Hello everyone, just wanted to give some updates about the machine learning - python jobs that Toni mentioned earlier in the "Experimental Python tasks (beta) " thread. | |
| ID: 56977 | Rating: 0 | rate:
| |
|
Highly anticipated and overdue. Needless to say, kudos to you and your team for pushing the frontier on the computational abilities of the client software. Looking forward to contribute in the future, hopefully with more than I have at hand right now. "problems [so far] unattainable in smaller scale settings"? 5. What is the ultimate goal of this ML-project? Have only one latest gen trained agents group at the end that is the result of the continuous reinforeced learning iterations? Have several and test/benchmark them against each other? Thx! Keep up the great work! | |
| ID: 56978 | Rating: 0 | rate:
| |
|
will you be utilizing the tensor cores present in the nvidia RTX cards? the tensor cores are designed for this kind of workload. | |
| ID: 56979 | Rating: 0 | rate:
| |
|
This is a welcome advance. Looking forward to contributing. | |
| ID: 56989 | Rating: 0 | rate:
| |
|
Thank you very much for this advance. | |
| ID: 56990 | Rating: 0 | rate:
| |
|
Wish you sucess. | |
| ID: 56994 | Rating: 0 | rate:
| |
|
Ian&Steve C. wrote on June 17th: will you be utilizing the tensor cores present in the nvidia RTX cards? the tensor cores are designed for this kind of workload. I am courious what the answer will be | |
| ID: 56996 | Rating: 0 | rate:
| |
|
also, can the team comment on not just GPU "under"utilization. these have NO GPU utilization. | |
| ID: 57000 | Rating: 0 | rate:
| |
|
I understand this is basic research in ML. However, I wonder which problems it would be used for here. Personally I'm here for the bio-science. If the topic of the new ML research differs significantly and it seems to be successful based on first trials, I'd suggest to set it up as a seperate project. | |
| ID: 57009 | Rating: 0 | rate:
| |
|
This is why I asked what "problems" are currently envisioned to be tackled by the resulting model. But IMO and understanding this is a ML project specifically set up to be trained on biomedical data sets. Thus, I'd argue that the science being done is still bio-related nonetheless. Would highly appreciate a feedback to loads of great questions here in this thread so far. | |
| ID: 57014 | Rating: 0 | rate:
| |
| ID: 57020 | Rating: 0 | rate:
| |
|
I noticed some python tasks in my task history. All failed for me and failed so far for everyone else. Has anyone completed any? | |
| ID: 58044 | Rating: 0 | rate:
| |
|
Host 132158 is getting some. The first failed with: File "/tmp/pip-install-kvyy94ud/atari-py_bc0e384c30f043aca5cad42554937b02/setup.py", line 28, in run sys.stderr.write("Unable to execute '{}'. HINT: are you sure `make` is installed?\n".format(' '.join(cmd))) NameError: name 'cmd' is not defined ---------------------------------------- ERROR: Failed building wheel for atari-py ERROR: Command errored out with exit status 1: command: /var/lib/boinc-client/slots/0/gpugridpy/bin/python -u -c 'import io, os, sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-install-kvyy94ud/atari-py_bc0e384c30f043aca5cad42554937b02/setup.py'"'"'; __file__='"'"'/tmp/pip-install-kvyy94ud/atari-py_bc0e384c30f043aca5cad42554937b02/setup.py'"'"';f = getattr(tokenize, '"'"'open'"'"', open)(__file__) if os.path.exists(__file__) else io.StringIO('"'"'from setuptools import setup; setup()'"'"');code = f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' install --record /tmp/pip-record-k6sefcno/install-record.txt --single-version-externally-managed --compile --install-headers /var/lib/boinc-client/slots/0/gpugridpy/include/python3.8/atari-py cwd: /tmp/pip-install-kvyy94ud/atari-py_bc0e384c30f043aca5cad42554937b02/ Looks like a typo. | |
| ID: 58045 | Rating: 0 | rate:
| |
|
Shame the tasks are misconfigured. I ran through a dozen of them on a host with errors. With the scarcity of work, every little bit is appreciated and can be used. | |
| ID: 58058 | Rating: 0 | rate:
| |
|
@abouh, could you check your configuration again? The tasks are failing during the build process with cmake. cmake normally isn't installed in Linux and when it is it is not normally installed into the PATH environment. | |
| ID: 58061 | Rating: 0 | rate:
| |
|
Hello everyone, sorry for the late reply. | |
| ID: 58104 | Rating: 0 | rate:
| |
|
Multiple different failure modes among the four hosts that have failed (so far) to run workunit 27102466. | |
| ID: 58112 | Rating: 0 | rate:
| |
|
The error reported in the job with result ID 32730901 is due to a conda environment error detected and solved during previous testing bouts. | |
| ID: 58114 | Rating: 0 | rate:
| |
|
OK, I've reset both my Linux hosts. Fortunately I'm on a fast line for the replacement download... | |
| ID: 58115 | Rating: 0 | rate:
| |
|
Task e1a15-ABOU_rnd_ppod_3-0-1-RND2976_3 was the first to run after the reset, but unfortunately it failed too. | |
| ID: 58116 | Rating: 0 | rate:
| |
|
I reset the project on my host. still failed. | |
| ID: 58117 | Rating: 0 | rate:
| |
|
I couldn't get your imgur image to load, just a spinner. | |
| ID: 58118 | Rating: 0 | rate:
| |
|
Yeah I get a message that Imgur is over capacity (first time I’ve ever seen that). Their site must be having maintenance or getting hammered. It was working earlier. I guess just try again a little later. | |
| ID: 58119 | Rating: 0 | rate:
| |
|
I've had two tasks complete on a host that was previously erroring out: | |
| ID: 58120 | Rating: 0 | rate:
| |
|
Hello everyone, | |
| ID: 58123 | Rating: 0 | rate:
| |
|
Yes I was progressively testing for how many steps the Agents could be trained and I forgot to increase the credits proportionally to the training steps. I will correct that in the immediate next batch, sorry and thanks for making us notice. | |
| ID: 58124 | Rating: 0 | rate:
| |
|
On mine, free memory (as reported in top) dropped from approximately 25,500 (when running an ACEMD task) to 7,000. | |
| ID: 58125 | Rating: 0 | rate:
| |
|
thanks for the clarification. | |
| ID: 58127 | Rating: 0 | rate:
| |
I agree with PDW that running work on all CPUs threads when BOINC expects at most that 1 CPU thread will be used will be problematic for most users who run CPU work from other projects. The normal way of handling that is to use the [MT] (multi-threaded) plan class mechanism in BOINC - these trial apps are being issued using the same [cuda1121] plan class as the current ACEMD production work. Having said that, it might be quite tricky to devise a combined [CUDA + MT] plan class. BOINC code usually expects a simple-minded either/or solution, not a combination. And I don't really like the standard MT implementation, which defaults to using every possible CPU core in the volunteer's computer. Not polite. MT can be tamed by using an app_config.xml or app_info.xml file, but you may need to tweak both <cpu_usage> (for BOINC scheduling purposes) and something like a command line parameter to control the spawning behaviour of the app. | |
| ID: 58132 | Rating: 0 | rate:
| |
|
given the current state of these beta tasks, I have done the following on my 7xGPU 48-thread system. allowed only 3x Python Beta tasks to run since the systems only have 64GB ram and each process is using ~20GB. <app_config> <app> <name>acemd3</name> <gpu_versions> <cpu_usage>1.0</cpu_usage> <gpu_usage>1.0</gpu_usage> </gpu_versions> </app> <app> <name>PythonGPU</name> <gpu_versions> <cpu_usage>5.0</cpu_usage> <gpu_usage>1.0</gpu_usage> </gpu_versions> <max_concurrent>3</max_concurrent> </app> </app_config> will see how it works out when more python beta tasks flow. and adjust as the project adjusts settings. abouh, before you start releasing more beta tasks, could you give us a heads up to what we should expect and/or what you changed about them? ____________ | |
| ID: 58134 | Rating: 0 | rate:
| |
|
I finished up a python gpu task last night on one host and saw it spawned a ton of processes that used up 17GB of system memory. I have 32GB minimum in all my hosts and it was not a problem. | |
| ID: 58135 | Rating: 0 | rate:
| |
I finished up a python gpu task last night on one host and saw it spawned a ton of processes that used up 17GB of system memory. I have 32GB minimum in all my hosts and it was not a problem. Good to know Keith. Did you by chance get a look at GPU utilization? Or CPU thread utilization of the spawns? ____________ | |
| ID: 58136 | Rating: 0 | rate:
| |
I finished up a python gpu task last night on one host and saw it spawned a ton of processes that used up 17GB of system memory. I have 32GB minimum in all my hosts and it was not a problem. Gpu utilization was at 3%. Each spawn used up about 170MB of memory and fluctuated around 13-17% cpu utilization. | |
| ID: 58137 | Rating: 0 | rate:
| |
|
good to know. so what I experienced was pretty similar. | |
| ID: 58138 | Rating: 0 | rate:
| |
|
Yes primarily Universe and a few TN-Grid tasks were running also. | |
| ID: 58140 | Rating: 0 | rate:
| |
|
I will send some more tasks later today with similar requirements as the last ones, with 32 multithreading reinforcement learning environments running in parallel for the agent to interact with. | |
| ID: 58141 | Rating: 0 | rate:
| |
|
I got 3 of them just now. all failed with tracebacks after several minutes of run time. seems like there's still some coding bugs in the application. all wingmen are failing similarly: | |
| ID: 58143 | Rating: 0 | rate:
| |
|
the new one I just got seems to be doing better. less CPU use, and it looks like i'm seeing the mentioned 60-80% spikes on the GPU occasionally. | |
| ID: 58144 | Rating: 0 | rate:
| |
|
I normally test the jobs locally first, to then run a couple of small batches of tasks in GPUGrid in case some error that did not appear locally occurs. The first small batch failed so I could fix the error in the second one. Now that the second batch succeeded will send a bigger batch of tasks. | |
| ID: 58145 | Rating: 0 | rate:
| |
|
I must be crunching one of the fixed second batch currently on this daily driver. Seems to be progressing nicely. | |
| ID: 58146 | Rating: 0 | rate:
| |
|
these new ones must be pretty long. | |
| ID: 58147 | Rating: 0 | rate:
| |
|
I got the first one of the Python WUs for me, and am a little concerned. After 3.25 hours it is only 10% complete. GPU usage seems to be about what you all are saying, and same with CPU. However, I also only have 8 cores/16 threads, with 6 other CPU work units running (TN Grid and Rosetta 4.2). Should I be limiting the other work to let these run? (16 GB RAM). | |
| ID: 58148 | Rating: 0 | rate:
| |
|
I don't think BOINC knows how to handle interpreting the estimated run_times of these Python tasks. I wouldn't worry about it. | |
| ID: 58149 | Rating: 0 | rate:
| |
|
I had the same feeling, Keith | |
| ID: 58150 | Rating: 0 | rate:
| |
|
also those of us running these, should probably prepare for VERY low credit reward. | |
| ID: 58151 | Rating: 0 | rate:
| |
|
I got one task early on that rewarded more than reasonable credit. | |
| ID: 58152 | Rating: 0 | rate:
| |
|
That task was short though. The threshold is around 2million credit reward if I remember. | |
| ID: 58153 | Rating: 0 | rate:
| |
|
confirmed. Peak FLOP Count One-time cheats ____________ | |
| ID: 58154 | Rating: 0 | rate:
| |
|
Yep, I saw that. Same credit as before and now I remember this bit of code being brought up before back in the old Seti days. | |
| ID: 58155 | Rating: 0 | rate:
| |
|
Awoke to find 4 PythonGPU WUs running on 3 computers. All had OPN & TN-Grid WUs running with CPU use flat-lined at 100%. Suspended all other CPU WUs to see what PG was using and got a band mostly contained in the range 20 to 40%. Then I tried a couple of scenarios. | |
| ID: 58157 | Rating: 0 | rate:
| |
|
I did something similar with my two 7xGPU systems. | |
| ID: 58158 | Rating: 0 | rate:
| |
|
Hello everyone, | |
| ID: 58161 | Rating: 0 | rate:
| |
|
thanks! | |
| ID: 58162 | Rating: 0 | rate:
| |
1. Detected multiple CUDA out of memory errors. Locally the jobs use 6GB of GPU memory. It seems difficult to lower the GPU memory requirements for now, so jobs running in GPUs with less memory should fail. I've tried to set preferences at all my less than 6GB RAM GPU hosts for not receiving Python Runtime (GPU, beta) app: Run only the selected applicationsACEMD3: yes But I've still received one more Python GPU task at one of them. This makes me to get in doubt whether GPUGRID preferences are currently working as intended or not... Task e1a1-ABOU_rnd_ppod_8-0-1-RND5560_0 RuntimeError: CUDA out of memory. | |
| ID: 58163 | Rating: 0 | rate:
| |
This makes me to get in doubt whether GPUGRID preferences are currently working as intended or not... my question is a different one: as long as the GPUGRID team now concentrates on Python, no more ACEMD tasks will come? | |
| ID: 58164 | Rating: 0 | rate:
| |
But I've still received one more Python GPU task at one of them. I had the same problem, you need to set the 'Run test applications' to No It looks like having that set to Yes will over ride any specific application setting you set. | |
| ID: 58166 | Rating: 0 | rate:
| |
|
Thanks, I'll try | |
| ID: 58167 | Rating: 0 | rate:
| |
This makes me to get in doubt whether GPUGRID preferences are currently working as intended or not... Hard to say. Toni and Gianni both stated the work would be very limited and infrequent until they can fill the new PhD positions. But there have been occasional "drive-by" drops of cryptic scout work I've noticed along with the occasional standard research acemd3 resend. Sounds like @abouh is getting ready to drop a larger debugged batch of Python on GPU tasks. | |
| ID: 58168 | Rating: 0 | rate:
| |
Sounds like @abouh is getting ready to drop a larger debugged batch of Python on GPU tasks. Would be great if they work on Windows, too :-) | |
| ID: 58169 | Rating: 0 | rate:
| |
|
Today I will send a couple of batches with short tasks for some final debugging of the scripts and then later I will send a big batch of debugged tasks. | |
| ID: 58170 | Rating: 0 | rate:
| |
|
The idea is to make it work for Windows in the future as well, once it works smoothly on linux. | |
| ID: 58171 | Rating: 0 | rate:
| |
|
Thanks, looks like they are small enough to fit on a 16GB system now. using about 12GB. | |
| ID: 58172 | Rating: 0 | rate:
| |
Thanks, looks like they are small enough to fit on a 16GB system now. using about 12GB. not sure what happened to it. take a look. https://gpugrid.net/result.php?resultid=32731651 ____________ | |
| ID: 58173 | Rating: 0 | rate:
| |
|
Looks like a needed package was not retrieved properly with a "deadline exceeded" error. | |
| ID: 58174 | Rating: 0 | rate:
| |
Looks like a needed package was not retrieved properly with a "deadline exceeded" error. It's interesting, looking at the stderr output. it appears that this app is communicating over the internet to send and receive data outside of BOINC. and to servers that are not belonging to the project. (i think the issue is that I was connected to my VPN checking something else and I left the connection active and it might have had an issue reaching the site it was trying to access) not sure how kosher that is. I think BOINC devs don't intend/desire this kind of behavior. some people might have some security concerns of the app doing these things outside of BOINC. might be a little smoother to do all communication only between the host and the project and only via the BOINC framework. if data needs to be uploaded elsewhere, it might be better for the project to do that on the backend. just my .02 ____________ | |
| ID: 58175 | Rating: 0 | rate:
| |
1. Detected multiple CUDA out of memory errors. Locally the jobs use 6GB of GPU memory. It seems difficult to lower the GPU memory requirements for now, so jobs running in GPUs with less memory should fail. I'm getting CUDA out of memory failures and all my cards have 10 to 12 GB of GDDR: 1080 Ti, 2080 Ti, 3080 Ti and 3080. There must be something else going on. I've also stopped trying to time-slice with PythonGPU. It should have a dedicated GPU and I'm leaving 32 CPU threads open for it. I keep looking for Pinocchio but have yet to see him. Where does it come from? Maybe I never got it. | |
| ID: 58176 | Rating: 0 | rate:
| |
The idea is to make it work for Windows in the future as well, once it works smoothly on linux. okay, sounds good; thanks for the information | |
| ID: 58177 | Rating: 0 | rate:
| |
|
I'm running one of the new batch and at first the task was only using 2.2GB of gpu memory but now it has clocked backup to 6.6GB of gpu memory. | |
| ID: 58178 | Rating: 0 | rate:
| |
|
Just had one that's listed as "aborted by user." I didn't abort it. | |
| ID: 58179 | Rating: 0 | rate:
| |
|
RuntimeError: CUDA out of memory. Tried to allocate 112.00 MiB (GPU 0; 11.77 GiB total capacity; 3.05 GiB already allocated; 50.00 MiB free; 3.21 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF | |
| ID: 58180 | Rating: 0 | rate:
| |
|
The ray errors are normal and can be ignored. | |
| ID: 58181 | Rating: 0 | rate:
| |
1. Detected multiple CUDA out of memory errors. Locally the jobs use 6GB of GPU memory. It seems difficult to lower the GPU memory requirements for now, so jobs running in GPUs with less memory should fail. I'm not doing anything at all in mitigation for the Python on GPU tasks other than to only run one at a time. I've been successful in almost all cases other than the very first trial ones in each evolution. | |
| ID: 58182 | Rating: 0 | rate:
| |
|
What was halved was the amount of Agent training per task, and therefore the total amount of time required to completed it. | |
| ID: 58183 | Rating: 0 | rate:
| |
|
During the task, the performance of the Agent is intermittently sent to https://wandb.ai/ to track how the agent is doing in the environment as training progresses. It immensely helps to understand the behaviour of the agent and facilitates research, as it allows visualising the information in a structured way. | |
| ID: 58184 | Rating: 0 | rate:
| |
|
Pinocchio probably only caused problems in a subset of hosts, as it was due to one of the firsts test batches having a wrong conda environment requirements file. It was a small batch. | |
| ID: 58185 | Rating: 0 | rate:
| |
|
My machines are probably just above the minimum spec for the current batches - 16 GB RAM, and 6 GB video RAM on a GTX 1660. | |
| ID: 58186 | Rating: 0 | rate:
| |
What was halved was the amount of Agent training per task, and therefore the total amount of time required to completed it. Halved? I've got one at nearly 21.5 hours on a 3080Ti and still going | |
| ID: 58187 | Rating: 0 | rate:
| |
|
This shows the timing discrepancy, a few minutes before task 32731655 completed. | |
| ID: 58188 | Rating: 0 | rate:
| |
|
i still think the 5,000,000 GFLOPs count is far too low. since these run for 12-24hrs depending on host (GPU speed does not seem to be a factor in this since GPU utilization is so low, most likely CPU/memory bound) and there seems to be a bit of a discrepancy in run time per task. I had a task run for 9hrs on my 3080Ti, while another user claims 21+ hrs on his 3080Ti. and I've had several tasks get killed around 12hrs for exceeded time limit, while others ran for longer. lots of inconsistencies here. | |
| ID: 58189 | Rating: 0 | rate:
| |
|
Because this project still uses DCF, the 'exceeded time limit' problem should go away as soon as you can get a single task to complete. Both my machines with finished tasks are now showing realistic estimates, but with DCFs of 5+ and 10+ - I agree, the FLOPs estimate should be increased by that sort of multiplier to keep estimates balanced against other researchers' work for the project. | |
| ID: 58190 | Rating: 0 | rate:
| |
|
my system that completed a few tasks had a DCF of 36+ | |
| ID: 58191 | Rating: 0 | rate:
| |
checkpointing also still isn't working. See my screenshot. "CPU time since checkpoint: 16:24:44" | |
| ID: 58192 | Rating: 0 | rate:
| |
|
I've checked a sched_request when reporting. <result> <name>e1a26-ABOU_rnd_ppod_11-0-1-RND6936_0</name> <final_cpu_time>55983.300000</final_cpu_time> <final_elapsed_time>36202.136027</final_elapsed_time> That's task 32731632. So it's the server applying the 'sanity(?) check' "elapsed time not less than CPU time". That's right for a single core GPU task, but not right for a task with multithreaded CPU elements. | |
| ID: 58193 | Rating: 0 | rate:
| |
|
As mentioned by Ian&Steve C., GPU speed influences only partially task completion time. | |
| ID: 58194 | Rating: 0 | rate:
| |
|
I will look into the reported issues before sending the next batch, to see if I can find a solution for both the problem of jobs being killed due to “exceeded time limit” and the progress and checkpointing problems. | |
| ID: 58195 | Rating: 0 | rate:
| |
From what Ian&Steve C. mentioned, I understand that increasing the "Estimated Computation Size", however BOINC calculates that, could solve the problem of jobs being killed? The jobs reach us with a workunit description: <workunit> <name>e1a24-ABOU_rnd_ppod_11-0-1-RND1891</name> <app_name>PythonGPU</app_name> <version_num>401</version_num> <rsc_fpops_est>5000000000000000.000000</rsc_fpops_est> <rsc_fpops_bound>250000000000000000.000000</rsc_fpops_bound> <rsc_memory_bound>4000000000.000000</rsc_memory_bound> <rsc_disk_bound>10000000000.000000</rsc_disk_bound> <file_ref> <file_name>e1a24-ABOU_rnd_ppod_11-0-run</file_name> <open_name>run.py</open_name> <copy_file/> </file_ref> <file_ref> <file_name>e1a24-ABOU_rnd_ppod_11-0-data</file_name> <open_name>input.zip</open_name> <copy_file/> </file_ref> <file_ref> <file_name>e1a24-ABOU_rnd_ppod_11-0-requirements</file_name> <open_name>requirements.txt</open_name> <copy_file/> </file_ref> <file_ref> <file_name>e1a24-ABOU_rnd_ppod_11-0-input_enc</file_name> <open_name>input</open_name> <copy_file/> </file_ref> </workunit> It's the fourth line, '<rsc_fpops_est>', which causes the problem. The job size is given as the estimated number of floating point operations to be calculated, in total. BOINC uses this, along with the estimated speed of the device it's running on, to estimate how long the task will take. For a GPU app, it's usually the speed of the GPU that counts, but in this case - although it's described as a GPU app - the dominant factor might be the speed of the CPU. BOINC doesn't take any direct notice of that. The jobs are killed when they reach the duration calculated from the next line, '<rsc_fpops_bound>'. A quick and dirty fix while testing might be to increase that value even above the current 50x the original estimate, but that removes a valuable safeguard during normal running. | |
| ID: 58196 | Rating: 0 | rate:
| |
|
I see, thank you very much for the info. I asked Toni to help me adjusting the "rsc_fpops_est" parameter. Hopefully the next jobs won't be aborted by the server. | |
| ID: 58197 | Rating: 0 | rate:
| |
|
Thanks @abouh for working with us in debugging your application and work units. | |
| ID: 58198 | Rating: 0 | rate:
| |
|
Thank you for your kind support. During the task, the agent first interacts with the environments for a while, then uses the GPU to process the collected data and learn from it, then interacts again with the environments, and so on. This behavior can be seen at some tests described at my Managing non-high-end hosts thread. | |
| ID: 58200 | Rating: 0 | rate:
| |
|
I just sent another batch of tasks. | |
| ID: 58201 | Rating: 0 | rate:
| |
I just sent another batch of tasks. Thank you very much for this kind of Christmas present! Merry Christmas to everyone crunchers worldwide 🎄✨ | |
| ID: 58202 | Rating: 0 | rate:
| |
|
1,000,000,000 GFLOPs - initial estimate 1690d 21:37:58. That should be enough! | |
| ID: 58203 | Rating: 0 | rate:
| |
I tested locally and the progress and the restart.chk files are correctly generated and updated. In a preliminary sight of one new Python GPU task received today: - Progress estimation is now working properly, updating by 0,9% increments. - Estimated computation size has raised to 1,000,000,000 GFLOPs, as also confirmed by Richard Haselgrove - Checkpointing seems to be working also, and is being stored at about every two minutes. - Learning cycle period has reduced to 11 seconds from 21 seconds observed at previous task. sudo nvidia-smi dmon - GPU dedicated RAM usage seems to have been reduced, but I don't know if enough for running at 4 GB RAM GPUs (?) - Currrent progress for task e1a20-ABOU_rnd_ppod_13-0-1-RND1192_0 is 28,9% after 2 hours and 13 minutes running. This leads to a total true execution time of about 7 hours and 41 minutes at my Host #569442 Well done! | |
| ID: 58204 | Rating: 0 | rate:
| |
|
Same observed behavior. Gpu memory halved, progress indicator normal and GFLOPS in line with actual usage. | |
| ID: 58208 | Rating: 0 | rate:
| |
- GPU dedicated RAM usage seems to have been reduced, but I don't know if enough for running at 4 GB RAM GPUs (?) I'm answering to myself: I enabled Python GPU tasks requesting in my GTX 1650 SUPER 4 GB system, and I happened to catch this previously failed task e1a21-ABOU_rnd_ppod_13-0-1-RND2308_1 This task has passed the initial processing steps, and has reached the learning cycle phase. At this point, memory usage is just at the limit of the 4 GB GPU available RAM. Waiting to see whether this task will be succeeding or not. System RAM usage keeps being very high. 99% of the 16 GB available RAM at this system is currently in use. | |
| ID: 58209 | Rating: 0 | rate:
| |
- Currrent progress for task e1a20-ABOU_rnd_ppod_13-0-1-RND1192_0 is 28,9% after 2 hours and 13 minutes running. This leads to a total true execution time of about 7 hours and 41 minutes at my Host #569442 That's roughly the figure I got in the early stages of today's tasks. But task 32731884 has just finished with <result> <name>e1a17-ABOU_rnd_ppod_13-0-1-RND0389_3</name> <final_cpu_time>59637.190000</final_cpu_time> <final_elapsed_time>39080.805144</final_elapsed_time> That's very similar (and on the same machine) as the one I reported in message 58193. So I don't think the task duration has changed much: maybe the progress %age isn't quite linear (but not enough to worry about). | |
| ID: 58210 | Rating: 0 | rate:
| |
|
Hello, 21:28:07 (152316): wrapper (7.7.26016): starting I have found an issue from Richard Haselgrove talking about this error: https://github.com/BOINC/boinc/issues/4125 It seems like the users getting this error could simply solve it by setting PrivateTmp=true. Is that correct? What is the appropriate way to modify that? ____________ | |
| ID: 58218 | Rating: 0 | rate:
| |
It seems like the users getting this error could simply solve it by setting PrivateTmp=true. Is that correct? What is the appropriate way to modify that? Right. I gave a step-by-step solution based on Richard Haselgrove finding at my Message #55986 It worked fine for all my hosts. | |
| ID: 58219 | Rating: 0 | rate:
| |
|
Thank you! | |
| ID: 58220 | Rating: 0 | rate:
| |
|
Some new (to me) errors in https://www.gpugrid.net/result.php?resultid=32732017 | |
| ID: 58221 | Rating: 0 | rate:
| |
|
it seems checkpointing still isnt working correctly. | |
| ID: 58222 | Rating: 0 | rate:
| |
|
I saw the same issue on my last task which was checkpointed past 20% yet reset to 10% upon restart. | |
| ID: 58223 | Rating: 0 | rate:
| |
- GPU dedicated RAM usage seems to have been reduced, but I don't know if enough for running at 4 GB RAM GPUs (?) Two of my hosts with 4 GB dedicated RAM GPUs have succeeded their latest Python GPU tasks so far. If it is planned to be kept GPU RAM requirements this way, it widens the app to a quite greater number of hosts. Also I happened to catch two simultaneous Python tasks at my triple GTX 1650 GPU host. I then urgently suspended requesting for Gpugrid tasks at BOINC Manager... Why? This host system RAM size is 32 GB. When the second Python task started, free system RAM decreased to 1% (!). I grossly estimate that environment for each Python task takes about 16 GB system RAM. I guess that an eventual third concurrent task might have crashed itself, or even crashed the whole three Python tasks due to lack of system RAM. I was watching to Psensor readings when the first of the two Python tasks finished, and then the free system memory drastically increased again from 1% to 38%. I also took a nvidia-smi screenshot, where can be seen that each Python task was respectively running at GPU 0 and GPU 1, while GPU 2 was processing a PrimeGrid CUDA GPU task. | |
| ID: 58225 | Rating: 0 | rate:
| |
|
now that I've upgraded my single 3080Ti host from a 5950X w/16GB ram to a 7402P/128GB ram, I want to see if I can even run 2x GPUGRID tasks on the same GPU. I see about 5GB VRAM use on the tasks I've processed so far. so with so much extra system ram and 12GB VRAM, it might work lol. | |
| ID: 58226 | Rating: 0 | rate:
| |
|
Regarding the checkpointing problem, the approach I follow is to check the progress file (if exists) at the beginning of the python script and then continue the job from there. | |
| ID: 58227 | Rating: 0 | rate:
| |
now that I've upgraded my single 3080Ti host from a 5950X w/16GB ram to a 7402P/128GB ram, I want to see if I can even run 2x GPUGRID tasks on the same GPU. I see about 5GB VRAM use on the tasks I've processed so far. so with so much extra system ram and 12GB VRAM, it might work lol. The last two tasks on my system with a 3080Ti ran concurrently and completed successfully. https://www.gpugrid.net/results.php?hostid=477247 | |
| ID: 58228 | Rating: 0 | rate:
| |
|
Errors in e6a12-ABOU_rnd_ppod_15-0-1-RND6167_2 (created today): | |
| ID: 58248 | Rating: 0 | rate:
| |
|
One user mentioned that could not solve the error INTERNAL ERROR: cannot create temporary directory! This is the configuration he is using: ### Editing /etc/systemd/system/boinc-client.service.d/override.conf I was just wondering if there is any possible reason why it should not work ____________ | |
| ID: 58249 | Rating: 0 | rate:
| |
|
I am using a systemd file generated from a PPA maintained by Gianfranco Costamagna. It's automatically generated from Debian sources, and kept up-to-date with new releases automatically. It's currently supplying a BOINC suite labelled v7.16.17 [Unit] Description=Berkeley Open Infrastructure Network Computing Client Documentation=man:boinc(1) After=network-online.target [Service] Type=simple ProtectHome=true PrivateTmp=true ProtectSystem=strict ProtectControlGroups=true ReadWritePaths=-/var/lib/boinc -/etc/boinc-client Nice=10 User=boinc WorkingDirectory=/var/lib/boinc ExecStart=/usr/bin/boinc ExecStop=/usr/bin/boinccmd --quit ExecReload=/usr/bin/boinccmd --read_cc_config ExecStopPost=/bin/rm -f lockfile IOSchedulingClass=idle # The following options prevent setuid root as they imply NoNewPrivileges=true # Since Atlas requires setuid root, they break Atlas # In order to improve security, if you're not using Atlas, # Add these options to the [Service] section of an override file using # sudo systemctl edit boinc-client.service #NoNewPrivileges=true #ProtectKernelModules=true #ProtectKernelTunables=true #RestrictRealtime=true #RestrictAddressFamilies=AF_INET AF_INET6 AF_UNIX #RestrictNamespaces=true #PrivateUsers=true #CapabilityBoundingSet= #MemoryDenyWriteExecute=true [Install] WantedBy=multi-user.target That has the 'PrivateTmp=true' line in the [Service] section of the file, rather than isolated at the top as in your example. I don't know Linux well enough to know how critical the positioning is. We had long discussions in the BOINC development community a couple of years ago, when it was discovered that the 'PrivateTmp=true' setting blocked access to BOINC's X-server based idle detection. The default setting was reversed for a while, until it was discovered that the reverse 'PrivateTmp=false' setting caused the problem creating temporary directories that we observe here. I think that the default setting was reverted to true, but the discussion moved into the darker reaches of the Linux package maintenance managers, and the BOINC development cycle became somewhat disjointed. I'm no longer fully up-to-date with the state of play. | |
| ID: 58250 | Rating: 0 | rate:
| |
|
A simpler answer might be ### Lines below this comment will be discarded so the file as posted won't do anything at all - in particular, it won't run BOINC! | |
| ID: 58251 | Rating: 0 | rate:
| |
|
Thank you! I reviewed the code and detected the source of the error. I am currently working to solve it. | |
| ID: 58253 | Rating: 0 | rate:
| |
|
Everybody seems to be getting the same error in today's tasks: | |
| ID: 58254 | Rating: 0 | rate:
| |
|
I believe I got one of the test, fixed tasks this morning based on the short crunch time and valid report. | |
| ID: 58255 | Rating: 0 | rate:
| |
|
Yes, your workunit was "created 7 Jan 2022 | 17:50:07 UTC" - that's a couple of hours after the ones I saw. | |
| ID: 58256 | Rating: 0 | rate:
| |
|
I just sent a batch that seems to fail with File "/var/lib/boinc-client/slots/30/python_dependencies/ppod_buffer_v2.py", line 325, in before_gradients For some reason it did not crash locally. "Fortunately" it will crash after only a few minutes, and it is easy to solve. I am very sorry for the inconvenience... I will send also a corrected batch with tasks of normal duration. I have tried to reduce the GPU memory requirements a bit in the new tasks. ____________ | |
| ID: 58263 | Rating: 0 | rate:
| |
|
Got one of those - failed as you describe. | |
| ID: 58264 | Rating: 0 | rate:
| |
|
I got 20 bad WU's today on this host: https://www.gpugrid.net/results.php?hostid=520456 Stderr Ausgabe <core_client_version>7.16.6</core_client_version> <![CDATA[ <message> process exited with code 195 (0xc3, -61)</message> <stderr_txt> 13:25:53 (6392): wrapper (7.7.26016): starting 13:25:53 (6392): wrapper (7.7.26016): starting 13:25:53 (6392): wrapper: running /usr/bin/flock (/var/lib/boinc-client/projects/www.gpugrid.net/miniconda.lock -c "/bin/bash ./miniconda-installer.sh -b -u -p /var/lib/boinc-client/projects/www.gpugrid.net/miniconda && /var/lib/boinc-client/projects/www.gpugrid.net/miniconda/bin/conda install -m -y -p gpugridpy --file requirements.txt ") 0%| | 0/45 [00:00<?, ?it/s] concurrent.futures.process._RemoteTraceback: ''' Traceback (most recent call last): File "concurrent/futures/process.py", line 368, in _queue_management_worker File "multiprocessing/connection.py", line 251, in recv TypeError: __init__() missing 1 required positional argument: 'msg' ''' The above exception was the direct cause of the following exception: Traceback (most recent call last): File "entry_point.py", line 69, in <module> File "concurrent/futures/process.py", line 484, in _chain_from_iterable_of_lists File "concurrent/futures/_base.py", line 611, in result_iterator File "concurrent/futures/_base.py", line 439, in result File "concurrent/futures/_base.py", line 388, in __get_result concurrent.futures.process.BrokenProcessPool: A process in the process pool was terminated abruptly while the future was running or pending. [6689] Failed to execute script entry_point 13:25:58 (6392): /usr/bin/flock exited; CPU time 3.906269 13:25:58 (6392): app exit status: 0x1 13:25:58 (6392): called boinc_finish(195) </stderr_txt> ]]> | |
| ID: 58265 | Rating: 0 | rate:
| |
|
I errored out 12 tasks created from 10:09:55 to 10:40:06. | |
| ID: 58266 | Rating: 0 | rate:
| |
|
And two of those were the batch error resends that now have failed. | |
| ID: 58268 | Rating: 0 | rate:
| |
|
You need to look at the creation time of the master WU, not of the individual tasks (which will vary, even within a WU, let alone a batch of WUs). | |
| ID: 58269 | Rating: 0 | rate:
| |
|
I have seen this error a few times. concurrent.futures.process.BrokenProcessPool: A process in the process pool was terminated abruptly while the future was running or pending. Do you think it could be due to a lack of resources? I think Linux starts killing processes if you are over capacity. ____________ | |
| ID: 58270 | Rating: 0 | rate:
| |
|
Might be the OOM-Killer kicking in. You would need to grep -i kill /var/log/messages* to check if processes were killed by the OOM-Killer. If that is the case you would have to configure /etc/sysctl.conf to let the system be less sensitive to brief out of memory conditions. | |
| ID: 58271 | Rating: 0 | rate:
| |
|
I Googled the error message, and came up with this stackoverflow thread. "The main module must be importable by worker subprocesses. This means that ProcessPoolExecutor will not work in the interactive interpreter. Calling Executor or Future methods from a callable submitted to a ProcessPoolExecutor will result in deadlock." Other search results may provide further clues. | |
| ID: 58272 | Rating: 0 | rate:
| |
|
Thanks! out of the possible explanations that could cause the error listed in the thread, I suspect it could be OS killing the threads do to a lack of resources. Could be not enough RAM, or maybe python raises this error if the ratio cores / processes is high? (I have seen some machines with 4 CPUs, and the tasks spawns 32 reinforcement learning environments). | |
| ID: 58273 | Rating: 0 | rate:
| |
|
What version of Python are the hosts that have the errors running? | |
| ID: 58274 | Rating: 0 | rate:
| |
What version of Python are the hosts that have the errors running? Same Python version as current mine. In case of doubt about conflicting Python versions, I published the solution that I applied to my hosts at Message #57833 It worked for my Ubuntu 20.04.3 LTS Linux distribution, but user mmonnin replied that this didn't work for him. mmonnin kindly published an alternative way at his Message #57840 | |
| ID: 58275 | Rating: 0 | rate:
| |
|
I saw the prior post and was about to mention the same thing. Not sure which one works as the PC has been able to run tasks. | |
| ID: 58276 | Rating: 0 | rate:
| |
|
All jobs should use the same python version (3.8.10), I define it in the requirements.txt file of the conda environment. | |
| ID: 58277 | Rating: 0 | rate:
| |
|
I have a failed task today involving pickle. | |
| ID: 58278 | Rating: 0 | rate:
| |
|
The tasks run on my Tesla K20 for a while, but then fail when they need to use PyTorch, which requires higher CUDA Capability. Oh well. Guess I'll stick to the ACEMED tasks. The error output doesn't list the requirements properly, but from a little Googling, it was updated to require 3.7 within the past couple years. The only Kepler card that has 3.7 is the Tesla K80. [W NNPACK.cpp:79] Could not initialize NNPACK! Reason: Unsupported hardware. /var/lib/boinc-client/slots/2/gpugridpy/lib/python3.8/site-packages/torch/cuda/__init__.py:120: UserWarning: Found GPU%d %s which is of cuda capability %d.%d. PyTorch no longer supports this GPU because it is too old. The minimum cuda capability supported by this library is %d.%d. While I'm here, is there any way to force the project to update my hardware configuration? It thinks my host has two Quadro K620s instead of one of those and the Tesla. | |
| ID: 58279 | Rating: 0 | rate:
| |
While I'm here, is there any way to force the project to update my hardware configuration? It thinks my host has two Quadro K620s instead of one of those and the Tesla. this is a problem (feature?) of BOINC, not the project. the project only knows what hardware you have based on what BOINC communicates to the project. with cards from the same vendor (nvidia/AMD/Intel) BOINC only lists the "best" card and then appends a number that's associated with how many total devices you have from that vendor. it will only list different models if they are from different vendors. within the nvidia vendor group, BOINC figures out the "best" device by checking the compute capability first, then memory capacity, then some third metric that i cant remember right now. BOINC deems the K620 to be "best" because it has a higher compute capability (5.0) than the Tesla K20 (3.5) even though the K20 is arguably the better card with more/faster memory and more cores. all in all, this has nothing to do with the project, and everything to do with BOINC's GPU ranking code. ____________ | |
| ID: 58280 | Rating: 0 | rate:
| |
While I'm here, is there any way to force the project to update my hardware configuration? It thinks my host has two Quadro K620s instead of one of those and the Tesla. Its often said as the "Best" card but its just the 1st https://www.gpugrid.net/show_host_detail.php?hostid=475308 This host has a 1070 and 1080 but just shows 2x 1070s as the 1070 is in the 1st slot. Any way to check for a "best" would come up with the 1080. Or the 1070Ti that used to be there with the 1070. | |
| ID: 58281 | Rating: 0 | rate:
| |
In your case, the metrics that BOINC is looking at are identical between the two cards (actually all three of the 1070, 1070Ti, and 1080 have identical specs as far as BOINC ranking is concerned). All have the same amount of VRAM and have the same compute capability. So the tie goes to device number I guess. If you were to swap the 1080 for even a weaker card with a better CC (like a GTX 1650) then that would get picked up instead, even when not in the first slot. ____________ | |
| ID: 58282 | Rating: 0 | rate:
| |
|
Ah, I get it. I thought it was just stuck, because it did have two K620s before. I didn't realize BOINC was just incapable of acknowledging different cards from the same vendor. Does this affect project statistics? The Milkyway@home folks are gonna have real inflated opinions of the K620 next time they check the numbers haha | |
| ID: 58283 | Rating: 0 | rate:
| |
|
Interesting I had seen this error once before locally, and I assumed it was due to a corrupted input file. | |
| ID: 58284 | Rating: 0 | rate:
| |
|
This is the document I had found about fixing the BrokenProcessPool error. | |
| ID: 58285 | Rating: 0 | rate:
| |
|
@abouh: Thank you for PM me twice! | |
| ID: 58286 | Rating: 0 | rate:
| |
Also I happened to catch two simultaneous Python tasks at my triple GTX 1650 GPU host. After upgrading system RAM from 32 GB to 64 GB at above mentioned host, it has successfully processed three concurrent ABOU Python GPU tasks: e2a43-ABOU_rnd_ppod_baseline_rnn-0-1-RND6933_3 - Link: https://www.gpugrid.net/result.php?resultid=32733458 e2a21-ABOU_rnd_ppod_baseline_rnn-0-1-RND3351_3 - Link: https://www.gpugrid.net/result.php?resultid=32733477 e2a27-ABOU_rnd_ppod_baseline_rnn-0-1-RND5112_1 - Link: https://www.gpugrid.net/result.php?resultid=32733441 More details at regarding Message #58287 | |
| ID: 58288 | Rating: 0 | rate:
| |
|
Hello everyone, Traceback (most recent call last): It seems like the task is not allowed to create a new dirs inside its working directory. Just wondering if it could be some kind of configuration problem, just like the "INTERNAL ERROR: cannot create temporary directory!" for which a solution was already shared. ____________ | |
| ID: 58289 | Rating: 0 | rate:
| |
|
My question would be: what is the working directory? /home/boinc-client/slots/1/... but the final failure concerns /var/lib/boinc-client That sounds like a mixed-up installation of BOINC: 'home' sounds like a location for a user-mode installation of BOINC, but '/var/lib/' would be normal for a service mode installation. It's reasonable for the two different locations to have different write permissions. What app is doing the writing in each case, and what account are they running under? Could the final write location be hard-coded, but the others dependent on locations supplied by the local BOINC installation? | |
| ID: 58290 | Rating: 0 | rate:
| |
|
Hi | |
| ID: 58291 | Rating: 0 | rate:
| |
|
Right so the working directory is /home/boinc-client/slots/1/... to which the script has full access. The script tries to create a directory to save the logs, but I guess it should not do it in /var/lib/boinc-client So I think the problem is just that the package I am using to log results by default saves them outside the working directory. Should be easy to fix. ____________ | |
| ID: 58292 | Rating: 0 | rate:
| |
|
BOINC has the concept of a "data directory". Absolutely everything that has to be written should be written somewhere in that directory or its sub-directories. Everything else must be assumed to be sandboxed and inaccessible. | |
| ID: 58293 | Rating: 0 | rate:
| |
The PC now as 1080 and 1080Ti with the Ti having more VRAM. BOINC shows 2x 1080. The 1080 is GPU 0 in nvidia-smi and so have the other BOINC displayed GPUs. The Ti is in the physical 1st slot. This PC happened to pick up two Python tasks. They aren't taking 4 days this time. 5:45 hr:min at 38.8% and 31 min at 11.8%. | |
| ID: 58294 | Rating: 0 | rate:
| |
what motherboard? and what version of BOINC?, your hosts are hidden so I cannot inspect myself. PCIe enumeration and ordering can be inconsistent against consumer boards. My server boards seem to enumerate starting from the slot furthest from the CPU socket, while most consumer boards are the opposite with device0 at the slot closest to the CPU socket. or do you perhaps run a locked coproc_info.xml file, this would prevent any GPU changes from being picked up by BOINC if it can't write to the coproc file. edit: also I forgot that most versions of BOINC incorrectly detect nvidia GPU memory. they will all max out at 4GB due to a bug in BOINC. So to BOINC your 1080Ti has the same amount of memory as your 1080. and since the 1080Ti is still a pascal card like the 1080, it has the same compute capability, so you're running into the same specs between them all still to get it to sort properly, you need to fix BOINC code, or use a GPU with higher or lower compute capability. put a Turing card in the system not in the first slot and BOINC will pick it up as GPU0 ____________ | |
| ID: 58295 | Rating: 0 | rate:
| |
|
The tests continue. Just reported e2a13-ABOU_rnd_ppod_baseline_cnn_nophi_2-0-1-RND9761_1, with final stats <result> <name>e2a13-ABOU_rnd_ppod_baseline_cnn_nophi_2-0-1-RND9761_1</name> <final_cpu_time>107668.100000</final_cpu_time> <final_elapsed_time>46186.399529</final_elapsed_time> That's an average CPU core count of 2.33 over the entire run - that's high for what is planned to be a GPU application. We can manage with that - I'm sure we all want to help develop and test the application for the coming research run - but I think it would be helpful to put more realistic usage values into the BOINC scheduler. | |
| ID: 58296 | Rating: 0 | rate:
| |
|
It's not a GPU application. It uses both CPU and GPU. | |
| ID: 58297 | Rating: 0 | rate:
| |
|
Do you mean changing some of the BOINC parameters like it was done in the case of <rsc_fpops_est>? | |
| ID: 58298 | Rating: 0 | rate:
| |
|
It would need to be done in the plan class definition. Toni said that you define your plan classes in C++ code, so there are some examples in Specifying plan classes in C++. | |
| ID: 58299 | Rating: 0 | rate:
| |
|
it seems to work better now but I've reached time limit after 1800sec 19:39:23 (6124): task /usr/bin/flock reached time limit 1800 application ./gpugridpy/bin/python missing | |
| ID: 58300 | Rating: 0 | rate:
| |
|
I'd like to hear what others are using for ncpus for their Python tasks in their app_config files. | |
| ID: 58301 | Rating: 0 | rate:
| |
|
I'm still running them at 1 CPU plus 1 GPU. They run fine, but when they are busy on the CPU-only sections, they steal time from the CPU tasks that are running at the same time - most obviously from CPDN. | |
| ID: 58302 | Rating: 0 | rate:
| |
|
You could employ ProcessLasso on the apps and up their priority I suppose. | |
| ID: 58303 | Rating: 0 | rate:
| |
I'd like to hear what others are using for ncpus for their Python tasks in their app_config files. I think that Python GPU App is very efficient in adapting to any amount of CPU cores, and taking profit of available CPU resources. This seems to be in some way independent of ncpus parameter at Gpugrid app_config.xml Setup at my twin GPU system is as follows: <app> <name>PythonGPU</name> <gpu_versions> <gpu_usage>1.0</gpu_usage> <cpu_usage>0.49</cpu_usage> </gpu_versions> </app> And setup for my triple GPU system is as follows: <app> <name>PythonGPU</name> <gpu_versions> <gpu_usage>1.0</gpu_usage> <cpu_usage>0.33</cpu_usage> </gpu_versions> </app> The finality for this is being able to respectively run two or three concurrent Python GPU tasks without reaching a full "1" CPU core (2 x 0.49 = 0.98; 3 x 0.33 = 0.99). Then, I manually control CPU usage by setting "Use at most XX % of the CPUs" at BOINC Manager for each system, according to its amount of CPU cores. This allows me to run concurrently "N" Python GPU tasks and a fixed number of other CPU tasks as desired. But as said, Gpugrid Python GPU app seems to take CPU resources as needed for successfully processing its tasks... at the cost of slowing down the other CPU applications. | |
| ID: 58304 | Rating: 0 | rate:
| |
|
Yes, I use Process Lasso on all my Windows machines, but I haven't explored its use under Linux. | |
| ID: 58305 | Rating: 0 | rate:
| |
|
This message 19:39:23 (6124): task /usr/bin/flock reached time limit 1800 Indicates that, after 30 minutes, the installation of miniconda and the task environment setup have not been finished. Consequently, python is not found later on to execute the task since it is one of the requirements of the miniconda environment. application ./gpugridpy/bin/python missing Therefore, it is not an error in itself, it just means that the miniconda setup went too slow for some reason (in theory 30 minutes should be enough time). Maybe the machine is slower than usual for some reason. Or the connection is slow and dependencies are not being downloaded. We could extend this timeout, but normally if 30 minutes is not enough for the miniconda setup another underlying problem could exists. ____________ | |
| ID: 58306 | Rating: 0 | rate:
| |
|
it seems to be a reasonably fast system. my guess is another type of permissions issue which is blocking the python install and it hits the timeout, or the CPUs are being too heavily used and not giving enough resources to the extraction process. | |
| ID: 58307 | Rating: 0 | rate:
| |
|
There is no Linux equivalent of Process Lasso. | |
| ID: 58308 | Rating: 0 | rate:
| |
|
Well, that got me a long way. E: Unable to locate package python-qwt5-qt4 E: Unable to locate package python-configobj Unsurprisingly, the next step returns Traceback (most recent call last): File "./procexp.py", line 27, in <module> from PyQt5 import QtCore, QtGui, QtWidgets, uic ModuleNotFoundError: No module named 'PyQt5' htop, however, shows about 30 multitasking processes spawned from main, each using around 2% of a CPU core (varying by the second) at nice 19. At the time of inspection, that is. I'll go away and think about that. | |
| ID: 58309 | Rating: 0 | rate:
| |
|
I've one task now that had the same timeout issue getting python. The host was running fine on these tasks before and I don't know what has changed. | |
| ID: 58310 | Rating: 0 | rate:
| |
|
You might look into schedtool as an alternative. | |
| ID: 58311 | Rating: 0 | rate:
| |
I'd like to hear what others are using for ncpus for their Python tasks in their app_config files.Very interesting. Does this actually limit PythonGPU to using at most 5 cpu threads? Does it work better than: <app_config> <!-- i9-7980XE 18c36t 32 GB L3 Cache 24.75 MB --> <app> <name>PythonGPU</name> <plan_class>cuda1121</plan_class> <gpu_versions> <cpu_usage>1.0</cpu_usage> <gpu_usage>1.0</gpu_usage> </gpu_versions> <avg_ncpus>5</avg_ncpus> <cmdline>--nthreads 5</cmdline> <fraction_done_exact/> </app> </app_config> Edit 1: To answer my own question I changed cpu_usage to 5 and am running a single PythonGPU WU with nothing else going on. The System Monitor shows 5 CPUs are running in the 60 to 80% range with all othe CPU running in the 10 to 40% range. Is there any way to stop it from taking over ones entire computer? Edit 2: I turned on WCG and the group of 5 went up to 100% and all the rest went to OPN in the 80 to 95% range. | |
| ID: 58317 | Rating: 0 | rate:
| |
|
No. Setting that value won’t change how much CPU is actually used. It just tells BOINC how much of the CPU is being used so that it can probably account resources. | |
| ID: 58318 | Rating: 0 | rate:
| |
|
This morning, in a routine system update, I noticed that BOINC Client / Manager was updated from Version 7.16.17 to Version 7.18.1. | |
| ID: 58320 | Rating: 0 | rate:
| |
|
Which distro/repository are you using? I have Mint with Gianfranco Costamagna's PPA: that's usually the fastest to update, and I see v7.18.1 is being offered there as well - although I haven't installed it yet. | |
| ID: 58321 | Rating: 0 | rate:
| |
Which distro/repository are you using? I have Mint with Gianfranco Costamagna's PPA: that's usually the fastest to update, and I see v7.18.1 is being offered there as well - although I haven't installed it yet. It bombed out on the Rosetta pythons; they did not run at all (a VBox problem undoubtedly). And it failed all the validations on QuChemPedIA, which does not use VirtualBox on the Linux version. But it works OK on CPDN, WCG/ARP and Einstein/FGRBP (GPU). All were on Ubuntu 20.04.3. So be prepared to bail out if you have to. | |
| ID: 58322 | Rating: 0 | rate:
| |
Which distro/repository are you using? I'm using the regular repository for Ubuntu 20.04.3 LTS I took screenshot of offered updates before updating. | |
| ID: 58324 | Rating: 0 | rate:
| |
|
My PPA gives slightly more information on the available update: | |
| ID: 58325 | Rating: 0 | rate:
| |
|
OK, I've taken a deep breath and enough coffee - applied all updates. [Unit] Note the line I've picked out. That starts with a # sign, for comment, so it has no effect: PrivateTmp is undefined in this file. New work became available just as I was preparing to update, so I downloaded a task and immediately suspended it. After the updates, and enough reboots to get my NVidia drivers functional again (it took three this time), I restarted BOINC and allowed the task to run. Task 32736884 Our old enemy "INTERNAL ERROR: cannot create temporary directory!" is back. Time for a systemd over-ride file, and to go fishing for another task. Edit - updated the file, as described in message 58312, and got task 32736938. That seems to be running OK, having passed the 10% danger point. Result will be in sometime after midnight. | |
| ID: 58327 | Rating: 0 | rate:
| |
|
I see your task completed normally with the PrivateTmp=true uncommented in the service file. | |
| ID: 58328 | Rating: 0 | rate:
| |
|
No, that's the first time I've seen that particular warning. The general structure is right for this machine, but it does't usually reach as high as 11 - GPUGrid normally gets slot 7. Whatever - there were some tasks left waiting after the updates and restarts. | |
| ID: 58329 | Rating: 0 | rate:
| |
|
Oh, I was not aware of this warning. | |
| ID: 58330 | Rating: 0 | rate:
| |
|
Yes, this experiments is with a slightly modified version of the algorithm, which should be faster. It runs the same number of interactions with the reinforcement learning environment, so the credits amount is the same. | |
| ID: 58331 | Rating: 0 | rate:
| |
|
I'll take a look at the contents of the slot directory, next time I see a task running. You're right - the entire '/var/lib/boinc-client/slots/n/...' structure should be writable, to any depth, by any program running under the boinc user account. | |
| ID: 58332 | Rating: 0 | rate:
| |
|
The directory paths are defined as environment variables in the python script. # Set wandb paths Then the directories are created by the wandb python package (which handles logging of relevant training data). I suspect it could be in the creation that the permissions are defined. So it is not a BOINC problem. I will change the paths in future jobs to: # Set wandb paths Note that "os.getcwd()" is the working directory, so "/var/lib/boinc-client/slots/11/" in this case ____________ | |
| ID: 58333 | Rating: 0 | rate:
| |
Oh, I was not aware of this warning. what happens if that directory doesn't exist? several of us run BOINC in a different location. since it's in /var/lib/ the process wont have permissions to create the directory, unless maybe if BOINC is run as root. ____________ | |
| ID: 58334 | Rating: 0 | rate:
| |
|
'/var/lib/boinc-client/' is the default BOINC data directory for Ubuntu BOINC service (systemd) installations. It most certainly exists, and is writable, on my machine, which is where Keith first noticed the error message in the report of a successful run. During that run, much will have been written to .../slots/11 | |
| ID: 58335 | Rating: 0 | rate:
| |
|
I'm aware it's the default location on YOUR computer, and others running the standard ubuntu repository installer. but the message from abouh sounded like this directory was hard coded since he put the entire path. and for folks running BOINC in another location, this directory will not be the same. if it uses a relative file path, then it's fine, but I was seeking clarification. | |
| ID: 58336 | Rating: 0 | rate:
| |
|
Hard path coding was removed before this most recent test batch. | |
| ID: 58337 | Rating: 0 | rate:
| |
/var/lib/boinc-client/ does not exist on my system. /var/lib is write protected, creating a directory there requires elevated privileges, which I'm sure happens during install from the repository. Yes. I do the following dance whenever setting up BOINC from Ubuntu Software or LocutusOfBorg: Join the root group: sudo adduser (Username) root • Join the BOINC group: sudo adduser (Username) boinc • Allow access by all: sudo chmod -R 777 /etc/boinc-client • Allow access by all: sudo chmod -R 777 /var/lib/boinc-client I also do these to allow monitoring by BoincTasks over the LAN on my Win10 machine: • Copy “cc_config.xml” to /etc/boinc-client folder • Copy “gui_rpc_auth.cfg” to /etc/boinc-client folder • Reboot | |
| ID: 58338 | Rating: 0 | rate:
| |
|
The directory should be created wherever you run BOINC, that is not a problem. | |
| ID: 58339 | Rating: 0 | rate:
| |
I do the following dance whenever setting up BOINC from Ubuntu Software or LocutusOfBorg:By doing so, you nullify your system's security provided by different access rights levels. This practice should be avoided by all costs. | |
| ID: 58340 | Rating: 0 | rate:
| |
|
I see that the BOINC PPA distribution has been updated yet again - still v7.18.1, but a new build timed at 17:10 yesterday, 4 February. | |
| ID: 58341 | Rating: 0 | rate:
| |
I do the following dance whenever setting up BOINC from Ubuntu Software or LocutusOfBorg:By doing so, you nullify your system's security provided by different access rights levels. I am on an isolated network behind a firewall/router. No problem at all. | |
| ID: 58342 | Rating: 0 | rate:
| |
I am on an isolated network behind a firewall/router. No problem at all.That qualifies as famous last words. | |
| ID: 58343 | Rating: 0 | rate:
| |
I see that the BOINC PPA distribution has been updated yet again - still v7.18.1, but a new build timed at 17:10 yesterday, 4 February. All I know is that the new build does not work at all on Cosmology with VirtualBox 6.1.32. A work unit just suspends immediately on startup. | |
| ID: 58344 | Rating: 0 | rate:
| |
I am on an isolated network behind a firewall/router. No problem at all.That qualifies as famous last words. It has lasted for many years. EDIT: They are all dedicated crunching machines. I have only BOINC and Folding on them. If they are a problem, I should pull out now. | |
| ID: 58345 | Rating: 0 | rate:
| |
I see that the BOINC PPA distribution has been updated yet again - still v7.18.1, but a new build timed at 17:10 yesterday, 4 February.My Ubuntu 20.04.3 machines updated themselves this morning to v7.18.1 for the 3rd time. (available version: 7.18.1+dfsg+202202041710~ubuntu20.04.1) | |
| ID: 58346 | Rating: 0 | rate:
| |
In your scenario, it's not a problem.I am on an isolated network behind a firewall/router. No problem at all.That qualifies as famous last words. It's dangerous to suggest that lazy solution to everyone, as their computers could be in a very different scenario. https://pimylifeup.com/chmod-777/ | |
| ID: 58347 | Rating: 0 | rate:
| |
In your scenario, it's not a problem.I am on an isolated network behind a firewall/router. No problem at all.That qualifies as famous last words. You might as well carry your warnings over to the Windows crowd, which doesn't have much security at all. | |
| ID: 58348 | Rating: 0 | rate:
| |
You might as well carry your warnings over to the Windows crowd, which doesn't have much security at all.Excuse me? | |
| ID: 58349 | Rating: 0 | rate:
| |
You might as well carry your warnings over to the Windows crowd, which doesn't have much security at all.Excuse me? What comparable isolation do you get in Windows from one program to another? Or what security are you talking about? Port security from external sources? | |
| ID: 58350 | Rating: 0 | rate:
| |
Security descriptors introduced into the NTFS 1.2 file system released in 1996 with Windows NT 4.0. The access control lists in NTFS are more complex in some aspects than in Linux. All modern Windows use NTFS by default.What comparable isolation do you get in Windows from one program to another?You might as well carry your warnings over to the Windows crowd, which doesn't have much security at all.Excuse me? User Account Control is introduced in 2007 with Windows Vista (=apps doesn't run as administrator even if the user has administrative privileges until the user elevates it through an annoying popup) Or what security are you talking about? Port security from external sources?Windows firewall is introced with Windows XP SP2 in 2004. This is my last post in this thread about (undermining) filesystem security. | |
| ID: 58351 | Rating: 0 | rate:
| |
I see that the BOINC PPA distribution has been updated yet again - still v7.18.1, but a new build timed at 17:10 yesterday, 4 February. Updated my second machine. It appears that this re-release is NOT releated to the systemd problem: the PrivateTmp=true line is still commented out. Re-apply the fix (#1) from message 58312 after applying this update, if you wish to continue running the Python test apps. | |
| ID: 58352 | Rating: 0 | rate:
| |
|
I think you are correct, except in the term "undermining", which is not appropriate for isolated crunching machines. There is a billion-dollar AV industry for Windows. Apparently someone has figured out how to undermine it there. But I agree that no more posts are necessary. | |
| ID: 58353 | Rating: 0 | rate:
| |
|
While chmod 777-ing in general is a bad practice. There’s little harm in blowing up the BOINC directory like that. Worst that can happen is you modify or delete a necessary file by accident and break BOINC. Just reinstall and learn the lesson. Not the end of the world in this instance. | |
| ID: 58354 | Rating: 0 | rate:
| |
I see that the BOINC PPA distribution has been updated yet again - still v7.18.1, but a new build timed at 17:10 yesterday, 4 February. Ubuntu 20.04.3 LTS is still on the older 7.16.6 version. apt list boinc-client Listing... Done boinc-client/focal 7.16.6+dfsg-1 amd64 | |
| ID: 58355 | Rating: 0 | rate:
| |
I see that the BOINC PPA distribution has been updated yet again - still v7.18.1, but a new build timed at 17:10 yesterday, 4 February.My Ubuntu 20.04.3 machines updated themselves this morning to v7.18.1 for the 3rd time. Curious how your Ubuntu release got this newer version. I did a sudo apt update and apt list boinc-client and apt show boinc-client and still come up with older 7.16.6 version. | |
| ID: 58356 | Rating: 0 | rate:
| |
|
I think they use a different PPA, not the standard Ubuntu version. | |
| ID: 58357 | Rating: 0 | rate:
| |
It's from http://ppa.launchpad.net/costamagnagianfranco/boinc/ubuntuMy Ubuntu 20.04.3 machines updated themselves this morning to v7.18.1 for the 3rd time. Sorry for the confusion. | |
| ID: 58358 | Rating: 0 | rate:
| |
I think they use a different PPA, not the standard Ubuntu version. You're right. I've checked, and this is my complete repository listing. There are new pending updates for BOINC package, but I've recently catched an ACEMD3 ADRIA new task, and I'm not updating until it be finished and reported. My experience warns that these tasks are highly prone to fail if something is changed while processing. | |
| ID: 58359 | Rating: 0 | rate:
| |
Which distro/repository are you using? Ah. Your reply here gave me a different impression. Slight egg on face, but both our Linux update manager screenshots fail to give source information in their consolidated update lists. Maybe we should put in a feature request? | |
| ID: 58360 | Rating: 0 | rate:
| |
|
ACEMD3 task finished on my original machine, so I updated BOINC from PPA 2022-01-30 to 2022-02-04. | |
| ID: 58361 | Rating: 0 | rate:
| |
|
Got a new task (task 32738148). Running normally, confirms override to systemd is preserved. wandb: WARNING Path /var/lib/boinc-client/slots/7/.config/wandb/wandb/ wasn't writable, using system temp directory (we're back in slot 7 as usual) There are six folders created in slot 7: agent_demos gpugridpy int_demos monitor_logs python_dependencies ROMS There are no hidden folders, and certainly no .config wandb data is in: /tmp/systemd-private-f670b90d460b4095a25c37b7348c6b93-boinc-client.service-7Jvpgh/tmp There are 138 folders in there, including one called simply wandb wandb contains: debug-internal.log debug.log latest-run run-20220206_163543-1wmmcgi5 The first two are files, the last two are folders. There is no subfolder called wandb - so no recursion, such as the warning message suggests. Hope that helps. | |
| ID: 58362 | Rating: 0 | rate:
| |
|
Thanks! the content of the slot directory is correct. | |
| ID: 58363 | Rating: 0 | rate:
| |
|
wandb: Run data is saved locally in /var/lib/boinc-client/slots/7/wandb/run-20220209_082943-1pdoxrzo | |
| ID: 58364 | Rating: 0 | rate:
| |
|
Great, thanks a lot for the confirmation. So now it seems the directory is appropriate one. | |
| ID: 58365 | Rating: 0 | rate:
| |
|
Pretty happy to see that my little Quadro K620s could actually handle one of the ABOU work units. Successfully ran one in under 31 hours. It didn't hit the memory too hard, which helps. The K620 has a DDR3 memory bus so the bandwidth is pretty limited. Traceback (most recent call last): File "run.py", line 40, in <module> assert os.path.exists('output.coor') AssertionError 11:22:33 (1966061): ./gpugridpy/bin/python exited; CPU time 0.295254 11:22:33 (1966061): app exit status: 0x1 11:22:33 (1966061): called boinc_finish(195) | |
| ID: 58367 | Rating: 0 | rate:
| |
|
All tasks goes in errors on this machine : https://www.gpugrid.net/results.php?hostid=591484 | |
| ID: 58368 | Rating: 0 | rate:
| |
|
I got two of those yesterday as well. They are described as "Anaconda Python 3 Environment v4.01 (mt)" - declared to run as multi-threaded CPU tasks. I do have working GPUs (on host 508381), but I don't think these tasks actually need a GPU. | |
| ID: 58369 | Rating: 0 | rate:
| |
|
We were running those kind of tasks a year ago. Looks like the researcher has made an appearance again. | |
| ID: 58370 | Rating: 0 | rate:
| |
|
I just downloaded one, but it errored out before I could even catch it starting. It ran for 3 seconds, required four cores of a Ryzen 3950X on Ubuntu 20.04.3, and had an estimated time of 2 days. I think they have some work to do. | |
| ID: 58371 | Rating: 0 | rate:
| |
|
PPS - It ran for two minutes on an equivalent Ryzen 3950X running BOINC 7.16.6, and then errored out. | |
| ID: 58372 | Rating: 0 | rate:
| |
|
I just ran 4 of the Python CPU tasks wu's on my Ryzen 7 5800H, Ubuntu 20.04.3 LTS, 16 GB ram. Each was run on 4 CPU threads at the same time. The first 0,6% took over 10 minutes, then they jumped to 10%, continued a while longer until 17 minutes were over and then erroed out all at more or less the same moment in the task. Here is one example: 32743954 | |
| ID: 58373 | Rating: 0 | rate:
| |
|
A RAIMIS MT task - which accounts for the 4 threads. Run NVIDIA GeForce RTX 3060 Laptop GPU (4095MB) Traceback (most recent call last): | |
| ID: 58374 | Rating: 0 | rate:
| |
|
I am running two of the Anacondas now. They each reserve four threads, but are apparently only using one of them, since BoincTasks shows 25% CPU usage. | |
| ID: 58380 | Rating: 0 | rate:
| |
|
Hey Richard. In how far is my GPU's memory involved in a CPU task? | |
| ID: 58381 | Rating: 0 | rate:
| |
Hey Richard. In how far is my GPU's memory involved in a CPU task? It shouldn't be - that's why I drew attention to it. I think both AbouH and RAIMIS are experimenting with different applications, which exploit both GPUs and multiple CPUs. It isn't at all obvious how best to manage a combination like that under BOINC - the BOINC developers only got as far as thinking about either/or, not both together. So far, Abou seems to have got further down the road, but I'm not sure how much further development is required. We watch and wait, and help where we can. | |
| ID: 58382 | Rating: 0 | rate:
| |
|
My first two Anacondas ended OK after 31 hours. But they were _2 and _3. | |
| ID: 58383 | Rating: 0 | rate:
| |
I am running a _4 now. After 18 minutes it is OK, but the CPU usage is still trending down to a single core after starting out high. It stopped making progress after running for a day and reaching 26% complete, so I aborted it. I will wait until they fix things before jumping in again. But my results were different than the others, so maybe it will do them some good. | |
| ID: 58384 | Rating: 0 | rate:
| |
|
Hello everyone! I am sorry for the late reply. | |
| ID: 58417 | Rating: 0 | rate:
| |
|
Is this a record? 08/03/2022 17:57:22 | GPUGRID | Started download of windows_x86_64__cuda1131.tar.bz2.e9a2e4346c92bfb71fae05c8321e9325 08/03/2022 18:35:03 | GPUGRID | Finished download of windows_x86_64__cuda1131.tar.bz2.e9a2e4346c92bfb71fae05c8321e9325 08/03/2022 18:35:26 | GPUGRID | Starting task e1a4-ABOU_pythonGPU_beta2_test-0-1-RND8371_5 08/03/2022 18:36:21 | GPUGRID | [sched_op] Reason: Unrecoverable error for task e1a4-ABOU_pythonGPU_beta2_test-0-1-RND8371_5 08/03/2022 18:36:21 | GPUGRID | Computation for task e1a4-ABOU_pythonGPU_beta2_test-0-1-RND8371_5 finished Edit 2 - "application C:\Windows\System32\tar.exe missing". I can deal with that. Download from https://sourceforge.net/projects/gnuwin32/files/tar/ NO - that wasn't what it said it was. Looking again. | |
| ID: 58464 | Rating: 0 | rate:
| |
|
No, this isn't working. Apparently, tar.exe is included in Windows 10 - but I'm still running Windows 7/64, and a copy from a W10 machine won't run ("Not a valid Win32 application"). Giving up for tonight - I've got too much waiting to juggle. I'll try again with a clearer head tomorrow. | |
| ID: 58465 | Rating: 0 | rate:
| |
|
Yeah estimates must have astronomical as I am at over 2 months Time left at 3/4 completion on 2 tasks. | |
| ID: 58466 | Rating: 0 | rate:
| |
|
No need to go back to the drawing board, in principle. Here is what is happening: | |
| ID: 58467 | Rating: 0 | rate:
| |
|
In this new version of the app, we send the whole conda environment in a compressed file ONLY ONCE, and unpack it in the machine. The conda environment is what weights around 2.5 GB (depends on whether the machine has cuda10 or cuda11). However, while the environment remains the same there will be no need to re-download it in every job. This is how acemd app works. | |
| ID: 58468 | Rating: 0 | rate:
| |
|
Some problems we are facing are, as Richard mentioned, that before W10 there is no tar.exe. tar.exe: Error opening archive: Can't initialize filter; unable to run program "bzip2 -d" In theory tar.exe is able to handle bzip2 files. We suspect it could be a problem with PATH env variable (which we will test). Also, tar gz could be a more compatible format for Windows. ____________ | |
| ID: 58469 | Rating: 0 | rate:
| |
|
Don't worry, it's only my own personal drawing board that I'm going back to! | |
| ID: 58470 | Rating: 0 | rate:
| |
|
Thank you very much! I will send a small batch of test jobs as soon as I can to check if for windows 10 the bzip2 error is caused by an erroneous PATH variable. And the next step will be trying with tar.gz as mentioned. | |
| ID: 58471 | Rating: 0 | rate:
| |
|
How about some checkpoints. I have a python task that was nearly completed, a ACEMD4 task downloaded next with like 8 billion days ETA. It interrupted the python task. 14hours of work and it went back to 10%. I only have 0.05 days work queue on that client so the python app was at least 95% complete. | |
| ID: 58472 | Rating: 0 | rate:
| |
|
was it a PythonGPU task for Linux mmonnin? I have checked your recent jobs, seemed to be successful. | |
| ID: 58473 | Rating: 0 | rate:
| |
|
I have a python task for Linux running, recently started. CPU time 00:33:10 CPU time since checkpoint 00:01:33 Elapsed time 00:33:27 but that isn't the acid test: the question is whether it can read back the checkpoint data files when restarted. I'll pause it after a checkpoint, let the machine finish the last 20 minutes of the task it booted aside, and see what happens on restart. Sometimes BOINC takes a little while to update progress after a pause - you have to watch it, not just take the first figure you see. Results will be reported in task 32773760 overnight, but I'll post here before that. Edit - looks good so far: restart.chk, progress, run.log all present with a good timestamp. | |
| ID: 58474 | Rating: 0 | rate:
| |
|
Perfect thanks! That it takes a little while to update progress after a pause, can happen. | |
| ID: 58475 | Rating: 0 | rate:
| |
However, Richard note that the result you linked is not PythonGPU but ACEMD 4. I am not sure how these do the checkpointing. Well, it was the only one I had in a suitable state for testing. And it's a good thing we checked. It appears that ACEMD4 in its current state (v1.03) does NOT handle checkpointing correctly. I suspended it manually at just after 10% complete: on restart, it wound back to 1% and started counting again from there. It's reached 2.980% as I type - four increments of 0.495. The run.log file (which we don't normally get a chance to see) has the ominous line # WARNING: removed an old file: output.xtc after a second set of startup details. Perhaps you could pass a message to the appropriate team? | |
| ID: 58476 | Rating: 0 | rate:
| |
|
I will. Thanks a lot for the feedback. | |
| ID: 58477 | Rating: 0 | rate:
| |
Perfect thanks! That it takes a little while to update progress after a pause, can happen. Yes it was linux. The % complete I saw was 100%, then a bit later 10% per BOINCTasks. Looking at the history on that PC it finished in 14:14 run time, just 11 minutes after the ACEMD4 tasks so it looks like it resumed properly. Thanks for checking. | |
| ID: 58478 | Rating: 0 | rate:
| |
|
OK, back on topic. Another of my Windows 7 machines has been allocated a genuine ABOU_pythonGPU_beta2 task (task 32779476), and I was able to suspend it before it even tried to run. I've been able to copy all the downloaded files into a sandbox to play with. <task> <application>C:\Windows\System32\tar.exe</application> <command_line>-xvf windows_x86_64__cuda1131.tar.gz</command_line> <setenv>PATH=C:\Windows\system32;C:\Windows</setenv> </task> You don't need both a path statement and a a hard-coded executable location. That may fail on a machine with non-standard drive assignments. It will certainly fail on this machine, because I still haven't been able to locate a viable tar.exe for Windows 7 (the Windows 10 executable won't run under Windows 7 - at least, I haven't found a way to make it run yet). I (and many other volunteers here) do have a freeware application called 7-Zip, and I've seen a suggestion that this may be able to handle the required decompression. I'll test that offline first, and if it works, I'll try to modify the job.xml file to use that instead. That's not a complete solution, of course, but it might give a pointer to the way forward. | |
| ID: 58479 | Rating: 0 | rate:
| |
|
OK, that works in principle. The 2.48 GB gz download decompresses to a single 4.91 GB tar file, and that in turn unpacks to 13,449 files in 632 folders. 7-Zip can handle both operations. | |
| ID: 58480 | Rating: 0 | rate:
| |
|
And it's worth a try. I'm going to split that task into two: <task> <application>"C:\Program Files\7-Zip\7z"</application> <command_line>x windows_x86_64__cuda1131.tar.gz</command_line> <setenv>PATH=C:\Windows\system32;C:\Windows</setenv> </task> <task> <application>"C:\Program Files\7-Zip\7z"</application> <command_line>x windows_x86_64__cuda1131.tar</command_line> <setenv>PATH=C:\Windows\system32;C:\Windows</setenv> </task> I could have piped them, but - baby steps! I'm going to need to increase the disk allowance: 10 (decimal) GB isn't enough. | |
| ID: 58481 | Rating: 0 | rate:
| |
|
I had a W10 PC without tar.exe. I noticed the error in a task and copied the exe to system32 folder. | |
| ID: 58483 | Rating: 0 | rate:
| |
|
Damn. Where did that go wrong? application C:\Windows\System32\tar.exe missing Anyone else who wants to try this experiment can try https://www.7-zip.org/ - looks as if the license would even allow the project to distribute it. Edit - I edited the job.xml file while the previous task was finishing, and then stopped BOINC to increase the disk limit. On restart, BOINC must have noticed that the file had changed, and it downloaded a fresh copy. Near miss. | |
| ID: 58484 | Rating: 0 | rate:
| |
|
application "C:\Program Files\7-Zip\7z" missing Make that "C:\Program Files\7-Zip\7z.exe" Or maybe not. application "C:\Program Files\7-Zip\7z.exe" missing Isn't the damn wrapper clever enough to remove the quotes I put in there to protect the space in "Program Files"? | |
| ID: 58485 | Rating: 0 | rate:
| |
|
Using tar.exe in W10 and W11 seems to work now. | |
| ID: 58486 | Rating: 0 | rate:
| |
|
On this particular Windows 7 machine, I have: | |
| ID: 58487 | Rating: 0 | rate:
| |
|
Yay! That's what I wanted to see: 17:49:09 (21360): wrapper: running C:\Program Files\7-Zip\7z.exe (x windows_x86_64__cuda1131.tar.gz) 7-Zip [64] 15.14 : Copyright (c) 1999-2015 Igor Pavlov : 2015-12-31 Scanning the drive for archives: 1 file, 2666937516 bytes (2544 MiB) Extracting archive: windows_x86_64__cuda1131.tar.gz And I've got v1.04 in my sandbox... | |
| ID: 58488 | Rating: 0 | rate:
| |
|
But not much more than that. After half an hour, it's got as far as: Everything is Ok Files: 13722 Size: 5270733721 Compressed: 5281648640 18:02:00 (21360): C:\Program Files\7-Zip\7z.exe exited; CPU time 6.567642 18:02:00 (21360): wrapper: running python.exe (run.py) WARNING: The script shortuuid.exe is installed in 'D:\BOINCdata\slots\5\Scripts' which is not on PATH. Consider adding this directory to PATH or, if you prefer to suppress this warning, use --no-warn-script-location. WARNING: The script normalizer.exe is installed in 'D:\BOINCdata\slots\5\Scripts' which is not on PATH. Consider adding this directory to PATH or, if you prefer to suppress this warning, use --no-warn-script-location. WARNING: The scripts wandb.exe and wb.exe are installed in 'D:\BOINCdata\slots\5\Scripts' which is not on PATH. Consider adding this directory to PATH or, if you prefer to suppress this warning, use --no-warn-script-location. ERROR: After October 2020 you may experience errors when installing or updating packages. This is because pip will change the way that it resolves dependency conflicts. We recommend you use --use-feature=2020-resolver to test your packages with the new resolver before it becomes the default. pytest 0.0.0 requires atomicwrites>=1.0, which is not installed. pytest 0.0.0 requires attrs>=17.4.0, which is not installed. pytest 0.0.0 requires iniconfig, which is not installed. pytest 0.0.0 requires packaging, which is not installed. pytest 0.0.0 requires py>=1.8.2, which is not installed. pytest 0.0.0 requires toml, which is not installed. aiohttp 3.7.4.post0 requires attrs>=17.3.0, which is not installed. WARNING: The scripts pyrsa-decrypt.exe, pyrsa-encrypt.exe, pyrsa-keygen.exe, pyrsa-priv2pub.exe, pyrsa-sign.exe and pyrsa-verify.exe are installed in 'D:\BOINCdata\slots\5\Scripts' which is not on PATH. Consider adding this directory to PATH or, if you prefer to suppress this warning, use --no-warn-script-location. WARNING: The script jsonschema.exe is installed in 'D:\BOINCdata\slots\5\Scripts' which is not on PATH. Consider adding this directory to PATH or, if you prefer to suppress this warning, use --no-warn-script-location. WARNING: The script gpustat.exe is installed in 'D:\BOINCdata\slots\5\Scripts' which is not on PATH. Consider adding this directory to PATH or, if you prefer to suppress this warning, use --no-warn-script-location. WARNING: The scripts ray-operator.exe, ray.exe, rllib.exe, serve.exe and tune.exe are installed in 'D:\BOINCdata\slots\5\Scripts' which is not on PATH. Consider adding this directory to PATH or, if you prefer to suppress this warning, use --no-warn-script-location. ERROR: After October 2020 you may experience errors when installing or updating packages. This is because pip will change the way that it resolves dependency conflicts. We recommend you use --use-feature=2020-resolver to test your packages with the new resolver before it becomes the default. pytest 0.0.0 requires atomicwrites>=1.0, which is not installed. pytest 0.0.0 requires iniconfig, which is not installed. pytest 0.0.0 requires py>=1.8.2, which is not installed. pytest 0.0.0 requires toml, which is not installed. WARNING: The script f2py.exe is installed in 'D:\BOINCdata\slots\5\Scripts' which is not on PATH. Consider adding this directory to PATH or, if you prefer to suppress this warning, use --no-warn-script-location. wandb: W&B API key is configured (use `wandb login --relogin` to force relogin) wandb: Appending key for api.wandb.ai to your netrc file: D:\BOINCdata\slots\5/.netrc wandb: Currently logged in as: rl-team-upf (use `wandb login --relogin` to force relogin) wandb: Tracking run with wandb version 0.12.11 wandb: Run data is saved locally in D:\BOINCdata\slots\5\wandb\run-20220310_181709-mxbeog6d wandb: Run `wandb offline` to turn off syncing. wandb: Syncing run MontezumaAgent_e1a12 wandb: View project at https://wandb.ai/rl-team-upf/MontezumaRevenge_rnd_ppo_cnn_nophi_baseline_beta wandb: View run at https://wandb.ai/rl-team-upf/MontezumaRevenge_rnd_ppo_cnn_nophi_baseline_beta/runs/mxbeog6d and doesn't seem to be getting any further. I'll see if it's moved on after dinner, might might abort it if it hasn't. Task is 32782603 | |
| ID: 58489 | Rating: 0 | rate:
| |
|
Then, lots of iterations of: OSError: [WinError 1455] The paging file is too small for this operation to complete. Error loading "D:\BOINCdata\slots\5\lib\site-packages\torch\lib\cudnn_cnn_train64_8.dll" or one of its dependencies. Traceback (most recent call last): File "<string>", line 1, in <module> File "D:\BOINCdata\slots\5\lib\multiprocessing\spawn.py", line 105, in spawn_main exitcode = _main(fd) File "D:\BOINCdata\slots\5\lib\multiprocessing\spawn.py", line 114, in _main prepare(preparation_data) File "D:\BOINCdata\slots\5\lib\multiprocessing\spawn.py", line 225, in prepare _fixup_main_from_path(data['init_main_from_path']) File "D:\BOINCdata\slots\5\lib\multiprocessing\spawn.py", line 277, in _fixup_main_from_path run_name="__mp_main__") File "D:\BOINCdata\slots\5\lib\runpy.py", line 263, in run_path pkg_name=pkg_name, script_name=fname) File "D:\BOINCdata\slots\5\lib\runpy.py", line 96, in _run_module_code mod_name, mod_spec, pkg_name, script_name) File "D:\BOINCdata\slots\5\lib\runpy.py", line 85, in _run_code exec(code, run_globals) File "D:\BOINCdata\slots\5\run.py", line 23, in <module> import torch File "D:\BOINCdata\slots\5\lib\site-packages\torch\__init__.py", line 126, in <module> raise err I've increased it ten-fold, but that requires a reboot - and the task didn't survive. Trying one last time, then it's 'No new Tasks' for tonight. | |
| ID: 58490 | Rating: 0 | rate:
| |
|
BTW, yes - the wrapper really is that dumb. | |
| ID: 58491 | Rating: 0 | rate:
| |
|
I managed to complete 2 of these WUs successfully. They still need a lot of work done. You have low GPU usage, and they cause the boinc manager to be slow and sluggish and unresponsive. | |
| ID: 58492 | Rating: 0 | rate:
| |
I had a W10 PC without tar.exe. I noticed the error in a task and copied the exe to system32 folder. Disabling python beta on this W10 PC has another 11+ hours gone https://www.gpugrid.net/result.php?resultid=32780319 | |
| ID: 58493 | Rating: 0 | rate:
| |
|
Yes, I have seen this error in some other machines that could unpack the file with tar.exe. In just a few of them. So it is an issue in the python script. Today I will be looking into it. It does not happen in linux with the same code. | |
| ID: 58494 | Rating: 0 | rate:
| |
|
Yes, regarding the workload, I have been testing the tasks with low GPU/CPU usage. I was interested in checking if the conda environment was successfully unpacked and the python script was able to complete a few iterations. It will be increased as soon as this part works, as well as the points. | |
| ID: 58495 | Rating: 0 | rate:
| |
|
Could the astronomical time estimations be simply due to a wrong configuration of the rsc_fpops_est parameter? | |
| ID: 58496 | Rating: 0 | rate:
| |
Yes, I have seen this error in some other machines that could unpack the file with tar.exe. In just a few of them. So it is an issue in the python script. Today I will be looking into it. It does not happen in linux with the same code. I was a bit suspicious about the 'paging file too small' error - I didn't even think Windows applications could get information about what the current setting was. I'd suggest correlating the machines with this error, with their reported physical memory. Mine is 'only' 8 GB - small by modern standards. It looks like there may be some useful clues in https://discuss.pytorch.org/t/winerror-1455-the-paging-file-is-too-small-for-this-operation-to-complete/131233 | |
| ID: 58497 | Rating: 0 | rate:
| |
Could the astronomical time estimations be simply due to a wrong configuration of the rsc_fpops_est parameter? That's certainly a part of it, but it's a very long, complicated, and historical story. It will affect any and all platforms, not just Windows, and other data as well as rsc_fpops_est. And it's also related to historical decisions by both BOINC and GPUGrid. I'll try and write up some bedtime reading for you, but don't waste time on it in the meantime - there won't be an easy 'magic bullet' to fix it. | |
| ID: 58498 | Rating: 0 | rate:
| |
|
Yes I was looking at the same link. Seems related to limited memory. I might try to run the suggested script before running the job, which seems to mitigate the problem. | |
| ID: 58499 | Rating: 0 | rate:
| |
|
Runtime estimation – and where it goes wrong | |
| ID: 58500 | Rating: 0 | rate:
| |
|
Thank you very much for the explanation Richard, very helpful actually. | |
| ID: 58501 | Rating: 0 | rate:
| |
Maybe a temporary solution for the time estimation would be to set rsc_fpops_est for the PythonGPUbeta app to the same value we have in the PythonGPU app?This approach is wrong. The rsc_fpops_est should be set accprdingly for the actual batch of workunits, not for the app. As test batches are much shorter than production batches, they should have a much less rsc_fpops_est value, regardless that the same app processes them. | |
| ID: 58502 | Rating: 0 | rate:
| |
Maybe a temporary solution for the time estimation would be to set rsc_fpops_est for the PythonGPUbeta app to the same value we have in the PythonGPU app?This approach is wrong. Correct. Next time I see a really gross (multi-year) runtime estimate, I'll dig out the exact figures, show you the working-out, and try to analyse where they've come from. In the meantime, we're working through a glut of ACEMD3 tasks, and here's how they arrive: 12/03/2022 08:23:29 | GPUGRID | [sched_op] NVIDIA GPU work request: 11906.64 seconds; 0.00 devices So, I'm asking for a few hours of work, and getting several days. Or so BOINC says. This is Windows host 45218, which is currently showing "Task duration correction factor 13.714405". (It was higher a few minutes ago, when that work was fetched - over 13.84) I forgot to mention yesterday that in the first phase of BOINC's life, both your server and our clients took account of DCF, so the 'request' and 'estimated' figures would have been much closer. But when the APR code was added in 2010, the DCF code was removed from the servers. So your server knows what my DCF is, but it doesn't use that information. So the server probably assessed that each task would last about 11,055 seconds. That's why it added the second task to the allocation: it thought the first one didn't quite fill my request for 11,906 seconds. In reality, this is a short-running batch - although not marked as such - and the last one finished in 4,289 seconds. That's why DCF is falling after every task, though slowly. | |
| ID: 58505 | Rating: 0 | rate:
| |
Yes, I have seen this error in some other machines that could unpack the file with tar.exe. In just a few of them. So it is an issue in the python script. Today I will be looking into it. It does not happen in linux with the same code. Having tar.exe wasn't enough. I later saw a popup in W10 saying archieveint.dll was missing. I had two python tasks in linux error out in ~30min with 15:33:14 (26820): task /usr/bin/flock reached time limit 1800 application ./gpugridpy/bin/python missing That PC has python 2.7.17 and 3.6.8 installed. | |
| ID: 58506 | Rating: 0 | rate:
| |
Next time I see a really gross (multi-year) runtime estimate, I'll dig out the exact figures, show you the working-out, and try to analyse where they've come from. Caught one! Task e1a5-ABOU_pythonGPU_beta2_test16-0-1-RND7314_1 Host is 43404. Windows 7. It has two GPUs, and GPUGrid is set to run on the other one, not as shown. The important bits are CUDA: NVIDIA GPU 0: NVIDIA GeForce GTX 1660 Ti (driver version 472.12, CUDA version 11.4, compute capability 7.5, 4096MB, 3032MB available, 5622 GFLOPS peak) DCF is 8.882342, and the task shows up as: Why? This is what I got from the server, in the sched_reply file: <app_version> <app_name>PythonGPUbeta</app_name> <version_num>104</version_num> ... <flops>47361236228.648697</flops> ... <workunit> <rsc_fpops_est>1000000000000000000.000000</rsc_fpops_est> ... 1,000,000,000,000,000,000 fpops, at 47 GFLOPS, would take 21,114,313 seconds, or 244 days. Multiply in the DCF, and you get the 2170 days shown. According to the application details page, this host has completed one 'Python apps for GPU hosts beta 1.04 windows_x86_64 (cuda1131)' task (new apps always go right down to the bottom of that page). It recorded an APR of 1279539, which is bonkers the other way - these are GFlops, remember. It must have been task 32782603, which completed in 781 seconds. So, lessons to be learned: 1) A shortened test task, described as running for the full-run number of fpops, will register an astronomical speed. If anyone completes 11 tasks like that, that speed will get locked into the system for that host, and will cause the 'runtime limit exceeded' error. 2) BOINC is extremely bad - stupidly bad - at generating a first guess for the speed of a 'new application, new host' combination. It's actually taken precisely one-tenth of the speed of the acemd3 application on this machine, which might be taken as a "safe working assumption" for the time being. I'll try to check that in the server code. Oooh - I've let it run, and BOINC has remembered how I set up 7-Zip decompression last week. That's nice. | |
| ID: 58508 | Rating: 0 | rate:
| |
|
But it hasn't remembered the increased disk limit. Never mind - nor did I. | |
| ID: 58509 | Rating: 0 | rate:
| |
|
Right now, the way PythonGPU app works is by dividing the job in 2 subtasks: 15:33:14 (26820): task /usr/bin/flock reached time limit 1800 means that after 1800 seconds, the conda environment was not yet created for some reason. This could be because the conda dependencies could not be downloaded in time or because the machine was running the installation process more slowly than expected. We set this time limit of 30 mins because in theory it is plenty of time to create the environment. However, in the new version (the current PythonGPUBeta), we send the whole conda environment compressed and simply unpack it in the machine. Therefor this error, which indeed happens every now and then now, should disappear. ____________ | |
| ID: 58510 | Rating: 0 | rate:
| |
|
ok, so my plan was to run at least a few more batches of test jobs. Then start the real tasks. | |
| ID: 58511 | Rating: 0 | rate:
| |
|
My gut feeling is that it would be better to deploy the finished app (after all testing seems to be complete) as a new app_version. We would have to go through the training process for APR one last time, but then it should settle down. <flops>707593666701.291382</flops> <flops>70759366670.129135</flops> That must be deliberate. | |
| ID: 58512 | Rating: 0 | rate:
| |
Would it be better to create a new app for real jobs once the testing is finished?Based on the last few days' discussion here, I've understood the purpose of the former short and long queue from GPUGrid's perspective: By separating the tasks into two queues based on their length, the project's staff didn't have to bother setting the rsc_fpops_est value for each and every batch, (note that the same app was assigned to each queue). The two queues had used different (but constant through batches) rsc_fpops_est values, so the runtime estimation of BOINC could not get so much off in each queue that would tigger the "won't finish on time" or the "run time exceeded" situation. Perhaps this practise should be put in operation again, even on a finer level of granularity (S, M, L tasks, or even XS and XL tasks). | |
| ID: 58513 | Rating: 0 | rate:
| |
|
I am getting "Disk usage limit exceeded" error. | |
| ID: 58518 | Rating: 0 | rate:
| |
|
I believe the "Disk usage limit exceeded" error is not related to the machine resources, is defined by an adjustable parameter of the app. The conda environment + all the other files might be over this limit.I will review the current value, we might have to increase it. Thanks for pointing it out! | |
| ID: 58519 | Rating: 0 | rate:
| |
|
After a day out running a long acemd3 task, there's good news and bad news. <flops>336636264786015.625000</flops> <rsc_fpops_est>1000000000000000000.000000</rsc_fpops_est> That ends up with an estimated runtime of about 9 hours - but at the cost of a speed estimate of 336,636 GFlops. That's way beyond a marketing department's dream. Either somebody has done open-heart surgery on the project's database (unlikely and unwise), or BOINC now has enough completed tasks for v1.05 to start taking notice of the reported values. The bad news: I'm getting errors again. ModuleNotFoundError: No module named 'gym' | |
| ID: 58524 | Rating: 0 | rate:
| |
|
v1.06 is released and working (very short test tasks only). | |
| ID: 58527 | Rating: 0 | rate:
| |
|
The latest version should fix this error. ModuleNotFoundError: No module named 'gym' ____________ | |
| ID: 58528 | Rating: 0 | rate:
| |
|
I have task 32836015 running - showing 50% after 30 minutes. That looks like it's giving the maths a good work-out. | |
| ID: 58529 | Rating: 0 | rate:
| |
|
For now I am just trying to see the jobs finish.. I am not even trying to make them run for a long time. Jobs should not even need checkpoints, should last less than 15 mins. | |
| ID: 58534 | Rating: 0 | rate:
| |
|
Err, this particular task is running on Linux - specifically, Mint v20.3 | |
| ID: 58536 | Rating: 0 | rate:
| |
|
This task https://www.gpugrid.net/result.php?resultid=32841161 has been running for nearly 26 hours now. It is the first Python beta task I have received that appears to be working. Green-With-Envy shows intermittent low activity on my 1080 GPU and BoincTasks shows 100% CPU usage. It checkpointed only once several minutes after it started and has shown 50% complete ever since. | |
| ID: 58537 | Rating: 0 | rate:
| |
|
Sounds just like mine, including the 100% CPU usage - that'll be the wrapper app, rather than the main Python app. | |
| ID: 58538 | Rating: 0 | rate:
| |
|
Well, after a suspend and allowing it to run, it went back to its checkpoint and has shown no progress since. I will abort it. Keep on learning.... | |
| ID: 58539 | Rating: 0 | rate:
| |
|
ok so it gets stuck at 50%. I will be reviewing it today. Thanks for the feedback. | |
| ID: 58540 | Rating: 0 | rate:
| |
|
Got a new one - the other Linux machine, but very similar. Looks like you've put some debug text into stderr.txt: 12:28:16 (482274): wrapper (7.7.26016): starting but nothing new has been added in the last five minutes. Showing 50% progress, no GPU activity. I'll give it another ten minutes or so, then try stop-start and abort if nothing new. Edit - no, no progress. Same result on two further tasks. All the quoted lines are written within about 5 seconds, then nothing. I'll let the machine do something else while I go shopping... Tasks for host 132158 | |
| ID: 58541 | Rating: 0 | rate:
| |
|
Ok so I have seen 3 main errors in the last batches: | |
| ID: 58549 | Rating: 0 | rate:
| |
|
We have updated to a new app version for windows that solves the following error: application C:\Windows\System32\tar.exe missing Now we send the 7z.exe (576 KB) file with the app, which allows to unpack the other files without relying on the host machine having tar.exe (which is only in windows 11 and latest builds of windows 10). I just sent a small batch of short tasks this morning to test and so far it seems to work. ____________ | |
| ID: 58550 | Rating: 0 | rate:
| |
|
Task 32868822 (Linux Mint GPU beta) | |
| ID: 58551 | Rating: 0 | rate:
| |
|
Do you know by chance if this same machine works fine with PythonGPU tasks even if it fails in the PythonGPUBeta ones? | |
| ID: 58552 | Rating: 0 | rate:
| |
|
Yes, it does. Most recent was: | |
| ID: 58553 | Rating: 0 | rate:
| |
|
I have also changed a bit the approach. | |
| ID: 58561 | Rating: 0 | rate:
| |
|
I've grabbed one. Will run within the hour. | |
| ID: 58562 | Rating: 0 | rate:
| |
|
I sent 2 batches, | |
| ID: 58563 | Rating: 0 | rate:
| |
|
Yes, I got the testing2. It's been running for about 23 minutes now, but I'm seeing the same as yesterday - nothing written to stderr.txt since: 09:29:18 (51456): wrapper (7.7.26016): starting and machine usage shows (full-screen version of that at https://i.imgur.com/Ly9Aabd.png) I've preserved the control information for that task, and I'll try to re-run it interactively in terminal later today - you can sometimes catch additional error messages that way. | |
| ID: 58564 | Rating: 0 | rate:
| |
|
Ok thanks a lot. Maybe then it is not the python script but some of the dependencies. | |
| ID: 58565 | Rating: 0 | rate:
| |
|
OK, I've aborted that task to get my GPU back. I'll see what I can pick out of the preserved entrails, and let you know. | |
| ID: 58566 | Rating: 0 | rate:
| |
|
Sorry, ebcak. I copied all the files, but when I came to work on them, several turned out to be BOINC softlinks back to the project directory, where the original file had been deleted. So the fine detail had gone. | |
| ID: 58568 | Rating: 0 | rate:
| |
|
The past several tasks have gotten stuck at 50% for me as well. Today one has made it past to 57.7% now in 8hours. 1-2% GPU util on 3070Ti. 2.5 CPU threads per BOINCTasks. 3063mb memory per nvidia-smi and 4.4GB per BOINCTasks. | |
| ID: 58569 | Rating: 0 | rate:
| |
|
I updated the app. Tested it locally and works fine on Linux. | |
| ID: 58571 | Rating: 0 | rate:
| |
|
Got a couple on one of my Windows 7 machines. The first - task 32875836 - completed successfully, the second is running now. | |
| ID: 58572 | Rating: 0 | rate:
| |
|
nice to hear it! lets see what happens on linux.. so weird if it only works in some machines and gets stuck in others... | |
| ID: 58573 | Rating: 0 | rate:
| |
nice to hear it! lets see what happens on linux.. so weird if it only works in some machines and gets stuck in others... Worse is to follow, I'm afraid. task 32875988 started immediately after the first one (same machine, but a different slot directory), but seems to have got stuck. I now seem to have two separate slot directories: Slot 0, where the original task ran. It has 31 items (3 folders, 28 files) at the top level, but the folder properties says the total (presumably expanding the site-packages) is 49 folders, 257 files, 3.62 GB Slot 5, allocated to the new task. It has 93 items at the top level (12 folders, including monitor_logs, and the rest files). This one looks the same as the first one did, while it was actively running the first task. This one has 14 files in the train directory - I think the first only had 4. This slot also has a stderr file, which ends with multiple repetitions of Traceback (most recent call last): File "<string>", line 1, in <module> File "D:\BOINCdata\slots\5\lib\multiprocessing\spawn.py", line 116, in spawn_main exitcode = _main(fd, parent_sentinel) File "D:\BOINCdata\slots\5\lib\multiprocessing\spawn.py", line 126, in _main self = reduction.pickle.load(from_parent) File "D:\BOINCdata\slots\5\lib\site-packages\pytorchrl\agent\env\__init__.py", line 1, in <module> from pytorchrl.agent.env.vec_env import VecEnv File "D:\BOINCdata\slots\5\lib\site-packages\pytorchrl\agent\env\vec_env.py", line 1, in <module> import torch File "D:\BOINCdata\slots\5\lib\site-packages\torch\__init__.py", line 126, in <module> raise err OSError: [WinError 1455] The paging file is too small for this operation to complete. Error loading "D:\BOINCdata\slots\5\lib\site-packages\torch\lib\shm.dll" or one of its dependencies. I'm going to try variations on a theme of - clear the old slot manually - pause and restart the task - stop and restart BOINC - stop and retsart Windows I'll report back what works and what doesn't. | |
| ID: 58574 | Rating: 0 | rate:
| |
|
Well, that was interesting. The files in slot 0 couldn't be deleted - they were locked by a running app 'python' - which is presumably why BOINC hadn't cleaned the folder when the first task finished. | |
| ID: 58575 | Rating: 0 | rate:
| |
|
Well this beta WU was a weird one: | |
| ID: 58576 | Rating: 0 | rate:
| |
|
Interesting that sometimes jobs work and sometimes get stuck in the same machine. | |
| ID: 58577 | Rating: 0 | rate:
| |
|
I've just had task 32876361 fail on a different, but identical, Windows machine. This time, it seems to be explicitly, and simply, a "not enough memory" error - these machines only have 8 GB, which was fine when I bought them. I've suspended the beta programme for the time being, and I'll try to upgrade them. | |
| ID: 58578 | Rating: 0 | rate:
| |
|
Another "Disk usage limit exceeded" error: | |
| ID: 58581 | Rating: 0 | rate:
| |
|
After having some errors with recent python app betas, task 32876819 ran without error on a RTX3070 Mobile under Win 11. | |
| ID: 58582 | Rating: 0 | rate:
| |
|
These tasks seem to run much better on my machines if I allocate 6 CPU's (threads) to each task. I managed to run one by itself and watched the performance monitor for CPU usage. During the initiation phase (about 5 minutes), the task used ~6 CPU's (threads). After the initiation phase, the CPU usage was in an oscillating pattern that was between ~2 and ~5 threads. Task ran very quickly and has been validated. Please let me know if you have questions. | |
| ID: 58588 | Rating: 0 | rate:
| |
|
Thanks a lot for the feedback: | |
| ID: 58590 | Rating: 0 | rate:
| |
|
Last batches seem to be working successfully both in Linux and Windows, and also for GPUs with cuda 10 and cuda 11. | |
| ID: 58591 | Rating: 0 | rate:
| |
It was reported that the reason was that the Python was not finishing correctly between jobs so I added a few changes in the code to try to solve this issue. Well, that was one report of one task on one machine with limited memory. It seemed be a case that, if it happened, caused problems for the following task. It's certainly worth looking at, and if it prevents some tasks failing - great. But I'd be cautious about assuming that it was the problem in all cases. | |
| ID: 58592 | Rating: 0 | rate:
| |
I will see if I can add some code at the end of the task to make sure all python processes are killed and the main program exits correctly. And send another testing round. I haven't gotten a new beta yet so I will shut off all GPU work with other projects to hopefully get some and help resolve this issue. | |
| ID: 58593 | Rating: 0 | rate:
| |
|
One other after thought re that WU. I had checked my status page here prior to aborting the task. It indicated the task was still in progress so no disposition of the files that I am presuming were sent back sometime in the past (since the slot was empty) was assigned to it. Wonder where they went? | |
| ID: 58594 | Rating: 0 | rate:
| |
|
Can anybody explain credits policy please. | |
| ID: 58597 | Rating: 0 | rate:
| |
|
Please note that other users can't see your entire task list by userid - that's a privacy policy common to all BOINC projects. | |
| ID: 58598 | Rating: 0 | rate:
| |
|
For some reason I haven't been able to snag any of the Python beta tasks lately. | |
| ID: 58599 | Rating: 0 | rate:
| |
|
The credits system is proportional to the amount of compute required to complete each task, like in acemd3. | |
| ID: 58600 | Rating: 0 | rate:
| |
|
Batches of both pythonGPU and pythonGPUBeta are being sent out this week. Hopefully pythonGPUBeta task will run without issues. | |
| ID: 58601 | Rating: 0 | rate:
| |
|
So far some run well while other ran for 2 and 3 days. | |
| ID: 58602 | Rating: 0 | rate:
| |
|
Looks like the standard BOINC mechanism of complain in a post on the forums on some topic and the BOINC genies grant your wish. | |
| ID: 58603 | Rating: 0 | rate:
| |
|
I have serious problems with my other machine running 1080Ti. | |
| ID: 58604 | Rating: 0 | rate:
| |
I have serious problems with my other machine running 1080Ti. you can try changing the driver back and see? easy troubleshooting step. It's definitely possible to be the driver. but you seem to be having an issue with the ACEMD3 tasks, this thread is about the Python tasks. ____________ | |
| ID: 58605 | Rating: 0 | rate:
| |
|
Sorry for posting wrong thread. | |
| ID: 58606 | Rating: 0 | rate:
| |
|
I've had no problems with their CUDA ACEMD3 app. it's been very stable across many data sets. all of the issues raised in this thread are in regards to the Python app that's still in testing/beta. problems are to be expected. | |
| ID: 58607 | Rating: 0 | rate:
| |
|
bcavnaugh wrote: ... For now I an waiting for new 3 & 4 on two of my hosts, it is a real bummer that our hosts have to sit for days on end without getting any tasks. you say it, indeed :-( Obviously, ACEMD has very low priority at GPUGRID these days :-( | |
| ID: 58608 | Rating: 0 | rate:
| |
|
Beta is still having issues with establishing the correct Python environment. | |
| ID: 58609 | Rating: 0 | rate:
| |
|
thanks, this is solved now. A new batch is running without this issue. | |
| ID: 58613 | Rating: 0 | rate:
| |
|
There are still a few old tasks around. I got the _9 (and hopefully final) issue of WU 27184379 from 19 March. It registered the 51% mark but hasn't moved on in over 3 hours: I'm afraid it's going the same way as all previous attempts. | |
| ID: 58614 | Rating: 0 | rate:
| |
|
Yes, I am still getting the bad work unit resends. | |
| ID: 58615 | Rating: 0 | rate:
| |
|
New tasks today. | |
| ID: 58616 | Rating: 0 | rate:
| |
|
Same here today. | |
| ID: 58617 | Rating: 0 | rate:
| |
|
Same. | |
| ID: 58618 | Rating: 0 | rate:
| |
|
Thanks for the feedback. I will look into it today. | |
| ID: 58619 | Rating: 0 | rate:
| |
In which OS? These were "Python apps for GPU hosts v4.01 (cuda1121)", which is Linux only. | |
| ID: 58621 | Rating: 0 | rate:
| |
|
Right I just saw it browsing thought the failed jobs. It seems that is in the PythonGPU app not in PythonGPUBeta. | |
| ID: 58622 | Rating: 0 | rate:
| |
|
The current version of PythonGPUBeta has been copied to PythonGPU | |
| ID: 58624 | Rating: 0 | rate:
| |
|
Well this is interesting to read. | |
| ID: 58625 | Rating: 0 | rate:
| |
|
The size for all the app files (including the compressed environment) are: | |
| ID: 58634 | Rating: 0 | rate:
| |
|
Also, the PythonGPU app version used in the new jobs should be 402 (or 4.02). | |
| ID: 58635 | Rating: 0 | rate:
| |
|
I have e1a46-ABOU_rnd_ppod_avoid_cnn_3-0-1-RND3588_4 running under Linux. I can confirm that my task (and its four predecessors) are running with the v4.02 app. | |
| ID: 58636 | Rating: 0 | rate:
| |
|
Thanks a lot for the info Richard! | |
| ID: 58637 | Rating: 0 | rate:
| |
|
I'd say 1%::99%, but thanks. | |
| ID: 58638 | Rating: 0 | rate:
| |
|
Uploaded and reported with no problem at all. | |
| ID: 58639 | Rating: 0 | rate:
| |
|
has the allowed limit changed to 30,000,000,000 bytes? | |
| ID: 58640 | Rating: 0 | rate:
| |
|
Appears so. | |
| ID: 58641 | Rating: 0 | rate:
| |
The size for all the app files (including the compressed environment) are: Note: I was commenting on Rosetta at home CPU pythons. What yours do, I don't know. I guess i had better add your project and see what happens. I readded your project to my system, so if I am home when a task is sent out, I'll have a look. | |
| ID: 58642 | Rating: 0 | rate:
| |
|
Thank you! | |
| ID: 58643 | Rating: 0 | rate:
| |
|
Testing was successful, so we can add the weights to the PythonGPU app job.xml file | |
| ID: 58644 | Rating: 0 | rate:
| |
|
abouh, | |
| ID: 58655 | Rating: 0 | rate:
| |
You can delete the previous post about ACMED3. I posted that incorrectly here. Some forums let you put a double space or a double period to delete your own post, but you must still do it within the editing time | |
| ID: 58666 | Rating: 0 | rate:
| |
|
Mikey, I know. But the time limit expired on that post to edit it. I came back days later not within the 30-60 minutes allowed. | |
| ID: 58669 | Rating: 0 | rate:
| |
|
I am now running a Python task. It has a very low usage of my GPU most often around 5 to 10%, occasionally getting up to 20%. Is this normal? Should I wait until I move my GPU from an old 3770K to a 12500 computer for better CPU capabilities to do these tasks? | |
| ID: 58672 | Rating: 0 | rate:
| |
|
This is normal for Python on GPU tasks. The tasks run on both the cpu and gpu during parts of the computation for the inferencing and machine learning segments. - cyclical GPU load is expected in Reinforcement Learning algorithms. Whenever GPU load in lower, CPU usage should increase. It is correct. | |
| ID: 58673 | Rating: 0 | rate:
| |
|
Sorry for the late reply Greg _BE, I hid the ACEMD3 posts. | |
| ID: 58674 | Rating: 0 | rate:
| |
|
New tasks being issued this morning, allocated to the old Linux v4.01 'Python app for GPU hosts' issued in October 2021. | |
| ID: 58675 | Rating: 0 | rate:
| |
I am not sure, but maybe one problem is that we ask only for 0.987 CPUs, since that was ideal for ACEMD jobs. In reality Python tasks use more. I will look into it. Asking for 1.00 CPUs (or above) would make a significant difference, because that would prompt the BOINC client to reduce the number of tasks being run for other projects. It would be problematic to increase the CPU demand above 1.00, because the CPU loading is dynamic - BOINC has no provision for allowing another project to utilise the cycles available during periods when the GPUGrid app is quiescent. Normally, a GPU app is given a higher process priority for CPU usage than a pure CPU app, so the operating system should allocate resources to your advantage, but that can be problematic when the wrapper app is in use. That was changed recently: I'll look into the situation with your server version and our current client versions. | |
| ID: 58676 | Rating: 0 | rate:
| |
|
Definitely only the latest version 403 should be sent. Thanks for letting us know. | |
| ID: 58677 | Rating: 0 | rate:
| |
|
BOINC GPU apps, wrapper apps, and process priority | |
| ID: 58678 | Rating: 0 | rate:
| |
|
We have deprecated v4.01 All are failing with "ModuleNotFoundError: No module named 'yaml'". should not happen any more. And all jobs should use v4.03 ____________ | |
| ID: 58696 | Rating: 0 | rate:
| |
|
abouh, | |
| ID: 58752 | Rating: 0 | rate:
| |
But here is something interesting, the CPU value according to BOINC Tasks is 221%! Because the task was actually using a little more than two cores to process the work. Why I have set Python task to allocate 3 cpu threads for BOINC scheduling. | |
| ID: 58753 | Rating: 0 | rate:
| |
But here is something interesting, the CPU value according to BOINC Tasks is 221%! Ok...interesting, but what accounts for the lack of progress in 30 mins on this task that I just killed and the exit child error and blow up on the previous Python? I mean really...0% with 2 decimal points, 7.88 for more than 30 minutes? I don't know of any project that can't even 1/100th in 30 minutes. I've seen my share of slow tasks in other projects, but this one...wow.... And how do you go about setting just python for 3 cpu cores? That's beyond my knowledge level. | |
| ID: 58754 | Rating: 0 | rate:
| |
|
You use an app_config.xml file in the project like this: | |
| ID: 58755 | Rating: 0 | rate:
| |
You use an app_config.xml file in the project like this: Ok thanks. I will make that file tomorrow or this weekend. To tired to try that tonight. | |
| ID: 58762 | Rating: 0 | rate:
| |
We have deprecated v4.01 I've recently reset Gpugrid project at every of my hosts, but I've still received v4.01 at several of them, and failed with the mentioned error. Some subsequent v4.03 resends for the same tasks have eventually succeeded at other hosts. | |
| ID: 58767 | Rating: 0 | rate:
| |
|
Unfortunately the admins never yanked the malformed tasks from distribution. | |
| ID: 58768 | Rating: 0 | rate:
| |
|
Sorry for the late reply Greg _BE, I was away for the last 5 days. Thank you very much for the detailed report. Exit status 195 (0xc3) EXIT_CHILD_FAILED Seems like the process failed after raising the exception: "The wandb backend process has shutdown". wandb is the python package we use to send out logs about the agent training process. It provides useful information to better understand the task results. Seems like the process failed and then the whole task got stuck, that is why no progress was being made. Since it reached 7.88% progress, I assume it worked well until then. I need to review other jobs to see why this could be happening and if it happened in other machines. We had not detected this issue before. Thanks for bringing it up. ---------- 2. Time estimation is not right for now due to the way BOINC makes it, Richard provided a very complete explanation in a previous posts. We hope it will improve over time... for now be aware that is it completely wrong. ---------- 3. Regarding this error: OSError: [WinError 1455] The paging file is too small for this operation to complete It is related to using pytorch in windows. It is explained here: https://stackoverflow.com/questions/64837376/how-to-efficiently-run-multiple-pytorch-processes-models-at-once-traceback We are applying this solution to mitigate the error, but for now it can not be eliminated completely. ____________ | |
| ID: 58770 | Rating: 0 | rate:
| |
|
Seems like deprecating the version v4.01 did not work then... I will check if there is anything else we can do to enforce usage of v4.03 over the old one. | |
| ID: 58771 | Rating: 0 | rate:
| |
|
You need a to send a message to all hosts when they connect to the scheduler to delete the 4.01 application from the host physically and to delete the entry in the client_state.xml file | |
| ID: 58772 | Rating: 0 | rate:
| |
|
I sent a batch which will fail with yaml.constructor.ConstructorError: could not determine a constructor for the tag 'tag:yaml.org,2002:python/object/apply:numpy.core.multiarray.scalar' It is just an error with the experiment configuration. I immediately cancelled the experiment and fixed the configuration, but the tasks were already sent. I am very sorry for the inconvenience. Fortunately the jobs will fail right after starting, so no need to kill them. The another batch contains jobs with the fixed configuration. ____________ | |
| ID: 58773 | Rating: 0 | rate:
| |
|
I was not getting too many of the python work units, but I recently received/completed one. I know they take... a while to complete. | |
| ID: 58774 | Rating: 0 | rate:
| |
|
I get the full "quick" credits for my Python tasks because I normally crunch them in 5-8 hours. | |
| ID: 58775 | Rating: 0 | rate:
| |
I get the full "quick" credits for my Python tasks because I normally crunch them in 5-8 hours. Got it. Thanks! I think I am confused why this task took so long to report. What is usually the "bottleneck" when running these tasks? | |
| ID: 58776 | Rating: 0 | rate:
| |
I get the full "quick" credits for my Python tasks because I normally crunch them in 5-8 hours. these tasks are multi-core tasks. they will use a lot of cores (maybe up to 32 threads?). are you running CPU work from other projects? if you are then it's probably starved on CPU resources trying to run the Python task. ____________ | |
| ID: 58777 | Rating: 0 | rate:
| |
these tasks are multi-core tasks. they will use a lot of cores (maybe up to 32 threads?). are you running CPU work from other projects? if you are then it's probably starved on CPU resources trying to run the Python task. The critical point being that they aren't declared to BOINC as needing multiple cores, so BOINC doesn't automatically clear extra CPU space for them to run in. | |
| ID: 58778 | Rating: 0 | rate:
| |
|
Right, I wish there was a way to specify that to BOINC on our side... does adjusting the app_config.xml help? I guess that has to be done of the user side | |
| ID: 58779 | Rating: 0 | rate:
| |
|
yes, the tasks run 32 agent environments in parallel python processes. Definitely the bottleneck could be the CPU because BOINC is not aware of it. | |
| ID: 58780 | Rating: 0 | rate:
| |
|
Thank you all for the replies- this was exactly the issue. I will keep that in mind if I receive another one of these work units. Theoretically, is it possible to run several of these tasks in parallel on the same GPU, since it really is not too GPU intensive and I have enough cores/memory? | |
| ID: 58781 | Rating: 0 | rate:
| |
Thank you all for the replies- this was exactly the issue. I will keep that in mind if I receive another one of these work units. Theoretically, is it possible to run several of these tasks in parallel on the same GPU, since it really is not too GPU intensive and I have enough cores/memory? Only if you have more than 64 threads per GPU available and you stop processing of any existing CPU work. ____________ | |
| ID: 58782 | Rating: 0 | rate:
| |
|
abouh asked Right, I wish there was a way to specify that to BOINC on our side... does adjusting the app_config.xml help? I guess that has to be done of the user side I tried that, but boinc manager on my pc will overallocate CPU's. I am currently running multicore atlas cpu tasks from lhc alongside the python tasks from gpugrid. The atlas tasks are set to use 8 CPU's and the python tasks are set to use 10 CPU's. The example for this response is on an AMD cpu with 8 cores/16 threads. BOINC is set to use 15 threads. It will run one gpugrid python 10 thread task and one lhc 8 thread task at the same time. That is 18 threads running on a 15 thread cpu. Here is my app_config for gpugrid: <app_config> <app> <name>acemd3</name> <gpu_versions> <gpu_usage>1</gpu_usage> <cpu_usage>1</cpu_usage> </gpu_versions> </app> <app> <name>PythonGPU</name> <cpu_usage>10</cpu_usage> <gpu_versions> <gpu_usage>1</gpu_usage> <cpu_usage>10</cpu_usage> </gpu_versions> <app_version> <app_name>PythonGPU</app_name> <plan_class>cuda1121</plan_class> <avg_ncpus>10</avg_ncpus> <ngpus>1</ngpus> <cmdline>--nthreads 10</cmdline> </app_version> </app> <app> <name>PythonGPUbeta</name> <cpu_usage>10</cpu_usage> <gpu_versions> <gpu_usage>1</gpu_usage> <cpu_usage>10</cpu_usage> </gpu_versions> <app_version> <app_name>PythonGPU</app_name> <plan_class>cuda1121</plan_class> <avg_ncpus>10</avg_ncpus> <ngpus>1</ngpus> <cmdline>--nthreads 10</cmdline> </app_version> </app> <app> <name>Python</name> <cpu_usage>10</cpu_usage> <gpu_versions> <gpu_usage>1</gpu_usage> <cpu_usage>10</cpu_usage> </gpu_versions> <app_version> <app_name>PythonGPU</app_name> <plan_class>cuda1121</plan_class> <avg_ncpus>10</avg_ncpus> <ngpus>1</ngpus> <cmdline>--nthreads 10</cmdline> </app_version> </app> <app> <name>acemd4</name> <gpu_versions> <gpu_usage>1</gpu_usage> <cpu_usage>1</cpu_usage> </gpu_versions> </app> </app_config> And here is my app_config for lhc: <app_config> <app> <name>ATLAS</name> <cpu_usage>8</cpu_usage> </app> <app_version> <app_name>ATLAS</app_name> <plan_class>vbox64_mt_mcore_atlas</plan_class> <avg_ncpus>8</avg_ncpus> <cmdline>--nthreads 8</cmdline> </app_version> </app_config> If anyone has any suggestions for changes to the app_config files, please let me know. | |
| ID: 58783 | Rating: 0 | rate:
| |
|
I can run 2 jobs manually on my machine with 12 CPUs, in parallel. They are slower than a single job, but much faster than running them sequentially. | |
| ID: 58785 | Rating: 0 | rate:
| |
However, I think currently GPUGrid automatically assigns one job per GPU, with the environment variable GPU_DEVICE_NUM. Normally, the user's BOINC client will assign the GPU device number, and this will be conveyed to the job by the wrapper. You can easily run two jobs per GPU (both with the same device number), and give them both two full CPU cores each, by using an app_config.xml file including ... <gpu_versions> <gpu_usage>0.5</gpu_usage> <cpu_usage>2.0</cpu_usage> </gpu_versions> ... (full details in the user manual) | |
| ID: 58786 | Rating: 0 | rate:
| |
|
I see, thanks for the clarification | |
| ID: 58788 | Rating: 0 | rate:
| |
|
I guess I am going to have to give up on this project. | |
| ID: 58789 | Rating: 0 | rate:
| |
|
This task is from a batch of a wrongly configured jobs. It is an error on our side. It was immediately corrected, but the jobs were already sent, and could not be cancelled. They crash after starting to runm, but it is just this batch. The following batches work normally. | |
| ID: 58790 | Rating: 0 | rate:
| |
This task is from a batch of a wrongly configured jobs. It is an error on our side. It was immediately corrected, but the jobs were already sent, and could not be cancelled. They crash after starting to runm, but it is just this batch. The following batches work normally. ok...waiting in line for the next batch. | |
| ID: 58791 | Rating: 0 | rate:
| |
|
I am still attempting to diagnose why these tasks are taking the system so long to complete. I changed the config to "reserve" 32 cores for these tasks. I did also make a change so I have two of these tasks running simultaneously- I am not clear on these tasks and multithreading. The system running them has 56 physical cores across two CPUs (112 logical). Are the "32" cores used for one of these tasks physical or logical? Also, I am relatively confident the GPUs can handle this (RTX A6000) but let me know if I am missing something. | |
| ID: 58830 | Rating: 0 | rate:
| |
|
Why do you think the tasks are running abnormally long? | |
| ID: 58831 | Rating: 0 | rate:
| |
Why do you think the tasks are running abnormally long? They should be put back into the beta category. They still have too many bugs and need more work. It looks like someone was in a hurry to leave for summer vacation. I decided to stop crunching them, for now. Of course, there isn't much to crunch here anyway, right now. There is always next fall to fix this..................... | |
| ID: 58832 | Rating: 0 | rate:
| |
Are you being confused by the cpu and gpu runtimes on the task? They are declared to use less than 1 CPU (and that's all BOINC knows about), but in reality they use much more. This website confuses matters by mis-reporting the elapsed time as the total (summed over all cores) CPU time. The only way to be exactly sure what has happened is to examine the job_log_[GPUGrid] file on your local machine. The third numeric column ('ct ...') is the total CPU time, summed over all cores: the penultimate column ('et ...') is the elapsed - wall clock - time for the task as a whole. Locally, ct will be above et for the task as a whole, but on this website, they will be reported as the same. | |
| ID: 58833 | Rating: 0 | rate:
| |
|
I'm not having any issues with them on Linux. I don't know how that compares to Windows hosts. | |
| ID: 58834 | Rating: 0 | rate:
| |
|
The 32 cores are logical, python processes running in parallel. I can run them locally in a 12 CPU machine. The GPU should be fine as well, so you are correct about that. | |
| ID: 58844 | Rating: 0 | rate:
| |
|
We decided to remove the beta flag from the current version of the python app when we found it to work without errors in a reasonable number hosts. We are aware that, even though we do testing it in our local linux and windows machines, there is a vast variety of configurations, versions and resource capabilities among the hosts, and it will not work in all of them. | |
| ID: 58845 | Rating: 0 | rate:
| |
|
I'm away from my machines at the moment, but can confirm that's the case. | |
| ID: 58846 | Rating: 0 | rate:
| |
|
I am not sure about the acemd tasks, but for python tasks, I will increase the amount of tasks progressively. | |
| ID: 58847 | Rating: 0 | rate:
| |
|
Thanks for this info. Here is the log file for a recently completed task: | |
| ID: 58848 | Rating: 0 | rate:
| |
Thanks for this info. Here is the log file for a recently completed task: No. That is incorrect. You cannot use the clocktime reported in the task. That will accumulate over however many cpu threads the task is allowed to show to BOINC. Blame BOINC for this issue not the application. Look at the sent time and the returned time to calculate how long the task actually took to process. Returned time minus the sent time = length of time to process. | |
| ID: 58853 | Rating: 0 | rate:
| |
|
BOINC just does not know how to account for these Python tasks which act "sorta" like an MT task. | |
| ID: 58855 | Rating: 0 | rate:
| |
1653158519 ue 148176.747654 ct 3544023.000000 fe 1000000000000000000 nm e5a63-ABOU_rnd_ppod_demo_sharing_large-0-1-RND5179_0 et 117973.295733 es 0 Actually, that line (from the client job log) actually is a useful source of information. It contains both ct 3544023.000000 which is the CPU or core time - as you say, it dates back to the days when CPUs only had one core. But now, it comprises the sum over all of however many cores are used. and et 117973.295733 That's the elapsed time (wallclock measure) which was added when GPU computing was first introduced and cpu time was not longer a reliable indicator of work done. I agree that many outdated legacy assumptions remain active in BOINC, but I think it's got beyond the point when mere tinkering could fix it - we really need a full Mark 2 rewrite. But that seems unlikely under the current management. | |
| ID: 58856 | Rating: 0 | rate:
| |
|
OK, so here is a back of the napkin calculation on how long the task actually took to crunch | |
| ID: 58858 | Rating: 0 | rate:
| |
|
Well, since there's also a 'nm' (name) field in the client job log, we can find the rest: 04:44:21 (34948): .\7za.exe exited; CPU time 9.890625 13:32:28 (7456): wrapper (7.9.26016): starting(that looks like a restart) Then some more of the same, and finally 14:41:51 (28304): python.exe exited; CPU time 2816214.046875 | |
| ID: 58859 | Rating: 0 | rate:
| |
|
| |
| ID: 58860 | Rating: 0 | rate:
| |
That is what I am confused about. I can tell you that these calculations of time seem accurate- it was somewhere around 24 hours that it was actually running. Also, the CPU was running closer to 3.1Ghz (boost). It barely pushed the GPU when running. Nothing changed with time when I reserved 32 cores for these tasks. I really can't nail down the issue. | |
| ID: 58861 | Rating: 0 | rate:
| |
|
As abouh has posted previously, the two resource types are used alternately - "cyclical GPU load is expected in Reinforcement Learning algorithms. Whenever GPU load in lower, CPU usage should increase." (message 58590). Any instantaneous observation won't reveal the full situation: either CPU will be high, and GPU low, or vice-versa. | |
| ID: 58862 | Rating: 0 | rate:
| |
|
Yep- I observe the alternation. When I suspend all other work units, I can see that just one of these tasks will use a little more than half of the logical processors. I know it has been talked about that although it says it uses 1 processor (or, 0.996, to be exact) that it uses more. I am running E@H work units and I think that running both is choking the CPU. Is there a way to limit the processor count that these python tasks use? In the past, I changed the app config to use 32, but it did not seem to speed anything up, even though they were reserved for the work unit. | |
| ID: 58863 | Rating: 0 | rate:
| |
As abouh has posted previously, the two resource types are used alternately - "cyclical GPU load is expected in Reinforcement Learning algorithms. Whenever GPU load in lower, CPU usage should increase. This can be very well graphically noticed at the following two images. Higher CPU - Lower GPU usage cycle: Higher GPU - Lower CPU usage cycle: CPU and GPU usage graphics follow an anti cyclical pattern. | |
| ID: 58864 | Rating: 0 | rate:
| |
Is there a way to limit the processor count that these python tasks use? In the past, I changed the app config to use 32, but it did not seem to speed anything up, even though they were reserved for the work unit. No there isn't as the user. These are not real MT tasks or any form that BOINC recognizes and provides some configuration options. Your only solution is to only run one at a time via an max_concurrent statement in an app_config.xml file and then also restrict the number of cores being allowed to be used by your other projects. That said, I don't know why you are having such difficulties. Maybe chalk it up to Windows, I don't know. I run 3 other cpu projects at the same times as I run the GPUGrid Python on GPU tasks with 28-46 cpu cores being occupied by Universe, TN-Grid or yoyo depending on the host. Every host primarily runs Universe as the major cpu project. No impact on the python tasks while running the other cpu apps. | |
| ID: 58865 | Rating: 0 | rate:
| |
No impact on the python tasks while running the other cpu apps. Conversely, I notice a performance loss on other CPU tasks when python tasks are in execution. I processed yesterday python task e7a30-ABOU_rnd_ppod_demo_sharing_large-0-1-RND2847_2 at my host #186626 It was received at 11:33 UTC, and result was returned on 22:50 UTC At the same period, PrimeGrid PPS-MEGA CPU tasks were also being processed. The medium processing time for eighteen (18) PPS-MEGA CPU tasks was 3098,81 seconds. The medium processing time for 18 other PPS-MEGA CPU tasks processed outside that period was 2699,11 seconds. This represents an extra processing time of about 400 seconds per task, or about a 12,9% performance loss. There is not such a noticeable difference when running Gpugrid ACEMD tasks. | |
| ID: 58866 | Rating: 0 | rate:
| |
|
I also notice an impact on my running Universe tasks. Generally adds 300 seconds to the normal computation times when running in conjunction with a python task. | |
| ID: 58867 | Rating: 0 | rate:
| |
|
Windows 10 machine running task 32899765. Had a power outage. When the power came back on, task was restarted but just sat there doing nothing. The stderr.txt file showed the following error: file pythongpu_windows_x86_64__cuda102.tar Task was stalled waiting on a response. BOINC was stopped and the pythongpu_windows_x86_64__cuda102.tar file was removed from the slots folder. Computer was restarted then the task was restarted. Then the following error message appeared several times in the stderr.txt file. OSError: [WinError 1455] The paging file is too small for this operation to complete. Error loading "C:\ProgramData\BOINC\slots\0\lib\site-packages\torch\lib\caffe2_detectron_ops_gpu.dll" or one of its dependencies. Page file size was increased to 64000MB and rebooted. Started task again and still got the error message about page file size too small. Then task abended. If you need more info about this task, please let me know. | |
| ID: 58871 | Rating: 0 | rate:
| |
|
Thank you captainjack for the info. (Y)es / (N)o / (A)lways / (S)kip all / A(u)to rename all / (Q)uit? The job command line is the following: 7za.exe pythongpu_windows_x86_64__cuda102.tar -y and I got from the application documentation (https://info.nrao.edu/computing/guide/file-access-and-archiving/7zip/7z-7za-command-line-guide): 7-Zip will prompt the user before overwriting existing files unless the user specifies the -y So essentially -y assumes "Yes" on all Queries. Honestly I am confused by this behaviour, thanks for pointing it out. Maybe I am missing the x, as in 7za.exe x pythongpu_windows_x86_64__cuda102.tar -y I will test it on the beta app. 2. Regarding the other error OSError: [WinError 1455] The paging file is too small for this operation to complete. Error loading "C:\ProgramData\BOINC\slots\0\lib\site-packages\torch\lib\caffe2_detectron_ops_gpu.dll" or one of its dependencies. is related to pytorch and nvidia and it only affects some windows machines. It is explained here: https://stackoverflow.com/questions/64837376/how-to-efficiently-run-multiple-pytorch-processes-models-at-once-traceback TL;DR: Windows and Linux treat multiprocessing in python differently, and in windows each process commits much more memory, especially when using pytorch. We use the script suggested in the link to mitigate the problem, but it could be that for some machines memory is still insufficient. Does that make sense in your case? ____________ | |
| ID: 58876 | Rating: 0 | rate:
| |
|
Thank you abouh for responding, | |
| ID: 58878 | Rating: 0 | rate:
| |
|
Seems like here are some possible workarounds:
and If it's of any value, I ended up setting the values into manual and some ridiculous amount of 360GB as the minimum and 512GB for the maximum. I also added an extra SSD and allocated all of it to Virtual memory. This solved the problem and now I can run up to 128 processes using pytorch and CUDA. Maybe it can be helpful for someone ____________ | |
| ID: 58879 | Rating: 0 | rate:
| |
|
Hi abouh, | |
| ID: 58880 | Rating: 0 | rate:
| |
|
So whats going on here? | |
| ID: 58881 | Rating: 0 | rate:
| |
|
The command line 7za.exe pythongpu_windows_x86_64__cuda102.tar.gz works fine if the job is executed without interruptions. However, in case the job is interrupted and restarted later, the command is executed again. Then, 7za needs to know whether or not to replace the already existing files with the new ones. The flag -y is just to make sure the script does not get stuck in that command prompt waiting for an answer. ____________ | |
| ID: 58883 | Rating: 0 | rate:
| |
|
Unfortunately recent versions of PyTorch do not support all GPU's, older ones might not be compatible... RuntimeError: CUDA out of memory. Tried to allocate 446.00 MiB (GPU 0; 11.00 GiB total capacity; 470.54 MiB already allocated; 8.97 GiB free; 492.00 MiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF does it happen recurrently in the same machine? or depending on the job? ____________ | |
| ID: 58884 | Rating: 0 | rate:
| |
So whats going on here? The problem is not with the card but with the Windows environment. I have no issues running the Python on GPU tasks in Linux on my 1080 Ti card. https://www.gpugrid.net/results.php?hostid=456812 | |
| ID: 58886 | Rating: 0 | rate:
| |
|
Well so far, these new python WU's have been consistently completing and even surviving multiple reboots, OS kernel upgrades, and OS upgrades: | |
| ID: 58906 | Rating: 0 | rate:
| |
|
Yes, one nice thing about the Python gpu tasks is that they survive a reboot and can be restarted on a different gpu without erroring. | |
| ID: 58907 | Rating: 0 | rate:
| |
Yes, one nice thing about the Python gpu tasks is that they survive a reboot and can be restarted on a different gpu without erroring. Good to know as I did not try a driver update or using a different GPU on a WU in progress. I do think BOINC needs to patch their estimated time to completion. XXXdays remaining makes it impossible to have any in a cache. | |
| ID: 58915 | Rating: 0 | rate:
| |
|
I haven't had any reason to carry a cache. I have my cache level set at only one task for each host as I don't want GPUGrid to monopolize my hosts and compete with my other projects. | |
| ID: 58919 | Rating: 0 | rate:
| |
Yes, one nice thing about the Python gpu tasks is that they survive a reboot and can be restarted on a different gpu without erroring. BOINC would have to completely rewrite that part of the code. The fact that these tasks run on both the cpu and gpu makes them impossible to decipher by BOINC. The closest mechanism is the MT or multi-task category but that only knows about cpu tasks which run solely on the cpu. | |
| ID: 58920 | Rating: 0 | rate:
| |
BOINC would have to completely rewrite that part of the code. The fact that these tasks run on both the cpu and gpu makes them impossible to decipher by BOINC. I think BOINC uses the CPU excluively for their Estimated Time to Completion algorithm all WU's including those using a GPU which makes sense since the job cannot complete until both processor's work are complete. Observing GPU work with E@H, it appears that the GPU finishes first and the CPU continues for a period of time to do what is necessary to wrap the job up for return and those BOINC ETC's are fairly accurate. It is the multi-thread WU's mentioned that appears to be throwing a monkey wrench at the ETC like these python jobs. From my observations, the python WU's use 32 processes regardless of actual system configuration. I have 2 Ryzen 16 core and my old FX-8350 8 core and they each run 32 processes each WU. It seems to me that the existing algorithm could be used in a modular fashion by assuming a single thread CPU job for the MT WU then calculating the estimated time and then knowing the number of processes the WU is requesting compared with those available from the system, it could perform a simple division and produce a more accurate result for MT WU's as well. Don't know for sure, just speculating but I do have the BOINC source code and might take a look and see if I can find the ETC stuff. Might be interseting. | |
| ID: 58936 | Rating: 0 | rate:
| |
|
The server code for determining the ETC for MT tasks also has to account for task scheduling. | |
| ID: 58937 | Rating: 0 | rate:
| |
|
You make a good point regarding the server side issues. Perhaps the projects themselves, if not already, would submit desired resources to allow the server to compare with those available on clients similar to submitting in house cluster jobs. I also agree that it is probably best to go through BOINC git and get a request for a potential fix but I also want to see their ETC algorithms just out of curiousity, both server and client. Nice interesting discussion. | |
| ID: 58943 | Rating: 0 | rate:
| |
|
You need to review the code in the /client/work_fetch.cpp module and any of the old closed issues pertaining to use of max_concurrent statements in app_config.xml. | |
| ID: 58944 | Rating: 0 | rate:
| |
|
Thank you Keith, much appreciated background and starting points. | |
| ID: 58949 | Rating: 0 | rate:
| |
|
need advice with regard to running Python on one of my Windows machines: | |
| ID: 58961 | Rating: 0 | rate:
| |
BOINC event log says that some 22GB more RAM are needed. Could you post the exact text of the log message and a few lines either side for context? We might be able to decode it. | |
| ID: 58962 | Rating: 0 | rate:
| |
BOINC event log says that some 22GB more RAM are needed. here is the text of the log message: 26.06.2022 09:20:35 | GPUGRID | Requesting new tasks for CPU and NVIDIA GPU 26.06.2022 09:20:37 | GPUGRID | Scheduler request completed: got 0 new tasks 26.06.2022 09:20:37 | GPUGRID | No tasks sent 26.06.2022 09:20:37 | GPUGRID | No tasks are available for ACEMD 3: molecular dynamics simulations for GPUs 26.06.2022 09:20:37 | GPUGRID | Nachricht vom Server: Python apps for GPU hosts needs 22396.05MB more disk space. You currently have 10982.55 MB available and it needs 33378.60 MB. 26.06.2022 09:20:37 | GPUGRID | Project requested delay of 31 seconds the reason why at this point it says I have 10.982MB available is because I currently have some LHC projects running which use some RAM. However, it also says: I need 33.378MB RAM; so my 32GB RAM are not enough anyway (as seen on the other machine, on which I also have 32GB RAM, and there is no problem with downloading and crunching Python). What I am surprised about is that the projects request so much free RAM, alhough while in operation, it uses only between 1.3 and 5GB. | |
| ID: 58963 | Rating: 0 | rate:
| |
26.06.2022 09:20:37 | GPUGRID | Nachricht vom Server: Python apps for GPU hosts needs 22396.05MB more disk space. You currently have 10982.55 MB available and it needs 33378.60 MB. Disk, not RAM. Probably one or other of your disk settings is blocking it. | |
| ID: 58964 | Rating: 0 | rate:
| |
26.06.2022 09:20:37 | GPUGRID | Nachricht vom Server: Python apps for GPU hosts needs 22396.05MB more disk space. You currently have 10982.55 MB available and it needs 33378.60 MB. Oh sorry, you are perfectly right. My mistake, how dumm :-( so, with my 32GB Ramdisk it does not work, when it says that it needs 33378MB. What I could do, theoretically, is to shift BOINC from the Ramdisk to the 1 GB SSD. However, the reason why I installed BOINC on the Ramdisk was that the LHC Atlas tasks which I am crunching permanently have an enormous disk usage, and I don't want ATLAS to kill the SSD too early. I guess that there might be ways to install a second instance of BOINC on the SSD - I tried this on another PC years ago, but somehow I did not get it done properly :-( | |
| ID: 58965 | Rating: 0 | rate:
| |
|
You'll need to decide which copy of BOINC is going to be your 'primary' installation (default settings, autorun stuff in the registry, etc.), and which is going to be the 'secondary'. <allow_multiple_clients>1</allow_multiple_clients> to the options section of cc_config.xml (or set the value to 1 if the line is already present). That needs a client restart if BOINC's already running. Then, these two batch files work for me. Adapt program and data locations as needed. To run the client: D:\BOINC\rh_boinc_test --allow_multiple_clients --allow_remote_gui_rpc --redirectio --detach_console --gui_rpc_port 31418 --dir D:\BOINCdata2\ To run a Manager to control the second client: start D:\BOINC\boincmgr.exe /m /n 127.0.0.1 /g 31418 /p password Note that I've set this up to run test clients alongside my main working installation - you can probably ignore that bit. | |
| ID: 58966 | Rating: 0 | rate:
| |
We have a time estimation problem, discussed previously in the thread. As Keith mentioned, the real walltime calculation should be much less than reported. Are you still in need of that? My first Python ran for 12 hours 55 minutes according to BoincTasks, but the website reported 156,269.60 seconds (over 43 hours). It got 75,000 credits. http://www.gpugrid.net/results.php?hostid=593715 | |
| ID: 58968 | Rating: 0 | rate:
| |
|
Thanks for the feedback Jim1348! It is useful for us to confirm that jobs run in a reasonable time despite the wrong estimation issue. Maybe that can be solved somehow in the future. Seems like at least did no estimate dozens of days like I have seen in other occasions. | |
| ID: 58969 | Rating: 0 | rate:
| |
|
it's because the app is using the CPU time instead of runtime. since it uses so many threads, it adds up the time spent on all the threads. 2 threads working for 1hr total would be 2hrs reported CPU time. you need to track wall clock time. the app seems to have this capability since it reports timestamps of start and stop in the stderr.txt file. | |
| ID: 58970 | Rating: 0 | rate:
| |
|
There are two separate problems with timing. | |
| ID: 58971 | Rating: 0 | rate:
| |
that may be true, NOW. however, if they move to a dynamic credit scheme (as they should) that awards credit based on flops and runtime (like ACEMD3 does), then the runtime will not be just cosmetic. ____________ | |
| ID: 58972 | Rating: 0 | rate:
| |
|
OK, I got one on host 508381. Initial estimate is 752d 05:26:18, task is 32940037 | |
| ID: 58973 | Rating: 0 | rate:
| |
|
Yesterday's task is just in the final stages - it'll finish after about 13 hours - and the next is ready to start. So here are the figures for the next in the cycle. | |
| ID: 58974 | Rating: 0 | rate:
| |
|
The credits per runtime for cuda1131 really looks strange sometimes: | |
| ID: 58975 | Rating: 0 | rate:
| |
|
Yes, you are right about that. There are 2 types of experiments I run now: | |
| ID: 58977 | Rating: 0 | rate:
| |
|
The credit system gives 50.000 credits per task. However, completion before a certain amount of time multiplies this value by 1.5, then by 1.25 for a while and finally by 1.0 indefinitely. That explains why sometimes you see 75.000 and sometimes 62.500 credits. | |
| ID: 58978 | Rating: 0 | rate:
| |
|
I had a idea after reading some of the post about utilisation of resources. | |
| ID: 58979 | Rating: 0 | rate:
| |
|
The reason Reinforcement Learning agents do not currently use the whole potential of the cards is because the interactions between the AI agent and the simulated environment are performed on CPU while the agent "learning" process is the one that uses the GPU intermittently. | |
| ID: 58980 | Rating: 0 | rate:
| |
|
Thanks for the comments, what about using large quantity of VRAM if available, the latest BOINC finally allows for correct reporting VRAM on NVidia cards so you can tailor the WUs based on VRAM to protect the contributions from users with lower specification computers. | |
| ID: 58981 | Rating: 0 | rate:
| |
|
Sorry for OT, but some people need admin help and I've seen one beeing active here :) | |
| ID: 58995 | Rating: 0 | rate:
| |
|
Hi Fritz! Apparently the problem is that sending emails from server no longer works. I will mention the problem to the server admin. | |
| ID: 59002 | Rating: 0 | rate:
| |
|
I talked to the server admin and he explained to me the problem in more detail. | |
| ID: 59003 | Rating: 0 | rate:
| |
|
Hello Toby, | |
| ID: 59004 | Rating: 0 | rate:
| |
|
BOINC can detect the quantity of GPU memory, it was bugged in the older BOINC version for nVidia cards but in 7.20 its fixed so there would be no need to detect in Python as its already in the project database. | |
| ID: 59006 | Rating: 0 | rate:
| |
|
Even video cards with 6GiB crash with insufficient VRAM. | |
| ID: 59007 | Rating: 0 | rate:
| |
|
From what we are finding right now the 6GB GPUs would have sufficient VRAM to run the current Python tasks. Refer to this thread noting between 2.5 and 3.2 GB being used:https://www.gpugrid.net/forum_thread.php?id=5327 | |
| ID: 59008 | Rating: 0 | rate:
| |
|
New generic error on multiple tasks this morning: TypeError: create_factory() got an unexpected keyword argument 'recurrent_nets' Seems to affect the entire batch currently being generated. | |
| ID: 59039 | Rating: 0 | rate:
| |
|
Thanks for letting us know Richard. It is a minor error, sorry for the inconvenience, I am fixing it right now. Unfortunately the remaining jobs of the batch will crash but then will be replaced with correct ones. | |
| ID: 59040 | Rating: 0 | rate:
| |
|
No worries - these things happen. The machine which alerted me to the problem now has a task 'created 28 Jul 2022 | 10:33:04 UTC' which seems to be running normally. | |
| ID: 59042 | Rating: 0 | rate:
| |
|
Yes exactly, it has to fail 8 times... the only good part is that the bugged tasks fail at the beginning of the script so almost no computation is wasted. I have checked and some of the tasks in the newest batch have already finished successfully. | |
| ID: 59043 | Rating: 0 | rate:
| |
|
A peculiarity of Python apps for GPU hosts 4.03 (cuda1131): | |
| ID: 59071 | Rating: 0 | rate:
| |
|
I've been monitoring and playing with the initial runtime estimates for these tasks. | |
| ID: 59076 | Rating: 0 | rate:
| |
|
or just use the flops reported by BOINC for the GPU. since it is recorded and communicated to the project. and from my experience (with ACEMD tasks) does get used in the credit reward for the non-static award scheme. so the project is certainly getting it and able to use that value. | |
| ID: 59077 | Rating: 0 | rate:
| |
|
Except: | |
| ID: 59078 | Rating: 0 | rate:
| |
|
personally I'm a big fan of just standardizing the task computational size and assigning static credit. no matter the device used or how long it takes. just take flops out of the equation completely. that way faster devices get more credit/RAC based on the rate in which valid tasks are returned. | |
| ID: 59099 | Rating: 0 | rate:
| |
|
The latest Python tasks I've done today have awarded 105,000 credits as compared to all the previous tasks at 75,000 credits. | |
| ID: 59101 | Rating: 0 | rate:
| |
Anyone notice this new award level? I just got my first one. http://www.gpugrid.net/workunit.php?wuid=27270757 But not all the new ones receive that. A subsequent one received the usual 75,000 credit. | |
| ID: 59102 | Rating: 0 | rate:
| |
|
Thanks for your report. It doesn't really track with scaling now that I examine my tasks. | |
| ID: 59104 | Rating: 0 | rate:
| |
|
My first 'high rate' task (105K credits) was a workunit created at 10 Aug 2022 | 2:03:51 UTC. | |
| ID: 59105 | Rating: 0 | rate:
| |
|
That implies the current release candidates are being assigned 105K credit based I assume on harder to crunch datasets. | |
| ID: 59107 | Rating: 0 | rate:
| |
|
Which apps are running these days? The apps page is missing the column that shows how much is running: https://www.gpugrid.net/apps.php <app_config> <!-- i9-10980XE 18c36t 32 GB L3 Cache 24.75 MB --> <app> <name>acemd3</name> <plan_class>cuda1121</plan_class> <gpu_versions> <cpu_usage>1.0</cpu_usage> <gpu_usage>1.0</gpu_usage> </gpu_versions> <fraction_done_exact/> </app> <app> <name>acemd4</name> <plan_class>cuda1121</plan_class> <gpu_versions> <cpu_usage>1.0</cpu_usage> <gpu_usage>1.0</gpu_usage> </gpu_versions> <fraction_done_exact/> </app> <app> <name>PythonGPU</name> <plan_class>cuda1121</plan_class> <gpu_versions> <cpu_usage>4.0</cpu_usage> <gpu_usage>1.0</gpu_usage> </gpu_versions> <app_version> <app_name>PythonGPU</app_name> <avg_ncpus>4</avg_ncpus> <ngpus>1</ngpus> <cmdline>--nthreads 4</cmdline> </app_version> <fraction_done_exact/> <max_concurrent>1</max_concurrent> </app> <app> <name>PythonGPUbeta</name> <plan_class>cuda1121</plan_class> <gpu_versions> <cpu_usage>4.0</cpu_usage> <gpu_usage>1.0</gpu_usage> </gpu_versions> <app_version> <app_name>PythonGPU</app_name> <avg_ncpus>4</avg_ncpus> <ngpus>1</ngpus> <cmdline>--nthreads 4</cmdline> </app_version> <fraction_done_exact/> <max_concurrent>1</max_concurrent> </app> <app> <name>Python</name> <plan_class>cuda1121</plan_class> <cpu_usage>4</cpu_usage> <gpu_versions> <cpu_usage>4</cpu_usage> <gpu_usage>1</gpu_usage> </gpu_versions> <app_version> <app_name>PythonGPU</app_name> <avg_ncpus>4</avg_ncpus> <ngpus>1</ngpus> <cmdline>--nthreads 4</cmdline> </app_version> <fraction_done_exact/> <max_concurrent>1</max_concurrent> </app> </app_config> | |
| ID: 59109 | Rating: 0 | rate:
| |
|
I get away with only reserving 3 cpu threads. That does not impact or affect what the actual task does when it runs. Just BOINC cpu scheduling for other projects. | |
| ID: 59110 | Rating: 0 | rate:
| |
|
Hi, guys! | |
| ID: 59111 | Rating: 0 | rate:
| |
Hi, guys! Yes, because of flaws in Windows memory management, that effect cannot be gotten around. You need to increase the size of your pagefile to the 50GB range to be safe. Linux does not have the problem and no changes are necessary to run the tasks. The project primarily develops Linux applications first as the development process is simpler. Then they tackle the difficulties of developing a Windows application with all the necessary workarounds. Just the way it is. For the reason why read this post. https://www.gpugrid.net/forum_thread.php?id=5322&nowrap=true#58908 | |
| ID: 59112 | Rating: 0 | rate:
| |
|
Thank you for clarification. | |
| ID: 59113 | Rating: 0 | rate:
| |
|
Task credits are fixed. Pay no attention to the running times. BOINC completely mishandles that since it has no recognition of the dual nature of these cpu-gpu application tasks. | |
| ID: 59114 | Rating: 0 | rate:
| |
|
Can anyone tell me what happened to this task: | |
| ID: 59115 | Rating: 0 | rate:
| |
|
RuntimeError: [enforce fail at ..\c10\core\CPUAllocator.cpp:76] data. DefaultCPUAllocator: not enough memory: you tried to allocate 3612672 bytes. | |
| ID: 59116 | Rating: 0 | rate:
| |
RuntimeError: [enforce fail at ..\c10\core\CPUAllocator.cpp:76] data. DefaultCPUAllocator: not enough memory: you tried to allocate 3612672 bytes. thanks Richard for the quick reply. I now changed the page file size to max. 65MB. I did it on both drives: system drive C:/ and drive F:/ (on separate SSD) on which BOINC is running. Probably to change it for only one drive would have been okay, right? If so, which one? | |
| ID: 59117 | Rating: 0 | rate:
| |
|
The Windows one. | |
| ID: 59118 | Rating: 0 | rate:
| |
|
I am a bit surprised that I am able to run the pythons without problem under Ubuntu 20.04.4 on a GTX 1060. It has 3GB of video memory, and uses 2.8GB thus far. And the CPU is currently running two cores (down from the previous four cores), using about 3.7GB of memory, though reserving 19 GB. | |
| ID: 59119 | Rating: 0 | rate:
| |
The Windows one. thx :-) | |
| ID: 59120 | Rating: 0 | rate:
| |
|
Can the CPU usage be adjusted correctly? its fine to use a number of cores but currently it say less than one and uses more than 1 | |
| ID: 59141 | Rating: 0 | rate:
| |
|
Hello! sorry for the late reply | |
| ID: 59143 | Rating: 0 | rate:
| |
|
current value of rsc_fpops_est is 1e18, with 10e18 as limit. I remember we had to increase it because otherwise produced false “task aborted by host” from some users side. Do you think we should change it again? | |
| ID: 59144 | Rating: 0 | rate:
| |
Regarding cpu_usage, I remember having this discussion with Toni and I think the reason why we set the number of cores to that number is because with a single core the jobs can actually be executed. Even if they create 32 threads. Definitely do not require 32 cores. Is there an advantage of setting it to an arbitrary number higher than 1? Couldn't that cause some allocation problems? sorry it is a bit outside of my knowledge zone... This is a consequence of the handling of GPU plan_classes in the released BOINC server code. In the raw BOINC code, the cpu_usage value is calculated by some obscure (and, in all honesty, irrelevant and meaningless) calculation of the ratio of the number of flops that will be performed on the CPU and the GPU - the GPU, in particular, being assumed to be processing at an arbitrary fraction of the theoretical peak speed. In short, it's useless. I don't think the raw BOINC code expects you to make manual alterations to the calculated value. If you've found a way of over-riding and fixing it - great. More power to your elbow. The current issue arises because the Python app is neither a pure GPU app, nor a pure multi-threaded CPU app. It operates in both modes - and the BOINC developers didn't think of that. I think you need to create a special, new, plan_class name for this application, and experiment on that. Don't meddle with the existing plan_classes - that will mess up the other GPUGrid lines of research. I'm running with a manual override which devotes the whole GPU power, plus 3 CPUs, to the Python tasks. That seems to work reasonably well: it keeps enough work from other BOINC projects off the CPU while Python is running. | |
| ID: 59145 | Rating: 0 | rate:
| |
Regarding cpu_usage, I remember having this discussion with Toni and I think the reason why we set the number of cores to that number is because with a single core the jobs can actually be executed. Even if they create 32 threads. Definitely do not require 32 cores. Is there an advantage of setting it to an arbitrary number higher than 1? Couldn't that cause some allocation problems? sorry it is a bit outside of my knowledge zone... Could you tell us a bit more about this manual override? Just now it is sprawled over five cores, ten threads. If it sees the sixth core free, it grabs that one also. | |
| ID: 59152 | Rating: 0 | rate:
| |
|
If you run other projects concurrently, then it is adviseable to limit the number of cores the Python tasks occupies for scheduling. I am not talking about the number of threads each task uses since that is fixed. | |
| ID: 59153 | Rating: 0 | rate:
| |
If you run other projects concurrently, then it is adviseable to limit the number of cores the Python tasks occupies for scheduling. I am not talking about the number of threads each task uses since that is fixed. Thank you Keith. Why is it using so many cores plus is it something like OpenIFS on CPDN? | |
| ID: 59154 | Rating: 0 | rate:
| |
Thank you Keith. Why is it using so many cores plus is it something like OpenIFS on CPDN? Yes - or nbody at MilkyWay. This Python task shares characteristics of a cuda (GPU) plan class, and a MT (multithreaded) plan class, and works best if treated as such. | |
| ID: 59155 | Rating: 0 | rate:
| |
|
Possible bad workunit: 27278732 ValueError: Expected value argument (Tensor of shape (1024,)) to be within the support (IntegerInterval(lower_bound=0, upper_bound=17)) of the distribution Categorical(logits: torch.Size([1024, 18])), but found invalid values: | |
| ID: 59163 | Rating: 0 | rate:
| |
|
Interesting I had never seen this error before, thank you! | |
| ID: 59178 | Rating: 0 | rate:
| |
|
Thanks Richard, is 3 CPU cores enough to not slow down the GPU? | |
| ID: 59192 | Rating: 0 | rate:
| |
|
I'm noticing an interesting difference in application behavior between different systems. abouh, can you help explain the reason? | |
| ID: 59203 | Rating: 0 | rate:
| |
|
or perhaps the Broadwell based Intel CPU is able to hardware accelerate some tasks that the EPYC has to do in software, leading to higher CPU use? | |
| ID: 59204 | Rating: 0 | rate:
| |
|
The application is not coded in any specific way to force more work to be done on more modern processors. | |
| ID: 59205 | Rating: 0 | rate:
| |
Maybe python handles it under the hood somehow? it might be related to pytorch actually. I did some more digging and it seems like AMD has worse performance due to some kind of CPU detection issue in the MKL (or maybe deliberate by Intel). do you know what version of MKL your package uses? and are you able to set specific env variables in your package? if your MKL is version <=2020.0, setting MKL_DEBUG_CPU_TYPE=5 might help this issue on AMD CPUs. but it looks like this will not be effective if you are on a newer version of the MKL as Intel has since removed this variable. ____________ | |
| ID: 59206 | Rating: 0 | rate:
| |
to add: I was able to inspect your MKL version as 2019.0.4, and I tried setting the env variable by adding os.environ["MKL_DEBUG_CPU_TYPE"] = "5" to the run.py main program, but it had no effect. either I didn't put the command in the right place (I inserted it below line 433 in the run.py script), or the issue is something else entirely. edit: you also might consider compiling your scripts into binaries to prevent inquisitive minds from messing about in your program ;) ____________ | |
| ID: 59207 | Rating: 0 | rate:
| |
|
Should the environment variable for fixing AMD computation in the MKL library be in the task package or just in the host environment? Or both? | |
| ID: 59208 | Rating: 0 | rate:
| |
|
I didn’t explicitly state it in my previous reply. But I tried all that already and it didn’t make any difference. I even ran run.py standalone outside of BOINC to be sure that the env variable was set. Neither the env variable being set nor the fake Intel library made any difference at all. | |
| ID: 59209 | Rating: 0 | rate:
| |
|
Ohh . . . . OK. Didn't know you had tried all the previous existing fixes. | |
| ID: 59210 | Rating: 0 | rate:
| |
|
I could definitely set the env variable depending on package version in my scripts if that made AI agents train faster. | |
| ID: 59211 | Rating: 0 | rate:
| |
|
Don't know if the math functions being used by the Python libraries are any higher than SSE2 or not. | |
| ID: 59212 | Rating: 0 | rate:
| |
I could definitely set the env variable depending on package version in my scripts if that made AI agents train faster. Was my location for the variable in the script right or appropriate? inserted below line 433. Does the script inherit the OS variables already? Just wanted to make sure I had it set properly. I figured the script runs in its own environment outside of BOINC (in Python). That’s why I tried adding it to the script. ____________ | |
| ID: 59213 | Rating: 0 | rate:
| |
It’s hard to say whether it’s faster or not since it’s not a true apples to apples comparison. So far it feels not faster, but that’s against different CPUs and different GPUs. Maybe my EPYC system seems similarly fast because the EPYC is just brute forcing it. It had much higher IPC than the old Broadwell based Intel. ____________ | |
| ID: 59214 | Rating: 0 | rate:
| |
|
One of my machines started a Python task yesterday evening and finished it after about 24-1/ 2hours. | |
| ID: 59215 | Rating: 0 | rate:
| |
One of my machines started a Python task yesterday evening and finished it after about 24-1/ 2hours. The calculated runtime is using the cpu time. Has been mentioned many times. It’s because more than one core was being used. So the sum of each core’s cpu time is what’s shown. You did get 48hr bonus of 25%. Base credit is 70,000. You got 87,500 (+25%). Less than 24hrs gets +50% for 105,000. ____________ | |
| ID: 59216 | Rating: 0 | rate:
| |
|
GPUGRID seems to have problems with figures, at least what concerns Python :-( | |
| ID: 59217 | Rating: 0 | rate:
| |
GPUGRID seems to have problems with figures, at least what concerns Python :-( probably due to your allocation of disk usage in BOINC. go into the compute preferences and allow BOINC to use more disk space. by default I think it is set to 50% of the disk drive. you might need to increase that. Options-> Computing Preferences... Disk and Memory tab and set whatever limits you think are appropriate. it will use the most restrictive of the 3 types of limits. The Python tasks take up a lot of space. ____________ | |
| ID: 59218 | Rating: 0 | rate:
| |
no, it isn't that. I am aware of these setting. Since nothing else than BOINC is being done on this computer, disk and RAM usage are set to 90% for BOINC. So, when I have some 58GB free on a 128GB RAM disk (with some 60GB free system RAM), it should normally be no problem for Python to download and being processed. On another machine, I have a lot less ressources, and it works. So no idea, what the problem is in this case ... :-( | |
| ID: 59221 | Rating: 0 | rate:
| |
|
Or BOINC doesn't consider a RAM Disk a "real" drive and ignores the available storage there. | |
| ID: 59222 | Rating: 0 | rate:
| |
Or BOINC doesn't consider a RAM Disk a "real" drive and ignores the available storage there. no, I have BOINC running on another PC with Ramdisk - in that case a much smaller one: 32GB | |
| ID: 59223 | Rating: 0 | rate:
| |
|
another question - | |
| ID: 59224 | Rating: 0 | rate:
| |
|
No. You cannot alter the task configuration. It will always create 32 spawned processes for each task during computation. | |
| ID: 59225 | Rating: 0 | rate:
| |
... thanks, Keith, for your explanation. Well, I actually would not need to put in this app_config.xml as in my case; the other BOINC tasks don't just asign any number of CPU cores by themselves. I tell each of these projects by a seperate app_config.xml how many cores to use (which I was, in fact, also hoping for Python). So I have no other choice than to live with the situation as is :-( What is too bad though is that obviously there are no longer any ACEMD tasks being sent out (where it is basically clear: 1 task = 1 CPU core [unless changed by an app_config.xml]). | |
| ID: 59226 | Rating: 0 | rate:
| |
Or BOINC doesn't consider a RAM Disk a "real" drive and ignores the available storage there. Now I tried once more to download a Python on my system with a 128GB Ramdisk (plus 128GB system RAM). The BOINC event log says: Python apps for GPU hosts needs 4590.46MB more disk space. You currently have 28788.14 MB available and it needs 33378.60 MB. Somehow though all this does not fit together: in reality, the Ramdisk is filled with 73GB and has 55GB available. Further, I am questioning whether Python indeed needs 33.378 MB free disk space for downloading? I am really frustrated that this does not work :-( | |
| ID: 59228 | Rating: 0 | rate:
| |
... You are not understanding the nature of the Python tasks. They are not using all your cores. They are not using 32 cores. They are using 32 spawned processes A process is NOT a core. The Python task use from 100-300% of a cpu core depending on the speed of the host and the number of cores in the host. That is why I offered the app_config.xml file to allot 3 cpu cores to each Python task for BOINC scheduling purposes. And you can have many app_config.xml files in play among all your projects as a app_config file is specific to each project and is placed into the projects folder. You certainly can use one for scheduling help for GPUGrid. A app_config file does not control the number of cores a task uses. That is dependent soley on the science application. A task will use as many or as little cores as needed. The only exception to that fact is in the special case of plan_class MT like the cpu tasks at Milkyway. Then BOINC has an actual control parameter --nthreads that can specifically set the number of cores allowed in the MT plan_class task. That cannot be used here because the Python tasks are not a simple cpu only MT type task. They are something completely different and something that BOINC does not know how to handle. They are a dual cpu-gpu combination task where the majority of computation is done on a cpu with bursts of activity on a gpu and then computation repeats that action. It would take a major rewrite of core BOINC code to properly handle this type of machine-learning, reinforcement learning combo tasks. Unless BOINC attracts new developers that are willing to tackle this major development hurdle, the best we can do is just accommodate these tasks through other host controls. | |
| ID: 59229 | Rating: 0 | rate:
| |
|
Make sure there are NO checkmarks on any selection in the Disk and memory tab of the BOINC Manager Options >> Computing Preferences page. | |
| ID: 59230 | Rating: 0 | rate:
| |
Make sure there are NO checkmarks on any selection in the Disk and memory tab of the BOINC Manager Options >> Computing Preferences page. I had removed these checkmarks already before. What I did now was to stop new Rosetta tasks (which also need a lot of disk space for their VM files), so the free disk space climbed up to about 80GB - only then the Python download worked. Strange, isn't it? | |
| ID: 59231 | Rating: 0 | rate:
| |
The reason Reinforcement Learning agents do not currently use the whole potential of the cards is because the interactions between the AI agent and the simulated environment are performed on CPU while the agent "learning" process is the one that uses the GPU intermittently. a suggestion for whenever you're able to move to to pure GPU work. PLEASE look into and enable "automatic mixed precision" in your code. https://pytorch.org/docs/stable/notes/amp_examples.html this should greatly benefit those devices which have Tensor cores. to speed things up. ____________ | |
| ID: 59232 | Rating: 0 | rate:
| |
Make sure there are NO checkmarks on any selection in the Disk and memory tab of the BOINC Manager Options >> Computing Preferences page. I think your issue is your use of a fixed ram disk size instead of a dynamic pagefile that is allowed to grow larger as needed. | |
| ID: 59233 | Rating: 0 | rate:
| |
Make sure there are NO checkmarks on any selection in the Disk and memory tab of the BOINC Manager Options >> Computing Preferences page. I just noticed the same problem with Rosetta Python tasks. So this may be in some kind of relation with the Python architecture. Also in the Rosetta case, the actual disk space available was significantly higher than Rosetta said it would need. So I don't believe that this has anything to do with the fixed ram disk size. What is the logic behind your assumption? | |
| ID: 59234 | Rating: 0 | rate:
| |
|
If you read the through the various posts, including mine, or investigate the issues with Pytorch on Windows, it is because of the nature of how Windows handles reservation of memory addresses compared to how Linux handles that. | |
| ID: 59235 | Rating: 0 | rate:
| |
So the best method to satisfy this fact on Windows is to start with a 35GB minimum size pagefile with a 50GB maximum size and allow the pagefile to size dynamically between that range. Your fixed ram disk size just isn't flexible enough or large enough apparently. That pagefile size seems to be sufficient for the other Windows users I have assisted with these tasks. thanks for the hint, I will adapt the page file size accordingly and see what happens. | |
| ID: 59236 | Rating: 0 | rate:
| |
|
Not sure if it would have made a difference, but I would have placed your code before line 433, only after importing os and sys | |
| ID: 59237 | Rating: 0 | rate:
| |
Not sure if it would have made a difference, but I would have placed your code before line 433, only after importing os and sys thanks :) I'll try anyway edit - nope, no different. ____________ | |
| ID: 59238 | Rating: 0 | rate:
| |
|
really unfortunate to use so much more resources on AMD than Intel. It's something about the multithreaded nature of the main run.py process itself. on intel it uses about 2-5% per process, and more run.py processes spin up the more cores you have. with AMD, it uses like 20-40% per process, so with high core count CPUs, that makes total CPU utilization crazy high. | |
| ID: 59239 | Rating: 0 | rate:
| |
No. You cannot alter the task configuration. It will always create 32 spawned processes for each task during computation. does it improve GPU utilization? on average I see barely 20% with seldom spikes up to 35% | |
| ID: 59240 | Rating: 0 | rate:
| |
does it improve GPU utilization? on average I see barely 20% with seldom spikes up to 35% not directly. but if your GPU is being bottlenecked by not enough CPU resources then it could help. the best configuration so far is to not run ANY other CPU or GPU work. run only these tasks, and run 2 at a time to occupy a little more GPU. ____________ | |
| ID: 59241 | Rating: 0 | rate:
| |
|
Hi everyone. the best configuration so far is to not run ANY other CPU or GPU work. run only these tasks, and run 2 at a time to occupy a little more GPU. I'm thinking about putting every other Boinc CPU work into a VM instead of running it directly on the host. You could have a VM using only 90 per cent of processing power through the VM settings. This would leave the rest for the Python stuff, so on a sixteen-thread CPU it could use 160% of one thread's power or 10% of the CPU. If this wasn't enough the VM could be adjusted to only using eighty per cent (320% of one thread's power or 20% of the CPU for the Python work) and so on. Return [adjust and try] until the machine does fine. Plus, you could run other GPU stuff on your GPU to have it fully utilized which should prevent high temperature variations which I see as unnecessary stress for a GPU. MilkyWay has a small VRAM footprint and doesn't use a full GPU, and maybe I'll try WCG OPNG as well. ____________ Greetings, Jens | |
| ID: 59248 | Rating: 0 | rate:
| |
... and maybe I'll try WCG OPNG as well. forget about WCG OPNG for the time being. Most of the time no tasks available; and if tasks are available for a short period of time, it's extremely hard to get them downloaded. The downloads get stuck most of the time, and only manual intervention helps. | |
| ID: 59251 | Rating: 0 | rate:
| |
|
Question: can a running Python task be interrupted for the time a Windows Update takes place (with rebooting of the PC), or does this damage the task? | |
| ID: 59254 | Rating: 0 | rate:
| |
Question: can a running Python task be interrupted for the time a Windows Update takes place (with rebooting of the PC), or does this damage the task? Yes, one great thing about the Python tasks compared to acemd3 tasks is that they can be interrupted and continue on later with no penalty. They save checkpoints well which are replayed to get the task back to the point in progress it was at before interruption. Just be advised, that the replay process takes a few minutes after restart. The task will show 2% completion percentage upon restart but will eventually jump back to the progress point it was at and continue calculation until end. Just be patient and let the task run. | |
| ID: 59255 | Rating: 0 | rate:
| |
Yes, one great thing about the Python tasks compared to acemd3 tasks is that they can be interrupted and continue on later with no penalty. I have a problem that they fail on reboot however. Is that common? http://www.gpugrid.net/results.php?hostid=583702 That is only on Windows though. I have not seen it yet on Linux, but I don't reboot often there. | |
| ID: 59259 | Rating: 0 | rate:
| |
Yes, one great thing about the Python tasks compared to acemd3 tasks is that they can be interrupted and continue on later with no penalty. Guess it must be only on Windows. No problem restarting a task after a reboot on Ubuntu. | |
| ID: 59260 | Rating: 0 | rate:
| |
|
The restart is supposed to work fine on Windows as well. Could you provide more information about when this error happens please? Does it happen systematically every time you interrupt and try to resume a task? | |
| ID: 59261 | Rating: 0 | rate:
| |
Could you provide more information about when this error happens please? Does it happen systematically every time you interrupt and try to resume a task? I can pause and restart them with no problem. The error occurred only on reboot. But I think I have found it. I was using a large write cache, PrimoCache, set with a 8 GB cache size and 1 hour latency. By disabling that, I am able to reboot without a problem. So there was probably a delay in flushing the cache on reboot that caused the error. But I used the write cache to protect my SSD, since I was seeing writes of around 370 GB a day, too much for me. But this time I am seeing only 200 GB/day. That is still a lot, but not fatal for some time. It seems that the work units vary in how much they will write. I will monitor it. I use SsdReady to monitor the writes to disk; the free version is OK. PS - I can set PrimoCache to only a 1 GB write-cache size with a 5 minute latency, and it reboots without a problem. Whether that is good enough to protect the SSD will have to be determined by monitoring the actual writes to disk. PrimoCache gives a measure of that. (SsdReady gives the OS writes, but not the actual writes to disk.) PPS: I should point out that the reason a write cache can cut down on the writes to disk is because of the nature of scientific algorithms. They invariable read from a location, do a calculation, and then write back to the same location much of the time. Then, the cache can store that, and only write to the disk the changes that occur at the end of the flush period. If you have a large enough cache, and set the write-delay to infinite, you essentially have a ramdisk. But the cache can be good enough, with less memory than a ramdisk would require. (And now it seems that 2 GB and 10 minutes works OK.) | |
| ID: 59262 | Rating: 0 | rate:
| |
|
Question for the experts here: | |
| ID: 59265 | Rating: 0 | rate:
| |
|
Sorry. There is no way to configure an app_config to differentiate between devices. | |
| ID: 59266 | Rating: 0 | rate:
| |
Sorry. There is no way to configure an app_config to differentiate between devices. In fact, I have 2 BOINC clients on this PC; I had to establish the second one with the BOINC DataDir on the SSD, since the first one is on the 32GB Ramdisk which would not let download Python tasks ("not enough disk space"). However, next week I will double the RAM on this PC, from 64 to 128GB, and then I will increase the Ramdisk size to at least 64GB; this should make it possible to download Python - at least that' what I hope. So then I could run 1 Python on each of the 2 GPUs on the SSD client, and a third Python on the Ramdisk client. The only two questions now are: how do I tell the Ramdisk client to run only 1 Python (although 2 GPUs available)? And how do I tell the Ramdisk client to choose the GPU with the lower amount of VRAM usage (i.e. the one that's NOT running the display)? In fact, I would prefer to run 2 Pythons on the Ramdisk client and 1 Python on the SSD client; however, the question is whether I could download 2 Pythons on the 64GB Ramdisk - the only thing I could do is to try. | |
| ID: 59267 | Rating: 0 | rate:
| |
|
please read the BOINC documentation for client configuration. all of the options and what they do are in there. | |
| ID: 59268 | Rating: 0 | rate:
| |
|
personally I would stop running the ram disk. it's just extra complication and eats up ram space that the Python tasks crave. your biggest benefit will be moving to linux, it's easily 2x faster, maybe more. I don't know how you have your systems set up, but i see your longest runtimes on your 3070 are like 24hrs. that's crazy long. are you not leaving enough CPU available? are you running other CPU work at the same time? | |
| ID: 59269 | Rating: 0 | rate:
| |
... thanks very much for your hints:-) One other thing that I now noticed when reading the stderr of the 3 Pythons that failed short time after start: "RuntimeError: [enforce fail at C:\cb\pytorch_1000000000000\work\c10\core\impl\alloc_cpu.cpp:81] data. DefaultCPUAllocator: not enough memory: you tried to allocate 3612672 bytes" So the reason why the tasks crashed after a few seconds was not the too small VRAM (this would probably have come up a little later), but the lack of system RAM. In fact, I remember that right after start of the 4 Pythons, the Meminfo tool showed a rapid decrease of free system RAM, and shortly thereafter the free RAM was going up again (i.e. after 3 tasks had crashed thus releasing memory). Any idea how mugh system RAM, roughly, a Python task takes? | |
| ID: 59270 | Rating: 0 | rate:
| |
From what I can see in the Windows Task Manager on this PC and on others running Python tasks, RAM usage of a Python can be from about 1GB to 6GB (!) How come that it varies that much? | |
| ID: 59271 | Rating: 0 | rate:
| |
|
you should figure 7-8GB per python task. that's what it seems to use on my linux system. i would imagine it uses a little when the task starts up, then slowly increases once it gets to running full out. that might be the reason for the variance of 1GB in the beginning, and 6+GB by the time it gets to running the main program. | |
| ID: 59272 | Rating: 0 | rate:
| |
|
Erich56 asked: Question: can a running Python task be interrupted for the time a Windows Update takes place (with rebooting of the PC), or does this damage the task? I tried it now - the two tasks running on a RTX3070 each - on Windows - did not survive a reboot :-( | |
| ID: 59280 | Rating: 0 | rate:
| |
|
since yesterday I upgraded the RAM of one of my PCs from 64GB to 128GB (so now I have a 64GB Ramdisk plus 64GB system RAM, before it was half each), every GPUGRID Python fails on this PC with 2 RTX3070 inside. | |
| ID: 59281 | Rating: 0 | rate:
| |
I'm new to config editing :) a few more questions Do I need to be more specific in <name> tag and put full application name like Python apps for GPU hosts 4.03 (cuda1131) from task properties? Because I don't see 3 CPUs been given to the task after client restart Application Python apps for GPU hosts 4.03 (cuda1131) Name e00015a03227-ABOU_rnd_ppod_expand_demos25-0-1-RND8538 State Running Received Tue 20 Sep 2022 10:48:34 PM +05 Report deadline Sun 25 Sep 2022 10:48:34 PM +05 Resources 0.99 CPUs + 1 NVIDIA GPU Estimated computation size 1,000,000,000 GFLOPs CPU time 00:48:32 CPU time since checkpoint 00:00:07 Elapsed time 00:11:37 Estimated time remaining 50d 21:42:09 Fraction done 1.990% Virtual memory size 18.16 GB Working set size 5.88 GB Directory slots/8 Process ID 5555 Progress rate 6.840% per hour Executable wrapper_26198_x86_64-pc-linux-gnu | |
| ID: 59285 | Rating: 0 | rate:
| |
Yes, one great thing about the Python tasks compared to acemd3 tasks is that they can be interrupted and continue on later with no penalty. The restart works fine on Windows. Maybe, it might be the five-minute break at 2% which might be causing the confusion. | |
| ID: 59286 | Rating: 0 | rate:
| |
Get rid of the ram disk. | |
| ID: 59287 | Rating: 0 | rate:
| |
Any already downloaded task will see the original cpu-gpu resource assignment. Any newly downloaded task will show the NEW task assignment. The name for the tasks is PythonGPU as you show. You should always refer to the client_state.xml file as it is the final arbiter of the correct naming and task configuation. | |
| ID: 59288 | Rating: 0 | rate:
| |
Yes, one great thing about the Python tasks compared to acemd3 tasks is that they can be interrupted and continue on later with no penalty. If you interrupt the task in its Stage 1 of downloading and unpacking the required support files, it may fail on Windows upon restart. It normally shows the failure for this reason in the stderr.txt. Best to interrupt the task once it is actually calculating and after its setup and has produced at least one checkpoint. | |
| ID: 59289 | Rating: 0 | rate:
| |
on the other hand, ramdisk works perfectly on this machine: https://www.gpugrid.net/show_host_detail.php?hostid=599484 | |
| ID: 59290 | Rating: 0 | rate:
| |
Then you need to investigate the differences between the two hosts. All I'm stating is that the RAM disk is an unnecessary complication that is not needed to process the tasks. Basic troubleshooting. Reduce to the most basic, absolute needed configuration for the tasks to complete correctly and then add back in one extra superfluous element at a time until the tasks fail again. Then you have identified why the tasks fail. | |
| ID: 59291 | Rating: 0 | rate:
| |
|
Keith Myers thanks! | |
| ID: 59292 | Rating: 0 | rate:
| |
|
In my case config didn't want to work until I added <max_concurrent> <app_config> <app> <name>PythonGPU</name> <max_concurrent>1</max_concurrent> <gpu_versions> <gpu_usage>1.0</gpu_usage> <cpu_usage>3.0</cpu_usage> </gpu_versions> </app> </app_config> Now I see as expected status: Running (3 CPUs + 1 NVIDIA GPU) Unfortunately it doesn't help to get high GPU utilization/ Completion time it looks like gonna be slightly better though | |
| ID: 59293 | Rating: 0 | rate:
| |
In my case config didn't want to work until I added <max_concurrent> If you have enough cpu for support and enough VRAM on the card, you can get better gpu utilization by moving to 2X tasks on the card. Just change the gpu_usage to 0.5 | |
| ID: 59294 | Rating: 0 | rate:
| |
I installed a RAMdisk because quite often I am crunching tasks which write many GB of data on the disk. E.g. LHC-Atlas, the GPU tasks from WCG, the Pythons from Rosetta, and last not least the Pythons from GPUGRID: about 200GB within 24 hours, which is much (so for my two RTX3070, this would be 400GB/day). So, if the machines are running 24/7, in my opinion this is simply not good for a SSD lifetime. Over the years, my experience with RAMdisk has been a good one. No idea what kind of problem the GPUGRID Pythons have with this particular RAMDisk - or vice versa. As said, on another machine with RAMDisk I also have 2 Pythons running concurrently, even on one GPU, and it works fine. So what I did yesterday evening was letting only one of two RTX3070 crunch a Python. On the other GPU, I sometimes crunched WCG of nothing at all. This evening, after about 22-1/2 hours, the Python finished successfully :-) BTW - beside the Python, 3 ATLAS tasks 3 cores ea. were also running all the time. Which means. what I know so far is that obviously I can run Pythons at least on one of the two RTX3070, and other projects on the other one. Still I will try to further investigate why GPUGRID Pythons don't run on both RTX3070. | |
| ID: 59297 | Rating: 0 | rate:
| |
|
I do not know how to properly mention the project administrators in the topic in order to draw attention to the problem of non-optimal use of disk space by this application. 7za x "X:\BOINC\projects\www.gpugrid.net\pythongpu_windows_x86_64__cuda1131.tar.gz.1a152f102cdad20f16638f0f269a5a17" -so | 7za x -aoa -si -ttar -o"X:\BOINC\slots\0\" Project tar.gz >> app files (5,46 GiB) = 5,46 GiB ! Moreover, if you use for archive not tar.gz format, but 7z (LZMA2 + "5 - Normal" profile, which is the default for recent 7-zip versions), then you can not only seriously save the amount of data downloaded by each user (and as a consequence the bandwidth of project's infrastructure), but speed up the process of unpacking data from archive. Saving more than one GiB: On my computer, unpacking by pipelining(as mentioned above) using the current(12 years old) 7za version(9.20) takes ~100 seconds. And when using the recent version of 7za(22.01) only ~ 45-50 seconds. 7za x "X:\BOINC\projects\www.gpugrid.net\pythongpu_windows_x86_64__cuda1131.7z" -o"X:\BOINC\slots\0\" I believe that the result of the described changes is worth implementing them (even if not all and/or not at once). Moreover, all changes are reduced only to updating one executable file, repacking the archive and changing the command to unpack it. | |
| ID: 59307 | Rating: 0 | rate:
| |
|
I believe the researcher has already been down this road with Windows not natively supporting the compression/decompression algorithms you mention. | |
| ID: 59308 | Rating: 0 | rate:
| |
It requires each volunteer to add support manually to their hosts. No Unfortunately, you have inattentively read what I wrote above. It has already been mentioned there that is currently Windows app already comes with 7za.exe version 9.20(you can find it in project folder). So nothing changing. | |
| ID: 59309 | Rating: 0 | rate:
| |
|
Yes, I do have GPUGrid installed on my Win10 machine after all. | |
| ID: 59310 | Rating: 0 | rate:
| |
It requires each volunteer to add support manually to their hosts. OK, so you can thank Richard Haselgrove for the application to now package that utility. Originally, the tasks failed because Windows does not come with that utility and Richard helped debug the issue with the developer. If you think the application is not using the utility correctly you should inform the developer of your analysis and code fix so that other Windows users can benefit. | |
| ID: 59311 | Rating: 0 | rate:
| |
you should inform the developer of your analysis and code fix so that other Windows users can benefit. I have already sent abouh PM to this tread, just in case. | |
| ID: 59312 | Rating: 0 | rate:
| |
|
Hello, thank you very much for your help. I would like to implement the changes if they help optimise the tasks, but let me try to summarise your ideas to see if I got them right: 7za x "X:\BOINC\projects\www.gpugrid.net\pythongpu_windows_x86_64__cuda1131.tar.gz.1a152f102cdad20f16638f0f269a5a17" -so | 7za x -aoa -si -ttar Change C --> Finally, you suggest using .7z encryption instead of .tar.gz to save memory and unpacking time with a more recent version of 7za. Is all the above correct? I believe these changes are worth implementing, thank you very much. I will try to start with Change A and Change B and unroll them into PythonGPUbeta first to test them this week. ____________ | |
| ID: 59335 | Rating: 0 | rate:
| |
|
Looks good to me. Just one question - are there any 'minimum Windows version' constraints on the later versions of 7za? I think it's unlikely to affect us, but it would be good to check, just in case. | |
| ID: 59336 | Rating: 0 | rate:
| |
|
Hi, abouh!
Of course, if you launch 7za from working directory(/slots/X), than output flag not necessary. Change C You are correct. Using 7z format(LZMA2 compression) significantly reduce archive size, save your bandwidth and some time for unpacking/unzipping process ; ) As I wrote above, the 7za command will be simplified, since the pipelining process will no longer be required. NB! It is important to update the supplied 7za to current version, since version 9.20, a lot of optimizations have been made for compression/decompression of 7z archives(LZMA).
As mentioned on 7-Zip homepage, app support all versions since Windows 2000:
| |
| ID: 59337 | Rating: 0 | rate:
| |
|
As a very first step I am trying to remove the .tar.gz file. I am encountering a first issue. The steps of the jobs are specified in the job.xml file in the following way: <job_desc> Essentially I need to execute a task that removes the pythongpu_windows_x86_64__cuda1131.tar.gz.1a152f102cdad20f16638f0f269a5a17 file after the very first task. When I try in the Windows command prompt: cmd.exe /C "del pythongpu_windows_x86_64__cuda1131.tar.gz.1a152f102cdad20f16638f0f269a5a17" it works. However when I add to the job.xml file <task> The wrapper seems to ignore it. Doesn't the wrapper have cmd.exe? I need to run more tests to figure out the exact command to delete files ____________ | |
| ID: 59340 | Rating: 0 | rate:
| |
<task> Try to use %COMSPEC% variable as alias to %SystemRoot%\system32\cmd.exe If this doesn't work, then I'm sure specifying the full path(C:\Windows\system32\cmd.exe) should work. | |
| ID: 59341 | Rating: 0 | rate:
| |
|
in other news. looks like we've finally crunched through all the tasks ready to send. all that remains are the ones in progress and the resends that will come from those. | |
| ID: 59343 | Rating: 0 | rate:
| |
|
True! Specifying the whole path works: <job_desc> I have deployed this Change A into the PythonGPUbeta app, just to test if it works in all Windows machines. Just sent a few (32) jobs. If it works fine on, will move on to introduce the other changes. ____________ | |
| ID: 59347 | Rating: 0 | rate:
| |
|
I will be running new experiments shortly. My idea is to use the whole capacity of the grid. I have already noticed that a few months ago it could absorb around 800 tasks and now it goes up to 1000! Thank you for all the support :) | |
| ID: 59348 | Rating: 0 | rate:
| |
|
The first batch I sent to PythonGPUbeta yesterday failed, but I figured out the problem this morning. I just sent another batch an hour ago to the PythonGPUbeta app. This time seems to be working. It has Change A implemented, so memory usage is more optimised. | |
| ID: 59354 | Rating: 0 | rate:
| |
|
Hello Aleksey! | |
| ID: 59356 | Rating: 0 | rate:
| |
|
more tasks? I'm running dry ;) | |
| ID: 59357 | Rating: 0 | rate:
| |
|
More tasks please, also. | |
| ID: 59358 | Rating: 0 | rate:
| |
|
Hi, | |
| ID: 59359 | Rating: 0 | rate:
| |
|
Good day, abouh This time seems to be working. It has Change A implemented, It's nice to hear that! Maybe tbz2 or txz? As I understand, tbz2/txz are alias of file extension for tar.bz2/tar.xz. So in fact these formats are tar containers which compressed by bz2 or xz. Therefore, this will require pipelining process, which, however, practically does not affect the unpacking speed, and only lengthens command string. In my test, unpacking of tar.xz done in ~40 seconds. seems like this ones we can unpacked in a single step as well, if recent versions 7za.exe allow to handle this format. xz format supported since version 9.04 beta, but more recent version support multi-threaded (de)compression, witch crucial for fast unpacking. The txz file is substantially smaller but took forever (30 mins) to compress. This format use LZMA2 algorithm, similar as 7z use by default. So space saving must be the same with the same settings(--compress-level). It's highly likely you forgot to use this flag --n-threads <n>, -j <n> to set number of threads to use for compression. By default conda-pack use only 1 thread! And also check --compress-level. Levels higher then 5 not so effective for compression_time/archive_size. Considering how I think that PythonGPU's app file rarely changes, it's not big deal. As far as I remember, this (practically) does not affect unpacking speed. On my test(32 threads / Threadripper 2950X), it took ~2,5 minutes with compress-level 5(archive size 1,55 GiB). | |
| ID: 59360 | Rating: 0 | rate:
| |
why not producing a zip file, because the boinc client can unzip such file direct from the project folder to the slot like with acemd3. You're probably right. I somehow didn't pay attention to acemd3 archives in project directory. Is there some info, how BOINC's work with archives? I suppose boinc-client uses its built-in library to work with archives (zlib ?), rather than some OS functions/tools. There's still a dilemma: 1) On the one hand, using zip format will simplify process of application launching and reduce the amount of disk space required by application (no need to copy archive to the working directory). Amount of written data on disk reduced accordingly. 2) On other hand, xz format reduce archive size by whole GiB, that helps to save project's network bandwidth and time to download necessary files at first users access to project. | |
| ID: 59361 | Rating: 0 | rate:
| |
On my test(32 threads / Threadripper 2950X), it took ~2,5 minutes with compress-level 5(archive size 1,55 GiB). It's about compression* | |
| ID: 59362 | Rating: 0 | rate:
| |
|
We tried to pack files with zip at first but encountered problems in windows. Not sure if it was some kind of strange quirk in the wrapper or in conda-pack (the tool for creating, packing and unpacking conda environments, https://conda.github.io/conda-pack/), but the process failed for compressed environment files above a certain memory size. | |
| ID: 59364 | Rating: 0 | rate:
| |
|
You were absolutely right, I forgot the number of threads! I could now reproduce a a much faster compression as well. | |
| ID: 59365 | Rating: 0 | rate:
| |
|
Hi abouh, | |
| ID: 59366 | Rating: 0 | rate:
| |
|
7z.exe calls the dll, 7za.exe stands alone. You find it in 7-Zip Extra on https://7-zip.org/download.html | |
| ID: 59367 | Rating: 0 | rate:
| |
All this has already been discussed by several posts above. If you had read before writing...
I think this is not a good idea. Some antiviruses may perceive an attempt to launch cmd.exe not from the system directory as suspicious/malicious activity. | |
| ID: 59368 | Rating: 0 | rate:
| |
|
I added the discussed changes and deployed them to the PythonGPUbeta app. More specifically: | |
| ID: 59369 | Rating: 0 | rate:
| |
|
No, I haven't been lucky enough yet to snag any of the beta tasks. | |
| ID: 59370 | Rating: 0 | rate:
| |
|
One of my Linux machines has just crashed two tasks in succession with UnboundLocalError: local variable 'features' referenced before assignment https://www.gpugrid.net/results.php?hostid=508381 Edit - make that three. And a fourth looks to be heading in the same direction - many other users have tried it already. | |
| ID: 59371 | Rating: 0 | rate:
| |
|
Thanks for the warning Richard, I have just fixed the error. Should not be present in the jobs starting a few minutes from now. | |
| ID: 59372 | Rating: 0 | rate:
| |
|
Yes, the next one has got well into the work zone - 1.99%. Thank you. | |
| ID: 59373 | Rating: 0 | rate:
| |
|
Just an observation. | |
| ID: 59374 | Rating: 0 | rate:
| |
|
I tried to run 1 Python on a second BOINC instance. | |
| ID: 59375 | Rating: 0 | rate:
| |
|
My question is, how can 13 tasks run on a 12-thread machine? Is it a good idea to run other tasks? Also, why was Boinc not taking into account the GPUGrid task? | |
| ID: 59376 | Rating: 0 | rate:
| |
|
If the 13th task is assessed - by the project and BOINC in conjunction - to require less than 1.0000 of a CPU, it will be allowed to run in parallel with a fully occupied CPU. For a GPU task, it will run at a slightly higher CPU priority, so it will steal CPU cycles from the pure CPU tasks - but on a modern multitasking OS, they won't notice the difference. | |
| ID: 59377 | Rating: 0 | rate:
| |
I tried to run 1 Python on a second BOINC instance. i think you're trying to do too much at once. 22-24hrs is incredibly slow for a single task on a 3070. my 3060 does them in 13hrs, doing 3 tasks at a time (4.3hrs effective speed). if you want any kind of reasonable performance, you need to stop processing other projects on the same system. or at the very least, adjust your app_config file to reserve more CPU for your Python task to prevent BOINC from running too much extra work from other projects. switch to Linux for even better performance. ____________ | |
| ID: 59379 | Rating: 0 | rate:
| |
|
Erich56 | |
| ID: 59380 | Rating: 0 | rate:
| |
|
Ian&Steve C. wrote: i think you're trying to do too much at once. 22-24hrs is incredibly slow for a single task on a 3070. my 3060 does them in 13hrs, doing 3 tasks at a time (4.3hrs effective speed). I agree, at the moment it may be "too much at once" :-) FYI, I recently bought another PC with 2 CPUs (8-c/8-HT each) and 1 GPU, I upgraded the RAM from 128GB to 256GB and created a 128GB Ramdisk; and on an existing PC with a 10-c/10-HT CPU plus 2 RTX3070 I upgraded the RAM from 64GB to 128GB (=maximum possible on this MoBo). So no surprise that now I am just testing what's possible. And by doing this, I keep finding out, of course, that sometimes I am expecting too much. What concerns the (low) speed of my two RTX3070: I have always been on the very conservative side what concerns GPU temperatures. Which means I have them run on about 60/61°C, not higher. With two such GPUs inside the same box, heat of course is a topic. Despite of good airflow, in order to keep the GPUs at the above mentioned temperature, I need to throttle them down to about 50-65% (different for each GPU). So this explains for the longer runtimes of the Pythons. If I had to boxes with 1 RTX3070 inside each, I am sure that there would be no need for throtteling. | |
| ID: 59381 | Rating: 0 | rate:
| |
|
jjch wrote: Erich56 thanks for taking your time for dealing with my problem. well, by now it's become clear to me what the cause for failure was: obviously, running a Primegrid GPU task and Python on the same GPU does not work for the Python. After a Primegrid got finished, I started another Python, and it runs well. What concerns memory, you may have misunderstood: when I mentioned the 8GB, I meant to say that I could see in the Windows Task Manager that Python was using 8GB. Total RAM on this machine is 64GB, so more than enough. Also what concerns the swap space: I had set this manually to 100GB min. and 150 GB max., so also more than enough. Again - the problem has been detected anyway. Whereas I had no problem to run two Pythons on the same GPU (even 3 might work), it is NOT possible to have a Python run along with a Primegrid task. So for me, this was a good learning process :-) Again, thanks anyway for your time investigating my failed tasks. | |
| ID: 59382 | Rating: 0 | rate:
| |
|
I just discovered the following problem on the PC which consists of: | |
| ID: 59383 | Rating: 0 | rate:
| |
I just discovered the following problem on the PC which consists of: Meanwhile, the problem has become even worse: After downloading 1 Python, it starts and in the BOINC manager it shows a remaing runtime of about 60 days (!!!). In reality, he task proceeds with normal speed and will be finished within 24 hours, like all other tasks before on this machine. Hence, nothing else can be downoladed. When trying to download tasks from other projects, it shows not requesting tasks: don't need (CPU; NVIDIA GPU: job cache full). when I try to download a second Python, it says "no tasks are available for Python apps for GPU hosts" which is not correct, there are some 150 available for download at the moment. Can anyone give me advice how to get this problem solved? | |
| ID: 59386 | Rating: 0 | rate:
| |
|
It can't. Due to the dual nature of the python tasks, BOINC has no mechanism to correctly show the estimated time to completion. | |
| ID: 59387 | Rating: 0 | rate:
| |
It can't. Due to the dual nature of the python tasks, BOINC has no mechanism to correctly show the estimated time to completion. But how come that on three other of my systems on which I am running Pythons for a while, the "remaining runtimes" are shown pretty correctly (+/- 24 hours)? And also on the machine in question, up to recently the time was indicated okay. Something must have happened yesterday, but I do not know what. If your assumption was right, on no Boinc instance more than 1 Python could be run in parallel. Didn't you say somewhere here in the forum that you are running 3 Pythons in parallel? How can a second and a third task be downloaded if the first one shows a remaining runtime of 30 or 60 days? What are the remaining runtimes shown for your Pythons once they get started? | |
| ID: 59388 | Rating: 0 | rate:
| |
|
Let me offer another possible "solution". (I am running two Python tasks on my system.) I found I had to change my Resource Share much, much higher for GPUGrid to effectively share other projects. I originally had Resource shares of 160 for GPUGrid vs 10 for Einstein and 40 for TN-Grid. Since the Python tasks 'use' so much CPU time in particular (at least reported CPU time), it seems to affect the Resource Share calculations at well. I had to move my Resource Share of GPUGrid (for example) to 2,000 to get it both to do two at once and to get Boinc to share with Einstein and TN-Grid roughly the way I wanted. (Nothing magic about my Resource Share ratios; just providing an example of how extreme I went to get it to balance the way I wanted.) | |
| ID: 59389 | Rating: 0 | rate:
| |
|
No, that was my teammate who is running 3X concurrent on his gpus. | |
| ID: 59390 | Rating: 0 | rate:
| |
Regarding the estimated time to completion, I have not seem them correct on my system yet, though it is getting better. At first Python tasks were starting at 1338 days (!) and now are at 23 days to start. Interesting to hear some of yours are showing correct! What setup are you using in the hosts showing correct times? On one my hosts a new Python started some 25 minutes ago. "Remaining time" is shown as 13 hrs. No particular setup. In the past years, this host had crunched numerous ACEMD tasks. Since a few weeks ago, it's crunching Pythons. GTX980Ti. Besides, 2 "Theory" tasks from LHC are running. | |
| ID: 59391 | Rating: 0 | rate:
| |
|
kksplace wrote: Let me offer another possible "solution". (I am running two Python tasks on my system.) I found I had to change my Resource Share much, much higher for GPUGrid to effectively share other projects. ... well, my target on this machine, in fact, is not to share Pythons with other projects. It would simply make me happy if I could run 2 (or perhaps 3) Pythons simultaneously. The hardware requirements should be sufficient. So, said that, I guess in this case the ressource share would not play any role. BTW: as mentioned before, until some time early last week I did run two Pythons simultaneously on this PC. I have no idea though what the indicated remaining runtimes were. Most probably not that high as now, otherwise I could not have downloaded and started to Pythons in parallel. So any idea what I can do to make this machine run at least 2 Pythons (if not 3) ??? | |
| ID: 59392 | Rating: 0 | rate:
| |
|
I am limited on any technical knowledge and can only speak how I got mine to work with 2 tasks. Sorry I can't help anymore. As to getting 3 tasks, my understanding from other posts and my own attempt is that you can't without a custom client or some other behind-the-scenes work. The '2 tasks at one time' limit is a GPUGrid restriction somewhere. | |
| ID: 59393 | Rating: 0 | rate:
| |
|
Yes, the project has a max 2 tasks per gpu limit with project max of 16 tasks. | |
| ID: 59394 | Rating: 0 | rate:
| |
... Keith, just for my understanding: what exactly does the entry <cpu_usage>3.0</cpu_usage> do? | |
| ID: 59395 | Rating: 0 | rate:
| |
... Exactly what I said in my previous message. adjust your app_config file to reserve more CPU for your Python task to prevent BOINC from running too much extra work from other projects. What Keith suggested would tell BOINC to reserve 3 whole CPU threads for each running PythonGPU task. ____________ | |
| ID: 59396 | Rating: 0 | rate:
| |
|
Hello! | |
| ID: 59397 | Rating: 0 | rate:
| |
It tells BOINC to take 3 cpus away from the available resources that BOINC thinks it has to work with. That tells BOINC to not commit resources to other projects that it doesn't have so that you aren't running the cpu overcommitted. It is only for BOINC scheduling of available resources. It does not impact the running of the Python task in any way directly. Only the scientific application itself deteremines how much cpu the task and application will use. You should never run a cpu in overcommitted state because that means that EVERY application including internal housekeeping is constantly fighting for available resources and NONE are running optimally. IOW's . . . . slooooowwwly. You can check your average cpu loading or utilization with the uptime command in the terminal. You should strive to get numbers that are less than the number of cores available to the operating system. If you have a cpu that has 16 cores/32 threads available to the OS, you should strive to use only up to 32 threads over the averaging periods. The uptime command besides printing out how long the system has been up and running also prints out the 1 minute / 5 minute / 15 minute system average loadings. As an example on this AMD 5950X cpu in this daily driver this is my uptime report. keith@Pipsqueek:~$ uptime 00:15:16 up 7 days, 14:41, 1 user, load average: 30.16, 31.76, 32.03 The cpu is right at the limit of maximum utilization of its 32 threads. So I am running it at 100% utilization most of the time. If the averages were higher than 32, then that shows that the cpu is overcommitted and trying to do too much all the time and not running applications efficiently. | |
| ID: 59398 | Rating: 0 | rate:
| |
|
Thanks for the notice, abouh. Should make the Windows users a bit happier with the experience of crunching your work. | |
| ID: 59399 | Rating: 0 | rate:
| |
thanks, Keith, for the thorough explanation. Now everything is clear to me. What concerns CPU loading/utilization, so far I have been taking a look at the Windows Task Manager which shows a (rough?) percentage on top of the column "CPU". However, for me the question still is how I could get my host with the vast hardware ressources (as described here: https://www.gpugrid.net/forum_thread.php?id=5233&nowrap=true#59383) to run at least 2 Pythons concurrently - as it was the case already before ??? Isn't there a way go get these much too high "remaining time" figures back to real? Or any other way to get more than 1 Python downloaded despite of these high figures? | |
| ID: 59400 | Rating: 0 | rate:
| |
There isn't any way to get the estimated time remaining down to reasonable values as far as we know without a complete rewrite of the BOINC client code. Or ask @kksplace how he managed to do it. Try to increase your amount of day's cache to 10 and see if you pick up the second task. Are you running with 0.5 gpu_usage via the app_config.xml file exampleI posted? You can spoof 2 gpus being detected by BOINC which would automatically increase your gpu task allowance to 4 tasks. You need to modify the coproc_info.xml file and then lock it down to immutable state so BOINC can't rewrite it. Google spoofing gpus in the Seti and BOINC forums on how to do that. | |
| ID: 59403 | Rating: 0 | rate:
| |
Try to increase your amount of day's cache to 10 and see if you pick up the second task. Counterintuitively, this can actually cause the opposite reaction on a lot of projects. if you ask for "too much" work, some projects will just shut you out and tell you that no work is available, even when it is. I don't know why, I just know it happens. this is probably why he can't download work. I would actually recommend keeping this value no larger than 2 days. ____________ | |
| ID: 59404 | Rating: 0 | rate:
| |
|
I was assuming that GPUGrid was the only project on his host. | |
| ID: 59405 | Rating: 0 | rate:
| |
|
I think GPUGRID is one of the projects that reacts negatively to having the value too high. | |
| ID: 59406 | Rating: 0 | rate:
| |
I was assuming that GPUGrid was the only project on his host. at the time I was trying to download and crunch 2 Pythons: YES - no other projects running at that time. Meanwhile, until the problem get's solved, I have running 1 CPU and 1 GPU project on this host. | |
| ID: 59407 | Rating: 0 | rate:
| |
Message boards : News : Experimental Python tasks (beta) - task description





























