Advanced search

Message boards : News : Experimental Python tasks (beta) - task description

Author Message
abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 155
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 56977 - Posted: 17 Jun 2021 | 10:40:32 UTC

Hello everyone, just wanted to give some updates about the machine learning - python jobs that Toni mentioned earlier in the "Experimental Python tasks (beta) " thread.

What are we trying to accomplish?
We are trying to train populations of intelligent agents in a distributed computational setting to solve reinforcement learning problems. This idea is inspired in the fact that human societies are knowledgeable as a whole, while individual agents have limited information. Also, every new generation of individuals attempts to expand and refine the knowledge inherited from previous ones, and the most interesting discoveries become part of a corpus of common knowledge. The idea is that small groups of agents will train in GPUgrid machines, and report their discoveries and findings. Information of multiple agents can be put in common and conveyed to new generations of machine learning agents. To the best of our knowledge this is the first time something of this sort is attempted in a GPUGrid-like platform, and has the potential to scale to solve problems unattainable in smaller scale settings.

Why most jobs were failing a few weeks ago?
It took us some time and testing to make simple agents work, but we managed to solve the problems in the previous weeks. Now, almost all agents train successfully.

Why are GPUs being underutilized? and why are CPU used for?
In the previous weeks we were running small scale tests, with small neural networks models that occupied little GPU memory. Also, some reinforcement learning environments, especially simple ones like those used in the test, run on CPU. Our idea is to scale to more complex models and environments to exploit the GPU capacity of the grid.

More information:
We use mainly PyTorch to train our neural networks. We only use Tensorboard because it is convenient for logging. We might remove that dependency in the future.
____________

bozz4science
Send message
Joined: 22 May 20
Posts: 107
Credit: 26,369,591
RAC: 0
Level
Val
Scientific publications
wat
Message 56978 - Posted: 17 Jun 2021 | 11:46:18 UTC
Last modified: 17 Jun 2021 | 12:08:24 UTC

Highly anticipated and overdue. Needless to say, kudos to you and your team for pushing the frontier on the computational abilities of the client software. Looking forward to contribute in the future, hopefully with more than I have at hand right now.

A couple of questions though:

1. As the main ML technique used for training the individual agents is neural networks, I wonder about the specifics of the whole setup? What does the learning data set look like? What AF do you use? Any optimisation, regularisation used?
2. Is it mainly about getting this kind of framework to work and then test for its accuracy? How did you determine the model's base parameters as is to get you started? How can you be sure that the initial model setup is getting you anywhere/is optimal? Or do you ultimately want to tune the final model and compare the accuracy of various reinforced learning approaches?
3. Is there a way to gauge the future complexity of those prospective WUs at this stage? Similar runtimes as the current Bandit tasks?
4. What do you want to use the trained networks for? What are you trying to predict? Or rephrased what main use cases/fields of research are currently imagined for the final model?
What do you envision to be

"problems [so far] unattainable in smaller scale settings"
?
5. What is the ultimate goal of this ML-project? Have only one latest gen trained agents group at the end that is the result of the continuous reinforeced learning iterations? Have several and test/benchmark them against each other?

Thx! Keep up the great work!

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 719
Credit: 4,861,478,494
RAC: 1,361,967
Level
Arg
Scientific publications
wat
Message 56979 - Posted: 17 Jun 2021 | 13:26:58 UTC - in response to Message 56977.

will you be utilizing the tensor cores present in the nvidia RTX cards? the tensor cores are designed for this kind of workload.
____________

Profile phi1258
Send message
Joined: 30 Jul 16
Posts: 4
Credit: 1,407,974,438
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwat
Message 56989 - Posted: 18 Jun 2021 | 11:21:31 UTC - in response to Message 56977.

This is a welcome advance. Looking forward to contributing.



Profile ServicEnginIC
Avatar
Send message
Joined: 24 Sep 10
Posts: 520
Credit: 2,275,243,465
RAC: 6,100
Level
Phe
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 56990 - Posted: 18 Jun 2021 | 12:04:08 UTC - in response to Message 56977.

Thank you very much for this advance.
I understand that on this kind of "singular" research only a limited general guidelines can be given, or there is a risk for them not being singular any more...
Best wishes.

_heinz
Send message
Joined: 20 Sep 13
Posts: 16
Credit: 3,433,447
RAC: 0
Level
Ala
Scientific publications
wat
Message 56994 - Posted: 20 Jun 2021 | 5:39:42 UTC
Last modified: 20 Jun 2021 | 5:43:47 UTC

Wish you sucess.
regards _heinz
____________

Erich56
Send message
Joined: 1 Jan 15
Posts: 902
Credit: 3,604,880,665
RAC: 330,779
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwat
Message 56996 - Posted: 21 Jun 2021 | 11:28:16 UTC - in response to Message 56979.

Ian&Steve C. wrote on June 17th:

will you be utilizing the tensor cores present in the nvidia RTX cards? the tensor cores are designed for this kind of workload.

I am courious what the answer will be

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 719
Credit: 4,861,478,494
RAC: 1,361,967
Level
Arg
Scientific publications
wat
Message 57000 - Posted: 22 Jun 2021 | 12:17:47 UTC

also, can the team comment on not just GPU "under"utilization. these have NO GPU utilization.

when will you start releasing tasks that do more than just CPU calculation? are you aware that only CPU calculation is occurring and nothing happens on the GPU at all? I have never observed these new tasks to use the GPU, ever. even the tasks that takes ~1hr to crunch. it all happens on the single CPU thread allocated for the WU. 0% GPU utilization and no gpugrid processes reported in nvidia-smi
____________

ExtraTerrestrial Apes
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 17 Aug 08
Posts: 2705
Credit: 1,311,122,549
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 57009 - Posted: 23 Jun 2021 | 20:09:29 UTC

I understand this is basic research in ML. However, I wonder which problems it would be used for here. Personally I'm here for the bio-science. If the topic of the new ML research differs significantly and it seems to be successful based on first trials, I'd suggest to set it up as a seperate project.

MrS
____________
Scanning for our furry friends since Jan 2002

bozz4science
Send message
Joined: 22 May 20
Posts: 107
Credit: 26,369,591
RAC: 0
Level
Val
Scientific publications
wat
Message 57014 - Posted: 24 Jun 2021 | 10:32:37 UTC

This is why I asked what "problems" are currently envisioned to be tackled by the resulting model. But IMO and understanding this is a ML project specifically set up to be trained on biomedical data sets. Thus, I'd argue that the science being done is still bio-related nonetheless. Would highly appreciate a feedback to loads of great questions here in this thread so far.

Profile Retvari Zoltan
Avatar
Send message
Joined: 20 Jan 09
Posts: 2335
Credit: 16,178,080,749
RAC: 0
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 57020 - Posted: 26 Jun 2021 | 7:53:10 UTC

https://www.youtube.com/watch?v=yhJWAdZl-Ck

mmonnin
Send message
Joined: 2 Jul 16
Posts: 329
Credit: 1,449,138,982
RAC: 407,971
Level
Met
Scientific publications
watwatwatwatwat
Message 58044 - Posted: 10 Dec 2021 | 11:32:51 UTC

I noticed some python tasks in my task history. All failed for me and failed so far for everyone else. Has anyone completed any?

Examnple:
https://www.gpugrid.net/workunit.php?wuid=27100605

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1434
Credit: 3,541,899,351
RAC: 420,027
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58045 - Posted: 10 Dec 2021 | 11:56:26 UTC - in response to Message 58044.

Host 132158 is getting some. The first failed with:

File "/tmp/pip-install-kvyy94ud/atari-py_bc0e384c30f043aca5cad42554937b02/setup.py", line 28, in run
sys.stderr.write("Unable to execute '{}'. HINT: are you sure `make` is installed?\n".format(' '.join(cmd)))
NameError: name 'cmd' is not defined
----------------------------------------
ERROR: Failed building wheel for atari-py
ERROR: Command errored out with exit status 1:
command: /var/lib/boinc-client/slots/0/gpugridpy/bin/python -u -c 'import io, os, sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-install-kvyy94ud/atari-py_bc0e384c30f043aca5cad42554937b02/setup.py'"'"'; __file__='"'"'/tmp/pip-install-kvyy94ud/atari-py_bc0e384c30f043aca5cad42554937b02/setup.py'"'"';f = getattr(tokenize, '"'"'open'"'"', open)(__file__) if os.path.exists(__file__) else io.StringIO('"'"'from setuptools import setup; setup()'"'"');code = f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' install --record /tmp/pip-record-k6sefcno/install-record.txt --single-version-externally-managed --compile --install-headers /var/lib/boinc-client/slots/0/gpugridpy/include/python3.8/atari-py
cwd: /tmp/pip-install-kvyy94ud/atari-py_bc0e384c30f043aca5cad42554937b02/

Looks like a typo.

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1050
Credit: 1,407,860,714
RAC: 1,014,249
Level
Met
Scientific publications
watwatwatwatwat
Message 58058 - Posted: 11 Dec 2021 | 0:23:09 UTC

Shame the tasks are misconfigured. I ran through a dozen of them on a host with errors. With the scarcity of work, every little bit is appreciated and can be used.

We just got put back in good graces with a whitelist at Gridcoin too.

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1050
Credit: 1,407,860,714
RAC: 1,014,249
Level
Met
Scientific publications
watwatwatwatwat
Message 58061 - Posted: 11 Dec 2021 | 2:16:29 UTC

@abouh, could you check your configuration again? The tasks are failing during the build process with cmake. cmake normally isn't installed in Linux and when it is it is not normally installed into the PATH environment.
It probably needs to be exported into the userland environment.

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 155
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 58104 - Posted: 14 Dec 2021 | 16:55:30 UTC - in response to Message 58045.

Hello everyone, sorry for the late reply.

we detected the "cmake" error and found a way around it that does not require to install anything. Some jobs already finished successfully last Friday without reporting this error.

The error was related to the atari_py, as some users reported. More specifically installing this python package from github https://github.com/openai/atari-py, which allows to use some Atari2600 games as a test bench for reinforcement learning (RL) agents.

Sorry for the inconveniences. Even while the AI agents part of the code has been tested and works, every time we need to test our agents in a new environment we need te modify environment initialisation part of the code with the one containing the new environment, in this case atari_py.

I just sent another batch of 5 test jobs, 3 already finished the others seem to be working without problems but have not yet finished.

http://gpugrid.net/PS3GRID_ops/db_action.php?table=result&id=32730763
http://gpugrid.net/PS3GRID_ops/db_action.php?table=result&id=32730759
http://gpugrid.net/PS3GRID_ops/db_action.php?table=result&id=32730761

http://gpugrid.net/PS3GRID_ops/db_action.php?table=result&id=32730760
http://gpugrid.net/PS3GRID_ops/db_action.php?table=result&id=32730762


____________

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1434
Credit: 3,541,899,351
RAC: 420,027
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58112 - Posted: 15 Dec 2021 | 15:31:49 UTC - in response to Message 58104.

Multiple different failure modes among the four hosts that have failed (so far) to run workunit 27102466.

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 155
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 58114 - Posted: 15 Dec 2021 | 16:12:09 UTC - in response to Message 58112.

The error reported in the job with result ID 32730901 is due to a conda environment error detected and solved during previous testing bouts.

It is the one that talk about a dependency called "pinocchio" and detects conflicts with it.

Seems like the conda misconfiguration persisted in some machines. To solve this error should be enough to click "reset" to reset the App.



____________

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1434
Credit: 3,541,899,351
RAC: 420,027
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58115 - Posted: 15 Dec 2021 | 16:56:36 UTC - in response to Message 58114.

OK, I've reset both my Linux hosts. Fortunately I'm on a fast line for the replacement download...

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1434
Credit: 3,541,899,351
RAC: 420,027
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58116 - Posted: 15 Dec 2021 | 19:29:54 UTC
Last modified: 15 Dec 2021 | 19:48:28 UTC

Task e1a15-ABOU_rnd_ppod_3-0-1-RND2976_3 was the first to run after the reset, but unfortunately it failed too.

Edit - so did e1a14-ABOU_rnd_ppod_3-0-1-RND3383_2, on the same machine.

This host also has 16 GB system RAM: GPU is GTX 1660 Ti.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 719
Credit: 4,861,478,494
RAC: 1,361,967
Level
Arg
Scientific publications
wat
Message 58117 - Posted: 15 Dec 2021 | 19:40:45 UTC - in response to Message 58114.
Last modified: 15 Dec 2021 | 19:43:12 UTC

I reset the project on my host. still failed.

WU: http://gpugrid.net/workunit.php?wuid=27102456

I see that ServicEnginIC and I both had the same error. we also both only have 16GB system memory on our host.

Aurum previously reported very high system memory use, but didn't elaborate on if it was real or virtual.

However, I can elaborate further to confirm that it's real.

https://i.imgur.com/XwAj4s3.png

a lot of it seems to stem from the ~4GB used by the python run.py process and then +184M for each of 32x multiproc spawns that appear to be running. not sure if these are intended to run, or if these were are artifact of setup that never got cleaned up?

I'm not certain, but it's possible that the task ultimately failed due to lack of resources having both RAM and Swap maxed out. maybe the next system that has it will succeed with it's 64GB TR system?

abouh, is it intended to keep this much system memory used during these tasks? or is the just something leftover that was supposed to be cleaned up? It might be helpful to know the exact system requirements so people with unsupported hardware do not try to run these tasks. if these tasks are going to use so much memory and all of the CPU cores, we should be prepared for that ahead of time.
____________

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1050
Credit: 1,407,860,714
RAC: 1,014,249
Level
Met
Scientific publications
watwatwatwatwat
Message 58118 - Posted: 15 Dec 2021 | 23:25:46 UTC - in response to Message 58117.

I couldn't get your imgur image to load, just a spinner.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 719
Credit: 4,861,478,494
RAC: 1,361,967
Level
Arg
Scientific publications
wat
Message 58119 - Posted: 16 Dec 2021 | 0:13:31 UTC - in response to Message 58118.

Yeah I get a message that Imgur is over capacity (first time I’ve ever seen that). Their site must be having maintenance or getting hammered. It was working earlier. I guess just try again a little later.
____________

mmonnin
Send message
Joined: 2 Jul 16
Posts: 329
Credit: 1,449,138,982
RAC: 407,971
Level
Met
Scientific publications
watwatwatwatwat
Message 58120 - Posted: 16 Dec 2021 | 0:26:37 UTC

I've had two tasks complete on a host that was previously erroring out:

https://www.gpugrid.net/workunit.php?wuid=27102460
https://www.gpugrid.net/workunit.php?wuid=27101116

Between 12:45:58 UTC and 19:44:33 UTC a task failed and then completed w/o any changes, resets, anything from me.

Wildly different runtime/credit ratios, I would expect something in between.

Run time Credit Credit/sec
3,389.26 264,786.85 78/s
49,311.35 34,722.22 0.70/s

CUDA
26,635.40 420,000.00 15.77/s

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 155
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 58123 - Posted: 16 Dec 2021 | 9:44:51 UTC - in response to Message 58117.

Hello everyone,

The reset was only to solve the error reported in e1a12-ABOU_rnd_ppod_3-0-1-RND1575_0 and other jobs, relative to a dependency called "pinocchio". I have checked the jobs reported to have errors after resetting, it seems like this error is not present in those jobs.

Regarding the memory usage, it is real as you report. The ~4GB are from the main script containing the AI agent and the training process. The 32x multiproc spawns are intended, each one contains an instance of the environment the agent interacts with to learn. Some RL environments run on GPU, but unfortunately the one we are working with at the moment does not. I get a total of 15GB locally when running 1 job. This could probably explain some job failures. Running all these environments in parallel is also more CPU intense as mentioned as well. The process to train the AI interleaves phases of data collection from interactions with the environment instances (CPU intensive), with phases of learning (GPU intensive)

I will test locally if the AI agent still learns by interacting with less instances of the environment at the same time, that could help reduce a bit the memory requirements in future jobs. However, for now the most immediate jobs will have similar requirements.


____________

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 155
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 58124 - Posted: 16 Dec 2021 | 10:15:12 UTC - in response to Message 58120.

Yes I was progressively testing for how many steps the Agents could be trained and I forgot to increase the credits proportionally to the training steps. I will correct that in the immediate next batch, sorry and thanks for making us notice.
____________

Profile PDW
Send message
Joined: 7 Mar 14
Posts: 15
Credit: 1,000,002,525
RAC: 0
Level
Met
Scientific publications
watwatwatwatwat
Message 58125 - Posted: 16 Dec 2021 | 10:23:45 UTC - in response to Message 58123.

On mine, free memory (as reported in top) dropped from approximately 25,500 (when running an ACEMD task) to 7,000.
That I can manage.

However the task also spawns a process for the number of threads (x) the machine has and then runs these, from 1 to x processes can be running at any one time. The value x is based on the machine threads and not what Boinc is configured for, in addition Boinc has no idea they exist and should be taken into account for scheduling purposes. The result is that the machine can at times be loading the CPU upto twice as much as expected. This I can't manage unless I only run one of these tasks and the machine is doing nothing else which isn't going to happen.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 719
Credit: 4,861,478,494
RAC: 1,361,967
Level
Arg
Scientific publications
wat
Message 58127 - Posted: 16 Dec 2021 | 14:18:23 UTC - in response to Message 58123.

thanks for the clarification.

I agree with PDW that running work on all CPUs threads when BOINC expects at most that 1 CPU thread will be used will be problematic for most users who run CPU work from other projects.

in my case, i did notice that each spawn used only a little CPU, but I'm not sure if this is the case for everyone. you could in theory tell BOINC how much CPU these are using by using a value over 1 in app_config for python tasks . for example, it looks like only ~10% of a thread was being used. so for my 32 thread CPU, that would equate to about 4 threads worth (round up from 3.2). so maybe something like

<app>
<name>PythonGPU</name>
<gpu_versions>
<cpu_usage>4</cpu_usage>
<gpu_usage>1</gpu_usage>
</gpu_versions>
</app>

you'd have to pick a cpu_usage value appropriate for your CPU use, and test to see if it works as desired.
____________

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1434
Credit: 3,541,899,351
RAC: 420,027
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58132 - Posted: 16 Dec 2021 | 16:56:20 UTC - in response to Message 58127.

I agree with PDW that running work on all CPUs threads when BOINC expects at most that 1 CPU thread will be used will be problematic for most users who run CPU work from other projects.

The normal way of handling that is to use the [MT] (multi-threaded) plan class mechanism in BOINC - these trial apps are being issued using the same [cuda1121] plan class as the current ACEMD production work.

Having said that, it might be quite tricky to devise a combined [CUDA + MT] plan class. BOINC code usually expects a simple-minded either/or solution, not a combination. And I don't really like the standard MT implementation, which defaults to using every possible CPU core in the volunteer's computer. Not polite.

MT can be tamed by using an app_config.xml or app_info.xml file, but you may need to tweak both <cpu_usage> (for BOINC scheduling purposes) and something like a command line parameter to control the spawning behaviour of the app.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 719
Credit: 4,861,478,494
RAC: 1,361,967
Level
Arg
Scientific publications
wat
Message 58134 - Posted: 16 Dec 2021 | 18:20:00 UTC

given the current state of these beta tasks, I have done the following on my 7xGPU 48-thread system. allowed only 3x Python Beta tasks to run since the systems only have 64GB ram and each process is using ~20GB.

app_config.xml

<app_config>
<app>
<name>acemd3</name>
<gpu_versions>
<cpu_usage>1.0</cpu_usage>
<gpu_usage>1.0</gpu_usage>
</gpu_versions>
</app>
<app>
<name>PythonGPU</name>
<gpu_versions>
<cpu_usage>5.0</cpu_usage>
<gpu_usage>1.0</gpu_usage>
</gpu_versions>
<max_concurrent>3</max_concurrent>
</app>
</app_config>


will see how it works out when more python beta tasks flow. and adjust as the project adjusts settings.

abouh, before you start releasing more beta tasks, could you give us a heads up to what we should expect and/or what you changed about them?
____________

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1050
Credit: 1,407,860,714
RAC: 1,014,249
Level
Met
Scientific publications
watwatwatwatwat
Message 58135 - Posted: 16 Dec 2021 | 18:22:58 UTC

I finished up a python gpu task last night on one host and saw it spawned a ton of processes that used up 17GB of system memory. I have 32GB minimum in all my hosts and it was not a problem.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 719
Credit: 4,861,478,494
RAC: 1,361,967
Level
Arg
Scientific publications
wat
Message 58136 - Posted: 16 Dec 2021 | 18:52:22 UTC - in response to Message 58135.

I finished up a python gpu task last night on one host and saw it spawned a ton of processes that used up 17GB of system memory. I have 32GB minimum in all my hosts and it was not a problem.


Good to know Keith.

Did you by chance get a look at GPU utilization? Or CPU thread utilization of the spawns?
____________

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1050
Credit: 1,407,860,714
RAC: 1,014,249
Level
Met
Scientific publications
watwatwatwatwat
Message 58137 - Posted: 16 Dec 2021 | 19:14:26 UTC - in response to Message 58136.

I finished up a python gpu task last night on one host and saw it spawned a ton of processes that used up 17GB of system memory. I have 32GB minimum in all my hosts and it was not a problem.


Good to know Keith.

Did you by chance get a look at GPU utilization? Or CPU thread utilization of the spawns?

Gpu utilization was at 3%. Each spawn used up about 170MB of memory and fluctuated around 13-17% cpu utilization.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 719
Credit: 4,861,478,494
RAC: 1,361,967
Level
Arg
Scientific publications
wat
Message 58138 - Posted: 16 Dec 2021 | 19:18:43 UTC - in response to Message 58137.

good to know. so what I experienced was pretty similar.

I'm sure you also had some other CPU tasks running too. I wonder if CPU utilization of the spawns would be higher if no other CPU tasks were running.
____________

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1050
Credit: 1,407,860,714
RAC: 1,014,249
Level
Met
Scientific publications
watwatwatwatwat
Message 58140 - Posted: 16 Dec 2021 | 21:00:08 UTC - in response to Message 58138.

Yes primarily Universe and a few TN-Grid tasks were running also.

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 155
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 58141 - Posted: 17 Dec 2021 | 10:17:36 UTC - in response to Message 58134.

I will send some more tasks later today with similar requirements as the last ones, with 32 multithreading reinforcement learning environments running in parallel for the agent to interact with.

For one job, locally I get around 15GB of system memory, and each cpu 13% - 17% utilisation as mentioned. For the GPU, the usage fluctuates between low use (5%-10%) during the phases in which the agent collects data from the environments and short high utilisation peaks of a few seconds, when the agent uses the data to learn (I get between 50% and 80%).

I will try to train the agents for a bit longer than in the last tasks. I have already corrected the credits of the tasks, in proportion to the number of interaction between the agent and the environments occurring in the tasks.

____________

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 719
Credit: 4,861,478,494
RAC: 1,361,967
Level
Arg
Scientific publications
wat
Message 58143 - Posted: 17 Dec 2021 | 16:48:28 UTC - in response to Message 58141.

I got 3 of them just now. all failed with tracebacks after several minutes of run time. seems like there's still some coding bugs in the application. all wingmen are failing similarly:

https://gpugrid.net/workunit.php?wuid=27102526
https://gpugrid.net/workunit.php?wuid=27102527
https://gpugrid.net/workunit.php?wuid=27102525


GPU (2080Ti) was loaded ~10-13% GPU utilization, but at base clocks 1350MHz and only ~65W power draw. GPU memory loaded 2-4GB. system memory reached ~25GB utilization while 2 tasks were running at the same time. CPU thread utilization ~25-30% across all 48 threads (EPYC 7402P), it didn't cap at 32 and about twice as much CPU utilization as expected, but maybe that's due to relatively low clock speed @ 3.35GHz. (I paused other CPU processing during this time).
____________

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 719
Credit: 4,861,478,494
RAC: 1,361,967
Level
Arg
Scientific publications
wat
Message 58144 - Posted: 17 Dec 2021 | 16:54:43 UTC - in response to Message 58143.
Last modified: 17 Dec 2021 | 16:58:05 UTC

the new one I just got seems to be doing better. less CPU use, and it looks like i'm seeing the mentioned 60-80% spikes on the GPU occasionally.

this one succeeded on the same host as the above three.

https://gpugrid.net/workunit.php?wuid=27102535
____________

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 155
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 58145 - Posted: 17 Dec 2021 | 17:21:35 UTC - in response to Message 58144.
Last modified: 17 Dec 2021 | 17:26:54 UTC

I normally test the jobs locally first, to then run a couple of small batches of tasks in GPUGrid in case some error that did not appear locally occurs. The first small batch failed so I could fix the error in the second one. Now that the second batch succeeded will send a bigger batch of tasks.
____________

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1050
Credit: 1,407,860,714
RAC: 1,014,249
Level
Met
Scientific publications
watwatwatwatwat
Message 58146 - Posted: 17 Dec 2021 | 18:11:26 UTC

I must be crunching one of the fixed second batch currently on this daily driver. Seems to be progressing nicely.

Using about 17GB of system memory and the gpu utilization spikes up to 97% every once in a while with periods mostly spent around 12-17% with some brief spikes around 42%.

I got one of the first batch on another host that failed fast with similar along with all the wingmen.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 719
Credit: 4,861,478,494
RAC: 1,361,967
Level
Arg
Scientific publications
wat
Message 58147 - Posted: 17 Dec 2021 | 19:29:02 UTC

these new ones must be pretty long.

been running almost 2 hours now. and a lot higher VRAM use. over 6GB per task used on the VRAM. GPUs with less than 6GB have issues?

but it also seems that some of the system memory used can be shared. running 1 task shows ~17GB system mem use, but running 5x tasks shows about 53GB system mem use. that's as far as I'll push it on my 64GB machines.
____________

kksplace
Send message
Joined: 4 Mar 18
Posts: 48
Credit: 445,464,249
RAC: 426,354
Level
Gln
Scientific publications
wat
Message 58148 - Posted: 17 Dec 2021 | 21:08:46 UTC
Last modified: 17 Dec 2021 | 21:09:41 UTC

I got the first one of the Python WUs for me, and am a little concerned. After 3.25 hours it is only 10% complete. GPU usage seems to be about what you all are saying, and same with CPU. However, I also only have 8 cores/16 threads, with 6 other CPU work units running (TN Grid and Rosetta 4.2). Should I be limiting the other work to let these run? (16 GB RAM).

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1050
Credit: 1,407,860,714
RAC: 1,014,249
Level
Met
Scientific publications
watwatwatwatwat
Message 58149 - Posted: 17 Dec 2021 | 23:27:43 UTC - in response to Message 58148.

I don't think BOINC knows how to handle interpreting the estimated run_times of these Python tasks. I wouldn't worry about it.

I am over 6 1/2 hours now on this daily driver with 10% still showing. I bet they never show anything BUT 10% done until they finish.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 719
Credit: 4,861,478,494
RAC: 1,361,967
Level
Arg
Scientific publications
wat
Message 58150 - Posted: 18 Dec 2021 | 0:09:18 UTC - in response to Message 58149.

I had the same feeling, Keith
____________

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 719
Credit: 4,861,478,494
RAC: 1,361,967
Level
Arg
Scientific publications
wat
Message 58151 - Posted: 18 Dec 2021 | 0:14:15 UTC
Last modified: 18 Dec 2021 | 0:15:02 UTC

also those of us running these, should probably prepare for VERY low credit reward.

This is something I have observed for a long time with beta tasks here. there seems to be some kind of anti-cheat mecahnism (or bug) built into BOINC when using the default credit reward scheme (based on flops), if the calculated credit reward is over some value, the credit reward gets defaulted to some very low value. since these are so long running, and beta, I fully expect to see this happen. I've reported about this behavior in the past.

would be a nice surprise if not, but I have a strong feeling it'll happen.
____________

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1050
Credit: 1,407,860,714
RAC: 1,014,249
Level
Met
Scientific publications
watwatwatwatwat
Message 58152 - Posted: 18 Dec 2021 | 1:14:41 UTC - in response to Message 58151.

I got one task early on that rewarded more than reasonable credit.
But the last one was way low but I thought I read a post from @abouh that he had made a mistake in the credit award algorithm and had corrected for that.
https://www.gpugrid.net/forum_thread.php?id=5233&nowrap=true#58124

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 719
Credit: 4,861,478,494
RAC: 1,361,967
Level
Arg
Scientific publications
wat
Message 58153 - Posted: 18 Dec 2021 | 2:36:47 UTC - in response to Message 58152.
Last modified: 18 Dec 2021 | 3:02:51 UTC

That task was short though. The threshold is around 2million credit reward if I remember.

I posted about it in the team forum almost exactly a year ago. Don’t want to post some details publicly because it could encourage cheating. But for a long time credit reward of the beta tasks has been inconsistent and not calculated fairly IMO. Because the credit reward was so high, I noticed a trend that when the credit reward was supposed to be high enough (extrapolating the runtime with expected reward) it triggered a very low value. This only happened on long running (and hence potential high reward) tasks. Since these tasks are so long, I just think there’s a possibility we’ll see that again.
____________

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 719
Credit: 4,861,478,494
RAC: 1,361,967
Level
Arg
Scientific publications
wat
Message 58154 - Posted: 18 Dec 2021 | 4:53:29 UTC - in response to Message 58151.
Last modified: 18 Dec 2021 | 5:23:09 UTC

confirmed.

Keith you just reported this one.

http://www.gpugrid.net/result.php?resultid=32731284

that value of 34,722.22 is the exact same "penalty value" i noticed before a year ago. for 11hrs worth of work (clock time). and 28hrs of "cpu time". interesting that the multithreaded nature of these tasks inflates the run time so much.

extrapolating from your successful run that did not hit a penalty, I'd guess that any task longer than about 2.5hrs is gonna hit the penalty value for these tasks. they really should just use the same credit scheme as acemd3. or assign static credit scaled for expected runtime, as long as all of the tasks are about the same size.

BOINC documentation confirms my suspicions on what's happening.

https://boinc.berkeley.edu/trac/wiki/CreditNew

Peak FLOP Count

This system uses the Peak-FLOPS-based approach, but addresses its problems in a new way.

When a job J is issued to a host, the scheduler computes peak_flops(J) based on the resources used by the job and their peak speeds.

When a client finishes a job and reports its elapsed time T, we define peak_flop_count(J), or PFC(J) as

PFC(J) = T * peak_flops(J)

The credit for a job J is typically proportional to PFC(J), but is limited and normalized in various ways.

Notes:

PFC(J) is not reliable; cheaters can falsify elapsed time or device attributes.
We use elapsed time instead of actual device time (e.g., CPU time). If a job uses a resource inefficiently (e.g., a CPU job that does lots of disk I/O) PFC() won't reflect this. That's OK. The key thing is that BOINC allocated the device to the job, whether or not the job used it efficiently.
peak_flops(J) may not be accurate; e.g., a GPU job may take more or less CPU than the scheduler thinks it will. Eventually we may switch to a scheme where the client dynamically determines the CPU usage. For now, though, we'll just use the scheduler's estimate.


One-time cheats

For example, claiming a PFC of 1e304.

This is handled by the sanity check mechanism, which grants a default amount of credit and treats the host with suspicion for a while.

____________

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1050
Credit: 1,407,860,714
RAC: 1,014,249
Level
Met
Scientific publications
watwatwatwatwat
Message 58155 - Posted: 18 Dec 2021 | 6:29:56 UTC

Yep, I saw that. Same credit as before and now I remember this bit of code being brought up before back in the old Seti days.

@Abouh needs to be made aware of this and assign fixed credit as what they do with acemd3.

Aurum
Avatar
Send message
Joined: 12 Jul 17
Posts: 343
Credit: 10,358,716,035
RAC: 855,775
Level
Trp
Scientific publications
watwatwat
Message 58157 - Posted: 18 Dec 2021 | 16:30:01 UTC
Last modified: 18 Dec 2021 | 16:45:56 UTC

Awoke to find 4 PythonGPU WUs running on 3 computers. All had OPN & TN-Grid WUs running with CPU use flat-lined at 100%. Suspended all other CPU WUs to see what PG was using and got a band mostly contained in the range 20 to 40%. Then I tried a couple of scenarios.
1. Rig-44 has an i9-9980XE 18c36t 32 GB with 16 GB swap file, SSD, and 2 x 2080 Ti's. The GPU use is so low I switched GPU usage to 0.5 for both OPNG and PG and reread config files. OPNG WUs started running and have all been reported fine. PG WUs kept running. Then I started adding back in gene_pcim WUs. When I exceeded 4 gene_pcim WUs the CPU use bands changed shape in a similar way to Rig-24 with a tight band around 30% and a number of curves bouncing off 100%.

2. Rig-26 has an E5-2699 22c44t 32 GB with 16 GB swap (yet to be used), SSD, and a 2080 Ti. I've added back 24 gene_pcim WUs and the CPU use band has moved up to 40-80% with no peaks hitting 100%. Next I changed GPU usage to 0.5 for both OPNG and PG and reread config files. Both seem to be running fine.

3. Rig-24 has an i7-6980X 10c20t 32 GB with a 16 GB swap file, SSD, and a 2080 Ti. This one has been running for 17 hours so far with the last 2 hours having all other CPU work suspended. Its CPU usage graph looks different. There's a tight band oscillating about 20% with a single band oscillating from 60 to 90%. Since PG wants 32 CPUs and this CPU only has 20 there's a constant queue for hyperthreading to feed in. I'll let this one run by itself hoping it finishes soon.

Note: TN-Grid usually runs great in Resource Zero Mode where it rarely ever sends more than one extra WU. With PG running and app_config reducing the max running WUs TN-Grid just keeps sending more WUs. Up to 280 now.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 719
Credit: 4,861,478,494
RAC: 1,361,967
Level
Arg
Scientific publications
wat
Message 58158 - Posted: 18 Dec 2021 | 17:03:32 UTC - in response to Message 58157.
Last modified: 18 Dec 2021 | 17:11:37 UTC

I did something similar with my two 7xGPU systems.

limited to 5 tasks concurrently.

and set the app_config files up such that it would run either 3x Einstein per GPU, OR 1xEinstein + 1x GPUGRID since the resources used by both are complimentary.

set GPUGRID to 0.6 for GPU use (prevents two from running on the same GPU, 0.6+0.6 >1.0)
set Einstein to 0.33 for GPU use (allows three to run on a single GPU or one GPUGRID + one Einstein, 0.33+0.33+0.33<1.0, 0.6+0.33<1.0)

but running 5 tasks on a system with 64GB system memory was too ambitious, ram use was initially OK, but grew to fill system ram and swap (default 2GB).

if these tasks become more common and plentiful, I might consider upgrading these 7xGPU systems to 128GB RAM so that they can handle running on all GPUs at the same time, but not going to bother if the project decides to reduce the system requirements or these pop up very infrequently.

the low credit reward per unit time due to the BOINC credit fail safe default value should be fixed though. not many people will have much incentive to test out the beta tasks with 10-20x less credit per unit time.

oh and these don't checkpoint properly (they checkpoint once very early on). if you pause a task that's been running for 20hrs, it restarts from that first checkpoint 20hrs ago.
____________

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 155
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 58161 - Posted: 20 Dec 2021 | 10:29:54 UTC
Last modified: 20 Dec 2021 | 13:55:24 UTC

Hello everyone,

The batch I sent on Friday was successfully completed, even if some jobs failed several times initially and got reassigned.

I went through all failed jobs. Here I summarise some errors I have seen:

1. Detected multiple CUDA out of memory errors. Locally the jobs use 6GB of GPU memory. It seems difficult to lower the GPU memory requirements for now, so jobs running in GPUs with less memory should fail.
2. Conda environment conflicts with package pinocchio. This one I talked about in a previous post. It requires resetting the app.
3. ´INTERNAL ERROR: cannot create temporary directory!´ - I understand this one could be due to a full disk.

Also, based on the feedback I will work on fixing the following things before the next batch:

1. Checkpoints will be created more often during training. So jobs can be restarted and won’t go back to the beginning.
2. Credits assigned. The idea is to progressively increase the credits until the credit return becomes similar to that of the acemd jobs. However, devising a general formula to calculate them is more complex in this case. For now it is based in the total amount of data samples gathered from the environments and used to train the AI agent, but that does not take into account the size of the agent neural networks. For now we will keep them fixed, but to solve other problems might be necessary to adjust them.

Finally, I think I was a bit too ambitious regarding the total amount of training per job. I will break jobs down in two, so they don't take that long to complete.
____________

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 719
Credit: 4,861,478,494
RAC: 1,361,967
Level
Arg
Scientific publications
wat
Message 58162 - Posted: 20 Dec 2021 | 14:55:18 UTC - in response to Message 58161.

thanks!

I did notice that all of mine failed with exceeded time limit.

might be a good idea to increase the estimated flops size of these tasks so BOINC knows that they are large and will run for a long time.
____________

Profile ServicEnginIC
Avatar
Send message
Joined: 24 Sep 10
Posts: 520
Credit: 2,275,243,465
RAC: 6,100
Level
Phe
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58163 - Posted: 20 Dec 2021 | 16:44:12 UTC - in response to Message 58161.

1. Detected multiple CUDA out of memory errors. Locally the jobs use 6GB of GPU memory. It seems difficult to lower the GPU memory requirements for now, so jobs running in GPUs with less memory should fail.

I've tried to set preferences at all my less than 6GB RAM GPU hosts for not receiving Python Runtime (GPU, beta) app:

Run only the selected applications
ACEMD3: yes
Quantum Chemistry (CPU): yes
Quantum Chemistry (CPU, beta): yes
Python Runtime (CPU, beta): yes
Python Runtime (GPU, beta): no

If no work for selected applications is available, accept work from other applications?: no

But I've still received one more Python GPU task at one of them.
This makes me to get in doubt whether GPUGRID preferences are currently working as intended or not...

Task e1a1-ABOU_rnd_ppod_8-0-1-RND5560_0

RuntimeError: CUDA out of memory.

Erich56
Send message
Joined: 1 Jan 15
Posts: 902
Credit: 3,604,880,665
RAC: 330,779
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwat
Message 58164 - Posted: 20 Dec 2021 | 17:12:00 UTC - in response to Message 58163.

This makes me to get in doubt whether GPUGRID preferences are currently working as intended or not...

my question is a different one: as long as the GPUGRID team now concentrates on Python, no more ACEMD tasks will come?

Profile PDW
Send message
Joined: 7 Mar 14
Posts: 15
Credit: 1,000,002,525
RAC: 0
Level
Met
Scientific publications
watwatwatwatwat
Message 58166 - Posted: 20 Dec 2021 | 18:21:34 UTC - in response to Message 58163.

But I've still received one more Python GPU task at one of them.
This makes me to get in doubt whether GPUGRID preferences are currently working as intended or not...


I had the same problem, you need to set the 'Run test applications' to No
It looks like having that set to Yes will over ride any specific application setting you set.

Profile ServicEnginIC
Avatar
Send message
Joined: 24 Sep 10
Posts: 520
Credit: 2,275,243,465
RAC: 6,100
Level
Phe
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58167 - Posted: 20 Dec 2021 | 19:26:34 UTC - in response to Message 58166.

Thanks, I'll try

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1050
Credit: 1,407,860,714
RAC: 1,014,249
Level
Met
Scientific publications
watwatwatwatwat
Message 58168 - Posted: 20 Dec 2021 | 19:53:57 UTC - in response to Message 58164.

This makes me to get in doubt whether GPUGRID preferences are currently working as intended or not...

my question is a different one: as long as the GPUGRID team now concentrates on Python, no more ACEMD tasks will come?

Hard to say. Toni and Gianni both stated the work would be very limited and infrequent until they can fill the new PhD positions.

But there have been occasional "drive-by" drops of cryptic scout work I've noticed along with the occasional standard research acemd3 resend.

Sounds like @abouh is getting ready to drop a larger debugged batch of Python on GPU tasks.

Erich56
Send message
Joined: 1 Jan 15
Posts: 902
Credit: 3,604,880,665
RAC: 330,779
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwat
Message 58169 - Posted: 21 Dec 2021 | 5:52:18 UTC - in response to Message 58168.

Sounds like @abouh is getting ready to drop a larger debugged batch of Python on GPU tasks.

Would be great if they work on Windows, too :-)

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 155
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 58170 - Posted: 21 Dec 2021 | 9:56:28 UTC - in response to Message 58168.

Today I will send a couple of batches with short tasks for some final debugging of the scripts and then later I will send a big batch of debugged tasks.

____________

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 155
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 58171 - Posted: 21 Dec 2021 | 9:57:51 UTC - in response to Message 58169.

The idea is to make it work for Windows in the future as well, once it works smoothly on linux.
____________

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 719
Credit: 4,861,478,494
RAC: 1,361,967
Level
Arg
Scientific publications
wat
Message 58172 - Posted: 21 Dec 2021 | 15:44:20 UTC - in response to Message 58170.

Thanks, looks like they are small enough to fit on a 16GB system now. using about 12GB.
____________

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 719
Credit: 4,861,478,494
RAC: 1,361,967
Level
Arg
Scientific publications
wat
Message 58173 - Posted: 21 Dec 2021 | 16:47:02 UTC - in response to Message 58172.

Thanks, looks like they are small enough to fit on a 16GB system now. using about 12GB.


not sure what happened to it. take a look.

https://gpugrid.net/result.php?resultid=32731651
____________

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1050
Credit: 1,407,860,714
RAC: 1,014,249
Level
Met
Scientific publications
watwatwatwatwat
Message 58174 - Posted: 21 Dec 2021 | 17:16:54 UTC - in response to Message 58173.

Looks like a needed package was not retrieved properly with a "deadline exceeded" error.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 719
Credit: 4,861,478,494
RAC: 1,361,967
Level
Arg
Scientific publications
wat
Message 58175 - Posted: 21 Dec 2021 | 18:15:03 UTC - in response to Message 58174.

Looks like a needed package was not retrieved properly with a "deadline exceeded" error.


It's interesting, looking at the stderr output. it appears that this app is communicating over the internet to send and receive data outside of BOINC. and to servers that are not belonging to the project.

(i think the issue is that I was connected to my VPN checking something else and I left the connection active and it might have had an issue reaching the site it was trying to access)

not sure how kosher that is. I think BOINC devs don't intend/desire this kind of behavior. some people might have some security concerns of the app doing these things outside of BOINC. might be a little smoother to do all communication only between the host and the project and only via the BOINC framework. if data needs to be uploaded elsewhere, it might be better for the project to do that on the backend.

just my .02
____________

Aurum
Avatar
Send message
Joined: 12 Jul 17
Posts: 343
Credit: 10,358,716,035
RAC: 855,775
Level
Trp
Scientific publications
watwatwat
Message 58176 - Posted: 21 Dec 2021 | 18:44:13 UTC - in response to Message 58161.

1. Detected multiple CUDA out of memory errors. Locally the jobs use 6GB of GPU memory. It seems difficult to lower the GPU memory requirements for now, so jobs running in GPUs with less memory should fail.


I'm getting CUDA out of memory failures and all my cards have 10 to 12 GB of GDDR: 1080 Ti, 2080 Ti, 3080 Ti and 3080. There must be something else going on.

I've also stopped trying to time-slice with PythonGPU. It should have a dedicated GPU and I'm leaving 32 CPU threads open for it.

I keep looking for Pinocchio but have yet to see him. Where does it come from? Maybe I never got it.

Erich56
Send message
Joined: 1 Jan 15
Posts: 902
Credit: 3,604,880,665
RAC: 330,779
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwat
Message 58177 - Posted: 21 Dec 2021 | 18:58:56 UTC - in response to Message 58171.

The idea is to make it work for Windows in the future as well, once it works smoothly on linux.

okay, sounds good; thanks for the information

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1050
Credit: 1,407,860,714
RAC: 1,014,249
Level
Met
Scientific publications
watwatwatwatwat
Message 58178 - Posted: 21 Dec 2021 | 19:12:20 UTC

I'm running one of the new batch and at first the task was only using 2.2GB of gpu memory but now it has clocked backup to 6.6GB of gpu memory.

Much as the previous ones. I thought the memory requirements were going to be cut in half.

Consuming the same amount of system memory as before . . . maybe a couple of GB more in fact. Up to 20GB now.

Aurum
Avatar
Send message
Joined: 12 Jul 17
Posts: 343
Credit: 10,358,716,035
RAC: 855,775
Level
Trp
Scientific publications
watwatwat
Message 58179 - Posted: 21 Dec 2021 | 21:21:09 UTC

Just had one that's listed as "aborted by user." I didn't abort it.
https://www.gpugrid.net/result.php?resultid=32731704

It also says "Please update your install command." I've kept my computer updated. Is this something I need to do?

What's this? Something I need to do or not?
"FutureWarning: Not all Ray CLI dependencies were found. In Ray 1.4+, the Ray CLI, autoscaler, and dashboard will only be usable via `pip install 'ray[default]'`"

mmonnin
Send message
Joined: 2 Jul 16
Posts: 329
Credit: 1,449,138,982
RAC: 407,971
Level
Met
Scientific publications
watwatwatwatwat
Message 58180 - Posted: 21 Dec 2021 | 23:12:16 UTC

RuntimeError: CUDA out of memory. Tried to allocate 112.00 MiB (GPU 0; 11.77 GiB total capacity; 3.05 GiB already allocated; 50.00 MiB free; 3.21 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Traceback (most recent call last):

That error on 4 tasks right around 55 minutes on 3080Ti

The same PC/GPU has complete Python tasks before, one earlier that ran for 1900 seconds and is running one now for 9hr. Util is around 2-3% and 6.5GB memory in nvidia-smi. 6.1GB in BOINC.

3070Ti has been running for 7:45 with 8% Util and same memory usage.

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1050
Credit: 1,407,860,714
RAC: 1,014,249
Level
Met
Scientific publications
watwatwatwatwat
Message 58181 - Posted: 22 Dec 2021 | 1:34:01 UTC - in response to Message 58179.

The ray errors are normal and can be ignored.
I completed one of the new tasks successfully. The one I commented on before.
14 hours of compute time.

I had another one that completed successfully but the stderr.txt was truncated and does not show the normal summary and boinc finish statements. Feels similar to the truncation that Einstein stderr.txt outputs have.

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1050
Credit: 1,407,860,714
RAC: 1,014,249
Level
Met
Scientific publications
watwatwatwatwat
Message 58182 - Posted: 22 Dec 2021 | 1:40:18 UTC - in response to Message 58176.

1. Detected multiple CUDA out of memory errors. Locally the jobs use 6GB of GPU memory. It seems difficult to lower the GPU memory requirements for now, so jobs running in GPUs with less memory should fail.


I'm getting CUDA out of memory failures and all my cards have 10 to 12 GB of GDDR: 1080 Ti, 2080 Ti, 3080 Ti and 3080. There must be something else going on.

I've also stopped trying to time-slice with PythonGPU. It should have a dedicated GPU and I'm leaving 32 CPU threads open for it.

I keep looking for Pinocchio but have yet to see him. Where does it come from? Maybe I never got it.

I'm not doing anything at all in mitigation for the Python on GPU tasks other than to only run one at a time. I've been successful in almost all cases other than the very first trial ones in each evolution.

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 155
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 58183 - Posted: 22 Dec 2021 | 9:29:54 UTC - in response to Message 58178.
Last modified: 22 Dec 2021 | 9:30:08 UTC

What was halved was the amount of Agent training per task, and therefore the total amount of time required to completed it.

The GPU memory and system memory will remain the same in the next batches.
____________

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 155
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 58184 - Posted: 22 Dec 2021 | 9:37:48 UTC - in response to Message 58175.
Last modified: 22 Dec 2021 | 9:43:47 UTC

During the task, the performance of the Agent is intermittently sent to https://wandb.ai/ to track how the agent is doing in the environment as training progresses. It immensely helps to understand the behaviour of the agent and facilitates research, as it allows visualising the information in a structured way.

wandb has a python package extensively used in machine learning research, which we import in our scripts for this purpose.
____________

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 155
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 58185 - Posted: 22 Dec 2021 | 9:43:04 UTC - in response to Message 58176.

Pinocchio probably only caused problems in a subset of hosts, as it was due to one of the firsts test batches having a wrong conda environment requirements file. It was a small batch.
____________

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1434
Credit: 3,541,899,351
RAC: 420,027
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58186 - Posted: 22 Dec 2021 | 10:07:45 UTC

My machines are probably just above the minimum spec for the current batches - 16 GB RAM, and 6 GB video RAM on a GTX 1660.

They've both completed and validated their first task, in around 10.5 / 11 hours.

But there's something odd about the result display in the task listing on this website - both the Run time and CPU time columns show the exact same value, and it's too large to be feasible: task 32731629, for example, shows 926 minutes of run time, but only 626 minutes between issue and return.

Tasks currently running locally show CPU time so far about 50% above elapsed time, which is to be expected from the description of how these tasks are designed to run. I suspect that something is triggering an anti-cheat mechanism: a task specified to use a single CPU core couldn't possibly use the CPU for longer than the run time, could it? But if so, it seems odd to 'correct' the elapsed time rather than the CPU time.

I'll take a look at the sched_request file after the next one reports, to see if the 'correction' is being applied locally by the BOINC client, or on the server.

mmonnin
Send message
Joined: 2 Jul 16
Posts: 329
Credit: 1,449,138,982
RAC: 407,971
Level
Met
Scientific publications
watwatwatwatwat
Message 58187 - Posted: 22 Dec 2021 | 11:25:13 UTC - in response to Message 58183.

What was halved was the amount of Agent training per task, and therefore the total amount of time required to completed it.

The GPU memory and system memory will remain the same in the next batches.


Halved? I've got one at nearly 21.5 hours on a 3080Ti and still going

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1434
Credit: 3,541,899,351
RAC: 420,027
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58188 - Posted: 22 Dec 2021 | 15:39:07 UTC

This shows the timing discrepancy, a few minutes before task 32731655 completed.



The two valid tasks on host 508381 ran in sequence on the same GPU: there's no way they could have both finished within 24 hours if the displayed elapsed time was accurate.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 719
Credit: 4,861,478,494
RAC: 1,361,967
Level
Arg
Scientific publications
wat
Message 58189 - Posted: 22 Dec 2021 | 15:47:48 UTC - in response to Message 58188.

i still think the 5,000,000 GFLOPs count is far too low. since these run for 12-24hrs depending on host (GPU speed does not seem to be a factor in this since GPU utilization is so low, most likely CPU/memory bound) and there seems to be a bit of a discrepancy in run time per task. I had a task run for 9hrs on my 3080Ti, while another user claims 21+ hrs on his 3080Ti. and I've had several tasks get killed around 12hrs for exceeded time limit, while others ran for longer. lots of inconsistencies here.

the low flops count is causing a lot of tasks to prematurely get killed by BOINC for exceeded time limit when they would have completed eventually. the fact that they do not proceed past 10% completion until the end probably doesn't help.
____________

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1434
Credit: 3,541,899,351
RAC: 420,027
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58190 - Posted: 22 Dec 2021 | 16:27:52 UTC - in response to Message 58189.

Because this project still uses DCF, the 'exceeded time limit' problem should go away as soon as you can get a single task to complete. Both my machines with finished tasks are now showing realistic estimates, but with DCFs of 5+ and 10+ - I agree, the FLOPs estimate should be increased by that sort of multiplier to keep estimates balanced against other researchers' work for the project.

The screen shot also shows how the 'remaining time' estimate gets screwed up when the running value reaches something like 10 hours at 10%. Roll on intermediate progress reports and checkpoints.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 719
Credit: 4,861,478,494
RAC: 1,361,967
Level
Arg
Scientific publications
wat
Message 58191 - Posted: 22 Dec 2021 | 17:05:06 UTC
Last modified: 22 Dec 2021 | 17:05:49 UTC

my system that completed a few tasks had a DCF of 36+

checkpointing also still isn't working. I had some tasks running for ~3hrs. restarted boinc and they restarted at 5mins.
____________

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1434
Credit: 3,541,899,351
RAC: 420,027
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58192 - Posted: 22 Dec 2021 | 18:52:57 UTC - in response to Message 58191.

checkpointing also still isn't working.

See my screenshot.

"CPU time since checkpoint: 16:24:44"

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1434
Credit: 3,541,899,351
RAC: 420,027
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58193 - Posted: 22 Dec 2021 | 18:59:00 UTC

I've checked a sched_request when reporting.

<result>
<name>e1a26-ABOU_rnd_ppod_11-0-1-RND6936_0</name>
<final_cpu_time>55983.300000</final_cpu_time>
<final_elapsed_time>36202.136027</final_elapsed_time>

That's task 32731632. So it's the server applying the 'sanity(?) check' "elapsed time not less than CPU time". That's right for a single core GPU task, but not right for a task with multithreaded CPU elements.

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 155
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 58194 - Posted: 23 Dec 2021 | 10:07:59 UTC - in response to Message 58187.

As mentioned by Ian&Steve C., GPU speed influences only partially task completion time.

During the task, the agent first interacts with the environments for a while, then uses the GPU to process the collected data and learn from it, then interacts again with the environments, and so on.

In the last batch, I reduced the total amount of agent-environment interactions gathered and processed before ending the task with respect to the previous batch, which should have reduced the completion time.
____________

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 155
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 58195 - Posted: 23 Dec 2021 | 10:09:32 UTC
Last modified: 23 Dec 2021 | 10:19:03 UTC

I will look into the reported issues before sending the next batch, to see if I can find a solution for both the problem of jobs being killed due to “exceeded time limit” and the progress and checkpointing problems.

From what Ian&Steve C. mentioned, I understand that increasing the "Estimated Computation Size", however BOINC calculates that, could solve the problem of jobs being killed?

Thanks you very much for your feedback. Happy holidays to everyone!
____________

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1434
Credit: 3,541,899,351
RAC: 420,027
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58196 - Posted: 23 Dec 2021 | 13:16:56 UTC - in response to Message 58195.

From what Ian&Steve C. mentioned, I understand that increasing the "Estimated Computation Size", however BOINC calculates that, could solve the problem of jobs being killed?

The jobs reach us with a workunit description:

<workunit>
<name>e1a24-ABOU_rnd_ppod_11-0-1-RND1891</name>
<app_name>PythonGPU</app_name>
<version_num>401</version_num>
<rsc_fpops_est>5000000000000000.000000</rsc_fpops_est>
<rsc_fpops_bound>250000000000000000.000000</rsc_fpops_bound>
<rsc_memory_bound>4000000000.000000</rsc_memory_bound>
<rsc_disk_bound>10000000000.000000</rsc_disk_bound>
<file_ref>
<file_name>e1a24-ABOU_rnd_ppod_11-0-run</file_name>
<open_name>run.py</open_name>
<copy_file/>
</file_ref>
<file_ref>
<file_name>e1a24-ABOU_rnd_ppod_11-0-data</file_name>
<open_name>input.zip</open_name>
<copy_file/>
</file_ref>
<file_ref>
<file_name>e1a24-ABOU_rnd_ppod_11-0-requirements</file_name>
<open_name>requirements.txt</open_name>
<copy_file/>
</file_ref>
<file_ref>
<file_name>e1a24-ABOU_rnd_ppod_11-0-input_enc</file_name>
<open_name>input</open_name>
<copy_file/>
</file_ref>
</workunit>

It's the fourth line, '<rsc_fpops_est>', which causes the problem. The job size is given as the estimated number of floating point operations to be calculated, in total. BOINC uses this, along with the estimated speed of the device it's running on, to estimate how long the task will take. For a GPU app, it's usually the speed of the GPU that counts, but in this case - although it's described as a GPU app - the dominant factor might be the speed of the CPU. BOINC doesn't take any direct notice of that.

The jobs are killed when they reach the duration calculated from the next line, '<rsc_fpops_bound>'. A quick and dirty fix while testing might be to increase that value even above the current 50x the original estimate, but that removes a valuable safeguard during normal running.

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 155
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 58197 - Posted: 23 Dec 2021 | 15:57:01 UTC - in response to Message 58196.
Last modified: 23 Dec 2021 | 21:34:36 UTC

I see, thank you very much for the info. I asked Toni to help me adjusting the "rsc_fpops_est" parameter. Hopefully the next jobs won't be aborted by the server.

Also, I checked the progress and the checkpointing problems. They were caused by format errors.

The python scripts were logging the progress into a "progress.txt" file but apparently BOINC wants just a file "progress" without extension.

Similarly, checkpoints were being generated, but were not identified correctly since they were not called "restart.chk".

I will work on fixing these issues before the next batch of tasks.
____________

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1050
Credit: 1,407,860,714
RAC: 1,014,249
Level
Met
Scientific publications
watwatwatwatwat
Message 58198 - Posted: 23 Dec 2021 | 19:35:37 UTC - in response to Message 58197.

Thanks @abouh for working with us in debugging your application and work units.

Nice to have a attentive and easy to work with researcher.

Looking forward to the next batch.

Profile ServicEnginIC
Avatar
Send message
Joined: 24 Sep 10
Posts: 520
Credit: 2,275,243,465
RAC: 6,100
Level
Phe
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58200 - Posted: 23 Dec 2021 | 21:20:01 UTC - in response to Message 58194.

Thank you for your kind support.

During the task, the agent first interacts with the environments for a while, then uses the GPU to process the collected data and learn from it, then interacts again with the environments, and so on.

This behavior can be seen at some tests described at my Managing non-high-end hosts thread.

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 155
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 58201 - Posted: 24 Dec 2021 | 10:02:52 UTC

I just sent another batch of tasks.

I tested locally and the progress and the restart.chk files are correctly generated and updated.

rsc_fpops_est job parameter should be higher too now.

Please let us know if you think the success rate of tasks can be improved in any other way. Thanks a lot for your help.
____________

Profile ServicEnginIC
Avatar
Send message
Joined: 24 Sep 10
Posts: 520
Credit: 2,275,243,465
RAC: 6,100
Level
Phe
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58202 - Posted: 24 Dec 2021 | 10:35:31 UTC - in response to Message 58201.

I just sent another batch of tasks.

Thank you very much for this kind of Christmas present!

Merry Christmas to everyone crunchers worldwide 🎄✨

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1434
Credit: 3,541,899,351
RAC: 420,027
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58203 - Posted: 24 Dec 2021 | 11:38:42 UTC
Last modified: 24 Dec 2021 | 12:09:40 UTC

1,000,000,000 GFLOPs - initial estimate 1690d 21:37:58. That should be enough!

I'll watch this one through, but after that I'll be away for a few days - happy holidays, and we'll pick up again on the other side.

Edit: Progress %age jumps to 10% after the initial unpacking phase, then increments every 0.9%. That'll do.

Profile ServicEnginIC
Avatar
Send message
Joined: 24 Sep 10
Posts: 520
Credit: 2,275,243,465
RAC: 6,100
Level
Phe
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58204 - Posted: 24 Dec 2021 | 12:51:06 UTC - in response to Message 58201.

I tested locally and the progress and the restart.chk files are correctly generated and updated.
rsc_fpops_est job parameter should be higher too now.

In a preliminary sight of one new Python GPU task received today:
- Progress estimation is now working properly, updating by 0,9% increments.
- Estimated computation size has raised to 1,000,000,000 GFLOPs, as also confirmed by Richard Haselgrove
- Checkpointing seems to be working also, and is being stored at about every two minutes.
- Learning cycle period has reduced to 11 seconds from 21 seconds observed at previous task. sudo nvidia-smi dmon
- GPU dedicated RAM usage seems to have been reduced, but I don't know if enough for running at 4 GB RAM GPUs (?)
- Currrent progress for task e1a20-ABOU_rnd_ppod_13-0-1-RND1192_0 is 28,9% after 2 hours and 13 minutes running. This leads to a total true execution time of about 7 hours and 41 minutes at my Host #569442

Well done!

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1050
Credit: 1,407,860,714
RAC: 1,014,249
Level
Met
Scientific publications
watwatwatwatwat
Message 58208 - Posted: 24 Dec 2021 | 16:43:12 UTC

Same observed behavior. Gpu memory halved, progress indicator normal and GFLOPS in line with actual usage.

Well done.

Profile ServicEnginIC
Avatar
Send message
Joined: 24 Sep 10
Posts: 520
Credit: 2,275,243,465
RAC: 6,100
Level
Phe
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58209 - Posted: 24 Dec 2021 | 17:38:21 UTC - in response to Message 58204.

- GPU dedicated RAM usage seems to have been reduced, but I don't know if enough for running at 4 GB RAM GPUs (?)

I'm answering to myself: I enabled Python GPU tasks requesting in my GTX 1650 SUPER 4 GB system, and I happened to catch this previously failed task e1a21-ABOU_rnd_ppod_13-0-1-RND2308_1
This task has passed the initial processing steps, and has reached the learning cycle phase.
At this point, memory usage is just at the limit of the 4 GB GPU available RAM.
Waiting to see whether this task will be succeeding or not.
System RAM usage keeps being very high.
99% of the 16 GB available RAM at this system is currently in use.

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1434
Credit: 3,541,899,351
RAC: 420,027
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58210 - Posted: 24 Dec 2021 | 22:56:33 UTC - in response to Message 58204.

- Currrent progress for task e1a20-ABOU_rnd_ppod_13-0-1-RND1192_0 is 28,9% after 2 hours and 13 minutes running. This leads to a total true execution time of about 7 hours and 41 minutes at my Host #569442

That's roughly the figure I got in the early stages of today's tasks. But task 32731884 has just finished with

<result>
<name>e1a17-ABOU_rnd_ppod_13-0-1-RND0389_3</name>
<final_cpu_time>59637.190000</final_cpu_time>
<final_elapsed_time>39080.805144</final_elapsed_time>

That's very similar (and on the same machine) as the one I reported in message 58193. So I don't think the task duration has changed much: maybe the progress %age isn't quite linear (but not enough to worry about).

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 155
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 58218 - Posted: 29 Dec 2021 | 8:31:14 UTC

Hello,

reviewing which jobs failed in the last batches I have seen several times this error:

21:28:07 (152316): wrapper (7.7.26016): starting
21:28:07 (152316): wrapper (7.7.26016): starting
21:28:07 (152316): wrapper: running /usr/bin/flock (/var/lib/boinc-client/projects/www.gpugrid.net/miniconda.lock -c "/bin/bash ./miniconda-installer.sh -b -u -p /var/lib/boinc-client/projects/www.gpugrid.net/miniconda &&
/var/lib/boinc-client/projects/www.gpugrid.net/miniconda/bin/conda install -m -y -p gpugridpy --file requirements.txt ")
[152341] INTERNAL ERROR: cannot create temporary directory!
[152345] INTERNAL ERROR: cannot create temporary directory!
21:28:08 (152316): /usr/bin/flock exited; CPU time 0.147100
21:28:08 (152316): app exit status: 0x1
21:28:08 (152316): called boinc_finish(195


I have found an issue from Richard Haselgrove talking about this error: https://github.com/BOINC/boinc/issues/4125

It seems like the users getting this error could simply solve it by setting PrivateTmp=true. Is that correct? What is the appropriate way to modify that?
____________

Profile ServicEnginIC
Avatar
Send message
Joined: 24 Sep 10
Posts: 520
Credit: 2,275,243,465
RAC: 6,100
Level
Phe
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58219 - Posted: 29 Dec 2021 | 9:15:02 UTC - in response to Message 58218.

It seems like the users getting this error could simply solve it by setting PrivateTmp=true. Is that correct? What is the appropriate way to modify that?

Right.
I gave a step-by-step solution based on Richard Haselgrove finding at my Message #55986
It worked fine for all my hosts.

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 155
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 58220 - Posted: 29 Dec 2021 | 9:26:29 UTC - in response to Message 58219.

Thank you!
____________

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1434
Credit: 3,541,899,351
RAC: 420,027
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58221 - Posted: 29 Dec 2021 | 10:38:21 UTC

Some new (to me) errors in https://www.gpugrid.net/result.php?resultid=32732017

"During handling of the above exception, another exception occurred:"

"ValueError: probabilities are not non-negative"

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 719
Credit: 4,861,478,494
RAC: 1,361,967
Level
Arg
Scientific publications
wat
Message 58222 - Posted: 29 Dec 2021 | 16:57:53 UTC

it seems checkpointing still isnt working correctly.

despite BOINC "claiming" that it's checkpointing X number of seconds ago, stopping BOINC and re-starting shows that it's not restarting from the checkpoint.

The task I currently have in progress was ~20% completed. stopped BOINC, and restarted and it retained the time (elapsed and CPU time) but progress reset to 10%.
____________

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1050
Credit: 1,407,860,714
RAC: 1,014,249
Level
Met
Scientific publications
watwatwatwatwat
Message 58223 - Posted: 29 Dec 2021 | 17:40:37 UTC - in response to Message 58222.

I saw the same issue on my last task which was checkpointed past 20% yet reset to 10% upon restart.

Profile ServicEnginIC
Avatar
Send message
Joined: 24 Sep 10
Posts: 520
Credit: 2,275,243,465
RAC: 6,100
Level
Phe
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58225 - Posted: 29 Dec 2021 | 23:05:12 UTC

- GPU dedicated RAM usage seems to have been reduced, but I don't know if enough for running at 4 GB RAM GPUs (?)

Two of my hosts with 4 GB dedicated RAM GPUs have succeeded their latest Python GPU tasks so far.
If it is planned to be kept GPU RAM requirements this way, it widens the app to a quite greater number of hosts.

Also I happened to catch two simultaneous Python tasks at my triple GTX 1650 GPU host.
I then urgently suspended requesting for Gpugrid tasks at BOINC Manager... Why?
This host system RAM size is 32 GB.
When the second Python task started, free system RAM decreased to 1% (!).
I grossly estimate that environment for each Python task takes about 16 GB system RAM.
I guess that an eventual third concurrent task might have crashed itself, or even crashed the whole three Python tasks due to lack of system RAM.
I was watching to Psensor readings when the first of the two Python tasks finished, and then the free system memory drastically increased again from 1% to 38%.

I also took a nvidia-smi screenshot, where can be seen that each Python task was respectively running at GPU 0 and GPU 1, while GPU 2 was processing a PrimeGrid CUDA GPU task.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 719
Credit: 4,861,478,494
RAC: 1,361,967
Level
Arg
Scientific publications
wat
Message 58226 - Posted: 29 Dec 2021 | 23:24:23 UTC - in response to Message 58225.

now that I've upgraded my single 3080Ti host from a 5950X w/16GB ram to a 7402P/128GB ram, I want to see if I can even run 2x GPUGRID tasks on the same GPU. I see about 5GB VRAM use on the tasks I've processed so far. so with so much extra system ram and 12GB VRAM, it might work lol.
____________

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 155
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 58227 - Posted: 30 Dec 2021 | 14:40:09 UTC - in response to Message 58222.

Regarding the checkpointing problem, the approach I follow is to check the progress file (if exists) at the beginning of the python script and then continue the job from there.


I have tested locally to stop the task and execute again the python script and it continues from the same point where it stopped. So the script seems correct.


However, I think that right after setting up the conda environment, the progress is set automatically to 10% before running my script, so I am guessing this is what is causing the problem. I have modified my code not to rely only on the progress file, since it might be overwritten after every conda setup to be at 10%.
____________

mmonnin
Send message
Joined: 2 Jul 16
Posts: 329
Credit: 1,449,138,982
RAC: 407,971
Level
Met
Scientific publications
watwatwatwatwat
Message 58228 - Posted: 30 Dec 2021 | 22:35:23 UTC - in response to Message 58226.

now that I've upgraded my single 3080Ti host from a 5950X w/16GB ram to a 7402P/128GB ram, I want to see if I can even run 2x GPUGRID tasks on the same GPU. I see about 5GB VRAM use on the tasks I've processed so far. so with so much extra system ram and 12GB VRAM, it might work lol.


The last two tasks on my system with a 3080Ti ran concurrently and completed successfully.
https://www.gpugrid.net/results.php?hostid=477247

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1434
Credit: 3,541,899,351
RAC: 420,027
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58248 - Posted: 6 Jan 2022 | 9:01:57 UTC

Errors in e6a12-ABOU_rnd_ppod_15-0-1-RND6167_2 (created today):

"wandb: Waiting for W&B process to finish, PID 334655... (failed 1). Press ctrl-c to abort syncing."

"ValueError: demo dir contains more than &#194;&#180;total_buffer_demo_capacity&#194;&#180;"

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 155
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 58249 - Posted: 6 Jan 2022 | 10:01:11 UTC
Last modified: 6 Jan 2022 | 10:20:07 UTC

One user mentioned that could not solve the error

INTERNAL ERROR: cannot create temporary directory!


This is the configuration he is using:

### Editing /etc/systemd/system/boinc-client.service.d/override.conf
### Anything between here and the comment below will become the new
contents of the file

PrivateTmp=true

### Lines below this comment will be discarded

### /lib/systemd/system/boinc-client.service
# [Unit]
# Description=Berkeley Open Infrastructure Network Computing Client
# Documentation=man:boinc(1)
# After=network-online.target
#
# [Service]
# Type=simple
# ProtectHome=true
# ProtectSystem=strict
# ProtectControlGroups=true
# ReadWritePaths=-/var/lib/boinc -/etc/boinc-client
# Nice=10
# User=boinc
# WorkingDirectory=/var/lib/boinc
# ExecStart=/usr/bin/boinc
# ExecStop=/usr/bin/boinccmd --quit
# ExecReload=/usr/bin/boinccmd --read_cc_config
# ExecStopPost=/bin/rm -f lockfile
# IOSchedulingClass=idle
# # The following options prevent setuid root as they imply
NoNewPrivileges=true
# # Since Atlas requires setuid root, they break Atlas
# # In order to improve security, if you're not using Atlas,
# # Add these options to the [Service] section of an override file using
# # sudo systemctl edit boinc-client.service
# #NoNewPrivileges=true
# #ProtectKernelModules=true
# #ProtectKernelTunables=true
# #RestrictRealtime=true
# #RestrictAddressFamilies=AF_INET AF_INET6 AF_UNIX
# #RestrictNamespaces=true
# #PrivateUsers=true
# #CapabilityBoundingSet=
# #MemoryDenyWriteExecute=true
# #PrivateTmp=true #Block X11 idle detection
#
# [Install]
# WantedBy=multi-user.target


I was just wondering if there is any possible reason why it should not work
____________

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1434
Credit: 3,541,899,351
RAC: 420,027
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58250 - Posted: 6 Jan 2022 | 12:01:13 UTC - in response to Message 58249.

I am using a systemd file generated from a PPA maintained by Gianfranco Costamagna. It's automatically generated from Debian sources, and kept up-to-date with new releases automatically. It's currently supplying a BOINC suite labelled v7.16.17

The full, unmodified, contents of the file are

[Unit]
Description=Berkeley Open Infrastructure Network Computing Client
Documentation=man:boinc(1)
After=network-online.target

[Service]
Type=simple
ProtectHome=true
PrivateTmp=true
ProtectSystem=strict
ProtectControlGroups=true
ReadWritePaths=-/var/lib/boinc -/etc/boinc-client
Nice=10
User=boinc
WorkingDirectory=/var/lib/boinc
ExecStart=/usr/bin/boinc
ExecStop=/usr/bin/boinccmd --quit
ExecReload=/usr/bin/boinccmd --read_cc_config
ExecStopPost=/bin/rm -f lockfile
IOSchedulingClass=idle
# The following options prevent setuid root as they imply NoNewPrivileges=true
# Since Atlas requires setuid root, they break Atlas
# In order to improve security, if you're not using Atlas,
# Add these options to the [Service] section of an override file using
# sudo systemctl edit boinc-client.service
#NoNewPrivileges=true
#ProtectKernelModules=true
#ProtectKernelTunables=true
#RestrictRealtime=true
#RestrictAddressFamilies=AF_INET AF_INET6 AF_UNIX
#RestrictNamespaces=true
#PrivateUsers=true
#CapabilityBoundingSet=
#MemoryDenyWriteExecute=true

[Install]
WantedBy=multi-user.target

That has the 'PrivateTmp=true' line in the [Service] section of the file, rather than isolated at the top as in your example. I don't know Linux well enough to know how critical the positioning is.

We had long discussions in the BOINC development community a couple of years ago, when it was discovered that the 'PrivateTmp=true' setting blocked access to BOINC's X-server based idle detection. The default setting was reversed for a while, until it was discovered that the reverse 'PrivateTmp=false' setting caused the problem creating temporary directories that we observe here. I think that the default setting was reverted to true, but the discussion moved into the darker reaches of the Linux package maintenance managers, and the BOINC development cycle became somewhat disjointed. I'm no longer fully up-to-date with the state of play.

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1434
Credit: 3,541,899,351
RAC: 420,027
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58251 - Posted: 6 Jan 2022 | 12:08:17 UTC - in response to Message 58249.

A simpler answer might be

### Lines below this comment will be discarded

so the file as posted won't do anything at all - in particular, it won't run BOINC!

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 155
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 58253 - Posted: 7 Jan 2022 | 10:27:24 UTC - in response to Message 58248.

Thank you! I reviewed the code and detected the source of the error. I am currently working to solve it.

I will do local tests and then send a small batch of short tasks to GPUGrid to test the fixed version of the scripts before sending the next big batch.
____________

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1434
Credit: 3,541,899,351
RAC: 420,027
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58254 - Posted: 7 Jan 2022 | 18:13:15 UTC

Everybody seems to be getting the same error in today's tasks:

"AttributeError: 'PPODBuffer' object has no attribute 'num_loaded_agent_demos'"

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1050
Credit: 1,407,860,714
RAC: 1,014,249
Level
Met
Scientific publications
watwatwatwatwat
Message 58255 - Posted: 7 Jan 2022 | 19:48:11 UTC

I believe I got one of the test, fixed tasks this morning based on the short crunch time and valid report.

No sign of the previous error.

https://www.gpugrid.net/result.php?resultid=32732671

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1434
Credit: 3,541,899,351
RAC: 420,027
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58256 - Posted: 7 Jan 2022 | 19:56:15 UTC - in response to Message 58255.

Yes, your workunit was "created 7 Jan 2022 | 17:50:07 UTC" - that's a couple of hours after the ones I saw.

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 155
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 58263 - Posted: 10 Jan 2022 | 10:26:02 UTC
Last modified: 10 Jan 2022 | 10:28:12 UTC

I just sent a batch that seems to fail with

File "/var/lib/boinc-client/slots/30/python_dependencies/ppod_buffer_v2.py", line 325, in before_gradients
if self.iter % self.save_demos_every == 0:
TypeError: unsupported operand type(s) for %: 'int' and 'NoneType'


For some reason it did not crash locally. "Fortunately" it will crash after only a few minutes, and it is easy to solve. I am very sorry for the inconvenience...

I will send also a corrected batch with tasks of normal duration. I have tried to reduce the GPU memory requirements a bit in the new tasks.
____________

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1434
Credit: 3,541,899,351
RAC: 420,027
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58264 - Posted: 10 Jan 2022 | 10:38:35 UTC - in response to Message 58263.
Last modified: 10 Jan 2022 | 10:58:56 UTC

Got one of those - failed as you describe.

Also has the error message "AttributeError: 'GWorker' object has no attribute 'batches'".

Edit - had a couple more of the broken ones, but one created at 10:40:34 UTC seems to be running OK. We'll know later!

FritzB
Send message
Joined: 7 Apr 15
Posts: 4
Credit: 50,436,830
RAC: 13,308
Level
Thr
Scientific publications
wat
Message 58265 - Posted: 10 Jan 2022 | 14:09:55 UTC - in response to Message 58264.

I got 20 bad WU's today on this host: https://www.gpugrid.net/results.php?hostid=520456


Stderr Ausgabe

<core_client_version>7.16.6</core_client_version>
<![CDATA[
<message>
process exited with code 195 (0xc3, -61)</message>
<stderr_txt>
13:25:53 (6392): wrapper (7.7.26016): starting
13:25:53 (6392): wrapper (7.7.26016): starting
13:25:53 (6392): wrapper: running /usr/bin/flock (/var/lib/boinc-client/projects/www.gpugrid.net/miniconda.lock -c "/bin/bash ./miniconda-installer.sh -b -u -p /var/lib/boinc-client/projects/www.gpugrid.net/miniconda &&
/var/lib/boinc-client/projects/www.gpugrid.net/miniconda/bin/conda install -m -y -p gpugridpy --file requirements.txt ")

0%| | 0/45 [00:00<?, ?it/s]

concurrent.futures.process._RemoteTraceback:
'''
Traceback (most recent call last):
File "concurrent/futures/process.py", line 368, in _queue_management_worker
File "multiprocessing/connection.py", line 251, in recv
TypeError: __init__() missing 1 required positional argument: 'msg'
'''

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "entry_point.py", line 69, in <module>
File "concurrent/futures/process.py", line 484, in _chain_from_iterable_of_lists
File "concurrent/futures/_base.py", line 611, in result_iterator
File "concurrent/futures/_base.py", line 439, in result
File "concurrent/futures/_base.py", line 388, in __get_result
concurrent.futures.process.BrokenProcessPool: A process in the process pool was terminated abruptly while the future was running or pending.
[6689] Failed to execute script entry_point
13:25:58 (6392): /usr/bin/flock exited; CPU time 3.906269
13:25:58 (6392): app exit status: 0x1
13:25:58 (6392): called boinc_finish(195)

</stderr_txt>
]]>

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1050
Credit: 1,407,860,714
RAC: 1,014,249
Level
Met
Scientific publications
watwatwatwatwat
Message 58266 - Posted: 10 Jan 2022 | 16:33:22 UTC - in response to Message 58264.

I errored out 12 tasks created from 10:09:55 to 10:40:06.

Those all have the batch error.

But have 3 tasks created from 10:41:01 to 11:01:56 still running normally

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1050
Credit: 1,407,860,714
RAC: 1,014,249
Level
Met
Scientific publications
watwatwatwatwat
Message 58268 - Posted: 10 Jan 2022 | 19:39:01 UTC

And two of those were the batch error resends that now have failed.

Only 1 still processing that I assume is of the fixed variety. 8 hours elapsed currently.

https://www.gpugrid.net/result.php?resultid=32732855

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1434
Credit: 3,541,899,351
RAC: 420,027
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58269 - Posted: 10 Jan 2022 | 21:31:54 UTC - in response to Message 58268.

You need to look at the creation time of the master WU, not of the individual tasks (which will vary, even within a WU, let alone a batch of WUs).

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 155
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 58270 - Posted: 11 Jan 2022 | 8:11:13 UTC - in response to Message 58265.
Last modified: 11 Jan 2022 | 8:11:37 UTC

I have seen this error a few times.

concurrent.futures.process.BrokenProcessPool: A process in the process pool was terminated abruptly while the future was running or pending.


Do you think it could be due to a lack of resources? I think Linux starts killing processes if you are over capacity.
____________

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1050
Credit: 1,407,860,714
RAC: 1,014,249
Level
Met
Scientific publications
watwatwatwatwat
Message 58271 - Posted: 12 Jan 2022 | 1:15:57 UTC

Might be the OOM-Killer kicking in. You would need to

grep -i kill /var/log/messages*

to check if processes were killed by the OOM-Killer.

If that is the case you would have to configure /etc/sysctl.conf to let the system be less sensitive to brief out of memory conditions.

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1434
Credit: 3,541,899,351
RAC: 420,027
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58272 - Posted: 12 Jan 2022 | 8:56:21 UTC

I Googled the error message, and came up with this stackoverflow thread.

The problem seems to be specific to Python, and arises when running concurrent modules. There's a quote from the Python manual:

"The main module must be importable by worker subprocesses. This means that ProcessPoolExecutor will not work in the interactive interpreter. Calling Executor or Future methods from a callable submitted to a ProcessPoolExecutor will result in deadlock."

Other search results may provide further clues.

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 155
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 58273 - Posted: 12 Jan 2022 | 15:11:50 UTC - in response to Message 58272.
Last modified: 12 Jan 2022 | 15:24:12 UTC

Thanks! out of the possible explanations that could cause the error listed in the thread, I suspect it could be OS killing the threads do to a lack of resources. Could be not enough RAM, or maybe python raises this error if the ratio cores / processes is high? (I have seen some machines with 4 CPUs, and the tasks spawns 32 reinforcement learning environments).

All tasks run the same code and in the majority of GPUGrid machines this error does no occur. Also, I have reviewed the failed jobs and this errors always occurs in the same hosts. So it is something specific to those machines. I will check if I find a common patterns in all hosts that get this error.
____________

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1050
Credit: 1,407,860,714
RAC: 1,014,249
Level
Met
Scientific publications
watwatwatwatwat
Message 58274 - Posted: 12 Jan 2022 | 16:46:57 UTC
Last modified: 12 Jan 2022 | 16:55:04 UTC

What version of Python are the hosts that have the errors running?

Mine for example is:

python3 --version
Python 3.8.10

What kernel and OS?

Linux 5.11.0-46-generic x86_64
Ubuntu 20.04.3 LTS

I've had the errors on hosts with 32GB and 128GB. I would assume the hosts with 128GB to be in the clear with no memory pressures.

Profile ServicEnginIC
Avatar
Send message
Joined: 24 Sep 10
Posts: 520
Credit: 2,275,243,465
RAC: 6,100
Level
Phe
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58275 - Posted: 12 Jan 2022 | 20:47:57 UTC

What version of Python are the hosts that have the errors running?

Mine for example is:

python3 --version
Python 3.8.10

Same Python version as current mine.

In case of doubt about conflicting Python versions, I published the solution that I applied to my hosts at Message #57833
It worked for my Ubuntu 20.04.3 LTS Linux distribution, but user mmonnin replied that this didn't work for him.
mmonnin kindly published an alternative way at his Message #57840

mmonnin
Send message
Joined: 2 Jul 16
Posts: 329
Credit: 1,449,138,982
RAC: 407,971
Level
Met
Scientific publications
watwatwatwatwat
Message 58276 - Posted: 13 Jan 2022 | 2:31:57 UTC

I saw the prior post and was about to mention the same thing. Not sure which one works as the PC has been able to run tasks.

The recent tasks are taking a really long time
2d13h 62,2% 1070 and 1080 GPU system
2d15h 60.4% 1070 and 1080 GPU system

2x concurrently on 3080Ti
2d12h 61.3%
2d14h 60.4%

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 155
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 58277 - Posted: 13 Jan 2022 | 10:45:46 UTC - in response to Message 58274.

All jobs should use the same python version (3.8.10), I define it in the requirements.txt file of the conda environment.

Here are the specs from 3 hosts that failed with the BrokenProcessPool error:

OS:
Linux Debian Debian GNU/Linux 11 (bullseye) [5.10.0-10-amd64|libc 2.31 (Debian GLIBC 2.31-13+deb11u2)]
Linux Ubuntu Ubuntu 20.04.3 LTS [5.4.0-94-generic|libc 2.31 (Ubuntu GLIBC 2.31-0ubuntu9.3)]
Linux Linuxmint Linux Mint 20.2 [5.4.0-91-generic|libc 2.31 (Ubuntu GLIBC 2.31-0ubuntu9.2)]

Memory:
32081.92 MB
32092.04 MB
9954.41 MB

____________

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1050
Credit: 1,407,860,714
RAC: 1,014,249
Level
Met
Scientific publications
watwatwatwatwat
Message 58278 - Posted: 13 Jan 2022 | 19:55:11 UTC

I have a failed task today involving pickle.

magic_number = pickle_module.load(f, **pickle_load_args)
EOFError: Ran out of input

When I was investigating the brokenprocesspool error I saw posts that involved the word pickle and the fixes for that error.

https://www.gpugrid.net/result.php?resultid=32733573

SuperNanoCat
Send message
Joined: 3 Sep 21
Posts: 3
Credit: 15,435,732
RAC: 23,739
Level
Pro
Scientific publications
wat
Message 58279 - Posted: 13 Jan 2022 | 21:18:41 UTC

The tasks run on my Tesla K20 for a while, but then fail when they need to use PyTorch, which requires higher CUDA Capability. Oh well. Guess I'll stick to the ACEMED tasks. The error output doesn't list the requirements properly, but from a little Googling, it was updated to require 3.7 within the past couple years. The only Kepler card that has 3.7 is the Tesla K80.

From this task:


[W NNPACK.cpp:79] Could not initialize NNPACK! Reason: Unsupported hardware.
/var/lib/boinc-client/slots/2/gpugridpy/lib/python3.8/site-packages/torch/cuda/__init__.py:120: UserWarning:
Found GPU%d %s which is of cuda capability %d.%d.
PyTorch no longer supports this GPU because it is too old.
The minimum cuda capability supported by this library is %d.%d.


While I'm here, is there any way to force the project to update my hardware configuration? It thinks my host has two Quadro K620s instead of one of those and the Tesla.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 719
Credit: 4,861,478,494
RAC: 1,361,967
Level
Arg
Scientific publications
wat
Message 58280 - Posted: 13 Jan 2022 | 21:51:08 UTC - in response to Message 58279.

While I'm here, is there any way to force the project to update my hardware configuration? It thinks my host has two Quadro K620s instead of one of those and the Tesla.


this is a problem (feature?) of BOINC, not the project. the project only knows what hardware you have based on what BOINC communicates to the project.

with cards from the same vendor (nvidia/AMD/Intel) BOINC only lists the "best" card and then appends a number that's associated with how many total devices you have from that vendor. it will only list different models if they are from different vendors.

within the nvidia vendor group, BOINC figures out the "best" device by checking the compute capability first, then memory capacity, then some third metric that i cant remember right now. BOINC deems the K620 to be "best" because it has a higher compute capability (5.0) than the Tesla K20 (3.5) even though the K20 is arguably the better card with more/faster memory and more cores.

all in all, this has nothing to do with the project, and everything to do with BOINC's GPU ranking code.
____________

mmonnin
Send message
Joined: 2 Jul 16
Posts: 329
Credit: 1,449,138,982
RAC: 407,971
Level
Met
Scientific publications
watwatwatwatwat
Message 58281 - Posted: 13 Jan 2022 | 22:58:05 UTC - in response to Message 58280.

While I'm here, is there any way to force the project to update my hardware configuration? It thinks my host has two Quadro K620s instead of one of those and the Tesla.


this is a problem (feature?) of BOINC, not the project. the project only knows what hardware you have based on what BOINC communicates to the project.

with cards from the same vendor (nvidia/AMD/Intel) BOINC only lists the "best" card and then appends a number that's associated with how many total devices you have from that vendor. it will only list different models if they are from different vendors.

within the nvidia vendor group, BOINC figures out the "best" device by checking the compute capability first, then memory capacity, then some third metric that i cant remember right now. BOINC deems the K620 to be "best" because it has a higher compute capability (5.0) than the Tesla K20 (3.5) even though the K20 is arguably the better card with more/faster memory and more cores.

all in all, this has nothing to do with the project, and everything to do with BOINC's GPU ranking code.


Its often said as the "Best" card but its just the 1st
https://www.gpugrid.net/show_host_detail.php?hostid=475308

This host has a 1070 and 1080 but just shows 2x 1070s as the 1070 is in the 1st slot. Any way to check for a "best" would come up with the 1080. Or the 1070Ti that used to be there with the 1070.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 719
Credit: 4,861,478,494
RAC: 1,361,967
Level
Arg
Scientific publications
wat
Message 58282 - Posted: 13 Jan 2022 | 23:23:11 UTC - in response to Message 58281.
Last modified: 13 Jan 2022 | 23:23:48 UTC



Its often said as the "Best" card but its just the 1st
https://www.gpugrid.net/show_host_detail.php?hostid=475308

This host has a 1070 and 1080 but just shows 2x 1070s as the 1070 is in the 1st slot. Any way to check for a "best" would come up with the 1080. Or the 1070Ti that used to be there with the 1070.


In your case, the metrics that BOINC is looking at are identical between the two cards (actually all three of the 1070, 1070Ti, and 1080 have identical specs as far as BOINC ranking is concerned). All have the same amount of VRAM and have the same compute capability. So the tie goes to device number I guess. If you were to swap the 1080 for even a weaker card with a better CC (like a GTX 1650) then that would get picked up instead, even when not in the first slot.
____________

SuperNanoCat
Send message
Joined: 3 Sep 21
Posts: 3
Credit: 15,435,732
RAC: 23,739
Level
Pro
Scientific publications
wat
Message 58283 - Posted: 14 Jan 2022 | 2:21:35 UTC - in response to Message 58280.

Ah, I get it. I thought it was just stuck, because it did have two K620s before. I didn't realize BOINC was just incapable of acknowledging different cards from the same vendor. Does this affect project statistics? The Milkyway@home folks are gonna have real inflated opinions of the K620 next time they check the numbers haha

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 155
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 58284 - Posted: 14 Jan 2022 | 9:41:19 UTC - in response to Message 58278.

Interesting I had seen this error once before locally, and I assumed it was due to a corrupted input file.

I have reviewed the task and it was solved by another hosts, but only after multiple failed attempts with this pickle error.

Thank you for bringing it up! I will review the code to see if I can find any bug related to that.
____________

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1050
Credit: 1,407,860,714
RAC: 1,014,249
Level
Met
Scientific publications
watwatwatwatwat
Message 58285 - Posted: 14 Jan 2022 | 20:12:28 UTC - in response to Message 58284.

This is the document I had found about fixing the BrokenProcessPool error.

https://stackoverflow.com/questions/57031253/how-to-fix-brokenprocesspool-error-for-concurrent-futures-processpoolexecutor

I was reading it and stumbled upon the word "pickle" and verb "picklable" and thought it funny and I never had heard that word associated with computing before.

When the latest failed task mentioned pickle in the output, it tied it right back to all the previous BrokenProcessPool errors.

klepel
Send message
Joined: 23 Dec 09
Posts: 189
Credit: 3,532,700,748
RAC: 145,659
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58286 - Posted: 14 Jan 2022 | 20:25:49 UTC

@abouh: Thank you for PM me twice!
The Experimental Python tasks (beta) succeed miraculously on my two Linux computers (which produced only errors) after several restarts of GPUGRID.net project and the latest distro update this week.

Profile ServicEnginIC
Avatar
Send message
Joined: 24 Sep 10
Posts: 520
Credit: 2,275,243,465
RAC: 6,100
Level
Phe
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58288 - Posted: 15 Jan 2022 | 22:24:17 UTC - in response to Message 58225.

Also I happened to catch two simultaneous Python tasks at my triple GTX 1650 GPU host.
I then urgently suspended requesting for Gpugrid tasks at BOINC Manager... Why?
This host system RAM size is 32 GB.
When the second Python task started, free system RAM decreased to 1% (!).

After upgrading system RAM from 32 GB to 64 GB at above mentioned host, it has successfully processed three concurrent ABOU Python GPU tasks:
e2a43-ABOU_rnd_ppod_baseline_rnn-0-1-RND6933_3 - Link: https://www.gpugrid.net/result.php?resultid=32733458
e2a21-ABOU_rnd_ppod_baseline_rnn-0-1-RND3351_3 - Link: https://www.gpugrid.net/result.php?resultid=32733477
e2a27-ABOU_rnd_ppod_baseline_rnn-0-1-RND5112_1 - Link: https://www.gpugrid.net/result.php?resultid=32733441

More details at regarding Message #58287

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 155
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 58289 - Posted: 17 Jan 2022 | 8:36:42 UTC

Hello everyone,

I have seen a new error in some jobs:


Traceback (most recent call last):
File "run.py", line 444, in <module>
main()
File "run.py", line 62, in main
wandb.login(key=str(args.wandb_key))
File "/home/boinc-client/slots/1/gpugridpy/lib/python3.8/site-packages/wandb/sdk/wandb_login.py", line 65, in login
configured = _login(**kwargs)
File "/home/boinc-client/slots/1/gpugridpy/lib/python3.8/site-packages/wandb/sdk/wandb_login.py", line 268, in _login
wlogin.configure_api_key(key)
File "/home/boinc-client/slots/1/gpugridpy/lib/python3.8/site-packages/wandb/sdk/wandb_login.py", line 154, in configure_api_key
apikey.write_key(self._settings, key)
File "/home/boinc-client/slots/1/gpugridpy/lib/python3.8/site-packages/wandb/sdk/lib/apikey.py", line 223, in write_key
api.clear_setting("anonymous", globally=True, persist=True)
File "/home/boinc-client/slots/1/gpugridpy/lib/python3.8/site-packages/wandb/apis/internal.py", line 75, in clear_setting
return self.api.clear_setting(*args, **kwargs)
File "/home/boinc-client/slots/1/gpugridpy/lib/python3.8/site-packages/wandb/apis/internal.py", line 19, in api
self._api = InternalApi(*self._api_args, **self._api_kwargs)
File "/home/boinc-client/slots/1/gpugridpy/lib/python3.8/site-packages/wandb/sdk/internal/internal_api.py", line 78, in __init__
self._settings = Settings(
File "/home/boinc-client/slots/1/gpugridpy/lib/python3.8/site-packages/wandb/old/settings.py", line 23, in __init__
self._global_settings.read([Settings._global_path()])
File "/home/boinc-client/slots/1/gpugridpy/lib/python3.8/site-packages/wandb/old/settings.py", line 110, in _global_path
util.mkdir_exists_ok(config_dir)
File "/home/boinc-client/slots/1/gpugridpy/lib/python3.8/site-packages/wandb/util.py", line 793, in mkdir_exists_ok
os.makedirs(path)
File "/home/boinc-client/slots/1/gpugridpy/lib/python3.8/os.py", line 213, in makedirs
makedirs(head, exist_ok=exist_ok)
File "/home/boinc-client/slots/1/gpugridpy/lib/python3.8/os.py", line 213, in makedirs
makedirs(head, exist_ok=exist_ok)
File "/home/boinc-client/slots/1/gpugridpy/lib/python3.8/os.py", line 223, in makedirs
mkdir(name, mode)
OSError: [Errno 30] Read-only file system: '/var/lib/boinc-client'
18:56:50 (54609): ./gpugridpy/bin/python exited; CPU time 42.541031
18:56:50 (54609): app exit status: 0x1
18:56:50 (54609): called boinc_finish(195)

</stderr_txt>


It seems like the task is not allowed to create a new dirs inside its working directory. Just wondering if it could be some kind of configuration problem, just like the "INTERNAL ERROR: cannot create temporary directory!" for which a solution was already shared.
____________

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1434
Credit: 3,541,899,351
RAC: 420,027
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58290 - Posted: 17 Jan 2022 | 9:36:10 UTC - in response to Message 58289.

My question would be: what is the working directory?

The individual line errors concern

/home/boinc-client/slots/1/...

but the final failure concerns

/var/lib/boinc-client

That sounds like a mixed-up installation of BOINC: 'home' sounds like a location for a user-mode installation of BOINC, but '/var/lib/' would be normal for a service mode installation. It's reasonable for the two different locations to have different write permissions.

What app is doing the writing in each case, and what account are they running under?

Could the final write location be hard-coded, but the others dependent on locations supplied by the local BOINC installation?

Profile [VENETO] sabayonino
Send message
Joined: 4 Apr 10
Posts: 50
Credit: 631,849,096
RAC: 43
Level
Lys
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58291 - Posted: 17 Jan 2022 | 12:51:27 UTC

Hi

I've the same issue regarding boinc-directory (boinc dir is setup to ~/boinc)

So, I cleanup ~/.conda directory and reinstall gpugridnet project to the boinc client

So , flock detect the right running boinc directory but now I have this error task

https://www.gpugrid.net/result.php?resultid=32734225

./gpugridpy/bin/python (I think this is in boinc/slots/<N>/ folder)

The WU is running and 0.43% completed but /home/<user>/boinc/slots/11/gpugridpy still empty. No data are writted .

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 155
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 58292 - Posted: 17 Jan 2022 | 15:28:21 UTC - in response to Message 58290.
Last modified: 17 Jan 2022 | 15:55:31 UTC

Right so the working directory is

/home/boinc-client/slots/1/...


to which the script has full access. The script tries to create a directory to save the logs, but I guess it should not do it in

/var/lib/boinc-client


So I think the problem is just that the package I am using to log results by default saves them outside the working directory. Should be easy to fix.
____________

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1434
Credit: 3,541,899,351
RAC: 420,027
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58293 - Posted: 17 Jan 2022 | 15:55:05 UTC - in response to Message 58292.

BOINC has the concept of a "data directory". Absolutely everything that has to be written should be written somewhere in that directory or its sub-directories. Everything else must be assumed to be sandboxed and inaccessible.

mmonnin
Send message
Joined: 2 Jul 16
Posts: 329
Credit: 1,449,138,982
RAC: 407,971
Level
Met
Scientific publications
watwatwatwatwat
Message 58294 - Posted: 17 Jan 2022 | 16:17:56 UTC - in response to Message 58282.



Its often said as the "Best" card but its just the 1st
https://www.gpugrid.net/show_host_detail.php?hostid=475308

This host has a 1070 and 1080 but just shows 2x 1070s as the 1070 is in the 1st slot. Any way to check for a "best" would come up with the 1080. Or the 1070Ti that used to be there with the 1070.


In your case, the metrics that BOINC is looking at are identical between the two cards (actually all three of the 1070, 1070Ti, and 1080 have identical specs as far as BOINC ranking is concerned). All have the same amount of VRAM and have the same compute capability. So the tie goes to device number I guess. If you were to swap the 1080 for even a weaker card with a better CC (like a GTX 1650) then that would get picked up instead, even when not in the first slot.


The PC now as 1080 and 1080Ti with the Ti having more VRAM. BOINC shows 2x 1080. The 1080 is GPU 0 in nvidia-smi and so have the other BOINC displayed GPUs. The Ti is in the physical 1st slot.

This PC happened to pick up two Python tasks. They aren't taking 4 days this time. 5:45 hr:min at 38.8% and 31 min at 11.8%.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 719
Credit: 4,861,478,494
RAC: 1,361,967
Level
Arg
Scientific publications
wat
Message 58295 - Posted: 17 Jan 2022 | 21:07:22 UTC - in response to Message 58294.
Last modified: 17 Jan 2022 | 21:52:59 UTC



Its often said as the "Best" card but its just the 1st
https://www.gpugrid.net/show_host_detail.php?hostid=475308

This host has a 1070 and 1080 but just shows 2x 1070s as the 1070 is in the 1st slot. Any way to check for a "best" would come up with the 1080. Or the 1070Ti that used to be there with the 1070.


In your case, the metrics that BOINC is looking at are identical between the two cards (actually all three of the 1070, 1070Ti, and 1080 have identical specs as far as BOINC ranking is concerned). All have the same amount of VRAM and have the same compute capability. So the tie goes to device number I guess. If you were to swap the 1080 for even a weaker card with a better CC (like a GTX 1650) then that would get picked up instead, even when not in the first slot.


The PC now as 1080 and 1080Ti with the Ti having more VRAM. BOINC shows 2x 1080. The 1080 is GPU 0 in nvidia-smi and so have the other BOINC displayed GPUs. The Ti is in the physical 1st slot.

This PC happened to pick up two Python tasks. They aren't taking 4 days this time. 5:45 hr:min at 38.8% and 31 min at 11.8%.


what motherboard? and what version of BOINC?, your hosts are hidden so I cannot inspect myself. PCIe enumeration and ordering can be inconsistent against consumer boards. My server boards seem to enumerate starting from the slot furthest from the CPU socket, while most consumer boards are the opposite with device0 at the slot closest to the CPU socket.

or do you perhaps run a locked coproc_info.xml file, this would prevent any GPU changes from being picked up by BOINC if it can't write to the coproc file.

edit:

also I forgot that most versions of BOINC incorrectly detect nvidia GPU memory. they will all max out at 4GB due to a bug in BOINC. So to BOINC your 1080Ti has the same amount of memory as your 1080. and since the 1080Ti is still a pascal card like the 1080, it has the same compute capability, so you're running into the same specs between them all still

to get it to sort properly, you need to fix BOINC code, or use a GPU with higher or lower compute capability. put a Turing card in the system not in the first slot and BOINC will pick it up as GPU0
____________

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1434
Credit: 3,541,899,351
RAC: 420,027
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58296 - Posted: 18 Jan 2022 | 19:03:55 UTC

The tests continue. Just reported e2a13-ABOU_rnd_ppod_baseline_cnn_nophi_2-0-1-RND9761_1, with final stats

<result>
<name>e2a13-ABOU_rnd_ppod_baseline_cnn_nophi_2-0-1-RND9761_1</name>
<final_cpu_time>107668.100000</final_cpu_time>
<final_elapsed_time>46186.399529</final_elapsed_time>

That's an average CPU core count of 2.33 over the entire run - that's high for what is planned to be a GPU application. We can manage with that - I'm sure we all want to help develop and test the application for the coming research run - but I think it would be helpful to put more realistic usage values into the BOINC scheduler.

Profile GDF
Volunteer moderator
Project administrator
Project developer
Project tester
Volunteer developer
Volunteer tester
Project scientist
Send message
Joined: 14 Mar 07
Posts: 1947
Credit: 629,356
RAC: 0
Level
Gly
Scientific publications
watwatwatwatwat
Message 58297 - Posted: 19 Jan 2022 | 9:17:03 UTC - in response to Message 58296.

It's not a GPU application. It uses both CPU and GPU.

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 155
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 58298 - Posted: 19 Jan 2022 | 9:49:39 UTC - in response to Message 58296.

Do you mean changing some of the BOINC parameters like it was done in the case of <rsc_fpops_est>?

Is that to better define the resources required by the tasks?
____________

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1434
Credit: 3,541,899,351
RAC: 420,027
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58299 - Posted: 19 Jan 2022 | 11:03:54 UTC - in response to Message 58298.

It would need to be done in the plan class definition. Toni said that you define your plan classes in C++ code, so there are some examples in Specifying plan classes in C++.

Unfortunately, the BOINC developers didn't consider your use-case of mixing CPU elements and GPU elements in the same task, so none of the examples really match - your app is a mixture of MT and CUDA classes. What we need (or at least, would like to see) at this end are realistic values for <avg_ncpus> and <coproc><count>.

FritzB
Send message
Joined: 7 Apr 15
Posts: 4
Credit: 50,436,830
RAC: 13,308
Level
Thr
Scientific publications
wat
Message 58300 - Posted: 19 Jan 2022 | 19:00:18 UTC

it seems to work better now but I've reached time limit after 1800sec
https://www.gpugrid.net/result.php?resultid=32734648


19:39:23 (6124): task /usr/bin/flock reached time limit 1800
application ./gpugridpy/bin/python missing

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1050
Credit: 1,407,860,714
RAC: 1,014,249
Level
Met
Scientific publications
watwatwatwatwat
Message 58301 - Posted: 19 Jan 2022 | 20:55:08 UTC

I'd like to hear what others are using for ncpus for their Python tasks in their app_config files.

I'm using:

<app>
<name>PythonGPU</name>
<gpu_versions>
<gpu_usage>1.0</gpu_usage>
<cpu_usage>5.0</cpu_usage>
</gpu_versions>
</app>

for all my hosts and they seem to like that. Haven't had any issues.

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1434
Credit: 3,541,899,351
RAC: 420,027
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58302 - Posted: 19 Jan 2022 | 22:28:41 UTC - in response to Message 58301.

I'm still running them at 1 CPU plus 1 GPU. They run fine, but when they are busy on the CPU-only sections, they steal time from the CPU tasks that are running at the same time - most obviously from CPDN.

Because these tasks are defined as GPU tasks, and GPU tasks are given a higher run priority than CPU tasks by BOINC ('below normal' against 'idle'), the real CPU project will always come off worst.

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1050
Credit: 1,407,860,714
RAC: 1,014,249
Level
Met
Scientific publications
watwatwatwatwat
Message 58303 - Posted: 20 Jan 2022 | 0:27:39 UTC - in response to Message 58302.
Last modified: 20 Jan 2022 | 0:28:14 UTC

You could employ ProcessLasso on the apps and up their priority I suppose.

When I ran Windows, I really utilized that utility to make the apps run the way I wanted them to, and not how BOINC sets them up on its own agenda.

Profile ServicEnginIC
Avatar
Send message
Joined: 24 Sep 10
Posts: 520
Credit: 2,275,243,465
RAC: 6,100
Level
Phe
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58304 - Posted: 20 Jan 2022 | 6:46:45 UTC - in response to Message 58301.

I'd like to hear what others are using for ncpus for their Python tasks in their app_config files.

I think that Python GPU App is very efficient in adapting to any amount of CPU cores, and taking profit of available CPU resources.
This seems to be in some way independent of ncpus parameter at Gpugrid app_config.xml

Setup at my twin GPU system is as follows:

<app>
<name>PythonGPU</name>
<gpu_versions>
<gpu_usage>1.0</gpu_usage>
<cpu_usage>0.49</cpu_usage>
</gpu_versions>
</app>

And setup for my triple GPU system is as follows:

<app>
<name>PythonGPU</name>
<gpu_versions>
<gpu_usage>1.0</gpu_usage>
<cpu_usage>0.33</cpu_usage>
</gpu_versions>
</app>

The finality for this is being able to respectively run two or three concurrent Python GPU tasks without reaching a full "1" CPU core (2 x 0.49 = 0.98; 3 x 0.33 = 0.99). Then, I manually control CPU usage by setting "Use at most XX % of the CPUs" at BOINC Manager for each system, according to its amount of CPU cores.
This allows me to run concurrently "N" Python GPU tasks and a fixed number of other CPU tasks as desired.
But as said, Gpugrid Python GPU app seems to take CPU resources as needed for successfully processing its tasks... at the cost of slowing down the other CPU applications.

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1434
Credit: 3,541,899,351
RAC: 420,027
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58305 - Posted: 20 Jan 2022 | 7:44:41 UTC

Yes, I use Process Lasso on all my Windows machines, but I haven't explored its use under Linux.

Remember that ncpus and similar has no effect whatsoever on the actual running of a BOINC project app - there is no 'control' element to its operation. The only effect it has is on BOINC's scheduling - how many tasks are allowed to run concurrently.

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 155
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 58306 - Posted: 20 Jan 2022 | 15:58:45 UTC - in response to Message 58300.

This message

19:39:23 (6124): task /usr/bin/flock reached time limit 1800


Indicates that, after 30 minutes, the installation of miniconda and the task environment setup have not been finished.

Consequently, python is not found later on to execute the task since it is one of the requirements of the miniconda environment.

application ./gpugridpy/bin/python missing


Therefore, it is not an error in itself, it just means that the miniconda setup went too slow for some reason (in theory 30 minutes should be enough time). Maybe the machine is slower than usual for some reason. Or the connection is slow and dependencies are not being downloaded.

We could extend this timeout, but normally if 30 minutes is not enough for the miniconda setup another underlying problem could exists.
____________

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 719
Credit: 4,861,478,494
RAC: 1,361,967
Level
Arg
Scientific publications
wat
Message 58307 - Posted: 20 Jan 2022 | 16:18:58 UTC - in response to Message 58306.

it seems to be a reasonably fast system. my guess is another type of permissions issue which is blocking the python install and it hits the timeout, or the CPUs are being too heavily used and not giving enough resources to the extraction process.
____________

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1050
Credit: 1,407,860,714
RAC: 1,014,249
Level
Met
Scientific publications
watwatwatwatwat
Message 58308 - Posted: 20 Jan 2022 | 22:15:20 UTC - in response to Message 58305.

There is no Linux equivalent of Process Lasso.

But there is a Linux equivalent of Windows Process-Explorer

https://github.com/wolfc01/procexp

Screenshots of the application at the old SourceForge repo.

https://sourceforge.net/projects/procexp/

Can dynamically change the nice value of the application.

There is also the command line schedtool utility that can be easily implemented in a bash file. I used to run that all the time in my gpuoverclock.sh script for Seti cpu and gpu apps.

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1434
Credit: 3,541,899,351
RAC: 420,027
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58309 - Posted: 21 Jan 2022 | 12:14:55 UTC - in response to Message 58308.

Well, that got me a long way.

There are dependencies listed for Mint 18.3 - I'm running Mint 20.2

The apt-get for the older version of Mint returns

E: Unable to locate package python-qwt5-qt4
E: Unable to locate package python-configobj

Unsurprisingly, the next step returns

Traceback (most recent call last):
File "./procexp.py", line 27, in <module>
from PyQt5 import QtCore, QtGui, QtWidgets, uic
ModuleNotFoundError: No module named 'PyQt5'

htop, however, shows about 30 multitasking processes spawned from main, each using around 2% of a CPU core (varying by the second) at nice 19. At the time of inspection, that is. I'll go away and think about that.

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1050
Credit: 1,407,860,714
RAC: 1,014,249
Level
Met
Scientific publications
watwatwatwatwat
Message 58310 - Posted: 21 Jan 2022 | 17:41:41 UTC - in response to Message 58300.

I've one task now that had the same timeout issue getting python. The host was running fine on these tasks before and I don't know what has changed.

I've aborted a couple tasks now that are not making any progress after 20 hours or so and are stuck at 13% completion. Similar series tasks are showing much more progress after only a few minutes. Most complete in 5-6 hours.

I reset the project thinking something got corrupted in the downloaded libraries but that has not fixed anything.

Need to figure out how to debug the tasks on this host.

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1050
Credit: 1,407,860,714
RAC: 1,014,249
Level
Met
Scientific publications
watwatwatwatwat
Message 58311 - Posted: 21 Jan 2022 | 17:42:23 UTC - in response to Message 58309.

You might look into schedtool as an alternative.

Aurum
Avatar
Send message
Joined: 12 Jul 17
Posts: 343
Credit: 10,358,716,035
RAC: 855,775
Level
Trp
Scientific publications
watwatwat
Message 58317 - Posted: 29 Jan 2022 | 21:23:39 UTC - in response to Message 58301.
Last modified: 29 Jan 2022 | 22:08:45 UTC

I'd like to hear what others are using for ncpus for their Python tasks in their app_config files.

I'm using:

<app>
<name>PythonGPU</name>
<gpu_versions>
<gpu_usage>1.0</gpu_usage>
<cpu_usage>5.0</cpu_usage>
</gpu_versions>
</app>

for all my hosts and they seem to like that. Haven't had any issues.
Very interesting. Does this actually limit PythonGPU to using at most 5 cpu threads?
Does it work better than:
<app_config>
<!-- i9-7980XE 18c36t 32 GB L3 Cache 24.75 MB -->
<app>
<name>PythonGPU</name>
<plan_class>cuda1121</plan_class>
<gpu_versions>
<cpu_usage>1.0</cpu_usage>
<gpu_usage>1.0</gpu_usage>
</gpu_versions>
<avg_ncpus>5</avg_ncpus>
<cmdline>--nthreads 5</cmdline>
<fraction_done_exact/>
</app>
</app_config>
Edit 1: To answer my own question I changed cpu_usage to 5 and am running a single PythonGPU WU with nothing else going on. The System Monitor shows 5 CPUs are running in the 60 to 80% range with all othe CPU running in the 10 to 40% range.
Is there any way to stop it from taking over ones entire computer?
Edit 2: I turned on WCG and the group of 5 went up to 100% and all the rest went to OPN in the 80 to 95% range.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 719
Credit: 4,861,478,494
RAC: 1,361,967
Level
Arg
Scientific publications
wat
Message 58318 - Posted: 30 Jan 2022 | 5:24:25 UTC - in response to Message 58317.

No. Setting that value won’t change how much CPU is actually used. It just tells BOINC how much of the CPU is being used so that it can probably account resources.

This app will use 32 threads and there’s nothing you can do in BOINC configuration to change that. This has always been the case though.
____________

Profile ServicEnginIC
Avatar
Send message
Joined: 24 Sep 10
Posts: 520
Credit: 2,275,243,465
RAC: 6,100
Level
Phe
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58320 - Posted: 2 Feb 2022 | 22:06:09 UTC

This morning, in a routine system update, I noticed that BOINC Client / Manager was updated from Version 7.16.17 to Version 7.18.1.
It would be interesting to know if PrivateTmp=true is set as a default at this new version, thus in some way helping for Python GPU task to succeed...

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1434
Credit: 3,541,899,351
RAC: 420,027
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58321 - Posted: 2 Feb 2022 | 23:06:32 UTC - in response to Message 58320.

Which distro/repository are you using? I have Mint with Gianfranco Costamagna's PPA: that's usually the fastest to update, and I see v7.18.1 is being offered there as well - although I haven't installed it yet.

I'll check it out in the morning. v7.18.1 should be pretty good (it's been available for Android since August last year), but I don't yet know the answer to your specific question - there hasn't been any chatter about testing or new releases in the usual places.

Jim1348
Send message
Joined: 28 Jul 12
Posts: 819
Credit: 1,576,708,471
RAC: 221,898
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58322 - Posted: 2 Feb 2022 | 23:47:29 UTC - in response to Message 58321.
Last modified: 2 Feb 2022 | 23:50:53 UTC

Which distro/repository are you using? I have Mint with Gianfranco Costamagna's PPA: that's usually the fastest to update, and I see v7.18.1 is being offered there as well - although I haven't installed it yet.

I'll check it out in the morning. v7.18.1 should be pretty good

It bombed out on the Rosetta pythons; they did not run at all (a VBox problem undoubtedly). And it failed all the validations on QuChemPedIA, which does not use VirtualBox on the Linux version. But it works OK on CPDN, WCG/ARP and Einstein/FGRBP (GPU). All were on Ubuntu 20.04.3.

So be prepared to bail out if you have to.

Profile ServicEnginIC
Avatar
Send message
Joined: 24 Sep 10
Posts: 520
Credit: 2,275,243,465
RAC: 6,100
Level
Phe
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58324 - Posted: 3 Feb 2022 | 6:29:43 UTC - in response to Message 58321.

Which distro/repository are you using?

I'm using the regular repository for Ubuntu 20.04.3 LTS
I took screenshot of offered updates before updating.

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1434
Credit: 3,541,899,351
RAC: 420,027
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58325 - Posted: 3 Feb 2022 | 9:25:23 UTC - in response to Message 58324.

My PPA gives slightly more information on the available update:



I know that it's auto-generated from the Debian package maintenance sources, which is probably the ultimate source of the Ubuntu LTS package as well. I've had a quick look round, but there's no sign so far that this release was originated by BOINC developers: in particular, no mention was made of it during the BOINC projects conference call on January 14th 2022. I'll keep digging.

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1434
Credit: 3,541,899,351
RAC: 420,027
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58327 - Posted: 3 Feb 2022 | 12:13:36 UTC
Last modified: 3 Feb 2022 | 12:34:19 UTC

OK, I've taken a deep breath and enough coffee - applied all updates.

WARNING - the BOINC update appears to break things.

The new systemd file, in full, is

[Unit]
Description=Berkeley Open Infrastructure Network Computing Client
Documentation=man:boinc(1)
After=network-online.target

[Service]
Type=simple
ProtectHome=true
ProtectSystem=strict
ProtectControlGroups=true
ReadWritePaths=-/var/lib/boinc -/etc/boinc-client
Nice=10
User=boinc
WorkingDirectory=/var/lib/boinc
ExecStart=/usr/bin/boinc
ExecStop=/usr/bin/boinccmd --quit
ExecReload=/usr/bin/boinccmd --read_cc_config
ExecStopPost=/bin/rm -f lockfile
IOSchedulingClass=idle
# The following options prevent setuid root as they imply NoNewPrivileges=true
# Since Atlas requires setuid root, they break Atlas
# In order to improve security, if you're not using Atlas,
# Add these options to the [Service] section of an override file using
# sudo systemctl edit boinc-client.service
#NoNewPrivileges=true
#ProtectKernelModules=true
#ProtectKernelTunables=true
#RestrictRealtime=true
#RestrictAddressFamilies=AF_INET AF_INET6 AF_UNIX
#RestrictNamespaces=true
#PrivateUsers=true
#CapabilityBoundingSet=
#MemoryDenyWriteExecute=true
#PrivateTmp=true #Block X11 idle detection

[Install]
WantedBy=multi-user.target

Note the line I've picked out. That starts with a # sign, for comment, so it has no effect: PrivateTmp is undefined in this file.

New work became available just as I was preparing to update, so I downloaded a task and immediately suspended it. After the updates, and enough reboots to get my NVidia drivers functional again (it took three this time), I restarted BOINC and allowed the task to run.

Task 32736884

Our old enemy "INTERNAL ERROR: cannot create temporary directory!" is back. Time for a systemd over-ride file, and to go fishing for another task.

Edit - updated the file, as described in message 58312, and got task 32736938. That seems to be running OK, having passed the 10% danger point. Result will be in sometime after midnight.

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1050
Credit: 1,407,860,714
RAC: 1,014,249
Level
Met
Scientific publications
watwatwatwatwat
Message 58328 - Posted: 3 Feb 2022 | 23:34:25 UTC

I see your task completed normally with the PrivateTmp=true uncommented in the service file.

But is the repeating warning:

wandb: WARNING Path /var/lib/boinc-client/slots/11/.config/wandb/wandb/ wasn't writable, using system temp directory

a normal entry for those using the standard BOINC location installation?

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1434
Credit: 3,541,899,351
RAC: 420,027
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58329 - Posted: 4 Feb 2022 | 9:04:58 UTC - in response to Message 58328.

No, that's the first time I've seen that particular warning. The general structure is right for this machine, but it does't usually reach as high as 11 - GPUGrid normally gets slot 7. Whatever - there were some tasks left waiting after the updates and restarts.

I think this task must have run under a revised version of the app - the next stage in testing. The output is slightly different in other ways, and the task ran for a significantly shorter time than other recent tasks. My other machine, which hasn't been updated yet, got the same warnings in a task running at the same time.

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 155
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 58330 - Posted: 4 Feb 2022 | 9:14:25 UTC - in response to Message 58328.
Last modified: 4 Feb 2022 | 9:23:48 UTC

Oh, I was not aware of this warning.

"/var/lib/boinc-client/slots/11/.config/wandb/wandb/" is the directory where the training logs are stored. Yes, it changed in the last batch because of a problem detected earlier, in which the logs were stored in a directory outside boinc-client.

I could actually change it to any other location. I just thought that any location inside "/var/lib/boinc-client/slots/11/" was fine.

Maybe it is just a warning because .config is a hidden directory. I will change it again anyway, so that the logs are stored in "/var/lib/boinc-client/slots/11/" directly. The next batches will still contains the warning, but will disappear for the next experiment.
____________

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 155
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 58331 - Posted: 4 Feb 2022 | 9:25:40 UTC - in response to Message 58329.

Yes, this experiments is with a slightly modified version of the algorithm, which should be faster. It runs the same number of interactions with the reinforcement learning environment, so the credits amount is the same.
____________

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1434
Credit: 3,541,899,351
RAC: 420,027
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58332 - Posted: 4 Feb 2022 | 9:38:39 UTC - in response to Message 58330.

I'll take a look at the contents of the slot directory, next time I see a task running. You're right - the entire '/var/lib/boinc-client/slots/n/...' structure should be writable, to any depth, by any program running under the boinc user account.

How is the '.config/wandb/wandb/' component of the path created? The doubled '/wandb' looks unusual.

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 155
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 58333 - Posted: 4 Feb 2022 | 9:44:30 UTC - in response to Message 58332.
Last modified: 4 Feb 2022 | 9:55:30 UTC

The directory paths are defined as environment variables in the python script.

# Set wandb paths
os.environ["WANDB_CONFIG_DIR"] = os.getcwd()
os.environ["WANDB_DIR"] = os.path.join(os.getcwd(), ".config/wandb")


Then the directories are created by the wandb python package (which handles logging of relevant training data). I suspect it could be in the creation that the permissions are defined. So it is not a BOINC problem. I will change the paths in future jobs to:

# Set wandb paths
os.environ["WANDB_CONFIG_DIR"] = os.getcwd()
os.environ["WANDB_DIR"] = os.getcwd()


Note that "os.getcwd()" is the working directory, so "/var/lib/boinc-client/slots/11/" in this case
____________

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 719
Credit: 4,861,478,494
RAC: 1,361,967
Level
Arg
Scientific publications
wat
Message 58334 - Posted: 4 Feb 2022 | 13:32:42 UTC - in response to Message 58330.

Oh, I was not aware of this warning.

"/var/lib/boinc-client/slots/11/.config/wandb/wandb/" is the directory where the training logs are stored. Yes, it changed in the last batch because of a problem detected earlier, in which the logs were stored in a directory outside boinc-client.

I could actually change it to any other location. I just thought that any location inside "/var/lib/boinc-client/slots/11/" was fine.

Maybe it is just a warning because .config is a hidden directory. I will change it again anyway, so that the logs are stored in "/var/lib/boinc-client/slots/11/" directly. The next batches will still contains the warning, but will disappear for the next experiment.


what happens if that directory doesn't exist? several of us run BOINC in a different location. since it's in /var/lib/ the process wont have permissions to create the directory, unless maybe if BOINC is run as root.
____________

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1434
Credit: 3,541,899,351
RAC: 420,027
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58335 - Posted: 4 Feb 2022 | 14:22:26 UTC - in response to Message 58334.

'/var/lib/boinc-client/' is the default BOINC data directory for Ubuntu BOINC service (systemd) installations. It most certainly exists, and is writable, on my machine, which is where Keith first noticed the error message in the report of a successful run. During that run, much will have been written to .../slots/11

Since abouh is using code to retrieve the working (i.e. BOINC slot) directory, the correct value should be returned for non-default data locations - otherwise BOINC wouldn't be able to run at all.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 719
Credit: 4,861,478,494
RAC: 1,361,967
Level
Arg
Scientific publications
wat
Message 58336 - Posted: 4 Feb 2022 | 15:33:49 UTC - in response to Message 58335.
Last modified: 4 Feb 2022 | 15:39:39 UTC

I'm aware it's the default location on YOUR computer, and others running the standard ubuntu repository installer. but the message from abouh sounded like this directory was hard coded since he put the entire path. and for folks running BOINC in another location, this directory will not be the same. if it uses a relative file path, then it's fine, but I was seeking clarification.

/var/lib/boinc-client/ does not exist on my system. /var/lib is write protected, creating a directory there requires elevated privileges, which I'm sure happens during install from the repository.
____________

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1434
Credit: 3,541,899,351
RAC: 420,027
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58337 - Posted: 4 Feb 2022 | 15:59:00 UTC - in response to Message 58336.
Last modified: 4 Feb 2022 | 16:21:03 UTC

Hard path coding was removed before this most recent test batch.

edit - see message 58292: "Should be easy to fix".

Jim1348
Send message
Joined: 28 Jul 12
Posts: 819
Credit: 1,576,708,471
RAC: 221,898
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58338 - Posted: 4 Feb 2022 | 22:13:21 UTC - in response to Message 58336.

/var/lib/boinc-client/ does not exist on my system. /var/lib is write protected, creating a directory there requires elevated privileges, which I'm sure happens during install from the repository.


Yes. I do the following dance whenever setting up BOINC from Ubuntu Software or LocutusOfBorg:
Join the root group: sudo adduser (Username) root
• Join the BOINC group: sudo adduser (Username) boinc
• Allow access by all: sudo chmod -R 777 /etc/boinc-client
• Allow access by all: sudo chmod -R 777 /var/lib/boinc-client


I also do these to allow monitoring by BoincTasks over the LAN on my Win10 machine:
• Copy “cc_config.xml” to /etc/boinc-client folder
• Copy “gui_rpc_auth.cfg” to /etc/boinc-client folder
• Reboot

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 155
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 58339 - Posted: 5 Feb 2022 | 9:10:09 UTC - in response to Message 58334.
Last modified: 5 Feb 2022 | 11:01:11 UTC

The directory should be created wherever you run BOINC, that is not a problem.

Inside the /boinc-client directory, but it does not matter if this directory is in /var/lib/ or somewhere else.
____________

Profile Retvari Zoltan
Avatar
Send message
Joined: 20 Jan 09
Posts: 2335
Credit: 16,178,080,749
RAC: 0
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58340 - Posted: 5 Feb 2022 | 11:05:20 UTC - in response to Message 58338.
Last modified: 5 Feb 2022 | 11:05:38 UTC

I do the following dance whenever setting up BOINC from Ubuntu Software or LocutusOfBorg:
Join the root group: sudo adduser (Username) root
• Join the BOINC group: sudo adduser (Username) boinc
• Allow access by all: sudo chmod -R 777 /etc/boinc-client
• Allow access by all: sudo chmod -R 777 /var/lib/boinc-client
By doing so, you nullify your system's security provided by different access rights levels.
This practice should be avoided by all costs.

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1434
Credit: 3,541,899,351
RAC: 420,027
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58341 - Posted: 5 Feb 2022 | 11:50:02 UTC - in response to Message 58327.
Last modified: 5 Feb 2022 | 12:07:55 UTC

I see that the BOINC PPA distribution has been updated yet again - still v7.18.1, but a new build timed at 17:10 yesterday, 4 February.

Saw it when I was coaxing a new ACEMD3 task into life, so I won't know what it contains until tomorrow (unless I sacrifice my second machine, after lunch).

Could somebody please check whether the Ubuntu LTS repo has been updated as well? Maybe somebody read my complaint to boinc_alpha after all. Still no replies.

Edit - found the change log, but I'm none the wiser.

Jim1348
Send message
Joined: 28 Jul 12
Posts: 819
Credit: 1,576,708,471
RAC: 221,898
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58342 - Posted: 5 Feb 2022 | 13:27:24 UTC - in response to Message 58340.

I do the following dance whenever setting up BOINC from Ubuntu Software or LocutusOfBorg:
Join the root group: sudo adduser (Username) root
• Join the BOINC group: sudo adduser (Username) boinc
• Allow access by all: sudo chmod -R 777 /etc/boinc-client
• Allow access by all: sudo chmod -R 777 /var/lib/boinc-client
By doing so, you nullify your system's security provided by different access rights levels.
This practice should be avoided by all costs.

I am on an isolated network behind a firewall/router. No problem at all.

Profile Retvari Zoltan
Avatar
Send message
Joined: 20 Jan 09
Posts: 2335
Credit: 16,178,080,749
RAC: 0
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58343 - Posted: 5 Feb 2022 | 13:28:42 UTC - in response to Message 58342.

I am on an isolated network behind a firewall/router. No problem at all.
That qualifies as famous last words.

Jim1348
Send message
Joined: 28 Jul 12
Posts: 819
Credit: 1,576,708,471
RAC: 221,898
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58344 - Posted: 5 Feb 2022 | 13:30:13 UTC - in response to Message 58341.

I see that the BOINC PPA distribution has been updated yet again - still v7.18.1, but a new build timed at 17:10 yesterday, 4 February.

All I know is that the new build does not work at all on Cosmology with VirtualBox 6.1.32. A work unit just suspends immediately on startup.

Jim1348
Send message
Joined: 28 Jul 12
Posts: 819
Credit: 1,576,708,471
RAC: 221,898
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58345 - Posted: 5 Feb 2022 | 13:30:54 UTC - in response to Message 58343.
Last modified: 5 Feb 2022 | 13:33:37 UTC

I am on an isolated network behind a firewall/router. No problem at all.
That qualifies as famous last words.

It has lasted for many years.

EDIT: They are all dedicated crunching machines. I have only BOINC and Folding on them. If they are a problem, I should pull out now.

Profile Retvari Zoltan
Avatar
Send message
Joined: 20 Jan 09
Posts: 2335
Credit: 16,178,080,749
RAC: 0
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58346 - Posted: 5 Feb 2022 | 13:34:08 UTC - in response to Message 58341.

I see that the BOINC PPA distribution has been updated yet again - still v7.18.1, but a new build timed at 17:10 yesterday, 4 February.

Could somebody please check whether the Ubuntu LTS repo has been updated as well? Maybe somebody read my complaint to boinc_alpha after all. Still no replies.
My Ubuntu 20.04.3 machines updated themselves this morning to v7.18.1 for the 3rd time.
(available version: 7.18.1+dfsg+202202041710~ubuntu20.04.1)

Profile Retvari Zoltan
Avatar
Send message
Joined: 20 Jan 09
Posts: 2335
Credit: 16,178,080,749
RAC: 0
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58347 - Posted: 5 Feb 2022 | 13:40:51 UTC - in response to Message 58345.
Last modified: 5 Feb 2022 | 13:41:07 UTC

I am on an isolated network behind a firewall/router. No problem at all.
That qualifies as famous last words.

It has lasted for many years.

EDIT: They are all dedicated crunching machines. I have only BOINC and Folding on them. If they are a problem, I should pull out now.
In your scenario, it's not a problem.
It's dangerous to suggest that lazy solution to everyone, as their computers could be in a very different scenario.
https://pimylifeup.com/chmod-777/

Jim1348
Send message
Joined: 28 Jul 12
Posts: 819
Credit: 1,576,708,471
RAC: 221,898
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58348 - Posted: 5 Feb 2022 | 13:56:12 UTC - in response to Message 58347.

I am on an isolated network behind a firewall/router. No problem at all.
That qualifies as famous last words.

It has lasted for many years.

EDIT: They are all dedicated crunching machines. I have only BOINC and Folding on them. If they are a problem, I should pull out now.
In your scenario, it's not a problem.
It's dangerous to suggest that lazy solution to everyone, as their computers could be in a very different scenario.
https://pimylifeup.com/chmod-777/

You might as well carry your warnings over to the Windows crowd, which doesn't have much security at all.

Profile Retvari Zoltan
Avatar
Send message
Joined: 20 Jan 09
Posts: 2335
Credit: 16,178,080,749
RAC: 0
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58349 - Posted: 5 Feb 2022 | 14:08:17 UTC - in response to Message 58348.

You might as well carry your warnings over to the Windows crowd, which doesn't have much security at all.
Excuse me?

Jim1348
Send message
Joined: 28 Jul 12
Posts: 819
Credit: 1,576,708,471
RAC: 221,898
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58350 - Posted: 5 Feb 2022 | 14:11:10 UTC - in response to Message 58349.

You might as well carry your warnings over to the Windows crowd, which doesn't have much security at all.
Excuse me?

What comparable isolation do you get in Windows from one program to another?
Or what security are you talking about? Port security from external sources?

Profile Retvari Zoltan
Avatar
Send message
Joined: 20 Jan 09
Posts: 2335
Credit: 16,178,080,749
RAC: 0
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58351 - Posted: 5 Feb 2022 | 15:28:34 UTC - in response to Message 58350.

You might as well carry your warnings over to the Windows crowd, which doesn't have much security at all.
Excuse me?
What comparable isolation do you get in Windows from one program to another?
Security descriptors introduced into the NTFS 1.2 file system released in 1996 with Windows NT 4.0. The access control lists in NTFS are more complex in some aspects than in Linux. All modern Windows use NTFS by default.
User Account Control is introduced in 2007 with Windows Vista (=apps doesn't run as administrator even if the user has administrative privileges until the user elevates it through an annoying popup)
Or what security are you talking about? Port security from external sources?
Windows firewall is introced with Windows XP SP2 in 2004.

This is my last post in this thread about (undermining) filesystem security.

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1434
Credit: 3,541,899,351
RAC: 420,027
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58352 - Posted: 5 Feb 2022 | 16:53:05 UTC - in response to Message 58346.

I see that the BOINC PPA distribution has been updated yet again - still v7.18.1, but a new build timed at 17:10 yesterday, 4 February.

Could somebody please check whether the Ubuntu LTS repo has been updated as well? Maybe somebody read my complaint to boinc_alpha after all. Still no replies.

My Ubuntu 20.04.3 machines updated themselves this morning to v7.18.1 for the 3rd time.
(available version: 7.18.1+dfsg+202202041710~ubuntu20.04.1)

Updated my second machine. It appears that this re-release is NOT releated to the systemd problem: the PrivateTmp=true line is still commented out.

Re-apply the fix (#1) from message 58312 after applying this update, if you wish to continue running the Python test apps.

Jim1348
Send message
Joined: 28 Jul 12
Posts: 819
Credit: 1,576,708,471
RAC: 221,898
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58353 - Posted: 5 Feb 2022 | 16:54:05 UTC - in response to Message 58351.
Last modified: 5 Feb 2022 | 17:25:41 UTC

I think you are correct, except in the term "undermining", which is not appropriate for isolated crunching machines. There is a billion-dollar AV industry for Windows. Apparently someone has figured out how to undermine it there. But I agree that no more posts are necessary.

EDIT: I probably should have said that it was only for isolated crunching machines at the outset. If I were running a server, I would do it differently.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 719
Credit: 4,861,478,494
RAC: 1,361,967
Level
Arg
Scientific publications
wat
Message 58354 - Posted: 5 Feb 2022 | 18:15:50 UTC
Last modified: 5 Feb 2022 | 18:16:08 UTC

While chmod 777-ing in general is a bad practice. There’s little harm in blowing up the BOINC directory like that. Worst that can happen is you modify or delete a necessary file by accident and break BOINC. Just reinstall and learn the lesson. Not the end of the world in this instance.
____________

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1050
Credit: 1,407,860,714
RAC: 1,014,249
Level
Met
Scientific publications
watwatwatwatwat
Message 58355 - Posted: 5 Feb 2022 | 19:20:07 UTC - in response to Message 58341.

I see that the BOINC PPA distribution has been updated yet again - still v7.18.1, but a new build timed at 17:10 yesterday, 4 February.

Saw it when I was coaxing a new ACEMD3 task into life, so I won't know what it contains until tomorrow (unless I sacrifice my second machine, after lunch).

Could somebody please check whether the Ubuntu LTS repo has been updated as well? Maybe somebody read my complaint to boinc_alpha after all. Still no replies.

Edit - found the change log, but I'm none the wiser.


Ubuntu 20.04.3 LTS is still on the older 7.16.6 version.

apt list boinc-client
Listing... Done
boinc-client/focal 7.16.6+dfsg-1 amd64

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1050
Credit: 1,407,860,714
RAC: 1,014,249
Level
Met
Scientific publications
watwatwatwatwat
Message 58356 - Posted: 5 Feb 2022 | 19:26:13 UTC - in response to Message 58346.

I see that the BOINC PPA distribution has been updated yet again - still v7.18.1, but a new build timed at 17:10 yesterday, 4 February.

Could somebody please check whether the Ubuntu LTS repo has been updated as well? Maybe somebody read my complaint to boinc_alpha after all. Still no replies.
My Ubuntu 20.04.3 machines updated themselves this morning to v7.18.1 for the 3rd time.
(available version: 7.18.1+dfsg+202202041710~ubuntu20.04.1)

Curious how your Ubuntu release got this newer version. I did a sudo apt update and apt list boinc-client and apt show boinc-client and still come up with older 7.16.6 version.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 719
Credit: 4,861,478,494
RAC: 1,361,967
Level
Arg
Scientific publications
wat
Message 58357 - Posted: 5 Feb 2022 | 22:22:11 UTC - in response to Message 58356.

I think they use a different PPA, not the standard Ubuntu version.
____________

Profile Retvari Zoltan
Avatar
Send message
Joined: 20 Jan 09
Posts: 2335
Credit: 16,178,080,749
RAC: 0
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58358 - Posted: 5 Feb 2022 | 22:52:53 UTC - in response to Message 58356.

My Ubuntu 20.04.3 machines updated themselves this morning to v7.18.1 for the 3rd time.
(available version: 7.18.1+dfsg+202202041710~ubuntu20.04.1)

Curious how your Ubuntu release got this newer version. I did a sudo apt update and apt list boinc-client and apt show boinc-client and still come up with older 7.16.6 version.
It's from http://ppa.launchpad.net/costamagnagianfranco/boinc/ubuntu
Sorry for the confusion.

Profile ServicEnginIC
Avatar
Send message
Joined: 24 Sep 10
Posts: 520
Credit: 2,275,243,465
RAC: 6,100
Level
Phe
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58359 - Posted: 5 Feb 2022 | 23:07:14 UTC - in response to Message 58357.

I think they use a different PPA, not the standard Ubuntu version.

You're right. I've checked, and this is my complete repository listing.
There are new pending updates for BOINC package, but I've recently catched an ACEMD3 ADRIA new task, and I'm not updating until it be finished and reported.
My experience warns that these tasks are highly prone to fail if something is changed while processing.

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1434
Credit: 3,541,899,351
RAC: 420,027
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58360 - Posted: 6 Feb 2022 | 8:10:43 UTC - in response to Message 58324.
Last modified: 6 Feb 2022 | 8:15:37 UTC

Which distro/repository are you using?

I'm using the regular repository for Ubuntu 20.04.3 LTS
I took screenshot of offered updates before updating.

Ah. Your reply here gave me a different impression. Slight egg on face, but both our Linux update manager screenshots fail to give source information in their consolidated update lists. Maybe we should put in a feature request?

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1434
Credit: 3,541,899,351
RAC: 420,027
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58361 - Posted: 6 Feb 2022 | 12:39:46 UTC
Last modified: 6 Feb 2022 | 12:40:31 UTC

ACEMD3 task finished on my original machine, so I updated BOINC from PPA 2022-01-30 to 2022-02-04.

I can confirm that if you used systemctl/edit to create a separate over-ride file, it remains in place - no need to re-edit every time. If you used a text editor to edit the raw systemd file in place, of course, it'll get over-written and will need editing again.

(final proof-of-the-pudding of that last statement awaits the release of the next test batch)

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1434
Credit: 3,541,899,351
RAC: 420,027
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58362 - Posted: 6 Feb 2022 | 17:13:30 UTC

Got a new task (task 32738148). Running normally, confirms override to systemd is preserved.

Getting entries in stderr as before:

wandb: WARNING Path /var/lib/boinc-client/slots/7/.config/wandb/wandb/ wasn't writable, using system temp directory

(we're back in slot 7 as usual)

There are six folders created in slot 7:

agent_demos
gpugridpy
int_demos
monitor_logs
python_dependencies
ROMS

There are no hidden folders, and certainly no .config

wandb data is in:

/tmp/systemd-private-f670b90d460b4095a25c37b7348c6b93-boinc-client.service-7Jvpgh/tmp

There are 138 folders in there, including one called simply wandb

wandb contains:

debug-internal.log
debug.log
latest-run
run-20220206_163543-1wmmcgi5

The first two are files, the last two are folders. There is no subfolder called wandb - so no recursion, such as the warning message suggests. Hope that helps.

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 155
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 58363 - Posted: 7 Feb 2022 | 8:13:08 UTC - in response to Message 58362.

Thanks! the content of the slot directory is correct.

The wandb directory will be also placed in the slot directory soon, in the next experiment. During the current experiment, which consists of multiple batches of tasks, the wandb directory will be still in /tmp, as a result of the warning.

That is not a problem per se, but I agree that will be cleaner to place it in the slot directory, so all BOINC files are there.
____________

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1434
Credit: 3,541,899,351
RAC: 420,027
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58364 - Posted: 9 Feb 2022 | 9:56:19 UTC - in response to Message 58363.

wandb: Run data is saved locally in /var/lib/boinc-client/slots/7/wandb/run-20220209_082943-1pdoxrzo

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 155
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 58365 - Posted: 10 Feb 2022 | 9:33:48 UTC - in response to Message 58364.
Last modified: 10 Feb 2022 | 9:34:28 UTC

Great, thanks a lot for the confirmation. So now it seems the directory is appropriate one.
____________

SuperNanoCat
Send message
Joined: 3 Sep 21
Posts: 3
Credit: 15,435,732
RAC: 23,739
Level
Pro
Scientific publications
wat
Message 58367 - Posted: 17 Feb 2022 | 17:38:34 UTC

Pretty happy to see that my little Quadro K620s could actually handle one of the ABOU work units. Successfully ran one in under 31 hours. It didn't hit the memory too hard, which helps. The K620 has a DDR3 memory bus so the bandwidth is pretty limited.

http://www.gpugrid.net/result.php?resultid=32741283

Though, it did fail one of the Anaconda work units that went out. The error message doesn't mean much to me.

http://www.gpugrid.net/result.php?resultid=32741757


Traceback (most recent call last):
File "run.py", line 40, in <module>
assert os.path.exists('output.coor')
AssertionError
11:22:33 (1966061): ./gpugridpy/bin/python exited; CPU time 0.295254
11:22:33 (1966061): app exit status: 0x1
11:22:33 (1966061): called boinc_finish(195)

Profile [AF] fansyl
Send message
Joined: 26 Sep 13
Posts: 18
Credit: 1,638,641,441
RAC: 2,872
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58368 - Posted: 17 Feb 2022 | 20:12:35 UTC

All tasks goes in errors on this machine : https://www.gpugrid.net/results.php?hostid=591484

I specify that the machine does not have a GPU usable by BOINC.

Thanks for your help.

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1434
Credit: 3,541,899,351
RAC: 420,027
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58369 - Posted: 18 Feb 2022 | 10:27:49 UTC - in response to Message 58368.

I got two of those yesterday as well. They are described as "Anaconda Python 3 Environment v4.01 (mt)" - declared to run as multi-threaded CPU tasks. I do have working GPUs (on host 508381), but I don't think these tasks actually need a GPU.

The task names refer to a different experimenter (RAIMIS) from the ones we've been discussing recently in this thread.

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1050
Credit: 1,407,860,714
RAC: 1,014,249
Level
Met
Scientific publications
watwatwatwatwat
Message 58370 - Posted: 18 Feb 2022 | 18:55:22 UTC

We were running those kind of tasks a year ago. Looks like the researcher has made an appearance again.

Jim1348
Send message
Joined: 28 Jul 12
Posts: 819
Credit: 1,576,708,471
RAC: 221,898
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58371 - Posted: 18 Feb 2022 | 21:12:05 UTC
Last modified: 18 Feb 2022 | 21:47:13 UTC

I just downloaded one, but it errored out before I could even catch it starting. It ran for 3 seconds, required four cores of a Ryzen 3950X on Ubuntu 20.04.3, and had an estimated time of 2 days. I think they have some work to do.
http://www.gpugrid.net/result.php?resultid=32742752

PS
- It probably does not help that that machine is running BOINC 7.18.1. I have had problems with it before. I will try 7.16.6 later.

Jim1348
Send message
Joined: 28 Jul 12
Posts: 819
Credit: 1,576,708,471
RAC: 221,898
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58372 - Posted: 18 Feb 2022 | 22:14:30 UTC - in response to Message 58371.
Last modified: 18 Feb 2022 | 22:15:49 UTC

PPS - It ran for two minutes on an equivalent Ryzen 3950X running BOINC 7.16.6, and then errored out.

Drago
Send message
Joined: 3 May 20
Posts: 3
Credit: 17,596,560
RAC: 1,298
Level
Pro
Scientific publications
wat
Message 58373 - Posted: 22 Feb 2022 | 19:31:41 UTC - in response to Message 58372.

I just ran 4 of the Python CPU tasks wu's on my Ryzen 7 5800H, Ubuntu 20.04.3 LTS, 16 GB ram. Each was run on 4 CPU threads at the same time. The first 0,6% took over 10 minutes, then they jumped to 10%, continued a while longer until 17 minutes were over and then erroed out all at more or less the same moment in the task. Here is one example: 32743954

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1434
Credit: 3,541,899,351
RAC: 420,027
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58374 - Posted: 23 Feb 2022 | 6:32:16 UTC - in response to Message 58373.

A RAIMIS MT task - which accounts for the 4 threads.

And yet -

Run
CUDA error: out of memory
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.

NVIDIA GeForce RTX 3060 Laptop GPU (4095MB)

Traceback (most recent call last):
File "/var/lib/boinc-client/slots/5/run.py", line 50, in <module>
assert os.path.exists('output.coor')
AssertionError

Jim1348
Send message
Joined: 28 Jul 12
Posts: 819
Credit: 1,576,708,471
RAC: 221,898
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58380 - Posted: 24 Feb 2022 | 22:19:06 UTC

I am running two of the Anacondas now. They each reserve four threads, but are apparently only using one of them, since BoincTasks shows 25% CPU usage.

They have been running for two hours, and should complete in 14 hours total, though the estimates are way off and show 12 days. Therefore, they are running high priority even though they should complete with no problem.

Drago
Send message
Joined: 3 May 20
Posts: 3
Credit: 17,596,560
RAC: 1,298
Level
Pro
Scientific publications
wat
Message 58381 - Posted: 25 Feb 2022 | 18:29:21 UTC - in response to Message 58374.

Hey Richard. In how far is my GPU's memory involved in a CPU task?

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1434
Credit: 3,541,899,351
RAC: 420,027
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58382 - Posted: 25 Feb 2022 | 19:21:40 UTC - in response to Message 58381.

Hey Richard. In how far is my GPU's memory involved in a CPU task?

It shouldn't be - that's why I drew attention to it. I think both AbouH and RAIMIS are experimenting with different applications, which exploit
both GPUs and multiple CPUs.

It isn't at all obvious how best to manage a combination like that under BOINC - the BOINC developers only got as far as thinking about either/or, not both together.

So far, Abou seems to have got further down the road, but I'm not sure how much further development is required. We watch and wait, and help where we can.

Jim1348
Send message
Joined: 28 Jul 12
Posts: 819
Credit: 1,576,708,471
RAC: 221,898
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58383 - Posted: 26 Feb 2022 | 15:08:37 UTC
Last modified: 26 Feb 2022 | 15:13:53 UTC

My first two Anacondas ended OK after 31 hours. But they were _2 and _3.
I am not sure what the error messages mean. Some ended after a couple of minutes, while others went longer.
http://www.gpugrid.net/results.php?hostid=593715

I am running a _4 now. After 18 minutes it is OK, but the CPU usage is still trending down to a single core after starting out high.

Jim1348
Send message
Joined: 28 Jul 12
Posts: 819
Credit: 1,576,708,471
RAC: 221,898
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58384 - Posted: 27 Feb 2022 | 15:28:01 UTC - in response to Message 58383.

I am running a _4 now. After 18 minutes it is OK, but the CPU usage is still trending down to a single core after starting out high.

It stopped making progress after running for a day and reaching 26% complete, so I aborted it. I will wait until they fix things before jumping in again. But my results were different than the others, so maybe it will do them some good.

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 155
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 58417 - Posted: 3 Mar 2022 | 16:29:10 UTC - in response to Message 58382.

Hello everyone! I am sorry for the late reply.

Now that most of my jobs seem to complete successfully, we decided to remove the "beta" flag from the app. I would like to thank you all for your help during the past months to reach this point. Obviously I will try to solve any further problem detected. In the future we will try to extend it for Windows, but we are not there yet.

Regarding the app requirements, from now on they will be similar to those in my last batches. In reinforcement learning, in general there is no way around the mixed CPU/GPU usage. Most reinforcement learning environments are powered by CPU, but the machine learning algorithms to teach agents to solve the environments use GPU.

RAMIS was experimenting with a different application. But the idea is that another beta app will be created for this purpose.

____________

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1434
Credit: 3,541,899,351
RAC: 420,027
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58464 - Posted: 8 Mar 2022 | 18:06:03 UTC
Last modified: 8 Mar 2022 | 18:53:42 UTC

Is this a record?



Initial runtime estimate for:

e1a4-ABOU_pythonGPU_beta2_test-0-1-RND8371_5
Python apps for GPU hosts beta v1.00 (cuda1131) for Windows

Task 32766826

Time to lie back and enjoy the popcorn for ... 11½ years ??!!

Edit - 36 minutes to download 2.52 GB, less than a minute to crash. Ah well, back to the drawing board.

08/03/2022 17:57:22 | GPUGRID | Started download of windows_x86_64__cuda1131.tar.bz2.e9a2e4346c92bfb71fae05c8321e9325
08/03/2022 18:35:03 | GPUGRID | Finished download of windows_x86_64__cuda1131.tar.bz2.e9a2e4346c92bfb71fae05c8321e9325
08/03/2022 18:35:26 | GPUGRID | Starting task e1a4-ABOU_pythonGPU_beta2_test-0-1-RND8371_5
08/03/2022 18:36:21 | GPUGRID | [sched_op] Reason: Unrecoverable error for task e1a4-ABOU_pythonGPU_beta2_test-0-1-RND8371_5
08/03/2022 18:36:21 | GPUGRID | Computation for task e1a4-ABOU_pythonGPU_beta2_test-0-1-RND8371_5 finished


Edit 2 - "application C:\Windows\System32\tar.exe missing". I can deal with that.

Download from https://sourceforge.net/projects/gnuwin32/files/tar/

NO - that wasn't what it said it was. Looking again.

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1434
Credit: 3,541,899,351
RAC: 420,027
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58465 - Posted: 8 Mar 2022 | 19:37:16 UTC

No, this isn't working. Apparently, tar.exe is included in Windows 10 - but I'm still running Windows 7/64, and a copy from a W10 machine won't run ("Not a valid Win32 application"). Giving up for tonight - I've got too much waiting to juggle. I'll try again with a clearer head tomorrow.

mmonnin
Send message
Joined: 2 Jul 16
Posts: 329
Credit: 1,449,138,982
RAC: 407,971
Level
Met
Scientific publications
watwatwatwatwat
Message 58466 - Posted: 8 Mar 2022 | 23:51:15 UTC
Last modified: 8 Mar 2022 | 23:55:06 UTC

Yeah estimates must have astronomical as I am at over 2 months Time left at 3/4 completion on 2 tasks.

11:37 hr:min 79.3% 61d2h
10:04 hr:min 73.9% 77d2h

74.8% dropped on the 2nd task it down to 74d10h. Around 215d initial ETA?

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 155
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 58467 - Posted: 9 Mar 2022 | 8:54:44 UTC
Last modified: 9 Mar 2022 | 9:22:43 UTC

No need to go back to the drawing board, in principle. Here is what is happening:

1. The PythonGPU app should be stable now and only available for Linux (like until now). Jobs are being sent there and should work normally.

2. A new app, called PythonGPUbeta, has been deployed for both Linux and Windows. The idea is to test now the python jobs for Windows. The source of bugs to solve should be this one now... Ultimately the idea is to have a common PythonGPU for both OS.

3. While PythonGPUbeta accepts Linux and Windows, I expect most errors to come from the Windows part.

Please, let me know if any of the following is not correct.
____________

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 155
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 58468 - Posted: 9 Mar 2022 | 9:02:10 UTC - in response to Message 58464.
Last modified: 9 Mar 2022 | 9:28:51 UTC

In this new version of the app, we send the whole conda environment in a compressed file ONLY ONCE, and unpack it in the machine. The conda environment is what weights around 2.5 GB (depends on whether the machine has cuda10 or cuda11). However, while the environment remains the same there will be no need to re-download it in every job. This is how acemd app works.

We are testing which compression format is best for our purpose. We tested first with a tar.bz2 file. For Linux there was no problem to decompress it.

For windows, I tested locally in a Windows 10 laptop. I could decompress it successfully with tar.exe.

I am not sure what is happening with the estimates, but the estimation is obviously wrong. The test jobs should download the conda environment only in the first job, decompress it and finally run a short python program using CPU and GPU. Are the Linux estimates also so exagerated?
____________

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 155
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 58469 - Posted: 9 Mar 2022 | 9:07:09 UTC
Last modified: 9 Mar 2022 | 9:32:49 UTC

Some problems we are facing are, as Richard mentioned, that before W10 there is no tar.exe.

Also, I have seen some jobs with the following error:

tar.exe: Error opening archive: Can't initialize filter; unable to run program "bzip2 -d"


In theory tar.exe is able to handle bzip2 files. We suspect it could be a problem with PATH env variable (which we will test). Also, tar gz could be a more compatible format for Windows.
____________

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1434
Credit: 3,541,899,351
RAC: 420,027
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58470 - Posted: 9 Mar 2022 | 9:35:49 UTC

Don't worry, it's only my own personal drawing board that I'm going back to!

Microsoft has form in this area. I remember buying a commercial copy of WinZip for use with Windows 3 - it arrived by post, on a single floppy disk. Later, they bought the company and incorporated it into Windows. Microsoft tend to do this very late in the day - hence my problems yesterday. I'll have a proper look round later today, and see if I can find a version which handles the bzip2 problem too.

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 155
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 58471 - Posted: 9 Mar 2022 | 9:52:48 UTC - in response to Message 58470.

Thank you very much! I will send a small batch of test jobs as soon as I can to check if for windows 10 the bzip2 error is caused by an erroneous PATH variable. And the next step will be trying with tar.gz as mentioned.

____________

mmonnin
Send message
Joined: 2 Jul 16
Posts: 329
Credit: 1,449,138,982
RAC: 407,971
Level
Met
Scientific publications
watwatwatwatwat
Message 58472 - Posted: 9 Mar 2022 | 11:45:54 UTC

How about some checkpoints. I have a python task that was nearly completed, a ACEMD4 task downloaded next with like 8 billion days ETA. It interrupted the python task. 14hours of work and it went back to 10%. I only have 0.05 days work queue on that client so the python app was at least 95% complete.

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 155
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 58473 - Posted: 9 Mar 2022 | 14:20:01 UTC - in response to Message 58472.
Last modified: 9 Mar 2022 | 14:43:41 UTC

was it a PythonGPU task for Linux mmonnin? I have checked your recent jobs, seemed to be successful.


PythonGPU task checkpointing was working before. It was discussed previously in the forum. I tested in locally back then and worked fine. Did it happen to anyone else that checkpointing failed? please let me know in that case


I have sent a small batch of tasks for PythonGPUbeta, to test if some errors on Windows are now solved. Will keep iterating in small batches for the beta app.
____________

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1434
Credit: 3,541,899,351
RAC: 420,027
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58474 - Posted: 9 Mar 2022 | 15:29:20 UTC - in response to Message 58473.
Last modified: 9 Mar 2022 | 15:42:54 UTC

I have a python task for Linux running, recently started.

It's reported that it's checkpointing properly:

CPU time 00:33:10
CPU time since checkpoint 00:01:33
Elapsed time 00:33:27

but that isn't the acid test: the question is whether it can read back the checkpoint data files when restarted.

I'll pause it after a checkpoint, let the machine finish the last 20 minutes of the task it booted aside, and see what happens on restart. Sometimes BOINC takes a little while to update progress after a pause - you have to watch it, not just take the first figure you see.

Results will be reported in task 32773760 overnight, but I'll post here before that.

Edit - looks good so far: restart.chk, progress, run.log all present with a good timestamp.

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 155
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 58475 - Posted: 9 Mar 2022 | 15:37:59 UTC - in response to Message 58474.
Last modified: 9 Mar 2022 | 15:40:32 UTC

Perfect thanks! That it takes a little while to update progress after a pause, can happen.

The pythonGPU tasks progress is defined by a target number of interactions between the AI agent and the environment in which it is trained. Generally 25M interactions per job. I generate checkpoints regularly and create a progress file that tracks how many of these interactions have been already executed.

After resuming, the script looks for these progress and checkpoint files to continue counting from there.

However, Richard note that the result you linked is not PythonGPU but ACEMD 4. I am not sure how these do the checkpointing.
____________

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1434
Credit: 3,541,899,351
RAC: 420,027
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58476 - Posted: 9 Mar 2022 | 16:10:04 UTC - in response to Message 58475.

However, Richard note that the result you linked is not PythonGPU but ACEMD 4. I am not sure how these do the checkpointing.

Well, it was the only one I had in a suitable state for testing.
And it's a good thing we checked. It appears that ACEMD4 in its current state (v1.03) does NOT handle checkpointing correctly. I suspended it manually at just after 10% complete: on restart, it wound back to 1% and started counting again from there. It's reached 2.980% as I type - four increments of 0.495.

The run.log file (which we don't normally get a chance to see) has the ominous line

# WARNING: removed an old file: output.xtc

after a second set of startup details. Perhaps you could pass a message to the appropriate team?

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 155
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 58477 - Posted: 9 Mar 2022 | 16:18:28 UTC - in response to Message 58476.

I will. Thanks a lot for the feedback.
____________

mmonnin
Send message
Joined: 2 Jul 16
Posts: 329
Credit: 1,449,138,982
RAC: 407,971
Level
Met
Scientific publications
watwatwatwatwat
Message 58478 - Posted: 9 Mar 2022 | 23:18:59 UTC - in response to Message 58475.

Perfect thanks! That it takes a little while to update progress after a pause, can happen.

The pythonGPU tasks progress is defined by a target number of interactions between the AI agent and the environment in which it is trained. Generally 25M interactions per job. I generate checkpoints regularly and create a progress file that tracks how many of these interactions have been already executed.

After resuming, the script looks for these progress and checkpoint files to continue counting from there.

However, Richard note that the result you linked is not PythonGPU but ACEMD 4. I am not sure how these do the checkpointing.


Yes it was linux.
The % complete I saw was 100%, then a bit later 10% per BOINCTasks.
Looking at the history on that PC it finished in 14:14 run time, just 11 minutes after the ACEMD4 tasks so it looks like it resumed properly. Thanks for checking.

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1434
Credit: 3,541,899,351
RAC: 420,027
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58479 - Posted: 10 Mar 2022 | 10:41:18 UTC

OK, back on topic. Another of my Windows 7 machines has been allocated a genuine ABOU_pythonGPU_beta2 task (task 32779476), and I was able to suspend it before it even tried to run. I've been able to copy all the downloaded files into a sandbox to play with.

The first task is:

<task>
<application>C:\Windows\System32\tar.exe</application>
<command_line>-xvf windows_x86_64__cuda1131.tar.gz</command_line>
<setenv>PATH=C:\Windows\system32;C:\Windows</setenv>
</task>

You don't need both a path statement and a a hard-coded executable location. That may fail on a machine with non-standard drive assignments.

It will certainly fail on this machine, because I still haven't been able to locate a viable tar.exe for Windows 7 (the Windows 10 executable won't run under Windows 7 - at least, I haven't found a way to make it run yet).

I (and many other volunteers here) do have a freeware application called 7-Zip, and I've seen a suggestion that this may be able to handle the required decompression. I'll test that offline first, and if it works, I'll try to modify the job.xml file to use that instead. That's not a complete solution, of course, but it might give a pointer to the way forward.

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1434
Credit: 3,541,899,351
RAC: 420,027
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58480 - Posted: 10 Mar 2022 | 10:54:35 UTC

OK, that works in principle. The 2.48 GB gz download decompresses to a single 4.91 GB tar file, and that in turn unpacks to 13,449 files in 632 folders. 7-Zip can handle both operations.

ToDo: go find the command line I saw yesterday for doing that in a script.
Check the disk usage limits to ensure all that can happen in the slot directory.

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1434
Credit: 3,541,899,351
RAC: 420,027
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58481 - Posted: 10 Mar 2022 | 11:23:07 UTC

And it's worth a try. I'm going to split that task into two:

<task>
<application>"C:\Program Files\7-Zip\7z"</application>
<command_line>x windows_x86_64__cuda1131.tar.gz</command_line>
<setenv>PATH=C:\Windows\system32;C:\Windows</setenv>
</task>

<task>
<application>"C:\Program Files\7-Zip\7z"</application>
<command_line>x windows_x86_64__cuda1131.tar</command_line>
<setenv>PATH=C:\Windows\system32;C:\Windows</setenv>
</task>

I could have piped them, but - baby steps!

I'm going to need to increase the disk allowance: 10 (decimal) GB isn't enough.

mmonnin
Send message
Joined: 2 Jul 16
Posts: 329
Credit: 1,449,138,982
RAC: 407,971
Level
Met
Scientific publications
watwatwatwatwat
Message 58483 - Posted: 10 Mar 2022 | 11:47:57 UTC
Last modified: 10 Mar 2022 | 11:49:40 UTC

I had a W10 PC without tar.exe. I noticed the error in a task and copied the exe to system32 folder.
This morning I noticed a task running for 6.5 hours with no progress, no CPU usage.
https://www.gpugrid.net/result.php?resultid=32778132

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1434
Credit: 3,541,899,351
RAC: 420,027
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58484 - Posted: 10 Mar 2022 | 11:50:21 UTC
Last modified: 10 Mar 2022 | 11:56:21 UTC

Damn. Where did that go wrong?

application C:\Windows\System32\tar.exe missing

Anyone else who wants to try this experiment can try https://www.7-zip.org/ - looks as if the license would even allow the project to distribute it.

Edit - I edited the job.xml file while the previous task was finishing, and then stopped BOINC to increase the disk limit. On restart, BOINC must have noticed that the file had changed, and it downloaded a fresh copy. Near miss.

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1434
Credit: 3,541,899,351
RAC: 420,027
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58485 - Posted: 10 Mar 2022 | 13:42:43 UTC
Last modified: 10 Mar 2022 | 14:19:37 UTC

application "C:\Program Files\7-Zip\7z" missing

Make that "C:\Program Files\7-Zip\7z.exe"

Or maybe not.

application "C:\Program Files\7-Zip\7z.exe" missing

Isn't the damn wrapper clever enough to remove the quotes I put in there to protect the space in "Program Files"?

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 155
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 58486 - Posted: 10 Mar 2022 | 15:02:08 UTC

Using tar.exe in W10 and W11 seems to work now.

However, it is true that:

a) some machines do not have tar.exe. My initial idea was that older versions of Windows could donwload tar.exe, but it seems that is does not work.

b) The C:\Windows\System32\tar.exe path is hardcoded. I understand that ideally we should add to PATH all possible paths where this executable could be found right?
____________

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1434
Credit: 3,541,899,351
RAC: 420,027
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58487 - Posted: 10 Mar 2022 | 15:40:34 UTC - in response to Message 58486.

On this particular Windows 7 machine, I have:

PATH=
C:\Windows\system32;
C:\Windows;
C:\Windows\System32\Wbem;
C:\Windows\System32\WindowsPowerShell\v1.0\;;
C:\Program Files\Process Lasso\;

- I've split that into separate lines for clarity. but it's one single environment variable that has been added to by various installers over the years.

For a native Windows system component, I wouldn't have thought a path was necessary at all - Windows should handle all that. That's what path variables are for. But maybe the wrapper app is so dumb that it just throws the exact string it parses from job.xml at a file_open function? I'll have a look at the code.

I've got two remaining thoughts left: try Program [space] Files without any quotes; or stick a copy of 7z.exe in Windows/system32 (although mine's a 64-bit version...), and call it explicitly from there. I don't think it'll have anywhere to hide from that...

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1434
Credit: 3,541,899,351
RAC: 420,027
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58488 - Posted: 10 Mar 2022 | 17:53:57 UTC

Yay! That's what I wanted to see:

17:49:09 (21360): wrapper: running C:\Program Files\7-Zip\7z.exe (x windows_x86_64__cuda1131.tar.gz)

7-Zip [64] 15.14 : Copyright (c) 1999-2015 Igor Pavlov : 2015-12-31

Scanning the drive for archives:
1 file, 2666937516 bytes (2544 MiB)

Extracting archive: windows_x86_64__cuda1131.tar.gz

And I've got v1.04 in my sandbox...

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1434
Credit: 3,541,899,351
RAC: 420,027
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58489 - Posted: 10 Mar 2022 | 18:27:31 UTC

But not much more than that. After half an hour, it's got as far as:

Everything is Ok

Files: 13722
Size: 5270733721
Compressed: 5281648640
18:02:00 (21360): C:\Program Files\7-Zip\7z.exe exited; CPU time 6.567642
18:02:00 (21360): wrapper: running python.exe (run.py)
WARNING: The script shortuuid.exe is installed in 'D:\BOINCdata\slots\5\Scripts' which is not on PATH.
Consider adding this directory to PATH or, if you prefer to suppress this warning, use --no-warn-script-location.
WARNING: The script normalizer.exe is installed in 'D:\BOINCdata\slots\5\Scripts' which is not on PATH.
Consider adding this directory to PATH or, if you prefer to suppress this warning, use --no-warn-script-location.
WARNING: The scripts wandb.exe and wb.exe are installed in 'D:\BOINCdata\slots\5\Scripts' which is not on PATH.
Consider adding this directory to PATH or, if you prefer to suppress this warning, use --no-warn-script-location.
ERROR: After October 2020 you may experience errors when installing or updating packages. This is because pip will change the way that it resolves dependency conflicts.

We recommend you use --use-feature=2020-resolver to test your packages with the new resolver before it becomes the default.

pytest 0.0.0 requires atomicwrites>=1.0, which is not installed.
pytest 0.0.0 requires attrs>=17.4.0, which is not installed.
pytest 0.0.0 requires iniconfig, which is not installed.
pytest 0.0.0 requires packaging, which is not installed.
pytest 0.0.0 requires py>=1.8.2, which is not installed.
pytest 0.0.0 requires toml, which is not installed.
aiohttp 3.7.4.post0 requires attrs>=17.3.0, which is not installed.
WARNING: The scripts pyrsa-decrypt.exe, pyrsa-encrypt.exe, pyrsa-keygen.exe, pyrsa-priv2pub.exe, pyrsa-sign.exe and pyrsa-verify.exe are installed in 'D:\BOINCdata\slots\5\Scripts' which is not on PATH.
Consider adding this directory to PATH or, if you prefer to suppress this warning, use --no-warn-script-location.
WARNING: The script jsonschema.exe is installed in 'D:\BOINCdata\slots\5\Scripts' which is not on PATH.
Consider adding this directory to PATH or, if you prefer to suppress this warning, use --no-warn-script-location.
WARNING: The script gpustat.exe is installed in 'D:\BOINCdata\slots\5\Scripts' which is not on PATH.
Consider adding this directory to PATH or, if you prefer to suppress this warning, use --no-warn-script-location.
WARNING: The scripts ray-operator.exe, ray.exe, rllib.exe, serve.exe and tune.exe are installed in 'D:\BOINCdata\slots\5\Scripts' which is not on PATH.
Consider adding this directory to PATH or, if you prefer to suppress this warning, use --no-warn-script-location.
ERROR: After October 2020 you may experience errors when installing or updating packages. This is because pip will change the way that it resolves dependency conflicts.

We recommend you use --use-feature=2020-resolver to test your packages with the new resolver before it becomes the default.

pytest 0.0.0 requires atomicwrites>=1.0, which is not installed.
pytest 0.0.0 requires iniconfig, which is not installed.
pytest 0.0.0 requires py>=1.8.2, which is not installed.
pytest 0.0.0 requires toml, which is not installed.
WARNING: The script f2py.exe is installed in 'D:\BOINCdata\slots\5\Scripts' which is not on PATH.
Consider adding this directory to PATH or, if you prefer to suppress this warning, use --no-warn-script-location.
wandb: W&B API key is configured (use `wandb login --relogin` to force relogin)
wandb: Appending key for api.wandb.ai to your netrc file: D:\BOINCdata\slots\5/.netrc
wandb: Currently logged in as: rl-team-upf (use `wandb login --relogin` to force relogin)
wandb: Tracking run with wandb version 0.12.11
wandb: Run data is saved locally in D:\BOINCdata\slots\5\wandb\run-20220310_181709-mxbeog6d
wandb: Run `wandb offline` to turn off syncing.
wandb: Syncing run MontezumaAgent_e1a12
wandb: View project at https://wandb.ai/rl-team-upf/MontezumaRevenge_rnd_ppo_cnn_nophi_baseline_beta
wandb: View run at https://wandb.ai/rl-team-upf/MontezumaRevenge_rnd_ppo_cnn_nophi_baseline_beta/runs/mxbeog6d

and doesn't seem to be getting any further. I'll see if it's moved on after dinner, might might abort it if it hasn't.

Task is 32782603

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1434
Credit: 3,541,899,351
RAC: 420,027
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58490 - Posted: 10 Mar 2022 | 18:54:03 UTC

Then, lots of iterations of:

OSError: [WinError 1455] The paging file is too small for this operation to complete. Error loading "D:\BOINCdata\slots\5\lib\site-packages\torch\lib\cudnn_cnn_train64_8.dll" or one of its dependencies.
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "D:\BOINCdata\slots\5\lib\multiprocessing\spawn.py", line 105, in spawn_main
exitcode = _main(fd)
File "D:\BOINCdata\slots\5\lib\multiprocessing\spawn.py", line 114, in _main
prepare(preparation_data)
File "D:\BOINCdata\slots\5\lib\multiprocessing\spawn.py", line 225, in prepare
_fixup_main_from_path(data['init_main_from_path'])
File "D:\BOINCdata\slots\5\lib\multiprocessing\spawn.py", line 277, in _fixup_main_from_path
run_name="__mp_main__")
File "D:\BOINCdata\slots\5\lib\runpy.py", line 263, in run_path
pkg_name=pkg_name, script_name=fname)
File "D:\BOINCdata\slots\5\lib\runpy.py", line 96, in _run_module_code
mod_name, mod_spec, pkg_name, script_name)
File "D:\BOINCdata\slots\5\lib\runpy.py", line 85, in _run_code
exec(code, run_globals)
File "D:\BOINCdata\slots\5\run.py", line 23, in <module>
import torch
File "D:\BOINCdata\slots\5\lib\site-packages\torch\__init__.py", line 126, in <module>
raise err

I've increased it ten-fold, but that requires a reboot - and the task didn't survive. Trying one last time, then it's 'No new Tasks' for tonight.

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1434
Credit: 3,541,899,351
RAC: 420,027
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58491 - Posted: 10 Mar 2022 | 19:05:20 UTC

BTW, yes - the wrapper really is that dumb.

https://github.com/BOINC/boinc/blob/master/samples/wrapper/wrapper.cpp#L727

It just plods along, from beginning to end, copying it byte by byte. The only thing it considers is which way the slashes are pointing.

Bedrich Hajek
Send message
Joined: 28 Mar 09
Posts: 411
Credit: 6,063,938,459
RAC: 3,526
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58492 - Posted: 11 Mar 2022 | 0:16:42 UTC

I managed to complete 2 of these WUs successfully. They still need a lot of work done. You have low GPU usage, and they cause the boinc manager to be slow and sluggish and unresponsive.

https://www.gpugrid.net/result.php?resultid=32784274

https://www.gpugrid.net/result.php?resultid=32783598

They were pain to finish!!!!!

And what for, only 3000 points for 882 days worth of work per WU!!!!!!



mmonnin
Send message
Joined: 2 Jul 16
Posts: 329
Credit: 1,449,138,982
RAC: 407,971
Level
Met
Scientific publications
watwatwatwatwat
Message 58493 - Posted: 11 Mar 2022 | 0:48:05 UTC - in response to Message 58483.

I had a W10 PC without tar.exe. I noticed the error in a task and copied the exe to system32 folder.
This morning I noticed a task running for 6.5 hours with no progress, no CPU usage.
https://www.gpugrid.net/result.php?resultid=32778132


Disabling python beta on this W10 PC has another 11+ hours gone
https://www.gpugrid.net/result.php?resultid=32780319

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 155
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 58494 - Posted: 11 Mar 2022 | 8:49:55 UTC - in response to Message 58490.
Last modified: 11 Mar 2022 | 8:59:43 UTC

Yes, I have seen this error in some other machines that could unpack the file with tar.exe. In just a few of them. So it is an issue in the python script. Today I will be looking into it. It does not happen in linux with the same code.
____________

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 155
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 58495 - Posted: 11 Mar 2022 | 8:58:52 UTC - in response to Message 58492.

Yes, regarding the workload, I have been testing the tasks with low GPU/CPU usage. I was interested in checking if the conda environment was successfully unpacked and the python script was able to complete a few iterations. It will be increased as soon as this part works, as well as the points.

For the completely wrong duration estimation, I will look into what can be done. I am not sure how BOINC estimates it. Could please someone confirm if it is also wrong in Linux of if it is only a Windows issue?


____________

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 155
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 58496 - Posted: 11 Mar 2022 | 9:15:30 UTC

Could the astronomical time estimations be simply due to a wrong configuration of the rsc_fpops_est parameter?
____________

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1434
Credit: 3,541,899,351
RAC: 420,027
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58497 - Posted: 11 Mar 2022 | 9:28:05 UTC - in response to Message 58494.
Last modified: 11 Mar 2022 | 9:46:21 UTC

Yes, I have seen this error in some other machines that could unpack the file with tar.exe. In just a few of them. So it is an issue in the python script. Today I will be looking into it. It does not happen in linux with the same code.

I was a bit suspicious about the 'paging file too small' error - I didn't even think Windows applications could get information about what the current setting was. I'd suggest correlating the machines with this error, with their reported physical memory. Mine is 'only' 8 GB - small by modern standards.

It looks like there may be some useful clues in

https://discuss.pytorch.org/t/winerror-1455-the-paging-file-is-too-small-for-this-operation-to-complete/131233

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1434
Credit: 3,541,899,351
RAC: 420,027
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58498 - Posted: 11 Mar 2022 | 9:34:16 UTC - in response to Message 58496.

Could the astronomical time estimations be simply due to a wrong configuration of the rsc_fpops_est parameter?

That's certainly a part of it, but it's a very long, complicated, and historical story. It will affect any and all platforms, not just Windows, and other data as well as rsc_fpops_est. And it's also related to historical decisions by both BOINC and GPUGrid.

I'll try and write up some bedtime reading for you, but don't waste time on it in the meantime - there won't be an easy 'magic bullet' to fix it.

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 155
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 58499 - Posted: 11 Mar 2022 | 10:21:10 UTC - in response to Message 58497.

Yes I was looking at the same link. Seems related to limited memory. I might try to run the suggested script before running the job, which seems to mitigate the problem.
____________

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1434
Credit: 3,541,899,351
RAC: 420,027
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58500 - Posted: 11 Mar 2022 | 13:50:18 UTC - in response to Message 58498.

Runtime estimation – and where it goes wrong

The estimates we see on our home computers are made up of three elements. They are:
The SIZE of a task – rsc_fpops_est
The SPEED of the device that’s calculating the result
One or more correction tweaks, designed to smooth off the rough edges.

The original system

In the early days, all BOINC projects ran on CPUs, and almost all the CPUs in use were single-core. The speed of that CPU was measured by a derivation of the Whetstone benchmark: this was originally designed to measure hardware speeds only, and deliberately excluded software optimisation. For scientific research, careful optimisation is a valid technique (provided it isn’t done at the expense of accuracy).

There was a general (but unspoken) assumption that projects would be running a single type of research task, using a single application. So the rough edges were smoothed by something called DCF (duration correction factor). That kept track of that single application, running on that single CPU, and gently adjusted it until the estimates were pretty good. It worked. The adjustments were calculated by, and stored on, the local computer.

The revised system

Starting in 2008, BOINC was adapted to support applications that ran on GPUs – GPUGrid and SETI@home first, others followed. There never was any attempt to benchmark GPUs, so the theoretical baseline speed of a GPU application was taken to be a figure derived from the hardware architecture, notably the number of shaders and the clock speed. This was known as “peak FLOPS”, or – to some of us – “marketing FLOPS”. No way has any programmer ever been able to write a scientific program which uses every clock cycle of every shader, with no overhead for synchronisation or data transfer. Whatever.

At the same time, projects kept their CPU apps running, and many developed multiple research streams using different apps. A single-valued DCF couldn’t smooth off all the different rough edges at the same time.

There’s nothing in principle to stop the BOINC client keeping track of multiple application+device combinations, and such a system was in fact developed by a volunteer. But it was rejected by David Anderson in Berkeley, who devised his own system of Runtime Estimation, keeping track of the necessary tweaks on the project server. This was intended to replace client-based DCFs entirely, although the old system was retained for historical compatibility.

The implications for GPUGrid

As I think we all know, GPUGrid uses rsc_fpops_est, but I don’t think it’s realised quite how fundamental it is to the whole inverted pyramid. If tasks run much faster than their declared fpops, the only conclusion that BOINC can draw is the application speed has suddenly become much faster, and it tries to adapt accordingly.

GPUGrid has kept both of the adjustment methods active. If you look at any of our computer details, you will see that it contains a link to show application details: the smoothed average of all our successful tasks with each application. The critical one here is APR, or ‘average processing rate’. That’s the device+application speed, in GFlops. But on the computer details page, you’ll also see the DCF listed. Nominally, this should be 1, replaced by APR – but here, usually it isn’t.

The implications? APR works adequately for long term, steady, production work. But it fails during periods of rapid change and testing.

1) APR is disregarded entirely when a new application version is activated on the server. It starts again from scratch, and the initial estimates are – questionable. In fact, I don’t have a clue what speed is assumed for the first few tasks allocated.

2) It kicks in in two stages. First, when 100 tasks have been completed for the whole ensemble, and again when each individual computer reaches 11 completed tasks. Note that ‘completed’ here means a normal end-of-run plus a validated result. Some app versions never achieve that!

Different GPUs run at very different speeds, and the first 100 tasks returned normally come back from the fastest cards. That skews the average speed. In the worst case, the first hundred back can set a standard which lesser cards can’t attain – so they are stopped by ‘run time exceeded’, can never achieve the necessary 11 validations to set their own, lower, bar, and are excluded for good. The same can happen if deliberately short test tasks are put through early on, without an adjusted rsc_fpops_est: again, an unfeasibly fast target is set, and no-one can complete full-length tasks.

Sorry – I’ve been called out this afternoon, so I’ve dashed that off much quicker than I intended. I’ll leave it there for now, and we can all discuss the way forward later.

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 155
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 58501 - Posted: 11 Mar 2022 | 21:25:14 UTC - in response to Message 58500.

Thank you very much for the explanation Richard, very helpful actually.

I have been using short tests tasks to catch bugs in the early states of the job. That might have caused problems, although I guess we can adjust rsc_fpops_est and reset statistics later. The idea is to have long term, steady, production work after the tests.

However, I don't fully understand how that could cause estimates of hundreds of days. In any case, the most reliable information for the host is then the progress percentage, which should be correct.

I remember the ‘run time exceeded’ error was happening previously in the app and we had to adjust the rsc_fpops_est parameter. Maybe a temporary solution for the time estimation would be to set rsc_fpops_est for the PythonGPUbeta app to the same value we have in the PythonGPU app? The idea is that PythonGPUbeta eventually becomes the sole Python app, running the same Linux jobs PythonGPU is running now plus Windows jobs.

____________

Profile Retvari Zoltan
Avatar
Send message
Joined: 20 Jan 09
Posts: 2335
Credit: 16,178,080,749
RAC: 0
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58502 - Posted: 11 Mar 2022 | 21:58:48 UTC - in response to Message 58501.
Last modified: 11 Mar 2022 | 22:01:23 UTC

Maybe a temporary solution for the time estimation would be to set rsc_fpops_est for the PythonGPUbeta app to the same value we have in the PythonGPU app?
This approach is wrong.
The rsc_fpops_est should be set accprdingly for the actual batch of workunits, not for the app.
As test batches are much shorter than production batches, they should have a much less rsc_fpops_est value, regardless that the same app processes them.

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1434
Credit: 3,541,899,351
RAC: 420,027
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58505 - Posted: 12 Mar 2022 | 8:56:44 UTC - in response to Message 58502.

Maybe a temporary solution for the time estimation would be to set rsc_fpops_est for the PythonGPUbeta app to the same value we have in the PythonGPU app?
This approach is wrong.
The rsc_fpops_est should be set accprdingly for the actual batch of workunits, not for the app.
As test batches are much shorter than production batches, they should have a much less rsc_fpops_est value, regardless that the same app processes them.

Correct.

Next time I see a really gross (multi-year) runtime estimate, I'll dig out the exact figures, show you the working-out, and try to analyse where they've come from.

In the meantime, we're working through a glut of ACEMD3 tasks, and here's how they arrive:

12/03/2022 08:23:29 | GPUGRID | [sched_op] NVIDIA GPU work request: 11906.64 seconds; 0.00 devices
12/03/2022 08:23:30 | GPUGRID | Scheduler request completed: got 2 new tasks
12/03/2022 08:23:30 | GPUGRID | [sched_op] estimated total NVIDIA GPU task duration: 306007 seconds

So, I'm asking for a few hours of work, and getting several days. Or so BOINC says.

This is Windows host 45218, which is currently showing "Task duration correction factor 13.714405". (It was higher a few minutes ago, when that work was fetched - over 13.84)

I forgot to mention yesterday that in the first phase of BOINC's life, both your server and our clients took account of DCF, so the 'request' and 'estimated' figures would have been much closer. But when the APR code was added in 2010, the DCF code was removed from the servers. So your server knows what my DCF is, but it doesn't use that information.

So the server probably assessed that each task would last about 11,055 seconds. That's why it added the second task to the allocation: it thought the first one didn't quite fill my request for 11,906 seconds.

In reality, this is a short-running batch - although not marked as such - and the last one finished in 4,289 seconds. That's why DCF is falling after every task, though slowly.

mmonnin
Send message
Joined: 2 Jul 16
Posts: 329
Credit: 1,449,138,982
RAC: 407,971
Level
Met
Scientific publications
watwatwatwatwat
Message 58506 - Posted: 12 Mar 2022 | 21:01:37 UTC - in response to Message 58494.

Yes, I have seen this error in some other machines that could unpack the file with tar.exe. In just a few of them. So it is an issue in the python script. Today I will be looking into it. It does not happen in linux with the same code.


Having tar.exe wasn't enough. I later saw a popup in W10 saying archieveint.dll was missing.

I had two python tasks in linux error out in ~30min with
15:33:14 (26820): task /usr/bin/flock reached time limit 1800
application ./gpugridpy/bin/python missing

That PC has python 2.7.17 and 3.6.8 installed.

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1434
Credit: 3,541,899,351
RAC: 420,027
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58508 - Posted: 13 Mar 2022 | 17:19:19 UTC - in response to Message 58505.

Next time I see a really gross (multi-year) runtime estimate, I'll dig out the exact figures, show you the working-out, and try to analyse where they've come from.

Caught one!

Task e1a5-ABOU_pythonGPU_beta2_test16-0-1-RND7314_1

Host is 43404. Windows 7. It has two GPUs, and GPUGrid is set to run on the other one, not as shown. The important bits are

CUDA: NVIDIA GPU 0: NVIDIA GeForce GTX 1660 Ti (driver version 472.12, CUDA version 11.4, compute capability 7.5, 4096MB, 3032MB available, 5622 GFLOPS peak)

DCF is 8.882342, and the task shows up as:



Why? This is what I got from the server, in the sched_reply file:

<app_version>
<app_name>PythonGPUbeta</app_name>
<version_num>104</version_num>
...
<flops>47361236228.648697</flops>
...
<workunit>
<rsc_fpops_est>1000000000000000000.000000</rsc_fpops_est>
...

1,000,000,000,000,000,000 fpops, at 47 GFLOPS, would take 21,114,313 seconds, or 244 days. Multiply in the DCF, and you get the 2170 days shown.

According to the application details page, this host has completed one 'Python apps for GPU hosts beta 1.04 windows_x86_64 (cuda1131)' task (new apps always go right down to the bottom of that page). It recorded an APR of 1279539, which is bonkers the other way - these are GFlops, remember. It must have been task 32782603, which completed in 781 seconds.

So, lessons to be learned:

1) A shortened test task, described as running for the full-run number of fpops, will register an astronomical speed. If anyone completes 11 tasks like that, that speed will get locked into the system for that host, and will cause the 'runtime limit exceeded' error.

2) BOINC is extremely bad - stupidly bad - at generating a first guess for the speed of a 'new application, new host' combination. It's actually taken precisely one-tenth of the speed of the acemd3 application on this machine, which might be taken as a "safe working assumption" for the time being. I'll try to check that in the server code.

Oooh - I've let it run, and BOINC has remembered how I set up 7-Zip decompression last week. That's nice.

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1434
Credit: 3,541,899,351
RAC: 420,027
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58509 - Posted: 13 Mar 2022 | 17:23:05 UTC

But it hasn't remembered the increased disk limit. Never mind - nor did I.

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 155
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 58510 - Posted: 14 Mar 2022 | 8:42:00 UTC - in response to Message 58506.

Right now, the way PythonGPU app works is by dividing the job in 2 subtasks:
1- first, installing conda and creating the conda environment.
2- second, running the python script.

The error

15:33:14 (26820): task /usr/bin/flock reached time limit 1800
application ./gpugridpy/bin/python missing


means that after 1800 seconds, the conda environment was not yet created for some reason. This could be because the conda dependencies could not be downloaded in time or because the machine was running the installation process more slowly than expected. We set this time limit of 30 mins because in theory it is plenty of time to create the environment.

However, in the new version (the current PythonGPUBeta), we send the whole conda environment compressed and simply unpack it in the machine. Therefor this error, which indeed happens every now and then now, should disappear.
____________

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 155
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 58511 - Posted: 14 Mar 2022 | 8:55:03 UTC - in response to Message 58508.

ok, so my plan was to run at least a few more batches of test jobs. Then start the real tasks.

I understand now that if some machines have by then run several test tasks that will create an estimation problem. Does resetting the credit statistics help? Would it be better to create a new app for real jobs once the testing is finished? so statistics are consistent and, in the long term, BOINC estimates better the durations?
____________

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1434
Credit: 3,541,899,351
RAC: 420,027
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58512 - Posted: 14 Mar 2022 | 10:29:52 UTC - in response to Message 58511.

My gut feeling is that it would be better to deploy the finished app (after all testing seems to be complete) as a new app_version. We would have to go through the training process for APR one last time, but then it should settle down.

I've seen the reference to resetting the credit statistics before, but only some years ago in scanning the documentation. I've never actually seen the console screen you use to control a BOINC server, let alone operated one for real, so I don't know whether you can control the reset to a single app_version, or whether you have to nuke the entire project - best not to find out the hard way.

You're right, of course - the whole Runtime Estimation (APR) structure is intimately bound up with the CreditNew tools, also introduced in 2010. So the credit reset is likely to include an APR reset - but I'd hold that back for now.

I see you've started sending out v1.05 betas. One has arrived on one of my Linux machines, and again, the estimated speed is exactly one-tenth of the acemd3 speed - with extreme precision, to the last decimal place:

<flops>707593666701.291382</flops>
<flops>70759366670.129135</flops>

That must be deliberate.

Profile Retvari Zoltan
Avatar
Send message
Joined: 20 Jan 09
Posts: 2335
Credit: 16,178,080,749
RAC: 0
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58513 - Posted: 14 Mar 2022 | 11:21:42 UTC - in response to Message 58511.
Last modified: 14 Mar 2022 | 11:22:20 UTC

Would it be better to create a new app for real jobs once the testing is finished?
Based on the last few days' discussion here, I've understood the purpose of the former short and long queue from GPUGrid's perspective:
By separating the tasks into two queues based on their length, the project's staff didn't have to bother setting the rsc_fpops_est value for each and every batch, (note that the same app was assigned to each queue). The two queues had used different (but constant through batches) rsc_fpops_est values, so the runtime estimation of BOINC could not get so much off in each queue that would tigger the "won't finish on time" or the "run time exceeded" situation.
Perhaps this practise should be put in operation again, even on a finer level of granularity (S, M, L tasks, or even XS and XL tasks).

Bedrich Hajek
Send message
Joined: 28 Mar 09
Posts: 411
Credit: 6,063,938,459
RAC: 3,526
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58518 - Posted: 14 Mar 2022 | 23:14:08 UTC

I am getting "Disk usage limit exceeded" error.

https://www.gpugrid.net/result.php?resultid=32808038

I do have 400 Gigs reserved for boincs.

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 155
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 58519 - Posted: 15 Mar 2022 | 16:40:36 UTC - in response to Message 58518.

I believe the "Disk usage limit exceeded" error is not related to the machine resources, is defined by an adjustable parameter of the app. The conda environment + all the other files might be over this limit.I will review the current value, we might have to increase it. Thanks for pointing it out!
____________

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1434
Credit: 3,541,899,351
RAC: 420,027
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58524 - Posted: 17 Mar 2022 | 9:59:07 UTC

After a day out running a long acemd3 task, there's good news and bad news.

The good news: runtime estimates have reached sanity, The magic numbers are now

<flops>336636264786015.625000</flops>
<rsc_fpops_est>1000000000000000000.000000</rsc_fpops_est>

That ends up with an estimated runtime of about 9 hours - but at the cost of a speed estimate of 336,636 GFlops. That's way beyond a marketing department's dream.

Either somebody has done open-heart surgery on the project's database (unlikely and unwise), or BOINC now has enough completed tasks for v1.05 to start taking notice of the reported values.

The bad news: I'm getting errors again.

ModuleNotFoundError: No module named 'gym'

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1434
Credit: 3,541,899,351
RAC: 420,027
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58527 - Posted: 18 Mar 2022 | 13:05:44 UTC

v1.06 is released and working (very short test tasks only).

Watch out for:
Another 2.46 GB download
Estimates are back up to multiple years

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 155
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 58528 - Posted: 18 Mar 2022 | 13:43:25 UTC - in response to Message 58524.

The latest version should fix this error.

ModuleNotFoundError: No module named 'gym'


____________

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1434
Credit: 3,541,899,351
RAC: 420,027
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58529 - Posted: 18 Mar 2022 | 15:24:55 UTC - in response to Message 58528.
Last modified: 18 Mar 2022 | 16:19:46 UTC

I have task 32836015 running - showing 50% after 30 minutes. That looks like it's giving the maths a good work-out.

Edit - actually, it's not doing much at all.


You should be on NVidia device 1 - but cool, low power, 0% usage. No checkpoint, nothing written to stderr.txt in an hour and a half.

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 155
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 58534 - Posted: 18 Mar 2022 | 16:53:01 UTC - in response to Message 58529.
Last modified: 18 Mar 2022 | 16:54:58 UTC

For now I am just trying to see the jobs finish.. I am not even trying to make them run for a long time. Jobs should not even need checkpoints, should last less than 15 mins.

So weird, some other jobs in Widows machines from the same batch managed to finish. For example those with result ids 32835825, 32836020 or 32835934.

I don't understand why it works in some Windows machines and fails in others. Sometimes without complaining about anything. And works fine locally in my Windows laptop.

Does windows have trouble with multiprocessing? I need to add many more checkpoints in the scripts I guess. Pretty much after every line of code..
____________

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1434
Credit: 3,541,899,351
RAC: 420,027
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58536 - Posted: 18 Mar 2022 | 17:41:35 UTC - in response to Message 58534.

Err, this particular task is running on Linux - specifically, Mint v20.3

It ran the first short task OK at lunchtime - see Python apps for GPU hosts beta on host 508381. I think I'd better abort it while we think.

kksplace
Send message
Joined: 4 Mar 18
Posts: 48
Credit: 445,464,249
RAC: 426,354
Level
Gln
Scientific publications
wat
Message 58537 - Posted: 20 Mar 2022 | 12:08:10 UTC

This task https://www.gpugrid.net/result.php?resultid=32841161 has been running for nearly 26 hours now. It is the first Python beta task I have received that appears to be working. Green-With-Envy shows intermittent low activity on my 1080 GPU and BoincTasks shows 100% CPU usage. It checkpointed only once several minutes after it started and has shown 50% complete ever since.

Should I let this task continue or abort it?

(Linux Mint, 1080 driver is 510.47.03)

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1434
Credit: 3,541,899,351
RAC: 420,027
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58538 - Posted: 20 Mar 2022 | 12:35:55 UTC - in response to Message 58537.

Sounds just like mine, including the 100% CPU usage - that'll be the wrapper app, rather than the main Python app.

One thing I didn't try, but only thought about afterwards, is to suspend the task for a moment and then allow it to run again. That has re-vitalised some apps at other projects, but is not guaranteed to improve things: it might even cause it to fail. But if it goes back to 0% or 50%, and doesn't move further, it's probably not going anywhere. I'd abort it at that point.

kksplace
Send message
Joined: 4 Mar 18
Posts: 48
Credit: 445,464,249
RAC: 426,354
Level
Gln
Scientific publications
wat
Message 58539 - Posted: 20 Mar 2022 | 13:12:02 UTC - in response to Message 58538.

Well, after a suspend and allowing it to run, it went back to its checkpoint and has shown no progress since. I will abort it. Keep on learning....

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 155
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 58540 - Posted: 21 Mar 2022 | 8:21:21 UTC - in response to Message 58538.

ok so it gets stuck at 50%. I will be reviewing it today. Thanks for the feedback.

I also seems to fail in most Windows cases without reporting any error.
____________

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1434
Credit: 3,541,899,351
RAC: 420,027
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58541 - Posted: 21 Mar 2022 | 12:39:20 UTC
Last modified: 21 Mar 2022 | 13:03:38 UTC

Got a new one - the other Linux machine, but very similar. Looks like you've put some debug text into stderr.txt:

12:28:16 (482274): wrapper (7.7.26016): starting
12:28:17 (482274): wrapper (7.7.26016): starting
12:28:17 (482274): wrapper: running /bin/tar (xf pythongpu_x86_64-pc-linux-gnu__cuda1131.tar.bz2)
12:31:39 (482274): /bin/tar exited; CPU time 192.149659
12:31:39 (482274): wrapper: running bin/python (run.py)
Starting!!
Finished imports!!
Sanity check, make sure that logging matches execution
Check if this is a restarted job
Define Train Vector of Envs
Define RL training algorithm
Look for available model checkpoint in log_dir - node failure case
Define RL Policy
Define rollouts storage
Define scheme

but nothing new has been added in the last five minutes. Showing 50% progress, no GPU activity. I'll give it another ten minutes or so, then try stop-start and abort if nothing new.

Edit - no, no progress. Same result on two further tasks. All the quoted lines are written within about 5 seconds, then nothing. I'll let the machine do something else while I go shopping...

Tasks for host 132158

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 155
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 58549 - Posted: 21 Mar 2022 | 14:51:37 UTC - in response to Message 58541.
Last modified: 21 Mar 2022 | 15:45:30 UTC

Ok so I have seen 3 main errors in the last batches:


1. The one reported by Bedrich Hajek ("Disk usage limit exceeded"). We have now increased the amount of disk space allotted by BOINC to each task and I believe, based on the last batch I sent, that this error is gone now.


2. The "older" Windows machines do not have the tar.exe application and therefore can not unpack the conda environment. I know Richard did some research into that, but had to download 7-Zip. Ideally I would like the app to be self-contained. Maybe we can send the 7-Zip program with the app, I will have to research if that is possible.

3. The job getting stuck at 50%. I did add some debug messages in the last batches and I believe I know more or less when in the code the script gets stuck. I am still looking into it. Will also check recent results to see if there is any pattern when this error happens. Note there there is no checkpoint because it is a short task that gets stuck, so since the training is not progressing new checkpoints are not getting saved.
____________

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 155
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 58550 - Posted: 24 Mar 2022 | 10:05:39 UTC - in response to Message 58549.
Last modified: 24 Mar 2022 | 10:09:32 UTC

We have updated to a new app version for windows that solves the following error:

application C:\Windows\System32\tar.exe missing


Now we send the 7z.exe (576 KB) file with the app, which allows to unpack the other files without relying on the host machine having tar.exe (which is only in windows 11 and latest builds of windows 10).

I just sent a small batch of short tasks this morning to test and so far it seems to work.
____________

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1434
Credit: 3,541,899,351
RAC: 420,027
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58551 - Posted: 24 Mar 2022 | 10:14:38 UTC

Task 32868822 (Linux Mint GPU beta)

Still seems to be stalling at 50%, after "Define scheme". bin/python run.py is using 100% CPU, plus over 30 threads from multiprocessing.spawn with too little CPU usage to monitor (shows as 0.0%). No GPU app listed by nvidia-smi.

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 155
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 58552 - Posted: 24 Mar 2022 | 10:24:01 UTC - in response to Message 58551.
Last modified: 24 Mar 2022 | 10:26:18 UTC

Do you know by chance if this same machine works fine with PythonGPU tasks even if it fails in the PythonGPUBeta ones?
____________

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1434
Credit: 3,541,899,351
RAC: 420,027
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58553 - Posted: 24 Mar 2022 | 11:01:02 UTC - in response to Message 58552.
Last modified: 24 Mar 2022 | 11:26:25 UTC

Yes, it does. Most recent was:

e1a5-ABOU_rnd_ppod_avoid_cnn13-0-1-RND6436_3

Three failed before me, but mine was OK.

Edit: In relation to that successful task, BOINC only returns the last 64 KB of stderr.txt - so that result starts in the middle of the file (that's the bit that's most likely to contain debug information after a crash). I'll try to capture the initial part of the file next time I run one of those tasks, for reference.

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 155
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 58561 - Posted: 25 Mar 2022 | 8:33:38 UTC
Last modified: 25 Mar 2022 | 8:34:20 UTC

I have also changed a bit the approach.

I have just sent a batch of short tasks much more similar to those in PythonGPU. If these work fine, I will slowly introduce changes to see what was the problem.
____________

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1434
Credit: 3,541,899,351
RAC: 420,027
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58562 - Posted: 25 Mar 2022 | 9:03:09 UTC - in response to Message 58561.

I've grabbed one. Will run within the hour.

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 155
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 58563 - Posted: 25 Mar 2022 | 9:20:46 UTC - in response to Message 58561.
Last modified: 25 Mar 2022 | 9:27:47 UTC

I sent 2 batches,

ABOU_rnd_ppod_avoid_cnn_testing

and

ABOU_rnd_ppod_avoid_cnn_testing2

Unfortunately the first batch will crash. I detected one bug already which I have fixed in the second one. Seems like you got at least one in the second batch ( e1a18-ABOU_rnd_ppod_avoid_cnn_testing2). Running it will give us the info we need.

On the bright side, the fix with 7z.exe seems to work in all machines so far.
____________

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1434
Credit: 3,541,899,351
RAC: 420,027
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58564 - Posted: 25 Mar 2022 | 9:52:52 UTC - in response to Message 58563.

Yes, I got the testing2. It's been running for about 23 minutes now, but I'm seeing the same as yesterday - nothing written to stderr.txt since:

09:29:18 (51456): wrapper (7.7.26016): starting
09:29:18 (51456): wrapper (7.7.26016): starting
09:29:18 (51456): wrapper: running /bin/tar (xf pythongpu_x86_64-pc-linux-gnu__cuda1131.tar.bz2)
09:32:39 (51456): /bin/tar exited; CPU time 192.380796
09:32:39 (51456): wrapper: running bin/python (run.py)
Starting!!
Finished imports!!
Define rollouts storage
Define scheme

and machine usage shows



(full-screen version of that at https://i.imgur.com/Ly9Aabd.png)

I've preserved the control information for that task, and I'll try to re-run it interactively in terminal later today - you can sometimes catch additional error messages that way.

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 155
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 58565 - Posted: 25 Mar 2022 | 10:06:50 UTC - in response to Message 58564.

Ok thanks a lot. Maybe then it is not the python script but some of the dependencies.
____________

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1434
Credit: 3,541,899,351
RAC: 420,027
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58566 - Posted: 25 Mar 2022 | 10:27:08 UTC - in response to Message 58565.

OK, I've aborted that task to get my GPU back. I'll see what I can pick out of the preserved entrails, and let you know.

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1434
Credit: 3,541,899,351
RAC: 420,027
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58568 - Posted: 25 Mar 2022 | 18:13:58 UTC

Sorry, ebcak. I copied all the files, but when I came to work on them, several turned out to be BOINC softlinks back to the project directory, where the original file had been deleted. So the fine detail had gone.

Memo to self - don't try to operate dangerous machinery too early in the morning.

mmonnin
Send message
Joined: 2 Jul 16
Posts: 329
Credit: 1,449,138,982
RAC: 407,971
Level
Met
Scientific publications
watwatwatwatwat
Message 58569 - Posted: 27 Mar 2022 | 15:49:31 UTC

The past several tasks have gotten stuck at 50% for me as well. Today one has made it past to 57.7% now in 8hours. 1-2% GPU util on 3070Ti. 2.5 CPU threads per BOINCTasks. 3063mb memory per nvidia-smi and 4.4GB per BOINCTasks.

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 155
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 58571 - Posted: 28 Mar 2022 | 16:09:03 UTC
Last modified: 28 Mar 2022 | 17:15:17 UTC

I updated the app. Tested it locally and works fine on Linux.

I sent a batch of test jobs (ABOU_rnd_ppod_avoid_cnn_testing3), which I have seen executed successfully in at least 1 Linux machine so far.

One way check if the job is actually progressing, is to look for a directory called "monitor_logs/train" in the BOINC directory where the job is being executed. If logs are being written to the files inside this folder, means the task is progressing.
____________

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1434
Credit: 3,541,899,351
RAC: 420,027
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58572 - Posted: 28 Mar 2022 | 17:20:54 UTC - in response to Message 58571.

Got a couple on one of my Windows 7 machines. The first - task 32875836 - completed successfully, the second is running now.

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 155
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 58573 - Posted: 28 Mar 2022 | 18:01:06 UTC - in response to Message 58572.

nice to hear it! lets see what happens on linux.. so weird if it only works in some machines and gets stuck in others...
____________

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1434
Credit: 3,541,899,351
RAC: 420,027
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58574 - Posted: 28 Mar 2022 | 18:55:31 UTC - in response to Message 58573.

nice to hear it! lets see what happens on linux.. so weird if it only works in some machines and gets stuck in others...

Worse is to follow, I'm afraid. task 32875988 started immediately after the first one (same machine, but a different slot directory), but seems to have got stuck.

I now seem to have two separate slot directories:

Slot 0, where the original task ran. It has 31 items (3 folders, 28 files) at the top level, but the folder properties says the total (presumably expanding the site-packages) is 49 folders, 257 files, 3.62 GB

Slot 5, allocated to the new task. It has 93 items at the top level (12 folders, including monitor_logs, and the rest files). This one looks the same as the first one did, while it was actively running the first task. This one has 14 files in the train directory - I think the first only had 4. This slot also has a stderr file, which ends with multiple repetitions of

Traceback (most recent call last):
File "<string>", line 1, in <module>
File "D:\BOINCdata\slots\5\lib\multiprocessing\spawn.py", line 116, in spawn_main
exitcode = _main(fd, parent_sentinel)
File "D:\BOINCdata\slots\5\lib\multiprocessing\spawn.py", line 126, in _main
self = reduction.pickle.load(from_parent)
File "D:\BOINCdata\slots\5\lib\site-packages\pytorchrl\agent\env\__init__.py", line 1, in <module>
from pytorchrl.agent.env.vec_env import VecEnv
File "D:\BOINCdata\slots\5\lib\site-packages\pytorchrl\agent\env\vec_env.py", line 1, in <module>
import torch
File "D:\BOINCdata\slots\5\lib\site-packages\torch\__init__.py", line 126, in <module>
raise err
OSError: [WinError 1455] The paging file is too small for this operation to complete. Error loading "D:\BOINCdata\slots\5\lib\site-packages\torch\lib\shm.dll" or one of its dependencies.

I'm going to try variations on a theme of
- clear the old slot manually
- pause and restart the task
- stop and restart BOINC
- stop and retsart Windows

I'll report back what works and what doesn't.

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1434
Credit: 3,541,899,351
RAC: 420,027
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58575 - Posted: 28 Mar 2022 | 19:36:32 UTC

Well, that was interesting. The files in slot 0 couldn't be deleted - they were locked by a running app 'python' - which is presumably why BOINC hadn't cleaned the folder when the first task finished.

So I stopped the second task, and used Windows Task Manager to see what was running. Sure enough, there was still a Python image, and I still couldn't delete the old files. So I force-stopped that python image, and then I could - and did - delete them.

I restarted the second task, but nothing much happened. The wrapper app posted in stderr that it was restarting python, but nothing else.

So then I restarted BOINC, and all hell broke loose. In quick succession, I got



Then windows crashed a browser tab and two Einstein@Home tasks on the other GPU.

When I'd closed the Python app from the Windows error box, the BOINC task closed cleanly, uploaded some files, and reported a successful finish. It even validated!

Things all seem to be running quietly now, so I think I'll leave this machine alone for a while and think. At the moment, the take-home theory is that the whole sequence was triggered by the failure of the python app to close at the end of the first task's run. That might be the next thing to look at.

STARBASEn
Avatar
Send message
Joined: 17 Feb 09
Posts: 87
Credit: 1,316,440,897
RAC: 814,068
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 58576 - Posted: 28 Mar 2022 | 20:38:27 UTC

Well this beta WU was a weird one:

https://www.gpugrid.net/workunit.php?wuid=27211744

It ran to 50% completion and hung there for 3.5 days so I aborted it. Boinc properties showed it running slot 10 except slot 10 was empty. Top (Fedora35) showed no activity with any GPUGrid WU. Some wrapper or something must have been kept alive and running in the background when the WU quit because the ET counter was incrementing time normally.

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 155
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 58577 - Posted: 29 Mar 2022 | 7:52:03 UTC
Last modified: 29 Mar 2022 | 8:01:53 UTC

Interesting that sometimes jobs work and sometimes get stuck in the same machine.

It also seems to me, based on you info, that something remains running at the end of the job and causes the next job to get stuck. Presumably some python thread.

I will see if I can add some code at the end of the task to make sure all python processes are killed and the main program exits correctly. And send another testing round.

Another observation in that this problem does not seem to be OS-dependant, since it happened to STARBASEn in a Linux machine and to Richard in Windows.
____________

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1434
Credit: 3,541,899,351
RAC: 420,027
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58578 - Posted: 29 Mar 2022 | 8:45:27 UTC

I've just had task 32876361 fail on a different, but identical, Windows machine. This time, it seems to be explicitly, and simply, a "not enough memory" error - these machines only have 8 GB, which was fine when I bought them. I've suspended the beta programme for the time being, and I'll try to upgrade them.

Bedrich Hajek
Send message
Joined: 28 Mar 09
Posts: 411
Credit: 6,063,938,459
RAC: 3,526
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58581 - Posted: 29 Mar 2022 | 20:42:53 UTC

Another "Disk usage limit exceeded" error:

https://www.gpugrid.net/result.php?resultid=32876568

And a successful one yesterday:

https://www.gpugrid.net/result.php?resultid=32876288


roundup
Send message
Joined: 11 May 10
Posts: 34
Credit: 257,866,755
RAC: 39,356
Level
Asn
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58582 - Posted: 30 Mar 2022 | 14:07:18 UTC
Last modified: 30 Mar 2022 | 14:11:23 UTC

After having some errors with recent python app betas, task 32876819 ran without error on a RTX3070 Mobile under Win 11.
A few observations:
- GPU load only between 4% and 8% with a peak between 50% and 70% every 12 seconds.
- The indicated time remaining in the BOINC Client was way off. It started with >7000 (seven thousand) days.
- 15.000 BOINC credits for 102,296 sec runtime. I assume that will be corrected once the python app is going produtive. EDIT: This runtime indicated on the GPUGrid site is not correct, it was actually less.

captainjack
Send message
Joined: 9 May 13
Posts: 171
Credit: 1,282,338,256
RAC: 99,065
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58588 - Posted: 31 Mar 2022 | 17:33:27 UTC

These tasks seem to run much better on my machines if I allocate 6 CPU's (threads) to each task. I managed to run one by itself and watched the performance monitor for CPU usage. During the initiation phase (about 5 minutes), the task used ~6 CPU's (threads). After the initiation phase, the CPU usage was in an oscillating pattern that was between ~2 and ~5 threads. Task ran very quickly and has been validated. Please let me know if you have questions.

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 155
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 58590 - Posted: 1 Apr 2022 | 8:59:15 UTC - in response to Message 58582.

Thanks a lot for the feedback:

- cyclical GPU load is expected in Reinforcement Learning algorithms. Whenever GPU load in lower, CPU usage should increase. It is correct.

- Incorrect time remaining prediction is an issue... it will only be fixed with time once the tasks become stable in duration.. maybe even will be required to create a new app and use this one only to debug.

- Also credits will be corrected yes, for now we will have something similar to what we have in the PythonGPU app.

Starting today I will start sending longer jobs, instead of the super short test jobs I was using just to test the code was working in all OS's and machines.
____________

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 155
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 58591 - Posted: 1 Apr 2022 | 9:05:50 UTC - in response to Message 58588.

Last batches seem to be working successfully both in Linux and Windows, and also for GPUs with cuda 10 and cuda 11.

My main worry now is whether or not the problem of some jobs getting "stuck" and never being completed persists. It was reported that the reason was that the Python was not finishing correctly between jobs so I added a few changes in the code to try to solve this issue.

Please let me know if you detect this problem in one of your tasks, that would be very helpful!

Incidentally, once the PythonGPUBeta app is stable enough, will replace the current PythonGPU app, which only works for Linux.
____________

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1434
Credit: 3,541,899,351
RAC: 420,027
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58592 - Posted: 1 Apr 2022 | 9:18:50 UTC - in response to Message 58591.

It was reported that the reason was that the Python was not finishing correctly between jobs so I added a few changes in the code to try to solve this issue.

Well, that was one report of one task on one machine with limited memory. It seemed be a case that, if it happened, caused problems for the following task. It's certainly worth looking at, and if it prevents some tasks failing - great. But I'd be cautious about assuming that it was the problem in all cases.

STARBASEn
Avatar
Send message
Joined: 17 Feb 09
Posts: 87
Credit: 1,316,440,897
RAC: 814,068
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 58593 - Posted: 1 Apr 2022 | 23:39:38 UTC

I will see if I can add some code at the end of the task to make sure all python processes are killed and the main program exits correctly. And send another testing round.

Another observation in that this problem does not seem to be OS-dependant, since it happened to STARBASEn in a Linux machine and to Richard in Windows.


I haven't gotten a new beta yet so I will shut off all GPU work with other projects to hopefully get some and help resolve this issue.

STARBASEn
Avatar
Send message
Joined: 17 Feb 09
Posts: 87
Credit: 1,316,440,897
RAC: 814,068
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 58594 - Posted: 1 Apr 2022 | 23:50:44 UTC

One other after thought re that WU. I had checked my status page here prior to aborting the task. It indicated the task was still in progress so no disposition of the files that I am presuming were sent back sometime in the past (since the slot was empty) was assigned to it. Wonder where they went?

Short Final
Send message
Joined: 26 May 20
Posts: 4
Credit: 47,639,432
RAC: 55,568
Level
Val
Scientific publications
wat
Message 58597 - Posted: 4 Apr 2022 | 10:56:50 UTC

Can anybody explain credits policy please.
My CPU's running Python app relentlessly for up to 7 days for only 50,000 credits. Yet have received 360,000 credits for the ACEMD 3 after only 42,000 secs (11.6 hrs). Bit skewiff.. see below:

https://www.gpugrid.net/results.php?userid=562496


Task
click for details
Show names Work unit
click for details Computer Sent Time reported
or deadline
explain Status Run time
(sec) CPU time
(sec) Credit Application
32877811 27214361 590351 1 Apr 2022 | 9:34:34 UTC 3 Apr 2022 | 9:57:48 UTC Completed and validated 309,332.50 309,332.50 50,000.00 Python apps for GPU hosts beta v1.10 (cuda1131)
32877804 27214354 581235 1 Apr 2022 | 9:38:33 UTC 3 Apr 2022 | 19:38:13 UTC Completed and validated 628,304.20 628,304.20 50,000.00 Python apps for GPU hosts beta v1.10 (cuda1131)
32876508 27207895 581235 29 Mar 2022 | 9:50:08 UTC 1 Apr 2022 | 4:52:45 UTC Completed and validated 101,951.50 100,984.90 360,000.00 ACEMD 3: molecular dynamics simulations for GPUs v2.19 (cuda1121)
32876455 27213533 581235 29 Mar 2022 | 9:17:17 UTC 29 Mar 2022 | 9:49:31 UTC Completed and validated 12,109.13 12,109.13 3,000.00 Python apps for GPU hosts beta v1.09 (cuda1131)
32876341 27213457 590351 29 Mar 2022 | 4:33:52 UTC 31 Mar 2022 | 6:41:54 UTC Completed and validated 42,830.17 41,435.17 360,000.00 ACEMD 3: molecular dynamics simulations for GPUs v2.19 (cuda1121)
32875459 27212897 581235 27 Mar 2022 | 2:32:46 UTC 29 Mar 2022 | 9:06:58 UTC Completed and validated 96,228.49 95,544.64 360,000.00 ACEMD 3: molecular dynamics simulations for GPUs v2.19 (cuda1121)

PS: How do I past neat image of above??

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1434
Credit: 3,541,899,351
RAC: 420,027
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58598 - Posted: 4 Apr 2022 | 13:12:34 UTC - in response to Message 58597.

Please note that other users can't see your entire task list by userid - that's a privacy policy common to all BOINC projects.

The ones you're worried about seem to be Results for host 581235

The one you're specifically asking about - the Python GPU beta v1.10 - was issued on Friday morning and returned on Sunday evening: it was only on your machine for about 58 hours. The run time of 628,304 seconds is misleading (a duplicate of the CPU time) and an error on this website.

Runtime and credit are still being adjusted, and errors are a common feature of beta testing. Sometimes you win, others (like this one) you lose. I'm sure your comments will be noted before testing is complete.

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1050
Credit: 1,407,860,714
RAC: 1,014,249
Level
Met
Scientific publications
watwatwatwatwat
Message 58599 - Posted: 4 Apr 2022 | 18:32:01 UTC
Last modified: 4 Apr 2022 | 18:33:58 UTC

For some reason I haven't been able to snag any of the Python beta tasks lately.

Just the old stock Python tasks.

Couple of them failed at 30 minutes with the no progress downloading the Python environment after 1800 seconds.

One of the reasons I would like to get the new beta tasks that overcome that issue.

Also found a task at 5 hours and counting at 100% completion and not reporting. Suspended the task and resumed in the hope that would nudge it to report but it just restarted at 10% progress.

[Edit] Looks like the suspend/resume was the trick after all. Uploading now.

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 155
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 58600 - Posted: 5 Apr 2022 | 7:54:43 UTC - in response to Message 58597.

The credits system is proportional to the amount of compute required to complete each task, like in acemd3.

In acemd3, it is proportional to the complexity of the simulation. In python tasks, which train artificial intelligence reinforcement learning agents, is proportional to the amount of interactions between the agent and its simulated environment required for the agent to learn how to behave in it.

At the moment, we give 2000 credits per 1M interactions, and most tasks require 25M training interactions (except test task which are shorter, normally just 1M). Therefore, completing a task gives 50000 credits and 75000 if completed specially fast.

Note that we are in beta phase, and while the credit difference between acemd and pythonGPU jobs should not be huge, we might need to adjust the credits given per 1M interactions to make them equivalent.
____________

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 155
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 58601 - Posted: 5 Apr 2022 | 8:04:42 UTC - in response to Message 58599.
Last modified: 5 Apr 2022 | 8:04:59 UTC

Batches of both pythonGPU and pythonGPUBeta are being sent out this week. Hopefully pythonGPUBeta task will run without issues.

We want to wait a bit more in case more bugs are detected, but we will soon update the pythonGPU app with the code from PythonGPUBeta, which seems to work well now. As mentioned, it does not have the problem of installing conda every time (instead downloads the packed environment only the first time). It also works for Linux and Windows.

At that point we will keep PythonGPUBeta only for testing.
____________

Profile bcavnaugh
Send message
Joined: 8 Nov 13
Posts: 56
Credit: 1,002,640,163
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwat
Message 58602 - Posted: 5 Apr 2022 | 16:52:21 UTC
Last modified: 5 Apr 2022 | 16:53:43 UTC

So far some run well while other ran for 2 and 3 days.
I did abort the ones that are still running after 3 days.
I will pick back up in the Fall and I hope to see good running tasks on my GPU's.
For now I an waiting for new 3 & 4 on two of my hosts, it is a real bummer that our hosts have to sit for days on end without getting any tasks.

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1050
Credit: 1,407,860,714
RAC: 1,014,249
Level
Met
Scientific publications
watwatwatwatwat
Message 58603 - Posted: 5 Apr 2022 | 17:38:10 UTC

Looks like the standard BOINC mechanism of complain in a post on the forums on some topic and the BOINC genies grant your wish.

Been getting nothing but solid Python beta tasks now for the past couple of days.

WR-HW95
Send message
Joined: 16 Dec 08
Posts: 7
Credit: 1,270,152,118
RAC: 284,612
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58604 - Posted: 5 Apr 2022 | 18:04:48 UTC

I have serious problems with my other machine running 1080Ti.
So far from 20 tasks past 2 weeks best one has ran around 38secs before error.
I tried to underpower + underclock core and mem, still same result around same time.
This one is result of last one.
"<core_client_version>7.16.20</core_client_version>
<![CDATA[
<message>
(unknown error) - exit code 195 (0xc3)</message>
<stderr_txt>
10:11:26 (15136): wrapper (7.9.26016): starting
10:11:26 (15136): wrapper: running bin/acemd3.exe (--boinc --device 0)
10:11:29 (15136): bin/acemd3.exe exited; CPU time 0.000000
10:11:29 (15136): app exit status: 0xc0000135
10:11:29 (15136): called boinc_finish(195)"

Is there something wrong in newer drivers on nvidia?
Only difference between machines that works and doesnt beside cpu (3900x and 5900x)is gfx driver version.
Machine that runs tasks has driver 496.49.
Machine that fails tasks has driver 511.79.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 719
Credit: 4,861,478,494
RAC: 1,361,967
Level
Arg
Scientific publications
wat
Message 58605 - Posted: 5 Apr 2022 | 19:06:38 UTC - in response to Message 58604.

I have serious problems with my other machine running 1080Ti.
So far from 20 tasks past 2 weeks best one has ran around 38secs before error.
I tried to underpower + underclock core and mem, still same result around same time.
This one is result of last one.
"<core_client_version>7.16.20</core_client_version>
<![CDATA[
<message>
(unknown error) - exit code 195 (0xc3)</message>
<stderr_txt>
10:11:26 (15136): wrapper (7.9.26016): starting
10:11:26 (15136): wrapper: running bin/acemd3.exe (--boinc --device 0)
10:11:29 (15136): bin/acemd3.exe exited; CPU time 0.000000
10:11:29 (15136): app exit status: 0xc0000135
10:11:29 (15136): called boinc_finish(195)"

Is there something wrong in newer drivers on nvidia?
Only difference between machines that works and doesnt beside cpu (3900x and 5900x)is gfx driver version.
Machine that runs tasks has driver 496.49.
Machine that fails tasks has driver 511.79.


you can try changing the driver back and see? easy troubleshooting step. It's definitely possible to be the driver.

but you seem to be having an issue with the ACEMD3 tasks, this thread is about the Python tasks.

____________

WR-HW95
Send message
Joined: 16 Dec 08
Posts: 7
Credit: 1,270,152,118
RAC: 284,612
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58606 - Posted: 5 Apr 2022 | 21:38:04 UTC - in response to Message 58605.

Sorry for posting wrong thread.
Changed drivers to 496.49 on other machine too... now just have to wait to get some work to see does it work.

Personally I was really hoping when new things were coming, that this project would ditch the cuda at last and moved to opencl.

No project that I have crunched on opencl have had extended issues like this. And most of those projects run on AMD cards too.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 719
Credit: 4,861,478,494
RAC: 1,361,967
Level
Arg
Scientific publications
wat
Message 58607 - Posted: 6 Apr 2022 | 0:41:00 UTC - in response to Message 58606.

I've had no problems with their CUDA ACEMD3 app. it's been very stable across many data sets. all of the issues raised in this thread are in regards to the Python app that's still in testing/beta. problems are to be expected.

CUDA outperforms OpenCL. even it identical code (as much as it can be), there is always the added overhead of needing to compile the opencl code at runtime. whereas CUDA runs natively on Nvidia. most projects run opencl because it lets them more easily port the code to different devices, expanding their user base at the expense of some performance overhead.

there have been many problems with the 500+ series drivers though. if you still have issues with the older drivers then it's something else wrong with your setup. if you didnt totally purge the old drivers with DDU from Safe Mode and re-install from a fresh nvidia package, that's a good first step. sometimes driver corruption can linger acropss many driver removals and upgrades and it needs to be more forcefully removed.
____________

Erich56
Send message
Joined: 1 Jan 15
Posts: 902
Credit: 3,604,880,665
RAC: 330,779
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwat
Message 58608 - Posted: 6 Apr 2022 | 5:23:39 UTC - in response to Message 58602.

bcavnaugh wrote:

... For now I an waiting for new 3 & 4 on two of my hosts, it is a real bummer that our hosts have to sit for days on end without getting any tasks.

you say it, indeed :-(
Obviously, ACEMD has very low priority at GPUGRID these days :-(

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1050
Credit: 1,407,860,714
RAC: 1,014,249
Level
Met
Scientific publications
watwatwatwatwat
Message 58609 - Posted: 7 Apr 2022 | 19:23:48 UTC

Beta is still having issues with establishing the correct Python environment.

Threw away around 27 tasks today with errors because of:

TypeError: object of type 'int' has no len()

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 155
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 58613 - Posted: 8 Apr 2022 | 9:51:42 UTC - in response to Message 58609.

thanks, this is solved now. A new batch is running without this issue.
____________

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1434
Credit: 3,541,899,351
RAC: 420,027
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58614 - Posted: 8 Apr 2022 | 14:43:17 UTC

There are still a few old tasks around. I got the _9 (and hopefully final) issue of WU 27184379 from 19 March. It registered the 51% mark but hasn't moved on in over 3 hours: I'm afraid it's going the same way as all previous attempts.

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1050
Credit: 1,407,860,714
RAC: 1,014,249
Level
Met
Scientific publications
watwatwatwatwat
Message 58615 - Posted: 8 Apr 2022 | 17:04:36 UTC

Yes, I am still getting the bad work unit resends.

Too bad they couldn't be purged before hitting the _9 timeout.

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1434
Credit: 3,541,899,351
RAC: 420,027
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58616 - Posted: 11 Apr 2022 | 10:26:59 UTC

New tasks today.

But: "ModuleNotFoundError: No module named 'yaml'"

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1050
Credit: 1,407,860,714
RAC: 1,014,249
Level
Met
Scientific publications
watwatwatwatwat
Message 58617 - Posted: 11 Apr 2022 | 16:01:42 UTC

Same here today.

Azmodes
Send message
Joined: 7 Jan 17
Posts: 34
Credit: 1,371,429,518
RAC: 25,470
Level
Met
Scientific publications
watwatwat
Message 58618 - Posted: 11 Apr 2022 | 19:04:45 UTC

Same.

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 155
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 58619 - Posted: 12 Apr 2022 | 8:59:28 UTC
Last modified: 12 Apr 2022 | 9:01:22 UTC

Thanks for the feedback. I will look into it today.

In which OS?
____________

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1434
Credit: 3,541,899,351
RAC: 420,027
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58621 - Posted: 12 Apr 2022 | 9:08:46 UTC - in response to Message 58619.

In which OS?

These were "Python apps for GPU hosts v4.01 (cuda1121)", which is Linux only.

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 155
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 58622 - Posted: 12 Apr 2022 | 9:35:11 UTC - in response to Message 58621.
Last modified: 12 Apr 2022 | 9:36:30 UTC

Right I just saw it browsing thought the failed jobs. It seems that is in the PythonGPU app not in PythonGPUBeta.

This is what I think happened: since in PythonGPU the conda environment is created every time, it could be that some of the dependencies from one or more packages required have changed recently. Therefore, yaml package was not installed in the environments and was missing during execution.

This is one more reason to switch to the new approach (currently beta). The conda environment is created, packed and sent to the volunteer machine when executing the first job. There, the environment is simply unpacked and there is no need to send a new one unless some fix in required.

We will move the PythonGPUBeta app to PythonGPU. Now PythonGPUBeta is quite stable, and its approach avoids this kind of problems. I expect we can do it today, but I will post to confirm it.
____________

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 155
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 58624 - Posted: 12 Apr 2022 | 14:41:23 UTC
Last modified: 12 Apr 2022 | 15:49:53 UTC

The current version of PythonGPUBeta has been copied to PythonGPU

Seems like the task DISK_LIMIT needs to be increased, I have seen some EXIT_DISK_LIMIT_EXCEEDED errors. We will adjust it.
____________

Greg _BE
Send message
Joined: 30 Jun 14
Posts: 109
Credit: 80,546,939
RAC: 13,489
Level
Thr
Scientific publications
watwatwatwatwatwat
Message 58625 - Posted: 12 Apr 2022 | 16:48:14 UTC

Well this is interesting to read.
Over at RAH they are using Python (cpu) and they are memory and disk space hogs.
I suggest once you get your GPU tasks working you make a FAQ on minimum memory and disk space needed to run these tasks.

One task in CPU uses 7.8 compressed to 8.4GB actual space on the drive.
Memory wise it uses 2861MB of physical ram and 55 to 58 MB of virtual.
If your tasks for GPU are anything like these...well we will need a bit of free space.

Looking forward to reading about your success getting python running on GPU.

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 155
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 58634 - Posted: 13 Apr 2022 | 7:23:26 UTC - in response to Message 58625.
Last modified: 13 Apr 2022 | 7:42:40 UTC

The size for all the app files (including the compressed environment) are:

2.0G for windows with cuda102
2.7G for windows with cuda1131
1.8G for linux with cuda102
2.6G for linux with cuda1131

The additional task specific data goes from a few KB to a few MB. I did not expect 7.8G compressed (not even after unpacking the environment). Is that the case for all PythonGPU tasks now?

Regarding CPU/GPU usage, this app actually uses a combination of both due to the nature of the problem we are tackling (training AI agent to develop intelligent behaviour in a simulated environment with reinforcement learning techniques). Interactions with the agent environment happen in CPU, learning happens in GPU.
____________

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 155
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 58635 - Posted: 13 Apr 2022 | 9:10:08 UTC

Also, the PythonGPU app version used in the new jobs should be 402 (or 4.02).

If that is not the case, there is probably some problem. It should be automatically used, but if that is not the case resetting the app should help.
____________

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1434
Credit: 3,541,899,351
RAC: 420,027
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58636 - Posted: 13 Apr 2022 | 9:46:58 UTC - in response to Message 58635.
Last modified: 13 Apr 2022 | 9:47:41 UTC

I have e1a46-ABOU_rnd_ppod_avoid_cnn_3-0-1-RND3588_4 running under Linux. I can confirm that my task (and its four predecessors) are running with the v4.02 app.

Small point: can you apply a "weight" to the sub-tasks in job.xml, please? At the moment, the 'decompress' stage is estimated to take 50% of the runtime under Linux, and 66% under Windows. That throws out the estimate for the rest of the run.

Under Linux, my slot directory is occupying 9.8 GB, against an allowed limit of 10,000,000,000 bytes: that's tight, especially when you consider the divergence of binary and decimal representations for bigger files.

All my predecessors for this workunit were running Windows. Three failed on disk limits, and one on memory limits. If every Windows version is using the 7-zip decompressor, there's the extra 'de-archived, but still compressed' step to allow for in the disk limit.

Still awaiting the final hurdle - the upload file size limit. In about 4 hours' time, I reckon - currently at 85% after 10 hours.

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 155
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 58637 - Posted: 13 Apr 2022 | 10:09:20 UTC - in response to Message 58636.
Last modified: 13 Apr 2022 | 15:03:42 UTC

Thanks a lot for the info Richard!

You are right, I should adjust the weights of the subtasks in job.xml to 10% for 'decompress' and 90% to execute the python script. That maybe also explains why jobs were getting stuck at 50% when python was not closed properly between jobs. The new job could decompress the environment (50%), but the python script could be executed.

I have increased the allowed limit to 30,000,000,000 bytes. This should affect all new jobs (to be confirmed) and should solve the DISK LIMIT problems.

Finally, I was also thinking about sending the compressed environment as a tar.bz2 file instead of a tar.gz to make it smaller. But I have to test that 7-zip handles it correctly.

Probably will deploy these changes first in PythonGPUBeta, that is what it is for
____________

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1434
Credit: 3,541,899,351
RAC: 420,027
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58638 - Posted: 13 Apr 2022 | 11:25:32 UTC - in response to Message 58637.

I'd say 1%::99%, but thanks.

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1434
Credit: 3,541,899,351
RAC: 420,027
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58639 - Posted: 13 Apr 2022 | 13:59:28 UTC

Uploaded and reported with no problem at all.

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 155
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 58640 - Posted: 13 Apr 2022 | 15:16:35 UTC - in response to Message 58639.

has the allowed limit changed to 30,000,000,000 bytes?
____________

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1050
Credit: 1,407,860,714
RAC: 1,014,249
Level
Met
Scientific publications
watwatwatwatwat
Message 58641 - Posted: 13 Apr 2022 | 16:19:28 UTC

Appears so.

<rsc_disk_bound>30000000000.000000</rsc_disk_bound>

Greg _BE
Send message
Joined: 30 Jun 14
Posts: 109
Credit: 80,546,939
RAC: 13,489
Level
Thr
Scientific publications
watwatwatwatwatwat
Message 58642 - Posted: 13 Apr 2022 | 19:28:33 UTC - in response to Message 58634.
Last modified: 13 Apr 2022 | 19:30:53 UTC

The size for all the app files (including the compressed environment) are:

2.0G for windows with cuda102
2.7G for windows with cuda1131
1.8G for linux with cuda102
2.6G for linux with cuda1131

The additional task specific data goes from a few KB to a few MB. I did not expect 7.8G compressed (not even after unpacking the environment). Is that the case for all PythonGPU tasks now?

Regarding CPU/GPU usage, this app actually uses a combination of both due to the nature of the problem we are tackling (training AI agent to develop intelligent behaviour in a simulated environment with reinforcement learning techniques). Interactions with the agent environment happen in CPU, learning happens in GPU.



Note: I was commenting on Rosetta at home CPU pythons.
What yours do, I don't know. I guess i had better add your project and see what happens.

I readded your project to my system, so if I am home when a task is sent out, I'll have a look.

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 155
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 58643 - Posted: 14 Apr 2022 | 7:36:34 UTC - in response to Message 58642.

Thank you!

I have added the subtask weights to the PythonGPUbeta app. Currently testing it with a small batch of tasks.
____________

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 155
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 58644 - Posted: 14 Apr 2022 | 8:42:41 UTC
Last modified: 14 Apr 2022 | 9:20:16 UTC

Testing was successful, so we can add the weights to the PythonGPU app job.xml file
____________

Greg _BE
Send message
Joined: 30 Jun 14
Posts: 109
Credit: 80,546,939
RAC: 13,489
Level
Thr
Scientific publications
watwatwatwatwatwat
Message 58655 - Posted: 15 Apr 2022 | 21:20:06 UTC

abouh,

can you have a look at my comments in a thread I created.
The 4.0 task was not increasing in percentage done after watching it for 10 minutes. Time to completion kept jumping around 1 second up 1 second down.
40 minutes run time vs cpu time? That a hell of a lot of set up time!

Here are the local host task details
Application Python apps for GPU hosts 4.03 (cuda1131)
Workunit name e2a18-ABOU_rnd_ppod_avoid_cnn_4-0-1-RND3898
State Running
Received 4/15/2022 12:06:46 PM
Report deadline 4/20/2022 12:06:46 PM
Estimated app speed 53.74 GFLOPs/sec
Estimated task size 1,000,000,000 GFLOPs
Resources 0.987 CPUs + 1 NVIDIA GPU (GTX 1050)
CPU time at last checkpoint 06:44:35
CPU time 06:47:39
Elapsed time 06:05:04
Estimated time remaining 198d,09:49:25
Fraction done 7.880%
Virtual memory size 7,230.02 MB
Working set size 2,057.87 MB

mikey
Send message
Joined: 2 Jan 09
Posts: 286
Credit: 573,060,776
RAC: 161
Level
Lys
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58666 - Posted: 17 Apr 2022 | 20:16:19 UTC - in response to Message 58652.

You can delete the previous post about ACMED3. I posted that incorrectly here.


Some forums let you put a double space or a double period to delete your own post, but you must still do it within the editing time

Greg _BE
Send message
Joined: 30 Jun 14
Posts: 109
Credit: 80,546,939
RAC: 13,489
Level
Thr
Scientific publications
watwatwatwatwatwat
Message 58669 - Posted: 18 Apr 2022 | 12:27:00 UTC - in response to Message 58666.

Mikey, I know. But the time limit expired on that post to edit it. I came back days later not within the 30-60 minutes allowed.

Werinbert
Send message
Joined: 12 May 13
Posts: 5
Credit: 100,032,540
RAC: 0
Level
Cys
Scientific publications
wat
Message 58672 - Posted: 18 Apr 2022 | 19:31:43 UTC

I am now running a Python task. It has a very low usage of my GPU most often around 5 to 10%, occasionally getting up to 20%. Is this normal? Should I wait until I move my GPU from an old 3770K to a 12500 computer for better CPU capabilities to do these tasks?

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1050
Credit: 1,407,860,714
RAC: 1,014,249
Level
Met
Scientific publications
watwatwatwatwat
Message 58673 - Posted: 18 Apr 2022 | 23:12:34 UTC - in response to Message 58672.

This is normal for Python on GPU tasks. The tasks run on both the cpu and gpu during parts of the computation for the inferencing and machine learning segments.

Read the posts by the admin developer explaining what the process involves.

- cyclical GPU load is expected in Reinforcement Learning algorithms. Whenever GPU load in lower, CPU usage should increase. It is correct.

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 155
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 58674 - Posted: 19 Apr 2022 | 8:21:52 UTC - in response to Message 58655.
Last modified: 19 Apr 2022 | 8:24:36 UTC

Sorry for the late reply Greg _BE, I hid the ACEMD3 posts.

I checked your job e2a18-ABOU_rnd_ppod_avoid_cnn_4-0-1-RND3898. Did the progress get stuck or was it just increasing slowly?

The job was finally completed by another Windows 10 host, but the CPU time is wrong because it says 668566.9 seconds.

I am not sure, but maybe one problem is that we ask only for 0.987 CPUs, since that was ideal for ACEMD jobs. In reality Python tasks use more. I will look into it.
____________

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1434
Credit: 3,541,899,351
RAC: 420,027
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58675 - Posted: 19 Apr 2022 | 8:25:47 UTC

New tasks being issued this morning, allocated to the old Linux v4.01 'Python app for GPU hosts' issued in October 2021.

All are failing with "ModuleNotFoundError: No module named 'yaml'".

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1434
Credit: 3,541,899,351
RAC: 420,027
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58676 - Posted: 19 Apr 2022 | 8:38:26 UTC - in response to Message 58674.

I am not sure, but maybe one problem is that we ask only for 0.987 CPUs, since that was ideal for ACEMD jobs. In reality Python tasks use more. I will look into it.

Asking for 1.00 CPUs (or above) would make a significant difference, because that would prompt the BOINC client to reduce the number of tasks being run for other projects.

It would be problematic to increase the CPU demand above 1.00, because the CPU loading is dynamic - BOINC has no provision for allowing another project to utilise the cycles available during periods when the GPUGrid app is quiescent. Normally, a GPU app is given a higher process priority for CPU usage than a pure CPU app, so the operating system should allocate resources to your advantage, but that can be problematic when the wrapper app is in use. That was changed recently: I'll look into the situation with your server version and our current client versions.

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 155
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 58677 - Posted: 19 Apr 2022 | 9:23:32 UTC - in response to Message 58675.
Last modified: 19 Apr 2022 | 9:24:44 UTC

Definitely only the latest version 403 should be sent. Thanks for letting us know.
____________

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1434
Credit: 3,541,899,351
RAC: 420,027
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58678 - Posted: 19 Apr 2022 | 12:01:04 UTC

BOINC GPU apps, wrapper apps, and process priority

The basic rule for BOINC applications (originally CPU only) has been to run applications at idle priority, to avoid interfering with foreground use of the computer.

Since the introduction of GPU apps into BOINC around 2008, the CPU portion of a GPU app has been automatically run at a slightly higher process priority (below normal) - an attempt to avoid highly-productive GPU work being throttled by competition for CPU resources.

Normally, the BOINC client manages these two different process priorities directly. But when a wrapper app is interpolated between the client and a worker app, it's the wrapper which sets the priority for the worker app. It was a user on this project who first noticed (Issue 3764 - May 2020) that the process priority of a GPU app wasn't being set correctly when it was executing under the control of a wrapper app.

Many false starts later (PRs 3826, 3948, 3988, 3999), a fully consistent set of process priority tools was developed, effective from about 25 September 2020.

But in order for these tools to be useful, compatible versions of both the BOINC client and the wrapper application have to be used. So far as I can tell, BOINC client for Windows v7.16.20 (current) is compliant; Wrapper version 26203 is compliant; but no full public release versions of the BOINC client for Linux are yet compliant (Gianfranco Costamagna's prototyping PPA client should be).

This project appears to be using wrapper code 26016 for Windows, and wrapper code 26198 for Linux. Unless these have been patched locally, neither wrapper will yet allow full process control management.

It's not urgent, but with the new Python apps running in a mixed CPU/GPU environment, it might be helpful to update the project's wrapper codebase. Fortunately, the basic server platform is unaffected by all this.

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 155
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 58696 - Posted: 21 Apr 2022 | 15:04:23 UTC - in response to Message 58675.

We have deprecated v4.01

Hopefully, if everything went fine, the error

All are failing with "ModuleNotFoundError: No module named 'yaml'".


should not happen any more. And all jobs should use v4.03
____________

Greg _BE
Send message
Joined: 30 Jun 14
Posts: 109
Credit: 80,546,939
RAC: 13,489
Level
Thr
Scientific publications
watwatwatwatwatwat
Message 58752 - Posted: 27 Apr 2022 | 18:52:49 UTC
Last modified: 27 Apr 2022 | 19:41:00 UTC

abouh,

I got another python finally.
But here is something interesting, the CPU value according to BOINC Tasks is 221%!
How can you get more than 100% of a single core?
Another observation, elapsed time vs CPU time. The two are off by about 5 hours.
4:01 vs 8:54 currently
Progress is not moving very fast. In the time it has taken me to write this it is stuck at 7.88%
Now 4:16 to 9:24 and still 7.88%!!, 15 mins and no progress? If this hasn't changed in the next hour, I am also aborting this task.
BTW, 46 checkpoints in the 4hrs of run time.

https://www.gpugrid.net/workunit.php?wuid=27219917

Exit status 195 (0xc3) EXIT_CHILD_FAILED
Computer ID 589200
Exception: The wandb backend process has shutdown
GeForce GTX 1050 (2047MB) driver: 512.15

Exit status 203 (0xcb) EXIT_ABORTED_VIA_GUI
Computer ID 590211
Run time 241,306.00
CPU time 1,471.50
GeForce RTX 3080 Ti (4095MB) driver: 497.

The point of this information is:

1)I have GTX 1050 and 1080. Previous python failed with the same exit error as the first person in this python task. What is EXIT_CHILD_FAILED? Something on your end or on our end?

2) Person 2 probably aborted because of the way BOINC reads the data to determine the time. I killed my first python because it shows 160+ days to completion.




***I give up. No progress in 30 minutes since I started this post***

Computer: DESKTOP-LFM92VN
Project GPUGRID

Name e5a13-ABOU_rnd_ppod_avoid_cnn_4-0-1-RND0256_2

Application Python apps for GPU hosts 4.03 (cuda1131)
Workunit name e5a13-ABOU_rnd_ppod_avoid_cnn_4-0-1-RND0256
State Running
Received 4/27/2022 4:35:18 PM
Report deadline 5/2/2022 4:35:18 PM
Estimated app speed 3,171.20 GFLOPs/sec
Estimated task size 1,000,000,000 GFLOPs
Resources 0.987 CPUs + 1 NVIDIA GPU (device 1)
CPU time at last checkpoint 09:58:18
CPU time 10:08:59
Elapsed time 04:37:57
Estimated time remaining 161d,06:23:41
Fraction done 7.880%
Virtual memory size 6,429.20 MB
Working set size 1,072.13 MB
Directory slots/12
Process ID 16828

Debug State: 2 - Scheduler: 2

That's 4:01 to 4:38 and still at 7.88%
Checkpoints count up. CPU is 219%
This is all messed up.
I join the abort team.

------------

Something about the other task that failed with exit child.
A few extracts:

wandb: Network error (ReadTimeout), entering retry loop.

Exception in thread StatsThr:
Traceback (most recent call last):
File "D:\data\slots\13\lib\site-packages\psutil\_common.py", line 449, in wrapper
ret = self._cache[fun]
AttributeError: 'Process' object has no attribute '_cache'

During handling of the above exception, another exception occurred:
(followed by line this and line that, etc)

And then this:
OSError: [WinError 1455] The paging file is too small for this operation to complete

But the next person who got has this kind of setup:

CPU type AuthenticAMD
AMD Ryzen 5 5600X 6-Core Processor [Family 25 Model 33 Stepping 0]
Number of processors 12
Coprocessors NVIDIA NVIDIA GeForce RTX 3080 (4095MB) driver: 512.15
Operating System Microsoft Windows 11
x64 Edition, (10.00.22000.00

I run GTX and Win10 with a Ryzen 7 2800 and 7.16.20 BOINC

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1050
Credit: 1,407,860,714
RAC: 1,014,249
Level
Met
Scientific publications
watwatwatwatwat
Message 58753 - Posted: 27 Apr 2022 | 19:35:21 UTC

But here is something interesting, the CPU value according to BOINC Tasks is 221%!
How can you get more than 100% of a single core?

Because the task was actually using a little more than two cores to process the work.

Why I have set Python task to allocate 3 cpu threads for BOINC scheduling.

Greg _BE
Send message
Joined: 30 Jun 14
Posts: 109
Credit: 80,546,939
RAC: 13,489
Level
Thr
Scientific publications
watwatwatwatwatwat
Message 58754 - Posted: 27 Apr 2022 | 19:45:18 UTC - in response to Message 58753.
Last modified: 27 Apr 2022 | 19:46:26 UTC

But here is something interesting, the CPU value according to BOINC Tasks is 221%!
How can you get more than 100% of a single core?

Because the task was actually using a little more than two cores to process the work.

Why I have set Python task to allocate 3 cpu threads for BOINC scheduling.


Ok...interesting, but what accounts for the lack of progress in 30 mins on this task that I just killed and the exit child error and blow up on the previous Python?

I mean really...0% with 2 decimal points, 7.88 for more than 30 minutes?
I don't know of any project that can't even 1/100th in 30 minutes.
I've seen my share of slow tasks in other projects, but this one...wow....

And how do you go about setting just python for 3 cpu cores? That's beyond my knowledge level.

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1050
Credit: 1,407,860,714
RAC: 1,014,249
Level
Met
Scientific publications
watwatwatwatwat
Message 58755 - Posted: 27 Apr 2022 | 22:31:48 UTC - in response to Message 58754.

You use an app_config.xml file in the project like this:

<app_config>
<app>
<name>acemd3</name>
<gpu_versions>
<gpu_usage>1.0</gpu_usage>
<cpu_usage>1.0</cpu_usage>
</gpu_versions>
</app>
<app>
<name>acemd4</name>
<gpu_versions>
<gpu_usage>1.0</gpu_usage>
<cpu_usage>1.0</cpu_usage>
</gpu_versions>
</app>
<app>
<name>PythonGPU</name>
<gpu_versions>
<gpu_usage>1.0</gpu_usage>
<cpu_usage>3.0</cpu_usage>
</gpu_versions>
</app>
<app>
<name>PythonGPUbeta</name>
<gpu_versions>
<gpu_usage>1.0</gpu_usage>
<cpu_usage>3.0</cpu_usage>
</gpu_versions>
</app>
</app_config>

Greg _BE
Send message
Joined: 30 Jun 14
Posts: 109
Credit: 80,546,939
RAC: 13,489
Level
Thr
Scientific publications
watwatwatwatwatwat
Message 58762 - Posted: 28 Apr 2022 | 19:40:48 UTC - in response to Message 58755.

You use an app_config.xml file in the project like this:

<app_config>
<app>
<name>acemd3</name>
<gpu_versions>
<gpu_usage>1.0</gpu_usage>
<cpu_usage>1.0</cpu_usage>
</gpu_versions>
</app>
<app>
<name>acemd4</name>
<gpu_versions>
<gpu_usage>1.0</gpu_usage>
<cpu_usage>1.0</cpu_usage>
</gpu_versions>
</app>
<app>
<name>PythonGPU</name>
<gpu_versions>
<gpu_usage>1.0</gpu_usage>
<cpu_usage>3.0</cpu_usage>
</gpu_versions>
</app>
<app>
<name>PythonGPUbeta</name>
<gpu_versions>
<gpu_usage>1.0</gpu_usage>
<cpu_usage>3.0</cpu_usage>
</gpu_versions>
</app>
</app_config>


Ok thanks. I will make that file tomorrow or this weekend. To tired to try that tonight.

Profile ServicEnginIC
Avatar
Send message
Joined: 24 Sep 10
Posts: 520
Credit: 2,275,243,465
RAC: 6,100
Level
Phe
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58767 - Posted: 30 Apr 2022 | 21:23:31 UTC - in response to Message 58696.

We have deprecated v4.01
Hopefully, if everything went fine, the error
All are failing with "ModuleNotFoundError: No module named 'yaml'".
should not happen any more. And all jobs should use v4.03

I've recently reset Gpugrid project at every of my hosts, but I've still received v4.01 at several of them, and failed with the mentioned error.
Some subsequent v4.03 resends for the same tasks have eventually succeeded at other hosts.

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1050
Credit: 1,407,860,714
RAC: 1,014,249
Level
Met
Scientific publications
watwatwatwatwat
Message 58768 - Posted: 1 May 2022 | 1:18:29 UTC - in response to Message 58767.
Last modified: 1 May 2022 | 1:19:13 UTC

Unfortunately the admins never yanked the malformed tasks from distribution.

They only will disappear when they hit the 7th (_6) resend and it fails. Then it will be pulled from distribution. (Too many errors (may have bug))

I've had a lot of the bad Python 4.01 tasks also but thankfully a lot of them were at the tail end of distribution.

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 155
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 58770 - Posted: 3 May 2022 | 8:56:12 UTC - in response to Message 58752.
Last modified: 3 May 2022 | 9:23:28 UTC

Sorry for the late reply Greg _BE, I was away for the last 5 days. Thank you very much for the detailed report.

----------

1. Regarding this error:

Exit status 195 (0xc3) EXIT_CHILD_FAILED
Computer ID 589200
Exception: The wandb backend process has shutdown
GeForce GTX 1050 (2047MB) driver: 512.15


Seems like the process failed after raising the exception: "The wandb backend process has shutdown". wandb is the python package we use to send out logs about the agent training process. It provides useful information to better understand the task results. Seems like the process failed and then the whole task got stuck, that is why no progress was being made. Since it reached 7.88% progress, I assume it worked well until then. I need to review other jobs to see why this could be happening and if it happened in other machines. We had not detected this issue before. Thanks for bringing it up.

----------

2. Time estimation is not right for now due to the way BOINC makes it, Richard provided a very complete explanation in a previous posts. We hope it will improve over time... for now be aware that is it completely wrong.

----------

3. Regarding this error:

OSError: [WinError 1455] The paging file is too small for this operation to complete

It is related to using pytorch in windows. It is explained here: https://stackoverflow.com/questions/64837376/how-to-efficiently-run-multiple-pytorch-processes-models-at-once-traceback
We are applying this solution to mitigate the error, but for now it can not be eliminated completely.
____________

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 155
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 58771 - Posted: 3 May 2022 | 8:59:20 UTC - in response to Message 58768.
Last modified: 3 May 2022 | 8:59:31 UTC

Seems like deprecating the version v4.01 did not work then... I will check if there is anything else we can do to enforce usage of v4.03 over the old one.
____________

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1050
Credit: 1,407,860,714
RAC: 1,014,249
Level
Met
Scientific publications
watwatwatwatwat
Message 58772 - Posted: 3 May 2022 | 15:03:59 UTC - in response to Message 58771.
Last modified: 3 May 2022 | 15:05:20 UTC

You need a to send a message to all hosts when they connect to the scheduler to delete the 4.01 application from the host physically and to delete the entry in the client_state.xml file

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 155
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 58773 - Posted: 3 May 2022 | 15:36:10 UTC
Last modified: 3 May 2022 | 15:37:13 UTC

I sent a batch which will fail with

yaml.constructor.ConstructorError: could not determine a constructor for the tag 'tag:yaml.org,2002:python/object/apply:numpy.core.multiarray.scalar'


It is just an error with the experiment configuration. I immediately cancelled the experiment and fixed the configuration, but the tasks were already sent.

I am very sorry for the inconvenience. Fortunately the jobs will fail right after starting, so no need to kill them. The another batch contains jobs with the fixed configuration.
____________

Boca Raton Community HS
Send message
Joined: 27 Aug 21
Posts: 11
Credit: 140,976,809
RAC: 355
Level
Cys
Scientific publications
wat
Message 58774 - Posted: 3 May 2022 | 16:28:10 UTC

I was not getting too many of the python work units, but I recently received/completed one. I know they take... a while to complete.

Specifically, I am looking at task 32892659, work unit 27222901.

I am glad it completed, but it was a long haul.

It was mentioned that "completing a task gives 50000 credits and 75000 if completed specially fast"

How fast do these need to complete for 75000? I am not saying I have the fastest processors but they are definitely not slow (they are running at ~3GHz with the boost) and the GPUs are definitely not slow.

Thanks!

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1050
Credit: 1,407,860,714
RAC: 1,014,249
Level
Met
Scientific publications
watwatwatwatwat
Message 58775 - Posted: 3 May 2022 | 19:15:29 UTC - in response to Message 58774.

I get the full "quick" credits for my Python tasks because I normally crunch them in 5-8 hours.

You took more than 2 days to report yours. You get a boost of 50% if returned within 1 day and 25% boost in credit if returned with 2 days.

Boca Raton Community HS
Send message
Joined: 27 Aug 21
Posts: 11
Credit: 140,976,809
RAC: 355
Level
Cys
Scientific publications
wat
Message 58776 - Posted: 3 May 2022 | 19:22:50 UTC - in response to Message 58775.

I get the full "quick" credits for my Python tasks because I normally crunch them in 5-8 hours.

You took more than 2 days to report yours. You get a boost of 50% if returned within 1 day and 25% boost in credit if returned with 2 days.



Got it. Thanks! I think I am confused why this task took so long to report. What is usually the "bottleneck" when running these tasks?

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 719
Credit: 4,861,478,494
RAC: 1,361,967
Level
Arg
Scientific publications
wat
Message 58777 - Posted: 3 May 2022 | 20:02:01 UTC - in response to Message 58776.

I get the full "quick" credits for my Python tasks because I normally crunch them in 5-8 hours.

You took more than 2 days to report yours. You get a boost of 50% if returned within 1 day and 25% boost in credit if returned with 2 days.



Got it. Thanks! I think I am confused why this task took so long to report. What is usually the "bottleneck" when running these tasks?


these tasks are multi-core tasks. they will use a lot of cores (maybe up to 32 threads?). are you running CPU work from other projects? if you are then it's probably starved on CPU resources trying to run the Python task.
____________

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1434
Credit: 3,541,899,351
RAC: 420,027
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58778 - Posted: 3 May 2022 | 21:36:29 UTC - in response to Message 58777.

these tasks are multi-core tasks. they will use a lot of cores (maybe up to 32 threads?). are you running CPU work from other projects? if you are then it's probably starved on CPU resources trying to run the Python task.

The critical point being that they aren't declared to BOINC as needing multiple cores, so BOINC doesn't automatically clear extra CPU space for them to run in.

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 155
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 58779 - Posted: 4 May 2022 | 8:20:54 UTC - in response to Message 58778.

Right, I wish there was a way to specify that to BOINC on our side... does adjusting the app_config.xml help? I guess that has to be done of the user side
____________

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 155
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 58780 - Posted: 4 May 2022 | 8:22:06 UTC - in response to Message 58777.
Last modified: 4 May 2022 | 8:24:20 UTC

yes, the tasks run 32 agent environments in parallel python processes. Definitely the bottleneck could be the CPU because BOINC is not aware of it.
____________

Boca Raton Community HS
Send message
Joined: 27 Aug 21
Posts: 11
Credit: 140,976,809
RAC: 355
Level
Cys
Scientific publications
wat
Message 58781 - Posted: 4 May 2022 | 11:57:25 UTC

Thank you all for the replies- this was exactly the issue. I will keep that in mind if I receive another one of these work units. Theoretically, is it possible to run several of these tasks in parallel on the same GPU, since it really is not too GPU intensive and I have enough cores/memory?

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 719
Credit: 4,861,478,494
RAC: 1,361,967
Level
Arg
Scientific publications
wat
Message 58782 - Posted: 4 May 2022 | 12:17:13 UTC - in response to Message 58781.

Thank you all for the replies- this was exactly the issue. I will keep that in mind if I receive another one of these work units. Theoretically, is it possible to run several of these tasks in parallel on the same GPU, since it really is not too GPU intensive and I have enough cores/memory?


Only if you have more than 64 threads per GPU available and you stop processing of any existing CPU work.
____________

captainjack
Send message
Joined: 9 May 13
Posts: 171
Credit: 1,282,338,256
RAC: 99,065
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58783 - Posted: 4 May 2022 | 14:31:38 UTC

abouh asked

Right, I wish there was a way to specify that to BOINC on our side... does adjusting the app_config.xml help? I guess that has to be done of the user side


I tried that, but boinc manager on my pc will overallocate CPU's. I am currently running multicore atlas cpu tasks from lhc alongside the python tasks from gpugrid. The atlas tasks are set to use 8 CPU's and the python tasks are set to use 10 CPU's. The example for this response is on an AMD cpu with 8 cores/16 threads. BOINC is set to use 15 threads. It will run one gpugrid python 10 thread task and one lhc 8 thread task at the same time. That is 18 threads running on a 15 thread cpu.

Here is my app_config for gpugrid:

<app_config>
<app>
<name>acemd3</name>
<gpu_versions>
<gpu_usage>1</gpu_usage>
<cpu_usage>1</cpu_usage>
</gpu_versions>
</app>
<app>
<name>PythonGPU</name>
<cpu_usage>10</cpu_usage>
<gpu_versions>
<gpu_usage>1</gpu_usage>
<cpu_usage>10</cpu_usage>
</gpu_versions>
<app_version>
<app_name>PythonGPU</app_name>
<plan_class>cuda1121</plan_class>
<avg_ncpus>10</avg_ncpus>
<ngpus>1</ngpus>
<cmdline>--nthreads 10</cmdline>
</app_version>
</app>

<app>
<name>PythonGPUbeta</name>
<cpu_usage>10</cpu_usage>
<gpu_versions>
<gpu_usage>1</gpu_usage>
<cpu_usage>10</cpu_usage>
</gpu_versions>
<app_version>
<app_name>PythonGPU</app_name>
<plan_class>cuda1121</plan_class>
<avg_ncpus>10</avg_ncpus>
<ngpus>1</ngpus>
<cmdline>--nthreads 10</cmdline>
</app_version>
</app>

<app>
<name>Python</name>
<cpu_usage>10</cpu_usage>
<gpu_versions>
<gpu_usage>1</gpu_usage>
<cpu_usage>10</cpu_usage>
</gpu_versions>
<app_version>
<app_name>PythonGPU</app_name>
<plan_class>cuda1121</plan_class>
<avg_ncpus>10</avg_ncpus>
<ngpus>1</ngpus>
<cmdline>--nthreads 10</cmdline>
</app_version>
</app>

<app>
<name>acemd4</name>
<gpu_versions>
<gpu_usage>1</gpu_usage>
<cpu_usage>1</cpu_usage>
</gpu_versions>
</app>
</app_config>


And here is my app_config for lhc:

<app_config>
<app>
<name>ATLAS</name>
<cpu_usage>8</cpu_usage>
</app>
<app_version>
<app_name>ATLAS</app_name>
<plan_class>vbox64_mt_mcore_atlas</plan_class>
<avg_ncpus>8</avg_ncpus>
<cmdline>--nthreads 8</cmdline>
</app_version>
</app_config>


If anyone has any suggestions for changes to the app_config files, please let me know.

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 155
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 58785 - Posted: 4 May 2022 | 17:39:52 UTC - in response to Message 58781.
Last modified: 4 May 2022 | 17:41:36 UTC

I can run 2 jobs manually on my machine with 12 CPUs, in parallel. They are slower than a single job, but much faster than running them sequentially.

Specially since the jobs iterate between using CPU and using GPU. 2 jobs won't be completely synchronous so as long as the GPU has enough memory.

However, I think currently GPUGrid automatically assigns one job per GPU, with the environment variable GPU_DEVICE_NUM.
____________

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1434
Credit: 3,541,899,351
RAC: 420,027
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58786 - Posted: 4 May 2022 | 19:09:06 UTC - in response to Message 58785.

However, I think currently GPUGrid automatically assigns one job per GPU, with the environment variable GPU_DEVICE_NUM.

Normally, the user's BOINC client will assign the GPU device number, and this will be conveyed to the job by the wrapper.

You can easily run two jobs per GPU (both with the same device number), and give them both two full CPU cores each, by using an app_config.xml file including

...
<gpu_versions>
<gpu_usage>0.5</gpu_usage>
<cpu_usage>2.0</cpu_usage>
</gpu_versions>
...

(full details in the user manual)

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 155
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 58788 - Posted: 5 May 2022 | 7:34:11 UTC - in response to Message 58786.
Last modified: 5 May 2022 | 7:34:33 UTC

I see, thanks for the clarification
____________

Greg _BE
Send message
Joined: 30 Jun 14
Posts: 109
Credit: 80,546,939
RAC: 13,489
Level
Thr
Scientific publications
watwatwatwatwatwat
Message 58789 - Posted: 5 May 2022 | 22:33:28 UTC
Last modified: 5 May 2022 | 22:34:23 UTC

I guess I am going to have to give up on this project.
All I get is exit child errors. Every single task.
For example: https://www.gpugrid.net/result.php?resultid=32894080

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 155
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 58790 - Posted: 6 May 2022 | 7:14:02 UTC - in response to Message 58789.

This task is from a batch of a wrongly configured jobs. It is an error on our side. It was immediately corrected, but the jobs were already sent, and could not be cancelled. They crash after starting to runm, but it is just this batch. The following batches work normally.

I mentioned it in a previous post, sorry for the problems... this specific job would have crashed anywhere.
____________

Greg _BE
Send message
Joined: 30 Jun 14
Posts: 109
Credit: 80,546,939
RAC: 13,489
Level
Thr
Scientific publications
watwatwatwatwatwat
Message 58791 - Posted: 6 May 2022 | 15:52:36 UTC - in response to Message 58790.

This task is from a batch of a wrongly configured jobs. It is an error on our side. It was immediately corrected, but the jobs were already sent, and could not be cancelled. They crash after starting to runm, but it is just this batch. The following batches work normally.

I mentioned it in a previous post, sorry for the problems... this specific job would have crashed anywhere.



ok...waiting in line for the next batch.

Boca Raton Community HS
Send message
Joined: 27 Aug 21
Posts: 11
Credit: 140,976,809
RAC: 355
Level
Cys
Scientific publications
wat
Message 58830 - Posted: 20 May 2022 | 16:42:10 UTC - in response to Message 58778.

I am still attempting to diagnose why these tasks are taking the system so long to complete. I changed the config to "reserve" 32 cores for these tasks. I did also make a change so I have two of these tasks running simultaneously- I am not clear on these tasks and multithreading. The system running them has 56 physical cores across two CPUs (112 logical). Are the "32" cores used for one of these tasks physical or logical? Also, I am relatively confident the GPUs can handle this (RTX A6000) but let me know if I am missing something.

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1050
Credit: 1,407,860,714
RAC: 1,014,249
Level
Met
Scientific publications
watwatwatwatwat
Message 58831 - Posted: 20 May 2022 | 19:47:55 UTC - in response to Message 58830.

Why do you think the tasks are running abnormally long?

Have you ever looked at the wall clock to see how long they take from start to finish.

You are running and finishing them well within the 5 day deadline.

You are finishing them in two days and get the 25% bonus credits.

Are you being confused by the cpu and gpu runtimes on the task?

That is the accumulated time across all 32 threads you appear to be running them on. That does not indicate the real walltime calculation. If you ran them on less threads, the accumulated time would be much less.

You don't really need that much cpu support. The task is configured to run on 1 cpu as delivered.

Bedrich Hajek
Send message
Joined: 28 Mar 09
Posts: 411
Credit: 6,063,938,459
RAC: 3,526
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58832 - Posted: 20 May 2022 | 21:46:06 UTC - in response to Message 58831.

Why do you think the tasks are running abnormally long?

Have you ever looked at the wall clock to see how long they take from start to finish.

You are running and finishing them well within the 5 day deadline.

You are finishing them in two days and get the 25% bonus credits.

Are you being confused by the cpu and gpu runtimes on the task?

That is the accumulated time across all 32 threads you appear to be running them on. That does not indicate the real walltime calculation. If you ran them on less threads, the accumulated time would be much less.

You don't really need that much cpu support. The task is configured to run on 1 cpu as delivered.



They should be put back into the beta category. They still have too many bugs and need more work. It looks like someone was in a hurry to leave for summer vacation. I decided to stop crunching them, for now. Of course, there isn't much to crunch here anyway, right now.

There is always next fall to fix this.....................




Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1434
Credit: 3,541,899,351
RAC: 420,027
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58833 - Posted: 20 May 2022 | 21:53:38 UTC - in response to Message 58831.

Are you being confused by the cpu and gpu runtimes on the task?

That is the accumulated time across all 32 threads you appear to be running them on. That does not indicate the real walltime calculation. If you ran them on less threads, the accumulated time would be much less.

They are declared to use less than 1 CPU (and that's all BOINC knows about), but in reality they use much more.

This website confuses matters by mis-reporting the elapsed time as the total (summed over all cores) CPU time.

The only way to be exactly sure what has happened is to examine the job_log_[GPUGrid] file on your local machine. The third numeric column ('ct ...') is the total CPU time, summed over all cores: the penultimate column ('et ...') is the elapsed - wall clock - time for the task as a whole.

Locally, ct will be above et for the task as a whole, but on this website, they will be reported as the same.

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1050
Credit: 1,407,860,714
RAC: 1,014,249
Level
Met
Scientific publications
watwatwatwatwat
Message 58834 - Posted: 20 May 2022 | 23:09:30 UTC - in response to Message 58832.

I'm not having any issues with them on Linux. I don't know how that compares to Windows hosts.

I get at least a couple a day per host for the past several weeks.

Nothing like a month ago when there were a thousand or so available.

I doubt we ever return to the production of years ago unfortunately.

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 155
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 58844 - Posted: 23 May 2022 | 7:49:06 UTC - in response to Message 58830.
Last modified: 24 May 2022 | 7:30:27 UTC

The 32 cores are logical, python processes running in parallel. I can run them locally in a 12 CPU machine. The GPU should be fine as well, so you are correct about that.

We have a time estimation problem, discussed previously in the thread. As Keith mentioned, the real walltime calculation should be much less than reported.

It would be very helpful if you could let us know if that is the case. In particular, if you are getting 75000 credits per jobs means the jobs are getting 25% extra credits for returning fast.
____________

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 155
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 58845 - Posted: 23 May 2022 | 8:42:25 UTC - in response to Message 58832.

We decided to remove the beta flag from the current version of the python app when we found it to work without errors in a reasonable number hosts. We are aware that, even though we do testing it in our local linux and windows machines, there is a vast variety of configurations, versions and resource capabilities among the hosts, and it will not work in all of them.

However, please note that in research at some point we need to start doing experiments (I want to talk more about that in my next post). Further testing and fixing is required and we are committed to do it. This takes a long time, so we need to work in both things in parallel. We will still use the beta app to test new versions.

Please, if you are talking about a recurring specific problem in your machines, let me know and will look into it.

____________

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1434
Credit: 3,541,899,351
RAC: 420,027
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58846 - Posted: 23 May 2022 | 8:44:31 UTC - in response to Message 58844.

I'm away from my machines at the moment, but can confirm that's the case.

Look at task 32897902. Reported time 108,075.00 seconds (well over a day), but got 75,000 credits. It was away from the server for about 11 hours. GTX 1660, Linux Mint.

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 155
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 58847 - Posted: 23 May 2022 | 9:21:48 UTC - in response to Message 58834.
Last modified: 23 May 2022 | 9:23:30 UTC

I am not sure about the acemd tasks, but for python tasks, I will increase the amount of tasks progressively.

To recap a bit about what we are doing, we are experimenting with populations of machine learning agents, trying to figure out how important are social interactions and information sharing for intelligent agents. More specifically, we train multiple agents for periods of time in different GPUGrid machines, which later return to the server to report their results. We are researching what kind of information they can share and how to build a common knowledge base, similar to what we humans do. Following, new generations of the populations repeat the process, but already equipped with the knowledge distilled by previous generations.

At the moment we have several experiments running with population sizes of 48 agents, that means a batch of 48 agents every 24-48h. We also have one experiment of 64 agents and one of 128. To my knowledge no recent paper has tried with more than 80, and we plan to keep increasing the population sizes to figure out how relevant that is for agent intelligent behavior. Ideally I would like to reach population sizes of 256, 512 and 1024.
____________

Boca Raton Community HS
Send message
Joined: 27 Aug 21
Posts: 11
Credit: 140,976,809
RAC: 355
Level
Cys
Scientific publications
wat
Message 58848 - Posted: 23 May 2022 | 14:00:55 UTC - in response to Message 58833.

Thanks for this info. Here is the log file for a recently completed task:

1653158519 ue 148176.747654 ct 3544023.000000 fe 1000000000000000000 nm e5a63-ABOU_rnd_ppod_demo_sharing_large-0-1-RND5179_0 et 117973.295733 es 0

So, the clock time is 117973.295733? Which would be ~32 hours of actual runtime?

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1050
Credit: 1,407,860,714
RAC: 1,014,249
Level
Met
Scientific publications
watwatwatwatwat
Message 58853 - Posted: 23 May 2022 | 21:18:38 UTC - in response to Message 58848.

Thanks for this info. Here is the log file for a recently completed task:

1653158519 ue 148176.747654 ct 3544023.000000 fe 1000000000000000000 nm e5a63-ABOU_rnd_ppod_demo_sharing_large-0-1-RND5179_0 et 117973.295733 es 0

So, the clock time is 117973.295733? Which would be ~32 hours of actual runtime?



No. That is incorrect. You cannot use the clocktime reported in the task. That will accumulate over however many cpu threads the task is allowed to show to BOINC. Blame BOINC for this issue not the application.

Look at the sent time and the returned time to calculate how long the task actually took to process. Returned time minus the sent time = length of time to process.

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1050
Credit: 1,407,860,714
RAC: 1,014,249
Level
Met
Scientific publications
watwatwatwatwat
Message 58855 - Posted: 23 May 2022 | 23:41:45 UTC

BOINC just does not know how to account for these Python tasks which act "sorta" like an MT task.

But BOINC does not handle MT tasks correctly either for that matter.

Blame it on the BOINC code which is old. Like it knows how to handle a task on a single cpu core and that is about all it gets right.

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1434
Credit: 3,541,899,351
RAC: 420,027
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58856 - Posted: 24 May 2022 | 6:26:28 UTC - in response to Message 58853.

1653158519 ue 148176.747654 ct 3544023.000000 fe 1000000000000000000 nm e5a63-ABOU_rnd_ppod_demo_sharing_large-0-1-RND5179_0 et 117973.295733 es 0

No. That is incorrect. You cannot use the clocktime reported in the task. That will accumulate over however many cpu threads the task is allowed to show to BOINC. Blame BOINC for this issue not the application.

Actually, that line (from the client job log) actually is a useful source of information. It contains both

ct 3544023.000000

which is the CPU or core time - as you say, it dates back to the days when CPUs only had one core. But now, it comprises the sum over all of however many cores are used.

and et 117973.295733

That's the elapsed time (wallclock measure) which was added when GPU computing was first introduced and cpu time was not longer a reliable indicator of work done.

I agree that many outdated legacy assumptions remain active in BOINC, but I think it's got beyond the point when mere tinkering could fix it - we really need a full Mark 2 rewrite. But that seems unlikely under the current management.

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1050
Credit: 1,407,860,714
RAC: 1,014,249
Level
Met
Scientific publications
watwatwatwatwat
Message 58858 - Posted: 24 May 2022 | 17:24:20 UTC

OK, so here is a back of the napkin calculation on how long the task actually took to crunch

Take the et time from the job_log entry for the task and divide by 32 since the tasks spawn 32 processes on the cpu to account for the way that BOINC calculates cpu_time accumulated across all cores crunching the task.

So 117973.295733 / 32 = 3686.665491656 seconds

or in reality a little over an hour to crunch.

That agrees with the wall clock time (reported - sent) times I have been observing for the shorty demo tasks that are currently being propagated to hosts.

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1434
Credit: 3,541,899,351
RAC: 420,027
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58859 - Posted: 24 May 2022 | 18:02:39 UTC - in response to Message 58858.

Well, since there's also a 'nm' (name) field in the client job log, we can find the rest:

Task 32897743, run on host 588658.

Because it's a Windows task, there's a lot to digest in the std_err log, but it includes

04:44:21 (34948): .\7za.exe exited; CPU time 9.890625
04:44:21 (34948): wrapper: running python.exe (run.py)

13:32:28 (7456): wrapper (7.9.26016): starting
13:32:28 (7456): wrapper: running python.exe (run.py)
(that looks like a restart)
Then some more of the same, and finally

14:41:51 (28304): python.exe exited; CPU time 2816214.046875
14:41:56 (28304): called boinc_finish(0)

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1050
Credit: 1,407,860,714
RAC: 1,014,249
Level
Met
Scientific publications
watwatwatwatwat
Message 58860 - Posted: 24 May 2022 | 18:32:40 UTC


14:41:51 (28304): python.exe exited; CPU time 2816214.046875
14:41:56 (28304): called boinc_finish(0)

So 2816214 / 32 = 88006 seconds

88006 / 3600 seconds = 24.44 hours

That is close to matching the received time minus the sent time of a little over a day.

The task did'nt get the full 50% credit bonus for returning within 24 hours but did get the 25% bonus.

I'm very surprised that that card is so slow or that the card is that slow when working with a cpu clocked to 2.7Ghz in Windows.

Boca Raton Community HS
Send message
Joined: 27 Aug 21
Posts: 11
Credit: 140,976,809
RAC: 355
Level
Cys
Scientific publications
wat
Message 58861 - Posted: 25 May 2022 | 16:55:51 UTC - in response to Message 58860.




I'm very surprised that that card is so slow or that the card is that slow when working with a cpu clocked to 2.7Ghz in Windows.


That is what I am confused about. I can tell you that these calculations of time seem accurate- it was somewhere around 24 hours that it was actually running. Also, the CPU was running closer to 3.1Ghz (boost). It barely pushed the GPU when running. Nothing changed with time when I reserved 32 cores for these tasks. I really can't nail down the issue.

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1434
Credit: 3,541,899,351
RAC: 420,027
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58862 - Posted: 25 May 2022 | 17:09:27 UTC

As abouh has posted previously, the two resource types are used alternately - "cyclical GPU load is expected in Reinforcement Learning algorithms. Whenever GPU load in lower, CPU usage should increase." (message 58590). Any instantaneous observation won't reveal the full situation: either CPU will be high, and GPU low, or vice-versa.

Boca Raton Community HS
Send message
Joined: 27 Aug 21
Posts: 11
Credit: 140,976,809
RAC: 355
Level
Cys
Scientific publications
wat
Message 58863 - Posted: 25 May 2022 | 19:35:37 UTC - in response to Message 58862.

Yep- I observe the alternation. When I suspend all other work units, I can see that just one of these tasks will use a little more than half of the logical processors. I know it has been talked about that although it says it uses 1 processor (or, 0.996, to be exact) that it uses more. I am running E@H work units and I think that running both is choking the CPU. Is there a way to limit the processor count that these python tasks use? In the past, I changed the app config to use 32, but it did not seem to speed anything up, even though they were reserved for the work unit.

I am not sure there is a way to post images, but here are some links to show CPU and GPU usage when only running one python task. Is it supposed to use that much of the CPU?

https://i.postimg.cc/Kv8zcMGQ/CPU-Usage1.jpg
https://i.postimg.cc/LX4dkj0b/GPU-Usage-1.jpg
https://i.postimg.cc/tRM0PZdB/GPU-Usage-2.jpg

I am sorry for all of the questions.... just trying my best to understand.


Profile ServicEnginIC
Avatar
Send message
Joined: 24 Sep 10
Posts: 520
Credit: 2,275,243,465
RAC: 6,100
Level
Phe
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58864 - Posted: 25 May 2022 | 19:37:27 UTC - in response to Message 58862.

As abouh has posted previously, the two resource types are used alternately - "cyclical GPU load is expected in Reinforcement Learning algorithms. Whenever GPU load in lower, CPU usage should increase.

This can be very well graphically noticed at the following two images.

Higher CPU - Lower GPU usage cycle:


Higher GPU - Lower CPU usage cycle:


CPU and GPU usage graphics follow an anti cyclical pattern.

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1050
Credit: 1,407,860,714
RAC: 1,014,249
Level
Met
Scientific publications
watwatwatwatwat
Message 58865 - Posted: 26 May 2022 | 1:17:49 UTC - in response to Message 58863.

Is there a way to limit the processor count that these python tasks use? In the past, I changed the app config to use 32, but it did not seem to speed anything up, even though they were reserved for the work unit.

I am sorry for all of the questions.... just trying my best to understand.


No there isn't as the user. These are not real MT tasks or any form that BOINC recognizes and provides some configuration options.

Your only solution is to only run one at a time via an max_concurrent statement in an app_config.xml file and then also restrict the number of cores being allowed to be used by your other projects.

That said, I don't know why you are having such difficulties. Maybe chalk it up to Windows, I don't know.

I run 3 other cpu projects at the same times as I run the GPUGrid Python on GPU tasks with 28-46 cpu cores being occupied by Universe, TN-Grid or yoyo depending on the host. Every host primarily runs Universe as the major cpu project.

No impact on the python tasks while running the other cpu apps.

Profile ServicEnginIC
Avatar
Send message
Joined: 24 Sep 10
Posts: 520
Credit: 2,275,243,465
RAC: 6,100
Level
Phe
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58866 - Posted: 26 May 2022 | 12:51:52 UTC - in response to Message 58865.

No impact on the python tasks while running the other cpu apps.

Conversely, I notice a performance loss on other CPU tasks when python tasks are in execution.
I processed yesterday python task e7a30-ABOU_rnd_ppod_demo_sharing_large-0-1-RND2847_2 at my host #186626
It was received at 11:33 UTC, and result was returned on 22:50 UTC
At the same period, PrimeGrid PPS-MEGA CPU tasks were also being processed.
The medium processing time for eighteen (18) PPS-MEGA CPU tasks was 3098,81 seconds.
The medium processing time for 18 other PPS-MEGA CPU tasks processed outside that period was 2699,11 seconds.
This represents an extra processing time of about 400 seconds per task, or about a 12,9% performance loss.
There is not such a noticeable difference when running Gpugrid ACEMD tasks.

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1050
Credit: 1,407,860,714
RAC: 1,014,249
Level
Met
Scientific publications
watwatwatwatwat
Message 58867 - Posted: 26 May 2022 | 17:59:57 UTC

I also notice an impact on my running Universe tasks. Generally adds 300 seconds to the normal computation times when running in conjunction with a python task.

captainjack
Send message
Joined: 9 May 13
Posts: 171
Credit: 1,282,338,256
RAC: 99,065
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58871 - Posted: 28 May 2022 | 2:03:28 UTC

Windows 10 machine running task 32899765. Had a power outage. When the power came back on, task was restarted but just sat there doing nothing. The stderr.txt file showed the following error:

file pythongpu_windows_x86_64__cuda102.tar
already exists. Overwrite with
pythongpu_windows_x86_64__cuda102.tar?
(Y)es / (N)o / (A)lways / (S)kip all / A(u)to rename all / (Q)uit?


Task was stalled waiting on a response.

BOINC was stopped and the pythongpu_windows_x86_64__cuda102.tar file was removed from the slots folder.

Computer was restarted then the task was restarted. Then the following error message appeared several times in the stderr.txt file.

OSError: [WinError 1455] The paging file is too small for this operation to complete. Error loading "C:\ProgramData\BOINC\slots\0\lib\site-packages\torch\lib\caffe2_detectron_ops_gpu.dll" or one of its dependencies.
Detected memory leaks!


Page file size was increased to 64000MB and rebooted.

Started task again and still got the error message about page file size too small. Then task abended.

If you need more info about this task, please let me know.

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 155
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 58876 - Posted: 28 May 2022 | 16:41:56 UTC - in response to Message 58871.

Thank you captainjack for the info.


1.

Interesting that the job gets stuck with:

(Y)es / (N)o / (A)lways / (S)kip all / A(u)to rename all / (Q)uit?


The job command line is the following:

7za.exe pythongpu_windows_x86_64__cuda102.tar -y


and I got from the application documentation (https://info.nrao.edu/computing/guide/file-access-and-archiving/7zip/7z-7za-command-line-guide):

7-Zip will prompt the user before overwriting existing files unless the user specifies the -y


So essentially -y assumes "Yes" on all Queries. Honestly I am confused by this behaviour, thanks for pointing it out. Maybe I am missing the x, as in

7za.exe x pythongpu_windows_x86_64__cuda102.tar -y


I will test it on the beta app.




2.

Regarding the other error

OSError: [WinError 1455] The paging file is too small for this operation to complete. Error loading "C:\ProgramData\BOINC\slots\0\lib\site-packages\torch\lib\caffe2_detectron_ops_gpu.dll" or one of its dependencies.
Detected memory leaks!


is related to pytorch and nvidia and it only affects some windows machines. It is explained here: https://stackoverflow.com/questions/64837376/how-to-efficiently-run-multiple-pytorch-processes-models-at-once-traceback

TL;DR: Windows and Linux treat multiprocessing in python differently, and in windows each process commits much more memory, especially when using pytorch.

We use the script suggested in the link to mitigate the problem, but it could be that for some machines memory is still insufficient. Does that make sense in your case?


____________

captainjack
Send message
Joined: 9 May 13
Posts: 171
Credit: 1,282,338,256
RAC: 99,065
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58878 - Posted: 29 May 2022 | 14:19:12 UTC

Thank you abouh for responding,

I looked through my saved messages from the task to see if there was anything else I could find that might be of value and couldn't find anything.

In regard to the "out of memory" error, I tried to read through the stackoverflow link about the memory error. It is way above my level of technical expertise at this point, but it seemed like the amount of nvidia memory might have something to do with it. I am using an antique GTX970 card. It's old but still works.

Good luck coming up with a solution. If you want me to do any more testing, please let me know.

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 155
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 58879 - Posted: 31 May 2022 | 8:20:29 UTC
Last modified: 31 May 2022 | 8:21:09 UTC

Seems like here are some possible workarounds:

https://github.com/Spandan-Madan/Pytorch_fine_tuning_Tutorial/issues/10

basically, two users mentioned


I think I managed to solve it (so far). Steps were:

1)- Windows + pause key
2)- Advanced system settings
3)- Advanced tab
4)- Performance - Settings button
5)- Advanced tab - Change button
6)- Uncheck the "Automatically... BLA BLA" checkbox
7)- Select the System managed size option box.


and

If it's of any value, I ended up setting the values into manual and some ridiculous amount of 360GB as the minimum and 512GB for the maximum. I also added an extra SSD and allocated all of it to Virtual memory. This solved the problem and now I can run up to 128 processes using pytorch and CUDA.
I did find out that every launch of Python and pytorch, loads some ridiculous amount of memory to the RAM and then when not used often goes into the virtual memory.


Maybe it can be helpful for someone
____________

bibi
Send message
Joined: 4 May 17
Posts: 6
Credit: 2,714,683,618
RAC: 551,798
Level
Phe
Scientific publications
watwatwatwatwat
Message 58880 - Posted: 1 Jun 2022 | 19:13:37 UTC - in response to Message 58876.

Hi abouh,

is there a commandline like
7za.exe pythongpu_windows_x86_64__cuda102.tar.gz
without -y to get pythongpu_windows_x86_64__cuda102.tar ?

WR-HW95
Send message
Joined: 16 Dec 08
Posts: 7
Credit: 1,270,152,118
RAC: 284,612
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58881 - Posted: 1 Jun 2022 | 20:23:47 UTC

So whats going on here?
https://www.gpugrid.net/workunit.php?wuid=27228431
RuntimeError: CUDA out of memory. Tried to allocate 446.00 MiB (GPU 0; 11.00 GiB total capacity; 470.54 MiB already allocated; 8.97 GiB free; 492.00 MiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
22:40:37 (12736): python.exe exited; CPU time 3346.203125

All kinds of errors on other tasks from too old card (1080ti) to out of ram.
Atm. commit charge is 70Gb and ram usage is 22Gb of 64Gb.

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 155
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 58883 - Posted: 6 Jun 2022 | 9:49:06 UTC - in response to Message 58880.

The command line

7za.exe pythongpu_windows_x86_64__cuda102.tar.gz


works fine if the job is executed without interruptions.

However, in case the job is interrupted and restarted later, the command is executed again. Then, 7za needs to know whether or not to replace the already existing files with the new ones.

The flag -y is just to make sure the script does not get stuck in that command prompt waiting for an answer.

____________

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 155
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 58884 - Posted: 6 Jun 2022 | 10:29:20 UTC - in response to Message 58881.

Unfortunately recent versions of PyTorch do not support all GPU's, older ones might not be compatible...

Regarding this error

RuntimeError: CUDA out of memory. Tried to allocate 446.00 MiB (GPU 0; 11.00 GiB total capacity; 470.54 MiB already allocated; 8.97 GiB free; 492.00 MiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
22:40:37 (12736): python.exe exited; CPU time 3346.203125


does it happen recurrently in the same machine? or depending on the job?
____________

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1050
Credit: 1,407,860,714
RAC: 1,014,249
Level
Met
Scientific publications
watwatwatwatwat
Message 58886 - Posted: 6 Jun 2022 | 20:18:53 UTC - in response to Message 58881.

So whats going on here?
https://www.gpugrid.net/workunit.php?wuid=27228431

All kinds of errors on other tasks from too old card (1080ti) to out of ram.
Atm. commit charge is 70Gb and ram usage is 22Gb of 64Gb.


The problem is not with the card but with the Windows environment.

I have no issues running the Python on GPU tasks in Linux on my 1080 Ti card.

https://www.gpugrid.net/results.php?hostid=456812

STARBASEn
Avatar
Send message
Joined: 17 Feb 09
Posts: 87
Credit: 1,316,440,897
RAC: 814,068
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 58906 - Posted: 11 Jun 2022 | 19:18:07 UTC

Well so far, these new python WU's have been consistently completing and even surviving multiple reboots, OS kernel upgrades, and OS upgrades:

Kernels --> 5.17.13
OS Fedora35 --> Fedora36

3 machines w/GTX-1060 510.73.05

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1050
Credit: 1,407,860,714
RAC: 1,014,249
Level
Met
Scientific publications
watwatwatwatwat
Message 58907 - Posted: 11 Jun 2022 | 20:31:43 UTC

Yes, one nice thing about the Python gpu tasks is that they survive a reboot and can be restarted on a different gpu without erroring.

Very nice compared to the acemd3/4 tasks which will error out under similar circumstance.

The Python tasks create and reread checkpoints very well. Upon restart the task will show 1% completion but after a while jump forward to the point that the task was stopped, exited or suspended and continue on till the finish.

STARBASEn
Avatar
Send message
Joined: 17 Feb 09
Posts: 87
Credit: 1,316,440,897
RAC: 814,068
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 58915 - Posted: 12 Jun 2022 | 15:36:57 UTC

Yes, one nice thing about the Python gpu tasks is that they survive a reboot and can be restarted on a different gpu without erroring.


Good to know as I did not try a driver update or using a different GPU on a WU in progress.

I do think BOINC needs to patch their estimated time to completion. XXXdays remaining makes it impossible to have any in a cache.

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1050
Credit: 1,407,860,714
RAC: 1,014,249
Level
Met
Scientific publications
watwatwatwatwat
Message 58919 - Posted: 12 Jun 2022 | 18:37:36 UTC

I haven't had any reason to carry a cache. I have my cache level set at only one task for each host as I don't want GPUGrid to monopolize my hosts and compete with my other projects.

That said, I haven't gone 12 hours without a Python task on every host at all times.

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1050
Credit: 1,407,860,714
RAC: 1,014,249
Level
Met
Scientific publications
watwatwatwatwat
Message 58920 - Posted: 12 Jun 2022 | 18:40:52 UTC - in response to Message 58915.

Yes, one nice thing about the Python gpu tasks is that they survive a reboot and can be restarted on a different gpu without erroring.


Good to know as I did not try a driver update or using a different GPU on a WU in progress.

I do think BOINC needs to patch their estimated time to completion. XXXdays remaining makes it impossible to have any in a cache.


BOINC would have to completely rewrite that part of the code. The fact that these tasks run on both the cpu and gpu makes them impossible to decipher by BOINC.

The closest mechanism is the MT or multi-task category but that only knows about cpu tasks which run solely on the cpu.

STARBASEn
Avatar
Send message
Joined: 17 Feb 09
Posts: 87
Credit: 1,316,440,897
RAC: 814,068
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 58936 - Posted: 17 Jun 2022 | 16:55:57 UTC

BOINC would have to completely rewrite that part of the code. The fact that these tasks run on both the cpu and gpu makes them impossible to decipher by BOINC.

The closest mechanism is the MT or multi-task category but that only knows about cpu tasks which run solely on the cpu.


I think BOINC uses the CPU excluively for their Estimated Time to Completion algorithm all WU's including those using a GPU which makes sense since the job cannot complete until both processor's work are complete. Observing GPU work with E@H, it appears that the GPU finishes first and the CPU continues for a period of time to do what is necessary to wrap the job up for return and those BOINC ETC's are fairly accurate.

It is the multi-thread WU's mentioned that appears to be throwing a monkey wrench at the ETC like these python jobs. From my observations, the python WU's use 32 processes regardless of actual system configuration. I have 2 Ryzen 16 core and my old FX-8350 8 core and they each run 32 processes each WU. It seems to me that the existing algorithm could be used in a modular fashion by assuming a single thread CPU job for the MT WU then calculating the estimated time and then knowing the number of processes the WU is requesting compared with those available from the system, it could perform a simple division and produce a more accurate result for MT WU's as well. Don't know for sure, just speculating but I do have the BOINC source code and might take a look and see if I can find the ETC stuff. Might be interseting.

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1050
Credit: 1,407,860,714
RAC: 1,014,249
Level
Met
Scientific publications
watwatwatwatwat
Message 58937 - Posted: 17 Jun 2022 | 17:57:47 UTC - in response to Message 58936.

The server code for determining the ETC for MT tasks also has to account for task scheduling.

If it was adjusted as you suggest, anytime a Python task would run on the host, the server would proclaim it severely overcommitted and prevent any other work from running or worse, would actually prevent the Python task from running as it prevents other work from running from other projects in accordance with resource share and round-robin scheduling algorithm in the server and client.

It is a mess already with MT work, I believe it would be even worse accounting for these mixed platform cpu-gpu Python tasks.

But go ahead and look at the code. Also you should raise an issue on BOINC's Github repository so that the problem is logged and can be tracked for progress.

STARBASEn
Avatar
Send message
Joined: 17 Feb 09
Posts: 87
Credit: 1,316,440,897
RAC: 814,068
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 58943 - Posted: 18 Jun 2022 | 18:52:40 UTC

You make a good point regarding the server side issues. Perhaps the projects themselves, if not already, would submit desired resources to allow the server to compare with those available on clients similar to submitting in house cluster jobs. I also agree that it is probably best to go through BOINC git and get a request for a potential fix but I also want to see their ETC algorithms just out of curiousity, both server and client. Nice interesting discussion.

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1050
Credit: 1,407,860,714
RAC: 1,014,249
Level
Met
Scientific publications
watwatwatwatwat
Message 58944 - Posted: 18 Jun 2022 | 20:04:43 UTC

You need to review the code in the /client/work_fetch.cpp module and any of the old closed issues pertaining to use of max_concurrent statements in app_config.xml.

I've have posted many conversations on this issue and collaborated with David Anderson and Richard Haselgrove to understand the issue and have seen at least six attempts to fix the issue once and for all.

A very complicated part of the code. You might also want to review many of the client emulator bug-fix runs done on this topic.

https://boinc.berkeley.edu/sim_web.php

The meat of the issue was in PR's #2918 #3001 #3065 #3076 #4117 and #4592

https://github.com/BOINC/boinc/pull/2918

Focus on the round-robin scheduling part of the code.

STARBASEn
Avatar
Send message
Joined: 17 Feb 09
Posts: 87
Credit: 1,316,440,897
RAC: 814,068
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 58949 - Posted: 20 Jun 2022 | 1:14:28 UTC

Thank you Keith, much appreciated background and starting points.

Erich56
Send message
Joined: 1 Jan 15
Posts: 902
Credit: 3,604,880,665
RAC: 330,779
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwat
Message 58961 - Posted: 26 Jun 2022 | 6:44:33 UTC

need advice with regard to running Python on one of my Windows machines:

One one of the Windows systems with a GTX980Ti, CPU Intel i7-4930K, 32GB RAM, Python runs well.
GPU memory usage is almost constant at 2.679MB, system memory usage varies between ~1.300MB and ~5.000MB. Task runtime between ~510.000 and ~530.000 secs.

Other Windows system with two RTX3070, CPU Intel i9-10900KF, 64GB RAM out of which 32GB are used for Ramdisk, leaving 32GB system RAM.
When trying to download Python tasks, BOINC event log says that some 22GB more RAM are needed.
How come?
From what I see from the other machine, Python uses between 1.3GB and 5GB RAM.

What can I do in order to get the machine with the two RTX3070 download and crunch Python tasks?

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1434
Credit: 3,541,899,351
RAC: 420,027
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58962 - Posted: 26 Jun 2022 | 7:09:04 UTC - in response to Message 58961.

BOINC event log says that some 22GB more RAM are needed.

Could you post the exact text of the log message and a few lines either side for context? We might be able to decode it.

Erich56
Send message
Joined: 1 Jan 15
Posts: 902
Credit: 3,604,880,665
RAC: 330,779
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwat
Message 58963 - Posted: 26 Jun 2022 | 7:30:14 UTC - in response to Message 58962.

BOINC event log says that some 22GB more RAM are needed.

Could you post the exact text of the log message and a few lines either side for context? We might be able to decode it.

here is the text of the log message:

26.06.2022 09:20:35 | GPUGRID | Requesting new tasks for CPU and NVIDIA GPU
26.06.2022 09:20:37 | GPUGRID | Scheduler request completed: got 0 new tasks
26.06.2022 09:20:37 | GPUGRID | No tasks sent
26.06.2022 09:20:37 | GPUGRID | No tasks are available for ACEMD 3: molecular dynamics simulations for GPUs
26.06.2022 09:20:37 | GPUGRID | Nachricht vom Server: Python apps for GPU hosts needs 22396.05MB more disk space. You currently have 10982.55 MB available and it needs 33378.60 MB.
26.06.2022 09:20:37 | GPUGRID | Project requested delay of 31 seconds


the reason why at this point it says I have 10.982MB available is because I currently have some LHC projects running which use some RAM.
However, it also says: I need 33.378MB RAM; so my 32GB RAM are not enough anyway (as seen on the other machine, on which I also have 32GB RAM, and there is no problem with downloading and crunching Python).

What I am surprised about is that the projects request so much free RAM, alhough while in operation, it uses only between 1.3 and 5GB.

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1434
Credit: 3,541,899,351
RAC: 420,027
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58964 - Posted: 26 Jun 2022 | 8:06:41 UTC - in response to Message 58963.

26.06.2022 09:20:37 | GPUGRID | Nachricht vom Server: Python apps for GPU hosts needs 22396.05MB more disk space. You currently have 10982.55 MB available and it needs 33378.60 MB.

Disk, not RAM. Probably one or other of your disk settings is blocking it.

Erich56
Send message
Joined: 1 Jan 15
Posts: 902
Credit: 3,604,880,665
RAC: 330,779
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwat
Message 58965 - Posted: 26 Jun 2022 | 8:21:42 UTC - in response to Message 58964.

26.06.2022 09:20:37 | GPUGRID | Nachricht vom Server: Python apps for GPU hosts needs 22396.05MB more disk space. You currently have 10982.55 MB available and it needs 33378.60 MB.

Disk, not RAM. Probably one or other of your disk settings is blocking it.

Oh sorry, you are perfectly right. My mistake, how dumm :-(

so, with my 32GB Ramdisk it does not work, when it says that it needs 33378MB.

What I could do, theoretically, is to shift BOINC from the Ramdisk to the 1 GB SSD. However, the reason why I installed BOINC on the Ramdisk was that the LHC Atlas tasks which I am crunching permanently have an enormous disk usage, and I don't want ATLAS to kill the SSD too early.

I guess that there might be ways to install a second instance of BOINC on the SSD - I tried this on another PC years ago, but somehow I did not get it done properly :-(

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1434
Credit: 3,541,899,351
RAC: 420,027
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58966 - Posted: 26 Jun 2022 | 9:32:13 UTC - in response to Message 58965.

You'll need to decide which copy of BOINC is going to be your 'primary' installation (default settings, autorun stuff in the registry, etc.), and which is going to be the 'secondary'.

The primary one can be exactly what is set up by the installer, with one change. The easiest way is to add the line

<allow_multiple_clients>1</allow_multiple_clients>

to the options section of cc_config.xml (or set the value to 1 if the line is already present). That needs a client restart if BOINC's already running.

Then, these two batch files work for me. Adapt program and data locations as needed.

To run the client:
D:\BOINC\rh_boinc_test --allow_multiple_clients --allow_remote_gui_rpc --redirectio --detach_console --gui_rpc_port 31418 --dir D:\BOINCdata2\

To run a Manager to control the second client:
start D:\BOINC\boincmgr.exe /m /n 127.0.0.1 /g 31418 /p password

Note that I've set this up to run test clients alongside my main working installation - you can probably ignore that bit.

Jim1348
Send message
Joined: 28 Jul 12
Posts: 819
Credit: 1,576,708,471
RAC: 221,898
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58968 - Posted: 30 Jun 2022 | 15:56:37 UTC - in response to Message 58844.

We have a time estimation problem, discussed previously in the thread. As Keith mentioned, the real walltime calculation should be much less than reported.

It would be very helpful if you could let us know if that is the case. In particular, if you are getting 75000 credits per jobs means the jobs are getting 25% extra credits for returning fast.

Are you still in need of that? My first Python ran for 12 hours 55 minutes according to BoincTasks, but the website reported 156,269.60 seconds (over 43 hours). It got 75,000 credits.
http://www.gpugrid.net/results.php?hostid=593715

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 155
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 58969 - Posted: 1 Jul 2022 | 13:10:32 UTC - in response to Message 58968.
Last modified: 1 Jul 2022 | 13:11:38 UTC

Thanks for the feedback Jim1348! It is useful for us to confirm that jobs run in a reasonable time despite the wrong estimation issue. Maybe that can be solved somehow in the future. Seems like at least did no estimate dozens of days like I have seen in other occasions.
____________

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 719
Credit: 4,861,478,494
RAC: 1,361,967
Level
Arg
Scientific publications
wat
Message 58970 - Posted: 1 Jul 2022 | 13:33:55 UTC - in response to Message 58969.

it's because the app is using the CPU time instead of runtime. since it uses so many threads, it adds up the time spent on all the threads. 2 threads working for 1hr total would be 2hrs reported CPU time. you need to track wall clock time. the app seems to have this capability since it reports timestamps of start and stop in the stderr.txt file.

also credit reward is static, and should be a more dynamic scheme like the acemd3 tasks. look at Jim's tasks you have tasks with a 2,000 - 150,000 seconds (reported) all with the same 75,000 credit reward. good reward for the 2,000s runs, but painfully low for the longer ones (the majority).
____________

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1434
Credit: 3,541,899,351
RAC: 420,027
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58971 - Posted: 1 Jul 2022 | 13:55:40 UTC - in response to Message 58970.

There are two separate problems with timing.

There's the display of CPU time instead of elapsed time on the website - that's purely cosmetic, as we report the correct elapsed time for the finished tasks.

And there's the estimation of anticipated runtime when a task is first issued, before it's even started to run. I would have thought that would have started to correct itself by now: with the steady supply of work recently, we will have got well past all the trigger points for the server averaging algorithms.

Next time I see a task waiting to run, I'll trap the numbers and try to make sense of them.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 719
Credit: 4,861,478,494
RAC: 1,361,967
Level
Arg
Scientific publications
wat
Message 58972 - Posted: 1 Jul 2022 | 14:56:15 UTC - in response to Message 58971.



There's the display of CPU time instead of elapsed time on the website - that's purely cosmetic, as we report the correct elapsed time for the finished tasks.



that may be true, NOW. however, if they move to a dynamic credit scheme (as they should) that awards credit based on flops and runtime (like ACEMD3 does), then the runtime will not be just cosmetic.
____________

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1434
Credit: 3,541,899,351
RAC: 420,027
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58973 - Posted: 1 Jul 2022 | 17:27:17 UTC - in response to Message 58971.

OK, I got one on host 508381. Initial estimate is 752d 05:26:18, task is 32940037

Size:
<rsc_fpops_est>1000000000000000000.000000</rsc_fpops_est>

Speed:
<flops>707646935000.048218</flops>

DCF:
<duration_correction_factor>45.991658</duration_correction_factor>

App_ver:
<app_name>PythonGPU</app_name>
<version_num>403</version_num>

Host details:
Number of tasks completed 80
Average processing rate 13025.358204684

Calculated time estimate (size / speed):
1413134.079355548 [seconds]
16.355718511 [days - raw]
752.226612105 [days - adjusted by DCF]

So my client is doing the calculations right.

The glaring difference is between flops and APR.

Re-doing the {size / speed} calculation with APR gives
76773.320494203 [seconds]
21.32592236 [hours]

which is a little high for this machine, but not bad. The last 'normal length' tasks ran in about 14 hours.

So, the question is: why is the server tracking APR, but not using it in the <app_version> blocks sent to our machines?

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1434
Credit: 3,541,899,351
RAC: 420,027
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58974 - Posted: 2 Jul 2022 | 9:04:06 UTC

Yesterday's task is just in the final stages - it'll finish after about 13 hours - and the next is ready to start. So here are the figures for the next in the cycle.

Initial estimate: 737d 06:19:25
<flops>707646935000.048218</flops>
<duration_correction_factor>45.076802</duration_correction_factor>
Average processing rate 13072.709605774

So, APR and DCF have both made a tiny movement in the right direction, but flops has remained stubbornly unchanged. And that's the one that controls the initial estimates.

(actually, a little short one crept in between the two I'm watching, so it's two cycles - but that doesn't change the principle)

roundup
Send message
Joined: 11 May 10
Posts: 34
Credit: 257,866,755
RAC: 39,356
Level
Asn
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58975 - Posted: 3 Jul 2022 | 10:17:13 UTC

The credits per runtime for cuda1131 really looks strange sometimes:

Task 27246643 2 Jul 2022 | 8:13:32 UTC 3 Jul 2022 | 8:20:56 UTC
Runtimes 445,161.60 445,161.60 Credits 62,500.00

Compare to this one:
Task 27246622 2 Jul 2022 | 7:55:03 UTC 2 Jul 2022 | 8:05:39 UTC Runtimes 2,770.92 2,770.92 Credits 75,000.00

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 155
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 58977 - Posted: 4 Jul 2022 | 13:54:08 UTC - in response to Message 58970.

Yes, you are right about that. There are 2 types of experiments I run now:

a) Normal experiments have tasks with a fixed target number of agent-environment interaction to process. The tasks finish once this number of interactions is reached. All tasks require the same amount of compute, then makes sense (at least to me) to reward them with the same amount of credits. Even if some tasks are completed in less time due to faster hardware.

b) I have recently introduced an "early stopping" mechanism to some experiments. The upper bound is the same as in the other type of experiments, a fixed amount of agent-environment interactions. However, if the agent discovers interesting results before that, it returns so this information can be shared with other agent in the population of AI agents. Which agents will finish earlier and how much earlier is random, so it would be interesting to adjust the credits dynamically, yes. I will ask acemd3 people how to do it.
____________

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 155
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 58978 - Posted: 4 Jul 2022 | 16:51:10 UTC - in response to Message 58975.
Last modified: 5 Jul 2022 | 10:47:37 UTC

The credit system gives 50.000 credits per task. However, completion before a certain amount of time multiplies this value by 1.5, then by 1.25 for a while and finally by 1.0 indefinitely. That explains why sometimes you see 75.000 and sometimes 62.500 credits.
____________

Toby Broom
Send message
Joined: 11 Dec 08
Posts: 25
Credit: 197,737,443
RAC: 210
Level
Ile
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58979 - Posted: 6 Jul 2022 | 22:59:17 UTC

I had a idea after reading some of the post about utilisation of resources.

For the power user here we tend to have high end hardware on the project so would it be possible to support our hardware fully e.g I imagine that’s if you have 10-24 GB of VRAM the whole simulation could be loaded in to VRAM giving additional performance to the project?

Additionally the more modern cards have more ML focused hardware accelerated features so are they well utilised?

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 155
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 58980 - Posted: 7 Jul 2022 | 11:10:44 UTC - in response to Message 58979.
Last modified: 7 Jul 2022 | 11:11:36 UTC

The reason Reinforcement Learning agents do not currently use the whole potential of the cards is because the interactions between the AI agent and the simulated environment are performed on CPU while the agent "learning" process is the one that uses the GPU intermittently.

There are, however, environments that only use GPU. They are becoming more and more common, so I see it as a real possibility that in the future most popular benchmarks of the field use only GPU. Then the jobs will be much more efficient since pretty much only GPU will be used. Unfortunately we are not there yet...

I am not sure if I am answering your question, please let me know if I am not.
____________

Toby Broom
Send message
Joined: 11 Dec 08
Posts: 25
Credit: 197,737,443
RAC: 210
Level
Ile
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58981 - Posted: 7 Jul 2022 | 19:40:48 UTC - in response to Message 58980.

Thanks for the comments, what about using large quantity of VRAM if available, the latest BOINC finally allows for correct reporting VRAM on NVidia cards so you can tailor the WUs based on VRAM to protect the contributions from users with lower specification computers.

FritzB
Send message
Joined: 7 Apr 15
Posts: 4
Credit: 50,436,830
RAC: 13,308
Level
Thr
Scientific publications
wat
Message 58995 - Posted: 10 Jul 2022 | 8:22:33 UTC

Sorry for OT, but some people need admin help and I've seen one beeing active here :)

Password reset doesn't work and there seems to be an alternative method some years ago. Maybe this can be done again?

Please have a look in this thread: http://www.gpugrid.net/forum_thread.php?id=2587&nowrap=true#58958

Thanks!
Fritz

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 155
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 59002 - Posted: 12 Jul 2022 | 7:38:42 UTC - in response to Message 58995.

Hi Fritz! Apparently the problem is that sending emails from server no longer works. I will mention the problem to the server admin.


____________

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 155
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 59003 - Posted: 15 Jul 2022 | 9:26:40 UTC - in response to Message 58995.

I talked to the server admin and he explained to me the problem in more detail.

The issue comes from the fact that the GPUGrid server uses a public IP from the Universitat Pompeu Fabra, so we have to comply with the data protection and security policies of the university. Among other things this implies that we can not send emails from our web server.

Therefore, unfortunately that prevents us from fixing the password recovery problem.




____________

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 155
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 59004 - Posted: 15 Jul 2022 | 9:46:45 UTC - in response to Message 58981.

Hello Toby,


For the python app, do you mean executing a script that automatically detects how much memory has the GPU to which the the task has been assigned, and then flexibly define an agent that uses it all (or most of it)? In other words, flexibly adapt to the host machine capacity.

The experiments we are running at the moment require training AI agents in a sequence of jobs (i.e. starting to training an agent in a GPUGrid job, then sending it back to the server to evaluate its capabilities, then send another job that loads the same agent and continues its training, evaluate again, etc)

Consequently, current jobs are designed to work with a fixed amount of GPU memory, and we can not set it too high since we want a high percentage of hosts the be able to run them.

However, it is true that by doing that we are sacrificing resources in GPUs with larger amounts of memory. You gave me something to think about, there could be situations is which could make sense to use this approach and indeed would be a more efficient use of resources.
____________

Toby Broom
Send message
Joined: 11 Dec 08
Posts: 25
Credit: 197,737,443
RAC: 210
Level
Ile
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 59006 - Posted: 15 Jul 2022 | 16:46:27 UTC - in response to Message 59004.

BOINC can detect the quantity of GPU memory, it was bugged in the older BOINC version for nVidia cards but in 7.20 its fixed so there would be no need to detect in Python as its already in the project database.

A variable job size, yes.

Its more work for you but I can imagine there could be performance boost? Too keep it simple you could have S,M,L with say <4, 4-8, >8? the GPUs with more than 8 could be larger in general as only the top tier GPU's have this much VRAM.

It seems BOINC knows how to allocate to suitable computers. Worst case you could make it opt in.

Profile JohnMD
Avatar
Send message
Joined: 4 Dec 10
Posts: 4
Credit: 3,672,356
RAC: 0
Level
Ala
Scientific publications
watwatwat
Message 59007 - Posted: 15 Jul 2022 | 20:20:42 UTC - in response to Message 58981.

Even video cards with 6GiB crash with insufficient VRAM.
The app is apparently not aware of available resources.
This ought to be the first priority before sending tasks to the world.

jjch
Send message
Joined: 10 Nov 13
Posts: 88
Credit: 14,970,000,871
RAC: 918,848
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 59008 - Posted: 15 Jul 2022 | 20:47:00 UTC - in response to Message 59007.

From what we are finding right now the 6GB GPUs would have sufficient VRAM to run the current Python tasks. Refer to this thread noting between 2.5 and 3.2 GB being used:https://www.gpugrid.net/forum_thread.php?id=5327

If jobs running on GPUs with 4GB or more are crashing, then there is a different problem. Have to look at the logs to see what's going on.

It's more likely they are running out of system memory or swap space but there are a few that are failing from an unknown cause.

I took a quick look at the jobs you have which errored and I found the mx150 and mx350 GPUs only have 2GB VRAM. These are not sufficient to run the Python app.

Unfortunately I would suggest you use these GPUs for another project they are more suited for.

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1434
Credit: 3,541,899,351
RAC: 420,027
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 59039 - Posted: 28 Jul 2022 | 9:18:17 UTC
Last modified: 28 Jul 2022 | 9:29:40 UTC

New generic error on multiple tasks this morning:

TypeError: create_factory() got an unexpected keyword argument 'recurrent_nets'

Seems to affect the entire batch currently being generated.

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 155
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 59040 - Posted: 28 Jul 2022 | 9:41:25 UTC - in response to Message 59039.
Last modified: 28 Jul 2022 | 9:42:38 UTC

Thanks for letting us know Richard. It is a minor error, sorry for the inconvenience, I am fixing it right now. Unfortunately the remaining jobs of the batch will crash but then will be replaced with correct ones.
____________

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1434
Credit: 3,541,899,351
RAC: 420,027
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 59042 - Posted: 28 Jul 2022 | 10:45:43 UTC

No worries - these things happen. The machine which alerted me to the problem now has a task 'created 28 Jul 2022 | 10:33:04 UTC' which seems to be running normally.

The earlier tasks will hang around until each of them has gone through 8 separate hosts, before your server will accept that there may have been a bug. But at least they don't waste much time.

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 155
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 59043 - Posted: 28 Jul 2022 | 13:38:06 UTC - in response to Message 59042.

Yes exactly, it has to fail 8 times... the only good part is that the bugged tasks fail at the beginning of the script so almost no computation is wasted. I have checked and some of the tasks in the newest batch have already finished successfully.
____________

Profile robertmiles
Send message
Joined: 16 Apr 09
Posts: 502
Credit: 586,513,433
RAC: 86,177
Level
Lys
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 59071 - Posted: 6 Aug 2022 | 19:47:50 UTC

A peculiarity of Python apps for GPU hosts 4.03 (cuda1131):

If BOINC is shut down while such a task is in progress, then restarted, the task will show 2% progress at first, even if it was well past this before the shutdown.

However, the progress may then jump past 98% at the next time a checkpoint is written, which looks like the hidden progress is recovered.

Not a definite problem, but you should be aware of it.

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1434
Credit: 3,541,899,351
RAC: 420,027
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 59076 - Posted: 7 Aug 2022 | 14:08:51 UTC

I've been monitoring and playing with the initial runtime estimates for these tasks.



The Y-axis has been scaled by various factors of 10 to make the changes legible.

The initial estimates (750 days to 230 days) are clearly dominated by the DCF (real numbers, unscaled).

The <flops> - the speed of processing, 707 or 704 GigaFlops, assumed by the server. There's a tiny jump midway through the month, which correlates with a machine software update, including a new version of BOINC, and reboot. That will have triggered a CPU benchmark run.

The DCF (client controlled) has been falling very, very, slowly. It's so far distant from reality that BOINC moves it at an ultra-cautious 1% of the difference at the conclusion of each successful run. The changes in slope come about because of the varying mixture of short-running (early exit) tasks and full-length tasks.

The APR has been wobbling about, again because of the varying mixture of tasks, but seems to be tracking the real world reasonably well. The values range from 13,000 to nearly 17,000 GigaFlops.

Conclusion:

The server seems to be estimating the speed of the client using some derivative of the reported benchmark for the machine. That's absurd for a GPU-based project: the variation in GPU speeds is far greater than the variation of CPU speeds. It would be far better to use the APR, but with some safeguards and greater regard to the actual numbers involved.

The chart was derived from host 508381, which has a measured CPU speed of 7.256 GigaFlops (roughly one-tenth of the speed assumed by the server), and all tasks were run on the same GTX 1660 Ti GPU, with a theoretical ('peak') speed of 5,530 GigaFlops. Congratulations to the GPUGrid programmers - you've exceeded three times the speed of light (according to APR)!

More seriously, that suggests that the 'size' setting for these tasks (fpops_est) - the only value that project actually has to supply manually - is set too low. This may have been the point at which the estimates started to go wrong.

One further wrinkle: BOINC servers can't fully allow for varying runtimes and early task exits. Old hands will remember the problems we had with 'dash-9' (overflow) tasks at SETI@home. We overcame that one by adding an 'outlier' pathway to the server code: if the project validator marks the task as an outlier, its runtime is disregarded when tracking APR - that keeps things a lot more stable. Details at https://boinc.berkeley.edu/trac/wiki/ValidationSimple#Runtimeoutliers

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 719
Credit: 4,861,478,494
RAC: 1,361,967
Level
Arg
Scientific publications
wat
Message 59077 - Posted: 7 Aug 2022 | 16:05:07 UTC - in response to Message 59076.

or just use the flops reported by BOINC for the GPU. since it is recorded and communicated to the project. and from my experience (with ACEMD tasks) does get used in the credit reward for the non-static award scheme. so the project is certainly getting it and able to use that value.
____________

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1434
Credit: 3,541,899,351
RAC: 420,027
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 59078 - Posted: 7 Aug 2022 | 17:08:51 UTC - in response to Message 59077.

Except:

1) A machine with two distinct GPUs only reports the peak flops of one of them. (The 'better' card, which is usually - but not always - the faster card).
2) Just as a GPU doesn't run at 10x the speed of the host CPU, it doesn't run realistic work at peak speed, either. That would involve yet another semi-realistic fiddle factor. And Ian will no doubt tell me that fancy modern cards, like Turing and Ampere, run closer to peak speed than earlier generations.

We need to avoid having too many moving parts - too many things to get wrong when the next rotation of researchers takes over.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 719
Credit: 4,861,478,494
RAC: 1,361,967
Level
Arg
Scientific publications
wat
Message 59099 - Posted: 11 Aug 2022 | 22:40:09 UTC - in response to Message 59078.

personally I'm a big fan of just standardizing the task computational size and assigning static credit. no matter the device used or how long it takes. just take flops out of the equation completely. that way faster devices get more credit/RAC based on the rate in which valid tasks are returned.

the only caveat is the need to make all the tasks roughly the same "size" computationally. but that seems easier than all the hoops to jump through to accommodate all the idiosyncrasies of BOINC, various systems, and task differences.
____________

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1050
Credit: 1,407,860,714
RAC: 1,014,249
Level
Met
Scientific publications
watwatwatwatwat
Message 59101 - Posted: 12 Aug 2022 | 1:45:16 UTC
Last modified: 12 Aug 2022 | 1:49:13 UTC

The latest Python tasks I've done today have awarded 105,000 credits as compared to all the previous tasks at 75,000 credits.

Looking back from the job_log, the estimated computation size has been at 1B GFLOPS for quite a while now.

Nothing has changed in the current task parameters as far as I can tell.

Estimated computation size
1,000,000,000 GFLOPs

So I assume that Abouh has decided to award more credits for the work done.

Anyone notice this new award level?

They are generally taking longer to crunch than the previous ones, so maybe it is just scaling.

Jim1348
Send message
Joined: 28 Jul 12
Posts: 819
Credit: 1,576,708,471
RAC: 221,898
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 59102 - Posted: 12 Aug 2022 | 1:50:33 UTC - in response to Message 59101.

Anyone notice this new award level?

I just got my first one.
http://www.gpugrid.net/workunit.php?wuid=27270757

But not all the new ones receive that. A subsequent one received the usual 75,000 credit.

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1050
Credit: 1,407,860,714
RAC: 1,014,249
Level
Met
Scientific publications
watwatwatwatwat
Message 59104 - Posted: 12 Aug 2022 | 3:31:18 UTC - in response to Message 59102.

Thanks for your report. It doesn't really track with scaling now that I examine my tasks.

Some are getting the new higher reward for 2 hours of computation but some are still getting the lower reward for 8 hours of computation.

I was getting what was the standard reward for tasks taking as little as 20 minutes of computation time. So the 75K was a little excessive in my opinion.

These new ones are trending at 2-3 hours of computation time. But I also had one take 11 hours and was still rewarded with only the 105K.

Maybe we are finally getting into the meat of the AI/ML investigation after all the initial training we have been doing.

Still sitting on 3 new acemd3 tasks that haven't been looked at for two days and will only get the standard reward since the client scheduler feels no need to push them to the front since their APR and estimated completion times are correct and reasonable. Really would like to get the Python tasks to get realistic APR's and estimated completion times. But since they are predominately a cpu task with a little bit of gpu computation, BOINC has no clue how to handle them.

Maybe Abouh can post some insight as to what the current investigation is doing.

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1434
Credit: 3,541,899,351
RAC: 420,027
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 59105 - Posted: 12 Aug 2022 | 6:30:00 UTC - in response to Message 59104.

My first 'high rate' task (105K credits) was a workunit created at 10 Aug 2022 | 2:03:51 UTC.

Since then, I've only received one 75K task: my copy was issued to me at 10 Aug 2022 | 21:15:47 UTC, but the underlying workunit was created at 9 Aug 2022 | 13:44:09 UTC - I got a resend after two previous failures by other crunchers.

My take is that the 'tariff' for GPUGrid tasks is set when the underlying workunit is created, and all subsequent tasks issued from that workunit inherit the same value.

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1050
Credit: 1,407,860,714
RAC: 1,014,249
Level
Met
Scientific publications
watwatwatwatwat
Message 59107 - Posted: 12 Aug 2022 | 15:28:04 UTC - in response to Message 59105.

That implies the current release candidates are being assigned 105K credit based I assume on harder to crunch datasets.

Don't think it depends on a recent release date either. I just had a 12 August _0 created task and it only awarded 75K after passing through one other before I got it.

Aurum
Avatar
Send message
Joined: 12 Jul 17
Posts: 343
Credit: 10,358,716,035
RAC: 855,775
Level
Trp
Scientific publications
watwatwat
Message 59109 - Posted: 13 Aug 2022 | 17:36:27 UTC
Last modified: 13 Aug 2022 | 17:40:02 UTC

Which apps are running these days? The apps page is missing the column that shows how much is running: https://www.gpugrid.net/apps.php
How many CPU threads do I need to run to finish Python WUs in a reasonable time for say an i9-9980XE?
Trying to update my app_config to give a it a go. The last one I found was pretty old. Here's what I've cobbled together. Suggestions welcome.

<app_config>
<!-- i9-10980XE 18c36t 32 GB L3 Cache 24.75 MB -->
<app>
<name>acemd3</name>
<plan_class>cuda1121</plan_class>
<gpu_versions>
<cpu_usage>1.0</cpu_usage>
<gpu_usage>1.0</gpu_usage>
</gpu_versions>
<fraction_done_exact/>
</app>
<app>
<name>acemd4</name>
<plan_class>cuda1121</plan_class>
<gpu_versions>
<cpu_usage>1.0</cpu_usage>
<gpu_usage>1.0</gpu_usage>
</gpu_versions>
<fraction_done_exact/>
</app>
<app>
<name>PythonGPU</name>
<plan_class>cuda1121</plan_class>
<gpu_versions>
<cpu_usage>4.0</cpu_usage>
<gpu_usage>1.0</gpu_usage>
</gpu_versions>
<app_version>
<app_name>PythonGPU</app_name>
<avg_ncpus>4</avg_ncpus>
<ngpus>1</ngpus>
<cmdline>--nthreads 4</cmdline>
</app_version>
<fraction_done_exact/>
<max_concurrent>1</max_concurrent>
</app>
<app>
<name>PythonGPUbeta</name>
<plan_class>cuda1121</plan_class>
<gpu_versions>
<cpu_usage>4.0</cpu_usage>
<gpu_usage>1.0</gpu_usage>
</gpu_versions>
<app_version>
<app_name>PythonGPU</app_name>
<avg_ncpus>4</avg_ncpus>
<ngpus>1</ngpus>
<cmdline>--nthreads 4</cmdline>
</app_version>
<fraction_done_exact/>
<max_concurrent>1</max_concurrent>
</app>
<app>
<name>Python</name>
<plan_class>cuda1121</plan_class>
<cpu_usage>4</cpu_usage>
<gpu_versions>
<cpu_usage>4</cpu_usage>
<gpu_usage>1</gpu_usage>
</gpu_versions>
<app_version>
<app_name>PythonGPU</app_name>
<avg_ncpus>4</avg_ncpus>
<ngpus>1</ngpus>
<cmdline>--nthreads 4</cmdline>
</app_version>
<fraction_done_exact/>
<max_concurrent>1</max_concurrent>
</app>
</app_config>

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1050
Credit: 1,407,860,714
RAC: 1,014,249
Level
Met
Scientific publications
watwatwatwatwat
Message 59110 - Posted: 13 Aug 2022 | 19:32:17 UTC

I get away with only reserving 3 cpu threads. That does not impact or affect what the actual task does when it runs. Just BOINC cpu scheduling for other projects.

It will always spawn 32 independent python processes when running.

And you really should update or remove the plan class statements for Python on GPU since your plan_class is incorrect.

Current plan_class is cuda1131 NOT cuda1121

You also can clean up your app_config as there only is PythonGPU application. No Python or PythonGPUBeta application.

[CSF] Aleksey Belkov
Avatar
Send message
Joined: 26 Dec 13
Posts: 66
Credit: 906,439,522
RAC: 76,865
Level
Glu
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 59111 - Posted: 14 Aug 2022 | 22:20:08 UTC
Last modified: 14 Aug 2022 | 22:23:14 UTC

Hi, guys!
I have not particularly followed Python GPU app (for Windows) and this thread, so perhaps this issue has already been discussed somewhere on the forum.
It seems I only tried once, and all tasks I received crashed almost immediately after start.
I was surprised that at WU's starting, limit on Virtual memory(Commit Charge) in the system was reached.
Today I tried to understand the problem in more detail and was surprised again to find that application addresses ~ 42 GiB Virtual Memory in total!
At the same time, the total consumption of Physical Memory is about 4 times less (~ 10 GiB).
For example

So the question is - is that intended?..

I had to create a 30 GiB swap file to cover this difference so that I could run something else on the system besides one WU of Python GPU -_-

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1050
Credit: 1,407,860,714
RAC: 1,014,249
Level
Met
Scientific publications
watwatwatwatwat
Message 59112 - Posted: 15 Aug 2022 | 2:29:50 UTC - in response to Message 59111.
Last modified: 15 Aug 2022 | 2:37:41 UTC

Hi, guys!
I have not particularly followed Python GPU app (for Windows) and this thread, so perhaps this issue has already been discussed somewhere on the forum.
It seems I only tried once, and all tasks I received crashed almost immediately after start.
I was surprised that at WU's starting, limit on Virtual memory(Commit Charge) in the system was reached.
Today I tried to understand the problem in more detail and was surprised again to find that application addresses ~ 42 GiB Virtual Memory in total!
At the same time, the total consumption of Physical Memory is about 4 times less (~ 10 GiB).
For example

So the question is - is that intended?..

I had to create a 30 GiB swap file to cover this difference so that I could run something else on the system besides one WU of Python GPU -_-


Yes, because of flaws in Windows memory management, that effect cannot be gotten around. You need to increase the size of your pagefile to the 50GB range to be safe.

Linux does not have the problem and no changes are necessary to run the tasks.
The project primarily develops Linux applications first as the development process is simpler. Then they tackle the difficulties of developing a Windows application with all the necessary workarounds.

Just the way it is. For the reason why read this post.
https://www.gpugrid.net/forum_thread.php?id=5322&nowrap=true#58908

[CSF] Aleksey Belkov
Avatar
Send message
Joined: 26 Dec 13
Posts: 66
Credit: 906,439,522
RAC: 76,865
Level
Glu
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 59113 - Posted: 15 Aug 2022 | 3:04:54 UTC - in response to Message 59112.
Last modified: 15 Aug 2022 | 3:21:22 UTC

Thank you for clarification.
I was not familiar with subtleties of the memory allocation mechanism in Windows.
That was useful.
And I already increase swap to RAM value(64GB) to be sure ;)


Upd.
And the reward system for this app clearly begs for revision... : /

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1050
Credit: 1,407,860,714
RAC: 1,014,249
Level
Met
Scientific publications
watwatwatwatwat
Message 59114 - Posted: 15 Aug 2022 | 3:27:39 UTC
Last modified: 15 Aug 2022 | 3:33:03 UTC

Task credits are fixed. Pay no attention to the running times. BOINC completely mishandles that since it has no recognition of the dual nature of these cpu-gpu application tasks.

They should be thought of as primarily a cpu application with a little gpu use thrown in occasionally.

[Edit] Look at the delta between sent time and returned time to determine the actual runtime that the task took.

In your example, the first listed task took only 20 minutes to finish, the second took 4 1/2 hours and the last took 4 hours. it all depends on the different parameter sets for each task that is the criteria for the reinforcement learning on the gpu.

Erich56
Send message
Joined: 1 Jan 15
Posts: 902
Credit: 3,604,880,665
RAC: 330,779
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwat
Message 59115 - Posted: 15 Aug 2022 | 11:22:21 UTC

Can anyone tell me what happened to this task:
https://www.gpugrid.net/result.php?resultid=32997605

which failed after 301.281 seconds :-(((

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1434
Credit: 3,541,899,351
RAC: 420,027
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 59116 - Posted: 15 Aug 2022 | 11:36:03 UTC - in response to Message 59115.

RuntimeError: [enforce fail at ..\c10\core\CPUAllocator.cpp:76] data. DefaultCPUAllocator: not enough memory: you tried to allocate 3612672 bytes.

It's possibly the Windows swap file settings, again.

Erich56
Send message
Joined: 1 Jan 15
Posts: 902
Credit: 3,604,880,665
RAC: 330,779
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwat
Message 59117 - Posted: 15 Aug 2022 | 14:51:53 UTC - in response to Message 59116.

RuntimeError: [enforce fail at ..\c10\core\CPUAllocator.cpp:76] data. DefaultCPUAllocator: not enough memory: you tried to allocate 3612672 bytes.

It's possibly the Windows swap file settings, again.

thanks Richard for the quick reply.
I now changed the page file size to max. 65MB.
I did it on both drives: system drive C:/ and drive F:/ (on separate SSD) on which BOINC is running.
Probably to change it for only one drive would have been okay, right? If so, which one?

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1050
Credit: 1,407,860,714
RAC: 1,014,249
Level
Met
Scientific publications
watwatwatwatwat
Message 59118 - Posted: 15 Aug 2022 | 15:45:00 UTC - in response to Message 59117.

The Windows one.

Jim1348
Send message
Joined: 28 Jul 12
Posts: 819
Credit: 1,576,708,471
RAC: 221,898
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 59119 - Posted: 15 Aug 2022 | 16:10:21 UTC

I am a bit surprised that I am able to run the pythons without problem under Ubuntu 20.04.4 on a GTX 1060. It has 3GB of video memory, and uses 2.8GB thus far. And the CPU is currently running two cores (down from the previous four cores), using about 3.7GB of memory, though reserving 19 GB.

Even on Win10, my GTX 1650 Super has had no problems, though it has 4GB of memory and uses 3.6GB. But I have 32GB system memory, and for once I let Windows manage the virtual memory itself. It is reserving 42GB. I usually set it to a lower value.

Erich56
Send message
Joined: 1 Jan 15
Posts: 902
Credit: 3,604,880,665
RAC: 330,779
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwat
Message 59120 - Posted: 15 Aug 2022 | 16:53:08 UTC - in response to Message 59118.

The Windows one.

thx :-)

Toby Broom
Send message
Joined: 11 Dec 08
Posts: 25
Credit: 197,737,443
RAC: 210
Level
Ile
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 59141 - Posted: 20 Aug 2022 | 15:08:53 UTC

Can the CPU usage be adjusted correctly? its fine to use a number of cores but currently it say less than one and uses more than 1

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 155
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 59143 - Posted: 22 Aug 2022 | 8:48:21 UTC - in response to Message 59107.
Last modified: 22 Aug 2022 | 10:44:01 UTC

Hello! sorry for the late reply

I adjusted the maximum length of some of the tasks and consequently also adjusted the credits for completing them. What I mean by that is that each one of my tasks contains an agent interacting with its environment and learning from a fixed number of total interaction steps. Previously I set that number to 25M steps. Now I increased it to 35M for some tasks and consequently also increased the reward.



This increase in the number of steps does not necessarily increase the completion time of the task, because if an agent discovers something relevant before reaching the maximum number of steps, the task ends and the “new information” is sent back to be shared with the other agents in the population. Whether that happens or not is random, but on average the task completion time will increase a bit due to the ones that reach 35M steps, so the reward has to increase as well. This change does not affect hardware requirements.

This randomness also explains why some tasks are shorter but still receive the same reward (credits per task are fixed). However, the average credit reward should be similar for all hosts as they solve more and more tasks. Also the average task completion time should remain stable.

As I have mentioned, I work with populations of AI agents that try to cooperatively solve a single complex problem. Note that as more things are discovered by agents in a population the harder it becomes to keep discovering new ones. In general, early tasks in an experiment return quite fast, while as the experiment progresses the 35M steps mark gets hit more and more often (and tasks take longer to complete).
____________

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 155
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 59144 - Posted: 22 Aug 2022 | 10:16:58 UTC - in response to Message 59076.
Last modified: 22 Aug 2022 | 10:24:44 UTC

current value of rsc_fpops_est is 1e18, with 10e18 as limit. I remember we had to increase it because otherwise produced false “task aborted by host” from some users side. Do you think we should change it again?

Regarding cpu_usage, I remember having this discussion with Toni and I think the reason why we set the number of cores to that number is because with a single core the jobs can actually be executed. Even if they create 32 threads. Definitely do not require 32 cores. Is there an advantage of setting it to an arbitrary number higher than 1? Couldn't that cause some allocation problems? sorry it is a bit outside of my knowledge zone...
____________

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1434
Credit: 3,541,899,351
RAC: 420,027
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 59145 - Posted: 22 Aug 2022 | 10:56:47 UTC - in response to Message 59144.
Last modified: 22 Aug 2022 | 10:57:35 UTC

Regarding cpu_usage, I remember having this discussion with Toni and I think the reason why we set the number of cores to that number is because with a single core the jobs can actually be executed. Even if they create 32 threads. Definitely do not require 32 cores. Is there an advantage of setting it to an arbitrary number higher than 1? Couldn't that cause some allocation problems? sorry it is a bit outside of my knowledge zone...

This is a consequence of the handling of GPU plan_classes in the released BOINC server code. In the raw BOINC code, the cpu_usage value is calculated by some obscure (and, in all honesty, irrelevant and meaningless) calculation of the ratio of the number of flops that will be performed on the CPU and the GPU - the GPU, in particular, being assumed to be processing at an arbitrary fraction of the theoretical peak speed. In short, it's useless.

I don't think the raw BOINC code expects you to make manual alterations to the calculated value. If you've found a way of over-riding and fixing it - great. More power to your elbow.

The current issue arises because the Python app is neither a pure GPU app, nor a pure multi-threaded CPU app. It operates in both modes - and the BOINC developers didn't think of that.

I think you need to create a special, new, plan_class name for this application, and experiment on that. Don't meddle with the existing plan_classes - that will mess up the other GPUGrid lines of research.

I'm running with a manual override which devotes the whole GPU power, plus 3 CPUs, to the Python tasks. That seems to work reasonably well: it keeps enough work from other BOINC projects off the CPU while Python is running.

KAMasud
Send message
Joined: 27 Jul 11
Posts: 39
Credit: 170,157,986
RAC: 147,226
Level
Ile
Scientific publications
watwat
Message 59152 - Posted: 23 Aug 2022 | 18:13:17 UTC - in response to Message 59145.
Last modified: 23 Aug 2022 | 18:20:18 UTC

Regarding cpu_usage, I remember having this discussion with Toni and I think the reason why we set the number of cores to that number is because with a single core the jobs can actually be executed. Even if they create 32 threads. Definitely do not require 32 cores. Is there an advantage of setting it to an arbitrary number higher than 1? Couldn't that cause some allocation problems? sorry it is a bit outside of my knowledge zone...

This is a consequence of the handling of GPU plan_classes in the released BOINC server code. In the raw BOINC code, the cpu_usage value is calculated by some obscure (and, in all honesty, irrelevant and meaningless) calculation of the ratio of the number of flops that will be performed on the CPU and the GPU - the GPU, in particular, being assumed to be processing at an arbitrary fraction of the theoretical peak speed. In short, it's useless.

I don't think the raw BOINC code expects you to make manual alterations to the calculated value. If you've found a way of over-riding and fixing it - great. More power to your elbow.

The current issue arises because the Python app is neither a pure GPU app, nor a pure multi-threaded CPU app. It operates in both modes - and the BOINC developers didn't think of that.

I think you need to create a special, new, plan_class name for this application, and experiment on that. Don't meddle with the existing plan_classes - that will mess up the other GPUGrid lines of research.

I'm running with a manual override which devotes the whole GPU power, plus 3 CPUs, to the Python tasks. That seems to work reasonably well: it keeps enough work from other BOINC projects off the CPU while Python is running.


Could you tell us a bit more about this manual override? Just now it is sprawled over five cores, ten threads. If it sees the sixth core free, it grabs that one also.

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1050
Credit: 1,407,860,714
RAC: 1,014,249
Level
Met
Scientific publications
watwatwatwatwat
Message 59153 - Posted: 23 Aug 2022 | 19:39:42 UTC - in response to Message 59152.

If you run other projects concurrently, then it is adviseable to limit the number of cores the Python tasks occupies for scheduling. I am not talking about the number of threads each task uses since that is fixed.

Just create an app_config.xml file and place it into the GPUGrid projects directory and either re-read config files from the Manager or just restart BOINC.

The file minimally just needs this:

<app_config>

<app>
<name>PythonGPU</name>
<gpu_versions>
<gpu_usage>1.0</gpu_usage>
<cpu_usage>3.0</cpu_usage>
</gpu_versions>
</app>

</app_config>

This will tell the BOINC client not to overcommit other projects cpu usage as the Python app gets 3 cores reserved for its use.

I have found that to be plenty even when running 95% of all cpu cores on 3 other cpu projects along with running 2 other gpu projects which also use some or all of a cpu core to process the gpu task.

KAMasud
Send message
Joined: 27 Jul 11
Posts: 39
Credit: 170,157,986
RAC: 147,226
Level
Ile
Scientific publications
watwat
Message 59154 - Posted: 24 Aug 2022 | 7:15:27 UTC - in response to Message 59153.

If you run other projects concurrently, then it is adviseable to limit the number of cores the Python tasks occupies for scheduling. I am not talking about the number of threads each task uses since that is fixed.

Just create an app_config.xml file and place it into the GPUGrid projects directory and either re-read config files from the Manager or just restart BOINC.

The file minimally just needs this:

<app_config>

<app>
<name>PythonGPU</name>
<gpu_versions>
<gpu_usage>1.0</gpu_usage>
<cpu_usage>3.0</cpu_usage>
</gpu_versions>
</app>

</app_config>

This will tell the BOINC client not to overcommit other projects cpu usage as the Python app gets 3 cores reserved for its use.

I have found that to be plenty even when running 95% of all cpu cores on 3 other cpu projects along with running 2 other gpu projects which also use some or all of a cpu core to process the gpu task.


Thank you Keith. Why is it using so many cores plus is it something like OpenIFS on CPDN?

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1434
Credit: 3,541,899,351
RAC: 420,027
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 59155 - Posted: 24 Aug 2022 | 8:04:26 UTC - in response to Message 59154.

Thank you Keith. Why is it using so many cores plus is it something like OpenIFS on CPDN?

Yes - or nbody at MilkyWay. This Python task shares characteristics of a cuda (GPU) plan class, and a MT (multithreaded) plan class, and works best if treated as such.

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1434
Credit: 3,541,899,351
RAC: 420,027
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 59163 - Posted: 25 Aug 2022 | 10:12:52 UTC

Possible bad workunit: 27278732

ValueError: Expected value argument (Tensor of shape (1024,)) to be within the support (IntegerInterval(lower_bound=0, upper_bound=17)) of the distribution Categorical(logits: torch.Size([1024, 18])), but found invalid values:
tensor([ 7, 9, 7, ..., 10, 9, 3], device='cuda:0')

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 155
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 59178 - Posted: 30 Aug 2022 | 7:34:11 UTC - in response to Message 59163.

Interesting I had never seen this error before, thank you!
____________

Toby Broom
Send message
Joined: 11 Dec 08
Posts: 25
Credit: 197,737,443
RAC: 210
Level
Ile
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 59192 - Posted: 3 Sep 2022 | 10:27:16 UTC - in response to Message 59145.

Thanks Richard, is 3 CPU cores enough to not slow down the GPU?

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 719
Credit: 4,861,478,494
RAC: 1,361,967
Level
Arg
Scientific publications
wat
Message 59203 - Posted: 8 Sep 2022 | 16:53:31 UTC

I'm noticing an interesting difference in application behavior between different systems. abouh, can you help explain the reason?

I can see that each running task will spawn 32x processes (multiprocessing.spawn) as well as [number of cores]x processes for the main run.py application.

so on my 8-core/16-thread Intel system, a single running task spawns 8x run.py processes, and 32x multiprocessing.spawn threads.

and on my 24-core/48-thread AMD EPYC system, a single running task spawns 24x run.py processes, and 32x multiprocessing.spawn threads.


What is confusing is the utilization of each thread between these systems.

the EPYC system is uses ~600-800% CPU for the run.py process (~20-40% each thread)
whereas the Intel system uses ~120% CPU for the run.py process (~2-5% each thread)

I replicated the same high CPU use on another EPYC system (in a VM) where I've constrained it to the same 8-core/16-thread, and again its using a much larger share of the CPU than the intel system.

is the application coded in some way that will force more work to be done on more modern processors? as far as I can tell, the increased CPU use isnt making the overall task run any faster. the Intel system is just as productive with far less CPU use.

I was trying to run some python tasks on my Plex VM to let it use the GPU since plex doesnt use it very much, but the CPU use is making it troublesome.
____________

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 719
Credit: 4,861,478,494
RAC: 1,361,967
Level
Arg
Scientific publications
wat
Message 59204 - Posted: 8 Sep 2022 | 17:33:16 UTC - in response to Message 59203.

or perhaps the Broadwell based Intel CPU is able to hardware accelerate some tasks that the EPYC has to do in software, leading to higher CPU use?
____________

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 155
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 59205 - Posted: 9 Sep 2022 | 6:40:53 UTC - in response to Message 59203.

The application is not coded in any specific way to force more work to be done on more modern processors.

Maybe python handles it under the hood somehow?
____________

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 719
Credit: 4,861,478,494
RAC: 1,361,967
Level
Arg
Scientific publications
wat
Message 59206 - Posted: 9 Sep 2022 | 12:37:59 UTC - in response to Message 59205.

Maybe python handles it under the hood somehow?


it might be related to pytorch actually. I did some more digging and it seems like AMD has worse performance due to some kind of CPU detection issue in the MKL (or maybe deliberate by Intel). do you know what version of MKL your package uses?

and are you able to set specific env variables in your package? if your MKL is version <=2020.0, setting MKL_DEBUG_CPU_TYPE=5 might help this issue on AMD CPUs. but it looks like this will not be effective if you are on a newer version of the MKL as Intel has since removed this variable.


____________

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 719
Credit: 4,861,478,494
RAC: 1,361,967
Level
Arg
Scientific publications
wat
Message 59207 - Posted: 9 Sep 2022 | 14:45:37 UTC - in response to Message 59206.
Last modified: 9 Sep 2022 | 14:59:12 UTC



and are you able to set specific env variables in your package? if your MKL is version <=2020.0, setting MKL_DEBUG_CPU_TYPE=5 might help this issue on AMD CPUs. but it looks like this will not be effective if you are on a newer version of the MKL as Intel has since removed this variable.



to add: I was able to inspect your MKL version as 2019.0.4, and I tried setting the env variable by adding

os.environ["MKL_DEBUG_CPU_TYPE"] = "5"


to the run.py main program, but it had no effect. either I didn't put the command in the right place (I inserted it below line 433 in the run.py script), or the issue is something else entirely.

edit: you also might consider compiling your scripts into binaries to prevent inquisitive minds from messing about in your program ;)
____________

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1050
Credit: 1,407,860,714
RAC: 1,014,249
Level
Met
Scientific publications
watwatwatwatwat
Message 59208 - Posted: 10 Sep 2022 | 3:15:17 UTC
Last modified: 10 Sep 2022 | 3:29:52 UTC

Should the environment variable for fixing AMD computation in the MKL library be in the task package or just in the host environment? Or both?

I would have thought the latter as the system calls the MKL library is using eventually have to be passed through to the cpu.

export MKL_DEBUG_CPU_TYPE=5

and add to your .bashrc script.

So you need to set the OS environent variable up first then pass it through to the Python code with your os.environ("MKL_DEBUG_CPU_TYPE")

Of course if the embedded MKL package is the later version where the variable is ignored now, a moot point of using the variable to fix the intentional hamstringing of AMD processors.

[Edit]

Looks like there is a workaround for the Intel MKL check whether it is running on an Intel processor. https://danieldk.eu/Posts/2020-08-31-MKL-Zen.html

So make the fake shared library and use LD_PRELOAD= to load the fake shared library

That might be the easiest method to get the math libraries to use the advanced SIMD instructions like AVX2.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 719
Credit: 4,861,478,494
RAC: 1,361,967
Level
Arg
Scientific publications
wat
Message 59209 - Posted: 10 Sep 2022 | 4:34:43 UTC - in response to Message 59208.
Last modified: 10 Sep 2022 | 4:40:03 UTC

I didn’t explicitly state it in my previous reply. But I tried all that already and it didn’t make any difference. I even ran run.py standalone outside of BOINC to be sure that the env variable was set. Neither the env variable being set nor the fake Intel library made any difference at all.

But the embedded MKL version is actually an old one. It’s from 2019 as I mentioned before. So it should accept the debug variable. I just think now that it’s probably not the reason.
____________

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1050
Credit: 1,407,860,714
RAC: 1,014,249
Level
Met
Scientific publications
watwatwatwatwat
Message 59210 - Posted: 10 Sep 2022 | 5:00:10 UTC

Ohh . . . . OK. Didn't know you had tried all the previous existing fixes.

So must be something else going on in the code I guess.

Just thought I would throw it out there in case you hadn't seen the other fixes.

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 155
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 59211 - Posted: 10 Sep 2022 | 6:49:18 UTC - in response to Message 59207.

I could definitely set the env variable depending on package version in my scripts if that made AI agents train faster.

No need to create binaries. I am fine with any user that feels like it tinkering with the code, it always provides useful information. :)

____________

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1050
Credit: 1,407,860,714
RAC: 1,014,249
Level
Met
Scientific publications
watwatwatwatwat
Message 59212 - Posted: 10 Sep 2022 | 7:48:33 UTC

Don't know if the math functions being used by the Python libraries are any higher than SSE2 or not.

But if they are the MKL library functions default to SSE2 only when the MKL library is called and it detects any NON-Intel cpu.

Probably only way to know for sure is examine the code and see it tries to run any SIMD instruction higher than SSE2, then implement the fix and see if the computations on the cpu are sped up.

Depending on the math function being called, the speedup with the fix in place can be orders of magnitude faster.

Based on Ian's experiment running on his Intel host, the lower cpu usage didn't make the tasks run any faster.

But less cpu usage per task (when the tasks run the same with either hi or lo cpu usage) would be beneficial when also running other cpu tasks and aren't taking resources away from those processes.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 719
Credit: 4,861,478,494
RAC: 1,361,967
Level
Arg
Scientific publications
wat
Message 59213 - Posted: 10 Sep 2022 | 11:46:05 UTC - in response to Message 59211.

I could definitely set the env variable depending on package version in my scripts if that made AI agents train faster.

No need to create binaries. I am fine with any user that feels like it tinkering with the code, it always provides useful information. :)


Was my location for the variable in the script right or appropriate? inserted below line 433. Does the script inherit the OS variables already? Just wanted to make sure I had it set properly. I figured the script runs in its own environment outside of BOINC (in Python). That’s why I tried adding it to the script.
____________

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 719
Credit: 4,861,478,494
RAC: 1,361,967
Level
Arg
Scientific publications
wat
Message 59214 - Posted: 10 Sep 2022 | 11:51:57 UTC - in response to Message 59212.


Based on Ian's experiment running on his Intel host, the lower cpu usage didn't make the tasks run any faster.

But less cpu usage per task (when the tasks run the same with either hi or lo cpu usage) would be beneficial when also running other cpu tasks and aren't taking resources away from those processes.


It’s hard to say whether it’s faster or not since it’s not a true apples to apples comparison. So far it feels not faster, but that’s against different CPUs and different GPUs. Maybe my EPYC system seems similarly fast because the EPYC is just brute forcing it. It had much higher IPC than the old Broadwell based Intel.

____________

Erich56
Send message
Joined: 1 Jan 15
Posts: 902
Credit: 3,604,880,665
RAC: 330,779
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwat
Message 59215 - Posted: 10 Sep 2022 | 17:57:37 UTC

One of my machines started a Python task yesterday evening and finished it after about 24-1/ 2hours.
How come that a runtime (and CPU time) of 1,354,433.00 secs (=376 hrs) is shown:

https://www.gpugrid.net/result.php?resultid=33030599

As a side effect, I did not get any credit bonus (in this case the one for finishing within 48 hrs).

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 719
Credit: 4,861,478,494
RAC: 1,361,967
Level
Arg
Scientific publications
wat
Message 59216 - Posted: 10 Sep 2022 | 18:11:25 UTC - in response to Message 59215.

One of my machines started a Python task yesterday evening and finished it after about 24-1/ 2hours.
How come that a runtime (and CPU time) of 1,354,433.00 secs (=376 hrs) is shown:

https://www.gpugrid.net/result.php?resultid=33030599

As a side effect, I did not get any credit bonus (in this case the one for finishing within 48 hrs).


The calculated runtime is using the cpu time. Has been mentioned many times. It’s because more than one core was being used. So the sum of each core’s cpu time is what’s shown.

You did get 48hr bonus of 25%. Base credit is 70,000. You got 87,500 (+25%). Less than 24hrs gets +50% for 105,000.
____________

Erich56
Send message
Joined: 1 Jan 15
Posts: 902
Credit: 3,604,880,665
RAC: 330,779
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwat
Message 59217 - Posted: 10 Sep 2022 | 21:14:39 UTC

GPUGRID seems to have problems with figures, at least what concerns Python :-(
I just wanted to download a new Python task. On my Ramdisk there is about 59GB free disk space, but the BOINC event log tells me that Python needs some 532MB more disk space. How come?

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 719
Credit: 4,861,478,494
RAC: 1,361,967
Level
Arg
Scientific publications
wat
Message 59218 - Posted: 10 Sep 2022 | 23:34:03 UTC - in response to Message 59217.
Last modified: 10 Sep 2022 | 23:36:01 UTC

GPUGRID seems to have problems with figures, at least what concerns Python :-(
I just wanted to download a new Python task. On my Ramdisk there is about 59GB free disk space, but the BOINC event log tells me that Python needs some 532MB more disk space. How come?


probably due to your allocation of disk usage in BOINC. go into the compute preferences and allow BOINC to use more disk space. by default I think it is set to 50% of the disk drive. you might need to increase that.

Options-> Computing Preferences...
Disk and Memory tab

and set whatever limits you think are appropriate. it will use the most restrictive of the 3 types of limits. The Python tasks take up a lot of space.
____________

Erich56
Send message
Joined: 1 Jan 15
Posts: 902
Credit: 3,604,880,665
RAC: 330,779
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwat
Message 59221 - Posted: 11 Sep 2022 | 4:50:08 UTC - in response to Message 59218.


probably due to your allocation of disk usage in BOINC. go into the compute preferences and allow BOINC to use more disk space. by default I think it is set to 50% of the disk drive. you might need to increase that.

Options-> Computing Preferences...
Disk and Memory tab

and set whatever limits you think are appropriate. it will use the most restrictive of the 3 types of limits. The Python tasks take up a lot of space.

no, it isn't that.
I am aware of these setting. Since nothing else than BOINC is being done on this computer, disk and RAM usage are set to 90% for BOINC.
So, when I have some 58GB free on a 128GB RAM disk (with some 60GB free system RAM), it should normally be no problem for Python to download and being processed.
On another machine, I have a lot less ressources, and it works.
So no idea, what the problem is in this case ... :-(

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1050
Credit: 1,407,860,714
RAC: 1,014,249
Level
Met
Scientific publications
watwatwatwatwat
Message 59222 - Posted: 11 Sep 2022 | 6:12:13 UTC

Or BOINC doesn't consider a RAM Disk a "real" drive and ignores the available storage there.

Could be BOINC only considers physical storage to be valid.

Erich56
Send message
Joined: 1 Jan 15
Posts: 902
Credit: 3,604,880,665
RAC: 330,779
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwat
Message 59223 - Posted: 11 Sep 2022 | 6:42:55 UTC - in response to Message 59222.

Or BOINC doesn't consider a RAM Disk a "real" drive and ignores the available storage there.

Could be BOINC only considers physical storage to be valid.

no, I have BOINC running on another PC with Ramdisk - in that case a much smaller one: 32GB

Erich56
Send message
Joined: 1 Jan 15
Posts: 902
Credit: 3,604,880,665
RAC: 330,779
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwat
Message 59224 - Posted: 11 Sep 2022 | 6:56:19 UTC

another question -

I think I read something concerning this topic somewhere here, but I cannot find the posting any more (maybe though I am mistaken):

Is there the possibility to limit (by app_config.xml) the number of CPU cores Python is using?
The reason why I am asking is that on that machine onto which Python can be downloaded, I have also another project (not GPU) running, and when Python fills up the number of available cores, the CPU is busy with 100% which slows things down, and also heats up the CPU much more.

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1050
Credit: 1,407,860,714
RAC: 1,014,249
Level
Met
Scientific publications
watwatwatwatwat
Message 59225 - Posted: 11 Sep 2022 | 8:05:35 UTC
Last modified: 11 Sep 2022 | 8:06:55 UTC

No. You cannot alter the task configuration. It will always create 32 spawned processes for each task during computation.

If the task is interfering with your other cpu tasks then you have a choice, either stop the Python tasks or reduce your other cpu tasks.

All you can do for making the Python task run reasonably well is assign 3-5 cpu cores for BOINC scheduling to keep other cpu work off the host.

You can do that through a app_config.xml file in the project directory.

Like this:

<app_config>

<app>
<name>PythonGPU</name>
<gpu_versions>
<gpu_usage>1.0</gpu_usage>
<cpu_usage>3.0</cpu_usage>
</gpu_versions>
</app>

</app_config>

Erich56
Send message
Joined: 1 Jan 15
Posts: 902
Credit: 3,604,880,665
RAC: 330,779
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwat
Message 59226 - Posted: 11 Sep 2022 | 12:21:35 UTC - in response to Message 59225.

...
All you can do for making the Python task run reasonably well is assign 3-5 cpu cores for BOINC scheduling to keep other cpu work off the host.

You can do that through a app_config.xml file in the project directory.
Like this: ...

thanks, Keith, for your explanation.

Well, I actually would not need to put in this app_config.xml as in my case; the other BOINC tasks don't just asign any number of CPU cores by themselves. I tell each of these projects by a seperate app_config.xml how many cores to use (which I was, in fact, also hoping for Python).
So I have no other choice than to live with the situation as is :-(

What is too bad though is that obviously there are no longer any ACEMD tasks being sent out (where it is basically clear: 1 task = 1 CPU core [unless changed by an app_config.xml]).

Erich56
Send message
Joined: 1 Jan 15
Posts: 902
Credit: 3,604,880,665
RAC: 330,779
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwat
Message 59228 - Posted: 11 Sep 2022 | 15:27:14 UTC - in response to Message 59223.

Or BOINC doesn't consider a RAM Disk a "real" drive and ignores the available storage there.

Could be BOINC only considers physical storage to be valid.

no, I have BOINC running on another PC with Ramdisk - in that case a much smaller one: 32GB


Now I tried once more to download a Python on my system with a 128GB Ramdisk (plus 128GB system RAM).
The BOINC event log says:

Python apps for GPU hosts needs 4590.46MB more disk space. You currently have 28788.14 MB available and it needs 33378.60 MB.

Somehow though all this does not fit together: in reality, the Ramdisk is filled with 73GB and has 55GB available.
Further, I am questioning whether Python indeed needs 33.378 MB free disk space for downloading?

I am really frustrated that this does not work :-(

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1050
Credit: 1,407,860,714
RAC: 1,014,249
Level
Met
Scientific publications
watwatwatwatwat
Message 59229 - Posted: 11 Sep 2022 | 15:30:21 UTC - in response to Message 59226.
Last modified: 11 Sep 2022 | 15:33:39 UTC

...
All you can do for making the Python task run reasonably well is assign 3-5 cpu cores for BOINC scheduling to keep other cpu work off the host.

You can do that through a app_config.xml file in the project directory.
Like this: ...

thanks, Keith, for your explanation.

Well, I actually would not need to put in this app_config.xml as in my case; the other BOINC tasks don't just asign any number of CPU cores by themselves. I tell each of these projects by a seperate app_config.xml how many cores to use (which I was, in fact, also hoping for Python).
So I have no other choice than to live with the situation as is :-(

What is too bad though is that obviously there are no longer any ACEMD tasks being sent out (where it is basically clear: 1 task = 1 CPU core [unless changed by an app_config.xml]).


You are not understanding the nature of the Python tasks. They are not using all your cores. They are not using 32 cores. They are using 32 spawned processes

A process is NOT a core.

The Python task use from 100-300% of a cpu core depending on the speed of the host and the number of cores in the host.

That is why I offered the app_config.xml file to allot 3 cpu cores to each Python task for BOINC scheduling purposes. And you can have many app_config.xml files in play among all your projects as a app_config file is specific to each project and is placed into the projects folder. You certainly can use one for scheduling help for GPUGrid.

A app_config file does not control the number of cores a task uses. That is dependent soley on the science application. A task will use as many or as little cores as needed.

The only exception to that fact is in the special case of plan_class MT like the cpu tasks at Milkyway. Then BOINC has an actual control parameter --nthreads that can specifically set the number of cores allowed in the MT plan_class task.

That cannot be used here because the Python tasks are not a simple cpu only MT type task. They are something completely different and something that BOINC does not know how to handle. They are a dual cpu-gpu combination task where the majority of computation is done on a cpu with bursts of activity on a gpu and then computation repeats that action.

It would take a major rewrite of core BOINC code to properly handle this type of machine-learning, reinforcement learning combo tasks. Unless BOINC attracts new developers that are willing to tackle this major development hurdle, the best we can do is just accommodate these tasks through other host controls.

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1050
Credit: 1,407,860,714
RAC: 1,014,249
Level
Met
Scientific publications
watwatwatwatwat
Message 59230 - Posted: 11 Sep 2022 | 15:40:07 UTC

Make sure there are NO checkmarks on any selection in the Disk and memory tab of the BOINC Manager Options >> Computing Preferences page.

That is what is limiting your Downloads.

Erich56
Send message
Joined: 1 Jan 15
Posts: 902
Credit: 3,604,880,665
RAC: 330,779
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwat
Message 59231 - Posted: 11 Sep 2022 | 16:47:19 UTC - in response to Message 59230.

Make sure there are NO checkmarks on any selection in the Disk and memory tab of the BOINC Manager Options >> Computing Preferences page.

That is what is limiting your Downloads.

I had removed these checkmarks already before.
What I did now was to stop new Rosetta tasks (which also need a lot of disk space for their VM files), so the free disk space climbed up to about 80GB - only then the Python download worked. Strange, isn't it?

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 719
Credit: 4,861,478,494
RAC: 1,361,967
Level
Arg
Scientific publications
wat
Message 59232 - Posted: 11 Sep 2022 | 17:19:36 UTC - in response to Message 58980.

The reason Reinforcement Learning agents do not currently use the whole potential of the cards is because the interactions between the AI agent and the simulated environment are performed on CPU while the agent "learning" process is the one that uses the GPU intermittently.

There are, however, environments that only use GPU. They are becoming more and more common, so I see it as a real possibility that in the future most popular benchmarks of the field use only GPU. Then the jobs will be much more efficient since pretty much only GPU will be used. Unfortunately we are not there yet...


a suggestion for whenever you're able to move to to pure GPU work. PLEASE look into and enable "automatic mixed precision" in your code.

https://pytorch.org/docs/stable/notes/amp_examples.html

this should greatly benefit those devices which have Tensor cores. to speed things up.

____________

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1050
Credit: 1,407,860,714
RAC: 1,014,249
Level
Met
Scientific publications
watwatwatwatwat
Message 59233 - Posted: 11 Sep 2022 | 18:48:40 UTC - in response to Message 59231.

Make sure there are NO checkmarks on any selection in the Disk and memory tab of the BOINC Manager Options >> Computing Preferences page.

That is what is limiting your Downloads.

I had removed these checkmarks already before.
What I did now was to stop new Rosetta tasks (which also need a lot of disk space for their VM files), so the free disk space climbed up to about 80GB - only then the Python download worked. Strange, isn't it?

I think your issue is your use of a fixed ram disk size instead of a dynamic pagefile that is allowed to grow larger as needed.

Erich56
Send message
Joined: 1 Jan 15
Posts: 902
Credit: 3,604,880,665
RAC: 330,779
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwat
Message 59234 - Posted: 11 Sep 2022 | 20:06:29 UTC - in response to Message 59233.

Make sure there are NO checkmarks on any selection in the Disk and memory tab of the BOINC Manager Options >> Computing Preferences page.

That is what is limiting your Downloads.

I had removed these checkmarks already before.
What I did now was to stop new Rosetta tasks (which also need a lot of disk space for their VM files), so the free disk space climbed up to about 80GB - only then the Python download worked. Strange, isn't it?

I think your issue is your use of a fixed ram disk size instead of a dynamic pagefile that is allowed to grow larger as needed.

I just noticed the same problem with Rosetta Python tasks. So this may be in some kind of relation with the Python architecture.
Also in the Rosetta case, the actual disk space available was significantly higher than Rosetta said it would need.
So I don't believe that this has anything to do with the fixed ram disk size. What is the logic behind your assumption?

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1050
Credit: 1,407,860,714
RAC: 1,014,249
Level
Met
Scientific publications
watwatwatwatwat
Message 59235 - Posted: 12 Sep 2022 | 0:58:35 UTC - in response to Message 59234.

If you read the through the various posts, including mine, or investigate the issues with Pytorch on Windows, it is because of the nature of how Windows handles reservation of memory addresses compared to how Linux handles that.

The Pytorch libraries when downloaded and expanded ask for many gigabytes of memory. Windows has to set aside every bit of memory space that the application asks for whether it will be needed or not. Linux does not have to abide by this fact since it handles memory allocation dynamically automatically.

And since every Python task is likely different, there is no reuse of the previous Pytorch libraries likely, so every task needs to get all of its configured resources every time a new task is executed.

So the best method to satisfy this fact on Windows is to start with a 35GB minimum size pagefile with a 50GB maximum size and allow the pagefile to size dynamically between that range. Your fixed ram disk size just isn't flexible enough or large enough apparently. That pagefile size seems to be sufficient for the other Windows users I have assisted with these tasks.

Read this explanation please for the actual particulars of the problem with Windows. https://www.gpugrid.net/forum_thread.php?id=5322&nowrap=true#58908

Erich56
Send message
Joined: 1 Jan 15
Posts: 902
Credit: 3,604,880,665
RAC: 330,779
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwat
Message 59236 - Posted: 12 Sep 2022 | 6:54:49 UTC - in response to Message 59235.

So the best method to satisfy this fact on Windows is to start with a 35GB minimum size pagefile with a 50GB maximum size and allow the pagefile to size dynamically between that range. Your fixed ram disk size just isn't flexible enough or large enough apparently. That pagefile size seems to be sufficient for the other Windows users I have assisted with these tasks.

thanks for the hint, I will adapt the page file size accordingly and see what happens.

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 155
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 59237 - Posted: 12 Sep 2022 | 14:43:46 UTC - in response to Message 59213.

Not sure if it would have made a difference, but I would have placed your code before line 433, only after importing os and sys

"""
if __name__ == "__main__":

import sys
sys.stderr.write("Starting!!\n")
import os

os.environ["MKL_DEBUG_CPU_TYPE"] = "5"

import platform
"""


____________

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 719
Credit: 4,861,478,494
RAC: 1,361,967
Level
Arg
Scientific publications
wat
Message 59238 - Posted: 12 Sep 2022 | 14:58:40 UTC - in response to Message 59237.
Last modified: 12 Sep 2022 | 15:33:37 UTC

Not sure if it would have made a difference, but I would have placed your code before line 433, only after importing os and sys

"""
if __name__ == "__main__":

import sys
sys.stderr.write("Starting!!\n")
import os

os.environ["MKL_DEBUG_CPU_TYPE"] = "5"

import platform
"""



thanks :) I'll try anyway

edit - nope, no different.
____________

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 719
Credit: 4,861,478,494
RAC: 1,361,967
Level
Arg
Scientific publications
wat
Message 59239 - Posted: 12 Sep 2022 | 15:35:31 UTC - in response to Message 59237.
Last modified: 12 Sep 2022 | 15:37:58 UTC

really unfortunate to use so much more resources on AMD than Intel. It's something about the multithreaded nature of the main run.py process itself. on intel it uses about 2-5% per process, and more run.py processes spin up the more cores you have. with AMD, it uses like 20-40% per process, so with high core count CPUs, that makes total CPU utilization crazy high.

here is what it looks like running 4x python tasks (2 GPUs, 2 tasks each) on an intel 8-core, 16-thread system. what you're seeing is the 4 main run.py processes and their multithreaded components. notice that the total CPU used by each main process is a little more than 100%, this equates to a full thread for each process.


now here is what it looks like running only 2x python tasks (1 GPU, 2 tasks each) on an AMD EPYC system with 24-cores, 48-threads. you can see the main run.py multithread components each using 20-40%, and each thread cumulatively using 600-800% CPU, EACH. that's 6-8 whole threads occupied for a single process. making it roughly 6-8x more resource intensive to run on AMD than Intel.


I even swapped my 8c/16t intel CPU for a 16t/32c one, and while it spun up a more multithread components for the main run.py, each one was still only 2-5% used making it only about 150% CPU used from each main process. something definitely weird going on with these task between AMD and Intel

the CPU used by the 32x multiprocessing.spawns is about the same between intel and AMD. it's only the threads that stem from the main run.py process that's showing this huge difference.
____________

Diplomat
Send message
Joined: 1 Sep 10
Posts: 15
Credit: 371,139,648
RAC: 37,057
Level
Asp
Scientific publications
watwatwat
Message 59240 - Posted: 12 Sep 2022 | 15:57:00 UTC - in response to Message 59225.

No. You cannot alter the task configuration. It will always create 32 spawned processes for each task during computation.

If the task is interfering with your other cpu tasks then you have a choice, either stop the Python tasks or reduce your other cpu tasks.

All you can do for making the Python task run reasonably well is assign 3-5 cpu cores for BOINC scheduling to keep other cpu work off the host.

You can do that through a app_config.xml file in the project directory.

Like this:

<app_config>

<app>
<name>PythonGPU</name>
<gpu_versions>
<gpu_usage>1.0</gpu_usage>
<cpu_usage>3.0</cpu_usage>
</gpu_versions>
</app>

</app_config>


does it improve GPU utilization? on average I see barely 20% with seldom spikes up to 35%

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 719
Credit: 4,861,478,494
RAC: 1,361,967
Level
Arg
Scientific publications
wat
Message 59241 - Posted: 12 Sep 2022 | 16:01:23 UTC - in response to Message 59240.

does it improve GPU utilization? on average I see barely 20% with seldom spikes up to 35%


not directly. but if your GPU is being bottlenecked by not enough CPU resources then it could help.

the best configuration so far is to not run ANY other CPU or GPU work. run only these tasks, and run 2 at a time to occupy a little more GPU.

____________

gemini8
Send message
Joined: 3 Jul 16
Posts: 16
Credit: 348,912,814
RAC: 90,708
Level
Asp
Scientific publications
watwat
Message 59248 - Posted: 13 Sep 2022 | 7:48:09 UTC - in response to Message 59241.

Hi everyone.

the best configuration so far is to not run ANY other CPU or GPU work. run only these tasks, and run 2 at a time to occupy a little more GPU.

I'm thinking about putting every other Boinc CPU work into a VM instead of running it directly on the host.
You could have a VM using only 90 per cent of processing power through the VM settings.
This would leave the rest for the Python stuff, so on a sixteen-thread CPU it could use 160% of one thread's power or 10% of the CPU.
If this wasn't enough the VM could be adjusted to only using eighty per cent (320% of one thread's power or 20% of the CPU for the Python work) and so on.
Return [adjust and try] until the machine does fine.

Plus, you could run other GPU stuff on your GPU to have it fully utilized which should prevent high temperature variations which I see as unnecessary stress for a GPU.
MilkyWay has a small VRAM footprint and doesn't use a full GPU, and maybe I'll try WCG OPNG as well.
____________
Greetings, Jens

Erich56
Send message
Joined: 1 Jan 15
Posts: 902
Credit: 3,604,880,665
RAC: 330,779
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwat
Message 59251 - Posted: 13 Sep 2022 | 19:24:52 UTC - in response to Message 59248.

... and maybe I'll try WCG OPNG as well.

forget about WCG OPNG for the time being. Most of the time no tasks available; and if tasks are available for a short period of time, it's extremely hard to get them downloaded. The downloads get stuck most of the time, and only manual intervention helps.

Erich56
Send message
Joined: 1 Jan 15
Posts: 902
Credit: 3,604,880,665
RAC: 330,779
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwat
Message 59254 - Posted: 14 Sep 2022 | 18:08:39 UTC

Question: can a running Python task be interrupted for the time a Windows Update takes place (with rebooting of the PC), or does this damage the task?

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1050
Credit: 1,407,860,714
RAC: 1,014,249
Level
Met
Scientific publications
watwatwatwatwat
Message 59255 - Posted: 14 Sep 2022 | 19:56:05 UTC - in response to Message 59254.
Last modified: 14 Sep 2022 | 20:18:46 UTC

Question: can a running Python task be interrupted for the time a Windows Update takes place (with rebooting of the PC), or does this damage the task?

Yes, one great thing about the Python tasks compared to acemd3 tasks is that they can be interrupted and continue on later with no penalty.

They save checkpoints well which are replayed to get the task back to the point in progress it was at before interruption.

Just be advised, that the replay process takes a few minutes after restart. The task will show 2% completion percentage upon restart but will eventually jump back to the progress point it was at and continue calculation until end.

Just be patient and let the task run.

Jim1348
Send message
Joined: 28 Jul 12
Posts: 819
Credit: 1,576,708,471
RAC: 221,898
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 59259 - Posted: 15 Sep 2022 | 11:42:38 UTC - in response to Message 59255.

Yes, one great thing about the Python tasks compared to acemd3 tasks is that they can be interrupted and continue on later with no penalty.

I have a problem that they fail on reboot however. Is that common?
http://www.gpugrid.net/results.php?hostid=583702

That is only on Windows though. I have not seen it yet on Linux, but I don't reboot often there.

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1050
Credit: 1,407,860,714
RAC: 1,014,249
Level
Met
Scientific publications
watwatwatwatwat
Message 59260 - Posted: 15 Sep 2022 | 15:59:48 UTC - in response to Message 59259.

Yes, one great thing about the Python tasks compared to acemd3 tasks is that they can be interrupted and continue on later with no penalty.

I have a problem that they fail on reboot however. Is that common?
http://www.gpugrid.net/results.php?hostid=583702

That is only on Windows though. I have not seen it yet on Linux, but I don't reboot often there.

Guess it must be only on Windows. No problem restarting a task after a reboot on Ubuntu.

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 155
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 59261 - Posted: 16 Sep 2022 | 8:03:30 UTC - in response to Message 59259.
Last modified: 16 Sep 2022 | 8:09:53 UTC

The restart is supposed to work fine on Windows as well. Could you provide more information about when this error happens please? Does it happen systematically every time you interrupt and try to resume a task?

Is there anyone for which the Windows checkpointing works fine? I tested locally and it worked.
____________

Jim1348
Send message
Joined: 28 Jul 12
Posts: 819
Credit: 1,576,708,471
RAC: 221,898
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 59262 - Posted: 16 Sep 2022 | 15:48:36 UTC - in response to Message 59261.
Last modified: 16 Sep 2022 | 16:36:09 UTC

Could you provide more information about when this error happens please? Does it happen systematically every time you interrupt and try to resume a task?

I can pause and restart them with no problem. The error occurred only on reboot.
But I think I have found it. I was using a large write cache, PrimoCache, set with a 8 GB cache size and 1 hour latency. By disabling that, I am able to reboot without a problem. So there was probably a delay in flushing the cache on reboot that caused the error.

But I used the write cache to protect my SSD, since I was seeing writes of around 370 GB a day, too much for me. But this time I am seeing only 200 GB/day. That is still a lot, but not fatal for some time. It seems that the work units vary in how much they will write. I will monitor it.

I use SsdReady to monitor the writes to disk; the free version is OK.

PS - I can set PrimoCache to only a 1 GB write-cache size with a 5 minute latency, and it reboots without a problem. Whether that is good enough to protect the SSD will have to be determined by monitoring the actual writes to disk. PrimoCache gives a measure of that. (SsdReady gives the OS writes, but not the actual writes to disk.)

PPS: I should point out that the reason a write cache can cut down on the writes to disk is because of the nature of scientific algorithms. They invariable read from a location, do a calculation, and then write back to the same location much of the time. Then, the cache can store that, and only write to the disk the changes that occur at the end of the flush period. If you have a large enough cache, and set the write-delay to infinite, you essentially have a ramdisk. But the cache can be good enough, with less memory than a ramdisk would require. (And now it seems that 2 GB and 10 minutes works OK.)

Erich56
Send message
Joined: 1 Jan 15
Posts: 902
Credit: 3,604,880,665
RAC: 330,779
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwat
Message 59265 - Posted: 18 Sep 2022 | 13:41:51 UTC

Question for the experts here:

One of my PCs has 2 RTX3070 inside, Pythons are running quite well.
The interesting thing is that VRAM usage of one GPU always is about 3.7GB, usage of the other always is about 4.3GB.
So with one of the GPUs I could (try to) process 2 Pythons simultaneously, with the other not (VRAM of the RTX3070 is 8GB).
Is it possible to arrange for such a setting via app_config.xml?

BTW, I know what the app_config.xml looks like for running 2 Pythons on both GPUs (<gpu_usage>0.5</gpu_usage>), but I have no idea how to configure the xml according to my wishes as outlined above.

Can anyone help?

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 719
Credit: 4,861,478,494
RAC: 1,361,967
Level
Arg
Scientific publications
wat
Message 59266 - Posted: 18 Sep 2022 | 13:54:46 UTC - in response to Message 59265.
Last modified: 18 Sep 2022 | 14:51:19 UTC

Sorry. There is no way to configure an app_config to differentiate between devices.

You can only have different settings for different applications.

The only option, which you might not want to do, is to run two different BOINC clients on the same system, to the project this will look like two different computers each having one GPU. Then you could configure one to run 2x and the other to run 1x.

But the amount of VRAM used by the Python app is likely the same between your cards. But the first GPU will always have more vram used because it’s running your display. a second task wont use 4.3GB again. most likely only another +3.6
____________

Erich56
Send message
Joined: 1 Jan 15
Posts: 902
Credit: 3,604,880,665
RAC: 330,779
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwat
Message 59267 - Posted: 18 Sep 2022 | 14:53:37 UTC - in response to Message 59266.

Sorry. There is no way to configure an app_config to differentiate between devices.

You can only have different settings for different applications.

The only option, which you might not want to do, is to run two different BOINC clients on the same system, to the project this will look like two different computers each having one GPU. Then you could configure one to run 2x and the other to run 1x.

But the amount of VRAM used by the Python app is likely the same between your cards. But the first GPU will always have more vram used because it’s running your display.


In fact, I have 2 BOINC clients on this PC; I had to establish the second one with the BOINC DataDir on the SSD, since the first one is on the 32GB Ramdisk which would not let download Python tasks ("not enough disk space").
However, next week I will double the RAM on this PC, from 64 to 128GB, and then I will increase the Ramdisk size to at least 64GB; this should make it possible to download Python - at least that' what I hope.

So then I could run 1 Python on each of the 2 GPUs on the SSD client, and a third Python on the Ramdisk client.
The only two questions now are: how do I tell the Ramdisk client to run only 1 Python (although 2 GPUs available)? And how do I tell the Ramdisk client to choose the GPU with the lower amount of VRAM usage (i.e. the one that's NOT running the display)?

In fact, I would prefer to run 2 Pythons on the Ramdisk client and 1 Python on the SSD client; however, the question is whether I could download 2 Pythons on the 64GB Ramdisk - the only thing I could do is to try.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 719
Credit: 4,861,478,494
RAC: 1,361,967
Level
Arg
Scientific publications
wat
Message 59268 - Posted: 18 Sep 2022 | 15:22:30 UTC - in response to Message 59267.
Last modified: 18 Sep 2022 | 16:18:36 UTC

please read the BOINC documentation for client configuration. all of the options and what they do are in there.

https://boinc.berkeley.edu/wiki/Client_configuration

you will need to change several things to run multiple clients at the same time. you need to start them on different ports, as well as add several things to cc_config. you will also need to exclude the GPU you dont want to use from each client.

either use the <exclude_gpu> section (where BOINC can see the device but wont use it for a given project)
or use the <ignore_nvidia_dev> tag (where BOINC wont see this device at all for any project)
____________

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 719
Credit: 4,861,478,494
RAC: 1,361,967
Level
Arg
Scientific publications
wat
Message 59269 - Posted: 18 Sep 2022 | 16:26:35 UTC - in response to Message 59267.
Last modified: 18 Sep 2022 | 16:30:24 UTC

personally I would stop running the ram disk. it's just extra complication and eats up ram space that the Python tasks crave. your biggest benefit will be moving to linux, it's easily 2x faster, maybe more. I don't know how you have your systems set up, but i see your longest runtimes on your 3070 are like 24hrs. that's crazy long. are you not leaving enough CPU available? are you running other CPU work at the same time?

for comparison, I built a Linux machine dedicated to these tasks. 2x RTX 3060 and a 24-core EPYC CPU and 128GB system ram. I am not running any other work on it, only PythonGPU. to give these tasks the optimum conditions to run as fast as possible.

with 12GB of VRAM, i can run 3x per GPU and it completes tasks in about 13hrs at the longest, for an effective longest completion time of about 1 task every 4.3hrs, which means at minimum, this system with 2x GPUs (6x tasks running) completes about 11 tasks per day (1,155,000 cred) + the bonus of some tasks completing earlier. you can see that my 3060 in this system is 6x more productive than your 3070. that's an insane difference

doing this uses about 80-90% of the CPU, and ~56GB of system ram. I have enough spare VRAM to add another GPU, but maybe not enough CPU power to support more than 1 more task. if I want another GPU i will probably need a more powerful (more cores) CPU.
____________

Erich56
Send message
Joined: 1 Jan 15
Posts: 902
Credit: 3,604,880,665
RAC: 330,779
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwat
Message 59270 - Posted: 18 Sep 2022 | 17:01:28 UTC - in response to Message 59268.

...
either use the <exclude_gpu> section (where BOINC can see the device but wont use it for a given project)
or use the <ignore_nvidia_dev> tag (where BOINC wont see this device at all for any project)

thanks very much for your hints:-)

One other thing that I now noticed when reading the stderr of the 3 Pythons that failed short time after start:

"RuntimeError: [enforce fail at C:\cb\pytorch_1000000000000\work\c10\core\impl\alloc_cpu.cpp:81] data. DefaultCPUAllocator: not enough memory: you tried to allocate 3612672 bytes"

So the reason why the tasks crashed after a few seconds was not the too small VRAM (this would probably have come up a little later), but the lack of system RAM.
In fact, I remember that right after start of the 4 Pythons, the Meminfo tool showed a rapid decrease of free system RAM, and shortly thereafter the free RAM was going up again (i.e. after 3 tasks had crashed thus releasing memory).

Any idea how mugh system RAM, roughly, a Python task takes?

Erich56
Send message
Joined: 1 Jan 15
Posts: 902
Credit: 3,604,880,665
RAC: 330,779
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwat
Message 59271 - Posted: 18 Sep 2022 | 17:23:23 UTC - in response to Message 59270.


One other thing that I now noticed when reading the stderr of the 3 Pythons that failed short time after start:

"RuntimeError: [enforce fail at C:\cb\pytorch_1000000000000\work\c10\core\impl\alloc_cpu.cpp:81] data. DefaultCPUAllocator: not enough memory: you tried to allocate 3612672 bytes"

So the reason why the tasks crashed after a few seconds was not the too small VRAM (this would probably have come up a little later), but the lack of system RAM.
In fact, I remember that right after start of the 4 Pythons, the Meminfo tool showed a rapid decrease of free system RAM, and shortly thereafter the free RAM was going up again (i.e. after 3 tasks had crashed thus releasing memory).

Any idea how mugh system RAM, roughly, a Python task takes?

From what I can see in the Windows Task Manager on this PC and on others running Python tasks, RAM usage of a Python can be from about 1GB to 6GB (!)
How come that it varies that much?

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 719
Credit: 4,861,478,494
RAC: 1,361,967
Level
Arg
Scientific publications
wat
Message 59272 - Posted: 18 Sep 2022 | 17:34:44 UTC - in response to Message 59271.
Last modified: 18 Sep 2022 | 17:35:32 UTC

you should figure 7-8GB per python task. that's what it seems to use on my linux system. i would imagine it uses a little when the task starts up, then slowly increases once it gets to running full out. that might be the reason for the variance of 1GB in the beginning, and 6+GB by the time it gets to running the main program.

these tasks work in 3 phases from what i've seen

Phase 1: extraction phase. just extracting the compressed package. usually takes about 5 minutes, depending on CPU speed. uses only a single core.

Phase 2: pre-processing and/or pre-loading. uses a large % of CPU power, GPU gets intermittently used, and VRAM preloads to about 60% of what will be eventually used. (in my case, VRAM preloads about 2100MB). this also lasts about 5 mins.

Phase 3: main program. CPU use drops down, and VRAM use loads up to 100% of what is needed (in my case 3600MB per task).
____________

Erich56
Send message
Joined: 1 Jan 15
Posts: 902
Credit: 3,604,880,665
RAC: 330,779
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwat
Message 59280 - Posted: 20 Sep 2022 | 10:08:45 UTC - in response to Message 59254.

Erich56 asked:

Question: can a running Python task be interrupted for the time a Windows Update takes place (with rebooting of the PC), or does this damage the task?

I tried it now - the two tasks running on a RTX3070 each - on Windows - did not survive a reboot :-(

Erich56
Send message
Joined: 1 Jan 15
Posts: 902
Credit: 3,604,880,665
RAC: 330,779
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwat
Message 59281 - Posted: 20 Sep 2022 | 11:58:10 UTC

since yesterday I upgraded the RAM of one of my PCs from 64GB to 128GB (so now I have a 64GB Ramdisk plus 64GB system RAM, before it was half each), every GPUGRID Python fails on this PC with 2 RTX3070 inside.

The task starts okay, RAM as well as VRAM is filling up continuously, also the CPU usage is close to 100%, and after a while (a few minutes up to half an hour) the task fails.
The BOINC manager says "aborted by the project", and the task description says "aufgegeben" = abandoned or so.

Interestingly, no times are shown, neither runtime nor CPU time, further there is no stderr.

See this example:

https://www.gpugrid.net/result.php?resultid=33044774

on another machine, I have two tasks running simultaneously on one GPU - no problem at all.

I was of course thinking of a defective RAM module; however, all night through I had running simultaneously 5 LHC ATLAS tasks 3-cores ea., without any problem. So I guess this was RAM test enough.

Also hundreds of WCG GPU tasks were processed this morning for hours, also without any problem.

Anyone and ideas ?

Diplomat
Send message
Joined: 1 Sep 10
Posts: 15
Credit: 371,139,648
RAC: 37,057
Level
Asp
Scientific publications
watwatwat
Message 59285 - Posted: 20 Sep 2022 | 18:03:37 UTC - in response to Message 59153.
Last modified: 20 Sep 2022 | 18:09:57 UTC


<app_config>

<app>
<name>PythonGPU</name>
<gpu_versions>
<gpu_usage>1.0</gpu_usage>
<cpu_usage>3.0</cpu_usage>
</gpu_versions>
</app>

</app_config>


I'm new to config editing :) a few more questions

Do I need to be more specific in <name> tag and put full application name like Python apps for GPU hosts 4.03 (cuda1131) from task properties?

Because I don't see 3 CPUs been given to the task after client restart


Application Python apps for GPU hosts 4.03 (cuda1131)
Name e00015a03227-ABOU_rnd_ppod_expand_demos25-0-1-RND8538
State Running
Received Tue 20 Sep 2022 10:48:34 PM +05
Report deadline Sun 25 Sep 2022 10:48:34 PM +05
Resources 0.99 CPUs + 1 NVIDIA GPU
Estimated computation size 1,000,000,000 GFLOPs
CPU time 00:48:32
CPU time since checkpoint 00:00:07
Elapsed time 00:11:37
Estimated time remaining 50d 21:42:09
Fraction done 1.990%
Virtual memory size 18.16 GB
Working set size 5.88 GB
Directory slots/8
Process ID 5555
Progress rate 6.840% per hour
Executable wrapper_26198_x86_64-pc-linux-gnu

KAMasud
Send message
Joined: 27 Jul 11
Posts: 39
Credit: 170,157,986
RAC: 147,226
Level
Ile
Scientific publications
watwat
Message 59286 - Posted: 20 Sep 2022 | 18:32:36 UTC - in response to Message 59260.

Yes, one great thing about the Python tasks compared to acemd3 tasks is that they can be interrupted and continue on later with no penalty.

I have a problem that they fail on reboot however. Is that common?
http://www.gpugrid.net/results.php?hostid=583702

That is only on Windows though. I have not seen it yet on Linux, but I don't reboot often there.

Guess it must be only on Windows. No problem restarting a task after a reboot on Ubuntu.



The restart works fine on Windows. Maybe, it might be the five-minute break at 2% which might be causing the confusion.

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1050
Credit: 1,407,860,714
RAC: 1,014,249
Level
Met
Scientific publications
watwatwatwatwat
Message 59287 - Posted: 20 Sep 2022 | 19:22:29 UTC - in response to Message 59281.


Anyone and ideas ?

Get rid of the ram disk.

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1050
Credit: 1,407,860,714
RAC: 1,014,249
Level
Met
Scientific publications
watwatwatwatwat
Message 59288 - Posted: 20 Sep 2022 | 19:25:45 UTC - in response to Message 59285.
Last modified: 20 Sep 2022 | 19:26:22 UTC


<app_config>

<app>
<name>PythonGPU</name>
<gpu_versions>
<gpu_usage>1.0</gpu_usage>
<cpu_usage>3.0</cpu_usage>
</gpu_versions>
</app>

</app_config>


I'm new to config editing :) a few more questions

Do I need to be more specific in <name> tag and put full application name like Python apps for GPU hosts 4.03 (cuda1131) from task properties?

Because I don't see 3 CPUs been given to the task after client restart


Application Python apps for GPU hosts 4.03 (cuda1131)
Name e00015a03227-ABOU_rnd_ppod_expand_demos25-0-1-RND8538
State Running
Received Tue 20 Sep 2022 10:48:34 PM +05
Report deadline Sun 25 Sep 2022 10:48:34 PM +05
Resources 0.99 CPUs + 1 NVIDIA GPU
Estimated computation size 1,000,000,000 GFLOPs
CPU time 00:48:32
CPU time since checkpoint 00:00:07
Elapsed time 00:11:37
Estimated time remaining 50d 21:42:09
Fraction done 1.990%
Virtual memory size 18.16 GB
Working set size 5.88 GB
Directory slots/8
Process ID 5555
Progress rate 6.840% per hour
Executable wrapper_26198_x86_64-pc-linux-gnu



Any already downloaded task will see the original cpu-gpu resource assignment.

Any newly downloaded task will show the NEW task assignment.

The name for the tasks is PythonGPU as you show.

You should always refer to the client_state.xml file as it is the final arbiter of the correct naming and task configuation.

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1050
Credit: 1,407,860,714
RAC: 1,014,249
Level
Met
Scientific publications
watwatwatwatwat
Message 59289 - Posted: 20 Sep 2022 | 19:29:24 UTC - in response to Message 59286.

Yes, one great thing about the Python tasks compared to acemd3 tasks is that they can be interrupted and continue on later with no penalty.

I have a problem that they fail on reboot however. Is that common?
http://www.gpugrid.net/results.php?hostid=583702

That is only on Windows though. I have not seen it yet on Linux, but I don't reboot often there.

Guess it must be only on Windows. No problem restarting a task after a reboot on Ubuntu.



The restart works fine on Windows. Maybe, it might be the five-minute break at 2% which might be causing the confusion.


If you interrupt the task in its Stage 1 of downloading and unpacking the required support files, it may fail on Windows upon restart.

It normally shows the failure for this reason in the stderr.txt.

Best to interrupt the task once it is actually calculating and after its setup and has produced at least one checkpoint.

Erich56
Send message
Joined: 1 Jan 15
Posts: 902
Credit: 3,604,880,665
RAC: 330,779
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwat
Message 59290 - Posted: 20 Sep 2022 | 19:52:08 UTC - in response to Message 59287.


Anyone and ideas ?

Get rid of the ram disk.

on the other hand, ramdisk works perfectly on this machine:

https://www.gpugrid.net/show_host_detail.php?hostid=599484

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1050
Credit: 1,407,860,714
RAC: 1,014,249
Level
Met
Scientific publications
watwatwatwatwat
Message 59291 - Posted: 20 Sep 2022 | 20:40:28 UTC - in response to Message 59290.


Anyone and ideas ?

Get rid of the ram disk.

on the other hand, ramdisk works perfectly on this machine:

https://www.gpugrid.net/show_host_detail.php?hostid=599484

Then you need to investigate the differences between the two hosts.

All I'm stating is that the RAM disk is an unnecessary complication that is not needed to process the tasks.

Basic troubleshooting. Reduce to the most basic, absolute needed configuration for the tasks to complete correctly and then add back in one extra superfluous element at a time until the tasks fail again.

Then you have identified why the tasks fail.

Diplomat
Send message
Joined: 1 Sep 10
Posts: 15
Credit: 371,139,648
RAC: 37,057
Level
Asp
Scientific publications
watwatwat
Message 59292 - Posted: 21 Sep 2022 | 16:42:56 UTC

Keith Myers thanks!

Diplomat
Send message
Joined: 1 Sep 10
Posts: 15
Credit: 371,139,648
RAC: 37,057
Level
Asp
Scientific publications
watwatwat
Message 59293 - Posted: 22 Sep 2022 | 2:53:36 UTC

In my case config didn't want to work until I added <max_concurrent>


<app_config>

<app>
<name>PythonGPU</name>
<max_concurrent>1</max_concurrent>
<gpu_versions>
<gpu_usage>1.0</gpu_usage>
<cpu_usage>3.0</cpu_usage>
</gpu_versions>
</app>

</app_config>


Now I see as expected status: Running (3 CPUs + 1 NVIDIA GPU)

Unfortunately it doesn't help to get high GPU utilization/ Completion time it looks like gonna be slightly better though

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1050
Credit: 1,407,860,714
RAC: 1,014,249
Level
Met
Scientific publications
watwatwatwatwat
Message 59294 - Posted: 22 Sep 2022 | 4:23:38 UTC - in response to Message 59293.

In my case config didn't want to work until I added <max_concurrent>


<app_config>

<app>
<name>PythonGPU</name>
<max_concurrent>1</max_concurrent>
<gpu_versions>
<gpu_usage>1.0</gpu_usage>
<cpu_usage>3.0</cpu_usage>
</gpu_versions>
</app>

</app_config>


Now I see as expected status: Running (3 CPUs + 1 NVIDIA GPU)

Unfortunately it doesn't help to get high GPU utilization/ Completion time it looks like gonna be slightly better though


If you have enough cpu for support and enough VRAM on the card, you can get better gpu utilization by moving to 2X tasks on the card. Just change the gpu_usage to 0.5

Erich56
Send message
Joined: 1 Jan 15
Posts: 902
Credit: 3,604,880,665
RAC: 330,779
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwat
Message 59297 - Posted: 22 Sep 2022 | 18:44:15 UTC - in response to Message 59291.
Last modified: 22 Sep 2022 | 18:47:30 UTC


Anyone and ideas ?

Get rid of the ram disk.

on the other hand, ramdisk works perfectly on this machine:

https://www.gpugrid.net/show_host_detail.php?hostid=599484

Then you need to investigate the differences between the two hosts.

All I'm stating is that the RAM disk is an unnecessary complication that is not needed to process the tasks.

Basic troubleshooting. Reduce to the most basic, absolute needed configuration for the tasks to complete correctly and then add back in one extra superfluous element at a time until the tasks fail again.

Then you have identified why the tasks fail.


I installed a RAMdisk because quite often I am crunching tasks which write many GB of data on the disk. E.g. LHC-Atlas, the GPU tasks from WCG, the Pythons from Rosetta, and last not least the Pythons from GPUGRID: about 200GB within 24 hours, which is much (so for my two RTX3070, this would be 400GB/day).
So, if the machines are running 24/7, in my opinion this is simply not good for a SSD lifetime.

Over the years, my experience with RAMdisk has been a good one. No idea what kind of problem the GPUGRID Pythons have with this particular RAMDisk - or vice versa. As said, on another machine with RAMDisk I also have 2 Pythons running concurrently, even on one GPU, and it works fine.

So what I did yesterday evening was letting only one of two RTX3070 crunch a Python. On the other GPU, I sometimes crunched WCG of nothing at all.
This evening, after about 22-1/2 hours, the Python finished successfully :-)
BTW - beside the Python, 3 ATLAS tasks 3 cores ea. were also running all the time.

Which means. what I know so far is that obviously I can run Pythons at least on one of the two RTX3070, and other projects on the other one.
Still I will try to further investigate why GPUGRID Pythons don't run on both RTX3070.

[CSF] Aleksey Belkov
Avatar
Send message
Joined: 26 Dec 13
Posts: 66
Credit: 906,439,522
RAC: 76,865
Level
Glu
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 59307 - Posted: 24 Sep 2022 | 17:46:41 UTC - in response to Message 56977.
Last modified: 24 Sep 2022 | 17:49:29 UTC

I do not know how to properly mention the project administrators in the topic in order to draw attention to the problem of non-optimal use of disk space by this application.
Only now I noticed what is contained in the slotX directory when performing a task.
I was very surprised to see there, in addition to the unpacked application files, also the archive itself, from which these files are unpacked/unzipped. At the same time, the archive is present in two copies at once, apparently due to the suboptimal process of unpacking the format tar.gz.
Here you can see that application's files itself occupy only half the working directory volume(slotX).



Apparently, when the application starts, the following happens:
1) The source archive(pythongpu_windows_x86_64__cuda1131.tar.gz.1a152f102cdad20f16638f0f269a5a17) of application is copied from the project directory(\projects\www.gpugrid.net) to the working directory(\slots\X\).
2) Then archive is unzipped (tar.gz >> tar).
3) At the last stage, the application files are unpacked from tar container.
At the same time, at the end of the process, unnecessary tar and tar.gz files( for some reason) does not deleted from working directory.
Thus, not only the peak amount of space occupied of each instance of this WU requires ~16 GiB, but this volume is occupied until WU's completing.

The whole process requires both much more time (copying and unpacking) and amount of written data.
Project tar.gz >> slotX (2,66 GiB) >> tar (5,48 GiB) >> app files (5,46 GiB) = 13,6 GiB

Both parameters can be significantly reduced by unpacking files directly into the working directory from the source archive, without all mentioned intermediate stages.
7za, which is used for unzipping/unpacking archives supports pipelining:


7za x "X:\BOINC\projects\www.gpugrid.net\pythongpu_windows_x86_64__cuda1131.tar.gz.1a152f102cdad20f16638f0f269a5a17" -so | 7za x -aoa -si -ttar -o"X:\BOINC\slots\0\"

Project tar.gz >> app files (5,46 GiB) = 5,46 GiB !


Moreover, if you use for archive not tar.gz format, but 7z (LZMA2 + "5 - Normal" profile, which is the default for recent 7-zip versions), then you can not only seriously save the amount of data downloaded by each user (and as a consequence the bandwidth of project's infrastructure), but speed up the process of unpacking data from archive.

Saving more than one GiB:



On my computer, unpacking by pipelining(as mentioned above) using the current(12 years old) 7za version(9.20) takes ~100 seconds.
And when using the recent version of 7za(22.01) only ~ 45-50 seconds.

7za x "X:\BOINC\projects\www.gpugrid.net\pythongpu_windows_x86_64__cuda1131.7z" -o"X:\BOINC\slots\0\"


I believe that the result of the described changes is worth implementing them (even if not all and/or not at once).
Moreover, all changes are reduced only to updating one executable file, repacking the archive and changing the command to unpack it.

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1050
Credit: 1,407,860,714
RAC: 1,014,249
Level
Met
Scientific publications
watwatwatwatwat
Message 59308 - Posted: 24 Sep 2022 | 21:32:01 UTC

I believe the researcher has already been down this road with Windows not natively supporting the compression/decompression algorithms you mention.

It requires each volunteer to add support manually to their hosts.

In the quest for compatibility, a researcher tries to package applications for all attached hosts to run natively without jumping through hoops so that everyone can run the tasks.

[CSF] Aleksey Belkov
Avatar
Send message
Joined: 26 Dec 13
Posts: 66
Credit: 906,439,522
RAC: 76,865
Level
Glu
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 59309 - Posted: 24 Sep 2022 | 21:53:22 UTC - in response to Message 59308.
Last modified: 24 Sep 2022 | 21:56:53 UTC

It requires each volunteer to add support manually to their hosts.

No
Unfortunately, you have inattentively read what I wrote above.
It has already been mentioned there that is currently Windows app already comes with 7za.exe version 9.20(you can find it in project folder).
So nothing changing.

Jim1348
Send message
Joined: 28 Jul 12
Posts: 819
Credit: 1,576,708,471
RAC: 221,898
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 59310 - Posted: 25 Sep 2022 | 2:31:01 UTC
Last modified: 25 Sep 2022 | 2:42:26 UTC

Yes, I do have GPUGrid installed on my Win10 machine after all.
And 7za.exe is in the project folder, just not in the project folder on my Ubuntu machine.

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1050
Credit: 1,407,860,714
RAC: 1,014,249
Level
Met
Scientific publications
watwatwatwatwat
Message 59311 - Posted: 25 Sep 2022 | 3:07:56 UTC - in response to Message 59309.

It requires each volunteer to add support manually to their hosts.

No
Unfortunately, you have inattentively read what I wrote above.
It has already been mentioned there that is currently Windows app already comes with 7za.exe version 9.20(you can find it in project folder).
So nothing changing.

OK, so you can thank Richard Haselgrove for the application to now package that utility. Originally, the tasks failed because Windows does not come with that utility and Richard helped debug the issue with the developer.

If you think the application is not using the utility correctly you should inform the developer of your analysis and code fix so that other Windows users can benefit.

[CSF] Aleksey Belkov
Avatar
Send message
Joined: 26 Dec 13
Posts: 66
Credit: 906,439,522
RAC: 76,865
Level
Glu
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 59312 - Posted: 25 Sep 2022 | 10:39:15 UTC - in response to Message 59311.

you should inform the developer of your analysis and code fix so that other Windows users can benefit.

I have already sent abouh PM to this tread, just in case.

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 155
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 59335 - Posted: 26 Sep 2022 | 7:33:56 UTC - in response to Message 59307.
Last modified: 26 Sep 2022 | 7:45:45 UTC

Hello, thank you very much for your help. I would like to implement the changes if they help optimise the tasks, but let me try to summarise your ideas to see if I got them right:



Change A --> As you say, the original file .tar.gz is first copied to the working directory and then unpacked in a 2-step process (tar.gz to tar and tar to plain files) and the tar.gz and tar files lie around after that. You suggest that these files should be deleted to save space and I agree, makes sense. Probably the sequence should be:
1) move .tar.gz file from project directory to working directory.
2) unpack .tar.gz to .tar
3) delete .tar.gz file
4) unpack .tar file to plain files
5) delete .tar file
This one is straightforward to implement.




Change B --> Additionally, you also suggest to replace the copying and the 2-step unpacking process for a single step process with the command line you propose. So the sequence would be further simplified to:
1) unpack .tar.gz to plain files
2) delete .tar.gz file
The only problem I see here is that I believe the step of first copying the files from the project directory(\projects\www.gpugrid.net) to the working directory(\slots\X\) I can not modify. It is general for all projects, even for the ones that do not contain files to be unpacked later. So not to mess with other GPUgrid projects the sequence should be:
1) move .tar.gz file from project directory to working directory.
2) unpack .tar.gz to plain files
3) delete .tar.gz file

in this case, would the command line would be simply this one? without the -o flag?

7za x "X:\BOINC\projects\www.gpugrid.net\pythongpu_windows_x86_64__cuda1131.tar.gz.1a152f102cdad20f16638f0f269a5a17" -so | 7za x -aoa -si -ttar





Change C --> Finally, you suggest using .7z encryption instead of .tar.gz to save memory and unpacking time with a more recent version of 7za.


Is all the above correct?

I believe these changes are worth implementing, thank you very much. I will try to start with Change A and Change B and unroll them into PythonGPUbeta first to test them this week.
____________

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1434
Credit: 3,541,899,351
RAC: 420,027
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 59336 - Posted: 26 Sep 2022 | 9:26:17 UTC - in response to Message 59335.
Last modified: 26 Sep 2022 | 9:26:45 UTC

Looks good to me. Just one question - are there any 'minimum Windows version' constraints on the later versions of 7za? I think it's unlikely to affect us, but it would be good to check, just in case.

I mention it, because the original trial runs used native Windows tar decompression (the same as the Linux implementation): but that was only introduced in later versions of Windows 10 and 11. Some of us (myself included) still use Windows 7, which supports 7z but not tar. A reasonable degree of backwards compatibility is desirable!

[CSF] Aleksey Belkov
Avatar
Send message
Joined: 26 Dec 13
Posts: 66
Credit: 906,439,522
RAC: 76,865
Level
Glu
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 59337 - Posted: 26 Sep 2022 | 12:00:48 UTC - in response to Message 59335.
Last modified: 26 Sep 2022 | 12:05:18 UTC

Hi, abouh!

Change A:
You are correct.

Change B
You are correct.
2) If this can't be changed or too hard / long to implement - no big deal.
In any case, pipelining still save some time and space : )


in this case, would the command line would be simply this one? without the -o flag?

Of course, if you launch 7za from working directory(/slots/X), than output flag not necessary.

Change C
You are correct.
Using 7z format(LZMA2 compression) significantly reduce archive size, save your bandwidth and some time for unpacking/unzipping process ; )
As I wrote above, the 7za command will be simplified, since the pipelining process will no longer be required.
NB! It is important to update the supplied 7za to current version, since version 9.20, a lot of optimizations have been made for compression/decompression of 7z archives(LZMA).


Just one question - are there any 'minimum Windows version' constraints on the later versions of 7za?

As mentioned on 7-Zip homepage, app support all versions since Windows 2000:

7-Zip works in Windows 10 / 8 / 7 / Vista / XP / 2019 / 2016 / 2012 / 2008 / 2003 / 2000.

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 155
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 59340 - Posted: 26 Sep 2022 | 13:39:36 UTC
Last modified: 26 Sep 2022 | 13:42:42 UTC

As a very first step I am trying to remove the .tar.gz file. I am encountering a first issue. The steps of the jobs are specified in the job.xml file in the following way:

<job_desc>

<task>
<application>.\7za.exe</application>
<command_line>x .\pythongpu_windows_x86_64__cuda1131.tar.gz.1a152f102cdad20f16638f0f269a5a17 -y</command_line>
</task>

<task>
<application>.\7za.exe</application>
<command_line>x .\pythongpu_windows_x86_64__cuda1131.tar -y</command_line>
</task>

....

<job_desc>


Essentially I need to execute a task that removes the pythongpu_windows_x86_64__cuda1131.tar.gz.1a152f102cdad20f16638f0f269a5a17 file after the very first task.

When I try in the Windows command prompt:

cmd.exe /C "del pythongpu_windows_x86_64__cuda1131.tar.gz.1a152f102cdad20f16638f0f269a5a17"


it works. However when I add to the job.xml file

<task>
<application>cmd.exe</application>
<command_line>/C "del .\pythongpu_windows_x86_64__cuda1131.tar.gz.1a152f102cdad20f16638f0f269a5a17"</command_line>
</task>


The wrapper seems to ignore it. Doesn't the wrapper have cmd.exe? I need to run more tests to figure out the exact command to delete files
____________

[CSF] Aleksey Belkov
Avatar
Send message
Joined: 26 Dec 13
Posts: 66
Credit: 906,439,522
RAC: 76,865
Level
Glu
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 59341 - Posted: 26 Sep 2022 | 14:09:41 UTC - in response to Message 59340.

<task>
<application>cmd.exe</application>
<command_line>/C "del .\pythongpu_windows_x86_64__cuda1131.tar.gz.1a152f102cdad20f16638f0f269a5a17"</command_line>
</task>

Try to use %COMSPEC% variable as alias to %SystemRoot%\system32\cmd.exe
If this doesn't work, then I'm sure specifying the full path(C:\Windows\system32\cmd.exe) should work.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 719
Credit: 4,861,478,494
RAC: 1,361,967
Level
Arg
Scientific publications
wat
Message 59343 - Posted: 26 Sep 2022 | 14:50:29 UTC

in other news. looks like we've finally crunched through all the tasks ready to send. all that remains are the ones in progress and the resends that will come from those.

any more coming soon?
____________

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 155
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 59347 - Posted: 27 Sep 2022 | 13:53:43 UTC - in response to Message 59341.
Last modified: 27 Sep 2022 | 14:06:09 UTC

True! Specifying the whole path works:

<job_desc>

<task>
<application>C:\Windows\system32\cmd.exe</application>
<command_line>/C "del \pythongpu_windows_x86_64__cuda1131.tar.gz.1a152f102cdad20f16638f0f269a5a17</command_line>
</task>

</job_desc>


I have deployed this Change A into the PythonGPUbeta app, just to test if it works in all Windows machines. Just sent a few (32) jobs. If it works fine on, will move on to introduce the other changes.
____________

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 155
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 59348 - Posted: 27 Sep 2022 | 14:02:54 UTC - in response to Message 59343.
Last modified: 27 Sep 2022 | 14:05:21 UTC

I will be running new experiments shortly. My idea is to use the whole capacity of the grid. I have already noticed that a few months ago it could absorb around 800 tasks and now it goes up to 1000! Thank you for all the support :)
____________

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 155
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 59354 - Posted: 28 Sep 2022 | 7:10:17 UTC

The first batch I sent to PythonGPUbeta yesterday failed, but I figured out the problem this morning. I just sent another batch an hour ago to the PythonGPUbeta app. This time seems to be working. It has Change A implemented, so memory usage is more optimised.
____________

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 155
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 59356 - Posted: 28 Sep 2022 | 7:49:54 UTC - in response to Message 59337.
Last modified: 28 Sep 2022 | 8:50:11 UTC

Hello Aleksey!

I was looking at how to implement Chance C, namely if we can encode and decode the task conda-environment files using 7zip format and recent versions of 7za.exe.

We use conda-pack to compress the conda environment that we later unpack in the gpugrid windows machines using 7za.exe.

However, looking at the documentation seems like 7zip is not a format conda-pack can deal with. https://conda.github.io/conda-pack/cli.html

Apparently the possible formats include: zip, tar.gz, tgz, tar.bz2, tbz2, tar.xz, txz, tar, parcel (?), squashfs (?)

So in case of switching from the current tar.gz, we could only go to one of these. Maybe tbz2 or txz? seems like this ones we can unpacked in a single step as well, if recent versions 7za.exe allow to handle this format.

Any recommendation? :)

For tbz2 the file size is similar, slightly smaller. The txz file is substantially smaller but took forever (30 mins) to compress.
2.0G pythongpu_windows_x86_64__cuda102.tar.gz
1.9G pythongpu_windows_x86_64__cuda102.tbz2
1.2G pythongpu_windows_x86_64__cuda102.txz
____________

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 719
Credit: 4,861,478,494
RAC: 1,361,967
Level
Arg
Scientific publications
wat
Message 59357 - Posted: 28 Sep 2022 | 12:25:25 UTC

more tasks? I'm running dry ;)
____________

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1050
Credit: 1,407,860,714
RAC: 1,014,249
Level
Met
Scientific publications
watwatwatwatwat
Message 59358 - Posted: 28 Sep 2022 | 15:39:18 UTC

More tasks please, also.

bibi
Send message
Joined: 4 May 17
Posts: 6
Credit: 2,714,683,618
RAC: 551,798
Level
Phe
Scientific publications
watwatwatwatwat
Message 59359 - Posted: 28 Sep 2022 | 16:13:06 UTC - in response to Message 59356.

Hi,

why not producing a zip file, because the boinc client can unzip such file direct from the project folder to the slot like with acemd3.
When it works, 7za.exe and this extra tasks are not necessary.

pythongpu_windows_x86_64__cuda1131.zip has 2,58 GB
pythongpu_windows_x86_64__cuda1131.tar.gz has 2,66 GB

[CSF] Aleksey Belkov
Avatar
Send message
Joined: 26 Dec 13
Posts: 66
Credit: 906,439,522
RAC: 76,865
Level
Glu
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 59360 - Posted: 28 Sep 2022 | 18:26:19 UTC - in response to Message 59356.
Last modified: 28 Sep 2022 | 18:42:56 UTC

Good day, abouh

This time seems to be working. It has Change A implemented,

It's nice to hear that!

Maybe tbz2 or txz?

As I understand, tbz2/txz are alias of file extension for tar.bz2/tar.xz.
So in fact these formats are tar containers which compressed by bz2 or xz.
Therefore, this will require pipelining process, which, however, practically does not affect the unpacking speed, and only lengthens command string.
In my test, unpacking of tar.xz done in ~40 seconds.

seems like this ones we can unpacked in a single step as well, if recent versions 7za.exe allow to handle this format.

xz format supported since version 9.04 beta, but more recent version support multi-threaded (de)compression, witch crucial for fast unpacking.


The txz file is substantially smaller but took forever (30 mins) to compress.

This format use LZMA2 algorithm, similar as 7z use by default. So space saving must be the same with the same settings(--compress-level).
It's highly likely you forgot to use this flag
--n-threads <n>, -j <n>

to set number of threads to use for compression. By default conda-pack use only 1 thread!
And also check --compress-level. Levels higher then 5 not so effective for compression_time/archive_size.
Considering how I think that PythonGPU's app file rarely changes, it's not big deal.
As far as I remember, this (practically) does not affect unpacking speed.
On my test(32 threads / Threadripper 2950X), it took ~2,5 minutes with compress-level 5(archive size 1,55 GiB).

[CSF] Aleksey Belkov
Avatar
Send message
Joined: 26 Dec 13
Posts: 66
Credit: 906,439,522
RAC: 76,865
Level
Glu
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 59361 - Posted: 28 Sep 2022 | 18:56:09 UTC - in response to Message 59359.
Last modified: 28 Sep 2022 | 19:47:06 UTC

why not producing a zip file, because the boinc client can unzip such file direct from the project folder to the slot like with acemd3.

You're probably right.
I somehow didn't pay attention to acemd3 archives in project directory.
Is there some info, how BOINC's work with archives?
I suppose boinc-client uses its built-in library to work with archives (zlib ?), rather than some OS functions/tools.

There's still a dilemma:
1) On the one hand, using zip format will simplify process of application launching and reduce the amount of disk space required by application (no need to copy archive to the working directory). Amount of written data on disk reduced accordingly.
2) On other hand, xz format reduce archive size by whole GiB, that helps to save project's network bandwidth and time to download necessary files at first users access to project.

[CSF] Aleksey Belkov
Avatar
Send message
Joined: 26 Dec 13
Posts: 66
Credit: 906,439,522
RAC: 76,865
Level
Glu
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 59362 - Posted: 28 Sep 2022 | 19:56:28 UTC - in response to Message 59360.

On my test(32 threads / Threadripper 2950X), it took ~2,5 minutes with compress-level 5(archive size 1,55 GiB).

It's about compression*

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 155
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 59364 - Posted: 29 Sep 2022 | 14:16:34 UTC - in response to Message 59359.
Last modified: 29 Sep 2022 | 14:49:18 UTC

We tried to pack files with zip at first but encountered problems in windows. Not sure if it was some kind of strange quirk in the wrapper or in conda-pack (the tool for creating, packing and unpacking conda environments, https://conda.github.io/conda-pack/), but the process failed for compressed environment files above a certain memory size.

We then tried to used another format that could compress the files to a smaller size than .zip. We tried .tar but not all windows version have tar.exe (old ones do not).

We finally found this solution of sending 7za.exe along with the conda packed environment to be able to unpack it as part of the job.

I am not 100% sure, but I suspect acemd3 does not use PyTorch machine learning python framework, which increases substantially the size of the packed environment. And I believe acemd4 does use pytorch, and faces the same issue as the PythonGPU tasks.
____________

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 155
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 59365 - Posted: 29 Sep 2022 | 14:54:43 UTC - in response to Message 59362.

You were absolutely right, I forgot the number of threads! I could now reproduce a a much faster compression as well.

I will proceed to test if I can use the BOINC wrapper and a newer version of 7za.exe to unpack it locally in a reasonable amount of time and then will deploy it to PythonGPUbeta for testing.

Thank you very much!
____________

bibi
Send message
Joined: 4 May 17
Posts: 6
Credit: 2,714,683,618
RAC: 551,798
Level
Phe
Scientific publications
watwatwatwatwat
Message 59366 - Posted: 29 Sep 2022 | 18:21:05 UTC - in response to Message 59365.

Hi abouh,

the provided 7za.exe has version 9.20 from 2010. The last version on 7-zip.org is 22.01 (now 7z.exe).
If you want to unpack in a pipe or delete the tar file, you need cmd. But the used starter wrapper_6.1_windows_x86_64.exe (see project folder) don't know about environment and the windows folder isn't necessarily c:\windows, so you also should provide cmd.exe.
Unpacking in a pipe:
<task>
<application>.\cmd.exe</application>
<command_line>/c .\7za.exe -so x pythongpu_windows_x86_64__cuda1131.tar.xz | .\7za.exe -y -sifile.txt.tar x & exit</command_line>
<weight>1</weight>
</task>

Why conda-pack with format zip is not working I don't know.

bibi
Send message
Joined: 4 May 17
Posts: 6
Credit: 2,714,683,618
RAC: 551,798
Level
Phe
Scientific publications
watwatwatwatwat
Message 59367 - Posted: 29 Sep 2022 | 18:56:26 UTC - in response to Message 59366.

7z.exe calls the dll, 7za.exe stands alone. You find it in 7-Zip Extra on https://7-zip.org/download.html
But your version works too.

[CSF] Aleksey Belkov
Avatar
Send message
Joined: 26 Dec 13
Posts: 66
Credit: 906,439,522
RAC: 76,865
Level
Glu
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 59368 - Posted: 29 Sep 2022 | 21:01:20 UTC - in response to Message 59366.


the provided 7za.exe has version 9.20 from 2010. The last version on 7-zip.org is 22.01


7z.exe calls the dll, 7za.exe stands alone. You find it in 7-Zip Extra on https://7-zip.org/download.html

All this has already been discussed by several posts above.
If you had read before writing...

so you also should provide cmd.exe.

I think this is not a good idea.
Some antiviruses may perceive an attempt to launch cmd.exe not from the system directory as suspicious/malicious activity.

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 155
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 59369 - Posted: 30 Sep 2022 | 5:44:38 UTC
Last modified: 30 Sep 2022 | 5:47:12 UTC

I added the discussed changes and deployed them to the PythonGPUbeta app. More specifically:

1. I changed the 7za.exe executable to (I believe) the latest version. A much newer one than the one previously used in any case.

2. I compress now the conda-environment files to .txz. I use the default --compress-level (4), because I tried with 9 and the compressed file size was the same.

As Aleksey mentioned, the unpacking still needs to be done in 2 steps, but at least now the sent files are smaller due to a more efficient compression.

Did anyone catch any of the PythonGPUbeta jobs? They seemed to work

Regarding what bibi mentioned, /Windows/System32/cmd.exe seems to be present in all Windows machines so far, or at least I have not seen any job failing because of this. I have sent 64 test jobs in total.
____________

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1050
Credit: 1,407,860,714
RAC: 1,014,249
Level
Met
Scientific publications
watwatwatwatwat
Message 59370 - Posted: 30 Sep 2022 | 5:48:53 UTC - in response to Message 59369.

No, I haven't been lucky enough yet to snag any of the beta tasks.

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1434
Credit: 3,541,899,351
RAC: 420,027
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 59371 - Posted: 30 Sep 2022 | 7:38:17 UTC
Last modified: 30 Sep 2022 | 7:42:38 UTC

One of my Linux machines has just crashed two tasks in succession with

UnboundLocalError: local variable 'features' referenced before assignment

https://www.gpugrid.net/results.php?hostid=508381

Edit - make that three. And a fourth looks to be heading in the same direction - many other users have tried it already.

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 155
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 59372 - Posted: 30 Sep 2022 | 8:25:10 UTC - in response to Message 59371.

Thanks for the warning Richard, I have just fixed the error. Should not be present in the jobs starting a few minutes from now.
____________

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1434
Credit: 3,541,899,351
RAC: 420,027
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 59373 - Posted: 30 Sep 2022 | 9:11:53 UTC - in response to Message 59372.

Yes, the next one has got well into the work zone - 1.99%. Thank you.

KAMasud
Send message
Joined: 27 Jul 11
Posts: 39
Credit: 170,157,986
RAC: 147,226
Level
Ile
Scientific publications
watwat
Message 59374 - Posted: 30 Sep 2022 | 9:28:33 UTC
Last modified: 30 Sep 2022 | 9:29:46 UTC

Just an observation.
Boinc does not consider a GPUGrid task as a task. Yesterday my finger brushed against Moo's "allow new WU's" and it promptly downloaded 12 WU"s. They, were all 12 running with the GPUGrid task also running? Never seen that before. I took remedial action. None errored out.

Erich56
Send message
Joined: 1 Jan 15
Posts: 902
Credit: 3,604,880,665
RAC: 330,779
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwat
Message 59375 - Posted: 30 Sep 2022 | 10:37:03 UTC

I tried to run 1 Python on a second BOINC instance.
So far, they have run on the "regular" instance, 1 task ea. on 2 RTX3070, without problems. Runtime was about 22-23hours.

On the "regular" instance I now run 2 Primegrid tasks, such ones with GPU use only, no CPU use.
Hence, to run Pythons in addition would be a nice supplement - using a lot of CPU and only part of the GPU.

After I started a Python on the second BOINC instance, all ran normal for a short while: CPU usage climed up close to 100%, VRAM usage was close to 4GB, system RAM some 8GB.
However, after a few minutes, CPU usage for the Python went down to about 15%. RAM and VRAM usage stayed at same level as before.
The progress bar in the BOINC manager showed some 2.980% after about 3 hours. So it was clear that something was going wrong, and I aborted the task.
Stderr can be seen here: https://www.gpugrid.net/result.php?resultid=33056430

I then started another task, just to preclude that the problem from before was a "one-timer". However, same problem again.

What's going wrong?

FYI, recently I ran altogether 3 Pythons on 2 RTX3070, which means on one of the RTX two Pythons were crunched simultaneously. No problem at all, the total runtime for each of the two tasks was just a little longer than for 1 task per GPU.

KAMasud
Send message
Joined: 27 Jul 11
Posts: 39
Credit: 170,157,986
RAC: 147,226
Level
Ile
Scientific publications
watwat
Message 59376 - Posted: 30 Sep 2022 | 14:54:53 UTC

My question is, how can 13 tasks run on a 12-thread machine? Is it a good idea to run other tasks? Also, why was Boinc not taking into account the GPUGrid task?

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1434
Credit: 3,541,899,351
RAC: 420,027
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 59377 - Posted: 30 Sep 2022 | 15:16:27 UTC - in response to Message 59376.

If the 13th task is assessed - by the project and BOINC in conjunction - to require less than 1.0000 of a CPU, it will be allowed to run in parallel with a fully occupied CPU. For a GPU task, it will run at a slightly higher CPU priority, so it will steal CPU cycles from the pure CPU tasks - but on a modern multitasking OS, they won't notice the difference.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 719
Credit: 4,861,478,494
RAC: 1,361,967
Level
Arg
Scientific publications
wat
Message 59379 - Posted: 30 Sep 2022 | 16:22:35 UTC - in response to Message 59375.

I tried to run 1 Python on a second BOINC instance.
So far, they have run on the "regular" instance, 1 task ea. on 2 RTX3070, without problems. Runtime was about 22-23hours.

On the "regular" instance I now run 2 Primegrid tasks, such ones with GPU use only, no CPU use.
Hence, to run Pythons in addition would be a nice supplement - using a lot of CPU and only part of the GPU.

After I started a Python on the second BOINC instance, all ran normal for a short while: CPU usage climed up close to 100%, VRAM usage was close to 4GB, system RAM some 8GB.
However, after a few minutes, CPU usage for the Python went down to about 15%. RAM and VRAM usage stayed at same level as before.
The progress bar in the BOINC manager showed some 2.980% after about 3 hours. So it was clear that something was going wrong, and I aborted the task.
Stderr can be seen here: https://www.gpugrid.net/result.php?resultid=33056430

I then started another task, just to preclude that the problem from before was a "one-timer". However, same problem again.

What's going wrong?

FYI, recently I ran altogether 3 Pythons on 2 RTX3070, which means on one of the RTX two Pythons were crunched simultaneously. No problem at all, the total runtime for each of the two tasks was just a little longer than for 1 task per GPU.



i think you're trying to do too much at once. 22-24hrs is incredibly slow for a single task on a 3070. my 3060 does them in 13hrs, doing 3 tasks at a time (4.3hrs effective speed).

if you want any kind of reasonable performance, you need to stop processing other projects on the same system. or at the very least, adjust your app_config file to reserve more CPU for your Python task to prevent BOINC from running too much extra work from other projects.

switch to Linux for even better performance.

____________

jjch
Send message
Joined: 10 Nov 13
Posts: 88
Credit: 14,970,000,871
RAC: 918,848
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 59380 - Posted: 1 Oct 2022 | 1:04:27 UTC - in response to Message 59379.

Erich56

The first two tasks I checked you didn't let them finish extracting. The others looks a bit inconclusive however you restarted the tasks so that could be it.

Leave them alone and let them run. If they stall at 2% for an extended time check the stderr file to see if there is an error that should be addressed.

Look to see if they are actually running or not before you abort. If its working it should get to the Created Learner. step and continue running from there.

There are some jobs that just fail with an unknown cause but these haven't gotten that far yet.

8Gb system memory is on the low side to run Python apps successfully. It can be done but you really shouldn't be running anything else.

Also, the Python apps need up to 48Gb of swap space configured on Windows systems. If you haven't already done it I would suggest increasing it.

Simplify your troubleshooting and cut down on the variables. Run only one Boinc instance and one Python task. See how that goes first.

After you confirm that's working you can possibly run an additional Python task or maybe a different GPU project at the same time.

While generally you do want to maximize the usage of your system it's not good to slam it to the ceiling either.

Erich56
Send message
Joined: 1 Jan 15
Posts: 902
Credit: 3,604,880,665
RAC: 330,779
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwat
Message 59381 - Posted: 1 Oct 2022 | 6:13:19 UTC - in response to Message 59379.

Ian&Steve C. wrote:

i think you're trying to do too much at once. 22-24hrs is incredibly slow for a single task on a 3070. my 3060 does them in 13hrs, doing 3 tasks at a time (4.3hrs effective speed).

if you want any kind of reasonable performance, you need to stop processing other projects on the same system. or at the very least, adjust your app_config file to reserve more CPU for your Python task to prevent BOINC from running too much extra work from other projects.

switch to Linux for even better performance.

I agree, at the moment it may be "too much at once" :-)

FYI, I recently bought another PC with 2 CPUs (8-c/8-HT each) and 1 GPU, I upgraded the RAM from 128GB to 256GB and created a 128GB Ramdisk;
and on an existing PC with a 10-c/10-HT CPU plus 2 RTX3070 I upgraded the RAM from 64GB to 128GB (=maximum possible on this MoBo).

So no surprise that now I am just testing what's possible. And by doing this, I keep finding out, of course, that sometimes I am expecting too much.

What concerns the (low) speed of my two RTX3070: I have always been on the very conservative side what concerns GPU temperatures. Which means I have them run on about 60/61°C, not higher.
With two such GPUs inside the same box, heat of course is a topic. Despite of good airflow, in order to keep the GPUs at the above mentioned temperature, I need to throttle them down to about 50-65% (different for each GPU). So this explains for the longer runtimes of the Pythons.
If I had to boxes with 1 RTX3070 inside each, I am sure that there would be no need for throtteling.

Erich56
Send message
Joined: 1 Jan 15
Posts: 902
Credit: 3,604,880,665
RAC: 330,779
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwat
Message 59382 - Posted: 1 Oct 2022 | 7:35:58 UTC - in response to Message 59380.

jjch wrote:

Erich56

The first two tasks I checked you didn't let them finish extracting. The others looks a bit inconclusive however you restarted the tasks so that could be it.

Leave them alone and let them run. If they stall at 2% for an extended time check the stderr file to see if there is an error that should be addressed.

Look to see if they are actually running or not before you abort. If its working it should get to the Created Learner. step and continue running from there.

There are some jobs that just fail with an unknown cause but these haven't gotten that far yet.

8Gb system memory is on the low side to run Python apps successfully. It can be done but you really shouldn't be running anything else.

Also, the Python apps need up to 48Gb of swap space configured on Windows systems. If you haven't already done it I would suggest increasing it.

Simplify your troubleshooting and cut down on the variables. Run only one Boinc instance and one Python task. See how that goes first.

After you confirm that's working you can possibly run an additional Python task or maybe a different GPU project at the same time.

While generally you do want to maximize the usage of your system it's not good to slam it to the ceiling either.


thanks for taking your time for dealing with my problem.

well, by now it's become clear to me what the cause for failure was:
obviously, running a Primegrid GPU task and Python on the same GPU does not work for the Python. After a Primegrid got finished, I started another Python, and it runs well.

What concerns memory, you may have misunderstood: when I mentioned the 8GB, I meant to say that I could see in the Windows Task Manager that Python was using 8GB. Total RAM on this machine is 64GB, so more than enough.

Also what concerns the swap space: I had set this manually to 100GB min. and 150 GB max., so also more than enough.

Again - the problem has been detected anyway. Whereas I had no problem to run two Pythons on the same GPU (even 3 might work), it is NOT possible to have a Python run along with a Primegrid task.
So for me, this was a good learning process :-)

Again, thanks anyway for your time investigating my failed tasks.

Erich56
Send message
Joined: 1 Jan 15
Posts: 902
Credit: 3,604,880,665
RAC: 330,779
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwat
Message 59383 - Posted: 1 Oct 2022 | 7:56:44 UTC

I just discovered the following problem on the PC which consists of:

2 CPUs Xeon E5 8-core / 16-HT each.
1 GPU Quadro P5000
128 GB Ramdisk
128 GB system memory

until a few days ago, I ran 2 Pythons simultaneously (with a setting in the app_config.xml: 0.5 gpu usage).

Now, while only 1 Python is running and I push the update button on the BOINC manager for fetching another Python, the BOINC event log tells me that no Pythons are available. Which is not the case though, as the server status page shows some 550 tasks for download; besides, I just downloaded one on another PC.
BTW: the Python tasks uses only some 50% of the processor - which seems logical with 2 CPUs inside.

So I tried to download tasks from other projects, and in all cases the event log says:
not requesting tasks: don't need (CPU; NVIDIA GPU: job cache full).
How can that be the case?
In the BOINC computing preferences, I now set the "store at least work" to 10 days, and under "store up to an additional" also 10 days. However, this did not solve the problem.

There is about 94GB free space on the Ramdisk, and some 150GB free system RAM.

What also catches my eye: on the one running Python, which right now shows 45% progress after come 10 hours, it shows a remaining runtime of 34 days!
Before, like on my other machines, remaining runtime for Pythons was indicated as 1-2 days.
Could this entry be the cause why nothing else can be downloaded and I get the message "job cache full"?

Can anyone help me to get out of this problem?

Erich56
Send message
Joined: 1 Jan 15
Posts: 902
Credit: 3,604,880,665
RAC: 330,779
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwat
Message 59386 - Posted: 2 Oct 2022 | 10:41:35 UTC - in response to Message 59383.

I just discovered the following problem on the PC which consists of:

2 CPUs Xeon E5 8-core / 16-HT each.
1 GPU Quadro P5000
128 GB Ramdisk
128 GB system memory

until a few days ago, I ran 2 Pythons simultaneously (with a setting in the app_config.xml: 0.5 gpu usage).

Now, while only 1 Python is running and I push the update button on the BOINC manager for fetching another Python, the BOINC event log tells me that no Pythons are available. Which is not the case though, as the server status page shows some 550 tasks for download; besides, I just downloaded one on another PC.
BTW: the Python tasks uses only some 50% of the processor - which seems logical with 2 CPUs inside.

So I tried to download tasks from other projects, and in all cases the event log says:
not requesting tasks: don't need (CPU; NVIDIA GPU: job cache full).
How can that be the case?
In the BOINC computing preferences, I now set the "store at least work" to 10 days, and under "store up to an additional" also 10 days. However, this did not solve the problem.

There is about 94GB free space on the Ramdisk, and some 150GB free system RAM.

What also catches my eye: on the one running Python, which right now shows 45% progress after come 10 hours, it shows a remaining runtime of 34 days!
Before, like on my other machines, remaining runtime for Pythons was indicated as 1-2 days.
Could this entry be the cause why nothing else can be downloaded and I get the message "job cache full"?

Can anyone help me to get out of this problem?


Meanwhile, the problem has become even worse:

After downloading 1 Python, it starts and in the BOINC manager it shows a remaing runtime of about 60 days (!!!). In reality, he task proceeds with normal speed and will be finished within 24 hours, like all other tasks before on this machine.

Hence, nothing else can be downoladed.
When trying to download tasks from other projects, it shows
not requesting tasks: don't need (CPU; NVIDIA GPU: job cache full).

when I try to download a second Python, it says "no tasks are available for Python apps for GPU hosts" which is not correct, there are some 150 available for download at the moment.

Can anyone give me advice how to get this problem solved?

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1050
Credit: 1,407,860,714
RAC: 1,014,249
Level
Met
Scientific publications
watwatwatwatwat
Message 59387 - Posted: 2 Oct 2022 | 17:30:49 UTC

It can't. Due to the dual nature of the python tasks, BOINC has no mechanism to correctly show the estimated time to completion.

The tasks do not take the time shown to complete and can in fact be returned well within the standard 5 day deadline.

Erich56
Send message
Joined: 1 Jan 15
Posts: 902
Credit: 3,604,880,665
RAC: 330,779
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwat
Message 59388 - Posted: 2 Oct 2022 | 18:10:14 UTC - in response to Message 59387.

It can't. Due to the dual nature of the python tasks, BOINC has no mechanism to correctly show the estimated time to completion.

The tasks do not take the time shown to complete and can in fact be returned well within the standard 5 day deadline.

But how come that on three other of my systems on which I am running Pythons for a while, the "remaining runtimes" are shown pretty correctly (+/- 24 hours)?

And also on the machine in question, up to recently the time was indicated okay.
Something must have happened yesterday, but I do not know what.

If your assumption was right, on no Boinc instance more than 1 Python could be run in parallel.
Didn't you say somewhere here in the forum that you are running 3 Pythons in parallel? How can a second and a third task be downloaded if the first one shows a remaining runtime of 30 or 60 days?
What are the remaining runtimes shown for your Pythons once they get started?

kksplace
Send message
Joined: 4 Mar 18
Posts: 48
Credit: 445,464,249
RAC: 426,354
Level
Gln
Scientific publications
wat
Message 59389 - Posted: 2 Oct 2022 | 22:03:50 UTC - in response to Message 59386.
Last modified: 2 Oct 2022 | 22:04:41 UTC

Let me offer another possible "solution". (I am running two Python tasks on my system.) I found I had to change my Resource Share much, much higher for GPUGrid to effectively share other projects. I originally had Resource shares of 160 for GPUGrid vs 10 for Einstein and 40 for TN-Grid. Since the Python tasks 'use' so much CPU time in particular (at least reported CPU time), it seems to affect the Resource Share calculations at well. I had to move my Resource Share of GPUGrid (for example) to 2,000 to get it both to do two at once and to get Boinc to share with Einstein and TN-Grid roughly the way I wanted. (Nothing magic about my Resource Share ratios; just providing an example of how extreme I went to get it to balance the way I wanted.)

Regarding the estimated time to completion, I have not seem them correct on my system yet, though it is getting better. At first Python tasks were starting at 1338 days (!) and now are at 23 days to start. Interesting to hear some of yours are showing correct! What setup are you using in the hosts showing correct times?

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1050
Credit: 1,407,860,714
RAC: 1,014,249
Level
Met
Scientific publications
watwatwatwatwat
Message 59390 - Posted: 3 Oct 2022 | 0:49:02 UTC - in response to Message 59388.

No, that was my teammate who is running 3X concurrent on his gpus.

He runs nothing but GPUGrid on those hosts.

I OTOH run multiple projects at the same time on my hosts. So the GPUGrid tasks have to share resources. That is a balancing act.

I run a custom client that allows me to get around the normal BOINC client and project limitations. I can ask for as much or as little amount of work that I want on any host.

Currently, I am running a single task on half a gpu in each host. I tried to run 2X on the gpu but I don't have enough resources to support 2 tasks on the host and run all my other projects at the same time. But the task runs well sharing the gpu with my other gpu projects. Keeps the gpu utilization much higher than if running only the Python task.

The GPUGrid tasks start up with multiple hundreds of days expected before completion. That drops down to only a couple of days once the task gets over 90% completion.

This is what BoincTasks is showing for the 5 tasks I am currently running on my hosts for estimated completion times.

GPUGRID 4.03 Python apps for GPU hosts (cuda1131) e00014a06316-ABOU_rnd_ppod_expand_demos25-0-1-RND9172_3 01:05:30 (02:57:04) 90.11 3.970 157d,17:33:34 10/7/2022 4:27:00 PM 3C + 0.5NV (d1) Running High P. Darksider
GPUGRID 4.03 Python apps for GPU hosts (cuda1131) e00005a00032-ABOU_rnd_ppod_expand_demos25_2-0-1-RND9669_0 13:30:26 (04d,00:21:21) 237.79 34.660 27d,12:31:49 10/7/2022 4:02:16 AM 3C + 0.5NV (d2) Running High P. Numbskull
GPUGRID 4.03 Python apps for GPU hosts (cuda1131) e00012a04847-ABOU_rnd_ppod_expand_demos25-0-1-RND2344_4 13:27:51 (01d,09:45:50) 83.59 48.520 10d,20:41:45 10/7/2022 4:05:00 AM 3C + 0.5NV (d1) Running High P. Pipsqueek
GPUGRID 4.03 Python apps for GPU hosts (cuda1131) e00015a05913-ABOU_rnd_ppod_expand_demos25-0-1-RND9942_0 21:04:49 (05d,14:22:40) 212.49 39.610 28d,03:53:33 10/6/2022 8:04:45 PM 3C + 0.5NV (d2) Running High P. Rocinante
GPUGRID 4.03 Python apps for GPU hosts (cuda1131) e00008a00044-ABOU_rnd_ppod_expand_demos25_2-0-1-RND2891_2 01:23:31 (02:53:39) 69.30 3.970 22d,07:56:42 10/7/2022 4:09:00 PM 3C + 0.5NV (d0) Running High P. Serenity

I'll finish all of the tasks before 24 hours on the high clocked hosts for maximum credit awards. I'll miss out on the 24 hour bonus by a half hour or so on the server hosts because of their slower clocks.

Erich56
Send message
Joined: 1 Jan 15
Posts: 902
Credit: 3,604,880,665
RAC: 330,779
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwat
Message 59391 - Posted: 3 Oct 2022 | 6:34:19 UTC - in response to Message 59389.

Regarding the estimated time to completion, I have not seem them correct on my system yet, though it is getting better. At first Python tasks were starting at 1338 days (!) and now are at 23 days to start. Interesting to hear some of yours are showing correct! What setup are you using in the hosts showing correct times?

On one my hosts a new Python started some 25 minutes ago. "Remaining time" is shown as 13 hrs.
No particular setup. In the past years, this host had crunched numerous ACEMD tasks. Since a few weeks ago, it's crunching Pythons. GTX980Ti. Besides, 2 "Theory" tasks from LHC are running.

Erich56
Send message
Joined: 1 Jan 15
Posts: 902
Credit: 3,604,880,665
RAC: 330,779
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwat
Message 59392 - Posted: 3 Oct 2022 | 10:53:17 UTC - in response to Message 59389.

kksplace wrote:

Let me offer another possible "solution". (I am running two Python tasks on my system.) I found I had to change my Resource Share much, much higher for GPUGrid to effectively share other projects. ...

well, my target on this machine, in fact, is not to share Pythons with other projects.
It would simply make me happy if I could run 2 (or perhaps 3) Pythons simultaneously. The hardware requirements should be sufficient.

So, said that, I guess in this case the ressource share would not play any role.

BTW: as mentioned before, until some time early last week I did run two Pythons simultaneously on this PC. I have no idea though what the indicated remaining runtimes were. Most probably not that high as now, otherwise I could not have downloaded and started to Pythons in parallel.

So any idea what I can do to make this machine run at least 2 Pythons (if not 3) ???

kksplace
Send message
Joined: 4 Mar 18
Posts: 48
Credit: 445,464,249
RAC: 426,354
Level
Gln
Scientific publications
wat
Message 59393 - Posted: 3 Oct 2022 | 17:05:04 UTC - in response to Message 59392.

I am limited on any technical knowledge and can only speak how I got mine to work with 2 tasks. Sorry I can't help anymore. As to getting 3 tasks, my understanding from other posts and my own attempt is that you can't without a custom client or some other behind-the-scenes work. The '2 tasks at one time' limit is a GPUGrid restriction somewhere.

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1050
Credit: 1,407,860,714
RAC: 1,014,249
Level
Met
Scientific publications
watwatwatwatwat
Message 59394 - Posted: 3 Oct 2022 | 17:48:02 UTC - in response to Message 59393.

Yes, the project has a max 2 tasks per gpu limit with project max of 16 tasks.

You normally would just implement a app_config.xml file to get two tasks running concurrently on a gpu.

<app_config>

<app>
<name>PythonGPU</name>
<gpu_versions>
<gpu_usage>0.5</gpu_usage>
<cpu_usage>3.0</cpu_usage>
</gpu_versions>
</app>

</app_config>

That has been the same quota since project inception. The only way to get around it is to spoof the gpu count via locking down the coproc_info.xml file in the BOINC folder.

Erich56
Send message
Joined: 1 Jan 15
Posts: 902
Credit: 3,604,880,665
RAC: 330,779
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwat
Message 59395 - Posted: 3 Oct 2022 | 19:19:15 UTC - in response to Message 59394.

...
<app_config>

<app>
<name>PythonGPU</name>
<gpu_versions>
<gpu_usage>0.5</gpu_usage>
<cpu_usage>3.0</cpu_usage>
</gpu_versions>
</app>

</app_config>
...

Keith, just for my understanding:

what exactly does the entry
<cpu_usage>3.0</cpu_usage>
do?


Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 719
Credit: 4,861,478,494
RAC: 1,361,967
Level
Arg
Scientific publications
wat
Message 59396 - Posted: 3 Oct 2022 | 19:33:31 UTC - in response to Message 59395.
Last modified: 3 Oct 2022 | 19:34:47 UTC

...
<app_config>

<app>
<name>PythonGPU</name>
<gpu_versions>
<gpu_usage>0.5</gpu_usage>
<cpu_usage>3.0</cpu_usage>
</gpu_versions>
</app>

</app_config>
...

Keith, just for my understanding:

what exactly does the entry
<cpu_usage>3.0</cpu_usage>
do?




Exactly what I said in my previous message.

adjust your app_config file to reserve more CPU for your Python task to prevent BOINC from running too much extra work from other projects.


What Keith suggested would tell BOINC to reserve 3 whole CPU threads for each running PythonGPU task.
____________

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 155
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 59397 - Posted: 4 Oct 2022 | 7:09:44 UTC

Hello!

Today I will deploy the changes tested last week in PythonGPUbeta to the PythonGPU app. The changes only affect Windows machines, and should results in downloading smaller initial files, and slightly less memory requirements.

As we discussed, for now the initial data unpacking still needs to be done in two steps, but using a more recent version of 7za.exe.

I did not detect any error in the PythonGPUbeta tasks, so hopefully this change will no affect jobs in PythonGPU either.
____________

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1050
Credit: 1,407,860,714
RAC: 1,014,249
Level
Met
Scientific publications
watwatwatwatwat
Message 59398 - Posted: 4 Oct 2022 | 7:24:25 UTC - in response to Message 59395.


Keith, just for my understanding:

what exactly does the entry
<cpu_usage>3.0</cpu_usage>
do?


It tells BOINC to take 3 cpus away from the available resources that BOINC thinks it has to work with.

That tells BOINC to not commit resources to other projects that it doesn't have so that you aren't running the cpu overcommitted.

It is only for BOINC scheduling of available resources. It does not impact the running of the Python task in any way directly. Only the scientific application itself deteremines how much cpu the task and application will use.

You should never run a cpu in overcommitted state because that means that EVERY application including internal housekeeping is constantly fighting for available resources and NONE are running optimally. IOW's . . . . slooooowwwly.

You can check your average cpu loading or utilization with the uptime command in the terminal. You should strive to get numbers that are less than the number of cores available to the operating system.

If you have a cpu that has 16 cores/32 threads available to the OS, you should strive to use only up to 32 threads over the averaging periods.

The uptime command besides printing out how long the system has been up and running also prints out the 1 minute / 5 minute / 15 minute system average loadings.

As an example on this AMD 5950X cpu in this daily driver this is my uptime report.

keith@Pipsqueek:~$ uptime
00:15:16 up 7 days, 14:41, 1 user, load average: 30.16, 31.76, 32.03

The cpu is right at the limit of maximum utilization of its 32 threads.
So I am running it at 100% utilization most of the time.

If the averages were higher than 32, then that shows that the cpu is overcommitted and trying to do too much all the time and not running applications efficiently.

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1050
Credit: 1,407,860,714
RAC: 1,014,249
Level
Met
Scientific publications
watwatwatwatwat
Message 59399 - Posted: 4 Oct 2022 | 7:28:03 UTC - in response to Message 59397.

Thanks for the notice, abouh. Should make the Windows users a bit happier with the experience of crunching your work.

Erich56
Send message
Joined: 1 Jan 15
Posts: 902
Credit: 3,604,880,665
RAC: 330,779
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwat
Message 59400 - Posted: 4 Oct 2022 | 9:41:13 UTC - in response to Message 59398.


Keith, just for my understanding:

what exactly does the entry
<cpu_usage>3.0</cpu_usage>
do?


It tells BOINC to take 3 cpus away from the available resources that BOINC thinks it has to work with.

...

You can check your average cpu loading or utilization with the uptime command in the terminal. You should strive to get numbers that are less than the number of cores available to the operating system.
...

thanks, Keith, for the thorough explanation. Now everything is clear to me.
What concerns CPU loading/utilization, so far I have been taking a look at the Windows Task Manager which shows a (rough?) percentage on top of the column "CPU".

However, for me the question still is how I could get my host with the vast hardware ressources (as described here:
https://www.gpugrid.net/forum_thread.php?id=5233&nowrap=true#59383) to run at least 2 Pythons concurrently - as it was the case already before ???

Isn't there a way go get these much too high "remaining time" figures back to real?
Or any other way to get more than 1 Python downloaded despite of these high figures?

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1050
Credit: 1,407,860,714
RAC: 1,014,249
Level
Met
Scientific publications
watwatwatwatwat
Message 59403 - Posted: 4 Oct 2022 | 16:50:27 UTC - in response to Message 59400.
Last modified: 4 Oct 2022 | 16:53:15 UTC


Isn't there a way go get these much too high "remaining time" figures back to real?
Or any other way to get more than 1 Python downloaded despite of these high figures?


There isn't any way to get the estimated time remaining down to reasonable values as far as we know without a complete rewrite of the BOINC client code.

Or ask @kksplace how he managed to do it.

Try to increase your amount of day's cache to 10 and see if you pick up the second task.

Are you running with 0.5 gpu_usage via the app_config.xml file exampleI posted?

You can spoof 2 gpus being detected by BOINC which would automatically increase your gpu task allowance to 4 tasks. You need to modify the coproc_info.xml file and then lock it down to immutable state so BOINC can't rewrite it.

Google spoofing gpus in the Seti and BOINC forums on how to do that.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 719
Credit: 4,861,478,494
RAC: 1,361,967
Level
Arg
Scientific publications
wat
Message 59404 - Posted: 4 Oct 2022 | 17:21:36 UTC - in response to Message 59403.

Try to increase your amount of day's cache to 10 and see if you pick up the second task.


Counterintuitively, this can actually cause the opposite reaction on a lot of projects.

if you ask for "too much" work, some projects will just shut you out and tell you that no work is available, even when it is. I don't know why, I just know it happens. this is probably why he can't download work.

I would actually recommend keeping this value no larger than 2 days.
____________

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1050
Credit: 1,407,860,714
RAC: 1,014,249
Level
Met
Scientific publications
watwatwatwatwat
Message 59405 - Posted: 4 Oct 2022 | 19:52:23 UTC - in response to Message 59404.

I was assuming that GPUGrid was the only project on his host.

I agree that increasing the value with more than one single project on the host is often deleterious.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 719
Credit: 4,861,478,494
RAC: 1,361,967
Level
Arg
Scientific publications
wat
Message 59406 - Posted: 4 Oct 2022 | 20:01:57 UTC - in response to Message 59405.

I think GPUGRID is one of the projects that reacts negatively to having the value too high.

but no, based on his daily contributions for this host via FreeDC, he's contributing to several projects.
____________

Erich56
Send message
Joined: 1 Jan 15
Posts: 902
Credit: 3,604,880,665
RAC: 330,779
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwat
Message 59407 - Posted: 4 Oct 2022 | 20:35:18 UTC - in response to Message 59405.

I was assuming that GPUGrid was the only project on his host.

at the time I was trying to download and crunch 2 Pythons: YES - no other projects running at that time.

Meanwhile, until the problem get's solved, I have running 1 CPU and 1 GPU project on this host.

Post to thread

Message boards : News : Experimental Python tasks (beta) - task description

//