Advanced search

Message boards : News : Python Runtime (GPU, beta)

Author Message
Toni
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 9 Dec 08
Posts: 1006
Credit: 5,068,599
RAC: 0
Level
Ser
Scientific publications
watwatwatwat
Message 57655 - Posted: 26 Oct 2021 | 10:57:36 UTC

If anybody wants to help debug a new application, please enable the above mentioned app.

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1576
Credit: 5,605,461,851
RAC: 8,706,681
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 57656 - Posted: 26 Oct 2021 | 12:09:24 UTC - in response to Message 57655.

I don't see anything new on https://www.gpugrid.net/apps.php yet?

Azmodes
Send message
Joined: 7 Jan 17
Posts: 34
Credit: 1,371,429,518
RAC: 0
Level
Met
Scientific publications
watwatwat
Message 57657 - Posted: 26 Oct 2021 | 12:30:43 UTC

GPUGRID 10/26/2021 2:01:26 PM No tasks are available for Python apps for GPU hosts

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1031
Credit: 35,623,832,483
RAC: 73,867,984
Level
Trp
Scientific publications
wat
Message 57659 - Posted: 26 Oct 2021 | 15:46:17 UTC

One system queued up and waiting.
____________

Azmodes
Send message
Joined: 7 Jan 17
Posts: 34
Credit: 1,371,429,518
RAC: 0
Level
Met
Scientific publications
watwatwat
Message 57660 - Posted: 26 Oct 2021 | 17:31:20 UTC

Got a 2080 Ti and two 2070 Supers ready to roll.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1031
Credit: 35,623,832,483
RAC: 73,867,984
Level
Trp
Scientific publications
wat
Message 57663 - Posted: 26 Oct 2021 | 20:25:32 UTC - in response to Message 57656.

I don't see anything new on https://www.gpugrid.net/apps.php yet?


OT, but I've always found it weird that this link is not linked from anywhere on the main GPUGRID site. can only find it via a google search or previous bookmark. if you're just browsing through the GPUGRID site it doesn't exist.
____________

Profile mg13 [HWU]
Avatar
Send message
Joined: 18 Nov 09
Posts: 7
Credit: 107,006
RAC: 0
Level

Scientific publications
wat
Message 57665 - Posted: 26 Oct 2021 | 23:12:12 UTC - in response to Message 57663.

I don't see anything new on https://www.gpugrid.net/apps.php yet?


OT, but I've always found it weird that this link is not linked from anywhere on the main GPUGRID site. can only find it via a google search or previous bookmark. if you're just browsing through the GPUGRID site it doesn't exist.


Yes, there is the link, just go to the home page and click on "Join us" and on the page that opens in the "Configuring your participation" section in point 2 click on "apps" and you will find it.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1031
Credit: 35,623,832,483
RAC: 73,867,984
Level
Trp
Scientific publications
wat
Message 57666 - Posted: 27 Oct 2021 | 0:57:38 UTC - in response to Message 57665.

Thanks. You’re right it’s there.

But I’ll follow up that it’s a very odd place for it. Nearly all other BOINC project puts a link near/with credit statistics, directly on the main page, or as a link on the bottom of every page.
____________

Profile Bill F
Avatar
Send message
Joined: 21 Nov 16
Posts: 32
Credit: 86,638,150
RAC: 128,516
Level
Thr
Scientific publications
wat
Message 57668 - Posted: 27 Oct 2021 | 1:41:30 UTC

Well I am checked and enabled including "run Test Apps" we will see of I get a task assigned.

Thanks
Bill F

dthonon
Send message
Joined: 26 Aug 21
Posts: 1
Credit: 184,282,567
RAC: 5,779
Level
Ile
Scientific publications
wat
Message 57678 - Posted: 27 Oct 2021 | 13:58:31 UTC - in response to Message 57668.

This application is enabled in my preferences, and I accept test applications, but I am not getting any python task :

mer. 27 oct. 2021 15:51:04 | GPUGRID | Scheduler request completed: got 0 new tasks

Server status shows 10 tasks waiting to be sent.

bozz4science
Send message
Joined: 22 May 20
Posts: 109
Credit: 68,936,176
RAC: 0
Level
Thr
Scientific publications
wat
Message 57680 - Posted: 27 Oct 2021 | 16:44:26 UTC

Why are you not running more test tasks for the new app? Almost all of the tasks ended on one of Ian’s hosts … or is that enough feedback for now?

Anyway, credit calculation looks almost random to me. At least for these tasks. Any chance you will fix that before this gets into production? (Comparison: 700sec runtime awarded ~100k vs. admittedly lower end card 110k sec runtime getting 565k credit. Seems out of scope.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1031
Credit: 35,623,832,483
RAC: 73,867,984
Level
Trp
Scientific publications
wat
Message 57682 - Posted: 28 Oct 2021 | 4:00:11 UTC - in response to Message 57680.

wow, I didnt even notice, I was out all day. I just set the system (7x GPU) to check for work every 100s or so and only checking for beta GPU work, so it doesn't surprise me that it got so many. it would ask for 7 at once, and I guess it got lucky that it asked for some work before anyone else.

beta tasks have always paid a lot of credit here for some reason.

but as with previous beta tasks, I see no indication that these tasks actually did anything on the GPU. my guess is that they ran some stuff on the CPU then finished. I've asked before what their intentions are with these tasks, and it's clear they are doing some type or machine learning kind of thing, but they dont appear to be even using the GPU at all, which is very strange when they are labelled as a cuda app.
____________

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1576
Credit: 5,605,461,851
RAC: 8,706,681
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 57683 - Posted: 28 Oct 2021 | 8:28:16 UTC

The apps finally appeared on the application page yesterday afternoon. So far, they are for Linux only (not mentioned in the original announcement), and with the same cuda101 / cuda 1121 variants as the current acemd runs.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1031
Credit: 35,623,832,483
RAC: 73,867,984
Level
Trp
Scientific publications
wat
Message 57686 - Posted: 28 Oct 2021 | 15:39:24 UTC - in response to Message 57683.

The apps finally appeared on the application page yesterday afternoon. So far, they are for Linux only (not mentioned in the original announcement), and with the same cuda101 / cuda 1121 variants as the current acemd runs.


cuda100* but yeah, looks to be the same app as listed in the Anaconda Python 3 category, same versioning.

____________

Azmodes
Send message
Joined: 7 Jan 17
Posts: 34
Credit: 1,371,429,518
RAC: 0
Level
Met
Scientific publications
watwatwat
Message 57706 - Posted: 1 Nov 2021 | 10:25:58 UTC

So, uh, that was it?

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1576
Credit: 5,605,461,851
RAC: 8,706,681
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 57711 - Posted: 2 Nov 2021 | 14:13:27 UTC - in response to Message 57706.

Not quite, but...

Got a new Python task. It failed:

14:05:29 (821885): wrapper: running ./gpugridpy/bin/python (run.py)
Running command git clone -q https://github.com/Unity-Technologies/obstacle-tower-env.git /tmp/pip-req-build-i6sgww_u
ERROR: Error [Errno 2] No such file or directory: 'git' while executing command git clone -q https://github.com/Unity-Technologies/obstacle-tower-env.git /tmp/pip-req-build-i6sgww_u
ERROR: Cannot find command 'git' - do you have 'git' installed and in your PATH?

Not yet, but I'll install it before the replacement task I got on report has a chance to start.

We shouldn't need to do that.

(and now my second Linux machine has got one too)

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1576
Credit: 5,605,461,851
RAC: 8,706,681
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 57712 - Posted: 2 Nov 2021 | 14:38:54 UTC

That looks better - I'd say the GPU is running:



But what's [ObstacleTower (as boinc)]? It's appeared on my task bar, and opens to a tiny, all black, window?

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1576
Credit: 5,605,461,851
RAC: 8,706,681
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 57713 - Posted: 2 Nov 2021 | 14:53:23 UTC

Second machine has acquired an ObstacleTower, too.

Interesting snip from stderr in running (repeated many times):

(raylet) ModuleNotFoundError: No module named 'aiohttp.signals'
(raylet) /var/lib/boinc-client/slots/5/gpugridpy/lib/python3.8/site-packages/ray/autoscaler/_private/cli_logger.py:57: FutureWarning: Not all Ray CLI dependencies were found. In Ray 1.4+, the Ray CLI, autoscaler, and dashboard will only be usable via `pip install 'ray[default]'`. Please update your install command.
(raylet) warnings.warn(
(raylet) Traceback (most recent call last):
(raylet) File "/var/lib/boinc-client/slots/5/gpugridpy/lib/python3.8/site-packages/ray/new_dashboard/agent.py", line 22, in <module>
(raylet) import ray.new_dashboard.utils as dashboard_utils
(raylet) File "/var/lib/boinc-client/slots/5/gpugridpy/lib/python3.8/site-packages/ray/new_dashboard/utils.py", line 20, in <module>
(raylet) import aiohttp.signals
(raylet) ModuleNotFoundError: No module named 'aiohttp.signals'
WARNING:gym_unity:New seed 57 will apply on next reset.
WARNING:gym_unity:New starting floor 0 will apply on next reset.

Toni
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 9 Dec 08
Posts: 1006
Credit: 5,068,599
RAC: 0
Level
Ser
Scientific publications
watwatwatwat
Message 57714 - Posted: 2 Nov 2021 | 16:02:12 UTC - in response to Message 57711.

This is being solved server-side, no need to install software of course.

Not quite, but...

Got a new Python task. It failed:

14:05:29 (821885): wrapper: running ./gpugridpy/bin/python (run.py)
Running command git clone -q https://github.com/Unity-Technologies/obstacle-tower-env.git /tmp/pip-req-build-i6sgww_u
ERROR: Error [Errno 2] No such file or directory: 'git' while executing command git clone -q https://github.com/Unity-Technologies/obstacle-tower-env.git /tmp/pip-req-build-i6sgww_u
ERROR: Cannot find command 'git' - do you have 'git' installed and in your PATH?

Not yet, but I'll install it before the replacement task I got on report has a chance to start.

We shouldn't need to do that.

(and now my second Linux machine has got one too)

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 200
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 57715 - Posted: 2 Nov 2021 | 16:22:51 UTC - in response to Message 57711.

The Obstacle Tower environment is a simulated environment for machine learning (Reinforcement Learning) research. Note that in order to research how to train and deploy embodied agents in the real word it is common to research in 3D world simulations like this on. This is the github page of the project: https://github.com/Unity-Technologies/obstacle-tower-env

We use it as a testbench within our efforts to train populations of interacting artificial intelligent agents able to develop complex behaviours and solve complex tasks. The environment runs on GPU, and the Deep Learning models learning how to solve the simulation too.

Most of the bugs we are trying to solve are related to the environment. It is installed via git, but the git-related issues is being solved from the server side as mentioned. The reported stderr message "ModuleNotFoundError: No module named 'aiohttp.signals'" should be solved now. The small black screen is also related to the environment.
____________

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1031
Credit: 35,623,832,483
RAC: 73,867,984
Level
Trp
Scientific publications
wat
Message 57716 - Posted: 2 Nov 2021 | 16:45:03 UTC - in response to Message 57715.

The Obstacle Tower environment is a simulated environment for machine learning (Reinforcement Learning) research. Note that in order to research how to train and deploy embodied agents in the real word it is common to research in 3D world simulations like this on. This is the github page of the project: https://github.com/Unity-Technologies/obstacle-tower-env

We use it as a testbench within our efforts to train populations of interacting artificial intelligent agents able to develop complex behaviours and solve complex tasks. The environment runs on GPU, and the Deep Learning models learning how to solve the simulation too.

Most of the bugs we are trying to solve are related to the environment. It is installed via git, but the git-related issues is being solved from the server side as mentioned. The reported stderr message "ModuleNotFoundError: No module named 'aiohttp.signals'" should be solved now. The small black screen is also related to the environment.


do you have any plans to utilize the Tensor cores present on many newer Nvidia GPUs? these are designed for machine learning tasks.
____________

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1576
Credit: 5,605,461,851
RAC: 8,706,681
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 57718 - Posted: 2 Nov 2021 | 17:50:17 UTC

Thanks for the feedback - on that basis, I'll keep pushing them through.

Had an odd finish:

FileNotFoundError: [Errno 2] No such file or directory: '/var/lib/boinc-client/slots/6/model.state_dict.3073'
(raylet) /var/lib/boinc-client/slots/6/gpugridpy/lib/python3.8/site-packages/ray/autoscaler/_private/cli_logger.py:57: FutureWarning: Not all Ray CLI dependencies were found. In Ray 1.4+, the Ray CLI, autoscaler, and dashboard will only be usable via `pip install 'ray[default]'`. Please update your install command.
(raylet) warnings.warn(
(raylet) Traceback (most recent call last):
(raylet) File "/var/lib/boinc-client/slots/6/gpugridpy/lib/python3.8/site-packages/ray/new_dashboard/agent.py", line 22, in <module>
(raylet) import ray.new_dashboard.utils as dashboard_utils
(raylet) File "/var/lib/boinc-client/slots/6/gpugridpy/lib/python3.8/site-packages/ray/new_dashboard/utils.py", line 20, in <module>
(raylet) import aiohttp.signals
(raylet) ModuleNotFoundError: No module named 'aiohttp.signals'
INFO:mlagents_envs.environment:Environment shut down with return code 0.
15:21:11 (827067): ./gpugridpy/bin/python exited; CPU time 1598.264794
15:21:11 (827067): app exit status: 0x1
15:21:11 (827067): called boinc_finish(195)

"Environment shut down with return code 0" sounds like a happy ending, but "called boinc_finish(195)" is 'Child failed'.

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1284
Credit: 4,926,781,959
RAC: 6,493,556
Level
Arg
Scientific publications
watwatwatwatwat
Message 57719 - Posted: 3 Nov 2021 | 6:23:50 UTC

Tried a LOT of the PythonGPU tasks today. Still no joy for a successful run.

Think they are getting further along though since I think I see progress in how far they get before the environment collapses and errors out.

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1576
Credit: 5,605,461,851
RAC: 8,706,681
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 57721 - Posted: 3 Nov 2021 | 9:30:00 UTC

The next round of testing has started.

e1a10-ABOU_PPOObstacle6-0-1-RND2533_0 - I was going to say 'is running', but it's crashed already. After only 20 seconds, I got an apparently normal finish, followed by

upload failure: <file_xfer_error>
<file_name>e1a10-ABOU_PPOObstacle6-0-1-RND2533_0_0</file_name>
<error_code>-131 (file size too big)</error_code>

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1576
Credit: 5,605,461,851
RAC: 8,706,681
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 57722 - Posted: 3 Nov 2021 | 9:36:16 UTC
Last modified: 3 Nov 2021 | 10:01:49 UTC

Got another from what looks like the same batch. Limit is

<max_nbytes>100000000.000000</max_nbytes>

I'll catch the output and see how big it is.

Edit - couldn't catch it ('report immediately' operated too fast). But I watched the next one in the slot directory: the output file was created right at the end, but was cleaned up almost immediately. I read it as 169 MB, but can't be certain.

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 200
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 57723 - Posted: 3 Nov 2021 | 10:25:40 UTC - in response to Message 57722.
Last modified: 3 Nov 2021 | 10:28:53 UTC

Yes the file should be 170M approx.

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 200
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 57724 - Posted: 3 Nov 2021 | 10:26:09 UTC - in response to Message 57722.

Yes the file should be 170M approx.
____________

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1576
Credit: 5,605,461,851
RAC: 8,706,681
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 57725 - Posted: 3 Nov 2021 | 10:33:04 UTC
Last modified: 3 Nov 2021 | 10:45:31 UTC

Well, I got one for you to study:

e1a8-ABOU_PPOObstacle7-0-1-RND2466_3

That was done by manually increasing the maximum allowed size in BOINC. I think that's an internal setting in the BOINC system - specifically, the workunit generator or its template files - rather than the Python package.

I've suspended work fetch for now - please let us know when the next iteration is ready to test.

Edit - this it what the upload file contained:




It seems a bit odd to return the ObstacleTower zip back to you unchanged?

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 200
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 57726 - Posted: 3 Nov 2021 | 10:55:09 UTC - in response to Message 57711.
Last modified: 3 Nov 2021 | 11:38:59 UTC

The git-related errors should be solved now.


ERROR: Cannot find command 'git' - do you have 'git' installed and in your PATH?


We will study the errors related to downloading the Obstacle Tower environment. Thank you for the feedback.
____________

Azmodes
Send message
Joined: 7 Jan 17
Posts: 34
Credit: 1,371,429,518
RAC: 0
Level
Met
Scientific publications
watwatwat
Message 57727 - Posted: 3 Nov 2021 | 12:19:15 UTC

Got one that ended in 195 (0xc3) EXIT_CHILD_FAILED after 15 minutes:

==> WARNING: A newer version of conda exists. <==
current version: 4.8.3
latest version: 4.10.3

Please update conda by running

$ conda update -n base -c defaults conda


13:14:06 (11501): /usr/bin/flock exited; CPU time 470.306190
13:14:06 (11501): wrapper: running ./gpugridpy/bin/python (run.py)
path: ['/var/lib/boinc-client/slots/34', '/var/lib/boinc-client/slots/34/gpugridpy/lib/python3.8/site-packages/git/ext/gitdb', '/var/lib/boinc-client/slots/34/gpugridpy/lib/python38.zip', '/var/lib/boinc-client/slots/34/gpugridpy/lib/python3.8', '/var/lib/boinc-client/slots/34/gpugridpy/lib/python3.8/lib-dynload', '/var/lib/boinc-client/slots/34/gpugridpy/lib/python3.8/site-packages', '/var/lib/boinc-client/slots/34/gpugridpy/lib/python3.8/site-packages/gitdb/ext/smmap']
git path: /var/lib/boinc-client/slots/34/gpugridpy/lib/python3.8/site-packages/git
Traceback (most recent call last):
File "run.py", line 340, in <module>
main()
File "run.py", line 53, in main
print("GPU available: {}".format(torch.cuda.is_available()))
NameError: name 'torch' is not defined
13:14:10 (11501): ./gpugridpy/bin/python exited; CPU time 1.602758
13:14:10 (11501): app exit status: 0x1
13:14:10 (11501): called boinc_finish(195)

</stderr_txt>
]]>

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1284
Credit: 4,926,781,959
RAC: 6,493,556
Level
Arg
Scientific publications
watwatwatwatwat
Message 57728 - Posted: 3 Nov 2021 | 14:24:08 UTC
Last modified: 3 Nov 2021 | 14:24:55 UTC

Got five PythonGPU tasks to finish and report after about ten minutes that were valid.

bozz4science
Send message
Joined: 22 May 20
Posts: 109
Credit: 68,936,176
RAC: 0
Level
Thr
Scientific publications
wat
Message 57732 - Posted: 3 Nov 2021 | 17:18:12 UTC

My machine is a dual boot machine (Win10/Ubuntu 20.04). Are there plans for a Windows app for these tasks or should I boot into Linux to get some of these tasks?

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1284
Credit: 4,926,781,959
RAC: 6,493,556
Level
Arg
Scientific publications
watwatwatwatwat
Message 57733 - Posted: 3 Nov 2021 | 19:50:45 UTC - in response to Message 57732.

Haven't heard of any posts by admin types that Windows apps will be made.
That stated, often the new beta apps are tested first on Linux to get the bugs out and then the Windows apps are generated.

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1284
Credit: 4,926,781,959
RAC: 6,493,556
Level
Arg
Scientific publications
watwatwatwatwat
Message 57735 - Posted: 3 Nov 2021 | 19:55:47 UTC

This task looks to have run through all of its parameter set to complete normally at around 3000 seconds and was validated for ~ 200K credits.
https://www.gpugrid.net/result.php?resultid=32660133

Profile PDW
Send message
Joined: 7 Mar 14
Posts: 15
Credit: 1,000,002,525
RAC: 0
Level
Met
Scientific publications
watwatwatwatwat
Message 57737 - Posted: 3 Nov 2021 | 20:05:30 UTC - in response to Message 57735.

Did you notice if it used the GPU and if it did what percentage ?

I had one that ran for about 3 hours before failing, never saw the fans running during that time.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1031
Credit: 35,623,832,483
RAC: 73,867,984
Level
Trp
Scientific publications
wat
Message 57740 - Posted: 3 Nov 2021 | 20:28:14 UTC
Last modified: 3 Nov 2021 | 20:41:52 UTC

just ran this one on my RTX 3080Ti: https://www.gpugrid.net/result.php?resultid=32660184

16:19:48 (1841951): wrapper (7.7.26016): starting
16:19:48 (1841951): wrapper (7.7.26016): starting
16:19:48 (1841951): wrapper: running /usr/bin/flock (/home/ian/BOINC/projects/www.gpugrid.net/miniconda.lock -c "/bin/bash ./miniconda-installer.sh -b -u -p /home/ian/BOINC/projects/www.gpugrid.net/miniconda &&
/home/ian/BOINC/projects/www.gpugrid.net/miniconda/bin/conda install -m -y -p gpugridpy --file requirements.txt ")

0%| | 0/35 [00:00<?, ?it/s]
Extracting : tk-8.6.8-hbc83047_0.conda: 0%| | 0/35 [00:00<?, ?it/s]
Extracting : tk-8.6.8-hbc83047_0.conda: 3%|2 | 1/35 [00:00<00:11, 3.04it/s]
Extracting : urllib3-1.25.8-py37_0.conda: 3%|2 | 1/35 [00:00<00:11, 3.04it/s]
Extracting : libedit-3.1.20181209-hc058e9b_0.conda: 6%|5 | 2/35 [00:00<00:10, 3.04it/s]
Extracting : libgcc-ng-9.1.0-hdf63c60_0.conda: 9%|8 | 3/35 [00:00<00:10, 3.04it/s]
Extracting : ld_impl_linux-64-2.33.1-h53a641e_7.conda: 11%|#1 | 4/35 [00:00<00:10, 3.04it/s]
Extracting : python-3.7.7-hcff3b4d_5.conda: 14%|#4 | 5/35 [00:00<00:09, 3.04it/s]
Extracting : python-3.7.7-hcff3b4d_5.conda: 17%|#7 | 6/35 [00:00<00:06, 4.16it/s]
Extracting : tqdm-4.46.0-py_0.conda: 17%|#7 | 6/35 [00:00<00:06, 4.16it/s]
Extracting : ca-certificates-2020.1.1-0.conda: 20%|## | 7/35 [00:00<00:06, 4.16it/s]
Extracting : wheel-0.34.2-py37_0.conda: 23%|##2 | 8/35 [00:00<00:06, 4.16it/s]
Extracting : libstdcxx-ng-9.1.0-hdf63c60_0.conda: 26%|##5 | 9/35 [00:00<00:06, 4.16it/s]
Extracting : certifi-2020.4.5.1-py37_0.conda: 29%|##8 | 10/35 [00:00<00:06, 4.16it/s]
Extracting : readline-8.0-h7b6447c_0.conda: 31%|###1 | 11/35 [00:00<00:05, 4.16it/s]
Extracting : ncurses-6.2-he6710b0_1.conda: 34%|###4 | 12/35 [00:00<00:05, 4.16it/s]
Extracting : conda-package-handling-1.6.1-py37h7b6447c_0.conda: 37%|###7 | 13/35 [00:00<00:05, 4.16it/s]
Extracting : chardet-3.0.4-py37_1003.conda: 40%|#### | 14/35 [00:00<00:05, 4.16it/s]
Extracting : zlib-1.2.11-h7b6447c_3.conda: 43%|####2 | 15/35 [00:00<00:04, 4.16it/s]
Extracting : six-1.14.0-py37_0.conda: 46%|####5 | 16/35 [00:00<00:04, 4.16it/s]
Extracting : pycparser-2.20-py_0.conda: 49%|####8 | 17/35 [00:00<00:04, 4.16it/s]
Extracting : libffi-3.3-he6710b0_1.conda: 51%|#####1 | 18/35 [00:00<00:04, 4.16it/s]
Extracting : pycosat-0.6.3-py37h7b6447c_0.conda: 54%|#####4 | 19/35 [00:00<00:03, 4.16it/s]
Extracting : cffi-1.14.0-py37he30daa8_1.conda: 57%|#####7 | 20/35 [00:00<00:03, 4.16it/s]
Extracting : _libgcc_mutex-0.1-main.conda: 60%|###### | 21/35 [00:00<00:03, 4.16it/s]
Extracting : pyopenssl-19.1.0-py37_0.conda: 63%|######2 | 22/35 [00:00<00:03, 4.16it/s]
Extracting : idna-2.9-py_1.conda: 66%|######5 | 23/35 [00:00<00:02, 4.16it/s]
Extracting : pysocks-1.7.1-py37_0.conda: 69%|######8 | 24/35 [00:00<00:02, 4.16it/s]
Extracting : xz-5.2.5-h7b6447c_0.conda: 71%|#######1 | 25/35 [00:00<00:02, 4.16it/s]
Extracting : setuptools-46.4.0-py37_0.conda: 74%|#######4 | 26/35 [00:00<00:02, 4.16it/s]
Extracting : ruamel_yaml-0.15.87-py37h7b6447c_0.conda: 77%|#######7 | 27/35 [00:00<00:01, 4.16it/s]
Extracting : cryptography-2.9.2-py37h1ba5d50_0.conda: 80%|######## | 28/35 [00:00<00:01, 4.16it/s]
Extracting : openssl-1.1.1g-h7b6447c_0.conda: 83%|########2 | 29/35 [00:00<00:01, 4.16it/s]
Extracting : sqlite-3.31.1-h62c20be_1.conda: 86%|########5 | 30/35 [00:00<00:01, 4.16it/s]
Extracting : pip-20.0.2-py37_3.conda: 89%|########8 | 31/35 [00:00<00:00, 4.16it/s]
Extracting : yaml-0.1.7-had09818_2.conda: 91%|#########1| 32/35 [00:00<00:00, 4.16it/s]
Extracting : requests-2.23.0-py37_0.conda: 94%|#########4| 33/35 [00:00<00:00, 4.16it/s]
Extracting : conda-4.8.3-py37_0.tar.bz2: 97%|#########7| 34/35 [00:00<00:00, 4.16it/s]


==> WARNING: A newer version of conda exists. <==
current version: 4.8.3
latest version: 4.10.3

Please update conda by running

$ conda update -n base -c defaults conda


16:21:21 (1841951): /usr/bin/flock exited; CPU time 61.036800
16:21:21 (1841951): wrapper: running ./gpugridpy/bin/python (run.py)
Running command git clone -q https://github.com/Unity-Technologies/obstacle-tower-env.git /tmp/pip-req-build-wwv7ghqo
/home/ian/BOINC/slots/15/gpugridpy/lib/python3.8/site-packages/ray/autoscaler/_private/cli_logger.py:57: FutureWarning: Not all Ray CLI dependencies were found. In Ray 1.4+, the Ray CLI, autoscaler, and dashboard will only be usable via `pip install 'ray[default]'`. Please update your install command.
warnings.warn(
Downloading...
From: https://storage.googleapis.com/obstacle-tower-build/v4.1/obstacletower_v4.1_linux.zip
To: /home/ian/BOINC/slots/15/obstacletower_v4.1_linux.zip

0%| | 0.00/170M [00:00<?, ?B/s]
1%| | 2.10M/170M [00:00<00:08, 19.9MB/s]
6%|&#226;&#150;&#140; | 10.5M/170M [00:00<00:02, 56.2MB/s]
11%|&#226;&#150;&#136;&#226;&#150;&#143; | 19.4M/170M [00:00<00:02, 70.8MB/s]
16%|&#226;&#150;&#136;&#226;&#150;&#139; | 27.8M/170M [00:00<00:02, 70.6MB/s]
22%|&#226;&#150;&#136;&#226;&#150;&#136;&#226;&#150;&#143; | 37.7M/170M [00:00<00:01, 76.7MB/s]
28%|&#226;&#150;&#136;&#226;&#150;&#136;&#226;&#150;&#138; | 47.7M/170M [00:00<00:01, 79.0MB/s]
34%|&#226;&#150;&#136;&#226;&#150;&#136;&#226;&#150;&#136;&#226;&#150;&#142; | 57.1M/170M [00:00<00:01, 82.8MB/s]
38%|&#226;&#150;&#136;&#226;&#150;&#136;&#226;&#150;&#136;&#226;&#150;&#138; | 65.5M/170M [00:00<00:01, 80.4MB/s]
43%|&#226;&#150;&#136;&#226;&#150;&#136;&#226;&#150;&#136;&#226;&#150;&#136;&#226;&#150;&#142; | 73.9M/170M [00:00<00:01, 81.2MB/s]
49%|&#226;&#150;&#136;&#226;&#150;&#136;&#226;&#150;&#136;&#226;&#150;&#136;&#226;&#150;&#138; | 82.8M/170M [00:01<00:01, 83.4MB/s]
54%|&#226;&#150;&#136;&#226;&#150;&#136;&#226;&#150;&#136;&#226;&#150;&#136;&#226;&#150;&#136;&#226;&#150;&#142; | 91.2M/170M [00:01<00:00, 80.8MB/s]
59%|&#226;&#150;&#136;&#226;&#150;&#136;&#226;&#150;&#136;&#226;&#150;&#136;&#226;&#150;&#136;&#226;&#150;&#137; | 101M/170M [00:01<00:00, 81.3MB/s]
65%|&#226;&#150;&#136;&#226;&#150;&#136;&#226;&#150;&#136;&#226;&#150;&#136;&#226;&#150;&#136;&#226;&#150;&#136;&#226;&#150;&#141; | 110M/170M [00:01<00:00, 83.7MB/s]
70%|&#226;&#150;&#136;&#226;&#150;&#136;&#226;&#150;&#136;&#226;&#150;&#136;&#226;&#150;&#136;&#226;&#150;&#136;&#226;&#150;&#137; | 119M/170M [00:01<00:00, 79.0MB/s]
75%|&#226;&#150;&#136;&#226;&#150;&#136;&#226;&#150;&#136;&#226;&#150;&#136;&#226;&#150;&#136;&#226;&#150;&#136;&#226;&#150;&#136;&#226;&#150;&#141; | 127M/170M [00:01<00:00, 80.2MB/s]
80%|&#226;&#150;&#136;&#226;&#150;&#136;&#226;&#150;&#136;&#226;&#150;&#136;&#226;&#150;&#136;&#226;&#150;&#136;&#226;&#150;&#136;&#226;&#150;&#136; | 137M/170M [00:01<00:00, 79.2MB/s]
85%|&#226;&#150;&#136;&#226;&#150;&#136;&#226;&#150;&#136;&#226;&#150;&#136;&#226;&#150;&#136;&#226;&#150;&#136;&#226;&#150;&#136;&#226;&#150;&#136;&#226;&#150;&#140; | 145M/170M [00:01<00:00, 80.1MB/s]
90%|&#226;&#150;&#136;&#226;&#150;&#136;&#226;&#150;&#136;&#226;&#150;&#136;&#226;&#150;&#136;&#226;&#150;&#136;&#226;&#150;&#136;&#226;&#150;&#136;&#226;&#150;&#136; | 154M/170M [00:01<00:00, 79.1MB/s]
96%|&#226;&#150;&#136;&#226;&#150;&#136;&#226;&#150;&#136;&#226;&#150;&#136;&#226;&#150;&#136;&#226;&#150;&#136;&#226;&#150;&#136;&#226;&#150;&#136;&#226;&#150;&#136;&#226;&#150;&#140;| 163M/170M [00:02<00:00, 82.6MB/s]
100%|&#226;&#150;&#136;&#226;&#150;&#136;&#226;&#150;&#136;&#226;&#150;&#136;&#226;&#150;&#136;&#226;&#150;&#136;&#226;&#150;&#136;&#226;&#150;&#136;&#226;&#150;&#136;&#226;&#150;&#136;| 170M/170M [00:02<00:00, 78.6MB/s]
16:21:54 (1841951): ./gpugridpy/bin/python exited; CPU time 22.798227
16:21:59 (1841951): called boinc_finish(0)

</stderr_txt>
<message>
upload failure: <file_xfer_error>
<file_name>e1a6-ABOU_PPOObstacle6-0-1-RND7771_2_0</file_name>
<error_code>-131 (file size too big)</error_code>
</file_xfer_error>
</message>


ran for about 2 mins and errored out. file size too big? how big could the file get in 2 minutes? lol. looks like everyone in this WU chain is having the same issue though. https://www.gpugrid.net/workunit.php?wuid=27085637 Bad WU?

and I saw no evidence that it ever touched the GPU, refreshing nvidia-smi every 2 seconds showed no process running on the GPU. must still be using only the CPU.

Can an admin please directly comment if these are actually using the GPU or not? I know an admin mentioned that they were only doing CPU work "as a test". Is that still the case? Having GPU tasks that only use the CPU core is very confusing.
____________

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1284
Credit: 4,926,781,959
RAC: 6,493,556
Level
Arg
Scientific publications
watwatwatwatwat
Message 57741 - Posted: 3 Nov 2021 | 20:33:04 UTC

The ones that have partially ran and were validated only used 31% of the gpu in nvidia-smi.

The one task that appears to have successfully run through to normal completion was done while I was out of the house and did not see it run unfortunately.

Will have to wait for more to observe.

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1284
Credit: 4,926,781,959
RAC: 6,493,556
Level
Arg
Scientific publications
watwatwatwatwat
Message 57743 - Posted: 3 Nov 2021 | 21:41:16 UTC
Last modified: 3 Nov 2021 | 21:44:30 UTC

Looks like the tasks fluctuate between a few seconds at 1% utilization before returning to hovering around 10-13% utilization. I was watching one on a 2070 and it was running for almost 60 minutes in nvidia-smi. They are marked at C+G type in that program.

I think I killed it when I pulled up htop to look at how much cpu it was using because it finished with an error instantly at the same time as htop populated the screen.

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 200
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 57748 - Posted: 4 Nov 2021 | 9:27:01 UTC - in response to Message 57725.

The contents of the obstacletower.zip downloaded file are necessary to generate the data required for the machine learning agent to train. That is why the file itself is not modified. Only used to generate the training data.

The expected behaviour is for the file to be downloaded, used during the job completion and then deleted. Should not be returned.

Some jobs have already finished successfully. Thank you for the feedback. Current jobs being tested should use around 30% GPU and around 8000MiB GPU memory.


____________

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1576
Credit: 5,605,461,851
RAC: 8,706,681
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 57750 - Posted: 4 Nov 2021 | 9:31:48 UTC - in response to Message 57748.

The expected behaviour is for the file to be downloaded, used during the job completion and then deleted. Should not be returned.

That makes much more sense. Standing by for the next round of debugging... :-)

bozz4science
Send message
Joined: 22 May 20
Posts: 109
Credit: 68,936,176
RAC: 0
Level
Thr
Scientific publications
wat
Message 57751 - Posted: 4 Nov 2021 | 9:47:15 UTC
Last modified: 4 Nov 2021 | 9:47:32 UTC

That's the next bad news for me as my GPU is maxed out at 6GB. Without upgrading my GPU and that's not likely gonna be soon, I suppose I have to give up on these types of tasks - at least for the time being. Thanks for the update though

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1031
Credit: 35,623,832,483
RAC: 73,867,984
Level
Trp
Scientific publications
wat
Message 57752 - Posted: 4 Nov 2021 | 12:31:04 UTC - in response to Message 57748.

Some jobs have already finished successfully. Thank you for the feedback. Current jobs being tested should use around 30% GPU and around 8000MiB GPU memory.



why such low GPU utilization? and 8000? or do you mean 800? 8GB? or 800MB?
____________

bozz4science
Send message
Joined: 22 May 20
Posts: 109
Credit: 68,936,176
RAC: 0
Level
Thr
Scientific publications
wat
Message 57753 - Posted: 4 Nov 2021 | 12:42:20 UTC
Last modified: 4 Nov 2021 | 12:44:21 UTC

I can only speculate in regards to the former one. But your latter question likely resolves to 8,000 MiB (Mebibyte) which is just another convention to count bits – if he indeed meant to write 8,000.

While k (kilo), M (Mega), G (Giga) and T (Tera) are the SI-prefix units and are computed as base 10 by 10^3, 10^6, 10^9 and 10^12 respectively, the binary prefix units of Ki (Kibi), Mi (Mebi), Gi (Gibi) and Ti (Tebi) are computed as base 2 by 2^10, 2^20, 2^30 and 2^40. As such M/Mi = (10^6/2^20) ~ 95.37% or a difference of ~4.63% between the SI and binary prefix units.

1 kB = 1000 B
1 KiB = 1024 B

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1031
Credit: 35,623,832,483
RAC: 73,867,984
Level
Trp
Scientific publications
wat
Message 57754 - Posted: 4 Nov 2021 | 12:52:03 UTC - in response to Message 57753.

yeah I know the conversions and such. I'm just wondering if it's a typo, Keith ran some of these beta tasks successfully and did not report such high memory usage, he claimed it only used about 200MB
____________

bozz4science
Send message
Joined: 22 May 20
Posts: 109
Credit: 68,936,176
RAC: 0
Level
Thr
Scientific publications
wat
Message 57755 - Posted: 4 Nov 2021 | 12:58:04 UTC
Last modified: 4 Nov 2021 | 12:58:33 UTC

ah, all right. didn't mean to offend you if that's what I did. still don't understand their beta testing procedure anyway. so far not many tasks have been run, only few of them successfully, but meanwhile nearly no information has been shared rendering the whole procedure rather intransparent and leaving others in the dark wondering about their piles of unsuccessful tasks. and the little information that is indeed shared seems to conflict a lot with the user experience and observations. for a ML task 8 GB isn't untypical though

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1031
Credit: 35,623,832,483
RAC: 73,867,984
Level
Trp
Scientific publications
wat
Message 57756 - Posted: 4 Nov 2021 | 13:06:37 UTC - in response to Message 57755.

I agree that lots of memory use wouldnt be atypical for AI/ML work. and also agree that the admins should be a little more transparent about what these tasks are doing and the expected behaviors. it seems so far they have tons and tons of errors, then the admins come back and say they fixed the errors, then just more errors again. I'd also like to know if these are using the Tensor cores on RTX GPUs.
____________

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1576
Credit: 5,605,461,851
RAC: 8,706,681
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 57757 - Posted: 4 Nov 2021 | 13:17:33 UTC

I think the Beta testing process is (as usual anywhere) very much an incremental process. It will have started with small test units, and as each little buglet surfaces and is solved, the process moves on to test a later segment that wasn't accessible until the previous problem had been overcome.

Thus - Abouh has confirmed that yesterday's upload file size problem was caused by including a source data file in the output - "Should not be returned".

I also noted that some of Keith's successful runs were resends of tasks which had failed on other machines - some of them generic problems which I would have expected to cause a failure on his machine too. So it seems that dynamic fixes may have been applied too. Normally, a new BOINC replication task is an exact copy of its predecessor, but I don't think can be automatically assumed during this Beta phase.

In particular, Keith's observation that one test task only used 200 MB of GPU memory isn't necessarily a foolproof guide to the memory demand of later tests.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1031
Credit: 35,623,832,483
RAC: 73,867,984
Level
Trp
Scientific publications
wat
Message 57758 - Posted: 4 Nov 2021 | 13:22:50 UTC - in response to Message 57757.

which is why I asked for clarification in light of the disparity between expected and observed behaviors.
____________

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1284
Credit: 4,926,781,959
RAC: 6,493,556
Level
Arg
Scientific publications
watwatwatwatwat
Message 57761 - Posted: 4 Nov 2021 | 17:05:37 UTC - in response to Message 57754.

yeah I know the conversions and such. I'm just wondering if it's a typo, Keith ran some of these beta tasks successfully and did not report such high memory usage, he claimed it only used about 200MB

Yes, I have watched tasks complete fully to a proper boinc finish end and I never saw more than 290MB of gpu memory reported in nvidia-smi at a max 13% utilization.

Unless nvidia-smi has an issue in reporting gpu RAM used, the 8GB of memory post is out of line. Or the tasks the scientist-developer mentioned haven't been released to us out of the laboratory yet.

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 200
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 57766 - Posted: 5 Nov 2021 | 10:37:31 UTC - in response to Message 57752.

We are progressing in our debugging and have managed to solve several errors, but as mentioned in a previous post, it is an incremental process.

We are trying to train AI agent using reinforcement learning, which generally interleaves stages in which the agent collects data (a process less GPU intensive) and stages in which the agent learns from that data. The nature of the problem, in which data in progressively generated, accounts for a lower GPU utilisation that in supervised machine learning, although we will work to progressively make it more efficient once debugging is completed.

Since the obstacle tower environment (https://github.com/Unity-Technologies/obstacle-tower-env), the source of data, also runs in GPU, during the learning stage, the neural network and the training data together with the environment occupy approximately 8,000 MiB (Mebibyte, was not a typo) of GPU memory when checked locally with nvidia-smi.

Basically, the python script has the following steps:
step 1: Defining the conda environment with all dependencies.
step 2: Downloading obstacletower.zip, a necessary file used to generate the data.
step 3: Initialising the data generator using the contents of obstacletower.zip.
step 4: Creating the AI agent and alternating data collection and data training stages.
step 5: Returning the trained AI agent, and not obstacletower.zip.

Only after reaching step 4 and step 5 the GPU is used. Some of the jobs that succeeded but barely used the GPU were to test that indeed problems in step 1 and step 2 had been solved (most of them solved by Keith Myers).

We noticed that most recent failed jobs returned the following error at step 3:

mlagents_envs.exception.UnityTimeOutException: The Unity environment took too long to respond. Make sure that :
The environment does not need user interaction to launch
The Agents' Behavior Parameters > Behavior Type is set to "Default"
The environment and the Python interface have compatible versions.

We are working to solve it. If step 3 is completed without errors, jobs reaching steps 4 and 5 should be using GPU.

We hope that helped shed some light on our work and the recent results. We will try to solve any further doubts and inform about our progress.
____________

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1031
Credit: 35,623,832,483
RAC: 73,867,984
Level
Trp
Scientific publications
wat
Message 57767 - Posted: 5 Nov 2021 | 12:10:16 UTC - in response to Message 57766.

Thanks for the more detailed answer.

regarding the 8GB of memory used.

-which step of the process does this happen?
-was Keith's nvidia-smi screenshot that he posted in another thread showing low memory use, from an earlier unit that did not require that much VRAM?
-will these units fail from too little VRAM?
-what will you do or are you doing about GPUs with less than 8GB VRAM, or even with 8GB?
-do you have some filter in the WU scheduler to not send these units to GPUs with less than 8GB?
____________

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1576
Credit: 5,605,461,851
RAC: 8,706,681
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 57768 - Posted: 5 Nov 2021 | 13:57:40 UTC - in response to Message 57767.

-do you have some filter in the WU scheduler to not send these units to GPUs with less than 8GB?

It's certainly possible to set such a filter: referring again to Specifying plan classes in C++, the scheduler can check a CUDA plan class specification like

if (!strcmp(plan_class, "cuda23")) {
if (!cuda_check(c, hu,
100, // minimum compute capability (1.0)
200, // max compute capability (2.0)
2030, // min CUDA version (2.3)
19500, // min display driver version (195.00)
384*MEGA, // min video RAM
1., // # of GPUs used (may be fractional, or an integer > 1)
.01, // fraction of FLOPS done by the CPU
.21 // estimated GPU efficiency (actual/peak FLOPS)
)) {
return false;
}
}

We last discussed that code in connection with compute capability, but I think we're still having problems implementing filters via tools like that.

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 200
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 57769 - Posted: 5 Nov 2021 | 14:50:31 UTC - in response to Message 57767.
Last modified: 5 Nov 2021 | 14:51:06 UTC

At step 3, initialising the environment requires a small amount of GPU memory (somewhere around 1GB). At step 4 the AI agent is initialised and trained, and a data storage class and a neural network are created and placed on the GPU. This is when more memory is required. However, in the next round of tests we will lower the GPU memory requirements of the script while debugging step 3. Eventually for steps 4 and 5 we expect it to require the 8G mentioned earlier.

Keith's nvidia-smi screenshot showing a job with low memory use was a job that returned after step 2, to verify problems in steps 1 and 2 had been solved.
____________

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1284
Credit: 4,926,781,959
RAC: 6,493,556
Level
Arg
Scientific publications
watwatwatwatwat
Message 57770 - Posted: 5 Nov 2021 | 17:08:45 UTC - in response to Message 57769.

So was this WU https://www.gpugrid.net/result.php?resultid=32660133 the one that was completed after steps 1 and 2?
Or after steps 4 and 5?

I never got to witness this one in realtime.

I had nvidia-smi polling update set at 1 second and I never saw the gpu memory usage go above 290MB for that screenshot. It was not taken from the task linked above.

The BOINC completion percentage just went to 10% and stayed there and never showed 100% completion when it finished. Think that is an issue with BOINC historically.

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1284
Credit: 4,926,781,959
RAC: 6,493,556
Level
Arg
Scientific publications
watwatwatwatwat
Message 57771 - Posted: 5 Nov 2021 | 17:16:26 UTC

The environment and the Python interface have compatible versions.

Is the reason why I was able to complete a workunit properly because of having my local python environment match the zipped wrapper python interface?

I use several pypi applications that probably have setup the python environment variable.

Is there something I can dump out of the host that completed the workunit properly that will help you debug the application package?

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 200
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 57776 - Posted: 8 Nov 2021 | 14:54:09 UTC - in response to Message 57770.

This one completed the whole python script. Including steps 4 and 5. Should have used the GPU.
____________

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1284
Credit: 4,926,781,959
RAC: 6,493,556
Level
Arg
Scientific publications
watwatwatwatwat
Message 57777 - Posted: 8 Nov 2021 | 16:00:17 UTC - in response to Message 57776.

Thanks for confirming the one I completed used the gpu.

Greger
Send message
Joined: 6 Jan 15
Posts: 74
Credit: 14,342,526,749
RAC: 31,313,135
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwat
Message 57778 - Posted: 9 Nov 2021 | 21:23:30 UTC

Did a check on one host running GPUGridpy units.

e4a6-ABOU_ppo_gym_demos3-0-1-RND1018_0
Run time 4,999.53
GPU Memory: nvidia-smi report 2027MiB

No check-pointing yet but works well.

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 200
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 57779 - Posted: 10 Nov 2021 | 8:42:17 UTC - in response to Message 57778.
Last modified: 10 Nov 2021 | 8:45:13 UTC

We sent out some jobs yesterday and almost all finished successfully.

We are still working on avoiding the following error related to the Obstacle Tower environment:

mlagents_envs.exception.UnityTimeOutException: The Unity environment took too.
long to respond. Make sure that :
The environment does not need user interaction to launch
The Agents' Behavior Parameters > Behavior Type is set to "Default"
The environment and the Python interface have compatible versions.

However, to test the rest of the code we tried with another set of environments that are less problematic (https://gym.openai.com/). The successful jobs used these environments. While we find and test a solution for the Obstacle Tower locally we will continue to send jobs with these environments to test the rest of the code.

Note that reinforcement learning (RL) techniques are independent of the environment. The environment represents the world where the AI agent learns intelligent behaviours. Switching to another environment simply means applying the learning technique to a different problem that can be equally challenging (placing the agent in a different world). Thus, we will now finish debugging the app with these Gym environments simply because are less prone to errors and, once we know the only possible source of problems is the environment, consider solving others.
____________

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1031
Credit: 35,623,832,483
RAC: 73,867,984
Level
Trp
Scientific publications
wat
Message 57782 - Posted: 10 Nov 2021 | 13:47:20 UTC - in response to Message 57779.
Last modified: 10 Nov 2021 | 14:08:41 UTC

I had a few failures:

https://www.gpugrid.net/result.php?resultid=32660680
and
https://www.gpugrid.net/result.php?resultid=32660448

seems to be a bad WU on both instances since all wingmen are erroring in the same way.

mainly used ~6-7% GPU utilization on my 3080Ti, with intermittent spikes to ~20% every 10s or so. power use near idle, GPU memory utilization around 2GB, and system memory use around 4.8GB. make sure your system has enough memory in multi-GPU systems.
____________

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 200
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 57785 - Posted: 10 Nov 2021 | 14:48:30 UTC - in response to Message 57782.

Thank you for the feedback. We had detected the error in

https://www.gpugrid.net/result.php?resultid=32660448

but not the one in

https://www.gpugrid.net/result.php?resultid=32660680


Having alternating phases of lower and higher GPU utilisation is normal in Reinforcement Learning, as the agent alternates between data collection (generally low GPU usage) and training (higher GPU memory and utilisation). Once we solve most of the errors we will focus on maximizing GPU efficiency during the training phases.
____________

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1031
Credit: 35,623,832,483
RAC: 73,867,984
Level
Trp
Scientific publications
wat
Message 57786 - Posted: 10 Nov 2021 | 15:04:20 UTC - in response to Message 57785.
Last modified: 10 Nov 2021 | 15:09:17 UTC

have you considered creating a modified app that will use the RTX (and other) GPU's onboard Tensor cores? it should speed up things considerably.

https://www.quora.com/Does-tensorflow-and-pytorch-automatically-use-the-tensor-cores-in-rtx-2080-ti-or-other-rtx-cards

I'm guessing in addition to making the needed configuration changes, you'd need to adjust your scheduler to only send to cards with Tensor cores (GeForce RTX cards, TitanV, Tesla/QuadroRTX cards from Volta forward)
____________

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1031
Credit: 35,623,832,483
RAC: 73,867,984
Level
Trp
Scientific publications
wat
Message 57789 - Posted: 10 Nov 2021 | 16:28:26 UTC - in response to Message 57786.
Last modified: 10 Nov 2021 | 16:30:20 UTC

information for pytorch here:

https://github.com/NVIDIA/apex
https://nvidia.github.io/apex/
____________

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 200
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 57790 - Posted: 10 Nov 2021 | 17:04:13 UTC - in response to Message 57786.

We are using PyTorch to train our agents, and for now we have not considered using mixed precision, which seem required for the Tensor cores.

It could be an interesting possibility to reduce memory requirements and speed up training processes. I have to admit that I do not know how it affects performance in reinforcement learning algorithms, but it is an interesting option.
____________

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1576
Credit: 5,605,461,851
RAC: 8,706,681
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 57792 - Posted: 10 Nov 2021 | 18:53:33 UTC
Last modified: 10 Nov 2021 | 19:48:25 UTC

Getting errors in the test5 run, like

e2a16-ABOU_ppod_gym_test5-0-1-RND0379_1
e2a10-ABOU_ppod_gym_test5-0-1-RND0874_1

And on the test6 run. This time, the error seems to be in placing the expected task files in the slot directory, prior to starting the main run.

e3a17-ABOU_ppod_gym_test6-0-1-RND2029_0
e3a11-ABOU_ppod_gym_test6-0-1-RND1260_4

Both have

File "run.py", line 393, in <module>
main()
File "run.py", line 106, in main
feature_extractor_network=get_feature_extractor(args.nn),
File "/var/lib/boinc-client/slots/4/gpugridpy/lib/python3.8/site-packages/pytorchrl/agent/actors/feature_extractors/__init__.py", line 19, in get_feature_extractor
raise ValueError("Specified model not found!")
ValueError: Specified model not found!

mmonnin
Send message
Joined: 2 Jul 16
Posts: 332
Credit: 3,772,896,065
RAC: 4,765,302
Level
Arg
Scientific publications
watwatwatwatwat
Message 57794 - Posted: 10 Nov 2021 | 23:56:58 UTC
Last modified: 10 Nov 2021 | 23:57:13 UTC

I got one that worked today. Then 6 more that didnt on the same PC
https://www.gpugrid.net/workunit.php?wuid=27086033

mmonnin
Send message
Joined: 2 Jul 16
Posts: 332
Credit: 3,772,896,065
RAC: 4,765,302
Level
Arg
Scientific publications
watwatwatwatwat
Message 57795 - Posted: 11 Nov 2021 | 2:04:09 UTC

I got another. So far it is running
Over 4 CPU threads at 1st then 1 thread for 1st 4min
13% completed back to 10% then no more progression
At 10% hen GPU load at 3-5% 875mb vram
78min so far.

Profile ServicEnginIC
Avatar
Send message
Joined: 24 Sep 10
Posts: 566
Credit: 5,943,552,024
RAC: 10,814,021
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 57796 - Posted: 11 Nov 2021 | 6:49:43 UTC

I've got several GPU Python beta tasks at my triple GPU Host #480458
Several of them have succeeded after around 5000 seconds execution time.
But three of these tasks have exceeded this time.
Task e1a20-ABOU_ppod_gym_test-0-1-RND4563_6 failed after 11432 seconds.
Task e1a6-ABOU_ppod_gym_test-0-1-RND1186_1 failed after 18784 seconds.
Task e1a2-ABOU_ppod_gym_test-0-1-RND2391_3 is currently running even longer.
This last task is theoreticaly running at device 1.
But it seems to be effectively running at device 0, sharing the same device with an ACEMD3 regular task e14s132_e10s98p1f905-ADRIA_AdB_KIXCMYB_HIP-0-2-RND7676_5.

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1284
Credit: 4,926,781,959
RAC: 6,493,556
Level
Arg
Scientific publications
watwatwatwatwat
Message 57797 - Posted: 11 Nov 2021 | 7:13:49 UTC

I've got the same thing going on. BOINC says the task is running on Device2 while in reality it is sharing Device0 along with an Einstein GRP task.

This is the task https://www.gpugrid.net/result.php?resultid=32661276

Profile ServicEnginIC
Avatar
Send message
Joined: 24 Sep 10
Posts: 566
Credit: 5,943,552,024
RAC: 10,814,021
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 57798 - Posted: 11 Nov 2021 | 9:30:27 UTC - in response to Message 57796.

Task e1a2-ABOU_ppod_gym_test-0-1-RND2391_3 is currently running even longer.

The risk of beta testing: It finally failed after 42555 seconds.
I hope this is somehow useful for debugging...

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1576
Credit: 5,605,461,851
RAC: 8,706,681
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 57799 - Posted: 11 Nov 2021 | 9:50:17 UTC - in response to Message 57798.

Task e1a2-ABOU_ppod_gym_test-0-1-RND2391_3 is currently running even longer.

FileNotFoundError: [Errno 2] No such file or directory: '/var/lib/boinc-client/slots/3/model.state_dict.73201'

The same for two of your predecessors on this workunit. Is there any way we could avoid re-inventing the wheel (slowly) for errors like this?

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 200
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 57800 - Posted: 11 Nov 2021 | 13:44:16 UTC - in response to Message 57799.
Last modified: 11 Nov 2021 | 13:48:31 UTC

The excessively long training time problem and the problem related to

FileNotFoundError: [Errno 2] No such file or directory: '/var/lib/boinc-client/slots/3/model.state_dict.73201'

Have been fixed now. Most jobs sent today are being completed successfully. The reported issues were very helpful for debugging.




Progress:

The core research idea is to train populations of reinforcement learning agents that learn independently for a certain amount of time and, once they return to the server, put their learned knowledge in common with other agents to create a new generation of agents equipped with the information acquired by previous generations. Each GPUgrid job is one of these agents doing some training independently. In that sense, the first 4 letters of the job name identify the generation and the number of the agent (i.e. e1a2-ABOU_ppod_gym_test-0-1-RND2391_3 refers to the epoch or generation number 1 and the agent number 2 within that generation).

The debugging done recently, has allowed more and more of this jobs to finish. An experiment currently running has achieved already a 3rd generation of agents.

As mentioned in an earlier post, we are working now with OpenAI gym environments (https://gym.openai.com/)
____________

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1284
Credit: 4,926,781,959
RAC: 6,493,556
Level
Arg
Scientific publications
watwatwatwatwat
Message 57801 - Posted: 11 Nov 2021 | 15:48:48 UTC - in response to Message 57800.
Last modified: 11 Nov 2021 | 15:54:02 UTC

Are you working on fixing the issue that the tasks only run on Device#0 in BOINC?

Even when Device#0 is already occupied by another task from another project?

That leaves at least one device doing nothing because BOINC thinks it is occupied.

Profile ServicEnginIC
Avatar
Send message
Joined: 24 Sep 10
Posts: 566
Credit: 5,943,552,024
RAC: 10,814,021
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 57802 - Posted: 11 Nov 2021 | 17:15:48 UTC - in response to Message 57801.
Last modified: 11 Nov 2021 | 17:16:26 UTC

Are you working on fixing the issue that the tasks only run on Device#0 in BOINC?

+1

At this other example, Device 0 is running 1 Gpugrid ACEMD3 task and 2 Python GPU tasks.
Meanwhile, Device 1 and Device 2 remain idle.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1031
Credit: 35,623,832,483
RAC: 73,867,984
Level
Trp
Scientific publications
wat
Message 57803 - Posted: 11 Nov 2021 | 17:25:26 UTC - in response to Message 57802.

weird, I thought this problem had been fixed already. I guess I never realized since I've only been running the beta tasks on my single GPU system.
____________

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1576
Credit: 5,605,461,851
RAC: 8,706,681
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 57804 - Posted: 11 Nov 2021 | 17:31:21 UTC
Last modified: 11 Nov 2021 | 17:57:23 UTC

Count me in on this, too.

My client is running e8a16-ABOU_ppod_gym_test7-0-1-RND1448_0 on device 1.

I have GPUGrid excluded from device 0, so I can run tasks from other projects in the faster PCIe slot while testing. But ...



Well, despite running on the wrong card, it finished and passed the GPUGrid validation test. I've swapped over the exclusion, and BOINC and GPUGrid are now in agreement that card 0 is the card to use.

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1284
Credit: 4,926,781,959
RAC: 6,493,556
Level
Arg
Scientific publications
watwatwatwatwat
Message 57805 - Posted: 11 Nov 2021 | 18:32:50 UTC

Hard to tell from the error code snippet whether the tasks are hardwired to run on Device#0 or whether the error snippet is just the result of where the task actually has run.

[nan, nan, nan, ..., nan, nan, nan]], device='cuda:0',

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1284
Credit: 4,926,781,959
RAC: 6,493,556
Level
Arg
Scientific publications
watwatwatwatwat
Message 57817 - Posted: 12 Nov 2021 | 19:33:08 UTC
Last modified: 12 Nov 2021 | 19:39:58 UTC

Well, I have a new python task running by itself now on Device#2.
So it may mean they have fixed the issue where the tasks always ran on Device#0.

See this new output in the stderr.txt that looks like it is allocating to Device#2
It hasn't been there in any other of my tasks till just now for this new task.

Found GPU: True, Number 2 - 2

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 200
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 57818 - Posted: 12 Nov 2021 | 19:51:54 UTC - in response to Message 57817.

Yes, we have fixed the issue. It should be fine now. Please, let us know if you encounter any new device placement error. We just ran the tests and, as you mention, we print the device number in the stderr file.

____________

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1284
Credit: 4,926,781,959
RAC: 6,493,556
Level
Arg
Scientific publications
watwatwatwatwat
Message 57819 - Posted: 12 Nov 2021 | 20:08:22 UTC - in response to Message 57818.

Thank you for fixing this issue. I don't know whether you test in a multi-gpu environment or not. I suspect a lot of projects don't.

But there are lots of us that run many multi-gpu hosts that have been bit by this bug often.

Profile ServicEnginIC
Avatar
Send message
Joined: 24 Sep 10
Posts: 566
Credit: 5,943,552,024
RAC: 10,814,021
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 57821 - Posted: 12 Nov 2021 | 20:15:16 UTC - in response to Message 57818.

Thank you very much for your continuous support.

Profile ServicEnginIC
Avatar
Send message
Joined: 24 Sep 10
Posts: 566
Credit: 5,943,552,024
RAC: 10,814,021
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 57824 - Posted: 13 Nov 2021 | 10:44:54 UTC
Last modified: 13 Nov 2021 | 10:46:38 UTC

Overnight, every of my currently active 6 varied Linux hosts received at least one task of the kind ...-ABOU_ppod_gym_test9-0-1-...
All the tasks gave a valid result, none of them errored. This is promising!

My triple-GPU host happened to receive several tasks in a short time, and three of them were executed concurrently.
It catched my attention that there was observed a drastic change in overall system temperatures when transitioning from executing highly GPU/CPU intensive PrimeGrid tasks to the Gpugrid tasks.

On the other hand, every GPU was effectively executing its own task, as shown at the following nvidia-smi screenshot:



This confirms the Keith Myers observation that the previous task-to-GPU assignment problem in multi-GPU systems is solved. Well done!

mmonnin
Send message
Joined: 2 Jul 16
Posts: 332
Credit: 3,772,896,065
RAC: 4,765,302
Level
Arg
Scientific publications
watwatwatwatwat
Message 57827 - Posted: 13 Nov 2021 | 13:27:09 UTC
Last modified: 13 Nov 2021 | 13:27:24 UTC

I enabled Python on a 2nd PC with a 1070 and 1080 and they all error out

https://www.gpugrid.net/result.php?resultid=32662330
Output in format: Requested package -> Available versions

Then lists tons pf packages and versions.

When I check python version on this PC I get 'Python 2.7.17'. On the PC that works, Python is not install at all.

I'm guessing there is some incompatibility between packages I have installed?

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1284
Credit: 4,926,781,959
RAC: 6,493,556
Level
Arg
Scientific publications
watwatwatwatwat
Message 57828 - Posted: 13 Nov 2021 | 17:52:56 UTC

You needn't install any packages. The tasks are entirely packaged with everything they need in the work unit bundle.

mmonnin
Send message
Joined: 2 Jul 16
Posts: 332
Credit: 3,772,896,065
RAC: 4,765,302
Level
Arg
Scientific publications
watwatwatwatwat
Message 57830 - Posted: 13 Nov 2021 | 23:15:03 UTC

Supposedly, but then they should work. Another PC of mine also with Ubuntu 18.04, driver 470 and Pascal arch works OK. These tasks were all completed by others.

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1284
Credit: 4,926,781,959
RAC: 6,493,556
Level
Arg
Scientific publications
watwatwatwatwat
Message 57831 - Posted: 14 Nov 2021 | 1:34:01 UTC

I can only guess the tasks are confused with the locally installed old Python 2.7 library with the bundle containing 3.8 Python.

Python 2.7 is deprecated in current Linux distributions with minimum Python 3.6 in the distros now.

You might want to either uninstall Python or upgrade it to the 3 series. I don't think uninstalling though is desired as I believe a lot of stock applications are Python based and you would lose those.

Jim1348
Send message
Joined: 28 Jul 12
Posts: 819
Credit: 1,591,285,971
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 57832 - Posted: 14 Nov 2021 | 2:27:31 UTC - in response to Message 57831.

I think you can uninstall python2 without damage. At least I could on Ubuntu 20.04.3, though I had only BOINC and Folding installed on it.
But I then made the mistake of trying to purge all python versions. It made the system unbootable, and I had to re-install it.

Profile ServicEnginIC
Avatar
Send message
Joined: 24 Sep 10
Posts: 566
Credit: 5,943,552,024
RAC: 10,814,021
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 57833 - Posted: 14 Nov 2021 | 9:12:32 UTC - in response to Message 57827.

When I check python version on this PC I get 'Python 2.7.17'. On the PC that works, Python is not install at all.

To discard that something is getting confused with the old, deprecated Python version, you can upgrade to Python 3 with the following Terminal commands:

sudo apt install python-is-python3
sudo apt install python3-pip

And after that, you can uninstall unnecessary old packages with the command:

sudo apt autoremove

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1576
Credit: 5,605,461,851
RAC: 8,706,681
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 57834 - Posted: 14 Nov 2021 | 10:24:54 UTC

I also have two closely similar Linux machines:

132158
508381

Don't be fooled by the host IDs: 132158 is an inherited ID from an earlier generation of hardware, and is actually slightly younger than 508381. Both run the same version of Linux Mint 20.2, installed from the same ISO download, and the same basic software environment - but I do make tweaks to the installed packages separately, as I encounter different testing needs.

Yesterday, I was away from home, but both machines downloaded tasks from the ppod_gym_test9 batch. 132158 failed to run them, 508381 succeeded.

The problem occurs during the learner.step in Python, with a ValueError raised at line 55 during initialisation:

File "/var/lib/boinc-client/slots/4/gpugridpy/lib/python3.8/site-packages/torch/distributions/distribution.py", line 55, in __init__
raise ValueError(
ValueError: Expected parameter loc (Tensor of shape (146, 8)) of distribution Normal(loc: torch.Size([146, 8]), scale: torch.Size([146, 8])) to satisfy the constraint Real(), but found invalid values:
tensor([[nan, nan, nan, ..., nan, nan, nan],
[nan, nan, nan, ..., nan, nan, nan],
[nan, nan, nan, ..., nan, nan, nan],
...,
[nan, nan, nan, ..., nan, nan, nan],
[nan, nan, nan, ..., nan, nan, nan],
[nan, nan, nan, ..., nan, nan, nan]], device='cuda:0',
grad_fn=<AddmmBackward0>)

The two 'file extraction' logs for the GPUGrid Python download seem to be different. I'll try to compare the software environment of the two machines and work out where the difference is coming from.

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1576
Credit: 5,605,461,851
RAC: 8,706,681
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 57838 - Posted: 14 Nov 2021 | 17:49:02 UTC - in response to Message 57834.

Well, I've looked through the software installations for both machines, but I can't see any significant differences. Both have Python 3.8 installed (probably with the operating system), and no sign of any Python 2.x; I've installed a few sundries from terminal (libboost, git, some 32-bit libs for CPDN), but the same list on both machines.

The 'file extraction' logs are different for every task, and sometimes the same filename appears more than once (is duplicated) in the list for a single task.

For the tasks I ran successfully on host 508381, that was the only host that attempted them. The tasks that failed on host 132158 were issued to the full limit of 8 hosts, and failed on all of them.

I can only assume that the difference between success and failure resulted from differences in the task data make-up, and not from differences in the installed software on my hosts.

mmonnin
Send message
Joined: 2 Jul 16
Posts: 332
Credit: 3,772,896,065
RAC: 4,765,302
Level
Arg
Scientific publications
watwatwatwatwat
Message 57840 - Posted: 15 Nov 2021 | 2:59:24 UTC - in response to Message 57833.
Last modified: 15 Nov 2021 | 3:00:56 UTC

When I check python version on this PC I get 'Python 2.7.17'. On the PC that works, Python is not install at all.

To discard that something is getting confused with the old, deprecated Python version, you can upgrade to Python 3 with the following Terminal commands:

sudo apt install python-is-python3
sudo apt install python3-pip

And after that, you can uninstall unnecessary old packages with the command:

sudo apt autoremove


The python-is command didn't work.

So I followed the instructions here starting with Option 1
https://phoenixnap.com/kb/how-to-install-python-3-ubuntu

At the end I did the python --version to check. Same 2.7.17 even though it seemed to complete.

So I tried option 2 from source. That worked OK too with 3.7.5
I get to the end and see the note about checking for specific versions. Uh, oh.

python --version = 2.7.17
python3 --version = 3.6.9
python3.7 --version = 3.7.5

So now I have 3 versions installed haha. Maybe one will work, dunno. But we'll need some more tasks to find out.

Aurum
Avatar
Send message
Joined: 12 Jul 17
Posts: 399
Credit: 13,024,100,382
RAC: 766,238
Level
Trp
Scientific publications
watwatwat
Message 58077 - Posted: 12 Dec 2021 | 14:10:45 UTC - in response to Message 57833.

sudo apt install python-is-python3
sudo apt install python3-pip

Thanks, that worked and I now have python 3.8.10 installed on my two GG computers with cuda 11.4.

I just noticed that one computer had previously attempted to run a python WU but it failed. https://www.gpugrid.net/result.php?resultid=32727968
The stderr said this among many other things:
==> WARNING: A newer version of conda exists. <==
current version: 4.8.3
latest version: 4.11.0

Please update conda by running

$ conda update -n base -c defaults conda
I tried running that command but it said "conda: command not found."

The rig that didn't run a python WU installed many more lines of files. The rig that did run the failed python WU installed less than half of the files.

What are all of the prerequisites I need to run these python WUs?

Profile ServicEnginIC
Avatar
Send message
Joined: 24 Sep 10
Posts: 566
Credit: 5,943,552,024
RAC: 10,814,021
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58079 - Posted: 12 Dec 2021 | 14:50:51 UTC - in response to Message 58077.

What are all of the prerequisites I need to run these python WUs?

I read Keith Myers Message #58061
Then, I executed:

sudo apt install cmake

chance or not, the following Python task worked for me: e1a1-ABOU_rnd_ppod3-0-1-RND4818_5
The same WU had previously failed at five other hosts.

Aurum
Avatar
Send message
Joined: 12 Jul 17
Posts: 399
Credit: 13,024,100,382
RAC: 766,238
Level
Trp
Scientific publications
watwatwat
Message 58081 - Posted: 12 Dec 2021 | 15:28:52 UTC - in response to Message 58079.

sudo apt install cmake


Done. Fingers crossed. Thx

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1284
Credit: 4,926,781,959
RAC: 6,493,556
Level
Arg
Scientific publications
watwatwatwatwat
Message 58083 - Posted: 12 Dec 2021 | 16:47:26 UTC - in response to Message 58079.

What are all of the prerequisites I need to run these python WUs?

I read Keith Myers Message #58061
Then, I executed:

sudo apt install cmake

chance or not, the following Python task worked for me: e1a1-ABOU_rnd_ppod3-0-1-RND4818_5
The same WU had previously failed at five other hosts.

I was hoping to get a response from the researcher before interfering with the process. Happy someone beat me to it.

So once again we crunchers need to help along the process by installing missing software on our hosts to properly crunch the work the researchers are sending out.

Would be nice if the researchers ran some of their work on some test systems of their own before releasing it to the public, or as we are also known as . . . "beta-testers"

abouh
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 31 May 21
Posts: 200
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 58105 - Posted: 14 Dec 2021 | 16:59:44 UTC - in response to Message 58083.

Hello everyone, sorry for the late reply.

we detected the "cmake" error and found a way around it that does not require to install anything. Some jobs already finished successfully last Friday without reporting this error.

The error was related to the atari_py, as some users reported. More specifically installing this python package from github https://github.com/openai/atari-py, which allows to use some Atari2600 games as a test bench for reinforcement learning (RL) agents.

Sorry for the inconveniences. Even while the AI agents part of the code has been tested and works, every time we need to test our agents in a new environment we need te modify environment initialisation part of the code with the one containing the new environment, in this case atari_py.

I just sent another batch of 5 test jobs, 3 already finished the others seem to be working without problems but have not yet finished.

http://gpugrid.net/PS3GRID_ops/db_action.php?table=result&id=32730763
http://gpugrid.net/PS3GRID_ops/db_action.php?table=result&id=32730759
http://gpugrid.net/PS3GRID_ops/db_action.php?table=result&id=32730761

http://gpugrid.net/PS3GRID_ops/db_action.php?table=result&id=32730760
http://gpugrid.net/PS3GRID_ops/db_action.php?table=result&id=32730762

____________

Aurum
Avatar
Send message
Joined: 12 Jul 17
Posts: 399
Credit: 13,024,100,382
RAC: 766,238
Level
Trp
Scientific publications
watwatwat
Message 58106 - Posted: 14 Dec 2021 | 19:08:48 UTC - in response to Message 58105.
Last modified: 14 Dec 2021 | 19:12:24 UTC

http://gpugrid.net/PS3GRID_ops/db_action.php?table=result&id=32730760
http://gpugrid.net/PS3GRID_ops/db_action.php?table=result&id=32730762

I cannot open these links. Please use the [url][/url] tags to make them linkable.

I have 2 running now and am surprised how much memory they report using. They finished and reported as I wrote this so I can't say how much memory but I think it said 22 GB each but my System Monitor reported much less on the order of 17 GB which has been relinquished. How much RAM should we have to run pythonGPU?

https://www.gpugrid.net/result.php?resultid=32730780
https://www.gpugrid.net/result.php?resultid=32730783

BTW, I installed cmake and latest python 3.8. Should I uninstall cmake as a better test?

I recommend making its CPU use require 1 and not 0.963.

Toni
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 9 Dec 08
Posts: 1006
Credit: 5,068,599
RAC: 0
Level
Ser
Scientific publications
watwatwatwat
Message 58107 - Posted: 14 Dec 2021 | 19:24:18 UTC - in response to Message 58106.

Those are private links, but you can see the result ID.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1031
Credit: 35,623,832,483
RAC: 73,867,984
Level
Trp
Scientific publications
wat
Message 58108 - Posted: 14 Dec 2021 | 20:21:24 UTC - in response to Message 58106.
Last modified: 14 Dec 2021 | 20:21:53 UTC

http://gpugrid.net/PS3GRID_ops/db_action.php?table=result&id=32730760
http://gpugrid.net/PS3GRID_ops/db_action.php?table=result&id=32730762

I cannot open these links. Please use the [url][/url] tags to make them linkable.

I have 2 running now and am surprised how much memory they report using. They finished and reported as I wrote this so I can't say how much memory but I think it said 22 GB each but my System Monitor reported much less on the order of 17 GB which has been relinquished. How much RAM should we have to run pythonGPU?

https://www.gpugrid.net/result.php?resultid=32730780
https://www.gpugrid.net/result.php?resultid=32730783

BTW, I installed cmake and latest python 3.8. Should I uninstall cmake as a better test?

I recommend making its CPU use require 1 and not 0.963.


real memory? or virtual memory allocation? high virt is normal, and on the order of tens of GB, even for acemd3 tasks.

re: CPU use for the task, this is easily configured client-side with an app config file, and it will force 1:1 no matter what the project defines. I'd recommend that.
____________

Aurum
Avatar
Send message
Joined: 12 Jul 17
Posts: 399
Credit: 13,024,100,382
RAC: 766,238
Level
Trp
Scientific publications
watwatwat
Message 58109 - Posted: 15 Dec 2021 | 12:25:37 UTC - in response to Message 58108.
Last modified: 15 Dec 2021 | 12:30:57 UTC

re: CPU use for the task, this is easily configured client-side with an app config file, and it will force 1:1 no matter what the project defines. I'd recommend that.

I wasn't asking you for a trivial response. I'm asking the people that create these work units why they don't specify 1 instead of 0.963.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1031
Credit: 35,623,832,483
RAC: 73,867,984
Level
Trp
Scientific publications
wat
Message 58110 - Posted: 15 Dec 2021 | 15:20:14 UTC - in response to Message 58109.

a trivial question garners a trivial response :)

does it solve your problem?
____________

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1576
Credit: 5,605,461,851
RAC: 8,706,681
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58111 - Posted: 15 Dec 2021 | 15:24:48 UTC - in response to Message 58109.

re: CPU use for the task, this is easily configured client-side with an app config file, and it will force 1:1 no matter what the project defines. I'd recommend that.

I wasn't asking you for a trivial response. I'm asking the people that create these work units why they don't specify 1 instead of 0.963.

Because the GPUGrid staff don't set that figure. There's an algorithm in the (Berkeley written) BOINC server code which generates the figure to use from a range of outdated, stupid, data.

I discussed this at some length almost three years ago, in https://github.com/BOINC/boinc/issues/2949 - with examples drawn from GPUGrid, among other projects. I think that this was about the point that Berkeley stopped reading a single word of what I write. Someone else can get to grips with it this time.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1031
Credit: 35,623,832,483
RAC: 73,867,984
Level
Trp
Scientific publications
wat
Message 58113 - Posted: 15 Dec 2021 | 15:38:56 UTC - in response to Message 58111.

thanks Richard, I had a thought that it was likely the output of some automated function within BOINC since nearly all projects end up with something like this by default if they don't manually set the figures.
____________

Profile Bill F
Avatar
Send message
Joined: 21 Nov 16
Posts: 32
Credit: 86,638,150
RAC: 128,516
Level
Thr
Scientific publications
wat
Message 58121 - Posted: 16 Dec 2021 | 2:56:58 UTC - in response to Message 58113.

Too bad they never worked to implement all or part of Richard's suggestion / GitHub issue. While I don't claim to be able to see the bigger picture in BOINC it sounds like a good path for automated adjustments when new GPU hardware is released.

Bill F

____________
In October of 1969 I took an oath to support and defend the Constitution of the United States against all enemies, foreign and domestic;
There was no expiration date.


bormolino
Send message
Joined: 16 May 13
Posts: 41
Credit: 79,726,864
RAC: 4,842
Level
Thr
Scientific publications
watwatwatwatwatwat
Message 58205 - Posted: 24 Dec 2021 | 15:20:12 UTC

I got one WU today (e1a10-ABOU_rnd_ppod_13-0-1-RND2740_2).

Is it normal behaviour that the WU uses more than 7GB of RAM?

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1031
Credit: 35,623,832,483
RAC: 73,867,984
Level
Trp
Scientific publications
wat
Message 58206 - Posted: 24 Dec 2021 | 15:38:40 UTC - in response to Message 58205.

I got one WU today (e1a10-ABOU_rnd_ppod_13-0-1-RND2740_2).

Is it normal behaviour that the WU uses more than 7GB of RAM?


Yes.
____________

bormolino
Send message
Joined: 16 May 13
Posts: 41
Credit: 79,726,864
RAC: 4,842
Level
Thr
Scientific publications
watwatwatwatwatwat
Message 58207 - Posted: 24 Dec 2021 | 15:40:09 UTC - in response to Message 58206.

I got one WU today (e1a10-ABOU_rnd_ppod_13-0-1-RND2740_2).

Is it normal behaviour that the WU uses more than 7GB of RAM?


Yes.


Thanks for answering.

Greg _BE
Send message
Joined: 30 Jun 14
Posts: 126
Credit: 107,156,939
RAC: 166,633
Level
Cys
Scientific publications
watwatwatwatwatwat
Message 58699 - Posted: 22 Apr 2022 | 15:39:17 UTC
Last modified: 22 Apr 2022 | 15:40:29 UTC

What is typical run time for these tasks? I am at 1 day and x hours processing and only 34% of the way done. I have 2 days left before deadline. I am running a GTX 1080 plain, not TI that is OC'd a bit.

Profile Bill F
Avatar
Send message
Joined: 21 Nov 16
Posts: 32
Credit: 86,638,150
RAC: 128,516
Level
Thr
Scientific publications
wat
Message 58712 - Posted: 25 Apr 2022 | 14:05:30 UTC - in response to Message 58699.

What is typical run time for these tasks? I am at 1 day and x hours processing and only 34% of the way done. I have 2 days left before deadline. I am running a GTX 1080 plain, not TI that is OC'd a bit.


Your times may be about right. I have a GTX 1060 with 6GB and my times were similar.

Profile robertmiles
Send message
Joined: 16 Apr 09
Posts: 503
Credit: 727,920,933
RAC: 155,858
Level
Lys
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58948 - Posted: 19 Jun 2022 | 18:44:06 UTC

This task is giving wild results for its estimated time remaining.

This morning, it was saying over 400 days remaining.

Application Python apps for GPU hosts 4.03 (cuda1131)
Name e23a16-ABOU_rnd_ppod_demo_sharing_large-0-1-RND7660
State Running
Received 6/19/2022 6:33:56 AM
Report deadline 6/24/2022 6:34:02 AM
Resources 0.949 CPUs + 1 NVIDIA GPU
Estimated computation size 1,000,000,000 GFLOPs
CPU time 23:13:06
CPU time since checkpoint 00:02:47
Elapsed time 06:42:45
Estimated time remaining 357d 12:44:26
Fraction done 22.580%
Virtual memory size 5.81 GB
Working set size 1.05 GB
Directory slots/10
Process ID 6376
Progress rate 3.240% per hour
Executable wrapper_6.1_windows_x86_64.exe

I've seen other tasks start out claiming over 300 days remaining, and then finish in between 5 and 6 days.

Is there something wrong in the data sent as task input, or is it the wild first ten tasks for a new application version?

mmonnin
Send message
Joined: 2 Jul 16
Posts: 332
Credit: 3,772,896,065
RAC: 4,765,302
Level
Arg
Scientific publications
watwatwatwatwat
Message 58950 - Posted: 20 Jun 2022 | 18:38:06 UTC
Last modified: 20 Jun 2022 | 18:38:52 UTC

Yup, non beta task but I've seen over 3k day ETAs recently.


Name e23a60-ABOU_rnd_ppod_demo_sharing_large-0-1-RND1212_1

Application Python apps for GPU hosts 4.03 (cuda1131)
Workunit name e23a60-ABOU_rnd_ppod_demo_sharing_large-0-1-RND1212
State Running High P.
Received 6/20/2022 12:17:53 PM
Report deadline 6/25/2022 12:17:53 PM
Estimated app speed 311.15 GFLOPs/sec
Estimated task size 1,000,000,000 GFLOPs
Resources 0.99 CPUs + 1 NVIDIA GPU
CPU time at last checkpoint 05:44:17
CPU time 05:47:46
Elapsed time 02:17:30
Estimated time remaining 2764d,01:55:33
Fraction done 11.890%
Virtual memory size 18,693.93 MB
Working set size 3,824.01 MB

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1284
Credit: 4,926,781,959
RAC: 6,493,556
Level
Arg
Scientific publications
watwatwatwatwat
Message 58951 - Posted: 20 Jun 2022 | 20:50:03 UTC

Just ignore the ETA estimates. Garbage data.
The tasks finish fine and well within their deadlines.

Profile robertmiles
Send message
Joined: 16 Apr 09
Posts: 503
Credit: 727,920,933
RAC: 155,858
Level
Lys
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58952 - Posted: 21 Jun 2022 | 0:45:31 UTC
Last modified: 21 Jun 2022 | 0:47:44 UTC

Were the very long ETA estimates intended to make GPUGRID Python GPU work run before any GPU work from other BOINC projects? They seem to be very good at doing that.

Note there seems to be no thread for discussing non-beta Python tasks.

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1284
Credit: 4,926,781,959
RAC: 6,493,556
Level
Arg
Scientific publications
watwatwatwatwat
Message 58953 - Posted: 21 Jun 2022 | 2:53:45 UTC - in response to Message 58952.

Were the very long ETA estimates intended to make GPUGRID Python GPU work run before any GPU work from other BOINC projects? They seem to be very good at doing that.

Note there seems to be no thread for discussing non-beta Python tasks.

No not at all. BOINC just has no mechanism for dealing with hybrid cpu-gpu tasks.

The Python on GPU tasks are the first of their kind.

It will take the BOINC devs a lot of time to accommodate them correctly.

If they are getting in the way of your other work, I suggest stopping them or limiting them to only a single task at any time by changing your cache size to absolute minimal values.

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1284
Credit: 4,926,781,959
RAC: 6,493,556
Level
Arg
Scientific publications
watwatwatwatwat
Message 58956 - Posted: 21 Jun 2022 | 17:31:43 UTC - in response to Message 58952.

Were the very long ETA estimates intended to make GPUGRID Python GPU work run before any GPU work from other BOINC projects? They seem to be very good at doing that.

Note there seems to be no thread for discussing non-beta Python tasks.

The other threads are here:
https://www.gpugrid.net/forum_thread.php?id=5323

https://www.gpugrid.net/forum_thread.php?id=5319

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1284
Credit: 4,926,781,959
RAC: 6,493,556
Level
Arg
Scientific publications
watwatwatwatwat
Message 58957 - Posted: 21 Jun 2022 | 18:29:22 UTC
Last modified: 21 Jun 2022 | 18:29:32 UTC

FYI, the Python on GPU tasks are the same as the beta Python tasks currently.

Both tasks are using the latest application code.

The devs said they would still keep the beta plan class available, just not in use, for whenever a new application might be developed.

So everyone is getting the standard Python work even if they have beta selected.

Drago
Send message
Joined: 3 May 20
Posts: 10
Credit: 318,606,560
RAC: 1,060,873
Level
Asp
Scientific publications
wat
Message 59715 - Posted: 12 Jan 2023 | 15:51:14 UTC

Does anybody know how many cpu threads would be ideal to run them efficiently? I gave them 12 threads exclusively but still the task is run primarily on the cpu with my 3070ti kicking in only sporadically for a second or two.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1031
Credit: 35,623,832,483
RAC: 73,867,984
Level
Trp
Scientific publications
wat
Message 59716 - Posted: 12 Jan 2023 | 16:09:32 UTC - in response to Message 59715.
Last modified: 12 Jan 2023 | 16:10:32 UTC

giving it more cores wont necessarily make it run faster. as few as 4 cores per task works fine on my EPYC system. but if you are running other projects on the CPU it will slow them down as the processes compete with each other for CPU time.

by default the program will use however many cores you have and you really can't change this with any BOINC settings.

also I would recommend putting Linux on that system instead of Windows. Linux runs much faster
____________

Drago
Send message
Joined: 3 May 20
Posts: 10
Credit: 318,606,560
RAC: 1,060,873
Level
Asp
Scientific publications
wat
Message 59719 - Posted: 13 Jan 2023 | 10:59:48 UTC - in response to Message 59716.

Gotcha! My Linux Laptop finishes them in 10 hours, my much faster Windows PC needs 18! Thanks again Ian & Steve

Post to thread

Message boards : News : Python Runtime (GPU, beta)

//