Advanced search

Message boards : News : Python Runtime (GPU, beta)

Author Message
Toni
Volunteer moderator
Project administrator
Project developer
Project scientist
Send message
Joined: 9 Dec 08
Posts: 1005
Credit: 5,068,599
RAC: 0
Level
Ser
Scientific publications
watwatwatwat
Message 57655 - Posted: 26 Oct 2021 | 10:57:36 UTC

If anybody wants to help debug a new application, please enable the above mentioned app.

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1244
Credit: 3,342,731,168
RAC: 763,242
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 57656 - Posted: 26 Oct 2021 | 12:09:24 UTC - in response to Message 57655.

I don't see anything new on https://www.gpugrid.net/apps.php yet?

Azmodes
Send message
Joined: 7 Jan 17
Posts: 27
Credit: 1,258,729,084
RAC: 61,499
Level
Met
Scientific publications
watwatwat
Message 57657 - Posted: 26 Oct 2021 | 12:30:43 UTC

GPUGRID 10/26/2021 2:01:26 PM No tasks are available for Python apps for GPU hosts

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 539
Credit: 4,446,686,357
RAC: 467,223
Level
Arg
Scientific publications
wat
Message 57659 - Posted: 26 Oct 2021 | 15:46:17 UTC

One system queued up and waiting.
____________

Azmodes
Send message
Joined: 7 Jan 17
Posts: 27
Credit: 1,258,729,084
RAC: 61,499
Level
Met
Scientific publications
watwatwat
Message 57660 - Posted: 26 Oct 2021 | 17:31:20 UTC

Got a 2080 Ti and two 2070 Supers ready to roll.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 539
Credit: 4,446,686,357
RAC: 467,223
Level
Arg
Scientific publications
wat
Message 57663 - Posted: 26 Oct 2021 | 20:25:32 UTC - in response to Message 57656.

I don't see anything new on https://www.gpugrid.net/apps.php yet?


OT, but I've always found it weird that this link is not linked from anywhere on the main GPUGRID site. can only find it via a google search or previous bookmark. if you're just browsing through the GPUGRID site it doesn't exist.
____________

Profile mg13 [HWU]
Avatar
Send message
Joined: 18 Nov 09
Posts: 5
Credit: 107,006
RAC: 0
Level

Scientific publications
wat
Message 57665 - Posted: 26 Oct 2021 | 23:12:12 UTC - in response to Message 57663.

I don't see anything new on https://www.gpugrid.net/apps.php yet?


OT, but I've always found it weird that this link is not linked from anywhere on the main GPUGRID site. can only find it via a google search or previous bookmark. if you're just browsing through the GPUGRID site it doesn't exist.


Yes, there is the link, just go to the home page and click on "Join us" and on the page that opens in the "Configuring your participation" section in point 2 click on "apps" and you will find it.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 539
Credit: 4,446,686,357
RAC: 467,223
Level
Arg
Scientific publications
wat
Message 57666 - Posted: 27 Oct 2021 | 0:57:38 UTC - in response to Message 57665.

Thanks. You’re right it’s there.

But I’ll follow up that it’s a very odd place for it. Nearly all other BOINC project puts a link near/with credit statistics, directly on the main page, or as a link on the bottom of every page.
____________

Profile Bill F
Avatar
Send message
Joined: 21 Nov 16
Posts: 18
Credit: 14,326,619
RAC: 0
Level
Pro
Scientific publications
wat
Message 57668 - Posted: 27 Oct 2021 | 1:41:30 UTC

Well I am checked and enabled including "run Test Apps" we will see of I get a task assigned.

Thanks
Bill F

dthonon
Send message
Joined: 26 Aug 21
Posts: 1
Credit: 5,834,975
RAC: 170,823
Level
Ser
Scientific publications
wat
Message 57678 - Posted: 27 Oct 2021 | 13:58:31 UTC - in response to Message 57668.

This application is enabled in my preferences, and I accept test applications, but I am not getting any python task :

mer. 27 oct. 2021 15:51:04 | GPUGRID | Scheduler request completed: got 0 new tasks

Server status shows 10 tasks waiting to be sent.

bozz4science
Send message
Joined: 22 May 20
Posts: 104
Credit: 21,759,591
RAC: 77,228
Level
Pro
Scientific publications
wat
Message 57680 - Posted: 27 Oct 2021 | 16:44:26 UTC

Why are you not running more test tasks for the new app? Almost all of the tasks ended on one of Ian’s hosts … or is that enough feedback for now?

Anyway, credit calculation looks almost random to me. At least for these tasks. Any chance you will fix that before this gets into production? (Comparison: 700sec runtime awarded ~100k vs. admittedly lower end card 110k sec runtime getting 565k credit. Seems out of scope.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 539
Credit: 4,446,686,357
RAC: 467,223
Level
Arg
Scientific publications
wat
Message 57682 - Posted: 28 Oct 2021 | 4:00:11 UTC - in response to Message 57680.

wow, I didnt even notice, I was out all day. I just set the system (7x GPU) to check for work every 100s or so and only checking for beta GPU work, so it doesn't surprise me that it got so many. it would ask for 7 at once, and I guess it got lucky that it asked for some work before anyone else.

beta tasks have always paid a lot of credit here for some reason.

but as with previous beta tasks, I see no indication that these tasks actually did anything on the GPU. my guess is that they ran some stuff on the CPU then finished. I've asked before what their intentions are with these tasks, and it's clear they are doing some type or machine learning kind of thing, but they dont appear to be even using the GPU at all, which is very strange when they are labelled as a cuda app.
____________

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1244
Credit: 3,342,731,168
RAC: 763,242
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 57683 - Posted: 28 Oct 2021 | 8:28:16 UTC

The apps finally appeared on the application page yesterday afternoon. So far, they are for Linux only (not mentioned in the original announcement), and with the same cuda101 / cuda 1121 variants as the current acemd runs.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 539
Credit: 4,446,686,357
RAC: 467,223
Level
Arg
Scientific publications
wat
Message 57686 - Posted: 28 Oct 2021 | 15:39:24 UTC - in response to Message 57683.

The apps finally appeared on the application page yesterday afternoon. So far, they are for Linux only (not mentioned in the original announcement), and with the same cuda101 / cuda 1121 variants as the current acemd runs.


cuda100* but yeah, looks to be the same app as listed in the Anaconda Python 3 category, same versioning.

____________

Azmodes
Send message
Joined: 7 Jan 17
Posts: 27
Credit: 1,258,729,084
RAC: 61,499
Level
Met
Scientific publications
watwatwat
Message 57706 - Posted: 1 Nov 2021 | 10:25:58 UTC

So, uh, that was it?

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1244
Credit: 3,342,731,168
RAC: 763,242
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 57711 - Posted: 2 Nov 2021 | 14:13:27 UTC - in response to Message 57706.

Not quite, but...

Got a new Python task. It failed:

14:05:29 (821885): wrapper: running ./gpugridpy/bin/python (run.py)
Running command git clone -q https://github.com/Unity-Technologies/obstacle-tower-env.git /tmp/pip-req-build-i6sgww_u
ERROR: Error [Errno 2] No such file or directory: 'git' while executing command git clone -q https://github.com/Unity-Technologies/obstacle-tower-env.git /tmp/pip-req-build-i6sgww_u
ERROR: Cannot find command 'git' - do you have 'git' installed and in your PATH?

Not yet, but I'll install it before the replacement task I got on report has a chance to start.

We shouldn't need to do that.

(and now my second Linux machine has got one too)

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1244
Credit: 3,342,731,168
RAC: 763,242
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 57712 - Posted: 2 Nov 2021 | 14:38:54 UTC

That looks better - I'd say the GPU is running:



But what's [ObstacleTower (as boinc)]? It's appeared on my task bar, and opens to a tiny, all black, window?

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1244
Credit: 3,342,731,168
RAC: 763,242
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 57713 - Posted: 2 Nov 2021 | 14:53:23 UTC

Second machine has acquired an ObstacleTower, too.

Interesting snip from stderr in running (repeated many times):

(raylet) ModuleNotFoundError: No module named 'aiohttp.signals'
(raylet) /var/lib/boinc-client/slots/5/gpugridpy/lib/python3.8/site-packages/ray/autoscaler/_private/cli_logger.py:57: FutureWarning: Not all Ray CLI dependencies were found. In Ray 1.4+, the Ray CLI, autoscaler, and dashboard will only be usable via `pip install 'ray[default]'`. Please update your install command.
(raylet) warnings.warn(
(raylet) Traceback (most recent call last):
(raylet) File "/var/lib/boinc-client/slots/5/gpugridpy/lib/python3.8/site-packages/ray/new_dashboard/agent.py", line 22, in <module>
(raylet) import ray.new_dashboard.utils as dashboard_utils
(raylet) File "/var/lib/boinc-client/slots/5/gpugridpy/lib/python3.8/site-packages/ray/new_dashboard/utils.py", line 20, in <module>
(raylet) import aiohttp.signals
(raylet) ModuleNotFoundError: No module named 'aiohttp.signals'
WARNING:gym_unity:New seed 57 will apply on next reset.
WARNING:gym_unity:New starting floor 0 will apply on next reset.

Toni
Volunteer moderator
Project administrator
Project developer
Project scientist
Send message
Joined: 9 Dec 08
Posts: 1005
Credit: 5,068,599
RAC: 0
Level
Ser
Scientific publications
watwatwatwat
Message 57714 - Posted: 2 Nov 2021 | 16:02:12 UTC - in response to Message 57711.

This is being solved server-side, no need to install software of course.

Not quite, but...

Got a new Python task. It failed:

14:05:29 (821885): wrapper: running ./gpugridpy/bin/python (run.py)
Running command git clone -q https://github.com/Unity-Technologies/obstacle-tower-env.git /tmp/pip-req-build-i6sgww_u
ERROR: Error [Errno 2] No such file or directory: 'git' while executing command git clone -q https://github.com/Unity-Technologies/obstacle-tower-env.git /tmp/pip-req-build-i6sgww_u
ERROR: Cannot find command 'git' - do you have 'git' installed and in your PATH?

Not yet, but I'll install it before the replacement task I got on report has a chance to start.

We shouldn't need to do that.

(and now my second Linux machine has got one too)

abouh
Volunteer moderator
Project administrator
Project developer
Project tester
Volunteer developer
Volunteer tester
Project scientist
Send message
Joined: 31 May 21
Posts: 15
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 57715 - Posted: 2 Nov 2021 | 16:22:51 UTC - in response to Message 57711.

The Obstacle Tower environment is a simulated environment for machine learning (Reinforcement Learning) research. Note that in order to research how to train and deploy embodied agents in the real word it is common to research in 3D world simulations like this on. This is the github page of the project: https://github.com/Unity-Technologies/obstacle-tower-env

We use it as a testbench within our efforts to train populations of interacting artificial intelligent agents able to develop complex behaviours and solve complex tasks. The environment runs on GPU, and the Deep Learning models learning how to solve the simulation too.

Most of the bugs we are trying to solve are related to the environment. It is installed via git, but the git-related issues is being solved from the server side as mentioned. The reported stderr message "ModuleNotFoundError: No module named 'aiohttp.signals'" should be solved now. The small black screen is also related to the environment.
____________

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 539
Credit: 4,446,686,357
RAC: 467,223
Level
Arg
Scientific publications
wat
Message 57716 - Posted: 2 Nov 2021 | 16:45:03 UTC - in response to Message 57715.

The Obstacle Tower environment is a simulated environment for machine learning (Reinforcement Learning) research. Note that in order to research how to train and deploy embodied agents in the real word it is common to research in 3D world simulations like this on. This is the github page of the project: https://github.com/Unity-Technologies/obstacle-tower-env

We use it as a testbench within our efforts to train populations of interacting artificial intelligent agents able to develop complex behaviours and solve complex tasks. The environment runs on GPU, and the Deep Learning models learning how to solve the simulation too.

Most of the bugs we are trying to solve are related to the environment. It is installed via git, but the git-related issues is being solved from the server side as mentioned. The reported stderr message "ModuleNotFoundError: No module named 'aiohttp.signals'" should be solved now. The small black screen is also related to the environment.


do you have any plans to utilize the Tensor cores present on many newer Nvidia GPUs? these are designed for machine learning tasks.
____________

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1244
Credit: 3,342,731,168
RAC: 763,242
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 57718 - Posted: 2 Nov 2021 | 17:50:17 UTC

Thanks for the feedback - on that basis, I'll keep pushing them through.

Had an odd finish:

FileNotFoundError: [Errno 2] No such file or directory: '/var/lib/boinc-client/slots/6/model.state_dict.3073'
(raylet) /var/lib/boinc-client/slots/6/gpugridpy/lib/python3.8/site-packages/ray/autoscaler/_private/cli_logger.py:57: FutureWarning: Not all Ray CLI dependencies were found. In Ray 1.4+, the Ray CLI, autoscaler, and dashboard will only be usable via `pip install 'ray[default]'`. Please update your install command.
(raylet) warnings.warn(
(raylet) Traceback (most recent call last):
(raylet) File "/var/lib/boinc-client/slots/6/gpugridpy/lib/python3.8/site-packages/ray/new_dashboard/agent.py", line 22, in <module>
(raylet) import ray.new_dashboard.utils as dashboard_utils
(raylet) File "/var/lib/boinc-client/slots/6/gpugridpy/lib/python3.8/site-packages/ray/new_dashboard/utils.py", line 20, in <module>
(raylet) import aiohttp.signals
(raylet) ModuleNotFoundError: No module named 'aiohttp.signals'
INFO:mlagents_envs.environment:Environment shut down with return code 0.
15:21:11 (827067): ./gpugridpy/bin/python exited; CPU time 1598.264794
15:21:11 (827067): app exit status: 0x1
15:21:11 (827067): called boinc_finish(195)

"Environment shut down with return code 0" sounds like a happy ending, but "called boinc_finish(195)" is 'Child failed'.

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 808
Credit: 1,073,789,831
RAC: 724,574
Level
Met
Scientific publications
watwatwatwatwat
Message 57719 - Posted: 3 Nov 2021 | 6:23:50 UTC

Tried a LOT of the PythonGPU tasks today. Still no joy for a successful run.

Think they are getting further along though since I think I see progress in how far they get before the environment collapses and errors out.

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1244
Credit: 3,342,731,168
RAC: 763,242
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 57721 - Posted: 3 Nov 2021 | 9:30:00 UTC

The next round of testing has started.

e1a10-ABOU_PPOObstacle6-0-1-RND2533_0 - I was going to say 'is running', but it's crashed already. After only 20 seconds, I got an apparently normal finish, followed by

upload failure: <file_xfer_error>
<file_name>e1a10-ABOU_PPOObstacle6-0-1-RND2533_0_0</file_name>
<error_code>-131 (file size too big)</error_code>

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1244
Credit: 3,342,731,168
RAC: 763,242
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 57722 - Posted: 3 Nov 2021 | 9:36:16 UTC
Last modified: 3 Nov 2021 | 10:01:49 UTC

Got another from what looks like the same batch. Limit is

<max_nbytes>100000000.000000</max_nbytes>

I'll catch the output and see how big it is.

Edit - couldn't catch it ('report immediately' operated too fast). But I watched the next one in the slot directory: the output file was created right at the end, but was cleaned up almost immediately. I read it as 169 MB, but can't be certain.

abouh
Volunteer moderator
Project administrator
Project developer
Project tester
Volunteer developer
Volunteer tester
Project scientist
Send message
Joined: 31 May 21
Posts: 15
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 57723 - Posted: 3 Nov 2021 | 10:25:40 UTC - in response to Message 57722.
Last modified: 3 Nov 2021 | 10:28:53 UTC

Yes the file should be 170M approx.

abouh
Volunteer moderator
Project administrator
Project developer
Project tester
Volunteer developer
Volunteer tester
Project scientist
Send message
Joined: 31 May 21
Posts: 15
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 57724 - Posted: 3 Nov 2021 | 10:26:09 UTC - in response to Message 57722.

Yes the file should be 170M approx.
____________

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1244
Credit: 3,342,731,168
RAC: 763,242
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 57725 - Posted: 3 Nov 2021 | 10:33:04 UTC
Last modified: 3 Nov 2021 | 10:45:31 UTC

Well, I got one for you to study:

e1a8-ABOU_PPOObstacle7-0-1-RND2466_3

That was done by manually increasing the maximum allowed size in BOINC. I think that's an internal setting in the BOINC system - specifically, the workunit generator or its template files - rather than the Python package.

I've suspended work fetch for now - please let us know when the next iteration is ready to test.

Edit - this it what the upload file contained:




It seems a bit odd to return the ObstacleTower zip back to you unchanged?

abouh
Volunteer moderator
Project administrator
Project developer
Project tester
Volunteer developer
Volunteer tester
Project scientist
Send message
Joined: 31 May 21
Posts: 15
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 57726 - Posted: 3 Nov 2021 | 10:55:09 UTC - in response to Message 57711.
Last modified: 3 Nov 2021 | 11:38:59 UTC

The git-related errors should be solved now.


ERROR: Cannot find command 'git' - do you have 'git' installed and in your PATH?


We will study the errors related to downloading the Obstacle Tower environment. Thank you for the feedback.
____________

Azmodes
Send message
Joined: 7 Jan 17
Posts: 27
Credit: 1,258,729,084
RAC: 61,499
Level
Met
Scientific publications
watwatwat
Message 57727 - Posted: 3 Nov 2021 | 12:19:15 UTC

Got one that ended in 195 (0xc3) EXIT_CHILD_FAILED after 15 minutes:

==> WARNING: A newer version of conda exists. <==
current version: 4.8.3
latest version: 4.10.3

Please update conda by running

$ conda update -n base -c defaults conda


13:14:06 (11501): /usr/bin/flock exited; CPU time 470.306190
13:14:06 (11501): wrapper: running ./gpugridpy/bin/python (run.py)
path: ['/var/lib/boinc-client/slots/34', '/var/lib/boinc-client/slots/34/gpugridpy/lib/python3.8/site-packages/git/ext/gitdb', '/var/lib/boinc-client/slots/34/gpugridpy/lib/python38.zip', '/var/lib/boinc-client/slots/34/gpugridpy/lib/python3.8', '/var/lib/boinc-client/slots/34/gpugridpy/lib/python3.8/lib-dynload', '/var/lib/boinc-client/slots/34/gpugridpy/lib/python3.8/site-packages', '/var/lib/boinc-client/slots/34/gpugridpy/lib/python3.8/site-packages/gitdb/ext/smmap']
git path: /var/lib/boinc-client/slots/34/gpugridpy/lib/python3.8/site-packages/git
Traceback (most recent call last):
File "run.py", line 340, in <module>
main()
File "run.py", line 53, in main
print("GPU available: {}".format(torch.cuda.is_available()))
NameError: name 'torch' is not defined
13:14:10 (11501): ./gpugridpy/bin/python exited; CPU time 1.602758
13:14:10 (11501): app exit status: 0x1
13:14:10 (11501): called boinc_finish(195)

</stderr_txt>
]]>

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 808
Credit: 1,073,789,831
RAC: 724,574
Level
Met
Scientific publications
watwatwatwatwat
Message 57728 - Posted: 3 Nov 2021 | 14:24:08 UTC
Last modified: 3 Nov 2021 | 14:24:55 UTC

Got five PythonGPU tasks to finish and report after about ten minutes that were valid.

bozz4science
Send message
Joined: 22 May 20
Posts: 104
Credit: 21,759,591
RAC: 77,228
Level
Pro
Scientific publications
wat
Message 57732 - Posted: 3 Nov 2021 | 17:18:12 UTC

My machine is a dual boot machine (Win10/Ubuntu 20.04). Are there plans for a Windows app for these tasks or should I boot into Linux to get some of these tasks?

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 808
Credit: 1,073,789,831
RAC: 724,574
Level
Met
Scientific publications
watwatwatwatwat
Message 57733 - Posted: 3 Nov 2021 | 19:50:45 UTC - in response to Message 57732.

Haven't heard of any posts by admin types that Windows apps will be made.
That stated, often the new beta apps are tested first on Linux to get the bugs out and then the Windows apps are generated.

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 808
Credit: 1,073,789,831
RAC: 724,574
Level
Met
Scientific publications
watwatwatwatwat
Message 57735 - Posted: 3 Nov 2021 | 19:55:47 UTC

This task looks to have run through all of its parameter set to complete normally at around 3000 seconds and was validated for ~ 200K credits.
https://www.gpugrid.net/result.php?resultid=32660133

Profile PDW
Send message
Joined: 7 Mar 14
Posts: 12
Credit: 906,394,286
RAC: 1,467,729
Level
Glu
Scientific publications
watwatwatwatwat
Message 57737 - Posted: 3 Nov 2021 | 20:05:30 UTC - in response to Message 57735.

Did you notice if it used the GPU and if it did what percentage ?

I had one that ran for about 3 hours before failing, never saw the fans running during that time.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 539
Credit: 4,446,686,357
RAC: 467,223
Level
Arg
Scientific publications
wat
Message 57740 - Posted: 3 Nov 2021 | 20:28:14 UTC
Last modified: 3 Nov 2021 | 20:41:52 UTC

just ran this one on my RTX 3080Ti: https://www.gpugrid.net/result.php?resultid=32660184

16:19:48 (1841951): wrapper (7.7.26016): starting
16:19:48 (1841951): wrapper (7.7.26016): starting
16:19:48 (1841951): wrapper: running /usr/bin/flock (/home/ian/BOINC/projects/www.gpugrid.net/miniconda.lock -c "/bin/bash ./miniconda-installer.sh -b -u -p /home/ian/BOINC/projects/www.gpugrid.net/miniconda &&
/home/ian/BOINC/projects/www.gpugrid.net/miniconda/bin/conda install -m -y -p gpugridpy --file requirements.txt ")

0%| | 0/35 [00:00<?, ?it/s]
Extracting : tk-8.6.8-hbc83047_0.conda: 0%| | 0/35 [00:00<?, ?it/s]
Extracting : tk-8.6.8-hbc83047_0.conda: 3%|2 | 1/35 [00:00<00:11, 3.04it/s]
Extracting : urllib3-1.25.8-py37_0.conda: 3%|2 | 1/35 [00:00<00:11, 3.04it/s]
Extracting : libedit-3.1.20181209-hc058e9b_0.conda: 6%|5 | 2/35 [00:00<00:10, 3.04it/s]
Extracting : libgcc-ng-9.1.0-hdf63c60_0.conda: 9%|8 | 3/35 [00:00<00:10, 3.04it/s]
Extracting : ld_impl_linux-64-2.33.1-h53a641e_7.conda: 11%|#1 | 4/35 [00:00<00:10, 3.04it/s]
Extracting : python-3.7.7-hcff3b4d_5.conda: 14%|#4 | 5/35 [00:00<00:09, 3.04it/s]
Extracting : python-3.7.7-hcff3b4d_5.conda: 17%|#7 | 6/35 [00:00<00:06, 4.16it/s]
Extracting : tqdm-4.46.0-py_0.conda: 17%|#7 | 6/35 [00:00<00:06, 4.16it/s]
Extracting : ca-certificates-2020.1.1-0.conda: 20%|## | 7/35 [00:00<00:06, 4.16it/s]
Extracting : wheel-0.34.2-py37_0.conda: 23%|##2 | 8/35 [00:00<00:06, 4.16it/s]
Extracting : libstdcxx-ng-9.1.0-hdf63c60_0.conda: 26%|##5 | 9/35 [00:00<00:06, 4.16it/s]
Extracting : certifi-2020.4.5.1-py37_0.conda: 29%|##8 | 10/35 [00:00<00:06, 4.16it/s]
Extracting : readline-8.0-h7b6447c_0.conda: 31%|###1 | 11/35 [00:00<00:05, 4.16it/s]
Extracting : ncurses-6.2-he6710b0_1.conda: 34%|###4 | 12/35 [00:00<00:05, 4.16it/s]
Extracting : conda-package-handling-1.6.1-py37h7b6447c_0.conda: 37%|###7 | 13/35 [00:00<00:05, 4.16it/s]
Extracting : chardet-3.0.4-py37_1003.conda: 40%|#### | 14/35 [00:00<00:05, 4.16it/s]
Extracting : zlib-1.2.11-h7b6447c_3.conda: 43%|####2 | 15/35 [00:00<00:04, 4.16it/s]
Extracting : six-1.14.0-py37_0.conda: 46%|####5 | 16/35 [00:00<00:04, 4.16it/s]
Extracting : pycparser-2.20-py_0.conda: 49%|####8 | 17/35 [00:00<00:04, 4.16it/s]
Extracting : libffi-3.3-he6710b0_1.conda: 51%|#####1 | 18/35 [00:00<00:04, 4.16it/s]
Extracting : pycosat-0.6.3-py37h7b6447c_0.conda: 54%|#####4 | 19/35 [00:00<00:03, 4.16it/s]
Extracting : cffi-1.14.0-py37he30daa8_1.conda: 57%|#####7 | 20/35 [00:00<00:03, 4.16it/s]
Extracting : _libgcc_mutex-0.1-main.conda: 60%|###### | 21/35 [00:00<00:03, 4.16it/s]
Extracting : pyopenssl-19.1.0-py37_0.conda: 63%|######2 | 22/35 [00:00<00:03, 4.16it/s]
Extracting : idna-2.9-py_1.conda: 66%|######5 | 23/35 [00:00<00:02, 4.16it/s]
Extracting : pysocks-1.7.1-py37_0.conda: 69%|######8 | 24/35 [00:00<00:02, 4.16it/s]
Extracting : xz-5.2.5-h7b6447c_0.conda: 71%|#######1 | 25/35 [00:00<00:02, 4.16it/s]
Extracting : setuptools-46.4.0-py37_0.conda: 74%|#######4 | 26/35 [00:00<00:02, 4.16it/s]
Extracting : ruamel_yaml-0.15.87-py37h7b6447c_0.conda: 77%|#######7 | 27/35 [00:00<00:01, 4.16it/s]
Extracting : cryptography-2.9.2-py37h1ba5d50_0.conda: 80%|######## | 28/35 [00:00<00:01, 4.16it/s]
Extracting : openssl-1.1.1g-h7b6447c_0.conda: 83%|########2 | 29/35 [00:00<00:01, 4.16it/s]
Extracting : sqlite-3.31.1-h62c20be_1.conda: 86%|########5 | 30/35 [00:00<00:01, 4.16it/s]
Extracting : pip-20.0.2-py37_3.conda: 89%|########8 | 31/35 [00:00<00:00, 4.16it/s]
Extracting : yaml-0.1.7-had09818_2.conda: 91%|#########1| 32/35 [00:00<00:00, 4.16it/s]
Extracting : requests-2.23.0-py37_0.conda: 94%|#########4| 33/35 [00:00<00:00, 4.16it/s]
Extracting : conda-4.8.3-py37_0.tar.bz2: 97%|#########7| 34/35 [00:00<00:00, 4.16it/s]


==> WARNING: A newer version of conda exists. <==
current version: 4.8.3
latest version: 4.10.3

Please update conda by running

$ conda update -n base -c defaults conda


16:21:21 (1841951): /usr/bin/flock exited; CPU time 61.036800
16:21:21 (1841951): wrapper: running ./gpugridpy/bin/python (run.py)
Running command git clone -q https://github.com/Unity-Technologies/obstacle-tower-env.git /tmp/pip-req-build-wwv7ghqo
/home/ian/BOINC/slots/15/gpugridpy/lib/python3.8/site-packages/ray/autoscaler/_private/cli_logger.py:57: FutureWarning: Not all Ray CLI dependencies were found. In Ray 1.4+, the Ray CLI, autoscaler, and dashboard will only be usable via `pip install 'ray[default]'`. Please update your install command.
warnings.warn(
Downloading...
From: https://storage.googleapis.com/obstacle-tower-build/v4.1/obstacletower_v4.1_linux.zip
To: /home/ian/BOINC/slots/15/obstacletower_v4.1_linux.zip

0%| | 0.00/170M [00:00<?, ?B/s]
1%| | 2.10M/170M [00:00<00:08, 19.9MB/s]
6%|&#226;&#150;&#140; | 10.5M/170M [00:00<00:02, 56.2MB/s]
11%|&#226;&#150;&#136;&#226;&#150;&#143; | 19.4M/170M [00:00<00:02, 70.8MB/s]
16%|&#226;&#150;&#136;&#226;&#150;&#139; | 27.8M/170M [00:00<00:02, 70.6MB/s]
22%|&#226;&#150;&#136;&#226;&#150;&#136;&#226;&#150;&#143; | 37.7M/170M [00:00<00:01, 76.7MB/s]
28%|&#226;&#150;&#136;&#226;&#150;&#136;&#226;&#150;&#138; | 47.7M/170M [00:00<00:01, 79.0MB/s]
34%|&#226;&#150;&#136;&#226;&#150;&#136;&#226;&#150;&#136;&#226;&#150;&#142; | 57.1M/170M [00:00<00:01, 82.8MB/s]
38%|&#226;&#150;&#136;&#226;&#150;&#136;&#226;&#150;&#136;&#226;&#150;&#138; | 65.5M/170M [00:00<00:01, 80.4MB/s]
43%|&#226;&#150;&#136;&#226;&#150;&#136;&#226;&#150;&#136;&#226;&#150;&#136;&#226;&#150;&#142; | 73.9M/170M [00:00<00:01, 81.2MB/s]
49%|&#226;&#150;&#136;&#226;&#150;&#136;&#226;&#150;&#136;&#226;&#150;&#136;&#226;&#150;&#138; | 82.8M/170M [00:01<00:01, 83.4MB/s]
54%|&#226;&#150;&#136;&#226;&#150;&#136;&#226;&#150;&#136;&#226;&#150;&#136;&#226;&#150;&#136;&#226;&#150;&#142; | 91.2M/170M [00:01<00:00, 80.8MB/s]
59%|&#226;&#150;&#136;&#226;&#150;&#136;&#226;&#150;&#136;&#226;&#150;&#136;&#226;&#150;&#136;&#226;&#150;&#137; | 101M/170M [00:01<00:00, 81.3MB/s]
65%|&#226;&#150;&#136;&#226;&#150;&#136;&#226;&#150;&#136;&#226;&#150;&#136;&#226;&#150;&#136;&#226;&#150;&#136;&#226;&#150;&#141; | 110M/170M [00:01<00:00, 83.7MB/s]
70%|&#226;&#150;&#136;&#226;&#150;&#136;&#226;&#150;&#136;&#226;&#150;&#136;&#226;&#150;&#136;&#226;&#150;&#136;&#226;&#150;&#137; | 119M/170M [00:01<00:00, 79.0MB/s]
75%|&#226;&#150;&#136;&#226;&#150;&#136;&#226;&#150;&#136;&#226;&#150;&#136;&#226;&#150;&#136;&#226;&#150;&#136;&#226;&#150;&#136;&#226;&#150;&#141; | 127M/170M [00:01<00:00, 80.2MB/s]
80%|&#226;&#150;&#136;&#226;&#150;&#136;&#226;&#150;&#136;&#226;&#150;&#136;&#226;&#150;&#136;&#226;&#150;&#136;&#226;&#150;&#136;&#226;&#150;&#136; | 137M/170M [00:01<00:00, 79.2MB/s]
85%|&#226;&#150;&#136;&#226;&#150;&#136;&#226;&#150;&#136;&#226;&#150;&#136;&#226;&#150;&#136;&#226;&#150;&#136;&#226;&#150;&#136;&#226;&#150;&#136;&#226;&#150;&#140; | 145M/170M [00:01<00:00, 80.1MB/s]
90%|&#226;&#150;&#136;&#226;&#150;&#136;&#226;&#150;&#136;&#226;&#150;&#136;&#226;&#150;&#136;&#226;&#150;&#136;&#226;&#150;&#136;&#226;&#150;&#136;&#226;&#150;&#136; | 154M/170M [00:01<00:00, 79.1MB/s]
96%|&#226;&#150;&#136;&#226;&#150;&#136;&#226;&#150;&#136;&#226;&#150;&#136;&#226;&#150;&#136;&#226;&#150;&#136;&#226;&#150;&#136;&#226;&#150;&#136;&#226;&#150;&#136;&#226;&#150;&#140;| 163M/170M [00:02<00:00, 82.6MB/s]
100%|&#226;&#150;&#136;&#226;&#150;&#136;&#226;&#150;&#136;&#226;&#150;&#136;&#226;&#150;&#136;&#226;&#150;&#136;&#226;&#150;&#136;&#226;&#150;&#136;&#226;&#150;&#136;&#226;&#150;&#136;| 170M/170M [00:02<00:00, 78.6MB/s]
16:21:54 (1841951): ./gpugridpy/bin/python exited; CPU time 22.798227
16:21:59 (1841951): called boinc_finish(0)

</stderr_txt>
<message>
upload failure: <file_xfer_error>
<file_name>e1a6-ABOU_PPOObstacle6-0-1-RND7771_2_0</file_name>
<error_code>-131 (file size too big)</error_code>
</file_xfer_error>
</message>


ran for about 2 mins and errored out. file size too big? how big could the file get in 2 minutes? lol. looks like everyone in this WU chain is having the same issue though. https://www.gpugrid.net/workunit.php?wuid=27085637 Bad WU?

and I saw no evidence that it ever touched the GPU, refreshing nvidia-smi every 2 seconds showed no process running on the GPU. must still be using only the CPU.

Can an admin please directly comment if these are actually using the GPU or not? I know an admin mentioned that they were only doing CPU work "as a test". Is that still the case? Having GPU tasks that only use the CPU core is very confusing.
____________

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 808
Credit: 1,073,789,831
RAC: 724,574
Level
Met
Scientific publications
watwatwatwatwat
Message 57741 - Posted: 3 Nov 2021 | 20:33:04 UTC

The ones that have partially ran and were validated only used 31% of the gpu in nvidia-smi.

The one task that appears to have successfully run through to normal completion was done while I was out of the house and did not see it run unfortunately.

Will have to wait for more to observe.

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 808
Credit: 1,073,789,831
RAC: 724,574
Level
Met
Scientific publications
watwatwatwatwat
Message 57743 - Posted: 3 Nov 2021 | 21:41:16 UTC
Last modified: 3 Nov 2021 | 21:44:30 UTC

Looks like the tasks fluctuate between a few seconds at 1% utilization before returning to hovering around 10-13% utilization. I was watching one on a 2070 and it was running for almost 60 minutes in nvidia-smi. They are marked at C+G type in that program.

I think I killed it when I pulled up htop to look at how much cpu it was using because it finished with an error instantly at the same time as htop populated the screen.

abouh
Volunteer moderator
Project administrator
Project developer
Project tester
Volunteer developer
Volunteer tester
Project scientist
Send message
Joined: 31 May 21
Posts: 15
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 57748 - Posted: 4 Nov 2021 | 9:27:01 UTC - in response to Message 57725.

The contents of the obstacletower.zip downloaded file are necessary to generate the data required for the machine learning agent to train. That is why the file itself is not modified. Only used to generate the training data.

The expected behaviour is for the file to be downloaded, used during the job completion and then deleted. Should not be returned.

Some jobs have already finished successfully. Thank you for the feedback. Current jobs being tested should use around 30% GPU and around 8000MiB GPU memory.


____________

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1244
Credit: 3,342,731,168
RAC: 763,242
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 57750 - Posted: 4 Nov 2021 | 9:31:48 UTC - in response to Message 57748.

The expected behaviour is for the file to be downloaded, used during the job completion and then deleted. Should not be returned.

That makes much more sense. Standing by for the next round of debugging... :-)

bozz4science
Send message
Joined: 22 May 20
Posts: 104
Credit: 21,759,591
RAC: 77,228
Level
Pro
Scientific publications
wat
Message 57751 - Posted: 4 Nov 2021 | 9:47:15 UTC
Last modified: 4 Nov 2021 | 9:47:32 UTC

That's the next bad news for me as my GPU is maxed out at 6GB. Without upgrading my GPU and that's not likely gonna be soon, I suppose I have to give up on these types of tasks - at least for the time being. Thanks for the update though

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 539
Credit: 4,446,686,357
RAC: 467,223
Level
Arg
Scientific publications
wat
Message 57752 - Posted: 4 Nov 2021 | 12:31:04 UTC - in response to Message 57748.

Some jobs have already finished successfully. Thank you for the feedback. Current jobs being tested should use around 30% GPU and around 8000MiB GPU memory.



why such low GPU utilization? and 8000? or do you mean 800? 8GB? or 800MB?
____________

bozz4science
Send message
Joined: 22 May 20
Posts: 104
Credit: 21,759,591
RAC: 77,228
Level
Pro
Scientific publications
wat
Message 57753 - Posted: 4 Nov 2021 | 12:42:20 UTC
Last modified: 4 Nov 2021 | 12:44:21 UTC

I can only speculate in regards to the former one. But your latter question likely resolves to 8,000 MiB (Mebibyte) which is just another convention to count bits – if he indeed meant to write 8,000.

While k (kilo), M (Mega), G (Giga) and T (Tera) are the SI-prefix units and are computed as base 10 by 10^3, 10^6, 10^9 and 10^12 respectively, the binary prefix units of Ki (Kibi), Mi (Mebi), Gi (Gibi) and Ti (Tebi) are computed as base 2 by 2^10, 2^20, 2^30 and 2^40. As such M/Mi = (10^6/2^20) ~ 95.37% or a difference of ~4.63% between the SI and binary prefix units.

1 kB = 1000 B
1 KiB = 1024 B

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 539
Credit: 4,446,686,357
RAC: 467,223
Level
Arg
Scientific publications
wat
Message 57754 - Posted: 4 Nov 2021 | 12:52:03 UTC - in response to Message 57753.

yeah I know the conversions and such. I'm just wondering if it's a typo, Keith ran some of these beta tasks successfully and did not report such high memory usage, he claimed it only used about 200MB
____________

bozz4science
Send message
Joined: 22 May 20
Posts: 104
Credit: 21,759,591
RAC: 77,228
Level
Pro
Scientific publications
wat
Message 57755 - Posted: 4 Nov 2021 | 12:58:04 UTC
Last modified: 4 Nov 2021 | 12:58:33 UTC

ah, all right. didn't mean to offend you if that's what I did. still don't understand their beta testing procedure anyway. so far not many tasks have been run, only few of them successfully, but meanwhile nearly no information has been shared rendering the whole procedure rather intransparent and leaving others in the dark wondering about their piles of unsuccessful tasks. and the little information that is indeed shared seems to conflict a lot with the user experience and observations. for a ML task 8 GB isn't untypical though

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 539
Credit: 4,446,686,357
RAC: 467,223
Level
Arg
Scientific publications
wat
Message 57756 - Posted: 4 Nov 2021 | 13:06:37 UTC - in response to Message 57755.

I agree that lots of memory use wouldnt be atypical for AI/ML work. and also agree that the admins should be a little more transparent about what these tasks are doing and the expected behaviors. it seems so far they have tons and tons of errors, then the admins come back and say they fixed the errors, then just more errors again. I'd also like to know if these are using the Tensor cores on RTX GPUs.
____________

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1244
Credit: 3,342,731,168
RAC: 763,242
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 57757 - Posted: 4 Nov 2021 | 13:17:33 UTC

I think the Beta testing process is (as usual anywhere) very much an incremental process. It will have started with small test units, and as each little buglet surfaces and is solved, the process moves on to test a later segment that wasn't accessible until the previous problem had been overcome.

Thus - Abouh has confirmed that yesterday's upload file size problem was caused by including a source data file in the output - "Should not be returned".

I also noted that some of Keith's successful runs were resends of tasks which had failed on other machines - some of them generic problems which I would have expected to cause a failure on his machine too. So it seems that dynamic fixes may have been applied too. Normally, a new BOINC replication task is an exact copy of its predecessor, but I don't think can be automatically assumed during this Beta phase.

In particular, Keith's observation that one test task only used 200 MB of GPU memory isn't necessarily a foolproof guide to the memory demand of later tests.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 539
Credit: 4,446,686,357
RAC: 467,223
Level
Arg
Scientific publications
wat
Message 57758 - Posted: 4 Nov 2021 | 13:22:50 UTC - in response to Message 57757.

which is why I asked for clarification in light of the disparity between expected and observed behaviors.
____________

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 808
Credit: 1,073,789,831
RAC: 724,574
Level
Met
Scientific publications
watwatwatwatwat
Message 57761 - Posted: 4 Nov 2021 | 17:05:37 UTC - in response to Message 57754.

yeah I know the conversions and such. I'm just wondering if it's a typo, Keith ran some of these beta tasks successfully and did not report such high memory usage, he claimed it only used about 200MB

Yes, I have watched tasks complete fully to a proper boinc finish end and I never saw more than 290MB of gpu memory reported in nvidia-smi at a max 13% utilization.

Unless nvidia-smi has an issue in reporting gpu RAM used, the 8GB of memory post is out of line. Or the tasks the scientist-developer mentioned haven't been released to us out of the laboratory yet.

abouh
Volunteer moderator
Project administrator
Project developer
Project tester
Volunteer developer
Volunteer tester
Project scientist
Send message
Joined: 31 May 21
Posts: 15
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 57766 - Posted: 5 Nov 2021 | 10:37:31 UTC - in response to Message 57752.

We are progressing in our debugging and have managed to solve several errors, but as mentioned in a previous post, it is an incremental process.

We are trying to train AI agent using reinforcement learning, which generally interleaves stages in which the agent collects data (a process less GPU intensive) and stages in which the agent learns from that data. The nature of the problem, in which data in progressively generated, accounts for a lower GPU utilisation that in supervised machine learning, although we will work to progressively make it more efficient once debugging is completed.

Since the obstacle tower environment (https://github.com/Unity-Technologies/obstacle-tower-env), the source of data, also runs in GPU, during the learning stage, the neural network and the training data together with the environment occupy approximately 8,000 MiB (Mebibyte, was not a typo) of GPU memory when checked locally with nvidia-smi.

Basically, the python script has the following steps:
step 1: Defining the conda environment with all dependencies.
step 2: Downloading obstacletower.zip, a necessary file used to generate the data.
step 3: Initialising the data generator using the contents of obstacletower.zip.
step 4: Creating the AI agent and alternating data collection and data training stages.
step 5: Returning the trained AI agent, and not obstacletower.zip.

Only after reaching step 4 and step 5 the GPU is used. Some of the jobs that succeeded but barely used the GPU were to test that indeed problems in step 1 and step 2 had been solved (most of them solved by Keith Myers).

We noticed that most recent failed jobs returned the following error at step 3:

mlagents_envs.exception.UnityTimeOutException: The Unity environment took too long to respond. Make sure that :
The environment does not need user interaction to launch
The Agents' Behavior Parameters > Behavior Type is set to "Default"
The environment and the Python interface have compatible versions.

We are working to solve it. If step 3 is completed without errors, jobs reaching steps 4 and 5 should be using GPU.

We hope that helped shed some light on our work and the recent results. We will try to solve any further doubts and inform about our progress.
____________

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 539
Credit: 4,446,686,357
RAC: 467,223
Level
Arg
Scientific publications
wat
Message 57767 - Posted: 5 Nov 2021 | 12:10:16 UTC - in response to Message 57766.

Thanks for the more detailed answer.

regarding the 8GB of memory used.

-which step of the process does this happen?
-was Keith's nvidia-smi screenshot that he posted in another thread showing low memory use, from an earlier unit that did not require that much VRAM?
-will these units fail from too little VRAM?
-what will you do or are you doing about GPUs with less than 8GB VRAM, or even with 8GB?
-do you have some filter in the WU scheduler to not send these units to GPUs with less than 8GB?
____________

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1244
Credit: 3,342,731,168
RAC: 763,242
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 57768 - Posted: 5 Nov 2021 | 13:57:40 UTC - in response to Message 57767.

-do you have some filter in the WU scheduler to not send these units to GPUs with less than 8GB?

It's certainly possible to set such a filter: referring again to Specifying plan classes in C++, the scheduler can check a CUDA plan class specification like

if (!strcmp(plan_class, "cuda23")) {
if (!cuda_check(c, hu,
100, // minimum compute capability (1.0)
200, // max compute capability (2.0)
2030, // min CUDA version (2.3)
19500, // min display driver version (195.00)
384*MEGA, // min video RAM
1., // # of GPUs used (may be fractional, or an integer > 1)
.01, // fraction of FLOPS done by the CPU
.21 // estimated GPU efficiency (actual/peak FLOPS)
)) {
return false;
}
}

We last discussed that code in connection with compute capability, but I think we're still having problems implementing filters via tools like that.

abouh
Volunteer moderator
Project administrator
Project developer
Project tester
Volunteer developer
Volunteer tester
Project scientist
Send message
Joined: 31 May 21
Posts: 15
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 57769 - Posted: 5 Nov 2021 | 14:50:31 UTC - in response to Message 57767.
Last modified: 5 Nov 2021 | 14:51:06 UTC

At step 3, initialising the environment requires a small amount of GPU memory (somewhere around 1GB). At step 4 the AI agent is initialised and trained, and a data storage class and a neural network are created and placed on the GPU. This is when more memory is required. However, in the next round of tests we will lower the GPU memory requirements of the script while debugging step 3. Eventually for steps 4 and 5 we expect it to require the 8G mentioned earlier.

Keith's nvidia-smi screenshot showing a job with low memory use was a job that returned after step 2, to verify problems in steps 1 and 2 had been solved.
____________

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 808
Credit: 1,073,789,831
RAC: 724,574
Level
Met
Scientific publications
watwatwatwatwat
Message 57770 - Posted: 5 Nov 2021 | 17:08:45 UTC - in response to Message 57769.

So was this WU https://www.gpugrid.net/result.php?resultid=32660133 the one that was completed after steps 1 and 2?
Or after steps 4 and 5?

I never got to witness this one in realtime.

I had nvidia-smi polling update set at 1 second and I never saw the gpu memory usage go above 290MB for that screenshot. It was not taken from the task linked above.

The BOINC completion percentage just went to 10% and stayed there and never showed 100% completion when it finished. Think that is an issue with BOINC historically.

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 808
Credit: 1,073,789,831
RAC: 724,574
Level
Met
Scientific publications
watwatwatwatwat
Message 57771 - Posted: 5 Nov 2021 | 17:16:26 UTC

The environment and the Python interface have compatible versions.

Is the reason why I was able to complete a workunit properly because of having my local python environment match the zipped wrapper python interface?

I use several pypi applications that probably have setup the python environment variable.

Is there something I can dump out of the host that completed the workunit properly that will help you debug the application package?

abouh
Volunteer moderator
Project administrator
Project developer
Project tester
Volunteer developer
Volunteer tester
Project scientist
Send message
Joined: 31 May 21
Posts: 15
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 57776 - Posted: 8 Nov 2021 | 14:54:09 UTC - in response to Message 57770.

This one completed the whole python script. Including steps 4 and 5. Should have used the GPU.
____________

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 808
Credit: 1,073,789,831
RAC: 724,574
Level
Met
Scientific publications
watwatwatwatwat
Message 57777 - Posted: 8 Nov 2021 | 16:00:17 UTC - in response to Message 57776.

Thanks for confirming the one I completed used the gpu.

Greger
Send message
Joined: 6 Jan 15
Posts: 66
Credit: 6,353,476,182
RAC: 1,383,102
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwat
Message 57778 - Posted: 9 Nov 2021 | 21:23:30 UTC

Did a check on one host running GPUGridpy units.

e4a6-ABOU_ppo_gym_demos3-0-1-RND1018_0
Run time 4,999.53
GPU Memory: nvidia-smi report 2027MiB

No check-pointing yet but works well.

abouh
Volunteer moderator
Project administrator
Project developer
Project tester
Volunteer developer
Volunteer tester
Project scientist
Send message
Joined: 31 May 21
Posts: 15
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 57779 - Posted: 10 Nov 2021 | 8:42:17 UTC - in response to Message 57778.
Last modified: 10 Nov 2021 | 8:45:13 UTC

We sent out some jobs yesterday and almost all finished successfully.

We are still working on avoiding the following error related to the Obstacle Tower environment:

mlagents_envs.exception.UnityTimeOutException: The Unity environment took too.
long to respond. Make sure that :
The environment does not need user interaction to launch
The Agents' Behavior Parameters > Behavior Type is set to "Default"
The environment and the Python interface have compatible versions.

However, to test the rest of the code we tried with another set of environments that are less problematic (https://gym.openai.com/). The successful jobs used these environments. While we find and test a solution for the Obstacle Tower locally we will continue to send jobs with these environments to test the rest of the code.

Note that reinforcement learning (RL) techniques are independent of the environment. The environment represents the world where the AI agent learns intelligent behaviours. Switching to another environment simply means applying the learning technique to a different problem that can be equally challenging (placing the agent in a different world). Thus, we will now finish debugging the app with these Gym environments simply because are less prone to errors and, once we know the only possible source of problems is the environment, consider solving others.
____________

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 539
Credit: 4,446,686,357
RAC: 467,223
Level
Arg
Scientific publications
wat
Message 57782 - Posted: 10 Nov 2021 | 13:47:20 UTC - in response to Message 57779.
Last modified: 10 Nov 2021 | 14:08:41 UTC

I had a few failures:

https://www.gpugrid.net/result.php?resultid=32660680
and
https://www.gpugrid.net/result.php?resultid=32660448

seems to be a bad WU on both instances since all wingmen are erroring in the same way.

mainly used ~6-7% GPU utilization on my 3080Ti, with intermittent spikes to ~20% every 10s or so. power use near idle, GPU memory utilization around 2GB, and system memory use around 4.8GB. make sure your system has enough memory in multi-GPU systems.
____________

abouh
Volunteer moderator
Project administrator
Project developer
Project tester
Volunteer developer
Volunteer tester
Project scientist
Send message
Joined: 31 May 21
Posts: 15
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 57785 - Posted: 10 Nov 2021 | 14:48:30 UTC - in response to Message 57782.

Thank you for the feedback. We had detected the error in

https://www.gpugrid.net/result.php?resultid=32660448

but not the one in

https://www.gpugrid.net/result.php?resultid=32660680


Having alternating phases of lower and higher GPU utilisation is normal in Reinforcement Learning, as the agent alternates between data collection (generally low GPU usage) and training (higher GPU memory and utilisation). Once we solve most of the errors we will focus on maximizing GPU efficiency during the training phases.
____________

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 539
Credit: 4,446,686,357
RAC: 467,223
Level
Arg
Scientific publications
wat
Message 57786 - Posted: 10 Nov 2021 | 15:04:20 UTC - in response to Message 57785.
Last modified: 10 Nov 2021 | 15:09:17 UTC

have you considered creating a modified app that will use the RTX (and other) GPU's onboard Tensor cores? it should speed up things considerably.

https://www.quora.com/Does-tensorflow-and-pytorch-automatically-use-the-tensor-cores-in-rtx-2080-ti-or-other-rtx-cards

I'm guessing in addition to making the needed configuration changes, you'd need to adjust your scheduler to only send to cards with Tensor cores (GeForce RTX cards, TitanV, Tesla/QuadroRTX cards from Volta forward)
____________

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 539
Credit: 4,446,686,357
RAC: 467,223
Level
Arg
Scientific publications
wat
Message 57789 - Posted: 10 Nov 2021 | 16:28:26 UTC - in response to Message 57786.
Last modified: 10 Nov 2021 | 16:30:20 UTC

information for pytorch here:

https://github.com/NVIDIA/apex
https://nvidia.github.io/apex/
____________

abouh
Volunteer moderator
Project administrator
Project developer
Project tester
Volunteer developer
Volunteer tester
Project scientist
Send message
Joined: 31 May 21
Posts: 15
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 57790 - Posted: 10 Nov 2021 | 17:04:13 UTC - in response to Message 57786.

We are using PyTorch to train our agents, and for now we have not considered using mixed precision, which seem required for the Tensor cores.

It could be an interesting possibility to reduce memory requirements and speed up training processes. I have to admit that I do not know how it affects performance in reinforcement learning algorithms, but it is an interesting option.
____________

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1244
Credit: 3,342,731,168
RAC: 763,242
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 57792 - Posted: 10 Nov 2021 | 18:53:33 UTC
Last modified: 10 Nov 2021 | 19:48:25 UTC

Getting errors in the test5 run, like

e2a16-ABOU_ppod_gym_test5-0-1-RND0379_1
e2a10-ABOU_ppod_gym_test5-0-1-RND0874_1

And on the test6 run. This time, the error seems to be in placing the expected task files in the slot directory, prior to starting the main run.

e3a17-ABOU_ppod_gym_test6-0-1-RND2029_0
e3a11-ABOU_ppod_gym_test6-0-1-RND1260_4

Both have

File "run.py", line 393, in <module>
main()
File "run.py", line 106, in main
feature_extractor_network=get_feature_extractor(args.nn),
File "/var/lib/boinc-client/slots/4/gpugridpy/lib/python3.8/site-packages/pytorchrl/agent/actors/feature_extractors/__init__.py", line 19, in get_feature_extractor
raise ValueError("Specified model not found!")
ValueError: Specified model not found!

mmonnin
Send message
Joined: 2 Jul 16
Posts: 305
Credit: 1,220,122,726
RAC: 822,766
Level
Met
Scientific publications
watwatwatwatwat
Message 57794 - Posted: 10 Nov 2021 | 23:56:58 UTC
Last modified: 10 Nov 2021 | 23:57:13 UTC

I got one that worked today. Then 6 more that didnt on the same PC
https://www.gpugrid.net/workunit.php?wuid=27086033

mmonnin
Send message
Joined: 2 Jul 16
Posts: 305
Credit: 1,220,122,726
RAC: 822,766
Level
Met
Scientific publications
watwatwatwatwat
Message 57795 - Posted: 11 Nov 2021 | 2:04:09 UTC

I got another. So far it is running
Over 4 CPU threads at 1st then 1 thread for 1st 4min
13% completed back to 10% then no more progression
At 10% hen GPU load at 3-5% 875mb vram
78min so far.

Profile ServicEnginIC
Avatar
Send message
Joined: 24 Sep 10
Posts: 459
Credit: 2,120,359,742
RAC: 691,039
Level
Phe
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 57796 - Posted: 11 Nov 2021 | 6:49:43 UTC

I've got several GPU Python beta tasks at my triple GPU Host #480458
Several of them have succeeded after around 5000 seconds execution time.
But three of these tasks have exceeded this time.
Task e1a20-ABOU_ppod_gym_test-0-1-RND4563_6 failed after 11432 seconds.
Task e1a6-ABOU_ppod_gym_test-0-1-RND1186_1 failed after 18784 seconds.
Task e1a2-ABOU_ppod_gym_test-0-1-RND2391_3 is currently running even longer.
This last task is theoreticaly running at device 1.
But it seems to be effectively running at device 0, sharing the same device with an ACEMD3 regular task e14s132_e10s98p1f905-ADRIA_AdB_KIXCMYB_HIP-0-2-RND7676_5.

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 808
Credit: 1,073,789,831
RAC: 724,574
Level
Met
Scientific publications
watwatwatwatwat
Message 57797 - Posted: 11 Nov 2021 | 7:13:49 UTC

I've got the same thing going on. BOINC says the task is running on Device2 while in reality it is sharing Device0 along with an Einstein GRP task.

This is the task https://www.gpugrid.net/result.php?resultid=32661276

Profile ServicEnginIC
Avatar
Send message
Joined: 24 Sep 10
Posts: 459
Credit: 2,120,359,742
RAC: 691,039
Level
Phe
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 57798 - Posted: 11 Nov 2021 | 9:30:27 UTC - in response to Message 57796.

Task e1a2-ABOU_ppod_gym_test-0-1-RND2391_3 is currently running even longer.

The risk of beta testing: It finally failed after 42555 seconds.
I hope this is somehow useful for debugging...

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1244
Credit: 3,342,731,168
RAC: 763,242
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 57799 - Posted: 11 Nov 2021 | 9:50:17 UTC - in response to Message 57798.

Task e1a2-ABOU_ppod_gym_test-0-1-RND2391_3 is currently running even longer.

FileNotFoundError: [Errno 2] No such file or directory: '/var/lib/boinc-client/slots/3/model.state_dict.73201'

The same for two of your predecessors on this workunit. Is there any way we could avoid re-inventing the wheel (slowly) for errors like this?

abouh
Volunteer moderator
Project administrator
Project developer
Project tester
Volunteer developer
Volunteer tester
Project scientist
Send message
Joined: 31 May 21
Posts: 15
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 57800 - Posted: 11 Nov 2021 | 13:44:16 UTC - in response to Message 57799.
Last modified: 11 Nov 2021 | 13:48:31 UTC

The excessively long training time problem and the problem related to

FileNotFoundError: [Errno 2] No such file or directory: '/var/lib/boinc-client/slots/3/model.state_dict.73201'

Have been fixed now. Most jobs sent today are being completed successfully. The reported issues were very helpful for debugging.




Progress:

The core research idea is to train populations of reinforcement learning agents that learn independently for a certain amount of time and, once they return to the server, put their learned knowledge in common with other agents to create a new generation of agents equipped with the information acquired by previous generations. Each GPUgrid job is one of these agents doing some training independently. In that sense, the first 4 letters of the job name identify the generation and the number of the agent (i.e. e1a2-ABOU_ppod_gym_test-0-1-RND2391_3 refers to the epoch or generation number 1 and the agent number 2 within that generation).

The debugging done recently, has allowed more and more of this jobs to finish. An experiment currently running has achieved already a 3rd generation of agents.

As mentioned in an earlier post, we are working now with OpenAI gym environments (https://gym.openai.com/)
____________

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 808
Credit: 1,073,789,831
RAC: 724,574
Level
Met
Scientific publications
watwatwatwatwat
Message 57801 - Posted: 11 Nov 2021 | 15:48:48 UTC - in response to Message 57800.
Last modified: 11 Nov 2021 | 15:54:02 UTC

Are you working on fixing the issue that the tasks only run on Device#0 in BOINC?

Even when Device#0 is already occupied by another task from another project?

That leaves at least one device doing nothing because BOINC thinks it is occupied.

Profile ServicEnginIC
Avatar
Send message
Joined: 24 Sep 10
Posts: 459
Credit: 2,120,359,742
RAC: 691,039
Level
Phe
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 57802 - Posted: 11 Nov 2021 | 17:15:48 UTC - in response to Message 57801.
Last modified: 11 Nov 2021 | 17:16:26 UTC

Are you working on fixing the issue that the tasks only run on Device#0 in BOINC?

+1

At this other example, Device 0 is running 1 Gpugrid ACEMD3 task and 2 Python GPU tasks.
Meanwhile, Device 1 and Device 2 remain idle.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 539
Credit: 4,446,686,357
RAC: 467,223
Level
Arg
Scientific publications
wat
Message 57803 - Posted: 11 Nov 2021 | 17:25:26 UTC - in response to Message 57802.

weird, I thought this problem had been fixed already. I guess I never realized since I've only been running the beta tasks on my single GPU system.
____________

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1244
Credit: 3,342,731,168
RAC: 763,242
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 57804 - Posted: 11 Nov 2021 | 17:31:21 UTC
Last modified: 11 Nov 2021 | 17:57:23 UTC

Count me in on this, too.

My client is running e8a16-ABOU_ppod_gym_test7-0-1-RND1448_0 on device 1.

I have GPUGrid excluded from device 0, so I can run tasks from other projects in the faster PCIe slot while testing. But ...



Well, despite running on the wrong card, it finished and passed the GPUGrid validation test. I've swapped over the exclusion, and BOINC and GPUGrid are now in agreement that card 0 is the card to use.

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 808
Credit: 1,073,789,831
RAC: 724,574
Level
Met
Scientific publications
watwatwatwatwat
Message 57805 - Posted: 11 Nov 2021 | 18:32:50 UTC

Hard to tell from the error code snippet whether the tasks are hardwired to run on Device#0 or whether the error snippet is just the result of where the task actually has run.

[nan, nan, nan, ..., nan, nan, nan]], device='cuda:0',

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 808
Credit: 1,073,789,831
RAC: 724,574
Level
Met
Scientific publications
watwatwatwatwat
Message 57817 - Posted: 12 Nov 2021 | 19:33:08 UTC
Last modified: 12 Nov 2021 | 19:39:58 UTC

Well, I have a new python task running by itself now on Device#2.
So it may mean they have fixed the issue where the tasks always ran on Device#0.

See this new output in the stderr.txt that looks like it is allocating to Device#2
It hasn't been there in any other of my tasks till just now for this new task.

Found GPU: True, Number 2 - 2

abouh
Volunteer moderator
Project administrator
Project developer
Project tester
Volunteer developer
Volunteer tester
Project scientist
Send message
Joined: 31 May 21
Posts: 15
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 57818 - Posted: 12 Nov 2021 | 19:51:54 UTC - in response to Message 57817.

Yes, we have fixed the issue. It should be fine now. Please, let us know if you encounter any new device placement error. We just ran the tests and, as you mention, we print the device number in the stderr file.

____________

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 808
Credit: 1,073,789,831
RAC: 724,574
Level
Met
Scientific publications
watwatwatwatwat
Message 57819 - Posted: 12 Nov 2021 | 20:08:22 UTC - in response to Message 57818.

Thank you for fixing this issue. I don't know whether you test in a multi-gpu environment or not. I suspect a lot of projects don't.

But there are lots of us that run many multi-gpu hosts that have been bit by this bug often.

Profile ServicEnginIC
Avatar
Send message
Joined: 24 Sep 10
Posts: 459
Credit: 2,120,359,742
RAC: 691,039
Level
Phe
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 57821 - Posted: 12 Nov 2021 | 20:15:16 UTC - in response to Message 57818.

Thank you very much for your continuous support.

Profile ServicEnginIC
Avatar
Send message
Joined: 24 Sep 10
Posts: 459
Credit: 2,120,359,742
RAC: 691,039
Level
Phe
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 57824 - Posted: 13 Nov 2021 | 10:44:54 UTC
Last modified: 13 Nov 2021 | 10:46:38 UTC

Overnight, every of my currently active 6 varied Linux hosts received at least one task of the kind ...-ABOU_ppod_gym_test9-0-1-...
All the tasks gave a valid result, none of them errored. This is promising!

My triple-GPU host happened to receive several tasks in a short time, and three of them were executed concurrently.
It catched my attention that there was observed a drastic change in overall system temperatures when transitioning from executing highly GPU/CPU intensive PrimeGrid tasks to the Gpugrid tasks.

On the other hand, every GPU was effectively executing its own task, as shown at the following nvidia-smi screenshot:



This confirms the Keith Myers observation that the previous task-to-GPU assignment problem in multi-GPU systems is solved. Well done!

mmonnin
Send message
Joined: 2 Jul 16
Posts: 305
Credit: 1,220,122,726
RAC: 822,766
Level
Met
Scientific publications
watwatwatwatwat
Message 57827 - Posted: 13 Nov 2021 | 13:27:09 UTC
Last modified: 13 Nov 2021 | 13:27:24 UTC

I enabled Python on a 2nd PC with a 1070 and 1080 and they all error out

https://www.gpugrid.net/result.php?resultid=32662330
Output in format: Requested package -> Available versions

Then lists tons pf packages and versions.

When I check python version on this PC I get 'Python 2.7.17'. On the PC that works, Python is not install at all.

I'm guessing there is some incompatibility between packages I have installed?

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 808
Credit: 1,073,789,831
RAC: 724,574
Level
Met
Scientific publications
watwatwatwatwat
Message 57828 - Posted: 13 Nov 2021 | 17:52:56 UTC

You needn't install any packages. The tasks are entirely packaged with everything they need in the work unit bundle.

mmonnin
Send message
Joined: 2 Jul 16
Posts: 305
Credit: 1,220,122,726
RAC: 822,766
Level
Met
Scientific publications
watwatwatwatwat
Message 57830 - Posted: 13 Nov 2021 | 23:15:03 UTC

Supposedly, but then they should work. Another PC of mine also with Ubuntu 18.04, driver 470 and Pascal arch works OK. These tasks were all completed by others.

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 808
Credit: 1,073,789,831
RAC: 724,574
Level
Met
Scientific publications
watwatwatwatwat
Message 57831 - Posted: 14 Nov 2021 | 1:34:01 UTC

I can only guess the tasks are confused with the locally installed old Python 2.7 library with the bundle containing 3.8 Python.

Python 2.7 is deprecated in current Linux distributions with minimum Python 3.6 in the distros now.

You might want to either uninstall Python or upgrade it to the 3 series. I don't think uninstalling though is desired as I believe a lot of stock applications are Python based and you would lose those.

Jim1348
Send message
Joined: 28 Jul 12
Posts: 790
Credit: 1,561,693,721
RAC: 81,844
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 57832 - Posted: 14 Nov 2021 | 2:27:31 UTC - in response to Message 57831.

I think you can uninstall python2 without damage. At least I could on Ubuntu 20.04.3, though I had only BOINC and Folding installed on it.
But I then made the mistake of trying to purge all python versions. It made the system unbootable, and I had to re-install it.

Profile ServicEnginIC
Avatar
Send message
Joined: 24 Sep 10
Posts: 459
Credit: 2,120,359,742
RAC: 691,039
Level
Phe
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 57833 - Posted: 14 Nov 2021 | 9:12:32 UTC - in response to Message 57827.

When I check python version on this PC I get 'Python 2.7.17'. On the PC that works, Python is not install at all.

To discard that something is getting confused with the old, deprecated Python version, you can upgrade to Python 3 with the following Terminal commands:

sudo apt install python-is-python3
sudo apt install python3-pip

And after that, you can uninstall unnecessary old packages with the command:

sudo apt autoremove

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1244
Credit: 3,342,731,168
RAC: 763,242
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 57834 - Posted: 14 Nov 2021 | 10:24:54 UTC

I also have two closely similar Linux machines:

132158
508381

Don't be fooled by the host IDs: 132158 is an inherited ID from an earlier generation of hardware, and is actually slightly younger than 508381. Both run the same version of Linux Mint 20.2, installed from the same ISO download, and the same basic software environment - but I do make tweaks to the installed packages separately, as I encounter different testing needs.

Yesterday, I was away from home, but both machines downloaded tasks from the ppod_gym_test9 batch. 132158 failed to run them, 508381 succeeded.

The problem occurs during the learner.step in Python, with a ValueError raised at line 55 during initialisation:

File "/var/lib/boinc-client/slots/4/gpugridpy/lib/python3.8/site-packages/torch/distributions/distribution.py", line 55, in __init__
raise ValueError(
ValueError: Expected parameter loc (Tensor of shape (146, 8)) of distribution Normal(loc: torch.Size([146, 8]), scale: torch.Size([146, 8])) to satisfy the constraint Real(), but found invalid values:
tensor([[nan, nan, nan, ..., nan, nan, nan],
[nan, nan, nan, ..., nan, nan, nan],
[nan, nan, nan, ..., nan, nan, nan],
...,
[nan, nan, nan, ..., nan, nan, nan],
[nan, nan, nan, ..., nan, nan, nan],
[nan, nan, nan, ..., nan, nan, nan]], device='cuda:0',
grad_fn=<AddmmBackward0>)

The two 'file extraction' logs for the GPUGrid Python download seem to be different. I'll try to compare the software environment of the two machines and work out where the difference is coming from.

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1244
Credit: 3,342,731,168
RAC: 763,242
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 57838 - Posted: 14 Nov 2021 | 17:49:02 UTC - in response to Message 57834.

Well, I've looked through the software installations for both machines, but I can't see any significant differences. Both have Python 3.8 installed (probably with the operating system), and no sign of any Python 2.x; I've installed a few sundries from terminal (libboost, git, some 32-bit libs for CPDN), but the same list on both machines.

The 'file extraction' logs are different for every task, and sometimes the same filename appears more than once (is duplicated) in the list for a single task.

For the tasks I ran successfully on host 508381, that was the only host that attempted them. The tasks that failed on host 132158 were issued to the full limit of 8 hosts, and failed on all of them.

I can only assume that the difference between success and failure resulted from differences in the task data make-up, and not from differences in the installed software on my hosts.

mmonnin
Send message
Joined: 2 Jul 16
Posts: 305
Credit: 1,220,122,726
RAC: 822,766
Level
Met
Scientific publications
watwatwatwatwat
Message 57840 - Posted: 15 Nov 2021 | 2:59:24 UTC - in response to Message 57833.
Last modified: 15 Nov 2021 | 3:00:56 UTC

When I check python version on this PC I get 'Python 2.7.17'. On the PC that works, Python is not install at all.

To discard that something is getting confused with the old, deprecated Python version, you can upgrade to Python 3 with the following Terminal commands:

sudo apt install python-is-python3
sudo apt install python3-pip

And after that, you can uninstall unnecessary old packages with the command:

sudo apt autoremove


The python-is command didn't work.

So I followed the instructions here starting with Option 1
https://phoenixnap.com/kb/how-to-install-python-3-ubuntu

At the end I did the python --version to check. Same 2.7.17 even though it seemed to complete.

So I tried option 2 from source. That worked OK too with 3.7.5
I get to the end and see the note about checking for specific versions. Uh, oh.

python --version = 2.7.17
python3 --version = 3.6.9
python3.7 --version = 3.7.5

So now I have 3 versions installed haha. Maybe one will work, dunno. But we'll need some more tasks to find out.

Post to thread

Message boards : News : Python Runtime (GPU, beta)