Advanced search

Message boards : News : ACEMD updated app

Author Message
Profile GDF
Volunteer moderator
Project administrator
Project developer
Project tester
Volunteer developer
Volunteer tester
Project scientist
Send message
Joined: 14 Mar 07
Posts: 1957
Credit: 629,356
RAC: 0
Level
Gly
Scientific publications
watwatwatwatwat
Message 59700 - Posted: 10 Jan 2023 | 9:53:01 UTC

As I said. We are currently compiling the Windows version.

GDF

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1042
Credit: 40,220,807,483
RAC: 3,892,747
Level
Trp
Scientific publications
wat
Message 59708 - Posted: 10 Jan 2023 | 15:40:06 UTC - in response to Message 59700.

might as well compile it for CUDA 11.8 to bring Ada (40-series) support.
____________

HZL
Send message
Joined: 23 Nov 08
Posts: 1
Credit: 612,500
RAC: 0
Level
Gly
Scientific publications
wat
Message 59720 - Posted: 15 Jan 2023 | 10:42:31 UTC - in response to Message 59700.

大家好! 我在中国上海 如何让GPU 工作在百分之一百的状态 我发现在运行时GPU 一直在百分之30左右![img][/img]

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1042
Credit: 40,220,807,483
RAC: 3,892,747
Level
Trp
Scientific publications
wat
Message 59722 - Posted: 15 Jan 2023 | 17:31:49 UTC - in response to Message 59720.

大家好! 我在中国上海 如何让GPU 工作在百分之一百的状态 我发现在运行时GPU 一直在百分之30左右![img][/img]


这个情况对于这个Python程序很正常,这个python程序用更多的CPU,而不是GPU。GPU的使用会被CPU限制。如果你同时运行两个任务,可以提高GPU的使用。但是在用这个Python程序的时候,你无法让GPU达到百分之百的状态。
____________

guoyeah
Send message
Joined: 17 Mar 10
Posts: 1
Credit: 5,362,500
RAC: 0
Level
Ser
Scientific publications
wat
Message 59725 - Posted: 17 Jan 2023 | 3:35:22 UTC - in response to Message 59720.

我Nvidia能到80%。我也同时在运行其他的CPU(20%)和Intel GPU(97%)项目。电源调成最佳性能后,CPU到50%。Intel i7 12代。

Pop Piasa
Avatar
Send message
Joined: 8 Aug 19
Posts: 252
Credit: 458,054,251
RAC: 0
Level
Gln
Scientific publications
watwat
Message 59734 - Posted: 18 Jan 2023 | 15:34:32 UTC
Last modified: 18 Jan 2023 | 15:52:09 UTC

Looking around I see the present batch of protein ligand sims are crashing... DARNIT!


process exited with code 195 (0xc3, -61)</message>
<stderr_txt>
22:58:08 (3209098): wrapper (7.7.26016): starting
22:58:25 (3209098): wrapper (7.7.26016): starting
22:58:25 (3209098): wrapper: running /bin/bash (run.sh)
/bin/bash: run.sh: No such file or directory
22:58:26 (3209098): /bin/bash exited; CPU time 0.001795
22:58:26 (3209098): app exit status: 0x7f
22:58:26 (3209098): called boinc_finish(195)

anything else found?
____________
"Together we crunch
To check out a hunch
And wish all our credit
Could just buy us lunch"


Piasa Tribe - Illini Nation

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1042
Credit: 40,220,807,483
RAC: 3,892,747
Level
Trp
Scientific publications
wat
Message 59735 - Posted: 18 Jan 2023 | 16:20:48 UTC - in response to Message 59734.

Looking around I see the present batch of protein ligand sims are crashing... DARNIT!


process exited with code 195 (0xc3, -61)</message>
<stderr_txt>
22:58:08 (3209098): wrapper (7.7.26016): starting
22:58:25 (3209098): wrapper (7.7.26016): starting
22:58:25 (3209098): wrapper: running /bin/bash (run.sh)
/bin/bash: run.sh: No such file or directory
22:58:26 (3209098): /bin/bash exited; CPU time 0.001795
22:58:26 (3209098): app exit status: 0x7f
22:58:26 (3209098): called boinc_finish(195)

anything else found?


if someone can preserve the data files and slot directory before it gets uploaded and subsequently wiped from your system, should be easy to figure out what's wrong.

my guess is they didn't name that run.sh file properly (via open_name probably), or didnt add a task to extract the file in the wrapper config file (jobs.xml), or something along those lines.
____________

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1042
Credit: 40,220,807,483
RAC: 3,892,747
Level
Trp
Scientific publications
wat
Message 59736 - Posted: 18 Jan 2023 | 16:42:18 UTC - in response to Message 59735.

actually I have some on my system so i took a look.

there appear to be many things wrong.

the job.xml file is calling just tar, with no reference to what tar is. this should probably be /bin/tar to use the system tar.

the extracted run.sh script looks woefully lacking in detail. i can see it trying to call python and conda from 'bin/' but that is not included in the input package and will fail. the input tarball only includes some text/config files and not the whole python package.
____________

Profile GDF
Volunteer moderator
Project administrator
Project developer
Project tester
Volunteer developer
Volunteer tester
Project scientist
Send message
Joined: 14 Mar 07
Posts: 1957
Credit: 629,356
RAC: 0
Level
Gly
Scientific publications
watwatwatwatwat
Message 59738 - Posted: 18 Jan 2023 | 17:04:57 UTC - in response to Message 59736.

What app exactly?

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1042
Credit: 40,220,807,483
RAC: 3,892,747
Level
Trp
Scientific publications
wat
Message 59739 - Posted: 18 Jan 2023 | 17:08:38 UTC - in response to Message 59738.
Last modified: 18 Jan 2023 | 17:09:18 UTC

What app exactly?


the new free energy one ('ATM' moniker). using the wrapper to call the run.sh script.

also it would be a good idea to add a checkbox for this app in project preferences. this app showed up with no warning and no announcement from the project and no way to prevent it it seems. I'm not sure if it's marked as beta or not.
____________

Profile GDF
Volunteer moderator
Project administrator
Project developer
Project tester
Volunteer developer
Volunteer tester
Project scientist
Send message
Joined: 14 Mar 07
Posts: 1957
Credit: 629,356
RAC: 0
Level
Gly
Scientific publications
watwatwatwatwat
Message 59740 - Posted: 18 Jan 2023 | 17:11:01 UTC - in response to Message 59739.

Yes, we should have made a beta, but this app is not related to this thread.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1042
Credit: 40,220,807,483
RAC: 3,892,747
Level
Trp
Scientific publications
wat
Message 59741 - Posted: 18 Jan 2023 | 17:12:16 UTC - in response to Message 59740.

Yes, we should have made a beta, but this app is not related to this thread.


you're right, but there is no announcement thread for this app, so no where else appropriate in the News section to get your attention about it.
____________

Profile GDF
Volunteer moderator
Project administrator
Project developer
Project tester
Volunteer developer
Volunteer tester
Project scientist
Send message
Joined: 14 Mar 07
Posts: 1957
Credit: 629,356
RAC: 0
Level
Gly
Scientific publications
watwatwatwatwat
Message 59746 - Posted: 18 Jan 2023 | 17:22:45 UTC - in response to Message 59741.

Soon we will announce it. This is just testing to see if it works which should have been done on a beta app.

I expect tons of workunits using this app. Soon I will introduce a new postdoc running the simulations.

g

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1042
Credit: 40,220,807,483
RAC: 3,892,747
Level
Trp
Scientific publications
wat
Message 59749 - Posted: 18 Jan 2023 | 17:53:44 UTC - in response to Message 59746.
Last modified: 18 Jan 2023 | 17:54:51 UTC

interesting to see that Ada "should" run on the Ampere cubins. I know the app has an architecture compatibility check, and it may fail there even if it could otherwise work.

you could also consider compiling your apps with the PTX version for forward compatibility

like this:
-gencode=arch=compute_86,code=sm_86
-gencode=arch=compute_86,code=compute_86

and the user can set the environment variable as needed. or you could set it in the wrapper config file
____________

Erich56
Send message
Joined: 1 Jan 15
Posts: 1100
Credit: 7,543,057,676
RAC: 6,910,786
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwat
Message 59758 - Posted: 19 Jan 2023 | 6:02:06 UTC
Last modified: 19 Jan 2023 | 6:33:00 UTC

I am successfully running the current ACEMD_3 tasks on a GTX980ti, on a Quadro P5000, and on two RTX3070.
However, they fail on a GTX1650 after a few seconds:

https://www.gpugrid.net/result.php?resultid=33263379
https://www.gpugrid.net/result.php?resultid=33263343

can anyone tell me what might be the reason?

Profile ServicEnginIC
Avatar
Send message
Joined: 24 Sep 10
Posts: 568
Credit: 7,113,142,024
RAC: 9,535,198
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 59759 - Posted: 19 Jan 2023 | 6:32:46 UTC - in response to Message 59758.

As a first, you can try resetting GPUGRID project at failing host.
But probably the reason is 4GB RAM being too short for executing these tasks.

Erich56
Send message
Joined: 1 Jan 15
Posts: 1100
Credit: 7,543,057,676
RAC: 6,910,786
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwat
Message 59760 - Posted: 19 Jan 2023 | 6:40:38 UTC - in response to Message 59759.
Last modified: 19 Jan 2023 | 6:41:41 UTC

...
But probably the reason is 4GB RAM being too short for executing these tasks.

that's what I am guessing, too.
However, I was closely watching the RAM usage (via MemInfo) when the tasks started: at the moment the task crashed, about 2 GB were still free.
Further, for the tasks running on the other hosts mentioned above, the Windows tasks manager shows a RAM usage between 60MB and 400MB per task.
Maybe the CPU Intel Core2 Duo E7400 @ 2.80GHz is too old for these tasks?
(However, some other GPU projects like Einstein, WCG and Primegrid are running well).

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1042
Credit: 40,220,807,483
RAC: 3,892,747
Level
Trp
Scientific publications
wat
Message 59763 - Posted: 19 Jan 2023 | 13:15:42 UTC - in response to Message 59760.

...
But probably the reason is 4GB RAM being too short for executing these tasks.

that's what I am guessing, too.
However, I was closely watching the RAM usage (via MemInfo) when the tasks started: at the moment the task crashed, about 2 GB were still free.
Further, for the tasks running on the other hosts mentioned above, the Windows tasks manager shows a RAM usage between 60MB and 400MB per task.
Maybe the CPU Intel Core2 Duo E7400 @ 2.80GHz is too old for these tasks?
(However, some other GPU projects like Einstein, WCG and Primegrid are running well).


i could very well be that the CPU is too old. it does not support AVX extensions for example, and if the application is built with this requirement then that could be a reason.
____________

Erich56
Send message
Joined: 1 Jan 15
Posts: 1100
Credit: 7,543,057,676
RAC: 6,910,786
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwat
Message 59766 - Posted: 19 Jan 2023 | 15:27:18 UTC - in response to Message 59763.


it could very well be that the CPU is too old. it does not support AVX extensions for example, and if the application is built with this requirement then that could be a reason.

perhaps one of the GPUGRID people could tell me if this is the case?

Ryan Munro
Send message
Joined: 6 Mar 18
Posts: 33
Credit: 548,272,577
RAC: 1,665,048
Level
Lys
Scientific publications
wat
Message 59767 - Posted: 19 Jan 2023 | 15:58:18 UTC

Just had one and it failed after 26 seconds on my 4090

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1042
Credit: 40,220,807,483
RAC: 3,892,747
Level
Trp
Scientific publications
wat
Message 59768 - Posted: 19 Jan 2023 | 16:10:32 UTC - in response to Message 59767.

Just had one and it failed after 26 seconds on my 4090


are the Python tasks working on your 4090? or were those run on a different GPU?
____________

Ryan Munro
Send message
Joined: 6 Mar 18
Posts: 33
Credit: 548,272,577
RAC: 1,665,048
Level
Lys
Scientific publications
wat
Message 59769 - Posted: 19 Jan 2023 | 16:57:17 UTC - in response to Message 59768.

Python run fine on my 4090, though they don't do much at all, all the work seems to be on the CPU.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1042
Credit: 40,220,807,483
RAC: 3,892,747
Level
Trp
Scientific publications
wat
Message 59770 - Posted: 19 Jan 2023 | 17:19:30 UTC - in response to Message 59769.

Python run fine on my 4090, though they don't do much at all, all the work seems to be on the CPU.


Thanks.

could you please report your failed task? click update on BOINC for GPUGRID to send back the result. I'd like to see the nature of the failure, to see if the architecture check is the reason for failure.
____________

Ryan Munro
Send message
Joined: 6 Mar 18
Posts: 33
Credit: 548,272,577
RAC: 1,665,048
Level
Lys
Scientific publications
wat
Message 59772 - Posted: 19 Jan 2023 | 18:16:15 UTC - in response to Message 59770.

Done :)

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1302
Credit: 5,707,321,959
RAC: 8,026,674
Level
Tyr
Scientific publications
watwatwatwatwat
Message 59774 - Posted: 19 Jan 2023 | 19:37:49 UTC

Looks like the application does not understand the 4090 architecture. Needs to be recompiled with the gencodes that Ian pointed out.

ACEMD failed:
Error compiling program: nvrtc: error: invalid value for --gpu-architecture (-arch)

Pop Piasa
Avatar
Send message
Joined: 8 Aug 19
Posts: 252
Credit: 458,054,251
RAC: 0
Level
Gln
Scientific publications
watwat
Message 59776 - Posted: 20 Jan 2023 | 1:13:30 UTC - in response to Message 59766.
Last modified: 20 Jan 2023 | 1:22:56 UTC


it could very well be that the CPU is too old. it does not support AVX extensions for example, and if the application is built with this requirement then that could be a reason.

perhaps one of the GPUGRID people could tell me if this is the case?


maybe you can tell, (if you can run an ACEMD3 app on another host that is AVX enabled) by setting the AVX offset in the bios of a capable host and then checking to see if the processor speed corresponds while running the wrapper (with no other WU).

Pop Piasa
Avatar
Send message
Joined: 8 Aug 19
Posts: 252
Credit: 458,054,251
RAC: 0
Level
Gln
Scientific publications
watwat
Message 59777 - Posted: 20 Jan 2023 | 1:48:29 UTC - in response to Message 59760.
Last modified: 20 Jan 2023 | 1:59:17 UTC

...
But probably the reason is 4GB RAM being too short for executing these tasks.

that's what I am guessing, too.
However, I was closely watching the RAM usage (via MemInfo) when the tasks started: at the moment the task crashed, about 2 GB were still free.
Further, for the tasks running on the other hosts mentioned above, the Windows tasks manager shows a RAM usage between 60MB and 400MB per task.
Maybe the CPU Intel Core2 Duo E7400 @ 2.80GHz is too old for these tasks?
(However, some other GPU projects like Einstein, WCG and Primegrid are running well).


interesting, larrywhitehead's 1060 3GB also does not seem to want to do these tasks

https://www.gpugrid.net/results.php?hostid=493191

only a vague siderr message

onl(unknown error) - exit code 195 (0xc3)</message>
<stderr_txt>
23:38:59 (9616): wrapper (7.9.26016): starting
23:38:59 (9616): wrapper: running bin/acemd3.exe (--boinc --device 0)
23:39:01 (9616): bin/acemd3.exe exited; CPU time 0.000000
23:39:01 (9616): app exit status: 0xc0000135
23:39:01 (9616): called boinc_finish(195)y this in the siderr


Yet I only observe a little over 2GB graphics memory being utilized max so far on my hosts.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1042
Credit: 40,220,807,483
RAC: 3,892,747
Level
Trp
Scientific publications
wat
Message 59778 - Posted: 20 Jan 2023 | 5:26:53 UTC - in response to Message 59774.

Looks like the application does not understand the 4090 architecture. Needs to be recompiled with the gencodes that Ian pointed out.

ACEMD failed:
Error compiling program: nvrtc: error: invalid value for --gpu-architecture (-arch)

That’s exactly what I thought would happen. I had the same experience with some other people trying to run the Einstein CUDA BRP7 app. Didn’t work on 11.7 but did work once I compiled it for 11.8 with gencode defined for CC 8.9
____________

catavalon21
Send message
Joined: 1 Feb 09
Posts: 4
Credit: 308,439,090
RAC: 0
Level
Asp
Scientific publications
watwatwatwatwat
Message 60026 - Posted: 6 Mar 2023 | 21:57:50 UTC

Is ACEMD3 not yet supporting the NV 4k architecture on W10? This is a 4070 Ti with the CUDA 1121 app.

ACEMD failed:
Error compiling program: nvrtc: error: invalid value for --gpu-architecture (-arch)

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1042
Credit: 40,220,807,483
RAC: 3,892,747
Level
Trp
Scientific publications
wat
Message 60027 - Posted: 6 Mar 2023 | 22:15:24 UTC - in response to Message 60026.

Is ACEMD3 not yet supporting the NV 4k architecture on W10? This is a 4070 Ti with the CUDA 1121 app.

ACEMD failed:
Error compiling program: nvrtc: error: invalid value for --gpu-architecture (-arch)


That’s correct. The current CUDA 11.21 app does not support Ada 4000 series.
____________

catavalon21
Send message
Joined: 1 Feb 09
Posts: 4
Credit: 308,439,090
RAC: 0
Level
Asp
Scientific publications
watwatwatwatwat
Message 60033 - Posted: 8 Mar 2023 | 1:15:52 UTC - in response to Message 60027.

Is ACEMD3 not yet supporting the NV 4k architecture on W10? This is a 4070 Ti with the CUDA 1121 app.

ACEMD failed:
Error compiling program: nvrtc: error: invalid value for --gpu-architecture (-arch)


That’s correct. The current CUDA 11.21 app does not support Ada 4000 series.


Thanks for confirming.

oemuser
Send message
Joined: 18 Sep 16
Posts: 10
Credit: 1,291,979
RAC: 0
Level
Ala
Scientific publications
wat
Message 60109 - Posted: 17 Mar 2023 | 14:20:18 UTC

I got ACEMD 3 task for my gtx 1080ti on Windows (2oiq-ADRIA_KDMD_1k_test_3809-0-1-RND9959).
GPU stays at very low clock speed 750Mhz and VRAM 800Mhz. I expected 2x CPU clock and 7x VRAM clock. Or would it not have any advantage?

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1302
Credit: 5,707,321,959
RAC: 8,026,674
Level
Tyr
Scientific publications
watwatwatwatwat
Message 60898 - Posted: 20 Dec 2023 | 21:25:27 UTC
Last modified: 20 Dec 2023 | 21:40:12 UTC

I see that a new acemd3 app was published yesterday for the Linux hosts in an attempt to fix the expired Acellera licensing issue.

Unfortunately, the app is still not working and any new work is still failing, this time with more information, problem with the python packaging of the job files.

https://www.gpugrid.net/result.php?resultid=33722983

Looks like they've moved away from a standalone acemd3 binary which is what was used in the past work.

Looks like they tried to just use the Windows code and of course failed with trying to use a Windows only msvcrt Python function.

Profile ServicEnginIC
Avatar
Send message
Joined: 24 Sep 10
Posts: 568
Credit: 7,113,142,024
RAC: 9,535,198
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 60904 - Posted: 26 Dec 2023 | 10:22:18 UTC - in response to Message 60898.
Last modified: 26 Dec 2023 | 10:26:09 UTC

Looks like they tried to just use the Windows code and of course failed with trying to use a Windows only msvcrt Python function.

It seems that You're right.
And currently still pending to address for Linux hosts:

Nombre 0_0-CRYPTICSCOUT_pocket_discovery_c82914d2_15b4_4300_b4db_cb72998e09bf-6-7-RND0445_6
Unidad de trabajo 27641639
Creado 26 Dec 2023 | 9:50:25 UTC
Enviado 26 Dec 2023 | 9:50:26 UTC
Recibir 26 Dec 2023 | 9:57:14 UTC
Estado del servidor Over
Resultado Error de ejecución
Estado del cliente Error de ejecución
Exit status 195 (0xc3) EXIT_CHILD_FAILED
ID del ordenador 186626
Límite de tiempo para informar 31 Dec 2023 | 9:50:26 UTC
Tiempo de ejecución 23.07
Tiempo de CPU 0.00
Estado de validación Inválido
Crédito 0.00
Versión de la aplicación ACEMD 3: molecular dynamics simulations for GPUs v2.21 (cuda1121)

Stderr output

<core_client_version>7.20.5</core_client_version>
<![CDATA[
<message>
process exited with code 195 (0xc3, -61)</message>
<stderr_txt>
09:55:04 (339849): wrapper (7.7.26016): starting
09:55:25 (339849): wrapper (7.7.26016): starting
09:55:25 (339849): wrapper: running bin/acemd (--boinc --device 0)
Traceback (most recent call last):
File "/usr/lib/python3.10/subprocess.py", line 69, in <module>
import msvcrt
ModuleNotFoundError: No module named 'msvcrt'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "runtime.py", line 8, in init runtime
File "/usr/lib/python3.10/platform.py", line 119, in <module>
import subprocess
File "/usr/lib/python3.10/subprocess.py", line 74, in <module>
import _posixsubprocess
ModuleNotFoundError: No module named '_posixsubprocess'
Error in sys.excepthook:
Traceback (most recent call last):
File "/usr/lib/python3.10/subprocess.py", line 69, in <module>
import msvcrt
ModuleNotFoundError: No module named 'msvcrt'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/usr/lib/python3/dist-packages/apport_python_hook.py", line 72, in apport_excepthook
from apport.fileutils import likely_packaged, get_recent_crashes
File "/usr/lib/python3/dist-packages/apport/__init__.py", line 5, in <module>
from apport.report import Report
File "/usr/lib/python3/dist-packages/apport/report.py", line 12, in <module>
import subprocess, tempfile, os.path, re, pwd, grp, os, io
File "/usr/lib/python3.10/subprocess.py", line 74, in <module>
import _posixsubprocess
ModuleNotFoundError: No module named '_posixsubprocess'

Original exception was:
Traceback (most recent call last):
File "/usr/lib/python3.10/subprocess.py", line 69, in <module>
import msvcrt
ModuleNotFoundError: No module named 'msvcrt'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "runtime.py", line 8, in init runtime
File "/usr/lib/python3.10/platform.py", line 119, in <module>
import subprocess
File "/usr/lib/python3.10/subprocess.py", line 74, in <module>
import _posixsubprocess
ModuleNotFoundError: No module named '_posixsubprocess'
Python error
09:55:26 (339849): bin/acemd exited; CPU time 0.032149
09:55:26 (339849): app exit status: 0x1
09:55:26 (339849): called boinc_finish(195)

</stderr_txt>
]]>

No hope for a solution in short term, since usually Universities get frozen in Christmas time...
Merry Xmas

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1302
Credit: 5,707,321,959
RAC: 8,026,674
Level
Tyr
Scientific publications
watwatwatwatwat
Message 60905 - Posted: 27 Dec 2023 | 0:53:51 UTC

I'm waiting till after New Years before bugging Gianni again with the request to fix the acemd3 app properly.

Erich56
Send message
Joined: 1 Jan 15
Posts: 1100
Credit: 7,543,057,676
RAC: 6,910,786
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwat
Message 60922 - Posted: 3 Jan 2024 | 17:01:02 UTC - in response to Message 60905.

I'm waiting till after New Years before bugging Gianni again with the request to fix the acemd3 app properly.

my Windows10 PCs were successfully crunching ACEMD 3 until this morning.

Within the past hour, some more ACEMD 3 tasks were downloaded and failed after about 1 minute.
See here: http://www.gpugrid.net/result.php?resultid=33725238

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1302
Credit: 5,707,321,959
RAC: 8,026,674
Level
Tyr
Scientific publications
watwatwatwatwat
Message 60923 - Posted: 3 Jan 2024 | 17:50:23 UTC
Last modified: 3 Jan 2024 | 17:51:50 UTC

I'm shocked to discover that this morning I have a acemd3 task running for 50 minutes so far.

All previous tasks insta-failed on the missing license issue and then when the app got updated in December for a missing Windows file.

All my hosts are Linux based and no Windows has ever been installed.

The slot that has the running task in it has all the normal and usual files in it along with checkpoint files that made running acemd3 tasks so wonderful because they could be stopped and started without failing.

Wish the other tasks at GPUGrid had that same capability.

I must assume that the app got updated again and now works. And after looking at the apps list, I see that that is the case. New app released today for acemd3.

Thank you Gianni!!

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1302
Credit: 5,707,321,959
RAC: 8,026,674
Level
Tyr
Scientific publications
watwatwatwatwat
Message 60924 - Posted: 3 Jan 2024 | 17:54:54 UTC

But that is only one task out of about 20 so far today that is being successfully run. All the rest are ATMbeta and have failed due to bad configuration file inputs.

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1302
Credit: 5,707,321,959
RAC: 8,026,674
Level
Tyr
Scientific publications
watwatwatwatwat
Message 60925 - Posted: 3 Jan 2024 | 18:21:35 UTC
Last modified: 3 Jan 2024 | 18:22:00 UTC

New Linux acemd3 app has an expiration date 3649 days into the future. Should not be an issue for years now.

#
# ACEMD version 3.7.3
#
# Copyright (C) 2017-2024 Acellera (www.acellera.com)
#
# By using ACEMD, you accept the terms and conditions of the ACEMD licence
# Check the licence by running "acemd --licence"
# More details: https://software.acellera.com/acemd/licence.html
#
# When publishing, please cite:
# ACEMD: Accelerating Biomolecular Dynamics in the Microsecond Time Scale
# M. J. Harvey, G. Giupponi and G. De Fabritiis,
# J Chem. Theory. Comput. 2009 5(6), pp1632-1639
# DOI: 10.1021/ct9000685
#
# Arguments:
# input: input
# platform:
# device: 2
# ncpus:
# precision: mixed
#
# ACEMD is running in Boinc mode!
#
# WARNING: This ACEMD version expires in 3649 days!

Erich56
Send message
Joined: 1 Jan 15
Posts: 1100
Credit: 7,543,057,676
RAC: 6,910,786
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwat
Message 60926 - Posted: 3 Jan 2024 | 21:04:36 UTC - in response to Message 60925.

New Linux acemd3 app has an expiration date 3649 days into the future. Should not be an issue for years now.

good news for the Linux crunchers.
However, it would be great it they did the same for the Windows version, and until this will be done, they should stop sending out Windows tasks which keep failing within a minute.

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1302
Credit: 5,707,321,959
RAC: 8,026,674
Level
Tyr
Scientific publications
watwatwatwatwat
Message 60927 - Posted: 3 Jan 2024 | 22:30:58 UTC - in response to Message 60926.

You need to look at a running task while it is still in its slot and capture the stderr.txt and progress files for later examination before the task errors out and clears the slot.

Your uploaded result files do not have any useful information about why the tasks are failing.

You should at least examine the acemd application for its license expiration as posted in my last post. Assuming the Windows application got the same license expiration, the tasks should run.

acemd --licence would at least eliminate that as the issue. Or prove it.

Erich56
Send message
Joined: 1 Jan 15
Posts: 1100
Credit: 7,543,057,676
RAC: 6,910,786
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwat
Message 60928 - Posted: 4 Jan 2024 | 6:08:43 UTC - in response to Message 60927.

... Your uploaded result files do not have any useful information about why the tasks are failing.

yes, you are right, the task from the link I uploaded before does not show any stderr.txt - for what reason ever (I did not check this before, sorry for that). I have noticed that this is the case with all tasks from this PC, regardless of whether they succeed for fail; no idea why.

However, the stderr from the other PC where ACEMD 3 tasks also failed does work, here is an example:
http://www.gpugrid.net/result.php?resultid=33725327

You should at least examine the acemd application for its license expiration as posted in my last post. Assuming the Windows application got the same license expiration, the tasks should run.

acemd --licence would at least eliminate that as the issue. Or prove it.

As yesterday the ACEMD 3 started failing at about the same time on both of my PCs (with a third PC, unfortunately I cannot crunch ACEMD 3 because the app does not work with Ada Lovelace yet), my guess, of course, was that this is not due to any problems with my hardware or my software, but rather due to a problem with the app itself, probably with the license.

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1302
Credit: 5,707,321,959
RAC: 8,026,674
Level
Tyr
Scientific publications
watwatwatwatwat
Message 60929 - Posted: 4 Jan 2024 | 7:34:07 UTC - in response to Message 60928.

The stderr.txt on Windows hosts never shows any reason for failing or succeeding.
I've never been able to decipher why all Windows tasks have the debug dump in their outputs.

You get the same dump output whether it succeeds or fails. They only ever display the generic error 195 BOINC catchall error code which does not explain anything.

The Linux stderr.txt output actually does show explicit reasons for why a task fails.

Your Quadro P5000 is NOT Ada generation, it's Pascal generation and Pascal cards have always worked with acemd tasks.

I've been trying to find the code path for these acemd tasks and haven't been able to deduce anything beyond the CONDA environment they set up in the job file and pass on the parameter file to the app.

They don't have the same layout the ATMbeta tasks use so you can follow along with the processing and figure out where they fail in the setup or processing flow.

There were a slug of acemd tasks released today that had the same issue with no Windows file found and all failed on all the Linux hosts. But they were not from the initial bad release but newly generated tasks today.

This task as an example of what I said.

https://www.gpugrid.net/result.php?resultid=33725353

It was attempted by 7 hosts of both Windows and Linux so the task itself is badly configured.

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1584
Credit: 6,390,241,851
RAC: 7,427,825
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 60930 - Posted: 4 Jan 2024 | 9:56:31 UTC - in response to Message 60929.

Actually, Erich's https://www.gpugrid.net/result.php?resultid=33725327 does contain a useful error code:

app exit status: 0xc0000135

That's a generic Windows NT code:

0xC0000135

STATUS_DLL_NOT_FOUND

{Unable To Locate Component} This application has failed to start because %hs was not found. Reinstalling the application might fix this problem.

You have to be careful and search Microsoft itself for that one: the general internet chatterbox will usually say that a specific component is at fault (usually the .NET framework), which is unlikely to be relevant for research applications. You might be able to get a name for the missing component by trying to launch the application manually in a terminal window - it should populate that %hs parameter.

Profile ServicEnginIC
Avatar
Send message
Joined: 24 Sep 10
Posts: 568
Credit: 7,113,142,024
RAC: 9,535,198
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 60935 - Posted: 8 Jan 2024 | 5:44:10 UTC - in response to Message 60904.

4 ACEMD tasks received at this Linux host on January 7-8th still continued failing after a few seconds.
One example:

Application: ACEMD 3: molecular dynamics simulations for GPUs 2.22 (cuda1121)
Name: 0_2-CRYPTICSCOUT_pocket_discovery_f279f6d5_5830_427a_b012_ee7935c48e7f-1-3-RND8942
State: Computation error
Received: Mon 08 Jan 2024 02:58:37 WET
Report deadline: Sat 13 Jan 2024 02:58:36 WET
Resources: 0.49 CPUs + 1 NVIDIA GPU
Estimated computation size: 5,000,000 GFLOPs
CPU time: 00:00:00
Elapsed time: 00:00:33
Executable: wrapper_26198_x86_64-pc-linux-gnu

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1584
Credit: 6,390,241,851
RAC: 7,427,825
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 60936 - Posted: 8 Jan 2024 | 8:22:14 UTC - in response to Message 60935.

They have "ModuleNotFoundError: No module named 'msvcrt'".

I think that stands for "MicroSoft Visual C RunTime [module]" - which is odd to see in a Linux package.

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1584
Credit: 6,390,241,851
RAC: 7,427,825
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 61504 - Posted: 15 May 2024 | 10:39:23 UTC

Following on from the reported issue in the ATM thread ("exceeded elapsed time limit" error - message 61483):

I've finally caught one of these for inspection in daylight. It's on a Linux machine, so a slightly different version - v2.24, deployed 15 Apr 2024 - but it should be close enough.

Here are the vital statistics:

App speed: <flops> 6271039115434
Task size: <rsc_fpops_est> 1000000000000000000
Correction: <duration_correction_factor> 0.010000

for an estimated run time of 1594 seconds - or 26 minutes 34 seconds, shown in BOINC Manager.

The time limit for the task is set by <rsc_fpops_bound>, which is 10 times larger than the estimate. So, 4 hours, 25 minutes, 40 seconds on this GeForce GTX 1660 Ti. I'll let you know how it gets on - or you can look it up yourself this afternoon, at task 35250069.

Or not.
ACEMD failed:
Error loading CUDA module: CUDA_ERROR_UNSUPPORTED_PTX_VERSION (222)

Back to the drawing board, while it gets on with Quantum chemistry as usual!

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1042
Credit: 40,220,807,483
RAC: 3,892,747
Level
Trp
Scientific publications
wat
Message 61505 - Posted: 15 May 2024 | 12:54:14 UTC - in response to Message 61504.

You may need to update your drivers.
____________

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1584
Credit: 6,390,241,851
RAC: 7,427,825
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 61509 - Posted: 16 May 2024 | 8:21:07 UTC - in response to Message 61505.

You may need to update your drivers.

It's a possibility - but the card/driver combo is accepted to run the cuda1121 version of QC. It's only the Python beta which needs cuda1131.

We'll see what happens when my other Linux machine catches a task - that does have a newer card and driver.

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1584
Credit: 6,390,241,851
RAC: 7,427,825
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 61519 - Posted: 21 May 2024 | 16:28:22 UTC

OK, that's looking more plausible. My other machine (driver 535.99) has completed tasks on the primary RTX 3060 GPU, and is now running one on the secondary GTX 1660 GPU - no problems so far.

So I've upgraded the failing machine from driver 470.99 up to a matching 535.99: back to the long slow fishing game!

Meanwhile, I'll check the estimates for the task on the slower secondary card - that might be a (different) problem.

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1584
Credit: 6,390,241,851
RAC: 7,427,825
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 61526 - Posted: 24 May 2024 | 14:25:12 UTC

I see we've been given a big new block of ACEND tasks to chew on.

Here are my current estimates for host 132158, after 9 completed tasks:



nearly 12 days for Quantum Chemistry
5.5 hours for ACEMD 3

That's still pretty tight on maximum time, but I've already got two more tasks to run - and they're all running to completion for now. We'll take another look after 11 completed tasks, to see what effect that has.

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1584
Credit: 6,390,241,851
RAC: 7,427,825
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 61527 - Posted: 25 May 2024 | 7:33:20 UTC

Yup, confirmed:



If you can get to 11 completed tasks, it's plain sailing from there on. The original 'time limit exceeded' problem was caused by the project's poor estimation of the work involved in completing the different work types - but it would be devilishly difficult for them to correct it at this late stage, without causing similar problems for other apps too. I suspect we'll have to live with it.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1042
Credit: 40,220,807,483
RAC: 3,892,747
Level
Trp
Scientific publications
wat
Message 61528 - Posted: 25 May 2024 | 15:18:24 UTC

I guess updating the drivers solved your previous problem.

the app may be labelled with an incorrect CUDA version requirement.
____________

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1584
Credit: 6,390,241,851
RAC: 7,427,825
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 61529 - Posted: 25 May 2024 | 15:28:45 UTC - in response to Message 61528.

I guess updating the drivers solved your previous problem.

Yes, that machine is running fine now - 6 tasks completed, plus two running.

It's still in the danger zone for 'exceeded elapsed time limit', but looks like it should pull through.

Post to thread

Message boards : News : ACEMD updated app

//