Advanced search

Message boards : News : ACEMD updated app

Author Message
Profile GDF
Volunteer moderator
Project administrator
Project developer
Project tester
Volunteer developer
Volunteer tester
Project scientist
Send message
Joined: 14 Mar 07
Posts: 1955
Credit: 629,356
RAC: 0
Level
Gly
Scientific publications
watwatwatwatwat
Message 59700 - Posted: 10 Jan 2023 | 9:53:01 UTC

As I said. We are currently compiling the Windows version.

GDF

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 783
Credit: 5,088,300,994
RAC: 3,117,086
Level
Tyr
Scientific publications
wat
Message 59708 - Posted: 10 Jan 2023 | 15:40:06 UTC - in response to Message 59700.

might as well compile it for CUDA 11.8 to bring Ada (40-series) support.
____________

HZL
Send message
Joined: 23 Nov 08
Posts: 1
Credit: 612,500
RAC: 11,506
Level
Gly
Scientific publications
wat
Message 59720 - Posted: 15 Jan 2023 | 10:42:31 UTC - in response to Message 59700.

大家好! 我在中国上海 如何让GPU 工作在百分之一百的状态 我发现在运行时GPU 一直在百分之30左右![img][/img]

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 783
Credit: 5,088,300,994
RAC: 3,117,086
Level
Tyr
Scientific publications
wat
Message 59722 - Posted: 15 Jan 2023 | 17:31:49 UTC - in response to Message 59720.

大家好! 我在中国上海 如何让GPU 工作在百分之一百的状态 我发现在运行时GPU 一直在百分之30左右![img][/img]


这个情况对于这个Python程序很正常,这个python程序用更多的CPU,而不是GPU。GPU的使用会被CPU限制。如果你同时运行两个任务,可以提高GPU的使用。但是在用这个Python程序的时候,你无法让GPU达到百分之百的状态。
____________

guoyeah
Send message
Joined: 17 Mar 10
Posts: 1
Credit: 3,535,000
RAC: 22,998
Level
Ala
Scientific publications
wat
Message 59725 - Posted: 17 Jan 2023 | 3:35:22 UTC - in response to Message 59720.

我Nvidia能到80%。我也同时在运行其他的CPU(20%)和Intel GPU(97%)项目。电源调成最佳性能后,CPU到50%。Intel i7 12代。

Pop Piasa
Avatar
Send message
Joined: 8 Aug 19
Posts: 230
Credit: 391,859,251
RAC: 835,246
Level
Asp
Scientific publications
watwat
Message 59734 - Posted: 18 Jan 2023 | 15:34:32 UTC
Last modified: 18 Jan 2023 | 15:52:09 UTC

Looking around I see the present batch of protein ligand sims are crashing... DARNIT!


process exited with code 195 (0xc3, -61)</message>
<stderr_txt>
22:58:08 (3209098): wrapper (7.7.26016): starting
22:58:25 (3209098): wrapper (7.7.26016): starting
22:58:25 (3209098): wrapper: running /bin/bash (run.sh)
/bin/bash: run.sh: No such file or directory
22:58:26 (3209098): /bin/bash exited; CPU time 0.001795
22:58:26 (3209098): app exit status: 0x7f
22:58:26 (3209098): called boinc_finish(195)

anything else found?
____________
"Together we crunch
To check out a hunch
And wish all our credit
Could just buy us lunch"


Piasa Tribe - Illini Nation

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 783
Credit: 5,088,300,994
RAC: 3,117,086
Level
Tyr
Scientific publications
wat
Message 59735 - Posted: 18 Jan 2023 | 16:20:48 UTC - in response to Message 59734.

Looking around I see the present batch of protein ligand sims are crashing... DARNIT!


process exited with code 195 (0xc3, -61)</message>
<stderr_txt>
22:58:08 (3209098): wrapper (7.7.26016): starting
22:58:25 (3209098): wrapper (7.7.26016): starting
22:58:25 (3209098): wrapper: running /bin/bash (run.sh)
/bin/bash: run.sh: No such file or directory
22:58:26 (3209098): /bin/bash exited; CPU time 0.001795
22:58:26 (3209098): app exit status: 0x7f
22:58:26 (3209098): called boinc_finish(195)

anything else found?


if someone can preserve the data files and slot directory before it gets uploaded and subsequently wiped from your system, should be easy to figure out what's wrong.

my guess is they didn't name that run.sh file properly (via open_name probably), or didnt add a task to extract the file in the wrapper config file (jobs.xml), or something along those lines.
____________

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 783
Credit: 5,088,300,994
RAC: 3,117,086
Level
Tyr
Scientific publications
wat
Message 59736 - Posted: 18 Jan 2023 | 16:42:18 UTC - in response to Message 59735.

actually I have some on my system so i took a look.

there appear to be many things wrong.

the job.xml file is calling just tar, with no reference to what tar is. this should probably be /bin/tar to use the system tar.

the extracted run.sh script looks woefully lacking in detail. i can see it trying to call python and conda from 'bin/' but that is not included in the input package and will fail. the input tarball only includes some text/config files and not the whole python package.
____________

Profile GDF
Volunteer moderator
Project administrator
Project developer
Project tester
Volunteer developer
Volunteer tester
Project scientist
Send message
Joined: 14 Mar 07
Posts: 1955
Credit: 629,356
RAC: 0
Level
Gly
Scientific publications
watwatwatwatwat
Message 59738 - Posted: 18 Jan 2023 | 17:04:57 UTC - in response to Message 59736.

What app exactly?

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 783
Credit: 5,088,300,994
RAC: 3,117,086
Level
Tyr
Scientific publications
wat
Message 59739 - Posted: 18 Jan 2023 | 17:08:38 UTC - in response to Message 59738.
Last modified: 18 Jan 2023 | 17:09:18 UTC

What app exactly?


the new free energy one ('ATM' moniker). using the wrapper to call the run.sh script.

also it would be a good idea to add a checkbox for this app in project preferences. this app showed up with no warning and no announcement from the project and no way to prevent it it seems. I'm not sure if it's marked as beta or not.
____________

Profile GDF
Volunteer moderator
Project administrator
Project developer
Project tester
Volunteer developer
Volunteer tester
Project scientist
Send message
Joined: 14 Mar 07
Posts: 1955
Credit: 629,356
RAC: 0
Level
Gly
Scientific publications
watwatwatwatwat
Message 59740 - Posted: 18 Jan 2023 | 17:11:01 UTC - in response to Message 59739.

Yes, we should have made a beta, but this app is not related to this thread.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 783
Credit: 5,088,300,994
RAC: 3,117,086
Level
Tyr
Scientific publications
wat
Message 59741 - Posted: 18 Jan 2023 | 17:12:16 UTC - in response to Message 59740.

Yes, we should have made a beta, but this app is not related to this thread.


you're right, but there is no announcement thread for this app, so no where else appropriate in the News section to get your attention about it.
____________

Profile GDF
Volunteer moderator
Project administrator
Project developer
Project tester
Volunteer developer
Volunteer tester
Project scientist
Send message
Joined: 14 Mar 07
Posts: 1955
Credit: 629,356
RAC: 0
Level
Gly
Scientific publications
watwatwatwatwat
Message 59746 - Posted: 18 Jan 2023 | 17:22:45 UTC - in response to Message 59741.

Soon we will announce it. This is just testing to see if it works which should have been done on a beta app.

I expect tons of workunits using this app. Soon I will introduce a new postdoc running the simulations.

g

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 783
Credit: 5,088,300,994
RAC: 3,117,086
Level
Tyr
Scientific publications
wat
Message 59749 - Posted: 18 Jan 2023 | 17:53:44 UTC - in response to Message 59746.
Last modified: 18 Jan 2023 | 17:54:51 UTC

interesting to see that Ada "should" run on the Ampere cubins. I know the app has an architecture compatibility check, and it may fail there even if it could otherwise work.

you could also consider compiling your apps with the PTX version for forward compatibility

like this:
-gencode=arch=compute_86,code=sm_86
-gencode=arch=compute_86,code=compute_86

and the user can set the environment variable as needed. or you could set it in the wrapper config file
____________

Erich56
Send message
Joined: 1 Jan 15
Posts: 958
Credit: 3,720,626,353
RAC: 587,188
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwat
Message 59758 - Posted: 19 Jan 2023 | 6:02:06 UTC
Last modified: 19 Jan 2023 | 6:33:00 UTC

I am successfully running the current ACEMD_3 tasks on a GTX980ti, on a Quadro P5000, and on two RTX3070.
However, they fail on a GTX1650 after a few seconds:

https://www.gpugrid.net/result.php?resultid=33263379
https://www.gpugrid.net/result.php?resultid=33263343

can anyone tell me what might be the reason?

Profile ServicEnginIC
Avatar
Send message
Joined: 24 Sep 10
Posts: 522
Credit: 2,277,193,049
RAC: 93,238
Level
Phe
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 59759 - Posted: 19 Jan 2023 | 6:32:46 UTC - in response to Message 59758.

As a first, you can try resetting GPUGRID project at failing host.
But probably the reason is 4GB RAM being too short for executing these tasks.

Erich56
Send message
Joined: 1 Jan 15
Posts: 958
Credit: 3,720,626,353
RAC: 587,188
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwat
Message 59760 - Posted: 19 Jan 2023 | 6:40:38 UTC - in response to Message 59759.
Last modified: 19 Jan 2023 | 6:41:41 UTC

...
But probably the reason is 4GB RAM being too short for executing these tasks.

that's what I am guessing, too.
However, I was closely watching the RAM usage (via MemInfo) when the tasks started: at the moment the task crashed, about 2 GB were still free.
Further, for the tasks running on the other hosts mentioned above, the Windows tasks manager shows a RAM usage between 60MB and 400MB per task.
Maybe the CPU Intel Core2 Duo E7400 @ 2.80GHz is too old for these tasks?
(However, some other GPU projects like Einstein, WCG and Primegrid are running well).

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 783
Credit: 5,088,300,994
RAC: 3,117,086
Level
Tyr
Scientific publications
wat
Message 59763 - Posted: 19 Jan 2023 | 13:15:42 UTC - in response to Message 59760.

...
But probably the reason is 4GB RAM being too short for executing these tasks.

that's what I am guessing, too.
However, I was closely watching the RAM usage (via MemInfo) when the tasks started: at the moment the task crashed, about 2 GB were still free.
Further, for the tasks running on the other hosts mentioned above, the Windows tasks manager shows a RAM usage between 60MB and 400MB per task.
Maybe the CPU Intel Core2 Duo E7400 @ 2.80GHz is too old for these tasks?
(However, some other GPU projects like Einstein, WCG and Primegrid are running well).


i could very well be that the CPU is too old. it does not support AVX extensions for example, and if the application is built with this requirement then that could be a reason.
____________

Erich56
Send message
Joined: 1 Jan 15
Posts: 958
Credit: 3,720,626,353
RAC: 587,188
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwat
Message 59766 - Posted: 19 Jan 2023 | 15:27:18 UTC - in response to Message 59763.


it could very well be that the CPU is too old. it does not support AVX extensions for example, and if the application is built with this requirement then that could be a reason.

perhaps one of the GPUGRID people could tell me if this is the case?

Ryan Munro
Send message
Joined: 6 Mar 18
Posts: 13
Credit: 78,849,663
RAC: 117,447
Level
Thr
Scientific publications
wat
Message 59767 - Posted: 19 Jan 2023 | 15:58:18 UTC

Just had one and it failed after 26 seconds on my 4090

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 783
Credit: 5,088,300,994
RAC: 3,117,086
Level
Tyr
Scientific publications
wat
Message 59768 - Posted: 19 Jan 2023 | 16:10:32 UTC - in response to Message 59767.

Just had one and it failed after 26 seconds on my 4090


are the Python tasks working on your 4090? or were those run on a different GPU?
____________

Ryan Munro
Send message
Joined: 6 Mar 18
Posts: 13
Credit: 78,849,663
RAC: 117,447
Level
Thr
Scientific publications
wat
Message 59769 - Posted: 19 Jan 2023 | 16:57:17 UTC - in response to Message 59768.

Python run fine on my 4090, though they don't do much at all, all the work seems to be on the CPU.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 783
Credit: 5,088,300,994
RAC: 3,117,086
Level
Tyr
Scientific publications
wat
Message 59770 - Posted: 19 Jan 2023 | 17:19:30 UTC - in response to Message 59769.

Python run fine on my 4090, though they don't do much at all, all the work seems to be on the CPU.


Thanks.

could you please report your failed task? click update on BOINC for GPUGRID to send back the result. I'd like to see the nature of the failure, to see if the architecture check is the reason for failure.
____________

Ryan Munro
Send message
Joined: 6 Mar 18
Posts: 13
Credit: 78,849,663
RAC: 117,447
Level
Thr
Scientific publications
wat
Message 59772 - Posted: 19 Jan 2023 | 18:16:15 UTC - in response to Message 59770.

Done :)

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1100
Credit: 1,468,861,541
RAC: 334,371
Level
Met
Scientific publications
watwatwatwatwat
Message 59774 - Posted: 19 Jan 2023 | 19:37:49 UTC

Looks like the application does not understand the 4090 architecture. Needs to be recompiled with the gencodes that Ian pointed out.

ACEMD failed:
Error compiling program: nvrtc: error: invalid value for --gpu-architecture (-arch)

Pop Piasa
Avatar
Send message
Joined: 8 Aug 19
Posts: 230
Credit: 391,859,251
RAC: 835,246
Level
Asp
Scientific publications
watwat
Message 59776 - Posted: 20 Jan 2023 | 1:13:30 UTC - in response to Message 59766.
Last modified: 20 Jan 2023 | 1:22:56 UTC


it could very well be that the CPU is too old. it does not support AVX extensions for example, and if the application is built with this requirement then that could be a reason.

perhaps one of the GPUGRID people could tell me if this is the case?


maybe you can tell, (if you can run an ACEMD3 app on another host that is AVX enabled) by setting the AVX offset in the bios of a capable host and then checking to see if the processor speed corresponds while running the wrapper (with no other WU).

Pop Piasa
Avatar
Send message
Joined: 8 Aug 19
Posts: 230
Credit: 391,859,251
RAC: 835,246
Level
Asp
Scientific publications
watwat
Message 59777 - Posted: 20 Jan 2023 | 1:48:29 UTC - in response to Message 59760.
Last modified: 20 Jan 2023 | 1:59:17 UTC

...
But probably the reason is 4GB RAM being too short for executing these tasks.

that's what I am guessing, too.
However, I was closely watching the RAM usage (via MemInfo) when the tasks started: at the moment the task crashed, about 2 GB were still free.
Further, for the tasks running on the other hosts mentioned above, the Windows tasks manager shows a RAM usage between 60MB and 400MB per task.
Maybe the CPU Intel Core2 Duo E7400 @ 2.80GHz is too old for these tasks?
(However, some other GPU projects like Einstein, WCG and Primegrid are running well).


interesting, larrywhitehead's 1060 3GB also does not seem to want to do these tasks

https://www.gpugrid.net/results.php?hostid=493191

only a vague siderr message

onl(unknown error) - exit code 195 (0xc3)</message>
<stderr_txt>
23:38:59 (9616): wrapper (7.9.26016): starting
23:38:59 (9616): wrapper: running bin/acemd3.exe (--boinc --device 0)
23:39:01 (9616): bin/acemd3.exe exited; CPU time 0.000000
23:39:01 (9616): app exit status: 0xc0000135
23:39:01 (9616): called boinc_finish(195)y this in the siderr


Yet I only observe a little over 2GB graphics memory being utilized max so far on my hosts.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 783
Credit: 5,088,300,994
RAC: 3,117,086
Level
Tyr
Scientific publications
wat
Message 59778 - Posted: 20 Jan 2023 | 5:26:53 UTC - in response to Message 59774.

Looks like the application does not understand the 4090 architecture. Needs to be recompiled with the gencodes that Ian pointed out.

ACEMD failed:
Error compiling program: nvrtc: error: invalid value for --gpu-architecture (-arch)

That’s exactly what I thought would happen. I had the same experience with some other people trying to run the Einstein CUDA BRP7 app. Didn’t work on 11.7 but did work once I compiled it for 11.8 with gencode defined for CC 8.9
____________

Post to thread

Message boards : News : ACEMD updated app

//