Advanced search

Message boards : News : ATM

Author Message
Quico
Volunteer moderator
Project administrator
Project developer
Project tester
Volunteer developer
Volunteer tester
Project scientist
Send message
Joined: 28 Feb 23
Posts: 16
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 60002 - Posted: 3 Mar 2023 | 10:39:46 UTC

Hello GPUGRID!

You‘ve already noticed that a new app called “ATM” has been deployed with some test runs. We are working on its validation and deployment, so expect more jobs to come on this app soon. Let me briefly explain what this new app is about.

The ATM application

The new ATM application stands for Alchemical Transfer Method, a methodology Emilio Gallicchio et al. designed for absolute and relative binding affinity predictions. The ATM method allows us to estimate binding affinities for molecules against a specific protein, measuring the strength at which they bind. This methodology falls under the category of alchemical free energy calculation methods, where unphysical intermediate states are used to estimate the free energy of physical processes (such as protein-ligand binding). The benefits of ATM, when compared with other common free energy prediction methods (like the popular FEP), come from its simplicity, as it can be used with any forcefield and does not require a lot of expertise to make it work properly.

Measuring experimental binding affinities between candidate molecules and the targeted protein is one of the first steps in drug discovery projects, but synthesizing molecules and performing experiments is expensive. Having the capacity to perform computational binding affinity predictions, particularly during drug lead optimization, is extremely beneficial. We are actively working now on testing and validating the ATM method so that we can start applying it to real drug discovery projects as soon as possible. Additionally, since these methods are usually applied to hundreds of molecules, it benefits a lot from the parallelization capabilities of GPUGRID, so if everything goes as expected, this could potentially send lots of work units.

The ATM app is based on Python, similar to the PythonRL application, where we ship it with a specific python environment.

Here are the two main references for the ATM method, for both absolute and relative binding affinity predictions:

Absolute binding free energy estimation with ATM: https://arxiv.org/pdf/2101.07894.pdf
Relative binding free energy estimation with ATM:
https://pubs.acs.org/doi/10.1021/acs.jcim.1c01129

For now we are only able to send jobs to Linux machines but we are hoping to have a Windows version soon.

Quico
Volunteer moderator
Project administrator
Project developer
Project tester
Volunteer developer
Volunteer tester
Project scientist
Send message
Joined: 28 Feb 23
Posts: 16
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 60003 - Posted: 3 Mar 2023 | 10:40:19 UTC

I’m brand new to GPUGRID so apologies in advance if I make some mistakes. I’m looking forward to learn from you all and discuss about this app :)

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1496
Credit: 3,621,549,351
RAC: 1,380,716
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 60005 - Posted: 3 Mar 2023 | 11:50:49 UTC

Welcome!

Let's start with some good news. I picked up one of your test tasks a couple of days ago.

T0_1-QUICO_TEST_ATM-0-1-RND8922_0

It ran right through without raising any red flags, and validated at the end. A good start.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 848
Credit: 5,739,355,781
RAC: 22,227,577
Level
Tyr
Scientific publications
wat
Message 60006 - Posted: 3 Mar 2023 | 13:21:48 UTC - in response to Message 60003.

Thanks for creating an official topic on these types of tasks.

The latest problem observed recently was upload hangs due to a file size too big. it didnt cause an error, but it just never uploaded because the file size exceeded the size limt of your apache server. the only resolution for the user was to abort the transfer and hope it didnt get marked as an error.

have you already addressed this issue? either by adjusting the apache server file size, or adjusting the tasks to not create such large result files.
____________

Greger
Send message
Joined: 6 Jan 15
Posts: 72
Credit: 6,835,310,740
RAC: 1,264,724
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwat
Message 60007 - Posted: 3 Mar 2023 | 20:20:09 UTC
Last modified: 3 Mar 2023 | 21:19:31 UTC

Welcome and thanks for info Quico

I did notice on past batch the upload got halted by server. It got rejected to download result.
I did a check on client_state file and it was below max_nbyte but still it didn´t allow to upload.

File size in past history that max allowed have been 700mb and these have been around 713-730mb in so something else control this cap and a change maybe help but i don´t see where issue would be.

event log for TL9_72-RAIMIS_TEST_ATM did say

<nbytes>729766132.000000</nbytes>
<max_nbytes>10000000000.000000</max_nbytes>


https://ibb.co/4pYBfNS
parsing upload result response <data_server_reply> <status>0</status> <file_size>0</file_size
error code -224 (permanent HTTP error)
https://ibb.co/T40gFR9

I will do test new test on new units but would probably face same issue if server have not changed.

https://boinc.berkeley.edu/trac/wiki/JobTemplates

Erich56
Send message
Joined: 1 Jan 15
Posts: 992
Credit: 3,778,551,353
RAC: 1,001,093
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwat
Message 60009 - Posted: 4 Mar 2023 | 6:05:52 UTC - in response to Message 60007.

File size in past history that max allowed have been 700mb

Greger, are you sure it was 700mb?
From what I remember, it was 500mb

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1496
Credit: 3,621,549,351
RAC: 1,380,716
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 60011 - Posted: 4 Mar 2023 | 9:10:39 UTC

I have one which is looking a bit poorly. It's 'running' of host 132158l (Linux Mint 21.1, GTX 1660 super, 64 GB RAM), but it's only showing 3% progress after 18 hours.


(image from remote monitoring on a Windows computer)

Are there any files I can examine, or which would be useful to you for debugging - or should I simply abort it?

Dirk Broer
Send message
Joined: 4 Oct 09
Posts: 2
Credit: 30,587,019
RAC: 246,196
Level
Val
Scientific publications
watwatwatwat
Message 60012 - Posted: 4 Mar 2023 | 9:50:47 UTC

I am trying to upload one, but can't get it to do the transfer:
Computer: MSI-B550-A-Pro
Project GPUGRID

Name TL9_82-RAIMIS_TEST_ATM-0-1-RND3943_1

Application ATM: Free energy calculations of protein-ligand binding 1.13 (cuda1121)
Workunit name TL9_82-RAIMIS_TEST_ATM-0-1-RND3943
State Uploading
Received 3/1/2023 4:46:17 PM
Report deadline 3/6/2023 4:46:16 PM
Estimated app speed 16.548,99 GFLOPs/sec
Estimated task size 1.000.000.000 GFLOPs
Resources 0,949 CPUs + 1 NVIDIA GPU
CPU time at last checkpoint 00:00:00
CPU time 05:27:34
Elapsed time 05:28:51
Estimated time remaining 00:00:00
Fraction done 100%
Virtual memory size 0,00 MB
Working set size 0,00 MB

Debug State: 4 - Scheduler: 0

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1496
Credit: 3,621,549,351
RAC: 1,380,716
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 60013 - Posted: 4 Mar 2023 | 10:39:52 UTC

I think mine is a failure. Nothing has been written to stderr.txt since 14:22:59 UTC yesterday, and the final entries are:

+ echo 'Run AToM'
+ CONFIG_FILE=Tyk2_new_2-ejm_49-ejm_50_asyncre.cntl
+ python bin/rbfe_explicit_sync.py Tyk2_new_2-ejm_49-ejm_50_asyncre.cntl
Warning: importing 'simtk.openmm' is deprecated. Import 'openmm' instead.

I'm aborting it.

NB a previous user also failed with a task from the same workunit: 27418556

Quico
Volunteer moderator
Project administrator
Project developer
Project tester
Volunteer developer
Volunteer tester
Project scientist
Send message
Joined: 28 Feb 23
Posts: 16
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 60022 - Posted: 6 Mar 2023 | 9:51:35 UTC - in response to Message 60013.

Thanks everyone for the replies!

From what I have seen, from the single test job I personally sent, one replica finished without issues but the other two blew up (Particle coordinate is NaN). I do find this strange because I have seen in the preparation that I run locally but not during production, the errors should be different. I'll check a few things locally since I changed a few things from my local runs and we'll try again, also with different inputs.

Welcome and thanks for info Quico

I did notice on past batch the upload got halted by server. It got rejected to download result.
I did a check on client_state file and it was below max_nbyte but still it didn´t allow to upload.

File size in past history that max allowed have been 700mb and these have been around 713-730mb in so something else control this cap and a change maybe help but i don´t see where issue would be.

event log for TL9_72-RAIMIS_TEST_ATM did say
<nbytes>729766132.000000</nbytes>
<max_nbytes>10000000000.000000</max_nbytes>


https://ibb.co/4pYBfNS
parsing upload result response <data_server_reply> <status>0</status> <file_size>0</file_size
error code -224 (permanent HTTP error)
https://ibb.co/T40gFR9

I will do test new test on new units but would probably face same issue if server have not changed.

https://boinc.berkeley.edu/trac/wiki/JobTemplates


Thanks for this, I'll keep that in mind. From the succesful run the size file is 498M so it should be on the limit there to what @Erich56 says. But that's useful information for when I run bigger systems.

I think mine is a failure. Nothing has been written to stderr.txt since 14:22:59 UTC yesterday, and the final entries are:

+ echo 'Run AToM'
+ CONFIG_FILE=Tyk2_new_2-ejm_49-ejm_50_asyncre.cntl
+ python bin/rbfe_explicit_sync.py Tyk2_new_2-ejm_49-ejm_50_asyncre.cntl
Warning: importing 'simtk.openmm' is deprecated. Import 'openmm' instead.

I'm aborting it.

NB a previous user also failed with a task from the same workunit: 27418556


Hmmm, that's weird. It shouldn't softlock in that step. Although this warning pops up it should keep running without issues. I'll ask around

gemini8
Send message
Joined: 3 Jul 16
Posts: 25
Credit: 362,030,631
RAC: 534,692
Level
Asp
Scientific publications
watwat
Message 60029 - Posted: 7 Mar 2023 | 11:43:14 UTC

This task didn't want to upload, but neither would GPUGrid update when I aborted the upload.
Only got 24h time-outs.
____________
Greetings, Jens

STE\/E
Send message
Joined: 18 Sep 08
Posts: 368
Credit: 278,008,761
RAC: 102,066
Level
Asn
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwat
Message 60035 - Posted: 8 Mar 2023 | 12:14:20 UTC

I just aborted 1 ATM Wu https://www.gpugrid.net/result.php?resultid=33338739 that had been running for over 7 Days, it sat at 75% done the whole time. Got another one & it immediately jumped to 75% done. Probably just abort it & deselect any new ATM Wu's ...
____________
STE\/E

Aurum
Avatar
Send message
Joined: 12 Jul 17
Posts: 371
Credit: 10,615,402,087
RAC: 11,814,016
Level
Trp
Scientific publications
watwatwat
Message 60036 - Posted: 8 Mar 2023 | 14:24:59 UTC
Last modified: 8 Mar 2023 | 14:40:30 UTC

Some still running, many failing.
Does ATM really just need one CPU?
I think I saw a new 1.1 GB executable DLing. Maybe the failures tried to run on the older version?
What are the VRAM and RAM minimum requirements for ATM?

Server Status shows both ATM and ATMbeta tasks but Tasks shows them all as ATM.
Strange, all my previously completed ATM WUs have vanished from my Tasks list?

Thanks for the papers, I'll read them later.

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1496
Credit: 3,621,549,351
RAC: 1,380,716
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 60037 - Posted: 8 Mar 2023 | 15:23:40 UTC

Three successive errors on host 132158

All with "python: can't open file '/hdd/boinc-client/slots/2/Scripts/rbfe_explicit_sync.py': [Errno 2] No such file or directory"

Aurum
Avatar
Send message
Joined: 12 Jul 17
Posts: 371
Credit: 10,615,402,087
RAC: 11,814,016
Level
Trp
Scientific publications
watwatwat
Message 60038 - Posted: 8 Mar 2023 | 16:56:45 UTC

I let some computers run off all other WUs so they were just running 2 ATM WUs. It appears they do only use one CPU each but that may just be a consequence of specifying a single CPU in the client_state.xml file. Might your ATM project benefit from using multiple CPUs?

<app_version>
<app_name>ATM</app_name>
<version_num>113</version_num>
<platform>x86_64-pc-linux-gnu</platform>
<avg_ncpus>1.000000</avg_ncpus>
<flops>46211986880283.171875</flops>
<plan_class>cuda1121</plan_class>
<api_version>7.7.0</api_version>
nvidia-smi reports ATM 1.13 WUs are using 550 to 568 MB of VRAM so call it 0.6 GB VRAM. BOINCtasks reports all WUs are using less than 1.2 GB RAM. That means that my computers could easily run up to 20 ATM WUs simultaneourly. Sadly GPUGRID does not allow us to control the number of WUs we DL like LHC or WCG do. So we're stuck with 2 set by the ACEMD project. I never run more than a single PYTHON WU on a computer so I get two and abort one and then have to uncheck PYTHON in my GPUGRID Preferences just in case ACEMD or ATM WUs materialize. I wonder how many years it's been since GG has improved the UI to make it more user-friendly? When one clicks their Preferences they still get 2 Warnings and 2 Strict Standards that have never been fixed.

Please add a link to your applications: https://www.gpugrid.net/apps.php
____________

kksplace
Send message
Joined: 4 Mar 18
Posts: 51
Credit: 552,626,749
RAC: 1,096,890
Level
Lys
Scientific publications
wat
Message 60039 - Posted: 8 Mar 2023 | 19:52:31 UTC

Is there a way to tell if an ATM WU is progressing? I have had only one succeed so far over the last several weeks. However, all of the failures so far were one of two types: either a failure to upload (and the download aborted by me) or a simple "Error while computing", which happened very quickly.

However, I now have an ATM WU which has been processing for over seven hours. Looking at the WU properties, it shows the CPU time nearly equal to the elapsed time. The GPU shows processing spikes up to 99%, and the 'down' periods are short.

As others have reported, the Progress shows 75% steadily.

I am inclined to keep letting it compute, but want to know what behavior others have seen on successful ATM WUs.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 848
Credit: 5,739,355,781
RAC: 22,227,577
Level
Tyr
Scientific publications
wat
Message 60041 - Posted: 8 Mar 2023 | 21:02:08 UTC
Last modified: 8 Mar 2023 | 21:03:20 UTC

let me explain something about the 75% since it seems many don't understand what's happening here. the 75% is in no way an indication of how much the task has progressed. it is totally a function of how BOINC acts with the wrapper when the tasks are setup in the way that they are.

the wrapper uses a jobs.xml file to instruct BOINC on different "subtasks" to perform over the course of the run of a single task from the project. in the <task> element there is an option to add a <weight> argument. this would tell boinc how much "weight" in percentage of total task completion that this subtask is worth. weight of 1 is equal to 1% and so on. if this weight argument is not defined, each subtask gets equal weight.

in the case of the ATM tasks, the job.xml file has four subtasks, and no weights defined. the first 3 tasks are just quick extractions and unpacking and complete quickly. which is why the tasks jump to 75% straight away. if it's staying at 75% indefinitely then that's pretty indicative that the task is stuck and probably wont make more progress.

by comparison, the PythonGPU tasks have 2 sub tasks, but the first extraction task has a weight of 1 and the second run.py task has weight of 99 which is why it doesnt have this kind of behavior. and the acemd3 tasks only have one subtask in the file so it doesnt need a weight at all and progress is pretty linear.
____________

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1496
Credit: 3,621,549,351
RAC: 1,380,716
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 60042 - Posted: 8 Mar 2023 | 21:59:23 UTC - in response to Message 60039.

I have one that's running (?) much the same. I think I've found a way to confirm it's still alive.

I looked at the task properties to see which slot directory it was running in (slot 2, in my case). Then I found the relevant directory, and poked about a bit.

I found our usual touchstone (stderr.txt) to be useless - it hadn't been touched in hours. But another file - run.log - is currently active. The most recent entries are current:

2023-03-08 21:55:05 - INFO - sync_re - Started: sample 107, replica 12
2023-03-08 21:55:17 - INFO - sync_re - Finished: sample 107, replica 12
(duration: 12.440164870815352 s)

which seems to suggest that all is well. Perhaps Quico could let us know how many samples to expect in the current batch?

kksplace
Send message
Joined: 4 Mar 18
Posts: 51
Credit: 552,626,749
RAC: 1,096,890
Level
Lys
Scientific publications
wat
Message 60043 - Posted: 8 Mar 2023 | 22:39:10 UTC - in response to Message 60042.

Thanks for the idea. Sure enough, that file is showing activity (On sample 324, replica 3 for me.) OK. Just going to sit and wait.

Ian&Steve, thanks for the explanation. Just one thought: what if the fourth item is just "do everything else"? Couldn't that mean going straight from 75% to 100% at some point (assuming it is progressing)?

Quico
Volunteer moderator
Project administrator
Project developer
Project tester
Volunteer developer
Volunteer tester
Project scientist
Send message
Joined: 28 Feb 23
Posts: 16
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 60045 - Posted: 9 Mar 2023 | 9:26:00 UTC - in response to Message 60042.

I have one that's running (?) much the same. I think I've found a way to confirm it's still alive.

I looked at the task properties to see which slot directory it was running in (slot 2, in my case). Then I found the relevant directory, and poked about a bit.

I found our usual touchstone (stderr.txt) to be useless - it hadn't been touched in hours. But another file - run.log - is currently active. The most recent entries are current:

2023-03-08 21:55:05 - INFO - sync_re - Started: sample 107, replica 12
2023-03-08 21:55:17 - INFO - sync_re - Finished: sample 107, replica 12
(duration: 12.440164870815352 s)

which seems to suggest that all is well. Perhaps Quico could let us know how many samples to expect in the current batch?


Thanks for this input (and everyone's). At least in the runs I sent recently we are expecting 341 samples.

I've seen that there were many crashes in the last batch of jobs I sent. I'll check if there were some issues on my end or it's just that the systems decided to blow up.

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1496
Credit: 3,621,549,351
RAC: 1,380,716
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 60046 - Posted: 9 Mar 2023 | 10:43:01 UTC - in response to Message 60045.

At least in the runs I sent recently we are expecting 341 samples.

Thanks, that's helpful. I've reached sample 266, so I'll be able to predict when it's likely to finish.

But I think you need to reconsider some design decisions. The current task properties (from BOINC Manager) are:



This task will take over 24 hours to run on my GTX 1660 Ti - that's long, even by GPUGrid standards.

BOINC doesn't think it's checkpointed since the beginning, even though checkpoints are listed at the end of each sample in the job.log

BOINC Manager shows that the fraction done is 75.000% - and has displayed that figure, unchanging, since a few minutes into the run.

I'm not seeing any sign of an output file (or I haven't found it yet!), although it's specified in the <result> XML:

<file_ref>
<file_name>T_QUICO_Tyk2_new_2_ejm_47_ejm_55_4-QUICO_TEST_ATM-0-1-RND8906_2_0</file_name>
<open_name>output.tar.bz2</open_name>
<copy_file/>
</file_ref>

More when it finishes.

Quico
Volunteer moderator
Project administrator
Project developer
Project tester
Volunteer developer
Volunteer tester
Project scientist
Send message
Joined: 28 Feb 23
Posts: 16
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 60047 - Posted: 9 Mar 2023 | 11:40:25 UTC - in response to Message 60046.
Last modified: 9 Mar 2023 | 11:56:09 UTC

At least in the runs I sent recently we are expecting 341 samples.

Thanks, that's helpful. I've reached sample 266, so I'll be able to predict when it's likely to finish.

But I think you need to reconsider some design decisions. The current task properties (from BOINC Manager) are:



This task will take over 24 hours to run on my GTX 1660 Ti - that's long, even by GPUGrid standards.



That's good to know, thanks. Next time I'll prepare them so they run for shorter amounts of time and finish over next submissions. Is there an aprox time you suggest per task?


I'm not seeing any sign of an output file (or I haven't found it yet!), although it's specified in the <result> XML:

<file_ref>
<file_name>T_QUICO_Tyk2_new_2_ejm_47_ejm_55_4-QUICO_TEST_ATM-0-1-RND8906_2_0</file_name>
<open_name>output.tar.bz2</open_name>
<copy_file/>
</file_ref>

More when it finishes.


Can you see a cntxt_0 folder or several r0-r21 folders? These should be some of the outputs that the run generates, and also the ones I'm getting from the succesful runs.

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1496
Credit: 3,621,549,351
RAC: 1,380,716
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 60048 - Posted: 9 Mar 2023 | 11:57:55 UTC - in response to Message 60047.

Can you see a cntxt_0 folder or several r0-r21 folders? These should be some of the outputs that the run generates, and also the ones I'm getting from the succesful runs.

Yes, I have all of those, and they're filling up nicely. I want to catch the final upload archive, and check it for size.

Quico
Volunteer moderator
Project administrator
Project developer
Project tester
Volunteer developer
Volunteer tester
Project scientist
Send message
Joined: 28 Feb 23
Posts: 16
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 60049 - Posted: 9 Mar 2023 | 14:37:12 UTC - in response to Message 60048.

Can you see a cntxt_0 folder or several r0-r21 folders? These should be some of the outputs that the run generates, and also the ones I'm getting from the succesful runs.

Yes, I have all of those, and they're filling up nicely. I want to catch the final upload archive, and check it for size.


Ah I see, from what I've seen the final upload archive has been around 500MB for these runs. Taking into accont what was mentioned filesize-wise in the beginning of the thread I'll tweak some paramaters in order to avoid heavier files

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 848
Credit: 5,739,355,781
RAC: 22,227,577
Level
Tyr
Scientific publications
wat
Message 60050 - Posted: 9 Mar 2023 | 14:50:08 UTC - in response to Message 60049.

you should also add weights to the <tasks> element in the jobs.xml file that's being used as well as adding some kind of progress reporting for the main script. jumping to 75% at the start and staying there for 12-24hrs until it jumps to 100% at the end is counterintuitive for most users and causes confusion about if the task is doing anything or not.
____________

Aurum
Avatar
Send message
Joined: 12 Jul 17
Posts: 371
Credit: 10,615,402,087
RAC: 11,814,016
Level
Trp
Scientific publications
watwatwat
Message 60051 - Posted: 9 Mar 2023 | 14:51:15 UTC - in response to Message 60047.
Last modified: 9 Mar 2023 | 14:53:22 UTC

Next time I'll prepare them so they run for shorter amounts of time and finish over next submissions. Is there an aprox time you suggest per task?

The sweet spot would be 0.5 to 4 hours. Above 8 hours is starting to drag. Some climate projects take over a week to run. It really depends on your needs, we're here to serve :-) It seems a quicker turn around time while you're tweaking your project would be to your benefit.

It seems it would help you if you created your own BOINC account and ran your WUs the same way we do. Get in the trenches with us and see what we see.

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1496
Credit: 3,621,549,351
RAC: 1,380,716
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 60052 - Posted: 9 Mar 2023 | 16:51:12 UTC
Last modified: 9 Mar 2023 | 17:01:11 UTC

Well, here it is:



BOINC sees that as 500.28 MB (Linux counts in 1000s, BOINC counts in 1024s) - wish me luck!

Edit - phew, it got through. But that's very, very close to the old limit. Task 33344733

Quico
Volunteer moderator
Project administrator
Project developer
Project tester
Volunteer developer
Volunteer tester
Project scientist
Send message
Joined: 28 Feb 23
Posts: 16
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 60053 - Posted: 9 Mar 2023 | 18:29:11 UTC - in response to Message 60051.

Next time I'll prepare them so they run for shorter amounts of time and finish over next submissions. Is there an aprox time you suggest per task?

The sweet spot would be 0.5 to 4 hours. Above 8 hours is starting to drag. Some climate projects take over a week to run. It really depends on your needs, we're here to serve :-) It seems a quicker turn around time while you're tweaking your project would be to your benefit.

It seems it would help you if you created your own BOINC account and ran your WUs the same way we do. Get in the trenches with us and see what we see.


Once the Windows version is live my personal set-up will join the cause and will have more feedback :)

Well, here it is:



BOINC sees that as 500.28 MB (Linux counts in 1000s, BOINC counts in 1024s) - wish me luck!

Edit - phew, it got through. But that's very, very close to the old limit. Task 33344733


Thanks, for the insight. I'll make it save frames less frequently in order to avoid bigger filesizes.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 848
Credit: 5,739,355,781
RAC: 22,227,577
Level
Tyr
Scientific publications
wat
Message 60068 - Posted: 13 Mar 2023 | 16:26:17 UTC

nothing but errors from the current ATM batch. run.sh is missing or misnamed/misreferenced.
____________

Aurum
Avatar
Send message
Joined: 12 Jul 17
Posts: 371
Credit: 10,615,402,087
RAC: 11,814,016
Level
Trp
Scientific publications
watwatwat
Message 60069 - Posted: 13 Mar 2023 | 17:49:34 UTC
Last modified: 13 Mar 2023 | 17:49:46 UTC

I vaguely recall GG had a rule something like a computer can only DL 200 WUs a day. If it's still in place it would be absurd since the overriding rule is that a computer can only hold 2 WUs at a time.
At the rate ATM WUs are failing I could hit that limit, so I halted GG DLs.
Please delete all your WUs until you fix the bug.

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1496
Credit: 3,621,549,351
RAC: 1,380,716
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 60074 - Posted: 14 Mar 2023 | 12:35:21 UTC

Today's tasks are running OK - the run.sh script problem has been cured.

I'm running one that the previous user aborted before it even started - no need for that any more (WU 27426736).

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 848
Credit: 5,739,355,781
RAC: 22,227,577
Level
Tyr
Scientific publications
wat
Message 60075 - Posted: 14 Mar 2023 | 12:51:35 UTC - in response to Message 60074.
Last modified: 14 Mar 2023 | 12:52:47 UTC

i wouldnt say "cured". but newer tasks seem to be fine. I'm still getting a good number of resends with the same problem. i guess they'll make their way through the meat grinder before defaulting out.

example: http://www.gpugrid.net/result.php?resultid=33357435
____________

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1496
Credit: 3,621,549,351
RAC: 1,380,716
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 60076 - Posted: 14 Mar 2023 | 14:47:33 UTC - in response to Message 60075.

My point was: if you get one of these, let it run - it may be going to produce useful science. If it's one of the faulty ones, you waste about 20 seconds, and move on.

Aurum
Avatar
Send message
Joined: 12 Jul 17
Posts: 371
Credit: 10,615,402,087
RAC: 11,814,016
Level
Trp
Scientific publications
watwatwat
Message 60084 - Posted: 15 Mar 2023 | 9:28:37 UTC
Last modified: 15 Mar 2023 | 9:30:08 UTC

Quico/GDF, GPU utilization is low so I'd like to test running 3 and 4 ATM WUs simultaneously.
Sadly GG chokes off work at 2 WUs per computer so that's presently impossible.

Quico
Volunteer moderator
Project administrator
Project developer
Project tester
Volunteer developer
Volunteer tester
Project scientist
Send message
Joined: 28 Feb 23
Posts: 16
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 60085 - Posted: 15 Mar 2023 | 10:12:31 UTC - in response to Message 60076.
Last modified: 15 Mar 2023 | 10:16:48 UTC

Sorry about the run.sh missing issue of the past few days. It slipped through me. Also they were a few re-send tests that also crashed, but it should be fixed now.


Is there a way I could delete the failed/crashed files from the server?

We're also trying to find alternatives to avoid the filesize issue. I hope we can find a nice solution in the next few days.

Do the last few runs take less time, being less of a drag to run them? I'm trying to find the sweet spot for everyone/most of us.

Thanks everyone!

Quico
Volunteer moderator
Project administrator
Project developer
Project tester
Volunteer developer
Volunteer tester
Project scientist
Send message
Joined: 28 Feb 23
Posts: 16
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 60086 - Posted: 15 Mar 2023 | 10:13:40 UTC - in response to Message 60084.

Quico/GDF, GPU utilization is low so I'd like to test running 3 and 4 ATM WUs simultaneously.
Sadly GG chokes off work at 2 WUs per computer so that's presently impossible.


How low is it? It really shouldn't be the case at least taking into account the tests we performed internally.

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1496
Credit: 3,621,549,351
RAC: 1,380,716
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 60087 - Posted: 15 Mar 2023 | 10:47:56 UTC - in response to Message 60085.

My host 508381 (GTX 1660 Ti) has finished a couple overnight, in about 9 hours. The last one finished just as I was reading your message, and I saw the upload size - 114 MB. Another failed with 'Energy is NaN', but that's another question.

The size and time figures are comfortable for me, but others will post their own views.

It would be helpful to work on the intermediate progress reports and checkpointing - at the moment, neither are reported to BOINC. This host (Linux Mint 20.3) spends the entire run reporting 75% progress: my other machine (Linux Mint 21.1) is stuck at 3%. Both run exactly the same build of BOINC.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 848
Credit: 5,739,355,781
RAC: 22,227,577
Level
Tyr
Scientific publications
wat
Message 60091 - Posted: 15 Mar 2023 | 11:29:34 UTC - in response to Message 60086.
Last modified: 15 Mar 2023 | 11:45:09 UTC

My observations show the GPU switching from periods of high utilization (~96-98%) to periods of idle (0%). About every minute or two.

i think the current size of the ATM are pretty good. about 4hrs on a 3080Ti and about 5hrs on a 2080Ti.

I'll second Richards's comment that you should put some effort into checkpointing about fixing the completion reporting (add weights to the job.xml file)
____________

Aurum
Avatar
Send message
Joined: 12 Jul 17
Posts: 371
Credit: 10,615,402,087
RAC: 11,814,016
Level
Trp
Scientific publications
watwatwat
Message 60093 - Posted: 15 Mar 2023 | 14:47:42 UTC - in response to Message 60086.
Last modified: 15 Mar 2023 | 14:53:46 UTC

Quico/GDF, GPU utilization is low so I'd like to test running 3 and 4 ATM WUs simultaneously.
Sadly GG chokes off work at 2 WUs per computer so that's presently impossible.
How low is it? It really shouldn't be the case at least taking into account the tests we performed internally.

GPUgrid is set to only DL 2 WUs per computer.

It used to be higher but since ACEMD WUs take around 12ish hours and have approxiamtely 50% GPU utilization a normal BOINC client couldn't really make efficient use of more than 2. The history of setting the limit may have had something to do with DDOS attacks and throttling server access as a defense.

But Python WUs with a very low GPU utilization and ATM with about 25% utilization could run more. I believe it's possible for the work server to designate how many WUs of a given kind based on the client's hardware.

Some use a custom BOINC client that tricks the server into thinking their computer is more than one computer.

I suspect 1080s & 2080s could run 3 and 3080s could run 4 ATM WUs. Be nice to give it a try.

Checkpointing should be high on your To-Do List followed closely by progress reporting. File size is not an issue on the client side since you DL files over a GB. But increasing the limit on your server side would make that problem vanish. Run times have shortened and run fine, maybe a little shorter would be nice but not a priority.

Profile Stephen Uitti
Send message
Joined: 17 Mar 14
Posts: 4
Credit: 75,778,522
RAC: 60,570
Level
Thr
Scientific publications
watwatwatwatwatwatwatwat
Message 60094 - Posted: 15 Mar 2023 | 14:58:08 UTC

I noticed Free energy calculations of protein ligand binding in WUProp. For example, today's time is 0.03 hours. I checked, and i've 68 of these with a total of minimal time. So i checked, and they all get "Error while computing". I looked at a recent work unit, 27429650 T_CDK2_new_2_edit_1oiu_26_T2_2A_1-QUICO_TEST_ATM-0-1-RND4575_0
The log has this:

+ python -m pip install git+https://github.com/raimis/AToM-OpenMM.git@5d7eac55295e8c6e777505c3ca7c998f1c85987d
Running command git clone --filter=blob:none --quiet https://github.com/raimis/AToM-OpenMM.git /t/boinclib/boinc-client/slots/8/tmp/pip-req-build-3qm67lb1
Running command git rev-parse -q --verify 'sha^5d7eac55295e8c6e777505c3ca7c998f1c85987d'
Running command git fetch -q https://github.com/raimis/AToM-OpenMM.git 5d7eac55295e8c6e777505c3ca7c998f1c85987d
Running command git checkout -q 5d7eac55295e8c6e777505c3ca7c998f1c85987d
error: subprocess-exited-with-error

&#195;&#151; python setup.py egg_info did not run successfully.
&#226;&#148;&#130; exit code: -4


I'm running Linux Mint 19 (a bit out of date), git is git version 2.17.1
/usr/bin/python is Python 2.7.17 and /usr/bin/python3 is Python 3.6.9 -- this was common until recently
uname -a
Linux berfon 5.4.0-104-generic #118~18.04.1-Ubuntu SMP Thu Mar 3 13:53:15 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
My machine has a gtx-950, so cuda tasks are OK.
It's having an issue writing to /t/boinclib/boinc-client/slots/8/tmp

sudo ls -ld /t/boinclib/boinc-client/slots/8/
drwxrwx--x 2 boinc boinc 4096 Mar 15 10:24 /t/boinclib/boinc-client/slots/8/
So it doesn't look like a permissions issue. The disk drive this is on has over 1 TB space free. It looks to me like git failed, and this is what is happening on all the work units.
My machine is running "New version of ACEMD" routinely.
My preferences for GPUGrid is to run everything. I'm not sure which category this is in, but it must be one of the beta apps.

I hope this helps.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 848
Credit: 5,739,355,781
RAC: 22,227,577
Level
Tyr
Scientific publications
wat
Message 60095 - Posted: 15 Mar 2023 | 15:23:16 UTC - in response to Message 60093.

GPUgrid is set to only DL 2 WUs per computer.


it's actually 2 per GPU, for up to 8 GPUs. 16 per computer/host.

ACEMD WUs take around 12ish hours and have approxiamtely 50% GPU utilization


acemd3 has always used nearly 100% utilization with a single task on every GPU I've ever run. if you're only seeing 50%, sounds like you're hitting some other kind of bottleneck preventing the GPU from working to its full potential.

____________

Aurum
Avatar
Send message
Joined: 12 Jul 17
Posts: 371
Credit: 10,615,402,087
RAC: 11,814,016
Level
Trp
Scientific publications
watwatwat
Message 60096 - Posted: 15 Mar 2023 | 17:53:15 UTC

I just started using nvitop for Linux and it gives a very different image of GPU utilization while running ATM: https://github.com/XuehaiPan/nvitop

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 848
Credit: 5,739,355,781
RAC: 22,227,577
Level
Tyr
Scientific publications
wat
Message 60097 - Posted: 15 Mar 2023 | 18:06:14 UTC - in response to Message 60096.
Last modified: 15 Mar 2023 | 18:09:46 UTC

i would probably give more trust to nvidia's own tools.

watch -n 1 nvidia-smi

or
watch -n 1 nvidia-smi --query-gpu=temperature.gpu,name,pci.bus_id,utilization.gpu,utilization.memory,clocks.current.sm,clocks.current.memory,power.draw,memory.used,pcie.link.gen.current,pcie.link.width.current --format=csv


but you said "acemd3" uses 50%. not ATM. overall I'd agree that ATM is closer to 50% effective or a little higher. it cycles between like 90 seconds @95+% and 30 seconds @0% and back and forth for the majority of the run.
____________

Aurum
Avatar
Send message
Joined: 12 Jul 17
Posts: 371
Credit: 10,615,402,087
RAC: 11,814,016
Level
Trp
Scientific publications
watwatwat
Message 60098 - Posted: 15 Mar 2023 | 18:09:25 UTC - in response to Message 60094.
Last modified: 15 Mar 2023 | 18:10:23 UTC

I'm running Linux Mint 19 (a bit out of date)
I just retired my last Linux Mint 19 computer yesterday and it had been running ATM, ACEMD & Python WUs on a 2080 Ti (12/7.5) fine. BTW, I tried the LM 21.1 upgrade from LM 20.3 and can't do things like open BOINC folder as admin. I can't see any advantage to 21.1 so I'm going to do a fresh install and revert back to 20.3.

My machine has a gtx-950, so cuda tasks are OK.
Is there a minimum requirement for CUDA and Compute Capability for ATM WUs?
https://www.techpowerup.com/gpu-specs/geforce-gtx-950.c2747 says CUDA 5.2 and https://developer.nvidia.com/cuda-gpus says 5.2.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 848
Credit: 5,739,355,781
RAC: 22,227,577
Level
Tyr
Scientific publications
wat
Message 60099 - Posted: 15 Mar 2023 | 18:14:54 UTC - in response to Message 60098.

Is there a minimum requirement for CUDA and Compute Capability for ATM WUs?
https://www.techpowerup.com/gpu-specs/geforce-gtx-950.c2747 says CUDA 5.2 and https://developer.nvidia.com/cuda-gpus says 5.2.


very likely the min CC is 5.0 (Maxwell) since Kepler cards seem to be erroring with the message that the card is too old.

all cuda 11.x apps are supported by CUDA 11.1+ drivers. with CUDA 11.1, Nvidia introduced forward compatibility of minor versions. so as long as you have 450+ drivers you should be able to run any CUDA app up to 11.8. CUDA 12+ will require moving to CUDA 12+ compatible drivers.
____________

Aurum
Avatar
Send message
Joined: 12 Jul 17
Posts: 371
Credit: 10,615,402,087
RAC: 11,814,016
Level
Trp
Scientific publications
watwatwat
Message 60100 - Posted: 15 Mar 2023 | 18:16:47 UTC - in response to Message 60095.

GPUgrid is set to only DL 2 WUs per computer.

it's actually 2 per GPU, for up to 8 GPUs. 16 per computer/host.
I'm sure you're right, it's been years since I put more than on GPU on a computer.

ACEMD WUs take around 12ish hours and have approxiamtely 50% GPU utilization
acemd3 has always used nearly 100% utilization with a single task on every GPU I've ever run. if you're only seeing 50%, sounds like you're hitting some other kind of bottleneck preventing the GPU from working to its full potential.[/quote]Let me rephrase that since it's been a long time since there was a steady flow of ACEMD. I always run 2 ACEMD WUs per GPU with no other GPU projects running. I can't remember what ACEMD utilization was but I don't recall that they slowed down much by running 2 WUs together.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 848
Credit: 5,739,355,781
RAC: 22,227,577
Level
Tyr
Scientific publications
wat
Message 60101 - Posted: 15 Mar 2023 | 18:19:02 UTC - in response to Message 60100.

maybe not much slower, but also not faster.
____________

Aurum
Avatar
Send message
Joined: 12 Jul 17
Posts: 371
Credit: 10,615,402,087
RAC: 11,814,016
Level
Trp
Scientific publications
watwatwat
Message 60102 - Posted: 15 Mar 2023 | 18:20:10 UTC - in response to Message 60097.

i would probably give more trust to nvidia's own tools.

watch -n 1 nvidia-smi

nvitop does that but graphs it.

Aurum
Avatar
Send message
Joined: 12 Jul 17
Posts: 371
Credit: 10,615,402,087
RAC: 11,814,016
Level
Trp
Scientific publications
watwatwat
Message 60103 - Posted: 15 Mar 2023 | 18:22:37 UTC - in response to Message 60101.

maybe not much slower, but also not faster.

But it has the advantage that compared to running a single ACEMD WU and letting the second GG sit idle waiting until it finishes and not getting the quick turnaround bonus feels like getting robbed :-) But who's counting?

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 848
Credit: 5,739,355,781
RAC: 22,227,577
Level
Tyr
Scientific publications
wat
Message 60104 - Posted: 15 Mar 2023 | 18:26:29 UTC - in response to Message 60103.
Last modified: 15 Mar 2023 | 18:28:30 UTC

until your 12h task turns into two 25hr tasks running two and you get robbed anyway. robbed of the bonus for two tasks instead of just one.

you can set your machine to not download excess tasks by setting a smaller cache size or playing with resource share. that way it wont download the second task until the first one is nearly finished. there are lots of options you can tweak to get the desired behavior.
____________

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1129
Credit: 1,573,761,177
RAC: 3,192,556
Level
His
Scientific publications
watwatwatwatwat
Message 60105 - Posted: 15 Mar 2023 | 21:13:38 UTC
Last modified: 15 Mar 2023 | 21:13:58 UTC

Picked up another ATM task but not holding much hope that it will run correctly based on the previous wingmen output files. Looks like the configuration is not correct again.

Had hope since the task mentions new in the name.

T_CDK2_new_2_edit_26_1h1q_T4_2_1-QUICO_TEST_ATM-0-1-RND2833_2

[Errno 2] No such file or directory

openmm.OpenMMException: Illegal value for DeviceIndex: 1

Guess I will be the next guinea pig.

zombie67 [MM]
Avatar
Send message
Joined: 16 Jul 07
Posts: 203
Credit: 603,097,515
RAC: 4,139,899
Level
Lys
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 60106 - Posted: 16 Mar 2023 | 1:28:51 UTC

Does the ATM app work with RTX 4000 series?
____________
Reno, NV
Team: SETI.USA

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 848
Credit: 5,739,355,781
RAC: 22,227,577
Level
Tyr
Scientific publications
wat
Message 60107 - Posted: 16 Mar 2023 | 2:12:42 UTC - in response to Message 60106.

Does the ATM app work with RTX 4000 series?


Maybe. The Python app does, and the ATM is a similar kind of setup. You’ll have to try it and see.

Not sure how much progress the project has made for Windows though.
____________

KAMasud
Send message
Joined: 27 Jul 11
Posts: 93
Credit: 219,931,354
RAC: 586,729
Level
Leu
Scientific publications
watwat
Message 60108 - Posted: 16 Mar 2023 | 8:06:10 UTC - in response to Message 60098.

I'm running Linux Mint 19 (a bit out of date)
I just retired my last Linux Mint 19 computer yesterday and it had been running ATM, ACEMD & Python WUs on a 2080 Ti (12/7.5) fine. BTW, I tried the LM 21.1 upgrade from LM 20.3 and can't do things like open BOINC folder as admin. I can't see any advantage to 21.1 so I'm going to do a fresh install and revert back to 20.3.

My machine has a gtx-950, so cuda tasks are OK.
Is there a minimum requirement for CUDA and Compute Capability for ATM WUs?
https://www.techpowerup.com/gpu-specs/geforce-gtx-950.c2747 says CUDA 5.2 and https://developer.nvidia.com/cuda-gpus says 5.2.



Glad to know someone else also has the same problem with Mint 21.1. I will shift to some other flavour.

KAMasud
Send message
Joined: 27 Jul 11
Posts: 93
Credit: 219,931,354
RAC: 586,729
Level
Leu
Scientific publications
watwat
Message 60111 - Posted: 18 Mar 2023 | 6:30:31 UTC

Got my first ATM Beta. Completed and validated.

Quico
Volunteer moderator
Project administrator
Project developer
Project tester
Volunteer developer
Volunteer tester
Project scientist
Send message
Joined: 28 Feb 23
Posts: 16
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 60120 - Posted: 20 Mar 2023 | 14:45:24 UTC - in response to Message 60091.

My observations show the GPU switching from periods of high utilization (~96-98%) to periods of idle (0%). About every minute or two.

i think the current size of the ATM are pretty good. about 4hrs on a 3080Ti and about 5hrs on a 2080Ti.

I'll second Richards's comment that you should put some effort into checkpointing about fixing the completion reporting (add weights to the job.xml file)


That sounds how ATM is intended to work for now. The idle GPU periods correspond to writing coordinates.

Happy to know that size of the jobs are good!


Picked up another ATM task but not holding much hope that it will run correctly based on the previous wingmen output files. Looks like the configuration is not correct again.

Had hope since the task mentions new in the name.

T_CDK2_new_2_edit_26_1h1q_T4_2_1-QUICO_TEST_ATM-0-1-RND2833_2

[Errno 2] No such file or directory

openmm.OpenMMException: Illegal value for DeviceIndex: 1

Guess I will be the next guinea pig.


I have seen your errors but I'm not sure why it's happening since I got several jobs running smoothly right now. I'll ask around.

The new tag is a legacy part on my end about receptor naming.

Quico
Volunteer moderator
Project administrator
Project developer
Project tester
Volunteer developer
Volunteer tester
Project scientist
Send message
Joined: 28 Feb 23
Posts: 16
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 60121 - Posted: 20 Mar 2023 | 14:46:25 UTC

Another heads-up, it seems that the Windows app will available soon! That way we'll be able to look into the progress reporting issue.

Erich56
Send message
Joined: 1 Jan 15
Posts: 992
Credit: 3,778,551,353
RAC: 1,001,093
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwat
Message 60123 - Posted: 20 Mar 2023 | 19:54:13 UTC - in response to Message 60121.

...it seems that the Windows app will available soon!

that's good news - I'm looking foward to receiving ATM tasks :-)

zombie67 [MM]
Avatar
Send message
Joined: 16 Jul 07
Posts: 203
Credit: 603,097,515
RAC: 4,139,899
Level
Lys
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 60126 - Posted: 22 Mar 2023 | 6:52:36 UTC
Last modified: 22 Mar 2023 | 6:53:46 UTC

I see that there is a windows app for ATM. But I have never received an app on any of my win machines, even with an updater. And yes, I have all the right project preferences set (everything checked). So, has anyone received an ATM task on a windows machine?

Quico
Volunteer moderator
Project administrator
Project developer
Project tester
Volunteer developer
Volunteer tester
Project scientist
Send message
Joined: 28 Feb 23
Posts: 16
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 60128 - Posted: 22 Mar 2023 | 11:15:48 UTC - in response to Message 60126.

I see that there is a windows app for ATM. But I have never received an app on any of my win machines, even with an updater. And yes, I have all the right project preferences set (everything checked). So, has anyone received an ATM task on a windows machine?


As far as I know, we are doing the final tests.
I'll let you know once it's fully ready and I have the green light to send jobs through there.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 848
Credit: 5,739,355,781
RAC: 22,227,577
Level
Tyr
Scientific publications
wat
Message 60129 - Posted: 22 Mar 2023 | 11:32:53 UTC - in response to Message 60126.

I see that there is a windows app for ATM. But I have never received an app on any of my win machines, even with an updater. And yes, I have all the right project preferences set (everything checked). So, has anyone received an ATM task on a windows machine?


do you have allow beta/test applications checked?
____________

zombie67 [MM]
Avatar
Send message
Joined: 16 Jul 07
Posts: 203
Credit: 603,097,515
RAC: 4,139,899
Level
Lys
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 60130 - Posted: 22 Mar 2023 | 14:37:45 UTC - in response to Message 60129.

I see that there is a windows app for ATM. But I have never received an app on any of my win machines, even with an updater. And yes, I have all the right project preferences set (everything checked). So, has anyone received an ATM task on a windows machine?


do you have allow beta/test applications checked?

Yep. Are you saying that you have received windows tasks for ATM?
____________
Reno, NV
Team: SETI.USA

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 848
Credit: 5,739,355,781
RAC: 22,227,577
Level
Tyr
Scientific publications
wat
Message 60132 - Posted: 22 Mar 2023 | 14:45:55 UTC - in response to Message 60130.

I see that there is a windows app for ATM. But I have never received an app on any of my win machines, even with an updater. And yes, I have all the right project preferences set (everything checked). So, has anyone received an ATM task on a windows machine?


do you have allow beta/test applications checked?

Yep. Are you saying that you have received windows tasks for ATM?


no I don't run windows. i was just asking if you had the beta box selected because that's necessary.

but looking at the server, some people did get them. someone else earlier in this thread reported that they got and processed one also. very few went out, so unless your system asked when they were available, it would be easy to miss. you can setup a script to ask for them regularly, BOINC will stop asking after so many requests with no tasks sent.
____________

Aurum
Avatar
Send message
Joined: 12 Jul 17
Posts: 371
Credit: 10,615,402,087
RAC: 11,814,016
Level
Trp
Scientific publications
watwatwat
Message 60134 - Posted: 22 Mar 2023 | 14:54:54 UTC - in response to Message 60126.

I see that there is a windows app for ATM. But I have never received an app on any of my win machines, even with an updater. And yes, I have all the right project preferences set (everything checked). So, has anyone received an ATM task on a windows machine?

I've yet to get a Windoze ATMbeta. They've been available for a while this morning and still nothing. That GPU just sits with bated breath.
What's the trick?

zombie67 [MM]
Avatar
Send message
Joined: 16 Jul 07
Posts: 203
Credit: 603,097,515
RAC: 4,139,899
Level
Lys
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 60135 - Posted: 22 Mar 2023 | 15:07:47 UTC - in response to Message 60132.

I see that there is a windows app for ATM. But I have never received an app on any of my win machines, even with an updater. And yes, I have all the right project preferences set (everything checked). So, has anyone received an ATM task on a windows machine?


do you have allow beta/test applications checked?

Yep. Are you saying that you have received windows tasks for ATM?


no I don't run windows. i was just asking if you had the beta box selected because that's necessary.

but looking at the server, some people did get them. someone else earlier in this thread reported that they got and processed one also. very few went out, so unless your system asked when they were available, it would be easy to miss. you can setup a script to ask for them regularly, BOINC will stop asking after so many requests with no tasks sent.


Yep. As I said, I have an updater script running as well.
____________
Reno, NV
Team: SETI.USA

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 848
Credit: 5,739,355,781
RAC: 22,227,577
Level
Tyr
Scientific publications
wat
Message 60136 - Posted: 22 Mar 2023 | 15:11:24 UTC - in response to Message 60135.

KAMasud got one on his Windows system. maybe he can share his settings.
____________

Aurum
Avatar
Send message
Joined: 12 Jul 17
Posts: 371
Credit: 10,615,402,087
RAC: 11,814,016
Level
Trp
Scientific publications
watwatwat
Message 60137 - Posted: 22 Mar 2023 | 15:26:35 UTC
Last modified: 22 Mar 2023 | 15:38:55 UTC

Quico, Do you have some cryptic requirements specified for your Win ATMbeta WUs?

I've even had my Win computer set to only request ATMbeta WUs and still got nothing.

KAMasud
Send message
Joined: 27 Jul 11
Posts: 93
Credit: 219,931,354
RAC: 586,729
Level
Leu
Scientific publications
watwat
Message 60138 - Posted: 23 Mar 2023 | 8:57:10 UTC - in response to Message 60136.

KAMasud got one on his Windows system. maybe he can share his settings.

____________________

Yes, I did get an ATM task. Completed and validated with success. No, I do not have any special settings. The only thing I do is not run any other project with GPU Grid. I have a feeling that they interfere with each other. How? GPU Grid is all over my cores and threads. Lacks discipline. My take on the subject. Admin, sorry.
Even though resources are wasted, I am not after the credits.

Quico
Volunteer moderator
Project administrator
Project developer
Project tester
Volunteer developer
Volunteer tester
Project scientist
Send message
Joined: 28 Feb 23
Posts: 16
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 60139 - Posted: 23 Mar 2023 | 9:34:34 UTC
Last modified: 23 Mar 2023 | 13:06:14 UTC

I think it's just a matter of very few tests being submitted right now. Once I have the green light from Raimondas I'll start sending jobs through the windows app as well.
I have a complete system prepared just for you ;)

PS: You can now check the pre-print of our initial benchmark in the lab with ATM!
https://arxiv.org/abs/2303.11065

Aurum
Avatar
Send message
Joined: 12 Jul 17
Posts: 371
Credit: 10,615,402,087
RAC: 11,814,016
Level
Trp
Scientific publications
watwatwat
Message 60140 - Posted: 23 Mar 2023 | 12:46:59 UTC

Still no checkpoints. Hopefully this is top of your priority list.

BTW, highlight your URL and click URL above and it'll be linkable:
https://arxiv.org/abs/2303.11065

Quico
Volunteer moderator
Project administrator
Project developer
Project tester
Volunteer developer
Volunteer tester
Project scientist
Send message
Joined: 28 Feb 23
Posts: 16
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 60141 - Posted: 23 Mar 2023 | 13:08:40 UTC - in response to Message 60140.

Done! Thanks for it.

Reporting should be live for the jobs I'll send later today, please let me know if it works accordingly, specially the jobs with _BACE_ in their jobname.

I'll also start sending jobs through Windows today as well.

zombie67 [MM]
Avatar
Send message
Joined: 16 Jul 07
Posts: 203
Credit: 603,097,515
RAC: 4,139,899
Level
Lys
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 60142 - Posted: 23 Mar 2023 | 14:05:09 UTC

There ate two different ATM apps on the server stats page, and also on the apps.php page. But in project preferences, there is only one ATM app listed. We need a way to select both/either in our project preferences.
____________
Reno, NV
Team: SETI.USA

KAMasud
Send message
Joined: 27 Jul 11
Posts: 93
Credit: 219,931,354
RAC: 586,729
Level
Leu
Scientific publications
watwat
Message 60143 - Posted: 23 Mar 2023 | 16:09:27 UTC

Let it be. It is more fun this way. Never know what you will get next and adjust.

Aurum
Avatar
Send message
Joined: 12 Jul 17
Posts: 371
Credit: 10,615,402,087
RAC: 11,814,016
Level
Trp
Scientific publications
watwatwat
Message 60144 - Posted: 23 Mar 2023 | 16:23:31 UTC
Last modified: 23 Mar 2023 | 16:26:46 UTC

My new WU behaves differently but I don't think checkpointing is working. It reported the first checkpoint after a minute and after an hour has yet to report a second one. Progress is stuck at 0.2 but time remaining has decreased from 1222 days to 22 days.

The Windoze WUs are all failing.

Pop Piasa
Avatar
Send message
Joined: 8 Aug 19
Posts: 252
Credit: 458,054,251
RAC: 1,667,176
Level
Gln
Scientific publications
watwat
Message 60145 - Posted: 23 Mar 2023 | 17:11:09 UTC
Last modified: 23 Mar 2023 | 17:24:02 UTC

I have started to get these ATM tasks on my windoze hosts.

All are failing like this:

(unknown error) - exit code 195 (0xc3)</message>
<stderr_txt>
11:28:53 (11872): wrapper (7.9.26016): starting
11:28:53 (11872): wrapper: running python.exe (bin/conda-unpack)
11:28:54 (11872): python.exe exited; CPU time 0.000000
11:28:54 (11872): wrapper: running Library/usr/bin/tar.exe (xjvf input.tar.bz2)
analyze.sh
cntxt_0/
cntxt_0/PTP1B_new-23486-23479
p-0.dat
p-10.dat
p-11.dat
p-12.dat
p-13.dat
p-14.dat
p-15.dat
p-16.dat
p-17.dat
p-18.dat
p-19.dat
p-1.dat
p-20.dat
p-21.dat
p-2.dat
p-3.dat
p-4.dat
p-5.dat
p-6.dat
p-7.dat
p-8.dat
p-9.dat
PTP1B_new-23486-23479_0.xml
PTP1B_new-23486-23479_asyncre.cntl
PTP1B_new-23486-23479.inpcrd
PTP1B_new-23486-23479.prmtop
r0/
r0/PTP1B_new-23486-23479.dcd
r0/PTP1B_new-23486-23479_ckpt.xml
r0/PTP1B_new-23486-23479.out
r1/
r1/PTP1B_new-23486-23479.dcd
r1/PTP1B_new-23486-23479_ckpt.xml
r1/PTP1B_new-23486-23479.out
r10/
r10/PTP1B_new-23486-23479.dcd
r10/PTP1B_new-23486-23479_ckpt.xml
r10/PTP1B_new-23486-23479.out
r11/
r11/PTP1B_new-23486-23479.dcd
r11/PTP1B_new-23486-23479_ckpt.xml
r11/PTP1B_new-23486-23479.out
r12/
r12/PTP1B_new-23486-23479.dcd
r12/PTP1B_new-23486-23479_ckpt.xml
r12/PTP1B_new-23486-23479.out
r13/
r13/PTP1B_new-23486-23479.dcd
r13/PTP1B_new-23486-23479_ckpt.xml
r13/PTP1B_new-23486-23479.out
r14/
r14/PTP1B_new-23486-23479.dcd
r14/PTP1B_new-23486-23479_ckpt.xml
r14/PTP1B_new-23486-23479.out
r15/
r15/PTP1B_new-23486-23479.dcd
r15/PTP1B_new-23486-23479_ckpt.xml
r15/PTP1B_new-23486-23479.out
r16/
r16/PTP1B_new-23486-23479.dcd
r16/PTP1B_new-23486-23479_ckpt.xml
r16/PTP1B_new-23486-23479.out
r17/
r17/PTP1B_new-23486-23479.dcd
r17/PTP1B_new-23486-23479_ckpt.xml
r17/PTP1B_new-23486-23479.out
r18/
r18/PTP1B_new-23486-23479.dcd
r18/PTP1B_new-23486-23479_ckpt.xml
r18/PTP1B_new-23486-23479.out
r19/
r19/PTP1B_new-23486-23479.dcd
r19/PTP1B_new-23486-23479_ckpt.xml
r19/PTP1B_new-23486-23479.out
r2/
r2/PTP1B_new-23486-23479.dcd
r2/PTP1B_new-23486-23479_ckpt.xml
r2/PTP1B_new-23486-23479.out
r20/
r20/PTP1B_new-23486-23479.dcd
r20/PTP1B_new-23486-23479_ckpt.xml
r20/PTP1B_new-23486-23479.out
r21/
r21/PTP1B_new-23486-23479.dcd
r21/PTP1B_new-23486-23479_ckpt.xml
r21/PTP1B_new-23486-23479.out
r3/
r3/PTP1B_new-23486-23479.dcd
r3/PTP1B_new-23486-23479_ckpt.xml
r3/PTP1B_new-23486-23479.out
r4/
r4/PTP1B_new-23486-23479.dcd
r4/PTP1B_new-23486-23479_ckpt.xml
r4/PTP1B_new-23486-23479.out
r5/
r5/PTP1B_new-23486-23479.dcd
r5/PTP1B_new-23486-23479_ckpt.xml
r5/PTP1B_new-23486-23479.out
r6/
r6/PTP1B_new-23486-23479.dcd
r6/PTP1B_new-23486-23479_ckpt.xml
r6/PTP1B_new-23486-23479.out
r7/
r7/PTP1B_new-23486-23479.dcd
r7/PTP1B_new-23486-23479_ckpt.xml
r7/PTP1B_new-23486-23479.out
r8/
r8/PTP1B_new-23486-23479.dcd
r8/PTP1B_new-23486-23479_ckpt.xml
r8/PTP1B_new-23486-23479.out
r9/
r9/PTP1B_new-23486-23479.dcd
r9/PTP1B_new-23486-23479_ckpt.xml
r9/PTP1B_new-23486-23479.out
Rplots.pdf
run.log
run.sh
uwham_analysis.R
uwham_analysis.Rout
11:29:23 (11872): Library/usr/bin/tar.exe exited; CPU time 0.796875
11:29:23 (11872): wrapper: running C:/Windows/system32/cmd.exe (/c call run.bat)
'run.bat' is not recognized as an internal or external command,
operable program or batch file.

11:29:24 (11872): C:/Windows/system32/cmd.exe exited; CPU time 0.000000
11:29:24 (11872): app exit status: 0x1
11:29:24 (11872): called boinc_finish(195)


A script error?

Quico
Volunteer moderator
Project administrator
Project developer
Project tester
Volunteer developer
Volunteer tester
Project scientist
Send message
Joined: 28 Feb 23
Posts: 16
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 60146 - Posted: 23 Mar 2023 | 17:46:00 UTC - in response to Message 60145.

I have started to get these ATM tasks on my windoze hosts.

All are failing like this:

(unknown error) - exit code 195 (0xc3)</message>
<stderr_txt>
11:28:53 (11872): wrapper (7.9.26016): starting
11:28:53 (11872): wrapper: running python.exe (bin/conda-unpack)
11:28:54 (11872): python.exe exited; CPU time 0.000000
11:28:54 (11872): wrapper: running Library/usr/bin/tar.exe (xjvf input.tar.bz2)
analyze.sh
cntxt_0/
cntxt_0/PTP1B_new-23486-23479
p-0.dat
p-10.dat
p-11.dat
p-12.dat
p-13.dat
p-14.dat
p-15.dat
p-16.dat
p-17.dat
p-18.dat
p-19.dat
p-1.dat
p-20.dat
p-21.dat
p-2.dat
p-3.dat
p-4.dat
p-5.dat
p-6.dat
p-7.dat
p-8.dat
p-9.dat
PTP1B_new-23486-23479_0.xml
PTP1B_new-23486-23479_asyncre.cntl
PTP1B_new-23486-23479.inpcrd
PTP1B_new-23486-23479.prmtop
r0/
r0/PTP1B_new-23486-23479.dcd
r0/PTP1B_new-23486-23479_ckpt.xml
r0/PTP1B_new-23486-23479.out
r1/
r1/PTP1B_new-23486-23479.dcd
r1/PTP1B_new-23486-23479_ckpt.xml
r1/PTP1B_new-23486-23479.out
r10/
r10/PTP1B_new-23486-23479.dcd
r10/PTP1B_new-23486-23479_ckpt.xml
r10/PTP1B_new-23486-23479.out
r11/
r11/PTP1B_new-23486-23479.dcd
r11/PTP1B_new-23486-23479_ckpt.xml
r11/PTP1B_new-23486-23479.out
r12/
r12/PTP1B_new-23486-23479.dcd
r12/PTP1B_new-23486-23479_ckpt.xml
r12/PTP1B_new-23486-23479.out
r13/
r13/PTP1B_new-23486-23479.dcd
r13/PTP1B_new-23486-23479_ckpt.xml
r13/PTP1B_new-23486-23479.out
r14/
r14/PTP1B_new-23486-23479.dcd
r14/PTP1B_new-23486-23479_ckpt.xml
r14/PTP1B_new-23486-23479.out
r15/
r15/PTP1B_new-23486-23479.dcd
r15/PTP1B_new-23486-23479_ckpt.xml
r15/PTP1B_new-23486-23479.out
r16/
r16/PTP1B_new-23486-23479.dcd
r16/PTP1B_new-23486-23479_ckpt.xml
r16/PTP1B_new-23486-23479.out
r17/
r17/PTP1B_new-23486-23479.dcd
r17/PTP1B_new-23486-23479_ckpt.xml
r17/PTP1B_new-23486-23479.out
r18/
r18/PTP1B_new-23486-23479.dcd
r18/PTP1B_new-23486-23479_ckpt.xml
r18/PTP1B_new-23486-23479.out
r19/
r19/PTP1B_new-23486-23479.dcd
r19/PTP1B_new-23486-23479_ckpt.xml
r19/PTP1B_new-23486-23479.out
r2/
r2/PTP1B_new-23486-23479.dcd
r2/PTP1B_new-23486-23479_ckpt.xml
r2/PTP1B_new-23486-23479.out
r20/
r20/PTP1B_new-23486-23479.dcd
r20/PTP1B_new-23486-23479_ckpt.xml
r20/PTP1B_new-23486-23479.out
r21/
r21/PTP1B_new-23486-23479.dcd
r21/PTP1B_new-23486-23479_ckpt.xml
r21/PTP1B_new-23486-23479.out
r3/
r3/PTP1B_new-23486-23479.dcd
r3/PTP1B_new-23486-23479_ckpt.xml
r3/PTP1B_new-23486-23479.out
r4/
r4/PTP1B_new-23486-23479.dcd
r4/PTP1B_new-23486-23479_ckpt.xml
r4/PTP1B_new-23486-23479.out
r5/
r5/PTP1B_new-23486-23479.dcd
r5/PTP1B_new-23486-23479_ckpt.xml
r5/PTP1B_new-23486-23479.out
r6/
r6/PTP1B_new-23486-23479.dcd
r6/PTP1B_new-23486-23479_ckpt.xml
r6/PTP1B_new-23486-23479.out
r7/
r7/PTP1B_new-23486-23479.dcd
r7/PTP1B_new-23486-23479_ckpt.xml
r7/PTP1B_new-23486-23479.out
r8/
r8/PTP1B_new-23486-23479.dcd
r8/PTP1B_new-23486-23479_ckpt.xml
r8/PTP1B_new-23486-23479.out
r9/
r9/PTP1B_new-23486-23479.dcd
r9/PTP1B_new-23486-23479_ckpt.xml
r9/PTP1B_new-23486-23479.out
Rplots.pdf
run.log
run.sh
uwham_analysis.R
uwham_analysis.Rout
11:29:23 (11872): Library/usr/bin/tar.exe exited; CPU time 0.796875
11:29:23 (11872): wrapper: running C:/Windows/system32/cmd.exe (/c call run.bat)
'run.bat' is not recognized as an internal or external command,
operable program or batch file.

11:29:24 (11872): C:/Windows/system32/cmd.exe exited; CPU time 0.000000
11:29:24 (11872): app exit status: 0x1
11:29:24 (11872): called boinc_finish(195)


A script error?


Hmmm I did send those this morning. Probably they entered the queue once my windows app was live and was looking for the run.bat.
If that's the case expect many crashes incoming :_(

The tests I'm monitoring seem to be still running so there's still hope

zombie67 [MM]
Avatar
Send message
Joined: 16 Jul 07
Posts: 203
Credit: 603,097,515
RAC: 4,139,899
Level
Lys
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 60147 - Posted: 23 Mar 2023 | 19:33:22 UTC

FWIW, this morning my windows machines started getting ATM tasks. Most of these tasks are erroring out. For these tasks, they have been issued many times over too many and failed every time. Looks like a problem with the tasks and not the clients running them. They will eventually work their way out of the system. But a few of the windows tasks I received today are actually working. Here is a successful example:

http://www.gpugrid.net/result.php?resultid=33375372

So there is hope.
____________
Reno, NV
Team: SETI.USA

KAMasud
Send message
Joined: 27 Jul 11
Posts: 93
Credit: 219,931,354
RAC: 586,729
Level
Leu
Scientific publications
watwat
Message 60148 - Posted: 23 Mar 2023 | 20:20:25 UTC - in response to Message 60147.

FWIW, this morning my windows machines started getting ATM tasks. Most of these tasks are erroring out. For these tasks, they have been issued many times over too many and failed every time. Looks like a problem with the tasks and not the clients running them. They will eventually work their way out of the system. But a few of the windows tasks I received today are actually working. Here is a successful example:

http://www.gpugrid.net/result.php?resultid=33375372

So there is hope.

--------------
Welcome Zombie67. If you are looking for more excitement, Climate has implemented OpenIFS.

kotenok2000
Send message
Joined: 18 Jul 13
Posts: 64
Credit: 12,875,793
RAC: 14,743
Level
Pro
Scientific publications
wat
Message 60149 - Posted: 23 Mar 2023 | 20:23:04 UTC - in response to Message 60148.

All openifs tasks are already sent.

Pop Piasa
Avatar
Send message
Joined: 8 Aug 19
Posts: 252
Credit: 458,054,251
RAC: 1,667,176
Level
Gln
Scientific publications
watwat
Message 60150 - Posted: 24 Mar 2023 | 1:19:31 UTC - in response to Message 60147.

...But a few of the windows tasks I received today are actually working.


I have one that is working, but I had to add ATMs to my appconfig file to get them to more accurately show the time remaining, due to what Ian pointed out way upthread.
https://www.gpugrid.net/forum_thread.php?id=5379&nowrap=true#60041
I now see realistic time remaining.

My current appconfig.xml script
app_config>
<app>
<name>PythonGPU</name>
<max_concurrent>1</max_concurrent>
<fraction_done_exact/>
<gpu_versions>
<gpu_usage>1</gpu_usage>
<cpu_usage>1</cpu_usage>
</gpu_versions>
</app>
<app>
<name>acemd3</name>
<max_concurrent>1</max_concurrent>
<fraction_done_exact/>
<gpu_versions>
<gpu_usage>1</gpu_usage>
<cpu_usage>1</cpu_usage>
</gpu_versions>
</app>
<app>
<name>ATM</name>
<max_concurrent>1</max_concurrent>
<fraction_done_exact/>
<gpu_versions>
<gpu_usage>1</gpu_usage>
<cpu_usage>1</cpu_usage>
</gpu_versions>
</app>
<project_max_concurrent>1</project_max_concurrent>
<report_results_immediately/>
</app_config>


This task ran alongside a F@H task (project 18717) on a RTX3060 12GB card without any problem, in case anybody is interested.

kotenok2000
Send message
Joined: 18 Jul 13
Posts: 64
Credit: 12,875,793
RAC: 14,743
Level
Pro
Scientific publications
wat
Message 60151 - Posted: 24 Mar 2023 | 2:54:44 UTC - in response to Message 60150.
Last modified: 24 Mar 2023 | 2:55:02 UTC

Why not
<app>
<name>PythonGPU</name>
<max_concurrent>1</max_concurrent>
<fraction_done_exact/>
<gpu_versions>
<gpu_usage>1</gpu_usage>
<cpu_usage>4</cpu_usage>
</gpu_versions>
</app>

?

Bedrich Hajek
Send message
Joined: 28 Mar 09
Posts: 413
Credit: 6,086,086,565
RAC: 758,845
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 60152 - Posted: 24 Mar 2023 | 9:48:54 UTC

So far, 2 WUs successfully completed, another one running.

https://www.gpugrid.net/workunit.php?wuid=27438037

https://www.gpugrid.net/workunit.php?wuid=27438416

https://www.gpugrid.net/workunit.php?wuid=27438497


kotenok2000
Send message
Joined: 18 Jul 13
Posts: 64
Credit: 12,875,793
RAC: 14,743
Level
Pro
Scientific publications
wat
Message 60153 - Posted: 24 Mar 2023 | 11:47:30 UTC - in response to Message 60152.
Last modified: 24 Mar 2023 | 12:05:55 UTC

it still can't run run.bat
http://www.gpugrid.net/result.php?resultid=33377536

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 848
Credit: 5,739,355,781
RAC: 22,227,577
Level
Tyr
Scientific publications
wat
Message 60154 - Posted: 24 Mar 2023 | 12:16:41 UTC
Last modified: 24 Mar 2023 | 12:26:33 UTC

progress reporting is still not working.

instead of halting progress at 75%, it now halts at 0.19%. the weights help prevent the task from jumping to 75%, but there is still something missing.

Python tasks are able to jump to about 1% after the extraction phase due to the weights, and then slowly creeps up over time as the task progresses. 2%, 3%, 4%, etc until it hits 100% in a natural and linear way. The ATM tasks do not do this at all. they sit at 0.19% for hours and hours with no indication of when they will complete. is it 4hrs? is it 20hrs? there's no feedback to the user. when it's done it just jumps to 100% without warning.

makes it very difficult to tell is a task is stuck or working.

-Edit-

The "BACE" tasks do seem to be reporting progress now. but the earlier tasks from yesterday ("T_p38") do not.
____________

Quico
Volunteer moderator
Project administrator
Project developer
Project tester
Volunteer developer
Volunteer tester
Project scientist
Send message
Joined: 28 Feb 23
Posts: 16
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 60155 - Posted: 24 Mar 2023 | 13:37:41 UTC - in response to Message 60154.

progress reporting is still not working.

instead of halting progress at 75%, it now halts at 0.19%. the weights help prevent the task from jumping to 75%, but there is still something missing.

Python tasks are able to jump to about 1% after the extraction phase due to the weights, and then slowly creeps up over time as the task progresses. 2%, 3%, 4%, etc until it hits 100% in a natural and linear way. The ATM tasks do not do this at all. they sit at 0.19% for hours and hours with no indication of when they will complete. is it 4hrs? is it 20hrs? there's no feedback to the user. when it's done it just jumps to 100% without warning.

makes it very difficult to tell is a task is stuck or working.

-Edit-

The "BACE" tasks do seem to be reporting progress now. but the earlier tasks from yesterday ("T_p38") do not.


T_p38 were sent before the update so I guess it makes sense that they don't show reporting yet. Is the progress report for the BACE runs good? Is it staying stuck?

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 848
Credit: 5,739,355,781
RAC: 22,227,577
Level
Tyr
Scientific publications
wat
Message 60156 - Posted: 24 Mar 2023 | 13:50:20 UTC - in response to Message 60155.

Yes, BACE looks good.

But something wrong with CDK2_new. It jumped to 100% but is still running.
____________

Emilio Gallicchio
New member
Send message
Joined: 23 Mar 23
Posts: 4
Credit: 87,500
RAC: 8,114
Level

Scientific publications
wat
Message 60157 - Posted: 24 Mar 2023 | 13:59:50 UTC - in response to Message 60140.

Hello Quico and everyone. Thank you for trying AToM-OpenMM on GPUGRID.

I am unsure if it is relevant to this issue, but AToM implements full checkpointing. Each replica's status is stored in a .xml file in the replica directory. We usually checkpoint every 10 mins, but this interval can be changed in the control file with the CHECKPOINT_TIME parameter (in seconds). Checkpointing is also triggered by SIGTERM or SIGINT signals sent to the main AToM process.

Launching the AToM job from the same folder reads the checkpoints and should restart the simulation as if it had kept running.

bibi
Send message
Joined: 4 May 17
Posts: 7
Credit: 2,871,113,043
RAC: 2,192,424
Level
Phe
Scientific publications
watwatwatwatwat
Message 60158 - Posted: 24 Mar 2023 | 14:08:25 UTC
Last modified: 24 Mar 2023 | 14:13:21 UTC

The python task must tell the boinc client how many ticks are to calculate (MAX_SAMPLES = 341 from *_asyncre.cntl times 22 replica) and the end of each tick.

In addition, the elapsed time used starts counting again at 0 after each restart. I don't know what the current situation is.

If the progress indicator is now ok, forgot my reply

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1496
Credit: 3,621,549,351
RAC: 1,380,716
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 60159 - Posted: 24 Mar 2023 | 14:09:29 UTC
Last modified: 24 Mar 2023 | 14:18:59 UTC

The ATM tasks also record that a task has checkpointed in the job.log file in the slot directory (or did so, a few debug iterations ago - see message 60046).

That file can be viewed while a task is running, but not after it's finished. It's written (I think) by the science app, but messages are passed to BOINC by the wrapper: that's probably where the problem is.

Edit: OK, I've downloaded a BACE task (resend _4) and a T_PTP1B_new task (resend _3). I'll watch them when the current pair of Abouh tasks have finished.

Emilio Gallicchio
New member
Send message
Joined: 23 Mar 23
Posts: 4
Credit: 87,500
RAC: 8,114
Level

Scientific publications
wat
Message 60160 - Posted: 24 Mar 2023 | 15:45:51 UTC - in response to Message 60158.

The GPUGRID version of AToM:

https://github.com/Gallicchio-Lab/AToM-OpenMM/blob/master/sync/atm.py

has this:


# Report progress on GPUGRID
progress = float(isample)/float(num_samples - last_sample)
open("progress", "w").write(str(progress))



which checks out as far as I can tell. last_sample is retrieved from checkpoints upon restart, so the progress % should be tracked correctly across restarts.

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1496
Credit: 3,621,549,351
RAC: 1,380,716
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 60161 - Posted: 24 Mar 2023 | 15:46:40 UTC

OK, the BACE task is running, and after 7 minutes or so, I see:

2023-03-24 15:40:33 - INFO - sync_re - Started: checkpointing
2023-03-24 15:40:49 - INFO - sync_re - Finished: checkpointing (duration: 15.699278543004766 s)
2023-03-24 15:40:49 - INFO - sync_re - Finished: sample 1 (duration: 303.5407383099664 s)

in the run.log file. So checkpointing is happening, but just not being reported through to BOINC.

Progress is 3.582% after eleven minutes.

Emilio Gallicchio
New member
Send message
Joined: 23 Mar 23
Posts: 4
Credit: 87,500
RAC: 8,114
Level

Scientific publications
wat
Message 60162 - Posted: 24 Mar 2023 | 16:04:08 UTC - in response to Message 60157.

Actually, it is unclear if AToM's GPUGRID version checkpoints after catching termination signals. I'll ask Raimondas. Termination without checkpointing is usually okay, but progress since the checkpoint would be lost, and the number of samples recorded in the checkpoint file would not reflect the actual number of samples recorded.

Does anyone know if BOINC sends specific signals to terminate an app? Would the app pass the signal to the main AToM's python process?

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1496
Credit: 3,621,549,351
RAC: 1,380,716
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 60163 - Posted: 24 Mar 2023 | 16:20:44 UTC - in response to Message 60162.

The app seems to be both checkpointing, and updating progress, at the end of each sample. That will make re-alignment after a pause easier, but there's always some over-run, and data lost on restart. It's up to the application itself to record the data point reached, and to be used for the restart, as an integral part of the checkpointing process.

I can't answer immediately on the termination question, but it's all open-source and I can look through it. In this case, it's more complicated, because BOINC will talk to the wrapper, and the wrapper will talk to the science app.

But the basic idea is that BOINC will send a request to terminate over the API, and wait for the application to close itself down as it sees fit. Actual signals will only be used to force termination in the case of an unconditional quit, such as an operating system closedown.

Aurum
Avatar
Send message
Joined: 12 Jul 17
Posts: 371
Credit: 10,615,402,087
RAC: 11,814,016
Level
Trp
Scientific publications
watwatwat
Message 60164 - Posted: 24 Mar 2023 | 16:20:50 UTC
Last modified: 24 Mar 2023 | 16:20:58 UTC

Seriously? Only 14 tasks a day?

GPUGRID 3/24/2023 9:17:44 AM This computer has finished a daily quota of 14 tasks

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1496
Credit: 3,621,549,351
RAC: 1,380,716
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 60165 - Posted: 24 Mar 2023 | 16:42:27 UTC - in response to Message 60164.

Seriously? Only 14 tasks a day?

The quota adjusts dynamically - it goes up if you report successful tasks, and goes down if you report errors.

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1496
Credit: 3,621,549,351
RAC: 1,380,716
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 60166 - Posted: 24 Mar 2023 | 16:53:12 UTC

The T_PTP1B_new task, on the other hand, is not reporting progress, even though it's logging checkpoints in the run.log

A file is maintained in the slot folder, called 'boinc_task_state.xml' (it's probably written by the wrapper, though I'm not certain of that).

The current contents are:

<active_task>
<project_master_url>https://www.gpugrid.net/</project_master_url>
<result_name>T_PTP1B_new_23484_23482_T3_2A_1-QUICO_TEST_ATM-0-1-RND3714_3</result_name>
<checkpoint_cpu_time>10.942300</checkpoint_cpu_time>
<checkpoint_elapsed_time>30.176729</checkpoint_elapsed_time>
<fraction_done>0.001996</fraction_done>
<peak_working_set_size>8318976</peak_working_set_size>
<peak_swap_size>16592896</peak_swap_size>
<peak_disk_usage>1318196036</peak_disk_usage>
</active_task>

The <fraction done> is reported as the 'progress%' figure - this one is reported as 0.199% by BOINC Manager (which truncates) and 0.200% by other tools (which round).

This task has been running for 43 minutes, and boinc_task_state.xml hasn't been re-written since the first minute.

KAMasud
Send message
Joined: 27 Jul 11
Posts: 93
Credit: 219,931,354
RAC: 586,729
Level
Leu
Scientific publications
watwat
Message 60167 - Posted: 24 Mar 2023 | 20:30:16 UTC


task 27438680
Completed and validated. While the following task had a failure after a re-start.
task 27438865

KAMasud
Send message
Joined: 27 Jul 11
Posts: 93
Credit: 219,931,354
RAC: 586,729
Level
Leu
Scientific publications
watwat
Message 60168 - Posted: 24 Mar 2023 | 20:30:47 UTC


task 27438680
Completed and validated. While the following task had a failure after a re-start.
task 27438865

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1496
Credit: 3,621,549,351
RAC: 1,380,716
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 60169 - Posted: 24 Mar 2023 | 20:49:28 UTC

My BACE task 33378091 finished successfully after 5 hours, under Linux Mint 21.1 with a GTX 1660 Super.

Four previous attempts failed, two of them under Windows with a 0xc0000135 error in Python.exe - that's a missing DLL.

KAMasud
Send message
Joined: 27 Jul 11
Posts: 93
Credit: 219,931,354
RAC: 586,729
Level
Leu
Scientific publications
watwat
Message 60170 - Posted: 24 Mar 2023 | 21:46:07 UTC

Task 27438853
Completed and validated. Short one though.

Emilio Gallicchio
New member
Send message
Joined: 23 Mar 23
Posts: 4
Credit: 87,500
RAC: 8,114
Level

Scientific publications
wat
Message 60171 - Posted: 25 Mar 2023 | 2:28:51 UTC - in response to Message 60163.


I can't answer immediately on the termination question, but it's all open-source and I can look through it. In this case, it's more complicated, because BOINC will talk to the wrapper, and the wrapper will talk to the science app.

But the basic idea is that BOINC will send a request to terminate over the API, and wait for the application to close itself down as it sees fit. Actual signals will only be used to force termination in the case of an unconditional quit, such as an operating system closedown.


Right, probably the wrapper should send a termination signal to AToM.

We have of course access to AToM's sources https://github.com/Gallicchio-Lab/AToM-OpenMM and we can make sure that it checkpoints appropriately when it receives the signal.

However, I do not have access to the wrapper. Quico: please advise.

Profile Landjunge
Send message
Joined: 2 Nov 08
Posts: 3
Credit: 226,554,031
RAC: 4,288,214
Level
Leu
Scientific publications
watwatwatwatwatwatwatwatwat
Message 60172 - Posted: 25 Mar 2023 | 9:32:49 UTC
Last modified: 25 Mar 2023 | 9:33:49 UTC

Hi, i have some "new_2" ATMs that run for 14h+ yet. Should i abort them?
Running linux with rtx3070 cards
____________

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1496
Credit: 3,621,549,351
RAC: 1,380,716
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 60173 - Posted: 25 Mar 2023 | 9:38:33 UTC - in response to Message 60171.
Last modified: 25 Mar 2023 | 9:50:33 UTC

The wrapper you're using at the moment is called "wrapper_26198_x86_64-pc-linux-gnu" (I haven't tried ATM under Windows yet, but can and will do so when I get a moment).

That wrapper name looks as if it was prepared from BOINC code dating to around February 2017. At that time, BOINC was working on versions of the wrapper specifically intended for use with VirtualBox.

BOINC makes pre-compiled versions of the wrapper available for projects to use "as is", but some projects customise the source code to suit their own needs. I don't know which path GPUGrid has taken.

Edit - I just looked at the file name the first time. In stderr.txt, I see

20:37:54 (115491): wrapper (7.7.26016): starting

That would put the date back to around November 2015, but I guess someone has made some local modifications.

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1496
Credit: 3,621,549,351
RAC: 1,380,716
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 60174 - Posted: 25 Mar 2023 | 9:45:14 UTC - in response to Message 60172.

Hi, i have some "new_2" ATMs that run for 14h+ yet. Should i abort them?

I have one at the moment which has been running for 17.5 hours. The same machine completed one yesterday (task 33374928) which ran for 19 hours. I wouldn't abort it just yet.

Profile Landjunge
Send message
Joined: 2 Nov 08
Posts: 3
Credit: 226,554,031
RAC: 4,288,214
Level
Leu
Scientific publications
watwatwatwatwatwatwatwatwat
Message 60175 - Posted: 25 Mar 2023 | 9:46:50 UTC - in response to Message 60174.

Hi, i have some "new_2" ATMs that run for 14h+ yet. Should i abort them?

I have one at the moment which has been running for 17.5 hours. The same machine completed one yesterday (task 33374928) which ran for 19 hours. I wouldn't abort it just yet.



thank you. I will let them running =)

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1496
Credit: 3,621,549,351
RAC: 1,380,716
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 60176 - Posted: 25 Mar 2023 | 11:32:54 UTC - in response to Message 60175.

And completed.

Aurum
Avatar
Send message
Joined: 12 Jul 17
Posts: 371
Credit: 10,615,402,087
RAC: 11,814,016
Level
Trp
Scientific publications
watwatwat
Message 60177 - Posted: 25 Mar 2023 | 13:06:20 UTC - in response to Message 60165.
Last modified: 25 Mar 2023 | 13:53:41 UTC

Seriously? Only 14 tasks a day?

The quota adjusts dynamically - it goes up if you report successful tasks, and goes down if you report errors.

Quico, This behavior is intended to block misconfigured computers. In this case it's your Windows version that fails in seconds and being resent until it hits a Linux computer or fails 7 times. My Win computer was locked out of GG early yesterday but all my Linux computers donated until WUs ran out.
In this example the first 4 failures all went to Win7 & 11 computers and then Linux completed it successfully:
https://www.gpugrid.net/workunit.php?wuid=27438768

And the Win WUs are failing in seconds again with today's tranche.

Aurum
Avatar
Send message
Joined: 12 Jul 17
Posts: 371
Credit: 10,615,402,087
RAC: 11,814,016
Level
Trp
Scientific publications
watwatwat
Message 60183 - Posted: 25 Mar 2023 | 14:27:30 UTC

WUs failing on Linux computers:

+ python -m pip install git+https://github.com/raimis/AToM-OpenMM.git@172e6db924567cd0af1312d33f05b156b53e3d1c
Running command git clone --filter=blob:none --quiet https://github.com/raimis/AToM-OpenMM.git /var/lib/boinc-client/slots/36/tmp/pip-req-build-jsq34xa4
fatal: unable to access '/home/conda/feedstock_root/build_artifacts/git_1679396317102/_h_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placeho/etc/gitconfig': Permission denied
error: subprocess-exited-with-error

&#195;&#151; git clone --filter=blob:none --quiet https://github.com/raimis/AToM-OpenMM.git /var/lib/boinc-client/slots/36/tmp/pip-req-build-jsq34xa4 did not run successfully.
&#226;&#148;&#130; exit code: 128
&#226;&#149;&#176;&#226;&#148;&#128;> See above for output.

note: This error originates from a subprocess, and is likely not a problem with pip.
error: subprocess-exited-with-error

https://www.gpugrid.net/result.php?resultid=33379917

Profile Landjunge
Send message
Joined: 2 Nov 08
Posts: 3
Credit: 226,554,031
RAC: 4,288,214
Level
Leu
Scientific publications
watwatwatwatwatwatwatwatwat
Message 60184 - Posted: 25 Mar 2023 | 14:30:06 UTC

Any ideas why WUs are failing on a linux ubuntu machine with gtx1070?

<core_client_version>7.20.5</core_client_version>
<![CDATA[
<message>
process exited with code 195 (0xc3, -61)</message>
<stderr_txt>
14:01:49 (3551): wrapper (7.7.26016): starting
14:02:12 (3551): wrapper (7.7.26016): starting
14:02:12 (3551): wrapper: running bin/python (bin/conda-unpack)
14:02:13 (3551): bin/python exited; CPU time 0.280413
14:02:13 (3551): wrapper: running bin/tar (xjvf input.tar.bz2)
14:02:14 (3551): bin/tar exited; CPU time 0.840912
14:02:14 (3551): wrapper: running bin/bash (run.sh)
+ echo 'Setup environment'
+ source bin/activate
++ _conda_pack_activate
++ local _CONDA_SHELL_FLAVOR
++ '[' -n x ']'
++ _CONDA_SHELL_FLAVOR=bash
++ local script_dir
++ case "$_CONDA_SHELL_FLAVOR" in
+++ dirname bin/activate
++ script_dir=bin
+++ cd bin
+++ pwd
++ local full_path_script_dir=/var/lib/boinc-client/slots/7/bin
+++ dirname /var/lib/boinc-client/slots/7/bin
++ local full_path_env=/var/lib/boinc-client/slots/7
+++ basename /var/lib/boinc-client/slots/7
++ local env_name=7
++ '[' -n '' ']'
++ export CONDA_PREFIX=/var/lib/boinc-client/slots/7
++ CONDA_PREFIX=/var/lib/boinc-client/slots/7
++ export _CONDA_PACK_OLD_PS1=
++ _CONDA_PACK_OLD_PS1=
++ PATH=/var/lib/boinc-client/slots/7/bin:/usr/local/bin:/usr/local/sbin:/usr/bin:/usr/sbin:/bin:/sbin:.
++ PS1='(7) '
++ case "$_CONDA_SHELL_FLAVOR" in
++ hash -r
++ local _script_dir=/var/lib/boinc-client/slots/7/etc/conda/activate.d
++ '[' -d /var/lib/boinc-client/slots/7/etc/conda/activate.d ']'
+++ ls -A /var/lib/boinc-client/slots/7/etc/conda/activate.d
++ '[' -n ocl-icd_activate.sh ']'
++ local _path
++ for _path in "$_script_dir"/*.sh
++ . /var/lib/boinc-client/slots/7/etc/conda/activate.d/ocl-icd_activate.sh
+++ conda_ocl_icd_activate
++++ ls /var/lib/boinc-client/slots/7/etc/OpenCL/vendors/
+++ [[ -z ocl-icd-system ]]
+ export PATH=/var/lib/boinc-client/slots/7:/var/lib/boinc-client/slots/7/bin:/usr/local/bin:/usr/local/sbin:/usr/bin:/usr/sbin:/bin:/sbin:.
+ PATH=/var/lib/boinc-client/slots/7:/var/lib/boinc-client/slots/7/bin:/usr/local/bin:/usr/local/sbin:/usr/bin:/usr/sbin:/bin:/sbin:.
+ echo 'Create a temporary directory'
+ export TMP=/var/lib/boinc-client/slots/7/tmp
+ TMP=/var/lib/boinc-client/slots/7/tmp
+ mkdir -p /var/lib/boinc-client/slots/7/tmp
+ echo 'Install AToM'
+ REPO_URL=git+https://github.com/raimis/AToM-OpenMM.git@172e6db924567cd0af1312d33f05b156b53e3d1c
+ python -m pip install git+https://github.com/raimis/AToM-OpenMM.git@172e6db924567cd0af1312d33f05b156b53e3d1c
Running command git clone --filter=blob:none --quiet https://github.com/raimis/AToM-OpenMM.git /var/lib/boinc-client/slots/7/tmp/pip-req-build-0qwsbkqo
Running command git rev-parse -q --verify 'sha^172e6db924567cd0af1312d33f05b156b53e3d1c'
Running command git fetch -q https://github.com/raimis/AToM-OpenMM.git 172e6db924567cd0af1312d33f05b156b53e3d1c
Running command git checkout -q 172e6db924567cd0af1312d33f05b156b53e3d1c
error: subprocess-exited-with-error

&#195;&#151; python setup.py egg_info did not run successfully.
&#226;&#148;&#130; exit code: -4
&#226;&#149;&#176;&#226;&#148;&#128;> [0 lines of output]
[end of output]

note: This error originates from a subprocess, and is likely not a problem with pip.
error: metadata-generation-failed

&#195;&#151; Encountered error while generating package metadata.
&#226;&#149;&#176;&#226;&#148;&#128;> See above for output.

note: This is an issue with the package mentioned above, not pip.
hint: See above for details.
14:02:22 (3551): bin/bash exited; CPU time 2.696428
14:02:22 (3551): app exit status: 0x1
14:02:22 (3551): called boinc_finish(195)

</stderr_txt>

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1496
Credit: 3,621,549,351
RAC: 1,380,716
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 60185 - Posted: 25 Mar 2023 | 16:27:51 UTC - in response to Message 60173.

(I haven't tried ATM under Windows yet, but can and will do so when I get a moment).

Just downloaded a BACE task for Windows. There may be trouble ahead...

The job.xml file reads:

<job_desc>
<unzip_input>
<zipfilename>windows_x86_64__cuda1121.zip</zipfilename>
</unzip_input>
<task>
<application>python.exe</application>
<command_line>bin/conda-unpack</command_line>
<weight>1</weight>
</task>
<task>
<application>Library/usr/bin/tar.exe</application>
<command_line>xjvf input.tar.bz2</command_line>
<setenv>PATH=$PWD/Library/usr/bin</setenv>
<weight>1</weight>
</task>
<task>
<application>C:/Windows/system32/cmd.exe</application>
<command_line>/c call run.bat</command_line>
<setenv>CUDA_DEVICE=$GPU_DEVICE_NUM</setenv>
<stdout_filename>run.log</stdout_filename>
<weight>1000</weight>
<fraction_done_filename>progress</fraction_done_filename>
</task>
</job_desc>


1) We had problems with python.exe triggering a missing DLL error. I'll run Dependency Walker over this one, to see what the problem is.

2) It runs a private version of tar.exe: Microsoft included tar as a system utility from Windows 10 onwards - but I'm running Windows 7. The MS utility wouldn't run for me - I'll try this one.

3) I'm not totally convinced of the cmd.exe syntax either, but we'll cross that bridge when we get to it.

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1496
Credit: 3,621,549,351
RAC: 1,380,716
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 60186 - Posted: 25 Mar 2023 | 17:16:04 UTC - in response to Message 60185.
Last modified: 25 Mar 2023 | 17:42:46 UTC

First reports from Dependency Walker:

"Error opening file: The system cannot find the file specified" for
API-MS-WIN-CORE-PATH-L1-1-0.DLL
API-MS-WIN-CORE-WINRT-ERROR-L1-1-0.DLL
API-MS-WIN-CORE-WINRT-L1-1-0.DLL
API-MS-WIN-CORE-WINRT-ROBUFFER-L1-1-0.DLL
API-MS-WIN-CORE-WINRT-STRING-L1-1-0.DLL
DCOMP.DLL
IESHIMS.DLL

The API-MS-WIN group and IESHIMS.DLL usually resolve when delay-load files are loaded during the run. But I can't find DCOMP.DLL in either the unpacked libraries, or the Windows system disk.

DCOMP.DLL seems to be called from MSHTML.DLL, which is a Windows system file. But I still can't find it from there.

Enough for now - my head is spinning!

Edit - DCOMP.DLL is present on my Windows 10 - now Windows 11 - laptop. Another fine example of Microsoft version control.

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1496
Credit: 3,621,549,351
RAC: 1,380,716
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 60188 - Posted: 26 Mar 2023 | 8:24:32 UTC

Just a note of warning: one of my machines is running a JNK1 task - been running for 13 hours.

It's running fine - the run log has reached sample 287, and progress has reached 1.2654867256637168

But that's over 100%, and the BOINC display has reached (and is pegged at) 100% - probably has been for several hours. Ignore it.

Post to thread

Message boards : News : ATM

//