Message boards : Number crunching : New version of ACEMD 2.17 on multi GPU hosts
Author | Message |
---|---|
A New version of ACEMD 2.17 was introduced on September 2nd 2021. | |
ID: 57272 | Rating: 0 | rate: / Reply Quote | |
BOINC has had two different ways of signalling to a science app which device number it should utilise. <gpu_type>NVIDIA</gpu_type> <gpu_device_num>1</gpu_device_num> <gpu_opencl_dev_index>1</gpu_opencl_dev_index> around 40 lines into the file. | |
ID: 57273 | Rating: 0 | rate: / Reply Quote | |
The fact that the BOINC “thinks” it’s running on the wrong device exposes two issues. | |
ID: 57274 | Rating: 0 | rate: / Reply Quote | |
Thank you both for sharing your knowledge once more. | |
ID: 57275 | Rating: 0 | rate: / Reply Quote | |
Regarding the method of passing the arguments, it should depend on what server version is being used (Einstein with its old server version has no problem with this method). it does appear to be some kind of command line argument, but at which step of the process (pre or post wrapper) is referred to by the stderr file is not clear. The command is being sent as --boinc input --device 0. This is the exact same command structure as used with no issue on previous apps/tasks. We can discount the server theory - the snippet I posted from init_data.xml came from an Einstein task. What it actually depends on is the version of the BOINC API library linked in to the science application at compile time. They used API_VERSION_7.15.0 for the autodock v7.17 app: I forget exactly when the transition took place, but it was much, much, longer ago than that. The mechanism is very crude: search the application with a hex editor for api_version, and the numeric follows that string - the number in the previous paragraph is a direct paste using that method. Edit - the API_Version should also appear in the <app_version> tag in client_state.xml. If it's missing, the client is likely to revert to using the old, deprecated, calling syntax. | |
ID: 57276 | Rating: 0 | rate: / Reply Quote | |
Today when returning home I've got a surprise waiting: <gpu_type>NVIDIA</gpu_type> init_data.xml slots/6 (Device #0 task) <gpu_type>NVIDIA</gpu_type> At this situation, nvidia-smi command returns as follows: That is, although assignments at init_data.xml files look as expected, and Boinc Manager shows that both devices #0 and #1 are running their own tasks, really both tasks are running at the same device #0, and device #1 is idle. I've copied at a safe location for reference the following three whole folders: /var/lib/boinc-client/slots/5 /var/lib/boinc-client/slots/6 /var/lib/boinc-client/projects/www.gpugrid.net I can examine them in search for any other clue, but I need further advice for this... | |
ID: 57280 | Rating: 0 | rate: / Reply Quote | |
I can think of three things I'd really like to see: | |
ID: 57281 | Rating: 0 | rate: / Reply Quote | |
1) In the slot directories: is there an init_data.xml file, and does the content match the <device_number> reported by BOINC for that slot? Yes, they did exist, and they matched the devices reported by BOINC Manager. But task reported by BOINC Manager as running at device 1 difers the physical device where it actually ran as reported by wrapper (and nvidia-smi command). I'll upload images for future reference, given that tasks will vanish from Gpugrid database in short. - Slots/6: Task e1s6_I6-ADRIA_test_acemd3_newapp_KIXCMYB-0-2-RND9347_2. Reported by BOINC Manager as running at device 0, the same where it actually ran. - Slots/5: Task e1s1_I2-ADRIA_test_acemd3_newapp_KIXCMYB-1-2-RND8042_0. Reported by BOINC Manager as running at device 1, but it ran actually at device 0. As shown by nvidia-smi command, device 1 was actually idle (P8 state, 1% overall activity), while the two mentioned tasks were running concurrently at device 0. I (barely) rescued the last previous version 2.12 task that ran at a device other than 0 on any of my hosts. v2.12 Task 1_3-CRYPTICSCOUT_pocket_discovery_6cacc905_1fa2_4ed0_98e6_1139c20e13df-0-2-RND1223_2. It ran at device 2 at this triple GTX 1650 GPU host. v2.17 Task e1s5_I6-ADRIA_test_acemd3_newapp_KIXCMYB-0-2-RND2315_0. It ran at device 0 on the same host. There are subtle differences between them: - Wrapper version is different: Previous v2.12 wrapper version was 7.7.26016. Current v2.17 wrapper version is 7.5.25014 (older?) - At previous v2.12 version, wrapper was running directly the acemd3 process. At current v2.17 version, a preliminary decompressing stage seems to be executed, then bin/acemd3 process is run. - v2.12 wrapper device assignment dialog: (--boinc input --device 2); v2.17 wrapper device assignment dialog: (--boinc --device 0) - Any other difference that I can't see? | |
ID: 57284 | Rating: 0 | rate: / Reply Quote | |
Thanks. I finally got my own task(s), so I can answer my own other questions. bin/acemd3 --boinc --device 0 (which was right for this particular task - BOINC had assigned it to device 0) 3) There's no API_VERSION string in the acemd3 binary, so it doesn't have the ability to read the init_data.xml file natively. But it does have strings for the parameters --boinc --device '--boinc' doesn't appear in the Wrapper documentation, but --device does. It appears to be correctly specified in the job.xml file as $GPU_DEVICE_NUM, per documentation. My task was 'Created 9 Sep 2021 | 6:49:00 UTC', and correctly specifies wrapper version 26014 - the latest available precompiled from BOINC. 25014 may have been a typo, since corrected. 26016 was pre-compiled for Windows only. Now, I just have to wait for a new task, and force it to run on device 1 to see what happens then. [afterthought: the precompiled wrapper 26014 is very old - April 2015. I do hope they're recompiling from source] | |
ID: 57289 | Rating: 0 | rate: / Reply Quote | |
- At previous v2.12 version, wrapper was running directly the acemd3 process. At current v2.17 version, a preliminary decompressing stage seems to be executed, then bin/acemd3 process is run. The decompressing stage produces two huge folders in the slot directory: bin lib The conda-pack.tar archive alone is 1 GB, and decompressing it, for each task, expands it to 1.8 GB. Watch out for your storage quotas and SSD write limits! But at least the libbost 1.74 filesystem library is in that lib folder, so hopefully that'll be an end to that class of error. | |
ID: 57290 | Rating: 0 | rate: / Reply Quote | |
there seems to be a difference in command used for the device assignment from the past (I missed this before) --boinc input --device n the 2.17 app uses this: --boinc --device 0 feels like it's hard coded to 0 and not using what BOINC assigns. ____________ | |
ID: 57291 | Rating: 0 | rate: / Reply Quote | |
The command line for acemd3 is set by the wrapper. Detected keyword --boinc That might show up if I can make a task run standalone in a terminal window - though that'll be mightily fiddly with all these weird filenames around. | |
ID: 57292 | Rating: 0 | rate: / Reply Quote | |
just noting the obvious difference between what worked in the past vs what doesn't work now. it worked before, but not now. what's different? the lack of "input" in the command. | |
ID: 57293 | Rating: 0 | rate: / Reply Quote | |
And I'm trying to think like a computer, and trying to work out why the observed difference might cause the observed effect. | |
ID: 57294 | Rating: 0 | rate: / Reply Quote | |
based on the assumption that the string "input" relates to a filename. rather could be a preprogrammed specific command, telling the app/wrapper to perform some function. without full knowledge of the code, it's just speculation at this point. | |
ID: 57295 | Rating: 0 | rate: / Reply Quote | |
API_VERSION does appear in the wrapper binary too. you can see this by converting "API_VERSION" to a hex string and searching the binary with hexedit. | |
ID: 57296 | Rating: 0 | rate: / Reply Quote | |
Yes, I'd expect that. The whole point of the wrapper is to be 'boinc-aware': it will need to read init_data.xml, and export the readings (like device number) into an external format like the one that the new version of acemd3 can act on. | |
ID: 57297 | Rating: 0 | rate: / Reply Quote | |
Finally, this problem has been corrected by the deployment of New Version of ACEMD 2.18 on Sep 14 2021 | 11:44:39 UTC | |
ID: 57301 | Rating: 0 | rate: / Reply Quote | |
Have you actually received the new app with some new work? | |
ID: 57302 | Rating: 0 | rate: / Reply Quote | |
Have you actually received the new app with some new work? Yes, I did. e1s10_I18-ADRIA_test_acemd3_devicetest_KIXCMYB-0-2-RND3507_1 (Link to Gpugrid webpage) e1s10_I18-ADRIA_test_acemd3_devicetest_KIXCMYB-0-2-RND3507_1 (Image) This task was actually processed at device 1 on my twin GTX 1650 GPU host. I had previously reset Gpugrid project in BOINC Manager at that system. | |
ID: 57303 | Rating: 0 | rate: / Reply Quote | |
I have two ADRIA tasks running now on host 132158 - Linux Mint, driver v460. | |
ID: 57306 | Rating: 0 | rate: / Reply Quote | |
that's the best test we can hope for, the most apples to apples. I'd certainly be interested to know if one is significantly faster than the other. ____________ | |
ID: 57307 | Rating: 0 | rate: / Reply Quote | |
I got two new 2.18 tasks, one each on two hosts. Both CUDA_101 though. | |
ID: 57309 | Rating: 0 | rate: / Reply Quote | |
Here's a taster show, after 4 hours elapsed: | |
ID: 57310 | Rating: 0 | rate: / Reply Quote | |
I received 4 GPUGrid WU's on my dual GPU system - RTX3070 and RTX2070..... | |
ID: 57311 | Rating: 0 | rate: / Reply Quote | |
Has anyone else seen this? It is an old known problem . Please take a look to Toni Message #52865, dated on Oct 17 2019. Specially, question about Can I use it on multi-GPU systems? Your failed task started at device 0, then it restarted at device 1... | |
ID: 57312 | Rating: 0 | rate: / Reply Quote | |
I received 4 GPUGrid WU's on my dual GPU system - RTX3070 and RTX2070..... you need to extend the time period for task switching in compute preferences. depending on how slow or fast your GPU is, and since these GPUGRID tasks can take 12-24+ hrs depending on GPU power, you might need to set this to a very high value. I have it set to 24hrs (1440 minutes) on my hosts. If you're running GPUGRID, might be a better option to set other projects to a resource share of 0 so that they only ask for work when no GPUGRID work is present. FYI, this issue will happen if you simply stop BOINC and/or reboot your system. you'll need to be fine with leaving your system on for days at a time potentially. ____________ | |
ID: 57313 | Rating: 0 | rate: / Reply Quote | |
Thanks for the quick reply clarifying the problem. | |
ID: 57314 | Rating: 0 | rate: / Reply Quote | |
Wait a minute . . . . . I thought I read in this thread on the previous beta releases that the restarting on a different device issue was solved??? | |
ID: 57315 | Rating: 0 | rate: / Reply Quote | |
Wait a minute . . . . . I thought I read in this thread on the previous beta releases that the restarting on a different device issue was solved??? That wasn’t the problem seen in previous app versions. We were seeing all tasks running on the same GPU. ____________ | |
ID: 57316 | Rating: 0 | rate: / Reply Quote | |
I'd certainly be interested to know if one is significantly faster than the other. The head-to-head speed comparison results are in. Both tasks completed and validated, and both were given the same credit score. Cards are GTX 1660 SUPER (ASUS TUF, if it matters). Runtime: v1121 113,110.14 sec v101 126,707.98 sec (12% longer) Speed: v1121 3.18% / hour (12% faster) v101 2.84% / hour | |
ID: 57329 | Rating: 0 | rate: / Reply Quote | |
If they keep both apps active, then the BOINC mechanism for choosing the most efficient application should become active once 10 valid tasks are completed on both apps. | |
ID: 57331 | Rating: 0 | rate: / Reply Quote | |
Hmmmm . . . . not enough tasks to draw a concrete conclusion, but on my daily driver with three identical RTX 2080 cards, the CUDA101 app was 2000 seconds faster than the CUDA1121 app. | |
ID: 57335 | Rating: 0 | rate: / Reply Quote | |
Anybody seen any sign of your credits exported to 3rd party aggregation websites yet? | |
ID: 57337 | Rating: 0 | rate: / Reply Quote | |
Anybody seen any sign of your credits exported to 3rd party aggregation websites yet? No. https://www.gpugrid.net/stats/ is accessible, but the files in it are dated September 16. Somebody needs to restart a script. | |
ID: 57338 | Rating: 0 | rate: / Reply Quote | |
Anybody seen any sign of your credits exported to 3rd party aggregation websites yet? Good observation. My statistics for GPUGRID at BOINC STATS are still also blank since new app v2.18 ADRIA tasks came out. | |
ID: 57340 | Rating: 0 | rate: / Reply Quote | |
Looking into this | |
ID: 57342 | Rating: 0 | rate: / Reply Quote | |
fixed | |
ID: 57344 | Rating: 0 | rate: / Reply Quote | |
Thanks, Gianni. | |
ID: 57346 | Rating: 0 | rate: / Reply Quote | |
fixed Working again, thanks | |
ID: 57349 | Rating: 0 | rate: / Reply Quote | |
By the way, and returning to the original topic of this thread to finish with some conclusion: | |
ID: 57358 | Rating: 0 | rate: / Reply Quote | |
Message boards : Number crunching : New version of ACEMD 2.17 on multi GPU hosts