Message boards : News : ATM
Author | Message |
---|---|
Hello GPUGRID! | |
ID: 60002 | Rating: 0 | rate:
![]() ![]() ![]() | |
I’m brand new to GPUGRID so apologies in advance if I make some mistakes. I’m looking forward to learn from you all and discuss about this app :) | |
ID: 60003 | Rating: 0 | rate:
![]() ![]() ![]() | |
Welcome! | |
ID: 60005 | Rating: 0 | rate:
![]() ![]() ![]() | |
Thanks for creating an official topic on these types of tasks. | |
ID: 60006 | Rating: 0 | rate:
![]() ![]() ![]() | |
Welcome and thanks for info Quico <nbytes>729766132.000000</nbytes> <max_nbytes>10000000000.000000</max_nbytes> https://ibb.co/4pYBfNS parsing upload result response <data_server_reply> <status>0</status> <file_size>0</file_size error code -224 (permanent HTTP error) https://ibb.co/T40gFR9 I will do test new test on new units but would probably face same issue if server have not changed. https://boinc.berkeley.edu/trac/wiki/JobTemplates | |
ID: 60007 | Rating: 0 | rate:
![]() ![]() ![]() | |
File size in past history that max allowed have been 700mb Greger, are you sure it was 700mb? From what I remember, it was 500mb | |
ID: 60009 | Rating: 0 | rate:
![]() ![]() ![]() | |
I have one which is looking a bit poorly. It's 'running' of host 132158l (Linux Mint 21.1, GTX 1660 super, 64 GB RAM), but it's only showing 3% progress after 18 hours. | |
ID: 60011 | Rating: 0 | rate:
![]() ![]() ![]() | |
I am trying to upload one, but can't get it to do the transfer: | |
ID: 60012 | Rating: 0 | rate:
![]() ![]() ![]() | |
I think mine is a failure. Nothing has been written to stderr.txt since 14:22:59 UTC yesterday, and the final entries are: + echo 'Run AToM' + CONFIG_FILE=Tyk2_new_2-ejm_49-ejm_50_asyncre.cntl + python bin/rbfe_explicit_sync.py Tyk2_new_2-ejm_49-ejm_50_asyncre.cntl Warning: importing 'simtk.openmm' is deprecated. Import 'openmm' instead. I'm aborting it. NB a previous user also failed with a task from the same workunit: 27418556 | |
ID: 60013 | Rating: 0 | rate:
![]() ![]() ![]() | |
Thanks everyone for the replies! Welcome and thanks for info Quico Thanks for this, I'll keep that in mind. From the succesful run the size file is 498M so it should be on the limit there to what @Erich56 says. But that's useful information for when I run bigger systems. I think mine is a failure. Nothing has been written to stderr.txt since 14:22:59 UTC yesterday, and the final entries are: Hmmm, that's weird. It shouldn't softlock in that step. Although this warning pops up it should keep running without issues. I'll ask around | |
ID: 60022 | Rating: 0 | rate:
![]() ![]() ![]() | |
This task didn't want to upload, but neither would GPUGrid update when I aborted the upload. | |
ID: 60029 | Rating: 0 | rate:
![]() ![]() ![]() | |
I just aborted 1 ATM Wu https://www.gpugrid.net/result.php?resultid=33338739 that had been running for over 7 Days, it sat at 75% done the whole time. Got another one & it immediately jumped to 75% done. Probably just abort it & deselect any new ATM Wu's ... | |
ID: 60035 | Rating: 0 | rate:
![]() ![]() ![]() | |
Some still running, many failing. | |
ID: 60036 | Rating: 0 | rate:
![]() ![]() ![]() | |
Three successive errors on host 132158 | |
ID: 60037 | Rating: 0 | rate:
![]() ![]() ![]() | |
I let some computers run off all other WUs so they were just running 2 ATM WUs. It appears they do only use one CPU each but that may just be a consequence of specifying a single CPU in the client_state.xml file. Might your ATM project benefit from using multiple CPUs? <app_version> nvidia-smi reports ATM 1.13 WUs are using 550 to 568 MB of VRAM so call it 0.6 GB VRAM. BOINCtasks reports all WUs are using less than 1.2 GB RAM. That means that my computers could easily run up to 20 ATM WUs simultaneourly. Sadly GPUGRID does not allow us to control the number of WUs we DL like LHC or WCG do. So we're stuck with 2 set by the ACEMD project. I never run more than a single PYTHON WU on a computer so I get two and abort one and then have to uncheck PYTHON in my GPUGRID Preferences just in case ACEMD or ATM WUs materialize. I wonder how many years it's been since GG has improved the UI to make it more user-friendly? When one clicks their Preferences they still get 2 Warnings and 2 Strict Standards that have never been fixed.<app_name>ATM</app_name> <version_num>113</version_num> <platform>x86_64-pc-linux-gnu</platform> <avg_ncpus>1.000000</avg_ncpus> <flops>46211986880283.171875</flops> <plan_class>cuda1121</plan_class> <api_version>7.7.0</api_version> Please add a link to your applications: https://www.gpugrid.net/apps.php ____________ ![]() | |
ID: 60038 | Rating: 0 | rate:
![]() ![]() ![]() | |
Is there a way to tell if an ATM WU is progressing? I have had only one succeed so far over the last several weeks. However, all of the failures so far were one of two types: either a failure to upload (and the download aborted by me) or a simple "Error while computing", which happened very quickly. | |
ID: 60039 | Rating: 0 | rate:
![]() ![]() ![]() | |
let me explain something about the 75% since it seems many don't understand what's happening here. the 75% is in no way an indication of how much the task has progressed. it is totally a function of how BOINC acts with the wrapper when the tasks are setup in the way that they are. | |
ID: 60041 | Rating: 0 | rate:
![]() ![]() ![]() | |
I have one that's running (?) much the same. I think I've found a way to confirm it's still alive. 2023-03-08 21:55:05 - INFO - sync_re - Started: sample 107, replica 12 2023-03-08 21:55:17 - INFO - sync_re - Finished: sample 107, replica 12 (duration: 12.440164870815352 s) which seems to suggest that all is well. Perhaps Quico could let us know how many samples to expect in the current batch? | |
ID: 60042 | Rating: 0 | rate:
![]() ![]() ![]() | |
Thanks for the idea. Sure enough, that file is showing activity (On sample 324, replica 3 for me.) OK. Just going to sit and wait. | |
ID: 60043 | Rating: 0 | rate:
![]() ![]() ![]() | |
I have one that's running (?) much the same. I think I've found a way to confirm it's still alive. Thanks for this input (and everyone's). At least in the runs I sent recently we are expecting 341 samples. I've seen that there were many crashes in the last batch of jobs I sent. I'll check if there were some issues on my end or it's just that the systems decided to blow up. | |
ID: 60045 | Rating: 0 | rate:
![]() ![]() ![]() | |
At least in the runs I sent recently we are expecting 341 samples. Thanks, that's helpful. I've reached sample 266, so I'll be able to predict when it's likely to finish. But I think you need to reconsider some design decisions. The current task properties (from BOINC Manager) are: ![]() This task will take over 24 hours to run on my GTX 1660 Ti - that's long, even by GPUGrid standards. BOINC doesn't think it's checkpointed since the beginning, even though checkpoints are listed at the end of each sample in the job.log BOINC Manager shows that the fraction done is 75.000% - and has displayed that figure, unchanging, since a few minutes into the run. I'm not seeing any sign of an output file (or I haven't found it yet!), although it's specified in the <result> XML: <file_ref> <file_name>T_QUICO_Tyk2_new_2_ejm_47_ejm_55_4-QUICO_TEST_ATM-0-1-RND8906_2_0</file_name> <open_name>output.tar.bz2</open_name> <copy_file/> </file_ref> More when it finishes. | |
ID: 60046 | Rating: 0 | rate:
![]() ![]() ![]() | |
At least in the runs I sent recently we are expecting 341 samples. That's good to know, thanks. Next time I'll prepare them so they run for shorter amounts of time and finish over next submissions. Is there an aprox time you suggest per task? I'm not seeing any sign of an output file (or I haven't found it yet!), although it's specified in the <result> XML: Can you see a cntxt_0 folder or several r0-r21 folders? These should be some of the outputs that the run generates, and also the ones I'm getting from the succesful runs. | |
ID: 60047 | Rating: 0 | rate:
![]() ![]() ![]() | |
Can you see a cntxt_0 folder or several r0-r21 folders? These should be some of the outputs that the run generates, and also the ones I'm getting from the succesful runs. Yes, I have all of those, and they're filling up nicely. I want to catch the final upload archive, and check it for size. | |
ID: 60048 | Rating: 0 | rate:
![]() ![]() ![]() | |
Can you see a cntxt_0 folder or several r0-r21 folders? These should be some of the outputs that the run generates, and also the ones I'm getting from the succesful runs. Ah I see, from what I've seen the final upload archive has been around 500MB for these runs. Taking into accont what was mentioned filesize-wise in the beginning of the thread I'll tweak some paramaters in order to avoid heavier files | |
ID: 60049 | Rating: 0 | rate:
![]() ![]() ![]() | |
you should also add weights to the <tasks> element in the jobs.xml file that's being used as well as adding some kind of progress reporting for the main script. jumping to 75% at the start and staying there for 12-24hrs until it jumps to 100% at the end is counterintuitive for most users and causes confusion about if the task is doing anything or not. | |
ID: 60050 | Rating: 0 | rate:
![]() ![]() ![]() | |
Next time I'll prepare them so they run for shorter amounts of time and finish over next submissions. Is there an aprox time you suggest per task? The sweet spot would be 0.5 to 4 hours. Above 8 hours is starting to drag. Some climate projects take over a week to run. It really depends on your needs, we're here to serve :-) It seems a quicker turn around time while you're tweaking your project would be to your benefit. It seems it would help you if you created your own BOINC account and ran your WUs the same way we do. Get in the trenches with us and see what we see. | |
ID: 60051 | Rating: 0 | rate:
![]() ![]() ![]() | |
Well, here it is: | |
ID: 60052 | Rating: 0 | rate:
![]() ![]() ![]() | |
Next time I'll prepare them so they run for shorter amounts of time and finish over next submissions. Is there an aprox time you suggest per task? Once the Windows version is live my personal set-up will join the cause and will have more feedback :) Well, here it is: Thanks, for the insight. I'll make it save frames less frequently in order to avoid bigger filesizes. | |
ID: 60053 | Rating: 0 | rate:
![]() ![]() ![]() | |
nothing but errors from the current ATM batch. run.sh is missing or misnamed/misreferenced. | |
ID: 60068 | Rating: 0 | rate:
![]() ![]() ![]() | |
I vaguely recall GG had a rule something like a computer can only DL 200 WUs a day. If it's still in place it would be absurd since the overriding rule is that a computer can only hold 2 WUs at a time. | |
ID: 60069 | Rating: 0 | rate:
![]() ![]() ![]() | |
Today's tasks are running OK - the run.sh script problem has been cured. | |
ID: 60074 | Rating: 0 | rate:
![]() ![]() ![]() | |
i wouldnt say "cured". but newer tasks seem to be fine. I'm still getting a good number of resends with the same problem. i guess they'll make their way through the meat grinder before defaulting out. | |
ID: 60075 | Rating: 0 | rate:
![]() ![]() ![]() | |
My point was: if you get one of these, let it run - it may be going to produce useful science. If it's one of the faulty ones, you waste about 20 seconds, and move on. | |
ID: 60076 | Rating: 0 | rate:
![]() ![]() ![]() | |
Quico/GDF, GPU utilization is low so I'd like to test running 3 and 4 ATM WUs simultaneously. | |
ID: 60084 | Rating: 0 | rate:
![]() ![]() ![]() | |
Sorry about the run.sh missing issue of the past few days. It slipped through me. Also they were a few re-send tests that also crashed, but it should be fixed now. | |
ID: 60085 | Rating: 0 | rate:
![]() ![]() ![]() | |
Quico/GDF, GPU utilization is low so I'd like to test running 3 and 4 ATM WUs simultaneously. How low is it? It really shouldn't be the case at least taking into account the tests we performed internally. | |
ID: 60086 | Rating: 0 | rate:
![]() ![]() ![]() | |
My host 508381 (GTX 1660 Ti) has finished a couple overnight, in about 9 hours. The last one finished just as I was reading your message, and I saw the upload size - 114 MB. Another failed with 'Energy is NaN', but that's another question. | |
ID: 60087 | Rating: 0 | rate:
![]() ![]() ![]() | |
My observations show the GPU switching from periods of high utilization (~96-98%) to periods of idle (0%). About every minute or two. | |
ID: 60091 | Rating: 0 | rate:
![]() ![]() ![]() | |
Quico/GDF, GPU utilization is low so I'd like to test running 3 and 4 ATM WUs simultaneously.How low is it? It really shouldn't be the case at least taking into account the tests we performed internally. GPUgrid is set to only DL 2 WUs per computer. It used to be higher but since ACEMD WUs take around 12ish hours and have approxiamtely 50% GPU utilization a normal BOINC client couldn't really make efficient use of more than 2. The history of setting the limit may have had something to do with DDOS attacks and throttling server access as a defense. But Python WUs with a very low GPU utilization and ATM with about 25% utilization could run more. I believe it's possible for the work server to designate how many WUs of a given kind based on the client's hardware. Some use a custom BOINC client that tricks the server into thinking their computer is more than one computer. I suspect 1080s & 2080s could run 3 and 3080s could run 4 ATM WUs. Be nice to give it a try. Checkpointing should be high on your To-Do List followed closely by progress reporting. File size is not an issue on the client side since you DL files over a GB. But increasing the limit on your server side would make that problem vanish. Run times have shortened and run fine, maybe a little shorter would be nice but not a priority. | |
ID: 60093 | Rating: 0 | rate:
![]() ![]() ![]() | |
I noticed Free energy calculations of protein ligand binding in WUProp. For example, today's time is 0.03 hours. I checked, and i've 68 of these with a total of minimal time. So i checked, and they all get "Error while computing". I looked at a recent work unit, 27429650 T_CDK2_new_2_edit_1oiu_26_T2_2A_1-QUICO_TEST_ATM-0-1-RND4575_0 | |
ID: 60094 | Rating: 0 | rate:
![]() ![]() ![]() | |
GPUgrid is set to only DL 2 WUs per computer. it's actually 2 per GPU, for up to 8 GPUs. 16 per computer/host. ACEMD WUs take around 12ish hours and have approxiamtely 50% GPU utilization acemd3 has always used nearly 100% utilization with a single task on every GPU I've ever run. if you're only seeing 50%, sounds like you're hitting some other kind of bottleneck preventing the GPU from working to its full potential. ____________ ![]() | |
ID: 60095 | Rating: 0 | rate:
![]() ![]() ![]() | |
I just started using nvitop for Linux and it gives a very different image of GPU utilization while running ATM: https://github.com/XuehaiPan/nvitop | |
ID: 60096 | Rating: 0 | rate:
![]() ![]() ![]() | |
i would probably give more trust to nvidia's own tools. watch -n 1 nvidia-smi or watch -n 1 nvidia-smi --query-gpu=temperature.gpu,name,pci.bus_id,utilization.gpu,utilization.memory,clocks.current.sm,clocks.current.memory,power.draw,memory.used,pcie.link.gen.current,pcie.link.width.current --format=csv but you said "acemd3" uses 50%. not ATM. overall I'd agree that ATM is closer to 50% effective or a little higher. it cycles between like 90 seconds @95+% and 30 seconds @0% and back and forth for the majority of the run. ____________ ![]() | |
ID: 60097 | Rating: 0 | rate:
![]() ![]() ![]() | |
I'm running Linux Mint 19 (a bit out of date)I just retired my last Linux Mint 19 computer yesterday and it had been running ATM, ACEMD & Python WUs on a 2080 Ti (12/7.5) fine. BTW, I tried the LM 21.1 upgrade from LM 20.3 and can't do things like open BOINC folder as admin. I can't see any advantage to 21.1 so I'm going to do a fresh install and revert back to 20.3. My machine has a gtx-950, so cuda tasks are OK.Is there a minimum requirement for CUDA and Compute Capability for ATM WUs? https://www.techpowerup.com/gpu-specs/geforce-gtx-950.c2747 says CUDA 5.2 and https://developer.nvidia.com/cuda-gpus says 5.2. | |
ID: 60098 | Rating: 0 | rate:
![]() ![]() ![]() | |
Is there a minimum requirement for CUDA and Compute Capability for ATM WUs? very likely the min CC is 5.0 (Maxwell) since Kepler cards seem to be erroring with the message that the card is too old. all cuda 11.x apps are supported by CUDA 11.1+ drivers. with CUDA 11.1, Nvidia introduced forward compatibility of minor versions. so as long as you have 450+ drivers you should be able to run any CUDA app up to 11.8. CUDA 12+ will require moving to CUDA 12+ compatible drivers. ____________ ![]() | |
ID: 60099 | Rating: 0 | rate:
![]() ![]() ![]() | |
I'm sure you're right, it's been years since I put more than on GPU on a computer.GPUgrid is set to only DL 2 WUs per computer. ACEMD WUs take around 12ish hours and have approxiamtely 50% GPU utilizationacemd3 has always used nearly 100% utilization with a single task on every GPU I've ever run. if you're only seeing 50%, sounds like you're hitting some other kind of bottleneck preventing the GPU from working to its full potential.[/quote]Let me rephrase that since it's been a long time since there was a steady flow of ACEMD. I always run 2 ACEMD WUs per GPU with no other GPU projects running. I can't remember what ACEMD utilization was but I don't recall that they slowed down much by running 2 WUs together. | |
ID: 60100 | Rating: 0 | rate:
![]() ![]() ![]() | |
maybe not much slower, but also not faster. | |
ID: 60101 | Rating: 0 | rate:
![]() ![]() ![]() | |
i would probably give more trust to nvidia's own tools. nvitop does that but graphs it. | |
ID: 60102 | Rating: 0 | rate:
![]() ![]() ![]() | |
maybe not much slower, but also not faster. But it has the advantage that compared to running a single ACEMD WU and letting the second GG sit idle waiting until it finishes and not getting the quick turnaround bonus feels like getting robbed :-) But who's counting? | |
ID: 60103 | Rating: 0 | rate:
![]() ![]() ![]() | |
until your 12h task turns into two 25hr tasks running two and you get robbed anyway. robbed of the bonus for two tasks instead of just one. | |
ID: 60104 | Rating: 0 | rate:
![]() ![]() ![]() | |
Picked up another ATM task but not holding much hope that it will run correctly based on the previous wingmen output files. Looks like the configuration is not correct again. | |
ID: 60105 | Rating: 0 | rate:
![]() ![]() ![]() | |
Does the ATM app work with RTX 4000 series? | |
ID: 60106 | Rating: 0 | rate:
![]() ![]() ![]() | |
Does the ATM app work with RTX 4000 series? Maybe. The Python app does, and the ATM is a similar kind of setup. You’ll have to try it and see. Not sure how much progress the project has made for Windows though. ____________ ![]() | |
ID: 60107 | Rating: 0 | rate:
![]() ![]() ![]() | |
I'm running Linux Mint 19 (a bit out of date)I just retired my last Linux Mint 19 computer yesterday and it had been running ATM, ACEMD & Python WUs on a 2080 Ti (12/7.5) fine. BTW, I tried the LM 21.1 upgrade from LM 20.3 and can't do things like open BOINC folder as admin. I can't see any advantage to 21.1 so I'm going to do a fresh install and revert back to 20.3. Glad to know someone else also has the same problem with Mint 21.1. I will shift to some other flavour. | |
ID: 60108 | Rating: 0 | rate:
![]() ![]() ![]() | |
Got my first ATM Beta. Completed and validated. | |
ID: 60111 | Rating: 0 | rate:
![]() ![]() ![]() | |
My observations show the GPU switching from periods of high utilization (~96-98%) to periods of idle (0%). About every minute or two. That sounds how ATM is intended to work for now. The idle GPU periods correspond to writing coordinates. Happy to know that size of the jobs are good! Picked up another ATM task but not holding much hope that it will run correctly based on the previous wingmen output files. Looks like the configuration is not correct again. I have seen your errors but I'm not sure why it's happening since I got several jobs running smoothly right now. I'll ask around. The new tag is a legacy part on my end about receptor naming. | |
ID: 60120 | Rating: 0 | rate:
![]() ![]() ![]() | |
Another heads-up, it seems that the Windows app will available soon! That way we'll be able to look into the progress reporting issue. | |
ID: 60121 | Rating: 0 | rate:
![]() ![]() ![]() | |
...it seems that the Windows app will available soon! that's good news - I'm looking foward to receiving ATM tasks :-) | |
ID: 60123 | Rating: 0 | rate:
![]() ![]() ![]() | |
I see that there is a windows app for ATM. But I have never received an app on any of my win machines, even with an updater. And yes, I have all the right project preferences set (everything checked). So, has anyone received an ATM task on a windows machine? | |
ID: 60126 | Rating: 0 | rate:
![]() ![]() ![]() | |
I see that there is a windows app for ATM. But I have never received an app on any of my win machines, even with an updater. And yes, I have all the right project preferences set (everything checked). So, has anyone received an ATM task on a windows machine? As far as I know, we are doing the final tests. I'll let you know once it's fully ready and I have the green light to send jobs through there. | |
ID: 60128 | Rating: 0 | rate:
![]() ![]() ![]() | |
I see that there is a windows app for ATM. But I have never received an app on any of my win machines, even with an updater. And yes, I have all the right project preferences set (everything checked). So, has anyone received an ATM task on a windows machine? do you have allow beta/test applications checked? ____________ ![]() | |
ID: 60129 | Rating: 0 | rate:
![]() ![]() ![]() | |
I see that there is a windows app for ATM. But I have never received an app on any of my win machines, even with an updater. And yes, I have all the right project preferences set (everything checked). So, has anyone received an ATM task on a windows machine? Yep. Are you saying that you have received windows tasks for ATM? ____________ Reno, NV Team: SETI.USA | |
ID: 60130 | Rating: 0 | rate:
![]() ![]() ![]() | |
I see that there is a windows app for ATM. But I have never received an app on any of my win machines, even with an updater. And yes, I have all the right project preferences set (everything checked). So, has anyone received an ATM task on a windows machine? no I don't run windows. i was just asking if you had the beta box selected because that's necessary. but looking at the server, some people did get them. someone else earlier in this thread reported that they got and processed one also. very few went out, so unless your system asked when they were available, it would be easy to miss. you can setup a script to ask for them regularly, BOINC will stop asking after so many requests with no tasks sent. ____________ ![]() | |
ID: 60132 | Rating: 0 | rate:
![]() ![]() ![]() | |
I see that there is a windows app for ATM. But I have never received an app on any of my win machines, even with an updater. And yes, I have all the right project preferences set (everything checked). So, has anyone received an ATM task on a windows machine? I've yet to get a Windoze ATMbeta. They've been available for a while this morning and still nothing. That GPU just sits with bated breath. What's the trick? | |
ID: 60134 | Rating: 0 | rate:
![]() ![]() ![]() | |
I see that there is a windows app for ATM. But I have never received an app on any of my win machines, even with an updater. And yes, I have all the right project preferences set (everything checked). So, has anyone received an ATM task on a windows machine? Yep. As I said, I have an updater script running as well. ____________ Reno, NV Team: SETI.USA | |
ID: 60135 | Rating: 0 | rate:
![]() ![]() ![]() | |
KAMasud got one on his Windows system. maybe he can share his settings. | |
ID: 60136 | Rating: 0 | rate:
![]() ![]() ![]() | |
Quico, Do you have some cryptic requirements specified for your Win ATMbeta WUs? | |
ID: 60137 | Rating: 0 | rate:
![]() ![]() ![]() | |
KAMasud got one on his Windows system. maybe he can share his settings. ____________________ Yes, I did get an ATM task. Completed and validated with success. No, I do not have any special settings. The only thing I do is not run any other project with GPU Grid. I have a feeling that they interfere with each other. How? GPU Grid is all over my cores and threads. Lacks discipline. My take on the subject. Admin, sorry. Even though resources are wasted, I am not after the credits. | |
ID: 60138 | Rating: 0 | rate:
![]() ![]() ![]() | |
I think it's just a matter of very few tests being submitted right now. Once I have the green light from Raimondas I'll start sending jobs through the windows app as well. | |
ID: 60139 | Rating: 0 | rate:
![]() ![]() ![]() | |
Still no checkpoints. Hopefully this is top of your priority list. | |
ID: 60140 | Rating: 0 | rate:
![]() ![]() ![]() | |
Done! Thanks for it. | |
ID: 60141 | Rating: 0 | rate:
![]() ![]() ![]() | |
There ate two different ATM apps on the server stats page, and also on the apps.php page. But in project preferences, there is only one ATM app listed. We need a way to select both/either in our project preferences. | |
ID: 60142 | Rating: 0 | rate:
![]() ![]() ![]() | |
Let it be. It is more fun this way. Never know what you will get next and adjust. | |
ID: 60143 | Rating: 0 | rate:
![]() ![]() ![]() | |
My new WU behaves differently but I don't think checkpointing is working. It reported the first checkpoint after a minute and after an hour has yet to report a second one. Progress is stuck at 0.2 but time remaining has decreased from 1222 days to 22 days. | |
ID: 60144 | Rating: 0 | rate:
![]() ![]() ![]() | |
I have started to get these ATM tasks on my windoze hosts. (unknown error) - exit code 195 (0xc3)</message> A script error? | |
ID: 60145 | Rating: 0 | rate:
![]() ![]() ![]() | |
I have started to get these ATM tasks on my windoze hosts. Hmmm I did send those this morning. Probably they entered the queue once my windows app was live and was looking for the run.bat. If that's the case expect many crashes incoming :_( The tests I'm monitoring seem to be still running so there's still hope | |
ID: 60146 | Rating: 0 | rate:
![]() ![]() ![]() | |
FWIW, this morning my windows machines started getting ATM tasks. Most of these tasks are erroring out. For these tasks, they have been issued many times over too many and failed every time. Looks like a problem with the tasks and not the clients running them. They will eventually work their way out of the system. But a few of the windows tasks I received today are actually working. Here is a successful example: | |
ID: 60147 | Rating: 0 | rate:
![]() ![]() ![]() | |
FWIW, this morning my windows machines started getting ATM tasks. Most of these tasks are erroring out. For these tasks, they have been issued many times over too many and failed every time. Looks like a problem with the tasks and not the clients running them. They will eventually work their way out of the system. But a few of the windows tasks I received today are actually working. Here is a successful example: -------------- Welcome Zombie67. If you are looking for more excitement, Climate has implemented OpenIFS. | |
ID: 60148 | Rating: 0 | rate:
![]() ![]() ![]() | |
All openifs tasks are already sent. | |
ID: 60149 | Rating: 0 | rate:
![]() ![]() ![]() | |
...But a few of the windows tasks I received today are actually working. I have one that is working, but I had to add ATMs to my appconfig file to get them to more accurately show the time remaining, due to what Ian pointed out way upthread. https://www.gpugrid.net/forum_thread.php?id=5379&nowrap=true#60041 I now see realistic time remaining. My current appconfig.xml script app_config> This task ran alongside a F@H task (project 18717) on a RTX3060 12GB card without any problem, in case anybody is interested. | |
ID: 60150 | Rating: 0 | rate:
![]() ![]() ![]() | |
Why not | |
ID: 60151 | Rating: 0 | rate:
![]() ![]() ![]() | |
So far, 2 WUs successfully completed, another one running. | |
ID: 60152 | Rating: 0 | rate:
![]() ![]() ![]() | |
it still can't run run.bat | |
ID: 60153 | Rating: 0 | rate:
![]() ![]() ![]() | |
progress reporting is still not working. | |
ID: 60154 | Rating: 0 | rate:
![]() ![]() ![]() | |
progress reporting is still not working. T_p38 were sent before the update so I guess it makes sense that they don't show reporting yet. Is the progress report for the BACE runs good? Is it staying stuck? | |
ID: 60155 | Rating: 0 | rate:
![]() ![]() ![]() | |
Yes, BACE looks good. | |
ID: 60156 | Rating: 0 | rate:
![]() ![]() ![]() | |
Hello Quico and everyone. Thank you for trying AToM-OpenMM on GPUGRID. | |
ID: 60157 | Rating: 0 | rate:
![]() ![]() ![]() | |
The python task must tell the boinc client how many ticks are to calculate (MAX_SAMPLES = 341 from *_asyncre.cntl times 22 replica) and the end of each tick. | |
ID: 60158 | Rating: 0 | rate:
![]() ![]() ![]() | |
The ATM tasks also record that a task has checkpointed in the job.log file in the slot directory (or did so, a few debug iterations ago - see message 60046). | |
ID: 60159 | Rating: 0 | rate:
![]() ![]() ![]() | |
The GPUGRID version of AToM: # Report progress on GPUGRID progress = float(isample)/float(num_samples - last_sample) open("progress", "w").write(str(progress)) which checks out as far as I can tell. last_sample is retrieved from checkpoints upon restart, so the progress % should be tracked correctly across restarts. | |
ID: 60160 | Rating: 0 | rate:
![]() ![]() ![]() | |
OK, the BACE task is running, and after 7 minutes or so, I see: 2023-03-24 15:40:33 - INFO - sync_re - Started: checkpointing 2023-03-24 15:40:49 - INFO - sync_re - Finished: checkpointing (duration: 15.699278543004766 s) 2023-03-24 15:40:49 - INFO - sync_re - Finished: sample 1 (duration: 303.5407383099664 s) in the run.log file. So checkpointing is happening, but just not being reported through to BOINC. Progress is 3.582% after eleven minutes. | |
ID: 60161 | Rating: 0 | rate:
![]() ![]() ![]() | |
Actually, it is unclear if AToM's GPUGRID version checkpoints after catching termination signals. I'll ask Raimondas. Termination without checkpointing is usually okay, but progress since the checkpoint would be lost, and the number of samples recorded in the checkpoint file would not reflect the actual number of samples recorded. | |
ID: 60162 | Rating: 0 | rate:
![]() ![]() ![]() | |
The app seems to be both checkpointing, and updating progress, at the end of each sample. That will make re-alignment after a pause easier, but there's always some over-run, and data lost on restart. It's up to the application itself to record the data point reached, and to be used for the restart, as an integral part of the checkpointing process. | |
ID: 60163 | Rating: 0 | rate:
![]() ![]() ![]() | |
Seriously? Only 14 tasks a day? GPUGRID 3/24/2023 9:17:44 AM This computer has finished a daily quota of 14 tasks | |
ID: 60164 | Rating: 0 | rate:
![]() ![]() ![]() | |
Seriously? Only 14 tasks a day? The quota adjusts dynamically - it goes up if you report successful tasks, and goes down if you report errors. | |
ID: 60165 | Rating: 0 | rate:
![]() ![]() ![]() | |
The T_PTP1B_new task, on the other hand, is not reporting progress, even though it's logging checkpoints in the run.log <active_task> <project_master_url>https://www.gpugrid.net/</project_master_url> <result_name>T_PTP1B_new_23484_23482_T3_2A_1-QUICO_TEST_ATM-0-1-RND3714_3</result_name> <checkpoint_cpu_time>10.942300</checkpoint_cpu_time> <checkpoint_elapsed_time>30.176729</checkpoint_elapsed_time> <fraction_done>0.001996</fraction_done> <peak_working_set_size>8318976</peak_working_set_size> <peak_swap_size>16592896</peak_swap_size> <peak_disk_usage>1318196036</peak_disk_usage> </active_task> The <fraction done> is reported as the 'progress%' figure - this one is reported as 0.199% by BOINC Manager (which truncates) and 0.200% by other tools (which round). This task has been running for 43 minutes, and boinc_task_state.xml hasn't been re-written since the first minute. | |
ID: 60166 | Rating: 0 | rate:
![]() ![]() ![]() | |
| |
ID: 60167 | Rating: 0 | rate:
![]() ![]() ![]() | |
| |
ID: 60168 | Rating: 0 | rate:
![]() ![]() ![]() | |
My BACE task 33378091 finished successfully after 5 hours, under Linux Mint 21.1 with a GTX 1660 Super. | |
ID: 60169 | Rating: 0 | rate:
![]() ![]() ![]() | |
Task 27438853 | |
ID: 60170 | Rating: 0 | rate:
![]() ![]() ![]() | |
Right, probably the wrapper should send a termination signal to AToM. We have of course access to AToM's sources https://github.com/Gallicchio-Lab/AToM-OpenMM and we can make sure that it checkpoints appropriately when it receives the signal. However, I do not have access to the wrapper. Quico: please advise. | |
ID: 60171 | Rating: 0 | rate:
![]() ![]() ![]() | |
Hi, i have some "new_2" ATMs that run for 14h+ yet. Should i abort them? | |
ID: 60172 | Rating: 0 | rate:
![]() ![]() ![]() | |
The wrapper you're using at the moment is called "wrapper_26198_x86_64-pc-linux-gnu" (I haven't tried ATM under Windows yet, but can and will do so when I get a moment). 20:37:54 (115491): wrapper (7.7.26016): starting That would put the date back to around November 2015, but I guess someone has made some local modifications. | |
ID: 60173 | Rating: 0 | rate:
![]() ![]() ![]() | |
Hi, i have some "new_2" ATMs that run for 14h+ yet. Should i abort them? I have one at the moment which has been running for 17.5 hours. The same machine completed one yesterday (task 33374928) which ran for 19 hours. I wouldn't abort it just yet. | |
ID: 60174 | Rating: 0 | rate:
![]() ![]() ![]() | |
Hi, i have some "new_2" ATMs that run for 14h+ yet. Should i abort them? thank you. I will let them running =) | |
ID: 60175 | Rating: 0 | rate:
![]() ![]() ![]() | |
And completed. | |
ID: 60176 | Rating: 0 | rate:
![]() ![]() ![]() | |
Seriously? Only 14 tasks a day? Quico, This behavior is intended to block misconfigured computers. In this case it's your Windows version that fails in seconds and being resent until it hits a Linux computer or fails 7 times. My Win computer was locked out of GG early yesterday but all my Linux computers donated until WUs ran out. In this example the first 4 failures all went to Win7 & 11 computers and then Linux completed it successfully: https://www.gpugrid.net/workunit.php?wuid=27438768 And the Win WUs are failing in seconds again with today's tranche. | |
ID: 60177 | Rating: 0 | rate:
![]() ![]() ![]() | |
WUs failing on Linux computers: + python -m pip install git+https://github.com/raimis/AToM-OpenMM.git@172e6db924567cd0af1312d33f05b156b53e3d1c Running command git clone --filter=blob:none --quiet https://github.com/raimis/AToM-OpenMM.git /var/lib/boinc-client/slots/36/tmp/pip-req-build-jsq34xa4 fatal: unable to access '/home/conda/feedstock_root/build_artifacts/git_1679396317102/_h_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placeho/etc/gitconfig': Permission denied error: subprocess-exited-with-error × git clone --filter=blob:none --quiet https://github.com/raimis/AToM-OpenMM.git /var/lib/boinc-client/slots/36/tmp/pip-req-build-jsq34xa4 did not run successfully. │ exit code: 128 ╰─> See above for output. note: This error originates from a subprocess, and is likely not a problem with pip. error: subprocess-exited-with-error https://www.gpugrid.net/result.php?resultid=33379917 | |
ID: 60183 | Rating: 0 | rate:
![]() ![]() ![]() | |
Any ideas why WUs are failing on a linux ubuntu machine with gtx1070? <core_client_version>7.20.5</core_client_version> | |
ID: 60184 | Rating: 0 | rate:
![]() ![]() ![]() | |
(I haven't tried ATM under Windows yet, but can and will do so when I get a moment). Just downloaded a BACE task for Windows. There may be trouble ahead... The job.xml file reads: <job_desc> <unzip_input> <zipfilename>windows_x86_64__cuda1121.zip</zipfilename> </unzip_input> <task> <application>python.exe</application> <command_line>bin/conda-unpack</command_line> <weight>1</weight> </task> <task> <application>Library/usr/bin/tar.exe</application> <command_line>xjvf input.tar.bz2</command_line> <setenv>PATH=$PWD/Library/usr/bin</setenv> <weight>1</weight> </task> <task> <application>C:/Windows/system32/cmd.exe</application> <command_line>/c call run.bat</command_line> <setenv>CUDA_DEVICE=$GPU_DEVICE_NUM</setenv> <stdout_filename>run.log</stdout_filename> <weight>1000</weight> <fraction_done_filename>progress</fraction_done_filename> </task> </job_desc> 1) We had problems with python.exe triggering a missing DLL error. I'll run Dependency Walker over this one, to see what the problem is. 2) It runs a private version of tar.exe: Microsoft included tar as a system utility from Windows 10 onwards - but I'm running Windows 7. The MS utility wouldn't run for me - I'll try this one. 3) I'm not totally convinced of the cmd.exe syntax either, but we'll cross that bridge when we get to it. | |
ID: 60185 | Rating: 0 | rate:
![]() ![]() ![]() | |
First reports from Dependency Walker: | |
ID: 60186 | Rating: 0 | rate:
![]() ![]() ![]() | |
Just a note of warning: one of my machines is running a JNK1 task - been running for 13 hours. | |
ID: 60188 | Rating: 0 | rate:
![]() ![]() ![]() | |
Message boards : News : ATM