Message boards : Number crunching : ATM: Free Energy Calculations new application
Author | Message |
---|---|
Just starting the thread for discussion of this new application. ATM = AToM. | |
ID: 59751 | Rating: 0 | rate:
![]() ![]() ![]() | |
about the restart failure. looks like it fails trying to create a directory that already exists. mkdir: cannot create directory 'atm_tmp': File exists needs some work to allow for that. ____________ ![]() | |
ID: 59752 | Rating: 0 | rate:
![]() ![]() ![]() | |
another quality of life improvement should be adding a <weight> line to the main task in the job.xml file. right now with 2 tasks in the file, and no weights defined, I'm guessing it splits it 50/50 and it thinks the task is 50% done once the extraction phase is complete. | |
ID: 59754 | Rating: 0 | rate:
![]() ![]() ![]() | |
task ran to completion in about an hour. but hit an error and threw it all away because the file size is too big. upload failure: <file_xfer_error> <file_name>T11_4-RAIMIS_TEST_ATM-0-1-RND7054_2_0</file_name> <error_code>-131 (file size too big)</error_code> </file_xfer_error> what a waste. ____________ ![]() | |
ID: 59755 | Rating: 0 | rate:
![]() ![]() ![]() | |
Have over a dozen of quick-failing ATM tasks. | |
ID: 59771 | Rating: 0 | rate:
![]() ![]() ![]() | |
looks like the small batch of tasks that went out today are better setup. ran for about an hour and completed successfully without the file size issue when complete. | |
ID: 59792 | Rating: 0 | rate:
![]() ![]() ![]() | |
This one https://www.gpugrid.net/workunit.php?wuid=27399736 is runnig for about 11 hours and it is stuck at 66,666% for at least 4 hours now. There is almost no load on the GPU. Just a few percent (3-5) once in a while, but constantly some load on the memory controller (10-30). Hope it will finish some day :) | |
ID: 59899 | Rating: 0 | rate:
![]() ![]() ![]() | |
still no official communication from the project about these tasks. | |
ID: 59936 | Rating: 0 | rate:
![]() ![]() ![]() | |
Just been sent a TL4 from WU 27405970. I see you've aborted two previous tasks from the same WU, Ian, on two different machines. Did you get any CPU usage figures from previous runs? I think I'll start it up with the GTX 1660 plus one core, but I'll probably abort it myself if it doesn't show much response. | |
ID: 59937 | Rating: 0 | rate:
![]() ![]() ![]() | |
they spin up multiple processes like the Python tasks do. but i didnt catch them at the very beginning to see if they spike in use or anything like that. | |
ID: 59938 | Rating: 0 | rate:
![]() ![]() ![]() | |
OK, I've set 3 CPUs for continuity from the current Python task, and I've put weights of 1-1-1-97 in the job file so I can see what's happening. | |
ID: 59939 | Rating: 0 | rate:
![]() ![]() ![]() | |
I see what you mean. Nearly half an hour in, CPU usage is showing around 25% of a single core, and GPU usage spiked once, to 41%, after about a quarter of an hour. It's one way of saving electricity, but I'd rather be doing something useful. Aborting. | |
ID: 59941 | Rating: 0 | rate:
![]() ![]() ![]() | |
1.13 ATM running fine for me. | |
ID: 59961 | Rating: 0 | rate:
![]() ![]() ![]() | |
FWIW, the first task I received completed successfully. | |
ID: 59964 | Rating: 0 | rate:
![]() ![]() ![]() | |
I've also finished one: | |
ID: 59967 | Rating: 0 | rate:
![]() ![]() ![]() | |
Over night, I had 4 of these tasks cancelled by server. | |
ID: 59968 | Rating: 0 | rate:
![]() ![]() ![]() | |
1.13 ATM running fine for me. _______________ Same here. I quite enjoy completing these WUs. There should be a way to analyse these WUs as to why it is happening on certain machines. We are mostly running the same hardware and OS. It would be fun to see the results. - | |
ID: 59969 | Rating: 0 | rate:
![]() ![]() ![]() | |
keep in mind these are bata tasks, and the batch being sent NOW are not necessarily the same as the batch sent last week and wont be the same as whatever is sent sometime in the future, until they get all the bugs worked out. the tasks last week basically ran with no perceived use of the GPU or CPU, so what were they doing? who knows. no official word from the project about these tasks at all. I wasn't willing to let the GPU/CPU be occupied for hours on end with the task spinning it's wheels when they could be doing something more useful. | |
ID: 59970 | Rating: 0 | rate:
![]() ![]() ![]() | |
keep in mind these are bata tasks, and the batch being sent NOW are not necessarily the same as the batch sent last week and wont be the same as whatever is sent sometime in the future, until they get all the bugs worked out. the tasks last week basically ran with no perceived use of the GPU or CPU, so what were they doing? who knows. no official word from the project about these tasks at all. I wasn't willing to let the GPU/CPU be occupied for hours on end with the task spinning it's wheels when they could be doing something more useful. _______________________ Well, most of us know that Abouh reads every word written on these threads and without much song and dance, makes changes. He is the Only Admin on all the projects who diligently attend. Maybe, quite possibly. No arguments with your tweaking statement. | |
ID: 59971 | Rating: 0 | rate:
![]() ![]() ![]() | |
Well, most of us know that Abouh reads every word written on these threads and without much song and dance, makes changes. He is the Only Admin on all the projects who diligently attend. Maybe, quite possibly. No arguments with your tweaking statement. well, Abouh is the only one from the project team who actively communicates with us volunteers - which is great. All others obviously don't care, and this has been like this over the years, unfortunately. For example: 9 days ago I asked in the ACEMD 4 thread when new ACEMD 4 task will be around, or whether this subproject is dead. No reply so far; whereas a reply could be very simple, not longer than just a line :-( You know what I want to say ... it's kind of disappointing at times :-( | |
ID: 59972 | Rating: 0 | rate:
![]() ![]() ![]() | |
keep in mind these are bata tasks, and the batch being sent NOW are not necessarily the same as the batch sent last week and wont be the same as whatever is sent sometime in the future, until they get all the bugs worked out. the tasks last week basically ran with no perceived use of the GPU or CPU, so what were they doing? who knows. no official word from the project about these tasks at all. I wasn't willing to let the GPU/CPU be occupied for hours on end with the task spinning it's wheels when they could be doing something more useful. that's great and all, but abouh is not the researcher working with this application. Abouh deals with the research with the Python RL tasks. These ATM tasks look to be being run by Raimis. (the researcher names are in the filenames of the WUs) ____________ ![]() | |
ID: 59973 | Rating: 0 | rate:
![]() ![]() ![]() | |
https://gpugrid.net/result.php?resultid=33321222 | |
ID: 59974 | Rating: 0 | rate:
![]() ![]() ![]() | |
... failed due to file size limit I am just trying to remember with which other application we've had the same problem some time ago - last year or 2 years ago ??? | |
ID: 59975 | Rating: 0 | rate:
![]() ![]() ![]() | |
... failed due to file size limit it's happened a few times in the past with acemd3 tasks. see here from July 2021: https://www.gpugrid.net/forum_thread.php?id=5239#57117 ____________ ![]() | |
ID: 59976 | Rating: 0 | rate:
![]() ![]() ![]() | |
Yea, I got my first ATM checkpoint :-) | |
ID: 59977 | Rating: 0 | rate:
![]() ![]() ![]() | |
Yea, I got my first ATM checkpoint :-) the uploads are nearly 700MB in size, and likely the same problem from my link that we saw over a year ago. their server can't accept something that big, I don't think they ever figured out how to adjust the settings of their file server and just tried to keep the file sizes below the limit, which they seem to have forgotten about. nothing you do will get them to upload. I've disabled ATM until they get it together with them. ____________ ![]() | |
ID: 59978 | Rating: 0 | rate:
![]() ![]() ![]() | |
On past chance, I bet and lost. | |
ID: 59979 | Rating: 0 | rate:
![]() ![]() ![]() | |
GDF, Should I Abort these 12 completed ATM WUs that won't upload or is there a reasonable chance you'll fix it? | |
ID: 59980 | Rating: 0 | rate:
![]() ![]() ![]() | |
Well, I just achieved my 100 hours, which was my 1st priority. I will abort and reset (if necessary) the completed tasks I have. If/when the project gets its act together, I'll be back. | |
ID: 59981 | Rating: 0 | rate:
![]() ![]() ![]() | |
For me it's just this: So 26 Feb 2023 11:57:00 CET | GPUGRID | Started upload of TL9_55-RAIMIS_TEST_ATM-0-1-RND1804_0_0 So 26 Feb 2023 11:57:02 CET | GPUGRID | Backing off 04:12:16 on upload of TL9_55-RAIMIS_TEST_ATM-0-1-RND1804_0_0 So 26 Feb 2023 11:57:19 CET | GPUGRID | Started upload of TL9_55-RAIMIS_TEST_ATM-0-1-RND1804_0_0 So 26 Feb 2023 11:57:22 CET | GPUGRID | Backing off 05:10:06 on upload of TL9_55-RAIMIS_TEST_ATM-0-1-RND1804_0_0 No message about the size, just about backing off. Hooray! ____________ - - - - - - - - - - Greetings, Jens | |
ID: 59986 | Rating: 0 | rate:
![]() ![]() ![]() | |
I just aborted the upload (not the workunit) and then it was reported as valid. | |
ID: 59988 | Rating: 0 | rate:
![]() ![]() ![]() | |
I just aborted the upload (not the workunit) and then it was reported as valid. Indeed, this worked out for me as well. But is there a result that can be used? ____________ - - - - - - - - - - Greetings, Jens | |
ID: 59989 | Rating: 0 | rate:
![]() ![]() ![]() | |
For me it's just this: There won’t be any message about why it failed until you enable debugging messages. See the previous link I posted about when this issues happened 1.5 years ago. ____________ ![]() | |
ID: 59990 | Rating: 0 | rate:
![]() ![]() ![]() | |
I just aborted the upload (not the workunit) and then it was reported as valid. Partially successful for me. I attempted with two of these and one ended up as "Upload failed" while the other "Completed and validated". | |
ID: 59991 | Rating: 0 | rate:
![]() ![]() ![]() | |
I just aborted the upload (not the workunit) and then it was reported as valid. Indeed, this worked out for me as well. | |
ID: 59992 | Rating: 0 | rate:
![]() ![]() ![]() | |
I just aborted the upload (not the workunit) and then it was reported as valid. It worked on multiple pc's for me too | |
ID: 59993 | Rating: 0 | rate:
![]() ![]() ![]() | |
task ran to completion in about an hour. but hit an error and threw it all away because the file size is too big. No it's not a waste in my opinion because you found something out. You found that "the file size was too big" so it can be corrected so it doesn't happen again hopefully. :-) | |
ID: 60008 | Rating: 0 | rate:
![]() ![]() ![]() | |
this now is a topic also on this thread: | |
ID: 60010 | Rating: 0 | rate:
![]() ![]() ![]() | |
How can I get ATM ? | |
ID: 60178 | Rating: 0 | rate:
![]() ![]() ![]() | |
you need to enable beta/test applications in your project preferences | |
ID: 60179 | Rating: 0 | rate:
![]() ![]() ![]() | |
Ah, Thanks. The "test application" setting I have missed. | |
ID: 60180 | Rating: 0 | rate:
![]() ![]() ![]() | |
So far, I noticed on ATM tasks an abnormal progress notification. | |
ID: 60313 | Rating: 0 | rate:
![]() ![]() ![]() | |
There is still a mix of old, broken progress tasks along with fixed progress tasks in rotation. | |
ID: 60314 | Rating: 0 | rate:
![]() ![]() ![]() | |
No, it's not the replication number. | |
ID: 60315 | Rating: 0 | rate:
![]() ![]() ![]() | |
Nice explanation. | |
ID: 60317 | Rating: 0 | rate:
![]() ![]() ![]() | |
Message boards : Number crunching : ATM: Free Energy Calculations new application