Acemd3 restart on windows possibly fixed

Message boards : News : Acemd3 restart on windows possibly fixed

Author	Message
Toni Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 9 Dec 08 Posts: 1006 Credit: 5,068,599 RAC: 0 Level Scientific publications	Message 52795 - Posted: 7 Oct 2019 \| 13:38:53 UTC
	The new acemd3 app should fix the issue. Thanks for all the reporting! Note that one still can't restart between different types of cards.
	ID: 52795 \| Rating: 0 \| rate: / Reply Quote

Aurum Send message Joined: 12 Jul 17 Posts: 399 Credit: 13,269,002,382 RAC: 15,983,814 Level Scientific publications	Message 52798 - Posted: 7 Oct 2019 \| 15:13:56 UTC - in response to Message 52795.
	That's great news Toni. I hope you'll send a BOINC notice out when Linux is back in production so those of us on walkabout know to return. ____________
	ID: 52798 \| Rating: 0 \| rate: / Reply Quote

Keith Myers Send message Joined: 13 Dec 17 Posts: 1289 Credit: 5,227,606,959 RAC: 10,542,720 Level Scientific publications	Message 52799 - Posted: 7 Oct 2019 \| 16:58:45 UTC - in response to Message 52795.
	Just change your Preferences for Computing to "Switch between tasks every" to something like 360 minutes and the task should start and finish on the same card avoiding the issue of restarting on a dissimilar card. If all your cards are the same brand and type, maybe only type, you can restart on a different card and finish with no errors.
	ID: 52799 \| Rating: 0 \| rate: / Reply Quote

zombie67 [MM] Send message Joined: 16 Jul 07 Posts: 207 Credit: 1,763,426,456 RAC: 7,113,060 Level Scientific publications	Message 52800 - Posted: 7 Oct 2019 \| 17:54:15 UTC
	Only two of the three windows apps were updated, cuda92 & cuda101 Why not the cuda100 app too? ____________ Reno, NV Team: SETI.USA
	ID: 52800 \| Rating: 0 \| rate: / Reply Quote

Bedrich Hajek Send message Joined: 28 Mar 09 Posts: 468 Credit: 8,498,022,716 RAC: 11,208,545 Level Scientific publications	Message 52802 - Posted: 8 Oct 2019 \| 2:24:48 UTC
	I managed to successfully complete a task after suspending it and restarting it on the windows 7 computer with a rtx 2080ti card. When I suspend the task the wrapper and the acemd3 disappeared from the task manager, and then reappear when the task restarted: http://www.gpugrid.net/result.php?resultid=21429745 On the windows 10 computer, I ran a task on the gtx 980ti card, I suspended and resumed it successfully. I suspended it again, reboot the computer, it restarted successfully. I received another task, which ran on the rtx 2080ti card successfully, side by side with the other task. I then suspended both tasks, and restarted the task, which ran on the 980ti, on the 2080ti, and it crashed. http://www.gpugrid.net/result.php?resultid=21429733 I restarted the other task , and which started on the 2080ti and is running well on the 2080ti, right now: http://www.gpugrid.net/result.php?resultid=21429800 You can't start the tasks on one and restarted successfully on another card. I haven't tried reboot the computer without first suspend the task, yet.
	ID: 52802 \| Rating: 0 \| rate: / Reply Quote

Nick Name Send message Joined: 3 Sep 13 Posts: 53 Credit: 1,533,531,731 RAC: 0 Level Scientific publications	Message 52803 - Posted: 8 Oct 2019 \| 4:09:39 UTC - in response to Message 52802.
	... On the windows 10 computer, I ran a task on the gtx 980ti card, I suspended and resumed it successfully. I suspended it again, reboot the computer, it restarted successfully. I received another task, which ran on the rtx 2080ti card successfully, side by side with the other task. I then suspended both tasks, and restarted the task, which ran on the 980ti, on the 2080ti, and it crashed. http://www.gpugrid.net/result.php?resultid=21429733 I have this task now, and it's not loading the GPU at all. It's the second task like that I've had in the last couple days. http://www.gpugrid.net/result.php?resultid=21429805 The other one failed on a suspend / restart, when I paused the client. http://www.gpugrid.net/result.php?resultid=21429344 That one did validate on another machine, and it looks like the one I have now is slowly making progress, so I'll let it run at least overnight to give it a chance to complete. ____________ Team USA forum \| Team USA page Join us and #crunchforcures. We are now also folding:join team ID 236370!
	ID: 52803 \| Rating: 0 \| rate: / Reply Quote

rod4x4 Send message Joined: 4 Aug 14 Posts: 266 Credit: 2,219,935,054 RAC: 0 Level Scientific publications	Message 52804 - Posted: 8 Oct 2019 \| 5:26:35 UTC Last modified: 8 Oct 2019 \| 5:58:58 UTC
	Received a New version of ACMD v2.08 (cuda101) Work Unit on a Win8.1 (update 1) Host with GTX750ti GPU. Work Unit Name: e40s11_e37s6p1f279-ADRIA_FOLDUBQ_BANDIT_crystal_ss_contacts_50_ubiquitin_4-0-2-RND1517_1 - Suspended Work Unit after 18 minutes (2.2% complete) - Wrapper and ACEMD tasks disappeared from Task Manager. - Resumed Work Unit, Wrapper and ACEMD tasks reappeared and Work Unit continued to process. Then rebooted PC after Work Unit had been running for 21 minutes. (without suspending WU) Work Unit successfully restarted and continues to process. NOTES - The Remaining (estimated) time does not seem to change or indicate an accurate run time. (only a small issue) - Checkpoint seems to be every 90 seconds - GTX750ti is running at 98% utilization and 94% power according to nvidia-smi. This GPU does not reach these figures on the old tasks. This Work unit may take another 13 hours to complete at current rate. Work Unit is here: http://www.gpugrid.net/result.php?resultid=21429814 It has not completed yet, but is still encouraging results!
	ID: 52804 \| Rating: 0 \| rate: / Reply Quote

[PUGLIA] kidkidkid3 Send message Joined: 23 Feb 11 Posts: 81 Credit: 954,353,044 RAC: 109,369 Level Scientific publications	Message 52810 - Posted: 8 Oct 2019 \| 20:05:49 UTC - in response to Message 52804. Last modified: 8 Oct 2019 \| 20:06:24 UTC
	Hi, Acemd3 WU in error at the end ... same GPU, no suspend/resume action ... http://www.gpugrid.net/result.php?resultid=21430068 K. ____________ Dreams do not always come true. But not because they are too big or impossible. Why did we stop believing. (Martin Luther King)
	ID: 52810 \| Rating: 0 \| rate: / Reply Quote

rod4x4 Send message Joined: 4 Aug 14 Posts: 266 Credit: 2,219,935,054 RAC: 0 Level Scientific publications	Message 52814 - Posted: 8 Oct 2019 \| 23:34:27 UTC - in response to Message 52810. Last modified: 8 Oct 2019 \| 23:36:01 UTC
	Acemd3 WU in error at the end ... same GPU, no suspend/resume action ... http://www.gpugrid.net/result.php?resultid=21430068 Your host has returned 2 "New Version ACEMD" work units that have both ended in "upload failure" Other WU: http://www.gpugrid.net/result.php?resultid=21428934 Failure Message: <message> upload failure: <file_xfer_error> <file_name>e39s4_e33s7p1f250-ADRIA_FOLDUBQ_BANDIT_crystal_ss_contacts_50_ubiquitin_0-1-2-RND0503_0_0</file_name> <error_code>-240 (stat() failed)</error_code> </file_xfer_error> </message> The WUs appear to complete successfully Exit State: 0 Very curious as all the "old" work units upload fine. I think the key to the error is in the error_code: stat() failed
	ID: 52814 \| Rating: 0 \| rate: / Reply Quote

Bedrich Hajek Send message Joined: 28 Mar 09 Posts: 468 Credit: 8,498,022,716 RAC: 11,208,545 Level Scientific publications	Message 52815 - Posted: 9 Oct 2019 \| 1:34:37 UTC
	I rebooted the computer (without suspending the tasks), these 2 task were able to restarted and finish successfully afterwards: http://www.gpugrid.net/result.php?resultid=21429991 http://www.gpugrid.net/result.php?resultid=21430924 There is still an issue with getting ACEMD v2.06 (cuda100) tasks. It happened on my windows 10 machine, which has a Maxwell card and a Turing card. I wonder, if there's connection there? http://www.gpugrid.net/result.php?resultid=21429972 The error was due to suspend and restart. I also have an unexplained error: http://www.gpugrid.net/result.php?resultid=21429886 It was running on the 980ti card. I suspended and restarted it successfully. It was running fine when I left it. Next morning, I found that it crashed. The 2080ti was running either Einstein or Milkyway tasks. Every once in a long while the Einstein gamma ray pulsar task will cause the NVIDIA driver to crash momentary, then it restarts. Maybe that and it was running on a non Turing card are the reasons for this crash. After the task crashed, it cause afterburner to crash. I had to restart that also.
	ID: 52815 \| Rating: 0 \| rate: / Reply Quote

rod4x4 Send message Joined: 4 Aug 14 Posts: 266 Credit: 2,219,935,054 RAC: 0 Level Scientific publications	Message 52816 - Posted: 9 Oct 2019 \| 4:54:14 UTC - in response to Message 52815. Last modified: 9 Oct 2019 \| 5:53:34 UTC
	There is still an issue with getting ACEMD v2.06 (cuda100) tasks. It happened on my windows 10 machine, which has a Maxwell card and a Turing card. I wonder, if there's connection there? http://www.gpugrid.net/result.php?resultid=21429972 The error was due to suspend and restart. From what I understand, I think only ACEMD v2.08 survives the suspend/restart. I also have an unexplained error: http://www.gpugrid.net/result.php?resultid=21429886 Assuming no issues with the host/other projects, looks like a new error. (These assumptions would need to be explored also) From the Stderr output: # Engine failed: Error invoking kernel: CUDA_ERROR_UNKNOWN (999) Error appears after 7 minutes (2 minutes after Task was suspended and resumed).
	ID: 52816 \| Rating: 0 \| rate: / Reply Quote

rod4x4 Send message Joined: 4 Aug 14 Posts: 266 Credit: 2,219,935,054 RAC: 0 Level Scientific publications	Message 52832 - Posted: 10 Oct 2019 \| 10:16:47 UTC
	Received another New version ACEMD v2.08 work unit on a Win7 Host with GTX750 GPU. Suspended and resumed the work unit. Work unit proceeded fine after suspend/resume and completed successfully. Work unit here: http://gpugrid.net/result.php?resultid=21433334
	ID: 52832 \| Rating: 0 \| rate: / Reply Quote

Retvari Zoltan Send message Joined: 20 Jan 09 Posts: 2343 Credit: 16,201,255,749 RAC: 349 Level Scientific publications	Message 52833 - Posted: 10 Oct 2019 \| 12:58:56 UTC - in response to Message 52795. Last modified: 10 Oct 2019 \| 12:59:14 UTC
	The new acemd3 app should fix the issue. Thanks for all the reporting! Note that one still can't restart between different types of cards. When do you plan to release the ACEMD3 client for Linux? I thought that you will do it after the Windows client is fixed. I know it still has (at least) one problem (restarting a task on a different card), but probably the Linux client has the same problem. Alternatively you could put part of the long workunits to the ACEMD3 queue (it has the new client for both platforms).
	ID: 52833 \| Rating: 0 \| rate: / Reply Quote

STARBASEn Send message Joined: 17 Feb 09 Posts: 91 Credit: 1,603,303,394 RAC: 0 Level Scientific publications	Message 52836 - Posted: 11 Oct 2019 \| 16:37:07 UTC
	Not getting any work yet for any of my Linux machines and I don't have Windows so I have been out of luck since May but I am sure E@H is pleased. I had been getting some of the beta tests however, just nothing recently since the restart problems appear to have been resolved with the Windows systems.
	ID: 52836 \| Rating: 0 \| rate: / Reply Quote

STARBASEn Send message Joined: 17 Feb 09 Posts: 91 Credit: 1,603,303,394 RAC: 0 Level Scientific publications	Message 52838 - Posted: 12 Oct 2019 \| 18:51:04 UTC
	Got a new ACEMD 2.06 (http://www.gpugrid.net/result.php?resultid=21443611) WU on one of my Linux machines. About 53% finished and was suspended and restarted once without issue. Wingman WU failed on a Win10 machine a with little less than 2 min completed, reason unknown. Both systems used GTX-1060's. I'm using driver version 430.50 and boinc 7.16.1 (Fedora distro). All is good with Linux so far.
	ID: 52838 \| Rating: 0 \| rate: / Reply Quote

Bedrich Hajek Send message Joined: 28 Mar 09 Posts: 468 Credit: 8,498,022,716 RAC: 11,208,545 Level Scientific publications	Message 52842 - Posted: 13 Oct 2019 \| 13:36:45 UTC
	I noticed that when I am running acemd3 task on each of cards (Maxwell and Turing), they ran slower than if I was running one acemd3 on cone card and another type of task on the other card. Acemd3 tasks running on both cards simultaneously: 21443009 16813137 12 Oct 2019 \| 12:46:46 UTC 12 Oct 2019 \| 17:53:01 UTC Completed and validated 17,927.90 17,526.83 75,000.00 New version of ACEMD v2.06 (cuda100) 21443008 16813136 12 Oct 2019 \| 12:46:46 UTC 12 Oct 2019 \| 17:08:35 UTC Completed and validated 7,590.81 7,420.91 75,000.00 New version of ACEMD v2.06 (cuda100) 21443007 16813135 12 Oct 2019 \| 12:46:46 UTC 12 Oct 2019 \| 19:14:35 UTC Completed and validated 7,557.09 7,356.45 75,000.00 New version of ACEMD v2.06 (cuda100) 21443006 16813134 12 Oct 2019 \| 12:46:46 UTC 12 Oct 2019 \| 15:01:54 UTC Completed and validated 7,589.88 7,461.47 75,000.00 New version of ACEMD v2.06 (cuda100) Acemd3 tasks running side by side with non acmed3 tasks: 21442682 16812889 12 Oct 2019 \| 2:13:51 UTC 12 Oct 2019 \| 4:09:23 UTC Completed and validated 6,604.02 6,413.02 75,000.00 New version of ACEMD v2.08 (cuda101) 21430016 16803567 8 Oct 2019 \| 11:27:46 UTC 8 Oct 2019 \| 13:22:21 UTC Completed and validated 6,498.40 6,246.75 75,000.00 New version of ACEMD v2.08 (cuda101) 21429800 16803495 8 Oct 2019 \| 1:41:48 UTC 8 Oct 2019 \| 3:40:53 UTC Completed and validated 6,519.76 6,327.67 75,000.00 New version of ACEMD v2.08 (cuda101) 21429286 16803018 7 Oct 2019 \| 1:23:48 UTC 7 Oct 2019 \| 11:50:24 UTC Completed and validated 16,540.09 16,367.14 75,000.00 New version of ACEMD v2.06 (cuda100) Either Einstein, Milkyway or Long Runs are running on the other card. I also noticed issue with the scheduler not asking for GPU task when I had all these lines in my cc_config.xml file: <exclude_gpu> <url>http://www.gpugrid.net/</url> <device_num>0</device_num> <app>acemdlong</app> </exclude_gpu> <exclude_gpu> <url>http://www.gpugrid.net/</url> <device_num>0</device_num> <app>acemdshort</app> </exclude_gpu> <exclude_gpu> <url>http://www.gpugrid.net/</url> <device_num>1</device_num> <app>acemd3</app> </exclude_gpu> What I am telling boinc is to not run long and short tasks on the Turning card and not to run acemd3 tasks on the Maxwell card. The logic works, but again the scheduler doesn't ask for GPU tasks, no matter what I set the cache number to and I have less than 2 tasks per card downloaded. (I downloaded the tasks before I ran this test.) But if I delete this from the file and of course save it: <exclude_gpu> <url>http://www.gpugrid.net/</url> <device_num>1</device_num> <app>acemd3</app> </exclude_gpu> Everything works fine. The scheduler asks for GPU tasks. Is this a boinc problem or a GPUGRID problem?
	ID: 52842 \| Rating: 0 \| rate: / Reply Quote

Bedrich Hajek Send message Joined: 28 Mar 09 Posts: 468 Credit: 8,498,022,716 RAC: 11,208,545 Level Scientific publications	Message 52843 - Posted: 13 Oct 2019 \| 19:44:25 UTC - in response to Message 52842.
	I noticed that when I am running acemd3 task on each of cards (Maxwell and Turing), they ran slower than if I was running one acemd3 on one card and another type of task on the other card. Acemd3 tasks running on both cards simultaneously: 21443009 16813137 12 Oct 2019 \| 12:46:46 UTC 12 Oct 2019 \| 17:53:01 UTC Completed and validated 17,927.90 17,526.83 75,000.00 New version of ACEMD v2.06 (cuda100) 21443008 16813136 12 Oct 2019 \| 12:46:46 UTC 12 Oct 2019 \| 17:08:35 UTC Completed and validated 7,590.81 7,420.91 75,000.00 New version of ACEMD v2.06 (cuda100) 21443007 16813135 12 Oct 2019 \| 12:46:46 UTC 12 Oct 2019 \| 19:14:35 UTC Completed and validated 7,557.09 7,356.45 75,000.00 New version of ACEMD v2.06 (cuda100) 21443006 16813134 12 Oct 2019 \| 12:46:46 UTC 12 Oct 2019 \| 15:01:54 UTC Completed and validated 7,589.88 7,461.47 75,000.00 New version of ACEMD v2.06 (cuda100) Acemd3 tasks running side by side with non acmed3 tasks: 21442682 16812889 12 Oct 2019 \| 2:13:51 UTC 12 Oct 2019 \| 4:09:23 UTC Completed and validated 6,604.02 6,413.02 75,000.00 New version of ACEMD v2.08 (cuda101) 21430016 16803567 8 Oct 2019 \| 11:27:46 UTC 8 Oct 2019 \| 13:22:21 UTC Completed and validated 6,498.40 6,246.75 75,000.00 New version of ACEMD v2.08 (cuda101) 21429800 16803495 8 Oct 2019 \| 1:41:48 UTC 8 Oct 2019 \| 3:40:53 UTC Completed and validated 6,519.76 6,327.67 75,000.00 New version of ACEMD v2.08 (cuda101) 21429286 16803018 7 Oct 2019 \| 1:23:48 UTC 7 Oct 2019 \| 11:50:24 UTC Completed and validated 16,540.09 16,367.14 75,000.00 New version of ACEMD v2.06 (cuda100) Either Einstein, Milkyway or Long Runs are running on the other card. Here is an observation that contradicts my previous observation: 21444555 16814402 13 Oct 2019 \| 14:27:59 UTC 13 Oct 2019 \| 17:43:04 UTC Completed and validated 7,623.39 7,471.55 75,000.00 New version of ACEMD v2.06 (cuda100) 21444501 16814357 13 Oct 2019 \| 12:40:33 UTC 13 Oct 2019 \| 15:35:54 UTC Completed and validated 7,577.33 7,431.94 75,000.00 New version of ACEMD v2.06 (cuda100) These two task ran on the Turing card, while the Maxwell was running a long tasks. They were running slower. So what is causing this? The tasks being v2.06, the tasks themselves, or something else. I don't know! Ok, no more theories, at least for a while.
	ID: 52843 \| Rating: 0 \| rate: / Reply Quote

Keith Myers Send message Joined: 13 Dec 17 Posts: 1289 Credit: 5,227,606,959 RAC: 10,542,720 Level Scientific publications	Message 52845 - Posted: 14 Oct 2019 \| 0:35:16 UTC - in response to Message 52842.
	Everything works fine. The scheduler asks for GPU tasks. Is this a boinc problem or a GPUGRID problem? Both. You are running a client that does not handle excludes well. The latest development version does much better. Also the project runs very old server software that should be updated but has not been.
	ID: 52845 \| Rating: 0 \| rate: / Reply Quote

Post to thread

Message boards : News : Acemd3 restart on windows possibly fixed

	About	Science	Volunteers	Performance	Forum	Join us	Donate

Author	Message
Toni Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 9 Dec 08 Posts: 1006 Credit: 5,068,599 RAC: 0 Level Scientific publications	Message 52795 - Posted: 7 Oct 2019 \| 13:38:53 UTC
	The new acemd3 app should fix the issue. Thanks for all the reporting! Note that one still can't restart between different types of cards.
	ID: 52795 \| Rating: 0 \| rate: / Reply Quote

Aurum Send message Joined: 12 Jul 17 Posts: 399 Credit: 13,269,002,382 RAC: 15,983,814 Level Scientific publications	Message 52798 - Posted: 7 Oct 2019 \| 15:13:56 UTC - in response to Message 52795.
	That's great news Toni. I hope you'll send a BOINC notice out when Linux is back in production so those of us on walkabout know to return. ____________
	ID: 52798 \| Rating: 0 \| rate: / Reply Quote

Keith Myers Send message Joined: 13 Dec 17 Posts: 1289 Credit: 5,227,606,959 RAC: 10,542,720 Level Scientific publications	Message 52799 - Posted: 7 Oct 2019 \| 16:58:45 UTC - in response to Message 52795.
	Just change your Preferences for Computing to "Switch between tasks every" to something like 360 minutes and the task should start and finish on the same card avoiding the issue of restarting on a dissimilar card. If all your cards are the same brand and type, maybe only type, you can restart on a different card and finish with no errors.
	ID: 52799 \| Rating: 0 \| rate: / Reply Quote

zombie67 [MM] Send message Joined: 16 Jul 07 Posts: 207 Credit: 1,763,426,456 RAC: 7,113,060 Level Scientific publications	Message 52800 - Posted: 7 Oct 2019 \| 17:54:15 UTC
	Only two of the three windows apps were updated, cuda92 & cuda101 Why not the cuda100 app too? ____________ Reno, NV Team: SETI.USA
	ID: 52800 \| Rating: 0 \| rate: / Reply Quote

Bedrich Hajek Send message Joined: 28 Mar 09 Posts: 468 Credit: 8,498,022,716 RAC: 11,208,545 Level Scientific publications	Message 52802 - Posted: 8 Oct 2019 \| 2:24:48 UTC
	I managed to successfully complete a task after suspending it and restarting it on the windows 7 computer with a rtx 2080ti card. When I suspend the task the wrapper and the acemd3 disappeared from the task manager, and then reappear when the task restarted: http://www.gpugrid.net/result.php?resultid=21429745 On the windows 10 computer, I ran a task on the gtx 980ti card, I suspended and resumed it successfully. I suspended it again, reboot the computer, it restarted successfully. I received another task, which ran on the rtx 2080ti card successfully, side by side with the other task. I then suspended both tasks, and restarted the task, which ran on the 980ti, on the 2080ti, and it crashed. http://www.gpugrid.net/result.php?resultid=21429733 I restarted the other task , and which started on the 2080ti and is running well on the 2080ti, right now: http://www.gpugrid.net/result.php?resultid=21429800 You can't start the tasks on one and restarted successfully on another card. I haven't tried reboot the computer without first suspend the task, yet.
	ID: 52802 \| Rating: 0 \| rate: / Reply Quote

Nick Name Send message Joined: 3 Sep 13 Posts: 53 Credit: 1,533,531,731 RAC: 0 Level Scientific publications	Message 52803 - Posted: 8 Oct 2019 \| 4:09:39 UTC - in response to Message 52802.
	... On the windows 10 computer, I ran a task on the gtx 980ti card, I suspended and resumed it successfully. I suspended it again, reboot the computer, it restarted successfully. I received another task, which ran on the rtx 2080ti card successfully, side by side with the other task. I then suspended both tasks, and restarted the task, which ran on the 980ti, on the 2080ti, and it crashed. http://www.gpugrid.net/result.php?resultid=21429733 I have this task now, and it's not loading the GPU at all. It's the second task like that I've had in the last couple days. http://www.gpugrid.net/result.php?resultid=21429805 The other one failed on a suspend / restart, when I paused the client. http://www.gpugrid.net/result.php?resultid=21429344 That one did validate on another machine, and it looks like the one I have now is slowly making progress, so I'll let it run at least overnight to give it a chance to complete. ____________ Team USA forum \| Team USA page Join us and #crunchforcures. We are now also folding:join team ID 236370!
	ID: 52803 \| Rating: 0 \| rate: / Reply Quote

rod4x4 Send message Joined: 4 Aug 14 Posts: 266 Credit: 2,219,935,054 RAC: 0 Level Scientific publications	Message 52804 - Posted: 8 Oct 2019 \| 5:26:35 UTC Last modified: 8 Oct 2019 \| 5:58:58 UTC
	Received a New version of ACMD v2.08 (cuda101) Work Unit on a Win8.1 (update 1) Host with GTX750ti GPU. Work Unit Name: e40s11_e37s6p1f279-ADRIA_FOLDUBQ_BANDIT_crystal_ss_contacts_50_ubiquitin_4-0-2-RND1517_1 - Suspended Work Unit after 18 minutes (2.2% complete) - Wrapper and ACEMD tasks disappeared from Task Manager. - Resumed Work Unit, Wrapper and ACEMD tasks reappeared and Work Unit continued to process. Then rebooted PC after Work Unit had been running for 21 minutes. (without suspending WU) Work Unit successfully restarted and continues to process. NOTES - The Remaining (estimated) time does not seem to change or indicate an accurate run time. (only a small issue) - Checkpoint seems to be every 90 seconds - GTX750ti is running at 98% utilization and 94% power according to nvidia-smi. This GPU does not reach these figures on the old tasks. This Work unit may take another 13 hours to complete at current rate. Work Unit is here: http://www.gpugrid.net/result.php?resultid=21429814 It has not completed yet, but is still encouraging results!
	ID: 52804 \| Rating: 0 \| rate: / Reply Quote

[PUGLIA] kidkidkid3 Send message Joined: 23 Feb 11 Posts: 81 Credit: 954,353,044 RAC: 109,369 Level Scientific publications	Message 52810 - Posted: 8 Oct 2019 \| 20:05:49 UTC - in response to Message 52804. Last modified: 8 Oct 2019 \| 20:06:24 UTC
	Hi, Acemd3 WU in error at the end ... same GPU, no suspend/resume action ... http://www.gpugrid.net/result.php?resultid=21430068 K. ____________ Dreams do not always come true. But not because they are too big or impossible. Why did we stop believing. (Martin Luther King)
	ID: 52810 \| Rating: 0 \| rate: / Reply Quote

rod4x4 Send message Joined: 4 Aug 14 Posts: 266 Credit: 2,219,935,054 RAC: 0 Level Scientific publications	Message 52814 - Posted: 8 Oct 2019 \| 23:34:27 UTC - in response to Message 52810. Last modified: 8 Oct 2019 \| 23:36:01 UTC
	Acemd3 WU in error at the end ... same GPU, no suspend/resume action ... http://www.gpugrid.net/result.php?resultid=21430068 Your host has returned 2 "New Version ACEMD" work units that have both ended in "upload failure" Other WU: http://www.gpugrid.net/result.php?resultid=21428934 Failure Message: <message> upload failure: <file_xfer_error> <file_name>e39s4_e33s7p1f250-ADRIA_FOLDUBQ_BANDIT_crystal_ss_contacts_50_ubiquitin_0-1-2-RND0503_0_0</file_name> <error_code>-240 (stat() failed)</error_code> </file_xfer_error> </message> The WUs appear to complete successfully Exit State: 0 Very curious as all the "old" work units upload fine. I think the key to the error is in the error_code: stat() failed
	ID: 52814 \| Rating: 0 \| rate: / Reply Quote

Bedrich Hajek Send message Joined: 28 Mar 09 Posts: 468 Credit: 8,498,022,716 RAC: 11,208,545 Level Scientific publications	Message 52815 - Posted: 9 Oct 2019 \| 1:34:37 UTC
	I rebooted the computer (without suspending the tasks), these 2 task were able to restarted and finish successfully afterwards: http://www.gpugrid.net/result.php?resultid=21429991 http://www.gpugrid.net/result.php?resultid=21430924 There is still an issue with getting ACEMD v2.06 (cuda100) tasks. It happened on my windows 10 machine, which has a Maxwell card and a Turing card. I wonder, if there's connection there? http://www.gpugrid.net/result.php?resultid=21429972 The error was due to suspend and restart. I also have an unexplained error: http://www.gpugrid.net/result.php?resultid=21429886 It was running on the 980ti card. I suspended and restarted it successfully. It was running fine when I left it. Next morning, I found that it crashed. The 2080ti was running either Einstein or Milkyway tasks. Every once in a long while the Einstein gamma ray pulsar task will cause the NVIDIA driver to crash momentary, then it restarts. Maybe that and it was running on a non Turing card are the reasons for this crash. After the task crashed, it cause afterburner to crash. I had to restart that also.
	ID: 52815 \| Rating: 0 \| rate: / Reply Quote

rod4x4 Send message Joined: 4 Aug 14 Posts: 266 Credit: 2,219,935,054 RAC: 0 Level Scientific publications	Message 52816 - Posted: 9 Oct 2019 \| 4:54:14 UTC - in response to Message 52815. Last modified: 9 Oct 2019 \| 5:53:34 UTC
	There is still an issue with getting ACEMD v2.06 (cuda100) tasks. It happened on my windows 10 machine, which has a Maxwell card and a Turing card. I wonder, if there's connection there? http://www.gpugrid.net/result.php?resultid=21429972 The error was due to suspend and restart. From what I understand, I think only ACEMD v2.08 survives the suspend/restart. I also have an unexplained error: http://www.gpugrid.net/result.php?resultid=21429886 Assuming no issues with the host/other projects, looks like a new error. (These assumptions would need to be explored also) From the Stderr output: # Engine failed: Error invoking kernel: CUDA_ERROR_UNKNOWN (999) Error appears after 7 minutes (2 minutes after Task was suspended and resumed).
	ID: 52816 \| Rating: 0 \| rate: / Reply Quote

rod4x4 Send message Joined: 4 Aug 14 Posts: 266 Credit: 2,219,935,054 RAC: 0 Level Scientific publications	Message 52832 - Posted: 10 Oct 2019 \| 10:16:47 UTC
	Received another New version ACEMD v2.08 work unit on a Win7 Host with GTX750 GPU. Suspended and resumed the work unit. Work unit proceeded fine after suspend/resume and completed successfully. Work unit here: http://gpugrid.net/result.php?resultid=21433334
	ID: 52832 \| Rating: 0 \| rate: / Reply Quote

Retvari Zoltan Send message Joined: 20 Jan 09 Posts: 2343 Credit: 16,201,255,749 RAC: 349 Level Scientific publications	Message 52833 - Posted: 10 Oct 2019 \| 12:58:56 UTC - in response to Message 52795. Last modified: 10 Oct 2019 \| 12:59:14 UTC
	The new acemd3 app should fix the issue. Thanks for all the reporting! Note that one still can't restart between different types of cards. When do you plan to release the ACEMD3 client for Linux? I thought that you will do it after the Windows client is fixed. I know it still has (at least) one problem (restarting a task on a different card), but probably the Linux client has the same problem. Alternatively you could put part of the long workunits to the ACEMD3 queue (it has the new client for both platforms).
	ID: 52833 \| Rating: 0 \| rate: / Reply Quote

STARBASEn Send message Joined: 17 Feb 09 Posts: 91 Credit: 1,603,303,394 RAC: 0 Level Scientific publications	Message 52836 - Posted: 11 Oct 2019 \| 16:37:07 UTC
	Not getting any work yet for any of my Linux machines and I don't have Windows so I have been out of luck since May but I am sure E@H is pleased. I had been getting some of the beta tests however, just nothing recently since the restart problems appear to have been resolved with the Windows systems.
	ID: 52836 \| Rating: 0 \| rate: / Reply Quote

STARBASEn Send message Joined: 17 Feb 09 Posts: 91 Credit: 1,603,303,394 RAC: 0 Level Scientific publications	Message 52838 - Posted: 12 Oct 2019 \| 18:51:04 UTC
	Got a new ACEMD 2.06 (http://www.gpugrid.net/result.php?resultid=21443611) WU on one of my Linux machines. About 53% finished and was suspended and restarted once without issue. Wingman WU failed on a Win10 machine a with little less than 2 min completed, reason unknown. Both systems used GTX-1060's. I'm using driver version 430.50 and boinc 7.16.1 (Fedora distro). All is good with Linux so far.
	ID: 52838 \| Rating: 0 \| rate: / Reply Quote

Bedrich Hajek Send message Joined: 28 Mar 09 Posts: 468 Credit: 8,498,022,716 RAC: 11,208,545 Level Scientific publications	Message 52842 - Posted: 13 Oct 2019 \| 13:36:45 UTC
	I noticed that when I am running acemd3 task on each of cards (Maxwell and Turing), they ran slower than if I was running one acemd3 on cone card and another type of task on the other card. Acemd3 tasks running on both cards simultaneously: 21443009 16813137 12 Oct 2019 \| 12:46:46 UTC 12 Oct 2019 \| 17:53:01 UTC Completed and validated 17,927.90 17,526.83 75,000.00 New version of ACEMD v2.06 (cuda100) 21443008 16813136 12 Oct 2019 \| 12:46:46 UTC 12 Oct 2019 \| 17:08:35 UTC Completed and validated 7,590.81 7,420.91 75,000.00 New version of ACEMD v2.06 (cuda100) 21443007 16813135 12 Oct 2019 \| 12:46:46 UTC 12 Oct 2019 \| 19:14:35 UTC Completed and validated 7,557.09 7,356.45 75,000.00 New version of ACEMD v2.06 (cuda100) 21443006 16813134 12 Oct 2019 \| 12:46:46 UTC 12 Oct 2019 \| 15:01:54 UTC Completed and validated 7,589.88 7,461.47 75,000.00 New version of ACEMD v2.06 (cuda100) Acemd3 tasks running side by side with non acmed3 tasks: 21442682 16812889 12 Oct 2019 \| 2:13:51 UTC 12 Oct 2019 \| 4:09:23 UTC Completed and validated 6,604.02 6,413.02 75,000.00 New version of ACEMD v2.08 (cuda101) 21430016 16803567 8 Oct 2019 \| 11:27:46 UTC 8 Oct 2019 \| 13:22:21 UTC Completed and validated 6,498.40 6,246.75 75,000.00 New version of ACEMD v2.08 (cuda101) 21429800 16803495 8 Oct 2019 \| 1:41:48 UTC 8 Oct 2019 \| 3:40:53 UTC Completed and validated 6,519.76 6,327.67 75,000.00 New version of ACEMD v2.08 (cuda101) 21429286 16803018 7 Oct 2019 \| 1:23:48 UTC 7 Oct 2019 \| 11:50:24 UTC Completed and validated 16,540.09 16,367.14 75,000.00 New version of ACEMD v2.06 (cuda100) Either Einstein, Milkyway or Long Runs are running on the other card. I also noticed issue with the scheduler not asking for GPU task when I had all these lines in my cc_config.xml file: <exclude_gpu> <url>http://www.gpugrid.net/</url> <device_num>0</device_num> <app>acemdlong</app> </exclude_gpu> <exclude_gpu> <url>http://www.gpugrid.net/</url> <device_num>0</device_num> <app>acemdshort</app> </exclude_gpu> <exclude_gpu> <url>http://www.gpugrid.net/</url> <device_num>1</device_num> <app>acemd3</app> </exclude_gpu> What I am telling boinc is to not run long and short tasks on the Turning card and not to run acemd3 tasks on the Maxwell card. The logic works, but again the scheduler doesn't ask for GPU tasks, no matter what I set the cache number to and I have less than 2 tasks per card downloaded. (I downloaded the tasks before I ran this test.) But if I delete this from the file and of course save it: <exclude_gpu> <url>http://www.gpugrid.net/</url> <device_num>1</device_num> <app>acemd3</app> </exclude_gpu> Everything works fine. The scheduler asks for GPU tasks. Is this a boinc problem or a GPUGRID problem?
	ID: 52842 \| Rating: 0 \| rate: / Reply Quote

Bedrich Hajek Send message Joined: 28 Mar 09 Posts: 468 Credit: 8,498,022,716 RAC: 11,208,545 Level Scientific publications	Message 52843 - Posted: 13 Oct 2019 \| 19:44:25 UTC - in response to Message 52842.
	I noticed that when I am running acemd3 task on each of cards (Maxwell and Turing), they ran slower than if I was running one acemd3 on one card and another type of task on the other card. Acemd3 tasks running on both cards simultaneously: 21443009 16813137 12 Oct 2019 \| 12:46:46 UTC 12 Oct 2019 \| 17:53:01 UTC Completed and validated 17,927.90 17,526.83 75,000.00 New version of ACEMD v2.06 (cuda100) 21443008 16813136 12 Oct 2019 \| 12:46:46 UTC 12 Oct 2019 \| 17:08:35 UTC Completed and validated 7,590.81 7,420.91 75,000.00 New version of ACEMD v2.06 (cuda100) 21443007 16813135 12 Oct 2019 \| 12:46:46 UTC 12 Oct 2019 \| 19:14:35 UTC Completed and validated 7,557.09 7,356.45 75,000.00 New version of ACEMD v2.06 (cuda100) 21443006 16813134 12 Oct 2019 \| 12:46:46 UTC 12 Oct 2019 \| 15:01:54 UTC Completed and validated 7,589.88 7,461.47 75,000.00 New version of ACEMD v2.06 (cuda100) Acemd3 tasks running side by side with non acmed3 tasks: 21442682 16812889 12 Oct 2019 \| 2:13:51 UTC 12 Oct 2019 \| 4:09:23 UTC Completed and validated 6,604.02 6,413.02 75,000.00 New version of ACEMD v2.08 (cuda101) 21430016 16803567 8 Oct 2019 \| 11:27:46 UTC 8 Oct 2019 \| 13:22:21 UTC Completed and validated 6,498.40 6,246.75 75,000.00 New version of ACEMD v2.08 (cuda101) 21429800 16803495 8 Oct 2019 \| 1:41:48 UTC 8 Oct 2019 \| 3:40:53 UTC Completed and validated 6,519.76 6,327.67 75,000.00 New version of ACEMD v2.08 (cuda101) 21429286 16803018 7 Oct 2019 \| 1:23:48 UTC 7 Oct 2019 \| 11:50:24 UTC Completed and validated 16,540.09 16,367.14 75,000.00 New version of ACEMD v2.06 (cuda100) Either Einstein, Milkyway or Long Runs are running on the other card. Here is an observation that contradicts my previous observation: 21444555 16814402 13 Oct 2019 \| 14:27:59 UTC 13 Oct 2019 \| 17:43:04 UTC Completed and validated 7,623.39 7,471.55 75,000.00 New version of ACEMD v2.06 (cuda100) 21444501 16814357 13 Oct 2019 \| 12:40:33 UTC 13 Oct 2019 \| 15:35:54 UTC Completed and validated 7,577.33 7,431.94 75,000.00 New version of ACEMD v2.06 (cuda100) These two task ran on the Turing card, while the Maxwell was running a long tasks. They were running slower. So what is causing this? The tasks being v2.06, the tasks themselves, or something else. I don't know! Ok, no more theories, at least for a while.
	ID: 52843 \| Rating: 0 \| rate: / Reply Quote

Keith Myers Send message Joined: 13 Dec 17 Posts: 1289 Credit: 5,227,606,959 RAC: 10,542,720 Level Scientific publications	Message 52845 - Posted: 14 Oct 2019 \| 0:35:16 UTC - in response to Message 52842.
	Everything works fine. The scheduler asks for GPU tasks. Is this a boinc problem or a GPUGRID problem? Both. You are running a client that does not handle excludes well. The latest development version does much better. Also the project runs very old server software that should be updated but has not been.
	ID: 52845 \| Rating: 0 \| rate: / Reply Quote