Advanced search

Message boards : News : Acemd3 restart on windows possibly fixed

Author Message
Toni
Volunteer moderator
Project administrator
Project developer
Project scientist
Send message
Joined: 9 Dec 08
Posts: 819
Credit: 4,294,282
RAC: 0
Level
Ala
Scientific publications
watwatwatwat
Message 52795 - Posted: 7 Oct 2019 | 13:38:53 UTC

The new acemd3 app should fix the issue.

Thanks for all the reporting!

Note that one still can't restart between different types of cards.

Aurum
Send message
Joined: 12 Jul 17
Posts: 110
Credit: 7,368,016,843
RAC: 4,045,081
Level
Tyr
Scientific publications
wat
Message 52798 - Posted: 7 Oct 2019 | 15:13:56 UTC - in response to Message 52795.

That's great news Toni. I hope you'll send a BOINC notice out when Linux is back in production so those of us on walkabout know to return.
____________

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 288
Credit: 237,915,213
RAC: 128,101
Level
Leu
Scientific publications
wat
Message 52799 - Posted: 7 Oct 2019 | 16:58:45 UTC - in response to Message 52795.

Just change your Preferences for Computing to "Switch between tasks every" to something like 360 minutes and the task should start and finish on the same card avoiding the issue of restarting on a dissimilar card. If all your cards are the same brand and type, maybe only type, you can restart on a different card and finish with no errors.

zombie67 [MM]
Avatar
Send message
Joined: 16 Jul 07
Posts: 174
Credit: 289,449,460
RAC: 441,346
Level
Asn
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 52800 - Posted: 7 Oct 2019 | 17:54:15 UTC

Only two of the three windows apps were updated, cuda92 & cuda101 Why not the cuda100 app too?
____________
Reno, NV
Team: SETI.USA

Bedrich Hajek
Send message
Joined: 28 Mar 09
Posts: 381
Credit: 4,777,137,589
RAC: 1,060,942
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 52802 - Posted: 8 Oct 2019 | 2:24:48 UTC

I managed to successfully complete a task after suspending it and restarting it on the windows 7 computer with a rtx 2080ti card. When I suspend the task the wrapper and the acemd3 disappeared from the task manager, and then reappear when the task restarted:

http://www.gpugrid.net/result.php?resultid=21429745


On the windows 10 computer, I ran a task on the gtx 980ti card, I suspended and resumed it successfully. I suspended it again, reboot the computer, it restarted successfully. I received another task, which ran on the rtx 2080ti card successfully, side by side with the other task. I then suspended both tasks, and restarted the task, which ran on the 980ti, on the 2080ti, and it crashed.


http://www.gpugrid.net/result.php?resultid=21429733



I restarted the other task , and which started on the 2080ti and is running well on the 2080ti, right now:



http://www.gpugrid.net/result.php?resultid=21429800


You can't start the tasks on one and restarted successfully on another card. I haven't tried reboot the computer without first suspend the task, yet.






Nick Name
Send message
Joined: 3 Sep 13
Posts: 23
Credit: 965,342,244
RAC: 1,233,750
Level
Glu
Scientific publications
watwatwatwatwatwatwatwat
Message 52803 - Posted: 8 Oct 2019 | 4:09:39 UTC - in response to Message 52802.

...
On the windows 10 computer, I ran a task on the gtx 980ti card, I suspended and resumed it successfully. I suspended it again, reboot the computer, it restarted successfully. I received another task, which ran on the rtx 2080ti card successfully, side by side with the other task. I then suspended both tasks, and restarted the task, which ran on the 980ti, on the 2080ti, and it crashed.

http://www.gpugrid.net/result.php?resultid=21429733

I have this task now, and it's not loading the GPU at all. It's the second task like that I've had in the last couple days.

http://www.gpugrid.net/result.php?resultid=21429805

The other one failed on a suspend / restart, when I paused the client.

http://www.gpugrid.net/result.php?resultid=21429344

That one did validate on another machine, and it looks like the one I have now is slowly making progress, so I'll let it run at least overnight to give it a chance to complete.
____________
Team USA forum | Team USA page
Always crunching / Always recruiting

rod4x4
Send message
Joined: 4 Aug 14
Posts: 94
Credit: 1,594,652,169
RAC: 1,458,261
Level
His
Scientific publications
watwatwatwatwatwatwat
Message 52804 - Posted: 8 Oct 2019 | 5:26:35 UTC
Last modified: 8 Oct 2019 | 5:58:58 UTC

Received a New version of ACMD v2.08 (cuda101) Work Unit on a Win8.1 (update 1) Host with GTX750ti GPU.

Work Unit Name: e40s11_e37s6p1f279-ADRIA_FOLDUBQ_BANDIT_crystal_ss_contacts_50_ubiquitin_4-0-2-RND1517_1

- Suspended Work Unit after 18 minutes (2.2% complete)
- Wrapper and ACEMD tasks disappeared from Task Manager.
- Resumed Work Unit, Wrapper and ACEMD tasks reappeared and Work Unit continued to process.

Then rebooted PC after Work Unit had been running for 21 minutes. (without suspending WU)
Work Unit successfully restarted and continues to process.

NOTES
- The Remaining (estimated) time does not seem to change or indicate an accurate run time. (only a small issue)
- Checkpoint seems to be every 90 seconds
- GTX750ti is running at 98% utilization and 94% power according to nvidia-smi. This GPU does not reach these figures on the old tasks.

This Work unit may take another 13 hours to complete at current rate.

Work Unit is here: http://www.gpugrid.net/result.php?resultid=21429814

It has not completed yet, but is still encouraging results!

Profile [PUGLIA] kidkidkid3
Avatar
Send message
Joined: 23 Feb 11
Posts: 64
Credit: 700,865,517
RAC: 828,243
Level
Lys
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 52810 - Posted: 8 Oct 2019 | 20:05:49 UTC - in response to Message 52804.
Last modified: 8 Oct 2019 | 20:06:24 UTC

Hi,

Acemd3 WU in error at the end ... same GPU, no suspend/resume action ...

http://www.gpugrid.net/result.php?resultid=21430068

K.
____________
Dreams do not always come true. But not because they are too big or impossible. Why did we stop believing.
(Martin Luther King)

rod4x4
Send message
Joined: 4 Aug 14
Posts: 94
Credit: 1,594,652,169
RAC: 1,458,261
Level
His
Scientific publications
watwatwatwatwatwatwat
Message 52814 - Posted: 8 Oct 2019 | 23:34:27 UTC - in response to Message 52810.
Last modified: 8 Oct 2019 | 23:36:01 UTC

Acemd3 WU in error at the end ... same GPU, no suspend/resume action ...

http://www.gpugrid.net/result.php?resultid=21430068


Your host has returned 2 "New Version ACEMD" work units that have both ended in "upload failure"

Other WU: http://www.gpugrid.net/result.php?resultid=21428934

Failure Message:
<message>
upload failure: <file_xfer_error>
<file_name>e39s4_e33s7p1f250-ADRIA_FOLDUBQ_BANDIT_crystal_ss_contacts_50_ubiquitin_0-1-2-RND0503_0_0</file_name>
<error_code>-240 (stat() failed)</error_code>
</file_xfer_error>
</message>

The WUs appear to complete successfully
Exit State: 0

Very curious as all the "old" work units upload fine.
I think the key to the error is in the error_code: stat() failed

Bedrich Hajek
Send message
Joined: 28 Mar 09
Posts: 381
Credit: 4,777,137,589
RAC: 1,060,942
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 52815 - Posted: 9 Oct 2019 | 1:34:37 UTC

I rebooted the computer (without suspending the tasks), these 2 task were able to restarted and finish successfully afterwards:


http://www.gpugrid.net/result.php?resultid=21429991


http://www.gpugrid.net/result.php?resultid=21430924


There is still an issue with getting ACEMD v2.06 (cuda100) tasks. It happened on my windows 10 machine, which has a Maxwell card and a Turing card. I wonder, if there's connection there?

http://www.gpugrid.net/result.php?resultid=21429972

The error was due to suspend and restart.


I also have an unexplained error:


http://www.gpugrid.net/result.php?resultid=21429886

It was running on the 980ti card. I suspended and restarted it successfully. It was running fine when I left it. Next morning, I found that it crashed. The 2080ti was running either Einstein or Milkyway tasks. Every once in a long while the Einstein gamma ray pulsar task will cause the NVIDIA driver to crash momentary, then it restarts. Maybe that and it was running on a non Turing card are the reasons for this crash. After the task crashed, it cause afterburner to crash. I had to restart that also.


rod4x4
Send message
Joined: 4 Aug 14
Posts: 94
Credit: 1,594,652,169
RAC: 1,458,261
Level
His
Scientific publications
watwatwatwatwatwatwat
Message 52816 - Posted: 9 Oct 2019 | 4:54:14 UTC - in response to Message 52815.
Last modified: 9 Oct 2019 | 5:53:34 UTC

There is still an issue with getting ACEMD v2.06 (cuda100) tasks. It happened on my windows 10 machine, which has a Maxwell card and a Turing card. I wonder, if there's connection there?

http://www.gpugrid.net/result.php?resultid=21429972

The error was due to suspend and restart.

From what I understand, I think only ACEMD v2.08 survives the suspend/restart.

I also have an unexplained error:
http://www.gpugrid.net/result.php?resultid=21429886

Assuming no issues with the host/other projects, looks like a new error. (These assumptions would need to be explored also)
From the Stderr output:
# Engine failed: Error invoking kernel: CUDA_ERROR_UNKNOWN (999)
Error appears after 7 minutes (2 minutes after Task was suspended and resumed).

rod4x4
Send message
Joined: 4 Aug 14
Posts: 94
Credit: 1,594,652,169
RAC: 1,458,261
Level
His
Scientific publications
watwatwatwatwatwatwat
Message 52832 - Posted: 10 Oct 2019 | 10:16:47 UTC

Received another New version ACEMD v2.08 work unit on a Win7 Host with GTX750 GPU.

Suspended and resumed the work unit.

Work unit proceeded fine after suspend/resume and completed successfully.

Work unit here:
http://gpugrid.net/result.php?resultid=21433334

Profile Retvari Zoltan
Avatar
Send message
Joined: 20 Jan 09
Posts: 2047
Credit: 14,822,552,469
RAC: 2,352,078
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 52833 - Posted: 10 Oct 2019 | 12:58:56 UTC - in response to Message 52795.
Last modified: 10 Oct 2019 | 12:59:14 UTC

The new acemd3 app should fix the issue.

Thanks for all the reporting!

Note that one still can't restart between different types of cards.
When do you plan to release the ACEMD3 client for Linux?
I thought that you will do it after the Windows client is fixed.
I know it still has (at least) one problem (restarting a task on a different card), but probably the Linux client has the same problem.
Alternatively you could put part of the long workunits to the ACEMD3 queue (it has the new client for both platforms).

STARBASEn
Avatar
Send message
Joined: 17 Feb 09
Posts: 70
Credit: 1,003,056,251
RAC: 40,666
Level
Met
Scientific publications
watwatwatwatwat
Message 52836 - Posted: 11 Oct 2019 | 16:37:07 UTC

Not getting any work yet for any of my Linux machines and I don't have Windows so I have been out of luck since May but I am sure E@H is pleased. I had been getting some of the beta tests however, just nothing recently since the restart problems appear to have been resolved with the Windows systems.

STARBASEn
Avatar
Send message
Joined: 17 Feb 09
Posts: 70
Credit: 1,003,056,251
RAC: 40,666
Level
Met
Scientific publications
watwatwatwatwat
Message 52838 - Posted: 12 Oct 2019 | 18:51:04 UTC

Got a new ACEMD 2.06 (http://www.gpugrid.net/result.php?resultid=21443611) WU on one of my Linux machines. About 53% finished and was suspended and restarted once without issue. Wingman WU failed on a Win10 machine a with little less than 2 min completed, reason unknown. Both systems used GTX-1060's. I'm using driver version 430.50 and boinc 7.16.1 (Fedora distro). All is good with Linux so far.

Bedrich Hajek
Send message
Joined: 28 Mar 09
Posts: 381
Credit: 4,777,137,589
RAC: 1,060,942
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 52842 - Posted: 13 Oct 2019 | 13:36:45 UTC

I noticed that when I am running acemd3 task on each of cards (Maxwell and Turing), they ran slower than if I was running one acemd3 on cone card and another type of task on the other card.

Acemd3 tasks running on both cards simultaneously:

21443009 16813137 12 Oct 2019 | 12:46:46 UTC 12 Oct 2019 | 17:53:01 UTC Completed and validated 17,927.90 17,526.83 75,000.00 New version of ACEMD v2.06 (cuda100)
21443008 16813136 12 Oct 2019 | 12:46:46 UTC 12 Oct 2019 | 17:08:35 UTC Completed and validated 7,590.81 7,420.91 75,000.00 New version of ACEMD v2.06 (cuda100)
21443007 16813135 12 Oct 2019 | 12:46:46 UTC 12 Oct 2019 | 19:14:35 UTC Completed and validated 7,557.09 7,356.45 75,000.00 New version of ACEMD v2.06 (cuda100)
21443006 16813134 12 Oct 2019 | 12:46:46 UTC 12 Oct 2019 | 15:01:54 UTC Completed and validated 7,589.88 7,461.47 75,000.00 New version of ACEMD v2.06 (cuda100)


Acemd3 tasks running side by side with non acmed3 tasks:

21442682 16812889 12 Oct 2019 | 2:13:51 UTC 12 Oct 2019 | 4:09:23 UTC Completed and validated 6,604.02 6,413.02 75,000.00 New version of ACEMD v2.08 (cuda101)
21430016 16803567 8 Oct 2019 | 11:27:46 UTC 8 Oct 2019 | 13:22:21 UTC Completed and validated 6,498.40 6,246.75 75,000.00 New version of ACEMD v2.08 (cuda101)
21429800 16803495 8 Oct 2019 | 1:41:48 UTC 8 Oct 2019 | 3:40:53 UTC Completed and validated 6,519.76 6,327.67 75,000.00 New version of ACEMD v2.08 (cuda101)
21429286 16803018 7 Oct 2019 | 1:23:48 UTC 7 Oct 2019 | 11:50:24 UTC Completed and validated 16,540.09 16,367.14 75,000.00 New version of ACEMD v2.06 (cuda100)


Either Einstein, Milkyway or Long Runs are running on the other card.

I also noticed issue with the scheduler not asking for GPU task when I had all these lines in my cc_config.xml file:


<exclude_gpu>
<url>http://www.gpugrid.net/</url>
<device_num>0</device_num>
<app>acemdlong</app>
</exclude_gpu>

<exclude_gpu>
<url>http://www.gpugrid.net/</url>
<device_num>0</device_num>
<app>acemdshort</app>
</exclude_gpu>

<exclude_gpu>
<url>http://www.gpugrid.net/</url>
<device_num>1</device_num>
<app>acemd3</app>
</exclude_gpu>

What I am telling boinc is to not run long and short tasks on the Turning card and not to run acemd3 tasks on the Maxwell card. The logic works, but again the scheduler doesn't ask for GPU tasks, no matter what I set the cache number to and I have less than 2 tasks per card downloaded. (I downloaded the tasks before I ran this test.)

But if I delete this from the file and of course save it:

<exclude_gpu>
<url>http://www.gpugrid.net/</url>
<device_num>1</device_num>
<app>acemd3</app>
</exclude_gpu>

Everything works fine. The scheduler asks for GPU tasks.
Is this a boinc problem or a GPUGRID problem?



Bedrich Hajek
Send message
Joined: 28 Mar 09
Posts: 381
Credit: 4,777,137,589
RAC: 1,060,942
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 52843 - Posted: 13 Oct 2019 | 19:44:25 UTC - in response to Message 52842.

I noticed that when I am running acemd3 task on each of cards (Maxwell and Turing), they ran slower than if I was running one acemd3 on one card and another type of task on the other card.

Acemd3 tasks running on both cards simultaneously:

21443009 16813137 12 Oct 2019 | 12:46:46 UTC 12 Oct 2019 | 17:53:01 UTC Completed and validated 17,927.90 17,526.83 75,000.00 New version of ACEMD v2.06 (cuda100)
21443008 16813136 12 Oct 2019 | 12:46:46 UTC 12 Oct 2019 | 17:08:35 UTC Completed and validated 7,590.81 7,420.91 75,000.00 New version of ACEMD v2.06 (cuda100)
21443007 16813135 12 Oct 2019 | 12:46:46 UTC 12 Oct 2019 | 19:14:35 UTC Completed and validated 7,557.09 7,356.45 75,000.00 New version of ACEMD v2.06 (cuda100)
21443006 16813134 12 Oct 2019 | 12:46:46 UTC 12 Oct 2019 | 15:01:54 UTC Completed and validated 7,589.88 7,461.47 75,000.00 New version of ACEMD v2.06 (cuda100)


Acemd3 tasks running side by side with non acmed3 tasks:

21442682 16812889 12 Oct 2019 | 2:13:51 UTC 12 Oct 2019 | 4:09:23 UTC Completed and validated 6,604.02 6,413.02 75,000.00 New version of ACEMD v2.08 (cuda101)
21430016 16803567 8 Oct 2019 | 11:27:46 UTC 8 Oct 2019 | 13:22:21 UTC Completed and validated 6,498.40 6,246.75 75,000.00 New version of ACEMD v2.08 (cuda101)
21429800 16803495 8 Oct 2019 | 1:41:48 UTC 8 Oct 2019 | 3:40:53 UTC Completed and validated 6,519.76 6,327.67 75,000.00 New version of ACEMD v2.08 (cuda101)
21429286 16803018 7 Oct 2019 | 1:23:48 UTC 7 Oct 2019 | 11:50:24 UTC Completed and validated 16,540.09 16,367.14 75,000.00 New version of ACEMD v2.06 (cuda100)


Either Einstein, Milkyway or Long Runs are running on the other card.







Here is an observation that contradicts my previous observation:


21444555 16814402 13 Oct 2019 | 14:27:59 UTC 13 Oct 2019 | 17:43:04 UTC Completed and validated 7,623.39 7,471.55 75,000.00 New version of ACEMD v2.06 (cuda100)
21444501 16814357 13 Oct 2019 | 12:40:33 UTC 13 Oct 2019 | 15:35:54 UTC Completed and validated 7,577.33 7,431.94 75,000.00 New version of ACEMD v2.06 (cuda100)

These two task ran on the Turing card, while the Maxwell was running a long tasks. They were running slower. So what is causing this? The tasks being v2.06, the tasks themselves, or something else. I don't know! Ok, no more theories, at least for a while.




Keith Myers
Send message
Joined: 13 Dec 17
Posts: 288
Credit: 237,915,213
RAC: 128,101
Level
Leu
Scientific publications
wat
Message 52845 - Posted: 14 Oct 2019 | 0:35:16 UTC - in response to Message 52842.

Everything works fine. The scheduler asks for GPU tasks.
Is this a boinc problem or a GPUGRID problem?

Both. You are running a client that does not handle excludes well. The latest development version does much better.

Also the project runs very old server software that should be updated but has not been.

Post to thread

Message boards : News : Acemd3 restart on windows possibly fixed