More Acemd3 tests

Message boards : News : More Acemd3 tests

Author	Message
Toni Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 9 Dec 08 Posts: 1006 Credit: 5,068,599 RAC: 0 Level Scientific publications	Message 52582 - Posted: 6 Sep 2019 \| 12:23:10 UTC
	We've uploaded Windows and Linux apps named "acemd3". If thing go as expected, they should be the new simulation engine. They should be an improvement on many aspects, especially maintainability and compatibility with RTX. There were a few short test workunits (TONI_TEST). Larger one should come soon. Please be patient as we iron out the details.
	ID: 52582 \| Rating: 0 \| rate: / Reply Quote

Toni Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 9 Dec 08 Posts: 1006 Credit: 5,068,599 RAC: 0 Level Scientific publications	Message 52583 - Posted: 6 Sep 2019 \| 13:05:32 UTC - in response to Message 52582.
	By the way: things we'd need a comment on: 1. do PCs with multiple GPUs work as expected? 2. does suspend/restart work as expected?
	ID: 52583 \| Rating: 0 \| rate: / Reply Quote

eXaPower Send message Joined: 25 Sep 13 Posts: 293 Credit: 1,897,601,978 RAC: 0 Level Scientific publications	Message 52584 - Posted: 6 Sep 2019 \| 13:21:41 UTC - in response to Message 52582.
	By the way: things we'd need a comment on: 1. do PCs with multiple GPUs work as expected? 2. does suspend/restart work as expected? 1. No - App not allowing 2/3/4/5 GPUs to run concurrent - Only 1 GPU at a time while other Turing error out. http://www.gpugrid.net/results.php?hostid=208061 http://www.gpugrid.net/workunit.php?wuid=16748681 <core_client_version>7.14.2</core_client_version> <![CDATA[ <message> (unknown error) - exit code 195 (0xc3)</message> <stderr_txt> 08:39:48 (1632): wrapper (7.9.26016): starting 08:39:48 (1632): wrapper: running acemd3.exe (--boinc input --device 1) # Engine failed: Illegal value for DeviceIndex: 1 08:39:49 (1632): acemd3.exe exited; CPU time 0.000000 08:39:49 (1632): app exit status: 0x1 08:39:49 (1632): called boinc_finish(195) 2. Yes and No suspend/restart worked it just error once it restarted WU. http://www.gpugrid.net/result.php?resultid=21350515 <core_client_version>7.14.2</core_client_version> <![CDATA[ <message> (unknown error) - exit code 195 (0xc3)</message> <stderr_txt> 08:55:41 (4032): wrapper (7.9.26016): starting 08:55:41 (4032): wrapper: running acemd3.exe (--boinc input --device 0) Detected memory leaks! Dumping objects -> ..\api\boinc_api.cpp(309) : {1845} normal block at 0x0000005BE25C15C0, 8 bytes long. Data: < M [ > 00 00 4D E2 5B 00 00 00 ..\lib\diagnostics_win.cpp(417) : {203} normal block at 0x0000005BE25C43B0, 1080 bytes long. Data: < > 04 0C 00 00 CD CD CD CD EC 00 00 00 00 00 00 00 Object dump complete. 09:09:55 (3728): wrapper (7.9.26016): starting 09:09:55 (3728): wrapper: running acemd3.exe (--boinc input --device 0) # Engine failed: The periodic box size has decreased to less than twice the nonbonded cutoff. 09:09:58 (3728): acemd3.exe exited; CPU time 0.000000 09:09:58 (3728): app exit status: 0x1 09:09:58 (3728): called boinc_finish(195)
	ID: 52584 \| Rating: 0 \| rate: / Reply Quote

Frank [NT] Send message Joined: 30 Sep 17 Posts: 2 Credit: 105,703,955 RAC: 1,370 Level Scientific publications	Message 52585 - Posted: 6 Sep 2019 \| 13:32:14 UTC - in response to Message 52583.
	Hi Toni, i got 2 of them. The 1. was at 32% when i suspend it, the 2. startet. I restarted the 1.WU, and when 1 suspend the 2.WU to continue the 1. it exit with an error. Then the 2. (still was at 0.0%) startet from itself. At 37% i suspend and restartet it, it also exit with an error. You can find my WU's here GTX 1660 Ti and Windows 10 I hope it helps you to improve the app.
	ID: 52585 \| Rating: 0 \| rate: / Reply Quote

PappaLitto Send message Joined: 21 Mar 16 Posts: 511 Credit: 4,672,242,755 RAC: 319 Level Scientific publications	Message 52586 - Posted: 6 Sep 2019 \| 15:10:05 UTC
	Hey Toni, The one I received errored out with only one 2080ti and no other cards in the system on windows: http://www.gpugrid.net/result.php?resultid=21344094
	ID: 52586 \| Rating: 0 \| rate: / Reply Quote

STARBASEn Send message Joined: 17 Feb 09 Posts: 91 Credit: 1,603,303,394 RAC: 0 Level Scientific publications	Message 52587 - Posted: 6 Sep 2019 \| 18:04:26 UTC Last modified: 6 Sep 2019 \| 18:05:52 UTC
	http://www.gpugrid.net/workunit.php?wuid=16749264 This WU run concurrently with E@H fine (2x GTX1060's). Suspended it once with leave WU in memory and it restarted fine from where it left off. Same with another ACEMD 2.06 WU but on another machine with only one GTX1060.
	ID: 52587 \| Rating: 0 \| rate: / Reply Quote

Bedrich Hajek Send message Joined: 28 Mar 09 Posts: 477 Credit: 9,341,372,716 RAC: 10,981,528 Level Scientific publications	Message 52589 - Posted: 6 Sep 2019 \| 23:25:22 UTC - in response to Message 52583.
	By the way: things we'd need a comment on: 1. do PCs with multiple GPUs work as expected? 2. does suspend/restart work as expected? I managed to get 1 of these unit on my windows 7 computer, with 1 rtx 2080ti card. It took nearly a minute from the time it started running for "elapsed" time to start moving and about another minute for the "process" % to start moving. I let it run for about 5 minutes before suspending it, (it was about 20% complete). It stopped within a couple of seconds. I waited about 30 seconds before resuming it, and it crashed within a few seconds. During its run time, the GPU usage was low (under 65%), and on all 6 of the CPU cores, usage was jumping up and down from 0 to 100%, according to Afterburner. I never seen that before. I didn't get a chance to run it on a multiple GPU computer, but send out more units and I will let you know what happens. See link: http://www.gpugrid.net/result.php?resultid=21352529
	ID: 52589 \| Rating: 0 \| rate: / Reply Quote

Bedrich Hajek Send message Joined: 28 Mar 09 Posts: 477 Credit: 9,341,372,716 RAC: 10,981,528 Level Scientific publications	Message 52592 - Posted: 7 Sep 2019 \| 2:46:54 UTC
	I ran 2 of the units on my windows 10 machine. This machine has a gtx 980 ti, which was running long unit, while the rtx 2080 ti was running the new version of ACEMD v2.06 (cuda100)unit. When I let the test unit run from start to finish without interruption, it finishes successfully, but when I suspend it and then resume it, it will crash within a few seconds. GPU usage on this machine was 80% maximum, compared to 90% usage for the long run, which was running on the 980 ti. http://www.gpugrid.net/results.php?hostid=263612&offset=0&show_names=1&state=0&appid=32
	ID: 52592 \| Rating: 0 \| rate: / Reply Quote

clemmo Send message Joined: 24 Jun 12 Posts: 2 Credit: 63,396,146 RAC: 0 Level Scientific publications	Message 52596 - Posted: 7 Sep 2019 \| 12:39:31 UTC Last modified: 7 Sep 2019 \| 12:48:20 UTC
	I've also had a test app have an error when suspended then resumed. Currently have one running. I'll let it go to see if it goes to completion. The workunit seems to be using 1 full CPU core and 92% GPU Load. My CPU is i7-4790 and GPU is GTX 1660.
	ID: 52596 \| Rating: 0 \| rate: / Reply Quote

kksplace Send message Joined: 4 Mar 18 Posts: 53 Credit: 1,821,981,749 RAC: 6,387,436 Level Scientific publications	Message 52598 - Posted: 7 Sep 2019 \| 13:32:54 UTC - in response to Message 52583.
	This test WU was suspended twice (once using Suspend, once using Suspend GPU in BOINC Manager) and successfully restarted and completed. http://www.gpugrid.net/result.php?resultid=21354832
	ID: 52598 \| Rating: 0 \| rate: / Reply Quote

Bedrich Hajek Send message Joined: 28 Mar 09 Posts: 477 Credit: 9,341,372,716 RAC: 10,981,528 Level Scientific publications	Message 52599 - Posted: 7 Sep 2019 \| 14:19:53 UTC - in response to Message 52598.
	This test WU was suspended twice (once using Suspend, once using Suspend GPU in BOINC Manager) and successfully restarted and completed. http://www.gpugrid.net/result.php?resultid=21354832 You're running linux with a GTX1080 card, while I am running windows with a RTX card. This is either a OS problem or a card type problem. To determine what is the problem we need to run these WU's on a non RTX card with windows and/or RTX card on linux.
	ID: 52599 \| Rating: 0 \| rate: / Reply Quote

Bedrich Hajek Send message Joined: 28 Mar 09 Posts: 477 Credit: 9,341,372,716 RAC: 10,981,528 Level Scientific publications	Message 52600 - Posted: 7 Sep 2019 \| 14:43:50 UTC - in response to Message 52599.
	This test WU was suspended twice (once using Suspend, once using Suspend GPU in BOINC Manager) and successfully restarted and completed. http://www.gpugrid.net/result.php?resultid=21354832 You're running linux with a GTX1080 card, while I am running windows with a RTX card. This is either a OS problem or a card type problem. To determine what is the problem we need to run these WU's on a non RTX card with windows and/or RTX card on linux. It looks like it is a windows problem. I ran this unit on a GTX 980 ti on windows 10. I suspended and resumed it. It crashed a few seconds after resuming. http://www.gpugrid.net/result.php?resultid=21355024
	ID: 52600 \| Rating: 0 \| rate: / Reply Quote

Billy Ewell 1931 Send message Joined: 22 Oct 10 Posts: 37 Credit: 1,066,973,674 RAC: 117,807 Level Scientific publications	Message 52601 - Posted: 7 Sep 2019 \| 15:14:24 UTC
	9/7/2019 9:44:40 AM \| GPUGRID \| task a70-TONI_TESTDHFR206b-9-30-RND0994_0 This task assigned to i7 Windows 10 and RTX 2080: Processed about 1:00 minute, suspended for 30 seconds, resumed and the task immediately failed.
	ID: 52601 \| Rating: 0 \| rate: / Reply Quote

mmonnin Send message Joined: 2 Jul 16 Posts: 333 Credit: 5,006,001,065 RAC: 19,655,720 Level Scientific publications	Message 52602 - Posted: 7 Sep 2019 \| 16:37:08 UTC Last modified: 7 Sep 2019 \| 16:37:22 UTC
	Running OK on 2x GPU system: https://www.gpugrid.net/results.php?hostid=475308 Results show a -device 0 or -device 1.
	ID: 52602 \| Rating: 0 \| rate: / Reply Quote

clemmo Send message Joined: 24 Jun 12 Posts: 2 Credit: 63,396,146 RAC: 0 Level Scientific publications	Message 52603 - Posted: 8 Sep 2019 \| 0:15:42 UTC
	Just received another test task. Decided to check the suspend/resume. Computation error on resume still. GTX1660
	ID: 52603 \| Rating: 0 \| rate: / Reply Quote

Keith Myers Send message Joined: 13 Dec 17 Posts: 1313 Credit: 6,022,292,459 RAC: 10,175,539 Level Scientific publications	Message 52604 - Posted: 8 Sep 2019 \| 0:34:18 UTC
	I continue to have no luck getting any of these new test tasks.
	ID: 52604 \| Rating: 0 \| rate: / Reply Quote

Billy Ewell 1931 Send message Joined: 22 Oct 10 Posts: 37 Credit: 1,066,973,674 RAC: 117,807 Level Scientific publications	Message 52605 - Posted: 9 Sep 2019 \| 16:55:58 UTC - in response to Message 52583. Last modified: 9 Sep 2019 \| 16:59:14 UTC
	TONI: The GPUGrid configuration (below)is set specifically to accommodate my i7, Windows 10 with RTX 2080. I momentarily selected both short and long runs ACEMD tasks and two immediately in sequence failed. Do you wish us to continue the Pause-then-Resume on the ACEMD3 and other special test tasks for the RTX cards. My three other machines with Windows and GTX 750ti and 1060s set idle as far as GPUGrid is concerned. ACEMD short runs (2-3 hours on fastest card): no ACEMD long runs (8-12 hours on fastest GPU): no ACEMD3: yes Quantum Chemistry (CPU): no Quantum Chemistry (CPU, beta): no Python Runtime: no If no work for selected applications is available, accept work from other applications?no
	ID: 52605 \| Rating: 0 \| rate: / Reply Quote

Retvari Zoltan Send message Joined: 20 Jan 09 Posts: 2343 Credit: 16,201,255,749 RAC: 0 Level Scientific publications	Message 52606 - Posted: 9 Sep 2019 \| 22:45:03 UTC - in response to Message 52605.
	The GPUGrid configuration (below)is set specifically to accommodate my i7, Windows 10 with RTX 2080. I momentarily selected both short and long runs ACEMD tasks and two immediately in sequence failed. These were downloaded from the "long" queue, which has only the old client, which is not compatible with Turing (RTX + GTX 1660, 1650) cards. As of yet, you should select only the ACEMD3 queue for Turing cards. My three other machines with Windows and GTX 750ti and 1060s set idle as far as GPUGrid is concerned. You should set up two different venues (one for ACEMD3 only for Turing, one for short+long for older cards), and assign your hosts to these venues according their GPUs.
	ID: 52606 \| Rating: 0 \| rate: / Reply Quote

Keith Myers Send message Joined: 13 Dec 17 Posts: 1313 Credit: 6,022,292,459 RAC: 10,175,539 Level Scientific publications	Message 52607 - Posted: 10 Sep 2019 \| 0:48:49 UTC
	Since I have been unable to get any of these new acemd3 tasks, is it valid to say that only the Windows hosts are having issues? And that the Linux hosts continue to not have any issues with the new app or tasks? I've only seen one post from a Linux user saying they had no issues. I was hoping to test for myself the new apps and higher rewarding tasks. I had no issues with the previous beta and tasks back in July. No such luck for the new apps and tasks in retrieving either so far.
	ID: 52607 \| Rating: 0 \| rate: / Reply Quote

klepel Send message Joined: 23 Dec 09 Posts: 189 Credit: 4,407,456,793 RAC: 2,294,847 Level Scientific publications	Message 52608 - Posted: 10 Sep 2019 \| 1:01:14 UTC - in response to Message 52607.
	And that the Linux hosts continue to not have any issues with the new app or tasks? I've only seen one post from a Linux user saying they had no issues. It seems to me that LINUX hosts do not have issues with the new app (Acemd3). My three hosts work just fine, if they receive WUs (once a day). Since I have been unable to get any of these new acemd3 tasks, is it valid to say that only the Windows hosts are having issues? Only one of my Windows hosts with Turing Card has received WUs: The first was finished successfully. The second one, I stopped at the one minute mark, after restart it crashed after 2 seconds: http://www.gpugrid.net/result.php?resultid=21364023 From my small samples size, I would think LINUX works fine and we might start regular production (Toni?), Windows does not work yet.
	ID: 52608 \| Rating: 0 \| rate: / Reply Quote

Billy Ewell 1931 Send message Joined: 22 Oct 10 Posts: 37 Credit: 1,066,973,674 RAC: 117,807 Level Scientific publications	Message 52609 - Posted: 10 Sep 2019 \| 2:55:56 UTC - in response to Message 52606.
	My three other machines with Windows and GTX 750ti and 1060s set idle as far as GPUGrid is concerned. You should set up two different venues (one for ACEMD3 only for Turing, one for short+long for older cards), and assign your hosts to these venues according their GPUs.[/quote] Thank you for the instruction but unfortunately I do not know how to accomplish the task you outline. When I access my GpuGrid account and select Preferences and subsequently GpuGrid Preferences, whatever I select as to applications to run has always applied to all of the four computers I have attached to GpuGrid. If I change any preference, such as select only ACEMD3, then obviously only tasks designed for my turing card will be downloaded to its computer. But, if I additionally select ACEMD both Long and Short, then those will be downloaded not only to the three non-Turing computers but also to the Turing 2080 where immediate failure will occur. Your recommendation seems to be the perfect solution and I am frustrated that I do not know how to accomplish the task. Most appreciative!
	ID: 52609 \| Rating: 0 \| rate: / Reply Quote

klepel Send message Joined: 23 Dec 09 Posts: 189 Credit: 4,407,456,793 RAC: 2,294,847 Level Scientific publications	Message 52610 - Posted: 10 Sep 2019 \| 4:30:08 UTC - in response to Message 52609.
	When I access my GpuGrid account and select Preferences and subsequently GpuGrid Preferences, whatever I select as to applications to run has always applied to all of the four computers I have attached to GpuGrid. You are looking in the right direcction: Under "GPUGRID Prreference" you are able to set the preference for 4 differrent locations: Default Home School Work. After that you have to assign a location to each host, selecting under "computers under this account", Details: The location you want to assign to the computer: Location is at the bottom of the page. So you are able to assign one location for your Turing card and another for the other cards. Hope this helps!
	ID: 52610 \| Rating: 0 \| rate: / Reply Quote

Billy Ewell 1931 Send message Joined: 22 Oct 10 Posts: 37 Credit: 1,066,973,674 RAC: 117,807 Level Scientific publications	Message 52621 - Posted: 12 Sep 2019 \| 15:43:11 UTC - in response to Message 52583.
	By the way: things we'd need a comment on: 2. does suspend/restart work as expected? Toni(or other): do you still want the suspend/restart to apply. An interesting comment: Yesterday, before I sorted out my GPUGrid preferences and my RTX 2080 associated machine downloaded two Non-New ACEMD tasks,one of the "longer running" tasks processed for over 2 hours and 40 minutes on the Turing card before failure but the second task failed apparently immediately.
	ID: 52621 \| Rating: 0 \| rate: / Reply Quote

Steve Jones Send message Joined: 28 Oct 18 Posts: 3 Credit: 70,648,040 RAC: 0 Level Scientific publications	Message 52631 - Posted: 14 Sep 2019 \| 11:34:02 UTC Last modified: 14 Sep 2019 \| 11:56:08 UTC
	http://gpugrid.net/result.php?resultid=21378124 Linux machine with a GeForce GTX 1050 Ti (ran GPUGrid task) and GeForce GTX 660 (running another project). Survived two suspends, including a machine reboot, and finished happily. Seems that others weren't so lucky with the same work unit though.
	ID: 52631 \| Rating: 0 \| rate: / Reply Quote

Toni Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 9 Dec 08 Posts: 1006 Credit: 5,068,599 RAC: 0 Level Scientific publications	Message 52637 - Posted: 18 Sep 2019 \| 10:51:15 UTC - in response to Message 52631.
	Dears, thanks for the reports and patience. A small update thanks to your testing: we found a problem with the WINDOWS Cuda 10.1 app - it's slower than it should be. Possibly related: restart fails most of the time. We are working on it.
	ID: 52637 \| Rating: 0 \| rate: / Reply Quote

Aurum Send message Joined: 12 Jul 17 Posts: 400 Credit: 13,844,764,882 RAC: 31,380,351 Level Scientific publications	Message 52646 - Posted: 18 Sep 2019 \| 17:46:07 UTC Last modified: 18 Sep 2019 \| 17:46:28 UTC
	I lucked out and checked in just when a new tranche of acemd3 WUs popped up. My 2080 Ti caught two sets of two WUs and they ran fine. My 2080 Ti has no problem running 4 einstein WUs but I've yet to get 4 WUs at the same time to test this for acemd3. Is there a limitation set on the server side??? ____________
	ID: 52646 \| Rating: 0 \| rate: / Reply Quote

Keith Myers Send message Joined: 13 Dec 17 Posts: 1313 Credit: 6,022,292,459 RAC: 10,175,539 Level Scientific publications	Message 52647 - Posted: 18 Sep 2019 \| 18:05:02 UTC
	Sigh . . . . still have never caught a single one of the new tasks or applications.
	ID: 52647 \| Rating: 0 \| rate: / Reply Quote

Aurum Send message Joined: 12 Jul 17 Posts: 400 Credit: 13,844,764,882 RAC: 31,380,351 Level Scientific publications	Message 52648 - Posted: 18 Sep 2019 \| 18:08:30 UTC - in response to Message 52647.
	Keith it took me a while before I caught my first. Are you sure you have acemd3 checked in preferences and short & long unchecked??? ____________
	ID: 52648 \| Rating: 0 \| rate: / Reply Quote

Toni Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 9 Dec 08 Posts: 1006 Credit: 5,068,599 RAC: 0 Level Scientific publications	Message 52649 - Posted: 18 Sep 2019 \| 18:11:37 UTC - in response to Message 52648.
	I could send more. Unfortunately they also go to linux hosts (which we don't need to test). Please follow up in the "server" forum.
	ID: 52649 \| Rating: 0 \| rate: / Reply Quote

Keith Myers Send message Joined: 13 Dec 17 Posts: 1313 Credit: 6,022,292,459 RAC: 10,175,539 Level Scientific publications	Message 52651 - Posted: 18 Sep 2019 \| 18:27:55 UTC - in response to Message 52648.
	Keith it took me a while before I caught my first. Are you sure you have acemd3 checked in preferences and short & long unchecked??? Yes. I still have new acemd3 app checked from before in July and the acemd2 app unchecked. I see from Toni's comment that he does not want Linux hosts to participate. So I guess I can just forget about the project again.
	ID: 52651 \| Rating: 0 \| rate: / Reply Quote

klepel Send message Joined: 23 Dec 09 Posts: 189 Credit: 4,407,456,793 RAC: 2,294,847 Level Scientific publications	Message 52652 - Posted: 18 Sep 2019 \| 20:08:02 UTC
	I set my three LINUX hosts to "no new work". So, that Keith can pick one LINUX WU up;-) Be patient: It seems to me, that we are nearing to production status with the new app.
	ID: 52652 \| Rating: 0 \| rate: / Reply Quote

ServicEnginIC Send message Joined: 24 Sep 10 Posts: 572 Credit: 7,519,297,024 RAC: 10,544,353 Level Scientific publications	Message 52654 - Posted: 18 Sep 2019 \| 20:31:48 UTC Last modified: 18 Sep 2019 \| 20:32:39 UTC
	I set my three LINUX hosts to "no new work". So, that Keith can pick one LINUX WU up;-) I've also just configured my Linux systems for not to accept ACEMD3 tasks. My XP and W10 systems keep waiting for them...
	ID: 52654 \| Rating: 0 \| rate: / Reply Quote

Keith Myers Send message Joined: 13 Dec 17 Posts: 1313 Credit: 6,022,292,459 RAC: 10,175,539 Level Scientific publications	Message 52655 - Posted: 18 Sep 2019 \| 22:07:58 UTC - in response to Message 52652.
	I set my three LINUX hosts to "no new work". So, that Keith can pick one LINUX WU up;-) Be patient: It seems to me, that we are nearing to production status with the new app. I've been patient since February. But my patience is wearing thin. I see other Linux users be able to get some of the new work. I just wonder what miracle method they used so I can duplicate. I hope that Toni can get the Windows app working correctly very soon so he will release enough work for ALL hosts to participate.
	ID: 52655 \| Rating: 0 \| rate: / Reply Quote

JStateson Send message Joined: 31 Oct 08 Posts: 186 Credit: 3,343,411,800 RAC: 18,455 Level Scientific publications	Message 52667 - Posted: 20 Sep 2019 \| 3:29:11 UTC Last modified: 20 Sep 2019 \| 3:29:54 UTC
	Just realized I successfully processed three of the "new" tasks on a Linux system with one 1660ti and five 1060. They all completed successfully. I didn't realize they were running so I failed to do a stop start to test suspend. I also noticed that the GPU was not identified so I don't know if they all ran on gpu0 or any of the other 5 on this mining system using risers for all cards. One of them was faster so I suspect that was on the 1660ti. http://www.gpugrid.org/results.php?hostid=509037 Would be interesting to compare to other system that have full 16x bandwidth instead of my 1x.
	ID: 52667 \| Rating: 0 \| rate: / Reply Quote

Keith Myers Send message Joined: 13 Dec 17 Posts: 1313 Credit: 6,022,292,459 RAC: 10,175,539 Level Scientific publications	Message 52668 - Posted: 20 Sep 2019 \| 6:16:35 UTC
	I believe other Linux users have already tested the new acemd3 app for stops, suspends and restarts with no issues. They processed through to completion, even on different cards I believe. That is what the Windows app needs to achieve.
	ID: 52668 \| Rating: 0 \| rate: / Reply Quote

rod4x4 Send message Joined: 4 Aug 14 Posts: 266 Credit: 2,219,935,054 RAC: 0 Level Scientific publications	Message 52672 - Posted: 20 Sep 2019 \| 11:15:19 UTC - in response to Message 52667.
	One of them was faster so I suspect that was on the 1660ti. http://www.gpugrid.org/results.php?hostid=509037 Would be interesting to compare to other system that have full 16x bandwidth instead of my 1x. For comparison, another Volunteer crunched this task on Linux host with GTX1660ti http://www.gpugrid.net/result.php?resultid=21381326
	ID: 52672 \| Rating: 0 \| rate: / Reply Quote

JStateson Send message Joined: 31 Oct 08 Posts: 186 Credit: 3,343,411,800 RAC: 18,455 Level Scientific publications	Message 52674 - Posted: 20 Sep 2019 \| 12:33:11 UTC - in response to Message 52672. Last modified: 20 Sep 2019 \| 13:03:41 UTC
	One of them was faster so I suspect that was on the 1660ti. http://www.gpugrid.org/results.php?hostid=509037 Would be interesting to compare to other system that have full 16x bandwidth instead of my 1x. For comparison, another Volunteer crunched this task on Linux host with GTX1660ti http://www.gpugrid.net/result.php?resultid=21381326 From the above two links plus my gtx-1070Ti system PCIe OS GPU Seconds %Performance ---- ----- ------ ------- ---- x16 18.04 1660Ti 1831.0 100 x1 18.04 1660Ti 2189.79 84 x16 Win10 1070Ti 2268.68 81 There is a loss in performance of %16 due to x1 but on the other hand, Windows with 1070Ti and a full x16 is slightly slower than the 1660Ti hanging on a 1x riser on Ubuntu! Both of my systems have swan_sync enabled and both run CUDA 10.0 Not sure about the other user.
	ID: 52674 \| Rating: 0 \| rate: / Reply Quote

roryd Send message Joined: 9 Aug 11 Posts: 2 Credit: 57,018,037 RAC: 0 Level Scientific publications	Message 52676 - Posted: 20 Sep 2019 \| 14:08:29 UTC Last modified: 20 Sep 2019 \| 14:12:25 UTC
	Hi all, noob here :-) I have three 2080ti cards (on Windows) and I'm trying to get work units but Boinc keeps saying there aren't any available since I started yesterday: 20-Sep-2019 14:31:26 [GPUGRID] Sending scheduler request: Requested by project. 20-Sep-2019 14:31:26 [GPUGRID] Requesting new tasks for NVIDIA GPU 20-Sep-2019 14:31:27 [GPUGRID] Scheduler request completed: got 0 new tasks 20-Sep-2019 14:31:27 [GPUGRID] No tasks sent 20-Sep-2019 14:31:27 [GPUGRID] No tasks are available for New version of ACEMD 20-Sep-2019 14:31:27 [GPUGRID] Project has no tasks available My project settings are: ACEMD short runs (2-3 hours on fastest card): no ACEMD long runs (8-12 hours on fastest GPU): no ACEMD3: yes Quantum Chemistry (CPU): no Quantum Chemistry (CPU, beta): no Python Runtime: no In my projects folder, the only executable is acemd-923-80.exe. Am I missing something here? TIA
	ID: 52676 \| Rating: 0 \| rate: / Reply Quote

Aurum Send message Joined: 12 Jul 17 Posts: 400 Credit: 13,844,764,882 RAC: 31,380,351 Level Scientific publications	Message 52677 - Posted: 20 Sep 2019 \| 14:44:15 UTC - in response to Message 52676.
	roryd, I just go ahead and check the acemd project as well. The test WUs come in small packs so it's catch as catch can. Keep an eye on the Server Status page. ____________
	ID: 52677 \| Rating: 0 \| rate: / Reply Quote

roryd Send message Joined: 9 Aug 11 Posts: 2 Credit: 57,018,037 RAC: 0 Level Scientific publications	Message 52678 - Posted: 20 Sep 2019 \| 14:58:57 UTC - in response to Message 52677.
	roryd, I just go ahead and check the acemd project as well. The test WUs come in small packs so it's catch as catch can. Keep an eye on the Server Status page. Hi Aurum, I tried that yesterday, but they all failed and then I got messages saying 19-Sep-2019 16:23:50 [GPUGRID] This computer has finished a daily quota of 4 tasks As all the GPUs are RTX, should I enable the acemd anyway?
	ID: 52678 \| Rating: 0 \| rate: / Reply Quote

ServicEnginIC Send message Joined: 24 Sep 10 Posts: 572 Credit: 7,519,297,024 RAC: 10,544,353 Level Scientific publications	Message 52679 - Posted: 20 Sep 2019 \| 15:30:36 UTC - in response to Message 52674.
	There is a loss in performance of %16 due to x1 but on the other hand, Windows with 1070Ti and a full x16 is slightly slower than the 1660Ti hanging on a 1x riser on Ubuntu! Both of my systems have swan_sync enabled and both run CUDA 10.0 Not sure about the other user. As seen in table from following link, GTX1660TI, SWAN_SYNC enabled, demands 33% of PCIE X16 bandwidth in my system. https://www.gpugrid.net/forum_thread.php?id=4987&nowrap=true#52633
	ID: 52679 \| Rating: 0 \| rate: / Reply Quote

Aurum Send message Joined: 12 Jul 17 Posts: 400 Credit: 13,844,764,882 RAC: 31,380,351 Level Scientific publications	Message 52680 - Posted: 20 Sep 2019 \| 18:34:19 UTC - in response to Message 52678.
	Sorry, my bad. I'm talking about 1080 Ti's and you're running 2080 Ti's. No, acemd does not work for Turing GPUs. I get confused as my single 2080 Ti is on a Linux computer. ____________
	ID: 52680 \| Rating: 0 \| rate: / Reply Quote

rod4x4 Send message Joined: 4 Aug 14 Posts: 266 Credit: 2,219,935,054 RAC: 0 Level Scientific publications	Message 52684 - Posted: 21 Sep 2019 \| 1:14:58 UTC
	Received a TEST work unit a43-TONI_TESTDHFR207c-23-30-RND4156_0 on a Win10 Host with GTX1060 GPU. Applied the following test: Let work unit run for 11 minutes 13 seconds suspended for 1 minute 20 seconds (approx) Resumed work unit. Results: Work unit had computational error several seconds after resuming Observations: Work unit predicted a run time of 36 minutes. This is an improvement on Work unit a89-TONI_TESTDHFR206b-23-30-RND6008_0 , which had a run time of 66 minutes. Speed issues seems to be improved. ACEMD3 task and Wrapper task disappeared from Task Manager after suspending task. After resumption / failure, the run time reverted to 2 minutes 12 seconds. STDerr Output time line reflects the full run time of 11 minutes, but Run Time summary only reflects 2 minutes 12 seconds. nvidia-smi reported 78% GPU utilization which is inline with CUDA80 tasks on this host. nvidia-smi reported similar Power usage as CUDA80 tasks on this host. Link to Work unit here: http://gpugrid.net/result.php?resultid=21396885
	ID: 52684 \| Rating: 0 \| rate: / Reply Quote

rod4x4 Send message Joined: 4 Aug 14 Posts: 266 Credit: 2,219,935,054 RAC: 0 Level Scientific publications	Message 52686 - Posted: 21 Sep 2019 \| 2:07:53 UTC
	Received another TEST work unit a6-TONI_TESTDHFR207-2-3-RND1704 Same testing method as last post, this time allowed work unit to run 40 minutes 37 seconds before suspending. (54% complete) Task failed after resuming 1 minute later. The run time may not have improved as indicated in last post. After 40 minutes 37 seconds task was 54% completed. So Windows 10 tasks still seem to have a speed issue compared to Linus ACEMD3 tasks. Additional I did notice the ACEMD3 task and Wrapper task did reappear in Task Manager for a few seconds before the task failed. All other observations consistent with last post. Failed task here: http://gpugrid.net/result.php?resultid=21397083
	ID: 52686 \| Rating: 0 \| rate: / Reply Quote

Erich56 Send message Joined: 1 Jan 15 Posts: 1105 Credit: 7,822,620,176 RAC: 1,489,209 Level Scientific publications	Message 52687 - Posted: 21 Sep 2019 \| 5:09:34 UTC - in response to Message 52686.
	Failed task here: http://gpugrid.net/result.php?resultid=21397083 what caught my eye: in line 8 of the stderr it says "Detected memory leaks!" - whatever this means.
	ID: 52687 \| Rating: 0 \| rate: / Reply Quote

rod4x4 Send message Joined: 4 Aug 14 Posts: 266 Credit: 2,219,935,054 RAC: 0 Level Scientific publications	Message 52689 - Posted: 21 Sep 2019 \| 10:24:03 UTC - in response to Message 52687.
	what caught my eye: in line 8 of the stderr it says "Detected memory leaks!" - whatever this means. It is a programming error indicating memory is not allocated or de-allocated correctly. This is the suspend/resume bug they are looking to fix.
	ID: 52689 \| Rating: 0 \| rate: / Reply Quote

ServicEnginIC Send message Joined: 24 Sep 10 Posts: 572 Credit: 7,519,297,024 RAC: 10,544,353 Level Scientific publications	Message 52695 - Posted: 22 Sep 2019 \| 10:00:57 UTC Last modified: 22 Sep 2019 \| 10:02:22 UTC
	My following W10 computer, GTX1050TI graphics card: https://www.gpugrid.net/show_host_detail.php?hostid=105442 Got an ACEMD3 V2.07 test WU: https://www.gpugrid.net/result.php?resultid=21400000 It was processed till the end, no pauses, and then errored out with indication "195 (0xc3) EXIT_CHILD_FAILED". The same WU but V2.06 was processed successfully by a second computer: WU: https://www.gpugrid.net/result.php?resultid=21400083 Computer: https://www.gpugrid.net/show_host_detail.php?hostid=459450 Something is still to be polished in V2.07 application code and/or scheduler... I also miss previously available information in old ACEMD WUs about graphics card model, clocks, temperatures reached... Perhaps it is not possible in new WUs due to Wrapper's philosophy (?)
	ID: 52695 \| Rating: 0 \| rate: / Reply Quote

Billy Ewell 1931 Send message Joined: 22 Oct 10 Posts: 37 Credit: 1,066,973,674 RAC: 117,807 Level Scientific publications	Message 52704 - Posted: 23 Sep 2019 \| 19:49:16 UTC
	e1s20_ubiquitin_50ns_3-ADRIA_FOLDUBQ_BANDIT_crystal_ss_contacts_50_ubiquitin_2-0-2-RND1315. This task had downloaded and failed without my immediate knowledge as I was doing some routine computer updating/restarting. Failure occurred at 52.31 seconds which is of course irrelevant. Machine: I7 Windows10 RTX2080. Do I understand that Toni still wants these failure reports?
	ID: 52704 \| Rating: 0 \| rate: / Reply Quote

JStateson Send message Joined: 31 Oct 08 Posts: 186 Credit: 3,343,411,800 RAC: 18,455 Level Scientific publications	Message 52709 - Posted: 23 Sep 2019 \| 23:58:38 UTC
	I lost a pair of those "new" tasks http://www.gpugrid.org/results.php?hostid=467730&offset=0&show_names=0&state=5&appid= I had to reboot to install to fix a problem with an app. I did stop the Boinc client before rebooting but the two "new" tasks did not survive the reboot. Looking at errors, my only other errors on this system were almost a year ago.
	ID: 52709 \| Rating: 0 \| rate: / Reply Quote

Keith Myers Send message Joined: 13 Dec 17 Posts: 1313 Credit: 6,022,292,459 RAC: 10,175,539 Level Scientific publications	Message 52718 - Posted: 25 Sep 2019 \| 5:29:55 UTC
	Just had my first failure on a restarted CUDA100 task that obeyed the set 60 minute run per project setting. Restarted on a different device and failed. Stderr output <core_client_version>7.16.2</core_client_version> <![CDATA[ <message> process exited with code 195 (0xc3, -61)</message> <stderr_txt> 18:14:25 (25157): wrapper (7.7.26016): starting 18:14:25 (25157): wrapper (7.7.26016): starting 18:14:25 (25157): wrapper: running acemd3 (--boinc input --device 0) 21:09:51 (23108): wrapper (7.7.26016): starting 21:09:51 (23108): wrapper (7.7.26016): starting 21:09:51 (23108): wrapper: running acemd3 (--boinc input --device 1) ERROR: /home/user/conda/conda-bld/acemd3_1566914012210/work/src/mdsim/context.cpp line 324: Cannot use a restart file on a different device! 21:09:55 (23108): acemd3 exited; CPU time 3.826201 21:09:55 (23108): app exit status: 0x9e 21:09:55 (23108): called boinc_finish(195) </stderr_txt> ]]> So this has a failure in common with the Windows wrapper app. https://www.gpugrid.net/result.php?resultid=21408545
	ID: 52718 \| Rating: 0 \| rate: / Reply Quote

Toby Broom Send message Joined: 11 Dec 08 Posts: 25 Credit: 360,187,443 RAC: 9 Level Scientific publications	Message 52719 - Posted: 25 Sep 2019 \| 6:20:35 UTC
	I see the same as others on suspend. http://www.gpugrid.net/result.php?resultid=21408995 These WU's don't stress my Titan V much the load is about 75% and power 100W. My 10xx GPU's are another 10% load and double power draw.
	ID: 52719 \| Rating: 0 \| rate: / Reply Quote

Retvari Zoltan Send message Joined: 20 Jan 09 Posts: 2343 Credit: 16,201,255,749 RAC: 0 Level Scientific publications	Message 52720 - Posted: 25 Sep 2019 \| 11:51:49 UTC Last modified: 25 Sep 2019 \| 11:53:35 UTC
	I noticed that the new ACEMD3 Windows app v2.06 does not update the boinc_task_state.xml file in the slot directory. It maybe related to the checkpoint + "resuming does not work" issue. BTW I don't know why this host received the v2.06, instead of the more recent v2.07. (NVIDIA GeForce GTX 1080 Ti (4095MB) driver: 436.15)
	ID: 52720 \| Rating: 0 \| rate: / Reply Quote

KAMasud Send message Joined: 27 Jul 11 Posts: 137 Credit: 523,901,354 RAC: 0 Level Scientific publications	Message 52721 - Posted: 25 Sep 2019 \| 17:49:47 UTC
	I restarted my machine and the WU crashed. e1s29_ubiquitin_50ns_5-ADRIA_FOLDUBQ_BANDIT_crystal_ss_contacts_50_ubiquitin_1-1-2-RND8951_2 Stderr output <core_client_version>7.14.2</core_client_version> <![CDATA[ <message> (unknown error) - exit code 195 (0xc3)</message> <stderr_txt> 19:49:47 (10992): wrapper (7.9.26016): starting 19:49:47 (10992): wrapper: running acemd3.exe (--boinc input --device 0) Detected memory leaks! Dumping objects -> ..\api\boinc_api.cpp(309) : {1755} normal block at 0x000001D0701589B0, 8 bytes long. Data: < p > 00 00 11 70 D0 01 00 00 ..\lib\diagnostics_win.cpp(417) : {202} normal block at 0x000001D07015C010, 1080 bytes long. Data: <@ > 40 02 00 00 CD CD CD CD E0 01 00 00 00 00 00 00 Object dump complete. 22:43:23 (11696): wrapper (7.9.26016): starting 22:43:23 (11696): wrapper: running acemd3.exe (--boinc input --device 0) # Engine failed: The periodic box size has decreased to less than twice the nonbonded cutoff. 22:43:27 (11696): acemd3.exe exited; CPU time 0.000000 22:43:27 (11696): app exit status: 0x1 22:43:27 (11696): called boinc_finish(195) 0 bytes in 0 Free Blocks. 506 bytes in 8 Normal Blocks. 1144 bytes in 1 CRT Blocks. 0 bytes in 0 Ignore Blocks. 0 bytes in 0 Client Blocks. Largest number used: 0 bytes. Total allocations: 141102 bytes. Dumping objects -> {1814} normal block at 0x0000016B06965620, 48 bytes long. Data: <ACEMD_PLUGIN_DIR> 41 43 45 4D 44 5F 50 4C 55 47 49 4E 5F 44 49 52 {1803} normal block at 0x0000016B069657E0, 48 bytes long. Data: <HOME=D:\ProgramD> 48 4F 4D 45 3D 44 3A 5C 50 72 6F 67 72 61 6D 44 {1792} normal block at 0x0000016B06965BD0, 48 bytes long. Data: <TMP=D:\ProgramDa> 54 4D 50 3D 44 3A 5C 50 72 6F 67 72 61 6D 44 61 {1781} normal block at 0x0000016B06965690, 48 bytes long. Data: <TEMP=D:\ProgramD> 54 45 4D 50 3D 44 3A 5C 50 72 6F 67 72 61 6D 44 {1770} normal block at 0x0000016B06965150, 48 bytes long. Data: <TMPDIR=D:\Progra> 54 4D 50 44 49 52 3D 44 3A 5C 50 72 6F 67 72 61 {1759} normal block at 0x0000016B069460C0, 141 bytes long. Data: <<project_prefere> 3C 70 72 6F 6A 65 63 74 5F 70 72 65 66 65 72 65 ..\api\boinc_api.cpp(309) : {1756} normal block at 0x0000016B06966A10, 8 bytes long. Data: < k > 00 00 93 06 6B 01 00 00 {981} normal block at 0x0000016B06944D40, 141 bytes long. Data: <<project_prefere> 3C 70 72 6F 6A 65 63 74 5F 70 72 65 66 65 72 65 {204} normal block at 0x0000016B069661F0, 8 bytes long. Data: < k > 10 90 96 06 6B 01 00 00 {197} normal block at 0x0000016B06965460, 48 bytes long. Data: <--boinc input --> 2D 2D 62 6F 69 6E 63 20 69 6E 70 75 74 20 2D 2D {196} normal block at 0x0000016B06966BA0, 16 bytes long. Data: < k > A8 C4 94 06 6B 01 00 00 00 00 00 00 00 00 00 00 {195} normal block at 0x0000016B069662E0, 16 bytes long. Data: < k > 80 C4 94 06 6B 01 00 00 00 00 00 00 00 00 00 00 {194} normal block at 0x0000016B06966BF0, 16 bytes long. Data: <X k > 58 C4 94 06 6B 01 00 00 00 00 00 00 00 00 00 00 {193} normal block at 0x0000016B06966010, 16 bytes long. Data: <0 k > 30 C4 94 06 6B 01 00 00 00 00 00 00 00 00 00 00 {192} normal block at 0x0000016B06966290, 16 bytes long. Data: < k > 08 C4 94 06 6B 01 00 00 00 00 00 00 00 00 00 00 {191} normal block at 0x0000016B06966F60, 16 bytes long. Data: < k > E0 C3 94 06 6B 01 00 00 00 00 00 00 00 00 00 00 {190} normal block at 0x0000016B06965540, 48 bytes long. Data: <ComSpec=C:\Windo> 43 6F 6D 53 70 65 63 3D 43 3A 5C 57 69 6E 64 6F {189} normal block at 0x0000016B06966150, 16 bytes long. Data: <@O k > 40 4F 96 06 6B 01 00 00 00 00 00 00 00 00 00 00 {188} normal block at 0x0000016B0695A5D0, 32 bytes long. Data: <SystemRoot=C:\Wi> 53 79 73 74 65 6D 52 6F 6F 74 3D 43 3A 5C 57 69 {187} normal block at 0x0000016B06966420, 16 bytes long. Data: < O k > 18 4F 96 06 6B 01 00 00 00 00 00 00 00 00 00 00 {185} normal block at 0x0000016B069666A0, 16 bytes long. Data: < N k > F0 4E 96 06 6B 01 00 00 00 00 00 00 00 00 00 00 {184} normal block at 0x0000016B06966650, 16 bytes long. Data: < N k > C8 4E 96 06 6B 01 00 00 00 00 00 00 00 00 00 00 {183} normal block at 0x0000016B06966C40, 16 bytes long. Data: < N k > A0 4E 96 06 6B 01 00 00 00 00 00 00 00 00 00 00 {182} normal block at 0x0000016B06966DD0, 16 bytes long. Data: <xN k > 78 4E 96 06 6B 01 00 00 00 00 00 00 00 00 00 00 {181} normal block at 0x0000016B06966510, 16 bytes long. Data: <PN k > 50 4E 96 06 6B 01 00 00 00 00 00 00 00 00 00 00 {180} normal block at 0x0000016B06964E50, 280 bytes long. Data: < e k PQ k > 10 65 96 06 6B 01 00 00 50 51 96 06 6B 01 00 00 {179} normal block at 0x0000016B06966100, 16 bytes long. Data: < k > C0 C3 94 06 6B 01 00 00 00 00 00 00 00 00 00 00 {178} normal block at 0x0000016B069661A0, 16 bytes long. Data: < k > 98 C3 94 06 6B 01 00 00 00 00 00 00 00 00 00 00 {177} normal block at 0x0000016B06966560, 16 bytes long. Data: <p k > 70 C3 94 06 6B 01 00 00 00 00 00 00 00 00 00 00 {176} normal block at 0x0000016B0694C370, 496 bytes long. Data: <`e k acemd3.e> 60 65 96 06 6B 01 00 00 61 63 65 6D 64 33 2E 65 {65} normal block at 0x0000016B069590E0, 16 bytes long. Data: < > 80 EA 11 A0 F6 7F 00 00 00 00 00 00 00 00 00 00 {64} normal block at 0x0000016B06959950, 16 bytes long. Data: <@ > 40 E9 11 A0 F6 7F 00 00 00 00 00 00 00 00 00 00 {63} normal block at 0x0000016B06959770, 16 bytes long. Data: < W > F8 57 0E A0 F6 7F 00 00 00 00 00 00 00 00 00 00 {62} normal block at 0x0000016B06959040, 16 bytes long. Data: < W > D8 57 0E A0 F6 7F 00 00 00 00 00 00 00 00 00 00 {61} normal block at 0x0000016B06959720, 16 bytes long. Data: <P > 50 04 0E A0 F6 7F 00 00 00 00 00 00 00 00 00 00 {60} normal block at 0x0000016B06959680, 16 bytes long. Data: <0 > 30 04 0E A0 F6 7F 00 00 00 00 00 00 00 00 00 00 {59} normal block at 0x0000016B069595E0, 16 bytes long. Data: < > E0 02 0E A0 F6 7F 00 00 00 00 00 00 00 00 00 00 {58} normal block at 0x0000016B06958FF0, 16 bytes long. Data: < > 10 04 0E A0 F6 7F 00 00 00 00 00 00 00 00 00 00 {57} normal block at 0x0000016B06958F00, 16 bytes long. Data: <p > 70 04 0E A0 F6 7F 00 00 00 00 00 00 00 00 00 00 {56} normal block at 0x0000016B06959590, 16 bytes long. Data: < > 18 C0 0C A0 F6 7F 00 00 00 00 00 00 00 00 00 00 Object dump complete. </stderr_txt> ]]>
	ID: 52721 \| Rating: 0 \| rate: / Reply Quote

RFGuy_KCCO Send message Joined: 13 Feb 14 Posts: 6 Credit: 1,056,951,005 RAC: 120 Level Scientific publications	Message 52730 - Posted: 27 Sep 2019 \| 2:02:26 UTC Last modified: 27 Sep 2019 \| 2:02:36 UTC
	No issues here with suspending and resuming tasks under Linux. Just suspended a WU and it resumed on the other GPU in that box without issue (both GPUs are RTX 2080's).
	ID: 52730 \| Rating: 0 \| rate: / Reply Quote

Keith Myers Send message Joined: 13 Dec 17 Posts: 1313 Credit: 6,022,292,459 RAC: 10,175,539 Level Scientific publications	Message 52731 - Posted: 27 Sep 2019 \| 7:09:48 UTC - in response to Message 52730.
	No issues here with suspending and resuming tasks under Linux. Just suspended a WU and it resumed on the other GPU in that box without issue (both GPUs are RTX 2080's). Curious. I wonder if my Linux failure was because the paused task did not start back up on the same type of card.
	ID: 52731 \| Rating: 0 \| rate: / Reply Quote

zombie67 [MM] Send message Joined: 16 Jul 07 Posts: 207 Credit: 2,161,961,456 RAC: 9,510,708 Level Scientific publications	Message 52732 - Posted: 27 Sep 2019 \| 13:31:41 UTC
	FWIW, my single machine with two GPUs will successfully process CUDA 101 tasks, but fail on CUDA 100 tasks. My other three machines with a single GPU will successfully process both CUDA 101 and CUDA 100 tasks. ____________ Reno, NV Team: SETI.USA
	ID: 52732 \| Rating: 0 \| rate: / Reply Quote

Keith Myers Send message Joined: 13 Dec 17 Posts: 1313 Credit: 6,022,292,459 RAC: 10,175,539 Level Scientific publications	Message 52733 - Posted: 27 Sep 2019 \| 19:20:10 UTC - in response to Message 52731.
	No issues here with suspending and resuming tasks under Linux. Just suspended a WU and it resumed on the other GPU in that box without issue (both GPUs are RTX 2080's). Curious. I wonder if my Linux failure was because the paused task did not start back up on the same type of card. Here is a workunit that was run for three different periods on two different cards. But they were the same card type and the WU successfully finished. https://www.gpugrid.net/result.php?resultid=21411774 <core_client_version>7.16.2</core_client_version> <![CDATA[ <stderr_txt> 03:44:01 (19192): wrapper (7.7.26016): starting 03:44:01 (19192): wrapper (7.7.26016): starting 03:44:01 (19192): wrapper: running acemd3 (--boinc input --device 0) 14:06:30 (1677): wrapper (7.7.26016): starting 14:06:30 (1677): wrapper (7.7.26016): starting 14:06:30 (1677): wrapper: running acemd3 (--boinc input --device 2) 19:30:32 (12479): wrapper (7.7.26016): starting 19:30:32 (12479): wrapper (7.7.26016): starting 19:30:32 (12479): wrapper: running acemd3 (--boinc input --device 0) 20:16:14 (12479): acemd3 exited; CPU time 2012.925385 20:16:14 (12479): called boinc_finish(0) So the wrapper app can handle being stopped and restarted on different cards AS LONG as they are the same card type. Two examples now of this fact. But when the WU is restarted on a different card type, something about the previous configuration is kept and does not match up with the new configuration. Could be something as simple as card name or maybe CC capabilities.
	ID: 52733 \| Rating: 0 \| rate: / Reply Quote

Keith Myers Send message Joined: 13 Dec 17 Posts: 1313 Credit: 6,022,292,459 RAC: 10,175,539 Level Scientific publications	Message 52734 - Posted: 27 Sep 2019 \| 23:40:00 UTC
	I have a request for help from Windows users. Does anyone want to try a development branch of the client that may be able to handle the pause/suspend issues on the acemd3 wrapper apps? I was browsing through the latest commits and and came upon PR#3307 which has the tantalizing description of Description of the Change On Windows, CreateProcess() is used to launch tasks, but this on its own does not handle child processes; if the parent task process exits, the workunit will be terminated. If <wait_for_children> is set in the job file, attach the task process to a job object instead, which can then be monitored to determine when all child processes are finished. Alternate Designs Release Notes Add <wait_for_children> option for tasks in job.xml This sounds like it may address some of the error messages I see in stderr.txt when a wrapper app is suspended or paused. And why Toni has asked whether the wrapper app and the child process acemd3 app are still in the Task Manager list. You can download the latest AppVeyor artifact here for the client. https://ci.appveyor.com/api/buildjobs/y4gd2lvbjjwoa54l/artifacts/deploy%2Fwin-client%2Fwin-client_PR3307_2019-09-26_8665946a.7z
	ID: 52734 \| Rating: 0 \| rate: / Reply Quote

JStateson Send message Joined: 31 Oct 08 Posts: 186 Credit: 3,343,411,800 RAC: 18,455 Level Scientific publications	Message 52735 - Posted: 27 Sep 2019 \| 23:57:42 UTC - in response to Message 52732. Last modified: 28 Sep 2019 \| 0:04:05 UTC
	FWIW, my single machine with two GPUs will successfully process CUDA 101 tasks, but fail on CUDA 100 tasks. My other three machines with a single GPU will successfully process both CUDA 101 and CUDA 100 tasks. I looked at output of both failing and passing tasks on your system with a pair of 1030. I did not see anything in the output identifying the type of coprocessor. however, that may be due to the bios missing code that identifies itself to boinc or more likely the app does not bother to identify the device or report temperatures like the older apps here. From other projects, MW for example, I see where the work units make timing calculations and adjusts parameters accordingly so as to time out tasks that are hung and other purposes. There are two different gt1030's. One is significantly slower than the other else they are identical. The newer versions are crippled. I was wondering if the pair you have together are matched. Just a guess as that could cause unexpected timing values if the apps simply checks the name and does not bother to recalculate parameters.
	ID: 52735 \| Rating: 0 \| rate: / Reply Quote

Keith Myers Send message Joined: 13 Dec 17 Posts: 1313 Credit: 6,022,292,459 RAC: 10,175,539 Level Scientific publications	Message 52736 - Posted: 28 Sep 2019 \| 0:11:09 UTC
	The one host that is getting the dominant amount of new acemd3 work just so happens to have three identical EVGA GTX 1070 Ti Black Edition cards and the tasks can apparently restart and run on any one of them after being switched off by the "switch between projects" standard 60 minute delimiter.
	ID: 52736 \| Rating: 0 \| rate: / Reply Quote

JStateson Send message Joined: 31 Oct 08 Posts: 186 Credit: 3,343,411,800 RAC: 18,455 Level Scientific publications	Message 52737 - Posted: 28 Sep 2019 \| 0:35:49 UTC - in response to Message 52734.
	I have a request You can download the latest AppVeyor artifact here for the client. https://ci.appveyor.com/api/buildjobs/y4gd2lvbjjwoa54l/artifacts/deploy%2Fwin-client%2Fwin-client_PR3307_2019-09-26_8665946a.7z that gave me 7.15.0 is it supposed to be 7.16.2? The systems I have that run gpugrid on windows are matched GPUs.
	ID: 52737 \| Rating: 0 \| rate: / Reply Quote

zombie67 [MM] Send message Joined: 16 Jul 07 Posts: 207 Credit: 2,161,961,456 RAC: 9,510,708 Level Scientific publications	Message 52738 - Posted: 28 Sep 2019 \| 0:38:44 UTC - in response to Message 52735.
	There are two different gt1030's. One is significantly slower than the other else they are identical. The newer versions are crippled. I was wondering if the pair you have together are matched. Just a guess as that could cause unexpected timing values if the apps simply checks the name and does not bother to recalculate parameters. These two 1030s are identical. Same brand and model, bought at the same time. It seems like a clue, that only the CUDA 100 tasks fail, and not the CUDA 101. Note, another of my machines has a single, identical 1030 (also purchased at the same time). It does fail either 101 or 100. Perhaps there is something about CUDA 100 and dual-card machines. Just a guess. ____________ Reno, NV Team: SETI.USA
	ID: 52738 \| Rating: 0 \| rate: / Reply Quote

Keith Myers Send message Joined: 13 Dec 17 Posts: 1313 Credit: 6,022,292,459 RAC: 10,175,539 Level Scientific publications	Message 52739 - Posted: 28 Sep 2019 \| 1:32:03 UTC - in response to Message 52737. Last modified: 28 Sep 2019 \| 1:36:25 UTC
	It is still from the master branch which is the development version 7.15.0. Or at least it still has the versioning number from the master branch. It may have more commits from further upstream too. If the version.h and version.log files aren't updated, the compile will still show whatever the version in those files are set. But it has the commit I referenced in it with a fix for wrapper apps.
	ID: 52739 \| Rating: 0 \| rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1589 Credit: 6,680,094,351 RAC: 9,199,283 Level Scientific publications	Message 52740 - Posted: 28 Sep 2019 \| 8:59:45 UTC - in response to Message 52734.
	I have a request for help from Windows users. Does anyone want to try a development branch of the client that may be able to handle the pause/suspend issues on the acemd3 wrapper apps? I was browsing through the latest commits and and came upon PR#3307 which has the tantalizing description of Description of the Change On Windows, CreateProcess() is used to launch tasks, but this on its own does not handle child processes; if the parent task process exits, the workunit will be terminated. If <wait_for_children> is set in the job file, attach the task process to a job object instead, which can then be monitored to determine when all child processes are finished. Alternate Designs Release Notes Add <wait_for_children> option for tasks in job.xml This sounds like it may address some of the error messages I see in stderr.txt when a wrapper app is suspended or paused. And why Toni has asked whether the wrapper app and the child process acemd3 app are still in the Task Manager list. You can download the latest AppVeyor artifact here for the client. https://ci.appveyor.com/api/buildjobs/y4gd2lvbjjwoa54l/artifacts/deploy%2Fwin-client%2Fwin-client_PR3307_2019-09-26_8665946a.7z I'm a little worried by that. The changes in PR #3307 were made in the wrapper app itself (only). You could indeed download the win-apps bundle from appveyor and extract wrapper_26014_windows_x86_64.exe, but it would be hard to deploy if Toni is issuing an earlier version from the server. If the client downloaded from that link has improvements, they'll come from the cumulative set of changes made both before and after the 7.16 branch was split. We urgently need to work out which the beneficial change was, and whether it happened before or after the fork. If it was made later, it needs to be cherrypicked into the new release.
	ID: 52740 \| Rating: 0 \| rate: / Reply Quote

HyperComputing Send message Joined: 15 Sep 19 Posts: 4 Credit: 485,304,520 RAC: 0 Level Scientific publications	Message 52741 - Posted: 28 Sep 2019 \| 10:43:56 UTC
	Hi I'm new on this forum. Here is what I've got with my 1050ti on linux x64 : curent task : ADRIA_FOLDUBQ_BANDIT_ss_contacts_50_ubiquitin_4-0-2 resources : 0.909 CPUs + 1 NVIDIA GPU task size : 5000000 GFLOPs elapsed time : 08:54:01 remaining time : 09:09:51 progress : 13,800 % 14% done after 50% elapsed time ???
	ID: 52741 \| Rating: 0 \| rate: / Reply Quote

JStateson Send message Joined: 31 Oct 08 Posts: 186 Credit: 3,343,411,800 RAC: 18,455 Level Scientific publications	Message 52742 - Posted: 28 Sep 2019 \| 15:07:13 UTC - in response to Message 52741.
	Here is what I've got with my 1050ti on linux x64 : remaining time : 09:09:51 progress : 13,800 % 14% done after 50% elapsed time ??? 7.2.42 is really old (but latest on berkeley download). very likely the client is estimating wrong in addition to mis-identifying the cpu. apt-get under ubuntu 18.04 got me version 7.16.1 boinc
	ID: 52742 \| Rating: 0 \| rate: / Reply Quote

Keith Myers Send message Joined: 13 Dec 17 Posts: 1313 Credit: 6,022,292,459 RAC: 10,175,539 Level Scientific publications	Message 52743 - Posted: 28 Sep 2019 \| 15:30:18 UTC - in response to Message 52740. Last modified: 28 Sep 2019 \| 15:31:40 UTC
	I'm a little worried by that. The changes in PR #3307 were made in the wrapper app itself (only). You could indeed download the win-apps bundle from appveyor and extract wrapper_26014_windows_x86_64.exe, but it would be hard to deploy if Toni is issuing an earlier version from the server. If the client downloaded from that link has improvements, they'll come from the cumulative set of changes made both before and after the 7.16 branch was split. We urgently need to work out which the beneficial change was, and whether it happened before or after the fork. If it was made later, it needs to be cherrypicked into the new release. I never thought about where the wrapper app originated. If issued by the server, it still controls the show if the new one doesn't get put into play. I just thought the description of the fix dovetailed perfectly into what we are seeing with the Windows acemd3 app runs and their inability to be suspended without failing. I was hoping you might see this post and contribute Richard as you know far more about how releases are handled. Are you saying that the wrapper app needs to be updated in the server code? Like in the new 1.20 server release?
	ID: 52743 \| Rating: 0 \| rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1589 Credit: 6,680,094,351 RAC: 9,199,283 Level Scientific publications	Message 52744 - Posted: 28 Sep 2019 \| 15:52:11 UTC - in response to Message 52743.
	Are you saying that the wrapper app needs to be updated in the server code? Like in the new 1.20 server release? Not really either of those. The wrapper is a self-contained application, built from code in the \samples\ folder on Github. I would imagine that most projects who need to use it would compile their own copy from that source. I see from your most recent stderr.txt that your machine is using Toni's "wrapper (7.7.26016)". I'm not sure exactly how the version number is generated: that sounds like a combination of old-ish server source code (7.7) and a possibly auto-incrementing value seeded from the old SVN repository (26016). Given that the Appveyor version I downloaded from your link this morning was 26014, it looks like Toni has possibly been updating his own local copy along the way, and getting ahead of BOINC Central. If so, I hope he pushes back any useful changes to GitHub when he's got it all working. But that's all just guesswork. Only Toni could tell you for certain.
	ID: 52744 \| Rating: 0 \| rate: / Reply Quote

robertmiles Send message Joined: 16 Apr 09 Posts: 503 Credit: 740,445,933 RAC: 15,239 Level Scientific publications	Message 52745 - Posted: 28 Sep 2019 \| 18:36:02 UTC - in response to Message 52741. Last modified: 28 Sep 2019 \| 18:38:29 UTC
	Hi I'm new on this forum. Here is what I've got with my 1050ti on linux x64 : curent task : ADRIA_FOLDUBQ_BANDIT_ss_contacts_50_ubiquitin_4-0-2 resources : 0.909 CPUs + 1 NVIDIA GPU task size : 5000000 GFLOPs elapsed time : 08:54:01 remaining time : 09:09:51 progress : 13,800 % 14% done after 50% elapsed time ??? They're done at least a partial new version lately to handle the newest Nvidia cards. The calculations for estimated remaining time tend to give rather inaccurate values under new versions until at least ten other tasks with the new version have run on the same computer. I'm also seeing rather inaccurate values with my 1080 under Windows 10 x64.
	ID: 52745 \| Rating: 0 \| rate: / Reply Quote

Keith Myers Send message Joined: 13 Dec 17 Posts: 1313 Credit: 6,022,292,459 RAC: 10,175,539 Level Scientific publications	Message 52746 - Posted: 28 Sep 2019 \| 19:08:32 UTC - in response to Message 52744.
	Given that the Appveyor version I downloaded from your link this morning was 26014, it looks like Toni has possibly been updating his own local copy along the way, and getting ahead of BOINC Central. If so, I hope he pushes back any useful changes to GitHub when he's got it all working. But that's all just guesswork. Only Toni could tell you for certain. Yes, hope Toni reads the thread and finds something useful from PR #3307 to incorporate if he in fact is updating the wrapper app on his own. Thanks for the insight about the versioning.
	ID: 52746 \| Rating: 0 \| rate: / Reply Quote

HyperComputing Send message Joined: 15 Sep 19 Posts: 4 Credit: 485,304,520 RAC: 0 Level Scientific publications	Message 52747 - Posted: 28 Sep 2019 \| 19:26:38 UTC - in response to Message 52745.
	The calculations for estimated remaining time tend to give rather inaccurate values under new versions until at least ten other tasks with the new version have run on the same computer. Thank you. I see what you mean. Now remaining time is growing up 1 sec every 3 sec. I estimate at 60h the real time this task will do the job.
	ID: 52747 \| Rating: 0 \| rate: / Reply Quote

Billy Ewell 1931 Send message Joined: 22 Oct 10 Posts: 37 Credit: 1,066,973,674 RAC: 117,807 Level Scientific publications	Message 52748 - Posted: 29 Sep 2019 \| 15:13:30 UTC
	9/29/2019 9:55:41 AM \| GPUGRID \| Computation for task e16s9_e14s4p0f17-ADRIA_FOLDUBQ_BANDIT_crystal_ss_contacts_50_ubiquitin_0-0-2-RND4379_1 finished This 2.07 task failed 8.75 seconds after startup following a management activity on my part. I had NOT suspended boinc manager activity before the shutdown. The machine is I7, W10, RTX2080. Again frustrating that at this point the problem has not been solved. Again TONI do you want these individual reports?
	ID: 52748 \| Rating: 0 \| rate: / Reply Quote

Toni Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 9 Dec 08 Posts: 1006 Credit: 5,068,599 RAC: 0 Level Scientific publications	Message 52750 - Posted: 30 Sep 2019 \| 16:02:07 UTC - in response to Message 52748. Last modified: 30 Sep 2019 \| 16:03:43 UTC
	Dears, sorry for the slow progress but I determined (at least) a restart problem, and it is not related to the wrapper. It is Windows-only, CUDA 10 only, as far as I can tell from your reports, and manifests itself with the "The periodic box size has decreased to less than twice the nonbonded cutoff." message. Unfortunately the root cause is hard to identify (may be external to our code). I have compiled the wrapper myself (the binaries on the boinc page are old and had one important bug in variable substitution), but for now the failures seem unrelated. It's a bit frustrating because everything else seems to work nicely.
	ID: 52750 \| Rating: 0 \| rate: / Reply Quote

Toni Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 9 Dec 08 Posts: 1006 Credit: 5,068,599 RAC: 0 Level Scientific publications	Message 52751 - Posted: 30 Sep 2019 \| 16:06:28 UTC - in response to Message 52748.
	9/29/2019 9:55:41 AM \| GPUGRID \| Computation for task e16s9_e14s4p0f17-ADRIA_FOLDUBQ_BANDIT_crystal_ss_contacts_50_ubiquitin_0-0-2-RND4379_1 finished This 2.07 task failed 8.75 seconds after startup following a management activity on my part. I had NOT suspended boinc manager activity before the shutdown. The machine is I7, W10, RTX2080. Again frustrating that at this point the problem has not been solved. Again TONI do you want these individual reports? That seems a faulty WU. Failed elsewhere.
	ID: 52751 \| Rating: 0 \| rate: / Reply Quote

HyperComputing Send message Joined: 15 Sep 19 Posts: 4 Credit: 485,304,520 RAC: 0 Level Scientific publications	Message 52753 - Posted: 1 Oct 2019 \| 10:10:58 UTC - in response to Message 52751.
	no task failed on linux. 1st unit : i7 with 1x 1050ti (cuda80 tasks) 2nd unit : i5 with 2x 1060 (cuda100 tasks)
	ID: 52753 \| Rating: 0 \| rate: / Reply Quote

RFGuy_KCCO Send message Joined: 13 Feb 14 Posts: 6 Credit: 1,056,951,005 RAC: 120 Level Scientific publications	Message 52759 - Posted: 2 Oct 2019 \| 6:38:45 UTC - in response to Message 52750. Last modified: 2 Oct 2019 \| 6:46:05 UTC
	Dears, sorry for the slow progress but I determined (at least) a restart problem, and it is not related to the wrapper. It is Windows-only, CUDA 10 only, as far as I can tell from your reports, and manifests itself with the "The periodic box size has decreased to less than twice the nonbonded cutoff." message. Unfortunately the root cause is hard to identify (may be external to our code). I have compiled the wrapper myself (the binaries on the boinc page are old and had one important bug in variable substitution), but for now the failures seem unrelated. It's a bit frustrating because everything else seems to work nicely. Any chance the Linux app could be released now, since the Linux community has been without steady work for months and the Linux app seems to be working fine? Please, please please. Edit - I forgot about the problem reported by Keith Myers involving suspend/resume on different types of cards. I guess this will need to be fixed before it can be released. No issues here with suspending and resuming tasks under Linux. Just suspended a WU and it resumed on the other GPU in that box without issue (both GPUs are RTX 2080's). Curious. I wonder if my Linux failure was because the paused task did not start back up on the same type of card.
	ID: 52759 \| Rating: 0 \| rate: / Reply Quote

Keith Myers Send message Joined: 13 Dec 17 Posts: 1313 Credit: 6,022,292,459 RAC: 10,175,539 Level Scientific publications	Message 52760 - Posted: 2 Oct 2019 \| 7:11:20 UTC - in response to Message 52759.
	Edit - I forgot about the problem reported by Keith Myers involving suspend/resume on different types of cards. I guess this will need to be fixed before it can be released. I solved that issue by changing my Preferences to rotate between projects to 360minutes vice the stock 60 minutes. The task stays on the same card it starts on and finishes. Longest task so far has only run for just shy of 3 hours.
	ID: 52760 \| Rating: 0 \| rate: / Reply Quote

Retvari Zoltan Send message Joined: 20 Jan 09 Posts: 2343 Credit: 16,201,255,749 RAC: 0 Level Scientific publications	Message 52761 - Posted: 2 Oct 2019 \| 9:33:40 UTC - in response to Message 52759.
	Any chance the Linux app could be released now, since the Linux community has been without steady work for months and the Linux app seems to be working fine? Please, please please. There is not enough work even for the Windows based hosts in the past few months. There would be much more complaints for the lack of work if the Linux community could also crunch them. BTW I am in both groups, but I prefer Linux for the higher performance due to the lack of WDDM.
	ID: 52761 \| Rating: 0 \| rate: / Reply Quote

Jim1348 Send message Joined: 28 Jul 12 Posts: 819 Credit: 1,591,285,971 RAC: 0 Level Scientific publications	Message 52762 - Posted: 2 Oct 2019 \| 14:33:25 UTC - in response to Message 52761.
	There is not enough work even for the Windows based hosts in the past few months. There would be much more complaints for the lack of work if the Linux community could also crunch them. BTW I am in both groups, but I prefer Linux for the higher performance due to the lack of WDDM. But that could be because all their new work is for Acemd3, and they are just letting the old stuff complete. I would state it the other way: They could do all the work they need to just with the Linux machines. They can work on the Windows app later, and have it working when they need it. Complaints? Have they ever stopped?
	ID: 52762 \| Rating: 0 \| rate: / Reply Quote

Erich56 Send message Joined: 1 Jan 15 Posts: 1105 Credit: 7,822,620,176 RAC: 1,489,209 Level Scientific publications	Message 52763 - Posted: 2 Oct 2019 \| 16:40:21 UTC - in response to Message 52762.
	Complaints? Have they ever stopped? :-) :-) :-)
	ID: 52763 \| Rating: 0 \| rate: / Reply Quote

Killersocke Send message Joined: 18 Oct 13 Posts: 53 Credit: 406,647,419 RAC: 0 Level Scientific publications	Message 52764 - Posted: 2 Oct 2019 \| 17:58:55 UTC
	My GPU feels extremely neglected 😪
	ID: 52764 \| Rating: 0 \| rate: / Reply Quote

PappaLitto Send message Joined: 21 Mar 16 Posts: 511 Credit: 4,672,242,755 RAC: 319 Level Scientific publications	Message 52765 - Posted: 2 Oct 2019 \| 19:23:39 UTC - in response to Message 52764.
	If you want something to crunch, folding@home always has work and is always looking for more volunteers
	ID: 52765 \| Rating: 0 \| rate: / Reply Quote

Jim1348 Send message Joined: 28 Jul 12 Posts: 819 Credit: 1,591,285,971 RAC: 0 Level Scientific publications	Message 52767 - Posted: 2 Oct 2019 \| 21:47:43 UTC - in response to Message 52765.
	If you want something to crunch, folding@home always has work and is always looking for more volunteers That is my standard line too. But at the moment, even they are having server problems in at least one of their locations, maybe two. No one seems to know exactly what is going on. Give them a week to figure it out.
	ID: 52767 \| Rating: 0 \| rate: / Reply Quote

KAMasud Send message Joined: 27 Jul 11 Posts: 137 Credit: 523,901,354 RAC: 0 Level Scientific publications	Message 52769 - Posted: 3 Oct 2019 \| 7:22:06 UTC
	I have decided to not restart or fiddle around with my machine. Let us see if it finishes successfully. If I may say so, according to Afterburner my GPU is running four degrees centigrade hotter then the old WU or normal. As hot as Einstien@Home which is my backup when GPU Grid becomes lazy. Resources set at zero for Einstien.
	ID: 52769 \| Rating: 0 \| rate: / Reply Quote

KAMasud Send message Joined: 27 Jul 11 Posts: 137 Credit: 523,901,354 RAC: 0 Level Scientific publications	Message 52770 - Posted: 3 Oct 2019 \| 7:58:54 UTC
	Lost it. Power failure.
	ID: 52770 \| Rating: 0 \| rate: / Reply Quote

biodoc Send message Joined: 26 Aug 08 Posts: 183 Credit: 7,502,764,375 RAC: 51,713,006 Level Scientific publications	Message 52772 - Posted: 4 Oct 2019 \| 11:19:38 UTC - in response to Message 52750.
	Dears, sorry for the slow progress but I determined (at least) a restart problem, and it is not related to the wrapper. It is Windows-only, CUDA 10 only, as far as I can tell from your reports, and manifests itself with the "The periodic box size has decreased to less than twice the nonbonded cutoff." message. Unfortunately the root cause is hard to identify (may be external to our code). I have compiled the wrapper myself (the binaries on the boinc page are old and had one important bug in variable substitution), but for now the failures seem unrelated. It's a bit frustrating because everything else seems to work nicely. If you are using openmm, line 375 of the CudaNonbondedUtilities.cpp source code is the following. throw OpenMMException("The periodic box size has decreased to less than twice the nonbonded cutoff."); https://github.com/openmm/openmm/blob/master/platforms/cuda/src/CudaNonbondedUtilities.cpp Perhaps Peter Eastman can shed some light on this problem. https://github.com/peastman
	ID: 52772 \| Rating: 0 \| rate: / Reply Quote

Killersocke Send message Joined: 18 Oct 13 Posts: 53 Credit: 406,647,419 RAC: 0 Level Scientific publications	Message 52773 - Posted: 4 Oct 2019 \| 12:29:26 UTC - in response to Message 52772. Last modified: 4 Oct 2019 \| 12:30:19 UTC
	Hi, there is not only CUDA 10. for example: http://www.gpugrid.net/result.php?resultid=21425993 http://www.gpugrid.net/result.php?resultid=21425617 http://www.gpugrid.net/result.php?resultid=21425545 #SWAN: FATAL: cannot find image for module [.nonbonded.cu.] for device version 750 Stderr Ausgabe <core_client_version>7.14.2</core_client_version> <![CDATA[ <message> (unknown error) - exit code -59 (0xffffffc5)</message> <stderr_txt> # GPU [GeForce RTX 2080 Ti] Platform [Windows] Rev [3212] VERSION [80] # SWAN Device 0 : # Name : GeForce RTX 2080 Ti # ECC : Disabled # Global mem : 11264MB # Capability : 7.5 # PCI ID : 0000:01:00.0 # Device clock : 1755MHz # Memory clock : 7000MHz # Memory width : 352bit # Driver version : r436_45 : 43648 #SWAN: FATAL: cannot find image for module [.nonbonded.cu.] for device version 750 </stderr_txt> ]]>
	ID: 52773 \| Rating: 0 \| rate: / Reply Quote

rod4x4 Send message Joined: 4 Aug 14 Posts: 266 Credit: 2,219,935,054 RAC: 0 Level Scientific publications	Message 52774 - Posted: 4 Oct 2019 \| 14:17:44 UTC - in response to Message 52773.
	Hi, there is not only CUDA 10. for example: http://www.gpugrid.net/result.php?resultid=21425993 http://www.gpugrid.net/result.php?resultid=21425617 http://www.gpugrid.net/result.php?resultid=21425545 #SWAN: FATAL: cannot find image for module [.nonbonded.cu.] for device version 750 Stderr Ausgabe <core_client_version>7.14.2</core_client_version> <![CDATA[ <message> (unknown error) - exit code -59 (0xffffffc5)</message> <stderr_txt> # GPU [GeForce RTX 2080 Ti] Platform [Windows] Rev [3212] VERSION [80] # SWAN Device 0 : # Name : GeForce RTX 2080 Ti # ECC : Disabled # Global mem : 11264MB # Capability : 7.5 # PCI ID : 0000:01:00.0 # Device clock : 1755MHz # Memory clock : 7000MHz # Memory width : 352bit # Driver version : r436_45 : 43648 #SWAN: FATAL: cannot find image for module [.nonbonded.cu.] for device version 750 </stderr_txt> ]]> Turing GPU cards are only able to do TEST Work Units at the moment. You will need to change your GPUGRID settings to ensure only TEST Work Units are accepted for your Turing GPU. The above errors occur for ACEMD2 Work units on Turing based cards.
	ID: 52774 \| Rating: 0 \| rate: / Reply Quote

Erich56 Send message Joined: 1 Jan 15 Posts: 1105 Credit: 7,822,620,176 RAC: 1,489,209 Level Scientific publications	Message 52777 - Posted: 4 Oct 2019 \| 15:21:32 UTC - in response to Message 52774.
	Turing GPU cards are only able to do TEST Work Units at the moment. You will need to change your GPUGRID settings to ensure only TEST Work Units are accepted for your Turing GPU. The above errors occur for ACEMD2 Work units on Turing based cards. How is it the other way round? Will the TEST Work Units work with cards prior Turing?
	ID: 52777 \| Rating: 0 \| rate: / Reply Quote

Diplomat Send message Joined: 1 Sep 10 Posts: 15 Credit: 621,049,648 RAC: 554 Level Scientific publications	Message 52779 - Posted: 4 Oct 2019 \| 17:56:52 UTC - in response to Message 52742.
	7.2.42 is really old (but latest on berkeley download). very likely the client is estimating wrong in addition to mis-identifying the cpu. apt-get under ubuntu 18.04 got me version 7.16.1 boinc how did you get 7.16? I have same version of ubuntu but in repos see available only 7.9.3
	ID: 52779 \| Rating: 0 \| rate: / Reply Quote

Keith Myers Send message Joined: 13 Dec 17 Posts: 1313 Credit: 6,022,292,459 RAC: 10,175,539 Level Scientific publications	Message 52780 - Posted: 4 Oct 2019 \| 18:20:37 UTC - in response to Message 52779.
	He must have installed the ppa. The Ubuntu 18.04 distro only has BOINC 7.9.3. Or he went to the source and compiled it on his own.
	ID: 52780 \| Rating: 0 \| rate: / Reply Quote

rod4x4 Send message Joined: 4 Aug 14 Posts: 266 Credit: 2,219,935,054 RAC: 0 Level Scientific publications	Message 52782 - Posted: 4 Oct 2019 \| 23:32:08 UTC - in response to Message 52777. Last modified: 4 Oct 2019 \| 23:50:31 UTC
	Turing GPU cards are only able to do TEST Work Units at the moment. You will need to change your GPUGRID settings to ensure only TEST Work Units are accepted for your Turing GPU. The above errors occur for ACEMD2 Work units on Turing based cards. How is it the other way round? Will the TEST Work Units work with cards prior Turing? The TEST work units seem to be backward compatible. My Pascal cards are receiving TEST work units and processing successfully. Interestingly, my Maxwell cards have not received a TEST work unit, but that could just be luck of the draw. EDIT: The drivers on my Maxwell cards are quite old v388 - v391. This will explain why they are not receiving TEST work units. Nvidia driver version 418.39 or above is required for CUDA 10.1
	ID: 52782 \| Rating: 0 \| rate: / Reply Quote

Jim1348 Send message Joined: 28 Jul 12 Posts: 819 Credit: 1,591,285,971 RAC: 0 Level Scientific publications	Message 52783 - Posted: 5 Oct 2019 \| 1:26:59 UTC - in response to Message 52779.
	how did you get 7.16? https://launchpad.net/~costamagnagianfranco/+archive/ubuntu/boinc
	ID: 52783 \| Rating: 0 \| rate: / Reply Quote

[PUGLIA] kidkidkid3 Send message Joined: 23 Feb 11 Posts: 94 Credit: 1,004,605,544 RAC: 153,854 Level Scientific publications	Message 52784 - Posted: 5 Oct 2019 \| 8:42:36 UTC - in response to Message 52783. Last modified: 5 Oct 2019 \| 8:43:43 UTC
	Hi, this is my last error on Windows after a suspend/resume action. http://www.gpugrid.net/result.php?resultid=21426852 I hope it help. K. ____________ Dreams do not always come true. But not because they are too big or impossible. Why did we stop believing. (Martin Luther King)
	ID: 52784 \| Rating: 0 \| rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1589 Credit: 6,680,094,351 RAC: 9,199,283 Level Scientific publications	Message 52785 - Posted: 5 Oct 2019 \| 11:13:50 UTC - in response to Message 52783.
	how did you get 7.16? https://launchpad.net/~costamagnagianfranco/+archive/ubuntu/boinc Be careful with that one. It has not yet passed full release testing, and several serious bugs have been found already. Gianfranco is good at updating the PPA as bugs are eliminated, but doesn't increment the version number independently. I think the current PPA numbered 7.16.3 has all except one of the fixes needed: there will probably be at least a 7.16.4 before this saga is finished.
	ID: 52785 \| Rating: 0 \| rate: / Reply Quote

Jim1348 Send message Joined: 28 Jul 12 Posts: 819 Credit: 1,591,285,971 RAC: 0 Level Scientific publications	Message 52786 - Posted: 5 Oct 2019 \| 11:25:40 UTC - in response to Message 52785. Last modified: 5 Oct 2019 \| 11:35:02 UTC
	Thank you for the expert advice. So far 7.16.3 has worked on three other Ubuntu machines and one Win7. I did get a lot of extra downloads on WCG that I had never seen before, but expect that is a problem at their end(?). That is mainly because I have seen their "settings" reset every few months, and don't entirely trust their servers on that. Also, I just did a manual update of BOINC with no more extraneous work units downloaded, so it seems to be OK now.
	ID: 52786 \| Rating: 0 \| rate: / Reply Quote

w1hue Send message Joined: 28 Sep 09 Posts: 21 Credit: 373,977,011 RAC: 46,075 Level Scientific publications	Message 52793 - Posted: 6 Oct 2019 \| 23:01:48 UTC
	When will there be some WUs that will run without erroring out on Windows machines? And not suck up an entire CPU in addition to the GPU? ____________
	ID: 52793 \| Rating: 0 \| rate: / Reply Quote

[PUGLIA] kidkidkid3 Send message Joined: 23 Feb 11 Posts: 94 Credit: 1,004,605,544 RAC: 153,854 Level Scientific publications	Message 52794 - Posted: 7 Oct 2019 \| 10:40:59 UTC - in response to Message 52784.
	Hi, this is my last error on Windows after a suspend/resume action. http://www.gpugrid.net/result.php?resultid=21426852 I hope it help. K. Another error on Windows WU acemd3 test, without suspend/resume action. http://www.gpugrid.net/result.php?resultid=21428934 K. ____________ Dreams do not always come true. But not because they are too big or impossible. Why did we stop believing. (Martin Luther King)
	ID: 52794 \| Rating: 0 \| rate: / Reply Quote

Aurum Send message Joined: 12 Jul 17 Posts: 400 Credit: 13,844,764,882 RAC: 31,380,351 Level Scientific publications	Message 52796 - Posted: 7 Oct 2019 \| 14:48:41 UTC - in response to Message 52779. Last modified: 7 Oct 2019 \| 14:48:59 UTC
	7.2.42 is really old (but latest on berkeley download). very likely the client is estimating wrong in addition to mis-identifying the cpu. apt-get under ubuntu 18.04 got me version 7.16.1 boinc how did you get 7.16? I have same version of ubuntu but in repos see available only 7.9.3 https://boinc.berkeley.edu/download_all.php ____________
	ID: 52796 \| Rating: 0 \| rate: / Reply Quote

Aurum Send message Joined: 12 Jul 17 Posts: 400 Credit: 13,844,764,882 RAC: 31,380,351 Level Scientific publications	Message 52797 - Posted: 7 Oct 2019 \| 14:49:55 UTC - in response to Message 52793.
	When will there be some WUs that will run without erroring out on Windows machines? And not suck up an entire CPU in addition to the GPU? They will always need their own CPU. ____________
	ID: 52797 \| Rating: 0 \| rate: / Reply Quote

w1hue Send message Joined: 28 Sep 09 Posts: 21 Credit: 373,977,011 RAC: 46,075 Level Scientific publications	Message 52801 - Posted: 7 Oct 2019 \| 22:37:08 UTC - in response to Message 52797.
	They will always need their own CPU. The "long runs" that I have been running did not require 100% of a CPU -- at least not on my system. I finally realized that I can change my preferences to not get ACEMD3 WUs -- which I did. ____________
	ID: 52801 \| Rating: 0 \| rate: / Reply Quote

HyperComputing Send message Joined: 15 Sep 19 Posts: 4 Credit: 485,304,520 RAC: 0 Level Scientific publications	Message 52837 - Posted: 11 Oct 2019 \| 19:37:59 UTC Last modified: 11 Oct 2019 \| 19:42:38 UTC
	Some config changes : - i5 have just 1 GTX1060 now (cuda100) - replaced GTX1050ti by GTX1060 on i7 (cuda80) current ADRIA's resumed task fail after I changed GPU. Now all works perfectly even with multiple suspend/resume. GTX1060 is definitely more much faster than GTX1050ti. (7h vs 95h for same task)
	ID: 52837 \| Rating: 0 \| rate: / Reply Quote

Post to thread

Message boards : News : More Acemd3 tests

	About	Science	Volunteers	Performance	Forum	Join us	Donate