Advanced search

Message boards : News : Update acemd3 app

Author Message
Toni
Volunteer moderator
Project administrator
Project developer
Project scientist
Send message
Joined: 9 Dec 08
Posts: 997
Credit: 5,068,599
RAC: 0
Level
Ser
Scientific publications
watwatwatwat
Message 57041 - Posted: 1 Jul 2021 | 18:29:57 UTC

I deployed the new app, which now requires cuda 11.2 and hopefully support all the latest cards. Touching the cuda versions is always a nightmare in boinc scheduler so expect problems.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 465
Credit: 4,437,299,830
RAC: 8,821
Level
Arg
Scientific publications
wat
Message 57042 - Posted: 1 Jul 2021 | 18:36:19 UTC - in response to Message 57041.

YES! Thank you so much!
____________

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 465
Credit: 4,437,299,830
RAC: 8,821
Level
Arg
Scientific publications
wat
Message 57043 - Posted: 1 Jul 2021 | 18:58:49 UTC - in response to Message 57042.
Last modified: 1 Jul 2021 | 19:07:46 UTC

I noticed the plan class is listed as "cuda1121" on the Applications page. is this a typo? will it cause any issues with getting work or running the application?

also you might need to put a cap (maybe compute capability or something) on the project server side to prevent the CUDA10.0 app from being sent to Ampere hosts. currently we saw many errors because the CUDA10.0 app was still sent to Ampere hosts. there should be a way to make sure Ampere hosts only get the 11.2 app and not try to use the cuda 10 app.
____________

ExtraTerrestrial Apes
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 17 Aug 08
Posts: 2705
Credit: 1,311,122,549
RAC: 1
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 57045 - Posted: 1 Jul 2021 | 19:30:59 UTC
Last modified: 1 Jul 2021 | 19:31:09 UTC

Great news! So far it's only Linux, right?

MrS
____________
Scanning for our furry friends since Jan 2002

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 733
Credit: 1,008,957,258
RAC: 257,491
Level
Met
Scientific publications
watwatwatwatwat
Message 57046 - Posted: 1 Jul 2021 | 19:53:00 UTC - in response to Message 57045.

So far.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 465
Credit: 4,437,299,830
RAC: 8,821
Level
Arg
Scientific publications
wat
Message 57047 - Posted: 1 Jul 2021 | 20:02:17 UTC

Just so people are aware, CUDA 11.2 (I assume the "1121" means CUDA 11.2.1 "update 1") means you need at least driver 460.32 on Linux.
____________

Toni
Volunteer moderator
Project administrator
Project developer
Project scientist
Send message
Joined: 9 Dec 08
Posts: 997
Credit: 5,068,599
RAC: 0
Level
Ser
Scientific publications
watwatwatwat
Message 57048 - Posted: 1 Jul 2021 | 20:22:33 UTC - in response to Message 57047.

Can someone confirm that the Linux cuda100 app is still sent out (and likely fail)?

T

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 465
Credit: 4,437,299,830
RAC: 8,821
Level
Arg
Scientific publications
wat
Message 57049 - Posted: 1 Jul 2021 | 20:31:25 UTC - in response to Message 57048.

Can someone confirm that the Linux cuda100 app is still sent out (and likely fail)?

T


is this the reason that the Linux tasks have been failing recently? they need this new app? did you remove the Linux cuda100 app?
____________

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 465
Credit: 4,437,299,830
RAC: 8,821
Level
Arg
Scientific publications
wat
Message 57050 - Posted: 1 Jul 2021 | 20:56:08 UTC - in response to Message 57048.
Last modified: 1 Jul 2021 | 21:04:46 UTC

I just got a couple tasks on my RTX 3080 Ti host, it got the new app. it failed in 2 seconds. it looks like you're missing a file, or you forgot to statically link boost into the app:

16:50:34 (15968): wrapper (7.7.26016): starting
16:50:34 (15968): wrapper (7.7.26016): starting
16:50:34 (15968): wrapper: running acemd3 (--boinc input --device 0)
acemd3: error while loading shared libraries: libboost_filesystem.so.1.74.0: cannot open shared object file: No such file or directory
16:50:35 (15968): acemd3 exited; CPU time 0.000360
16:50:35 (15968): app exit status: 0x7f
16:50:35 (15968): called boinc_finish(195)


https://www.gpugrid.net/result.php?resultid=32631384

but it's promising that I didnt get the "invalid architecture" error
____________

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 733
Credit: 1,008,957,258
RAC: 257,491
Level
Met
Scientific publications
watwatwatwatwat
Message 57051 - Posted: 1 Jul 2021 | 21:11:25 UTC
Last modified: 1 Jul 2021 | 21:41:43 UTC

Looks like Ubuntu 20.04.2 LTS has libboost-all-dev 1.71 installed.

I remember that Gridcoin needs libboost-all-dev 1.74 installed now also when building.

That is in 21.04.

[Edit]
Theoretically yes AFAIK anything about wrapper containers.

I just wonder if you installed the latest 1.74 libboost-all-dev environment that the tasks wouldn't fail.

https://www.boost.org/users/history/version_1_74_0.html

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 465
Credit: 4,437,299,830
RAC: 8,821
Level
Arg
Scientific publications
wat
Message 57052 - Posted: 1 Jul 2021 | 21:31:01 UTC - in response to Message 57051.

i think these are sandboxed in the wrapper. so packages on the system in theory shouldnt matter right?
____________

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 733
Credit: 1,008,957,258
RAC: 257,491
Level
Met
Scientific publications
watwatwatwatwat
Message 57053 - Posted: 1 Jul 2021 | 21:45:12 UTC

Just failed a couple more acemd3 tasks. What a waste . . . . as hard as they are to snag.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 465
Credit: 4,437,299,830
RAC: 8,821
Level
Arg
Scientific publications
wat
Message 57054 - Posted: 1 Jul 2021 | 22:24:20 UTC - in response to Message 57053.

Just failed a couple more acemd3 tasks. What a waste . . . . as hard as they are to snag.


did you get the new app? do you have that newer version of boost installed?
____________

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 733
Credit: 1,008,957,258
RAC: 257,491
Level
Met
Scientific publications
watwatwatwatwat
Message 57055 - Posted: 1 Jul 2021 | 23:28:47 UTC - in response to Message 57054.

No I just have the normal CUDA 10.0 app installed. I am just investigating what would be needed to install the missing libraries.

Pop Piasa
Avatar
Send message
Joined: 8 Aug 19
Posts: 211
Credit: 359,295,378
RAC: 38
Level
Asp
Scientific publications
watwat
Message 57056 - Posted: 1 Jul 2021 | 23:52:38 UTC

Great to see this progress as prices of GPUs are beginning to fall and Ampere GPUs are currently dominating the market availability. I hope China's Ban on mining becomes a budgetary boon for crunchers and gamers worldwide.

The Amperes should eventually expedite this project considerably.
____________

Piasa Tribe - Illini Nation

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 465
Credit: 4,437,299,830
RAC: 8,821
Level
Arg
Scientific publications
wat
Message 57057 - Posted: 2 Jul 2021 | 0:05:05 UTC - in response to Message 57055.

No I just have the normal CUDA 10.0 app installed. I am just investigating what would be needed to install the missing libraries.


looks like you're actually getting the new app now: http://www.gpugrid.net/result.php?resultid=32631755
____________

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 733
Credit: 1,008,957,258
RAC: 257,491
Level
Met
Scientific publications
watwatwatwatwat
Message 57058 - Posted: 2 Jul 2021 | 0:16:36 UTC - in response to Message 57057.

No I just have the normal CUDA 10.0 app installed. I am just investigating what would be needed to install the missing libraries.


looks like you're actually getting the new app now: http://www.gpugrid.net/result.php?resultid=32631755

Huh, hadn't noticed.

So maybe the New version of ACEMD v2.12 (cuda1121) is going to be the default app even for the older cards.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 465
Credit: 4,437,299,830
RAC: 8,821
Level
Arg
Scientific publications
wat
Message 57059 - Posted: 2 Jul 2021 | 0:41:12 UTC - in response to Message 57058.
Last modified: 2 Jul 2021 | 0:49:30 UTC

I think you’ll only get 11.2 app if you have a driver that’s compatible. Greater than 460.32. Just my guess. I’ll need to see if systems with and older driver will still get the cuda 100 app

Edit: answering my own question. I guess the driver being reported doesn’t factor into the app selection anymore. My systems reporting an older driver still received the new app. So it won’t prevent the app from being send to someone without a new enough driver.
____________

Toni
Volunteer moderator
Project administrator
Project developer
Project scientist
Send message
Joined: 9 Dec 08
Posts: 997
Credit: 5,068,599
RAC: 0
Level
Ser
Scientific publications
watwatwatwat
Message 57060 - Posted: 2 Jul 2021 | 8:50:26 UTC - in response to Message 57059.
Last modified: 2 Jul 2021 | 8:57:08 UTC

I'm still trying to figure out the best way to distribute the app. The current way has hard-coded minimum-maximum driver versions for each CUDA version and it's too cumbersome to maintain.

Suggestions are welcome. The server knows the client's CUDA version and driver version, as well as the app's CUDA plan class.

Bedrich Hajek
Send message
Joined: 28 Mar 09
Posts: 401
Credit: 5,720,254,004
RAC: 127,631
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 57061 - Posted: 2 Jul 2021 | 12:07:52 UTC - in response to Message 57060.

I'm still trying to figure out the best way to distribute the app. The current way has hard-coded minimum-maximum driver versions for each CUDA version and it's too cumbersome to maintain.

Suggestions are welcome. The server knows the client's CUDA version and driver version, as well as the app's CUDA plan class.


Here is an idea:

How about distribution by card type? That would exclude the really slow cards, like 740M.

BTW: What driver version do we need for this?

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 465
Credit: 4,437,299,830
RAC: 8,821
Level
Arg
Scientific publications
wat
Message 57062 - Posted: 2 Jul 2021 | 12:54:25 UTC - in response to Message 57060.
Last modified: 2 Jul 2021 | 13:09:11 UTC

Toni, I think the first thing that needs to be fixed is the problem with boost 1.74 library not being included in the app distribution. the app is failing right away because it's not there. you either need to distribute the .so file or statically link it into the acemd3 app so it's not needed separately.

manually installing it seems to be a workaround, but that's a tall order to make every Linux user have to perform.
____________

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 465
Credit: 4,437,299,830
RAC: 8,821
Level
Arg
Scientific publications
wat
Message 57063 - Posted: 2 Jul 2021 | 14:49:02 UTC

after manually installing the required boost to get past that error, I now get this error on my 3080 Ti system:

09:55:10 (4806): wrapper (7.7.26016): starting
09:55:10 (4806): wrapper (7.7.26016): starting
09:55:10 (4806): wrapper: running acemd3 (--boinc input --device 0)
ACEMD failed:
Error launching CUDA compiler: 32512
sh: 1: : Permission denied


09:55:11 (4806): acemd3 exited; CPU time 0.479062
09:55:11 (4806): app exit status: 0x1
09:55:11 (4806): called boinc_finish(195)


Task: https://www.gpugrid.net/result.php?resultid=32632410

I tried purging and reinstalling the nvidia drivers, but no change.

it looks like this same error popped up when you first released acemd3 2 years ago: http://www.gpugrid.net/forum_thread.php?id=4935#51970

biodoc wrote:
Multiple failures of this task on both windows and linux

http://www.gpugrid.net/workunit.php?wuid=16517304

<message>
process exited with code 195 (0xc3, -61)</message>
<stderr_txt>
15:19:27 (30109): wrapper (7.7.26016): starting
15:19:27 (30109): wrapper (7.7.26016): starting
15:19:27 (30109): wrapper: running acemd3 (--boinc input --device 0)
# Engine failed: Error launching CUDA compiler: 32512
sh: 1: : Permission denied

15:19:28 (30109): acemd3 exited; CPU time 0.186092
15:19:28 (30109): app exit status: 0x1
15:19:28 (30109): called boinc_finish(195)

</stderr_txt>


Why is the app launching CUDA compiler?


you then updated the app which fixed the problem at that time, but you didnt post exactly what was changed: http://www.gpugrid.net/forum_thread.php?id=4935&nowrap=true#52022

Toni wrote:
It was a cryptic bug in the order loading shared libraries, or something like that. Otherwise unexplainably system-dependent.

I see VERY few failures now. The new app will be a huge step forward on several aspects, not least maintainability. We'll be transitioning gradually.


so whatever kind of change you made between v2.02 and v2.03 seems to be what needs fixing again.
____________

Profile ServicEnginIC
Avatar
Send message
Joined: 24 Sep 10
Posts: 410
Credit: 2,022,240,642
RAC: 344,531
Level
Phe
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 57064 - Posted: 2 Jul 2021 | 15:26:32 UTC

I deployed the new app, which now requires cuda 11.2 and hopefully support all the latest cards. Touching the cuda versions is always a nightmare in boinc scheduler so expect problems.

Thank you so much.
Those efforts are for noble reasons.

Regarding persistent errors:
I also manually installed boost as a try at one of my Ubuntu 20.04 hosts, by means of the following commands:

sudo add-apt-repository ppa:mhier/libboost-latest
sudo apt-get update
sudo apt-get install boost1.74
reboot

But a new task downloaded after that still failed:
e3s644_e1s419p0f770-ADRIA_New_KIXcMyb_HIP_AdaptiveBandit-1-2-RND9285_4
Then, I've reset GPUGrid project, and it seems that it did the trick.
A new task is currently running on this host, instead of failing after a few seconds past:
e4s126_e3s248p0f238-ADRIA_New_KIXcMyb_HIP_AdaptiveBandit-0-2-RND6347_7
49 minutes, 1,919% progress by now.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 465
Credit: 4,437,299,830
RAC: 8,821
Level
Arg
Scientific publications
wat
Message 57065 - Posted: 2 Jul 2021 | 15:33:05 UTC - in response to Message 57064.

I deployed the new app, which now requires cuda 11.2 and hopefully support all the latest cards. Touching the cuda versions is always a nightmare in boinc scheduler so expect problems.

Thank you so much.
Those efforts are for noble reasons.

Regarding persistent errors:
I also manually installed boost as a try at one of my Ubuntu 20.04 hosts, by means of the following commands:

sudo add-apt-repository ppa:mhier/libboost-latest
sudo apt-get update
sudo apt-get install boost1.74
reboot

But a new task downloaded after that still failed:
e3s644_e1s419p0f770-ADRIA_New_KIXcMyb_HIP_AdaptiveBandit-1-2-RND9285_4
Then, I've reset GPUGrid project, and it seems that it did the trick.
A new task is currently running on this host, instead of failing after a few seconds past:
e4s126_e3s248p0f238-ADRIA_New_KIXcMyb_HIP_AdaptiveBandit-0-2-RND6347_7
49 minutes, 1,919% progress by now.


Thanks, I'll try a project reset. though I had already done a project reset after the new app was announced. I guess it can't hurt.

____________

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 465
Credit: 4,437,299,830
RAC: 8,821
Level
Arg
Scientific publications
wat
Message 57066 - Posted: 2 Jul 2021 | 15:45:20 UTC - in response to Message 57065.

nope, even after the project reset, still the same error

process exited with code 195 (0xc3, -61)</message>
<stderr_txt>
11:42:55 (5665): wrapper (7.7.26016): starting
11:42:55 (5665): wrapper (7.7.26016): starting
11:42:55 (5665): wrapper: running acemd3 (--boinc input --device 0)
ACEMD failed:
Error launching CUDA compiler: 32512
sh: 1: : Permission denied

11:42:56 (5665): acemd3 exited; CPU time 0.429069
11:42:56 (5665): app exit status: 0x1
11:42:56 (5665): called boinc_finish(195)


https://www.gpugrid.net/result.php?resultid=32632487
____________

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 465
Credit: 4,437,299,830
RAC: 8,821
Level
Arg
Scientific publications
wat
Message 57067 - Posted: 2 Jul 2021 | 16:00:16 UTC - in response to Message 57064.

sudo add-apt-repository ppa:mhier/libboost-latest
sudo apt-get update
sudo apt-get install libboost1.74
reboot


small correction here. it's "libboost1.74", not just "boost1.74"
____________

Profile ServicEnginIC
Avatar
Send message
Joined: 24 Sep 10
Posts: 410
Credit: 2,022,240,642
RAC: 344,531
Level
Phe
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 57068 - Posted: 2 Jul 2021 | 16:27:57 UTC - in response to Message 57066.

Maybe that your problem is an Ampere-specific one (?).

I've catched a new task in another of my hosts after applying the same remedy, and it is also running as expected.
e3s263_e1s419p0f938-ADRIA_New_KIXcMyb_HIP_AdaptiveBandit-1-2-RND6959_2
25 minutes, 0,560% progress by now for this second task.
Turing GPUs and Nvidia drivers 465.31 on both hosts.
Installing libboost1.74 didn't worked for me by itself.
Resetting project didn't worked for me by itself.
Installing libboost1.74 and resetting project afterfards, did work for both my hosts.
I've doublechecked, and the commands that I employed were the previously published at message #57064
Watching Synaptic, this lead to libboost1.74 and libboost1.74-dev were correctly installed.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 465
Credit: 4,437,299,830
RAC: 8,821
Level
Arg
Scientific publications
wat
Message 57069 - Posted: 2 Jul 2021 | 16:37:10 UTC - in response to Message 57068.
Last modified: 2 Jul 2021 | 16:41:30 UTC

Maybe that your problem is an Ampere-specific one (?).

I've catched a new task in another of my hosts after applying the same remedy, and it is also running as expected.
e3s263_e1s419p0f938-ADRIA_New_KIXcMyb_HIP_AdaptiveBandit-1-2-RND6959_2
25 minutes, 0,560% progress by now for this second task.
Turing GPUs and Nvidia drivers 465.31 on both hosts.
Installing libboost1.74 didn't worked for me by itself.
Resetting project didn't worked for me by itself.
Installing libboost1.74 and resetting project afterfards, did work for both my hosts.
I've doublechecked, and the commands that I employed were the previously published at message #57064
Watching Synaptic, this lead to libboost1.74 and libboost1.74-dev were correctly installed.


I had this thought. I put in my old 2080ti to the problem-host, and will see if it starts processing, or if it's really a problem with the host-specific configuration. this isn't the first time this has happened though. and Toni previously fixed it with an app update. so it looks like that will be needed again even if it's Ampere-specifc.

I think the difference in install commands comes down to the use of apt vs. apt-get. although apt-get still works, transitioning to just apt will be better in the long term. Difference between apt and apt-get
____________

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 465
Credit: 4,437,299,830
RAC: 8,821
Level
Arg
Scientific publications
wat
Message 57070 - Posted: 2 Jul 2021 | 17:15:35 UTC - in response to Message 57069.
Last modified: 2 Jul 2021 | 17:22:47 UTC

Maybe that your problem is an Ampere-specific one (?).

I've catched a new task in another of my hosts after applying the same remedy, and it is also running as expected.
e3s263_e1s419p0f938-ADRIA_New_KIXcMyb_HIP_AdaptiveBandit-1-2-RND6959_2
25 minutes, 0,560% progress by now for this second task.
Turing GPUs and Nvidia drivers 465.31 on both hosts.
Installing libboost1.74 didn't worked for me by itself.
Resetting project didn't worked for me by itself.
Installing libboost1.74 and resetting project afterfards, did work for both my hosts.
I've doublechecked, and the commands that I employed were the previously published at message #57064
Watching Synaptic, this lead to libboost1.74 and libboost1.74-dev were correctly installed.


I had this thought. I put in my old 2080ti to the problem-host, and will see if it starts processing, or if it's really a problem with the host-specific configuration. this isn't the first time this has happened though. and Toni previously fixed it with an app update. so it looks like that will be needed again even if it's Ampere-specifc.


well, it seems it's not Ampere specific. it failed in the same way on my 2080ti here: https://www.gpugrid.net/result.php?resultid=32632521

still the CUDA compiler error

unfortunately I can't easily move the 3080ti to another system since it's a watercooled model that requires a custom water loop.
____________

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 733
Credit: 1,008,957,258
RAC: 257,491
Level
Met
Scientific publications
watwatwatwatwat
Message 57071 - Posted: 2 Jul 2021 | 18:08:24 UTC

I just used the ppa method on my other two hosts. But I did not reboot.
Picked up another task and it is running.
Waiting still on the luck of the draw for the other host without work.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 465
Credit: 4,437,299,830
RAC: 8,821
Level
Arg
Scientific publications
wat
Message 57072 - Posted: 2 Jul 2021 | 18:19:53 UTC - in response to Message 57070.


well, it seems it's not Ampere specific. it failed in the same way on my 2080ti here: https://www.gpugrid.net/result.php?resultid=32632521

still the CUDA compiler error

unfortunately I can't easily move the 3080ti to another system since it's a watercooled model that requires a custom water loop.


I think I finally solved the issue! it's running on the 3080ti finally!

first I removed the manual installation of boost. and installed the PPA version. I don't think this was the issue though.

while poking around in my OS installs, I discovered that I had the CUDA 11.1 toolkit installed (likely from my previous attempts at building some apps to run on Ampere). I removed this old toolkit, cleaned up any files, rebooted, reset the project and waited for a task to show up.

so now it's running finally. now to see how long it'll take a 3080ti ;). it has over 10,000 CUDA cores so I'm hoping for a fast time. 2080ti runs about 12hrs, so it'll be interesting to see how fast I can knock it out. using about 310 watts right now. but with the caveat that ever since I've had this card, I've noticed some weird power limiting behavior. I'm waiting on an RMA now for a new card, and I'm hoping it can really stretch it's legs, plan to still power limit it to about 320W though.

____________

Profile ServicEnginIC
Avatar
Send message
Joined: 24 Sep 10
Posts: 410
Credit: 2,022,240,642
RAC: 344,531
Level
Phe
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 57073 - Posted: 2 Jul 2021 | 18:35:05 UTC - in response to Message 57072.
Last modified: 2 Jul 2021 | 18:36:13 UTC

Congratulations!
Good news...Anxious to see the performance on a 3080 Ti

Vismed
Send message
Joined: 19 Nov 17
Posts: 1
Credit: 24,414,049
RAC: 36
Level
Pro
Scientific publications
wat
Message 57074 - Posted: 2 Jul 2021 | 18:54:38 UTC - in response to Message 57041.

Well, it will be your problem, not mine. Even having decent hard- and software I am pretty astonished how folks like cosmology and the likes seemingly do not understand how VM and the like works. I am pretty pissed as an amateur, though...

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1146
Credit: 3,283,608,315
RAC: 116,679
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 57075 - Posted: 2 Jul 2021 | 18:59:40 UTC

Just seen my first failures with libboost errors on Linux Mint 20.1, driver 460.80, GTX 1660 super.

Applied the PPA and reset the project - waiting on the next tasks now.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 465
Credit: 4,437,299,830
RAC: 8,821
Level
Arg
Scientific publications
wat
Message 57076 - Posted: 2 Jul 2021 | 19:02:55 UTC - in response to Message 57074.

Well, it will be your problem, not mine. Even having decent hard- and software I am pretty astonished how folks like cosmology and the likes seemingly do not understand how VM and the like works. I am pretty pissed as an amateur, though...


what problem are you having specifically?

this project has nothing to do with cosmology, and this project does not use VMs.
____________

Profile Retvari Zoltan
Avatar
Send message
Joined: 20 Jan 09
Posts: 2274
Credit: 16,057,322,981
RAC: 134
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 57077 - Posted: 2 Jul 2021 | 22:47:18 UTC - in response to Message 57072.

I think I finally solved the issue! it's running on the 3080ti finally!

now to see how long it'll take a 3080ti ;). it has over 10,000 CUDA cores so I'm hoping for a fast time. 2080ti runs about 12hrs, so it'll be interesting to see how fast I can knock it out. using about 310 watts right now.

This is the moment of truth we're all waiting for.
My bet is 9h 15m.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 465
Credit: 4,437,299,830
RAC: 8,821
Level
Arg
Scientific publications
wat
Message 57078 - Posted: 3 Jul 2021 | 1:22:16 UTC - in response to Message 57077.

I think I finally solved the issue! it's running on the 3080ti finally!

now to see how long it'll take a 3080ti ;). it has over 10,000 CUDA cores so I'm hoping for a fast time. 2080ti runs about 12hrs, so it'll be interesting to see how fast I can knock it out. using about 310 watts right now.

This is the moment of truth we're all waiting for.
My bet is 9h 15m.


I’m not sure it’ll be so simple.

When I checked earlier, it was tracking a 12.5hr completion time. But the 2080ti was tracking a 14.5hr completion time.

Either the new run of tasks are longer, or the CUDA 11.2 app is slower? We’ll have to see.
____________

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 733
Credit: 1,008,957,258
RAC: 257,491
Level
Met
Scientific publications
watwatwatwatwat
Message 57079 - Posted: 3 Jul 2021 | 2:11:48 UTC - in response to Message 57078.

I'm curious how you have a real estimated time remaining calculated for a brand new application.

AFAIK, you JUST got the application working and I don't believe you have validated ten tasks yet to get an accurate APR which produces the accurate estimated time remaining numbers.

All my tasks are in EDF mode and multiple day estimates simply because I have returned exactly one valid task so far. A shorty Cryptic-Scout task.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 465
Credit: 4,437,299,830
RAC: 8,821
Level
Arg
Scientific publications
wat
Message 57080 - Posted: 3 Jul 2021 | 3:18:39 UTC - in response to Message 57079.

I didn’t use the time remaining estimate from BOINC. I estimated it myself based on % complete and elapsed time, assuming a linear completion rate.
____________

Profile Retvari Zoltan
Avatar
Send message
Joined: 20 Jan 09
Posts: 2274
Credit: 16,057,322,981
RAC: 134
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 57081 - Posted: 3 Jul 2021 | 8:08:18 UTC - in response to Message 57078.
Last modified: 3 Jul 2021 | 8:15:46 UTC

When I checked earlier, it was tracking a 12.5hr completion time. But the 2080ti was tracking a 14.5hr completion time.

Either the new run of tasks are longer, or the CUDA 11.2 app is slower? We’ll have to see.
If the new tasks are longer, the awarded credit should be higher. The present ADRIA_New_KIXcMyb_HIP_AdaptiveBandit workunits "worth" 675.000 credits, while the previous ADRIA_D3RBandit_batch_nmax5000 "worth" 523.125 credits, so the present ones are longer.
My estimation was 12h/1.3=9h15m (based on my optimistic 30% performance improvement expectation).
Nevertheless we can use the completion times to estimate the actual performance improvement (3080Ti vs 2080Ti): The 3080 Ti completed the task in 44368s (12h 19m 28s) the 2080Ti completed the task in 52642s (14h 37m 22s), so the 3080Ti is "only" 18.65% faster. So the number of the usable CUDA cores in the 30xx series are the half of the advertised number (just as I expected), as 10240/2=5120, 5120/4352=1.1765 (so the 3080Ti has 17.65% more CUDA cores than the 2080Ti has), the CUDA cores of the 3080Ti are 1.4% faster than of the 2080Ti.

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1146
Credit: 3,283,608,315
RAC: 116,679
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 57082 - Posted: 3 Jul 2021 | 10:09:52 UTC

The PPA-reset trick worked - I have a new task running now. Another satisfied customer.

The completion estimate at 10% was 43.5 hours, both by extrapolation and by setting <fraction_done_exact/> in app_config.xml

It's a ADRIA_New_KIXcMyb_HIP_AdaptiveBandit task. I ran a couple of these about 10 days ago, under the old app: they took about 33 hours - previous 'D3RBandit_batch*' tasks had been 28 hours on average. Cards are GTX 1660 super.

So there's a possibility that the new app is slower, at least on 'before cutting edge' cards.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 465
Credit: 4,437,299,830
RAC: 8,821
Level
Arg
Scientific publications
wat
Message 57083 - Posted: 3 Jul 2021 | 10:39:34 UTC - in response to Message 57081.

Some of the tasks were even longer. I have two more 2080ti reports that were 55,583 and 56,560s respectively. On Identical GPUs running the same clocks. There seems to be Some variability. If you use the slower one it puts it closer to 30%. This exposes the flaw of using a single sample to form a conclusion. More data is required.

Also note that I’ve been experiencing performance issues with this specific card. I believe it’s underperforming due to some incorrect power limiting behavior (I’ve done a lot of load testing and cross referencing benchmark results with others online). I have a replacement on the way to test.

These ADRIA tasks have hard coded reward. It isn’t necessarily based on run time. They increased the reward from the D3RBandit to these KIXcMyb tasks, but since they stopped distributing the CUDA 10 app, we can’t know for sure if the tasks are just longer or if there’s some inefficiency in the new 11.2 app that’s slowing it down. If the tasks aren’t longer, then the new app is almost 30% slower than the old CUDA app
____________

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1146
Credit: 3,283,608,315
RAC: 116,679
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 57084 - Posted: 3 Jul 2021 | 11:16:21 UTC

I looked back into the 'job_log_www.gpugrid.net.txt' in the BOINC data folder to get my comparison times. I haven't run many AdaptiveBandits yet, but I think the 'D3RBandit_batch*' time was a robust average over the many sub-types.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 465
Credit: 4,437,299,830
RAC: 8,821
Level
Arg
Scientific publications
wat
Message 57085 - Posted: 3 Jul 2021 | 11:35:35 UTC - in response to Message 57083.
Last modified: 3 Jul 2021 | 11:36:42 UTC

After cross referencing runtimes for various Windows hosts, I think the new app is just slower. Windows hosts haven’t experienced an app change (yet) and haven’t shown any sudden or recent change in run time with the KIX AdaptiveBandit jobs. This suggests that that tasks haven’t really changed, leading the only other cause of the longer run time to be a slower 11.2 app.

I also noticed that the package distribution is different between the CUDA 10 and 11.2 apps. 10 included some library files that are not included with 11.2 (like cudart and cudafft libraries) so the app may have been compiled in a different way.

I hope Toni can bring the app back to par. It really shouldn’t be that much slower.
____________

Greger
Send message
Joined: 6 Jan 15
Posts: 63
Credit: 6,258,979,864
RAC: 221,211
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwat
Message 57086 - Posted: 3 Jul 2021 | 12:09:49 UTC
Last modified: 3 Jul 2021 | 12:14:32 UTC

GTX 1080
# Speed: average 75.81 ns/day, current 75.71 ns/day

RTX 2070S
# Speed: average 134.99 ns/day, current 132.17 ns/day

RTX 3070
# Speed: average 159.15 ns/day, current 155.75 ns/day
https://www.gpugrid.net/result.php?resultid=32632515
https://www.gpugrid.net/result.php?resultid=32632513

only task yet is with 3070 and ended after 18-19 hours
3000-series looks slow with 11.2 but they works. Progressbar and estimate looks to be close expected time and 2070 could probably end after around 21 hours.

It would be great if Toni could make application to print from progress.log

Had to add PPA for libboost needed and did try update one host to 21.04 to get latest boost but did not work.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 465
Credit: 4,437,299,830
RAC: 8,821
Level
Arg
Scientific publications
wat
Message 57087 - Posted: 3 Jul 2021 | 12:24:39 UTC - in response to Message 57086.

Where did you get the ns/day numbers from?

But it’s not just 3000-series being slow. All cards seem to be proportionally slower with 11.2 vs 10.0, by about 30%
____________

Greger
Send message
Joined: 6 Jan 15
Posts: 63
Credit: 6,258,979,864
RAC: 221,211
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwat
Message 57088 - Posted: 3 Jul 2021 | 12:50:10 UTC - in response to Message 57087.
Last modified: 3 Jul 2021 | 12:51:12 UTC

Go to slot folder and cat progress.log

Yes looks like all cards are affected on new application. I compare with 1000-series also but do not have numbers of ns/day for them.

Where did you get 469 driver? Can't see it on nvidia site or PPA.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 465
Credit: 4,437,299,830
RAC: 8,821
Level
Arg
Scientific publications
wat
Message 57089 - Posted: 3 Jul 2021 | 14:43:35 UTC - in response to Message 57088.

It’s not real. I’ve manipulated the coproc_info file to report what I want.

Actual driver in use is 460.84
____________

Aurum
Avatar
Send message
Joined: 12 Jul 17
Posts: 307
Credit: 10,279,318,231
RAC: 158
Level
Trp
Scientific publications
watwatwat
Message 57090 - Posted: 3 Jul 2021 | 15:48:58 UTC - in response to Message 57086.

RTX 3070
# Speed: average 159.15 ns/day, current 155.75 ns/day
https://www.gpugrid.net/result.php?resultid=32632515
https://www.gpugrid.net/result.php?resultid=32632513

only task yet is with 3070 and ended after 18-19 hours
3000-series looks slow with 11.2 but they works. Progressbar and estimate looks to be close expected time and 2070 could probably end after around 21 hours.
The WU you linked had one wingman run it as cuda 10.1 and the other as 11.21 with 155,037 seconds versus 68,000. Isn't that faster?
https://www.gpugrid.net/workunit.php?wuid=27075862
What does ns mean?

Profile ServicEnginIC
Avatar
Send message
Joined: 24 Sep 10
Posts: 410
Credit: 2,022,240,642
RAC: 344,531
Level
Phe
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 57091 - Posted: 3 Jul 2021 | 16:16:33 UTC - in response to Message 57080.

I didn’t use the time remaining estimate from BOINC. I estimated it myself based on % complete and elapsed time, assuming a linear completion rate.

I usually employ the same method, since Progress % shown by BOINC Manager is quite linear.
At my low-end GPUs, I'm still waiting for the first task to complete :-)
Evaluating the small sample of tasks that I've received, tasks for this new version are taking longer to complete than previous ones (lets say "by the moment")
Estimated completion times for the 5 GPUs that I'm monitoring are as follows:



The last three GPUs are Turing GTX 1650 ones, but different graphics cards models and clock frequencies.
An editable version of the spreadsheet used can be downloaded from this link

Greger
Send message
Joined: 6 Jan 15
Posts: 63
Credit: 6,258,979,864
RAC: 221,211
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwat
Message 57092 - Posted: 3 Jul 2021 | 16:21:35 UTC - in response to Message 57090.

RTX 3070
# Speed: average 159.15 ns/day, current 155.75 ns/day
https://www.gpugrid.net/result.php?resultid=32632515
https://www.gpugrid.net/result.php?resultid=32632513

only task yet is with 3070 and ended after 18-19 hours
3000-series looks slow with 11.2 but they works. Progressbar and estimate looks to be close expected time and 2070 could probably end after around 21 hours.
The WU you linked had one wingman run it as cuda 10.1 and the other as 11.21 with 155,037 seconds versus 68,000. Isn't that faster?
https://www.gpugrid.net/workunit.php?wuid=27075862
What does ns mean?


nanosecond
https://en.wikipedia.org/wiki/Nanosecond#:~:text=A%20nanosecond%20(ns)%20is%20an,or%201%E2%81%841000%20microsecond.


Yes there big gap to runtime on other host but it was also using NVIDIA GeForce GTX 1070

Greger
Send message
Joined: 6 Jan 15
Posts: 63
Credit: 6,258,979,864
RAC: 221,211
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwat
Message 57093 - Posted: 3 Jul 2021 | 16:30:06 UTC - in response to Message 57089.

It’s not real. I’ve manipulated the coproc_info file to report what I want.

Actual driver in use is 460.84


Ok why i ask was that device name is unknown for my 3080Ti and had some hope that driver you used would fix that.
So i could go coproc file and edit instead.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 465
Credit: 4,437,299,830
RAC: 8,821
Level
Arg
Scientific publications
wat
Message 57094 - Posted: 3 Jul 2021 | 16:53:19 UTC - in response to Message 57093.
Last modified: 3 Jul 2021 | 16:54:38 UTC

It’s not real. I’ve manipulated the coproc_info file to report what I want.

Actual driver in use is 460.84


Ok why i ask was that device name is unknown for my 3080Ti and had some hope that driver you used would fix that.
So i could go coproc file and edit instead.


What driver are you using? The 3080ti won’t be detected until driver 460.84. Anything older will not know what GPU that is.
____________

Aurum
Avatar
Send message
Joined: 12 Jul 17
Posts: 307
Credit: 10,279,318,231
RAC: 158
Level
Trp
Scientific publications
watwatwat
Message 57095 - Posted: 3 Jul 2021 | 17:05:39 UTC

Greger, I just can't get my head around what it means. So out of the 8.64E13 ns in a day you only calculate for 159 ns??? I'm not familiar with that figure of merit.

BTW, my 3080 is running 465.31. Still waiting to catch a WU after the PPA, reboot & reset.

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1146
Credit: 3,283,608,315
RAC: 116,679
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 57096 - Posted: 3 Jul 2021 | 17:51:03 UTC - in response to Message 57095.

The nanoseconds will be the biochemical reaction time that we're modelling - very, very, slowly - in a digital simulation.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 465
Credit: 4,437,299,830
RAC: 8,821
Level
Arg
Scientific publications
wat
Message 57097 - Posted: 3 Jul 2021 | 18:00:23 UTC - in response to Message 57095.

Greger, I just can't get my head around what it means. So out of the 8.64E13 ns in a day you only calculate for 159 ns??? I'm not familiar with that figure of merit.

BTW, my 3080 is running 465.31. Still waiting to catch a WU after the PPA, reboot & reset.


Aren’t you big into folding? ns/day is a very common metric for measuring computation speed in molecular modeling.

____________

Greger
Send message
Joined: 6 Jan 15
Posts: 63
Credit: 6,258,979,864
RAC: 221,211
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwat
Message 57098 - Posted: 3 Jul 2021 | 18:47:55 UTC - in response to Message 57094.
Last modified: 3 Jul 2021 | 18:51:42 UTC

It’s not real. I’ve manipulated the coproc_info file to report what I want.

Actual driver in use is 460.84


Ok why i ask was that device name is unknown for my 3080Ti and had some hope that driver you used would fix that.
So i could go coproc file and edit instead.


What driver are you using? The 3080ti won’t be detected until driver 460.84. Anything older will not know what GPU that is.


NVIDIA-SMI 465.27 Driver Version: 465.27 CUDA Version: 11.3

Could not use 460 for 3080Ti so i had to move latest ubuntu provided and it would this version.
boinc-client detect name as
Coprocessors NVIDIA NVIDIA Graphics Device (4095MB) driver: 465.27

I edit coproc_info.xml but it does not change when i update to project and if i restart boinc-client it will wipe even if ai change driverversin inside file.

Maybe i could lock file to root only to prevent boinc to write permission but i better not.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 465
Credit: 4,437,299,830
RAC: 8,821
Level
Arg
Scientific publications
wat
Message 57099 - Posted: 3 Jul 2021 | 18:53:41 UTC - in response to Message 57098.

You need driver 460.84 for 3080ti. You can use that one.

You can also use 465.31, but that driver is about a month older, 460.84 will be better unless you absolutely need some feature from the 465 branch.
____________

Greger
Send message
Joined: 6 Jan 15
Posts: 63
Credit: 6,258,979,864
RAC: 221,211
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwat
Message 57100 - Posted: 3 Jul 2021 | 18:56:34 UTC - in response to Message 57095.
Last modified: 3 Jul 2021 | 18:59:50 UTC

Greger, I just can't get my head around what it means. So out of the 8.64E13 ns in a day you only calculate for 159 ns??? I'm not familiar with that figure of merit.

BTW, my 3080 is running 465.31. Still waiting to catch a WU after the PPA, reboot & reset.


As mention before here it is possible time the device could genereate a folding event for that device but you need take in count the complexity of folding time in and amount of atoms have big affect on it and possible other parameters in modelling event.

Think of see it as a box and you have x y z and it build up protein with atoms then make fold of it. In total result it would be very very short event.

There was a free tool before and possible available still today that you could use to open data that done directly with after it was done. users have done this at folding@home and posted in forums.

Not sure if that is free for acemd

Greger
Send message
Joined: 6 Jan 15
Posts: 63
Credit: 6,258,979,864
RAC: 221,211
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwat
Message 57101 - Posted: 3 Jul 2021 | 19:22:23 UTC - in response to Message 57099.

You need driver 460.84 for 3080ti. You can use that one.

You can also use 465.31, but that driver is about a month older, 460.84 will be better unless you absolutely need some feature from the 465 branch.


ok thanks

Aurum
Avatar
Send message
Joined: 12 Jul 17
Posts: 307
Credit: 10,279,318,231
RAC: 158
Level
Trp
Scientific publications
watwatwat
Message 57102 - Posted: 3 Jul 2021 | 19:29:54 UTC

Yea, snagged a WU and it's running. My guesstimate is 19:44:13 on my 3080 dialed down to 230 Watts. Record breaking long heat wave here and summer peak Time-of-Use electric rates (8.5x higher) have started. Summer is not BOINC season in The Great Basin.

Rxn time, now that makes sense. Thx.

Linux Mint repository offers 465.31 and 460.84. Is it actually worth reverting to 460.84??? I wouldn't do it until after this WU completes anyway.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 465
Credit: 4,437,299,830
RAC: 8,821
Level
Arg
Scientific publications
wat
Message 57103 - Posted: 3 Jul 2021 | 19:56:15 UTC - in response to Message 57102.

Linux Mint repository offers 465.31 and 460.84. Is it actually worth reverting to 460.84??? I wouldn't do it until after this WU completes anyway.


probably wont matter if the driver you have is working. i don't expect any performance difference between the two. I was just saying that I would use a more recent non-beta driver if i was updating, unless you need some feature in 465 branch specifically.
____________

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 465
Credit: 4,437,299,830
RAC: 8,821
Level
Arg
Scientific publications
wat
Message 57104 - Posted: 3 Jul 2021 | 20:00:01 UTC

second 3080ti task completed in 11hrs

http://gpugrid.net/result.php?resultid=32632580
____________

Greger
Send message
Joined: 6 Jan 15
Posts: 63
Credit: 6,258,979,864
RAC: 221,211
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwat
Message 57105 - Posted: 3 Jul 2021 | 20:59:16 UTC

peak 28,9°C here today so suspend during daytime after 2 task done.
I run evening and night these days if temp i high. Ambient temp was above 35 inside and fan gone up to 80% on gpu i checked.

So i manage to go to 460.84 after a few remove and --purge nvidia*. Apparently there was a libnvidia-compute left and hold it back.

Got name correct but detect vram wrong (4095MB). Lets see if it would work.

Profile ServicEnginIC
Avatar
Send message
Joined: 24 Sep 10
Posts: 410
Credit: 2,022,240,642
RAC: 344,531
Level
Phe
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 57106 - Posted: 3 Jul 2021 | 21:32:29 UTC

Just take in mind that any change in Nvidia driver version while a GPUgrid task is in progress, will cause it to fail when computing is restarted.
Commented in message #56909

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 465
Credit: 4,437,299,830
RAC: 8,821
Level
Arg
Scientific publications
wat
Message 57107 - Posted: 3 Jul 2021 | 21:58:34 UTC - in response to Message 57105.

peak 28,9°C here today so suspend during daytime after 2 task done.
I run evening and night these days if temp i high. Ambient temp was above 35 inside and fan gone up to 80% on gpu i checked.

So i manage to go to 460.84 after a few remove and --purge nvidia*. Apparently there was a libnvidia-compute left and hold it back.

Got name correct but detect vram wrong (4095MB). Lets see if it would work.


The VRAM reported wrong is not because of the driver. It’s a problem with BOINC. BOINC uses a detection technique that is only 32-bit (4GB). This can only be fixed by fixing the code in BOINC.
____________

Greger
Send message
Joined: 6 Jan 15
Posts: 63
Credit: 6,258,979,864
RAC: 221,211
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwat
Message 57108 - Posted: 3 Jul 2021 | 23:18:13 UTC - in response to Message 57107.
Last modified: 3 Jul 2021 | 23:50:23 UTC

I went back to my host and driver crashed. smi unable to open and task failed on another project. Restarted it and back on track. Few minutes later it fetch new task from GPUGrid. Let's hope it does not crash again.

https://www.gpugrid.net/result.php?resultid=32634065

# Speed: average 225.91 ns/day, current 226.09 ns/day
That is more like. this much better then my 3070 and 3060Ti got.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 465
Credit: 4,437,299,830
RAC: 8,821
Level
Arg
Scientific publications
wat
Message 57109 - Posted: 3 Jul 2021 | 23:32:42 UTC - in response to Message 57108.

GPU detection is handled by BOINC, not any individual projects.

Driver updates always require a reboot to take effect.
____________

Profile ServicEnginIC
Avatar
Send message
Joined: 24 Sep 10
Posts: 410
Credit: 2,022,240,642
RAC: 344,531
Level
Phe
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 57110 - Posted: 4 Jul 2021 | 9:09:53 UTC

Finally, my first result of a new version 2.12 task came out in my fastest card:
e4s126_e3s248p0f238-ADRIA_New_KIXcMyb_HIP_AdaptiveBandit-0-2-RND6347_7
It took 141948 seconds of total processing time. That is: 1 day 15 hours 25 minutes and 48 seconds
Predicted time in table shown at message #57091 was 142074 seconds after 61,439% done.
There is a slight difference of 126 seconds between estimated and true execution time. 0,09% deviation.
For me, it is approximate enough, and validates Ian&Steve C. theory of progress for these tasks being quite linear along their execution.

Greger
Send message
Joined: 6 Jan 15
Posts: 63
Credit: 6,258,979,864
RAC: 221,211
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwat
Message 57111 - Posted: 4 Jul 2021 | 10:08:18 UTC

Compare old and new app on 2070S

old
52,930.87 New version of ACEMD v2.11 (cuda100)
WU 27069210 e130s1888_e70s25p0f44-ADRIA_D3RBandit_batch_nmax5000-0-1-RND2852_1

new
80,484.11 New version of ACEMD v2.12 (cuda1121)
WU: 27077230 e5s177_e4s56p0f117-ADRIA_New_KIXcMyb_HIP_AdaptiveBandit-1-2-RND4081_4

Not sure if size of units grown that much to be able compare them.

Aurum
Avatar
Send message
Joined: 12 Jul 17
Posts: 307
Credit: 10,279,318,231
RAC: 158
Level
Trp
Scientific publications
watwatwat
Message 57112 - Posted: 4 Jul 2021 | 13:40:41 UTC - in response to Message 57102.
Last modified: 4 Jul 2021 | 13:46:54 UTC

My guesstimate is 19:44:13 on my 3080 dialed down to 230 Watts.

16:06:54
https://www.gpugrid.net/workunit.php?wuid=27077289

Profile ServicEnginIC
Avatar
Send message
Joined: 24 Sep 10
Posts: 410
Credit: 2,022,240,642
RAC: 344,531
Level
Phe
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 57113 - Posted: 4 Jul 2021 | 14:23:15 UTC

At this moment, every of my 7 currently working GPUs have any new version 2.12 task in process.
Two tasks received today completed the quota.
Task e4s120_e3s763p0f798-ADRIA_New_KIXcMyb_HIP_AdaptiveBandit-1-2-RND9850_3, hanging from WU #27076712
Task e5s90_e4s138p0f962-ADRIA_New_KIXcMyb_HIP_AdaptiveBandit-1-2-RND6130_4, hanging from WU #27077322
Something to remark: These two tasks are repetitive resends of previously failed tasks with the following known problem:

acemd3: error while loading shared libraries: libboost_filesystem.so.1.74.0: cannot open shared object file: No such file or directory

Chance to remember that there is a remedy for this problem, commented at message #57064 in this same thread.

One last update for estimated times to completion on my GPUs:



An editable version of the spreadsheet used can be downloaded from this link
Changes since previous version:
- Lines for two more GPUs are added.
- A new cell is added for seconds to D:H:M:S conversion

Aurum
Avatar
Send message
Joined: 12 Jul 17
Posts: 307
Credit: 10,279,318,231
RAC: 158
Level
Trp
Scientific publications
watwatwat
Message 57114 - Posted: 4 Jul 2021 | 14:50:02 UTC - in response to Message 57081.
Last modified: 4 Jul 2021 | 14:52:34 UTC

So the number of the usable CUDA cores in the 30xx series are half of the advertised number (just as I expected), as 10240/2=5120, 5120/4352=1.1765 (so the 3080Ti has 17.65% more CUDA cores than the 2080Ti has), the CUDA cores of the 3080Ti are 1.4% faster than of the 2080Ti.


Does using half of CUDA cores have implications for BOINCing?
GG+OPNG at <cpu_usage>1.0</cpu_usage> & <gpu_usage>0.5</gpu_usage> works fine.
GG+DaggerHashimoto crashes GG instantly.
I hope to try 2xGG today.

Profile Retvari Zoltan
Avatar
Send message
Joined: 20 Jan 09
Posts: 2274
Credit: 16,057,322,981
RAC: 134
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 57115 - Posted: 4 Jul 2021 | 15:06:42 UTC - in response to Message 57114.

Does using half of CUDA cores have implications for BOINCing?
GG+OPNG at <cpu_usage>1.0</cpu_usage> & <gpu_usage>0.5</gpu_usage> works fine.
GG+DaggerHashimoto crashes GG instantly.
I hope to try 2xGG today.
You can't utilize the "extra" CUDA cores by running a second task (regardless of the project).
The 30xx series improved gaming experience much more, than the crunching performance.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 465
Credit: 4,437,299,830
RAC: 8,821
Level
Arg
Scientific publications
wat
Message 57116 - Posted: 4 Jul 2021 | 15:10:22 UTC - in response to Message 57114.

So the number of the usable CUDA cores in the 30xx series are half of the advertised number (just as I expected), as 10240/2=5120, 5120/4352=1.1765 (so the 3080Ti has 17.65% more CUDA cores than the 2080Ti has), the CUDA cores of the 3080Ti are 1.4% faster than of the 2080Ti.


Does using half of CUDA cores have implications for BOINCing?
GG+OPNG at <cpu_usage>1.0</cpu_usage> & <gpu_usage>0.5</gpu_usage> works fine.
GG+DaggerHashimoto crashes GG instantly.
I hope to try 2xGG today.


I think you misunderstand what's happening.

running 2x GPUGRID tasks concurrently wont make it "use more". it'll just slow both down, probably slower than half speed due to the constant resource fighting.

if GPUGRID isn't seeing the effective 2x benefit of Turing vs Ampere, that tells me one of two things (or maybe some combination of both):
1. that app isn't as FP32 heavy as some have implied, and maybe has a decent amount of INT32 instructions. the INT32 setup of Ampere is the same as Turing
2. there is some additional optimization that needs to be applied to the ACEMD3 app to better take advantage of the extra FP32 cores on Ampere.
____________

WMD
Send message
Joined: 21 May 21
Posts: 1
Credit: 2,092,500
RAC: 2
Level
Ala
Scientific publications
wat
Message 57118 - Posted: 4 Jul 2021 | 15:46:13 UTC - in response to Message 57116.

if GPUGRID isn't seeing the effective 2x benefit of Turing vs Ampere, that tells me one of two things (or maybe some combination of both):
1. that app isn't as FP32 heavy as some have implied, and maybe has a decent amount of INT32 instructions. the INT32 setup of Ampere is the same as Turing
2. there is some additional optimization that needs to be applied to the ACEMD3 app to better take advantage of the extra FP32 cores on Ampere.

The way Ampere works is that half the cores are FP32, and the other half are either FP32 or INT32 depending on need. On Turing (and older), the INT32 half was always INT32. So you're probably right - either GPUGRID has some INT32 load that is using the cores instead, or some kind of application change is required to get it to use the other half.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 465
Credit: 4,437,299,830
RAC: 8,821
Level
Arg
Scientific publications
wat
Message 57119 - Posted: 4 Jul 2021 | 16:30:28 UTC - in response to Message 57118.
Last modified: 4 Jul 2021 | 16:32:20 UTC

I'm not convinced that the extra cores "aren't being used" at all, ie, the cores are sitting idle 100% of the time as a direct result of the architecture or something like that. I think both the application and the hardware are fully aware of the available cores/SMs. just that the application is coded in such a way that it can't take advantage of the extra resources, either in optimization or in the number of INT instructions required.

nvidia's press notes do seem to show a 1.5x improvement in molecular modeling load for A100 vs V100, so maybe the amount of INT calls is inherent to this kind of load anyway. (granted the A100 is based on the GA100 core, which is a different architecture without the shared FP/INT cores for the doubling of FP cores like on GA102)

but in the case of GPUGRID, i think it's just their application. on folding Ampere performs much closer to the claims. a 3070 being only a bit slower than a 2080ti, which is what I would expect.
____________

Aurum
Avatar
Send message
Joined: 12 Jul 17
Posts: 307
Credit: 10,279,318,231
RAC: 158
Level
Trp
Scientific publications
watwatwat
Message 57120 - Posted: 4 Jul 2021 | 16:33:22 UTC - in response to Message 57115.
Last modified: 4 Jul 2021 | 16:57:55 UTC

The 30xx series improved gaming experience much more, than the crunching performance.

I'm thoroughly unimpressed by my 3080. Its performance does not scale with price making it much more expensive for doing calculations. I'll probably test it for a few more days and then sell it.

I like to use some metric that's proportional to calculations and optimize calcs/Watt. In the past my experience has been reducing max power improves performance. But since Nvidia eliminated the nvidia-settings options -a [gpu:0]/GPUGraphicsClockOffset & -a [gpu:0]/GPUMemoryTransferRateOffset that I used I haven't found a good way to do it using Linux. nvidia-settings -q all

It seems Nvidia chooses a performance level but I can't see how to force it to a desired level:
sudo DISPLAY=:0 XAUTHORITY=/var/run/lightdm/root/:0 nvidia-settings -q '[gpu:0]/GPUPerfModes'
3080: 0, 1, 2, 3 & 4
Attribute 'GPUPerfModes' (Rig-05:0[gpu:0]):
perf=0, nvclock=210, nvclockmin=210, nvclockmax=420, nvclockeditable=1, memclock=405, memclockmin=405, memclockmax=405, memclockeditable=1, memTransferRate=810, memTransferRatemin=810, memTransferRatemax=810, memTransferRateeditable=1 ;
perf=1, nvclock=210, nvclockmin=210, nvclockmax=2100, nvclockeditable=1, memclock=810, memclockmin=810, memclockmax=810, memclockeditable=1, memTransferRate=1620, memTransferRatemin=1620, memTransferRatemax=1620, memTransferRateeditable=1 ;
perf=2, nvclock=240, nvclockmin=240, nvclockmax=2130, nvclockeditable=1, memclock=5001, memclockmin=5001, memclockmax=5001, memclockeditable=1, memTransferRate=10002, memTransferRatemin=10002, memTransferRatemax=10002, memTransferRateeditable=1 ;
perf=3, nvclock=240, nvclockmin=240, nvclockmax=2130, nvclockeditable=1, memclock=9251, memclockmin=9251, memclockmax=9251, memclockeditable=1, memTransferRate=18502, memTransferRatemin=18502, memTransferRatemax=18502, memTransferRateeditable=1 ;
perf=4, nvclock=240, nvclockmin=240, nvclockmax=2130, nvclockeditable=1, memclock=9501, memclockmin=9501, memclockmax=9501, memclockeditable=1, memTransferRate=19002, memTransferRatemin=19002, memTransferRatemax=19002, memTransferRateeditable=1

Nvidia has said, "The -a and -g arguments are now deprecated in favor of -q and -i, respectively. However, the old arguments still work for this release." Sounds like they're planning to reduce or eliminate customers ability to control the products they buy.

Nvidia also eliminated GPULogoBrightness so the baby-blinkie lights never turn off.

Aurum
Avatar
Send message
Joined: 12 Jul 17
Posts: 307
Credit: 10,279,318,231
RAC: 158
Level
Trp
Scientific publications
watwatwat
Message 57121 - Posted: 4 Jul 2021 | 16:49:39 UTC - in response to Message 57116.

running 2x GPUGRID tasks concurrently wont make it "use more". it'll just slow both down, probably slower than half speed due to the constant resource fighting.

if GPUGRID isn't seeing the effective 2x benefit of Turing vs Ampere, that tells me one of two things (or maybe some combination of both):
1. that app isn't as FP32 heavy as some have implied, and maybe has a decent amount of INT32 instructions. the INT32 setup of Ampere is the same as Turing
2. there is some additional optimization that needs to be applied to the ACEMD3 app to better take advantage of the extra FP32 cores on Ampere.

At less than 5% complete with two WUs running simultaneously and having started within minutes of each other:
WU1: 4840 sec at 4.7% implies 102978 sec total
WU2: 5409 sec at 4.6% implies 117587 sec total
From yesterday's singleton: 2 x 58014 sec = 116028 sec total if independent.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 465
Credit: 4,437,299,830
RAC: 8,821
Level
Arg
Scientific publications
wat
Message 57122 - Posted: 4 Jul 2021 | 17:54:31 UTC - in response to Message 57121.

running 2x GPUGRID tasks concurrently wont make it "use more". it'll just slow both down, probably slower than half speed due to the constant resource fighting.

if GPUGRID isn't seeing the effective 2x benefit of Turing vs Ampere, that tells me one of two things (or maybe some combination of both):
1. that app isn't as FP32 heavy as some have implied, and maybe has a decent amount of INT32 instructions. the INT32 setup of Ampere is the same as Turing
2. there is some additional optimization that needs to be applied to the ACEMD3 app to better take advantage of the extra FP32 cores on Ampere.

At less than 5% complete with two WUs running simultaneously and having started within minutes of each other:
WU1: 4840 sec at 4.7% implies 102978 sec total
WU2: 5409 sec at 4.6% implies 117587 sec total
From yesterday's singleton: 2 x 58014 sec = 116028 sec total if independent.


my point exactly. showing roughly half speed, with no real benefit to running multiples. pushing your completion time to 32hours will only reduce your credit reward since you'll be bumped out of the +50% bonus for returning in 24hrs.
____________

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 465
Credit: 4,437,299,830
RAC: 8,821
Level
Arg
Scientific publications
wat
Message 57123 - Posted: 4 Jul 2021 | 18:16:56 UTC - in response to Message 57120.

Aurum wrote:
But since Nvidia eliminated the nvidia-settings options -a [gpu:0]/GPUGraphicsClockOffset & -a [gpu:0]/GPUMemoryTransferRateOffset that I used I haven't found a good way to do it using Linux.


these options still work. I use them for my 3080Ti. not sure what you mean?

this is exactly what I use for my 3080Ti (same on my Turing hosts)


/usr/bin/nvidia-smi -pm 1
/usr/bin/nvidia-smi -acp UNRESTRICTED

/usr/bin/nvidia-smi -i 0 -pl 320

/usr/bin/nvidia-settings -a "[gpu:0]/GPUPowerMizerMode=1"

/usr/bin/nvidia-settings -a "[gpu:0]/GPUMemoryTransferRateOffset[4]=500" -a "[gpu:0]/GPUGraphicsClockOffset[4]=100"


it works as desired.

Aurum wrote:
It seems Nvidia chooses a performance level but I can't see how to force it to a desired level:


what do you mean by "performance level"? if you mean forcing a certain P-state, no you can't do that. and these cards will not allow getting into P0 state unless you're running a 3D application. any compute application will get a best of P2 state. this has been the case ever since Maxwell. workarounds to force P0 state stopped working since Pascal, so this isnt new.

if you mean the PowerMizer preferred mode (which is analogous to the power settings in Windows) you can select that easily in Linux too. I always run mine at "prefer max performance" do this with the following command:

/usr/bin/nvidia-settings -a "[gpu:0]/GPUPowerMizerMode=1"


I'm unsure if this really makes much difference though except increasing idle power consumption (forcing higher clocks). the GPU seems to detect loads properly and clock up even when left on the default "Auto" selection.

Aurum wrote:
Nvidia also eliminated GPULogoBrightness so the baby-blinkie lights never turn off.

I'm not sure this was intentional, probably something that fell through the cracks that not enough people have complained about for them to dedicate resources to fixing. there's no gain for nvidia disabling this function. but again, this stopped working with Turing, so it's been this way for like 3 years, not something new. I have mostly EVGA cards, so when I want to mess with the lighting, I just throw the card on my test bench, boot into Windows, change the LED settings there, and then put it back in the crunching rig. the settings are preserved internal to the card (for my cards) so it stays and whatever I left it as. you can probably do the same

____________

Aurum
Avatar
Send message
Joined: 12 Jul 17
Posts: 307
Credit: 10,279,318,231
RAC: 158
Level
Trp
Scientific publications
watwatwat
Message 57124 - Posted: 4 Jul 2021 | 18:26:11 UTC

It sure does not look like running multiple GG WUs on the same GPU has any benefit.
My 3080 is stuck in P2. I'd like to try it in P3 and P4 but I can't make it change. I tried:
nvidia-smi -lmc 9251
Memory clocks set to "(memClkMin 9501, memClkMax 9501)" for GPU 00000000:65:00.0
All done.
nvidia-smi -lgc 240,2130
GPU clocks set to "(gpuClkMin 240, gpuClkMax 2130)" for GPU 00000000:65:00.0
All done.

But it's still in P2.

Aurum
Avatar
Send message
Joined: 12 Jul 17
Posts: 307
Credit: 10,279,318,231
RAC: 158
Level
Trp
Scientific publications
watwatwat
Message 57125 - Posted: 4 Jul 2021 | 18:34:34 UTC - in response to Message 57123.

Aurum wrote:
But since Nvidia eliminated the nvidia-settings options -a [gpu:0]/GPUGraphicsClockOffset & -a [gpu:0]/GPUMemoryTransferRateOffset that I used I haven't found a good way to do it using Linux.
these options still work. I use them for my 3080Ti. not sure what you mean?

this is exactly what I use for my 3080Ti (same on my Turing hosts)
/usr/bin/nvidia-settings -a "[gpu:0]/GPUMemoryTransferRateOffset[4]=500" -a "[gpu:0]/GPUGraphicsClockOffset[4]=100"
it works as desired.

How do you prove to yourself they work? They don't even exist any more. Run
nvidia-settings -q all | grep -C 10 -i GPUMemoryTransferRateOffset
and you will not find either of them.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 465
Credit: 4,437,299,830
RAC: 8,821
Level
Arg
Scientific publications
wat
Message 57126 - Posted: 4 Jul 2021 | 18:44:18 UTC

but all the slightly off-topic aside.

It was a great first step to getting the app working for Ampere. it's been long awaited and the new app is much appreciated and now many more cards can help contribute to the project, especially with these newer long running tasks lately. we need powerful cards to handle these tasks.

I think the two priorities now should be:

1. remedy the dependency on boost. either include the necessary library in the package distribution to clients, or recompile the app with boost statically linked. otherwise only those hosts who recognize the problem and know how to manually install the proper boost package will be able to contribute.

2. investigate the cause and provide a remedy for the ~30% slowdown in application performance from the older cuda100 app. this isn't just affecting Ampere, but affecting all GPUs equally it seems. maybe some optimization flag was omitted or some change to the code was made that was undesirable or unintended. just changing from cuda100 to cuda1121 should not in itself have caused this if there were no other code changes. sometimes you can see slight performance changes like 1-2%, but a 30% reduction is a sign that something is clearly wrong.
____________

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 465
Credit: 4,437,299,830
RAC: 8,821
Level
Arg
Scientific publications
wat
Message 57127 - Posted: 4 Jul 2021 | 18:54:32 UTC - in response to Message 57125.
Last modified: 4 Jul 2021 | 18:56:04 UTC

Aurum wrote:
But since Nvidia eliminated the nvidia-settings options -a [gpu:0]/GPUGraphicsClockOffset & -a [gpu:0]/GPUMemoryTransferRateOffset that I used I haven't found a good way to do it using Linux.
these options still work. I use them for my 3080Ti. not sure what you mean?

this is exactly what I use for my 3080Ti (same on my Turing hosts)
/usr/bin/nvidia-settings -a "[gpu:0]/GPUMemoryTransferRateOffset[4]=500" -a "[gpu:0]/GPUGraphicsClockOffset[4]=100"
it works as desired.

How do you prove to yourself they work? They don't even exist any more. Run
nvidia-settings -q all | grep -C 10 -i GPUMemoryTransferRateOffset
and you will not find either of them.


I prove they work by opening Nvidia X Server Settings and observing that the clock speed offsets have been changed in accordance with the commands and don't give any error when running them. and they have. the commands work 100%. I see you're referencing some other command. I have no idea the function of the command you're trying to use. but my command works.

see for yourself:
https://i.imgur.com/UFHbhNt.png
____________

888
Send message
Joined: 28 Jan 21
Posts: 5
Credit: 25,191,104
RAC: 175,241
Level
Val
Scientific publications
wat
Message 57139 - Posted: 5 Jul 2021 | 12:12:35 UTC

I'm still getting the CUDA compiler permission denied error. I've added the PPA and installed libboost1.74 as above, and reset the project multiple times. But every downloaded task fails after 2 seconds.

http://www.gpugrid.net/result.php?resultid=32636087

I'm running Mint 20.1, with rtx2070 and rtx3070 cards running 465.31 drivers.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 465
Credit: 4,437,299,830
RAC: 8,821
Level
Arg
Scientific publications
wat
Message 57140 - Posted: 5 Jul 2021 | 12:31:15 UTC - in response to Message 57139.

I'm still getting the CUDA compiler permission denied error. I've added the PPA and installed libboost1.74 as above, and reset the project multiple times. But every downloaded task fails after 2 seconds.

http://www.gpugrid.net/result.php?resultid=32636087

I'm running Mint 20.1, with rtx2070 and rtx3070 cards running 465.31 drivers.


How did you install the drivers? Have you ever installed the CUDA toolkit? This was my problem. If you have a CUDA toolkit installed, remove it. I would also be safe and totally purge your nvidia drivers and re-install fresh.
____________

Erich56
Send message
Joined: 1 Jan 15
Posts: 769
Credit: 3,402,889,227
RAC: 0
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwat
Message 57142 - Posted: 5 Jul 2021 | 13:01:27 UTC - in response to Message 57126.

Ian&Steve C wrote:

It was a great first step to getting the app working for Ampere. it's been long awaited and the new app is much appreciated and now many more cards can help contribute to the project, especially with these newer long running tasks lately. we need powerful cards to handle these tasks.

I think the two priorities now should be:

1. remedy the dependency on boost. either include the necessary library in the package distribution to clients, or recompile the app with boost statically linked. otherwise only those hosts who recognize the problem and know how to manually install the proper boost package will be able to contribute.

2. investigate the cause and provide a remedy for the ~30% slowdown in application performance from the older cuda100 app. ...

and last, but not least: an app for Windows would be nice :-)

888
Send message
Joined: 28 Jan 21
Posts: 5
Credit: 25,191,104
RAC: 175,241
Level
Val
Scientific publications
wat
Message 57143 - Posted: 5 Jul 2021 | 13:31:53 UTC - in response to Message 57140.

I'm still getting the CUDA compiler permission denied error. I've added the PPA and installed libboost1.74 as above, and reset the project multiple times. But every downloaded task fails after 2 seconds.

http://www.gpugrid.net/result.php?resultid=32636087

I'm running Mint 20.1, with rtx2070 and rtx3070 cards running 465.31 drivers.


How did you install the drivers? Have you ever installed the CUDA toolkit? This was my problem. If you have a CUDA toolkit installed, remove it. I would also be safe and totally purge your nvidia drivers and re-install fresh.



Thanks for the quick reply. I had the CUDA toolkit ver 10 installed, but after seeing your previous post about you problem, I had already removed it. I'll try purging and reinstalling my nvidia drivers, thanks.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 465
Credit: 4,437,299,830
RAC: 8,821
Level
Arg
Scientific publications
wat
Message 57145 - Posted: 5 Jul 2021 | 13:45:03 UTC - in response to Message 57143.

did you use the included removal script to remove the toolkit? or did you manually delete some files? definitely try the removal script if you havent already. good luck!
____________

Profile trigggl
Send message
Joined: 6 Mar 09
Posts: 25
Credit: 96,451,621
RAC: 38
Level
Thr
Scientific publications
watwatwatwatwatwatwatwat
Message 57147 - Posted: 5 Jul 2021 | 14:56:33 UTC - in response to Message 57126.

...
1. remedy the dependency on boost. either include the necessary library in the package distribution to clients, or recompile the app with boost statically linked. otherwise only those hosts who recognize the problem and know how to manually install the proper boost package will be able to contribute.
...

For those of us who are using the python app, the correct version is installed in the miniconda folder.
locate libboost_filesystem
/usr/lib64/libboost_filesystem-mt.so
/usr/lib64/libboost_filesystem.so
/usr/lib64/libboost_filesystem.so.1.76.0
/usr/lib64/cmake/boost_filesystem-1.76.0/libboost_filesystem-variant-shared.cmake
/var/lib/boinc/projects/www.gpugrid.net/miniconda/lib/libboost_filesystem.so
/var/lib/boinc/projects/www.gpugrid.net/miniconda/lib/libboost_filesystem.so.1.74.0
/var/lib/boinc/projects/www.gpugrid.net/miniconda/lib/cmake/boost_filesystem-1.74.0/libboost_filesystem-variant-shared.cmake
/var/lib/boinc/projects/www.gpugrid.net/miniconda/pkgs/boost-cpp-1.74.0-h312852a_4/lib/libboost_filesystem.so
/var/lib/boinc/projects/www.gpugrid.net/miniconda/pkgs/boost-cpp-1.74.0-h312852a_4/lib/libboost_filesystem.so.1.74.0
/var/lib/boinc/projects/www.gpugrid.net/miniconda/pkgs/boost-cpp-1.74.0-h312852a_4/lib/cmake/boost_filesystem-1.74.0/libboost_filesystem-variant-shared.cmake

I definitely don't want to downgrade my system version to run a project. Perhaps gpugrid could include the libboost that they already supply for a different app.

Could the miniconda folder be somehow included in the app?

Profile ServicEnginIC
Avatar
Send message
Joined: 24 Sep 10
Posts: 410
Credit: 2,022,240,642
RAC: 344,531
Level
Phe
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 57192 - Posted: 10 Jul 2021 | 8:02:28 UTC

Richard Haselgrove sait at Message #57177:

Look at that timeout: host 528201. Oh, Mr. Kevvy, where art thou? 156 libboost errors? You can fix that...

Finally, Mr. Kevvy host #537616 processed successfully today these two tasks:
e4s113_e1s796p0f577-ADRIA_New_KIXcMyb_HIP_AdaptiveBandit-1-2-RND7908_0
e5s9_e3s99p0f334-ADRIA_New_KIXcMyb_HIP_AdaptiveBandit-0-2-RND8007_4
If it was due to your fix, congratulations Mr. Kevvy, you've found the right way.

Or perhaps it was some fix at tasks from server side?
Hard to know till there are plenty of new tasks ready to send.
Currently, 7:51:20 UTC, there are 0 tasks left ready to send, 28 tasks left in progress, as Server status page shows.

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1146
Credit: 3,283,608,315
RAC: 116,679
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 57193 - Posted: 10 Jul 2021 | 8:11:36 UTC - in response to Message 57192.

I got a note back from Mr. K - he saw the errors, and was going to check his machines. I imagine he's applied Ian's workround.

Curing the world's diseases, one computer at a time. It would be better if that bug could be fixed at source, for a universal cure.

Profile ServicEnginIC
Avatar
Send message
Joined: 24 Sep 10
Posts: 410
Credit: 2,022,240,642
RAC: 344,531
Level
Phe
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 57222 - Posted: 22 Jul 2021 | 21:39:41 UTC

On July 3rd 2021, Ian&Steve C. wrote at Message #57087:

But it’s not just 3000-series being slow. All cards seem to be proportionally slower with 11.2 vs 10.0, by about 30%

While organizing screenshots on one of my hosts, I happened to find comparative images for tasks of old Linux APP V2.11 (CUDA 10.0) and new APP V2.12 (CUDA 11.2)

* ACEMD V2.11 tasks on 14/06/2021:


* ACEMD V2.12 task on 20/07/2021:


Pay attention to device 0, the only comparable one.
- ACEMD V2.11 task: 08:10:18 = 29418 seconds past to process 15,04%. Extrapolating, this leads to 195598 seconds of total processing time (2d 06:19:58)
- ACEMD V2.12 task: 3d 02:51:01 = 269461 seconds past to process 96,48%. Extrapolating, this leads to 279292 seconds of total processing time (3d 05:34:52)
That is, about 42,8% of excess processing time for this particular host and device 0 (GTX 1650 GPU)

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1146
Credit: 3,283,608,315
RAC: 116,679
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 57223 - Posted: 23 Jul 2021 | 10:04:15 UTC - in response to Message 57222.

Also bear in mind that your first screenshot shows a D3RBandit task, and your second shows a AdaptiveBandit task.

They are different, and not directly comparable. How much of the observed slowdown is down to the data/algorithm, and how much is down to the new application, will need further examples to unravel.

Profile ServicEnginIC
Avatar
Send message
Joined: 24 Sep 10
Posts: 410
Credit: 2,022,240,642
RAC: 344,531
Level
Phe
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 57225 - Posted: 23 Jul 2021 | 13:12:06 UTC - in response to Message 57223.
Last modified: 23 Jul 2021 | 13:13:05 UTC

Also bear in mind that your first screenshot shows a D3RBandit task, and your second shows a AdaptiveBandit task.

Bright observer, and sharp appointment, as always.
I agree that tasks aren't probably fully comparable, but they are the most comparable I found: Same host, same device, same ADRIA WUs family, same base credit amount granted: 450000...
Now I'm waiting for the next move, and wondering about what will it consist of: An amended V2.12 APP?, a new V2.13 APP?, a "superstitious-proof" new V2.14 APP? ... ;-)

RJ The Bike Guy
Send message
Joined: 2 Apr 20
Posts: 18
Credit: 30,906,033
RAC: 48,725
Level
Val
Scientific publications
wat
Message 57230 - Posted: 4 Aug 2021 | 2:35:51 UTC

Is GPU grid still doing anything? I haven't gotten any work in like a month or more. And before that is was just sporadic. I used to always have work units. Now, nothing.

Profile Bill F
Avatar
Send message
Joined: 21 Nov 16
Posts: 10
Credit: 14,326,619
RAC: 0
Level
Pro
Scientific publications
wat
Message 57231 - Posted: 4 Aug 2021 | 7:53:39 UTC

I am not receiving Windows tasks anymore. My configuration is
Boinc 7.16.11 GenuineIntel Intel(R) Xeon(R) CPU E5620 @ 2.40GHz [Family 6 Model 44 Stepping 2](4 processors)

NVIDIA GeForce GTX 1060 6GB (4095MB) driver: 461.40

Microsoft Windows 10 Professional x64 Edition, (10.00.19043.00)

Am I still within Spec's to get Windows acemd3 work ?

Thanks
Bill F
____________
In October of 1969 I took an oath to support and defend the Constitution of the United States against all enemies, foreign and domestic;
There was no expiration date.


Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 465
Credit: 4,437,299,830
RAC: 8,821
Level
Arg
Scientific publications
wat
Message 57232 - Posted: 4 Aug 2021 | 13:43:59 UTC - in response to Message 57231.

there hasnt been an appreciable amount of work available for over a month.
____________

Erich56
Send message
Joined: 1 Jan 15
Posts: 769
Credit: 3,402,889,227
RAC: 0
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwat
Message 57233 - Posted: 5 Aug 2021 | 12:32:19 UTC - in response to Message 57232.

there hasnt been an appreciable amount of work available for over a month.

:-( :-( :-(

Profile ServicEnginIC
Avatar
Send message
Joined: 24 Sep 10
Posts: 410
Credit: 2,022,240,642
RAC: 344,531
Level
Phe
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 57234 - Posted: 5 Aug 2021 | 15:08:38 UTC - in response to Message 57233.

there hasnt been an appreciable amount of work available for over a month.

:-( :-( :-(

Currently it's like Gpugrid Project was hibernating.
From time to time, when tasks in progress reach zero, some automatism (?) launches 20 more CRYPTICSCOUT_pocket_discovery WUs. But lately only for Linux systems, and with these known problems unsolved.

Waiting for everything awakening soon again...

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 733
Credit: 1,008,957,258
RAC: 257,491
Level
Met
Scientific publications
watwatwatwatwat
Message 57235 - Posted: 5 Aug 2021 | 20:22:36 UTC

Yes, that is all I've been getting lately. I had 4 CRYPTICSCOUT_pocket_discovery tasks 4 days ago and I got 2 more today.

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 733
Credit: 1,008,957,258
RAC: 257,491
Level
Met
Scientific publications
watwatwatwatwat
Message 57236 - Posted: 6 Aug 2021 | 15:48:40 UTC

Another two more today.

Erich56
Send message
Joined: 1 Jan 15
Posts: 769
Credit: 3,402,889,227
RAC: 0
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwat
Message 57237 - Posted: 6 Aug 2021 | 16:24:25 UTC

what I don't understand is that there is no word whatsoever from the project team about an even tentative schedule :-(
Will there be new tasks available in say 1 week, 1 month, 3 months, ... ?
Will there be a new app which covers Ampere cards also (for both Linux and Windows) ... ?

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 465
Credit: 4,437,299,830
RAC: 8,821
Level
Arg
Scientific publications
wat
Message 57238 - Posted: 6 Aug 2021 | 16:33:58 UTC

still no tasks on the two hosts that have been having some issue getting work since the new app was released. I've set NNT on all other hosts in order to try to funnel any available work to these two hosts.

it's like they've been shadow-banned or something. even when work is available, it gets the message that no work is available. after an entire month, these hosts should have picked up at least one task.
____________

Profile Bill F
Avatar
Send message
Joined: 21 Nov 16
Posts: 10
Credit: 14,326,619
RAC: 0
Level
Pro
Scientific publications
wat
Message 57239 - Posted: 7 Aug 2021 | 1:08:56 UTC

Well I stepped out on a limb and side emailed the Principal Investigator listed for the project and the University regarding the lack of any communications.

If you see a puff of smoke in the Dallas TX area and the user count goes down by one you will know that I was hit by a lighting bolt or a small thermonuclear device.

Bill F
Dallas
____________
In October of 1969 I took an oath to support and defend the Constitution of the United States against all enemies, foreign and domestic;
There was no expiration date.


Jim1348
Send message
Joined: 28 Jul 12
Posts: 787
Credit: 1,560,573,721
RAC: 11
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 57243 - Posted: 7 Aug 2021 | 15:04:28 UTC - in response to Message 57239.

It worked! And Texas is still there, the last time I checked.
http://www.gpugrid.net/forum_thread.php?id=5246

Profile trigggl
Send message
Joined: 6 Mar 09
Posts: 25
Credit: 96,451,621
RAC: 38
Level
Thr
Scientific publications
watwatwatwatwatwatwatwat
Message 57247 - Posted: 8 Aug 2021 | 19:49:07 UTC - in response to Message 57192.
Last modified: 8 Aug 2021 | 19:59:54 UTC

Moved to libboost thread.

Post to thread

Message boards : News : Update acemd3 app