Advanced search

Message boards : News : New multicore app and WUs

Author Message
Toni
Volunteer moderator
Project administrator
Project developer
Project scientist
Send message
Joined: 9 Dec 08
Posts: 585
Credit: 4,273,184
RAC: 0
Level
Ala
Scientific publications
watwatwatwat
Message 48127 - Posted: 10 Nov 2017 | 14:58:07 UTC
Last modified: 10 Nov 2017 | 15:10:10 UTC

Dears,

we would like to test our new CPU multicore application for quantum chemistry tasks ("QC"). Since it’s the first time we have a CPU app out, I’ll test the behavior of GPUGRID with a relatively large batch that you will see soon. Workunits are named "*QC309big*".

Here’s some features of the app, in short (subject to change):

* Platform: Linux only for now, generic x64.
* Threads: as many as Boinc decides. I guess it depends on your machine, your preferences, and other running tasks in ways which are obscure to me…
* Run time: about 1 CPU hour per WU (so, shorter if multithreading)
* Credit: computed with the default algorithm (tasks are short, don’t expect much). Bonus mechanism for fast turnaround is still on.
* Known bugs: restarts and checkpoints. This should be mitigated with the “keep in memory when suspended” option. Sorry about that, it’s outside of our control.
* Network behavior: the first time you get a WU of this kind it downloads a Python interpreter (miniconda) and then some open-source packages, and installs them in the project directory. The installation is reused whenever possible.
* Disk usage: could go around 1 GB, perhaps more when tasks are running. Resetting the project should remove everything.
* Memory usage: should be around 1 GB when running.

Depending on the results of this test, we’ll start thinking about other platforms.

Thanks and nice crunching!

Toni

Sergey Kovalchuk
Send message
Joined: 18 Feb 16
Posts: 5
Credit: 1,009,912
RAC: 228
Level
Ala
Scientific publications
wat
Message 48130 - Posted: 10 Nov 2017 | 15:37:26 UTC - in response to Message 48127.

the client does not receive WUs, although there are almost a thousand of them and the client is suitable for the requirements (Linux x64). earlier this host was able to receive test tasks for QC and python

please write the exact requirements (memory, disk, OS) specified when generating tasks

Toni
Volunteer moderator
Project administrator
Project developer
Project scientist
Send message
Joined: 9 Dec 08
Posts: 585
Credit: 4,273,184
RAC: 0
Level
Ala
Scientific publications
watwatwatwat
Message 48131 - Posted: 10 Nov 2017 | 15:41:06 UTC - in response to Message 48130.
Last modified: 10 Nov 2017 | 15:44:30 UTC

Can you check what applications are you accepting in your preferences?

By the way requests are currently as follows:


<rsc_fpops_est>3e12</rsc_fpops_est>
<rsc_fpops_bound>250e15</rsc_fpops_bound>
<rsc_disk_bound>4e9</rsc_disk_bound>
<rsc_memory_bound>1e9</rsc_memory_bound>

Sergey Kovalchuk
Send message
Joined: 18 Feb 16
Posts: 5
Credit: 1,009,912
RAC: 228
Level
Ala
Scientific publications
wat
Message 48132 - Posted: 10 Nov 2017 | 16:05:30 UTC - in response to Message 48131.

All apps selected & "accept work from other"


Preferences:
max memory usage when active: 1900.76MB
max memory usage when idle: 1980.80MB
max disk usage: 6.71GB (4,47 free)

Toni
Volunteer moderator
Project administrator
Project developer
Project scientist
Send message
Joined: 9 Dec 08
Posts: 585
Credit: 4,273,184
RAC: 0
Level
Ala
Scientific publications
watwatwatwat
Message 48133 - Posted: 10 Nov 2017 | 16:09:28 UTC - in response to Message 48132.

Another boinc mystery...

Toni
Volunteer moderator
Project administrator
Project developer
Project scientist
Send message
Joined: 9 Dec 08
Posts: 585
Credit: 4,273,184
RAC: 0
Level
Ala
Scientific publications
watwatwatwat
Message 48134 - Posted: 10 Nov 2017 | 16:44:59 UTC - in response to Message 48133.

Jobs only seem to go to a subset of eligible machines. If anybody out there has a clue of the reason, I'll be glad to hear.

klepel
Send message
Joined: 23 Dec 09
Posts: 135
Credit: 1,805,959,720
RAC: 1,485,046
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 48135 - Posted: 10 Nov 2017 | 17:25:12 UTC

All error out with this:
Stderr output

<core_client_version>7.6.33</core_client_version>
<![CDATA[
<message>
process exited with code 195 (0xc3, -61)
</message>
<stderr_txt>
12:19:41 (31019): wrapper (7.7.26016): starting
12:19:41 (31019): wrapper (7.7.26016): starting
12:19:41 (31019): wrapper: running ../../projects/www.gpugrid.net/Miniconda3-4.3.30-Linux-x86_64.sh (-b -f -p /var/lib/boinc-client/projects/www.gpugrid.net/miniconda)
Python 3.6.3 :: Anaconda, Inc.
12:19:49 (31019): miniconda-installer exited; CPU time 6.649529
12:19:49 (31019): wrapper: running /var/lib/boinc-client/projects/www.gpugrid.net/miniconda/bin/python (pre_script.py)
12:19:59 (31019): $PROJECT_DIR/miniconda/bin/python exited; CPU time 7.101246
12:19:59 (31019): wrapper: running /var/lib/boinc-client/projects/www.gpugrid.net/miniconda/bin/psi4 (-n 14 -i psi4.in -o psi4.out)
/var/lib/boinc-client/projects/www.gpugrid.net/miniconda/bin/psi4: 3: /var/lib/boinc-client/projects/www.gpugrid.net/miniconda/bin/psi4: readlink: not found
/var/lib/boinc-client/projects/www.gpugrid.net/miniconda/bin/psi4: 9: /var/lib/boinc-client/projects/www.gpugrid.net/miniconda/bin/psi4: /bin/psi4.bin: not found
12:20:00 (31019): $PROJECT_DIR/miniconda/bin/psi4 exited; CPU time 0.001541
12:20:00 (31019): app exit status: 0x7f
12:20:00 (31019): called boinc_finish(195)

</stderr_txt>
]]>
It is this computer:
http://www.gpugrid.net/show_host_detail.php?hostid=420971

Profile Stoneageman
Avatar
Send message
Joined: 25 May 09
Posts: 211
Credit: 12,084,430,221
RAC: 11,938,233
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 48136 - Posted: 10 Nov 2017 | 17:53:41 UTC

All error out after a few seconds on AMD and Intel machines

<core_client_version>7.6.33</core_client_version>
<![CDATA[
<message>
process exited with code 195 (0xc3, -61)
</message>
<stderr_txt>
17:27:46 (14006): wrapper (7.7.26016): starting
17:27:46 (14006): wrapper (7.7.26016): starting
17:27:46 (14006): wrapper: running ../../projects/www.gpugrid.net/Miniconda3-4.3.30-Linux-x86_64.sh (-b -f -p /var/lib/boinc-client/projects/www.gpugrid.net/miniconda)
Python 3.6.3 :: Anaconda, Inc.
17:27:54 (14006): miniconda-installer exited; CPU time 6.648000
17:27:54 (14006): wrapper: running /var/lib/boinc-client/projects/www.gpugrid.net/miniconda/bin/python (pre_script.py)
17:28:05 (14006): $PROJECT_DIR/miniconda/bin/python exited; CPU time 7.584000
17:28:05 (14006): wrapper: running /var/lib/boinc-client/projects/www.gpugrid.net/miniconda/bin/psi4 (-n 15 -i psi4.in -o psi4.out)
/var/lib/boinc-client/projects/www.gpugrid.net/miniconda/bin/psi4: 3: /var/lib/boinc-client/projects/www.gpugrid.net/miniconda/bin/psi4: readlink: not found
/var/lib/boinc-client/projects/www.gpugrid.net/miniconda/bin/psi4: 9: /var/lib/boinc-client/projects/www.gpugrid.net/miniconda/bin/psi4: /bin/psi4.bin: not found
17:28:06 (14006): $PROJECT_DIR/miniconda/bin/psi4 exited; CPU time 0.000000
17:28:06 (14006): app exit status: 0x7f
17:28:06 (14006): called boinc_finish(195)

Profile [AF] fansyl
Send message
Joined: 26 Sep 13
Posts: 1
Credit: 550,411,904
RAC: 887,038
Level
Lys
Scientific publications
watwatwatwatwatwatwatwat
Message 48137 - Posted: 10 Nov 2017 | 18:09:57 UTC
Last modified: 10 Nov 2017 | 18:11:15 UTC

Hello,

error on my computer: Ubuntu mate 16.04/kernel 4.13.11/Ryzen 5 1400

Stderr output

<core_client_version>7.6.31</core_client_version>
<![CDATA[
<message>
process exited with code 195 (0xc3, -61)
</message>
<stderr_txt>
19:00:23 (31619): wrapper (7.7.26016): starting
19:00:23 (31619): wrapper (7.7.26016): starting
19:00:23 (31619): wrapper: running ../../projects/www.gpugrid.net/Miniconda3-4.3.30-Linux-x86_64.sh (-b -f -p /var/lib/boinc-client/projects/www.gpugrid.net/miniconda)
Python 3.6.3 :: Anaconda, Inc.
19:00:33 (31619): miniconda-installer exited; CPU time 8.382948
19:00:33 (31619): wrapper: running /var/lib/boinc-client/projects/www.gpugrid.net/miniconda/bin/python (pre_script.py)
19:03:37 (31619): $PROJECT_DIR/miniconda/bin/python exited; CPU time 63.497739
19:03:37 (31619): wrapper: running /var/lib/boinc-client/projects/www.gpugrid.net/miniconda/bin/psi4 (-n 7 -i psi4.in -o psi4.out)
/var/lib/boinc-client/projects/www.gpugrid.net/miniconda/bin/psi4: 3: /var/lib/boinc-client/projects/www.gpugrid.net/miniconda/bin/psi4: readlink: not found
/var/lib/boinc-client/projects/www.gpugrid.net/miniconda/bin/psi4: 9: /var/lib/boinc-client/projects/www.gpugrid.net/miniconda/bin/psi4: /bin/psi4.bin: not found
19:03:38 (31619): $PROJECT_DIR/miniconda/bin/psi4 exited; CPU time 0.002335
19:03:38 (31619): app exit status: 0x7f
19:03:38 (31619): called boinc_finish(195)

</stderr_txt>
]]>


Good luck for debug

Toni
Volunteer moderator
Project administrator
Project developer
Project scientist
Send message
Joined: 9 Dec 08
Posts: 585
Credit: 4,273,184
RAC: 0
Level
Ala
Scientific publications
watwatwatwat
Message 48138 - Posted: 10 Nov 2017 | 18:21:56 UTC - in response to Message 48137.

Dears, all three errors mention a missing "readlink" executable. It is surprising, because it's a fairly basic command, but please check if you can run "readlink" in a terminal. If not installed, should be in the "coreutils" package.

Profile Stoneageman
Avatar
Send message
Joined: 25 May 09
Posts: 211
Credit: 12,084,430,221
RAC: 11,938,233
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 48139 - Posted: 10 Nov 2017 | 18:40:21 UTC

It is installed

readlink --version
readlink (GNU coreutils) 8.26
Copyright (C) 2016 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>.

klepel
Send message
Joined: 23 Dec 09
Posts: 135
Credit: 1,805,959,720
RAC: 1,485,046
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 48140 - Posted: 10 Nov 2017 | 19:01:39 UTC

Same here. It is installed readlink version 8.26.

Profile Daniel
Send message
Joined: 17 Sep 16
Posts: 4
Credit: 63,931,247
RAC: 1,647,530
Level
Thr
Scientific publications
watwat
Message 48141 - Posted: 10 Nov 2017 | 19:12:06 UTC

I also have problem with getting new WUs on some of my machines. Looks that ones with Nvidia card get work, and ones without it do not get anything.
____________

PappaLitto
Send message
Joined: 21 Mar 16
Posts: 264
Credit: 1,216,483,931
RAC: 5,012,972
Level
Met
Scientific publications
watwat
Message 48142 - Posted: 10 Nov 2017 | 19:21:43 UTC

Is there a particular reason this is a CPU application and not a GPU one?

mmonnin
Send message
Joined: 2 Jul 16
Posts: 30
Credit: 20,067,201
RAC: 298,140
Level
Pro
Scientific publications
wat
Message 48144 - Posted: 10 Nov 2017 | 23:02:51 UTC - in response to Message 48140.

Same here. It is installed readlink version 8.26.



Same here.

NNW until there's a fix.

Profile Daniel
Send message
Joined: 17 Sep 16
Posts: 4
Credit: 63,931,247
RAC: 1,647,530
Level
Thr
Scientific publications
watwat
Message 48145 - Posted: 11 Nov 2017 | 0:06:59 UTC - in response to Message 48144.
Last modified: 11 Nov 2017 | 0:19:28 UTC

Same here. It is installed readlink version 8.26.



Same here.

NNW until there's a fix.

On Linux CentOS 7.4 is works fine. I suspect that bolinc is not able to find or execute readlink cmd. Please try executing following commands:

which readlink
ls -l `which readlink`
sudo -iu boinc bash -c 'which readlink'
sudo -iu boinc bash -c 'ls -l `which readlink`'
sudo -iu boinc readlink /lib/libz.so.1


On my CentOS they return following results:

# which readlink
/usr/bin/readlink
# ls -l `which readlink`
-rwxr-xr-x. 1 root root 41800 2016-11-05 /usr/bin/readlink
# sudo -iu boinc bash -c 'which readlink'
/bin/readlink
# sudo -iu boinc bash -c 'ls -l `which readlink`'
-rwxr-xr-x. 1 root root 41800 2016-11-05 /bin/readlink
# sudo -iu boinc readlink /lib/libz.so.1
libz.so.1.2.7

____________

Profile Stoneageman
Avatar
Send message
Joined: 25 May 09
Posts: 211
Credit: 12,084,430,221
RAC: 11,938,233
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 48146 - Posted: 11 Nov 2017 | 1:25:35 UTC

# which readlink
/bin/readlink


# ls -l `which readlink`
-rwxr-xr-x 1 root root 43192 Oct 4 20:56 /bin/readlink

The following return nothing
# sudo -iu boinc bash -c 'which readlink'
# sudo -iu boinc bash -c 'ls -l `which readlink`'
# sudo -iu boinc readlink /lib/libz.so.1

mmonnin
Send message
Joined: 2 Jul 16
Posts: 30
Credit: 20,067,201
RAC: 298,140
Level
Pro
Scientific publications
wat
Message 48147 - Posted: 11 Nov 2017 | 2:51:59 UTC - in response to Message 48146.

Commands do not work for me either.

Trotador
Send message
Joined: 25 Mar 12
Posts: 83
Credit: 1,067,963,599
RAC: 144,007
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 48148 - Posted: 11 Nov 2017 | 9:13:37 UTC

So, I copied readlink program to usr/bin and now it is working in my ubuntu hosts.

Profile [VENETO] sabayonino
Send message
Joined: 4 Apr 10
Posts: 47
Credit: 537,635,512
RAC: 409,979
Level
Lys
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 48149 - Posted: 11 Nov 2017 | 10:27:00 UTC - in response to Message 48148.
Last modified: 11 Nov 2017 | 10:35:56 UTC

Readlink path usually is /usr/bin but it depend on various packaging and configuration provided by the distro

Don't copy the file from /bin to /usr/bin (or whatever)

just create a symlink. If for same reason readlink will be updated , the file you've copied will not updated

$ sudo ln -sf /bin/readlink /usr/bin/readlink


PS : my readlink path is
$ which readlink
/usr/bin/readlink

Toni
Volunteer moderator
Project administrator
Project developer
Project scientist
Send message
Joined: 9 Dec 08
Posts: 585
Credit: 4,273,184
RAC: 0
Level
Ala
Scientific publications
watwatwatwat
Message 48150 - Posted: 11 Nov 2017 | 10:51:23 UTC - in response to Message 48149.
Last modified: 11 Nov 2017 | 10:55:02 UTC

I'll add /bin to the path in the next app update. That may work, unless there is some weird sandboxing thing going on. You shouldn't need to tweak your system: just let them fail (they should fail fast, so no CPU loss).

Concerning why some hosts are not receiving WUs, it's baffling me. It's not a matter of hosts already having GPUs because my own machine does and it did not get tasks. It may be related to the "reliable hosts" classification.

Toni
Volunteer moderator
Project administrator
Project developer
Project scientist
Send message
Joined: 9 Dec 08
Posts: 585
Credit: 4,273,184
RAC: 0
Level
Ala
Scientific publications
watwatwatwat
Message 48151 - Posted: 11 Nov 2017 | 11:38:12 UTC - in response to Message 48150.

@Daniel: can you list one of your hosts which gets QC tasks and one which doesn't?

Thanks

Profile Daniel
Send message
Joined: 17 Sep 16
Posts: 4
Credit: 63,931,247
RAC: 1,647,530
Level
Thr
Scientific publications
watwat
Message 48152 - Posted: 11 Nov 2017 | 12:09:40 UTC - in response to Message 48151.

@Daniel: can you list one of your hosts which gets QC tasks and one which doesn't?

Thanks

Hosts which get tasks: 449991, 449992, 391907
Hosts which did not get any: 444456, 452231
____________

John C MacAlister
Send message
Joined: 17 Feb 13
Posts: 178
Credit: 132,357,411
RAC: 16,335
Level
Cys
Scientific publications
watwatwatwatwatwatwatwat
Message 48153 - Posted: 11 Nov 2017 | 12:28:20 UTC - in response to Message 48127.

Many thanks for this: I look forward to the Windows version!

Dears,

we would like to test our new CPU multicore application for quantum chemistry tasks ("QC"). Since it’s the first time we have a CPU app out, I’ll test the behavior of GPUGRID with a relatively large batch that you will see soon. Workunits are named "*QC309big*".

Here’s some features of the app, in short (subject to change):

* Platform: Linux only for now, generic x64.
* Threads: as many as Boinc decides. I guess it depends on your machine, your preferences, and other running tasks in ways which are obscure to me…
* Run time: about 1 CPU hour per WU (so, shorter if multithreading)
* Credit: computed with the default algorithm (tasks are short, don’t expect much). Bonus mechanism for fast turnaround is still on.
* Known bugs: restarts and checkpoints. This should be mitigated with the “keep in memory when suspended” option. Sorry about that, it’s outside of our control.
* Network behavior: the first time you get a WU of this kind it downloads a Python interpreter (miniconda) and then some open-source packages, and installs them in the project directory. The installation is reused whenever possible.
* Disk usage: could go around 1 GB, perhaps more when tasks are running. Resetting the project should remove everything.
* Memory usage: should be around 1 GB when running.

Depending on the results of this test, we’ll start thinking about other platforms.

Thanks and nice crunching!

Toni


____________
John

Profile Conan
Send message
Joined: 25 Mar 09
Posts: 5
Credit: 85,320
RAC: 96
Level

Scientific publications
wat
Message 48154 - Posted: 11 Nov 2017 | 22:48:37 UTC
Last modified: 11 Nov 2017 | 22:51:20 UTC

Two of my computers have received tasks and processed them with no trouble.
Both run Fedora (16 and 21), host ids are 192138 and 189186.
My 8 core (16 thread) computer (running Fedora 25) has yet to receive a task.

Host 192138 is a 6 core computer and Host 189186 is a four core computer.

The 6 core has shorter Run times per task and more CPU times than the 4 core.

This is as expected due to core count, however the 4 core computer gets higher credit per task than the 6 core, this does not make sense.

6 core getting around 1,500 sec Run time, 8,600 CPU time and about 66 credits.

4 core getting around 3,200 sec Run time, 6,900 CPU time and about 85+ credits.

A bit odd perhaps?

Conan

Toni
Volunteer moderator
Project administrator
Project developer
Project scientist
Send message
Joined: 9 Dec 08
Posts: 585
Credit: 4,273,184
RAC: 0
Level
Ala
Scientific publications
watwatwatwat
Message 48155 - Posted: 11 Nov 2017 | 23:04:48 UTC - in response to Message 48154.
Last modified: 11 Nov 2017 | 23:29:30 UTC

Credit assignment logic has historically been problematic (see here) to the point that I am inclined to think that it has no best solution. For the time being the credit algorithm is the old default one from boinc. I think it relies heavily on the self-computed FLOPS and yes that seems paradoxical.

el_gallo_azul
Send message
Joined: 14 Jun 14
Posts: 8
Credit: 28,088,602
RAC: 0
Level
Val
Scientific publications
wat
Message 48156 - Posted: 12 Nov 2017 | 2:36:31 UTC

I haven't been able to successfully process a WU on my computer. I've received many, but they've all resulted in "Computation error".

See screenshot: https://imgur.com/z0vLkoh

mmonnin
Send message
Joined: 2 Jul 16
Posts: 30
Credit: 20,067,201
RAC: 298,140
Level
Pro
Scientific publications
wat
Message 48157 - Posted: 12 Nov 2017 | 3:14:36 UTC - in response to Message 48156.

I haven't been able to successfully process a WU on my computer. I've received many, but they've all resulted in "Computation error".

See screenshot: https://imgur.com/z0vLkoh


You'll have to try one of the suggestions posted by Daniel or [VENETO] sabayonino above. I'm waiting for more WUs to try myself.

gianni
Send message
Joined: 11 Jul 08
Posts: 16
Credit: 105,098
RAC: 0
Level

Scientific publications
watwatwat
Message 48158 - Posted: 12 Nov 2017 | 4:39:28 UTC - in response to Message 48142.

we are not aware of fast and free gpu qm applications. if you know one, let us know.

Toni
Volunteer moderator
Project administrator
Project developer
Project scientist
Send message
Joined: 9 Dec 08
Posts: 585
Credit: 4,273,184
RAC: 0
Level
Ala
Scientific publications
watwatwatwat
Message 48159 - Posted: 12 Nov 2017 | 8:40:07 UTC - in response to Message 48157.
Last modified: 12 Nov 2017 | 9:30:30 UTC

Please do not tweak your system. The current application (QC 3.10) should solve the problem.

eXaPower
Send message
Joined: 25 Sep 13
Posts: 263
Credit: 1,010,763,867
RAC: 2,126,677
Level
Met
Scientific publications
watwatwatwatwatwat
Message 48160 - Posted: 12 Nov 2017 | 13:28:59 UTC - in response to Message 48158.

we are not aware of fast and free gpu qm applications. if you know one, let us know.


@UF & @UNC developed ANAKIN-ME to create fast, accurate quantum mechanical simulations. See the demo at #SC17 http://nvda.ws/2zyBhKj


https://twitter.com/NVIDIADC

Profile GDF
Volunteer moderator
Project administrator
Project developer
Project tester
Volunteer developer
Volunteer tester
Project scientist
Send message
Joined: 14 Mar 07
Posts: 1895
Credit: 629,356
RAC: 0
Level
Gly
Scientific publications
watwatwatwatwat
Message 48161 - Posted: 12 Nov 2017 | 16:14:10 UTC - in response to Message 48160.

Yes, we have that and it is nice, but limited and not a QM code.

mmonnin
Send message
Joined: 2 Jul 16
Posts: 30
Credit: 20,067,201
RAC: 298,140
Level
Pro
Scientific publications
wat
Message 48167 - Posted: 13 Nov 2017 | 13:14:17 UTC

I completed one this morning in Ubuntu.

Toni
Volunteer moderator
Project administrator
Project developer
Project scientist
Send message
Joined: 9 Dec 08
Posts: 585
Credit: 4,273,184
RAC: 0
Level
Ala
Scientific publications
watwatwatwat
Message 48169 - Posted: 13 Nov 2017 | 14:03:22 UTC - in response to Message 48167.

The new app has 0% failure rate. However, only a handful of hosts are receiving it, for reasons utterly obscure.

This is the only indication i found in the logs:

2017-11-10 20:06:33.9454 [PID=182743] [quota] Overall limits on jobs in progress:
2017-11-10 20:06:33.9454 [PID=182743] [quota] CPU: base 2 scaled 112 njobs 0
2017-11-10 20:06:33.9454 [PID=182743] [quota] GPU: base 2 scaled 0 njobs 0


That "njobs 0" seems to prevent result sending. Any clue hugely appreciated...

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 783
Credit: 1,396,074,620
RAC: 1,295,937
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 48170 - Posted: 13 Nov 2017 | 14:19:55 UTC - in response to Message 48169.

The new app has 0% failure rate. However, only a handful of hosts are receiving it, for reasons utterly obscure.

This is the only indication i found in the logs:

2017-11-10 20:06:33.9454 [PID=182743] [quota] Overall limits on jobs in progress:
2017-11-10 20:06:33.9454 [PID=182743] [quota] CPU: base 2 scaled 112 njobs 0
2017-11-10 20:06:33.9454 [PID=182743] [quota] GPU: base 2 scaled 0 njobs 0


That "njobs 0" seems to prevent result sending. Any clue hugely appreciated...

The only reading material I can suggest is http://boinc.berkeley.edu/trac/wiki/ProjectOptions#Joblimits, but I imagine you know that already. Remember to read the following 'Job limits (advanced)' section too.

captainjack
Send message
Joined: 9 May 13
Posts: 111
Credit: 799,309,059
RAC: 1,595,268
Level
Glu
Scientific publications
watwatwatwatwatwatwatwatwat
Message 48171 - Posted: 13 Nov 2017 | 14:44:30 UTC

For those interested in controlling the number of threads used by the multicore app, the following app_config.xml entries seem to work.

<app>
<name>QC</name>
<max_concurrent>1</max_concurrent>
</app>
<app_version>
<app_name>QC</app_name>
<plan_class>mt</plan_class>
<avg_ncpus>9</avg_ncpus>
<cmdline>--nthreads 9</cmdline>
</app_version>

The <avg_ncpus> entry tells BOINC the number of threads to reserve for the app.

The <cmdline> entry tells the app the number of threads available for processing.

Toni
Volunteer moderator
Project administrator
Project developer
Project scientist
Send message
Joined: 9 Dec 08
Posts: 585
Credit: 4,273,184
RAC: 0
Level
Ala
Scientific publications
watwatwatwat
Message 48172 - Posted: 13 Nov 2017 | 14:49:59 UTC - in response to Message 48171.

Can anybody comment on the suspend/resume behavior under a variety of conditions (ie. with and without "keep in memory")? I expect the calculation to restart from scratch, but not crash.

Profile bormolino
Send message
Joined: 16 May 13
Posts: 17
Credit: 17,886,346
RAC: 10,670
Level
Pro
Scientific publications
watwat
Message 48173 - Posted: 13 Nov 2017 | 15:17:03 UTC

Like many others I don't get any WUs on my linux machines.
____________

captainjack
Send message
Joined: 9 May 13
Posts: 111
Credit: 799,309,059
RAC: 1,595,268
Level
Glu
Scientific publications
watwatwatwatwatwatwatwatwat
Message 48174 - Posted: 13 Nov 2017 | 15:32:43 UTC

Can anybody comment on the suspend/resume behavior under a variety of conditions (ie. with and without "keep in memory")? I expect the calculation to restart from scratch, but not crash.


When I suspended a task with LAIM on, BOINC manager showed that it was suspended, but the system monitor showed that the task was still busy using all the threads that were allocated to it.

When I suspended a task with LAIM off, BOINC manager showed that the task was suspended and the task disappeared from the system monitor. When the task was resumed, it restarted from 0 and appears to be running normally.

Toni
Volunteer moderator
Project administrator
Project developer
Project scientist
Send message
Joined: 9 Dec 08
Posts: 585
Credit: 4,273,184
RAC: 0
Level
Ala
Scientific publications
watwatwatwat
Message 48175 - Posted: 13 Nov 2017 | 15:54:35 UTC - in response to Message 48174.

@captainjack - thanks, appreciated.

klepel
Send message
Joined: 23 Dec 09
Posts: 135
Credit: 1,805,959,720
RAC: 1,485,046
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 48176 - Posted: 13 Nov 2017 | 16:03:57 UTC

I just wanted to report back:
My host ID: 420971 gets work and finishes latest version with success!
My host ID: 452211 does not get any work. Message is: There is now work available. This host does not have any GPU and works from an USB stick.

Toni
Volunteer moderator
Project administrator
Project developer
Project scientist
Send message
Joined: 9 Dec 08
Posts: 585
Credit: 4,273,184
RAC: 0
Level
Ala
Scientific publications
watwatwatwat
Message 48177 - Posted: 13 Nov 2017 | 16:15:25 UTC - in response to Message 48176.
Last modified: 13 Nov 2017 | 16:21:37 UTC

Working/not working pairs are useful for debugging indeed (if they have the same preferences, that is). It was suggested that it was the presence of a GPU, but there are GPU-less counter-examples, like this. The scheduler is a software nightmare...

I'll resume tests later this week. In the meantime, there are 1000 more CPU WUs (QC310big).

Jim1348
Send message
Joined: 28 Jul 12
Posts: 455
Credit: 1,130,760,908
RAC: 122,902
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwat
Message 48180 - Posted: 13 Nov 2017 | 17:45:45 UTC
Last modified: 13 Nov 2017 | 17:52:18 UTC

Today is my lucky day. I just enabled the multicore app, and immediately picked up two of them on my i7-3770 machine running Ubuntu 16.04.3 (Linux 4.10.0.38), and BOINC 7.8.3. They run on 7 cores, with one core reserved for GPU support as set by BOINC preferences, not in the app_config (though I use one for other purposes).

However, suspending them does not shut them down with LAIM enabled, as noted before. I have not tried the non-LAIM case.

If it matters, this machine was attached to GPUGrid earlier, and I had run a few GPU work units on the GTX 980, though I am requesting only the CPU work now. But maybe that has something to do with why I am getting them.

EDIT: Also, I have "Run test applications?" enabled, though I don't know if that is necessary in this case.

Profile Conan
Send message
Joined: 25 Mar 09
Posts: 5
Credit: 85,320
RAC: 96
Level

Scientific publications
wat
Message 48183 - Posted: 13 Nov 2017 | 22:42:44 UTC

My two computers that are getting or have gotten cpu work, have both been connected before.
The new computer I attached does not get work but says "No work available" even when there is plenty.

Conan

el_gallo_azul
Send message
Joined: 14 Jun 14
Posts: 8
Credit: 28,088,602
RAC: 0
Level
Val
Scientific publications
wat
Message 48184 - Posted: 14 Nov 2017 | 0:19:29 UTC - in response to Message 48157.
Last modified: 14 Nov 2017 | 0:20:27 UTC

OK, thanks @mmonnin.

I've just

which readlink

followed by
sudo ln -sf /bin/readlink /usr/bin/readlink
,
and am now waiting for some more WUs.

Toni
Volunteer moderator
Project administrator
Project developer
Project scientist
Send message
Joined: 9 Dec 08
Posts: 585
Credit: 4,273,184
RAC: 0
Level
Ala
Scientific publications
watwatwatwat
Message 48187 - Posted: 14 Nov 2017 | 9:55:22 UTC - in response to Message 48184.

Do not make symlinks. The problem is already solved.

Profile Coleslaw
Send message
Joined: 24 Jul 08
Posts: 30
Credit: 139,813,917
RAC: 212,327
Level
Cys
Scientific publications
watwatwatwatwatwatwatwatwatwatwat
Message 48192 - Posted: 14 Nov 2017 | 20:09:33 UTC
Last modified: 14 Nov 2017 | 20:11:47 UTC

Since it’s the first time we have a CPU app out, I’ll test the behavior of GPUGRID with a relatively large batch that you will see soon.


I just started reading this thread. I thought I would point out that there was a multi-threaded CPU application back in 2014. It just wasn't necessarily for Quantum Chemistry.
____________

Profile Conan
Send message
Joined: 25 Mar 09
Posts: 5
Credit: 85,320
RAC: 96
Level

Scientific publications
wat
Message 48198 - Posted: 16 Nov 2017 | 7:21:52 UTC - in response to Message 48192.

Since it’s the first time we have a CPU app out, I’ll test the behavior of GPUGRID with a relatively large batch that you will see soon.


I just started reading this thread. I thought I would point out that there was a multi-threaded CPU application back in 2014. It just wasn't necessarily for Quantum Chemistry.


Yes I ran that one on both Windows 32 bit and Linux 64 bit, which is where nearly all my points came from, as I had to stop GPU use a few years ago so I ran the CPU app instead.

Conan

Dayle Diamond
Send message
Joined: 5 Dec 12
Posts: 69
Credit: 1,047,803,965
RAC: 788,549
Level
Met
Scientific publications
watwatwatwatwatwatwatwat
Message 48228 - Posted: 23 Nov 2017 | 16:47:19 UTC
Last modified: 23 Nov 2017 | 17:05:04 UTC

On a 1950x it's reserving all 32 threads but not running them near the maximum.
It seems to be switching which cores are active - my System Monitor CPU usage chart looks like a long line of infinity symbols.

If you divide the CPU time by the runtime, you'll see an average usage of about seventeen cores a second. Everything else is going to waste.

16713948 12878079 453935 23 Nov 2017 | 12:59:03 UTC 23 Nov 2017 | 16:09:15 UTC Completed and validated 680.18 11,586.25
67.70 Quantum Chemistry v3.10 (mt)

16713947 12878078 453935 23 Nov 2017 | 12:59:03 UTC 23 Nov 2017 | 14:12:17 UTC Completed and validated 761.12 12,984.46 267.57 Quantum Chemistry v3.10 (mt)

16713946 12878077 453935 23 Nov 2017 | 12:59:03 UTC 23 Nov 2017 | 15:11:46 UTC Completed and validated 702.76 11,639.75

PS. It's running at top priority over World Community Grid, but they've got similar deadlines. Is this intentional?

dfygrvty
New member
Send message
Joined: 21 Nov 17
Posts: 2
Credit: 249,438
RAC: 22,016
Level

Scientific publications
wat
Message 48229 - Posted: 23 Nov 2017 | 17:53:43 UTC - in response to Message 48127.

getting a ton of quantum chemistry tasks on my aws ec2 p2.xlarge instance.
a47-toni_qc310k-0-1-* are the names of the tasks. Are these the new multicore tasks you talked about? The machine takes a task to 66% in 2 seconds and then sits at that percentage for ~10 minutes.

I think the task stops reporting progress @ 66%? bug? I compiled the boinc client on the ec2 instance, so it could definitely be user error as well.

klepel
Send message
Joined: 23 Dec 09
Posts: 135
Credit: 1,805,959,720
RAC: 1,485,046
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 48230 - Posted: 23 Nov 2017 | 18:05:48 UTC

Same here stuck at 66%. Will go to lunch and see if it finished in the meanwhile.

dfygrvty
New member
Send message
Joined: 21 Nov 17
Posts: 2
Credit: 249,438
RAC: 22,016
Level

Scientific publications
wat
Message 48231 - Posted: 23 Nov 2017 | 18:28:37 UTC - in response to Message 48229.

they finish about 10-15 minutes after they 'hang' on my ec2 instance.

klepel
Send message
Joined: 23 Dec 09
Posts: 135
Credit: 1,805,959,720
RAC: 1,485,046
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 48232 - Posted: 23 Nov 2017 | 18:50:00 UTC

Here as well! Times are in relation with more threads and higher clock frequency on the other computer.

Dayle Diamond
Send message
Joined: 5 Dec 12
Posts: 69
Credit: 1,047,803,965
RAC: 788,549
Level
Met
Scientific publications
watwatwatwatwatwatwatwat
Message 48233 - Posted: 23 Nov 2017 | 20:44:43 UTC

I'm using Ubuntu's bundled system monitor to display CPU usage graphs. That 66% thing is just a bug with the work unit time estimation, but my cores really were gradually rising and falling from 0 to 100%. Like a helix on its side, but with 32 lines.

(It's not thermal throttling.)

IF at all possible, consider limiting each multicore app to four cores - almost every modern CPU's threads can be divided equally by four, so we can ensure the highest throughput as no thread would go to waste.

Toni
Volunteer moderator
Project administrator
Project developer
Project scientist
Send message
Joined: 9 Dec 08
Posts: 585
Credit: 4,273,184
RAC: 0
Level
Ala
Scientific publications
watwatwatwat
Message 48234 - Posted: 23 Nov 2017 | 21:57:07 UTC - in response to Message 48233.
Last modified: 23 Nov 2017 | 22:00:00 UTC

The 66% is due to our using the boinc wrapper for an app which doesn't report its progress. There are three steps in the WU (install, update, compute) and the third is the long one, hence the 2/3.

If I figure out how, I'll try to limit the number of CPUs requested. I think the client has some control over it as well.

Petr Kriz
Send message
Joined: 22 Feb 09
Posts: 1
Credit: 9,642
RAC: 0
Level

Scientific publications
wat
Message 48235 - Posted: 23 Nov 2017 | 22:46:53 UTC

Just tried to run few tasks and still getting the same error:

<core_client_version>7.6.22</core_client_version>
<![CDATA[
<message>
process exited with code 195 (0xc3, -61)
</message>
<stderr_txt>
23:27:04 (6871): wrapper (7.7.26016): starting
23:27:04 (6871): wrapper (7.7.26016): starting
23:27:04 (6871): wrapper: running ../../projects/www.gpugrid.net/Miniconda3-4.3.30-Linux-x86_64.sh (-b -f -p /var/lib/boinc/projects/www.gpugrid.net/miniconda)
Python 3.6.3 :: Anaconda, Inc.
23:33:01 (6871): task miniconda-installer reached time limit 360
23:33:01 (6871): wrapper: running /var/lib/boinc/projects/www.gpugrid.net/miniconda/bin/python (pre_script.py)
Traceback (most recent call last):
File "pre_script.py", line 1, in <module>
import conda.cli
ModuleNotFoundError: No module named 'conda'
23:33:02 (6871): $PROJECT_DIR/miniconda/bin/python exited; CPU time 0.025285
23:33:02 (6871): app exit status: 0x1
23:33:02 (6871): called boinc_finish(195)

</stderr_txt>
]]>

Any idea, how to solve it?

klepel
Send message
Joined: 23 Dec 09
Posts: 135
Credit: 1,805,959,720
RAC: 1,485,046
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 48236 - Posted: 24 Nov 2017 | 1:56:57 UTC

This one hang for about 6 hours:
http://www.gpugrid.net/result.php?resultid=16717461

el_gallo_azul
Send message
Joined: 14 Jun 14
Posts: 8
Credit: 28,088,602
RAC: 0
Level
Val
Scientific publications
wat
Message 48237 - Posted: 24 Nov 2017 | 4:44:33 UTC

Since I had 100% errors (Message 48156 - Posted: 12 Nov 2017 | 2:36:31 UTC) on my first batch of these CPU tasks, I created a symlink as instructed, then deleted the symlink as subsequently instructed, but I have never received a single task since my 12 Nov 2017 post.

Profile GDF
Volunteer moderator
Project administrator
Project developer
Project tester
Volunteer developer
Volunteer tester
Project scientist
Send message
Joined: 14 Mar 07
Posts: 1895
Credit: 629,356
RAC: 0
Level
Gly
Scientific publications
watwatwatwatwat
Message 48239 - Posted: 24 Nov 2017 | 11:10:37 UTC - in response to Message 48237.

OK, we will start production mode next week. Unfortunately we will need more than 50x the current number of CPUs, but it is just the start now, so it is ok.

gdf

mmonnin
Send message
Joined: 2 Jul 16
Posts: 30
Credit: 20,067,201
RAC: 298,140
Level
Pro
Scientific publications
wat
Message 48241 - Posted: 24 Nov 2017 | 11:17:28 UTC - in response to Message 48228.

On a 1950x it's reserving all 32 threads but not running them near the maximum.
It seems to be switching which cores are active - my System Monitor CPU usage chart looks like a long line of infinity symbols.

If you divide the CPU time by the runtime, you'll see an average usage of about seventeen cores a second. Everything else is going to waste.

16713948 12878079 453935 23 Nov 2017 | 12:59:03 UTC 23 Nov 2017 | 16:09:15 UTC Completed and validated 680.18 11,586.25
67.70 Quantum Chemistry v3.10 (mt)

16713947 12878078 453935 23 Nov 2017 | 12:59:03 UTC 23 Nov 2017 | 14:12:17 UTC Completed and validated 761.12 12,984.46 267.57 Quantum Chemistry v3.10 (mt)

16713946 12878077 453935 23 Nov 2017 | 12:59:03 UTC 23 Nov 2017 | 15:11:46 UTC Completed and validated 702.76 11,639.75

PS. It's running at top priority over World Community Grid, but they've got similar deadlines. Is this intentional?


Pretty typical of multithreaded apps (of any BOINC project) that they do not scale that well past 4-8 cores. I typically use an app_config to 4 cores on mt apps like LHC, Cosmology, yafu, etc.

Bedrich Hajek
Send message
Joined: 28 Mar 09
Posts: 336
Credit: 3,803,697,809
RAC: 891,908
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 48242 - Posted: 24 Nov 2017 | 11:18:30 UTC - in response to Message 48239.

OK, we will start production mode next week. Unfortunately we will need more than 50x the current number of CPUs, but it is just the start now, so it is ok.

gdf



You will need a windows app for this.


Profile bormolino
Send message
Joined: 16 May 13
Posts: 17
Credit: 17,886,346
RAC: 10,670
Level
Pro
Scientific publications
watwat
Message 48243 - Posted: 24 Nov 2017 | 12:19:44 UTC - in response to Message 48237.
Last modified: 24 Nov 2017 | 12:20:25 UTC

Since I had 100% errors (Message 48156 - Posted: 12 Nov 2017 | 2:36:31 UTC) on my first batch of these CPU tasks, I created a symlink as instructed, then deleted the symlink as subsequently instructed, but I have never received a single task since my 12 Nov 2017 post.


Same here ...

mmonnin
Send message
Joined: 2 Jul 16
Posts: 30
Credit: 20,067,201
RAC: 298,140
Level
Pro
Scientific publications
wat
Message 48244 - Posted: 24 Nov 2017 | 13:12:43 UTC

I received some yesterday on a new install of Ubuntu 17.10. No symlink or anything and they completed.

PappaLitto
Send message
Joined: 21 Mar 16
Posts: 264
Credit: 1,216,483,931
RAC: 5,012,972
Level
Met
Scientific publications
watwat
Message 48245 - Posted: 24 Nov 2017 | 16:05:33 UTC

If you need that many CPUs, you will definitely need a windows app

Post to thread

Message boards : News : New multicore app and WUs