Advanced search

Message boards : News : New workunits

Author Message
Toni
Volunteer moderator
Project administrator
Project developer
Project scientist
Send message
Joined: 9 Dec 08
Posts: 857
Credit: 4,301,782
RAC: 1
Level
Ala
Scientific publications
watwatwatwat
Message 53010 - Posted: 21 Nov 2019 | 17:10:20 UTC
Last modified: 21 Nov 2019 | 17:17:54 UTC

I'm loading a first batch of 1000 workunits for a new project (GSN*) on the acemd3 app. This batch is both for a basic science investigation, and for load-testing the app. Thanks!

If you disabled "acemd3" from the preferences for some reason, please re-enable it.

Profile [PUGLIA] kidkidkid3
Avatar
Send message
Joined: 23 Feb 11
Posts: 68
Credit: 717,397,017
RAC: 311,303
Level
Lys
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 53011 - Posted: 21 Nov 2019 | 18:46:17 UTC - in response to Message 53010.

Hi,
my first new WU stop at 1% after 20 minutes of running.
I suspended and restart it, the elapsed time restart from 0.
After another 20 minutes of running, without increase of working progress i kill it.
For investigation see http://www.gpugrid.net/result.php?resultid=21501927
Thanks in advance.
K.
____________
Dreams do not always come true. But not because they are too big or impossible. Why did we stop believing.
(Martin Luther King)

jjch
Send message
Joined: 10 Nov 13
Posts: 35
Credit: 12,267,775,439
RAC: 1,859,037
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwat
Message 53012 - Posted: 21 Nov 2019 | 19:03:27 UTC

Thank you Toni!

I already have my GTX 1070/1080 GPU's pegged to nearly 100% even with one WU running on each.

The GPU load does drop lower intermittently and also will drop PerfCap to Idle.

The new thing I am noticing is I am now hitting the Power PerfCap throttling the GPU's.





jjch
Send message
Joined: 10 Nov 13
Posts: 35
Credit: 12,267,775,439
RAC: 1,859,037
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwat
Message 53013 - Posted: 21 Nov 2019 | 19:22:08 UTC - in response to Message 53011.
Last modified: 21 Nov 2019 | 19:55:09 UTC

Hi,
my first new WU stop at 1% after 20 minutes of running.
I suspended and restart it, the elapsed time restart from 0.
After another 20 minutes of running, without increase of working progress i kill it.
For investigation see http://www.gpugrid.net/result.php?resultid=21501927
Thanks in advance.
K.


[PUGLIA] kidkidkid3

It would help a lot to know what your setup looks like. Your computers are hidden so we can't see them. Also, the configuration may make a difference

Please provide some details.

Profile [PUGLIA] kidkidkid3
Avatar
Send message
Joined: 23 Feb 11
Posts: 68
Credit: 717,397,017
RAC: 311,303
Level
Lys
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 53014 - Posted: 21 Nov 2019 | 20:26:58 UTC - in response to Message 53013.
Last modified: 21 Nov 2019 | 20:37:24 UTC

Sorry for mistake of configuration

Intel Quadcore Q9450 with 4GB (2*2 DDR3 at 1333) and GTX 750 TI

... here the log.

Output su Stderr
<core_client_version>7.14.2</core_client_version>
<![CDATA[
<message>
aborted by user</message>
<stderr_txt>
19:10:11 (2408): wrapper (7.9.26016): starting
19:10:11 (2408): wrapper: running acemd3.exe (--boinc input --device 0)
Detected memory leaks!
Dumping objects ->
..\api\boinc_api.cpp(309) : {1583} normal block at 0x0000020099013380, 8 bytes long.
Data: < > 00 00 F9 98 00 02 00 00
..\lib\diagnostics_win.cpp(417) : {198} normal block at 0x0000020099011BA0, 1080 bytes long.
Data: <( ` > 28 11 00 00 CD CD CD CD 60 01 00 00 00 00 00 00
Object dump complete.

</stderr_txt>
____________
Dreams do not always come true. But not because they are too big or impossible. Why did we stop believing.
(Martin Luther King)

jjch
Send message
Joined: 10 Nov 13
Posts: 35
Credit: 12,267,775,439
RAC: 1,859,037
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwat
Message 53015 - Posted: 21 Nov 2019 | 20:34:40 UTC - in response to Message 53014.

Sorry for mistake of configuration ... here the log.

Output su Stderr
<core_client_version>7.14.2</core_client_version>
<![CDATA[
<message>
aborted by user</message>
<stderr_txt>
19:10:11 (2408): wrapper (7.9.26016): starting
19:10:11 (2408): wrapper: running acemd3.exe (--boinc input --device 0)
Detected memory leaks!
Dumping objects ->
..\api\boinc_api.cpp(309) : {1583} normal block at 0x0000020099013380, 8 bytes long.
Data: < > 00 00 F9 98 00 02 00 00
..\lib\diagnostics_win.cpp(417) : {198} normal block at 0x0000020099011BA0, 1080 bytes long.
Data: <( ` > 28 11 00 00 CD CD CD CD 60 01 00 00 00 00 00 00
Object dump complete.

</stderr_txt>
]]>



OK so I can check on the link to the computer and I see you have 2x GTX 750 Ti's
http://www.gpugrid.net/show_host_detail.php?hostid=208691

I'm not sure a GTX 750 series can run the new app. Let's see if one of the resident experts will know the answer.

Erich56
Send message
Joined: 1 Jan 15
Posts: 638
Credit: 3,157,242,642
RAC: 818,959
Level
Arg
Scientific publications
watwatwatwatwatwat
Message 53016 - Posted: 21 Nov 2019 | 20:48:13 UTC - in response to Message 53015.

I'm not sure a GTX 750 series can run the new app. Let's see if one of the resident experts will know the answer.

the strange thing with my hosts here is that the host with the GTX980ti and the host with the GTX970 received the new ACEMD v2.10 tasks this evening, but the two hosts with a GTX750ti did NOT. Was this coincidence, or is the new version not being sent to GTX750ti cards?

Profile ServicEnginIC
Avatar
Send message
Joined: 24 Sep 10
Posts: 78
Credit: 1,250,277,476
RAC: 713,065
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 53017 - Posted: 21 Nov 2019 | 21:03:13 UTC - in response to Message 53015.

I'm not sure a GTX 750 series can run the new app

I can confirm that I've finished successfully ACEMD3 test tasks on GTX750 and GTX750Ti graphics cards running under Linux OS.
I can also remark that I had some troubles under Windows 10 regarding some Antivirus interference.
This was commented at following thread:
http://www.gpugrid.net/forum_thread.php?id=4999

Profile ServicEnginIC
Avatar
Send message
Joined: 24 Sep 10
Posts: 78
Credit: 1,250,277,476
RAC: 713,065
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 53018 - Posted: 21 Nov 2019 | 21:08:18 UTC - in response to Message 53016.

Was this coincidence, or is the new version not being sent to GTX750ti cards?

Please, try updating drivers

Greg _BE
Send message
Joined: 30 Jun 14
Posts: 39
Credit: 42,093,642
RAC: 9,172
Level
Val
Scientific publications
watwatwatwat
Message 53020 - Posted: 21 Nov 2019 | 23:23:13 UTC
Last modified: 21 Nov 2019 | 23:25:15 UTC

Just starting a task on my GTX 1050TI (fully updated drivers, no overdrive, default settings)
Been running 30 mins and it did 2% finally.
You should change something in the code so it spits out decimal update % done information. I use that to check if the task is moving in Boinc Tasks.
Your config only updates the 1-x% and no decimal.
Memory usage is minimal compared to LHC ATLAS. You use only 331 MB real memory and 648 Virtual, which is more in the range of what Rosetta uses.
So it looks like I should have this task done in about 26 hrs from now.
For a GPU task it is taking a lot of CPU, it needs 100%+ all the time in CPU usage.

Pop Piasa
Send message
Joined: 8 Aug 19
Posts: 2
Credit: 17,669,900
RAC: 145,413
Level
Pro
Scientific publications
wat
Message 53021 - Posted: 22 Nov 2019 | 1:12:12 UTC - in response to Message 53015.
Last modified: 22 Nov 2019 | 1:37:58 UTC

Hi, I'm running test 309 on an i7-860 with one GTX 750Ti and ACEMD 3 test is reporting 4.680%/Hr.
Better than my GTX 1060 and i7-7700K running test 725 @ 4.320%/Hr.
Why does the GTX 1060 run slower, Toni, anybody?
(running latest drivers)
____________

STARBASEn
Avatar
Send message
Joined: 17 Feb 09
Posts: 73
Credit: 1,011,793,001
RAC: 222,046
Level
Met
Scientific publications
watwatwatwatwatwat
Message 53022 - Posted: 22 Nov 2019 | 1:53:59 UTC

I got this one today http://www.gpugrid.net/workunit.php?wuid=16850979 and it ran fine. As I've said before, Linux machines are quite ready.

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 330
Credit: 252,993,463
RAC: 450,216
Level
Asn
Scientific publications
wat
Message 53023 - Posted: 22 Nov 2019 | 2:09:05 UTC

Three finished so far, working on a fourth. Keep 'em coming.

KAMasud
Send message
Joined: 27 Jul 11
Posts: 21
Credit: 157,044,390
RAC: 99,882
Level
Ile
Scientific publications
wat
Message 53025 - Posted: 22 Nov 2019 | 4:24:07 UTC

Got one task. GTX1060 with Max-Q. Windows 10. Task errored out. Following is the complete story.
<core_client_version>7.14.2</core_client_version>
<![CDATA[
<message>
(unknown error) - exit code 195 (0xc3)</message>
<stderr_txt>
23:16:17 (1648): wrapper (7.9.26016): starting
23:16:17 (1648): wrapper: running acemd3.exe (--boinc input --device 0)
# Engine failed: Error invoking kernel: CUDA_ERROR_ILLEGAL_ADDRESS (700)
02:43:35 (1648): acemd3.exe exited; CPU time 12377.906250
02:43:35 (1648): app exit status: 0x1
02:43:35 (1648): called boinc_finish(195)
0 bytes in 0 Free Blocks.
506 bytes in 8 Normal Blocks.
1144 bytes in 1 CRT Blocks.
0 bytes in 0 Ignore Blocks.
0 bytes in 0 Client Blocks.
Largest number used: 0 bytes.
Total allocations: 184328910 bytes.
Dumping objects ->
{1806} normal block at 0x000001C10FAA6190, 48 bytes long.
Data: <ACEMD_PLUGIN_DIR> 41 43 45 4D 44 5F 50 4C 55 47 49 4E 5F 44 49 52
{1795} normal block at 0x000001C10FAA6350, 48 bytes long.
Data: <HOME=D:\ProgramD> 48 4F 4D 45 3D 44 3A 5C 50 72 6F 67 72 61 6D 44
{1784} normal block at 0x000001C10FAA6580, 48 bytes long.
Data: <TMP=D:\ProgramDa> 54 4D 50 3D 44 3A 5C 50 72 6F 67 72 61 6D 44 61
{1773} normal block at 0x000001C10FAA6120, 48 bytes long.
Data: <TEMP=D:\ProgramD> 54 45 4D 50 3D 44 3A 5C 50 72 6F 67 72 61 6D 44
{1762} normal block at 0x000001C10FAA5CC0, 48 bytes long.
Data: <TMPDIR=D:\Progra> 54 4D 50 44 49 52 3D 44 3A 5C 50 72 6F 67 72 61
{1751} normal block at 0x000001C10FA8A0B0, 141 bytes long.
Data: <<project_prefere> 3C 70 72 6F 6A 65 63 74 5F 70 72 65 66 65 72 65
..\api\boinc_api.cpp(309) : {1748} normal block at 0x000001C10FAA86C0, 8 bytes long.
Data: < > 00 00 A4 0F C1 01 00 00
{977} normal block at 0x000001C10FA8D840, 141 bytes long.
Data: <<project_prefere> 3C 70 72 6F 6A 65 63 74 5F 70 72 65 66 65 72 65
{203} normal block at 0x000001C10FAA8CB0, 8 bytes long.
Data: < > 10 BB AA 0F C1 01 00 00
{197} normal block at 0x000001C10FAA5B70, 48 bytes long.
Data: <--boinc input --> 2D 2D 62 6F 69 6E 63 20 69 6E 70 75 74 20 2D 2D
{196} normal block at 0x000001C10FAA8030, 16 bytes long.
Data: < > 18 BA AA 0F C1 01 00 00 00 00 00 00 00 00 00 00
{195} normal block at 0x000001C10FAA83F0, 16 bytes long.
Data: < > F0 B9 AA 0F C1 01 00 00 00 00 00 00 00 00 00 00
{194} normal block at 0x000001C10FAA89E0, 16 bytes long.
Data: < > C8 B9 AA 0F C1 01 00 00 00 00 00 00 00 00 00 00
{193} normal block at 0x000001C10FAA7FE0, 16 bytes long.
Data: < > A0 B9 AA 0F C1 01 00 00 00 00 00 00 00 00 00 00
{192} normal block at 0x000001C10FAA8DA0, 16 bytes long.
Data: <x > 78 B9 AA 0F C1 01 00 00 00 00 00 00 00 00 00 00
{191} normal block at 0x000001C10FAA8B20, 16 bytes long.
Data: <P > 50 B9 AA 0F C1 01 00 00 00 00 00 00 00 00 00 00
{190} normal block at 0x000001C10FAA5A90, 48 bytes long.
Data: <ComSpec=C:\Windo> 43 6F 6D 53 70 65 63 3D 43 3A 5C 57 69 6E 64 6F
{189} normal block at 0x000001C10FAA7F90, 16 bytes long.
Data: < > D0 FE A8 0F C1 01 00 00 00 00 00 00 00 00 00 00
{188} normal block at 0x000001C10FA9D540, 32 bytes long.
Data: <SystemRoot=C:\Wi> 53 79 73 74 65 6D 52 6F 6F 74 3D 43 3A 5C 57 69
{187} normal block at 0x000001C10FAA88F0, 16 bytes long.
Data: < > A8 FE A8 0F C1 01 00 00 00 00 00 00 00 00 00 00
{185} normal block at 0x000001C10FAA8C10, 16 bytes long.
Data: < > 80 FE A8 0F C1 01 00 00 00 00 00 00 00 00 00 00
{184} normal block at 0x000001C10FAA81C0, 16 bytes long.
Data: <X > 58 FE A8 0F C1 01 00 00 00 00 00 00 00 00 00 00
{183} normal block at 0x000001C10FAA8210, 16 bytes long.
Data: <0 > 30 FE A8 0F C1 01 00 00 00 00 00 00 00 00 00 00
{182} normal block at 0x000001C10FAA85D0, 16 bytes long.
Data: < > 08 FE A8 0F C1 01 00 00 00 00 00 00 00 00 00 00
{181} normal block at 0x000001C10FAA88A0, 16 bytes long.
Data: < > E0 FD A8 0F C1 01 00 00 00 00 00 00 00 00 00 00
{180} normal block at 0x000001C10FA8FDE0, 280 bytes long.
Data: < \ > A0 88 AA 0F C1 01 00 00 C0 5C AA 0F C1 01 00 00
{179} normal block at 0x000001C10FAA8800, 16 bytes long.
Data: <0 > 30 B9 AA 0F C1 01 00 00 00 00 00 00 00 00 00 00
{178} normal block at 0x000001C10FAA8A80, 16 bytes long.
Data: < > 08 B9 AA 0F C1 01 00 00 00 00 00 00 00 00 00 00
{177} normal block at 0x000001C10FAA8850, 16 bytes long.
Data: < > E0 B8 AA 0F C1 01 00 00 00 00 00 00 00 00 00 00
{176} normal block at 0x000001C10FAAB8E0, 496 bytes long.
Data: <P acemd3.e> 50 88 AA 0F C1 01 00 00 61 63 65 6D 64 33 2E 65
{65} normal block at 0x000001C10FAA8C60, 16 bytes long.
Data: < > 80 EA D6 DA F7 7F 00 00 00 00 00 00 00 00 00 00
{64} normal block at 0x000001C10FA9BA00, 16 bytes long.
Data: <@ > 40 E9 D6 DA F7 7F 00 00 00 00 00 00 00 00 00 00
{63} normal block at 0x000001C10FA9B9B0, 16 bytes long.
Data: < W > F8 57 D3 DA F7 7F 00 00 00 00 00 00 00 00 00 00
{62} normal block at 0x000001C10FA9B960, 16 bytes long.
Data: < W > D8 57 D3 DA F7 7F 00 00 00 00 00 00 00 00 00 00
{61} normal block at 0x000001C10FA9B910, 16 bytes long.
Data: <P > 50 04 D3 DA F7 7F 00 00 00 00 00 00 00 00 00 00
{60} normal block at 0x000001C10FA9B870, 16 bytes long.
Data: <0 > 30 04 D3 DA F7 7F 00 00 00 00 00 00 00 00 00 00
{59} normal block at 0x000001C10FA9B780, 16 bytes long.
Data: < > E0 02 D3 DA F7 7F 00 00 00 00 00 00 00 00 00 00
{58} normal block at 0x000001C10FA9B730, 16 bytes long.
Data: < > 10 04 D3 DA F7 7F 00 00 00 00 00 00 00 00 00 00
{57} normal block at 0x000001C10FA9B690, 16 bytes long.
Data: <p > 70 04 D3 DA F7 7F 00 00 00 00 00 00 00 00 00 00
{56} normal block at 0x000001C10FA9B640, 16 bytes long.
Data: < > 18 C0 D1 DA F7 7F 00 00 00 00 00 00 00 00 00 00
Object dump complete.

</stderr_txt>

Enjoy reading it.

rod4x4
Send message
Joined: 4 Aug 14
Posts: 129
Credit: 1,648,208,676
RAC: 824,434
Level
His
Scientific publications
watwatwatwatwatwatwat
Message 53026 - Posted: 22 Nov 2019 | 5:43:18 UTC
Last modified: 22 Nov 2019 | 6:01:38 UTC

I'm not sure a GTX 750 series can run the new app.

I have a GTX 750 on a linux host that is processing an ACEMD3 task, it is about half way through and should complete the task in about 1 day.

A Win7 host with GTX 750 ti is also processing an ACEMD3 task. This should take 20 hours.

On a Win7 host with GTX 960, two ACEMD3 tasks have failed. Both with this error: # Engine failed: Particle coordinate is nan
Host can be found here: http://gpugrid.net/results.php?hostid=274119

What I have noticed on my Linux hosts is nvidia-smi reports the ACEMD3 tasks are using 10% more power than the ACEMD2 tasks. This would indicate that the ACEMD3 tasks are more efficient at pushing the GPU to it's full potential.

Because of this, I have reduced the overclocking on some hosts (particularly the GTX 960 above)

Erich56
Send message
Joined: 1 Jan 15
Posts: 638
Credit: 3,157,242,642
RAC: 818,959
Level
Arg
Scientific publications
watwatwatwatwatwat
Message 53027 - Posted: 22 Nov 2019 | 5:59:40 UTC - in response to Message 53018.

Was this coincidence, or is the new version not being sent to GTX750ti cards?

Please, try updating drivers

would be useful if we were told which is the minimum required version number of the driver.

rod4x4
Send message
Joined: 4 Aug 14
Posts: 129
Credit: 1,648,208,676
RAC: 824,434
Level
His
Scientific publications
watwatwatwatwatwatwat
Message 53028 - Posted: 22 Nov 2019 | 6:03:14 UTC - in response to Message 53027.

would be useful if we were told which is the minimum required version number of the driver.


This info can be found here:
http://gpugrid.net/forum_thread.php?id=5002

Erich56
Send message
Joined: 1 Jan 15
Posts: 638
Credit: 3,157,242,642
RAC: 818,959
Level
Arg
Scientific publications
watwatwatwatwatwat
Message 53029 - Posted: 22 Nov 2019 | 6:11:17 UTC - in response to Message 53028.

would be useful if we were told which is the minimum required version number of the driver.


This info can be found here:
http://gpugrid.net/forum_thread.php?id=5002

oh, thanks very much; so all is clear now - I need to update my drivers on the two GTX750ti hosts.

rod4x4
Send message
Joined: 4 Aug 14
Posts: 129
Credit: 1,648,208,676
RAC: 824,434
Level
His
Scientific publications
watwatwatwatwatwatwat
Message 53031 - Posted: 22 Nov 2019 | 7:05:16 UTC - in response to Message 53021.

Hi, I'm running test 309 on an i7-860 with one GTX 750Ti and ACEMD 3 test is reporting 4.680%/Hr.
Better than my GTX 1060 and i7-7700K running test 725 @ 4.320%/Hr.
Why does the GTX 1060 run slower, Toni, anybody?
(running latest drivers)


The gtx 1060 performance seems fine for the ACEMD2 task in your task list.
May find some clues to the slow ACEMD3 performance in the Stderr output when task completes.
The ACEMD3 task progress reporting is not as accurate as the ACEMD2 tasks, a side affect of using a Wrapper. So the performance should only be judged when it has completed.

Erich56
Send message
Joined: 1 Jan 15
Posts: 638
Credit: 3,157,242,642
RAC: 818,959
Level
Arg
Scientific publications
watwatwatwatwatwat
Message 53032 - Posted: 22 Nov 2019 | 7:31:07 UTC - in response to Message 53029.
Last modified: 22 Nov 2019 | 8:02:16 UTC

would be useful if we were told which is the minimum required version number of the driver.


This info can be found here:
http://gpugrid.net/forum_thread.php?id=5002

oh, thanks very much; so all is clear now - I need to update my drivers on the two GTX750ti hosts.


Driver updates complete, and 1 of my 2 GTX750ti has already received a task, it's running well.

What I noticed, also on the other hosts (GTX980ti and GTX970), is that the GPU usage (as shown in the NVIDIA Inspector and GPU-Z) now is up to 99% most of the time; this was not the case before, most probably due to the WDDM "brake" in Win7 and Win10 (it was at 99% in WinXP which had no WDDM).
And this is noticable, as the new software seems to have overcome this problem.

rod4x4
Send message
Joined: 4 Aug 14
Posts: 129
Credit: 1,648,208,676
RAC: 824,434
Level
His
Scientific publications
watwatwatwatwatwatwat
Message 53034 - Posted: 22 Nov 2019 | 8:48:13 UTC - in response to Message 53032.
Last modified: 22 Nov 2019 | 8:53:13 UTC

Driver updates complete, and 1 of my 2 GTX750ti has already received a task, it's running well.

Good News!

What I noticed, also on the other hosts (GTX980ti and GTX970), is that the GPU usage (as shown in the NVIDIA Inspector and GPU-Z) now is up to 99% most of the time; this was not the case before, most probably due to the WDDM "brake" in Win7 and Win10 (it was at 99% in WinXP which had no WDDM).
And this is noticable, as the new software seems to have overcome this problem.

The ACEMD3 performance is impressive. Toni did indicate that the performance using the Wrapper will be better (here:
http://gpugrid.net/forum_thread.php?id=4935&nowrap=true#51939)...and he is right!
Toni (and GPUgrid team) set out with a vision to make the app more portable and faster. They have delivered. Thank you Toni (and GPUgrid team).

Greg _BE
Send message
Joined: 30 Jun 14
Posts: 39
Credit: 42,093,642
RAC: 9,172
Level
Val
Scientific publications
watwatwatwat
Message 53036 - Posted: 22 Nov 2019 | 8:52:30 UTC

http://www.gpugrid.net/result.php?resultid=21502590

Crashed and burned after going 2% or more.
Memory leaks

Updated my drivers and have another task in queue.

Erich56
Send message
Joined: 1 Jan 15
Posts: 638
Credit: 3,157,242,642
RAC: 818,959
Level
Arg
Scientific publications
watwatwatwatwatwat
Message 53037 - Posted: 22 Nov 2019 | 9:00:41 UTC - in response to Message 53034.

Toni (and GPUgrid team) set out with a vision to make the app more portable and faster. They have delivered. Thank you Toni (and GPUgrid team).

+ 1

rod4x4
Send message
Joined: 4 Aug 14
Posts: 129
Credit: 1,648,208,676
RAC: 824,434
Level
His
Scientific publications
watwatwatwatwatwatwat
Message 53038 - Posted: 22 Nov 2019 | 9:03:05 UTC - in response to Message 53036.
Last modified: 22 Nov 2019 | 9:06:17 UTC

http://www.gpugrid.net/result.php?resultid=21502590

Crashed and burned after going 2% or more.
Memory leaks

Updated my drivers and have another task in queue.


The memory leaks do appear on startup, probably not critical errors.

The issue in your case is ACEMD3 tasks cannot start on one GPU and be resumed on another.

From your STDerr Output:
.....
04:26:56 (8564): wrapper: running acemd3.exe (--boinc input --device 0)
.....
06:08:12 (16628): wrapper: running acemd3.exe (--boinc input --device 1)
ERROR: src\mdsim\context.cpp line 322: Cannot use a restart file on a different device!


It was started on Device 0
but failed when it was resumed on Device 1

Refer this FAQ post by Toni for further clarification:
http://www.gpugrid.net/forum_thread.php?id=5002

Toni
Volunteer moderator
Project administrator
Project developer
Project scientist
Send message
Joined: 9 Dec 08
Posts: 857
Credit: 4,301,782
RAC: 1
Level
Ala
Scientific publications
watwatwatwat
Message 53039 - Posted: 22 Nov 2019 | 9:20:03 UTC - in response to Message 53038.
Last modified: 22 Nov 2019 | 9:21:18 UTC

Thanks to all! To summarize some responses of the feedback above:

* GPU occupation is high (100% on my Linux machine)
* %/day is not an indication of performance because WU size differs between WU types
* Minimum required drivers, failures on notebook cards: see FAQ - thanks for those posting the links
* Tasks apparently stuck: may be an impression due to the % being rounded (e.g. 8h task divided in 100% fractions = no apparent progress for minutes)
* "Memory leaks": ignore the message, it's always there. The actual error, if present, is at the top.

Erich56
Send message
Joined: 1 Jan 15
Posts: 638
Credit: 3,157,242,642
RAC: 818,959
Level
Arg
Scientific publications
watwatwatwatwatwat
Message 53040 - Posted: 22 Nov 2019 | 9:30:14 UTC

Toni, since the new app is an obvious success - now the inevitable question: when will you send out the next batch of tasks?

rod4x4
Send message
Joined: 4 Aug 14
Posts: 129
Credit: 1,648,208,676
RAC: 824,434
Level
His
Scientific publications
watwatwatwatwatwatwat
Message 53041 - Posted: 22 Nov 2019 | 9:44:45 UTC - in response to Message 53039.

Hi Toni

"Memory leaks": ignore the message, it's always there. The actual error, if present, is at the top.

I am not seeing the error at the top, am I missing it? All I find is the generic Wrapper error message stating there is an Error in the Client task.
The task error is buried in the STDerr Output.
Can the task error be passed to the Wrapper Error code?

Toni
Volunteer moderator
Project administrator
Project developer
Project scientist
Send message
Joined: 9 Dec 08
Posts: 857
Credit: 4,301,782
RAC: 1
Level
Ala
Scientific publications
watwatwatwat
Message 53042 - Posted: 22 Nov 2019 | 11:04:23 UTC - in response to Message 53041.

@rod4x4 which error? no resume on different cards is known, please see the faq.

San-Fernando-Valley
Send message
Joined: 16 Jan 17
Posts: 6
Credit: 26,041,400
RAC: 77,042
Level
Val
Scientific publications
wat
Message 53043 - Posted: 22 Nov 2019 | 11:42:23 UTC

WAITING FOR WU's

Greg _BE
Send message
Joined: 30 Jun 14
Posts: 39
Credit: 42,093,642
RAC: 9,172
Level
Val
Scientific publications
watwatwatwat
Message 53044 - Posted: 22 Nov 2019 | 13:26:53 UTC - in response to Message 53038.

oh interesting.
then I guess I have to write a script to keep all your tasks on the 1050.
That's my better GPU anyway.

Greg _BE
Send message
Joined: 30 Jun 14
Posts: 39
Credit: 42,093,642
RAC: 9,172
Level
Val
Scientific publications
watwatwatwat
Message 53045 - Posted: 22 Nov 2019 | 13:28:46 UTC - in response to Message 53039.

Why is CPU usage so high?
I expect GPU to be high, but CPU?
One thread running between 85-100+% on CPU

jp de malo
Send message
Joined: 3 Jun 10
Posts: 4
Credit: 1,061,743,659
RAC: 372,067
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 53046 - Posted: 22 Nov 2019 | 14:05:37 UTC - in response to Message 53043.
Last modified: 22 Nov 2019 | 14:06:39 UTC

c'est déja fini le test aucune erreur sur mes 1050ti et sur ma 1080ti

Toni
Volunteer moderator
Project administrator
Project developer
Project scientist
Send message
Joined: 9 Dec 08
Posts: 857
Credit: 4,301,782
RAC: 1
Level
Ala
Scientific publications
watwatwatwat
Message 53047 - Posted: 22 Nov 2019 | 14:09:04 UTC - in response to Message 53044.

oh interesting.
then I guess I have to write a script to keep all your tasks on the 1050.
That's my better GPU anyway.


See faq, you can restrict usable gpus.

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 330
Credit: 252,993,463
RAC: 450,216
Level
Asn
Scientific publications
wat
Message 53048 - Posted: 22 Nov 2019 | 14:32:35 UTC - in response to Message 53038.

http://www.gpugrid.net/result.php?resultid=21502590

Crashed and burned after going 2% or more.
Memory leaks

Updated my drivers and have another task in queue.


The memory leaks do appear on startup, probably not critical errors.

The issue in your case is ACEMD3 tasks cannot start on one GPU and be resumed on another.

From your STDerr Output:
.....
04:26:56 (8564): wrapper: running acemd3.exe (--boinc input --device 0)
.....
06:08:12 (16628): wrapper: running acemd3.exe (--boinc input --device 1)
ERROR: src\mdsim\context.cpp line 322: Cannot use a restart file on a different device!


It was started on Device 0
but failed when it was resumed on Device 1

Refer this FAQ post by Toni for further clarification:
http://www.gpugrid.net/forum_thread.php?id=5002

Solve the issue of stopping processing one type of card and attempting to finish on another type of card by changing your compute preferences of "switch between tasks every xx minutes" to a larger value than the default 60. Change to a value that will allow the task to finish on your slowest card. I suggest 360-640 minutes depending on your hardware.

Toni
Volunteer moderator
Project administrator
Project developer
Project scientist
Send message
Joined: 9 Dec 08
Posts: 857
Credit: 4,301,782
RAC: 1
Level
Ala
Scientific publications
watwatwatwat
Message 53049 - Posted: 22 Nov 2019 | 14:36:02 UTC

I'm looking for a confirmation that the app works on windows machine with > 1 device. I'm seeing some


7:33:28 (10748): wrapper: running acemd3.exe (--boinc input --device 2)
# Engine failed: Illegal value for DeviceIndex: 2

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 330
Credit: 252,993,463
RAC: 450,216
Level
Asn
Scientific publications
wat
Message 53051 - Posted: 22 Nov 2019 | 14:48:22 UTC - in response to Message 53045.

Why is CPU usage so high?
I expect GPU to be high, but CPU?
One thread running between 85-100+% on CPU

Because that is the way the gpu application and wrapper requires. The science application is faster and needs a constant supply of data fed to it by the cpu thread because of higher gpu utilization. The tasks finish in 1/3 to 1/2 the time that the old acemd2 app needed.

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 330
Credit: 252,993,463
RAC: 450,216
Level
Asn
Scientific publications
wat
Message 53052 - Posted: 22 Nov 2019 | 15:14:44 UTC - in response to Message 53039.

Thanks to all! To summarize some responses of the feedback above:

* GPU occupation is high (100% on my Linux machine)
* %/day is not an indication of performance because WU size differs between WU types
* Minimum required drivers, failures on notebook cards: see FAQ - thanks for those posting the links
* Tasks apparently stuck: may be an impression due to the % being rounded (e.g. 8h task divided in 100% fractions = no apparent progress for minutes)
* "Memory leaks": ignore the message, it's always there. The actual error, if present, is at the top.

Toni, new features are available for CUDA-MEMCHECK in CUDA10.2. The CUDA-MEMCHECK tool seems useful. It can be called against the application with:
cuda-memcheck [memcheck_options] app_name [app_options]

https://docs.nvidia.com/cuda/cuda-memcheck/index.html#memcheck-tool

Erich56
Send message
Joined: 1 Jan 15
Posts: 638
Credit: 3,157,242,642
RAC: 818,959
Level
Arg
Scientific publications
watwatwatwatwatwat
Message 53053 - Posted: 22 Nov 2019 | 15:25:17 UTC - in response to Message 53049.

I'm looking for a confirmation that the app works on windows machine with > 1 device. I'm seeing some

7:33:28 (10748): wrapper: running acemd3.exe (--boinc input --device 2)
# Engine failed: Illegal value for DeviceIndex: 2


In one of my hosts I have 2 GTX980Ti. However, one of them I have excluded from GPUGRID via cc_config.xml since one of the fans became defective. But with regard to your request, I guess this does not matter.
At any rate, the other GPU processes the new app perfectly.

Greg _BE
Send message
Joined: 30 Jun 14
Posts: 39
Credit: 42,093,642
RAC: 9,172
Level
Val
Scientific publications
watwatwatwat
Message 53054 - Posted: 22 Nov 2019 | 15:55:46 UTC - in response to Message 53048.

http://www.gpugrid.net/result.php?resultid=21502590

Crashed and burned after going 2% or more.
Memory leaks

Updated my drivers and have another task in queue.


The memory leaks do appear on startup, probably not critical errors.

The issue in your case is ACEMD3 tasks cannot start on one GPU and be resumed on another.

From your STDerr Output:
.....
04:26:56 (8564): wrapper: running acemd3.exe (--boinc input --device 0)
.....
06:08:12 (16628): wrapper: running acemd3.exe (--boinc input --device 1)
ERROR: src\mdsim\context.cpp line 322: Cannot use a restart file on a different device!


It was started on Device 0
but failed when it was resumed on Device 1

Refer this FAQ post by Toni for further clarification:
http://www.gpugrid.net/forum_thread.php?id=5002

Solve the issue of stopping processing one type of card and attempting to finish on another type of card by changing your compute preferences of "switch between tasks every xx minutes" to a larger value than the default 60. Change to a value that will allow the task to finish on your slowest card. I suggest 360-640 minutes depending on your hardware.


360 is already where it is at since I also run LHC ATLAS and that does not like to be disturbed and usually finishes in 6 hrs.

I added a cc_config file to force your project to use just the 1050. I will double check my placement a bit later.

Aurum
Send message
Joined: 12 Jul 17
Posts: 131
Credit: 7,642,556,493
RAC: 3,980,674
Level
Tyr
Scientific publications
wat
Message 53055 - Posted: 22 Nov 2019 | 18:57:10 UTC

The %Progress keeps resetting to zero on 2080 Ti's but seems normal on 1080 Ti's.
____________

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 918
Credit: 2,223,381,052
RAC: 251,037
Level
Phe
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 53056 - Posted: 22 Nov 2019 | 19:39:41 UTC - in response to Message 53049.

I'm looking for a confirmation that the app works on windows machine with > 1 device. I'm seeing some

7:33:28 (10748): wrapper: running acemd3.exe (--boinc input --device 2)
# Engine failed: Illegal value for DeviceIndex: 2

I'm currently running test340-TONI_GSNTEST3-3-100-RND9632_0 on a GTX 1660 SUPER under Windows 7, BOINC v7.16.3

The machine has a secondary GPU, but is running on the primary: command line looks correct, as

"acemd3.exe" --boinc input --device 0

Progress is displaying plausibly as 50.000% after 2 hours 22 minutes, updating in 1% increments only.

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 918
Credit: 2,223,381,052
RAC: 251,037
Level
Phe
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 53057 - Posted: 22 Nov 2019 | 22:03:04 UTC

Task completed and validated.

Aurum
Send message
Joined: 12 Jul 17
Posts: 131
Credit: 7,642,556,493
RAC: 3,980,674
Level
Tyr
Scientific publications
wat
Message 53058 - Posted: 22 Nov 2019 | 22:18:54 UTC - in response to Message 53055.

The %Progress keeps resetting to zero on 2080 Ti's but seems normal on 1080 Ti's.
My impression so far is that Win7-64 can run four WUs on two 1080 Ti's fine on the same computer fine.
The problem seems to be with 2080 Ti's running on Win7-64. I'm running four WUs on one 2080 Ti with four Einstein or four Milkyway on the second 2080 Ti seems ok so far. Earlier when I had two WUs on each 2080 Ti along with either two Einstein or two Milkyway that it kept resetting.
All Linux computers with 1080 Ti's seem normal.
Plan to move my two 2080 Ti's back to a Linux computer and try that.

____________

rod4x4
Send message
Joined: 4 Aug 14
Posts: 129
Credit: 1,648,208,676
RAC: 824,434
Level
His
Scientific publications
watwatwatwatwatwatwat
Message 53059 - Posted: 23 Nov 2019 | 2:32:37 UTC - in response to Message 53058.

The %Progress keeps resetting to zero on 2080 Ti's but seems normal on 1080 Ti's.
My impression so far is that Win7-64 can run four WUs on two 1080 Ti's fine on the same computer fine.
The problem seems to be with 2080 Ti's running on Win7-64. I'm running four WUs on one 2080 Ti with four Einstein or four Milkyway on the second 2080 Ti seems ok so far. Earlier when I had two WUs on each 2080 Ti along with either two Einstein or two Milkyway that it kept resetting.
All Linux computers with 1080 Ti's seem normal.
Plan to move my two 2080 Ti's back to a Linux computer and try that.

As a single ACEMD3 task can push the GPU to 100%, it would be interesting to see if there is any clear advantage to running multiple ACEMD3 tasks on a GPU.

rod4x4
Send message
Joined: 4 Aug 14
Posts: 129
Credit: 1,648,208,676
RAC: 824,434
Level
His
Scientific publications
watwatwatwatwatwatwat
Message 53060 - Posted: 23 Nov 2019 | 2:42:50 UTC - in response to Message 53042.
Last modified: 23 Nov 2019 | 2:46:13 UTC

@rod4x4 which error? no resume on different cards is known, please see the faq.

Hi Toni
Not referring to any particular error.
When the ACEMD3 task (Child task) experiences an error, the Wrapper always reports a generic error (195) in the Exit Status:
Exit status 195 (0xc3) EXIT_CHILD_FAILED

Can the specific (Child) task error be passed to the Exit Status?

KAMasud
Send message
Joined: 27 Jul 11
Posts: 21
Credit: 157,044,390
RAC: 99,882
Level
Ile
Scientific publications
wat
Message 53061 - Posted: 23 Nov 2019 | 4:05:58 UTC

Okay, my 1060 with Max-Q design completed one task and validated.

rod4x4
Send message
Joined: 4 Aug 14
Posts: 129
Credit: 1,648,208,676
RAC: 824,434
Level
His
Scientific publications
watwatwatwatwatwatwat
Message 53062 - Posted: 23 Nov 2019 | 5:16:43 UTC - in response to Message 53061.

Okay, my 1060 with Max-Q design completed one task and validated.

Good news.
Did you make any changes to the config after the first failure?

Bedrich Hajek
Send message
Joined: 28 Mar 09
Posts: 386
Credit: 4,837,651,939
RAC: 1,398,889
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 53065 - Posted: 23 Nov 2019 | 10:30:42 UTC

My windows 10 computer on the RTX 2080 ti is finishing these WUs in about 6100 seconds, which is about the same time as computers running linux with same card.

Is the WDDM lag gone or is it my imagination?


The 18000 to 19000 seconds range are the these WUs running on the GTX 980 ti.



http://www.gpugrid.net/results.php?hostid=263612&offset=0&show_names=0&state=0&appid=32




KAMasud
Send message
Joined: 27 Jul 11
Posts: 21
Credit: 157,044,390
RAC: 99,882
Level
Ile
Scientific publications
wat
Message 53066 - Posted: 23 Nov 2019 | 11:03:36 UTC

@Rod 4*4. I did make a change but I do not know it's relevance. I set SWAN_SYNC to 0. I did that for some other reason. Anyway, second WU completed and validated.

Erich56
Send message
Joined: 1 Jan 15
Posts: 638
Credit: 3,157,242,642
RAC: 818,959
Level
Arg
Scientific publications
watwatwatwatwatwat
Message 53069 - Posted: 23 Nov 2019 | 13:12:50 UTC - in response to Message 53065.
Last modified: 23 Nov 2019 | 13:13:10 UTC

Is the WDDM lag gone or is it my imagination?

Given that the various tool now show a GPU utilization of mostly up to 99% or even 100% (as it was with WinXP before), it would seem to me that the WDDM does not play a role any more.

eXaPower
Send message
Joined: 25 Sep 13
Posts: 281
Credit: 1,457,437,667
RAC: 162,551
Level
Met
Scientific publications
watwatwatwatwatwatwatwat
Message 53070 - Posted: 23 Nov 2019 | 13:32:02 UTC

WU now require 1 CPU core - WU run slower on 4/5 GPUs with (4) CPU cores.

3 GPUs full speed while 4 or 5 cause GPU usage to tank.

ELISA WU GPU max power draw (55C GPU temp)

330W on 2080ti 95% GPU utilization (PCIe 3.0 x4) @ 1995MHz

115W on 1660 99% "" (PCIe 3.0 x4) @ 1995MHz

215W on 2080 89% "" (PCIe 2.0 x1) @ 1995MHz

Progress bar runtime (2080ti 1hr 40min) / (2080 2hr 40min) / (1660 5hr)

1660 runtime equal to the 980ti.

Profile Retvari Zoltan
Avatar
Send message
Joined: 20 Jan 09
Posts: 2078
Credit: 15,135,591,890
RAC: 4,579,208
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 53071 - Posted: 23 Nov 2019 | 13:48:57 UTC - in response to Message 53065.

My windows 10 computer on the RTX 2080 ti is finishing these WUs in about 6100 seconds, which is about the same time as computers running linux with same card.

Is the WDDM lag gone or is it my imagination?
I came to this conclusion too. The runtimes on Windows 10 are about 10880 sec (3h 1m 20s) (11200 sec on my other host), while on Linux it's about 10280 sec (2h 51m 20s) on GTX 1080 Ti (Linux is about 5.5% faster). These are different cards, and the fastest GPU appears to be the slowest in this list. It's possible that the CPU feeding the GPU(s) is more important for the ACEMD3 than it was for the ACEMD2, as my ACEMD3-wise slowest host has the oldest CPU (i7-4930k, which is 3rd gen.: Ivy Bridge E) while the other has an i3-4330 (which is 4rd gen.: Haswell). The other difference between the two Windows host is that the i7 had 2 rosetta@home tasks running, while the i3 had only the ACEMD3 running. Now I reduced the number of rosetta@home tasks to 1. I will suspend rosetta@home if there will be a steady flow of GPUGrid workunits.

Profile Retvari Zoltan
Avatar
Send message
Joined: 20 Jan 09
Posts: 2078
Credit: 15,135,591,890
RAC: 4,579,208
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 53072 - Posted: 23 Nov 2019 | 14:22:14 UTC - in response to Message 53069.
Last modified: 23 Nov 2019 | 14:23:21 UTC

Is the WDDM lag gone or is it my imagination?
Given that the various tool now show a GPU utilization of mostly up to 99% or even 100% (as it was with WinXP before), it would seem to me that the WDDM does not play a role any more.
While this high readout of GPU usage could be misleading, I think it's true this time. I expected this to happen on Windows 10 v1703, but apparently it didn't. So it seems that older CUDA versions (8.0) don't have their appropriate drivers to get around WDDM, but CUDA 10 has it.
I mentioned it at the end of a post almost 2 years ago.
There are new abbreviations from Microsoft to memorize (the links lead to TLDR pages, so click on them at your own risk):
DCH: Declarative Componentized Hardware supported apps
UWP: Universal Windows Platform
WDF: Windows Driver Frameworks
- KMDF: Kernel-Mode Driver Framework
- UMDF: User-Mode Driver Framework
This 'new' Windows Driver Framework is responsible for the 'lack of WDDM' and its overhead. Good work!

[AF>Libristes]on2vhf
Send message
Joined: 7 Oct 17
Posts: 2
Credit: 6,974,520
RAC: 98,772
Level
Ser
Scientific publications
wat
Message 53073 - Posted: 23 Nov 2019 | 15:09:36 UTC

Hi,
many thanks for add new tasks, but please add again more tasks lol
bests regards
laurent from Belgium

Toni
Volunteer moderator
Project administrator
Project developer
Project scientist
Send message
Joined: 9 Dec 08
Posts: 857
Credit: 4,301,782
RAC: 1
Level
Ala
Scientific publications
watwatwatwat
Message 53075 - Posted: 23 Nov 2019 | 16:45:21 UTC - in response to Message 53073.
Last modified: 23 Nov 2019 | 16:48:13 UTC

100% GPU use and low WDDM overhead are nice news. However, they may be a specific to this particular WU type - we'll see in the future. (The swan sync variable is ignored and plays no role.)

Profile [AF>Libristes] hermes
Send message
Joined: 11 Nov 16
Posts: 24
Credit: 317,357,618
RAC: 648,044
Level
Asp
Scientific publications
watwatwat
Message 53084 - Posted: 24 Nov 2019 | 12:18:53 UTC - in response to Message 53075.
Last modified: 24 Nov 2019 | 12:24:26 UTC

For me, 100% on GPU is not the best ;-)
Because I have just one card on the pc, and I can't see videos when GPUgrid is running. Even if I ask to smplayer or vlc to use the CPU So I have to pause this project when I use my pc.
Maybe one day we will can put some priority to the use of GPU (on linux).
I think I will buy a cheap card for manage the TV and play movies. But well, in general I am at work or somewhere else...

Nice to have some work. Folding@Home will wait. I was thinking to change, the others BOINC projects running on GPU doesn't interest me.

Erich56
Send message
Joined: 1 Jan 15
Posts: 638
Credit: 3,157,242,642
RAC: 818,959
Level
Arg
Scientific publications
watwatwatwatwatwat
Message 53085 - Posted: 24 Nov 2019 | 12:58:27 UTC

there was a task which ended after 41 seconds with:
195 (0xc3) EXIT_CHILD_FAILED

stderr here: http://www.gpugrid.net/result.php?resultid=21514460

mmonnin
Send message
Joined: 2 Jul 16
Posts: 273
Credit: 653,868,889
RAC: 133,587
Level
Lys
Scientific publications
wat
Message 53087 - Posted: 24 Nov 2019 | 13:16:38 UTC

Are tasks being sent out for CUDA80 plan_class? I have only received new tasks on my 1080Ti with driver 418 and none on another system with 10/1070Ti with driver 396, which doesn't support CUDA100

rod4x4
Send message
Joined: 4 Aug 14
Posts: 129
Credit: 1,648,208,676
RAC: 824,434
Level
His
Scientific publications
watwatwatwatwatwatwat
Message 53088 - Posted: 24 Nov 2019 | 13:25:52 UTC - in response to Message 53085.

there was a task which ended after 41 seconds with:
195 (0xc3) EXIT_CHILD_FAILED

stderr here: http://www.gpugrid.net/result.php?resultid=21514460

unfortunately ACEMD3 no longer tells you the real error. The wrapper provides a meaningless generic message. (error 195)
The task error in your STDerr Output is
# Engine failed: Particle coordinate is nan

I had this twice on one host. Not sure if I am completely correct as ACEMD3 is a new beast we have to learn and tame, but in my case I reduced the Overclocking and it seemed to fix the issue, though that could just be a coincidence.

ALAIN_13013
Avatar
Send message
Joined: 11 Sep 08
Posts: 16
Credit: 1,448,826,738
RAC: 144,368
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 53089 - Posted: 24 Nov 2019 | 13:29:53 UTC - in response to Message 53084.

For me, 100% on GPU is not the best ;-)
Because I have just one card on the pc, and I can't see videos when GPUgrid is running. Even if I ask to smplayer or vlc to use the CPU So I have to pause this project when I use my pc.
Maybe one day we will can put some priority to the use of GPU (on linux).
I think I will buy a cheap card for manage the TV and play movies. But well, in general I am at work or somewhere else...

Nice to have some work. Folding@Home will wait. I was thinking to change, the others BOINC projects running on GPU doesn't interest me.

C'est exactement ce que j'ai fait en installant une GT710 juste pour la sortie vidéo, c'est au top, du coup ma 980 Ti à 100% de charge ne me dérange pas du tout !
____________

rod4x4
Send message
Joined: 4 Aug 14
Posts: 129
Credit: 1,648,208,676
RAC: 824,434
Level
His
Scientific publications
watwatwatwatwatwatwat
Message 53090 - Posted: 24 Nov 2019 | 13:32:19 UTC - in response to Message 53087.

Are tasks being sent out for CUDA80 plan_class? I have only received new tasks on my 1080Ti with driver 418 and none on another system with 10/1070Ti with driver 396, which doesn't support CUDA100

Yes CUDA80 is supported, see apps page here:https://www.gpugrid.net/apps.php
Also see FAQ for ACEMD3 here: https://www.gpugrid.net/forum_thread.php?id=5002

Bedrich Hajek
Send message
Joined: 28 Mar 09
Posts: 386
Credit: 4,837,651,939
RAC: 1,398,889
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 53092 - Posted: 24 Nov 2019 | 15:43:53 UTC - in response to Message 53088.

there was a task which ended after 41 seconds with:
195 (0xc3) EXIT_CHILD_FAILED

stderr here: http://www.gpugrid.net/result.php?resultid=21514460

unfortunately ACEMD3 no longer tells you the real error. The wrapper provides a meaningless generic message. (error 195)
The task error in your STDerr Output is
# Engine failed: Particle coordinate is nan

I had this twice on one host. Not sure if I am completely correct as ACEMD3 is a new beast we have to learn and tame, but in my case I reduced the Overclocking and it seemed to fix the issue, though that could just be a coincidence.



I had a couple errors on my windows 7 computer, and none on my windows 10 computer, so far. In my case, it's not overclocking, since I don't overclock.

http://www.gpugrid.net/results.php?hostid=494023&offset=0&show_names=0&state=0&appid=32

Yes, I do believe we need some more testing.





mmonnin
Send message
Joined: 2 Jul 16
Posts: 273
Credit: 653,868,889
RAC: 133,587
Level
Lys
Scientific publications
wat
Message 53093 - Posted: 24 Nov 2019 | 15:50:53 UTC - in response to Message 53090.

Are tasks being sent out for CUDA80 plan_class? I have only received new tasks on my 1080Ti with driver 418 and none on another system with 10/1070Ti with driver 396, which doesn't support CUDA100

Yes CUDA80 is supported, see apps page here:https://www.gpugrid.net/apps.php
Also see FAQ for ACEMD3 here: https://www.gpugrid.net/forum_thread.php?id=5002


Then the app requires an odd situation in Linux where it supposedly supports CUDA 80 but to use it requires a newer driver beyond it.

What driver/card/OS combinations are supported?

Windows, CUDA80 Minimum Driver r367.48 or higher
Linux, CUDA92 Minimum Driver r396.26 or higher
Linux, CUDA100 Minimum Driver r410.48 or higher
Windows, CUDA101 Minimum Driver r418.39 or higher


There's not even a Linux CUDA92 plan_class so I'm not sure what thats for in the FAQ.

klepel
Send message
Joined: 23 Dec 09
Posts: 171
Credit: 2,853,412,838
RAC: 547,670
Level
Phe
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 53096 - Posted: 24 Nov 2019 | 18:56:12 UTC

I just wanted to confirm, you need a driver supporting CUDA100 or CUDA101, then even a GTX670 can crunch the "acemd3" app.

See computer: http://www.gpugrid.net/show_host_detail.php?hostid=486229

Although it will not make the 24 hours deadline, and I can tell, the GPU is extremely stressed. I will run some more WUs on it, to confirm that it can handle the new app. And afterwards it will go to the summer pause or might be retired from BOINC altogether.

mmonnin
Send message
Joined: 2 Jul 16
Posts: 273
Credit: 653,868,889
RAC: 133,587
Level
Lys
Scientific publications
wat
Message 53098 - Posted: 24 Nov 2019 | 20:01:08 UTC - in response to Message 53093.

Are tasks being sent out for CUDA80 plan_class? I have only received new tasks on my 1080Ti with driver 418 and none on another system with 10/1070Ti with driver 396, which doesn't support CUDA100

Yes CUDA80 is supported, see apps page here:https://www.gpugrid.net/apps.php
Also see FAQ for ACEMD3 here: https://www.gpugrid.net/forum_thread.php?id=5002


Then the app requires an odd situation in Linux where it supposedly supports CUDA 80 but to use it requires a newer driver beyond it.

What driver/card/OS combinations are supported?

Windows, CUDA80 Minimum Driver r367.48 or higher
Linux, CUDA92 Minimum Driver r396.26 or higher
Linux, CUDA100 Minimum Driver r410.48 or higher
Windows, CUDA101 Minimum Driver r418.39 or higher


There's not even a Linux CUDA92 plan_class so I'm not sure what thats for in the FAQ.


And now I got the 1st CUDA80 task on that system w/o any driver changes.

rod4x4
Send message
Joined: 4 Aug 14
Posts: 129
Credit: 1,648,208,676
RAC: 824,434
Level
His
Scientific publications
watwatwatwatwatwatwat
Message 53100 - Posted: 24 Nov 2019 | 21:24:31 UTC - in response to Message 53085.
Last modified: 24 Nov 2019 | 21:25:46 UTC

there was a task which ended after 41 seconds with:
195 (0xc3) EXIT_CHILD_FAILED

stderr here: http://www.gpugrid.net/result.php?resultid=21514460

Checking this task, it has failed on 8 computers so it is just a faulty work unit.
clocking would not be the cause as previously stated.

rod4x4
Send message
Joined: 4 Aug 14
Posts: 129
Credit: 1,648,208,676
RAC: 824,434
Level
His
Scientific publications
watwatwatwatwatwatwat
Message 53101 - Posted: 24 Nov 2019 | 21:44:32 UTC - in response to Message 53092.

there was a task which ended after 41 seconds with:
195 (0xc3) EXIT_CHILD_FAILED

stderr here: http://www.gpugrid.net/result.php?resultid=21514460

unfortunately ACEMD3 no longer tells you the real error. The wrapper provides a meaningless generic message. (error 195)
The task error in your STDerr Output is
# Engine failed: Particle coordinate is nan

I had this twice on one host. Not sure if I am completely correct as ACEMD3 is a new beast we have to learn and tame, but in my case I reduced the Overclocking and it seemed to fix the issue, though that could just be a coincidence.



I had a couple errors on my windows 7 computer, and none on my windows 10 computer, so far. In my case, it's not overclocking, since I don't overclock.

http://www.gpugrid.net/results.php?hostid=494023&offset=0&show_names=0&state=0&appid=32

Yes, I do believe we need some more testing


Agreed, testing will be an ongoing process...some errors cannot be fixed.

this task had an error code 194...
finish file present too long</message>

This error has been seen in ACEMD2 and listed as "Unknown"

Matt Harvey did a FAQ on error codes for ACEMD2 here
http://gpugrid.net/forum_thread.php?id=3468

icg studio
Send message
Joined: 24 Nov 11
Posts: 3
Credit: 82,250
RAC: 5
Level

Scientific publications
wat
Message 53102 - Posted: 24 Nov 2019 | 23:47:10 UTC

Finally Cuda 10.1! Supprot for Turing Cuda Cores other words.
My RTX 2060 start crunching.
Will post later time.

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 330
Credit: 252,993,463
RAC: 450,216
Level
Asn
Scientific publications
wat
Message 53103 - Posted: 25 Nov 2019 | 0:20:18 UTC - in response to Message 53101.
Last modified: 25 Nov 2019 | 0:24:00 UTC

this task had an error code 194...
finish file present too long</message>

This is a bug in the BOINC 7.14.2 client and earlier versions. You need to update to the 7.16 branch to fix it.
Identified/quantified in https://github.com/BOINC/boinc/issues/3017
And resolved for the client in:
https://github.com/BOINC/boinc/pull/3019
And in the server code in:
https://github.com/BOINC/boinc/pull/3300

rod4x4
Send message
Joined: 4 Aug 14
Posts: 129
Credit: 1,648,208,676
RAC: 824,434
Level
His
Scientific publications
watwatwatwatwatwatwat
Message 53104 - Posted: 25 Nov 2019 | 0:42:18 UTC - in response to Message 53103.
Last modified: 25 Nov 2019 | 0:57:09 UTC

this task had an error code 194...
finish file present too long</message>

This is a bug in the BOINC 7.14.2 client and earlier versions. You need to update to the 7.16 branch to fix it.
Identified/quantified in https://github.com/BOINC/boinc/issues/3017
And resolved for the client in:
https://github.com/BOINC/boinc/pull/3019
And in the server code in:
https://github.com/BOINC/boinc/pull/3300

Thanks for the info and links. Sometimes we overlook the Boinc Client performance.

From the Berkeley download page(https://boinc.berkeley.edu/download_all.php):

7.16.3 Development version
(MAY BE UNSTABLE - USE ONLY FOR TESTING)

and
7.14.2 Recommended version

This needs to be considered by volunteers, install latest version if you are feeling adventurous. (any issues you may find will help the Berkeley team develop the new client)

Alternatively,
- reducing the CPU load on your PC and/or
- ensuring the PC is not rebooted as the finish file is written,
may avert this error.

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 330
Credit: 252,993,463
RAC: 450,216
Level
Asn
Scientific publications
wat
Message 53105 - Posted: 25 Nov 2019 | 5:45:50 UTC

I haven't had a single instance of "finish file present" errors since moving to the 7.16 branch. I used to get a couple or more a day before on 7.14.2 or earlier.

It may be labelled an unstable development revision, but it is as close to general release stable as you can get. The only issue is that it is still in flux as more commits get added to it and the version number gets increased.

Greg _BE
Send message
Joined: 30 Jun 14
Posts: 39
Credit: 42,093,642
RAC: 9,172
Level
Val
Scientific publications
watwatwatwat
Message 53109 - Posted: 25 Nov 2019 | 18:39:45 UTC - in response to Message 53084.

For me, 100% on GPU is not the best ;-)
Because I have just one card on the pc, and I can't see videos when GPUgrid is running. Even if I ask to smplayer or vlc to use the CPU So I have to pause this project when I use my pc.
Maybe one day we will can put some priority to the use of GPU (on linux).
I think I will buy a cheap card for manage the TV and play movies. But well, in general I am at work or somewhere else...

Nice to have some work. Folding@Home will wait. I was thinking to change, the others BOINC projects running on GPU doesn't interest me.



I see you have a RTX and a GTX.
You could save your GTX for video and general PC usage and put the RTX full time on GPU tasks.

I find this odd that you are having issues seeing videos. I noticed that with my system as well and it was not the GPU that was having trouble, it was the CPU that was overloaded. After I changed the CPU time to like 95% then I had no trouble watching videos.

After much tweaking on the way BOINC and all the projects I run use my system, I finally have it to where I can watch videos without any problems and I use a GTX 1050TI as my primary card along with a Ryzen 2700 with no video processor.

There must be something overloading your system if you can't watch videos on a RTX GPU while running GPU Grid.

Wiyosaya
Send message
Joined: 22 Nov 09
Posts: 113
Credit: 523,669,003
RAC: 101,972
Level
Lys
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 53110 - Posted: 25 Nov 2019 | 20:19:01 UTC
Last modified: 25 Nov 2019 | 20:20:26 UTC

I am getting high CPU/South bridge temps on one of my PCs with these latest work units.

The PC is http://www.gpugrid.net/show_host_detail.php?hostid=160668
and the current work unit is http://www.gpugrid.net/workunit.php?wuid=16866756

Every WU since November 22, 2019 had been exhibiting high temperatures on this PC. The previous apps never exhibited this. In addition, I found the PC unresponsive this afternoon. I was able to reboot, however, this does not give me a warm fuzzy feeling about continuing to run GPUGrid on this PC.

Anyone else seeing something similar or is there a solution for this?

Thanks.
____________

Profile Retvari Zoltan
Avatar
Send message
Joined: 20 Jan 09
Posts: 2078
Credit: 15,135,591,890
RAC: 4,579,208
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 53111 - Posted: 25 Nov 2019 | 23:03:47 UTC - in response to Message 53110.

I am getting high CPU/South bridge temps on one of my PCs with these latest work units.
That's because of two reasons:
1. The new app uses a whole CPU thread (or core, if there's no HT or SMT) to feed the GPU
2. The new app is not hindered by WDDM.

Every WU since November 22, 2019 had been exhibiting high temperatures on this PC. The previous apps never exhibited this.
That's because of two reasons:
1. The old app didn't feed the GPU with a full CPU thread unless the user configured it with the SWAN_SYNC environmental variable.
2. The performance of the old app was hindered by WDDM (under Windows Vista...10)

In addition, I found the PC unresponsive this afternoon. I was able to reboot, however, this does not give me a warm fuzzy feeling about continuing to run GPUGrid on this PC.

Anyone else seeing something similar or is there a solution for this?
There are a few options:
1. reduce the GPU's clock frequency (and the GPU voltage accordingly) or its power target.
2. increase cooling (cleaning fins, increasing air ventilation/fan speed).
If the card is overclocked (by you, or the factory) you should re-calibrate the overclock settings for the new app.
A small reduction in GPU voltage and frequency results in perceptible decrease of the power consumption (=heat output), as the power consumption is in direct ratio of the clock frequency multiplied by the GPU voltage squared.

RFGuy_KCCO
Send message
Joined: 13 Feb 14
Posts: 5
Credit: 666,856,737
RAC: 915,540
Level
Lys
Scientific publications
watwatwatwatwatwatwat
Message 53114 - Posted: 26 Nov 2019 | 4:32:11 UTC
Last modified: 26 Nov 2019 | 4:36:14 UTC

I have found that running GPU's at 60-70% of their stock power level is the sweet spot in the compromise between PPD and power consumption/temps. I usually run all of my GPU's at 60% power level.

icg studio
Send message
Joined: 24 Nov 11
Posts: 3
Credit: 82,250
RAC: 5
Level

Scientific publications
wat
Message 53119 - Posted: 26 Nov 2019 | 10:27:17 UTC - in response to Message 53102.
Last modified: 26 Nov 2019 | 10:28:37 UTC

Finally Cuda 10.1! Supprot for Turing Cuda Cores other words.
My RTX 2060 start crunching.
Will post run-time later.


13134.75 seconds run-time @ RTX 2060, Ryzen 2600,Windows 10 1909.
Average GPU CUDA utilisation 99%.
No Issue at all with those workunit.

KAMasud
Send message
Joined: 27 Jul 11
Posts: 21
Credit: 157,044,390
RAC: 99,882
Level
Ile
Scientific publications
wat
Message 53126 - Posted: 26 Nov 2019 | 17:36:55 UTC - in response to Message 53111.

[quote]1. The old app didn't feed the GPU with a full CPU thread unless the user configured it with the SWAN_SYNC environmental variable.




Something was making my Climate models unstable and crashing them. That was the reason I lassoed in the GPU through SWAN_SYNC. Now my Climate models are stable. Plus I am getting better clock speeds.

Profile ServicEnginIC
Avatar
Send message
Joined: 24 Sep 10
Posts: 78
Credit: 1,250,277,476
RAC: 713,065
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 53129 - Posted: 26 Nov 2019 | 19:16:01 UTC - in response to Message 53110.
Last modified: 26 Nov 2019 | 19:25:35 UTC

I am getting high CPU/South bridge temps on one of my PCs with these latest work units.

As commented in several threads along GPUGrid forum, new ACEMD3 tasks are challenging our computers to their maximum.
They can be taken as a true hardware Quality Control!
Either CPUs, GPUs, PSUs and MoBos seem to be squeezed simultaneously while processing theese tasks.
I'm thinking of printing stickers for my computers: "I processed ACEMD3 and survived" ;-)

Regarding your processor:
Intel(R) Xeon(R) CPU E5-1650 v2 @ 3.50GHz
It has a rated TDP of 130W. A lot of heat to dissipate...
It was launched on Q3/2013.
If it has been running for more than three years, I would recommend to renew CPU cooler's thermal paste.
A clean CPU cooler and a fresh thermal paste usually help to reduce CPU temperature by several degrees.

Regarding chipset temperature:
I can't remember any motherboard that I can touch chipset heatsinks with confidence.
Chipset heat evacuation is based in most of standard motherboards on passive air convection heatsinks.
If there is room at the upper back of your computer case, I would recommend to install an extra fan to extract heated air and improve air circulation.

Jim1348
Send message
Joined: 28 Jul 12
Posts: 704
Credit: 1,375,263,468
RAC: 99,044
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwat
Message 53132 - Posted: 26 Nov 2019 | 22:56:01 UTC

Wow. My GTX 980 on Ubuntu 18.04.3 is running at 80C. It is a three-fan version, not overclocked, with a large heatsink. I don't recall seeing it above 65C before.

I can't tell about the CPU yet. It is a Ryzen 3700x, and apparently the Linux kernel does not support temperature measurements yet. But "Tdie" and "Tctl", whatever they are, report 76C on Psensor.

That is good. I want to use my hardware, and it is getting colder around here.

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 330
Credit: 252,993,463
RAC: 450,216
Level
Asn
Scientific publications
wat
Message 53133 - Posted: 26 Nov 2019 | 23:56:56 UTC - in response to Message 53132.

Tdie is the cpu temp of the 3700X. Tctl is the package power limit offset temp. The offset is 0 on Ryzen 3000. The offset is 20° C. on Ryzen 1000 and 10° C. on Ryzen 2000. The offset is used for cpu fan control.

Tdie and Tctl is provided by the k10temp driver. You still have access to the sensors command if you install lm-sensors.

Ryzen only provides a monolithic single core temp for all cores. It doesn't support individual core temps like Intel cpus.

If you have a ASUS motherboard with a WMI BIOS, there is a driver that can report all the sensors on the motherboard, the same as you would get in Windows.

Jim1348
Send message
Joined: 28 Jul 12
Posts: 704
Credit: 1,375,263,468
RAC: 99,044
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwat
Message 53136 - Posted: 27 Nov 2019 | 2:54:05 UTC - in response to Message 53133.

Thanks. It is an ASRock board, and it probably has the same capability. I will look around.

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 330
Credit: 252,993,463
RAC: 450,216
Level
Asn
Scientific publications
wat
Message 53139 - Posted: 27 Nov 2019 | 4:34:07 UTC - in response to Message 53136.
Last modified: 27 Nov 2019 | 4:34:37 UTC

AFAIK, only ASUS implemented an WMI BIOS to overcome the limitations and restrictions of using a crappy SIO chip on most of their boards.

The latest X570 boards went with a tried and true NCT6775 SIO chip that is well supported in both Windows and Linux.

To give you an idea of what the independent developer driver accomplished with his asus-wmi-sensor driver, this is the output of sensors command on my Crosshair VII Hero motherboard.

keith@Serenity:~$ sensors
asus-isa-0000
Adapter: ISA adapter
cpu_fan: 0 RPM

asuswmisensors-isa-0000
Adapter: ISA adapter
CPU Core Voltage: +1.24 V
CPU SOC Voltage: +1.07 V
DRAM Voltage: +1.42 V
VDDP Voltage: +0.64 V
1.8V PLL Voltage: +2.14 V
+12V Voltage: +11.83 V
+5V Voltage: +4.80 V
3VSB Voltage: +3.36 V
VBAT Voltage: +3.27 V
AVCC3 Voltage: +3.36 V
SB 1.05V Voltage: +1.11 V
CPU Core Voltage: +1.26 V
CPU SOC Voltage: +1.09 V
DRAM Voltage: +1.46 V
CPU Fan: 1985 RPM
Chassis Fan 1: 0 RPM
Chassis Fan 2: 0 RPM
Chassis Fan 3: 0 RPM
HAMP Fan: 0 RPM
Water Pump: 0 RPM
CPU OPT: 0 RPM
Water Flow: 648 RPM
AIO Pump: 0 RPM
CPU Temperature: +72.0°C
CPU Socket Temperature: +45.0°C
Motherboard Temperature: +36.0°C
Chipset Temperature: +52.0°C
Tsensor 1 Temperature: +216.0°C
CPU VRM Temperature: +50.0°C
Water In: +216.0°C
Water Out: +35.0°C
CPU VRM Output Current: +71.00 A

k10temp-pci-00c3
Adapter: PCI adapter
Tdie: +72.2°C (high = +70.0°C)
Tctl: +72.2°C

keith@Serenity:~
$

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 330
Credit: 252,993,463
RAC: 450,216
Level
Asn
Scientific publications
wat
Message 53141 - Posted: 27 Nov 2019 | 4:37:55 UTC

So you can at least look at the driver project at github, this is the link.
https://github.com/electrified/asus-wmi-sensors

Jim1348
Send message
Joined: 28 Jul 12
Posts: 704
Credit: 1,375,263,468
RAC: 99,044
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwat
Message 53146 - Posted: 27 Nov 2019 | 8:03:28 UTC - in response to Message 53141.
Last modified: 27 Nov 2019 | 8:18:14 UTC

OK, I will look at it occasionally. I think Psensor is probably good enough. Fortunately, the case has room for two (or even three) 120 mm fans side by side, so I can cool the length of the card better, I just don't normally have to.

It is now running Einstein FGRBP1, and is down to 69C. It will probably go down a little more.

EDIT:
I also have a Ryzen 3600 machine with the same motherboard (ASRock B450M PRO4) and BIOS. Tdie and Tctl are reporting 95C. I will shut it down and put on a better cooler; I just used the stock one that came with the CPU.

Greg _BE
Send message
Joined: 30 Jun 14
Posts: 39
Credit: 42,093,642
RAC: 9,172
Level
Val
Scientific publications
watwatwatwat
Message 53147 - Posted: 27 Nov 2019 | 8:30:16 UTC

I am running at GTX 1050 at full load and full OC and it goes to only 56C. Fan speed is about 90% of capacity.

For heat, my system with a Ryzen7 2700 running at 40.75 GHZ in OC and running a wide range of projects ranging from LHC (including ATLAS) to easier stuff like Rosetta plus this new stuff, rarely gets above 81C.

I upgraded my case recently and have a Corsair case with 2x 240 fans on the front intake, 1 x 120 exhaust fan on the rear and 1 x 120 intake fan on the bottom. Cooling is with a Artic Cooling Freezer using stock fans which are as good or better than Noctura fans. My top grill can do a 360mm radiator but I opted for a 240 due to budget. So I have one extra slot for hot air to escape.

I burned a Corsair single radiator with push/pull fans after 3 years. And a gamer I met at a electronics box store in the US while I was home visiting family told me she uses the Arctic radiator and has no problems. It is also refillable.

This kind of cooling is what I consider the best short of a gas cooled system or one of those 1,000 dollar external systems.

computezrmle
Send message
Joined: 10 Jun 13
Posts: 9
Credit: 122,193,551
RAC: 294,103
Level
Cys
Scientific publications
wat
Message 53148 - Posted: 27 Nov 2019 | 11:27:59 UTC

@ Keith Myers
This would be steam!

Water In: +216.0°C




@ Greg _BE
my system with a Ryzen7 2700 running at 40.75 GHZ ... rarely gets above 81C.

Wow!!



@ Jim1348
This is the output from standard sensors package.
>sensors nct6779-isa-0290 nct6779-isa-0290 Adapter: ISA adapter Vcore: +0.57 V (min = +0.00 V, max = +1.74 V) in1: +1.09 V (min = +0.00 V, max = +0.00 V) ALARM AVCC: +3.23 V (min = +2.98 V, max = +3.63 V) +3.3V: +3.23 V (min = +2.98 V, max = +3.63 V) in4: +1.79 V (min = +0.00 V, max = +0.00 V) ALARM in5: +0.92 V (min = +0.00 V, max = +0.00 V) ALARM in6: +1.35 V (min = +0.00 V, max = +0.00 V) ALARM 3VSB: +3.46 V (min = +2.98 V, max = +3.63 V) Vbat: +3.28 V (min = +2.70 V, max = +3.63 V) in9: +0.00 V (min = +0.00 V, max = +0.00 V) in10: +0.75 V (min = +0.00 V, max = +0.00 V) ALARM in11: +0.78 V (min = +0.00 V, max = +0.00 V) ALARM in12: +1.66 V (min = +0.00 V, max = +0.00 V) ALARM in13: +0.91 V (min = +0.00 V, max = +0.00 V) ALARM in14: +0.74 V (min = +0.00 V, max = +0.00 V) ALARM fan1: 3479 RPM (min = 0 RPM) fan2: 0 RPM (min = 0 RPM) fan3: 0 RPM (min = 0 RPM) fan4: 0 RPM (min = 0 RPM) fan5: 0 RPM (min = 0 RPM) SYSTIN: +40.0°C (high = +0.0°C, hyst = +0.0°C) sensor = thermistor CPUTIN: +48.5°C (high = +80.0°C, hyst = +75.0°C) sensor = thermistor AUXTIN0: +8.0°C sensor = thermistor AUXTIN1: +40.0°C sensor = thermistor AUXTIN2: +38.0°C sensor = thermistor AUXTIN3: +40.0°C sensor = thermistor SMBUSMASTER 0: +57.5°C PCH_CHIP_CPU_MAX_TEMP: +0.0°C PCH_CHIP_TEMP: +0.0°C PCH_CPU_TEMP: +0.0°C intrusion0: ALARM intrusion1: ALARM beep_enable: disabled


The real Tdie is shown as "SMBUSMASTER 0" already reduced by 27° (Threadripper offset) using the following formula in /etc/sensors.d/x399.conf
chip "nct6779-isa-0290" compute temp7 @-27, @+27

Greg _BE
Send message
Joined: 30 Jun 14
Posts: 39
Credit: 42,093,642
RAC: 9,172
Level
Val
Scientific publications
watwatwatwat
Message 53149 - Posted: 27 Nov 2019 | 12:42:59 UTC - in response to Message 53148.
Last modified: 27 Nov 2019 | 13:12:48 UTC

No..its just 177F. No idea where you got that value from.
Water boils at 212F and that would trigger a thermal shutdown on the CPU.

AMD specifies 85°C as the maximum safe temperature for a Ryzen™ 7 2700X processor. That's the chip with the video co-processor, I am just pure CPU.
From what I can see, thermal shut down is 100-115C according to some posts.

If the chip is 80C, then I guess the outgoing water would be that, but the radiator does not feel that hot. According to NZXT CAM monitoring I am only using 75% of the temperature range.

Checked AMD website. Max Temp for my CPU is 95C or 203F. So I am well within the limits of the design specs of this CPU. Your temperature calculation was way off.




@ Keith Myers
This would be steam!
Water In: +216.0°C





@ Greg _BE
my system with a Ryzen7 2700 running at 40.75 GHZ ... rarely gets above 81C.

Wow!!



@ Jim1348
This is the output from standard sensors package.
>sensors nct6779-isa-0290 nct6779-isa-0290 Adapter: ISA adapter Vcore: +0.57 V (min = +0.00 V, max = +1.74 V) in1: +1.09 V (min = +0.00 V, max = +0.00 V) ALARM AVCC: +3.23 V (min = +2.98 V, max = +3.63 V) +3.3V: +3.23 V (min = +2.98 V, max = +3.63 V) in4: +1.79 V (min = +0.00 V, max = +0.00 V) ALARM in5: +0.92 V (min = +0.00 V, max = +0.00 V) ALARM in6: +1.35 V (min = +0.00 V, max = +0.00 V) ALARM 3VSB: +3.46 V (min = +2.98 V, max = +3.63 V) Vbat: +3.28 V (min = +2.70 V, max = +3.63 V) in9: +0.00 V (min = +0.00 V, max = +0.00 V) in10: +0.75 V (min = +0.00 V, max = +0.00 V) ALARM in11: +0.78 V (min = +0.00 V, max = +0.00 V) ALARM in12: +1.66 V (min = +0.00 V, max = +0.00 V) ALARM in13: +0.91 V (min = +0.00 V, max = +0.00 V) ALARM in14: +0.74 V (min = +0.00 V, max = +0.00 V) ALARM fan1: 3479 RPM (min = 0 RPM) fan2: 0 RPM (min = 0 RPM) fan3: 0 RPM (min = 0 RPM) fan4: 0 RPM (min = 0 RPM) fan5: 0 RPM (min = 0 RPM) SYSTIN: +40.0°C (high = +0.0°C, hyst = +0.0°C) sensor = thermistor CPUTIN: +48.5°C (high = +80.0°C, hyst = +75.0°C) sensor = thermistor AUXTIN0: +8.0°C sensor = thermistor AUXTIN1: +40.0°C sensor = thermistor AUXTIN2: +38.0°C sensor = thermistor AUXTIN3: +40.0°C sensor = thermistor SMBUSMASTER 0: +57.5°C PCH_CHIP_CPU_MAX_TEMP: +0.0°C PCH_CHIP_TEMP: +0.0°C PCH_CPU_TEMP: +0.0°C intrusion0: ALARM intrusion1: ALARM beep_enable: disabled


The real Tdie is shown as "SMBUSMASTER 0" already reduced by 27° (Threadripper offset) using the following formula in /etc/sensors.d/x399.conf
chip "nct6779-isa-0290" compute temp7 @-27, @+27

rod4x4
Send message
Joined: 4 Aug 14
Posts: 129
Credit: 1,648,208,676
RAC: 824,434
Level
His
Scientific publications
watwatwatwatwatwatwat
Message 53150 - Posted: 27 Nov 2019 | 13:04:30 UTC

@ Keith Myers
This would be steam!
Water In: +216.0°C


I saw the same thing. Funny Huh!

Jim1348
Send message
Joined: 28 Jul 12
Posts: 704
Credit: 1,375,263,468
RAC: 99,044
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwat
Message 53151 - Posted: 27 Nov 2019 | 13:15:32 UTC
Last modified: 27 Nov 2019 | 13:47:59 UTC

The heatsink on the Ryzen 3600 that reports Tdie and Tctl at 95C is only moderately warm to the touch. That was the case when I installed it.

So there are two possibilities: the heatsink is not making good contact to the chip, or else the reading is wrong. I will find out soon.

EDIT: The paste is spread out properly and sticking fine. It must be that the Ryzen 3600 reports the temps differently, or else is interpreted wrong. That is my latest build by the way, put together within the last month.

Greg _BE
Send message
Joined: 30 Jun 14
Posts: 39
Credit: 42,093,642
RAC: 9,172
Level
Val
Scientific publications
watwatwatwat
Message 53152 - Posted: 27 Nov 2019 | 13:16:36 UTC
Last modified: 27 Nov 2019 | 13:17:46 UTC

https://i.pinimg.com/originals/94/63/2d/94632de14e0b1612e4c70111396dc03f.jpg

c to f degrees chart

Greg _BE
Send message
Joined: 30 Jun 14
Posts: 39
Credit: 42,093,642
RAC: 9,172
Level
Val
Scientific publications
watwatwatwat
Message 53153 - Posted: 27 Nov 2019 | 13:20:31 UTC

I have checked my system with HW Monitor,CAM,MSI Command Center and Ryzen Master. All report the same thing. 80C and AMD says max 95C before shutdown.

I'll leave it at that.

computezrmle
Send message
Joined: 10 Jun 13
Posts: 9
Credit: 122,193,551
RAC: 294,103
Level
Cys
Scientific publications
wat
Message 53154 - Posted: 27 Nov 2019 | 13:30:37 UTC - in response to Message 53149.

No idea where you got that value from.

I got it from this message:
http://www.gpugrid.net/forum_thread.php?id=5015&nowrap=true#53139
If this is really °C then 216 would be steam or
if it is °F then 35 would be close to ice.
Water In: +216.0°C
Water Out: +35.0°C



If the chip is 80C, then I guess the outgoing water would be that, but the radiator does not feel that hot.

Seriously (don't try this!) -> any temp >60 °C would burn your fingers.
Most components used in watercooling circuits are specified for a Tmax (water!) of 65 °C.
Any cooling medium must be (much) cooler than the device to establish a heat flow.


But are you sure you really run your Ryzen at 40.75 GHZ?
It's from this post:
http://www.gpugrid.net/forum_thread.php?id=5015&nowrap=true#53147
;-)

Aurum
Send message
Joined: 12 Jul 17
Posts: 131
Credit: 7,642,556,493
RAC: 3,980,674
Level
Tyr
Scientific publications
wat
Message 53156 - Posted: 27 Nov 2019 | 15:06:26 UTC - in response to Message 53148.

This would be steam!
Water In: +216.0°C
Not at 312 PSIA.
____________

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 330
Credit: 252,993,463
RAC: 450,216
Level
Asn
Scientific publications
wat
Message 53159 - Posted: 27 Nov 2019 | 15:58:46 UTC - in response to Message 53150.
Last modified: 27 Nov 2019 | 16:04:19 UTC

@ Keith Myers
This would be steam!
Water In: +216.0°C


I saw the same thing. Funny Huh!

No, it is just the value you get from an unterminated input on the ASUS boards. Put a standard 10K thermistor on it and it reads normally.

Just ignore any input with the +216.0 °C value. If you are so annoyed,you could fabricate two-pin headers with a resistor to pull the inputs down.

klepel
Send message
Joined: 23 Dec 09
Posts: 171
Credit: 2,853,412,838
RAC: 547,670
Level
Phe
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 53163 - Posted: 27 Nov 2019 | 19:43:32 UTC

I just made an interesting observation comparing my computers with GTX1650 and GTX1660ti with ServicEnginIC´s computers:
http://www.gpugrid.net/show_host_detail.php?hostid=147723
http://www.gpugrid.net/show_host_detail.php?hostid=482132
mine:
http://www.gpugrid.net/show_host_detail.php?hostid=512242
http://www.gpugrid.net/show_host_detail.php?hostid=512293
The computers of ServicEnginIC are approx. 10% slower than mine. His CPUs are Intel(R) Core(TM)2 Quad CPU Q9550 @ 2.83GHz and Intel(R) Core(TM)2 Quad CPU Q9550 @ 2.83GHz, mine are two AMD Ryzen 5 2600 Six-Core Processors.
Might it be that the Wrapper is slower on slower CPUs and therefore slows down the GPUs? Is this the experience from other users as well?

biodoc
Send message
Joined: 26 Aug 08
Posts: 161
Credit: 1,407,476,347
RAC: 28,976
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 53167 - Posted: 27 Nov 2019 | 20:50:07 UTC - in response to Message 53151.
Last modified: 27 Nov 2019 | 20:58:05 UTC

The heatsink on the Ryzen 3600 that reports Tdie and Tctl at 95C is only moderately warm to the touch. That was the case when I installed it.

So there are two possibilities: the heatsink is not making good contact to the chip, or else the reading is wrong. I will find out soon.

EDIT: The paste is spread out properly and sticking fine. It must be that the Ryzen 3600 reports the temps differently, or else is interpreted wrong. That is my latest build by the way, put together within the last month.


One option to cool your processor down a bit is to run it at base frequency using the cTDP and PPL (package power limit) settings in the bios. Both are set at auto in the "optimized defaults" bios setting. AMD and the motherboard manufacturers assume we are gamers or enthusiasts that want to automatically overclock the processors to the thermal limit.

Buried somewhere in the bios AMD CBS folder there should be an option to set the cTDP and PPL to manual mode. When set to manual you can key in values for watts. I have my 3700X rigs set to 65 and 65 watts for cTDP and PPL. My 3900X is set to 105 and 105 watts respectively. The numbers come from the TDP of the processor. So for a 3600 it would be 65 and for a 3600X the number is 95 watts. Save the bios settings and the processor will now run at base clock speed at full load and will draw quite a bit less power at the wall.

Here's some data I collected on my 3900X.

3900X (105 TDP; AGESA 1.0.0.3 ABBA) data running WCG at full load:

bios optimized defaults (PPL at 142?): 4.0 GHz pulls 267 watts at the wall.
TDP/PPL (package power limit) set at 105/105: 3.8 GHz pulls 218 watts at the wall
TDP/PPL set at 65/88: 3.7 GHz pulls 199 watts at the wall
TDP/PPL set at 65/65: 3.0 GHz pulls 167 watts at the wall

3.8 to 4 GHz requires 52 watts
3.7 to 4 GHz requires 68 watts
3.7 -3.8 GHz requires 20 watts
3.0 -3.7 GHz requires 32 watts

Note: The latest bios with 1.0.0.4 B does not allow me to underclock using TDP/PPL bios settings.

Profile Retvari Zoltan
Avatar
Send message
Joined: 20 Jan 09
Posts: 2078
Credit: 15,135,591,890
RAC: 4,579,208
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 53168 - Posted: 27 Nov 2019 | 21:43:14 UTC - in response to Message 53163.
Last modified: 27 Nov 2019 | 21:43:48 UTC

Might it be that the Wrapper is slower on slower CPUs and therefore slows down the GPUs?
Is this the experience from other users as well?
I have similar experiences with my hosts.

Pop Piasa
Send message
Joined: 8 Aug 19
Posts: 2
Credit: 17,669,900
RAC: 145,413
Level
Pro
Scientific publications
wat
Message 53170 - Posted: 27 Nov 2019 | 22:09:13 UTC - in response to Message 53031.

Thank you Rod4x4, I later saw the first WU speed up and subsequent units have been running over 12%/Hr without issues. Guess I jumped on that too fast. The 1% increments are OK with me. Thanks again.

Jim1348
Send message
Joined: 28 Jul 12
Posts: 704
Credit: 1,375,263,468
RAC: 99,044
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwat
Message 53171 - Posted: 27 Nov 2019 | 22:24:42 UTC - in response to Message 53167.

The heatsink on the Ryzen 3600 that reports Tdie and Tctl at 95C is only moderately warm to the touch. That was the case when I installed it.

So there are two possibilities: the heatsink is not making good contact to the chip, or else the reading is wrong. I will find out soon.

EDIT: The paste is spread out properly and sticking fine. It must be that the Ryzen 3600 reports the temps differently, or else is interpreted wrong. That is my latest build by the way, put together within the last month.


One option to cool your processor down a bit is to run it at base frequency using the cTDP and PPL (package power limit) settings in the bios. Both are set at auto in the "optimized defaults" bios setting. AMD and the motherboard manufacturers assume we are gamers or enthusiasts that want to automatically overclock the processors to the thermal limit.

Thanks, but I believe you misread me. The CPU is fine. The measurement is wrong.

Profile ServicEnginIC
Avatar
Send message
Joined: 24 Sep 10
Posts: 78
Credit: 1,250,277,476
RAC: 713,065
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 53172 - Posted: 27 Nov 2019 | 22:30:57 UTC

The computers of ServicEnginIC are approx. 10% slower than mine. His CPUs are Intel(R) Core(TM)2 Quad CPU Q9550 @ 2.83GHz and Intel(R) Core(TM)2 Quad CPU Q9550 @ 2.83GHz, mine are two AMD Ryzen 5 2600 Six-Core Processors.
Might it be that the Wrapper is slower on slower CPUs and therefore slows down the GPUs?

I have similar experiences with my hosts.

+1

And some other cons for my veteran rigs:
- DDR3 @1.333 MHZ DRAM
- Both Motherboards are PCIE 2.0, probably bottlenecking PCIE 3.0 for newest cards

10% performance loss seems to be congruent with all of this

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 330
Credit: 252,993,463
RAC: 450,216
Level
Asn
Scientific publications
wat
Message 53175 - Posted: 28 Nov 2019 | 0:49:23 UTC - in response to Message 53171.

Thanks, but I believe you misread me. The CPU is fine. The measurement is wrong.

No, I believe the measurement is incorrect but is still going to be rather high in actuality. The Ryzen 3600 ships with the Wraith Stealth cooler which is just the normal Intel solution of a copper plug embedded into a aluminum casting. It just doesn't have the ability to quickly move heat away from the IHS.

You would see much better temps if you switched to the Wraith MAX or Wraith Prism cooler which have real heat pipes and normal sized fans.

The temps are correct for the Ryzen and Ryzen+ cpus, but the k10temp driver which is stock in Ubuntu didn't get the change needed to accommodate the Ryzen 2 cpus with the correct 0 temp offset. That only is shipping in the 5.3.4 or 5.4 kernels.

https://www.phoronix.com/scan.php?page=news_item&px=AMD-Zen2-k10temp-Patches

There are other solutions you could use in the meantime like the ASUS temp driver if you have a compatible motherboard or there also is a zenpower driver that can report the proper temp as well as the cpu power.

https://github.com/ocerman/zenpower

Greg _BE
Send message
Joined: 30 Jun 14
Posts: 39
Credit: 42,093,642
RAC: 9,172
Level
Val
Scientific publications
watwatwatwat
Message 53176 - Posted: 28 Nov 2019 | 0:56:03 UTC - in response to Message 53154.
Last modified: 28 Nov 2019 | 1:05:06 UTC

Damn! Wishful thinking!

How about 4.75. To many numbers on my screen
It's because it shows 4075 and then I automatically drop in the . at 2 places not realizing my mistake!

As far as temperature goes, I am only reporting the CPU temp at the sensor point.
I have sent a webform to Arctic asking them what the temp would be at the radiator after passing by the CPU heatsink. The air temp of the exhaust air does not feel anywhere near 80. I would put it down around 40C or less.

I will see what they say and let you know.

Greg _BE
Send message
Joined: 30 Jun 14
Posts: 39
Credit: 42,093,642
RAC: 9,172
Level
Val
Scientific publications
watwatwatwat
Message 53177 - Posted: 28 Nov 2019 | 1:06:19 UTC
Last modified: 28 Nov 2019 | 1:07:59 UTC

Tony - I keep getting this on random tasks
unknown error) - exit code 195 (0xc3)</message>
<stderr_txt>
13:11:40 (25792): wrapper (7.9.26016): starting
13:11:40 (25792): wrapper: running acemd3.exe (--boinc input --device 1)
# Engine failed: Particle coordinate is nan
13:37:25 (25792): acemd3.exe exited; CPU time 1524.765625
13:37:25 (25792): app exit status: 0x1
13:37:25 (25792): called boinc_finish(195)

It runs 1524 seconds and bombs.
What's up with that?

It also appears that BOINC or the task is ignoring the appconfig command to use only my 1050. I see another task that is starting on the 1050 and then jumping to the 650 since the 1050 is tied up with another project.

Jim1348
Send message
Joined: 28 Jul 12
Posts: 704
Credit: 1,375,263,468
RAC: 99,044
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwat
Message 53178 - Posted: 28 Nov 2019 | 3:06:52 UTC - in response to Message 53175.

The temps are correct for the Ryzen and Ryzen+ cpus, but the k10temp driver which is stock in Ubuntu didn't get the change needed to accommodate the Ryzen 2 cpus with the correct 0 temp offset. That only is shipping in the 5.3.4 or 5.4 kernels.

Then it is probably reading 20C too high, and the CPU is really at 75C.
Yes, I can improve on that. Thanks.

rod4x4
Send message
Joined: 4 Aug 14
Posts: 129
Credit: 1,648,208,676
RAC: 824,434
Level
His
Scientific publications
watwatwatwatwatwatwat
Message 53179 - Posted: 28 Nov 2019 | 3:47:00 UTC - in response to Message 53177.

Tony - I keep getting this on random tasks
unknown error) - exit code 195 (0xc3)</message>
<stderr_txt>
13:11:40 (25792): wrapper (7.9.26016): starting
13:11:40 (25792): wrapper: running acemd3.exe (--boinc input --device 1)
# Engine failed: Particle coordinate is nan
13:37:25 (25792): acemd3.exe exited; CPU time 1524.765625
13:37:25 (25792): app exit status: 0x1
13:37:25 (25792): called boinc_finish(195)

It runs 1524 seconds and bombs.
What's up with that?

It also appears that BOINC or the task is ignoring the appconfig command to use only my 1050. I see another task that is starting on the 1050 and then jumping to the 650 since the 1050 is tied up with another project.


# Engine failed: Particle coordinate is nan

Two issues can cause this error:
1. Error in the Task. This would mean all Hosts fail the task. See this link for details: https://github.com/openmm/openmm/issues/2308
2. If other Hosts do not fail the task, the error could be in the GPU Clock rate. I have tested this on one of my hosts and am able to produce this error when I Clock the GPU too high.

It also appears that BOINC or the task is ignoring the appconfig command to use only my 1050.

One setting to try....In Boinc Manager, Computer Preferences, set the "Switch between tasks every xxx minutes" to between 800 - 9999. This should allow the task to finish on the same GPU it started on.
Can you post your app_config.xml file contents?

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 330
Credit: 252,993,463
RAC: 450,216
Level
Asn
Scientific publications
wat
Message 53180 - Posted: 28 Nov 2019 | 4:34:21 UTC

I've had a couple of the NaN errors. One where everyone errors out the task and another recently where it errored out after running through to completion. I had already removed all overclocking on the card but it still must have been too hot for the stock clockrate. It is my hottest card being sandwiched in the middle of the gpu stack with very little airflow. I am going to have to start putting in negative clock offset on it to get the temps down I think to avoid any further NaN errors on that card.

rod4x4
Send message
Joined: 4 Aug 14
Posts: 129
Credit: 1,648,208,676
RAC: 824,434
Level
His
Scientific publications
watwatwatwatwatwatwat
Message 53181 - Posted: 28 Nov 2019 | 5:52:47 UTC - in response to Message 53180.
Last modified: 28 Nov 2019 | 5:55:02 UTC

I've had a couple of the NaN errors. One where everyone errors out the task and another recently where it errored out after running through to completion. I had already removed all overclocking on the card but it still must have been too hot for the stock clockrate. It is my hottest card being sandwiched in the middle of the gpu stack with very little airflow. I am going to have to start putting in negative clock offset on it to get the temps down I think to avoid any further NaN errors on that card.

Would be interested to hear if the Under Clocking / Heat reduction fixes the issue.
I am fairly confident this is the issue, but need validation / more data from fellow volunteers to be sure.

Erich56
Send message
Joined: 1 Jan 15
Posts: 638
Credit: 3,157,242,642
RAC: 818,959
Level
Arg
Scientific publications
watwatwatwatwatwat
Message 53183 - Posted: 28 Nov 2019 | 6:32:03 UTC - in response to Message 53163.

http://www.gpugrid.net/show_host_detail.php?hostid=147723
http://www.gpugrid.net/show_host_detail.php?hostid=482132
...
Might it be that the Wrapper is slower on slower CPUs and therefore slows down the GPUs? Is this the experience from other users as well?


that's really interesting: the comparison of above two tasks shows that the host with the GTX1660ti yields lower GFLOP figures (single as well as double) as the host with the GTX1650.
In both hosts, the CPU ist the same: Intel(R) Core(TM)2 Quad CPU Q9550 @ 2.83GHz.

And now the even more surprising fact: by coincidence, exactly the same CPU is running in one of my hosts (http://www.gpugrid.net/show_host_detail.php?hostid=205584) with a GTX750ti - and here the GFLOP figures are even markedly higher than in the abeove cited hosts with more modern GPUs.

So, is the conclusion now: the weaker the GPU, the higher the number of GFLOPs generated by the system?

Profile Retvari Zoltan
Avatar
Send message
Joined: 20 Jan 09
Posts: 2078
Credit: 15,135,591,890
RAC: 4,579,208
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 53184 - Posted: 28 Nov 2019 | 9:28:26 UTC - in response to Message 53183.
Last modified: 28 Nov 2019 | 9:34:47 UTC

http://www.gpugrid.net/show_host_detail.php?hostid=147723
http://www.gpugrid.net/show_host_detail.php?hostid=482132
...
Might it be that the Wrapper is slower on slower CPUs and therefore slows down the GPUs? Is this the experience from other users as well?
that's really interesting: the comparison of above two tasks shows that the host with the GTX1660ti yields lower GFLOP figures (single as well as double) as the host with the GTX1650.
In both hosts, the CPU ist the same: Intel(R) Core(TM)2 Quad CPU Q9550 @ 2.83GHz.

And now the even more surprising fact: by coincidence, exactly the same CPU is running in one of my hosts (http://www.gpugrid.net/show_host_detail.php?hostid=205584) with a GTX750ti - and here the GFLOP figures are even markedly higher than in the abeove cited hosts with more modern GPUs.

So, is the conclusion now: the weaker the GPU, the higher the number of GFLOPs generated by the system?
The "Integer" (I hope it's called this way in English) speed measured is way much higher under Linux than under Windows.
(the 1st and 2nd host use Linux, the 3rd use Windows)
See the stats of my dual boot host:
Linux 139876.18 - Windows 12615.42
There's more than one order of magnitude difference between the two OS on the same hardware, one of them must be wrong.

Greg _BE
Send message
Joined: 30 Jun 14
Posts: 39
Credit: 42,093,642
RAC: 9,172
Level
Val
Scientific publications
watwatwatwat
Message 53185 - Posted: 28 Nov 2019 | 13:27:26 UTC - in response to Message 53176.

Damn! Wishful thinking!

How about 4.75. To many numbers on my screen
It's because it shows 4075 and then I automatically drop in the . at 2 places not realizing my mistake!

As far as temperature goes, I am only reporting the CPU temp at the sensor point.
I have sent a webform to Arctic asking them what the temp would be at the radiator after passing by the CPU heatsink. The air temp of the exhaust air does not feel anywhere near 80. I would put it down around 40C or less.

I will see what they say and let you know.

------------------------

Hi Greg

I talked to my colleague who is in the Liquid Freezer II Dev. Team and he said that theese temps are normal with this kind of load.
Installation sounds good to me.


With kind regards


Your ARCTIC Team,
Stephan
Arctic/Service Manager

Greg _BE
Send message
Joined: 30 Jun 14
Posts: 39
Credit: 42,093,642
RAC: 9,172
Level
Val
Scientific publications
watwatwatwat
Message 53186 - Posted: 28 Nov 2019 | 13:30:27 UTC - in response to Message 53179.

Tony - I keep getting this on random tasks
unknown error) - exit code 195 (0xc3)</message>
<stderr_txt>
13:11:40 (25792): wrapper (7.9.26016): starting
13:11:40 (25792): wrapper: running acemd3.exe (--boinc input --device 1)
# Engine failed: Particle coordinate is nan
13:37:25 (25792): acemd3.exe exited; CPU time 1524.765625
13:37:25 (25792): app exit status: 0x1
13:37:25 (25792): called boinc_finish(195)

It runs 1524 seconds and bombs.
What's up with that?

It also appears that BOINC or the task is ignoring the appconfig command to use only my 1050. I see another task that is starting on the 1050 and then jumping to the 650 since the 1050 is tied up with another project.


# Engine failed: Particle coordinate is nan

Two issues can cause this error:
1. Error in the Task. This would mean all Hosts fail the task. See this link for details: https://github.com/openmm/openmm/issues/2308
2. If other Hosts do not fail the task, the error could be in the GPU Clock rate. I have tested this on one of my hosts and am able to produce this error when I Clock the GPU too high.

It also appears that BOINC or the task is ignoring the appconfig command to use only my 1050.

One setting to try....In Boinc Manager, Computer Preferences, set the "Switch between tasks every xxx minutes" to between 800 - 9999. This should allow the task to finish on the same GPU it started on.
Can you post your app_config.xml file contents?


--------------------

<?xml version="1.0"?>

-<app_config>


-<exclude_gpu>

<url>www.gpugrid.net</url>

<device_num>1</device_num>

<type>NVIDIA</type>

</exclude_gpu>

</app_config>

I was having some issues with LHC ATLAS and was in the process of putting the tasks on pause and then disconnecting the client. In this process I discovered that another instance popped up right after I closed the one I was looking at and then I got another instance popping up with a message saying that there were two running. I shut that down and it shut down the last instance. This is a first for me.

I have restarted my computer and now will wait and see whats going on.

computezrmle
Send message
Joined: 10 Jun 13
Posts: 9
Credit: 122,193,551
RAC: 294,103
Level
Cys
Scientific publications
wat
Message 53187 - Posted: 28 Nov 2019 | 16:52:29 UTC - in response to Message 53186.

What you posted is a mix of app_config.xml and cc_config.xml.
Be so kind as to strictly follow the hints and examples on this page:
https://boinc.berkeley.edu/wiki/Client_configuration

Greg _BE
Send message
Joined: 30 Jun 14
Posts: 39
Credit: 42,093,642
RAC: 9,172
Level
Val
Scientific publications
watwatwatwat
Message 53189 - Posted: 28 Nov 2019 | 19:24:12 UTC - in response to Message 53187.
Last modified: 28 Nov 2019 | 19:26:52 UTC

What you posted is a mix of app_config.xml and cc_config.xml.
Be so kind as to strictly follow the hints and examples on this page:
https://boinc.berkeley.edu/wiki/Client_configuration



You give me a page on CC config. I jumped down to what appears to be stuff related to app_config and copied this

<exclude_gpu>
<url>project_URL</url>
[<device_num>N</device_num>]
[<type>NVIDIA|ATI|intel_gpu</type>]
[<app>appname</app>]
</exclude_gpu>

project id is the gpugrid.net
device = 1
type is nvidia
removed app name since app name changes so much

*****GPUGRID: Notice from BOINC
Missing <app_config> in app_config.xml
11/28/2019 8:24:51 PM***

This is why I had it in the text.

Greg _BE
Send message
Joined: 30 Jun 14
Posts: 39
Credit: 42,093,642
RAC: 9,172
Level
Val
Scientific publications
watwatwatwat
Message 53190 - Posted: 28 Nov 2019 | 19:57:02 UTC

What the heck now???!!!

A slew of Exit child errors! What is this? Is this speed problems with OC?
Also getting restart on different device errors!!!
Now this...is that because something is not right in the app_config?

Zalster
Avatar
Send message
Joined: 26 Feb 14
Posts: 181
Credit: 4,144,805,476
RAC: 609,649
Level
Arg
Scientific publications
watwatwatwat
Message 53191 - Posted: 28 Nov 2019 | 20:05:23 UTC - in response to Message 53189.



You give me a page on CC config. I jumped down to what appears to be stuff related to app_config and copied this

<exclude_gpu>
<url>project_URL</url>
[<device_num>N</device_num>]
[<type>NVIDIA|ATI|intel_gpu</type>]
[<app>appname</app>]
</exclude_gpu>

project id is the gpugrid.net
device = 1
type is nvidia
removed app name since app name changes so much

*****GPUGRID: Notice from BOINC
Missing <app_config> in app_config.xml
11/28/2019 8:24:51 PM***

This is why I had it in the text.



<cc_config>
<exclude_gpu>
<url>project_URL</url>
[<device_num>N</device_num>]
[<type>NVIDIA|ATI|intel_gpu</type>]
[<app>appname</app>]
</exclude_gpu
</cc_config>

This needs to go into the Boinc folder not the GPUGrid project folder

____________

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 330
Credit: 252,993,463
RAC: 450,216
Level
Asn
Scientific publications
wat
Message 53192 - Posted: 28 Nov 2019 | 20:56:21 UTC - in response to Message 53190.

If you are going to use an exclude, then you need to exclude all dissimilar devices than the one you want to use. That is how to get rid of restart on different device errors. Or just set the switch between tasks to 360minutes or greater and don't exit BOINC while the task is running.

The device number you use in the exclude statement is defined by how BOINC enumerates the cards in the Event Log at startup.

The gpu_exclude statement goes into cc_config.xml in the main BOINC directory, not a project directory.

Toni
Volunteer moderator
Project administrator
Project developer
Project scientist
Send message
Joined: 9 Dec 08
Posts: 857
Credit: 4,301,782
RAC: 1
Level
Ala
Scientific publications
watwatwatwat
Message 53193 - Posted: 28 Nov 2019 | 21:18:18 UTC - in response to Message 53190.

What the heck now???!!!

A slew of Exit child errors! What is this? Is this speed problems with OC?
Also getting restart on different device errors!!!
Now this...is that because something is not right in the app_config?


I see two types of errors:

ERROR: src\mdsim\context.cpp line 322: Cannot use a restart file on a different device!


as the name says, exclusion not working. And

# Engine failed: Particle coordinate is nan


this usually indicates mathematical errors in the operations performed, memory corruption, or similar (or a faulty wu, unlikely in this case). Maybe a reboot will solve it.

computezrmle
Send message
Joined: 10 Jun 13
Posts: 9
Credit: 122,193,551
RAC: 294,103
Level
Cys
Scientific publications
wat
Message 53194 - Posted: 28 Nov 2019 | 21:28:01 UTC - in response to Message 53189.

You give me a page on CC config.

I posted the official documentation for more than just cc_config.xml:
cc_config.xml
nvc_config.xml
app_config.xml

It's worth to carefully read this page a couple of times as it provides all you need to know.

Long ago the page had a direct link to the app_config.xml section.
Unfortunately that link is not available any more but you may use your browser's find function.

Greg _BE
Send message
Joined: 30 Jun 14
Posts: 39
Credit: 42,093,642
RAC: 9,172
Level
Val
Scientific publications
watwatwatwat
Message 53195 - Posted: 28 Nov 2019 | 22:58:46 UTC - in response to Message 53192.
Last modified: 28 Nov 2019 | 23:08:50 UTC

If you are going to use an exclude, then you need to exclude all dissimilar devices than the one you want to use. That is how to get rid of restart on different device errors. Or just set the switch between tasks to 360minutes or greater and don't exit BOINC while the task is running.

The device number you use in the exclude statement is defined by how BOINC enumerates the cards in the Event Log at startup.

The gpu_exclude statement goes into cc_config.xml in the main BOINC directory, not a project directory.



Ok, on point 1, it was set for 360 already because that's a good time for LHC ATLAS to run complete. I moved it up to 480 now to try and deal with this stuff in GPUGRID.

Point 2 - Going to try a cc_config with a triple exclude gpu code block for here and for 2 other projects. From what I read this should be possible.

Greg _BE
Send message
Joined: 30 Jun 14
Posts: 39
Credit: 42,093,642
RAC: 9,172
Level
Val
Scientific publications
watwatwatwat
Message 53196 - Posted: 28 Nov 2019 | 23:00:43 UTC - in response to Message 53193.
Last modified: 28 Nov 2019 | 23:01:58 UTC

What the heck now???!!!

A slew of Exit child errors! What is this? Is this speed problems with OC?
Also getting restart on different device errors!!!
Now this...is that because something is not right in the app_config?


I see two types of errors:

ERROR: src\mdsim\context.cpp line 322: Cannot use a restart file on a different device!


as the name says, exclusion not working. And

# Engine failed: Particle coordinate is nan


this usually indicates mathematical errors in the operations performed, memory corruption, or similar (or a faulty wu, unlikely in this case). Maybe a reboot will solve it.



One of these days I will get this problem solved. Driving me nuts!

rod4x4
Send message
Joined: 4 Aug 14
Posts: 129
Credit: 1,648,208,676
RAC: 824,434
Level
His
Scientific publications
watwatwatwatwatwatwat
Message 53197 - Posted: 29 Nov 2019 | 0:15:26 UTC - in response to Message 53195.
Last modified: 29 Nov 2019 | 0:18:28 UTC

Ok, on point 1, it was set for 360 already because that's a good time for LHC ATLAS to run complete. I moved it up to 480 now to try and deal with this stuff in GPUGRID.

As your GPU is taking 728 minutes to complete the current batch of Tasks, this setting needs to be MORE that 728 to have a positive effect. Times for other projects don't suit GPUgrid requirements as tasks here can be longer.

Greg _BE
Send message
Joined: 30 Jun 14
Posts: 39
Credit: 42,093,642
RAC: 9,172
Level
Val
Scientific publications
watwatwatwat
Message 53205 - Posted: 29 Nov 2019 | 14:31:21 UTC - in response to Message 53197.
Last modified: 29 Nov 2019 | 14:34:45 UTC

Ok, on point 1, it was set for 360 already because that's a good time for LHC ATLAS to run complete. I moved it up to 480 now to try and deal with this stuff in GPUGRID.

As your GPU is taking 728 minutes to complete the current batch of Tasks, this setting needs to be MORE that 728 to have a positive effect. Times for other projects don't suit GPUgrid requirements as tasks here can be longer.


Oh? That's interesting. Changed to 750 minutes.

Greg _BE
Send message
Joined: 30 Jun 14
Posts: 39
Credit: 42,093,642
RAC: 9,172
Level
Val
Scientific publications
watwatwatwat
Message 53224 - Posted: 30 Nov 2019 | 17:11:11 UTC

Just suffered DPC_WATCHDOG_VIOLATION on my system. Will be offline ba few days.

Profile Retvari Zoltan
Avatar
Send message
Joined: 20 Jan 09
Posts: 2078
Credit: 15,135,591,890
RAC: 4,579,208
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 53225 - Posted: 30 Nov 2019 | 20:04:53 UTC - in response to Message 53193.
Last modified: 30 Nov 2019 | 20:14:23 UTC

# Engine failed: Particle coordinate is nan

this usually indicates mathematical errors in the operations performed, memory corruption, or similar (or a faulty wu, unlikely in this case). Maybe a reboot will solve it.
These workunits has failed on all 8 hosts with this error condition.
initial_1923-ELISA_GSN4V1-12-100-RND5980
initial_1086-ELISA_GSN0V1-2-100-RND9613
Perhaps these workunits inherited a NaN (=Not a Number) from their previous stage.
I don't think this could be solved by a reboot.
I'm eagerly waiting to see how many batches will survive through all the 100 stages.

Bedrich Hajek
Send message
Joined: 28 Mar 09
Posts: 386
Credit: 4,837,651,939
RAC: 1,398,889
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 53290 - Posted: 6 Dec 2019 | 23:38:21 UTC

I ran the following unit:

1_7-GERARD_pocket_discovery_d89241c4_7afa_4928_b469_bad3dc186521-0-2-RND1542_1, which ran well and would have finished as valid, if the following error did not occur:

</stderr_txt>
<message>
upload failure: <file_xfer_error>
<file_name>1_7-GERARD_pocket_discovery_d89241c4_7afa_4928_b469_bad3dc186521-0-2-RND1542_1_9</file_name>
<error_code>-131 (file size too big)</error_code>
</file_xfer_error>
</message>
]]>

Scroll to the bottom on this page:

http://www.gpugrid.net/result.php?resultid=21553962


It looks like you need to increase the size limits of the output files for it to upload. It should be done for all the subsequent WUs.





Keith Myers
Send message
Joined: 13 Dec 17
Posts: 330
Credit: 252,993,463
RAC: 450,216
Level
Asn
Scientific publications
wat
Message 53291 - Posted: 7 Dec 2019 | 6:12:34 UTC - in response to Message 53290.

I must have squeaked in under the wire by just this much with this GERARD_pocket_discovery task.
https://www.gpugrid.net/result.php?resultid=21551650

Bedrich Hajek
Send message
Joined: 28 Mar 09
Posts: 386
Credit: 4,837,651,939
RAC: 1,398,889
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 53293 - Posted: 7 Dec 2019 | 12:45:41 UTC - in response to Message 53291.

I must have squeaked in under the wire by just this much with this GERARD_pocket_discovery task.
https://www.gpugrid.net/result.php?resultid=21551650



Apparently, these units vary in length. Here is another one with the same problem:

http://www.gpugrid.net/workunit.php?wuid=16894092



Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 918
Credit: 2,223,381,052
RAC: 251,037
Level
Phe
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 53294 - Posted: 7 Dec 2019 | 17:11:49 UTC
Last modified: 7 Dec 2019 | 17:20:36 UTC

I've got one running from 1_5-GERARD_pocket_discovery_d89241c4_7afa_4928_b469_bad3dc186521-0-2-RND2573 - I'll try to catch some figures to see how bad the problem is.

Edit - the _9 upload file (the one named in previous error messages) is set to allow

<max_nbytes>256000000.000000</max_nbytes>

or 256,000,000 bytes. You'd have thought that was enough.

Toni
Volunteer moderator
Project administrator
Project developer
Project scientist
Send message
Joined: 9 Dec 08
Posts: 857
Credit: 4,301,782
RAC: 1
Level
Ala
Scientific publications
watwatwatwat
Message 53295 - Posted: 7 Dec 2019 | 17:47:56 UTC - in response to Message 53294.

The 256 MB is the new limit - I raised it today. There are only a handful of WUs like that.

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 918
Credit: 2,223,381,052
RAC: 251,037
Level
Phe
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 53301 - Posted: 7 Dec 2019 | 23:16:16 UTC

I put precautions in place, but you beat me to it - final file size was 155,265,144 bytes. Plenty of room. Uploading now.

Erich56
Send message
Joined: 1 Jan 15
Posts: 638
Credit: 3,157,242,642
RAC: 818,959
Level
Arg
Scientific publications
watwatwatwatwatwat
Message 53303 - Posted: 8 Dec 2019 | 5:54:33 UTC

what I also noticed with the GERARD tasks (currently is running 0_2-GERARD_pocket_discovery ...):

the GPU utilization oscillates between 76% and 95% (in contrast to the ELISA tasks, where it was permanently close to or even at 100%)

Profile God is Love, JC proves it...
Avatar
Send message
Joined: 24 Nov 11
Posts: 30
Credit: 143,504,152
RAC: 197,331
Level
Cys
Scientific publications
watwatwatwatwatwatwatwatwatwatwat
Message 53316 - Posted: 9 Dec 2019 | 20:12:48 UTC - in response to Message 53295.
Last modified: 9 Dec 2019 | 20:16:22 UTC

I am getting upload errors too, on most but not all (4 of 6) WUs...
but, only on my 950M, not on my 1660 Ti, ... or EVEN my GeForce 640 !!

need to increase the size limits of the output files

So, how is this done?
Via the Options, Computing preferences, under Network, the default values are not shown (that I can see). I WOULD have assumed that boinc manager would have these as only limited by the system constraints unless tighter limits are desired.
AND, only download rate, upload rate, and usage limits can be set.
Again, how should output file size limits be increased.

It would have been VERY polite of GpuGrid to post some notice about this with the new WU releases.

I am very miffed, and justifiably so, at having wasted so much of my GPU time and energy, and effort on my part to hunt down the problem. Indeed, there was NO feedback from GpuGrid on this at all; I only noticed that my RAC kept falling even though I was running WUs pretty much nonstop.

I realize that getting research done is the primary goal, but if GpuGrid is asking people to donate their PC time and GPU time, then please be more polite to your donors.

LLP, PhD

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 330
Credit: 252,993,463
RAC: 450,216
Level
Asn
Scientific publications
wat
Message 53317 - Posted: 9 Dec 2019 | 21:38:27 UTC

You can't control the result output file. That is set by the science application under control of the project administrators. The quote you referenced was from Toni acknowledging that he needed to increase the size of the upload server input buffer to handle the larger result files that a few tasks were producing. Not the norm of the usual work we have processed so far. Should be rare cases the results files exceed 250MB.

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 918
Credit: 2,223,381,052
RAC: 251,037
Level
Phe
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 53320 - Posted: 9 Dec 2019 | 23:39:40 UTC

Neither of those two. The maximum file size is specified in the job specification associated with the task in question. You can (as I did) increase the maximum size by careful editing of the file 'client_state.xml', but it needs a steady hand, some knowledge, and is not for the faint of heart. It shouldn't be needed now, after Toni's correction at source.

Profile God is Love, JC proves it...
Avatar
Send message
Joined: 24 Nov 11
Posts: 30
Credit: 143,504,152
RAC: 197,331
Level
Cys
Scientific publications
watwatwatwatwatwatwatwatwatwatwat
Message 53321 - Posted: 9 Dec 2019 | 23:59:42 UTC - in response to Message 53317.
Last modified: 10 Dec 2019 | 0:03:52 UTC

Hm,
Toni's message (53295) was posted on the 7th. Toni used the past tense on the 7th ("I raised");
yet, https://gpugrid.net/result.php?resultid=21553648
ended on the 8th and still had the same frustrating error.
After running for hours, the results were nonetheless lost:
upload failure: <file_xfer_error>
<file_name>initial_1497-ELISA_GSN4V1-20-100-RND8978_0_0</file_name>
<error_code>-240 (stat() failed)</error_code>
</file_xfer_error>

Also, I must be just extremely unlucky. Toni says this came up on 'only a handful' of WUs, yet this happened to at least five of the WUs my GPUs ran.

I am holding off on running any GpuGrid WUs for a while, until this problem is more fully corrected.

Just for full disclosure... Industrial Engineers hate waste.

LLP
MS and PhD in Industrial & Systems Engineering.
Registered Prof. Engr. (Industrial Engineering)

Profile God is Love, JC proves it...
Avatar
Send message
Joined: 24 Nov 11
Posts: 30
Credit: 143,504,152
RAC: 197,331
Level
Cys
Scientific publications
watwatwatwatwatwatwatwatwatwatwat
Message 53322 - Posted: 10 Dec 2019 | 0:15:47 UTC
Last modified: 10 Dec 2019 | 0:18:53 UTC

Besides the upload errors,
a couple, resultid=21544426 and resultid=21532174, had said:
"Detected memory leaks!"
So I ran extensive memory diagnostics, but no errors were reported by windoze (extensive as in some eight hours of diagnostics).
Boinc did not indicate if this was RAM or GPU 'memory leaks'

In fact, now I am wondering whether these 'memory leaks' were on my end at all, or on the GpuGrid servers...

LLP
____________
I think ∴ I THINK I am
My thinking neither is the source of my being
NOR proves it to you
God Is Love, Jesus proves it! ∴ we are

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 918
Credit: 2,223,381,052
RAC: 251,037
Level
Phe
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 53325 - Posted: 10 Dec 2019 | 8:26:48 UTC - in response to Message 53321.

Hm,
Toni's message (53295) was posted on the 7th. Toni used the past tense on the 7th ("I raised");
yet, https://gpugrid.net/result.php?resultid=21553648
ended on the 8th and still had the same frustrating error.
After running for hours, the results were nonetheless lost:
upload failure: <file_xfer_error>
<file_name>initial_1497-ELISA_GSN4V1-20-100-RND8978_0_0</file_name>
<error_code>-240 (stat() failed)</error_code>
</file_xfer_error>

That's a different error. Toni's post was about a file size error.

Toni
Volunteer moderator
Project administrator
Project developer
Project scientist
Send message
Joined: 9 Dec 08
Posts: 857
Credit: 4,301,782
RAC: 1
Level
Ala
Scientific publications
watwatwatwat
Message 53326 - Posted: 10 Dec 2019 | 9:24:58 UTC - in response to Message 53322.

Besides the upload errors,
a couple, resultid=21544426 and resultid=21532174, had said:
"Detected memory leaks!"
So I ran extensive memory diagnostics, but no errors were reported by windoze (extensive as in some eight hours of diagnostics).
Boinc did not indicate if this was RAM or GPU 'memory leaks'

In fact, now I am wondering whether these 'memory leaks' were on my end at all, or on the GpuGrid servers...

LLP



Such messages are always present in Windows. They are not related to successful or not termination. If an error message is present, it's elsewhere in the output.

Toni
Volunteer moderator
Project administrator
Project developer
Project scientist
Send message
Joined: 9 Dec 08
Posts: 857
Credit: 4,301,782
RAC: 1
Level
Ala
Scientific publications
watwatwatwat
Message 53327 - Posted: 10 Dec 2019 | 9:26:15 UTC - in response to Message 53326.
Last modified: 10 Dec 2019 | 9:27:26 UTC

Also, slow and mobile cards should not be used for crunching for the reasons you mention.

Gustav
Send message
Joined: 24 Jul 19
Posts: 1
Credit: 4,562,500
RAC: 9,387
Level
Ala
Scientific publications
wat
Message 53328 - Posted: 10 Dec 2019 | 9:46:06 UTC

Hi,

I have not received any new WU in like 30-40 days.

Why? Are there no available WU:s for anyone or could it be bad settings on my side?

My PC:s are starving...


Br Thomas

Carlos Augusto Engel
Send message
Joined: 5 Jun 09
Posts: 38
Credit: 2,668,021,604
RAC: 366,296
Level
Phe
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 53329 - Posted: 10 Dec 2019 | 13:50:59 UTC - in response to Message 53328.

Hello,
I think you have to install latest version Nvidia drivers.
____________

Aurum
Send message
Joined: 12 Jul 17
Posts: 131
Credit: 7,642,556,493
RAC: 3,980,674
Level
Tyr
Scientific publications
wat
Message 53330 - Posted: 10 Dec 2019 | 17:32:54 UTC - in response to Message 53328.

I have not received any new WU in like 30-40 days.Why?
Did you check ACEMD3 in Prefs?

Bedrich Hajek
Send message
Joined: 28 Mar 09
Posts: 386
Credit: 4,837,651,939
RAC: 1,398,889
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 53338 - Posted: 12 Dec 2019 | 2:20:02 UTC

I have another observation to add. One of my computers had an abrupt shutdown (in words, the power was shut off, accidentally, off course), while crunching this unit: initial_1609-ELISA_GSN4V1-19-100-RND7717_0. Upon restart, the unit finished as valid. Which would not have happened with the previous ACEMD app. See link:


http://www.gpugrid.net/result.php?resultid=21561458



Off course the run time is wrong, it should be about 2000 seconds more.


Erich56
Send message
Joined: 1 Jan 15
Posts: 638
Credit: 3,157,242,642
RAC: 818,959
Level
Arg
Scientific publications
watwatwatwatwatwat
Message 53339 - Posted: 12 Dec 2019 | 6:14:45 UTC - in response to Message 53338.

I have another observation to add. One of my computers had an abrupt shutdown (in words, the power was shut off, accidentally, off course)

now that you are saying this - I had a similar situation with one my hosts 2 days ago. The PC shut down and restarted.
I had/have no idea whether this was caused by crunching a GPUGRID task or whether there was any other reasond behind that.

Post to thread

Message boards : News : New workunits