Advanced search

Message boards : Graphics cards (GPUs) : GPU work units [network connection issue]

Author Message
mscharmack
Avatar
Send message
Joined: 20 Aug 07
Posts: 18
Credit: 1,319,274
RAC: 0
Level
Ala
Scientific publications
watwatwatwatwat
Message 8812 - Posted: 24 Apr 2009 | 1:52:05 UTC
Last modified: 24 Apr 2009 | 1:53:29 UTC

Are GPU work units dependent on the internet other than sending and/or receiving. The reason for this question is because our internet went out a three times over the last month and the workunits that were currently being processed continued on until completion and they were reported as client error/compute error. The last internet outage, I turned off the computer and restarted it after the outage and the work unit processed correctly. Are the workunits dependent on a good internet connection while being processed? I have lost 6 workunits due to this.

Profile Paul D. Buck
Send message
Joined: 9 Jun 08
Posts: 1050
Credit: 37,321,185
RAC: 0
Level
Val
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 8828 - Posted: 24 Apr 2009 | 11:38:17 UTC
Last modified: 24 Apr 2009 | 11:39:58 UTC

Um, no, and, um, well, yes ...

Which is confusing ... bear with me ...

There is no need for an internet connection to compute a GPU Grid task.

On the other hand, the TCP/IP connection used to connect the client to the manager can be farbled up if you lose your external connection. This is a long standing issue with the BOINC client that, because of its "intermittent" nature is difficult to locate.

*IF* you know you are losing, or will lose, or have lost your connection you have two choices to protect your work. One shut down the system or two, turn off BOINC's connection (Activity menu, suspend internet).

What killed your tasks were "No heartbeat" meaning the client and the science application lost "sight" of each other ... most specifically, the application thinks that the client is dead. It isn't really, it is just tied up trying to make a connection to the Internet. And it is locked up, so to speak, in the deadly embrace waiting for the connection to fail... but while it is doing that it is not paying any attention to anything else ...

{edit}
Like the running applications ... so they try to make the connection a specific number of times and if they fail that number of times, they quit ...

*SO* ...

Hope this helps ...

jrobbio
Send message
Joined: 13 Mar 09
Posts: 59
Credit: 324,366
RAC: 0
Level

Scientific publications
watwatwatwat
Message 8834 - Posted: 24 Apr 2009 | 13:20:44 UTC - in response to Message 8828.

Thanks for that I was coincidentally having latency issues last night when Boinc lost connection to the client.

Isn't there a reference server/s that the system uses to check if the connection is good or not. Maybe it has something to do with losing site of that.

Rob

mscharmack
Avatar
Send message
Joined: 20 Aug 07
Posts: 18
Credit: 1,319,274
RAC: 0
Level
Ala
Scientific publications
watwatwatwatwat
Message 8838 - Posted: 24 Apr 2009 | 13:57:56 UTC

Thanks for the info, I'll shut down the internet connection next time and let the application continue to process.

Clownius
Send message
Joined: 19 Feb 09
Posts: 37
Credit: 30,657,566
RAC: 0
Level
Val
Scientific publications
watwatwatwatwatwatwatwatwat
Message 8857 - Posted: 24 Apr 2009 | 18:40:19 UTC - in response to Message 8834.

I think the reference server it tries is google. I think i saw it somewhere but no idea where.

Profile Michael Goetz
Avatar
Send message
Joined: 2 Mar 09
Posts: 124
Credit: 47,698,744
RAC: 107,365
Level
Val
Scientific publications
watwatwatwatwatwatwatwat
Message 8882 - Posted: 25 Apr 2009 | 4:23:28 UTC - in response to Message 8857.

I think the reference server it tries is google. I think i saw it somewhere but no idea where.


The default reference site is indeed Google. You can change that if you wish; I believe it's an option you can put in the cc_config.xml file (or whatever the filename is.)

The place you saw that information was probably the documentation for the cc_config.xml file.

Mike

ExtraTerrestrial Apes
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 17 Aug 08
Posts: 2705
Credit: 1,311,122,549
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 8910 - Posted: 25 Apr 2009 | 13:31:29 UTC

Mhh, I also had occasional network failures but I didn't see the apps loosing connection to the BOINC client, even though I didn't suspend network activity (after all, there's always the chance the connection is restored and I won't have to run dry..).

So I'm wondering: is it really your external connection? Or some strange problem with your windows setup?

MrS
____________
Scanning for our furry friends since Jan 2002

Profile Paul D. Buck
Send message
Joined: 9 Jun 08
Posts: 1050
Credit: 37,321,185
RAC: 0
Level
Val
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 8931 - Posted: 25 Apr 2009 | 17:28:10 UTC - in response to Message 8910.

Mhh, I also had occasional network failures but I didn't see the apps loosing connection to the BOINC client, even though I didn't suspend network activity (after all, there's always the chance the connection is restored and I won't have to run dry..).

So I'm wondering: is it really your external connection? Or some strange problem with your windows setup?

MrS

That is why it is such a hard problem to find. It does not happen to all people all the time. But, if you look into the past the "no Heartbeat" has been a annoying bug for a long time in the BOINC world.

Just like the "Can't acquire lockfile" ... another pest ...

mscharmack
Avatar
Send message
Joined: 20 Aug 07
Posts: 18
Credit: 1,319,274
RAC: 0
Level
Ala
Scientific publications
watwatwatwatwat
Message 8939 - Posted: 26 Apr 2009 | 1:14:29 UTC
Last modified: 26 Apr 2009 | 1:17:46 UTC

I don't think its a problem with the Windows setup. This is just a recent problem and it only happens when the internet in the area goes down. It has gone down three times in the last month and the two work units being processed at the time crashed and burned. The onlt problem I can think of ia a bug in BOINC Application 6.6.20 files (which was also installed about the time the problems started). I will revert back to a previous version and see what happens. I am also running the most current nvidia drivers for the video cards.

Profile Paul D. Buck
Send message
Joined: 9 Jun 08
Posts: 1050
Credit: 37,321,185
RAC: 0
Level
Val
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 8941 - Posted: 26 Apr 2009 | 1:18:26 UTC - in response to Message 8939.

I don't think its a problem with the Windows setup. This is just a recent problem and it only happens when the internet in the area goes down. It has gone down three times in the last month and the two work units being processed at the time crashed and burned. The onlt problem I can think of ia a bug in BOINC Application 6.6.20 files (which was also installed about the time the problems started). I will revert back to a previous version and see what happens. I am also running the most current nvidia drivers for the video cards.

No Heartbeat has been around for years ... but trying another client may work ... :)

ExtraTerrestrial Apes
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 17 Aug 08
Posts: 2705
Credit: 1,311,122,549
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 9013 - Posted: 27 Apr 2009 | 20:38:18 UTC - in response to Message 8939.

Yes, if inet was really broken then it surely was not an issue with windows and the installed programs. However, what I was thinking: what if some program went bezerk and blocked your inet access as well as your local servers and thus the no heartbeat issue. In this case it would also look like a broken inet from your point of view.

Except if you have different computers and / or you know the neighbours inet is also gone or you see the service guys working or whatever. I don't know your situation, so this was just an idea.. maybe a crazy one ;)

MrS
____________
Scanning for our furry friends since Jan 2002

Profile Paul D. Buck
Send message
Joined: 9 Jun 08
Posts: 1050
Credit: 37,321,185
RAC: 0
Level
Val
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 9024 - Posted: 27 Apr 2009 | 21:31:14 UTC - in response to Message 9013.
Last modified: 27 Apr 2009 | 21:32:35 UTC

Except if you have different computers and / or you know the neighbours inet is also gone or you see the service guys working or whatever. I don't know your situation, so this was just an idea.. maybe a crazy one ;)

Not really.

ANOTHER bug I am chasing causes OS-X versions of BOINC Manger to lose conection to the Client though it continues to run, apparently properly ... but the manager cannot connect to the client.

I have sent Charlie Fenton I think 4 reports now of what I have discovered and what I suspect ... nothing back from him yet ... (sadly) ...

BUt, Charlie is a good guy I think so patience is a virtue which is probably why I don't have much of it ...

{edit}

Forgot to mention, it looks like a TCP/IP bug also ...

mscharmack
Avatar
Send message
Joined: 20 Aug 07
Posts: 18
Credit: 1,319,274
RAC: 0
Level
Ala
Scientific publications
watwatwatwatwat
Message 9150 - Posted: 30 Apr 2009 | 19:37:47 UTC

Well, I reverted back to BOINC 6.4.7 and everything has been running properly for the last two days. No problems to report at all. In my opinion there is a problem in the BOINC 6.6.20 code as it is applied to GPU/CUDA functions. BOINC 6.6.20 runs fine on my other computers, however they are not running GPU/CUDA functions.

ExtraTerrestrial Apes
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 17 Aug 08
Posts: 2705
Credit: 1,311,122,549
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 9158 - Posted: 30 Apr 2009 | 20:24:56 UTC - in response to Message 9150.

In my opinion there is a problem in the BOINC 6.6.20 code as it is applied to GPU/CUDA functions.


There is a problem? Boy, we could all finally be happy again if it was only one ;)

MrS
____________
Scanning for our furry friends since Jan 2002

mscharmack
Avatar
Send message
Joined: 20 Aug 07
Posts: 18
Credit: 1,319,274
RAC: 0
Level
Ala
Scientific publications
watwatwatwatwat
Message 9162 - Posted: 30 Apr 2009 | 20:56:45 UTC

Maybe I should have said "some problems' onstead of "a problem." Bad choice of words on my part. However, these problems have not manifested themselves on my computers that are not using CUDA.

Profile Paul D. Buck
Send message
Joined: 9 Jun 08
Posts: 1050
Credit: 37,321,185
RAC: 0
Level
Val
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 9168 - Posted: 1 May 2009 | 2:10:47 UTC

There are two major problems with 6.6.20; neither of which is recognized by UCB as far as I know. One of them has been fixed in 6.6.23 and later, though 6.6.24 introduced a new issue, addressed in 6.6.25 (multi-GPU users only).

The two problems in 6.6.20 show as long running tasks on either CPU or GPU (I now have a confirmed instance of this relating to AP tasks of SaH), and secondly a gradual imbalance in internal debts that results in a very poor mix of work selection on the system.

Profile robertmiles
Send message
Joined: 16 Apr 09
Posts: 503
Credit: 729,045,933
RAC: 96,502
Level
Lys
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 9233 - Posted: 3 May 2009 | 3:30:47 UTC - in response to Message 9168.

Could the problem with long running tasks be only for those tasks that don't have fairly frequent checkpoints available yet?

Profile Paul D. Buck
Send message
Joined: 9 Jun 08
Posts: 1050
Credit: 37,321,185
RAC: 0
Level
Val
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 9237 - Posted: 3 May 2009 | 6:47:48 UTC - in response to Message 9233.

Could the problem with long running tasks be only for those tasks that don't have fairly frequent checkpoints available yet?

No, because it is universal with tasks from more than one project. I saw it only with GPU Grid for sure (it may have affected other tasks, I just did not see it). Another user saw his AP tasks of SaH with estimated times of 187 hours plus change to 63 hours by going back to 6.4.7 (I think) ... I have not seen it with 6.6.23 and there was a change in the resource scheduler (in the release notes) an though I forget what it said it certainly sounded like the issue. I have been running 6.6.23 and 6.6.25 for several weeks so I can get at the other issues (if you watched the mailing list this week end I sure did fill that up), and the only way you can get the developers attention is to run the latest versions (or close to it, there is no significant change in 6.6.26 so I have not tried it yet).

Post to thread

Message boards : Graphics cards (GPUs) : GPU work units [network connection issue]

//