Advanced search

Message boards : Server and website : SOS-Downloads stuck

Author Message
nanoprobe
Send message
Joined: 26 Feb 12
Posts: 184
Credit: 222,376,233
RAC: 0
Level
Leu
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwat
Message 44717 - Posted: 15 Oct 2016 | 21:25:43 UTC

After not having run this project for months I return to find the same problem that was here when I left. Absolutely unconscionable. Good by again.

Profile caffeineyellow5
Avatar
Send message
Joined: 30 Jul 14
Posts: 225
Credit: 2,658,976,345
RAC: 0
Level
Phe
Scientific publications
watwatwatwatwatwatwatwat
Message 44718 - Posted: 15 Oct 2016 | 22:46:49 UTC

What is the problem and why are you having it? Maybe we can help you figure it out. The project up and down loading are working just fine and we are all getting tons of tasks at the moment... some every few minutes with these SDOERR_CASP tasks being handed out that are literally running in less than 5-10 minutes on many hosts.

One thing you could do if you have frequent interrupts over the internet pausing up/down loads is change the line in your BOINC cc_config from the default

<http_transfer_timeout>3000</http_transfer_timeout>

to
<http_transfer_timeout>60</http_transfer_timeout>


That will make it so if something does interrupt the transfer, it will retry a connection after 60 seconds and not wait 3000 seconds (50 minutes).

As far as I can tell, all the servers are running fine (from the Server Status page and all my up and downloads are running smooth and nobody else has complained about this issue in weeks when there was a server full issue for a weekend.
____________
1 Corinthians 9:16 "For though I preach the gospel, I have nothing to glory of: for necessity is laid upon me; yea, woe is unto me, if I preach not the gospel!"
Ephesians 6:18-20, please ;-)
http://tbc-pa.org

nanoprobe
Send message
Joined: 26 Feb 12
Posts: 184
Credit: 222,376,233
RAC: 0
Level
Leu
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwat
Message 44721 - Posted: 16 Oct 2016 | 0:47:19 UTC

The problem is the downloads start and then stop and after whatever time elapses it tries again and it keeps doing the same thing over and over until all the files are received sometimes taking hours. This happens on all 4 of my machines with Nvidia cards and this project is the ONLY one that I run that gives me this issue. There is no <http_transfer_timeout>3000</http_transfer_timeout> tag in my cc_config file but I just added one. We'll see what happens but I'm not very optimistic.

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1576
Credit: 5,601,336,851
RAC: 8,787,904
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 44723 - Posted: 16 Oct 2016 | 6:36:21 UTC - in response to Message 44721.

The problem is the downloads start and then stop and after whatever time elapses it tries again and it keeps doing the same thing over and over until all the files are received sometimes taking hours. This happens on all 4 of my machines with Nvidia cards and this project is the ONLY one that I run that gives me this issue. There is no <http_transfer_timeout>3000</http_transfer_timeout> tag in my cc_config file but I just added one. We'll see what happens but I'm not very optimistic.

Note that the default, as stated in http://boinc.berkeley.edu/wiki/Client_configuration#Options, is actually 300.

<http_transfer_timeout>seconds</http_transfer_timeout>
Abort HTTP transfers if idle for this many seconds; default 300.

Profile Beyond
Avatar
Send message
Joined: 23 Nov 08
Posts: 1112
Credit: 6,162,416,256
RAC: 0
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 44724 - Posted: 16 Oct 2016 | 12:53:15 UTC - in response to Message 44721.
Last modified: 16 Oct 2016 | 12:54:40 UTC

The problem is the downloads start and then stop and after whatever time elapses it tries again and it keeps doing the same thing over and over until all the files are received sometimes taking hours. This happens on all 4 of my machines with Nvidia cards and this project is the ONLY one that I run that gives me this issue. There is no <http_transfer_timeout>3000</http_transfer_timeout> tag in my cc_config file but I just added one. We'll see what happens but I'm not very optimistic.

This is the only project with that problem. Have no idea what they have set wrong but we've complained about it a number of times. Anyway, to address this problem I use the switch above:

<http_transfer_timeout>60</http_transfer_timeout>

That helps but in order to make it at acceptable I also have to start BOINC from the command line and use this argument:

--pers_retry_delay_max 60

It still starts and stops but retries much more quickly. Downloads went from sometimes taking hours to now a maximum of 7-8 minutes. We shouldn't have to jump through these hoops but I really don't think there's anyone anymore on the project that knows how to configure the system. There are other easy fixes that we've asked for that never get addressed.

Profile caffeineyellow5
Avatar
Send message
Joined: 30 Jul 14
Posts: 225
Credit: 2,658,976,345
RAC: 0
Level
Phe
Scientific publications
watwatwatwatwatwatwatwat
Message 44736 - Posted: 16 Oct 2016 | 22:47:21 UTC

I just don't get these problems except for the times when everyone is because of server failures. I have systems between 2 different locations on 2 different internet providers (Comcast and RCN).

Profile Beyond
Avatar
Send message
Joined: 23 Nov 08
Posts: 1112
Credit: 6,162,416,256
RAC: 0
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 44742 - Posted: 17 Oct 2016 | 2:20:53 UTC - in response to Message 44736.
Last modified: 17 Oct 2016 | 2:22:57 UTC

You've complained about it previously as have many others. It happens here on virtually every GPUGrid download (Centurylink). It never happens on any other downloads, BOINC or otherwise.

nanoprobe
Send message
Joined: 26 Feb 12
Posts: 184
Credit: 222,376,233
RAC: 0
Level
Leu
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwat
Message 44751 - Posted: 17 Oct 2016 | 16:16:27 UTC

Maybe I'm overreacting but there are just too many issues with seeming simple remedies for me. Since I'm already in the process of scaling back my DC operation it doesn't really matter.
With that said, I have 3 SuperMicro dual socket boards with Xeon ES V4 CPUs that will be going up for sale. 2 are 28c/56t and 1 is 36c/72t. Won't go into all the details now. If there's anyone here interested PM me and we'll discuss things.

Profile Retvari Zoltan
Avatar
Send message
Joined: 20 Jan 09
Posts: 2343
Credit: 16,201,255,749
RAC: 7,520
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 44756 - Posted: 17 Oct 2016 | 20:04:27 UTC

While I don't think the staff of GPUGrid could do anything about your HTTP timeout problem, out of curiosity I ask you to run a very basic network diagnostics:
If you have a Windows based PC on the same network as your crunching box, please open a command prompt and type

ping www.gpugrid.net -n 100

You can do it on Linux also, but I'm not familiar with its command syntax (the -n 100 parameter tells the ping command to try 100 times).
You'll see a lot of (exactly 100, if everything's going well) messages like:

Reply from 84.89.134.145: bytes=32 time=83ms TTL=49

Then, at the end:

Ping statistics for 84.89.134.145: Packets: Sent = 100, Received = 100, Lost = 0 (0% loss), Approximate round trip times in milli-seconds: Minimum = 83ms, Maximum = 88ms, Average = 83ms

These are the actual results of my host, I'm curious about your statistics.
I expect your loss of packets and the round trip times be significantly higher than what I experience.
Unfortunately these numbers do not reveal the device which is responsible for your problem, but I'm quite confident in that it's closer to your end (most probably it's at your ISP) than to the GPUGrid site (in this case much more users would have such difficulties).

You could also try a traceroute command:

tracert www.gpugrid.net

Which gives you a list of the devices between your end and grosso.upf.edu (on which the gpugrid.net project resides).
Perhaps this list could help us to figure out what's wrong. Especially if it gives you very different results when you run it multiple times.
In some cases these errors are simply caused by network congestion (when the ISP has limited bandwidth to certain destinations), but it could depend on the time of the day. On your end however, P2P file sharing applications or appliances, a faulty router/switch could cause such strange errors (but I'm sure in this case there would be problems with other sites as well).

Betting Slip
Send message
Joined: 5 Jan 09
Posts: 670
Credit: 2,498,095,550
RAC: 0
Level
Phe
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 44757 - Posted: 17 Oct 2016 | 21:03:57 UTC - in response to Message 44756.

Actually Retvari, I experience the same problems with downloads sticking, not the whole package just one or two files that stick.

My Stats:

Pinging www.gpugrid.net [84.89.134.145] with 32 bytes of data:


Ping statistics for 84.89.134.145:
Packets: Sent = 100, Received = 100, Lost = 0 (0% loss),
Approximate round trip times in milli-seconds:
Minimum = 58ms, Maximum = 59ms, Average = 58ms


My Tracert:

Tracing route to www.gpugrid.net [84.89.134.145]
over a maximum of 30 hops:

1 <1 ms <1 ms <1 ms 192.168.0.1
2 * * * Request timed out.
3 13 ms 14 ms 12 ms be363.pr2.hobir.isp.sky.com [89.200.135.232]
4 12 ms 12 ms 12 ms ae-3.r02.londen03.uk.bb.gin.ntt.net [83.231.221.
45]
5 11 ms 11 ms 11 ms ae-3.r24.londen12.uk.bb.gin.ntt.net [129.250.4.2
3]
6 33 ms 33 ms 33 ms ae-6.r01.mdrdsp03.es.bb.gin.ntt.net [129.250.4.1
38]
7 36 ms 34 ms 34 ms rediris.baja.espanix.net [193.149.1.26]
8 48 ms 48 ms 48 ms CIEMAT.AE1.cica.rt1.and.red.rediris.es [130.206.
245.38]
9 52 ms 52 ms 51 ms CICA.AE1.uv.rt1.val.red.rediris.es [130.206.245.
34]
10 58 ms 60 ms 68 ms anella-val1-router.red.rediris.es [130.206.211.7
0]
11 * * * Request timed out.
12 58 ms 57 ms 57 ms grosso.upf.edu [84.89.134.145]
13 58 ms 57 ms 57 ms grosso.upf.edu [84.89.134.145]
14 58 ms 58 ms 58 ms grosso.upf.edu [84.89.134.145]

Trace complete.


I have also noticed the same thing on a remote host with a differernt ISP.

captainjack
Send message
Joined: 9 May 13
Posts: 171
Credit: 2,322,079,288
RAC: 2,364,757
Level
Phe
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 44758 - Posted: 17 Oct 2016 | 21:16:10 UTC

I am also having issues downloading individual files. Usually take two or three retries. My trace route says:

Tracing route to www.gpugrid.net [84.89.134.145]
over a maximum of 30 hops:

1 <1 ms <1 ms <1 ms dsldevice.attlocal.net [192.168.1.254]
2 21 ms 20 ms 21 ms 99-17-40-3.lightspeed.edmdok.sbcglobal.net [99.17.40.3]
3 * * * Request timed out.
4 * * * Request timed out.
5 21 ms 22 ms 23 ms 12.83.71.89
6 28 ms 30 ms 29 ms ggr3.dlstx.ip.att.net [12.122.139.17]
7 27 ms 27 ms 27 ms 192.205.36.222
8 28 ms 26 ms 28 ms be2764.ccr22.dfw01.atlas.cogentco.com [154.54.47.213]
9 32 ms 32 ms 33 ms be2443.ccr22.iah01.atlas.cogentco.com [154.54.44.229]
10 46 ms 46 ms 45 ms be2690.ccr42.atl01.atlas.cogentco.com [154.54.28.129]
11 54 ms 54 ms 54 ms be2113.ccr42.dca01.atlas.cogentco.com [154.54.24.221]
12 59 ms 59 ms 58 ms be2807.ccr42.jfk02.atlas.cogentco.com [154.54.40.109]
13 129 ms 129 ms 129 ms be2747.ccr42.par01.atlas.cogentco.com [154.54.31.190]
14 143 ms 142 ms 142 ms be2423.ccr22.bio02.atlas.cogentco.com [130.117.50.78]
15 148 ms 147 ms 148 ms be2293.ccr22.mad05.atlas.cogentco.com [130.117.50.26]
16 147 ms 149 ms 147 ms be2853.rcr11.b015537-1.mad05.atlas.cogentco.com [154.54.56.62]
17 150 ms 161 ms 149 ms 149.11.68.2
18 149 ms 148 ms 148 ms CIEMAT.AE2.telmad.rt4.mad.red.rediris.es [130.206.245.2]
19 155 ms 154 ms 159 ms TELMAD.AE4.uv.rt1.val.red.rediris.es [130.206.245.89]
20 163 ms 162 ms 163 ms anella-val1-router.red.rediris.es [130.206.211.70]
21 * * * Request timed out.
22 161 ms 160 ms 160 ms grosso.upf.edu [84.89.134.145]
23 160 ms 159 ms 160 ms grosso.upf.edu [84.89.134.145]
24 161 ms 161 ms 160 ms grosso.upf.edu [84.89.134.145]

Trace complete.

Maybe that will help isolate the issue.

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1576
Credit: 5,601,336,851
RAC: 8,787,904
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 44760 - Posted: 17 Oct 2016 | 21:29:27 UTC

After a quiet period when most downloads completed at the first attempt, in the last few days I've seen a marked increase in download delays - as Betting Slip says, usually just one file dropping to zero speed, while nominally still 'active'. That's coincided with more work being downloaded (and re-downloaded - see Pascal thread): I doubt that's a coincidence.

If I notice quickly, I can briefly 'suspend network activity' for BOINC, then go back to 'network activity always', while the download status is still 'active', and hence the underlying TCP/IP connection is still alive (an authenticated route still exists). If I manage that, the download usually completes far faster than a normal download.

That leads me to suspect that the majority of the packets have already arrived safely, with just a few gaps where individual packets have dropped out. And, for some reason, the 'resend packet xxxx' messages aren't getting through, or are themselves being dropped by the server.

Three years ago, we had a great deal of success at SETI with advising Windows users to enable rfc1323 - but that was to overcome exactly the problem with "large bandwidth*delay" paths described in the RFC. SETI has moved to a better network environment since then. We don't have exactly the same problem here (I've tried the fix, and it made no difference), but I suspect we may need a similar sort of packet-level analysis to identify and alleviate the problem we are observing.

Profile Retvari Zoltan
Avatar
Send message
Joined: 20 Jan 09
Posts: 2343
Credit: 16,201,255,749
RAC: 7,520
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 44762 - Posted: 17 Oct 2016 | 23:13:06 UTC

My trace route looks very similar after the first couple of hops:

Tracing route to www.gpugrid.net [84.89.134.145] over a maximum of 30 hops: 1 <1 ms <1 ms <1 ms 192.168.11.254 [192.168.11.254] 2 16 ms 16 ms 16 ms lo1.bsr0-zugliget.net.telekom.hu [145.236.238.178] 3 16 ms 16 ms 16 ms 81.183.3.4 4 17 ms 16 ms 17 ms 81.183.3.4 5 19 ms 16 ms 16 ms 81.183.3.145 6 24 ms 23 ms 23 ms 80.157.202.125 7 22 ms 22 ms 22 ms 80.150.171.74 8 28 ms 28 ms 28 ms be2974.ccr21.muc03.atlas.cogentco.com [154.54.58.5] 9 33 ms 34 ms 34 ms be3072.ccr21.zrh01.atlas.cogentco.com [130.117.0.17] 10 46 ms 46 ms 45 ms be3080.ccr21.mrs01.atlas.cogentco.com [130.117.49.1] 11 58 ms 58 ms 57 ms be2354.ccr21.vlc02.atlas.cogentco.com [130.117.0.150] 12 62 ms 61 ms 62 ms be2339.ccr22.mad05.atlas.cogentco.com [130.117.49.81] 13 63 ms 62 ms 63 ms be2853.rcr11.b015537-1.mad05.atlas.cogentco.com [154.54.56.62] 14 63 ms 62 ms 63 ms 149.11.68.50 15 159 ms 74 ms 74 ms CIEMAT.AE1.cica.rt1.and.red.rediris.es [130.206.245.38] 16 78 ms 77 ms 77 ms CICA.AE1.uv.rt1.val.red.rediris.es [130.206.245.34] 17 85 ms 83 ms 91 ms anella-val1-router.red.rediris.es [130.206.211.70] 18 * * * Request timed out. 19 83 ms 83 ms 86 ms grosso.upf.edu [84.89.134.145] 20 84 ms 83 ms 83 ms grosso.upf.edu [84.89.134.145] 21 83 ms 91 ms 84 ms grosso.upf.edu [84.89.134.145] Trace complete.

I'm suspecting that one of my hosts has had a stalled download, and that made it crunch for Einstein@home for awhile. But these glitches usually happen to my hosts almost only when new workunits become available after a near-empty period. That's when the ghost workunits are appear too. Probably too many hosts are connected / trying to connect to the server at these time periods. Perhaps it looks like a DDOS attack for some firewall/router in the way.

Profile caffeineyellow5
Avatar
Send message
Joined: 30 Jul 14
Posts: 225
Credit: 2,658,976,345
RAC: 0
Level
Phe
Scientific publications
watwatwatwatwatwatwatwat
Message 44763 - Posted: 18 Oct 2016 | 4:12:34 UTC - in response to Message 44762.

Ping statistics for 84.89.134.145:
Packets: Sent = 100, Received = 99, Lost = 1 (1% loss),
Approximate round trip times in milli-seconds:
Minimum = 114ms, Maximum = 123ms, Average = 118ms

C:\Windows\system32>tracert www.gpugrid.net

Tracing route to www.gpugrid.net [84.89.134.145]
over a maximum of 30 hops:

1 9 ms 11 ms 12 ms bdl1.rdl-ubr2.trpr-rdl.pa.cable.rcn.net [10.49.128.1]
2 13 ms 12 ms 10 ms bdle25-sub202.aggr1.phdl.pa.rcn.net [207.172.196.209]
3 11 ms 11 ms 11 ms xe-4-1-0.bar2.Philadelphia1.Level3.net [4.78.154.89]
4 * * * Request timed out.
5 16 ms 17 ms 17 ms Comcast-Level3-10G.boston1.Level3.net [4.68.110.90]
6 18 ms 16 ms 12 ms be2060.ccr41.jfk02.atlas.cogentco.com [154.54.31.9]
7 99 ms 96 ms 100 ms be2746.ccr41.par01.atlas.cogentco.com [154.54.29.118]
8 120 ms 121 ms 121 ms be2475.ccr21.bio02.atlas.cogentco.com [130.117.48.181]
9 121 ms 119 ms 117 ms be2235.ccr21.mad05.atlas.cogentco.com [130.117.48.134]
10 108 ms 105 ms 109 ms be2852.rcr11.b015537-1.mad05.atlas.cogentco.com [154.54.36.166]
11 103 ms 106 ms 105 ms 149.11.68.50
12 105 ms 106 ms 107 ms CIEMAT.AE2.telmad.rt4.mad.red.rediris.es [130.206.245.2]
13 112 ms 112 ms 107 ms TELMAD.AE4.uv.rt1.val.red.rediris.es [130.206.245.89]
14 121 ms 123 ms 119 ms anella-val1-router.red.rediris.es [130.206.211.70]
15 * * * Request timed out.
16 115 ms 114 ms 120 ms grosso.upf.edu [84.89.134.145]
17 114 ms 120 ms 119 ms grosso.upf.edu [84.89.134.145]
18 120 ms 114 ms 115 ms grosso.upf.edu [84.89.134.145]

Trace complete.
____________
1 Corinthians 9:16 "For though I preach the gospel, I have nothing to glory of: for necessity is laid upon me; yea, woe is unto me, if I preach not the gospel!"
Ephesians 6:18-20, please ;-)
http://tbc-pa.org

Dayle Diamond
Send message
Joined: 5 Dec 12
Posts: 84
Credit: 1,629,213,415
RAC: 672,941
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 44764 - Posted: 18 Oct 2016 | 5:25:25 UTC

Nanoprobe, check your messages ^^

Vagelis Giannadakis
Send message
Joined: 5 May 13
Posts: 187
Credit: 349,254,454
RAC: 0
Level
Asp
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 44765 - Posted: 18 Oct 2016 | 8:18:46 UTC

Download / upload issues have been around for a while now. We have discussed them to sufficient length to come to the conclusion that neither we nor GPUGRID staff have a clue as to their cause. :(

For the record, I too have downloads / uploads stall, succeed only after a number of retries, and all this only for GPUGRID!
____________

Profile Beyond
Avatar
Send message
Joined: 23 Nov 08
Posts: 1112
Credit: 6,162,416,256
RAC: 0
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 44767 - Posted: 18 Oct 2016 | 14:54:59 UTC - in response to Message 44765.

For the record, I too have downloads / uploads stall, succeed only after a number of retries, and all this only for GPUGRID!

Same here, but only downloads stall for me, and again: only for GPUGrid.

Ping statistics for 84.89.134.145:
Packets: Sent = 100, Received = 100, Lost = 0 (0% loss),
Approximate round trip times in milli-seconds:
Minimum = 170ms, Maximum = 215ms, Average = 175ms

It seems that everyone (including me) has this happening:

17 85 ms 83 ms 91 ms anella-val1-router.red.rediris.es [130.206.211.70]
18 * * * Request timed out.
19 83 ms 83 ms 86 ms grosso.upf.edu [84.89.134.145]

Is it the problem? Again, the ONLY project this happens to is GPUGrid and never on any other downloads of any kind.

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1576
Credit: 5,601,336,851
RAC: 8,787,904
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 44768 - Posted: 18 Oct 2016 | 15:17:37 UTC - in response to Message 44767.

For me, it's only one file (or rarely two) that stalls, out of a dozen or more for a typical task. And it gets partway through the download before stalling.

To me, that tells me that the destination address has been found, the route established, and the connection set up - no amount of pinging or tracerting is going to diagnose anything more than that. What we really need (and the project probably doesn't employ) is a network specialist experienced in analysing throughput at the individual packet level - and I don't know any of those, personally.

Profile skgiven
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,995,359,260
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 44770 - Posted: 18 Oct 2016 | 20:12:19 UTC - in response to Message 44768.

If there is an internal network issue it's likely something the university needs to sort out, rather than the group.
____________
FAQ's

HOW TO:
- Opt out of Beta Tests
- Ask for Help

captainjack
Send message
Joined: 9 May 13
Posts: 171
Credit: 2,322,079,288
RAC: 2,364,757
Level
Phe
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 44771 - Posted: 18 Oct 2016 | 20:34:27 UTC

Just processed a number of short tasks. Many of them had issues downloading files. In going back through the event log, all of the interrupts happened with the "**-psf_file" and "*-pdb_file" files.

More clues maybe?

Vagelis Giannadakis
Send message
Joined: 5 May 13
Posts: 187
Credit: 349,254,454
RAC: 0
Level
Asp
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 44772 - Posted: 18 Oct 2016 | 22:28:44 UTC - in response to Message 44767.

Same here, but only downloads stall for me, and again: only for GPUGrid.


After checking the logs a little closer, I concur, it is only downloads that exhibit this symptom not uploads, for example:

18-Oct-2016 23:54:49 [GPUGRID] Requesting new tasks for NVIDIA GPU
18-Oct-2016 23:54:52 [GPUGRID] Scheduler request completed: got 1 new tasks
18-Oct-2016 23:54:54 [GPUGRID] Started download of ...
...
19-Oct-2016 00:00:04 [GPUGRID] Temporarily failed download of e22s9_e15s1p0f97-GERARD_ENDOPEP_frag53P_1-0-coor_file: transient HTTP error
19-Oct-2016 00:00:04 [GPUGRID] Backing off 00:03:05 on download of e22s9_e15s1p0f97-GERARD_ENDOPEP_frag53P_1-0-coor_file
19-Oct-2016 00:00:04 [GPUGRID] Temporarily failed download of e22s9_e15s1p0f97-GERARD_ENDOPEP_frag53P_1-0-pdb_file: transient HTTP error
19-Oct-2016 00:00:04 [GPUGRID] Backing off 00:02:47 on download of e22s9_e15s1p0f97-GERARD_ENDOPEP_frag53P_1-0-pdb_file
19-Oct-2016 00:00:04 [GPUGRID] Started download of e22s9_e15s1p0f97-GERARD_ENDOPEP_frag53P_1-0-psf_file
19-Oct-2016 00:00:04 [GPUGRID] Started download of e22s9_e15s1p0f97-GERARD_ENDOPEP_frag53P_1-0-par_file
...

The download for these two files kept failing and retrying, it took them about 10 minutes to download.
____________

Profile Retvari Zoltan
Avatar
Send message
Joined: 20 Jan 09
Posts: 2343
Credit: 16,201,255,749
RAC: 7,520
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 44774 - Posted: 18 Oct 2016 | 23:42:13 UTC - in response to Message 44767.

It seems that everyone (including me) has this happening:

17 85 ms 83 ms 91 ms anella-val1-router.red.rediris.es [130.206.211.70]
18 * * * Request timed out.
19 83 ms 83 ms 86 ms grosso.upf.edu [84.89.134.145]

Is it the problem?

I assume you refer to #18: It's quite normal that some routers don't reply to requests which come from random computers on the internet.
I hoped to get some clues, but we're still just guessing the problem.
To investigate this issue some network traffic analysis on the packet level should be done by the network admins at the campus, and decide to take some countermeasures locally, or contact some other ISPs for a solution. But frankly I think this issue doesn't have that much impact on the project's throughput. I don't know how many sites are hosted on this server (besides ps3grid.net and gpugrid.net). I presume there are a lot of servers hosting a lot of webpages at the campus which are routed through the same devices. Their traffic may interfere GPUGrid's traffic, but it can't be analysed from outside.

Vagelis Giannadakis
Send message
Joined: 5 May 13
Posts: 187
Credit: 349,254,454
RAC: 0
Level
Asp
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 44775 - Posted: 19 Oct 2016 | 8:06:06 UTC

I think this is more likely a gateway / firewall / reverse proxy issue. The connections are not closed, they are just stalled. Force-closing a connection raises an error on both sides immediately, and clearly this does not happen with our downloads. I think some network component (hardware or software), through which our connections are routed, intervenes and stalls them. Perhaps some network traffic limiter?
____________

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1576
Credit: 5,601,336,851
RAC: 8,787,904
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 44776 - Posted: 19 Oct 2016 | 8:13:26 UTC - in response to Message 44772.

transient HTTP error

Transient HTTP errors can be diagnosed further by setting the <http_debug> event log flag in BOINC. I'll do that next time I'm due to download a new task (if I remember to notice in time), but my expectation is that it will turn out to be simply BOINC's own timeout, which doesn't get us much further forward.

But it would confirm that reducing the timeout to 60 seconds is likely to help.

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1576
Credit: 5,601,336,851
RAC: 8,787,904
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 44780 - Posted: 19 Oct 2016 | 10:48:19 UTC

Well, I've downloaded and logged a new task, and - wouldn't you believe - it didn't get stuck.

But here's a log section the network gurus could have a look at.

19-Oct-2016 10:46:45 [---] [http_xfer] [ID#1271] HTTP: wrote 1368 bytes
19-Oct-2016 10:46:45 [---] [http_xfer] [ID#1271] HTTP: wrote 1368 bytes
19-Oct-2016 10:46:45 [---] [http_xfer] [ID#1271] HTTP: wrote 2736 bytes
19-Oct-2016 10:46:45 [---] [http_xfer] [ID#1271] HTTP: wrote 1368 bytes
19-Oct-2016 10:46:45 [---] [http_xfer] [ID#1271] HTTP: wrote 1368 bytes
19-Oct-2016 10:46:45 [---] [http_xfer] [ID#1271] HTTP: wrote 1368 bytes
19-Oct-2016 10:46:45 [---] [http_xfer] [ID#1271] HTTP: wrote 1368 bytes
19-Oct-2016 10:46:45 [---] [http_xfer] [ID#1271] HTTP: wrote 1368 bytes
19-Oct-2016 10:46:45 [---] [http_xfer] [ID#1271] HTTP: wrote 1368 bytes
19-Oct-2016 10:46:45 [---] [http_xfer] [ID#1271] HTTP: wrote 1368 bytes
19-Oct-2016 10:46:45 [---] [http_xfer] [ID#1271] HTTP: wrote 1368 bytes
19-Oct-2016 10:46:45 [GPUGRID] [file_xfer] http op done; retval 0 (Success)
19-Oct-2016 10:46:45 [GPUGRID] [file_xfer] file transfer status 0 (Success)
19-Oct-2016 10:46:45 [GPUGRID] Finished download of e24s6_e1s19p0f341-GERARD_ENDOPEP_frag30P_1-0-vel_file
19-Oct-2016 10:46:45 [GPUGRID] [file_xfer] Throughput 0 bytes/sec
19-Oct-2016 10:46:45 [GPUGRID] [http] HTTP_OP::init_get(): http://www.gpugrid.org/PS3GRID/download/b9/e24s6_e1s19p0f341-GERARD_ENDOPEP_frag30P_1-0-idx_file
19-Oct-2016 10:46:45 [GPUGRID] [http] HTTP_OP::libcurl_exec(): ca-bundle 'D:\BOINC\ca-bundle.crt'
19-Oct-2016 10:46:45 [GPUGRID] [http] HTTP_OP::libcurl_exec(): ca-bundle set
19-Oct-2016 10:46:45 [GPUGRID] Started download of e24s6_e1s19p0f341-GERARD_ENDOPEP_frag30P_1-0-idx_file
19-Oct-2016 10:46:45 [GPUGRID] [file_xfer] URL: http://www.gpugrid.org/PS3GRID/download/b9/e24s6_e1s19p0f341-GERARD_ENDOPEP_frag30P_1-0-idx_file
19-Oct-2016 10:46:45 [---] [http_xfer] [ID#1271] HTTP: wrote 16384 bytes
19-Oct-2016 10:46:45 [GPUGRID] [http] [ID#1273] Info: Found bundle for host www.gpugrid.org: 0x40b89e0 [can pipeline]
19-Oct-2016 10:46:45 [GPUGRID] [http] [ID#1273] Info: Re-using existing connection! (#1191) with host www.gpugrid.org
19-Oct-2016 10:46:45 [GPUGRID] [http] [ID#1273] Info: Connected to www.gpugrid.org (84.89.134.145) port 80 (#1191)
19-Oct-2016 10:46:45 [GPUGRID] [http] [ID#1273] Sent header to server: GET /PS3GRID/download/b9/e24s6_e1s19p0f341-GERARD_ENDOPEP_frag30P_1-0-idx_file HTTP/1.1
19-Oct-2016 10:46:45 [GPUGRID] [http] [ID#1273] Sent header to server: Host: www.gpugrid.org
19-Oct-2016 10:46:45 [GPUGRID] [http] [ID#1273] Sent header to server: User-Agent: BOINC client (windows_x86_64 7.7.0)
19-Oct-2016 10:46:45 [GPUGRID] [http] [ID#1273] Sent header to server: Accept: */*
19-Oct-2016 10:46:45 [GPUGRID] [http] [ID#1273] Sent header to server: Accept-Encoding: deflate, gzip
19-Oct-2016 10:46:45 [GPUGRID] [http] [ID#1273] Sent header to server: Content-Type: application/x-www-form-urlencoded
19-Oct-2016 10:46:45 [GPUGRID] [http] [ID#1273] Sent header to server: Accept-Language: en_GB
19-Oct-2016 10:46:45 [GPUGRID] [http] [ID#1273] Sent header to server:
19-Oct-2016 10:46:45 [---] [http_xfer] [ID#1271] HTTP: wrote 16384 bytes
19-Oct-2016 10:46:45 [---] [http_xfer] [ID#1271] HTTP: wrote 16384 bytes
19-Oct-2016 10:46:45 [---] [http_xfer] [ID#1271] HTTP: wrote 16384 bytes
19-Oct-2016 10:46:45 [---] [http_xfer] [ID#1271] HTTP: wrote 16384 bytes
19-Oct-2016 10:46:45 [---] [http_xfer] [ID#1271] HTTP: wrote 16384 bytes
19-Oct-2016 10:46:45 [---] [http_xfer] [ID#1271] HTTP: wrote 16384 bytes
19-Oct-2016 10:46:45 [---] [http_xfer] [ID#1271] HTTP: wrote 16384 bytes
19-Oct-2016 10:46:45 [---] [http_xfer] [ID#1271] HTTP: wrote 16384 bytes
19-Oct-2016 10:46:45 [---] [http_xfer] [ID#1271] HTTP: wrote 16384 bytes
19-Oct-2016 10:46:45 [---] [http_xfer] [ID#1271] HTTP: wrote 16384 bytes
19-Oct-2016 10:46:45 [---] [http_xfer] [ID#1271] HTTP: wrote 16384 bytes
19-Oct-2016 10:46:45 [---] [http_xfer] [ID#1271] HTTP: wrote 12696 bytes
19-Oct-2016 10:46:45 [---] [http_xfer] [ID#1271] HTTP: wrote 1368 bytes
19-Oct-2016 10:46:45 [---] [http_xfer] [ID#1271] HTTP: wrote 1368 bytes

Most of the time the download jogged along writing 1368 bytes at a time: I'm interpreting that as individual packets being received in the right order, and being sent to the disk-writing queue immediately.

"wrote 2736 bytes" appears a lot of times too - probably two packets arriving in reverse order, and both needing to be processed before being written.

But when a new file was being requested, the writes increased to 16384 bytes, and stayed that way for some time. That suggests to me that something in one or other system - server or client - is having problems walking and chewing gum at the same time. Since this is the only project where it happens, I'd suggest that possibly the server is the one on the verge of being overloaded.

Connection [ID#1271] was downloading:

e24s6_e1s19p0f341-GERARD_ENDOPEP_frag30P_1-0-coor_file - 924 KB with throughput 730677 bytes/sec (over 500 packets/sec, if my analysis is right). That's going to be really hard to diagnose from outside the lab, and even inside it without specialist equipment and skills. But one thing comes to mind - restricting BOINC to one file being transferred at a time might ease the pressure caused by that hiccup in the middle.

Profile Beyond
Avatar
Send message
Joined: 23 Nov 08
Posts: 1112
Credit: 6,162,416,256
RAC: 0
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 44782 - Posted: 19 Oct 2016 | 16:32:25 UTC - in response to Message 44776.

But it would confirm that reducing the timeout to 60 seconds is likely to help.

It does help. I'll post again what made these GPUGrid downloads acceptable for me:

<http_transfer_timeout>60</http_transfer_timeout>

That helps but in order to make it more acceptable I also have to start BOINC from the command line and use this argument:

--pers_retry_delay_max 60

It still starts and stops but retries much more quickly. Downloads went from sometimes taking hours to now around 7-8 minutes. We shouldn't have to jump through these hoops but at least these workarounds help (a lot).

Vagelis Giannadakis
Send message
Joined: 5 May 13
Posts: 187
Credit: 349,254,454
RAC: 0
Level
Asp
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 44790 - Posted: 20 Oct 2016 | 9:52:52 UTC

OK, I was able to capture a DEBUG-level section of the log with a failed download. It confirms (of course) the experience we have: the file download begins, a part of the file is downloaded, then the download stalls.

Here's how the download begins, notice the connection #2131 (already open from a previous download / scheduler request), the thread performing the download ID#2558 and the server-reported file size 3193090 bytes:


20-Oct-2016 01:14:18 [GPUGRID] [http] HTTP_OP::init_get(): http://www.gpugrid.org/PS3GRID/download/1f4/e26s1_e8s9p0f59-GERARD_ENDOPEP_frag30P_1-0-pdb_file
20-Oct-2016 01:14:18 [GPUGRID] [http] HTTP_OP::libcurl_exec(): ca-bundle 'C:\Program Files\BOINC\ca-bundle.crt'
20-Oct-2016 01:14:18 [GPUGRID] [http] HTTP_OP::libcurl_exec(): ca-bundle set
20-Oct-2016 01:14:18 [GPUGRID] Started download of e26s1_e8s9p0f59-GERARD_ENDOPEP_frag30P_1-0-pdb_file
20-Oct-2016 01:14:18 [---] [http_xfer] [ID#2555] HTTP: wrote 4140 bytes
20-Oct-2016 01:14:18 [GPUGRID] [http] [ID#2558] Info: Found bundle for host www.gpugrid.org: 0x3f09850
20-Oct-2016 01:14:18 [GPUGRID] [http] [ID#2558] Info: Re-using existing connection! (#2131) with host www.gpugrid.org
20-Oct-2016 01:14:18 [GPUGRID] [http] [ID#2558] Info: Connected to www.gpugrid.org (84.89.134.145) port 80 (#2131)
20-Oct-2016 01:14:18 [GPUGRID] [http] [ID#2558] Sent header to server: GET /PS3GRID/download/1f4/e26s1_e8s9p0f59-GERARD_ENDOPEP_frag30P_1-0-pdb_file HTTP/1.1
20-Oct-2016 01:14:18 [GPUGRID] [http] [ID#2558] Sent header to server: User-Agent: BOINC client (windows_x86_64 7.6.9)
20-Oct-2016 01:14:18 [GPUGRID] [http] [ID#2558] Sent header to server: Host: www.gpugrid.org
20-Oct-2016 01:14:18 [GPUGRID] [http] [ID#2558] Sent header to server: Accept: */*
20-Oct-2016 01:14:18 [GPUGRID] [http] [ID#2558] Sent header to server: Accept-Encoding: deflate, gzip
20-Oct-2016 01:14:18 [GPUGRID] [http] [ID#2558] Sent header to server: Content-Type: application/x-www-form-urlencoded
20-Oct-2016 01:14:18 [GPUGRID] [http] [ID#2558] Sent header to server: Accept-Language: en_US
20-Oct-2016 01:14:18 [GPUGRID] [http] [ID#2558] Sent header to server:
20-Oct-2016 01:14:18 [GPUGRID] [http] [ID#2558] Received header from server: HTTP/1.1 200 OK
20-Oct-2016 01:14:18 [GPUGRID] [http] [ID#2558] Received header from server: Date: Wed, 19 Oct 2016 22:09:48 GMT
20-Oct-2016 01:14:18 [GPUGRID] [http] [ID#2558] Received header from server: Server: Apache/2.2.3 (CentOS)
20-Oct-2016 01:14:18 [GPUGRID] [http] [ID#2558] Received header from server: Last-Modified: Mon, 03 Oct 2016 08:52:19 GMT
20-Oct-2016 01:14:18 [GPUGRID] [http] [ID#2558] Received header from server: ETag: "6b8c03c-30b902-53df20fbd56c0"
20-Oct-2016 01:14:18 [GPUGRID] [http] [ID#2558] Received header from server: Accept-Ranges: bytes
20-Oct-2016 01:14:18 [GPUGRID] [http] [ID#2558] Received header from server: Content-Length: 3193090
20-Oct-2016 01:14:18 [GPUGRID] [http] [ID#2558] Received header from server: Cache-Control: max-age=300
20-Oct-2016 01:14:18 [GPUGRID] [http] [ID#2558] Received header from server: Expires: Wed, 19 Oct 2016 22:14:48 GMT
20-Oct-2016 01:14:18 [GPUGRID] [http] [ID#2558] Received header from server: Content-Type: text/plain; charset=UTF-8
20-Oct-2016 01:14:18 [GPUGRID] [http] [ID#2558] Received header from server:
20-Oct-2016 01:14:18 [---] [http_xfer] [ID#2558] HTTP: wrote 1053 bytes
...
20-Oct-2016 01:14:19 [---] [http_xfer] [ID#2558] HTTP: wrote 1380 bytes


At this point, the client has downloaded a part of the file and the connection stalls. Then, after about 5 minutes, the client gives up. It closes the connection and the thread dies (it does not appear in the log again):


20-Oct-2016 01:19:25 [GPUGRID] [http] [ID#2558] Info: Operation too slow. Less than 10 bytes/sec transferred the last 300 seconds
20-Oct-2016 01:19:25 [GPUGRID] [http] [ID#2558] Info: Closing connection 2131
20-Oct-2016 01:19:25 [GPUGRID] [http] HTTP error: Timeout was reached
20-Oct-2016 01:19:26 [GPUGRID] Temporarily failed download of e26s1_e8s9p0f59-GERARD_ENDOPEP_frag30P_1-0-pdb_file: transient HTTP error
20-Oct-2016 01:19:26 [GPUGRID] Backing off 00:02:04 on download of e26s1_e8s9p0f59-GERARD_ENDOPEP_frag30P_1-0-pdb_file


About 2 minutes later, it gives it another try, notice the new connection #2134 and the new thread ID#2579. Also, notice the Range header in the request, asking the server to start sending from the 227893rd byte instead of the beginning, and the 206 Partial Content status code, the Content-Length and the Content-Range headers in the server's response:


20-Oct-2016 01:21:31 [GPUGRID] [http] HTTP_OP::init_get(): http://www.gpugrid.org/PS3GRID/download/1f4/e26s1_e8s9p0f59-GERARD_ENDOPEP_frag30P_1-0-pdb_file
20-Oct-2016 01:21:31 [GPUGRID] [http] HTTP_OP::libcurl_exec(): ca-bundle 'C:\Program Files\BOINC\ca-bundle.crt'
20-Oct-2016 01:21:31 [GPUGRID] [http] HTTP_OP::libcurl_exec(): ca-bundle set
20-Oct-2016 01:21:31 [GPUGRID] Started download of e26s1_e8s9p0f59-GERARD_ENDOPEP_frag30P_1-0-pdb_file
...
20-Oct-2016 01:21:32 [GPUGRID] [http] [ID#2579] Info: Connected to www.gpugrid.org (84.89.134.145) port 80 (#2134)
...
20-Oct-2016 01:21:32 [GPUGRID] [http] [ID#2579] Sent header to server: Range: bytes=227893-
...
20-Oct-2016 01:21:32 [GPUGRID] [http] [ID#2579] Received header from server: HTTP/1.1 206 Partial Content
...
20-Oct-2016 01:21:32 [GPUGRID] [http] [ID#2579] Received header from server: Content-Length: 2965197
...
20-Oct-2016 01:21:32 [GPUGRID] [http] [ID#2579] Received header from server: Content-Range: bytes 227893-3193089/3193090
20-Oct-2016 01:21:32 [---] [http_xfer] [ID#2579] HTTP: wrote 995 bytes


The file sizes match up and the download begins once more. The client downloads another chunk of the file and then the connection stalls again. Again the connection is closed and the thread dies:


...
20-Oct-2016 01:21:34 [---] [http_xfer] [ID#2579] HTTP: wrote 1380 bytes
20-Oct-2016 01:26:39 [GPUGRID] [http] [ID#2579] Info: Operation too slow. Less than 10 bytes/sec transferred the last 300 seconds
20-Oct-2016 01:26:39 [GPUGRID] [http] [ID#2579] Info: Closing connection 2134
20-Oct-2016 01:26:39 [GPUGRID] [http] HTTP error: Timeout was reached
20-Oct-2016 01:26:39 [GPUGRID] Temporarily failed download of e26s1_e8s9p0f59-GERARD_ENDOPEP_frag30P_1-0-pdb_file: transient HTTP error
20-Oct-2016 01:26:39 [GPUGRID] Backing off 00:06:28 on download of e26s1_e8s9p0f59-GERARD_ENDOPEP_frag30P_1-0-pdb_file


After about 6 minutes, the client retries:


20-Oct-2016 01:33:07 [GPUGRID] [http] HTTP_OP::init_get(): http://www.gpugrid.org/PS3GRID/download/1f4/e26s1_e8s9p0f59-GERARD_ENDOPEP_frag30P_1-0-pdb_file
...
20-Oct-2016 01:33:07 [GPUGRID] Started download of e26s1_e8s9p0f59-GERARD_ENDOPEP_frag30P_1-0-pdb_file
20-Oct-2016 01:33:08 [GPUGRID] [http] [ID#2581] Info: Connection 2137 seems to be dead!
20-Oct-2016 01:33:08 [GPUGRID] [http] [ID#2581] Info: Closing connection 2137
...
20-Oct-2016 01:33:08 [GPUGRID] [http] [ID#2581] Info: Connected to www.gpugrid.org (84.89.134.145) port 80 (#2138)
20-Oct-2016 01:33:08 [GPUGRID] [http] [ID#2581] Sent header to server: GET /PS3GRID/download/1f4/e26s1_e8s9p0f59-GERARD_ENDOPEP_frag30P_1-0-pdb_file HTTP/1.1
...
20-Oct-2016 01:33:08 [GPUGRID] [http] [ID#2581] Received header from server: HTTP/1.1 206 Partial Content
...
20-Oct-2016 01:33:08 [---] [http_xfer] [ID#2581] HTTP: wrote 995 bytes
...


Finally, the download succeeds:


20-Oct-2016 01:33:26 [---] [http_xfer] [ID#2581] HTTP: wrote 179 bytes
20-Oct-2016 01:33:26 [GPUGRID] [http] [ID#2581] Info: Connection #2138 to host www.gpugrid.org left intact
20-Oct-2016 01:33:27 [GPUGRID] Finished download of e26s1_e8s9p0f59-GERARD_ENDOPEP_frag30P_1-0-pdb_file


This is not an error at the transport layer or lower, these errors are either automatically corrected by the mechanism implementing the relevant layer (e.g. a packet getting lost is retransmitted automatically by TCP), or they immediately send an error to the application layer above (the BOINC client in our case). This is caused by some mechanism working at the application layer, through which our connections are routed and which intervenes under certain conditions.

I checked the life times of the connections, to see if they are stalled some time after they are opened, but life time doesn't seem to count: the second download attempt above was stalled a few seconds after the connection had been opened.

I assert that GPUGRID has a mechanism, like a reverse proxy, which filters HTTP connections and selectively pauses them (but does not force-close them!) under some criteria, I believe when a certain bandwidth is reached.

Can someone from the project please check with the network people and see if this is the case?
____________

Profile skgiven
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,995,359,260
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 44830 - Posted: 24 Oct 2016 | 14:34:34 UTC - in response to Message 44790.
Last modified: 24 Oct 2016 | 14:58:43 UTC

Maybe this is down to data cost management or network limitations but if not I guess the apparent network issues could stem from a disk issue (simply can't read from one of the drives/arrays fast enough - say when lots of users are downloading simultaneously). If that's the situation then the only real solution is faster disks/arrays, if it's really needed.

Cache-Control: max-age=300
That tells us the server's Hyper Text Transfer Protocol time-out is 5min.
Perhaps that should be reduced server side, say to 120?

Maybe reducing the number of simultaneous connections would help, but it might just spread the problem.
____________
FAQ's

HOW TO:
- Opt out of Beta Tests
- Ask for Help

Profile caffeineyellow5
Avatar
Send message
Joined: 30 Jul 14
Posts: 225
Credit: 2,658,976,345
RAC: 0
Level
Phe
Scientific publications
watwatwatwatwatwatwatwat
Message 44831 - Posted: 25 Oct 2016 | 0:08:10 UTC - in response to Message 44830.

It also may be a consideration that this is on a university campus. When I worked for my last company, we worked with municipalities and schools to get them streaming television channels online and also allow online access to transfer MPEG-2 files into the servers from around the campus and off campus. Many times the university television staff would be at constant odds with the IT department because many departments wanted the bandwidth and there was only so much to go around. The IT department would act like they were working with us and the department to improve speed or cut down on interruptions, but then we would catch them by doing ongoing pings between us and the station computers and giving the data to IT and they would deny for a while and then say, "Oh yeah, that limiter parameter! We forgot about that! We'll 'loosen' that for you to get a better stream." Then we would still get calls from them asking why "our stream" was cutting out on people and always traced it back to IT giving bandwidth to other departments and putting limiters on the bandwidth that would make the signal intermittent. So maybe the department needs to battle for a more steady bandwidth, even if they have to trade speed for stability. If everybody was a few K slower up and downloading but the signal never broke, maybe we could live with that easier?
____________
1 Corinthians 9:16 "For though I preach the gospel, I have nothing to glory of: for necessity is laid upon me; yea, woe is unto me, if I preach not the gospel!"
Ephesians 6:18-20, please ;-)
http://tbc-pa.org

Profile Beyond
Avatar
Send message
Joined: 23 Nov 08
Posts: 1112
Credit: 6,162,416,256
RAC: 0
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 44833 - Posted: 25 Oct 2016 | 14:10:50 UTC - in response to Message 44830.

Maybe this is down to data cost management or network limitations but if not I guess the apparent network issues could stem from a disk issue (simply can't read from one of the drives/arrays fast enough - say when lots of users are downloading simultaneously). If that's the situation then the only real solution is faster disks/arrays, if it's really needed.

Cache-Control: max-age=300
That tells us the server's Hyper Text Transfer Protocol time-out is 5min.
Perhaps that should be reduced server side, say to 120?

Maybe reducing the number of simultaneous connections would help, but it might just spread the problem.

I don't think it's a traffic issue as it stalls here on every WU download, often several times before the download is complete. Doesn't matter what time of day. Again, this never happens on any other download of any kind. Only GPUGrid.

Jim1348
Send message
Joined: 28 Jul 12
Posts: 819
Credit: 1,591,285,971
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 44835 - Posted: 25 Oct 2016 | 17:43:10 UTC
Last modified: 25 Oct 2016 | 17:47:07 UTC

FWIW, I see the same thing on almost every download. But I have two GTX 960s on the same machine, which I just started up again. One of them downloaded all the files, while the second one got stuck as usual. It is always the longest file (or maybe the second-longest), and I concluded some time ago that it must be a problem with the server rather than the network. It seems to pause the long ones to give preference to the shorter ones, and then can't start up again.

After it times out at least once (after 5 minutes), I can manually restart the download OK. Or else I can just leave it alone, and it completes on the second or third try. So I lose maybe 10 minutes, and I don't worry about it.

Stefan
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 5 Mar 13
Posts: 348
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 44837 - Posted: 25 Oct 2016 | 18:10:35 UTC

Is there something we need to take a look at or is it an individual issue? Can someone give me a tl;dr?

Jim1348
Send message
Joined: 28 Jul 12
Posts: 819
Credit: 1,591,285,971
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 44838 - Posted: 25 Oct 2016 | 18:47:40 UTC - in response to Message 44837.

Is there something we need to take a look at or is it an individual issue? Can someone give me a tl;dr?

All I can say is that it is reproducible, and so I don't see how it can be a network issue, unless it is a router or switch on your own network. It could be some sort of traffic-shaping that a router might do; I don't really know that it is a server per se, but it is not at all random. About the only time I don't see it is when re-attaching to the project after an absence, though I have not done rigorous tests on that.

Jim1348
Send message
Joined: 28 Jul 12
Posts: 819
Credit: 1,591,285,971
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 44846 - Posted: 26 Oct 2016 | 0:52:09 UTC - in response to Message 44838.

I allowed both GTX 960s to run dry, then requested a download and got a SDOERR-CASP20M and a SDOERR-CASP1XX. Both downloaded without a pause. So maybe allowing the connection to go idle for a while helps?

Profile Retvari Zoltan
Avatar
Send message
Joined: 20 Jan 09
Posts: 2343
Credit: 16,201,255,749
RAC: 7,520
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 44847 - Posted: 26 Oct 2016 | 1:28:08 UTC - in response to Message 44846.

I allowed both GTX 960s to run dry, then requested a download and got a SDOERR-CASP20M and a SDOERR-CASP1XX. Both downloaded without a pause. So maybe allowing the connection to go idle for a while helps?

I think that's not the right reason. Right now the network traffic at GPUGrid is very low, because there's plenty of work available in both queues, so there's no constant unfulfilled work requests. However network statistics from calm and disturbed periods could prove or disprove it.

Jim1348
Send message
Joined: 28 Jul 12
Posts: 819
Credit: 1,591,285,971
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 44848 - Posted: 26 Oct 2016 | 2:18:59 UTC - in response to Message 44847.

That may be so, but I wonder if it is related to the fact that I run GPUGrid with a zero resource share? As one work unit ends and starts to upload, a new work unit starts to download. That is when I see the pauses, sometimes both on the upload and download.

So I wonder whether other people who are having the problem use a zero resource share also? If not, the pauses should not matter even if they occur, since the downloads for the next work unit will normally occur long before the current work unit is finished, and any pauses will be hidden.

Profile Beyond
Avatar
Send message
Joined: 23 Nov 08
Posts: 1112
Credit: 6,162,416,256
RAC: 0
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 44850 - Posted: 26 Oct 2016 | 2:54:30 UTC - in response to Message 44848.

So I wonder whether other people who are having the problem use a zero resource share also?

I'm using a high resource share and almost always have stalls/pauses.

Jim1348
Send message
Joined: 28 Jul 12
Posts: 819
Credit: 1,591,285,971
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 44852 - Posted: 26 Oct 2016 | 8:36:09 UTC - in response to Message 44850.

I just started another download three hours after the last upload finished, and it got stuck again on the longest file:
e15s9_e13s9p0f80-SDOERR_CASP11_crystal_ss_20ns_ntl9_1-0-pdb_file
(897.12 K file size)

All the shorter files downloaded quickly. So it does not seem to be dependent on uploading and downloading at the same time. And as usual, I was able to restart it after the 5 minutes timeout, and it finished the downloaded OK. So again it seems to be a server problem or something related thereto. I don't see how a transmission problem could distinguish so reliably between files based on their size. And I have a good cable modem connection, at 25Mbps/4Mbps, which I usually exceed in tests.

Stefan
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 5 Mar 13
Posts: 348
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 44856 - Posted: 26 Oct 2016 | 13:06:03 UTC
Last modified: 26 Oct 2016 | 13:11:13 UTC

Hm this sounds suspiciously familiar. We are having issues with another webservice of ours getting stuck at loading from time to time these days. The two could be related if the network of the university is having problems. I will report this to our guys just in case.

Profile caffeineyellow5
Avatar
Send message
Joined: 30 Jul 14
Posts: 225
Credit: 2,658,976,345
RAC: 0
Level
Phe
Scientific publications
watwatwatwatwatwatwatwat
Message 44857 - Posted: 26 Oct 2016 | 14:05:32 UTC - in response to Message 44856.

On the Computing Preferences tab of the BOINC Options list has up and download limiting. I noticed on some of my systems I set this to less than half what they can push opening the connection and have seen these user-side limited speed connections pause and timeout less if at all. It may be that the university is limiting the bandwidth and one of the triggers is a noticeable spike for a single connection.

Profile Beyond
Avatar
Send message
Joined: 23 Nov 08
Posts: 1112
Credit: 6,162,416,256
RAC: 0
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 44861 - Posted: 26 Oct 2016 | 17:13:21 UTC - in response to Message 44857.

On the Computing Preferences tab of the BOINC Options list has up and download limiting. I noticed on some of my systems I set this to less than half what they can push opening the connection and have seen these user-side limited speed connections pause and timeout less if at all. It may be that the university is limiting the bandwidth and one of the triggers is a noticeable spike for a single connection.

Doubt it. My max DL speed is 5 Mbps. That can't tax anybody's server.

(centurylink monopoly dsl. We don't care, we don't have to. We're the phone company...)

Vagelis Giannadakis
Send message
Joined: 5 May 13
Posts: 187
Credit: 349,254,454
RAC: 0
Level
Asp
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 44866 - Posted: 27 Oct 2016 | 10:14:58 UTC - in response to Message 44837.

Stefan wrote:
Is there something we need to take a look at or is it an individual issue? Can someone give me a tl;dr?


You can start by asking the university / campus IT people whether they are doing any form of traffic shaping on incoming connections to servers in the university. If there is some traffic shaping going on, you can tell them your contributors have reported problems downloading files (tasks) from certain servers (grosso??) and ask them to monitor the traffic shaping for incoming connections to your servers. Finally, ask them to report any findings to you and, if you do find we are victim to any bandwidth / number of connections limiting mechanism, start to exercise the fine art of negotiating for "MOAR BANDWIDTH!!" :D
____________

Stefan
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 5 Mar 13
Posts: 348
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 45060 - Posted: 31 Oct 2016 | 8:58:43 UTC
Last modified: 31 Oct 2016 | 13:06:00 UTC

University staff have a "won't bother looking into it till you prove it" attitude, so right now Jose is running a script from home testing the connection over a few days. Then we can throw the hard cold data at them and tell them to fix it.

Stefan
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 5 Mar 13
Posts: 348
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 45063 - Posted: 31 Oct 2016 | 13:08:34 UTC

Does anyone notice download problems on weekends?

captainjack
Send message
Joined: 9 May 13
Posts: 171
Credit: 2,322,079,288
RAC: 2,364,757
Level
Phe
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 45066 - Posted: 31 Oct 2016 | 14:27:35 UTC

Stephan asked

Does anyone notice download problems on weekends?


Yes

Sun 30 Oct 2016 11:45:55 PM CDT | | Project communication failed: attempting access to reference site
Sun 30 Oct 2016 11:45:55 PM CDT | GPUGRID | Temporarily failed download of e26s11_e22s4p0f35-SDOERR_CASP11_crystal_ss_20ns_ntl9_0-0-psf_file: transient HTTP error
Sun 30 Oct 2016 11:45:55 PM CDT | GPUGRID | Backing off 00:06:21 on download of e26s11_e22s4p0f35-SDOERR_CASP11_crystal_ss_20ns_ntl9_0-0-psf_file
Sun 30 Oct 2016 11:45:57 PM CDT | | Internet access OK - project servers may be temporarily down.


Sun 30 Oct 2016 04:42:45 PM CDT | | Project communication failed: attempting access to reference site
Sun 30 Oct 2016 04:42:45 PM CDT | GPUGRID | Temporarily failed download of e12s17_e4s21p0f210-PABLO_SH2TRIPEP_Q_TRI_2-0-pdb_file: transient HTTP error
Sun 30 Oct 2016 04:42:45 PM CDT | GPUGRID | Backing off 00:04:07 on download of e12s17_e4s21p0f210-PABLO_SH2TRIPEP_Q_TRI_2-0-pdb_file
Sun 30 Oct 2016 04:42:46 PM CDT | | Internet access OK - project servers may be temporarily down.

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1576
Credit: 5,601,336,851
RAC: 8,787,904
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 45071 - Posted: 31 Oct 2016 | 18:04:02 UTC - in response to Message 45063.

Does anyone notice download problems on weekends?

Yes, here too. Examples from just one machine:
29-Oct-2016 01:51:04 [GPUGRID] Temporarily failed download of e10s3_e9s8p0f10-SDOERR_CASP11_crystal_contacts_20ns_a3D_0-0-coor_file: transient HTTP error

29-Oct-2016 05:08:31 [GPUGRID] Temporarily failed download of e16s12_e9s18p0f486-GERARD_CXCL12CHALCLD_mol0_2-0-coor_file: transient HTTP error

29-Oct-2016 10:31:01 [GPUGRID] Temporarily failed download of e6s1_e5s2p0f181-SDOERR_CASP11_crystal_ss_50ns_a3D_0-0-pdb_file: transient HTTP error

30-Oct-2016 14:03:47 [GPUGRID] Temporarily failed download of e28s4_e27s3p0f1-SDOERR_CASP11_crystal_ss_20ns_ntl9_1-0-psf_file: transient HTTP error

30-Oct-2016 22:11:57 [GPUGRID] Temporarily failed download of e13s11_e10s4p0f159-SDOERR_CASP11_crystal_ss_contacts_20ns_a3D_1-0-pdb_file: transient HTTP error

I do have a copy of Wireshark available and I can try to capture a log, if that would be helpful?

Profile Retvari Zoltan
Avatar
Send message
Joined: 20 Jan 09
Posts: 2343
Credit: 16,201,255,749
RAC: 7,520
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 45073 - Posted: 31 Oct 2016 | 18:38:02 UTC - in response to Message 45071.

I do have a copy of Wireshark available and I can try to capture a log, if that would be helpful?

You can have a try, but we'll see similar events: some http requests remain unanswered, but we won't know which device blocked/dropped that packet (and why). Perhaps if it's a packet fragmentation issue we'll see something useful in the log.

mindcrime
Send message
Joined: 27 Feb 14
Posts: 4
Credit: 121,376,887
RAC: 0
Level
Cys
Scientific publications
watwatwatwatwatwatwat
Message 45074 - Posted: 31 Oct 2016 | 18:50:05 UTC - in response to Message 44756.

While I don't think the staff of GPUGrid could do anything about your HTTP timeout problem, out of curiosity I ask you to run a very basic network diagnostics:
If you have a Windows based PC on the same network as your crunching box, please open a command prompt and type

ping www.gpugrid.net -n 100

You can do it on Linux also, but I'm not familiar with its command syntax (the -n 100 parameter tells the ping command to try 100 times).
You'll see a lot of (exactly 100, if everything's going well) messages like:

Reply from 84.89.134.145: bytes=32 time=83ms TTL=49

Then, at the end:

Ping statistics for 84.89.134.145: Packets: Sent = 100, Received = 100, Lost = 0 (0% loss), Approximate round trip times in milli-seconds: Minimum = 83ms, Maximum = 88ms, Average = 83ms

These are the actual results of my host, I'm curious about your statistics.
I expect your loss of packets and the round trip times be significantly higher than what I experience.
Unfortunately these numbers do not reveal the device which is responsible for your problem, but I'm quite confident in that it's closer to your end (most probably it's at your ISP) than to the GPUGrid site (in this case much more users would have such difficulties).

You could also try a traceroute command:

tracert www.gpugrid.net

Which gives you a list of the devices between your end and grosso.upf.edu (on which the gpugrid.net project resides).
Perhaps this list could help us to figure out what's wrong. Especially if it gives you very different results when you run it multiple times.
In some cases these errors are simply caused by network congestion (when the ISP has limited bandwidth to certain destinations), but it could depend on the time of the day. On your end however, P2P file sharing applications or appliances, a faulty router/switch could cause such strange errors (but I'm sure in this case there would be problems with other sites as well).



Nanoprobe's network/setup/config is NOT the issue, I've experienced this issue many times on different machines with different OS and connections. The issue is exactly as he describes, usually the larger file will hangup and some of the smaller ones will finish. Then after timing out the big one will restart and make a small amount of progress and hang again. This is unique to this project, I have no issues elsewhere. It IS on gpugrid's side, not sure if its the project or their provider.

People regularly mention download/upload problems around here. Currently experiencing this on win7 64bit and linux 64bit

mindcrime
Send message
Joined: 27 Feb 14
Posts: 4
Credit: 121,376,887
RAC: 0
Level
Cys
Scientific publications
watwatwatwatwatwatwat
Message 45075 - Posted: 31 Oct 2016 | 18:51:56 UTC - in response to Message 45060.
Last modified: 31 Oct 2016 | 19:38:30 UTC

University staff have a "won't bother looking into it till you prove it" attitude, so right now Jose is running a script from home testing the connection over a few days. Then we can throw the hard cold data at them and tell them to fix it.



Tell them to install boinc client and add gpugrid to it. I bet they'll get some hangups.

If i have a stalled file transfer, currently I have a stalled libcufft.so.6.5 and after it stalls and times out I can watch my network activity when I retry it. It spikes up but immediately comes back down and stalls, looks like 180deg of a sin wave.

I'm not an IT guy, i could do tracerts and what not but I have no idea how to diagnose this as everything points to server side for the following reasons.

-This is the only project I have this problem on
-Many other people have posted about "transient http error" for over 6 months. this kind of error is almost unheard of on other projects.
-But most importantly; veteran crunchers with years of BOINC experience are telling you that they cannot contribute.

And what kind of IT department doesn't do the IT work? Sounds like they said we don't want to figure it out, you figure it out. That's pretty messed up.

edit: felt I should follow up since I made some progress after I got ranty.

I edited my cc_config on a linux machine to do 1 max transfer per project and I was able to get all the files without interruptions. It feels like there's something at play affecting parallel downloads to the same IP/host?

Profile Retvari Zoltan
Avatar
Send message
Joined: 20 Jan 09
Posts: 2343
Credit: 16,201,255,749
RAC: 7,520
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 45077 - Posted: 31 Oct 2016 | 19:33:15 UTC - in response to Message 45074.

Nanoprobe's network/setup/config is NOT the issue, ...

I'm aware of that. ISP's are doing some traffic shaping (or QoS), which could result in issues like this one.
Most probably the campus' ISP (or WAN operator, or IT staff) is to blame.
This issue began when there was a change in the network at the campus about a year ago.
It was much worse than now in the beginning, but it seems that there is still something which escaped their attention.

I've experienced this issue many times on different machines with different OS and connections. The issue is exactly as he describes, usually the larger file will hangup and some of the smaller ones will finish. Then after timing out the big one will restart and make a small amount of progress and hang again.

This is probability at work: large files are divided to much more packets than smaller ones, so if a packet gets lost from time to time a larger file has higher probability to get stuck (even many times).

This is unique to this project, I have no issues elsewhere. It IS on gpugrid's side, not sure if its the project or their provider.

Perhaps GPUGrid's BOINC server log (compared to the user's log) could help in deciding this.

[AF>P4G] anthony
Send message
Joined: 14 Mar 10
Posts: 14
Credit: 501,938,373
RAC: 0
Level
Lys
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 45079 - Posted: 31 Oct 2016 | 20:07:07 UTC
Last modified: 31 Oct 2016 | 20:11:18 UTC

Hello,

The problem is solved for me, I edit my cc_config file as like as caffeineyellow5 said in the second message.

If you aren't cc-config.xml, crete a file with notepad "bloc-note" with the following command :
<http_transfer_timeout>10</http_transfer_timeout>
(I change the value to earn time).
And modify the file name to cc_config.xml

Put it into C:\ProgramData\BOINC

Stefan
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 5 Mar 13
Posts: 348
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 45132 - Posted: 3 Nov 2016 | 10:33:46 UTC

We sent them our tests which show the timeouts and now they are looking into it. Let's hope we get some news soon.

Profile Beyond
Avatar
Send message
Joined: 23 Nov 08
Posts: 1112
Credit: 6,162,416,256
RAC: 0
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 45232 - Posted: 7 Nov 2016 | 15:58:39 UTC - in response to Message 45079.
Last modified: 7 Nov 2016 | 16:11:56 UTC

Hello, The problem is solved for me, I edit my cc_config file as like as caffeineyellow5 said in the second message.

If you aren't cc-config.xml, crete a file with notepad "bloc-note" with the following command :
<http_transfer_timeout>10</http_transfer_timeout>

Put it into C:\ProgramData\BOINC

Anthony it doesn't really solve the problem, it simply masks it somewhat so that DLs don't hang for hours. BTW, this was first suggested by Richard Haselgrove. A more complete workaround is the one I posted in the 5th message:

https://www.gpugrid.net/forum_thread.php?id=4399&nowrap=true#44724

Realize that these are only workarounds and not a real solution. The bad news is that they might tend to hammer the server with more requests than should be necessary if everything was working correctly. Personally, I wouldn't go under 60 for http_transfer_timeout.

Also the same DL problem is evident when trying to access long threads on the message board: the thread DL stalls especially on threads with too many graphics (like the crunchathlon thread for instance). Quite irritating. Again, this happens on no other projects except GPUGrid.

Arif Mert Kapicioglu
Send message
Joined: 26 May 10
Posts: 6
Credit: 597,131,550
RAC: 0
Level
Lys
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 45253 - Posted: 12 Nov 2016 | 18:19:49 UTC

Have you received any new info on this issue? I can concur problem occurs in downloading files with sizes <1MB. In fact, I had to attach a backup project just to keep my gpus working.

Profile caffeineyellow5
Avatar
Send message
Joined: 30 Jul 14
Posts: 225
Credit: 2,658,976,345
RAC: 0
Level
Phe
Scientific publications
watwatwatwatwatwatwatwat
Message 45277 - Posted: 16 Nov 2016 | 2:05:25 UTC

I still suspect a cache/packet size, open active connection limit, timeout, or throttling issues at the University's IT level which stands between the project and the world.

Profile Beyond
Avatar
Send message
Joined: 23 Nov 08
Posts: 1112
Credit: 6,162,416,256
RAC: 0
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 45283 - Posted: 16 Nov 2016 | 18:12:12 UTC

Downloads still stalling here...

pvh
Send message
Joined: 17 Mar 10
Posts: 23
Credit: 1,173,824,416
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 45811 - Posted: 21 Dec 2016 | 12:54:55 UTC

I too am now wrestling with the download issue, manually trying to force libcufft through... This really reminds me of how the internet worked 20 years ago when it was hopelessly overloaded. I agree with caffeineyellow5 that this all points to some overloaded or malfunctioning network component on the campus causing it to randomly drop packets. This is something the network guys on the campus need to solve...

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1576
Credit: 5,601,336,851
RAC: 8,787,904
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 45812 - Posted: 21 Dec 2016 | 14:08:31 UTC

If you have a 'modern' internet connection, you can download the full darn official toolkit for CUDA 8.0 direct from NVidia:

https://developer.nvidia.com/cuda-toolkit

It's 1.2 GB in total, but only took me about 8 minutes to download - NVidia have good servers and connections.

I wouldn't bother installing the whole package: just use an archive manager (I used 7-zip) to pull the file(s) you need from cufft\bin\

For the Windows cufft64_80.dll I get a file size of 145,769,016 bytes (142,353 KB), and an MD5 of fe5ab557e61c775e6eda899a229dd42b - all identical to the file distributed by GPUGrid.

I'd need to rename the file from cufft64_80.dll to _cufft64_80.dll, and then drop it into the GPUGrid project folder in the BOINC data directory: click 'retry download' and it should accept that the download is complete.

The same procedure should work for other operating systems too, though you may need to mark the file as executable.

LSG
Send message
Joined: 22 Nov 10
Posts: 4
Credit: 647,970,482
RAC: 0
Level
Lys
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 45815 - Posted: 21 Dec 2016 | 15:00:31 UTC

Downloads are hanging anywhere from 0.00% to 92.71% of the download. I've aborted several transfers that remain hung for an hour and keep cycling between "Download: active" (with nothing downloading), "Download: pending" (ditto), and "Download: retry in {time}." Very frustrating. Unproductive, too. This problem has not occurred in the three other BOINC projects I'm subscribed to. My location: NH, USA.

Tomas Brada
Send message
Joined: 3 Nov 15
Posts: 38
Credit: 6,768,093
RAC: 0
Level
Ser
Scientific publications
wat
Message 45818 - Posted: 21 Dec 2016 | 17:27:49 UTC - in response to Message 45815.

I've aborted several transfers that remain hung for an hour and keep cycling between "Download: active" (with nothing downloading), "Download: pending" (ditto), and "Download: retry in {time}".

Next time you can select "Suspend network activity" in the manager, wait a few seconds, and then resume it. This causes the download to pause and close the stalled TCP connection then start a fresh one.
____________

LSG
Send message
Joined: 22 Nov 10
Posts: 4
Credit: 647,970,482
RAC: 0
Level
Lys
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 45832 - Posted: 22 Dec 2016 | 2:52:58 UTC - in response to Message 45818.

Clever. Better than smacking "Retry now." Thanks. I'll have to remember that... for "normal" network problems. What's going on these past few weeks with GPUGRID isn't "normal." <Sigh>

Cheers,
LSG

Profile Retvari Zoltan
Avatar
Send message
Joined: 20 Jan 09
Posts: 2343
Credit: 16,201,255,749
RAC: 7,520
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 45835 - Posted: 22 Dec 2016 | 9:02:57 UTC

I have a laptop connected through WiFi to the same network as my desktops. I can't download the cufft64_80.dll with my laptop, however I can download it with my desktops.
I suspect that this is caused by a packet scheduling / routing problem: packets arrive in an overly random order or a packet is delayed too much (perhaps dropped) en route. This is filling the receive buffers of the network adapter, and if they get full before the missing packet arrives the download gets stalled (that's why it can be fixed by increasing the number of the receive buffers).

Toni
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 9 Dec 08
Posts: 1006
Credit: 5,068,599
RAC: 0
Level
Ser
Scientific publications
watwatwatwat
Message 45978 - Posted: 30 Dec 2016 | 10:31:31 UTC - in response to Message 45835.

Should be solved - https://gpugrid.net/forum_thread.php?id=4466&nowrap=true#45967

LSG
Send message
Joined: 22 Nov 10
Posts: 4
Credit: 647,970,482
RAC: 0
Level
Lys
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 45996 - Posted: 30 Dec 2016 | 19:41:20 UTC

To paraphrase "My Fair Lady": By George, I think they've got it!

About an hour ago, I downloaded two WUs, both estimated to take 1d 07:23:14. The download was fast -- no pauses / hiccups / delays / restarts. For the past month (I'm guessing), WU segments of 13+ MB would not download. These did.

Thanks for the fix. My location: NH, USA.

Profile Beyond
Avatar
Send message
Joined: 23 Nov 08
Posts: 1112
Credit: 6,162,416,256
RAC: 0
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 46042 - Posted: 2 Jan 2017 | 22:44:52 UTC - in response to Message 45978.

Should be solved - https://gpugrid.net/forum_thread.php?id=4466&nowrap=true#45967

https://gpugrid.net/forum_thread.php?id=4466&nowrap=true#45967

Post to thread

Message boards : Server and website : SOS-Downloads stuck

//