SLAC External Network Connectivity

Les Cottrell and Gary Buhrmaster

Reported Problem

On 6/26/07, Gregory Dubois reported by email:

I'm seeing high packet loss rates and long latencies to a variety of offsite addresses, including www.rl.ac.uk, www.energy.gov, and a variety of other commercial sites.

A number of university sites that I've tried all seem to be OK.

Is there some general problem with the public Internet? I see similar problems when trying to access the problem sites from Caltech.

Gregory

A later email explained this in more detail:

I saw very slow page loads on a number of commercial web sites, at first, like news.yahoo.com. I always got at least some amount of data back, but just very slowly and erratically. I followed up by looking at ping statistics. When packets were getting through they were taking longer than usual (e.g., ~100 ms for that site when I usually see 7-15ms), and loss rates were around 50%.

I saw the same effect from inky.its.caltech.edu, more or less.

I didn't see any problems in going to a number of other commercial sites, like www.google.com (well, to whatever host that happened to resolve to at that moment), or in going to a few university sites that I tried.

First Glimpse

We looked to see if we could see losses to the sites noted by Gregory, or other sites close by. We found a host which was giving poor web response time to SLAC. It was transit1.511.org. Pings to that host were losing 38% of the pings, so something was definitely wrong. The traceroute showed extended RTTs between between ESnet and Level3 between Palo Alto and San Jose (just a few miles apart). We then ran pathneck1 from SLAC to the host. This also showed that the bottleneck is somewhere close to the exchange between ESnet and Level3.

We also saw poor web response time from www.yahoo.com. Ping losses were about 40%. The traceroute as far as hop 11 was the same as for the transit1.511.org host. At hop 14 the route is transferred to Equinix and the RTT increases from 6ms to 36ms. The poor response to Yahoo from Caltech continued into the evening of 6/26/07, however this was not seen from SLAC. Of course which site/node one gets for Yahoo is probably unpredictable, especially if some of the pings are made from SLAC and others from Caltech (by Gregory Dubois).

It looks like there is a bottleneck when one goes from ESnet to some public Internet providers.

Possible causes

There were a number of hosts in the SSRL domain at SLAC for which a compromised userid was used to start a daemon to send traffic to offsite locations as part of a DDOS attack. The SSRL network was overloaded, and a "meltdown" ensued. The SLAC border forwarded the traffic to the upstream, as expected, although at one point near noon the load was very high (around 900 mbits/sec, out of a 1000mbits/sec link; quite high, and due to the random nature of traffic, some was probably dropped from time to time). One of ESnet's connections to commodity peers *was* overloaded, causing some traffic to slow to a crawl or be dropped which was destined to the "Internet" at large.

Interestingly, neither Google nor Yahoo use the peering in question (or, to be precise, the current addresses those names resolve to do not. Yahoo and Google both use multiple datacenters, with multiple target addresses, which may have caused some confusion as to what is what, and where is where). Neither do most of the European collaborator institutions (which go via GEANT) and have a much bigger pipe than what we could throw at the link in any case.

And, of course, nothing that happened at SLAC should have any impact at Caltech to the Internet.

There was a cable cut in the Transatlantic late last week, and while the protected circuits protected, there are always some unprotected circuits on the fiber, and that would mean that some of the traffic went to other links, which may mean there is less headroom available across the pond than was previously available. And that often generates moments of challenge. (The fiber is supposed to be repaired by next week).

References:

1: Pathneck