SLAC logo

Throughput from SLAC to Caltech reduced by a factor of 5 Network logo

Les Cottrell. Page created: October 3, 2003.

Central Computer Access | Computer Networking | Network Group | More case studies
SLAC Welcome
Highlighted Home
Detailed Home
Search
Phonebook

Problem

We noticed that the iperf throughput from SLAC to Caltech dropped from between 300 and 400 Mbits/s to about 50 Mbits/s on August 27, 2003.

Host checks

On looking at how the Caltech host was configured we found that the max TCP window size was 132kBytes. We contacted the admin for the Caltech host and they increased the window size to 32MBytes. The throughput was still limited to < 100Mbits/s. We verified that the Caltech host had a GE interface, was using it, and that it was correctly configured.

Asymmetry in throughput

We ran iperf from plato.cacr.caltech.edu with a 4MB window and 1 stream of Reno to iphicles.slac.stanford.edu and could get more than a couple hundred Mbits/s. A minute later however with the same setup we got < 100Mbits/s from SLAC to Caltech. We also tried Chicago to Caltech with similar results to the slac to caltech case. We concluded there was an asymmetry in the routes from SLAC to Caltech and Caltech to SLAC.

Routes

We compared the traceroutes from SLAC to Caltech and vice versa.

We then looked in more detail at the history of routes around when the dramatic change in throughput occured, i.e. between 5:55am on August 27 and 6:55am on August 27, 2003. The routes before & after from SLAC (see Traceroute Analysis for 08/27/2003 or for more Traceroute Analysis for node1.cacr.caltech.edu on 08/27/2003) were:

Date       time     numhops Epoch time rte#   Route
08/27/2003 06:05:05      13 1061989505   3   (134.79.243.1),(134.79.135.15),(192.68.191.83),(171.64.1.213),(198.32.249.73),(198.32.249.154),(198.32.248.126),(198.32.248.13),(198.32.248.9),(192.41.208.50),(131.215.254.253),(131.215.5.147),(131.215.144.226),
08/27/2003 06:20:07      16 1061990407   31   (134.79.243.1),(134.79.135.15),(192.68.191.83),(171.64.1.213),(198.32.249.2),(198.32.249.6),(137.164.22.84),(137.164.22.30),(137.164.22.28),(137.164.24.244),(130.152.181.17),(130.152.180.13),(192.12.19.252),(131.215.254.253),(131.215.5.147),(131.215.144.226),
Looking in more detail at the iperf throughput measurement histories from SLAC to Caltech at 5:25am on August 27, 2003 we got about 450Mbits/s, the next measurement at 6:55am got about 58Mbits/s, it then held at this level for many days. There was then a brief blip around Sep 17-18 where the throughput went back up to several hundred Mbits/s. After that it fell back to just under 100Mbits/s and was independent of time of day.

WE followed this up by looking in more detail at the routes when the blip occured. The blip in throughput was observed starting 9/18/03 at 14:25 and was last observed at 23:25 on the same day. During this time throughputs of 380-480Mbits/s were measured. The throughput measured at 14:25 was 30Mbit/s, and after the blip at 00:55 9/19/03 was about 85Mbits.

The route before was:
09/18/2003 14:05:05      14 1063919105   72   (134.79.243.1),(134.79.135.15),(192.68.191.83),(171.64.1.213),(198.32.249.2),(198.32.249.6),(137.164.22.84),(137.164.22.30),(137.164.22.88),(137.164.24.246),(192.12.19.252),(131.215.254.253),(131.215.5.147),(131.215.144.226),

The route when performance was better (during the blip) was mainly:
09/18/2003 16:05:06      16 1063926306   68   (134.79.243.1),(134.79.135.15),(192.68.191.83),(171.64.1.213),(198.32.249.2),(198.32.249.6),(137.164.22.84),(137.164.22.53),(137.164.22.28),(137.164.22.57),(198.32.248.2),(198.32.248.6),(192.41.208.50),(131.215.254.253),(131.215.5.147),(131.215.144.226),

And the route afterwards was:
09/18/2003 23:50:07      14 1063954207   81   (134.79.243.1),(134.79.135.15),(192.68.191.83),(171.64.1.213),(198.32.249.2),(198.32.249.6),(137.164.22.84),(137.164.22.53),(137.164.22.28),(137.164.24.244),(137.164.27.246),(131.215.254.253),(131.215.5.147),(131.215.144.226),
We reported the problem to the noc@calren2.net and support@4c.net at 12:14 on 10/3/2003.

Resolution

The NOC responded at 2:14pm that it had received the report and was working on it.

At 6:09pm Eric Sizelove of the NOC responded as follows:


Les, I believe we have located and fixed the current bottleneck.  Here 
is what was happening:

 - a CENIC router in Los Angeles (ASN 2152) is receiving Caltech's 
prefixes via a Los Nettos route server on a shared connector segment.

- the Los Nettos route server was preferring paths to Caltech that 
went through a next hop that was not reachable from the CENIC router 
(and was then advertising that next-hop to the CENIC router)

- because of the unreachable next-hop the CENIC router was re-writing 
the advertised next-hop to be the direct peering address of the Los 
Nettos route server

- the los nettos route server's unreachable-from-CENIC path traverses 
a 100 Mb/s ethernet.  This appears to have been the cause of the 
bottleneck and would have started when Caltech switched paths and went 
behind the Los-Nettos-to-CENIC-path last month.

This problem was corrected by Los Nettos implementing a manual re- write of the next-hop advertised to CENIC.  Please let us know if you 
see better performance now.
We remeasured the iperf throughput from SLAC to Caltech and found we could get about 300Mbits/s. Case closed.
Page owner: Jiri Navratil