Throughput from SLAC to Caltech reduced by a factor of 5Les Cottrell. Page created: October 3, 2003.Central Computer Access | Computer Networking | Network Group | More case studies |
|
We then looked in more detail at the history of routes around when the dramatic change in throughput occured, i.e. between 5:55am on August 27 and 6:55am on August 27, 2003. The routes before & after from SLAC (see Traceroute Analysis for 08/27/2003 or for more Traceroute Analysis for node1.cacr.caltech.edu on 08/27/2003) were:
Date time numhops Epoch time rte# Route 08/27/2003 06:05:05 13 1061989505 3 (134.79.243.1),(134.79.135.15),(192.68.191.83),(171.64.1.213),(198.32.249.73),(198.32.249.154),(198.32.248.126),(198.32.248.13),(198.32.248.9),(192.41.208.50),(131.215.254.253),(131.215.5.147),(131.215.144.226), 08/27/2003 06:20:07 16 1061990407 31 (134.79.243.1),(134.79.135.15),(192.68.191.83),(171.64.1.213),(198.32.249.2),(198.32.249.6),(137.164.22.84),(137.164.22.30),(137.164.22.28),(137.164.24.244),(130.152.181.17),(130.152.180.13),(192.12.19.252),(131.215.254.253),(131.215.5.147),(131.215.144.226),Looking in more detail at the iperf throughput measurement histories from SLAC to Caltech at 5:25am on August 27, 2003 we got about 450Mbits/s, the next measurement at 6:55am got about 58Mbits/s, it then held at this level for many days. There was then a brief blip around Sep 17-18 where the throughput went back up to several hundred Mbits/s. After that it fell back to just under 100Mbits/s and was independent of time of day.
WE followed this up by looking in more detail at the routes when the blip occured. The blip in throughput was observed starting 9/18/03 at 14:25 and was last observed at 23:25 on the same day. During this time throughputs of 380-480Mbits/s were measured. The throughput measured at 14:25 was 30Mbit/s, and after the blip at 00:55 9/19/03 was about 85Mbits.
The route before was: 09/18/2003 14:05:05 14 1063919105 72 (134.79.243.1),(134.79.135.15),(192.68.191.83),(171.64.1.213),(198.32.249.2),(198.32.249.6),(137.164.22.84),(137.164.22.30),(137.164.22.88),(137.164.24.246),(192.12.19.252),(131.215.254.253),(131.215.5.147),(131.215.144.226), The route when performance was better (during the blip) was mainly: 09/18/2003 16:05:06 16 1063926306 68 (134.79.243.1),(134.79.135.15),(192.68.191.83),(171.64.1.213),(198.32.249.2),(198.32.249.6),(137.164.22.84),(137.164.22.53),(137.164.22.28),(137.164.22.57),(198.32.248.2),(198.32.248.6),(192.41.208.50),(131.215.254.253),(131.215.5.147),(131.215.144.226), And the route afterwards was: 09/18/2003 23:50:07 14 1063954207 81 (134.79.243.1),(134.79.135.15),(192.68.191.83),(171.64.1.213),(198.32.249.2),(198.32.249.6),(137.164.22.84),(137.164.22.53),(137.164.22.28),(137.164.24.244),(137.164.27.246),(131.215.254.253),(131.215.5.147),(131.215.144.226),We reported the problem to the noc@calren2.net and support@4c.net at 12:14 on 10/3/2003.
At 6:09pm Eric Sizelove of the NOC responded as follows:
Les, I believe we have located and fixed the current bottleneck. Here is what was happening: - a CENIC router in Los Angeles (ASN 2152) is receiving Caltech's prefixes via a Los Nettos route server on a shared connector segment. - the Los Nettos route server was preferring paths to Caltech that went through a next hop that was not reachable from the CENIC router (and was then advertising that next-hop to the CENIC router) - because of the unreachable next-hop the CENIC router was re-writing the advertised next-hop to be the direct peering address of the Los Nettos route server - the los nettos route server's unreachable-from-CENIC path traverses a 100 Mb/s ethernet. This appears to have been the cause of the bottleneck and would have started when Caltech switched paths and went behind the Los-Nettos-to-CENIC-path last month. This problem was corrected by Los Nettos implementing a manual re- write of the next-hop advertised to CENIC. Please let us know if you see better performance now.We remeasured the iperf throughput from SLAC to Caltech and found we could get about 300Mbits/s. Case closed.