Problems with Stanford Connectivity, September 2005 Network logo

Les Cottrell. Page created: September 24, 2005

Central Computer Access | Computer Networking | Network Group | More case studies
SLAC Welcome
Highlighted Home
Detailed Home
Search
Phonebook

Problem

The Internet End-to-end Performance Monitoring (IEPM) automated performance drop detection sent an email alert to the SLAC developers on Saturday 10th September at 2:00pm.

The alert indicated that there had been an ~ 50% drop (from about 600Mbits/s to about 300Mbits/s) in the achievable throughput measured by the thrulay probe from SLAC to Caltech at 11:43am 9/10/2005. The email also provided information on the routes before and after the change and time series measurements of the various probes.

Analysis

This looked like a classic route change from CENIC/Internet2 to ESnet. From the IEPM traceroutes (at 10 minute intervals) it appeared to start between 00:17:00 and 00:26.47 9/10/05 PDT with the route being restored between 21:46:36 and 21:56:45. Looking at the traceroute tables, it appears to have affected Caltech, Indiana, UFl, SDSC, UT Dallas. We did not see end-to-end performance events on these links since we were only measuring with pathchirp and not thrulay or iperf.

Looking at the topology map from the above nodes for this day the routes can be seen to have switched from CENIC (green)/Abilene (blue) to ESnet (red). Looking at http://calendar.es.net/cgi-bin/pmcalendar.pl we could not see any scheduled maintenance between these times. Drilling down http://cricket.cenic.org/grapher.cgi to utilization > CENIC backbone hpr-routers > hpr-svl-1ge-summary (multiple targets) Octets > weekly one could see the Stanford router lost all its traffic, while another Sunnyvale router added a lot of extra traffic.

The thrulay time series showed the step down (and later step up) in perfomance. As expected, we saw similar effects with the iperf probe time series though the effect on the multi-stream iperf was less than on the single stream (presumably due to the multi-stream iperf being less friendly and pushing aside other traffic on the more congested backup link). The effect was not noticeable on the pathchirp probe time series. Looking at the ABwE and ping RTT results there was no effect visible on the ABwE dynamic bandwidth capacity. There was a slight effect visible on the available bandwidth though we did not automatically detect it. On the other hand the minimum RTT showed a dramatic effect. This was also visible to the other affected sites such as U Florida.

Cause

CENIC had an unplanned outage of the Stanford 15540 due to a fan failure(*) This, of course, affected SLAC's paths to CENIC (and I2) via Stanford (SLAC has no direct visibility/connection into CENIC and I2), and the effect was (mostly) limited to Stanford (and Stanford departments such as SLAC). The fan problem is not really resolved (there are external fans currently being used to "adequately"(?) cool the 15540, but the new fan tray should arrive on Monday (where another short outage *might* occur as it is replaced). See CENIC's ticket #24540 for more information, as well as some email from CENIC operations. We were also able to find information about the outage from the InterMapper Web Server which provide information on recent outages.

Later we found the official Cisco Field Notice about the generic fan problems.


Page owner: Les Cottrell, Gary Buhrmaster