Problems with link to/from BNL Network logo

Les Cottrell. Page created: August 16, 2005

Central Computer Access | Computer Networking | Network Group | More case studies
SLAC Welcome
Highlighted Home
Detailed Home
Search
Phonebook

Problem

ESnet reported at 7:26pm, Monday August 15th, 2005:
Qwest advises that the local carrier has confirmed there is a fiber cut 
which caused the OC48 outage to Brookhaven National Laboratory.
The cut has not yet been pinpointed and there is no estimated uptime yet.  
Connectivity to BNL was lost at 15:44PT.

Steve Lowe
ESnet - The Energy Science Network
========================================================================
Details and tracking of the problem can be found by doing a
finger 13716@ticket.es.net
Soon after that (7:45pm) we received a network anomalous event email from the BNL IEPM-BW monitoring host at iepmbw.bnl.gov. This email identified events seen by the
Plateau Algorithm in: iperf data to Caltech, CERN and SLAC; thrulay data to ANL, Daresbury (nr. Liverpool UK), Indiana, and the University of Florida; pathchirp to SDSC, SLAC, and University of Florida. Further alerts were sent from iepmbw.bnl.gov at 9:45pm, 10:45pm, 11:45pm, 12:45pm, 1:45am next day, and 2:23am. Alerts were also sent from the IEPM-BW monitoring host at CERN pcgiga.cern.ch identifying events in the iperf measurements to BNL.

Observations

An example of a Plateau Algorithm's identification of the event in the iperf data from BNL to SLAC shows the start and end of the event. It also identifies the route changes that occured.

If one looks in detail at the pathchirp data, it is clear the effect of the event started between 18:41 and 18:57 8/15/05 and ended between 3:42 and 3:57pm 8/16/05. Also looking at several metrics (RTT, pathchirp, iperf, multi-stream iperf (miperf) and thrulay) it can be seen that they all change at the time of the event.

The traceroute visualization also clearly shows the onset of instability after 18:00 hours.

Looking at the PingER data of losses from PingER monitoring sites to BNL there is also evidence of increased packet loss from many monitoring sites in Japan, UK, Germany, Hungary, Canada, US (both Internet2 and ESnet) - typically from no loss to 1 or 2%. At the same time it appears two Italian sites lost connectivity to BNL for at least 30 minutes.

Cause

BNL's primary ESNet fiber connection (OC-48) to NYC went down on 8/16/05 at around 6:45pm EDT. At that time BNL's only connection to the internet was through its secondary backup connection (T3) through NYSernet. The primary link was restored at around 3:44am EDT.
Page owner: Les Cottrell