Impact of network measurements on Internet traffic Network logo

Les Cottrell. Page created: August 18, 2004.

Central Computer Access | Computer Networking | Network Group | More case studies
SLAC Welcome
Highlighted Home
Detailed Home
Search
Phonebook

Problem

Joe Metzger of ESnet reported the following:

-----Original Message-----
From: Joe Metzger [mailto:metzger@es.net]
Sent: Wednesday, June 30, 2004 2:01 PM
To: Cottrell, Les
Cc: routing@es.net
Subject: Side effects of IEPM monitoring.

Les,
ESnet and Abilene have recently started working on a joint effort to monitor the performance between ESnet sites and Abilene participants. As part of this effort, I am using OWAMP to measure latency between a box at FERMI and LBL, SDSC, NCSU & OSU.

Some general details about the measurements and graphs of the data can be found at http://measurement.es.net.

We noticed that our OWAMP traffic was experiencing significant delays in a fairly regular pattern and that many of these episodes of high delay corresponded with bursts of packet discards on the FNAL - CHI OC12 interfaces.

After quite a bit of investigation and analysis of the flows that occured during periods where packet discard events happened, we hypothesize that IEPM scheduled tests run on dmzmon0.deemz.net were saturating the OC12 between FNAL and Chicago and causing packet loss and large queuing delays for other traffic.

I think we have pretty much confirmed this hypothesis with a test that we started yestereday. We changed the inbound and outbound routing for dmzmon0.deemz.net to bypass the OC12 and go through a new FNAL/Starlight connection yesterday around
11:15 AM PDT. Since that change, the OWAMP data which is still running on the old path is not showing the regular high latency spikes and the interface discard counters have stopped their regular spikes.

Graph showing typical spikes before the routing change:
http://measurement.es.net/cgi-bin/mrtg_graph_select.cgi-jcm?width=&height=&expire=300&start=Jun+28+0%3A00&end=Jun+28+12%3A00&graph=1min&upperLimit=&lowerLimit=&html=%2Flatency%2Ffnal-all-latency.html

Graph showing now spikes after the routing change:
http://measurement.es.net/cgi-bin/mrtg_graph_select.cgi-jcm?width=&height=&expire=300&start=Jun+29+12%3A00&end=Jun+29+23%3A00&graph=1min&upperLimit=&lowerLimit=&html=%2Flatency%2Ffnal-all-latency.html

So my question is have you done any sort of analysis of the effects that IPEM bandwidth tests are having on other traffic?

RTT elongation and packet discards

Thanks for the update. It is very interesting. We looked at elongation of RTT with iperf TCP throughput a year or so ago (see for example slide 14 of http://www.slac.stanford.edu/grp/scs/net/talk/pfdl-feb03.ppt) For example for CERN we saw that the RTT instead of being a sharp spike at about 165msec had a lomg flat tail out to 400msec and the average increased by over 100 ms when the pings were run at the same time as an iperf transfer.

Looking at our router stats (out discards) for the ESnet interface, we do see
occurrences of a couple of hundred out-discards (to ESnet) in a 5min interval and then many intervals with no out-discards. However these do not correlate with the 5 minute utilizations (which are pretty smooth between 150 and 350 Mbits/s all day long).

Our interface to Stanford/CENIC Abilene is more lightly used in general and we do see lots of utilization spikes on it (median in ~ 5 Mbits/s, out ~ 20 Mbits/s with spikes of 40-80 Mbits/s). I would expect these spikes could well be caused by the monitoring. However we see no in or out discards. Possibly the link is lightly enough used that we never congest it with our iperf traffic.

Correlations of discards and traffic

I looked to try and correlate the output from our IEPM tests with the drop reports from the router etc. We also looked at the switches attached to the two hosts making tests but could see no discards on their ports. A plot of the throughputs from IEPM and the DMZ switch/router reports superimposing the two sets of measurements shows a lack of an out discards spike at 7/01/04 and IEPM spikes with no out discards. This would lean against a correlation.. Unfortunately the time stamps are hard to reconcile: The router time stamp is at the end of the 5 minutes when the data is read out. The IEPM time stamps are when the set of tests started for a given host

Next we switched off our iperf/bbftp testing for about 9 hours on 7/2/04 and the bursts of discards significantly reduced. This was reproducible.

We have very nice tool from Network Physics that enables me to quickly look at the retransmits for all SLAC traffic, for groups of hosts (e.g. all our test machines, or all the BaBar data transfer hosts, or our main high performance collaborators etc.) and to drill down. At first glance it is apparent that the bursts of discards correlate with bursts of retransmits for our network testing hosts. However, it appears the re-transmits bursts only affect the network testing. There are other retransmits for all SLAC and for various groups however they are not correlated with the network testing, rather being smoother.

The router discard rates for the 5 min periods where there are bursts due to network monitoring are about 0.05% (i.e. out-discards/out-pkts).

I looked at the 1 minute (as opposed to 5 minute SNMP router discard counters) using the Network Physics monitor that easily enables me to see the number of re-transmits by host or groups of hosts. I grouped the hosts into "SLAC all", "IEPM test network hosts" (i.e. the hosts that generate the network test traffic with iperf etc.), the SLAC farm network hosts (where all the data is kept) etc. Then I looked at the re-transmits per group. At that level the spikes in re-xmits from the SLAC test hosts do not seem to impact the overall re-transmits or the goodput, unless there are very few re-transmits so they mainly all come from the test hosts.

Effect of QBSS

QBone Scavenger Service (or simply ``scavenger service'') is a network mechanism to let users and applications take advantage of otherwise unused network capacity in a manner that would not substantially affect performance of the default best-effort class of service. Both Abilene and ESnet support QBSS. We ran 10-15 second iperf multistream throughput tests to multiple sites with QBSS turned on, each QBSS test being followed sequentially by an identical test without QBSS.turned on. This was repeated at 90 minute intervals for about a month from July 25th through August 17 2004.  It was apparent that QBSS had little effect on most routes to U.S. hosts, however for most trans-oceanic routes the effect was marked (see for example CESnet iperf1 (with QBSS) and iperf (without QBSS) traffic). We analyzed throughputs from the two iperf measurements (with and without QBSS) to extract the averages, standard deviations, and the asymmetries ((iperf-iperf1)/(iperf+iperf1)). We also compared the iperf measurements with those from a packet pair dispersion technique (ABwE). The results confirm that QBSS has a marked effect on most trans-oceanic routes and TRIUMF (in Vancouver Canada).  The one exception is the INFN Milan host. It has a high-speed link via ESnet/GEANT/INFN, but then has only 34Mbits/s on the last hop. This is probably the most common bottleneck and there is probably no QBSS on this link so there is no effect. SLAC has two routes to the Internet, one via ESnet, the second via CENIC/Abilene/Internet2. Most of the trans-oceanic routes are via ESnet (then to GEANT for Europe, and to SInet for Japan), except for Italy/INFN which is routed via CENIC/Abilene/GEANT.

The ABwE measurements are exponentially weighted moving averages with a damping factor (a) of 0.8. The measurements are taken once a minute and the damping is of the form new_prediction = a * old_prediction  + (1 - a) * new_measurement. We had thought that maybe the QBSS results would be close to the available bandwidth estimates from ABwE since QBSS basically soaks up all unused bandwidth (i.e. available bandwidth) while not affecting other cross-traffic. However, there was little correlation between the iperf1 and the ABwE measurements.

 


Page owner: Les Cottrell