High statistics ping results

Created: May 14, 1999; last updated by: Les Cottrell on February 27, 2000

Tutorial | PingER Help | PingER Tools | PingER Summary Reports| PingER Detail Reports

Page Contents

  • Introduction
  • Pings between hosts at same site:
  • Pings between 2 ESnet sites
  • Pings between 2 International sites

  • Introduction

    To understand better how to interpret PingER results we decided to make a series of one off high statistics ping measurements with shorter time frames than the normal PingER measurements on both LAN and various WAN paths. The idea is to look at the frequency distributions, the time variations for various types of networks in the LAN and WAN environment, and to correlate the results with the topology, routes and known performance issues. Our goal is also to compare these results with results from other high statitics delay measurements.

    Unless otherwise noted, the pings were sent at one second intervals with a timeout of 20 seconds and a payload (including the 8 ICMP protocol bytes) of 100 bytes.

    Pings between hosts at same site

    To understand the behavior of ping at a single site we used the NIKHEF ping client (since it has a resolution of 1 usec.) running on a Redhat 5.2 Linux on an Intel 400MHz Pentium host (doris) at SLAC to various other hosts at SLAC separated from doris by various network devices with various interface speeds from 10 Mbps to 1 Gbps. Doris, the ping client, is connected to the network via a shared 10Mbps hub that is connected to a 10Mbps edge switch port. The table below shows the hosts pinged (i.e. acting as ping servers) together with their hardware and software configurations and the connection between doris and the server. The edge switches are Cisco Catalyst 5000s, the core switches are Cisco Catalyst 6500s, the farm and server switches are Cisco Catalyst 5500s, and the core routers are Cisco Catalyst 8500s.
    Server nameServer HardwareServer OSServer interface speed Network connection devices & speeds
    mercurySun Ultra 5Solaris 5.610Mbps HDX sharedSame shared 10 Mbps hub
    charonSun Ultra 1Solaris 5.610Mbps HDX shared 10Mbps to edge switch (cgb3) 10 Mbps to doris
    bronco001Sun Ultra 5Solaris 5.6100Mbps FDX switched 100 Mbps to farm switch, 1Gbps to core switch, 1Gbps to core router, 1Gbps to core switch, 100Mbps to edge switch, 10Mbps to doris
    mailboxSun Ultra 5Solaris 5.6100 Mbps FDX switched 100Mbps to server switch, 1Gbps to core switch, 1Gbps to core router, 1Gbps to core switch, 100Mbps to edge switch, 10 Mbps to doris
    grouseSun Sparc 1+SunOS 4.1.3.110Mbps HDX shared 10Mbps to edge switch, 100Mbps to core switch, 1Gbps to core router, 1Gbps to core switch, 100Mbps to edge switch, 10Mbps to doris

    A simple model to understand the median or minimum ping response times for an unloaded local area network and lightly loaded hosts is to ignore the hubs (a hub inserts about 1 bit time delay) and the cable lengths (for a site with cable runs of < 10,000 feet this should introduce an error of < 20usec.), assume the latency of the switches and routers is about 15usec. (this comes from Cisco specification sheets), calculate the times to clock the 100 byte ping packet into each device at the interface speed, measure the ping client time by comparing the time reported by ping host client with the wire time immediately in front of the host (~210usec.), and measure the server host time to echo the ping by measuring the wire times (going in and coming out) in front of the server (~ 100usec. for the Ultra 5 (330MHz)hosts, ~ 125 usec. for a Sun Ultra 1 (167MHz), ~550usec. for the Sun Sparc 1+ (25MHz), and ~170usec to a Sun Sparcstation 5 (110Mhz)). Putting this all together, for the hosts in the table the agreement between measured and predicted Ping RTTs is within 60 usec.

    In the subsections below we show some examples of the ping RTT history and also the frequency distributions. We do not attempt to explain the frequency distributions in any detail but simply note that in all cases there is a large peak near the low end of the measured RTTs followed by a long tail with some structure observable. The artificial regularity of every n-th bin having a higher or lower frequency above log10(RTT) of 1 is a binning effect of the logarithmic bin sizes interacting with the measurement granularity of the NIKHEF ping (1usec. for RTT < 1 msec., 10usec. for RTT >=1msec. and < 10msec., 100usec. for RTT >=10msec. and < 100msec., and 1msec. for >=100msec.). and does not show up when using equally spaced linear RTT bins. The double peak in the frequency distribution for the two hosts on the same subnet is also a binning effect. and does not show up when using linear bin-widths. On the other hand, by measuring the wire-time difference between packets entering and leaving the server, the double peak seen in the low RTT "peak" for the distribution of the two hosts on the same shared hub is found to be be caused by the ping server. For another example of a pathological RTT distribution caused by a ping server, see Pinger Measurement Pathologies.

    Pings between 2 hosts on the same shared 10 Mbps hub

    The following plots are for the Linux host to a Sun Ultra 5 (mercury) running Solaris 5.6. The first plot shows the ping RTT in msec. for about 260,000 100 byte pings started May 30, 1999.

    The second plot shows the frequency histogram of the ping RTTs with log scales. From measurements of the "wire time" using NetXray running on a separate Windows NT on the same hub as the ping server/responder host (mercury), we verified that the two peaks around log10(RTT) = 0.5 are a function of the ping server/responder host itself.

    Pings between 2 hosts on the same subnet but different ports on the same switch

    The following plots are from the host to a Sun Ultra 1 (charon) running Solaris 5.6. The hosts are on the same Cisco Catalyst 5000 switch but different 10 Mbps shared ports. the pings were stated on May 30, 1999 at 12:38:20 PST and the packet loss was about 0.08%. The first plot shows the behavior of about 260,000 ping RTTs as a function of time.

    The second plot shows the frequency distribution of the pings.

    Pings between 2 hosts at the same site but on different subnets

    The following plots are for 500,000 ping RTTs between a Linux Redhat 5.2 host (doris) and a Sun Sparc 1+ (grouse) running SunOS and the same Linux host and a Surveyor host running on a Pentium II running FreeBSD. The grouse pings were started on May 30, 1999. The two hosts are on separate subnets and are separated by 4 switches and a router. The first plot shows the time variation of the ping RTT.

    The second plot is the frequency histogram of the ping RTT. The blue line shows the cumulative distribution function (CDF). The data was binned into 2 different bin widths to provide a reasonable number of counts in the higer RTT bins: 0.1 msec. bins are shown in magenta and extend out to 10 msec., and 10 msec. bins run from 10 to 100 msec. The counts in the 1 msec. wide bins are normalized to the 0.1 msec. wide bins by dividing the count in the 1 msec. bins by 10. A simple power series fit to the data between RTT 2.3 msec. and 61 msec. is also shown as a black line.

    The distribution has a sharp peak with a median at 1.35 msec and with an Inter Quartile Range (IQR) of 0.2 msec. There is also a high RTT tail.

    The third plot in this subsection shows the time variation of the ping RTT for 306,000 pings between the Linux host and the SLAC Surveyor host.

    The final plot in this subsection show the frequency distribution of the ping RTTs between the Linux host and the SLAc Surveyor host. The blue line shows the cumulative distribution function (CDF). The data is binned into 3 different bin widths The black dots are for bins with a width of 0.1 msec. and are for RTT < 1 msec.. The magenta dots are for bin widths of 1 msec. and are for RTTs < 10 msec.. The green dots have bin widths of 10 msec. and cover the entire range of data. The binned data is normalized by dividing the counts in the 1 msec. bins by 10 and the counts in the 10 msec. bins by 10. the black line is a simple power series fits to the data between 2.3 msec. and 61 msec. inclusive.

    The distribution exhibits a sharp peak with a median at 0.9 msec. an IQR of 0.06 msec. and a high RTT tail. There are also secondary peaks at 10 msec. and 2.4 msec.

    Pings between 2 ESnet sites

    ESnet sites have excellent connectivity with low packet loss and a high speed well-provisioned backbone that they connect to. Thus they provide an example of "how good it can get". The ESnet operations center is at LBNL and SLAC is an ESnet site with, at the time of the measurements below, a T3 interconnect to the ESnet ATM backbone cloud. The SLAC link to ESnet is also lightly loaded with peaks measured over 5 minutes only reaching about 50% utilization for the period of interest.

    The ping distribution for an extensive (500K samples) measurement between a host at SLAC (minos.slac.stanford.edu) and a host at ESnet at LBNL (hershey.es.net), is seen below starting at 9:01am on April 23, 1999 and ending at 3:59am on April 29 1999. The pings were separated by 1 second and the timeout was 20 seconds. It can be seen that there is a narrow (IQR = 1msec.) peak at 4 msec. with a very long tail extending out to beyond 750 msec. The black line is a fit to a power series with the parameters shown.
    minos - hershey ping RTT distribution (38918 bytes)
    If one plots this data on a log-log plot (see below) then it can be seen that there are two time scales (4-18 msec. and 18-1000 msec.) with quite different behaviors. The bulk of the data (99.8%) falls in the 4-18 msec. region. In the 4-18 msec. region (the magenta points) the data falls of as y ~ A * RTT-6.6 whereas beyond 18msec. (the blue points) it falls off as y ~ B * RTT-1.7. The parameters of the fits are shown in the chart. Note that in the 4-18 msec. region the data are histogrammed in 1 msec. bins, whereas beyond that they are histogrammed in 10 msec. bins. and the 2 y scales are adjusted appropriately (the one for the wider bins beyond 18 msec. is a factor 10 greater than the other). The green points are not used in the fits and are the data histogrammed in 1 msec. bins for the range 19 msec. to 55 msec. The power law exponent behavior in the region 4 - 18 msec. is that exhibited by very chaotic processes such as fully developed turbulence or the stock market, whereas the data beyond 18 msec. is more characteristic of heavy-tailed or long range similarity behavior. A guess is that the transition at 20 msec reflects a change from delays caused by simple queueing to delays caused by router processing and needs more work to substantiate.
    Log-log plot of Minos Hershey ping data (6754 bytes)
    The autocorrelation function for the first 64000 RTTs for (there was no packet loss in this period) is shown below. It can be seen that in general there is a very weak (< 0.01) positive correlation for lags of less than 300 seconds. This weak correlation is present even for pings separated by only 1 second. The red horizontal lines are plotted at +-2/(sqrt(64000)) and indicate twice the standard error if the autocorrelation is zero (95% of the Autocorrelation values will lie within +-2/sqrt(64000)) if the autocorrelation is zero). The following quote is from "Nonlinear Time Series Analysis" by Kantz and Schreiber.

    Stochastic processes have decaying autocorrelations, but the reate of decay depends on the properties of the process. Autocorrealtions of signals from deterministic chaotic systems decay exponentially with increasing lag. Autocorrelations are not characteristic enough to distinguish random from a deterministic chaotic signal.


    By performing a discrete Fourier transform of the RTTs one can obtain the spectral power periodogram shown below. There are seen to be pronounced (over twice height of most other peaks) peaks in the distribution at 200sec., 600sec., 900sec. and 1800sec. More work is needed to ascertain the causes of the peaks. Possibilities include periodic tasks in one or more routers (e.g. reading out SNMP variables, updating routing tables) causing queuing delays at regular intervals. It is worthy of note that there is a cron job that runs every 10 minutes and reads out the SNMP interface statistics and router cpu utilizations for the routers at SLAC (RTR-CORE1, RTR-CGB6 & RTR-DMZ). The route between the two hosts traverses 5 Cisco routers.
    traceroute to hershey.es.net (198.128.1.11), 30 hops max, 40 byte packets
     1  RTR-CORE1.SLAC.Stanford.EDU (134.79.199.2)  1 ms  1 ms  1 ms
     2  RTR-CGB6.SLAC.Stanford.EDU (134.79.135.6)  2 ms  1 ms  2 ms
     3  RTR-DMZ.SLAC.Stanford.EDU (134.79.111.4)  2 ms  2 ms  2 ms
     4  ESNET-A-GATEWAY.SLAC.Stanford.EDU (192.68.191.18)  2 ms  175 ms  212 ms
     5  lbl1-atms.es.net (134.55.24.11)  4 ms  4 ms  4 ms
     6  esnet-lbl.es.net (134.55.23.66)  4 ms  4 ms  4 ms
     7  hershey.es.net (198.128.1.11)  5 ms  4 ms  5 ms
    


    If one looks at the RTT of just the pings in the 20-1000 msec. region as a function of time of day, then one gets the chart shown below:

    It can be seen that there are some long individual ping response times (up to 1200 msec. on the Saturday), as well as periods when there were a lot of pings with RTTs over 20 msec. see for example Tuesday April 27 around noon. More details on the RTTs of these latter pings can be seen below (there were no packet losses in the period shown in the chart). It can be seen that for about 11:50 am thru 12:25pm there was a period of increased RTT.

    The autocorrelation function for the roughly 7000 ping RTTs around noon (between 11am and 2pm) on Tuesday April 27th, 1999 are seen below. Again the correlation for pings separated by as close as 1 second is seen to be weak. The horizontal red line is drawn at twice the standard error if the autocorrelation was zero. It is seen that there is a positive autocorrelation that decreases as the lag increases out to about 150 seconds.

    The pathchar behavior between a host on the same subnet as minos (minos is an AIX host and pathchar does not run on it) and hershey is shown below:

    >pathchar -q 64 hershey.es.net
    pathchar to hershey.es.net (198.128.1.11)
     mtu limitted to 8192 bytes at local host
     doing 64 probes at each of 64 to 8192 by 260
     0 FLORA03.SLAC.Stanford.EDU (134.79.16.55)
     |    77 Mb/s,   462 us (1.77 ms)
     1 RTR-CGB5.SLAC.Stanford.EDU (134.79.19.3)
     |   294 Mb/s,   218 us (2.43 ms)
     2 RTR-CGB6.SLAC.Stanford.EDU (134.79.135.6)
     |    18 Mb/s,   276 us (6.53 ms)
     3 RTR-DMZ.SLAC.Stanford.EDU (134.79.111.4)
     |   ?? b/s,   -85 us (2.44 ms)
     4 ESNET-A-GATEWAY.SLAC.Stanford.EDU (192.68.191.18)
                            -> 192.68.191.18 (1)           
     |   ?? b/s,   1.42 ms (5.13 ms)
     5?lbl1-atms.es.net (134.55.24.11)
     |   245 Mb/s,   71 us (5.54 ms)
     6 esnet-lbl.es.net (134.55.23.66)
     |   9.7 Mb/s,   95 us (12.5 ms)
     7 hershey.es.net (198.128.1.11)
    7 hops, rtt 4.91 ms (12.5 ms), bottleneck 9.7 Mb/s, pipe 42418 bytes
    

    Pings between International sites

    CERN, at the time of the measurements below, shared an 8Mbps link across the Atlantic with the World Health Organization, IN2P3 in France and Switch (the Swiss Academic network). The shared trans Atlantic link was reached over 80% utilization for a 5 minute period during the measurement period and was normally the bottleneck. The loading on the link is seen below. The green represents the average (over 30 minutes) traffic to Switzerland and the blue is the average (over 30 minutes) traffic to the U.S. The dark green and magenta are the 5 minute maxima. The ping measurements below were for the consecutive days labelled Sun, Mon, Tue, Wed in the utilization graph below.
    CERN Trans-Atlantic link utilization (30470 bytes)

    To better understand the behavior of ping Round Trip Time (RTT) in the WAN, we pinged CERN (ping.cern.ch) from SLAC (minos.slac.stanford.edu) every minute with a timeout of 20 seconds for 260K pings between 8:36 am Sunday May 9 and 10:35am Wednesday May 12, 1999 (PDT). The packet loss for these measurements was about 0.053%. The distribution of the RTT is seen in the chart below.

    The distribution shows a lot of structure. First there is a sharp peak at about 224 msec. with a width of (90% of the peak is contained in) 9.5 msec. On the high RTT side of the peak several smaller peaks are seen, together with a long tail. If we look at the individual RTTs in the high RTT tail beyond 260 msec. then we get the chart shown below:

    The clusters of points for Tuesday May 11, also show up in the Surveyor data as shown in the graphs below:

    Of particular interest is the cluster around 18:00 hours on Tuesday May 11. The ping RTT and loss data is shown for this data in the chart below. The loss is calculated by looking for missing ping sequence numbers. The routes are obtained from Surveyor measurements which use traceroute to measure the routes about every 15 minutes.
    wpe20.jpg (67276 bytes)
    There is a clear change in behavior starting at about 18:10 hours and stopping at about 19:20 hours. At the start of this period there is a loss of 169 consecutive ping packets (or a break in connectivity of 169 seconds, since the pings are sent at one second intervals, while the network routing converges to a new route), and at the end a further loss of 36 consecutive ping packets. Apart from this period the route (as measured by traceroute) to CERN is from SLAC to ESnet to the New York Sprint NAP, then to West Orange in New Jersey and thence back to Chicago to the STAR-Tap and onto CERN. During the period from 18:10 hours to 19:20 hours, the route is from SLAC to ESnet to BBN which goes via New York, London, to Geneva and is more congested, and hence the increase in packet loss, but avoids the trip back from New Jersey to Chicago (and so saves an extra 30 msec. in the round trip). The complete routes can be seen below:

    The ping RTT data for the cluster around 1:00am on May 11, 1999 can be seen in more detail in the chart below. In the chart it can be seen that there is a complete loss of connectivity (i.e. no pings responded) of about 14 minutes starting at about 1:07am until about 1:21am. After this performance looks fairly normal. Prior to the loss of connectivity, there are periods of longer RTT (almost double) followed by shorter losses of connectivity. For CERN to SLAC, Surveyor shows a change from the normal route at 1:00am and 1:15am returning to the normal route at 1:35am. For SLAC to CERN, Surveyor shows a change in route at 0:56am returning to the normal route at the next measurement at 1:23am. The alternate routes are limited to the SLAC site. This cluster is coincident with problems occurring as a result of making changes to a core switch at SLAC.

    The cluster around 7:15am on May 11, 1999 shown in more detail below is actually 3 sudden changes in RTT from about 220 msec. to about 525 msec. and back after 1 to 2 minutes, with RTT top hat shaped peaks at about 7:14am to 7:16am, 7:19am to 7:20am, and 7:23am to 7:24am. Surveyor traceroute samples did not coincide with any of these peaks and saw no route changes. Only one packet was lost in the period shown below. The black line is a moving average with the average being over 10 seconds. It is inserted to help the eye discern the top hat peaks.
    Cluster around 7am (23513 bytes)

    Surveyor also does not indicate any route changes for the clusters around 14:00 hours on May 11, 1999 or 15:00 hours or 18:30 hours on May 10, 1999.

    The pathchar information for the normal path from SLAC to CERN is shown below:

    >pathchar -q 64 ping.cern.ch
    pathchar to dxcoms.cern.ch (137.138.28.176)
     mtu limitted to 8192 bytes at local host
     doing 64 probes at each of 64 to 8192 by 260
     0 FLORA03.SLAC.Stanford.EDU (134.79.16.55)
     |   162 Mb/s,   369 us (1.14 ms)
     1 RTR-CORE1.SLAC.Stanford.EDU (134.79.19.2)
     |   115 Mb/s,   281 us (2.28 ms)
     2 RTR-CGB6.SLAC.Stanford.EDU (134.79.159.12)
     |    19 Mb/s,   242 us (6.29 ms)
     3 RTR-DMZ.SLAC.Stanford.EDU (134.79.111.4)
     |   ?? b/s,   -100 us (2.29 ms)
     4 ESNET-A-GATEWAY.SLAC.Stanford.EDU (192.68.191.18)
                            -> 192.68.191.18 (1)           
     |   ?? b/s,   31.1 ms (64.4 ms)
     5?nynap1-atms.es.net (134.55.24.9)
     |   914 Mb/s,   118 us (64.7 ms)
     6 1-sprint-nap.cw.net (192.157.69.11)
                            -> 192.157.69.11 (1)           
     |   1997 Mb/s,   1.72 ms (68.2 ms)
     7?core4-hssi6-0-0.WestOrange.cw.net (204.70.10.225)
     |   591 Mb/s,   9.52 ms (87.4 ms)
     8 bordercore4.WillowSprings.cw.net (166.48.34.1)
                            -> 166.48.34.1 (2)           
     |    86 Mb/s,   1.13 ms (90.4 ms)
     9?cern-cwe.WillowSprings.cw.net (166.48.34.6)
                            -> 166.48.34.6 (3)           
     |   130 Mb/s,   59.9 ms (211 ms)
    10?cernh9-ar1-chicago.cern.ch (192.65.184.166)
                            -> 192.65.184.166 (2)           
     |   ?? b/s,   356 us (211 ms)
    11?cgate2.cern.ch (192.65.185.1)
     |   2634 Mb/s,   135 us (211 ms)
    12 cgate1-dmz.cern.ch (192.65.184.65)
                            -> 192.65.184.65 (3)           
     |   551 Mb/s,   327 us (212 ms)
    13?r513-c-rci47-15-gb0.cern.ch (128.141.211.41)
                            -> 128.141.211.41 (1)           
     |    15 Mb/s,   -225 us (216 ms)
    14?dxcoms.cern.ch (137.138.28.176)
    14 hops, rtt 210 ms (216 ms), bottleneck  15 Mb/s, pipe 425545 bytes
    

    [ Feedback ]