Last Update on 5/8/98
ICFA-NTF Home Page | Monitoring WG Home Page
This report is available from http://www.slac.stanford.edu/xorg/icfa/ntf/mon-wg-report-may98.html
This report will attempt to be a fairly self contained complete report on the activity of the group. It covers:
We use the standard ICMP ping facility to provide a measure of end-to-end link performance. Ping is a simple tool which comes installed on most platforms, and hence there is no complicated software to be installed. It also often runs at high priority (i.e. in the Unix kernel) and so is a good way to measure network performance independently of the application. We use ping to measure round-trip performance including response time, packet loss, reachability, and unpredictability. These metrics are defined in the Tutorial on WAN Monitoring at SLAC. Currently, every 30 minutes, we ping a set of remote nodes with 11 pings of 100 data bytes each. The pings are separated by at least one second, to help reduce self correlation effects between individual pings, and the default ping timeout of 20 seconds is used. The first ping is thrown away to reduce possible effects such as priming of caches. The minimum/average/maximum for each set of 10 pings is recorded. This is repeated for ten pings of 1000 data bytes. Work is in progress at HEPNRC to use Poisson arrival times for the ping intervals, rather than every 30 minutes, to avoid missing periodic events. A new version of the measuring tools have also been developed by SLAC to measure medians as well as averages in order to reduce sensitivity to outliers. Both the Poisson arrival time and the median measurement tools are in beta test and are hoped to be released in Summer 1998.
The easiest way to validate ping is to demonstrate that measurements made with it correlate with application response. Such a correlation between the lower bounds of Web and ping responses is seen in the figure below (from Internet Monitoring in the Energy Research Community). The remarkably clear lower boundary seen around y=2x is not surprising since: a slope of 2 corresponds to HTTP GETs that take twice the ping time; the minimum ping time is approximately the round trip time; and a minimal TCP transaction involves two round trips, one round trip to exchange the second to send the request and receive the response.
The lower boundary can also be visualized by displaying the distribution of residuals between the measurements and the line y = 2 x (where y = HTTP GET response time and x = Minimum ping response time). Such a distribution is shown below. The steep increase in the frequency of measurements as one approaches zero residual value (y=2x) is apparent. The interquartile range (the residual range between where between 25% and 75% of the measurements fall is about 250 ms), this range is indicated on the plot by vertical green lines. The full-width at half maximum is about 120 ms.
An earlier study of the relationship between ping response time and FTP data rates can be found in Correlations between FTP & Ping, and Correlations between FTP throughput, Hops & Packet Loss. Another way to correlate throughput measurements with packet loss is by Modeling TCP Throughput.
Sites are divided into 4 categories: remote sites that are simply monitored; monitoring sites that perform the monitoring and collect the data and in some cases provide some reports on the recently gathered data; an archive site which gathers the data via the Web from the monitoring sites, saves it in the archive and makes it available on demand via HTTP; one or more analysis sites (an analysis site may be identical to the archive site) that read the data from the archive or monitoring sites and run programs to analyse the data and provide reports via the Web. This hierarchical architecture removes the problem of full-mesh pinging where every site would ping every other site, which would not scale for large numbers of sites. It also matches the organization of HENP sites into major Laboratories and collaboration sites (usually located at universities). The figure below depicts the architecture.
The ping measuring tools (collectively referred to as PingER) are now installed at 15 HENP/ESnet monitoring sites in 8 countries, over 480 links are being monitored in 22 countries and HEPNRC is acting as the Archive site. A table of the current HENP/ESnet monitoring sites is seen below.
|ARM (US)||BNL (US)||CERN (CH)||Carleton U (CA)||CMU-HEP (US)|
|DESY (DE)||DOE-MICS (US)||HEPNRC/FNAL (US)||INFN/CNAF (IT)||KEK (JP)|
|RMKI/KFKI (HU)||RAL (UK)||SLAC (US)||TRIUMF (CA)||UMD-HEP (US)|
In the past, monitoring sites have chosen sites of interest to themselves to monitor. However, it has become apparent that the ability to accurately compare network performance will be enhanced by selecting a number of sites common to all monitoring sites. Hence we have defined so-called beacon sites that are considered important and it is recommended that all monitoring sites include these in their list of monitored sites. Currently, the list of beacon sites is being finalized, and at the time of writing they are only just being implemented.
Remote sites require no effort beyond providing the name of an appropriate host to monitor and a contact person.
The monitoring sites require a small amount of effort (say about 2 full-time-equivalent (FTE) days) to initially install the monitoring code and make the data available via a Web page, plus install occasional updates to the code and respond to questions on pathological problems.
The analysis site takes about 25% of an FTE effort to maintain. This effort includes monitoring the data collection, contacting monitoring sites when the monitoring fails, and managing the archive data.
The other major effort is to understand the data and develop analysis and reporting tools and make the information available on the Web, to respond to questions from potential and existing users, and to document everything. This effort corresponds to between 1.5 and 2.5 FTEs scattered at several sites, in particular SLAC and HEPNRC. The main tools used for the analysis are perl, Statistical Analysis Software (SAS) and the spreadsheet package Excel. The reports are a mix of dynamic (i.e the report is generated upon user demand) and static (i.e the report is generated at regular intervals), aiming at getting the most information as quickly as possible, while minimizing the amount of disk storage required.
The raw data archive is growing at about 600 kbytes/month/link monitored, the network traffic is about 100 bits/second/link monitored and the load on an IBM 80Mhz RS/6000 250 to monitor ~100 sites is about 40 minutes total cpu time per day. Currently (May 1998) we have about 4GBytes of data in the archive. The actual growth of the archive can be seen in the following plot:
The numbers are for the extra data added to the archive per month (ie this is NOT a cumulative plot). The growth is mainly due to adding extra links which are monitored. The drop in December is due to sites that had not upgraded to the latest release of the monitoring programs, and the data could not be retrieved. The indexes are additional datafiles created by Statistical Analysis Software (SAS) to facilitate finding data quickly.
Viewing the above plot it can be seen that ESnet sites have good ping quality most of the time, whereas the loss between SLAC and sites in the other groups averaged poor or worse. This is suspected to be mainly due to the poor performance encountered as packets traversed the interchanges between ESnet and the rest of the Internet.
Reports on the current state of the links are available from 7 monitoring sites. The report is created by a perl web CGI script ( connectivity.pl ) and is in the form of a table showing the latest ping measurements (loss min/max/average/median response in msec and the slope of the minimum response time versus data bytes) for each remote site monitored. The table can be sorted and the losses are colored to make the quality of the links stand out. The table provides clickable links to burrow down to more detailed information such as a plot of the packet loss for a given host for the last 24 hours. An example of the table is shown below.
These short term reports are used to look at small amounts of data from a single monitoring site. As with other tabular examples, the data can be sorted and imported into spreadsheet packages such as Excel.
Plots of the response time and loss (one point per half hour) are available via a Web form at HEPNRC which allows one to select the links, the time frame, and the graphical format. An example of such a plot for the HEPNRC to University of Manchester in England link for part of December 1997 is seen below. The left hand scale is the ping response time in msec and is for the black line, the right hand scale for the red line is for the percent packet loss. The effect of the improved performance (reduced packet loss) across the Christmas holiday can be clearly seen.
Plots of the response times and packet loss with one point per day and going back for the last 180 days are valuable for revealing the effect of network changes. We have developed SAS tools to create such plots automatically. The plot below shows such a 180 day plot for the link between CERN and SLAC. The lines are cubic spline fits to the data to aid the eye. The black and green dots show the 1000 byte and 100 byte ping response times. The blue circles and red hash marks (#) show the weekday 1000 and 100 byte percent packet losses. The cyan circles and blue hash marks show the weekend packet losses. It is immediately apparent that around early October, the response time improved by about a factor of two and the packet loss improved by more than an order of magnitude. This date corresponds with the CERN link to the US being upgraded. Looking in more detail, it can also be observed that prior to October there are big differences between the weekend and weekday performances. Such differences are an indicator of a congested link, the weekend measurements giving a rough indication of how well the link can perform under low utilization conditions.
Similar plots are shown below for major HEP sites in Germany, France, Japan and the UK.It can be seen that the performance: to KEK in Japan is very good; to IN2P3 in France the performance improved in early October (IN2P3 links to the US via CERN), but not as much as the CERN link improved (in particular note the weekend to weekday differences); to RL in the UK it got much worse in September and the daily packet loss is now over 20%, which is basically unusable; to DESY (Hamburg) in Germany the packet loss grew from about 4.5 to around 10% from July to November and then dropped back to under 4%; and to INFN in Italy the packet loss is hovering around 7%. The sudden change for IN2P3 around December 14th was caused by being unable to contact the host there - probably the host we were monitoring was removed from service.
By comparing a metric for the last 3 months one can see which sites are experiencing the most problems and how the metric has changed over the recent time period. The Excel figures below shows the packet loss measured from SLAC during prime time (7am - 7 pm during weekdays at SLAC) for the 3 months of September, October and November 1997. The order of the sites is determined by sorting the losses within each group for November 1997. Such a plot enables one to quickly see the worst performing sites and the variation over the previous 3 months.
The following observations are in order:
Note that these plots and those following in this section on End-to-end ping measurements are produced manually in the spreadsheet package Excel using the output facility of the tabular reports.
By comparing the running averages of the response times and packet loss for the previous 10 weeks (weekends excluded) against the averages for the most recent week, one can determine how close (in units of standard deviations) the most recent week's averages are to those of the previous 10 weeks. The table below (only currently available for the SLAC data) was created by SAS and then displayed in web format by a perl CGI script. It shows such a comparison and by sorting on the difference (number of standard deviations difference) one can quickly identify which sites have degraded the most. The table also provides clickable links to allow one to burrow down to more detailed information on a given link, and provide access to the raw data in Excel format to facilitate further analysis.
The differences in performance between low and high utilization (or loaded) periods may be used as a sensitive indicator of how close the link is to saturation. This difference may be quantified in several ways. These include:
Currently long term reports going back over several years are only available from SLAC. The SAS analysis code that produces tables of sites versus monthly metric value has been converted to handle multiple monitoring sites and is in production use at SLAC. It has recently been copied to the analysis site at HEPNRC and is being actively worked on to provide a Web user interface.
The Excel plot below shows the monthly average prime time ping response times for four groups of sites monitored from SLAC: ESnet, N. America East, N. America West and International sites. The lines through the points are exponential fits to guide the eye. The parameters of the exponentials are seen in the upper right hand part of the figure. It can be seen that for the former 3 groups the response time has improved by about a factor of two in the last 2 years. The response time for the international group on the other hand has increased from about 400 msec to over 500 msec. A large contribution to this increase is the adding of monitoring to sites which have slow links to SLAC. The most notable such sites are IHEP in Beijing China (added in February 97), Novosibirsk in Russia (added in November 96), and the FZU in the Czech Republic (added in May 97). If these sites are removed from the international group then an exponential fit shows the response time is improving by just under 1% per month, which is a similar value to the other groups.
Similar improvements can be seen in the figure below for the prime time packet loss for the same grouping of sites.
If one burrows down to look at the international long term trends in more detail, then one sees a figure like the one below. In this a set of representative HEP sites in various countries have been chosen. This is generally valid since elsewhere we have shown that viewed from SLAC, sites within a given country (apart from Japan) have similar performance. From this figure it can be seen that KEK has the best performance, followed by DESY and CERN and with the UK, the FSU and China having the worst performance. The UK is interesting in that the line is "saw-toothed", the packet loss improving (decreasing) when (April 96, February 97 and August 97) capacity is added to the UK US link, and the degrading (increasing) until the next addition (the other dips are due to holiday periods). It can also be noticed that IN2P3 (France) and CERN (Switzerland) tracked one another fairly well until recently. This is probably a reflection of the IN2P3 link being one hop beyond CERN.
The example below shows monitoring sites and remote sites aggregated by continent for median monthly ping loss in April 1998.
Other groupings can be selected and displayed, such as top level domain (e.g. edu, gov) or backbone (e.g. ESnet, vBNS, TEN-34). Further, the metric (loss, response, quiesence, reachability, unpredictability) and month can also be selected. The following screenshot shows the collection sites across the top and the remote sites aggregated by country along the side for pingloss in April 1998. By placing the mouse pointer on the icon next to the number it is possible to see the number of links the number represents. In the example, there are two links from CERN to the Russian Federation (suncs02.cern.ch to www.jinr.dubna.su and suncs02.cern.ch to www.ihep.su).
Each combination (such as North America to Asia or SLAC to China in the above examples), links to a by-month summary table. The by-month table itself has further links to ping the remote node from the collection site if a reverse ping program is installed, and a link to the graphing program at the archive site. The by-month table provides summary information intended to give an indication of trends in the performance of each particular grouping.
The following example shows a by-month table for remote sites on ESnet seen from monitoring sites in Canada (TRIUMF and CARLETON). The groups to be displayed can be selected. Also the data can be sorted by clicking on a column heading, and it can also be exported to a spreadsheet package such as Excel for further analysis.
For both table.pl and pingtable.pl, the numbers or the cells that the numbers are in are color colored by quality (Also see the section on Link Quality for further information).
|Ping-Loss||Less than 1%||1% to 2.5%||2.5% to 5%||5% to 12%||more than 12%|
|Ping-Response||Less than 62.5ms||62.5ms to 125ms||125ms to 250ms||250ms to 500ms||more than 500ms|
|Zero Packet Loss||More than 95%||85% to 95%||65% to 85%||45% to 65%||less than 45%|
|Unreachability||Less than 1%||1% to 3%||3% to 5%||5% to 10%||more than 10%|
|Unpredictability||Less than 5||5 to 10||10 to 20||20 to 30||more than 30|
An example is shown below where one can select predefined interest groups via the scrollable box that in the figure shows babar, cern and ch-it.
Another advantage of grouping the data is it allows the user to analyze trends in a subset of links. In the example below the data has been grouped into links that cross an ocean, and immediately we see a steady decrease in the median packet loss and an overall improvement of transoceanic connectivity. The vertical blue bars indicate the Inter-quartile Range of the measurements. Further analysis reveals that various links had been upgraded the time period shown.
When we get a zero packet loss sample (a sample refers to a set of n pings), we refer to the network as being quiescent (or non-busy). We can then measure the percentage frequency of how often the network was found to be quiescent. A high percentage is an indication of a good (quiescent or non-heavily loaded) network. For example a network that is busy 8 work hours per week day, and quiescent at other times would have a quiescent percentage of about 75% ~ (total_hours/week - 5 weekdays/week * 8 hours/day) / (total_hours/week). This frequency analysis also avoids criticism aimed at selecting only prime time measurements (this is similar to the phone companies metric of error-free seconds). An example of a plot of the ping Network Quiescence for groups of sites seen from SLAC can be seen in the figure below. From this figure it can be seen that quiescence has improved for ESnet sites but not markedly for other sites.
To provide a quick visualization of link performance, SLAC is developing a java tool (based on the Mapnet tool from NLANR/CAIDA) to display the links as colored lines on a map of the world. The tools allows one to select areas of the world to view and how the links are to be colored (e.g. red to reflect links which have poor packet loss, and green to reflect ones with good packet loss). Options include selecting the metric (packet loss, response time, unpredictability etc.) and the time frame. Also by moving the mouse over the links or sites the coordinates of the link or sites can be displayed. Currently the only monitoring site selectable is SLAC, but this will be extended to include all monitoring sites. A screen shot of the tool (MapPing) is seen below.
ESnet keeps traffic measurements from its routers. This data goes back to June 1990 when there were about 20 routers to today when there are about 56. Unfortunately over the period of January 1996 through September 1997 the data has not been consistently gathered and reported. This has being addressed by ESnet, but the old data has not been corrected. The graph below show the bytes accepted by ESnet from the major ESnet HENP laboratories. It can be seen that the growth for the period reported is roughly exponential with a monthly growth of 2% - 6%.
The overall ESnet growth in bytes accepted is seen below. The dip starting around January 1996 is believed to be due to faulty data gathering/reporting. The more recent data since October 1997 is believed to be more accurate and shows that the growth experienced prior to January 1996 is continuing.
The Multi Router Traffic Grapher (MRTG) is a tool to monitor the traffic load on network-links. MRTG generates HTML pages containing GIF images which provide a LIVE visual representation of this traffic. In addition to a detailed daily view, MRTG also creates visual representations of the traffic seen during the last seven days, the last four weeks and the last twelve months. A couple of ICFA-NTF (SLAC & CERN) sites are running MRTG and have made the MRTG reports for their external connections available on the Web. An example of the output for CERN is seen below.
The graphs readily show the diurnal and weekly (low use at weekends) variations and show intermediate (over the last year) term trends.
A very powerful tool for diagnosing network problems is traceroute. For an introduction to traceroute see Mapping the Internet with Traceroute. We have been actively encouraging HENP sites to provide Web based reverse traceroute servers so a user can trace routes from both ends of a link. About 30 HENP sites currently have such servers. A list of such Traceroute Servers for HENP & ESnet has been set up.
John Macallister of Oxford has developed a traceping tool based on the standard traceroute and ping utilities. It does a traceroute to the host at each remote site and then pings the nodes along the route to each remote host. Statistics are gathered at regular intervals for 24-hour periods and provide information on routing configuration, route quality (in terms of packet loss) and route stability. This data is archived and can be used to look at route changes since the data gathering was started. The archived data is useful to look at changes in performance and to see how the common (the most used route) route to each site has changed over time. The tool currently runs under VMS, but John is porting it to perl. An example of the output is seen below where the table shows for each hour the packet loss percentage from Oxford to each of the nodes on the route to Minnesota. The nodes are identified by both their IP address and name. One can also observe that during the day two distinct routes were seen, one via UTelecom the other via tglobe.
TRIUMF has developed a topology map to help visualize the routes seen from a particular site to other sites. This is based on the Anemone project at NLANR/CAIDA. It shows each node on each route as an ellipse, colored blue if not reachable, and red otherwise. The ellipses are linked with lines to indicate the routes. Historic data is also available so one can see how the routes have changed over the last year.An example of the output from this tool is seen below:
Work is underway at SLAC to combine these network maps with their traceroute information and the data gathered by the PingER monitoring effort. The following image shows the routes from SLAC to 11 ES net sites, with each hop color coded according to the round trip time recorded by traceroute. The uncolored site is the traceroute host.
1cottrell@vesta02:~>prtraceroute www.cern.ch traceroute to www.cern.ch (188.8.131.52) with AS and policy additions 1 AS3671 RTR-CGB4.SLAC.Stanford.EDU 184.108.40.206 [I] 2 AS3671 RTR-DMZ.SLAC.Stanford.EDU 220.127.116.11 [I] SS 3 AS32 ESNET-A-GATEWAY.SLAC.Stanford. 18.104.22.168 [ERROR] 4 AS293 cebaf-atms.es.net 22.214.171.124 [?] 5 AS293 dccon-cebaf-mae-e.es.net 126.96.36.199 [I] 6 AS291 cern-dcconn.es.net 188.8.131.52 [?] 7 AS513 cernh8-s0.cern.ch 184.108.40.206 [ERROR] 8 AS513 cgate1.cern.ch 220.127.116.11 [I] 9 AS513 r513-c-rci47-17-gb0.cern.ch 18.104.22.168 [I] 10 AS513 www.cern.ch 22.214.171.124 [I] AS Path followed: AS3671 AS32 AS293 AS291 AS513 AS3671 = SLAC AS32 = STANFORD AS293 = ESnet AS291 = Energy Science Network (Eastern US sites) AS513 = CERN
NIKHEF has also developed a similar extended traceroute.
In the same period, the main activities of the Working Group have been :