Authors: Les Cottrell. Created: February 24, 2000
Deployment: 30 monitoring sites, extending to Asia, Latin America, S. Pacific, & the PPDG. 600 remote sites, > 2100 pairs. At this size require a node manager.
Timeping now tries to DNS lookup first, The Poisson scheduling for inter-burst is once again on the list of projects. We have also found a requirement for "directives" to allow customization of number, size etc. of pings. In addition will provide different (more or less detail logging), choice of application. We have made the data available in ULM format, and are looking at XML.
The following new tools have been put into use: synack (measure session opening time) but there is a concern about it appearing as a security probe. Sting (from U. Washington) has been use to get one way losses. the author is looking at adding delay or round trip measurement. The NIKHEF ping allows sub-millisecond pings, setting of the PHB bits. The NIKHEF traceroute allows recording of ASs. THis is available at ftp://ftp.nikhef.nl/pub/network/traceroute.tar.Z The data is being moved to a database (currently Oracle 7). Warrren wants to do more on relating the ping measurements to application performance.
Better visualization is needed to provide better comprehension of the data for users. A student is looking at using JAS to display the table results in graphical format. There is a heavier HEP based C++ tool called ROOT. Also a student is looking at using cichlid from NLANR to show 3D bar charts of performance of monitor to monitored sites with animation to show time development.
A student is looking at ICMP rate limiting to derive signatures to identify its occurence. Jeff Sedayao is seeing rate limiting but typically it is transient. We are also looking at doing measurements related to VoIp and QoS. Another activity is looking for anomalies such as duplicate or out-of-order packets (possibly sometimes/but not always due to load balancing). There is also interest in real time monitoring and reporting, but so far since we do not run a NOC this is of limited interest.
Paperwork in progress are a publication to go in the IEEE, there will be presentations at PAM and the IETF, there were 3 presentations at the recent Computing in High Energy Physics 2000.
Questions/suggestions: subtract out fiber miles from RTTs; bursts are good for measuring jitter; try and show utilization versus performance; need to look more carefully at AMP vs PingER data for medians, drift etc.
To a large extent IP measurement is being strapped on networks which were not adequately instrumented to begin with. Vendor & standards-based support for IP measurements is arriving late while deployment & service development continues at an astonishing rate. IP suffers from both under-measurement & lack of robustness in its protocols for performance guarantees (e.g. routing,, transport).
Goals are to understand live ISP network IP data, to provide network design information, network operation support (fault, performance, usage & SLA verification), provisioning and billing, capacity & facilities planning near & long term, trending deployment, marketing usage profiling, customer behavior modeling and sales tool.
The challenge is macroscopic QoS. It is hard due to large, complex distrubuted, heterogeneous, constantly changing environment.
They keep a data warehouse where they keep all the data. Then they can test theories on the data. They have 10TBytes of data. Have session data, IP flow (netflow 120GB/day) , BGP routing tables, MIB data, active probes, latency, throughput & jitter, applications log (web hosting, email, ecommerce ...), configuration data (topology, service, customer, policies, tunnels, multi-cast). The reporting is via the web, provides access security.
The components include packetscopes (packet traces at large modem banks), call record journals for modem sessions; server logs. They have measurement servers with GPS/UTC time synchronized hardware for 1-way active measurements. They have a suite of distributed measurement applications.
New possibilities include distillation of performance data, e.g. statistical sketches where only send deviations to harvester. Requires dynamic filtering, aggregation etc.
One application is WIPM an active probe for backbone-wide measurement of loss, latency, throughput & jitter. THe underlying mechanism for RTT and loss is still ICMP but they can use other probes. Have some web based visualization tools to show loss etc. matrices (with min, mean & 95% and coloring based on individual pair thresholds), were using moving averages but can make problems look longer than should be which is bad for PR. Detailed graphs are available by clicking on boxes. The detailed plots use Jchart which allow zoom & pan. Another plot showed day of year versus time of day colored points to show performance, can identify effects of time of day. MINT has been folded into NIMI for multicast performance inference.
They have also been doing optimization of OSPF weights. He is working on Peermon to look at routing anomalies by snooping on BGP advertisements (e.g. routing withdrawls, duplicates etc.)
They came up with a data model for IP networks with objects (e.g. links, routers) classes, and hierarchies nd methods (e.g. linkages, statistics, histogram tables etc.
The display shows the map with routers as points and lines as the links. The lines and/or points can be colored and the thickness can be set from various measured metrics (e.g. utilization & delay). Things can be labelled and mouseover can be used to get more text information on an object. They can also show time sequences. It can also be 3D.
Also see: http://www.nanog.org/mtg-0002/greenberg.html
Use a TKL topology mapping tool to provide an overview. There is a SIGCOMM 2000 paper to be published. Uses topology (connectivity and capacity of routers & links), configuration; demands i.e. expected or measured load between pairs; routing with tunable rules for selecting a path for each traffic flow; performance objective of balancing load, low latency, supporting SLAs. The question they are addressing is given the topology & demands how do you decide what routes to use. They need to derive the topology from network configuration information, compute the demands from edge measurements, modeling the path selection can be achieved by IP routing protocols, build a visualization environment and reports.
Had to get operations people to build the lat/long etc. somewhere into the MIB.
Configuration information provides backbone topology, link capacities, router locations, layer 2 & 3 links (e.g. ATM PVCs), inter & intra domain routing (e.g. OSPF weights), customer location and IP addresses; external IP addresses. They extract topology from router configuration files to construct a single view of the network topology, determining the OSPF link weights. they also extract addresses from forwarding tables to determine the IP addresses at each access interface, determine the external IP addresses at each peering interfaces.
For netdb see http://www.research.att.com/~anja/feldmann/papers.netdb.ps
They have a variety of measurement sources: e.g. they use SNMP for utilization, loss, fault recording (runts, giants ...).
CAIDA has a large collection of Web tools. It also has an education & outreach program to try and improve the education in Internet engineering.
RTFM started in the IETF in 1992 when it was called Internet accounting. But the name was politically incorrect. Nevil got involved since in new Zealand there was no subsidy for the Internet and everything had to be charged for. Nevil put together NetraMet an open source implementation (RFC 2722 defines a meter MIB) for a generalized, distributed, asynchronous reliable flow measurement architecture. It runs on several Unix flavors, OCxMON, DOS, windows with FDDI, Enet, OC3, Netflow interfaces, also supports IPv4, IPv6, IPX, AppleTalk, DECnet & CLNS.
The architecture includes meters, meter readers (SNMP management agents), managers tell readers what to read & when and to download configurations/rule sets into the meters, and sample analysis tools. Flows are defined by a set of attribute values, the user specifies which flows to measure and the level of detail via a ruleset. Flows are bi-directional with the user specifying the direction. The meter does the front end data reductions to producer the table of flows.
They are defining new attributes (RFC 2724) to include multi valued attributes. It defines a general way of retrieving distributions as arrays of counters for such attributes. The current versions include to/from packet size, to/from bit/pkt rate for specified short intervals (e.g. 10 second), packet turnaround (e.g. ping, DNS syn/ack), time to get small objects, TCP stream duration, DSCodePoint (for diffserv), to/from ASN.
To turnaround (ms) for DNS requests (passively measured) with mean, 25% and 75%. He also showed the SNMP response times as a function of the remote site.
CAIDA has a metric working group (email@example.com) with co-chairs (Sue Moon - SprintLabs & Brett Watson - MFN/AboveNet) from the networking industry. Nevil is the CAIDA point of contact. This was driven by the need to provide information to the network vendors to know what to measure in their devices. For education they want to produce a FAQ on "What does 'measuring the Internet' really mean?", and a survey paper on "Metrics and Internet Measurement". they also want to define new metrics in the light of new/emerging services, they want to organize experiments and publish definitions via the IETF. They want to recommend a "Service measurement Toolkit" for CAIDA to implement, and want to publish "Measurement Requirements for hardware/software vendors" document.
Questions include what types of network & transport layer metrics are being used by ISPs in engineering and operating networks? What new services are being offered (e.g. DiffServ)? Will these new differentiated transport and applications layer services need new metrics? How can the service metrics be measured in a multi-ISP environment? How can customers verify these measurements?
OC48 being tested in NZ, have OC3 & OC12.
Napster MP3 streaming of music is the fastest growing application, network managers are starting to block. Can see quake games starting (UDP spike), can watch security intrusion (e.g. port scan triggers full packet capture). They show data by application, by AS, protocol by time of day.
See also https://anala.caida.org/CoralReef/Demos/cerfnet/in-link0:137/app_bytes.html
CAIDA is interested in the infrastructure rather then the end-to-end performance. For example they look at the optimality of the root server locations in terms of hops needed to access. Visualizations need to be simple but are hard to do with the very complex data available from multiple sources. Requires 3D and time playout etc. http://www.caida.org/Tools/Skitter/Summary/iad.skitter.caida.org/20000221/ gives a plot of longitude vs RTT among a lot of other information.
They have a service called NetGeo if you send in a IP name they will give you the lat/long. They hope to have a public tool available in a few months. They also have gtrace which is available via the CAIDA tools page.
They were monitoring about 40,000 hosts and that number is dropping monthly at about 1% per month due to re-addressing, ICMPL blocking, firewalls etc.
Mit has been looking at validating the PingER measurements in particular ICMP rate limiting. Have 2 tools (synack - locally developed & sting from University of Washington) which are not ICMP based and can use them to see whether the problem is network related or ICMP specific. Found high losses/unreachability to Pakistan are mainly due to ICMP limiting, since saw low loss with synack or sting. Also looked at the loss as a function of packet number and measured the slope. Then look for slopes well outside the normal distribution and look at those sites as potential candidates for limiting.
Showed that CAR should give a clear pattern in drops, but did not find evidence of this since it is not by site but by total traffic on link. Will try pinging with variable size packets and detect whether loss varies with size (since more likely to run out of queue space for bigger packets).
Is there interest in extending to application monitoring? Web serving is today's most important application. It would be good to break down the performance into its components (DNS lookup, session opening, server response ...). Is there interest in putting together a white paper on things to look out in SLAs? Jeff says he now has more time to look at things. One thing of interest is to do some mining of the traceroute data to look at. Another item talked about on the email list has been looking for anomalies/exceptions/alerts. We have also talked about doing passive monitoring. Validation of the metrics is important.
Previous discussions have come up with 3 areas:
Sharad pointed out that installing new programs and configurations takes a lot of time and effort, and so is not something for the short term. On the other hand we have a lot of existing data, and analyzing/interpreting it does not mean everybody has to be involved, so it can start immediately. One could use this analysis to help in producing a realistic SLA. CAIDA discussed defining SLAs at an earlier stage but decided not to go down that path.
We discussed extending the measurements to applications. Two ways of doing this were discussed, the timeit tools from Intel, and an enhanced PingER. It was agreed that this would be a useful exercise but would take considerable effort to set up/deploy and so would take longer to get results from. This might be a good activity if the XIWT proposal to DARPA is accepted.
It was proposed to analyze the existing data, with the eventual goal of coming up with recommendations for how to come up with SLA definitions. To get to this we need to dig deeper into the current data, validate the measurements and metrics, decide what extra metrics are needed, what are the sampling intervals (may want to feed back the ideas into the measurements to try out things), how to combine metrics, derive baselines, expectations (help on deciding what percentiles are relevant, maybe a section on statistics and how to use them), and thresholds/tolerances with very carefully specified ways of making the measurements and analyses. The output will be a white paper on Service Level Agreements (SLAs).
We worked on an outline:
Internet Service Performance: Performance Aspects of SLAs
Given the rough outline one question that comes up is what analysis is needed to get started. There is some public domain ratings of ISPs from InverseNet.
It was decided that the next meeting should be in about 3 months, that would give time to make a start on the next white paper on SLAs. It will probably be on the East Coast. Thursday May 4th or June 1st at the AT&T facility in Bedminster, New Jersey. A voice conference call was scheduled for 1pm EST March 31st.
Ira is retiring March 31st. Jeremy and Chuck will take over the leadership of the XIWT/IPERF. Ira expects to continue to work for CNRI at about a 20% time level, but at least initially most of this time will not be on IPERF activities.
Filtered 750K traceroutes measured from from 7 hosts from June 1 1999 through January 1 2000, down to 350K loops by requiring a valid response (all three probes respond, and the last host was the right host). Did not include Motorola since it was added later. For some sites, like SLAC stopped at DMZ router since can't see inside DMZ.
Median hops is 11, majority between 8 & 18 hops. Some go out to 30 hops. A small number of routers occur frequently, in particular those close to the end points, e.g. 10% of the routers cover about 90% of the routes. Looking at the number of paths seen per pair, e.g. 70% of the pairs measured have less than 80 routes. For most sites there is one dominant path for > 90% of the time. One might look at whether the dominant route percentage changes with time, then one might see if there is a correlation with RTT/loss. Vern Paxson in his thesis did a lot of work on analyzing traceroutes.
The project is called PingER-6. SLAC has a small amount of bandwidth carved out for 6BONE - need to understand how carving is done. IPv6 addresses have been assigned, a Linux machine will be used to provide a firewall, since . Monitoring 41 sites around the world mainly is N. America, W. Europe and Australia. 6BONE is production, 6REN is research network. When we started monitoring many of the sites were unreachable, was due to IPv6 peering between ESnet and BBN being down but not noticed. Weekends look good, but weekdays look bad, typical of a congested network. Typically IPv6 much poorer performance than production network. Need more bandwidth (i.e. headroom) to be carved off for 6BONE. Quake has been ported. Besides PingER-6 there is traceroute, Gnu libraries. Drivers to IPv6 could be jumbo frames, better security (IPSec is integrated into IPv6), better accounting (e.g. for settlements), auto-configuration ability, improved QoS potential. The edge router at SLAC is a Cisco 2600 with an experimental version of the IOS. There is an IPv6 forum in March in Telluride Colorado..
Figure 2 needs to be finalized, Jeremy has been deferring this until he knows the format required by the publisher. We need to finalize this so we can give it to the editor. Andrew and Ira will finish up by Tuesday.
There was a comment that we need to explain why we use median rather than mean. This would probably go in around page 8 or 9. There was a also a reviewer comment on the need for an executive overview, but we feel this is served by the abstract. There was also a comment on the definition of poor for availability in Figure 5, it needs to be made more mathematically precise, Ira and Andrew will work on this. There are some long sentences in the abstract that the technical editor will fix. Some work will be needed to make the figure labels more clearly rendered in PDF. Chuck said he can work with the publishers to fix this.
We need to decide what data to put on the CD-ROM. An issue that came up was the need to annonomize the data. To make the traceroute data useful one need the names. Unfortunately we do not know how the data will be used. It was decided to proceed with the White Paper and not await the availability of the CD. We will investigate to see if there are ways to make the traceroute data available in a useful form but suitably anonomized. If possible it will be made available on a CD. The White Paper will not refer to the CD ROM since it is unclear if and when it will be available, it will instead oint to a web page which will be updated as we decide how and what to make available.