XIWT/IPWT Meeting, Austin 18-19 Sep, 1997

XIWT/IPWT Meeting

Southwestern Bell, Austin TX, 18-19 Sept 1997

Trip report from Les Cottrell, SLAC

Contents

Welcome - Howard Shimokura, SBC *

Internet Weather Report (IWR)- John Quarterman, MIDS *

Emerging Measurement Tools & Initiatives - Tracie Monk, NLANR *

End-to-end Methods for Measuring & Improving Internet Performance - Jeff Sedayao, Intel *

Technical Subcommittee T1A1 on Performance: Status & Plans for Internet-Related Work - Spilios Makris, Bellcore *

Metrics & Infrastructure for IP Performance - Guy Almes, ANS *

Internet End-to-end Performance Monitoring, Methodology, Tools & Results - Les Cottrell, SLAC *

The Automotive Network eXchange Overseer Performance Test Tool - Dorata Blat, Bellcore *

Testing for Structure- John Leong, Inverse *

Discussion/Planning - Ira Richer *

Discussion of what to do next? *

This was meeting of the Cross-Industry Working Team (XIWT) Internet Performance Working Team (IPWT). It was a very interactive meeting with about 25 attendees with 8 laptops deployed. As far as SLAC is concerned, the most important section to read is the last one (Discussion of what to do next).

One outcome of the meeting was that the SLAC/HEPNRC tools for end-to-end performance monitoring have been selected to be deployed at about 6 XIWT member sites including (if my memory is correct CyberCash, SBC, Intel, Houston Associates) with Intel acting as the Archive site. They want measure performance of members' own networks, get some tests up to validate and understand what should be recommended to other commercial customers and for what purposes. A goal is to build a community within XIWT so can evolve it to address harder issues.

Welcome - Howard Shimokura, SBC

We are moving from a circuit switched world to a packet switched one. The Telecommunications /Information industry contributes ~ 16% of the US GNP. In the rest of the world it is typically < 6%. Thus it is a big contributor to the US economy and also the US has leadership so can export to the global economy. Performance monitoring is critical to provide users with a predictable and stable environment that they are looking for.

Internet Weather Report (IWR)- John Quarterman, MIDS

John gave an interactive demo of the many reports that the IWR has. Many of those demonstrated are only available for paying customers. He has been gathering data since 1993 and has come up with some very nice ways of presenting the data by graphs generated by Perl scripts. One of the graphs showed the Internet response time has improved from 650 to 450ms (averaged) between 1993 and Sept-97. The graph also showed distinct seasonal patterns. He also showed graphs of the 25%, median, 75% and the minimum (all plotted on one graph as shaded areas) for various metrics, which were intelligible and clear. One interesting graph was the number of hosts per domain versus time, with lines for all the domains (many tens possibly over a hundred) which was still intelligible. We talked a lot during breaks about how we might collaborate since we have critical needs on how to display long term data from multiple sources for many metrics. He is very interested in historical data and I will get him in contact with the folks at RAL who gather the traceping data, in case they have some long term data.

Emerging Measurement Tools & Initiatives - Tracie Monk, NLANR

For a copy of this talk see http://www.nlanr.net/Caida/xiwt_970918.html/index.htm.

Why measure:

There are different reasons for ISPs vs. "Users"

Activities

Various national measurement initiatives

NLANR/CAIDA

Common Solutions Group (CSG)

NIMI (LBL/PSC) suite to be available soon

IPMA/Salamander (Merit)

Measurement "services", e.g. HEP & Inverse

User measurement tools (see http://www.caida.org/Tools/taxonomy.html) with both public & commercial tools, supported by grant from Cisco.

Requirements

Proactive id,

Required Tools

Performance, scalable, standardized measurement tools, analysis & visualization tools such as maproute (developing tools to show geographically with pop ups to show delays with color coding, requires a reverse traceroute server to be enabled at host (with gateway (-g) flag enabled),
IETF / IPPM metrics, NIMI, pathchar, tcpanaly

NIMI (National Internet Measurement Infrastructure)

~ NPD on steroids
LBL/PSC collaboration (Paxson, Mathis, Mahdavi)
Software suite, not hardware, platform independent, typically run on a PC running Unix (BSD).
Deploy on measurement hosts around network for Internet for black box infrastructure measurements
Will incorporate various tools with Poisson packet generator such as treno, RTT, one way delay, hooks for GPS, tcpanaly (uses tcpdump format files & analyzes TCP's sending / receiving behavior), pathchar
Prototype tar file for deployment on NLANR boxes (100 - 200 hosts) in November '97.

Challenges

Scaling, data retrieval/analysis
Envision a network of 1 Million NIMIds
Multiple test topologies overlaid on same infrastructure
No central management
Improved timing of test packets

As a group we need, for the health of the net, to encourage enabling some things such as reverse traceroute servers, identify ping hosts, provide trustable latitude and longitude of routers in the DNS records for visualization. This is needed from the ISPs. However, some of the ISPs are concerned with identifying the exact location of a site due to terrorist threats. For most visualization accuracy to a city or phone exchange or Zip code is good enough. Maybe as users we could ask for ISP to provide such open-ness in our contracts.

Flow tools:

Integrated into routers

Stand alone boxes

Differs from performance measurements, inserted into topology, insights on nature of actual traffic

Cisco has cflowd which is particularly valuable for developing AS matrices

Coral provides constant flow real-time monitoring via fibre bleed, inexpensive boxes ~ $6K, can use with OC3mon. Does packet traces, AS matrices, identify purpose of traffic (e.g. 1% traffic at an MCI major node was real-audio). Specs on how to build are public on the Web, see www.nlanr.net/Coral

Management tools

Macro level insights into Internet topology

Tools to enhance visualization, e.g. mapnet to show topology of networks, using Java applets to allow zoom in, to allow pop-ups on links, shows peering relationships between networks, allows authorized users to update information such as peering, only publicly released last week (see http://www.caida.org/Tools/Mapnet/Backbones/ ). I had the opportunity during the break to "drive" Mapnet. It shows links between POPs for user selected ISPs (e.g. ESnet, Sprint etc.), with zoom in & out on the map, display of links at a given POP that the mouse was placed over., and coloring of the links to represent traffic, or ISP again at user selection.

Developing visualization of Internet traffic flows including MBONE, caching hierarchy, AS's, pathchar (see http://www.nlanr.net/Viz )

Plans

NLANR plans to deploy NIMIDs at 100 sites
CAIDA plans continued tool development - performance, flow & management

Note that NLANR/CAIDA tools are public domain

End-to-end Methods for Measuring & Improving Internet Performance - Jeff Sedayao, Intel

Uses:

Open tickets with ISP including traceroute information especially with multiple egress points to identify problems
Provide marketing assessment of performance
Use to select new & improved current ISPs
Use in contractual SLAs, ISPs may claim they do not control the whole path, so Intel says they are buying a service & it is up to the ISP to improve peering, or to compare different ISP support.
Experimental designs to compare configurations of proxy, NS, routers etc.
Justify expenditures to improve internal and external customer WWW performance.
Give the real user something to look at to get an idea of what is going on. Saves time for developers being asked questions which user can answer for themselves.
Provide tools for NOCs (Network Operation Centers) to watch, be alerted on, and take action on.

Imeter

Ping Internet landmarks
Source is available via ftp:
Huge dependency of delay and time of day indicates infrastructure overload
Moved from means to medians to improve robustness to heavy tailed distributions

Timeit:

Tool to measure HTTP GET performance by getting files from landmarks.
For HTTP get breakdown DNS lookup, connect time, delivery rate, fractional error
Source available at: ftp://ftp.va.pubnix.com/pub/uunet/timeit-2.1.tar.gz

Internet Measurement & Control System (IMCS)

It is a system to trigger alerts and take actions
Uses Imeter & Timeit metrics
Scripts developed for NOC to examine (via ping, traceroute) & escalate to local staff or ISP
Totally web based

Issues:

Firewall blocking ICMP, servers wanting a browser type
Use of redirects on timeit
Refinement of landmark selection (they do not check the robots file)
Scalability/traffic generation:
Maintain sampling frequency below annoyance levels
Measurements add traffic load and are non-value added
Users will measure if there is no alternative

Futures

Analysis of outliers
Aggregation of geographies of interest for Imeter
Analysis of internal proxy data performance
Inclusion in ISP (Internet Service Provider) SLAs (Service Level Agreements)
ISPs cooperate (maybe after some beating on) & were interested in results
ISPs use data to detect, debug & fix problems
Tools & algorithms are available & can be implemented

Technical Subcommittee T1A1 on Performance: Status & Plans for Internet-Related Work - Spilios Makris, Bellcore

Spilios is vice-chair of the T1A1.2 (732-758-5640)

T1A1 develops & recommends ANSI standards & technical reports on speech, audio, data, image & video & multimedia integration within US telecommunications networks. Also fosters consistency and positions in other North American & International standards bodies.

The motivation is the explosive growth of public internet services which is raising fundamental questions on standardization efforts. Performance measures for end-to-end Internet measurements have not been standardized. Delay, information loss for "best effort" Internet services are of increasing interest, as are also new services such as multimedia, streaming, inter-working with the PSTN.

T1A1 has a key role in harmonizing ANSI standards with the IETF and ITU. There are several working groups.

T1A1.2 on Network survivability, performance including interactions between Internet & PSTN.
T1A1.3 on performance of Digital Networks & Services for example defining performance parameters for characterization of end-to-end public data communication services using the Internet, preparation of US inputs to ITU
T1A1.5 on multimedia communications coding & performance with initial emphasis on vide conferencing & video telephony.
T1A1.7 on signal processing & network performance for voice band service processing. Will investigate interaction between Internet & PSTN including performance issues for Internet.

See www.t1.org/t1a1/t1a1.htm for updated information on working group activities. They have only just started and have not done anything substantive yet. The participants are mainly coming from the PSTN side. They are looking for improved liaisons between XIWT/IPWT and the T1A1 working groups.

Metrics & Infrastructure for IP Performance - Guy Almes, ANS

Background:

Internet topology increasingly complex
Load grows faster than capacity
Relationships among networks is increasingly complex (people are guarded about releasing information that results in bad press)
Result is that users don't understand the Internet's performance & reliability

Objective is to enable users & service providers to have common understanding, tools etc.

IETF/IPPM have produced: a framework document, a one-way delay metric, a packet loss metric, bulk transfer capacity, and are working on an availability metric (Paxson & Mahdavi started Dec-96)

Measurement Strategies:

Active vs. passive measurements
Hard vs. soft degree of cooperation (e.g. ping & HTTP GET are soft degrees of cooperation)
Single metric with multiple methodologies

Framework revisions:

clock issues - synchronization, accuracy & resolution;
singletons (e.g. a single ping),
samples (a sequence of pings between T0 & Tf with a Poisson distribution (lambda)) &
statistics (minimum, percentile (the response time for which x% of the singletons in a sample have a better response time), median (avoids undefined, i.e. unreachable problems), inverse percentile); generic "P" type packets.

Surveyor Infrastructure

Collaborating organizations include ANS & the 23 Common Solutions Group (CSG) universities
Active tests with ongoing tests of delay & packet loss & occasional tests of flow capacity. Key ideas: a database / web server to receive results, use GPS to synchronize clocks (accurate to a few microseconds and synchronized), need for measurement machines both at campuses and at/near exchange points.

Database/Web server

Measurement machines upload their results to the database server (use SSH for security)
These results are stored so that queries can be made at the same time

Operational:

Two sites operational since Jun-97,
8 sites now operational,
remaining 15 sites being deployed this fall,
currently they are measuring delay/loss with lambda = 4/sec (expect to reduce this (i.e. fewer packets/sec), they have 7Gbytes/month), with minimum packet size between all 56 paths ( (n-1)**2 ) paths for n=8).
The GPS card they use is an AT bus card. It is still expensive. There are PCI based cards now available. The technology is getting cheaper at GPS is deployed to cars.

There is limited analysis / reports available at the moment. He showed:

One-way ping graphs of response & loss used minimum, 50% (median), 90%, and GMT showed for samples of ~ 240 pings.
Can see routes with different loading patterns in the two directions, which result in factors of 5-7 differences in the one way response times, even for symmetric routes.
This tells you that ping round trip times may have very different components for one direction compared to the other.
This type of measurement is complementary to flow-analysis techniques.

Policy implications:

Better understanding of cost vs. performance tradeoff;
Cooperation among users / providers;
Value of cooperating in sharing results even of imperfect results;
Relationships with quality of service.

Internet End-to-end Performance Monitoring, Methodology, Tools & Results - Les Cottrell, SLAC

See: http://www.slac.stanford.edu/grp/scs/net/talk/xiwt-sep97/

The Automotive Network eXchange Overseer Performance Test Tool - Dorata Blat, Bellcore

TCP/IP based VPN interconnecting automotive industry trading partners. They have 15K trading partners in N. America, and 50K worldwide. It is characterized by controlled service quality for mission critical business to business trading partner communications. The idea is to replace a special purpose network consisting of leased lines with an equally good (or better) performing network at lower cost. It comprises multiple ANX certified ISPs (CSP) & ANX certified exchange point operators (ANX CEPO, these are the folks who interconnect the CSPs).

ANX Service Quality is a metrics-based approach. There are 8 categories: net services, interoperability, performance, reliability, business continuity & disaster recovery, security, customer care, trouble handling. Bellcore designed the metrics, criteria & measurement techniques. Bellcore has the role of the ANX overseer and defines certification criteria for service quality, assessment of service providers, certification verification of ISPs.

The metrics are driven by trading partner quality needs. They have a test methodology based on a black box (they require their CSP to install these black boxes) approach which measure metrics visible to the end users (they are not interested in internet trends). These include file transfers, with test data of incompressible, unpredictable random data, and qualitative criteria representing bounds on acceptable performance. They measure throughput, packet loss & delay and the metrics will be enhanced based on ANX release 1 pilot experiences (Sep-Dec 97). The measurements are not done continuously but rather made occasionally (maybe a few times a year, or if complaints are made) on a sample basis.

The throughput target criterion is to measure throughput equal to or exceeding half of the access link bandwidth adjusted for capacity consumed by link-layer. Size of the test files 30+ Mbytes (their users mainly used large file transfers of CAD data rather than a lot of WWW browsing), size of IP packets 512 byte payload (avoids fragmentation). The throughput is averaged over several tests with a sliding window over the most recent tests.

The packet loss rate criterion (PLR = (# packets sent - # necessary packets)/#packets sent) <= 0.03% initially. The loss depends on link utilization. If the end node link is getting < 30% loading then the PLR threshold required is 0.03%, for 50% loading they threshold is 0.05%, if the loading is > 70% they throw away the measurement and try again later. They are likely to be made more stringent when running smoothly. The number of packets aggregated must be statistically meaningful (100/PLR). They put test points at the end of the relevant links. Modern, stable TCP performance is based on some packet loss to help identify how big the windows can be made before congestion occurs. This is especially the case on long high bandwidth links where many packets can be simultaneously in transit on the link. The question arose as to what percent packet loss is normal (i.e. caused by having a larger window size than the intervening queuing elements (routers) can buffer) for normal good TCP performance.

The file transfer delay criterion takes into account file size, access link rates, number of hops, propagation delays (physical distances). The measurement technique uses 1 Mbyte files, of 512 byte payloads, with a calculation aggregated over several tests with sliding window over the most recent trials. They demand that 90% of the file transfers meet the delay requirements.

Testing for Structure- John Young, Inverse

Describing ISP benchmark report to understand end user experience. They have done controlled tests on accessibility to 25 ISPs with 40 POPs (Point of Presence) per ISP to look at Internet performance in DNS, Web latency, throughput, loss. They enter the Internet from over 1200 locations. Need a standard test page, standard server, a basket of 10 popular URLs which all ISPs tests. Other issues that affect the user experience include the server, network, browser, modem. Then there is network caching &/or compression.

Their test setup can dialup any of the POPs and make the tests from there. Thus the test link can be a very poor compared to the backbones, it works OK for packet loss. Throughput is more challenging since limited by the dialup link. Very excited by pathchar since it shows one can measure the throughput of a fat pipe from a much slower link. By comparing various pairs (e.g. Netcom-MCI vs. Netcom-UUnet vs. Netcom-Concentric) they can start to identify a bad ISP. They use UDP transfers. They randomize in time the measurements, but are not doing Poisson distributions. The multiple monitoring points allow better understanding of where the problems lie (e.g. if monitoring to a site works from any monitoring site, the the remote site must be reachable). The destination node is co-located with the monitoring site so the clocks are fairly well correlated.

They find that inter ISP links are worse than links that stay within an ISP. Typically within ISP they see 0-4% for 90% of the links, for inter ISP links (which may go through several ISPs) they see >5% probably 10-20% of the time.

Discussion/Planning - Ira Richer

The purposes are:

To learn about some of the available tools for measuring Internet performance (this was the purpose of the presentations in the first part of the meeting).
To identify which of these tools would be appropriate and feasible to deploy at member sites.
To determine the resources (hardware, software, human) required for an IPWT measurement activity.
To agree on next steps, including specific action items schedule.

Need to identify what is important to be measured?

Identify what will/could ISPs provide to the users (especially if it is in an SLA).

What measurements would one want the user sites to provide, or are useful.

ISPs want standardized data, produced by organizations which know what they are doing, so they an respect, put credibility on the measurements.

There are lots of interfaces, data format, kinds of data being collected. Need to agree on a common data format, or common tools.

The IPWT does not want to develop tools, so they are interested in using existing tools. Most felt the tools demonstrated earlier were an existence proof. Some tools are based on extremely simple tools in particular ping and traceroute. A way of doing it would be that IPWT members become collection/monitoring sites and identify the remote sites they are interested in, the XIWG provides the analysis site and builds the tools.

There may be a sensitive issue with a standards body (such as the XIWT/IPWT) identifying with rating products. This would be especially so if the metric is subject to optimization by the ISP, e.g. GETting Web pages at a few well-known sites, so the ISP identifies those sites and provides good connectivity to them. One possibility might be to define the tests, but not publish them. This also reduces possible competition to companies like Inverse. Could publish but not produce conclusions, or publish the results without labeling them (i.e. making them anonymous). Tracie Monk claims that, even if don't label the reports, some experts can identify from patterns which ISPs may be involved. This may not be a problem if the number of experts is small.

Another issue is long term trends versus real-time monitoring for trouble shooting especially for the site NOC. Both were felt to be of interest.

XIWT/IPWT maybe a good group to work on hard collaboration (in Guy Almes definition of hard vs. soft collaboration) since they consist of a group of members who are interested in performance monitoring.

Another concern is how to do analysis/presentation so people may not be misled. This is where much of the work is. Maybe some commercial outfit can mine the data so they can sell the results to customers or ISPs.

Much of the discussion was on what should be the role of the XIWT/IPWT. Two possible roles came up:

"confront" the ISPs with performance issue with an interpretation asking them to address the issues. Maybe do it via IOPS.
A second way would be to provide tools for ISPs and others to use.

How does one want to promote measurements, should the XIWT be promotional, and who should it promote, e.g. the IETF, the T1A1, ANX efforts. John Quarterman pointed out that one of the things that promoted TCP vs. OSI, was the defense folks bought a lot of Sun workstations, so an analogy would be for the XIWT to buy measurement services (John acknowledged that MIDS is a major provider of such services and so he was not unbiased J ).

Discussion of what to do next?

Want to measure performance of members' own networks, get some tests up to validate and understand what should be recommended to other commercial customers and for what purposes. A goal is to build a community within XIWT so can evolve it to address harder issues. The XIWT represents a very different audience than the SLAC/HEPNRC, Intel, Surveyor communities. In particular the XIWT includes ISPs (e.g. GTE, SBC, AT&T), Internet providers (e.g. CyberCash), commercial users (HP, Intel).

What tools should be deployed?

Possibilities: SLAC/HEPNRC, Intel, Surveyor (Almes), NIMI. There are other tools such as Merit and flow monitoring which provides traffic flow characterization. Flow monitoring requires a higher level of trust and cooperation. It is easy to monitor own flows. There might be value in aggregating data from the flows of multiple users. Flows also have severe privacy issues.

Criteria:

Stability, maturity, what does it measure, underlying tools, is it still evolving, is it supported, what are the resource requirements (people, hardware, system etc.), visualizing the data, upgrade plans, ease of use (install, documentation, consulting).

Criterion	SLAC/HEPNRC	Intel	Surveyor	NIMI
Measures	Packet Loss, RTT, unreachability	Packet loss, RTT, HTTP GET, DNS	1 way delay, packet loss	Delay, packet loss, traceroute, throughput
Underlying tool	Ping	Ping, HTTP GET		Treno, traceroute, ping
Visualization	Plots of loss RTT vs. time, Web tables of most recent measurements with navigation. Pilot of long term tools
Database	Flat files & SAS	Flat files	Oracle
Storage	0.5MB/ping pair
Platform	Unix	NT/Unix	BSDI
Software	SAS (archive site) for NT/Unix
Installation	DIY and is documented, root access not required	Intel does installation, requires root access
People	0.3 for archive site, 0.1 for collection site
Upgrade plans
Level of support available	Limited	?	Limited
Ease of use	Some, man pages, attention has been paid to configuration, instructions for installation		Minimal
Firewall preference	Can be inside or outside, currently it is usually inside the firewall		Can be inside or outside	Prefer outside since uses dynamic UDP ports

Possibility would be to install SLAC/HEPNRC or the Intel tools generally among the XIWT community, with a few more aggressive sites (a sub-team) pursuing the Surveyor. The existence of SLAC/HEPNRC's visualization tools is a very valuable selling tool to management. It may depend on how soon results are needed. Also it may depend on how important it is to have one-way timing. Can XIWT members do both? About 6 XIWT members are interested in deploying tools. There was a move to start out with the SLAC tools, but not exclude Surveyor to later use. Could buy a Surveyor box and run the SLAC/HEPNRC tools on it. Note Guy Almes (father of Surveyor) wants to put Surveyors at a few DoE sites including SLAC/HEPNRC so there should be a good growth evolution.

Next steps/Action items

Investigate port to BSD for Intel boxes
Need an archive site support (NIST?)
Bell South, SBC, Houston Associates Inc. will be Collection sites. NLANR interested but may not have resources
Set up subcommittee including: Jeff Sedayao, Scott Toborg, Rich Ogata, to drive this effort.
Set up a mailing list to share experiences.

Next meeting:

The next plenary meeting of the XIWT is November, not many members of the IPWT attend the plenary.
They want to set up a conference call to report on the status of the port, in three or four weeks (October 21st, late afternoon EST (e.g. 1pm PST for 1 hour). Ira will organize the call.
Mid November for a face to face one day meeting. W. Coast since XIWT has many members on W. Coast. Intel will look at hosting. Tentatively November 13^th. Could invite Paxson to talk about NIMI.

Is there interest in developing a model contract for ISPs? The subcommittee will jump start this.

There was a recommendation that the collection sites provide a reverse traceroute server at their sites.