FY97 Budget Request

U.S. DEPARTMENT OF ENERGY

FIELD WORK PROPOSAL

1. WORK PACKAGE NUMBER	2. REVISION NO. 0			3. DATE PREPARED 5/8/97
4. WORK PACKAGE TITLE Internet End-to-end Performance Monitoring (IEPM)			5. BUDGET AND REPORTING CODE KJ-01
6. WORK PROPOSAL TERM 12 Months			7. IS THIS WORK PACKAGE INCLUDED IN THE INSTITUTIONAL PLAN? No
8. DOE PROGRAM MANAGER Daniel A. Hitchcock, Acting Director Mathematical, Information, And Computational Sciences Division FTS 233-7486		11. HEADQUARTERS ORGANIZATION Energy Research			14. DOE ORGANIZATION CODE
9. OPERATIONS OFFICE WORK PROPOSAL REVIEWER Dr. Burton Richter FTS 462-2601		12. OPERATIONS OFFICE Oakland			15. DOE ORGANIZATION CODE 55
10. CONTRACTOR WORK PROPOSAL MANAGER Dr. Burton Richter FTS 462-2601		13. CONTRACTOR NAME Stanford Linear Accelerator Center			16. CONTRACTOR CODE 55
17. WORK PROPOSAL DESCRIPTION (Approach, aanticipated benefit in 200 words or less) The Field Work Proposal covers the development and deployment of end-to-end Internet monitoring tools on the production network in an architecture that mimics the large collaborations it is expected to serve. The architecture for the deployment will also help minimize the network and people resources required for support. The tools will gather, process and electronically publish statistics (via WWW) that provide intelligible information on short and long term end-to-end Internet performance. The project will develop an improved understanding of the critical components that limit end to end performance for the provision of network layer services and major applications such as WWW. It will improve low impact (on the network and servers), low cost (to deploy), understandable mechanisms to provide automated monitoring of the end-to-end connections at the network layer (ping) between selected sites, such as those in a large distributed collaboration. It will also look at the issues of measuring the performance of higher layers of the network protocol stack, in order to get closer to the "user layer". The data will be gathered automatically and archived for further analysis. It is planned to extend the existing monitoring at 3 sites in the U.S., to sites in over 8 countries. This information is critical to enable modern distributed collaborations to effectively function over today's Internet. The information will help provide realistic end-to-end service quality expectations. In addition, the work will provide: information for problem diagnosis, realistic planning, and identifying where extra resources may be effectively applied.
18. CONTRACTOR WORK PROPOSAL MANAGER: Burton Richter, Director Stanford Linear Accelerator Center 5/8/97 (Signature) (Date)			19. OPERATIONS OFFICE REVIEW OFFICIAL (Signature) (Date)

20. DETAIL ATTACHMENTS:

Refer to related Field Task Proposals

Attachment 20: Proposed work

We will develop tools to reduce the raw data into forms suitable for analysis. Statistical analysis of the data will be made utilizing existing commercial and public domain tools to identify the critical long term trends, and short term exceptions to expected performance. The reports will be accessible via WWW forms that will allow users to dynamically customize what information more statically (for reports that cannot be generated sufficiently quickly to be interactive) by browsing the project's WWW pages. We will produce reports in tabular, graphical and other forms. We expect to be monitoring several hundred remote sites. Each will be monitored at a roughly 30 minute frequency. Reports will be generated showing various metrics including response time, packet loss, unreachability, at 30 minute intervals as well as averaged over longer periods such as daily and monthly. We will also create alerts identifying unreachable hosts, and major departures from expected daily response and packet loss performance, for network service people and interested collaborators. The alerts will be enunciatable via email, pager, or simply by highlighting some part of a WWW report with a hyperlink to more information.

The reports will be accessed via WWW and hence accessible worldwide (with appropriate restrictions) using any common platform. In general the reports will be vendor neutral. Care will be taken for selected reports to sanitize them to reduce liability issues. In some cases, access to the reports may be restricted to certain sites such as those involved in the relevant distributed collaboration. Updating of the reports will be automated and based on appropriate time schedules.

To make the tools and information available to a wide range of sites in an orderly and timely fashion, we will investigate setting up an hierarchy of sites in a distributed collaboration. The first level "remote sites" will simply need to provide the name of a suitable pingable host, which by design is pingable all the time. The second level "collection sites" will install the data monitoring and data collection tools to be provided by the proposed project. The collection sites will monitor the other sites in the distributed collaboration, save the data and make it available to the third level "analysis sites". The analysis sites will collect the data from the collection sites, analyze the data, create the reports and make them available via WWW. This hierarchy matches the hierarchy of sites seen in many distributed collaborations such as those in High Energy Physics experiments. Besides making the monitoring more manageable, it also helps limit the monitoring traffic by helping eliminate the need or desire for every host to monitor every other host. It also ensures that the data is measured with a common methodology and is archived in acommon format.

The work will include documenting and making the data collection tools available for distribution and installation at remote sites.

We recognize that an important but hard problem is distinguishing which aspects of end-to-end measurement reflect limitations in the application; the local-area networks at the endpoints; the Internet connectivity of the source and destination sites; and the wide-area network path between them. By its nature, end-to-end measurement is not conducive to separating out performance factors beyond the split between the application and the network path.

However, we picture addressing this issue as follows. First, we could use a packet filter to record the "wire times" associated with the measurement traffic. These timings then allow us to distinguish application effects from network path effects. Second, to distinguish between different network elements, we could add use of "traceroute", and its newly-announced successor, "pathchar", (both written by Van Jacobson of LBNL) to the measurement suite.

Finally, if IEPM measurements can interact with NIMI measurements (see below), then NIMI cloud measurements could provide additional localization of network path problems.

We view pursuing these approaches as beyond the limited scope of the initial IEPM effort, but important to consider as we develop the IEPM architecture, to ensure that it can ultimately accommodate better fault localization.

Attachment 20: Development Collaboration & Relationship to Other Work

The work will involve a collaboration of three major DOE sites, SLAC, LBNL and HEPNRC/FNAL. We also expect to involve other sites in the collaboration, in particular ORNL, BNL and CERN. In addition we have serious expressions of interest from a further 7 sites in Europe, one in Japan and one in Canada. The development collaboration between HEPNRC and SLAC is already in place and working. It also includes people at ORNL and BNL, though due to staff shortages these latter 2 sites have reduced their involvement recently. In the summer of 1997 we also expect to involve CERN (one of SLAC's staff members will be spending a year at CERN to help deploy the tools), and 7 other European sites have recently agreed to utilize the tools we will develop.

A major HEPNRC/FNAL contribution (the chief investigator at HEPNRC is David Martin) is currently in the area of defining the distributed architecture for monitoring, and preparing the monitoring and collection code for public distribution. The work with LBNL will be to better understand the Internet dynamics and the available tools. This will help to improve the details of the monitoring and to evaluate how to incorporate new tools into the monitoring. SLAC is developing new ways to analyze and report the information and preparing tools to facilitate this.

We envision the work in this proposal as dovetailing with that of the NSF-sponsored "National Internet Measurement Infrastructure" (NIMI) project. (Vern Paxson is one of the principle investigators of NIMI; the others are Matt Mathis, Jamshid Mahdavi, and Gwendolyn Hunt, of the Pittsburgh Supercomputing Center.) The goal of NIMI is to pilot a scalable measurement system for probing Internet clouds and assessing the performance they deliver. A key facet of NIMI is the development of a modular architecture that can accommodate numerous, different measurement techniques.

The two projects mesh very well. NIMI will provide the general mechanisms for scheduling measurements and retrieving results. IEPM will provide NIMI with access to ESnet and HEP sites (the initial deployment of systems will be to LBNL, SLAC and FNAL, this will be followed by systems at ORNL, BNL and/or CERN) as a testbed for developing and refining the measurement architecture (in particular, the mechanisms for scheduling individual measurements and returning the results) in the context of a large, but not unduly large, network. Both projects will contribute measurement and analysis tools to one another; and IEPM will be assured of compatibility with the NIMI infrastructure if/when the latter becomes widely deployed.

The initial deployment of NIMI systems will be at LLBL, SLAC, and FNAL, to be followed by ORNL and BNL. An alternative would be to place one of the systems at CERN. After the initial deployment is successfully completed, for a future project, we expect to deploy NIMI systems at one to two dozen sites in N. America, Europe and Japan.

The key differences between the work we propose here and that of NIMI are:

* The NIMI work focuses on large-scale measurement of Internet "clouds", while here we focus on end-to-end performance (across multiple clouds) for a particular, specialized community, namely HEP.

* Consequently, the main measurements of interest to NIMI are pure network path measurements, while for this work, we emphasize end-to-end all the way up to the application layer (e.g., HTTP timing).

* The NIMI architecture must address significantly larger scaling issues. For this work, the scale is sufficiently limited that approaches such as defining a hierarchy of participating sites should adequately address the scale.

Attachment 20: Our Strengths

SLAC & LBNL have a long history of Wide Area Network support going back to the original creation of HEPnet (the predecessor of ESnet), the creation of the first Internet link to Mainland China, and most recently with the collocation of ESnet management at LBNL. More recently SLAC & LBNL have assumed a leadership role in wide area network monitoring (for a list of presentations and papers see: http://www/xorg/nmtf/nmtf.html#present and

http://www-nrg.ee.lbl.gov/nrg-papers.html.) Vern Paxson of LBNL has published ground breaking work on Internet performance, and Les Cottrell of SLAC is the chairman of the ESnet Network Monitoring Task Force, and the Network Monitoring Focal Group which has made many presentations and publications in this area.

The LBNL & SLAC people also have many contacts in the Internet monitoring and the ESnet communities. In particular, besides chairing the ESnet task force on networking monitoring, in the last year, Les Cottrell of SLAC has made presentations on End-to-end Internet Monitoring to the CCIRN, EOWG, and CHEP97 as well as provided information to the FNCAC, ICFA and the ESSC. Vern Paxson of LBNL is a key member of the IETF/IPPM and has published Internet drafts as well as many papers on Internet monitoring and dynamics.

Attachment 20: Deliverables and Schedule

1. Identify and hire developer to work on project.

2. Tools for automated monitoring of remote hosts at their network layer (utilizing ICMP/ping) and gathering the data.

3. Tools to analyze the network layer data and produce WWW reports for long term trends and short term alerts. Reports will include response time, packet loss, variability, and reachability.

4. A WWW site for the project to provide and publicize information on the project, including progress, results, documentation and availability of tools.

5. Deploy the tools of item 1 to two pilot sites.

The above deliverables will be ready for demonstration 4 months after the start of the project.

6. Deploy tools of item 2 to further sites, at a rate dependent on experience gained with item 5.

7. Package the tools of item 3 for delivery to analysis sites.

8. Document results of our research and recommendations on characterizing the relationships between user experience and the end-to-end performance of higher layers of the network protocol stack.

The above will be completed 12 months after the start of the project.

In addition we will provide:

* Monthly highlights reports (1 paragraph) over the life of the project.

* A final DOE publishable report at the end of the project.

* The software tools from item 2 will be made publically available.

* The software tools from item 3 will be made available to collaboration sites that have the appropriate prerequisites (e.g. commercial analysis tool licenses such as SAS and SPlus).

* Consulting type support during the life of the project for users of the software.

Attachment 20: References:

Internet End-to-end Performance Monitoring, Les Cottrell, Dave Martin & Connie Logg, presented at the ISMA meeting San Diego, May 1997.

Report on CHEP97/ICFA Mini Workshop on HEP and the Internet, Les Cottrell, presented at the ESCC meeting San Diego, Apr 1997.

Internet End-to-end Monitoring and Performance: Measuring, Analysis & Reports and Uses, Les Cottrell, David Martin & Connie Logg, ESCC Meeting Apr. 1997, San Diego.

What is the Internet Doing for and to You?, R. L. A. Cottrell, C. A. Logg, D. E. Martin, CHEP97, April 1997.

Internet Performance and Reliability Measurements, David E. Martin, R. Les Cottrell, Connie Logg, CHEP97, April 1997.

Current State of the Internet and Network Monitoring, Les Cottrell & W. P. Lidinsky, for the ESnet Progress Report, Feb 1997.

Network Monitoring Efforts, W. Toki & W. P. Lidinsky, Presentation to the ESSC, Jan 7, 1997, Washington DC.

End-to-End Internet Ping Performance, Les Cottrell. Charley Granieri, Mike Wendling, & Connie Logg, Presentation to CCIRN Meeting San Jose Dec-12, 1997.

End-to-End Internet Performance, Les Cottrell, C. Granieri & C. Logg, presentation to EOWG/FEPG Meeting, DOE/Germantown, Nov 14, 1996.

Internet Monitoring & NMTF Futures, Les Cottrell, ESCC Meeting, Oct 29, 1996, Princeton, NJ. Report from the Network Monitoring Focus Group, D.E. Martin, Energy Sciences Network Site Coordinating Committee Meeting, Oct 29, 1996, Princeton, NJ, postscript and PDF

Report on ESCC Network Monitoring Efforts Les Cottrell, Connie Logg, Bill Lidinsky, Dave Martin, and Gary Haney, presentation to the EOWG/FEPG Sep-30-96 in Postscript and PDF formats [13 pages].

Internet Gridlock and WAN Monitoring, Les Cottrell, invited talk given at the Babar Collaboration Meeting, Dresden, July 1996.

Network Monitoring for the LAN and WAN, invited talk given at ORNL, June 24, 1996 by Les Cottrell.

Distributed Computing Environment Monitoring and User Expectations Les Cottrell & Connie Logg, SLAC-PUB-95-7008, Contributed to the International Conference on Computing In High Energy Physics '95 (CHEP95) Conference, Rio De Janiero, Brazil September 18-22, 1995. See also the html and Postscript transparencies.

Network Monitoring Les Cottrell & Connie Logg, presented at the 1995 DOE Telecommunications Conference, Portland Oregon, July 1995.