End-To-End QoS Guarantees in Computer Networks Using NetLets

DOE Office of Science Notice 01-06

Collaboratory Pilot: High Performance Networks

Title of Proposed Project:

Optimizing Performance of Throughput by Simulation (OPTS)

Principal Investigator:

Les Cottrell, (650)926-2523, FAX (650)926-3329, <cottrell@slac.stanford.edu>

Stanford Linear Accelerator Center (SLAC), MS97, 2575 Sand Hill Rd., Menlo Park, California 94025
Key members of team:

Warren Matthews, Andy Hanushevsky, + student, SLAC

George Riley + student, Georgia Institute of Technology

Rich Wolski, University of Tennessee at Knoxville

Submitted to:

High Performance Networks Research Program, Mathematical, Information and Computational Sciences Division,

Office of Advanced Scientific Research, U. S. Department of Energy, 19901 Germantown Rd, Germantown, MD 20874-1207;

Summary description of proposed research

Many of today’s networks and emerging networks enable increased levels of collaboration for DoE scientists and their collaborators. At the same time there is an increasing need to replicate and access large scale (TeraByte and beyond) databases from anywhere on the network. Providing high performance access today, however, requires a thorough understanding of many technologies including understanding how congestion is managed, what metrics determine the performance, what are the metrics for the paths and even what are the paths involved. As a result most scientists are unaware of what is possible, what are the expectations and constraints, and how to address these issues.

We propose to make bulk throughput performance and other (e.g. round trip times, bottleneck bandwidth) measurements on a variety of network paths of interest to the ESnet community. The results of these performance measurements will be then used to construct realistic simulations of the actual network paths, using the popular and widely used ns-2 network simulator. These observed measurements and simulation results will be used to validate the range, level of detail etc, of agreement with the ns-2 simulator, and to understand the interaction of the performance with various metrics. This information will be fed back to the ns-2 developer community to assist in identifying where improvements are needed. Having understood the validity of the simulator it will be used to assist users in understanding today’s poor performance, providing guidance on settings required to improve performance, setting expectations of what is possible, and predicting improvements expected from upgrades. In addition we believe we can use the simulator to identify where the bulk throughput transfer saturates the network and even over-runs (i.e. increases the offered traffic without increasing the throughput) the network. With this information we expect to be able to provide guidelines for setting the bulk-throughput application parameters so as to also limit the impact on other users and to provide better performance of the bulk-throughput. Feedback from these activities will in turn be used to provide an easy to use front end to the simulator, and simplified distribution techniques to enable wider use of the simulator for tuning etc. We will work with some parallel bulk throughput application developers to explore how to improve the applications based on our findings. We will also collaborate with the Network Weather Service (NWS) project to provide a new forecasting method based on simulation.

Statement of importance - identification of problem or opportunity, or situation being addressed

There is an increasing need today for high throughput by data intensive science applications such as the Particle Physics Data Grid (PPDG) or remote backup and archiving, and for data replication. At the same time networks, especially the research and education networks such as ESnet, Abilene and those in Europe and Japan, are becoming increasingly capable of high throughput. However, it is difficult for application users to achieve high performance, since they would need expertise beyond that of most users. This difficulty is since: host systems are optimized for low bandwidths and latencies; configuring the host’s OS and its TCP/IP stack necessary to achieve higher performance varies between operating systems and usually requires systems privileges to accomplish; TCP by design hides the problems; there is a lack of instrumentation and tools to diagnose the performance issues. So far most of the work in understanding how to improve performance has not investigated varying the number of parallel flows. Parallel (i.e. multiple parallel flows) bulk transfer applications are emerging that also take advantage of multiple parallel streams running on top of the TCP stack. Little has been done so far to see how using multiple flows interacts with large windows and how to optimally set the window size and numbers of flows to best achieve high throughput both with today’s common and newer TCP stacks. There has also been little work done to understand and limit the impact of a high throughput bulk data transfer on other applications using all or part of the path traversed by the bulk-data.

Explanation of methodology and equipment needs

We propose to make TCP throughput measurements with multiple window sizes and parallel flows, using tools such as iperf, grid FTP, bbftp, and sfcp for a range of network connected hosts, with round trip times (RTT) varying from a few to hundreds of milliseconds, bandwidths varying from Mbits/s to Gbits/s and using a variety of networks and Internet Service Providers (ISPs). The paths will be chosen as being typical (mainly in terms of bottleneck bandwidth and RTT) of the various paths of interest to the ESnet and High Energy and Nuclear Physics community, for the ability to get iperf servers installed at the site, and for the ability to find administrators/users at the site who are supportive of our efforts. Simultaneous to the bulk throughput measurements we will also make ping measurements of RTT, loss and variability to crudely represent the effect of bulk throughput on other packets at the bottleneck.We will compare these results with those obtained from the ns-2 network simulator. In particular we wish to quantify how well and over what range of the parameters (RTT, bottleneck bandwidth, window size, flows, queue lengths, congestion, TCP implementation, run time), the simulator agrees with observation for various metrics (goodput, RTT, loss and variability). We will also look at the sensitivity of the parameters, and try and decide which ones are most critical and provide guidelines on how to provide reasonable starting estimates for the parameters. Since improvements of factors of 5 to 60 have been observed in bulk throughput by optimizing the maximum window sizes and number of flows, one only requires rough agreement between the simulator and observation for the results of the simulator to be effective in optimizing the TCP stack to optimize bulk throughput. We also intend to modify some bulk throughput applications to adjust the window/system memory buffer sizes at the start of a connection, and later even to dynamically adjust them during the transaction so they are optimal and the system memory buffer sizes are coordinated between transmitter and receiver.

2.1 Simulator extensions

We propose to use the ns-2 simulator to model ESnet network related paths, using realistic values for various simulation parameters (such as RTT, bottleneck bandwidth). We will provide some extensions to the baseline ns-2 simulator, such as arbitrary packet reordering to observe the effect of such anomalies, and the mean and variance of the RTT for each flow to see whether we can estimate loading, and a model for cross-traffic to understands its effects. We will also provide a front end around the simulator to simplify making simulations with a variety of maximum window sizes, parallel flows and other parameters, and to enable the simulation results to be easily compared with observations. To bring the ability to use the simulator to choose bulk throughput parameters to a larger audience we will provide a simpler (point and shoot interface) to the simulator with online help and package the simulator and attendant tools for easy download and installation on a variety of platforms. This will enable non-expert users at other sites to use the simulator to make predictions etc.

We also plan to provide a network enabled simulation server to accept necessary parameters from a client such as the NWS and return goodput and other estimates. The NWS maintains its own database of network performance data which it gathers from a distributed set of sensors. It also includes a forecasting subsystem that applies a suite of performance prediction models to the most recently gathered data, and makes the resulting predictions available in real-time. By comparing the prediction error over time, the NWS adaptively chooses the most accurate model from its suite to use at any given point in time. As part of this work, we will provide an on-line client-server interface to ns-2 so that the NWS can include predictions resulting from simulation in addition to the suite of models it now supports. The study of how simulation-based predictions (based on recently gathered data) can be combined with statistical forecasting techniques to make on-line end-to-end performance predictions is a significant scientific contribution that this proposal will make.

If we run into problems with execution times of the simulations especially as we try to achieve larger simulations, then we will use the "Parallel/Distributed ns" and the "NixVector ns", both developed at Georgia Tech.

2.2 Uses

With the existence of network simulators such as ns-2, it is much simpler to simulate throughput than to actually measure it for many paths (see below for more discussion on this). The simulator can therefore be used to more simply understand the current performance achieved by bulk data throughput applications, especially those using non-optimal default settings, and to aid in optimizing the parameters. It will also be of value to predict performance available between sites on existing paths and existing TCP stacks and how well a proposed upgrade to the path or stack can perform. After an upgrade is put in place, the simulator can be used to help ensure the expected performance is achieved and help identify where further improvement may be needed. The simulator can also be used to look at the impact of the bulk throughput in terms of the increased losses and RTT variability, and guidelines can be provided to limit the impact of the bulk throughput, e.g. by deliberately setting the parameters so that the bulk throughput leaves some fraction of the bandwidth unused, or so that it does not drive the network into unnecessary packet losses and over-runs. Integrating the simulator with NWS via a simulation server, will provide another forecasting model for the NWS, as mention previously. In the reverse direction interfacing to the NWS will enable our project to "pull" information from the NWS so the simulation can build an archive of expected throughputs, end-to-end network probe data etc. which will be useful in different off-line simulation and analysis contexts.

2.3 Advantages & disadvantages

The simulator makes it very easy to adjust many of the parameters that affect throughput and quickly identify the effects of each parameter on the throughput. There are available methods to realistically estimate many of the parameters needed for the simulation. For example ping can be used to estimate the RTT, and pipechar or pchar can give estimates of the bottleneck bandwidth for lower speed (<=T3) paths. Since the simulations can be done fairly quickly one can also optimize the parameters to improve the fit between observed and simulated values of various metrics. The simulator does not impact the performance of the real network by making active measurements (apart from measurements made to determine the simulator’s initial parameter estimates). Since bulk throughput measurements can easily utilize over 90% of the bottleneck bandwidth this can be a very important consideration. There is no need to install iperf or another application client and server at local and remote hosts, so one does not need the collaboration of a remote administrator to do the install and configuration or to make available an account and password. Given today’s security concerns, this is a major advantage. There are a relatively small number of parallel high performance bulk throughput applications in major use today in the High Energy and Nuclear Physics community, yet these provide a major fraction of the utilization of many bottlenecks for long periods (days at a time). It is therefore relatively easy to identify these applications and their users /developers and work with them to improve the performance both of the application and to reduce the impact on the bottleneck. The latter is extremely important until the appropriate Quality of Service (QoS) mechanisms are defined, implemented and deployed in the appropriate places so we can use them to limit the impact of high performance bulk throughput applications.

There are some concerns that need to be recognized and addressed. These include the fact that the simulator is less accurate than actual current observations. This is why we will be making real observations on a variety of paths to validate the simulator and understand its range of applicability etc. Estimating the bottleneck bandwidth for high speed paths is tricky even with tools like pipechar, so in some cases more detailed measurements and/or contacts with ISPs and site administrators will be needed to understand a path in some detail. Blocking or rate limiting of ICMP at some remote sites may mean that getting an estimate for the RTT parameter maybe a bit tricky, however, in our experience for 98% of the sites of interest today, this is not a problem, and in most cases can be resolved by contacts with the site administrator or simply pinging the border router. We still need to understand the impacts of other parameters such as queue lengths, congestion, runtime etc. This will be an early goal of this project, though initial estimates show there is quite a lot of leeway in setting these parameters. We will develop more sophisticated ns-2 models to more accurately measure and understand the effects of cross-traffic, i.e. traffic competing at the bottleneck with the simulated flow(s). Fortunately many of the paths used between high-performance sites are "over-engineered" so frequently a single bulk throughput application instance will absorb most of the bottleneck bandwidth. However, as more users take advantage of advanced high throughput applications , the competition for bandwidth is likely to increase and new simulation models to take account of cross-traffic will be needed. We hope to spur the development of better measurement, understanding and modeling of cross-traffic by identifying and studying breakdowns in the current simulation techniques.

3. Anticipated results

The validation of the simulator against bulk throughput observations for a wide range of paths (SLAC has access to a 155Mbps link to ESnet, a 622Mbps link to the Stanford campus and Internet 2, and a 2.5 Gbps link to NTON, in addition to having major collaborators in many countries) will help indicate the range (paths and metrics) and level of detail over which confidence can be placed in the simulator. In turn this information should also provide feedback to the ns-2 developer community to improve ns-2 (George Riley of GA Tech, a key person for the current proposal is an ns-2 developer). With the availability of such a validated tool we will then have another way to assist in optimizing Internet throughput performance especially for parallel FTP. We believe it will provide a useful low-impact middle-ground between simply estimating the bottleneck bandwidth * RTT product (which provides no information on the impact of parallel flows) to give the optimum window size and doing full bulk throughput measurements (which is very network intensive and can be difficult to set up). We also hope to provide realistic estimates of the impact of the bulk-throughput on other users and guidance on how to set the parameters to make this impact acceptable while achieving reasonable bulk throughput. Our close contacts with the BaBar physicists and applications developers (the experiment is based at SLAC), and the PPDG community (the PI is a member of the PPDG collaboration), will enable us to provide them with ready assistance to tune and enhance their applications (bbftp is a BaBar parallel FTP, Andy Hanushevsky one of the key people on this proposal is the architect of sfcp) to take advantage of our experience. We also anticipate tying in the current work with the Web100 developments since we are an alpha level tester and in close contact with the developers. Finally with strong ties with the SLAC LAN engineering group (the PI is head of the group) and access to the ESnet testbed, we believe we will be able to make other detailed measurements to compare with SNMP accessible statistics collected by routers and switches.

Deliverables will include publications comparing simulation with observation for high bulk throughput applications, feedback to the ns-2 community on the validity of the simulator. An extended ns-2 to provide reordering, RTT variation, loss reporting and the development of a model for cross-traffic. A front end to ns-2 to enable users/application developers to tune high throughput applications. This will include code packaged for downloading/installation on the major platforms of interest and documentation on how to use. A web site providing access to results, status reports, the proposal, access to the participants. Consulting for research users/application developers on how to tune/modify their applications, setting expectations etc. Feedback to the Web100 developers on usage and requirements.

4. Project schedule

The first phase of the work will be to develop scripts to simplify making bulk throughput performance measurements using iperf on a selection of paths, and to simultaneously make and record ping measurements. At the same time we will write a simple simulator front end to accept the various simulator parameters or take reasonable defaults, make the simulator measurements, format the results in a form suitable for comparison with the iperf and make the comparisons in tabular and graphical formats. During this phase we will also enhance the simulator to add reordering of packets and reporting on extra metrics such as RTT variability and loss. We will also explore the use of NWS data to parameterize ns-2 simulations and study their accuracy.

The next phase will consist of comparing the observed and simulated goodput for the various simulator parameter settings. The data will be analyzed using standard statistical tools to quantify the level of agreement, and graphical reports will be made to illustrate the qualitative agreements. We will document the findings so far, including some guidelines to users on how to set the parameters for their applications. At this stage we will also set up a web site to provide access to the results etc. and install, configure and get experience with Web100 to provide measures of RTT and loss. We will design and put together a simple prototype simulation server, provide simple documentation on how to use it, work with the NWS project, get experience with its use, and evaluate its impact and needed improvements.

The third phase will look at more subtle effects such as the RTT variability and loss for the observed and simulated data, and the effect of cross-traffic. We will also use Web100 to provide measures of RTT and loss for further comparison and validation. The intent from this is to see how the data agrees and whether the simulator can be used to identify the onset of over-runs, or to identify how to keep the bulk throughput use to under some fraction of the available bottleneck bandwidth. We will also try to understand in more detail why and where the simulated goodput does not agree with that observed so as to identify the range of applicability of the simulator, and where possible recommend and/or implement improvements. In addition to this validation activity, we will study the performance of ns-2 in an on-line forecasting capacity using the NWS. In particular, we will investigate the performance characteristics of ns-2 with the intention of optimizing it so that it will be able to respond in near-real time when a forecast is demanded by the NWS. Based on experience from the prototype simulation server we will design a more scalable, secure server with added features that have been identified as necessary in the pilot, and bring the new server into early release service.

The fourth phase will improve the usability and ease of distribution of the network simulator tool and associated front-ends. We will also use Web100 to provide measures of RTT and loss for further comparison and validation. In addition we will provide documentation on how to download, install, and use the new tools. We will also identify and work with bulk throughput applications developers to define application program interfaces that will enable applying the results to their applications.

This phase is more speculative and depends on what is found in earlier stages, and developments from other sources (e.g. deployment of new or different bulk transfer/distribution mechanisms). For example to understand discrepancies in more detail, we may look in more detail at SNMP measurements and/or trace packets to understand the underlying causes. Further we can begin to tie together the simulated results to setting the Web100 TCP parameters to further improve TCP performance. We will also measure the impact of applying our information to parallel bulk throughput applications used in the HENP community. We also plan to evaluate how to extend this to other applications such as remote backup and archiving (SLAC is collaborating with FNAL to provide a database backup of SLAC data at FNAL), or to other applications needing high performance bulk throughput such as video streaming or content distribution. An adjunct to this will be to validate the simulator versus wide area remote file access and to provide guidance for deciding when to replicate the data as opposed to simply remotely accessing the file. We also plan to validate the NWS forecasting capabilities and the simulation of QoS against real testbeds such as the ESnet QoS testbed that SLAC has access to, and the Daresbury/SLAC Janet/Internet 2 QoS project. Following this we hope to be able to reasonably accurately predict the impact of some QoS strategies on bulk throughput and its impacts on others.

Budget

Total budget for this project is $380K/year for three years. The breakdown is roughly: University of Tennessee at Kentucky $135K/year, Georgia Tech $75K/year, SLAC $170K/year.