Author: Les Cottrell. Created: Jan 30 '02
There were about 60 attendees. There was wide variety of interests including people from the Netherlands, Australia, Brazil. There was wireless connectivity and it worked. The meeting was at the Arizona State University Memorial Union.
Defined end to end problem, causes, and what is needed solve it. Causes are often not the network (45% of problems may be network). He went through the needs:
GMA terminology: have events which are made available by a producer, and are used by a consumer, with a directory service to publicize data & discovery, must be distributed and replicated.. People implementing are in UK, GSFC, LBL.
Event archives required
Event analysis and visualization.
Many pieces exist BUT all use different event formats and protocols!
DMF at LBNL goal is to lead GGF project and work with PingER, NWS, MDS.
~ 80 DOD sites con US Alaska, Hawaii, 10Mbps to OC12, ATM core, BGP clusters, NHRP. FedNet, I2 & commercial peers. Deployed AMPs 1999 - current, Gigabit test platforms, 2001 - current. Lats couple years looking at how to get high throughput. Gigabit test platforms: AMP++, Linux 2.4.17, ... 10msec NTP accuracy withoput clock, 10used NTP accuracy with CDMA clock (Endrun Technologies Praecis Ct ($800), serial port, tagged events. Would love to see NTP noanokernel timing for Linux. New systems can do Gb with no difficulty.
Low latency matters. Turned from 45Mbps to 2*OC3, to Seattle 1.3*ideal RTT 2.8* for Ohio. Use Mping to measure RTT. Have seen many problems with duplex, do not see with pings. TCP acceleration ~ MSS/RTT^2 Mbps/s. With congestion avoidance can take 60 minutes to get from 0 to 10Gbps. Q~ (bw*rtt)^2, queue duration ~ RTT^2.
Low latency really helps, maybe more than just adding bandwidth.. 100KB BDP need ~ 400KB queue in router, 1MB BDP ~ 40MB queue! Jumbo frames help reduce the size and duration of queues. For high BDPs average rate at best = 3/4/ of the tight link rate. To do better requires throttling the sender in a way that doesn't halve cwin at peak, e.g. carefully setting rwin, pacing the sender, TCPW (sets cwin based on estimated BW), clever response to ECN.
With UDP as increase the offered load see sudden dramatic build up of RTT due to queue build up followed by packet loss. IN such cases there are very few singel packet drops, rather they are loss of bursts of hundreds of packets lost. This hits people with big RTTs, Gbit/s across country needs o(10^-8) loss, about 1 ping packet every 3 years. Require provider to demonstrate 50% of capacity for 10 minutes.
BER specs for NICs/circuits may not be low enough, e.g. 10^12 BER 10^-8 packet ER, 10 hops 10^-7. Worry about latency and loss, worry about duplex, 9K Bytes for jumbo is a start.
Built wrapper around iperf to allow web user to request an iperf from his machine to a server somewhere at a GigaPoP. Need a fabric parallel to the network to enable a user to probe and find out what is going on. From http://www.thequilt.net,get list of GigaPop pages, then from individual pages get usage stats, traces, interactive iperf, web100 bandwidth tests multicast beacon, looking glass server, history of many of the above measurements. From central site have a route view server. Legal policy is important, e.g. usage statistics within gigapops, security is extremely important, who can do the iperf, how do we do authentication, quilt wide, individual gigapops, limit servers from too much activity, limit servers from frequent activity.
Goals make it as simple as possible for users/netadmins to debug problems at gigapop level. Archive data to provide a historical view of what happens at the gigapop. Support measurement projects for researchers at gigapop layer, whenever possible. Start on idea of measurement fabric (user interface, admin interface). they will write and handout java applets to do this. Want input on direction, tools to include etc.
Motivation: improve, understand stacks, TCP tuning, router/switch buffer queues inadequate, eliminate wizard gap, need to instrument the kernel. Tuning is painful, log RTT, IP routing, MSS negotiations, IP reordering, losses, congestion ...Want real time triage to pin point problem areas. 30 alpha tester from SLAC, ORNL, LBNL & universities www.net100.org. Modified kernel support 2.4.16. Separation from Kernel Impelemnation Set (KIS) and library functions, Want to get improvements into RFCs (draft-mathis-rfc2012-extension-00.txt and dradt-ietf-ipngwg-rfc2012-update-01.txt). Open up distribution next Wednesday www.web011.org/download. Please be cognizant of impacts on others, please use, test & provide feedbacks, IETF standards process to benefit all. Attention will shortly turn to working with OS vendors to incorporate into TCP stacks.
Includes Web100 + PSC, ORNL, LBNL with objective of improving bulk throughput. Components are active network probes & passive probes, network metrics data base, tuning daemon (WAD) to tune network flows based on network metrics. See www.net100.org. Java applet can be demonstrated at http://firebird.ccs.ornl.gov:7123/ have 120 variables. LBL has an iperf modified to get statistics back after a run. Can modify TCP variables to do tuning. Have active probes at LBL, ORNL, NCAR, PSC, NERSC.
WAD uses network performance data to tune flows. Is there a risk of congestive failure as one tunes the stacks.
Concern is too much active probing so very interested in using passive measures due to traffic impacts.
Used passive monitoring to charaterize campus traffic. 0.2% of flows on UAB get 3Mbits/sec, this is the PITAC threshold. In May 2000 one host accounted for 66% of all flows >= 3Mbits/s and the flows were going to 70 different hosts.
H.323 75% of network problems involve duplex mismatch. Latency of 150ms or less, definitely 300ms or less, need to undsertand sensitivity to jitter. Want 0.1% or less packet loss.
This is an applications perspective, joint project from CS and physics people. In the tiered HENP environment users are using applications that need to know where to find the "closest" copy of some data. Each project named monitoring as a key need. First need was to agree how to define monitoring, is the ftp server up, how many events has my job processed, what's the load on this network connection. Formed joint group in October 2001, cross-cutting between HENP experiments. First step has been to define use cases for requirements setting. Then agree on an initial set of sensors. Define schemas (likely to be MDS based), agree on a higher-level reporting framework. 19 use cases from ~ 9 groups, fall into 4 categories: health of system (network, servers, cpus etc.), system upgrade evaluation, resource selection and application-specific progress tracking. Defined a template for the use cases, includes: description, contact; performance events/sensors required; how will info be used; what access is needed (last value, streaming of data, logs); size of data to be gathered; overhead constraints (for sensor); frequency data will be updated; frequency data will be accessed; how timely does data need to be; scale issues: how many producers will there be, how many consumers, what portion of this will be of interest to a specific query; security requirements; consistency or failure concerns; duration of logging (2 weeks is a good length); platform information. She went through a particular use case for replica selection.
Evaluation: all sensors must be non-obtrusive; all data is small and must be as timely as possible; all data must be kept for years (for trend analysis) and must be accessible by any and all means; no one really knows how many sensors will be accesses at one time (or reporting to a higher level service), or how often they will be accessed; security is not a concern yet. the line between monitoring system and higher level services isn't always clear. From use cases gathered requirements, split by type: network, cpu, storage system, other. Host sensor: cpu load, available memory, disk; network bandwidth & latency; storage system: available free storage; next steps what tools should we deploy?
Contact information http://www.mcs.anl.gov/~jms/pg-monitoring
Spirent PLC is a 5K people company worldwide, publicly traded in London & NYSE. Performance analysis solutions division (AdTech, SmartBits) and service assurance solutions division. Voyager (cane about as part of working with I2) project components. Systems hosted in a NOC. They have 2 probes a connector and a participant probe. There is a probe management, data management and analysis framework. CenterOp perform gathers service availability and throughput data, it associates this data with customer circuits or virtual connections. The data can be reduced into a Quality Index. Framework runs on AIX and Informix. Start beta test in network environment in May '02.
Will be divided up into 4 groups: two groups each to cover both Applications and HostOS, and two groups each to cover both Network 1 (campus wall jack to campus border, LAN) and Network 2 (WAN, GigaPoP, backbone). Will be breakout sessions and then come back with reports, then discuss where to go next. Each group will assign a discussion leader and recorder. The leader & recorder are responsible for presenting the group's conclusion. The major focus will be packet loss detection. Each group will prepare answers to the Report Questions and make recommendations on how to solve the problems presented.
Process seemed to absorb a lot of time, getting definitions for terms, what is a path, what does "Common" mean, what does packet loss mean?
Defined three ways in which packets can be lost, since there are different ways of detecting loss. Analysis is periodic ongoing low-level baseline/health measure when things are working normally, these could be active or passive. When run into trouble then need to be able to drill down to more intensive quasi-real time measurements. Summary data should be available via the web or XML. For sharing an XML schema would be useful to convert between formats.
Need to provide a catalogue of all the measurement activities. Want to reach out through the workshop meetings to invite others (non Internet 2 end to end measurers) to the meetings. A technology need is to find and allow more standard access to the various measurement archives. We want the E2Epi to
I met with Thomas Ndousse of DoE/MICS. He is now interested in setting up a NIMI measurement infrastructure with archives and access for users and applications to the archive. Users would be able to query the archive and request the NIMIs to schedule measurements. Sounds a bit like the AIME proposal, which I pointed out to Thomas. Thomas also wants to get more of an engineering viewpoint into measurement with well defined methodologies and statistically designed experiments.
I had a long discussion with Margaret Murray of CAIDA and agreed that we would work together on validating the pipechar, iperf etc. measurements and when we have them, pathrate and pathload. Apparently Thomas is asking CAIDA for this type of analysis and Hans Werner Braun of NLANR is anxious to comply.
I met with Jenny Schopf of ANL. We discussed the Globus MDS infrastructure for providing access to data measurements. She explained to me how one would write an Information Provider (IP) for the PingER data that would put the data in some defined schema and make it available via LDIF (the LDAP protocol) to a Grid Resource Information Server (GRIS). The user requests the information from a Grid Information Index Server (GIIS) which gets it from its cache or requests it from the appropriate GRIS. There is nothing at the moment to allow the user to know what GIIS may have access to the required information. The IP and GRIS could run in the PingER archive server, or in separate machines. The IP, GRIS and GIIS would probably all be at SLAC. Jenny pointed me to a web page for further information at http://www.globus.org/gt2/mds2/. She also will provide an example shell scripts that implement an IP for GridFTP. The other person working on MDS is John McGee <mcGee@isi.edu>. We also discussed the idea of making occasional high impact performance measurements for normalization, then more frequently making lower impact measurements and using the latter to help in interpolation or extrapolation.
I talked to Dantong Yu who is very interested in collaborating on the IEPM-BW project and has some time to work on it. He has experience with MDS and would be interested in working on putting the PingER data into MDS. We will send him the documentation on the IEPM-BW data, where to get it and the formats etc. When I get back I will discuss this with Warren, point Dantong on how to get started on getting an account at SLAC, and set up a phone call. In the other direction Dantong and I are still working on figuring how to get through the BNL firewall from SLAC.
I talked to Brian Tierney of LBNL about measurement durations, we agreed that it is probable that in some cases measurement durations could actually be less than 10 seconds (e.g. for small RTT links).
One of the big concerns is how much impact active measurements have on the network being measured. I discussed an idea I had on the plane of using passive bandwidth measurements from our netflow measurements for predictions. There was general agreement that it is worth trying.
I talked to T Charles Yun, Program manager of the Internet 2 Applications Group. They are looking to deploy small dedicated PC running Red Hat Linux 7.2, with 2*100Mbps and 1 GE NIC but no GPS for timing. How to make the PCs will be documented and they will be specified with 66MHz, 64bit PCI buses. They will be used to run applications like iperf, ping, traceroute on demand. There are 2 of these boxes at the moment, they do not appear to be the boxes that Stanislaus Shalunov was developing. The software to run on the boxes is being developed at U Mich by Sushi@umich.edu and Bob Riddle. The site puts together the hardware and then install RedHat Linux. After they are connected to the network they can get their address from DHCP or have it pre-configured. The software including iperf will then have to be loaded, and it will register itself to an LDAP database (they may use Verisign certificates). This will provide the ability to exchange authorization information. There will be a web front end to allow a user to make a request for measurement between any two of these boxes, which will provide the output back to the web browser. They are looking at unifying this with NIMI.
I talked to Maxim Grigoriev of FNAL about the missing IEPM_BW host at FNAL, and about getting FNAL involved in IEPM-BW. We agreed taht I would put together information on IERPM-BW for the FNAL folks and set up a phone meeting in the near future.