Next Generation Internet (NGI) Testbed WorkshopBerkeley, Jul 21, 22, 1999
Rough notes by Les Cottrell, SLAC
There were about 40 attendees, ~ 50% took notes with laptops.
Several questions on why PPDG need Tbps for interactive use, what are the data flows, where is the data cached, how much data is needed before one can start looking at the data, a data flow diagram indicating the number of flows, volumes, end-points, durations, latency/jitter requirements, would be illustrative. For the non-interactive use it was pointed out that the data volume is already severely restricted by very conservative trigger settings. Further, today Particle Physics transfers bulk data by tapes which with copying, cataloging, packing, sending, getting thru customs, receiving, cataloging, loading, reading, tracking etc. can typically takes a couple of weeks and can be a full time job for a couple of people per experiment. This severely limits the ability of remote sites to perform timely reconstruction of the data.
Not currently heavily using the production network. It is an application that can be an early aggressive user of the test beds. So far have not focused on what happens at network layer when one uses such applications. Leighton says the current connectivity for the collaborators is as good as it can get today and is in production. The critical path is to measure/understand how the applications run over today's networks and how to improve things.
Analyze petabytes of climate data from distributed locations. Large caches at LBNL, LLNL & ANL. The source of the data is from LANL. Cache coordination is a big part of the project to help migration & replication of data between caches. Today run climate models. 10 min run creates 700MB data and runs for 6 hours/day, want to copy the generated sets from LANL to NCAR, corresponds to 25GB/day. Future requirements (1-2 years) data sets 50-100 times bigger generated in the same time, 1.25-2.5 TB/day, 400-920 Mbps (50-115 MBps).
Issues include higher speed path from NERSC HPSS to ESnet and HERSC HPSS to LLNL RAID system over LLNL, firewall issues, ESnet to vBNS/Abilene (are reservations on both possible?).
This was used as an example of how a LAN connects to the Testbed and has a production network at the same time. Dave showed the part of the LAN being used for the DiffServ, VoIP, IPv6 experiments with LBNL together with the ESnet production connection. Today the two nets (production & pilot) are separated as ATM PVCs. In the longer term the PPDG data is on the production network and will need to be able to access the Testbed network. Becca Nitzan pointed out that the ballgame changes when/if one moves to a GSR (Cisco 12000) since QoS is different. Becca says that the Cisco 7500s have more knobs for adjusting things (e.g. for QoS features).
One issue is the security requirements and the monitoring of flows, e.g. by OCxMON ( it only looks at headers). Boeing have something that runs at OC12 and are looking for OC48 requirements (it is like a high speed sniffer, they licensed the sniffer stuff from somebody, the ballpark figure is $50K).
They are connected via OC3 to vBNS. Getting rid of all LANE stuff on the LAN. The backbone is switched Gbps. They do not have jitter requirements, but do want to be able to schedule/reserve bandwidth for bulk data transfers.
There was an interesting discussion on the importance of correctly setting the Ethernet speed auto-negotiation. If incorrectly set at both ends then performance can be degraded by a factor of 50. In some cases mis-configuration can be seen by looking at the error counters in the switches. However, one cannot see all cases since it is hard to see the counters in the end node (e.g. a PC).
Parallel Sessions: LAN issues, WAN issues
Treats every resource in a generic way. It has the following properties: start time & duration, resource type to reserve etc., who is to get the resource (i.e. implies authentication). There is a gatekeeper for authentication, then a Local Resource Manager (LRM). The LRM exists at each site and keeps track of resources and slots (e.g. periods of reservation). There can be multiple resource managers, and to get an end to end reservation may require multiple resource managers to be contacted. The LRMs do not talk directly together. There is a global MDS database which is accessed by the LRMs and the gatekeepers. There is a working implementation at ANL. They are working to get it into the DPSS system. There will be a web based calendaring interface that allows defining the start & duration of a reservation, and the resources (IP addresses) required. Will need to be able to view who has reservations etc.
The discussion raised the issue of the conflicts of the developers and the applications users. There will be a fair amount of time setting things up (for some functions measured in years) during which availability, reliability of the Testbed will be very varied (e.g. functions will not be available in a reliable, regular fashion, routers will need rebooting (often caused by bugs in the beta code needed to provide the functionality), configurations will change without warning). This expectation level needs to be made clear to the applications people. We need a project plan with where we will be at various stages and when, e.g. how long to get to Abilene & ESnet swapping reservations.
A second concern is how does an application user decide what network to use, and how is this decision implemented (e.g. how does the application switch from the Testbed to the production network, especially if the node to be accessed is deep inside a campus).
The third concern was it is possible to defeat the QoS even on an uncongested network.
Helen Chen raised the issue of whether we can get by with "infinite bandwidth" and hence avoid the complexity of QoS. An example of the "infinite bandwidth" is NTON. In this case there is the question of how does one access the bandwidth, how does one get funding to get access to NTON. It is clear that infinite bandwidth is preferable, but the question is more economic as to what do you do if you can't afford infinite bandwidth. Maybe it is where should one put ones money to get the biggest bang for your buck, is it in WDM over dark fiber, or is it doing research into QoS. It is also true that the corollary of infinite bandwidth is that there is always a bottleneck somewhere, and so the solutions are complementary. Another problem is that the latest, fastest hardware (e.g. OC48, OC 192, ...) do not support QoS.
Created a taxonomy of sites in the NGI in terms of project, partners, bulk transfer, real-time, and came up with what are likely to be the busiest paths.
Bandwidth reservation: requires 200Mbps over hours (FedEx alternative), 10Mbps for 5 minutes (interactive). Issues: how much bandwidth (200 Mbps), latency & jitter requirements, how many flows, how interactive is the setup.
Key issues: authentication, instrumentation (end-to-end), directory services, data access / management.
Want to focus on tools & techniques needed to understand what applications are doing and require. May require accurate time-stamped data. How important is jitter & how does one measure it.
Want burstiness, multipoint requirements. There are active & passive mechanisms, what tools do the plumbers need as well as the applications. Want to playback streams of data to look for/understand problems. How does one store the data, how is it stored to enable correlation later on. How can we put into place persistent monitoring.
An independent implementation of Van Jacobsen's Pathchar. The output has been made more intelligible.
Bruce is looking at some improved measurement algorithms to reduce the impact on the network, to reduce the measurement time, produce useful results over switched networks. Also want to improve analysis to reduce the effects of experimental errors, doing adaptive analysis & measurements. He is also looking at a programming interface (API) for applications.
To download go to: http://www.ca.sandia.gov/~bmah/Software/pchar To contact Bruce use <email@example.com> Runs on FreeBSD, Solaris, Linux, IRIX, mainly developed on FreeBSD & Solaris.
Active measurement for accurate one way performance. 54 nodes deployed. About 11 of the ~ 20 NGI sites have surveyors. They have a CGI based summary server. Java/Excel visualization/analysis to summary server.
IETF "Diff" (EF PHB) + Qbone "Serv" (QPS) Qbone Premium Service based on VJ VLL "premium service". Includes an architecture for measurement & dissemination. Collects active one-way delay variation, one-way loss, traceroutes, passive: loads, discards, link bandwidths, EF reservation load. HTTP access to all data (including raw data).
http://www.internet2.edu/qbone is the QBone home page.
Problems with getting aerial into QWEST POPs or alternatively getting access to the SONET timing. Surveyor over ATM being implemented. Indiana U polling all Abilene interfaces every 3 seconds. Early on (up to 3 months ago) the counters disagreed (e.g. out of one did not equal into another). They appear to agree better now. Want to correlate Surveyor data with the Abilene SNMP data. They also provide web access to browsing the SNMP node MIBs. To get to the Abilene tools http://www.abilene.iu.edu/. The underlying tools are in Perl.
40% net, 20% host, 40% bugs or problems in applications (50:50 client server). Ping & ttcp may not be good indicator of network problems.
Netlogger used to do performance/bottleneck analysis on distributed applications. Can be used for post-mortem analysis. Current visualization tools don't scale beyond 20 events at a time, and it only works to the milli-second level. Need synchronized clocks, i.e. NTP. Uses Universal Log Message format (ULM IETF draft standard). Easy to understand format, self naming/describing. Would be nice if everybody to use same format. Will build filters to convert the data. Pablo, NWS (network weather service), Surveyor, others? APIs for C, C++, Java and Perl. Has 6 simple calls, open, write, flush, ... Have wrappers for netstat, vmstat, uptime, ..., snmpget
For visualization use concept of lifelines to trace an event through the system. E.g. x axis is time, y axis is the event. To do this there has to be a way to associate a set of netlogger messages with events.
Should be able to get 1 msec accuracy with NTP, better if there is a GPS on your site.
Can use netlogger to launch probes when it sees some pathological behavior, e.g. trigger netstat.
An issue is how to archive the data, can one sue multicast to send the data to the archiver and netlogger daemon at the same time.
Source code (for Solaris, Linux & Iris versions of nldaemon) available at http://www-didc.lbl.gov/NetLogger
EMERGE is a project to provide Grid infrastructure services for MREN. Such services include uniform distributed authentication, information (LDAP) services, resource management & authorization services.
Requires mechanism for issuing certificates (CAs). Most sites represented at the meeting do not run their own CAs (FNAL & some of the weapons sites (LLNL) do). Setting up & running a CA is a big effort. We could either run a DOE NGI CA, or try and get someone else (e.g. Globus) to issue keys for us. This will require some coordination of accounts across sites.
EMERGE also provides resource management (discovery and reservation) starting in a single domain, later to tackle inter-domain issues. Will manage a small fraction of bandwidth. How does MREN infrastructure integrate with ESnet? Does it switch between production and test network? Does it access multiple levels of service on say NTON? How is the re-routing/addressing done if one switches from test to production? This will require working with/coordinating with the LAN architects at each site.
A 3rd area will be instrumentation. They have GloPerf & Heartbeat monitor by default. In addition Surveyor, netlogger, network weather service are all very relevant and will need integration. An integrated performance data archive for DOE NGI sites and applications.
There will also be a Grid information service based on LDAP.
They have traffic conditioning, IP to ATM class of service, ..
Have been testing Cisco traffic conditioning (TC) methods in a lab environment. Use Chariot software module -tcp and udp which allows multiple streams between a source & destination. They use an HP ATM analyzer.
Using CAR on Cisco boxes with WRED (WFQ does not work well). Also use distributed Traffic Shaping dTS (not available for a few months on GSR), uses CIR, Burst size (Bc) and excess burst size (Be).
They are also working on bandwidth reservation using an expect script to control the router. Use Netramet to monitor full OC3 rate flows.
Created isolated Testbed for QoS applications (1st one is DPSS) based on DiffServ. Currently a single domain. Use CAR & WRED. Resource manager controls router through an expect script. Machines connected at Fast Ethernet (switched 100). A gatekeeper does the authentication to get to the resource manager (LRM). Resource managers access the MDS. The end-to-end co-reservation agent talks to the resource managers along the path.
Used modified version of ttcp (GARA enabled support for a desired rate, consecutive bandwidth reporting on both sides, support for automated testing), Have a UDP generator from U Mich. They are ready to test on the WAN. For info on graphs in presentation contact email address in the presentation.
There are concerns about interconnecting the AS domains (Abilene, MREN & ESnet) for the NGI participants. Bandwidth for the Testbed will be allocated out of the production services. ANL, SNL, LBL & SLAC have Testbed routers. A question is what fraction of ESnet bandwidth will be set aside for the NGI Testbed (premium services?), a figure of 30% was suggested as a straw man.
Knobs for controlling the bandwidth include WFQ, (W)RED, & CAR.
Site is responsible for carrying "premium" service to the application end system. Site routers will mark premium services. ESnet & Abilene will police at ingress. ESnet & Abilene will set the PHB (per hop behavior). The goal is to ease manual configuration and to have a persistent Testbed infrastructure. Abilene has a proxy access (mainly for querying) router MIBs via SNMP, this might be a big advantage to people making tests to understand how path is configured and seeing the bytes etc. transferred.
A reservation system will be important: to know who is on the Testbed, and also to reserve and record bandwidth usage. May need to check whether tests conflict in terms of reservations, may need exclusive use to be able to be specified as well as inclusive use (exclusive use is not supported by the bandwidth reservation system).
Bandwidth reservation issues: how much bandwidth is needed to make it worth while (one suggestion was 200Mbps), what other parameters matter (e.g. latency/jitter), how many flows, how interactive is the setup?
Need to define a distributed measurement to gather performance data which will be used by many apps for debugging which includes: define what data to gather; define methods for data collection; define format of the data; define where data is to be collected; define mechanisms for researchers/ applications to query and analyze data in near real-time; provide long term storage of performance data.
We also need to make applications aware of availability of performance data: work with combustion corridor to integrate and validate architecture. Need data correlation of measurement data and application results.
Open issues include: authentication/access control to use Testbed resources; directory services; data access / management; effort requirements; configuration management.
Another concern is the availability of passive monitoring tools at speeds above OC3. OC12MON is hard to get, Boeing has something, there is the expensive Smartbits which supports OC12 (LBNL has one).
A discussion ensued as to where to start. This devolved to which sites are involved with most applications. This indicated ANL, LBNL & U. Wisconsin. Then there was a discussion of what applications to focus on first.
Several working groups were set up:
Authentication (Engert/Johnston); Instrumentation/Measurement (Tierney); Application requirements (Steve Lau, Dean W); Testbed definitions (Nitzan, Winkler, Steve Wallace); Testbed site host/LAN interconnect (Millsom, DeMar); Resource Management (Foster, Winkler).
There is a PI meeting October 4/5 in Washington.
[ Feedback ]