Author: Les Cottrell. Created: April 19
This was held at the Institute of Pure and Applied Mathematics at UCLA. There were about 75 attendees. Walt Willinger was the chair-person. Paul Barford of U Wisconsin gave a talk to set the stage.
http://www.psc.edu/~web100/pathprobe, http://www.web100.org/, http://www.psc.edu/~web100/pathprobe
Need to attack the wizard gap between the experts who can debug/tune TCP to get maximum performance and what the non-expert can get. Much of the difficulty of addressing this is the transparency of TCP. So have developed an instrumented TCP stack where one can dynamically read and set variables in the stack. Want to make this an IETF standard and get vendors to implement.
They have implemented 5 groups of variables. Stages of impact of Web100: first round reduce stack bottleneck, indirectly fix paths, indirectly fix applications; second round change user notion of "big" and raise loads everywhere.
Pathprobe is pre-alpha tool to rest and diagnose paths, currently under rapid prototyping, a joint effort of Web100 & Net100 mixed NSF & DoE funding. Measure all model parameters for Rate=MSS*0.7/(RTT*sqrt(loss)), and the model should agree with the actual performance. Do all parameters agree with expected values, excessive RTT suggests bad routing. Test from web100 sender to any TCP discard server, scans different windows gives data rate, run length (transmit pkts/recoveries) & RTT. One signature for reorders is a large number of dup acks and no retrans. Pass/fail statuses greatly eases non-expert interpretation, is there sufficient data, window clamp too small, path is too lossy so failed, path met rate but too lossy for longer link. Want to sectionalize a long path S->R by testing A>B, A>C, A>D etc, each section has to run at the target data rate and meet model parameters for S>R. Pathprobe is likely to be the web100 killer app, pushes Web100 out to needy users, collect interesting pathological paths, prove the instruments in the TCP MIB, the present is a fast moving target.
Development of network-aware operating systems. DoE funded at $1M/yr for 3 years. Measure & understand e2e net & app performance, tune app (grid & bulk transfer). Components: leverage web100, network tool analysis framework (NTAF), tool design & analysis, active net probes and sensors, network metrics dB, transport protocol analysis, tuning daemon to tune network tools based on findings. Web100 is based on Linux 2.4with 100+ variables per flow. Net100 add web100 to iperf, java applet (see http://firebird.ccs.ornl.gov:7123/) . Working on daemon that (WAD) will tune a network unaware application, work around net problems, query dB & calculate optimum TCP parameters. Version 0 WAD for ORNL to NERSC showed no tuning gets <10Mbits/s, hand tuned get up to 70Mbits/s, WAD got 60MBits/s. Use NetLogger to log derived events, now working on tuning parallel streams. Looking at ways to get WADs to communicate between each other. Need to try and avoid losses (choose right buffer size, ECN, TCP Vegas, reduce bursts), faster recovery from losses (bigger (jumbo) frames), speculative reocveriry (D-SACK), modified congestion avoidance (AIMD), TCP Westwood), autotune (WAD buffer sizes, dupthresh, Del ACK, Nagle, aggressive AIMD, virtual MSS (e.g. add k segments per RTT rather than 1 MSS per RTT), initial window, ssthresh, apply to only to designated flows/paths, non TCP solutions (rate based, ?).
Have Net100 probes at ORNL, PSC, NCAR, LBL, NERSC, preliminary schema for network data, injitial web100 sensor daemon & tuning daemons, integration of Wu Feng's DRS (Dynamic Right Sizing) & web100. In progress TCP: tuning extensions to Linux/Web100 kernel, analysis of TCP tuning options, deriving tuning info from network measurements, tuning parallel flows and gridFTP. Future interactions with other network measurement sources, multipath/parallel selection.
Want to look at real jumbo frames, ECN (supported by Linux, instrumented by web100), drop tail vs RED, SNMP path data where are losses occurring, what kind of losses, SNMP mirrors (MRTG).
Allows easy configuration and launch network tools (iperf, pchar, pipechar, will add pathrate & pathload, netest and host stuff), augment tools to report Web100 data. Collect & transform tool results into a common format. Archive results and save for later analysis, all tools are instrumented with Web100. Want to compare iperf with application, determine advantage of parallel data streams, analyze variability over time, compare tools. Have archive in place in SQL. Working on XML format, also support NetLogger format. Will be tied into the GMA publish/subscribe mechanisms. Have analysis form, select time frame, application and way to report on. Have 5 sites with full mesh. Also provide access to ascii data. Can look at data with NLV interactively allows looking at Web100 variables, with data for 1/sec.
NTAF issues? reliabilities, tools hang or crash, need management thred/tool to timeout, need non-blocking I/O, servers hang or crash (use cron to restart), archive dB performance, SQL table design, buffering issues (pipelining to guarantee that applications & sensors never block when writing to archive), fault tolerance with redundant archive destinations.
10 clients each transferring 10MB/s=6200 ev/s 313 KB/s (1.1.GB/hr).
Pipechar tool to measure hop-by-hop capacity and congestion through the network. Faster than pchar, pathrate, more accurate on fast networks, not always accurate still refining the algorithms, results affected by host speeds, client side only, 100kbits/s load. Uses UDP/ICMP packets of various sizes and TTL, uses packet trains assumes dispersion is inversely proportional to available bandwidth. In general pretty accurate.
Netest is a new tool, to provide non wizards with enough information to know what to expect, active client/server tool, puts fairly heavy load on net about 5Mbps for 1-2 mins but faster bursts. Gives single/multi stream UDP throughputs, then does TCP to give optimal window size, single stream throughput, makes recommendation on number of streams.
Have a self configuring network monitoring project, leverages passive Bro project to monitor traffic to characterize application streams as they cross the network. Use passive optical taps. Need fast cpus, tricks with GE drivers, recommend Syskonnect cards, add time stamps in driver code. Synchronize via NTP. Use interrupt coalescing. Install monitor hosts at critical points in the network (i.e. next to key routers), passively capture packet headers of monitored traffic, configured & activated by application end hosts (i.e. like remotely activated tcpdump), can only monitor own traffic. Have 2 boxes at the moment (NERSC & LBL), adding third at ORNL. No results yet. Contact Craig Leres or Brian if want to build own box. Need a writeup.
Web based tool based on ORNL applets, modified analysis, also modified NLANR iperf support 2 new options (-ee, -R) print web100 derived details and get receiver stats too. Tool is web based runs on any client with java enabled web browserr. ID what is happening, cannot tell if a 3rd party is performing properly, it is only end-to-end, can't tell you where the problem is, only relevant for particular ends. Sends 10 sec data from client to server. Then 10 sec in reverse prints out summary of link speed and whether duplex is full or half . Allows one to get more details with a statistics button for snd/rcv throughput, details for 5 configuration tests (link speed, duplex, congestion, excessive errors, duplex mismatch). throughput limits section (%S-R-N limited RTT % loss, % out of order, tweakable settings (Nagle, SACK etc.) Plus more details individual TCP counters collected by web100, conditional test parms, theoretical limits analysis bw*delay, loss rate, buffer sizes, server logs all counter variable used for condition tests.
Current development have 3 servers at ANL (450MHz P3s), servers support both web based tool and modified iperf based tools. Looking for people who might be interested in tool.
Demonstration of web tool running for 10 secs (http://miranda.ctd.anl.gov:7123) showing the two end throughputs, then showed the statistical analysis, also showed all the web100 variables. Looks for cwndtime >30%, maxssthresh >0, PktsRetrans/sec > 2 ..., he claims this works quite reliably. Sees a lot of transitions between sender limited and receiver limited states.
Future 4th servers at ANL for external users, explore GE Ethernet issues, explore wireless, talking with Tom DeFanti for Starlight, Starlight partners/collaborators.
Caveats: server needs time between tests (~90secs), if try for 2 machines simultaneously then 1st succeeds seconds waits or fails. Analysis messages need to be validated.
Update in program, new developments: early career PI in applied ($100K in 3 years) mathematics, CS & net research, applications due in next week, expect large number of profs in Us in early stages of career. Coming workshops: LSN on cyber security and info assurance (joint agencies) for next generation cyber sec. LSN sponsored workshop on NG transport protocols (where is TCP going, is there a new protocol needed). ESSC/I2 joint tech meeting in planning (at Boulder week end July/begin August); joint ESCC/net research PIs meeting in planning. Meeting in January of I2 joint techs in Miami will be co-scheduled with ESCC. This joint focus is to allow ESnet to focus on research & leading edge applications that cannot be done elsewhere. Ongoing research: hi-perf transport protocols, network measurement and analysis, cyber security. Program growth opportunities include Tbps optical networking, distributed cyber security, terascale storage and file systems.
Challenges are e2e (end-to-end) performance, network aware applications, scalable cyber-security. Want to provide high performance to applications with current & future networks. R&D for ultra high speed nets. Near term enhance existing network protocols to operate at high speed, e2e perf measurement & analysis, develop new advanced network services. Long term harness abundant optical bandwidth, develop optical protocols. E2e communities are ESnet, I2, international communities. This means have to manage inter network domain issues. Objective is high throughput & advanced networking capabilities to enable distributed high end apps. Partners are ESnet application communities, network research/middleware, site networking group, I2 & international collaborators, The approach is net measure and analysis research, net measure & analysis infrastructure, net measure & analysis toolkits, coordination of e2e at various sites, net performance portals. New call proposals for a performance program, coinciding with the termination of the current 3 year cycle SciDAC projects. It will need to move into a space that is innovative.
Major focus today: TCP enhancements (30, net storage (1), grid security (1).
Components of E2E performance: net ware apps, SANs & storage performance issues; host issues TCP, NIC & MTU, OS & I/O; core: routing, traffic eng ...
Want to make application aware of host and network aware (ie. tie together) by end of 2003. By 2005 want coupling all to be transparent.
Network measurement infrastructure (NMI) proposed NIMI based measurement platforms deployed at each site (SLAC, ANL, ORNL, CERN, UMich). Need to coordinate measurements, have all measurements colocated in a single platform and schedule, run as needed and then make available to all who need. Also want to isolate whether problem is at SLAC, CERN in Abilenbe, GEANT, ESnet, IXPs etc. So want boxes placed at exchange points to diagnose along path. NIMI not funded adequately, needs to refund so can put most of tools in that infrastructure, also hope ESnet to deploy in their infrastructure. Thomas wants the NIMI structure to be developed and made operational, then up to each lab to procure and maintain etc. Wants to set up an E2E perf group in ESnet so can do between labs and link to I2 E2Epi.
TCP futures: TCP works everywhere, so need to work on enhancing TCP with a unified implementation, enhance existing TCP implementations, idea is to derive classes of TCP for different transport media etc.. In parallel for further into future, develop a new TCP an Internet based transport with all-optical fiber. Talking about overlay networks between big science centers. Solve problem in community that needs higher speeds (HENP, ESnet, NASA), then take to IETF for standardization. Different classes come from a generic implementation of TCP from which is generated Reno, wireless, Tbits/s, Vegas version etc. Hard otherwise to cover emails or web pages for seconds versus file transfers lasting for days.
Talked about optical burst switching. For very high throughputs allow to move bursts in sizes of 64kbytes, 2GB, 4GB.Use forward error connection. In addition have a secondary channel with just in time scheduling to set up the transfer. Protocol has been designed, NSA and DARPA taking lead on this. SGI working on. DoE are being asked for testing for large transfers. Wants to set up a n optical burst switching testbed.
Then went on to show overlay network which provides depots to move data e.g. from CERN to Sunnyvale allowing access from west coast sites over slower links using TCP. This enables high speed core with optical burst switching and lower speed edge connections.
Program opportunities: Ultra high speed data transfer protocols (this would sit well when going to congress to get money). Distributed cyber security for large scale collaboration. High performance WAN file storage/systems.
StarTap was a single ATM exchange point to connect international links to. But nobody was willing to pull to coasts. It did enable transfer traffic (e.g. Europe to Japan). Has 15 connections. StarTap funding is running out after 5 years. StarLight = StarTap - NG. Have now a production 1Gbpos interchange at Chicago, looking at OC48 (2.5Gbps) and OC192 (10Gbps). Located at Northwestern so vendor neutral. Access to carrier POP has been a major headache with StarTap. Could not find a carrier hotel that was suitable (i.e. had all the carriers coming into). ANL does engineering and brainy stuff. Space is replacing a being decommissioned PBX. Currently present: Ameritech, AT&T, QWest, Global crossing (expect someone else will buy out services), Teleglobe. Policy free 802.1q VLANs are created between peering partners. Starlight benefits connecting networks: hi speed peering with large MTU, available colo, believe in multicast support. Equipment is a Cisco 6509 with lots of GE ports, Juniper M10 & M5 with GE and OC12, Cisco LS1010 with OC-12, ESnet IPv6 router, data mining cluster with GE NICS, Surfnet's Cisco GSR 120000 & 15454. Two OC12's between Startap and Starlight. ESnet and NASA both are located (now) at Starlight. vBNS still at Startap. Also have connection to QWest POP at 455 N. CityFront Park where Teragrid, NREN, ESnet, Abilene, IWire show up. Have 36 strands between QWest and Starlight. Hope this summer to upgrade the link between QWest and Starlight by using DWDM at 64 channels, they have the DWDM gear. 10 Gbps transfer service. GE is easy and simple. SURFnet has research project with Cisco between AMS & CHI with 15454 on each ends want to drive to do interesting things. Also CERN & UK interested in joining in. Want to research into what applications could, want to take advantage of such a setup.
NSF TeraGrid facility awarded by NSF to put distributed computers center iin place between ANL, SDSC, & NCSA. 40Gbps between QWest PoP between LA & Chicago. Problem has been to get it to the end sites. Goal is to get in place by June, expect ANL to QWest & NCSA in May. In September iGrid in Amsterdam in September 24-26. Looking for applications that can use more than OC48 (2.5Gbps) and 10 Gbps. Will be at Amsterdam Science Center (same location as Amsterdam Exchange point). Easier than connecting up a convention center. Need to tell in next 60 days what is needed.
Users voluntarily mark their traffic to be "nice", i.e. back-off in case of competition. SLAC traffic is heavily bulk throughput. Used I2 recipes for QBSS. Three test beds: 10Mbps bottleneck, 100Mbps and 2Gbps. Show plots of how quick QBSS backs off and restores, also aggregate throughput is maintained.
Goal: enable researchers faculty to obtain high performance from the network. Four initiative focii are applications, host OS/tuning, measurement infrastructure, performance improvement environment. Role of Internet2 capitalize on existing work and bring people together. Want to disambiguate where the problems lie. Instrumentation options has goal of heterogeneous instruments. Quilt allows measurements to be made from places across the network with archived data, e.g. http://noc.greatplains.net/measurement. Use existing machines where possible (AMP, Surveyor), beacons (H.323,FTP), packet reflector. Packet reflector with agents at various sites to allow packets to be sent & reflected. Want access to standard operational information from key places along the path. Need end-to-end analyzer to put data together and interpret. Matt can get 3 sec interval loss data for Abilene. Matt is measurement czar for SC2002 show floor network.
Aside form Bill Wing, in future there will be much closer collaboration between I2 & ESnet since much of data required by universities is from the DoE Labs.
Abilene upgrade will put GE switches in the rack connecting into router interfaces. Will be dedicated measurement machines, 1/4 space for worthy experiments, e.g. for 3-6 months projects. Hi bw flow tester, latency tester (think AMP/Surveyor/Skitter), CDMA on board, local measurement collector. A smattering of E2E performance initiative boxes Will provide SNMP, flow data, routing BGP & IGP conceptually circular buffer to hold interesting data.
Project funded by DoE. Want to provide trusted authentication across domains. Risk model is difficult to determine, protect but not interfere too much. Holy grail is single login across domains etc. The vulnerabilities are that PKI is a largely unproven technology (under-going development), compromised CA (improperly generated keys, improperly signed certificates; compromised key. Inadequate protection of private keys. Then need applications that improperly check assigned keys (e.g. check for revocation). Compromised of machine taht generate key pairs, generate certificates, sore private keys, verify authorization for loacl resource access, is it worse than current? No secure time source which is important for certificate expiration, effective revocation may be OK if no long term dependencies (e.g. signing).
Threats in HENP: have an active, adaptive opponent; unauthorized use of resources use of DDOS attacks (large net pipes), use for cracking keys; theft of capabilities (identity), destruction of data, public embarrassment (funding impact).
GSI says each site must implement mapping from grid account space (X.509 + proxy certificates) to local account namespace. Local account information remains at site. Requires a distributed grid database, grid repository maintains a list of trusted third parties to provide authentication services, sites may refer to grid repository and augment that list with their own trusted parties (not real time).
Delegate/proxy: reluctance in sending log-term (years) credentials across the Grid to repeatedly respond to authentication requests. Instead create short term possibly limited capability credentials (key-pair) signed by long-term credentials.
Community Authorization Service: many users have common level of authentication to group of resources, provide a single access ID to site for authorization, group the resources to reduce the auth/overhead. Have to reserve accountability, need protection from errors that destroy results, backtrack to a single user.
CAS overview gives a single GSI credential for whole community, pre-existing relationship with resource provider. CAS manages its own access control, access to resources it controls. User requests access from CAS.
New DWDM technoloy drivers: high peoered unidirectional amplification; Raman amplification, allows amplification on the way; wavelength add/drop optical muxes are appearing; higher TDM rates going from 10=>40 Gbps (but lose considerable number of DWDM channels, so may be no net gain, and need something capable of accepting 40Gbps; dynamic gain equalizing; high levels of FEC gives a few Db of gain. Today conventionally need regen at every 400Km. Between regens need optical amps (cheaper) and only need one for multiple wavelengths. Raman enhanced puts1200Km between regens (money is in regens). Ultra long haul DWDM exists but not deployed give 2000km expect 2003. Long enough to get between end-points without regens, still need amps every 50 km. Removal of regens saves30% of total cost.
Single fiber capacity going 80Gbps>120Gbps>150/300Gbps (i.e. 30 wavelengths). THen either got to more wavelenths (300/600) or C band (but new amps) (350/700), 2003-2004 0.7/17Tbps > 1.4/2.8 Tbps (C+L bans 25GHz) with Raman.
Different types of fiber. QWest picket in 1996 want to keep dispersion as flat a possible so don't get different velocities for different wavelengths. To help use dispersion compensators. But costly and increases losses so more amps. LEAF fibers have problems with edge channels and extra dispersion so fewer channels.
Two IP visions of transport: transport as dumb pipes on an unprotected lambda mesh with MPLS fast reroute on routers provides protection switching. Smart optical transport uses OXC over a lambda mesh provides protection switching, GMPLS running on smart OXC and router allows for dynamic provisioning.
High speed routers are taking a lot of power and power management is becoming needed, needs more A/C, needs bigger batteries, it is becoming a gating factor.
QWest have SONET rings with 10 wavelengths. Based on STS1 granularity (51 Mbit level or OC3). Moving to a lambda mesh which is more flexible using OC48 & OC192 lambda granularity. Use collector rings to get from customer to POP then use lambda mesh for backbone.
Roll your own network say for a 3 hub net: NY, CHI, LA. System could use existing design rules, Raman design rule, 2 fiber system. Terminals 10Gbps interfacing, no add/drop multiplexing. Costed such a model using Raman design rules, came up with $22M. Fiber $2K/mile = $12M. On 3K mile links need 50 ($800) amps which need colo ($480K), need terminals in CHI (2), NY & LA ($200K each = $800K). Amps $17K need 35, 11 regens ($250K), equip maint ($145K), fiber maint $150/mile, ops $500 for each of 50 colos. NRC $21.6M, recurring yearly $2.9M. Add wavelength costs $50K per TX/RX card (14, 5 regens and two ends with two for each location), i.e. $700K, plus maint $11.7K for 12 months, $140K.