Report date: February 25, 2003
Dates of travel: Jan 23 - Feb 5, 2003
Travellers: Roger L. A. Cottrell
Name of others traveling with the traveler: None
Position: Assistant Director, SLAC Computer Services
Employing Organization: Stanford Linear Accelerator Center
Address: SLAC MS 97, 2575 Sand Hill Rd., Menlo Park, California 94025 Airfare expense:
FTMS trip number: 200332613
Travel destinations: NIKHEF/NV Tropen Hotel, Amsterdam, Netherlands; CERN, Geneva, Switzerland.
Vacation dates: Saturday 25 - Sunday 26 January, 2003; Saturday Feb 1 - Sunday February 2, 2003, Sunday October 6, 2002..
Purpose of trip: Attend the 1st SCAMPI Workshop, make presentation on "Monitoring High-Performance Networks", learn of SCAMPI activities, meet European peers in advanced network monitoring, visit NIKHEF meet with networkers there discuss and assist them to join our testbed. Attend the "Protocols for Long Distance Networks 1st International workshop, present talk on "High Performance Active End-to-end Monitoring", listen to other presentations on high performance network protocols, meet with peers in the high-performance networking arena. Visit the CERN networking group, meet with peers there, share information on our joint testbed collaboration, extend measurements.
Funding sources: DoE
Total Costs for DoE: $2489.03
Duty days:10 days
The presentations and discussions on the latest TCP stack developments (Grid DT, Scalable TCP, FAST TCP & HS TCP) are extremely important to follow for HENP and Data grid activities. The new stacks appear to have a solid theoretical foundation, and from our own measurements FAST work swell up to Gbits/s rates. Now we know about Scalable and HS TCP we will in future test and compare these stacks. It is also apparent that jumbo frames also work well. One of the next steps will be to compare the various TCP stacks at high speeds including above 1Gbits/s. The discussions on alternatives (tsunami and RDMA) to TCP for data transfer were also valuable. It will be very important to understand the limits of TCP and what can be done about them (e.g. CRC calculation off-loading, interrupt coalescing, zero-copy, protocol off-loading to NIC etc.) and if and when it makes sense to use alternatives.
I was able to successfully re-invigorate our collaborations with the DataTag testbed, and the European DataGrid measurement community. I had many insights into TCP details and developments, as well as emerging possible alternatives. Much of this will be extremely valuable for future proposal and projects.
At NIKHEF I met Cees DeLaat head of the computing/networking group there, Antony Antony who supports the network measurements at NIKHEF, and Jason Lee of LBNL who is on a year's leaver of absence at NIKHEF.
At CERN I met Richard Hughes-Jones of Manchester University, Tom Kelley of Cambridge University, Sylvain Ravot of Caltech, Olivier Martin, Ben Segal and Jean-Philippe Martin-Flatin of CERN.
The morning was mainly devoted to presentations on the proposed SCAMPI measurement box hardware and software. It will be able to make high-speed passive measurements and may use the University of Waikato WAND group designed DAG cards (marketed by Endace). Unfortunately, due to a Miocrosoft Windows XP feature that only allows admin access to a laptop whose security log is full, I spent most of the morning re-creating my talk on Jason Lee's laptop. As part of this I also found a way to use a Linux recovery disk to access WinXP files in a read-only mode and was alter able to recover my presentation from my laptop and email it to someone else for presentation. By mid afternoon I was able to contact someone at SLAC who was able to walk me through re-accessing my laptop, clearing the offending security file and changing its configuration to prevent a repeat. Luckily my presentation was late in the day, so by that time I had recovered everything. However the notes below from the talks only cover the afternoon.
DataGrid is one of the major European Griod research collaboration projects. It has a Working Package 7 that worries bout network monitoring. They are using GEANT and need monitoring for provisioning, capacity planning, problem identification and resolution, forecasting and optimization, and for resource brokers. They have chosen to use active monitoring since the end sites have their own policiues, needs etc. They make one-way delay measurements using the RIP package, round trip measurements using PingER (from SLAC), iperf, udpmon and router netload. They are rnning 5 RIPE engines and run PingER, iperf and UDPmon at measurement nodes (7 with full mesh). They use Public Coordination Protocol (PCP) to schedule measurements so they do not stamp on one another. It is based on N WS but is an extension. The results are stored in MDS/LDAP. They have an elegant topo map showing who is up/down etc. They use a network cost function for resource brokers to decide where to schedule jobs.
Stephen discussed the design aims of the WAND DAG cards (passive monitring cards). Originally they were used to verify simulation traffic and router models. They were built as custom hardware since there were unusual interfaces (e.f. OC3-ATM). They puit considerable effort into quality time-stamps and full capture capability. Full capture is becoming increasingly harder at today's high speeds (e.g. 10Gbps). Typically they only do header capture. A limitation is the bus speeds, e.g. 64bit 66MHz PCI peaks at 528Mbps, and the actual offering id 60-70% of that, i.e. ~ 350MHz. Even then they have to carefully optimize per-packet overheads, avoid dynamic buffer management, avoid the network stack, hi-rate interrupts, copying all of which results in a light-weight capture process They use a large 128MB circular buffer memory mapped to user space to avoid copying. They can easily write to disk at 32MBytes/s. In one hour they can capture 100GBytes. The PCI-X (64bit 132MHz) allows 1064MB/s or about 80% of a 10Gbits/s load, but the PCI bus is 80% efficient so only get 70% of load. A new PCI-X 2 bus is coming which will double (266MHz) the speed which will help. However the FPGAs will not run at 266MHz. They hope to be able to capture 100% of the header load assuming an average packet size of 200-600Bytes and a header of 40 Bytes (i.e. a 5:1 to 15:1 reduction).
This talk mainly concerned the IETF IPFIX group activities. IPFIX is working on a flow measurement standard. The goal is to find or develop a basic IP traffic flow technology. The requirements RFC are almost done. They are related to accounting, QoS, atack detection and analysis, intrusion detection. The architecture is starting. Cisco has proposed Netflow version 9 to be the standard/standard compliant. There are 5 competing proprietary technologies proposed for the standard (NetFlow, SFlow, CRANE, LFAP and DIAMETER). The protocol does not use UDP.
Common MRTG measurements are made at 5 minute intervals. He was interested in evaluating the effects of reducing the time intervals. The lower limit is limited by time-stamp repcision and physics. He is only using passive measurements and with COTS hardware and software so as to enable easy reproduction. At Twente they have 2000 workstations connected at 100Mbiots/s with a 300Mbits/s uplink. The long-term load is about 50%. Reading the SNMP MIBs takes too long for fine grain measurements. He showed cases where the short term average is much higher than the long term averages.
We discussed with Cess De Laat and Antony how to add NIKHEF to the testbed.
I met with Sylvain Ravot and Olivier Martin. We discussed measurements with new TCP stacks and jumbo frames and exchanged many useful practical details. As a result of this we are now able to make reliable high-speed TCP measurements between Sunnyvale, StarLight and CERN using stock and FAST CP and jumbo frames. I also worked with Sylvain, Richard Hughes Jones and Antony Antony of NIKHEF to make more measurements and to put together the talk for the PFLD Network workshop.
I had discussions with Richard Hughes-Jones of Manchester University and Brian Tierney of LBNL concerning the final version of the GGF Network Monitoring Working Group recommendation on naming conventions for network measurements.
I had discussions with Nicolas Simar from DANTE about collaborating with the Internet 2 PiPES group in deploying and making one way delay and loss measurements for tghe purposes of trouble shooting.
He gave an introduction to CERN and the LHC project in particular concerning its requirements for high performance computing, networking and Grid technology.
www.icir.org/floyd/hstcp.html & www.icir.org/floyd/quickstart.html
Problems: sustaining a high congestion window, e.g. 1500B MTU, 100ms RTT 10Gbps = 83,333segments at most one drop every 5000000000 packets or one drop every 1 2/3 hours, which is not realistic. So TCP cannot achieve the bandwidth available. There are solutions e.g. parallel streams, fix the cwnd etc. Sally is looking for light-weight improvements (easy to implement). The other end of spectrum is more powerful ne transport protocols and more explicit feedback from routers. Response function: the average sending rate S in packets / RTT expressed as a function of the steady state packet drop rate p. Stock TCP average Sending rtae is 3/4 W packets, each cycle takes W/2 RTTs with one drop in ~ (3/8)W^2, p~1/((3/8)W^2), S~ sqrt(1.5)/sqrt(p) for drop rate p. HS TCP like standard TCP when cwnd low, more aggressive when cwnd is high (uses a modified TCP response function). HS TCP can be regarded as behaving as an aggregate of N TCP connections at higher congestion windows. Change response function at 10^-3 loss and 50Mbits/s. Decrease by 1/8 instead of 0.5. Regular TCP (S~1.22/p^(0.5)). Whether it gets standardized will depend on its relative fairness. Showed graph of relative send rates HS TCP/stock TCP. Relative fairness ~ 0.11/p^0.32. Note also need to add in RTT and MTU.
HS TCP in tail drop environments (what is queue management doing in routers). Routers have Active Queue Management (AQM) e.g. RED where packet is dropped before buffer overflow. In drop tail assume TCP increases sending rate by P packets per RTT, then P packets are likely to be dropped for each congestion event for that connection. As increase number of flows fairness increases (i.e. stock TCP gets closer to HS TCP), even better iof RED is used.
There are 3 parameters in HS TCP, low window (38), high window(83000 high loss 0.0000001, set high p decrease 0.1 (0.5 is stock TCP starts at 0.5 at low window).
HS TCP proposal needs feedback from experiments. Sally feels that HS TCP is correct path given backwards compatibility and incremental deployment. Neeeds to be robust to re-ordering, need to make prompt use of newly-available bandwidth, starting up with high congestion windows.
Limited slow start, stops doubling Sends an IP option in SYN packet with the sender's desired sending rate in pps). Routers decrease TTL on path and decrease the allowed initial sending rate if necessary. Only underutilized routers would use this (router could keep track of how many flows are using this and how much bandwidth there is available). SYN & SYN/ACK packets would not take fast path in routers (since have options in packets). Once demand feedback from routers how far does one go in adding new features.
Golas: incremental deployment, steps must be fundamentally correct, long-term direction, not be short term hacks. Robustness in heterogeneous environment valued over efficiency.
Datagram Congestion Control Protocols (DCCP) unreliable data delivery but with congestion control, needs to use ECN, a choice of TCP-friendly congestion control mechanisms. Constraints: low overhead, traverse firewalls
What is XCP (Dina Katabi, Mark Handley, Charlie Rohrs) eXplicit Control Protocol (XCP) goals of fairness, high util, small queues, near zero packet drop. Packet header conbtains: cwnd, rtt estimate, feedback (initialized to desired increase in bytes in cwnd, per ACK). Routers modify the congestion filed. Routers deal with efficiency and fairness separately. Efficiency controiller computes the desired change in the number or arriving bytes in a control interval (i.e. an average RTT), based on spare bandwidth & persistent queue. The fairness controller uses AIMD to allocate the increase orf decrease to individual packets. Policing agents can be used at the edge of the network for security.
Addressing stock TCP congestion window performance problems. Loss recovery times at 10Gbit/s 170Kpkts, 4hr 35 min loss recovery, 5X10^-11 error rate. So goal: effective high performance link utilization, changes need to be robust in face of bugs, packet corruption, reorder & jitter, L2 switches; do not adversely damage existing traffic; do not require manual tuning to achieve reasonable performance (80% of performance for 95% of the people is fine). Increase for each ack cwnd=cwnd+a; for loss cwnd=> cwnd-b*cwnd. Has nice feature that recover quickly and almost independent of RTT (recovery time ~ log(1-b)/log(1+a). Fairness choose a legacy window lwnd when cwnd>lwnd use scalable algorithm, when cwnd<=lwnd use traditional TCP. Fixing lwnd fixes a/b. Control theoretic stable, yes if a<(pj(yj))/(yj*p`j(yj)) yj is equilibrium rate of link. Fixed lwnd at 16. Look at variance and convergence. Chooses a value of b (1/8) that gives good loss recovery (a=0.01, lwnd=16) Has patch against 2.4.19 to implement scalable TCP algorithm.
DataTag has 2048 packets buffering and 40packets/link (memory is
expensive at high speeds). Seems OK in presence of 4200 concurrent web users
(400Mbits/s out of 2.5Gbits/s link) across 3 hosts. Extra delays (assuming no
loss) is few msec (2048 packets maximum ). Code from
Driven by advanced applications such as HENP. Showed link utilization with TCP/RED drops off with capacity. FAST does much better. Feedback mechanisms can use loss probability (Reno) or queuing delay (Vegas). AQM: drop tail, RED, REM/PI, average queue length. 2 components TCP adapts sending rate (window) to congestion, AQM adjusts and feeds back congestion information.
FAST uses e2e delay & loss, achieves any desired fairness expressed by utility function, very high utilization. Difficulties due to effects ignored in theory, large wuindow sizes. Big benefit from SACK. Sender side kernel modification. Described SC2002 demonstration. Showed breaking of Internet 2 Land Speed Record (LSR). Throughput average over 1 hours 1 ..10 flows (one per pair) , memory to memory. 21TB transferred in 6 hours.
SURFnet lambda (100ms RTT) to Chicago; iGrid2002 10Gbits to Chicago; DataTAG link CERN/Chicago. Sent 1000 1 kB UDPmon packets, losses start from 1500 up through 5000 packets By looking at the onset of loss can estimate buffer space at bottleneck. Believs most important part is the slow start of bandwidth discovery phase. Linux is very bursty in slow start, this can cause one to run out of the buffer and cause loss. Tried pacing packets at device level, also tried HS TCP/Net100), Linux queue management (txqueuelen). Using blocking of cpu (i..e. slow cpu) to pace packets can get improvements of factor 5. IFQ Net100 mod to disable buffer management between the NIC card and cpu. Looked at HS TCP and diabling IFQ. HS helps a lot, IFQ helps a bit during slow-start. Looked at txqueuelen effects (default is 100 packets). DataTag HSTCP got 950Mbits/s single stream with 1500B MTU which agrees with jumbos and UDP. 980Mbps LSR 196ms SNV - AMS jumbos, no congestion, entire path OC192. Concludes TCP is not robust, is this an implementation, protocols or philosophical. Next steps use 10GE NICs, use traces to examine TCP behavior.
Understanding requirements: aggregate vs. single stream flows, consider scalability vs. flexibility. Realities include hardware limitations, protocol issues. Hardware: beware of marketing claims. Interface rates should not exceed fabric interconnect. 10GE parallel data stream as compared to Ethernet serial data stream. UEEE 802.3ae only recently standardized; IPG options will impact performance, early products do not reach max. OC192 data rate (9.95328Gbits/s still dominates wide area. Expect BER=10^-8 to 10^-11. European experience claims at 2.5Gbit/s can get BER 10^-13 to 10^-15. Need to watch BERs. Impact of loss rates on TCP severe. OC192 POS up to 9192bytes (IP 9180). GE 1500, 9K or larger (IP 9174). Most fibers use DWDM forward error correction. BERs are quoted after FEC.. On a single link can get 10^-18, but after patch panels can drop to 10^-14.
IGrid2002 got very light utilization. US15 got 100Mbits/s on GE link. Improvement required access via Web100 to TCP variables. With Net100 HS TCP got 700Mbits/s.
SC02: 10*10GE interfaces to booths, 15 10GE interfaces across LAN. Caltech/Caltech and LBNL demos.
TeraGrid distributed cluster to build 4 computer centers and make look as one. NSF award, Site access all built, now need backbone to complete. Complete in March. BER Chicago - LA higher than would like on QWest link. Aggregation of interfaces is a challenge, e.g. how do you connect 800*1GE PCs to 4*OC192.
Pessimistic about lambda switching will removed need for routers.
Showed many useful examples of the effect of jumbo frames and RTT on fairness, recovery from congestion etc. Also showed increase in RTT as buffer fills. Modified TCP to allow selection of additive increment and used it to adjust to provide fairness between different RTTs and different MTUs.
Changing from protective QoS to proactive QoS. Need to consider interaction of all computer system components. Concentrated on Linux. Need to increase TCP windows but by how much. What auto-configuration do we need. Linux is not RFC standard slow start + congestion avoidance, has many modifications. The interaction with network adapter must be considered. Must also consider TCP cache. Pathload was often unreliable 40/149 too low, 10/149 too high, 99/149 realistic. Found as increased windows the throughput started to drop. Txqueuelen factors of 2-3 improvements for big windows. Watch out for tcp cache of window that will lock the ssthresh after which go into AIMD. GSR had a 37MB buffer which held RTT at 400ms for 2 seconds. Don't want to fill queues so using a buffered pipe in general is not good, in some cases can result in catastrophic throughput problems. Can sender control filling pipe by checking RTT, can receiver better moderate is advertised window to assist. Found internal chan_ses_window_default and increased when from 10Mbits/s to 88Mbits/s. Many of configuration options & interactions can have more effect than new TCP stacks.
Problems. TCP sensitive to loss. We have reached the best packet loss rates (10^-8 to 10^-11) as get close to BER rates. Cost of routers > switches. Newer storage strategies (YottaYotta, etc.)... They are proposing e2e lightpaths. Do not use GMPLS or ASON. Internet packet overlay network built on top of Teleco switched network. Reroute elephant traffic on special path. Use OBGP to redirect the elephant traffic. Mice traffic (i.e. most of it) stays on regular network. Idea is to complement the existing network, enables elephants to do special things and get improved performance. Expect for small communities, make selection at daily intervals. At iGrid2002 demonstrated a manually provisioned e2e lightpath. Transfer 1TB of Atlas Monte Carlo (MC) data from TRIUMF to CERN. Used iperf (10 streams 940Mbps), bbftp (666Mbps), tsunami. (800Mbps disk to disk, 1.2Gbps disk to memory) with 9000 MTU. Channel bonding of 2 GE seemed to work well on an unshared link. Dual CPUs will not give best performance unless have multi-threading apps. Have beta 10GE cards from Intel. Looking more at GE bonding.
Feels like a real hack compared to proving TCP is friendly. Tsunami not a replacement for TCP. May meet a particular need over non-traditional networks. Motivation break away from TCP assumptions of AIMD response to congestion. In hi-speed backbones packet loss is usually not due to congestion. Losses comes from equipment, cabling, may not be avoidable, and TCP does not respond well. Alternatives include multi-streams. File transfer is a special case/problem domain (know size in advance, have random assess to data, can have holes in data ...)
Tsunami uses TCP for control stream, UDP for data. Exponential backoff and re-growth. Prototype used for GTRN net test in May '02. Performance of COTS about 400-450Mbps without special OS training. Using 3ware IDE RAID controllers with 4-6 drives per controller.
User sets parameters to tune for target error rate and target transmission rates. Client tracks missing blocks taht are put inm retransmission queue which is sent priodically to server with error rate information. Puts delay between packets, using a busy wait so uses 100% of one cpu. There are a lot of tuning parameters: block size, speedup & slowdown factors, ... Want to reduce number of user parameters to 2. Want to integrate tsunami into Globus.
IP routers vs. Ethernet switches, statistical multiplexing switching vs. dedicated / Circuit switching.
Routers more expensive than switches, may take more time, may (be expected to) do a lot more (e.g. QoS). Can TCP make effective use of hi-speed WANs. Dimensions: utilization, fairness, robustness. Consequences: repeatable, high performance, fair. Current FTPs limited by disk speeds. Can improvements work for non-gargantuan flows. We know how to manage IP networks, have tradition of networks that serve high-end users benefitting early academic adopters and benefiting the broader internet community. Predictable performance of TCP is a big issue. Typically over-provision core so as to push problems to the edge.
What we want from theory: explain what is ober5sved, design alternatives, separate what is necessary from accident. Started with a simple hierarchical model. Constraints on product of bandwidth and # connections. There are different solutions for high speed connectivity and large #s of connections. Power laws are everywhere. Expects a power law in topology of the network. Need to see if this is statistically random, a random distribution is not a power law, it is not probable. Maybe customer demand plus bandwidth constraints are all that is needed to design networks. Similar results from biological networks. Goal of incremental deployment in the current Internet. Needs a core theory not tied to an particular implementation. There are many networks (energy, phone, water, transportation ...) that need to converge. Current computer networks are mainly computer to human, will increasingly become computer to computer, which will need control theory to feedback and keep "stable". Biological and Internets both work on hour-glass systems (e.g. Internet built on IP). Internet easier than comparable measurements in biology, ecology, econometrics. The self-similarity/log-range ordering falls out of the way Internet is used, e.g. get a few small web pages while searching for something then get big file when want real detail (i.e. it is a feature of navigational techniques, e.g. TV channel surfing with very small time/channel, then a long time to view an interesting channel). Want to be able to separate things, e.g. can change TCP without changing web sites, decentralized asynchronous control at a host & router converges to near optimal utilization, layering of protocols. Underlying driver is to reduce waste & fragility under constraints results in something else getting big, e.g. heavy tails.
Call setup delay may be a big deal. Can be reduced to microseconds. There is a cost to the signaling link (i.e. bandwidth needed to setup call). A challenge is how much bandwidth should be allocated to a file transfer. Could be greedy then hurt later calls, also more bandwidth may become available later (after start of transfer). Looked at how to use optical cloud say from NY to Seattle. Idea is to have a 2nd NIC in host that has a path to the optical circuit switched network. Then need an signaling/setup protocol. Use Scheduled Transfer (ST) ANSI standard (OS by-pass implementation), receiver pins space in memory & passes address to sender, sender includes memory address in ST header. Need to be able to choose between the normal TCP/IP link and the optical link (based on file lengths and relative throughputs achievable). Use with call blocking (e.g. if cannot deliver throughput requested then block optical circuit and use TCP/IP circuit).
Class covers many control circuits (e.g. AIMD).
Try to understand TCP performance using Web100/Net100 and NetLogger. Web100 has about 140 variables one can interrogate and in some cases set them.
Wanted to look at why cwnd changes. Look at congestionSignals (rexmit, FastRexmit & ECN), sendStall Interface queue is full (txqueuelen), MaxRwinRcvd receiver advertyised window, congestion window validation. Net100 pyWAD, conceived as tuning daemon. Simplifies logging web100 events including derived events, specify duration of run. Important point cannot analyze TCP in isolation of the OS etc. Very powerful visualization tool. Behavior is not understood in detail but very interesting. Parallel streams may be a bad idea with well tuned streams. Web100 adds about 5% cpu overhead.
Using UDPMon for throughputs. Has logic analyzer on PCI bus. Look at NIC behaviors. Intel Pro/1000 has interrupt coalescing. NICs have unloaded checksum, interrupt coalescing. RDMA will allow unloading of data copies.
Host processing for TCP/IP are too much at high (>1 Gbps) speeds. TCP/IP host can't keep up with link bandwidth on receive. Per bytes costs dominate. Need to eliminate copying. 1.2cpus at 700Mbits/s RDMA only 30% with RDMA. Zero copy TCP is unreliable, idiosyncratic, expensive. Mem-to-mem using net protocols to carry placement information has satisfactory performance for FCS. Microsoft interested in making RDMA a standard. (http://ietf.org/html.charters/rddp-charter.html and www.rdmaconsortium.org/home. iSCSI will run direct over RDMA. Lot of open issues (e.g. ordering constraints, does it eliminate the need for jumbos in short term, security ). Initial focus is not on WAN, but it will probably be used there eventually. Other technologies in the non TCP disk space at Scheduled Transfers (ST), SCTP, FCS over IP, iSCSI
YottaYotta - Geoff Hayward. We discussed their Fiber Channel Standard (FCS) disk array controllers with a TCP interface. The controller comes as a switch with (I think) 12 blades. Each blade has 4x2Gbits/s. The interface to TCP is via a product (IFCP) from another company where each interface will drive 2*1GE TCP interface. Flow control is set up by configuring the interface to accept/provide so many FCS credits. This is based on the Bandwidth*RTT product. If the link is dedicated then the the number of credits are equal to BW*RTT. If one believes that the link is < 10% utilized then it is set to 90%*BW*RTT. If there is loss the the ICTP dynamically reduces the FCS credits. They are able to support distributed RAID and tests. remote mirroring which might simplify replication needs. There could be interest in porting FAST TCP to the Yotta Yotta equipment.
I talked to of Wayne Hong of Carleton University about their tests with high throughput. They achieved 11.1Gbits/s bi-directional TRIUMF to Chicago over lightpaths.
I had discussions with Sven Ubik of CESnet concerning helping them get access to NIKHEF and thence StarLight. We involved Antony Antony and Linbda Winkler in these discussions.
I had discussions with Steven Low of Caltech concerning an NSF ITR proposal he is putting together. I will follow up on my return (after I finish writing my trip report, and filling out all the other paper work to get reimbursed for the travel and meeting fees, catching up on email etcx.)