Breaking the Internet2 Land Speed Record: Twice
Les Cottrell – SLAC
Prepared for the Ricoh, Menlo Park, April 2003
http://www.slac.stanford.edu/grp/scs/net/talk/ricoh-hiperf.html

Outline
Who did it?
What was done?
How was it done?
What was special about this anyway?
Who needs it?
So what’s next?
Where do I find out more?

Who did it: Collaborators and sponsors
Caltech: Harvey Newman, Steven Low, Sylvain Ravot, Cheng Jin, Xiaoling Wei, Suresh Singh, Julian Bunn
SLAC: Les Cottrell, Gary Buhrmaster, Fabrizio Coccetti
LANL: Wu-chun Feng, Eric Weigle, Gus Hurwitz, Adam Englehart
NIKHEF/UvA: Cees DeLaat, Antony Antony
CERN: Olivier Martin, Paolo Moroni
ANL: Linda Winkler
DataTAG, StarLight, TeraGrid, SURFnet, NetherLight, Deutsche Telecom, Information Society Technologies
Cisco, Level(3), Intel
DoE, European Commission, NSF

What was done?
Beat the Gbps limit for a single TCP stream across the Atlantic – transferred a TByte in an hour

On February 27-28, over a Terabyte of data was transferred in 3700 seconds by S. Ravot of Caltech between the Level3 PoP in Sunnyvale, near SLAC, and CERN.

The data passed through the TeraGrid router at StarLight from memory to memory as a single TCP/IP stream at an average rate of 2.38 Gbps (using large windows and 9KByte “jumbo” frames).

This beat the former record by a factor of approximately 2.5,  and used the US-CERN link at 99% efficiency.

How was it done: Typical testbed

Typical Components
CPU
Pentium 4 (Xeon) with 2.4GHz cpu
For GE used Syskonnect NIC
For 10GE used Intel NIC
Linux 2.4.19 or 20
Routers
Cisco GSR 12406 with OC192/POS & 1 and 10GE server interfaces (loaned, list > $1M)
Cisco 760x
Juniper T640 (Chicago)
Level(3) OC192/POS fibers (loaned SNV-CHI monthly lease cost ~ $220K)

Challenges
After a loss it can take over an hour for stock TCP (Reno) to recover to maximum throughput at 1Gbits/s
i.e. loss rate of 1 in ~ 2 Gpkts (3Tbits), or BER of 1 in 3.6*1012

Windows and Streams
Well accepted that multiple streams (n) and/or big windows are important to achieve optimal throughput
Effectively reduces impact of a loss by 1/n, and improves recovery time  by 1/n
Optimum windows & streams changes with changes (e.g. utilization) in path, hard to optimize n
Can be unfriendly to others

Even with big windows (1MB) still need multiple streams with Standard TCP
Above knee performance still improves slowly, maybe due to squeezing out others and taking more than fair share due to large number of streams
Streams, windows can change during day, hard to optimize

Impact on others

New TCP Stacks
Reno (AIMD) based, loss indicates congestion
Back off less when see congestion
Recover more quickly after backing off
Scalable TCP: exponential recovery
Tom Kelly, Scalable TCP: Improving Performance in Highspeed Wide Area Networks Submitted for publication, December 2002.
High Speed TCP: same as Reno for low performance, then increase window more & more aggressively as window increases using a table
Vegas based, RTT indicates congestion
Caltech FAST TCP, quicker response to congestion, but …

Stock vs FAST TCP
MTU=1500B
Need to measure all parameters to understand effects of parameters, configurations:
Windows, streams, txqueuelen, TCP stack, MTU, NIC card
Lot of variables
Examples of 2 TCP stacks
FAST TCP no longer needs multiple streams, this is a major simplification (reduces # variables to tune by 1)

Jumbo frames
Become more important at higher speeds:
Reduce interrupts to CPU and packets to process, reduce cpu utilization
Similar effect to using multiple streams (T. Hacker)
Jumbo can achieve >95% utilization SNV to CHI or GVA with 1 or multiple stream up to Gbit/s
Factor 5 improvement over single stream 1500B MTU throughput for stock TCP (SNV-CHI(65ms) & CHI-AMS(128ms))
Complementary approach to a new stack
Deployment doubtful
Few sites have deployed
Not part of GE or 10GE standards

TCP stacks with 1500B MTU @1Gbps

Jumbo frames, new TCP stacks at 1 Gbits/s

Other gotchas
Large windows and large number of streams can cause last stream to take a long time to close.
Linux memory leak
Linux TCP configuration caching
What is the window size actually used/reported
32 bit counters in iperf and routers wrap, need latest releases with 64bit counters
Effects of txqueuelen (number of packets queued for NIC)
Routers do not pass jumbos
Performance differs between drivers and NICs from different  manufacturers
May require tuning a lot of parameters

What was special?
End-to-end application-to-application, single and multi-streams (not just internal backbone aggregate speeds)
TCP has not run out of stream yet, scales into multi-Gbits/s region
TCP well understood, mature, many good features: reliability etc.
Friendly on shared networks
New TCP stacks only need to be deployed at sender
Often just a few data sources, many destinations
No modifications to backbone routers etc
No need for jumbo frames
Used Commercial Off The Shelf (COTS) hardware and software

Who needs it?
HENP – current driver
Multi-hundreds Mbits/s and Multi TByte files/day transferred across Atlantic today
SLAC BaBar experiment already has > PByte stored
Tbits/s and ExaBytes (1018) stored in a decade
Data intensive science:
Astrophysics, Global weather, Bioinformatics, Fusion, seismology…
Industries such as aerospace, medicine, security …
Future:
Media distribution
Gbits/s=2 full length DVD movies/minute
2.36Gbits/s is equivalent to
Transferring a full CD in 2.3 seconds  (i.e. 1565 CDs/hour)
Transferring 200 full length DVD movies in one hour
(i.e. 1 DVD in 18 seconds)
Will sharing movies be like sharing music today?

When will it have an impact
ESnet traffic doubling/year since 1990
SLAC capacity increasing by 90%/year since 1982
SLAC Internet traffic increased by factor 2.5 in last year
International throughput increase by factor 10 in 4 years
So traffic increases by factor 10 in 3.5 to 4 years, so in:
3.5 to 5 years 622 Mbps => 10Gbps
3-4 years 155 Mbps => 1Gbps
3.5-5 years 45Mbps => 622Mbps
2010-2012:
100s Gbits for high speed production net end connections
10Gbps will be mundane for R&E and business
Home: doubling ~ every 2 years, 100Mbits/s by end of decade?

What’s next?
Break 2.5Gbits/s limit
Disk-to-disk throughput & useful applications
Need faster cpus (extra 60% MHz/Mbits/s over TCP for disk to disk), understand how to use multi-processors
Evaluate new stacks with real-world links, and other equipment
Other NICs
Response to congestion, pathologies
Fairnesss
Deploy for some major (e.g. HENP/Grid) customer applications
Understand how to make 10GE NICs work well with 1500B MTUs
Move from “hero” demonstrations to commonplace

More Information
Internet2 Land Speed Record Publicity
www-iepm.slac.stanford.edu/lsr/
www-iepm.slac.stanford.edu/lsr2/
10GE tests
www-iepm.slac.stanford.edu/monitoring/bulk/10ge/
sravot.home.cern.ch/sravot/Networking/10GbE/10GbE_test.html
TCP stacks
netlab.caltech.edu/FAST/
datatag.web.cern.ch/datatag/pfldnet2003/papers/kelly.pdf
www.icir.org/floyd/hstcp.html
Stack comparisons
www-iepm.slac.stanford.edu/monitoring/bulk/fast/
www.csm.ornl.gov/~dunigan/net100/floyd.html