Breaking the Internet2
Land Speed Record: Twice
|
|
|
Les Cottrell – SLAC |
|
Prepared for the Ricoh, Menlo Park,
April 2003 |
|
http://www.slac.stanford.edu/grp/scs/net/talk/ricoh-hiperf.html |
|
|
Outline
|
|
|
Who did it? |
|
What was done? |
|
How was it done? |
|
What was special about this anyway? |
|
Who needs it? |
|
So what’s next? |
|
Where do I find out more? |
Who did it:
Collaborators and sponsors
|
|
|
Caltech: Harvey Newman, Steven Low,
Sylvain Ravot, Cheng Jin, Xiaoling Wei, Suresh Singh, Julian Bunn |
|
SLAC: Les Cottrell, Gary Buhrmaster,
Fabrizio Coccetti |
|
LANL: Wu-chun Feng, Eric Weigle, Gus
Hurwitz, Adam Englehart |
|
NIKHEF/UvA: Cees DeLaat, Antony Antony |
|
CERN: Olivier Martin, Paolo Moroni |
|
ANL: Linda Winkler |
|
DataTAG, StarLight, TeraGrid, SURFnet,
NetherLight, Deutsche Telecom, Information Society Technologies |
|
Cisco, Level(3), Intel |
|
DoE, European Commission, NSF |
|
|
What was done?
|
|
|
Beat the Gbps limit for a single TCP
stream across the Atlantic – transferred a TByte in an hour |
On February 27-28, over
a Terabyte of data was transferred in 3700 seconds by S. Ravot of Caltech
between the Level3 PoP in Sunnyvale, near SLAC, and CERN.
The data passed through the TeraGrid router at StarLight from memory to memory as
a single TCP/IP stream at an average rate of 2.38 Gbps (using large windows and
9KByte “jumbo” frames).
This beat the former record by a factor of approximately 2.5, and used the US-CERN link at 99% efficiency.
How was it done: Typical
testbed
Typical Components
|
|
|
|
|
CPU |
|
Pentium 4 (Xeon) with 2.4GHz cpu |
|
For GE used Syskonnect NIC |
|
For 10GE used Intel NIC |
|
Linux 2.4.19 or 20 |
|
Routers |
|
Cisco GSR 12406 with OC192/POS & 1
and 10GE server interfaces (loaned, list > $1M) |
|
Cisco 760x |
|
Juniper T640 (Chicago) |
|
Level(3) OC192/POS fibers (loaned
SNV-CHI monthly lease cost ~ $220K) |
Challenges
|
|
|
|
After a loss it can take over an hour
for stock TCP (Reno) to recover to maximum throughput at 1Gbits/s |
|
i.e. loss rate of 1 in ~ 2 Gpkts
(3Tbits), or BER of 1 in 3.6*1012 |
|
|
Windows and Streams
|
|
|
Well accepted that multiple streams (n)
and/or big windows are important to achieve optimal throughput |
|
Effectively reduces impact of a loss by
1/n, and improves recovery time by 1/n |
|
Optimum windows & streams changes
with changes (e.g. utilization) in path, hard to optimize n |
|
Can be unfriendly to others |
Even with big windows
(1MB) still need multiple streams with Standard TCP
|
|
|
Above knee performance still improves
slowly, maybe due to squeezing out others and taking more than fair share due
to large number of streams |
|
Streams, windows can change during day,
hard to optimize |
Impact on others
New TCP Stacks
|
|
|
|
|
|
Reno (AIMD) based, loss indicates
congestion |
|
Back off less when see congestion |
|
Recover more quickly after backing off |
|
Scalable TCP: exponential recovery |
|
Tom Kelly, Scalable TCP: Improving
Performance in Highspeed Wide Area Networks Submitted for publication,
December 2002. |
|
High Speed TCP: same as Reno for low
performance, then increase window more & more aggressively as window
increases using a table |
|
Vegas based, RTT indicates congestion |
|
Caltech FAST TCP, quicker response to
congestion, but … |
Stock vs FAST
TCP
MTU=1500B
|
|
|
|
Need to measure all parameters to
understand effects of parameters, configurations: |
|
Windows, streams, txqueuelen, TCP
stack, MTU, NIC card |
|
Lot of variables |
|
Examples of 2 TCP stacks |
|
FAST TCP no longer needs multiple
streams, this is a major simplification (reduces # variables to tune by 1) |
|
|
Jumbo frames
|
|
|
|
Become more important at higher speeds: |
|
Reduce interrupts to CPU and packets to
process, reduce cpu utilization |
|
Similar effect to using multiple
streams (T. Hacker) |
|
Jumbo can achieve >95% utilization
SNV to CHI or GVA with 1 or multiple stream up to Gbit/s |
|
Factor 5 improvement over single stream
1500B MTU throughput for stock TCP (SNV-CHI(65ms) & CHI-AMS(128ms)) |
|
Complementary approach to a new stack |
|
Deployment doubtful |
|
Few sites have deployed |
|
Not part of GE or 10GE standards |
TCP stacks with 1500B
MTU @1Gbps
Jumbo frames, new TCP
stacks at 1 Gbits/s
Other gotchas
|
|
|
|
Large windows and large number of
streams can cause last stream to take a long time to close. |
|
Linux memory leak |
|
Linux TCP configuration caching |
|
What is the window size actually
used/reported |
|
32 bit counters in iperf and routers
wrap, need latest releases with 64bit counters |
|
Effects of txqueuelen (number of
packets queued for NIC) |
|
Routers do not pass jumbos |
|
Performance differs between drivers and
NICs from different manufacturers |
|
May require tuning a lot of parameters |
What was special?
|
|
|
|
End-to-end application-to-application,
single and multi-streams (not just internal backbone aggregate speeds) |
|
TCP has not run out of stream yet,
scales into multi-Gbits/s region |
|
TCP well understood, mature, many good
features: reliability etc. |
|
Friendly on shared networks |
|
New TCP stacks only need to be deployed
at sender |
|
Often just a few data sources, many
destinations |
|
No modifications to backbone routers
etc |
|
No need for jumbo frames |
|
Used Commercial Off The Shelf (COTS)
hardware and software |
Who needs it?
|
|
|
|
|
|
HENP – current driver |
|
Multi-hundreds Mbits/s and Multi TByte
files/day transferred across Atlantic today |
|
SLAC BaBar experiment already has >
PByte stored |
|
Tbits/s and ExaBytes (1018)
stored in a decade |
|
Data intensive science: |
|
Astrophysics, Global weather,
Bioinformatics, Fusion, seismology… |
|
Industries such as aerospace, medicine,
security … |
|
Future: |
|
Media distribution |
|
Gbits/s=2 full length DVD movies/minute |
|
2.36Gbits/s is equivalent to |
|
Transferring a full CD in 2.3
seconds (i.e. 1565 CDs/hour) |
|
Transferring 200 full length DVD movies
in one hour
(i.e. 1 DVD in 18 seconds) |
|
Will sharing movies be like sharing
music today? |
When will it have an
impact
|
|
|
|
ESnet traffic doubling/year since 1990 |
|
SLAC capacity increasing by 90%/year
since 1982 |
|
SLAC Internet traffic increased by
factor 2.5 in last year |
|
International throughput increase by
factor 10 in 4 years |
|
So traffic increases by factor 10 in
3.5 to 4 years, so in: |
|
3.5 to 5 years 622 Mbps => 10Gbps |
|
3-4 years 155 Mbps => 1Gbps |
|
3.5-5 years 45Mbps => 622Mbps |
|
2010-2012: |
|
100s Gbits for high speed production
net end connections |
|
10Gbps will be mundane for R&E and
business |
|
Home: doubling ~ every 2 years,
100Mbits/s by end of decade? |
What’s next?
|
|
|
|
Break 2.5Gbits/s limit |
|
Disk-to-disk throughput & useful
applications |
|
Need faster cpus (extra 60% MHz/Mbits/s
over TCP for disk to disk), understand how to use multi-processors |
|
Evaluate new stacks with real-world
links, and other equipment |
|
Other NICs |
|
Response to congestion, pathologies |
|
Fairnesss |
|
Deploy for some major (e.g. HENP/Grid)
customer applications |
|
Understand how to make 10GE NICs work
well with 1500B MTUs |
|
Move from “hero” demonstrations to
commonplace |
More Information
|
|
|
|
Internet2 Land Speed Record Publicity |
|
www-iepm.slac.stanford.edu/lsr/ |
|
www-iepm.slac.stanford.edu/lsr2/ |
|
10GE tests |
|
www-iepm.slac.stanford.edu/monitoring/bulk/10ge/ |
|
sravot.home.cern.ch/sravot/Networking/10GbE/10GbE_test.html |
|
TCP stacks |
|
netlab.caltech.edu/FAST/ |
|
datatag.web.cern.ch/datatag/pfldnet2003/papers/kelly.pdf |
|
www.icir.org/floyd/hstcp.html |
|
Stack comparisons |
|
www-iepm.slac.stanford.edu/monitoring/bulk/fast/ |
|
www.csm.ornl.gov/~dunigan/net100/floyd.html |
|
|