High Performance WAN Testbed Experiences & Results

Les Cottrell – SLAC

Prepared for the CHEP03, San Diego, March 2003

http://www.slac.stanford.edu/grp/scs/net/talk/chep03-hiperf.html

Outline

Who did it?

What was done?

How was it done?

Who needs it?

So what’s next?

Where do I find out more?

Who did it: Collaborators and sponsors

Caltech: Harvey Newman, Steven Low, Sylvain Ravot, Cheng Jin, Xiaoling Wei, Suresh Singh, Julian Bunn

SLAC: Les Cottrell, Gary Buhrmaster, Fabrizio Coccetti

LANL: Wu-chun Feng, Eric Weigle, Gus Hurwitz, Adam Englehart

NIKHEF/UvA: Cees DeLaat, Antony Antony

CERN: Olivier Martin, Paolo Moroni

ANL: Linda Winkler

DataTAG, StarLight, TeraGrid, SURFnet, NetherLight, Deutsche Telecom, Information Society Technologies

Cisco, Level(3), Intel

DoE, European Commission, NSF

What was done?

Beat the Gbps limit for a single TCP stream across the Atlantic – transferred a TByte in an hour

On February 27-28, over a Terabyte of data was transferred in 3700 seconds by S. Ravot of Caltech between the Level3 PoP in Sunnyvale, near SLAC, and CERN.

The data passed through the TeraGrid router at StarLight from memory to memory as a single TCP/IP stream at an average rate of 2.38 Gbps (using large windows and 9KByte “jumbo” frames).

This beat the former record by a factor of approximately 2.5, and used the US-CERN link at 99% efficiency.

How was it done: Typical testbed

Typical Components

CPU

Pentium 4 (Xeon) with 2.4GHz cpu

For GE used Syskonnect NIC

For 10GE used Intel NIC

Linux 2.4.19 or 20

Routers

Cisco GSR 12406 with OC192/POS & 1 and 10GE server interfaces (loaned, list > $1M)

Cisco 760x

Juniper T640 (Chicago)

Level(3) OC192/POS fibers (loaned SNV-CHI monthly lease cost ~ $220K)

Challenges

After a loss it can take over an hour for stock TCP (Reno) to recover to maximum throughput at 1Gbits/s

i.e. loss rate of 1 in ~ 2 Gpkts (3Tbits), or BER of 1 in 3.6*10¹²

Windows and Streams

Well accepted that multiple streams (n) and/or big windows are important to achieve optimal throughput

Effectively reduces impact of a loss by 1/n, and improves recovery time by 1/n

Optimum windows & streams changes with changes (e.g. utilization) in path, hard to optimize n

Can be unfriendly to others

Even with big windows (1MB) still need multiple streams with Standard TCP

Above knee performance still improves slowly, maybe due to squeezing out others and taking more than fair share due to large number of streams

Streams, windows can change during day, hard to optimize

New TCP Stacks

Reno (AIMD) based, loss indicates congestion

Back off less when see congestion

Recover more quickly after backing off

Scalable TCP: exponential recovery

Tom Kelly, Scalable TCP: Improving Performance in Highspeed Wide Area Networks Submitted for publication, December 2002.

High Speed TCP: same as Reno for low performance, then increase window more & more aggressively as window increases using a table

Vegas based, RTT indicates congestion

Caltech FAST TCP, quicker response to congestion, but …

Stock vs FAST TCP
MTU=1500B

Need to measure all parameters to understand effects of parameters, configurations:

Windows, streams, txqueuelen, TCP stack, MTU, NIC card

Lot of variables

Examples of 2 TCP stacks

FAST TCP no longer needs multiple streams, this is a major simplification (reduces # variables to tune by 1)

Jumbo frames

Become more important at higher speeds:

Reduce interrupts to CPU and packets to process, reduce cpu utilization

Similar effect to using multiple streams (T. Hacker)

Jumbo can achieve >95% utilization SNV to CHI or GVA with 1 or multiple stream up to Gbit/s

Factor 5 improvement over single stream 1500B MTU throughput for stock TCP (SNV-CHI(65ms) & CHI-AMS(128ms))

Complementary approach to a new stack

Deployment doubtful

Few sites have deployed

Not part of GE or 10GE standards

TCP stacks with 1500B MTU @1Gbps

Jumbo frames, new TCP stacks at 1 Gbits/s

Other gotchas

Large windows and large number of streams can cause last stream to take a long time to close.

Linux memory leak

Linux TCP configuration caching

What is the window size actually used/reported

32 bit counters in iperf and routers wrap, need latest releases with 64bit counters

Effects of txqueuelen (number of packets queued for NIC)

Routers do not pass jumbos

Performance differs between drivers and NICs from different manufacturers

May require tuning a lot of parameters

Who needs it?

HENP – current driver

Data intensive science:

Astrophysics, Global weather, Fusion, sesimology

Industries such as aerospace, medicine, security …

Future:

Media distribution

Gbits/s=2 full length DVD movies/minute

2.36Gbits/s is equivalent to

Transferring a full CD in 2.3 seconds (i.e. 1565 CDs/hour)

Transferring 200 full length DVD movies in one hour
(i.e. 1 DVD in 18 seconds)

Will sharing movies be like sharing music today?

What’s next?

Break 2.5Gbits/s limit

Disk-to-disk throughput & useful applications

Need faster cpus (extra 60% MHz/Mbits/s over TCP for disk to disk), understand how to use multi-processors

Evaluate new stacks with real-world links, and other equipment

Other NICs

Response to congestion, pathologies

Fairnesss

Deploy for some major (e.g. HENP/Grid) customer applications

Understand how to make 10GE NICs work well with 1500B MTUs

More Information

Internet2 Land Speed Record Publicity

www-iepm.slac.stanford.edu/lsr/

www-iepm.slac.stanford.edu/lsr2/

10GE tests

www-iepm.slac.stanford.edu/monitoring/bulk/10ge/

sravot.home.cern.ch/sravot/Networking/10GbE/10GbE_test.html

TCP stacks

netlab.caltech.edu/FAST/

datatag.web.cern.ch/datatag/pfldnet2003/papers/kelly.pdf

www.icir.org/floyd/hstcp.html

Stack comparisons

www-iepm.slac.stanford.edu/monitoring/bulk/fast/

www.csm.ornl.gov/~dunigan/net100/floyd.html

Impact on others


	Les Cottrell – SLAC
	Prepared for the CHEP03, San Diego, March 2003
	http://www.slac.stanford.edu/grp/scs/net/talk/chep03-hiperf.html


	Who did it?
	What was done?
	How was it done?
	Who needs it?
	So what’s next?
	Where do I find out more?


	Caltech: Harvey Newman, Steven Low, Sylvain Ravot, Cheng Jin, Xiaoling Wei, Suresh Singh, Julian Bunn
	SLAC: Les Cottrell, Gary Buhrmaster, Fabrizio Coccetti
	LANL: Wu-chun Feng, Eric Weigle, Gus Hurwitz, Adam Englehart
	NIKHEF/UvA: Cees DeLaat, Antony Antony
	CERN: Olivier Martin, Paolo Moroni
	ANL: Linda Winkler
	DataTAG, StarLight, TeraGrid, SURFnet, NetherLight, Deutsche Telecom, Information Society Technologies
	Cisco, Level(3), Intel
	DoE, European Commission, NSF


	Beat the Gbps limit for a single TCP stream across the Atlantic – transferred a TByte in an hour


CPU
	Pentium 4 (Xeon) with 2.4GHz cpu
		For GE used Syskonnect NIC
		For 10GE used Intel NIC
	Linux 2.4.19 or 20
Routers
	Cisco GSR 12406 with OC192/POS & 1 and 10GE server interfaces (loaned, list > $1M)
	Cisco 760x
	Juniper T640 (Chicago)
Level(3) OC192/POS fibers (loaned SNV-CHI monthly lease cost ~ $220K)


	After a loss it can take over an hour for stock TCP (Reno) to recover to maximum throughput at 1Gbits/s
		i.e. loss rate of 1 in ~ 2 Gpkts (3Tbits), or BER of 1 in 3.6*10¹²


	Well accepted that multiple streams (n) and/or big windows are important to achieve optimal throughput
	Effectively reduces impact of a loss by 1/n, and improves recovery time by 1/n
	Optimum windows & streams changes with changes (e.g. utilization) in path, hard to optimize n
	Can be unfriendly to others


	Above knee performance still improves slowly, maybe due to squeezing out others and taking more than fair share due to large number of streams
	Streams, windows can change during day, hard to optimize


Reno (AIMD) based, loss indicates congestion
	Back off less when see congestion
	Recover more quickly after backing off
		Scalable TCP: exponential recovery
			Tom Kelly, Scalable TCP: Improving Performance in Highspeed Wide Area Networks Submitted for publication, December 2002.
		High Speed TCP: same as Reno for low performance, then increase window more & more aggressively as window increases using a table
Vegas based, RTT indicates congestion
	Caltech FAST TCP, quicker response to congestion, but …


	Need to measure all parameters to understand effects of parameters, configurations:
		Windows, streams, txqueuelen, TCP stack, MTU, NIC card
		Lot of variables
	Examples of 2 TCP stacks
		FAST TCP no longer needs multiple streams, this is a major simplification (reduces # variables to tune by 1)


	Become more important at higher speeds:
		Reduce interrupts to CPU and packets to process, reduce cpu utilization
		Similar effect to using multiple streams (T. Hacker)
	Jumbo can achieve >95% utilization SNV to CHI or GVA with 1 or multiple stream up to Gbit/s
	Factor 5 improvement over single stream 1500B MTU throughput for stock TCP (SNV-CHI(65ms) & CHI-AMS(128ms))
	Complementary approach to a new stack
	Deployment doubtful
		Few sites have deployed
		Not part of GE or 10GE standards


	Large windows and large number of streams can cause last stream to take a long time to close.
	Linux memory leak
	Linux TCP configuration caching
	What is the window size actually used/reported
	32 bit counters in iperf and routers wrap, need latest releases with 64bit counters
	Effects of txqueuelen (number of packets queued for NIC)
	Routers do not pass jumbos
	Performance differs between drivers and NICs from different manufacturers
		May require tuning a lot of parameters


HENP – current driver
Data intensive science:
	Astrophysics, Global weather, Fusion, sesimology
Industries such as aerospace, medicine, security …
Future:
	Media distribution
		Gbits/s=2 full length DVD movies/minute
		2.36Gbits/s is equivalent to
			Transferring a full CD in 2.3 seconds (i.e. 1565 CDs/hour)
			Transferring 200 full length DVD movies in one hour (i.e. 1 DVD in 18 seconds)
		Will sharing movies be like sharing music today?


	Break 2.5Gbits/s limit
	Disk-to-disk throughput & useful applications
		Need faster cpus (extra 60% MHz/Mbits/s over TCP for disk to disk), understand how to use multi-processors
	Evaluate new stacks with real-world links, and other equipment
		Other NICs
		Response to congestion, pathologies
		Fairnesss
		Deploy for some major (e.g. HENP/Grid) customer applications
	Understand how to make 10GE NICs work well with 1500B MTUs


	Internet2 Land Speed Record Publicity
		www-iepm.slac.stanford.edu/lsr/
		www-iepm.slac.stanford.edu/lsr2/
	10GE tests
		www-iepm.slac.stanford.edu/monitoring/bulk/10ge/
		sravot.home.cern.ch/sravot/Networking/10GbE/10GbE_test.html
	TCP stacks
		netlab.caltech.edu/FAST/
		datatag.web.cern.ch/datatag/pfldnet2003/papers/kelly.pdf
		www.icir.org/floyd/hstcp.html
	Stack comparisons
		www-iepm.slac.stanford.edu/monitoring/bulk/fast/
		www.csm.ornl.gov/~dunigan/net100/floyd.html