PAM 2003, San Diego April 7-8, 2003

Introduction.

I was on the organizing committee. There were about 90 registrants. Previous PAMs were held at Waikato New Zealand, and Fort Collins Colorado. There were 4 papers co-authored by SLAC people (Jiri, Connie, Les). 93 papers were submitted, there were 15 reviewers. Very hard to decide which to accept. Over Xmas had 4 weeks for each reviewer to go through 12-14 papers. Accepted 23 papers.

ABwE - Jiri Navratil

Tool to estimate available bandwidth quickly and with low impact. Uses simple techniques, looks at how packets get separated without cross-traffic (due to different link speeds which both compress inter packet delays and expand them) and then the effects of the cross traffic. Most of the effects of cross-traffic come from large packets. They cause a multiplication factor (QDF) for multiple peaks in inter packet delays. Appears to track changes in routing and utilization. Absolute values appear to be within factor of 2 of iperf. Need to be able to quantify the agreement , e.g. statistical differences. There may be problems with iperf configuration parameters which cast doubt on accuracy of iperf. Packet pairs measure capacity of a path, not the TCP bandwidth, so detailed relationship is unclear. Need clearer definition of available bandwidth. {Need to discuss this with Constantinos, also ask about new pathload, TCP vs other esitmators and no server ideas}

PathChirp - Vinay Ribiero

Efficient estimator of bandwidth with light load on network. Available bandwidth on a single link is the average unused capacity over some time interval. The path is the min(link available bandwidth). Applications include server selection, route selection, net monitoring, SLA verification. Must be fast, low impact, accurate. Principle of self-induced congestion: probing rate < available bandwidth > no delay increase, probing rate > available bandwidth > increase delays. Trains of Packet-Pairs (TOPP) varies packet-pair spacing, look for where see increase in delay. A shortcoming is packet-pairs do not capture temporal queuing behavior. So use packet trains. Pathload uses CBR packet trains, varies rate until pin down where the delay starts to increase. PathChirp uses varying packet spacing (exponentially) within a a packet train. This reduces the load on the network by 50%, it also capture temporal queuing behavior. Look for point in chirp train that sees increase in delay. Not as statistically clean (i.e a single point where delay increases not always present) as one might like due to noise of cross traffic etc. So have to average & smooth over many chirp trains. If delay in chirp decreases then bandwidth is less than current packet pair delay bandwidth, if delay increases then bandwidth is greater than current chirp packet pair delay bandwidth. Does some outlier suppression (e.g. need > 2 adjacent inter-packet delays to increase). This can give multiple estimates from a single chirp train. Implemented in pathChirp tool, available online (spin.rice.edu), uses UDP packets. Looked at Mean Square Error as vary probe size, the exponential increase parameter in the packet delays in a chirp. Changed cross-traffic using a traffic (Poisson) generator to see how pathChirp tracks. Compared with TOPP, for same probing rate pathChirp was more accurate. Also 10 times as efficient as Pathload. Question about time accuracy on server, may limit method to OC12 links. Can get a good estimate at 7 chirps. What is the standard deviation of the measurements. Unclear if it works above OC3 (155Mbps).

Quick Iperf - Ajay Tirumala

Measures TCP bandwidth, want to make measurements in stable region (congestion avoidance) period of the data flow. Otherwise if want 90% of measurement to be in stable state then need to measure for 10*slow-start time. So need to detect when we are out of slow-start. Can do this using web100 sampling the variables every 20ms. Then make a short measurement (e.g. a second) during the non-slow start stable state of flow. Designed for TCP Reno, also works for Vegas. Needs minor modifications to work for Sally Floyd limited slow start. Made measurements to validate from SLAC to 20 sites, comparing 20 s iperf aggregate measurements with those from Quick Iperf. Appears to work well (better than 68%). Gives a significant reduction of time needed to make measurements. {Ajay what happens if have very high speed link then AIMD/Reno takes a long time to increase. Might be interesting to run for a long time.

Integrating Active Methods and Flow Meters - Thomas Lindh

Uses NetraMet. Testbed measurements using NistNet, source and destination for traffic and passive monitors. Sticks active marker packets in flows, bracketing a block. Can use to synchronize two traffic meters. Looks at loss periods and loss free periods, delays etc.

ANEMOS - Antonias Danalis

http://www.eecis.udel.edu/~danalis/ANeMoS.html

Looked at existing tools. Commercial tools are NetView, OpenView, Sun. Open source PingER, Surveyor, NWS. ANEMOS is a monitoring infrastructure for multiple paths. Uses pathload and ping (plug ins). Uses MRTG for visualization (do not use MRTG roll-ups so have all data going back a long way), archived using MYSQL, sophisticated rules can automate measurements and automated data analysis. Client is a Java applet, coordinator is the brains of the system. Workers call the external tools taht perform the measurement. Measurement requests are a series of measurements for workers to do. Coordinator requests workers to make measurements. Coordinator & workers written in Java, multi-threaded. Has a grammar for rules (BNF) to combine variables with operators. Has a GUI interface to choose operators and variables. Prototype, just released. Have tried with 10-15 workers, do not know it would scale to thousands of workers. Desynchronizes measurements.

Third Party Addresses in Traceroute Paths - Young Hyun

AS level Internet topology is very useful: study growth, performance, resiliency, convergence time. Traceroutes do not need ISP, but can only see active links. Inaccuracies due to unresponsive hosts, filtering, rate limiting, private addresses, multi-cast, loopback, routes not stable, routes incomplete. 200k destination lists used, clients of DNS root servers. Many organizations have more than one AS (mergers), some closely allied (e.g. QWest, USWesr). 3rd party addresses have to do with reverse path not same as forward path.

Self Configuring Network Monitor - Brian Tierney

http://www-didc.lbl.gov/SCNM/

Host with passive tap to network, plus a regular interface. End nodes can request turn on of capture of specified packet headers (have to be packets for requesting end node). Captured packets are sent back to end node. Monitor host system installed and maintained by network admin. FreeBSD 933MHz, 2 Syskonnect 1GE NICs (one input, one output). Time stamp in NIC, interrupt moderation (to reduce CPU requirements) must be short enough avoid running out of kernel buffer descriptors. Memory bandwidth is key bottleneck (need fast PCI bus). Believe hardware will almost handle 10GE. Do header compression (10s 300Mbps TCP stream = 20MB of header data. Use CSLIP approach (3% of bits are header for 1500B MTU). Developed SCNMPlot based on tcptrace, allow multiple inputs with a nudge facility data . Boxes at NERSC, LBNL, ORNL and SLAC.

IEPM-BW - Connie Logg

Architecture of a Network Monitor - Andrew Moore

Nprobe does full line rate capture at 1GE on COTS hardware. Discard is the best form of compression, so be selective. Can also split the workload across cus in 1 box, or multi-cpus in multiple boxes. Host to host <= capacity of a single monitor. Works well. Have to avoid memory copies, single thread to minimize memory usage. Have time stamps below ius, filtering on card with minimal impact. Use XOR of src and dst and use as hashing criterion. Compress/discard to improve recording rates. Have modules, most mature are for tcp, udp, http ... Compare vs tcpdump/nprobe: 64B 19.4/95.9Mbps, live mix, 209/340Mbps. Also look at disk rates. Again nprobe outperformed tcpdump. 44K transacations (flows) concurrently (304Mbps). Cpu limited (not PCI or disk), CRC computed in cpu caused some of the load. 23000 new flows/second, cpu starved (FSB/L2 cache limited. Sustainable traffic load of 280Mbps, mean packet size 190Bytes, bursts of up to 500Mbps with 1 sec measurement period. 50% of traffic (by byte) was http. With http only got 189Mbps (mean 234Mbps). Can scale up with more analysis processors.

Next (nb this is a prototype), use smarter hardware (TCP off-load), move work to NIC processors

pktd: A Packet Capture and Injection Daemon

This is part of NIMI. Active (looks like an attack) vs Passive (security). Major trust problem to deployment. Most infrastructures require people to know one another, which is not very scalable. Client host requirements vs host owner requirements. Need fine granularity to allow people access (read/write, user, group ...) Still an administrative hassle. Idea of pktd is to allow/enable trust. pktd is a daemon that runs on NIMI hosts, sole, trusted privileged entity with full NIC access, multiplexed access for clients. Similar information provided to what libpcap provides + an inject interface. Advantages: only entity host owners need to trust, static, finer granularity, more efficient use of resources (packet filter access). Control mechanisms: per-client tuning, access trpe (capture vs injection), traffic type selection to deliver to which clients, traffic content, to come resource control. Can allow ssh access but not telnet port data access (since password in clear). Ca decide how much data/packet to provide (IP/ttl OK, src/dst may be private). Define protocol masks that decides which fields are accessible to client. Have trace anonymization that works on Linux, FreeBSD and Solaris. Abel to capture 832Mbps with fill 90 byte headers. They also have CSLIP packet compression. Compression may be lossy.

The Spectrum of Internet Performance - Roberto Percacci

Relate distance to RTT. Can spot bunches of data by where pings go between. But hard to extract quantitatve measures from scatterplots. Use T=RTT/distance. 1<T<infinity, represents slope of RTT vs distance. It is dimensionless and has a nice distribution. Tmin=RTTmin/d, Tavg=RTTavg/d. R=Tavg/tmin=RTTavg/RTTmin. Refer to Tmin, Tavg, R, PL (packet loss) as the spectrum of Internet performance. RTT=propagation+processing. Propagation = lL(in seconds) +P. At unloaded times can neglect preocessing, and RTTmin~L. R=estimate of congestion, RTTmin~wiggliness of path. Distributions of show peaks with dips across Atlantic, RTT distribution is flat out to 200msec. Log(R) vs Log (Tmin) is roughly straight line tail of form P(x)=x^-a, also show CP(x)=x^(a-1). Power lay tails result in lack of characteristic scale, large fluctuations, average has liuttle meaning. Looking at pingER data see clear tail in Tmin dominating distribution with a~3, similar for Tavg, also packet loss nice straight loine from 1-10% loss. See similar results from RIPE hosts, AMP and and Grid hosts. All show similar long straight tails.

amin is close to 3 for PingER and RIPE. For AMP it is closer to 4. AMP large number of Abilene sites, so n ot as representative (Abilene perfoprms better than others). aavg is ~ 2.6. a(for R) has larger exponents 4.5. to 5. Loss has a ~ 1.2. The power law is telling us something about the Internet's intrinsic property.

Has a very simple model that seems to reproduce fairly well. Random distribution of hosts in a star like network (hubs are IXPs etc.) Such a stellar network has an a~3. Tried variations on how the hosts are distributed about hubs (e.g. Gaussian, spherical), all have slopes of ~3. For meshed network gets a~3.4.

Further work compare with real measurements of L that use traceroute + localization. Confirm power law distributions on larger samples. How much from layer 2, how much from IP, how much from inter-domain routing. Could use concept to define quality of connection to the "Internet cloud".

Effect of Malicious Traffic on Network - Alefiya Hussain

Objective: quantify the effect of malicious traffic on normal network. Look at impact of latency & throughput during attack. Malicious = DoS & worms, compare with normal (http, ftp, smtp, nntp, ...) traffic

Discussions

John Estabrook of UIUC has agreed to ensure Ajay's Quick Iperf mods are folded back in to iperf.

Constantinos Dovrolis said his new pathload would be available Thursday this week. Among other things it detects interrupt coalescing.

Met with John Hicks from Indiana University. He will provide us an account, password and

E2Epi TAG, Washington

Brief report on PAM03. Ajay's paper on Quick Iperf was very interesting and promising. Advisor is coming close to completion. Working on new GUI for iperf. Analysis engine, data collection GUI will be integrated in May. June 9th date for overall availability, so can demo this summer. Iperf 1.7 is out provides 2 way measurements is officially released.

Matt Z mentioned presentation of the ANEMOS system to allow input to data selection from a database.

Open plenary 1:00pm today. Will join Thursday sessions on web100 and E2EpiPEs.

SC03 is in Phoenix this year. Matt Z will be making measurements at SC03.

Abilene PAC meeting was yesterday, starting to think about Abilene 3. Endorsing large MTU, getting experiments going on various TCP stacks. Constructing an MPLS path from PSC to Chicago, can try experiments over MPLs, over production and over dedicated.

Have submitted a set of (mainly web100) variables to Microsoft to get them to expose them\. Will come out with .net SP1. Will provide ability to query various flow variables. Will have something web100-ish.

Had presentation of MonaLISA at Pipe fitters meeting. Delves more into host side that Pipes does. They want to work well with Pipes, Harvey will put manpower (1/3 FTE) into integrating it. There did not appear to be a scheduler to prevent say 2 large iperf tests going on simultaneously.

Wizard workshop working to with Quilt people to propose to NSF. May approach DoE and NLM (?). Peter Clarke of UCL, UK is also very interested. Measurement workshop in Miami, Matt & Stas presented how they had used tools, this caused considerable interest and sparked idea for Wizard workshop. Rough outline is a series of workshops over 12-18 months, with at least one major wizard camp with global wizards for master presentations, to do video capture for later streaming of presentations. Three regionals to work with GigaPoPs, LAN folks. Then "virtual briefing" web cast monthly to parallel topics in regionals. 1.5 days of instruction and 0.5 day of bring your problems. There will be an archive, FAQ, streaming videos, listserv mailing list. Start middle to end September 2003. A tangible delivery will be a trouble-shooting guide.

Looking at a new chair for the TAG, need a non-staff member to chair.

Radio astronomers (David Lagsley, network engineer). Need toolkit to get E2E performance, need Gbps for radio telescopes. Trying to establish connections between sites. Lot of trouble-shooting to get good performance. VLDI correlator, want to use hi-speed research nets. Some of radio-telescopes have high speed connections (e.g. Haystack), not all. Dennis Paus might be a contact.

HENP Network WG

Optical networks - Benoit Fleury

Costs of links coming down (at least factor 2/18 months). This is faster than the improvement in cpu density or memory (double every 25 months). This is being driven by high speeds/wavelength on a fiber (currently 10Gbits/s but 40 Gbits/s is almost ready, just needs to be economically viable, i.e. cheaper to buy 4*10Gbps than 1*40Gbps), and wavelengths per fiber (about 80 wavelengths per band, and 4 bands per fiber, a band is a range of frequency where one can get signal through, the most common band today is the L band ~ 1550nm). Today there are typically 30-40 wavelengths per fiber (currently advanced production systems are configured for 30-40, but typically only use 8). Also removing need for OEO (optical-electrical-optical) regenerators (replaced with optical amps, which are relatively cheap to build and are wideband so do not need a lot of tuning and one for each wavelength) and only need electronics (transponders) at user sites reduces costs and increases flexibility, i.e. enables faster set up, less setup & maintenance.. $10/km/Gbits/s is the new price point, actually closer to $100-200/km/Gbits/s for first fiber. Looking into optical VPNs which are ideally suited for multiple users who share a network but want their own shared security shell.

iGOC Update, James Williams, Indiana U

http://igoc.iu.edu/

High speed TCP Transfer over long delay high performance links - Sylvain Ravot

Microsoft have achieved 5Gbits/s from one host sending data through an Intel 10GE card. It was important to use a large send buffer of 64K which then gets chopped up into MTUs by the NIC. This feature is not available on Linux.

Discussions:

Shawn McKee and I discussed ideas on creating MonaLisa clients to provide host information (e.g. free memory, # processes etc.) A user on the client would request5 the Java applet from server, then run it to get the OS configuration. This info would be analyzed and compared with baselines to identify problems areas. It could also do pings, traceroutes (both directions), bandwidth estimation, and iperf tests to a server. The servers might be Grid/HENP site servers with tools such as reverse traceroute servers, Rich Carlson's tool etc. The information gathered could be recorded to an archive server so others could access the data for analysis etc. ( e.g. look at frequency of various types of errors etc.)

I had several discussions with Guy Almes of Internet2. He is very interested in the performance of various stacks and I gave him a copy of my slides. He pointed out that a big advantage of FAST TCP is that it should induce less queuing than methods that rely on loss for congestion notification. This in turn should lead to less jitter which is important for many applications.

Warren, Richard Carlson, Shawn McKee met over breakfast to discuss the MAGGIE proposal.. We discussed how to use Richard Carlson's test applet, e.g. deplot at MAGIE servers so can add to on demand tests in cases of problems (other tests would be ping, traceroute, iperf, ABwE. Shawn will apply for funding for his project which will provide a databases to capture and archive measurements, publish and make available via GGF NMWG and emerging schemas. We will integrate MAGGIE measurements with his database. For the proposal(s) we will write an introduction to show the impoartance of monitoring/measurements related to providing performant, predictable, robust end-to-end networks, and stressing the needs for such network features. The idea is to sell this to DoE and NSF programs as well as to users (e.g. HENP, astrophysics, Grid, bioinformatics) so as to get wide support. Rich agreed to make a stab at the introduction.

I talked with Julio Ibarra of Florida International University (a major player in the Latin American AMPATH project). He said they would be very interested in participating in the Digital Divide ICTP meeting in November. I will send him dates and web pages explaining what we are doing.

Antony Antony of NIKHEF has been looking at how to get high performance throughput end-to-end. He has evidence (to be published in an iGrid2002 paper) that one needs the end hosts to be well matched in terms of their speeds of sending and receiving packets. If they are not then one host can over-run the other resulting in packet loss and degraded performance.

Russ Hobby is pushing to get link speed for an IP address added into the DNS record (using a text record). He has talked to CENIC, Internet2 and will talk to the Quilt next. Once one has it, it is made useful by including it in the output from a modified traceroute.

Opening plenary - Doug van Howeling

The simple idea of applications drives the network that enables the applications. There are layers between the applications and the network, namely the Middleware and services, and the pillars between the layers include security and end-to-end performance. Internet community includes 202 universities, 30 affiliates, corporate members, government partners, international (40 MOUs) GigaPoPs, K20.

Peter Freeman - Assistant Director Computer & Information Science & Engineering Directorate

Ultra High Speed Networking, ANL April 10, 2003

This was a follow up meeting to the August 13-15, 2002 High-Performance Network Planning Workshop for the Office od Advanced Scientific Computing Research of the DoE. The official title of this meeting was "DoE Workshop on High-Speed Network for Large Science Applications.

Walt Polansky

Terabit Networking for R&D for PetaScale Science - Thomas Ndousse

MICS looks after basic R&D in CS, App. Math, Collab ... Program mission is Research to develop & deploy advanced network capailities to address unique net req of Office of Science. 47% budget to DoE Labs, 54% to Universities. Try to ensure links from U to Labs. 31% measure, 26% high-speed protocols, 32% security, others 11%. Why do net R&D: address issues not addressed b y other federal R&D programs, no economic payoff for industry, unique net requirements to support agency science mission, research focus & agenda driven by big impact science applications, e.g. HENP.

There are large scale collaboration challenges beyond high-speed networking, that involve cyber-security, locating resources. This requires middleware. Need to bring the network to par with super-computing. Need to make sure the networking is ready when the application/users need it. Must identify what ESnet applications need beyond commercial internet. So 3 layers: commercial, Internet2/ESnet, Internet3/Tbits/s nets. Logical optical networks make it possible to construct special, high-speed, specially provisioned networks on demand (may include coarse-grained QoS, non IP etc.)

The workshop focus is on the network transport and provisioning (not on applications or middleware). Looking at radical changes as opposed to incremental changes. Then need to consider effects on business model. Not here to discuss what type of transport protocols should be on the commercial Internet, rather it is to discuss what is to be used on ultra-high scale networks. Outcome is a workshop report and a special IEEE issue on communications in order to assist in communicating. Need to indicate where we need to go in networking if the country is to be on the leading edge. The audience is both technical but also the people who can make policy. Need to also understand the ability of the transport to scale, how does it interoperate with non ESnet sites (n.b. most of ESnet sites communicate mainly with non-ESnet sites), specifically need to address legacy interaction. Aim at Tbits/s throughputs end-to-end, what mods are required to get there, can TCP scale.

Issues: load balancing (e.g. OC768 - 40Gbps) and beyond may consist of multiple wavelengths. Transactions on ultra-high speed nets may be short lived in contract to 7x24 operations. New business models. Current net provisioning may not scale to Tbits/s nets. Can consider decoupling transport protocol from backbone routed network. Has idea of multi-tier networks: production, science ultranet for a few high impact science sites, for network testing.

ESnet - George Seweryniak

Current backbone is OC192, up to OC768 by 2008. Pent up demand of big science and new optical technologies make it possible. Concern about security (e.g. firewalls) impact on throughput.

Bill Wing

DoE has greater networking needs (and less money) than any other federal agency, eg HENP 10Gbps steady by 2006, Genomics 10's GB today 10's of TB by 2005, Climate modeling PB file transfers by 2006. Climate modeling wants to drive the decision making down to local governments so need high resolution. DOEs next gen net 2-20Tbps, separate channel for TCP-hostile control. How to get cheap, can we get there from here.

High performance Networks for high impact Science Workshop - Ray Bair

55 people attended Aug 13-15, 2002 included network, apps and provider experts. Complementary to this meeting, more of a requirements bias from applications/users. Diverse applications: climate modeling (few larege data repositories, many computing sites), SNS (tight schedules, high data rate/volumes), macromolecular crystallography, HEP (data intensive, global collaborations), MFE (tight schedules, RT collaboration), chemical sciences, bioinformatics (v. large research community, many large databases). There are many shared requirements (large dataset dissemination, distributed collab oration, distributed data source & usage, community data directories, auth reliable transport/control, service guarantees to dist resources and RT distributed analysis/steering. Much of science is already distributed endeavor or rapidly is becoming so, science paradigm shifts depend critically upon an integrated advanced infrastructure well beyond today's.

Middleware priorities: secure control over who does what, information integration and access, co-scheduling & QoS, effective network caching & computing, community services to support collaborative work, monitoring & problem diagnosis.

Evolution Connectivity/bandwidth > predictability > .manageable data.

High-priority network research items (more related to this workshop): ubiquitous monitoring & measurement infrastructure (middleware often needs understanding of underlying net to make decisions, data must be publishable in a scalable format), hi-perf transport transport protocols (TCP limitations, research to both improve & new protocols); multicast; guaranteed performance & delivery. Intrusion detection (need predictive analysis, can one get warning of an attack about to occur).

Three needs for provisioning: production level for base program; resources for high utilization science (IN SUPPORT OF CHALLENGING SCIENCE APPLICATIONS). rESOURCES FOR NETWORK RESEARCH (easily separable for running controlled experiments). But may not be 3 different nets. Need good migration of capabilities from one level to another. Need integrated net provisioning strategy across all 3 layers, be agile and adapt to changes, develop shared visions/metrics of success.

Breakout group: Transport

Discuss existing methods & their evolution 3-5 year horizon

What requirements do apps need? Some metrics include: high bandwidth, low loss, stable bandwidth (low burstiness), low latency, low resource (cpu, memory etc.) utilization, fairness, performance gap (COTS vs wizard), QoS, robust/reliable (checksums, maybe faster to transfer file twice and then diff rather than do MD5 checksum once), security, multicast. High bandwidth and real-time applications, RT requires stability, predictability. Caches are bad for RT.

Usage scenarios: bulk data transfer / data replication, remote visualization, computational steering, data mining, instrument control
Service classification / QoS: guaranteed vs. best effort, predictability
High bandwidth
Stable bandwidth (low burstiness)
Low latency
low resource (cpu, memory ...) utilization
Fairness
Real-time
Performance gap: out of box vs. net wizard vs. what apps want
Robustness/reliability/error rate: loss
Security - authorization & authentication (session layer), privacy & integrity (presentation layer)
Multicast

The big investment in the Internet is the algorithms. Yet for congress we cannot get funding for TCP. So what is TCP, is it new TCPs, do we have to throw out all possible TCP algorithms and go to UDP or something. Can we make major changes to TCP, e.g. {possible solutions are in braces}

Algorithmic: block vs. byte oriented {SCTP/RDMA/R-UDP/Tsunami}, bigger sequence counts (wrap around within 14 ms at Tbits/s with out of order packets separated by > max time) {PAWS, FAT TCP RFC 1263}, check-sums/CRC {SCTP/RDMA}, congestion control as it relates to stability & convergence (impact of using loss for congestion on the queuing) {FAST, Scalable, HS, XCP, control-theoretic approach, stochastic approximation (SA)}, self-clocking ACKs, friendliness/fairness (alternate control functions, slow start - especially for small files and impact of slow start on other flows at high speeds in a shared environment, can one use out-of band knowledge or history to by-pass slow-start), ability to turn off congestion control {not implemented but could be}, bigger window field (no need for window scaling), assumes shared packet-switched network; striping /parallel streams {SCTP, RDMA, R-UDP}.
Implementation: flow control with respect to advertised window -> DRS / Web100 / Net100, MTU size, "excessive" cpu & memory utilization {TOE}, make it self tuning/configuring (today systems are optimally tuned for homes), MTU {TCP/IP . device driver (virtualize MSS/MTU)}

Problems with current transport methods & possible "solutions" with TCP, short term (now - 5 years)

Algorithmic: alternate means of estimating path properties (out of band), slow start, congestion control; Rate based rather than window based, tsunami;
Implementation: DRS / Web100 / Net100; kernel based vs. user space; adoptable by mass-market

What TCP cannot address, who addresses it?

Reservations (RSVP) or at transport
Operating over circuit switched lambdas

Vision for next 10 years

Composable ("Lego") Transport Protocols: configuration time vs. run-time loadable modules; user hints, network passed info, need to define functions to be composed:
- parallel streams, network striping, multi-path, data aggregation (dual of multicast)
- unit of book-keeping byte vs. block vs. file oriented,
- error control: strong CRC & FEC
- QoS: best effort vs. reservation;
- information export API to application and network management;
- environment: LAN vs. SAN vs. WAN mixed
- Persistence of data in the network
- Control: RTT dependence or not ...
Reservation-based solutions:
- on demand /advanced network provisioning: circuit-switched with FEC (packet loss is NOT due to congestion);
- limited sharing vs. exclusive use
Guarantees (deadline, bandwidth)
Analytical design based on control and statistics: e.g. effects of composing certain transport features with others, e.g. RTT dependent vs. independent
Knowing where data is headed ahead of time may influence how transport protocols are composed: interface with I/O issues
Issues:
- Legacy problems: interoperability, smooth migration, inter-working, stand on shoulders
Circuit switched lambda
Hybrid: packet & switched; dedicated SAN vs. shared WAN
MTU

Discussions

Talked to Ray Struble of Level(3) about testbeds. Talked to Matt Mathis about tests we should make if we had a testbed. He is interested in throughput for a single stream as a function of MTU size. Talked to someone from Juniper about fundamental limitations of routers (memory speeds and buffering, instructions). Talked to Bill Wing about the testbed proposal, he said he got quotes from Level(3) for circuits from Atlanta to Chicago and the costs were too high to contemplate for any proposal to the DoE. He was hopeful of a better deal from AT&T but has not been able to get definitive quotes. Talked to Linda Winkler about whether there is a spare OC192 port at StarLight if we were able to get a circuit to there. She said she has no interfaces and is not even sure if she has a slot spare in the router. I mentioned to Wes Kaplow of QWest our desire for a few months loan of a high speed link from SNV (actually he pointed out he can bring the link to SLAC/Stanford) to Chicago to replace the Level(3) testbed, for example for SC2003, he did not laugh at me, and will look into. I also talked to Cisco about the need for high speed demonstration equipment for SC2003.

Other interesting asides:

Middleware wants to be able to make predictions many hours into future (e.g. take into account diurnal behavior - Warren this might be a way to "sell" the diurnal behavior work).
RDMA (see http://www.rdmaconsortium.org/home) is becoming a solution for TCP's impact due to small MTUs.