Author: Les Cottrell. Created: May 13
Thomas Ndousse (DoE MICS, SciDAC network program manager), Richard Baraniuk Rice University, Wu-Chun Feng LANL, Les Cottrell SLAC
Thomas confirmed that there are no funds available at the moment for a new network proposal. He needs such proposals to show that there is important work that should be funded. the last payment from the current funding cycle will be Jan '03, but the work is expected to go onto June '04. Having proposals in hand can be very important since monies can become available at any time.
Thomas feels he needs a very exciting proposal that will be: unique towards meeting DoE scientists needs, will get a lot of support from DoE scientists, will catch the attention and minds of congress. Something like "Terascale networking for Petascale computing", see below for more on this. He does not feel that extending the current SciDAC network proposals will get the required support from either scientists or congress. He is also feels that the amount of SciDAC money currently being focused on network measurement is hard to defend. To defend the measurements he needs scientists to say "my science will break without this or that capability". Thus we need to tie the measurements into scientific applications. PPDG is a good start, but it is only 5% ($3M) of the SciDAC funding ($60M), so it would be valuable to be involved with other SciDAC applications.
Thomas presented ideas to fund a wavelength network between Chicago, ORNL and Sunnyvale. This test network would be used to study how to meet next generation high speed networking bulk transfer requirements. Thomas is concerned that TCP cannot scale to 10Gbits/s and beyond due to the lack of large (GB) transfer units, the need for cpu intervention, the lack of forward error correction, the need for flow control etc. He believes that ST (Scheduled Transfer) may be an answer. ST basically allows disk to disk transfer over network so removes some of the problems faced by TCP. At the same time Thomas does not feel he can fund wavelengths to each end node (e.g. SLAC, LBNL, NASA, Caltech, SDSC) for the west coast, so he is interested in funding "depots" at each wavelength end-point. Copies between the depots would be made at high speed, where they would be cached until copied from the depot to the end-node. SGI is implementing in the Lab an ST interface to read from disk and put on wire. Only one transfer uses the wavelength at a time, so careful scheduling of the wavelength is necessary. Also The end node also needs to be notified when the copy is ready in the local depot so it can pull it out thus making the space in the depot available for re-use. Alternatively maybe the depot could push the data onto the end node. There will need to be a reasonable balance between the speed of the end node links and the network backbone links, otherwise what may take say an hour to transfer on the backbone will take many hours to copy to the end node. Another issue with any copies (TCP or ST) at high speeds is the disk transfer rates which are not keeping up with network transfer rates. This will require a lot of parallelism both in disks and servers. Thomas said David Wormsley working on parallel I/O at Sandia. Thomas is thinking about a $3M proposal with ~$1M for the wavelength, and $2M for R&D. At a later stage maybe more wavelengths can be added, to show the ideas scale.
Understand complex network dynamics, lack of adequate modeling techniques, internal network inaccessible, poor understanding of application and network inter-modulation. Concerns over next generation TCP protocols on high speed circuit switched networks. Lady from Brookhaven Polytechnic says TCP will not scale, wants to use ST. SciDAC needs to produce demonstrable results to improve scientific applications. So our projects must enable SciDAC goals, must integrate our tools with SciDAC applications. I.e. in 3 year project need research and demonstration of integration with applications. Transpac at U Indiana wants to use TICKET to keep up with monitoring grid traffic, for example so can optimize how applications run and use the network. Scientists need to more easily get high performance throughputs. Thomas wants to take our presentations back to DoE and to scientists to show how it benefits.
Rice working on developing analysis of data, e.g. from TICKET. Analysis is offline. Path modeling will be real-time (after make MatLab clean). Techniques are based on provable mathematical theories. Could be used to develop new protocols. Need to tie into real applications, e,g. how do we improve performance for HEP program, or on the Probe project. E.g. put TICKET at SLAC or BNL on a datamover and understand and improve the performance.
INCITE tasks are: multiscale traffic analysis, inference algorithms for paths, links, net tomo, active measurements, magnet, ticket, passive path, tomography tool kit.
Presented Multifractal Chirp probing for cross-traffic (CT), tomography (adds a spacial view of the net, not just end-to-end but along the path, might one day be useful for selecting optimal paths), alpha/beta traffic patterns.
Available bandwidth estimation measure CT from probe packet delay spread on packet pairs with various initial separation (exponentially spaced packet train) between packets in each pair. Conflict between closely spaced packets which load/perturb net and widely spaced packets which don't see the CT. One problem is the use of Matlab, great for prototyping, looking at alternative free software (e.g. Octave) and also C code. Would like to get MRTG SNMP utilization at peering points to validate the CT results, e.g. at SLAC, the CERN Starlight router, the CERN router, the IN2P3 router. Once have validated then no longer need access to routers for SNMP.
Rice has sent up IP tunneling to allow packets going from A to B to go via C, so C can look at packets as they pass by.
Tomography: Yolanda Tsang & Rob Nowak (also collaborating with Don Townsley). Using active probes to understand the links by looking for shared features. Looking at topology (discovery of where things are logically located), tomography (know topology and then find the characteristics of the links, e.g. loss, CT, RTT). Good for knowing where things are located on the grid so grid people/applications know where things are, how things are laid out, how data centers are connected up. Write a brief presentation for PPDG to show the Grid folks how to find out where resources are and how they are connected. Complex mathematical problem. Rice also working on how to make inference.
More focus on end-point from when it hits computer to application. He has two tools MAGNET and TICKET that make network measurements with commodity parts.
MAGNET monitors traffic immediately after being generated by the application, i.e. before it is modulated by the TCP/IP stack. The goal is to create a library of of application traces to test network traces. It captures information at the interfaces between the applilcation and TCP stack, between TCP & IP, IP & data link, and data link and network. It has fine granularity timestamps (ns) , high performance & low overhead. Besides capturing the actual event (e.g. packet header) it can also capture a union of other data (e.g. TCP dynamic parameters such as packets in flight). The minimum record size is 24 bytes. it can add 64bytes for the TCP parameter data and a further 8 bytes for the IP parameter data. Information is communicated between the kernel (Linux 2.4) and user space using shared memory. The impact on throughput is < 5% at 100 and 100Gbps. It also has a modest increase in cpu utilization (up to 26% at 100Mbps and 2% at 1000Mbps). Event loss is a function of the buffer size and delay (of what) (see presentation for graph) and is less than tcpdump/libpcap (15%). They do not have any visualization tools yet. The ability to make nsec timestamps could be useful for INCITE BW/CT tools, especially for making measurements on high speed links such as > 100Mbps.
TICKET is like tcpdump on steroids, i.e. it allows gathering traces on Gbps links. It is much cheaper than commercial products such as NetScout ($26K for hardware for Netscout vs. $2k for TICKET hardware.) TICKET will capture at up to 2Mbits/s vs tcpdump at 300Mbps (not sure if dependent on speed of cpu - Wu?). It also has ns timestamps vs tcpdump's ms timestamps & Netscout 1 sec timestamps. Again visualization tools need to be devloped.
In an aside Wu also mentioned a novel cluster design based on low power Transmeta chips. Transmeta chips implement the Intel instruction set, but with far fewer transistors. They do this by implementing the common paths in hardware and then the less likely paths in microcode. The net result is that the Transmeta chip draws 6W vs 130W for an Itanium. Because of trhis one can stuff 240 processors in a regular single rack with 3U high boxes each containing 24 cpus on blades. the power drain is 4KW which is equivalent to 10 Intel Itanium processor board. Note that 10 degree C increase in temperature doubles the failure rate. If one was to build an ASCI Red machine then the MFlop/Watt is about 0.5 vs 6.25 for the Transmeta implementation. A downside is that the Transmeta chips though of equivalent MIPS/MHz currently only go to 66MHz and will soon got ot 990MHz. Asecond downside is the health of Transmeta. They are not doing well, their sales having dropped from $54M to $2m in the last report. They have sufficient venture capital to last 2 years. If Intel continues on the current trends then by 2010 they will have 1B transistors on a chip with 1KW/sq cm or getting close Wattage/sq cm of a nuclear reactor.
The BW tool appears to compare reasonably well with iperf and bbcp for links with bandwidths below 100Mbps, no packet re-reordering, and low loss. The BW tools are much lower network impact than bbcp & iperf.
The CT tool appears to qualitatively give sensible results. It is low impact. It can be used for continuos reporting. It does not require SNMP access to routers.
Traceroute measurements are being made from 5 sites. the traceroute measurement tools has been ported from VMS/DCL to Unix/perl and is in production at SLAC running traceroutes to over 30 sites. Initial reporting is via drill down tables. The Rice topology/tomography tool has been installed at SLAC. It will identify branch points and thus will be both cleaner than traceroute but also identify switch branch points. A new visualization tool is being built at SLAC for topology and tomography measurements from traceroute and the Rice topography/tomography tools. It enables a tree graph display with links colored by ISP or RTT, and drill down to detailed node information.
A major end-product is to address one of the holy grails of the Grid and SciDAC applications, i.e. making applications network aware. This includes knowing what windows/streams to use, what throughput to expect, where to place and request data from (e.g. for replicas). All the above need a cheap way to get available bandwidth (e.g. from an archive of measurements or on-demand measurements).
Will the project enable a new generation of applications. Which application community is interested in implementing our tools. Need GridFTP and PPDG to say they are using the INCITE tools. What unsolved (today) problem is being addressed by this proposal. Must list all components of proposal, what are fundamental network issues being addressed and what are the tools. Go through 5 slides that will explain to non networkers what we are doing. Might push to base program, push for new money, or push to NSF. Need to see how it could map into an overlay network (e.g. wavelengths). Need project report in next few days. Need presentation (5 slides) for Thomas to present to his boss. How does INCITE differ from the other network projects (Web100, INCITE, NetLogger, self configuring, CAIDA ...) Identify users, network researchers, lay scientists, network operators, contributions to general network research, how does each group benefit.