Networking for Non-Networkers 2005

Held at the UK National eScience Center, Edinburgh

Rough notes on topics of interest to Les Cottrell

There were about 60 attendees, almost entirely from the UK. The attendees were interested in networks but not network experts. Typically they came from application support/development, system support/admin areas. the speakers were by invitation and were form the US and the UK.

TCP - Brian Tierney

See http://gridmon.dl.ac.uk/nfnn/slides/BrianTierneyNFNN2.pdf. ssh has its own buffer size control that limits the performance to about 1Mbit/s

Linux 2.6.12 (released Friday 6/17/05) has a way to set turn off the ssthresh caching permanently (like web100) as opposed to just the next transfer.

There is a bug (fixed in Linux 2.6.12) in the way Large Send Offload (LSO) is implemented.

LAN Issues - Sam Wilson, Edinburgh

See http://gridmon.dl.ac.uk/nfnn/slides/SamWilsonNFNN2.pdf. Nice discussion of half/full-duplex mismatch problems.

RHJ

See http://gridmon.dl.ac.uk/nfnn/slides/RichardHughesJonesNFNN2.pdf. Journaling can have a big degradation of performance for disk writing.

Network Jargon - Robin Tasker, Daresbury Lab

See http://gridmon.dl.ac.uk/nfnn/slides/RobinTaskerJustWhatisOC-192NFNN2.pdf. Very nice talk explaining many of the terms in networking including Ethernet, frames vs. packets, DSn, OCn, STM, SDH,  DWDM, TDM etc. Also discussed fiber types and twisted copper pair. A nice anecdote was 10Gbps = 167 King James bibles/second.

Security and Performance - Paul Kummer, Daresbury

See http://gridmon.dl.ac.uk/nfnn/slides/PaulKummerNFNN2.pdf . Covered what is the problem (data rates, hacking, viruses & spam, the web), followed by comments and possible solutions. Even though todays backbones are lightly loaded (tens to few hundred Mbits/s), the LHC service challenge shows that the requirements are for continuous Gbits/s from multiple sites. At the same time typical CCLRC site is seeing 300 probes/second. The firewall log uncompressed is about 5GBytes/day. Once hole found the compromise time is measured in seconds. Peer-to-peer file sharing can badly impact performance.  Typical data rate for viruses and spam are 15Kbits/s so does not impact overall network performance, overall mail traffic is 50Kbps.Typical user uses about 2Mbps averaged over a day, which is very bursty. The scientific environment is hard to control, need open-ness can't lock down computers without preventing useful work, use lots of protocols. Can never get absolute security, the enemy is dynamic, constant need to keep protection up to date, currently measure in hours for viruses. Firewalls impact performance (bits/s, sessions/sec, total sessions), they are expensive, they do not know about special applications such as multi-stream FTPs (e.g. GridFTP, bbftp). Need security in depth. An alternative might be ACLs but do not look at datastream so will not handle FTP. They also do not handle Denial of Service attacks (firewalls can detect the increase in requests and then block things). Advantages they come standard in router/switches and will run at line rates. Can try and separate high speed FTPs in a controlled way through ACLs and use firewall for lower speed more general applications. 

Grid security is based on certificates. They imply a level of trust, but take no account of low-level attacks (e.g. buffer overruns). Grid design is not firewall friendly (require many ports to be opened, web service may be worse, especially if run on port 80 since no way to distinguish good from bad). For web services will need certificate sending, checking etc. to remote end and to CA. So may want to cache certificates (but then have to worry about how long to cache vs. revocation needs).

Diagnostic Steps - Les Cottrell, SLAC

See http://gridmon.dl.ac.uk/nfnn/slides/LesCottrellNFNN2.pdf.

What can you do with all this - Clive Davenall, NESC

See http://gridmon.dl.ac.uk/nfnn/slides/CliveDavenhallNFNN2.pdf. The WFCAM (Wide Field Camera) astronomy project has 200GB/night. Transfer from Hawaii to Cambridge by tape/courier, then via JAnet from Cambridge to Edinburgh. Get 200GB in 5 hours or 12MB/s using multi-threads of SCP.

Grid Performance Workshop

UK Science Core Program - Tony Hey, moving to Microsoft

One of the focii is pervasive computing and networking. Need science progress to keep country competitive. Major components: multidisciplinary working;  national information e-Infrastructure; access to capital infrastructure/large-scale facilities (e.g. ITER, diamond synchrotron light, ISIS, LHC...). Key elements: network, remote access, HPCX (DL), middleware... $18M over 3 years EPSRC e-Science core programme. There will be $9M for more pilots. Will be a new figurehead for eScience in UK. Hope to make Microsoft be more involved in Grid activities.

Inca - Shava Smallen, SDSC http://inca.sdsc.edu

This is a test harness and reporting system. Can user X run app Y on Grid  Z and access dataset N? Is it all compatible (are updates synchronized), is there enough space etc. Inca is a framework to enable automated testing, benchmarking and monitoring of Grid systems. Scripts (called reporters) to output XML conforming to Inca specification. Context of execution is required (what commands run, what time, what result, what machine, what inputs). Can run reporters repeatedly (e.g. every hour, every day), and one shot (e.g. boot time). Future looking for repeated errors to provide feedback as to how it was fixed last time. Also want to detect errors and notify people. Monitoring net with pathload, looking at pathchirp. Typical common errors are services not being up, and security problems.

Hawkeye - Nick LeRoy, Wisconsin www.cs.wisc.edu/confor/glidein

Hawkeye a monitoring system for the Grid. To detect and report problems. It can monitor system, IO loads, runaway processes, monitor health of site. Distributed, uses a push data model, built on Condor, stable production quality. Alerts when thing go wrong, when virtually any problem found (CVS lock held for say 20 mins, checkpoint server disk full). Uses RRDT to visualize what is going on. Currently processes, cpu use, RAM I/O, VM stats, disk use. Has Condor specific modules. Can develop own custom Hawkeye modules.

LCG - Dave Colling

Big question is whether the overall requirements for LCG (1000 users, 14K hosts, PB of data, 100 sites) is matched by any other applications. The concern is that it has been over-sold to the EU as a leading application that will help build infrastructures for multiple data intensive sciences.

Network Measurement Tools - Matthew Allen, UCSB

Investigating how do various tools compare.  Want to compare measurements, may want to aggregate, understand relationships between low impact and heavier tools. Compared NWS, NTTCP, iperf, netperf measurements on same network testbed (UCSB - LA) at same times. Looking at time series. Data is heavily self-correlated, non-stationary. Use capture percentage to determine if 2 time series are consistent with one another, and auto-correlation to see if they are telling you different things. Use means and standard deviations to set confidence levels. Then capture is the % of values within +- 2 standard deviations,  so would like to see how one set of measurements compares with another. This is based on statistics to provide heuristics help a human decide how well the tools agree (are the same).  Next look at correlation scatter plots. Then look at auto-correlation (how related is value 1 related to value 2 etc. then for values 2, 3, 4 ... steps (lag) apart). Pair-wise differencing pairs element from one measurement series with the element from another series that is temporarily closest to it.  A new series is created from difference. Then see if there is much auto-correlation.

Heuristics provide some information about relationships of measurements. Techniques usually work but not always. they do provide insight as to whether two time series contain the same information, but does not tell how the two series are related.  

Grid Futures Discussion

Problems with the Grid: Point solutions to applications, still hard to use, not reproducible, scalable, underestimated the technical difficulties, not enough of a scientific discipline, oversold.

Unclear to computer folks as to what the applications folks want, what threshold of pain they are willing to encounter to use the services. Computer folks love addressing complexity, but apps folks want simplicity. 

The grid is a fault rich environment, it has missing or immature grid services.

Applications people cannot think a long way into the future, they are focusing on the immediate needs. It is hard to get applications folks to come to meetings such as this one, since they cannot see the benefits. Need to show how the tools have improved the performance of applications. What is the killer application for the Grid? Is it an infrastructure in need of a audience? How should tool developers get their information exposed at developers meetings etc? Many of the tools wanted are typically very simple, a good, reliable file transfer, a global/grid "top" command, red lights/green lights. The ideas of resource brokering, replica locating, meta catalogs,  are still a long way into the future for most applications users. Disconnect between the research, papers, and application to middleware, applications and to users. Many layers from paper to implementation to installable toolkit, to integration with middleware to use in application to applaud from scientist user. There is often no funding to progress through these layers.

Where should funding be applied, are there any funds. Maybe need to prepare a survey for grid applications users on what they want with respect to the tools.

Need closer interaction with users, so the tools developed are closely related to perceived needs of users. One needs a survey of Grid tools and see where/if they apply to some problem. There are some such lists for network measurement tools, e.g. http://www.slac.stanford.edu/xorg/nmtf/nmtf-tools.html and http://www.caida.org/tools/. May need to start standardization and synthesize things.

We still do not have a standard single resource broker, user does not want to try and compare all possibilities. GGF does not do implementations. Is GGF too big, does one need sub-group meetings?

There will be a white paper from this meeting. We need to track down the funding issues, need collaborative projects to make further progress. What should be done about next years meeting? We really need input from applications folks. It is hard to get their attention even to attend a meeting like this one. Demonstrations of applications, tools and infrastructures would be useful. Exemplars of applications using toolkits would help encourage such work, could also be a journal of grid computing special issue with peer review. There are monthly newsletters from NESC, SDSC, ScienceGrid etc., maybe could have features on applications and toolkits. This could help promote the field and interactions, help get funding agencies' attention. The last three meetings have been in the UK, should we move to the US. Need somewhere that is easy to get to.