‹header›

‹date/time›

Click to edit Master text styles

Second level

Third level

Fourth level

Fifth level

‹footer›

‹#›

> The LSN committee (DOE, NSF, DARPA, NASA, NSA, NIST, etc.) - federal

> agencies that fund network research in US has request from the

> research community a comprehensive presentation on the state of

> network measurement and analysis in the US. The presentation will

> enable the committee to identify the R&D gaps and large-scale

> deployment issues. Could the four of you put to together an hour

> presentation during the next LSN meeting at NSF on 4/8/03? It could be

> a single presentation consisting of slides collected from the four of

> you and presented by on one person at NSF. Your contribution to this

> effort will be highly invaluable in defining new R&D directions in the

> next several years.

ØWhere in measurement/monitoring should the federal government invest dollars to improve the Internet?

Network is by design transparent so hard to find out information about how it is working etc. GGF Grid High Performance Network group is trying to bring together networkers/applications writers/users by creating documents on “Top ten things network engineers wish grid programmers knew” and vice versa. http://www.csm.ornl.gov/ghpn/

Understanding is hard: Immense, moving target, traditional (e.g. Poisson distributions) mathematical tools don’t work, looking for invariants, need parsimonious models. See Vern Paxson’s work, e.g. http://www.icir.org/vern/talks/vp-painfully-hard.UCB-mig.99.ps.gz

The top three networking problems according to a paper by Claudia DeLuna of JPL, are Ethernet duplex, host configuration and bad media.

Failure cause breakdown for 3 Internet sites indicated 51% caused by operator error. “Self Repairing Computers”, Scientific American, June 2003

Reviewing user reported long lasting (typically days, i.e. does not include router reboots, or time out for reconfiguration) WAN problems that SLAC over the last two years, the biggest contributors (30%) were a combination of mis-configured routers (loose unicast RPF filters, wrong buffer size, poorly chosen backup route), misconfigured switches (needed reboot, PVC incorrectly rate limited), firewalls (limit throughput, reset window scaling option). Note these are mainly engineering problems or bugs as opposed to problems we need to research to know how to fix each one individually. However, we do need to investigate how to accurately and automatically identify and report on the location and cause of such problems for the end-user.

Consider a critical and immediate network communications need came up, possibly spanning many service providers, and may be a life threatening situation, but the network behaves much worse than any other time over the last year. What are you able to do about it? Hans Werner Braun

Grids are a natural resource for disaster response, e.g. calculating the plume trace on a nuclear reactor after a terrorist event. However may lose connectivity to major Grid site due to, for example, mis-configured (malicious or inadvertent) routers, and lose critical time due to inability to spot or diagnose the problem.

An open issue related to network management is helping users and others to identify and resolve problems in the network. If a user can't access a web page, it would be useful if the user could find out, easily, without having to run ping and traceroute, whether the problem was that the web server was down, that the network was partitioned due to a link failure, that there was heavy congestion along the path, that the DNS name couldn't be resolved, that the firewall prohibited the access, or something else. We encourage work on application of artificial intelligence (AI) or expert system techniques to network management systems.

The end-to-end principle is the core architectural principle of the Internet. … it addresses concerns of maintaining openness, increasing reliability and robustness, and preserving properties of user choice and ease of new service deployment (harder to change core than end nodes, e.g. difficulty of rolling out multicast in public service providers; Requiring someone with a new idea for a service to convince a bunch of ISPs to modify their networks is much more difficult than simply putting up a web page with some downloadable software implementing the service). Draft-iab-e2e-futures.txt

Monitoring data is not just for end-to-end performance analysis

Lots of Grid “middleware services” need monitoring data too:

Grid Schedulers

find the best match of CPUs and data sets for a given job

Grid Replica Selection

find the “best” copy of a data set to use

Reliable File Copy Service

detect failures and recover

Network-aware Applications

TCP buffer size tuning, number of parallel streams, etc.

Many of these components already exist or are in progress:

instrumentation tools

Pablo (UIUC), NetLogger (LBNL), Magnet (LANL), ARM (Open Group), log4j (apache), web100, etc.

host and network sensors

too many to list

sensor management tools

Ganglia, Nagios, NetLogger

event publication service

CIM (DMTF), MDS (Globus), NWS (UCSB), R-GMA (RAL), CODE (NASA), pyGMA (LBNL)

event archive service

netarchd (LBNL), NWS (UCSB)

event analysis and visualization tools

lots, but most only work for specific types of events:

NetLogger nlv (LBNL), Probe (Stazi), Autopilot (UIUC), etc.

BUT, all use different event formats and protocols!

no interoperability

Many of these tools still in the “early prototype” stage

Need few false positives

Key reasons for Grid failure are: 1. security, 2. problem diagnosis

US scientific leadership threatened if cannot maintain leadership in networking. US used to enjoy best connectivity, now Europe has recognized the need for excellent scientific networking and has caught up and in some cases surpassed the US.

There are jobs that industry must tackle and is rightfully their responsibility, e.g. more reliable routers/switches, easier less error prone configurations, better quality control of software, undo functions, non-intrusive system/controller upgrades, controlled access to selected router information, self aware/autonomic (self-healing) networks

Recovery Oriented Networking (RON): Develop consensus and working code for acceptable performance metrics such as MTBF, MTTR (focus on evaluating recovery time), causes of problems. Then develop suites to inject common (including human, software, and hardware) errors and measure robustness and recovery time – like what Whetstone benchmarks did for processors

The Internet had early success in network device monitoring with the Simple Network Management Protocol (SNMP) and its associated Management Information Bases (MIBs). There has been comparatively less success in managing networks, in contrast to the hierarchical monitoring of individual devices. Unfortunately, network management research has historically been very under-funded, because it is difficult to get funding bodies to recognize this as legitimate networking research.

Sally Floyd: http://www.ietf.org/internet-drafts/draft-iab-research-funding-00.txt

There are many email lists and unfunded groups in this arena, e.g.

The IETF imrg, ippm, the GGF NMWG & Glue-schema, ghpn groups, the e2e, plus European based groups

Plus less monitoring focused groups such as Grid groups (GGF, PPDG, GriPhyN), NANOG, CENIC, ESCC, IETF, ITU, Internet2, SciDAC funded projects, groups from other countries such as DataTAG, GEANT, APAN

Example of Grid file transfer: a grid scheduler determines that a copy of a given file needs to be copied to site A before a job can be run. Several copies of this file are ina Grid Replica Catalogue so there is a choice of where to copy the file from. The Grid Scheduler needs to determine the optimal method to create this new file copy, and to estimate how long this file creation will take. To make this selection the scheduler must have determined what is the best source (or sources) to copy the data from. Selecting the best source to copy the data from requires a prediction of future end-to-end path characteristics between the destination and each possible source. Accurate prediction of the performance obtainable from each source requires measurement of available bandwidth (both end-to-end and hop-by-hop), latency, loss, and other characteristics important to transfer performance. Profile for Network Performance Measurements for Grids, GGF NMWG group.