1
|
- Les Cottrell, KC Claffy, Brian Tierney,
Ronn Ritke, Hans-Werner Braun
- Prepared for the LSN meeting at NSF Washington 6/10/03
|
2
|
- Goal: for network monitoring & analysis talk:
- identify the R&D gaps and large-scale deployment issues for DOE,
NSF, DARPA, NASA, NSA, NIST, etc. – the federal agencies that fund network research
in US
- Two complementary presentations
- High performance networking measurement needs for Science (E2E)- Les
- Consumer grade & net-centric measurement needs – kc
- Science network measurement needs
- The end-to-end challenge, illustrations
- Solution
- End to end Monitoring Goals
- Current issues
- Problem analysis, measurement infrastructure, analysis tools,
standards, collaborations
- Benefits to Science
- Consequences of not addressing issues
- Why not leave to industry
- Appendix
- What is being done today
- Who is measuring?
- Who is using the measurements?
- What is being measured?
- What tools are being used?
|
3
|
- Distributed systems are very hard
- A distributed system is one in which I can't get my work done because a
computer I've never heard of has failed. Butler Lampson
- When building distributed systems, we often observe unexpectedly low
performance
- the reasons for which are usually not obvious
- The bottlenecks can be in any of the following components:
- the applications
- the operating systems
- the disks, network adapters, bus, memory, etc. on either the sending or
receiving host
- the network switches and routers, and so on
- Problems may not be logical
- Most problems are operator errors, configurations, bugs
|
4
|
|
5
|
- I’ve lost my connection
- Despite over-provisioned networks user cannot get throughput expected
- What should I expect the performance to be?
- It sometimes works …
- What am I, as a scientist, supposed to do?
- Need tools/measurements to detect problems, identify location, cause and
time of occurrence
|
6
|
- A complete End-to-End monitoring framework that includes the following
components:
- instrumentation tools (application, middleware, and OS monitoring)
- host and network sensors (host
and network monitoring)
- sensor management / activation
tools
- event publication service
- event archive service
- event analysis and visualization tools
- a common set of protocols for describing, exchanging, and locating
monitoring data
- Need for applications (e.g. Grid middleware), diagnosis, perf.
analysis
- toolkit for streamlined problem diagnosis: detection, location,
isolation & reporting
- glue to multiple sources of information, traceroute archives, router
info, delay/loss archives, on-demand tests, baselines
- analysis and heuristics
- E2EPi working on solution, but only funded for coordination not for all
the underlying work
|
7
|
- Have to solve the E2E performance, it is THE critical metric for user,
not just a backbone bandwidth problem
- Improve end-to-end data throughput for data intensive applications in a
high-speed WAN environments
- Provide the ability to do performance analysis and fault detection in a
Grid computing environment
- Provide accurate, detailed, and adaptive monitoring of all of
distributed computing components, including the network
- Unfortunately, network management research has historically been very
under-funded, because it is difficult to get funding bodies to recognize
this as legitimate networking research, IAB Concerns &
Recommendations Regarding Internet Research & Evolution
|
8
|
- Cultivate systematic studies of problems, causes, how to discover, how
to report, how to by-pass
- Analysis to help in deciding what are the most important problems, see
how they are tackled manually today
- Decide on which problems are most cost-effective to assist in
developing tools to assist in diagnosis
|
9
|
- Need to build infrastructure to support troubleshooting:
- Requires repetitive and on-demand measurements with appropriate
security model.
- Provide recommended/accepted set of tools for delay, RTT, loss, route
tracking, "bandwidth" estimation.
- Include archiving and access to data, analysis and reporting of
repetitive data.
- Allow for evaluation, validation and comparison of new measurement
tools, TCP stacks, applications (e.g. file transfer).
- Reverse traceroute, looking glass, remote tcpdump (e.g. SCNM), remote
testing of connection (ANL NDT),
- Traceroute archives
- Make tools easier to comprehend and use by scientists
- Encourage efforts such as Internet2 E2Epi efforts to provide
measurements inside the cloud
- Extend to ESnet & other NRNs, and beyond
- Fund collaboration across boundaries
- Ubiquitous coverage (require multiple toolkits): Inter agency,
international, hi-speed, digital divide, long term and current
|
10
|
- Provide measurement tools to accurately & quickly identify
performance problems,
- to automatically take action to investigate and provide information
for:
- Scientist
- Grid support “NOC”
- Network administrator or network person
- Promote well understood, accepted metrics for customers for realistic,
enforceable SLAs,
- provide acceptable limits,
- provide tools to track
|
11
|
- All the above requires:
- easy to use standard ways (e.g.web services) for applications to access
data from existing and new monitoring projects.
- standard naming conventions and schemas.
- This will provide the ability to share information from multiple
measurement infrastructure projects
|
12
|
- Need to build multi-disciplinary teams (incent orthogonal groups to work
with one another):
- include people close to eventual customers (scientists, operational
folks)
- to ensure what is developed is useful, tested out in realistic
environments
- include vendors and providers in funded projects to bridge the gaps
- E2Epi is funded to provide coordination
- Multi agency funding!
- This is not a problem a single agency can address
- Science applications cross multi-agency networks, but barriers to
interagency network monitoring collaborations
|
13
|
- Network reaches its potential
- enable new ways of doing science:
- data intensive science (astrophysics, global weather, seismology,
medicine),
- remote instrument control (SNS, fusion(ITER), surgery),
- remote visualization/insight (Terascale supernova, climate modeling),
- world-wide collaboration enabling (LHC, ITER)
- enables scientists to do science
- Wizard gap closure, not fighting the network, network becomes a
catalyst
- Without good troubleshooting capabilities, the Grid vision will fail
- Predictability, planning, expectations, raising the bar
|
14
|
- Data continues to ship inefficiently by truck/plane FedEx
- Long delays (2 weeks), degraded collaboration, US scientists continue
to lose leadership
- Increased costs (manpower costs, lack of automation)
- Inadequate reliability or performance for new applications, (e.g. Grid
fails to reach its potential)
- New capabilities do not emerge in US:
- remote instrument control, real-time video, media distribution…
- US science loses leadership to Japan, Europe, Canada
|
15
|
- Industry won’t do it (“it’s not my problem”):
- Has its interest and hands full elsewhere
- It’s hard, does not sell products, little Return on Investment
- Historically poor record, competitive concerns
- Management features are late in product development cycle
- Early success with SNMP and Netflow
- Commercial Network Management Platforms’s (e.g. OpenView, Tivoli)
limited success (network oriented, not user), not cost effective
- ISPs only measure own nets, not E2E, SLA guarantees are not
cross-provider
|
16
|
- Some Measurement Infrastructures:
- CAIDA list: www.caida.org/analysis/performance/measinfra/
- AMP: amp.nlanr.net/, PMA http://pma..nlanr.net
- IEPM/PingER home site: www-iepm.slac.stanford.edu/
- IEPM-BW site: www-iepm.slac.stanford.edu/bw
- NIMI: ncne.nlanr.net/nimi/
- RIPE: www.ripe.net/test-traffic/
- NWS: nws.cs.ucsb.edu/
- Internet2 PiPES: e2epi.internet2.edu/
- Tools
- CAIDA measurement taxonomy: www.caida.org/tools/
- SLAC Network Tools: www.slac.stanford.edu/xorg/nmtf/nmtf-tools.html
- Internet research needs:
- www.ietf.org/internet-drafts/draft-iab-research-funding-00.txt
|
17
|
|
18
|
- CAIDA (skitter, macroscopic …)
- NLANR (e.g. AMP – active, PMA – passive)
- LBL (e.g. netest
- SLAC/FNAL (e.g. PingER, IEPM-BW)
- PSC (NIMI)
- RICE (INCITE)
- Europe: RIPE (Eu ISPs), PPMCG
- NWS
- Internet2 (PiPES, IETF/IPPM, Netflow)
- Sprint, ATT Research
- Commercial (Keynote, Matrix, internetweather…)
- For more see www.caida.org/analysis/performance/measinfra
|
19
|
- Users
- “Why is the performance not what I would like or expect”
- Set expectations, build case to complain to ISP
- What should I expect, what applications are likely to work
- Planners: observe growth, decide when upgrades are needed, make cases
for upgrades
- Network engineers: pin-point problem, provide information to providers
- Providers: “where is the problem and what is it”, best bang for the buck
- Grid applications users/developers look forward to using,
- e.g. Grid Resource Broker data placement
- Requires APIs (e.g. web services), common naming conventions (e.g.
NMWG, GLUE schema …) etc.
- Security: anomalies
- Researchers: modeling, theory testing, scaling laws
|
20
|
|
21
|
- Delays, RTT, loss, jitter, availability
- “Bandwidth” estimation
- TCP & UDP throughputs
- Packet pair techniques
- Packet length techniques (pchar …)
- Topology /tomography, routing
- Utilization, errors
- Security
- Evaluation of new protocols
- Applications (many commercial packages)
- One off: traffic characterization at borders and IXPs
- Exception, providers do not make information public
|
22
|
- Delays etc.: ping, OWAMP, GPS
- “Bandwidth”: iperf, pathload, pipechar, netest, ABwE
- Utilization: SNMP
- Topology/tomography: traceroute, skitter, INCITE
- Routing: RIPE, routeviews
- Traffic characterization: netflow, NeTraMet, tcpdump, coralreef
- Visualization: MRTG, RRD, netgeo, geoplot, tcptrace, xplot
|