State of Network
Monitoring and Analysis in the US
|
|
|
Les Cottrell, KC Claffy, Brian
Tierney, Ronn Ritke, Hans-Werner
Braun |
|
Prepared for the LSN meeting at NSF
Washington 6/10/03 |
|
|
Outline
|
|
|
|
|
|
Goal: for network monitoring &
analysis talk: |
|
identify the R&D gaps and
large-scale deployment issues for DOE, NSF, DARPA, NASA, NSA, NIST, etc. –
the federal agencies that fund network
research in US |
|
Two complementary presentations |
|
High performance networking measurement
needs for Science (E2E)- Les |
|
Consumer grade & net-centric
measurement needs – kc |
|
Science network measurement needs |
|
The end-to-end challenge, illustrations |
|
Solution |
|
End to end Monitoring Goals |
|
Current issues |
|
Problem analysis, measurement
infrastructure, analysis tools, standards, collaborations |
|
Benefits to Science |
|
Consequences of not addressing issues |
|
Why not leave to industry |
|
Appendix |
|
What is being done today |
|
Who is measuring? |
|
Who is using the measurements? |
|
What is being measured? |
|
What tools are being used? |
The Problem
|
|
|
|
|
Distributed systems are very hard |
|
A distributed system is one in which I
can't get my work done because a computer I've never heard of has failed.
Butler Lampson |
|
When building distributed systems, we
often observe unexpectedly low performance |
|
the reasons for which are usually not
obvious |
|
The bottlenecks can be in any of the
following components: |
|
the applications |
|
the operating systems |
|
the disks, network adapters, bus,
memory, etc. on either the sending or receiving host |
|
the network switches and routers, and
so on |
|
Problems may not be logical |
|
Most problems are operator errors,
configurations, bugs |
Anatomy of a Problem
Problem examples: Help,
it’s not working
|
|
|
|
I’ve lost my connection |
|
Despite over-provisioned networks user
cannot get throughput expected |
|
Wizard gap |
|
What should I expect the performance to
be? |
|
It sometimes works … |
|
What am I, as a scientist, supposed to
do? |
|
Need tools/measurements to detect
problems, identify location, cause and time of occurrence |
The Solution
|
|
|
|
|
A complete End-to-End monitoring
framework that includes the following components: |
|
instrumentation tools (application,
middleware, and OS monitoring) |
|
host and network sensors (host and network monitoring) |
|
sensor management / activation tools |
|
event publication service |
|
event archive service |
|
event analysis and visualization tools |
|
a common set of protocols for
describing, exchanging, and locating monitoring data |
|
Need for applications (e.g. Grid
middleware), diagnosis, perf. analysis |
|
toolkit for streamlined problem
diagnosis: detection, location, isolation & reporting |
|
glue to multiple sources of
information, traceroute archives, router info, delay/loss archives, on-demand
tests, baselines |
|
analysis and heuristics |
|
E2EPi working on solution, but only
funded for coordination not for all the underlying work |
End-2-End Monitoring
Goals
|
|
|
Have to solve the E2E performance, it
is THE critical metric for user, not just a backbone bandwidth problem |
|
Improve end-to-end data throughput for
data intensive applications in a high-speed WAN environments |
|
Provide the ability to do performance
analysis and fault detection in a Grid computing environment |
|
Provide accurate, detailed, and
adaptive monitoring of all of distributed computing components, including the
network |
|
|
|
|
|
Unfortunately, network management
research has historically been very under-funded, because it is difficult to
get funding bodies to recognize this as legitimate networking research, IAB
Concerns & Recommendations Regarding Internet Research & Evolution |
|
|
Current Issues 1:
Problem Analysis
|
|
|
|
Cultivate systematic studies of
problems, causes, how to discover, how to report, how to by-pass |
|
Analysis to help in deciding what are
the most important problems, see how they are tackled manually today |
|
Decide on which problems are most
cost-effective to assist in developing tools to assist in diagnosis |
Current issues 2: Measurement
Infrastructures
|
|
|
|
|
Need to build infrastructure to support
troubleshooting: |
|
Requires repetitive and on-demand
measurements with appropriate security model. |
|
Provide recommended/accepted set of
tools for delay, RTT, loss, route tracking, "bandwidth" estimation. |
|
Include archiving and access to data,
analysis and reporting of repetitive data. |
|
Allow for evaluation, validation and
comparison of new measurement tools, TCP stacks, applications (e.g. file
transfer). |
|
Reverse traceroute, looking glass,
remote tcpdump (e.g. SCNM), remote testing of connection (ANL NDT), |
|
Traceroute archives |
|
Make tools easier to comprehend and use
by scientists |
|
Encourage efforts such as Internet2
E2Epi efforts to provide measurements inside the cloud |
|
Extend to ESnet & other NRNs, and
beyond |
|
Fund collaboration across boundaries |
|
Ubiquitous coverage (require multiple
toolkits): Inter agency, international, hi-speed, digital divide, long term
and current |
Current issues 3:
Analysis tools
|
|
|
|
|
Provide measurement tools to accurately
& quickly identify performance problems, |
|
to automatically take action to
investigate and provide information for: |
|
Scientist |
|
Grid support “NOC” |
|
Network administrator or network person |
|
Promote well understood, accepted
metrics for customers for realistic, enforceable SLAs, |
|
provide acceptable limits, |
|
provide tools to track |
Current issues 4:
Standards
|
|
|
|
All the above requires: |
|
easy to use standard ways (e.g.web
services) for applications to access data from existing and new monitoring
projects. |
|
standard naming conventions and
schemas. |
|
This will provide the ability to share
information from multiple measurement infrastructure projects |
Current issues 5:
Collaboration
|
|
|
|
|
Need to build multi-disciplinary teams
(incent orthogonal groups to work with one another): |
|
include people close to eventual
customers (scientists, operational folks) |
|
to ensure what is developed is useful,
tested out in realistic environments |
|
include vendors and providers in funded
projects to bridge the gaps |
|
E2Epi is funded to provide coordination |
|
Multi agency funding! |
|
This is not a problem a single agency
can address |
|
Science applications cross multi-agency
networks, but barriers to interagency network monitoring collaborations |
Benefits to Science
|
|
|
|
|
Network reaches its potential |
|
enable new ways of doing science: |
|
data intensive science (astrophysics,
global weather, seismology, medicine), |
|
remote instrument control (SNS,
fusion(ITER), surgery), |
|
remote visualization/insight (Terascale
supernova, climate modeling), |
|
world-wide collaboration enabling (LHC,
ITER) |
|
enables scientists to do science |
|
Wizard gap closure, not fighting the
network, network becomes a catalyst |
|
Without good troubleshooting
capabilities, the Grid vision will fail |
|
Predictability, planning, expectations,
raising the bar |
What happens if we do
not address
|
|
|
|
Data continues to ship inefficiently by
truck/plane FedEx |
|
Long delays (2 weeks), degraded
collaboration, US scientists continue to lose leadership |
|
Increased costs (manpower costs, lack
of automation) |
|
Inadequate reliability or performance
for new applications, (e.g. Grid fails to reach its potential) |
|
New capabilities do not emerge in US: |
|
remote instrument control, real-time
video, media distribution… |
|
US science loses leadership to Japan,
Europe, Canada |
Why not leave it to
industry
|
|
|
|
|
Industry won’t do it (“it’s not my
problem”): |
|
Has its interest and hands full
elsewhere |
|
It’s hard, does not sell products,
little Return on Investment |
|
Historically poor record, competitive
concerns |
|
Management features are late in product
development cycle |
|
Early success with SNMP and Netflow |
|
Commercial Network Management
Platforms’s (e.g. OpenView, Tivoli) limited success (network oriented, not
user), not cost effective |
|
ISPs only measure own nets, not E2E,
SLA guarantees are not cross-provider |
|
|
More Information
|
|
|
|
Some Measurement Infrastructures: |
|
CAIDA list: www.caida.org/analysis/performance/measinfra/ |
|
AMP: amp.nlanr.net/, PMA http://pma..nlanr.net |
|
IEPM/PingER home site: www-iepm.slac.stanford.edu/ |
|
IEPM-BW site: www-iepm.slac.stanford.edu/bw |
|
NIMI: ncne.nlanr.net/nimi/ |
|
RIPE: www.ripe.net/test-traffic/ |
|
NWS: nws.cs.ucsb.edu/ |
|
Internet2 PiPES: e2epi.internet2.edu/ |
|
Tools |
|
CAIDA measurement taxonomy: www.caida.org/tools/ |
|
SLAC Network Tools: www.slac.stanford.edu/xorg/nmtf/nmtf-tools.html |
|
Internet research needs: |
|
www.ietf.org/internet-drafts/draft-iab-research-funding-00.txt |
Appendix: Current
Practices
Who is Measuring?
|
|
|
CAIDA (skitter, macroscopic …) |
|
NLANR (e.g. AMP – active, PMA –
passive) |
|
LBL (e.g. netest |
|
SLAC/FNAL (e.g. PingER, IEPM-BW) |
|
PSC (NIMI) |
|
RICE (INCITE) |
|
Europe: RIPE (Eu ISPs), PPMCG |
|
NWS |
|
Internet2 (PiPES, IETF/IPPM, Netflow) |
|
Sprint, ATT Research |
|
Commercial (Keynote, Matrix,
internetweather…) |
|
For more see www.caida.org/analysis/performance/measinfra |
Who are using
measurements (customers)?
|
|
|
|
|
Users |
|
“Why is the performance not what I
would like or expect” |
|
Set expectations, build case to
complain to ISP |
|
What should I expect, what applications
are likely to work |
|
Planners: observe growth, decide when
upgrades are needed, make cases for upgrades |
|
Network engineers: pin-point problem,
provide information to providers |
|
Providers: “where is the problem and
what is it”, best bang for the buck |
|
Grid applications users/developers look
forward to using, |
|
e.g. Grid Resource Broker data
placement |
|
Requires APIs (e.g. web services),
common naming conventions (e.g. NMWG, GLUE schema …) etc. |
|
Security: anomalies |
|
Researchers: modeling, theory testing,
scaling laws |
What is being Measured
1/2
What is being measured
2/2?
|
|
|
|
Delays, RTT, loss, jitter, availability |
|
“Bandwidth” estimation |
|
TCP & UDP throughputs |
|
Packet pair techniques |
|
Packet length techniques (pchar …) |
|
Topology /tomography, routing |
|
Utilization, errors |
|
Security |
|
Evaluation of new protocols |
|
Applications (many commercial packages) |
|
Email, DB, www … |
|
One off: traffic characterization at
borders and IXPs |
|
Exception, providers do not make
information public |
What tools are being
used
|
|
|
Delays etc.: ping, OWAMP, GPS |
|
“Bandwidth”: iperf, pathload, pipechar,
netest, ABwE |
|
Utilization: SNMP |
|
Topology/tomography: traceroute,
skitter, INCITE |
|
Routing: RIPE, routeviews |
|
Traffic characterization: netflow,
NeTraMet, tcpdump, coralreef |
|
Visualization: MRTG, RRD, netgeo,
geoplot, tcptrace, xplot |
|
|