State of Network Monitoring and Analysis in the US

Les Cottrell, KC Claffy, Brian Tierney, Ronn Ritke, Hans-Werner Braun

Prepared for the LSN meeting at NSF Washington 6/10/03

Outline

Goal: for network monitoring & analysis talk:

identify the R&D gaps and large-scale deployment issues for DOE, NSF, DARPA, NASA, NSA, NIST, etc. – the federal agencies that fund network research in US

Two complementary presentations

High performance networking measurement needs for Science (E2E)- Les

Consumer grade & net-centric measurement needs – kc

Science network measurement needs

The end-to-end challenge, illustrations

Solution

End to end Monitoring Goals

Current issues

Problem analysis, measurement infrastructure, analysis tools, standards, collaborations

Benefits to Science

Consequences of not addressing issues

Why not leave to industry

Appendix

What is being done today

Who is measuring?

Who is using the measurements?

What is being measured?

What tools are being used?

The Problem

Distributed systems are very hard

A distributed system is one in which I can't get my work done because a computer I've never heard of has failed. Butler Lampson

When building distributed systems, we often observe unexpectedly low performance

the reasons for which are usually not obvious

The bottlenecks can be in any of the following components:

the applications

the operating systems

the disks, network adapters, bus, memory, etc. on either the sending or receiving host

the network switches and routers, and so on

Problems may not be logical

Most problems are operator errors, configurations, bugs

Problem examples: Help, it’s not working

I’ve lost my connection

Despite over-provisioned networks user cannot get throughput expected

Wizard gap

What should I expect the performance to be?

It sometimes works …

What am I, as a scientist, supposed to do?

Need tools/measurements to detect problems, identify location, cause and time of occurrence

The Solution

A complete End-to-End monitoring framework that includes the following components:

instrumentation tools (application, middleware, and OS monitoring)

host and network sensors (host and network monitoring)

sensor management / activation tools

event publication service

event archive service

event analysis and visualization tools

a common set of protocols for describing, exchanging, and locating monitoring data

Need for applications (e.g. Grid middleware), diagnosis, perf. analysis

toolkit for streamlined problem diagnosis: detection, location, isolation & reporting

glue to multiple sources of information, traceroute archives, router info, delay/loss archives, on-demand tests, baselines

analysis and heuristics

E2EPi working on solution, but only funded for coordination not for all the underlying work

End-2-End Monitoring Goals

Have to solve the E2E performance, it is THE critical metric for user, not just a backbone bandwidth problem

Improve end-to-end data throughput for data intensive applications in a high-speed WAN environments

Provide the ability to do performance analysis and fault detection in a Grid computing environment

Provide accurate, detailed, and adaptive monitoring of all of distributed computing components, including the network

Unfortunately, network management research has historically been very under-funded, because it is difficult to get funding bodies to recognize this as legitimate networking research, IAB Concerns & Recommendations Regarding Internet Research & Evolution

Current Issues 1: Problem Analysis

Cultivate systematic studies of problems, causes, how to discover, how to report, how to by-pass

Analysis to help in deciding what are the most important problems, see how they are tackled manually today

Decide on which problems are most cost-effective to assist in developing tools to assist in diagnosis

Current issues 2: Measurement Infrastructures

Need to build infrastructure to support troubleshooting:

Requires repetitive and on-demand measurements with appropriate security model.

Provide recommended/accepted set of tools for delay, RTT, loss, route tracking, "bandwidth" estimation.

Include archiving and access to data, analysis and reporting of repetitive data.

Allow for evaluation, validation and comparison of new measurement tools, TCP stacks, applications (e.g. file transfer).

Reverse traceroute, looking glass, remote tcpdump (e.g. SCNM), remote testing of connection (ANL NDT),

Traceroute archives

Make tools easier to comprehend and use by scientists

Encourage efforts such as Internet2 E2Epi efforts to provide measurements inside the cloud

Extend to ESnet & other NRNs, and beyond

Fund collaboration across boundaries

Ubiquitous coverage (require multiple toolkits): Inter agency, international, hi-speed, digital divide, long term and current

Current issues 3: Analysis tools

Provide measurement tools to accurately & quickly identify performance problems,

to automatically take action to investigate and provide information for:

Scientist

Grid support “NOC”

Network administrator or network person

Promote well understood, accepted metrics for customers for realistic, enforceable SLAs,

provide acceptable limits,

provide tools to track

Current issues 4: Standards

All the above requires:

easy to use standard ways (e.g.web services) for applications to access data from existing and new monitoring projects.

standard naming conventions and schemas.

This will provide the ability to share information from multiple measurement infrastructure projects

Current issues 5: Collaboration

Need to build multi-disciplinary teams (incent orthogonal groups to work with one another):

include people close to eventual customers (scientists, operational folks)

to ensure what is developed is useful, tested out in realistic environments

include vendors and providers in funded projects to bridge the gaps

E2Epi is funded to provide coordination

Multi agency funding!

This is not a problem a single agency can address

Science applications cross multi-agency networks, but barriers to interagency network monitoring collaborations

Benefits to Science

Network reaches its potential

enable new ways of doing science:

data intensive science (astrophysics, global weather, seismology, medicine),

remote instrument control (SNS, fusion(ITER), surgery),

remote visualization/insight (Terascale supernova, climate modeling),

world-wide collaboration enabling (LHC, ITER)

enables scientists to do science

Wizard gap closure, not fighting the network, network becomes a catalyst

Without good troubleshooting capabilities, the Grid vision will fail

Predictability, planning, expectations, raising the bar

What happens if we do not address

Data continues to ship inefficiently by truck/plane FedEx

Long delays (2 weeks), degraded collaboration, US scientists continue to lose leadership

Increased costs (manpower costs, lack of automation)

Inadequate reliability or performance for new applications, (e.g. Grid fails to reach its potential)

New capabilities do not emerge in US:

remote instrument control, real-time video, media distribution…

US science loses leadership to Japan, Europe, Canada

Why not leave it to industry

Industry won’t do it (“it’s not my problem”):

Has its interest and hands full elsewhere

It’s hard, does not sell products, little Return on Investment

Historically poor record, competitive concerns

Management features are late in product development cycle

Early success with SNMP and Netflow

Commercial Network Management Platforms’s (e.g. OpenView, Tivoli) limited success (network oriented, not user), not cost effective

ISPs only measure own nets, not E2E, SLA guarantees are not cross-provider

More Information

Some Measurement Infrastructures:

CAIDA list: www.caida.org/analysis/performance/measinfra/

AMP: amp.nlanr.net/, PMA http://pma..nlanr.net

IEPM/PingER home site: www-iepm.slac.stanford.edu/

IEPM-BW site: www-iepm.slac.stanford.edu/bw

NIMI: ncne.nlanr.net/nimi/

RIPE: www.ripe.net/test-traffic/

NWS: nws.cs.ucsb.edu/

Internet2 PiPES: e2epi.internet2.edu/

Tools

CAIDA measurement taxonomy: www.caida.org/tools/

SLAC Network Tools: www.slac.stanford.edu/xorg/nmtf/nmtf-tools.html

Internet research needs:

www.ietf.org/internet-drafts/draft-iab-research-funding-00.txt

Appendix: Current Practices

Who is Measuring?

CAIDA (skitter, macroscopic …)

NLANR (e.g. AMP – active, PMA – passive)

LBL (e.g. netest

SLAC/FNAL (e.g. PingER, IEPM-BW)

PSC (NIMI)

RICE (INCITE)

Europe: RIPE (Eu ISPs), PPMCG

NWS

Internet2 (PiPES, IETF/IPPM, Netflow)

Sprint, ATT Research

Commercial (Keynote, Matrix, internetweather…)

For more see www.caida.org/analysis/performance/measinfra

Who are using measurements (customers)?

Users

“Why is the performance not what I would like or expect”

Set expectations, build case to complain to ISP

What should I expect, what applications are likely to work

Planners: observe growth, decide when upgrades are needed, make cases for upgrades

Network engineers: pin-point problem, provide information to providers

Providers: “where is the problem and what is it”, best bang for the buck

Grid applications users/developers look forward to using,

e.g. Grid Resource Broker data placement

Requires APIs (e.g. web services), common naming conventions (e.g. NMWG, GLUE schema …) etc.

Security: anomalies

Researchers: modeling, theory testing, scaling laws

What is being Measured 1/2

What is being measured 2/2?

Delays, RTT, loss, jitter, availability

“Bandwidth” estimation

TCP & UDP throughputs

Packet pair techniques

Packet length techniques (pchar …)

Topology /tomography, routing

Utilization, errors

Security

Evaluation of new protocols

Applications (many commercial packages)

Email, DB, www …

One off: traffic characterization at borders and IXPs

Exception, providers do not make information public

What tools are being used

Delays etc.: ping, OWAMP, GPS

“Bandwidth”: iperf, pathload, pipechar, netest, ABwE

Utilization: SNMP

Topology/tomography: traceroute, skitter, INCITE

Routing: RIPE, routeviews

Traffic characterization: netflow, NeTraMet, tcpdump, coralreef

Visualization: MRTG, RRD, netgeo, geoplot, tcptrace, xplot


	Les Cottrell, KC Claffy, Brian Tierney, Ronn Ritke, Hans-Werner Braun
	Prepared for the LSN meeting at NSF Washington 6/10/03


Goal: for network monitoring & analysis talk:
	identify the R&D gaps and large-scale deployment issues for DOE, NSF, DARPA, NASA, NSA, NIST, etc. – the federal agencies that fund network research in US
Two complementary presentations
	High performance networking measurement needs for Science (E2E)- Les
	Consumer grade & net-centric measurement needs – kc
Science network measurement needs
	The end-to-end challenge, illustrations
	Solution
	End to end Monitoring Goals
	Current issues
		Problem analysis, measurement infrastructure, analysis tools, standards, collaborations
	Benefits to Science
	Consequences of not addressing issues
	Why not leave to industry
	Appendix
		What is being done today
			Who is measuring?
			Who is using the measurements?
			What is being measured?
			What tools are being used?


Distributed systems are very hard
	A distributed system is one in which I can't get my work done because a computer I've never heard of has failed. Butler Lampson
When building distributed systems, we often observe unexpectedly low performance
		the reasons for which are usually not obvious
The bottlenecks can be in any of the following components:
	the applications
	the operating systems
	the disks, network adapters, bus, memory, etc. on either the sending or receiving host
	the network switches and routers, and so on
Problems may not be logical
	Most problems are operator errors, configurations, bugs


	I’ve lost my connection
	Despite over-provisioned networks user cannot get throughput expected
		Wizard gap
	What should I expect the performance to be?
	It sometimes works …
	What am I, as a scientist, supposed to do?
	Need tools/measurements to detect problems, identify location, cause and time of occurrence


A complete End-to-End monitoring framework that includes the following components:
	instrumentation tools (application, middleware, and OS monitoring)
	host and network sensors (host and network monitoring)
	sensor management / activation tools
	event publication service
	event archive service
	event analysis and visualization tools
	a common set of protocols for describing, exchanging, and locating monitoring data
		Need for applications (e.g. Grid middleware), diagnosis, perf. analysis
	toolkit for streamlined problem diagnosis: detection, location, isolation & reporting
		glue to multiple sources of information, traceroute archives, router info, delay/loss archives, on-demand tests, baselines
		analysis and heuristics
	E2EPi working on solution, but only funded for coordination not for all the underlying work


	Have to solve the E2E performance, it is THE critical metric for user, not just a backbone bandwidth problem
	Improve end-to-end data throughput for data intensive applications in a high-speed WAN environments
	Provide the ability to do performance analysis and fault detection in a Grid computing environment
	Provide accurate, detailed, and adaptive monitoring of all of distributed computing components, including the network


	Unfortunately, network management research has historically been very under-funded, because it is difficult to get funding bodies to recognize this as legitimate networking research, IAB Concerns & Recommendations Regarding Internet Research & Evolution


	Cultivate systematic studies of problems, causes, how to discover, how to report, how to by-pass
		Analysis to help in deciding what are the most important problems, see how they are tackled manually today
		Decide on which problems are most cost-effective to assist in developing tools to assist in diagnosis


Need to build infrastructure to support troubleshooting:
	Requires repetitive and on-demand measurements with appropriate security model.
	Provide recommended/accepted set of tools for delay, RTT, loss, route tracking, "bandwidth" estimation.
		Include archiving and access to data, analysis and reporting of repetitive data.
	Allow for evaluation, validation and comparison of new measurement tools, TCP stacks, applications (e.g. file transfer).
	Reverse traceroute, looking glass, remote tcpdump (e.g. SCNM), remote testing of connection (ANL NDT),
	Traceroute archives
	Make tools easier to comprehend and use by scientists
	Encourage efforts such as Internet2 E2Epi efforts to provide measurements inside the cloud
		Extend to ESnet & other NRNs, and beyond
		Fund collaboration across boundaries
	Ubiquitous coverage (require multiple toolkits): Inter agency, international, hi-speed, digital divide, long term and current


Provide measurement tools to accurately & quickly identify performance problems,
	to automatically take action to investigate and provide information for:
		Scientist
		Grid support “NOC”
		Network administrator or network person
	Promote well understood, accepted metrics for customers for realistic, enforceable SLAs,
		provide acceptable limits,
		provide tools to track


	All the above requires:
		easy to use standard ways (e.g.web services) for applications to access data from existing and new monitoring projects.
		standard naming conventions and schemas.
	This will provide the ability to share information from multiple measurement infrastructure projects


Need to build multi-disciplinary teams (incent orthogonal groups to work with one another):
	include people close to eventual customers (scientists, operational folks)
		to ensure what is developed is useful, tested out in realistic environments
	include vendors and providers in funded projects to bridge the gaps
E2Epi is funded to provide coordination
Multi agency funding!
	This is not a problem a single agency can address
	Science applications cross multi-agency networks, but barriers to interagency network monitoring collaborations


Network reaches its potential
	enable new ways of doing science:
		data intensive science (astrophysics, global weather, seismology, medicine),
		remote instrument control (SNS, fusion(ITER), surgery),
		remote visualization/insight (Terascale supernova, climate modeling),
		world-wide collaboration enabling (LHC, ITER)
	enables scientists to do science
		Wizard gap closure, not fighting the network, network becomes a catalyst
	Without good troubleshooting capabilities, the Grid vision will fail
	Predictability, planning, expectations, raising the bar


	Data continues to ship inefficiently by truck/plane FedEx
		Long delays (2 weeks), degraded collaboration, US scientists continue to lose leadership
		Increased costs (manpower costs, lack of automation)
	Inadequate reliability or performance for new applications, (e.g. Grid fails to reach its potential)
	New capabilities do not emerge in US:
		remote instrument control, real-time video, media distribution…
		US science loses leadership to Japan, Europe, Canada


Industry won’t do it (“it’s not my problem”):
	Has its interest and hands full elsewhere
	It’s hard, does not sell products, little Return on Investment
	Historically poor record, competitive concerns
		Management features are late in product development cycle
		Early success with SNMP and Netflow
		Commercial Network Management Platforms’s (e.g. OpenView, Tivoli) limited success (network oriented, not user), not cost effective
		ISPs only measure own nets, not E2E, SLA guarantees are not cross-provider


	Some Measurement Infrastructures:
		CAIDA list: www.caida.org/analysis/performance/measinfra/
		AMP: amp.nlanr.net/, PMA http://pma..nlanr.net
		IEPM/PingER home site: www-iepm.slac.stanford.edu/
		IEPM-BW site: www-iepm.slac.stanford.edu/bw
		NIMI: ncne.nlanr.net/nimi/
		RIPE: www.ripe.net/test-traffic/
		NWS: nws.cs.ucsb.edu/
		Internet2 PiPES: e2epi.internet2.edu/
	Tools
		CAIDA measurement taxonomy: www.caida.org/tools/
		SLAC Network Tools: www.slac.stanford.edu/xorg/nmtf/nmtf-tools.html
	Internet research needs:
		www.ietf.org/internet-drafts/draft-iab-research-funding-00.txt


	CAIDA (skitter, macroscopic …)
	NLANR (e.g. AMP – active, PMA – passive)
	LBL (e.g. netest
	SLAC/FNAL (e.g. PingER, IEPM-BW)
	PSC (NIMI)
	RICE (INCITE)
	Europe: RIPE (Eu ISPs), PPMCG
	NWS
	Internet2 (PiPES, IETF/IPPM, Netflow)
	Sprint, ATT Research
	Commercial (Keynote, Matrix, internetweather…)
	For more see www.caida.org/analysis/performance/measinfra


Users
	“Why is the performance not what I would like or expect”
		Set expectations, build case to complain to ISP
	What should I expect, what applications are likely to work
Planners: observe growth, decide when upgrades are needed, make cases for upgrades
Network engineers: pin-point problem, provide information to providers
Providers: “where is the problem and what is it”, best bang for the buck
Grid applications users/developers look forward to using,
	e.g. Grid Resource Broker data placement
		Requires APIs (e.g. web services), common naming conventions (e.g. NMWG, GLUE schema …) etc.
Security: anomalies
Researchers: modeling, theory testing, scaling laws


	Delays, RTT, loss, jitter, availability
	“Bandwidth” estimation
		TCP & UDP throughputs
		Packet pair techniques
		Packet length techniques (pchar …)
	Topology /tomography, routing
	Utilization, errors
	Security
	Evaluation of new protocols
	Applications (many commercial packages)
		Email, DB, www …
	One off: traffic characterization at borders and IXPs
		Exception, providers do not make information public


	Delays etc.: ping, OWAMP, GPS
	“Bandwidth”: iperf, pathload, pipechar, netest, ABwE
	Utilization: SNMP
	Topology/tomography: traceroute, skitter, INCITE
	Routing: RIPE, routeviews
	Traffic characterization: netflow, NeTraMet, tcpdump, coralreef
	Visualization: MRTG, RRD, netgeo, geoplot, tcptrace, xplot