Network Monitoring for the LAN and WAN

Les Cottrell and Connie Logg, SLAC

Talk given at ORNL, June 24. 1996

Talk available at http://www.slac.stanford.edu/grp/scs/net/talk/ornl-96/ornl.htm

Outline of Talk:

Why Monitor

To provide:

  1. Performance Tuning - Improve service - proactively id & reduce bottlenecks, tune and optimize systems, improve QOS, optimize investments - id under/over utilized resources, balance workloads
  2. Trouble Shooting - Get out of crisis mode, id probs & start diagnosis/fixing before end user notices, increase reliability/availability, allow user to accomplish work more effectively and maximize productivity.
  3. Planning - understand performance trends for planning
  4. Expectations - set expectations for the Distributed System (from network thru applications) and see how well they are met
  5. Security
  6. Accounting

What's Changed that Makes Monitoring so Crucial now

1. Distributed environment (client/server)

2. Network growth:

3. Complexity:

4. Reduced Resources:

What Should we Monitor

The ultimate measures of performance are the users' perceptions of the performance of their networked applications (e.g. WWW, email, a distributed RDBMS, a spreadsheet accessing a distributed file system etc.)

This performance is affected by the performance of the complete Distributed System, which includes:

How do we Monitor - Components

Network Data Collection at SLAC

Collect data via SNMP from:

Data Analysis at SLAC

Once a day (in the early morning), via batch jobs:

Data Reduction at SLAC

Analysis generates thousands of reports most of which are uninteresting

Reduction examines the analysis reports and extracts the exceptions e.g.

Alert Notification

The daily WWW visible exception reports are manually reviewed each working morning and used as input to the morning H. O. T. meeting

Results

Service Level Expectations:

Future - Switched Network

Shared media => switched network

Need a probe on every switch/hub port

Future - RMON2

Provides monitoring for full 7 layers of OSI model

RMON still handles layers 1 and 2 of OSI model

RMON 2 will enable trouble-shooting tools to

Future - ATM

Challenges:

Summary

Still no out of the box integrated solution available

Require distributed, easy-to-use, heterogeneous "system" management to enable focus to shift to service management

Need to make information digestible

Developing tools is costly and still has to be done in-house

Wide Area Monitoring

WAN monitoring for an end site has different requirements to LAN monitoring

Internet Problems ("Gridlock")

Quality of service over Internet has dramatically worsened in last year

Possible Solutions

Tail circuits from collaborator sites to Esnet

Peering with NSI, NSF and other government agencies

Develop service quality metrics and perfomance measurements:

How do we Monitor - Tools

The main tools used today are:

Others see:

http://www.slac.stanford.edu/~cottrell/tcom/nmtf-tools.html

Tools

Ping

FTP rates depend on many inter-related factors including:

Sites That SLAC Monitors

Short Term Problems

Long Term Degradation

Studies at SLAC, HEPNRC and LBNL show major degradation in quality of service between Universities and ESnet Labs in last year:

For example see the ping loss and response time degradation between SLAC and UCD over the 180 days from Dec '95 to Mar-96.

Note: difference in weekends vs workdays

3D Plot of Ping Response vs Hour of day

Alerts

For each node being monitored:

Service Predictability

Scatter plot:

daily average ping / max ping rate versus

daily average packet success / maximum packet success

(where % success = (total packets - packets lost) / Total Packets)

Changes in Service Predictability

By this metric:

What are We (ESCC/NMTF) Doing

Meeting at Berkeley, May 1996 -agreed:

Will monitor end-to-end connections for:

What are Others Doing

Big concerns about Internet gridlock

"to a significant extent due to the lack of efficient financial pressure on service providers to strengthen the infrstructure" from Metrics for Internet Settlements, by Brian Carpenter CERN

Much effort therefore to identify critical networking metrics and tools than can be employed by users and ISPs to quantify Internet quality of service

Organizations involved include the Federal Networking Council (FNC) and its Advisory Committee (FNCAC), DARPA, Kansas University, Merit, NSF, ESnet/NMTF, National Laboratory for Applied Network Research (NLANR) working with MCI.

Further Information

ESCC Network Monitoring Task Force (NMTF):
http://www.slac.stanford.edu/~cottrell/tcom/nmtf/
Tutorial on WAN Monitoring:
http://www.slac.stanford.edu/comp/net/wan-mon/tutorial.html
Survey of Internet Statistics / Metrics:
http://www.tomco.net/~tmonk/metrics.htm<
Federal Network Council:
http://www.fnc.gov/metrics.html
http://www.fnc.gov/fnc_collab1.html
Draft RFC on Metrics for Internet Settlements (Carpenter/CERN):
ftp://ds.internic.net/internet-drafts/draft-carpenter-metrics-00.txt
National Laboratory for Applied Network Research (NLANR):
http://www.nlanr.net
http://www.nlanr.net/INFO/
http://www.nlanr.net/Viz/End2end
NCSA/NLANR Internet Performance Sampling:
http://computer.ncsa.uiuc.edu/vwelch/projects/inetperf/
Merit's Network Statistics Collection and Reporting Facility:
http://home.merit.edu/~wbn/RAC/play.html
SLAC's WAN Monitoring:
http://www.slac.stanford.edu/comp/net/wan-mon.html

Owner: cottrell@slac.stanford.edu