SLAC's Network Management Features
Connie Logg, October 1993
Current NMS:
We have a DECstation 5000 on which we are running DEC MSU. DEC has
recently announced that they are no longer support DEC MSU, and we are
going to be replacing it with NETVIEW 6000 running on an RS/6000 37T.
The features of DEC MSU which we currently use are:
- The network map display feature: we use this as the configuration
documentation for our system, as an interface to various utilities
(select a node and select an action from a menu to look at
something about that node), as our service desk network interface
status and information display, and other things I may have missed.
- The trap generation facility: When MSU detects that a node has
changed in responsiveness, it generates a trap, which results in a
software routine being invoked. This routine records the transition
in a flat file. Every transition, from up to down, down to up is
recorded in this file. This file is then processed daily and various
reports are generated from it. Samples of the file and reports
are talked about later.
- Various utilities which have been integrated with DEC MSU such as:
RMON from NAT (which allows the user to setup various types of
data collection in an NAT ethermeter, and display the results); and
our plot packages for display of interface and error statistics
for the NAT bridges and ethermeters. Note that these are independent
packages that are simply placed in a menu, so that a node can be
chosen via the mouse, and these routines invoked via the menu.
Ethermeter Utilization
We have attempted to place an NAT ethermeter on EVERY segment of cable
in our network. Several pieces of code have been developed in house to
probe these ethermeters, on a regular basis (once an hour currently), and to
just look at the MIB variables in a sensible fashion (we call this piece
of code natlookup).
-
Data and error statistics are collected hourly, 7 days a week,
24 hours a day, from all the ethermeters (44 of them currently). Plots
of network activity are available on demand. These plots include:
-
Figure 1:
plot of per second good packets, multicasts, broadcasts, shorts, and kilobytes
-
Figure 2:plots of errors such as crcs, alignment, oversized, and shorts
-
Figure 3:plots of packet sizes
-
Figure 4:plots of the peaks for good packets, multicasts, broadcasts,
shorts, and kilobytes
Extensive use of WWW is made to provide plots to the user community.
Once an hour the data for "today" is plotted and the postcript file is
made available via WWW. Once a day (early in the morning) a plot of yesterday's
data and the data for the past week are created and the postscript files
are made available via WWW.
If a plot of any other time period is desired, the plotting program
can be invoked directly by the user.
-
Other reports generated from this data include:
-
Figure 5: - daily summary of network traffic - "*" lines indicate that there
is something that needs checking out.
-
Figure 6:daily summary of network errors - "*" lines indicate that there
is something that needs checking out.
-
Figure 7:a monthly trend plot which plots one point monthly of the mean
of the various data points.
-
Figure 8:a daily lineprinter format scatterplot which show the ratio of the
collisions vs goods. This is glanced at every day to quickly
pinpoint network "hot spots".
Bridge Monitoring
Data is collected hourly from the NAT bridges in the SLAC network. The
data includes the number of good packets, the number of crc and alignment
errors, the number of collisions, and the number of multicasts.
-
Figure 9: Plot of bridge statistics. This is generated hourly for "todays"
data, and daily for the data from "yesterday" and the "past week".
These are all available via WWW. In addition the user can invoke
the plot program himself, if (s)he wishes another time range.
In addition, daily summary reports are generated to summarize the data.
-
Figure 10:Daily Bridge Summary - available via WWW
Routers
Currently data is collected hourly from routers in the network. A daily
report of this is generated, but no "analysis" is done. It is just eyeballed
as needed.
-
Figure 11: Router Daily Summary - available via WWW.
Network Timing
We have made an attempt to "time" our network via issueing pings to the
ethermeters and critical servers. The ethermeters and critical servers are
pinged 4 times an hour a total of 10 times each: 5 times with a 100 byte
packet and 5 times with a 1000 byte packet. The ping command returns
a minimum, maximum, and average ping time for the 5 pings. This is stuffed in
a flat file and plotted on a daily basis.
-
Figure 12:individual plots for each node show the average and maximums
for each node for both 100 byte and 1000 byte packets.
-
Figure 13:The top 10 servers have a frequency plot generated for the
maximum ping times for the 100 byte packets.
-
Figure 14:The top 10 servers have a frequency plot generated for the
average ping times for the 100 byte packets.
The pings only capture a snapshot of the network at a specific time,
but it does allow us to compare the responsiveness of nodes.
System Connectivity Tracking
As mentioned in section I., we track network connectivity (network
management station centric) via the NMS. The trap generated by the NMS
when it sees a change of response from a node (they are polled every two minutes
by the NMS) results in the event being recorded in a flat file. This flat file
is processed daily by 2 programs, which generate the following reports:
-
Figure 15:list of connectivity outages for our major servers, routers,
ethermeters, and bridges. Note that IHEP is a special router
which is used for the China connection. IHEP's data is
actually imported from a SLAC VAX which monitors it via DECNET.
-
Figure 16:a report which summarizes year to date connectivity.
Enterprise Wide Network Database
We have an Oracle data base (currently hosted by a VAX 9000) which
contains a plethora of information about our network. Much of the data
collection detailed above is driven by lists of servers, bridges, ethermeters,
and routers which are extracted daily (automatically) from this data base
known affectionately as CANDO. The person who maintains this database and
updates it as needed also maintains the MSU network map described in the
first section.
Problem Tracking
Our network problem tracking is currently done by a system developed for
VM problem tracking years ago. Every problem, change, or other action to the
network is currently register in this system. This system will probably be
replaced in the next year with a new system that taps into our new network
management system and our CANDO database.
Summary
I have tried to summarize some of the components of our network management
strategy. There are several other areas which are monitored which I have not
covered (for example: appletalk, micom switch, and IHEP traffic).
If you would
like more information, please feel free to contact me by phone: 415-926-2879 or
email: CAL@SLAC.STANFORD.EDU.