SLAC's Network Management Features

Connie Logg, October 1993

Current NMS:

We have a DECstation 5000 on which we are running DEC MSU. DEC has recently announced that they are no longer support DEC MSU, and we are going to be replacing it with NETVIEW 6000 running on an RS/6000 37T.

The features of DEC MSU which we currently use are:

The network map display feature: we use this as the configuration documentation for our system, as an interface to various utilities (select a node and select an action from a menu to look at something about that node), as our service desk network interface status and information display, and other things I may have missed.
The trap generation facility: When MSU detects that a node has changed in responsiveness, it generates a trap, which results in a software routine being invoked. This routine records the transition in a flat file. Every transition, from up to down, down to up is recorded in this file. This file is then processed daily and various reports are generated from it. Samples of the file and reports are talked about later.
Various utilities which have been integrated with DEC MSU such as: RMON from NAT (which allows the user to setup various types of data collection in an NAT ethermeter, and display the results); and our plot packages for display of interface and error statistics for the NAT bridges and ethermeters. Note that these are independent packages that are simply placed in a menu, so that a node can be chosen via the mouse, and these routines invoked via the menu.

Ethermeter Utilization

We have attempted to place an NAT ethermeter on EVERY segment of cable in our network. Several pieces of code have been developed in house to probe these ethermeters, on a regular basis (once an hour currently), and to just look at the MIB variables in a sensible fashion (we call this piece of code natlookup).

Data and error statistics are collected hourly, 7 days a week, 24 hours a day, from all the ethermeters (44 of them currently). Plots of network activity are available on demand. These plots include:

Figure 1: plot of per second good packets, multicasts, broadcasts, shorts, and kilobytes
Figure 2:plots of errors such as crcs, alignment, oversized, and shorts
Figure 3:plots of packet sizes
Figure 4:plots of the peaks for good packets, multicasts, broadcasts, shorts, and kilobytes

Extensive use of WWW is made to provide plots to the user community. Once an hour the data for "today" is plotted and the postcript file is made available via WWW. Once a day (early in the morning) a plot of yesterday's data and the data for the past week are created and the postscript files are made available via WWW.

If a plot of any other time period is desired, the plotting program can be invoked directly by the user.

Other reports generated from this data include:

Figure 5: - daily summary of network traffic - "*" lines indicate that there is something that needs checking out.
Figure 6:daily summary of network errors - "*" lines indicate that there is something that needs checking out.
Figure 7:a monthly trend plot which plots one point monthly of the mean of the various data points.
Figure 8:a daily lineprinter format scatterplot which show the ratio of the collisions vs goods. This is glanced at every day to quickly pinpoint network "hot spots".

Bridge Monitoring

Data is collected hourly from the NAT bridges in the SLAC network. The data includes the number of good packets, the number of crc and alignment errors, the number of collisions, and the number of multicasts.

Figure 9: Plot of bridge statistics. This is generated hourly for "todays" data, and daily for the data from "yesterday" and the "past week". These are all available via WWW. In addition the user can invoke the plot program himself, if (s)he wishes another time range.
Figure 10:Daily Bridge Summary - available via WWW

Routers

Currently data is collected hourly from routers in the network. A daily report of this is generated, but no "analysis" is done. It is just eyeballed as needed.

Figure 11: Router Daily Summary - available via WWW.

Network Timing

We have made an attempt to "time" our network via issueing pings to the ethermeters and critical servers. The ethermeters and critical servers are pinged 4 times an hour a total of 10 times each: 5 times with a 100 byte packet and 5 times with a 1000 byte packet. The ping command returns a minimum, maximum, and average ping time for the 5 pings. This is stuffed in a flat file and plotted on a daily basis.

Figure 12:individual plots for each node show the average and maximums for each node for both 100 byte and 1000 byte packets.
Figure 13:The top 10 servers have a frequency plot generated for the maximum ping times for the 100 byte packets.
Figure 14:The top 10 servers have a frequency plot generated for the average ping times for the 100 byte packets.

The pings only capture a snapshot of the network at a specific time, but it does allow us to compare the responsiveness of nodes.

System Connectivity Tracking

As mentioned in section I., we track network connectivity (network management station centric) via the NMS. The trap generated by the NMS when it sees a change of response from a node (they are polled every two minutes by the NMS) results in the event being recorded in a flat file. This flat file is processed daily by 2 programs, which generate the following reports:

Figure 15:list of connectivity outages for our major servers, routers, ethermeters, and bridges. Note that IHEP is a special router which is used for the China connection. IHEP's data is actually imported from a SLAC VAX which monitors it via DECNET.
Figure 16:a report which summarizes year to date connectivity.

Enterprise Wide Network Database

We have an Oracle data base (currently hosted by a VAX 9000) which contains a plethora of information about our network. Much of the data collection detailed above is driven by lists of servers, bridges, ethermeters, and routers which are extracted daily (automatically) from this data base known affectionately as CANDO. The person who maintains this database and updates it as needed also maintains the MSU network map described in the first section.

Problem Tracking

Our network problem tracking is currently done by a system developed for VM problem tracking years ago. Every problem, change, or other action to the network is currently register in this system. This system will probably be replaced in the next year with a new system that taps into our new network management system and our CANDO database.

Summary

I have tried to summarize some of the components of our network management strategy. There are several other areas which are monitored which I have not covered (for example: appletalk, micom switch, and IHEP traffic).

If you would like more information, please feel free to contact me by phone: 415-926-2879 or email: CAL@SLAC.STANFORD.EDU.