A Sudden Change in Bandwidth Utilization

Connie Logg Page created: January 10, 2002, last update January 10, 2002.

Central Computer Access | Computer Networking | Network Group | ICFA-NTF Monitoring

Problem description

We have code in production which performs various types of tests of WAN bandwidth utilization for several different applications. On January 4, 2002, we noticed that the bandwidth availability appeared to substantially decrease for these tests to certain nodes, while for others it did not. There were no network changes, and paths to the nodes did not appear to have changed. We have the traceroutes stored in the raw data files. The effect seemed to have started between 12:34 and about 20:00 on Jan 3rd, 2001.

Figure 1 As can be seen from Figures 1 & 2, the bandwidth tests for these higher performance paths dropped dramatically (to 100 mbits/second or less) while the tests along the lower performance path did not change appreciably. The graphs showed us the event, but did little to identify the source of the problem. Examination of the "raw data" logs proved more helpful. One of the tests we do is PIPECHAR which attempts to ascertain the size of the various pipes connecting two nodes. An examination of the PIPCHAR data in the log files showed that at one point PIPECHAR's estimation of the client machine's network interface dropped from 1 Gigabit/sec to 100 megabits/sec., although no changes had been made in the network connections.
Resolution
Looking at the interface statistics (available via netstat -i) it was evident that the host was reporting many (~1% error rates on the NIC). This tended to implicate the cable, the switch port, or host adapter needed some readjustments (our best guess at the time was the cable, and after that the GBIC on the switch). At the same time there were no errors reported on the switch interface that the host was connected to. After making sure the fiber and GBIC were properly seated, we recalled that the host had crashed on Jan 3rd. Following this crash we were recommended by the Sun engineers to turn on various debugging flags in /etc/system/kmem_flags to assist in tracing the problem down.
After these debugging flags were disabled, the bandwidth utilization rates returned to normal.
This points out that we should not leave these kernel debbuging flags turned on for long periods of time on our production servers without being aware of the significant performance impacts that can occur.

Figure 2

Figure 3

Figure 1	As can be seen from Figures 1 & 2, the bandwidth tests for these higher performance paths dropped dramatically (to 100 mbits/second or less) while the tests along the lower performance path did not change appreciably. The graphs showed us the event, but did little to identify the source of the problem. Examination of the "raw data" logs proved more helpful. One of the tests we do is PIPECHAR which attempts to ascertain the size of the various pipes connecting two nodes. An examination of the PIPCHAR data in the log files showed that at one point PIPECHAR's estimation of the client machine's network interface dropped from 1 Gigabit/sec to 100 megabits/sec., although no changes had been made in the network connections. Resolution Looking at the interface statistics (available via netstat -i) it was evident that the host was reporting many (~1% error rates on the NIC). This tended to implicate the cable, the switch port, or host adapter needed some readjustments (our best guess at the time was the cable, and after that the GBIC on the switch). At the same time there were no errors reported on the switch interface that the host was connected to. After making sure the fiber and GBIC were properly seated, we recalled that the host had crashed on Jan 3rd. Following this crash we were recommended by the Sun engineers to turn on various debugging flags in `/etc/system/kmem_flags` to assist in tracing the problem down. After these debugging flags were disabled, the bandwidth utilization rates returned to normal. This points out that we should not leave these kernel debbuging flags turned on for long periods of time on our production servers without being aware of the significant performance impacts that can occur.

Figure 2

Figure 3