Stanford to SLAC file transfer problems

Introduction

Around 6/14/00 Dick Guertin of Stanford reported slow FTP performance tranferring files from lindy.stanford.edu to ftp-slac.stanford.edu. This had only started in the previous two weeks. Neither ftp-slac nor afs09 (where his home directory is) looked busy. Dick did an ftp of a file to /tmp on ftp-slac to cut out AFS, and that was equally slow. Chuck Boeheim got his file and ftp'd it from icarus to ftp-slac, and that was very fast. Chuck didn't see any packet loss on the link to campus, and very good round-trip times. Dick did a klog to SLAC's AFS cell and copied the file over AFS, and that turned out to be quite fast. Since that's more direct than FTP and has no clear-text password, that's a better solution for him. However, still trying to figure out the performance problem with FTP, Chuck logged onto an elaine system on campus and tried the same thing. He also found that ftp was extrememly slow, much slower than AFS. Yet the ftp-slac machine is not busy. However, he didn't see network packet loss or loads on the systems involved that would explain the slow performance.

Routing

The routing between a typical SLAC host (in this case www3.slac.stanford.edu) and lindy.stanford.edu is seen below. Nodes 1 through 3 are physically at SLAC. Between nodes 2 and 3 there is a 100Mbps FDDI ring (Router Ring). See SLAC DMZ Network* for more details. Between nodes 3 and 4 there are ATM switches (ATM-DMZ and Stan-L1010) with OC3 (155Mbps) interfaces to the routers, and a single mode fiber from SLAC to Stanford carrying an OC12 (622 Mbps) ATM circuit.

traceroute to Lindy.Stanford.EDU (171.64.11.11): 1-25 hops, 38 byte packets
 1  router                       (w.x.y.z)     [AS3671 - SU-SLAC]  1.14 ms (ttl=255)
 2  router                     (w.x.y.z     ) [AS3671 - SU-SLAC]  1.06 ms (ttl=254)
 3  RTR-DMZ.SLAC.Stanford.EDU (134.79.111.4) [AS3671 - SU-SLAC]  1.13 ms (ttl=253)
 4  RTR-SUNET47.SLAC.Stanford.EDU (192.68.191.34) [AS32 - Stanford Linear Accelerator Center]  1.72 ms (ttl=252)
 5  Core-gateway.Stanford.EDU (171.64.1.115) [AS32 - BN-CIDR-171.64]  1.61 ms (ttl=251)
 6  i2-gateway.Stanford.EDU (171.64.1.209) [AS32 - BN-CIDR-171.64]  1.89 ms (ttl=250)
 7  Core3-gateway.Stanford.EDU (171.64.1.222) [AS32 - BN-CIDR-171.64]  1.76 ms (ttl=249)
 8  Core1-gateway.Stanford.EDU (171.64.3.67) [AS32 - BN-CIDR-171.64]  1.94 ms (ttl=248)
 9  Lindy.Stanford.EDU (171.64.11.11) [AS32 - BN-CIDR-171.64]  2.52 ms (ttl=247)

A typical route from elaine.stanford.edu to SLAC is shown below:

elaine41:~> traceroute ftp-slac.slac.stanford.edu
traceroute to ftp-slac.slac.stanford.edu (w.x.y.z     ): 1-30 hops, 38 byte packets
 1  leland-gateway.Stanford.EDU (171.64.15.97)  1.26 ms  0.986 ms  0.788 ms
 2  Core2-gateway.Stanford.EDU (171.64.1.233)  0.517 ms  0.526 ms  0.444 ms
 3  Core3-gateway.Stanford.EDU (171.64.3.34)  0.895 ms  0.815 ms  0.852 ms
 4  i2-gateway.Stanford.EDU (171.64.1.221)  0.862 ms  1.1 ms  0.964 ms
 5  Core-gateway.Stanford.EDU (171.64.1.210)  1.19 ms  1.13 ms  1.37 ms
 6  sunet47-gateway.Stanford.EDU (171.64.1.113)  1.40 ms  1.58 ms  1.50 ms
 7  RTR-DMZ.SLAC.Stanford.EDU (192.68.191.33)  2.28 ms  2.28 ms  2.42 ms

There have been no routing changes in the last month for the traffic between SLAC and Stanford. The routes on Thursday April 13th, 2000 are shown below.

Utilization

The link between SLAC and Stanford is an OC3 (155Mbps) ATM circuit. It is lightly used (see SLAC ATM to Stanford MRTG plots) especially since Friday 12 May, 2000 when Abilene traffic to SLAC was rerouted via ESnet rather than the Stanford-SLAC link. Since Saturday 13 May, 2000 the largest 5 minute peaks are under 5Mbps which is less than 5% of the links nominal capacity and less than 25% of the bottleneck bandwidth measured by pathchar (see below). One can also notice that traffic inbound to SLAC (the green area) dropped off in week 13. This coincides roughly with when Dick Guertin reports the drop in TCP performance when sending to SLAC.

Ping

Pinging from flora01.slac.stanford.edu (a Sun Solaris 2.6 host) to lindy.stanford.edu (a Unix Systetm V host) shows that there is some problem as one sends ping packets sizes that require IP fragmentation since they are beyond the maximum segment sizes. This is shown below for packets of 1472 bytes (no fragmentation) and 1473 bytes (requires fragmentation).

14cottrell@flora01:~>ping -s lindy.stanford.edu 1473 10
PING lindy.stanford.edu: 1473 data bytes
ICMP Port Unreachable from gateway Lindy.Stanford.EDU (171.64.11.11)
 for udp from    login.SLAC.Stanford.EDU (w.x.y.z)      to Lindy.Stanford.EDU (171.64.11.11) port 33444
^C
----lindy.stanford.edu PING Statistics----
5 packets transmitted, 0 packets received, 100% packet loss
15cottrell@flora01:~>ping -s lindy.stanford.edu 1472 10
PING lindy.stanford.edu: 1472 data bytes
1480 bytes from Lindy.Stanford.EDU (171.64.11.11): icmp_seq=0. time=5. ms
1480 bytes from Lindy.Stanford.EDU (171.64.11.11): icmp_seq=1. time=5. ms
^C
----lindy.stanford.edu PING Statistics----
2 packets transmitted, 2 packets received, 0% packet loss
round-trip (ms)  min/avg/max = 5/5/5

This loss of 1473 packets is fairly consistent with roughly 1 in 1000 packets being successfully echoed in the reverse direction (from elaine7.stanford.edu to ftp-slac.slac.stanford.edu starting at 21:07 PDT 5/16/00 with pings separated by 1 second) and the reported failure reasons being shown below. "no reply" was the response for 967 out of 1000, and "Frag reassembly" was the response for 34 out of 1000 pings.

packet seq=552 bounced at FTP-SLAC.SLAC.Stanford.EDU (w.x.y.z     ): Frag reassembly time exceeded
no reply from ftp-slac.slac.stanford.edu within 1 sec

Pinging each of the nodes along the route from flora01.slac.stanford.edu using pingroute.pl indicates that the effect (100% loss on 100 1473 byte packets) starts at the RTR-CGB6, see below.


29cottrell@flora01:~>bin/pingroute.pl -c 4 -s 1473 lindy.stanford.edu
Architecture=SUN5, commands=traceroute -q 1 and  node , pingroute.pl version=1.3, 5/13/00
pingroute.pl version 1.3, 5/13/00 using traceroute to get nodes in route from flora01 to lindy.stanford.edu
traceroute: Warning: ckecksums disabled
traceroute to lindy.stanford.edu (171.64.11.11), 30 hops max, 40 byte packets
pingroute.pl version 1.3, 5/13/00 found 9 hops in route from flora01 to lindy.stanford.edu
1  router                       (w.x.y.z)      0.626 ms
2  router                     (w.x.y.z     )  0.907 ms
3  RTR-DMZ.SLAC.Stanford.EDU (134.79.111.4)  1.090 ms
4  RTR-SUNET47.SLAC.Stanford.EDU (192.68.191.34)  1.211 ms
5  Core-gateway.Stanford.EDU (171.64.1.115)  1.404 ms
6  i2-gateway.Stanford.EDU (171.64.1.209)  1.253 ms
7  Core4-gateway.Stanford.EDU (171.64.1.226)  1.337 ms
8  Core1-gateway.Stanford.EDU (171.64.3.19)  2.255 ms
9  Lindy.Stanford.EDU (171.64.11.11)  2.398 ms
Wrote 9 addresses to /tmp/pingaddr, now ping each address 4 times from flora01
         pings/node=4                              100 byte packets           1473 byte packets
         NODE                                  %loss    min    max    avg %loss   min    max    avg from flora01
w.x.y.z         router                            0%    0.0    4.0    1.0   0%    1.0    1.0    1.0 Tue May 16 21:41:04 PDT 2000
w.x.y.z         router                           0%    0.0    1.0    0.0 100%    0.0    0.0    0.0 Tue May 16 21:41:10 PDT 2000
134.79.111.4    RTR-DMZ.SLAC.STANFORD.EDU         0%    1.0    1.0    1.0 100%    0.0    0.0    0.0 Tue May 16 21:41:26 PDT 2000
192.68.191.34   RTR-SUNET47.SLAC.STANFORD.EDU     0%    1.0    2.0    1.0 100%    0.0    0.0    0.0 Tue May 16 21:41:42 PDT 2000
171.64.1.115    CORE-GATEWAY.STANFORD.EDU         0%    1.0    1.0    1.0 100%    0.0    0.0    0.0 Tue May 16 21:41:58 PDT 2000
171.64.1.209    I2-GATEWAY.STANFORD.EDU           0%    1.0    2.0    1.0   0%    3.0    3.0    3.0 Tue May 16 21:42:14 PDT 2000
171.64.1.226    CORE4-GATEWAY.STANFORD.EDU      100%    0.0    0.0    0.0 100%    0.0    0.0    0.0 Tue May 16 21:42:20 PDT 2000
171.64.3.19     CORE1-GATEWAY.STANFORD.EDU      100%    0.0    0.0    0.0 100%    0.0    0.0    0.0 Tue May 16 21:42:46 PDT 2000
171.64.11.11    LINDY.STANFORD.EDU                0%    1.0    2.0    1.0 100%    0.0    0.0    0.0 Tue May 16 21:43:12 PDT 2000

In the reverse direction from elaine7.stanford.edu to ftp-slac.slac.stanford.edu, the 100% of the big packets varies along the route as is seen below.


elaine7:~> bin/pingroute.pl -c 4 -s 1473 ftp-slac.slac.stanford.edu
Architecture=STANFORDU, commands=traceroute -q 1 and ping -s -t 3 node 100 4, pingroute.pl version=1.3, 5/13/00, debug=1
pingroute.pl version 1.3, 5/13/00 using traceroute to get nodes in route from elaine7.Stanford.EDU to ftp-slac.slac.stanford.edu
pingroute.pl version 1.3, 5/13/00 found 11 hops in route from elaine7.Stanford.EDU to ftp-slac.slac.stanford.edu
traceroute to ftp-slac.slac.stanford.edu (w.x.y.z     ): 1-30 hops, 38 byte packets
1  leland-gateway.Stanford.EDU (171.64.15.65)  1.10 ms
2  Core6-gateway.Stanford.EDU (171.64.1.229)  0.489 ms
3  Core4-gateway.Stanford.EDU (171.64.3.82)  1.85 ms
4  i2-gateway.Stanford.EDU (171.64.1.225)  1.38 ms
5  Core-gateway.Stanford.EDU (171.64.1.210)  1.20 ms
6  sunet47-gateway.Stanford.EDU (171.64.1.113)  1.92 ms
7  RTR-DMZ.SLAC.Stanford.EDU (192.68.191.33)  2.27 ms
8  *
9  *
10  FTP-SLAC.SLAC.Stanford.EDU (w.x.y.z     )  2.49 ms
Wrote 11 addresses to /tmp/pingaddr, now ping each address 4 times from elaine7.Stanford.EDU
         pings/node=4                              100 byte packets           1473 byte packets
         NODE                                  %loss    min    max    avg %loss   min    max    avg from elaine7.Stanford.EDU
171.64.15.65    LELAND-GATEWAY.STANFORD.EDU       0%    1.0    1.0    1.8 100%    0.0    0.0    0.0 Tue May 16 22:06:08 PDT 2000
171.64.1.229    CORE6-GATEWAY.STANFORD.EDU        0%    0.5    0.0    0.6 100%    0.0    0.0    0.0 Tue May 16 22:06:34 PDT 2000
171.64.3.82     CORE4-GATEWAY.STANFORD.EDU        0%    0.9    1.0    1.5   0%    2.9    3.0    3.3 Tue May 16 22:07:00 PDT 2000
171.64.1.225    I2-GATEWAY.STANFORD.EDU           0%    0.9    2.0    1.6   0%    3.5    4.0    3.9 Tue May 16 22:07:18 PDT 2000
171.64.1.210    CORE-GATEWAY.STANFORD.EDU         0%    1.4    2.0    1.9 100%    0.0    0.0    0.0 Tue May 16 22:07:36 PDT 2000
171.64.1.113    SUNET47-GATEWAY.STANFORD.EDU      0%    2.2    2.0    2.6 100%    0.0    0.0    0.0 Tue May 16 22:08:02 PDT 2000
192.68.191.33   RTR-DMZ.SLAC.STANFORD.EDU         0%    2.3    5.0    3.1 100%    0.0    0.0    0.0 Tue May 16 22:08:28 PDT 2000
w.x.y.z         FTP-SLAC.SLAC.STANFORD.EDU        0%    2.3    2.0    2.5 100%    0.0    0.0    0.0 Tue May 16 22:08:55 PDT 2000

We later realized that some of the routers would pass ping packets > 1472 bytes but would not respond to them, which reduced the effectiveness of this method to indicate the onset of failure to pass > 1472 packets (i.e. packets that require IP fragmentation).

PingER also shows low packet loss for the month of May measured between oceanus.slac.stanford.edu and argus.stanford.edu. This is shown below where each column contains the median percentage packet loss for that day of the month.


Monitoring-Site Remote-Site     1     2     3     4     5     6     7     8     9     10    11    12    13    14    15                 
oceanus.slac    argus.stanford 0.10  0.52  0.49  0.52  1.46  0.42  0.10  0.62  1.17  0.31  1.04  0.94  0.10  0.21  0.42

Using the Cisco type ping from RTR-DMZ (a Cisco 7507) to lindy.stanford.edu gave losses of between 0.1 and 0.5% for packets sizes on 100, 1472 and 1500 Bytes, but using 1700 Byte packets (further study showed the threshold was 1500 bytes, packets with 1501 packets give rise to fragmentation and heavy loss) gave over 99% loss. Repeating the 1700 byte packets from RTR-DMZ to RTR-SUNET47, CORE-GATEWAY and I2-GATEWAY gave 1 loss in 100 pings in each case. However pinging with 100 or 1700 Byte packets from RTR-DMZ to CORE3-GATEWAY or CORE4-GATEWAY gave 0% success rate with an error message of ICMP destination Unreachable. We also pinged from RTR-DMZ in towards the SLAC network with 1700 byte pings and got <= 1% loss to RTR-CGB6 (a Cisco 7513), RTR-CORE1 (a Cisco 8500) and RTR-CORE2 (a Cisco 8500) and login.01 (a Sun Solaris 2.6 host). This might indicate the problem is not in the ATM cloud and appears to be in the Stanford network beyond the I2-GATEWAY.Stanford.Edu router.

Using 100 Cisco type pings with 1500 Bytes from RTR-DMZ to the Stanford campus routers showed no packet loss (100/100) to sunet47, core1-gateway and I2-gateway but Host unreachable (U.U.) to core-gateway, core3-gateway and core4-gateway.

Using the Stanford (from NIKHEF) ping from elaine10.stanford.edu to ping rtr-dmz.slac.stanford.edu with the preload option that sends a specified number of packets at startup (back to back) before listening for a response. With this we get about 73 small (<= 100Bytes) packets through before loss ensues, for larger packets (1000 Bytes) we got about 50 packest through before loss ensued. Preload pings of 100 and 1000 Bytes from elaine10 to the first router (leland.gateway. stanford.edu) also successfuly got 73 packets through before loss ensued. We repeated these tests from another machine at Stanford (elaine1.stanford.edu) with similar results. We believe this result is an artifact of the buffer space available in the Cisco router.

Sending big echo requests from elaine26.stanford.edu to core-gateway.stanford.edu or sunet47-gateway.stanford.edu or rtr-dmz.slac.stanford.edu all fail. However, sending big echo requests from elaine26 to i2-gateway.stanford.edu works.

Further study at the SLAC end showed that the losses for 1473 byte packets occurred when initiated from pharlap.slac.stanford.edu or w.x.y.z (chosen since we have sudo priviledges to run tcpdump on it) on NETHUB or flora01.slac.stanford.edu (w.x.y.z ) and rtr-cgb6.slac.stanford.edu (interfaces w.x.y.z or w.x.y.z or w.x.y.z or w.x.y.z or w.x.y.z ) but not between pharlap and rtr-core1.slac.stanford.edu. By looking at the diagram of the SLAC Switched Network, this removes the FDDI ring as being the cause of the problem. Further we turned on debugging in rtr-cgb6 and it indicated that it was replying to the ping ICMP requests. The route from pharlap to rtr-cgb6 passes through swh-core1 to rtr-core1 out through interface w.x.y.z thru swh-core1 to interface w.x.y.z of rtr-cgb6 (rtr-core1 is acting as a 1 armed router to swh-core1). Pings from pharlap with 1473 bytes work as far as rtr-core1 interface w.x.y.z but not to rtr-cgb6 interface w.x.y.z . Using tcpdump on pharlap we verified that the echo responses were not being received from rtr-cgb6.

On the other hand if the big packet (>> 1500 Byte) pings were initiated from rtr-cgb6 then there was low loss to either pharlap or flora01.

Even more strange was that the effect does not appear when we launch the big echo request packets to rtr-cgb6 from an AIX 4.2 machine (vesta01 - w.x.y.z ) or a Linux machine (noric03.slac.stanford.edu - w.x.y.z ) or a Windows NT machine (atreides.slac.stanford.edu - w.x.y.z ), but does occur when launched from a Solaris 2.6 (SunOS 5.6) machine (flora01.slac.stanford.edu - w.x.y.z or pharlap.slac.stanford.edu - w.x.y.z ). Repeating these big echo requests to lindy.stanford.edu from vesta01, noric03 and atreides also results in high loss. The default behavior of Solaris is to set the don't fragment bit.

Packet traces

Using NetXray on atreides to look at the packets when sending big echo requests from atreides to lindy.stanford.edu both of the echo request packets are seen but only the first echo response fragment is seen. The first echo request fragment has the "do'nt fragment bit" set.

When sending big echo requests from elaine26.stanford.edu (171.64.15.101) to atreides then only the first echo request fragment is seen by NetXray running on Atreides. Similarly when sending big echo requests from elaine26.stanford.edu to pharlap only the first of the 2 echo request fragments are seen by tcpdump running on pharlap.

When sending big echo requests from pharlap to rtr-cgb6 then no responses are seen by tcpdump running on pharlap.

When sending big echo requests from elaine26 to pharlap and looking on the FDDI router ring with a Netscout sniffer, both the echo request fragments were seen as well as both the echo reply fragments. See the packet trace. Note that the first request fragment has the don't fragment bit set.

Router statistics

Looking at RTR-DMZ at the ATM interface to ATM-DMZ (Lightstream 1010) one sees the following:

ATM5/0 is up, line protocol is up 
  Hardware is cxBus ATM
  MTU 4470 bytes, sub MTU 4470, BW 156250 Kbit, DLY 80 usec, rely 255/255, load 1/255
  Encapsulation ATM, loopback not set, keepalive set (10 sec)
  Encapsulation(s): AAL5, PVC mode
  256 TX buffers, 256 RX buffers,
  2048 maximum active VCs, 1024 VCs per VP, 1 current VCCs
  VC idle disconnect time: 300 seconds
  Last input never, output 00:00:00, output hang never
  Last clearing of "show interface" counters 1w2d
  Queueing strategy: fifo
  Output queue 0/40, 0 drops; input queue 0/75, 572 drops
  5 minute input rate 233000 bits/sec, 60 packets/sec
  5 minute output rate 73000 bits/sec, 64 packets/sec
     37439574 packets input, 2747274064 bytes, 0 no buffer
     Received 0 broadcasts, 0 runts, 0 giants
     0 input errors, 0 CRC, 0 frame, 0 overrun, 2300788 ignored, 0 abort
     70905621 packets output, 1067094969 bytes, 0 underruns
     0 output errors, 0 collisions, 0 interface resets
     0 output buffer failures, 0 output buffers swapped out

The 2300788 ignored (number of packets that were ignored on the receive interface due to lack of internal buffers) is of interest. We compared the ignored with packets input while doing the FTP GETs from Stanford to SLAC and saw error rates of 5 to 25%.

Looking in ATM-DMZ similar errors were not noticeable.

ATM1/0/0 is up, line protocol is up 
  Hardware is oc3suni
  Description: --> RTR-DMZ
  MTU 4470 bytes, sub MTU 4470, BW 155520 Kbit, DLY 0 usec, rely 255/255, load 1/255
  Encapsulation ATM, loopback not set, keepalive not supported 
  Last input 00:00:00, output 00:00:00, output hang never
  Last clearing of "show interface" counters 10w1d
  Queueing strategy: fifo
  Output queue 0/40, 0 drops; input queue 0/75, 0 drops
  5 minute input rate 136000 bits/sec, 329 packets/sec
  5 minute output rate 303000 bits/sec, 733 packets/sec
     4013949168 packets input, 2285908400 bytes, 0 no buffer
     Received 0 broadcasts, 0 runts, 0 giants, 0 throttles
     7 input errors, 7 CRC, 0 frame, 0 overrun, 0 ignored, 0 abort
     3622564490 packets output, 3017356946 bytes, 0 underruns
     0 output errors, 0 collisions, 0 interface resets
     0 output buffer failures, 0 output buffers swapped out

We disconnected the fiber and ran a test on it and it looked good. We exchanged the TX and RX fibers between RTR-DMZ and ATM-DMZ, it had no effect. Since we had a spare, we also changed the ATM interface card on ATM-DMZ at about 3:00pm 5/19/00 in case there was a poor driver at the ATM-DMZ end causing the problems at RTR-DMZ, this also had no effect.

Finally at about 5:20pm PDT 5/19/00 we did a clear interface atm 5/0 which clears the hardware logic on an interface and the FTP GET rate improved to 2.3MBytes/sec (yes MBYTES!) and the ignored count stopped increasing.

After clearing the interface it was also interesting to note that at least some of the earlier ping fragmentation problems cleared, for example compare the ping below with a similar one above:

2cottrell@flora01:~>ping -s lindy.stanford.edu 1473 5
PING lindy.stanford.edu: 1473 data bytes
1481 bytes from Lindy.Stanford.EDU (171.64.11.11): icmp_seq=0. time=6. ms
1481 bytes from Lindy.Stanford.EDU (171.64.11.11): icmp_seq=1. time=6. ms
1481 bytes from Lindy.Stanford.EDU (171.64.11.11): icmp_seq=2. time=6. ms
1481 bytes from Lindy.Stanford.EDU (171.64.11.11): icmp_seq=3. time=6. ms
1481 bytes from Lindy.Stanford.EDU (171.64.11.11): icmp_seq=4. time=5. ms

----lindy.stanford.edu PING Statistics----
5 packets transmitted, 5 packets received, 0% packet loss
round-trip (ms)  min/avg/max = 5/5/6

Pathchar

Running pathchar between a node on the SLAC network (flora01.slac.stanford.edu) and lindy.stanford.edu around 6pm PDT May 16, 2000 gives the result shown below.

30cottrell@flora01:~>sudo /afs/slac/g/scs/bin/pathchar lindy.stanford.edu
Password:
pathchar to lindy.stanford.edu (171.64.11.11)
 mtu limitted to 1500 bytes at    login.SLAC.Stanford.EDU (w.x.y.z)     
 doing 32 probes at each of 64 to 1500 by 44
 0    login.SLAC.Stanford.EDU (w.x.y.z)     
 |    29 Mb/s,   204 us (816 us)
 1 router                       (w.x.y.z)    
 |    51 Mb/s,   184 us (1.42 ms)
 2 router                     (w.x.y.z     )
 |   115 Mb/s,   58 us (1.64 ms)
 3 RTR-DMZ.SLAC.Stanford.EDU (134.79.111.4)
 |    52 Mb/s,   79 us (2.03 ms)
 4 RTR-SUNET47.SLAC.Stanford.EDU (192.68.191.34)
 |   1238 Mb/s,   15 us (2.07 ms)
 5 Core-gateway.Stanford.EDU (171.64.1.115)
 |    92 Mb/s,   -34 us (2.13 ms)
 6 i2-gateway.Stanford.EDU (171.64.1.209)
 |    46 Mb/s,   132 us (2.65 ms)
 7 Core4-gateway.Stanford.EDU (171.64.1.226)
 |    48 Mb/s,   112 us (3.13 ms)
 8 Core1-gateway.Stanford.EDU (171.64.3.19)
 9  * 1   548 535       3
10  * 1   504 533       4
11  * 1   460 530       3
12  * 1   460 533       3
13  * 1   196 529       3
14:  14   460 312       3

One way measurements

We also used sting with 100 probes to look at the one way packet losses in both directions. The results show that the loss measured by sting at this time (10:40pm PST 5/16/2000) were 0% in both directions.

% sudo sting lindy.stanford.edu
Password:
Connection setup took 2 ms
src = w.x.y.z     :11005 (2919847614)
dst = 171.64.11.11:80 (4208648018)

dataSent = 100, dataReceived = 100
acksSent = 100, acksReceived = 100
Forward drop rate = 0.000000
Reverse drop rate = 0.000000
316 packets received by filter
0 packets dropped by kernel

One can also look at the Surveyor reports to look at the one way delays and losses. The graphs indicate there is very low delay in both directions and low loss in both directions but slightly higher in the Stanford to SLAC direction.

FTP tests

The Stanford FTP server is transfer.stanford.edu. Doing a TCP Binary PUT around 10:00 am friday 5/19/00 from flora01 at SLAC to transfer.stanford.edu gets about 900kBytes/sec. Doing a TCP Binary GET only received about 2.4kBytes/sec.

Summary

The link is uncongested. There is heavy loss as soon as fragmentation is required. Otherwise the link looks good (< 1% loss and RTT < 10 msec.) Since both ends of the connection are using Ethernet (10 or 100Mbps) it is unclear why fragmentation should be needed since the two ends should negotiate a maximum segment size that does not require fragmentation.

It is beginning to look like some of the fragmantation effects might be associated with a particular operating system since it (so far) does not happen when the ping request is sent from a Linux or AIX or Windows NT machine at SLAC to rtr-cgb6, but does occur when the requests is sent from a Solaris 2.6 host.

One also needs to be aware that a router may pass a fragmented packet but may not respond to it. Another concern is that some of the routers don't correctly generate fragmented pings even though they will reply to them.

The effect is seen on both the SLAC and Stanford sites separately. It does not appear to be associated with the ATM fabric or the FDDI rings (both places where fragmentation may occur).

It smelt like a big packet being sent with the don't fragment bit being set. The correct response from a switch or router receiving an oversized packet that it can't fragment is to send back an ICMP "Destination Unreachable" message. This feature is actually used for maximum transfer unit (MTU) size discovery (see Path MTU Discovery and Filtering ICMP). The NetXray results indicate that echo request packets sent from elaine (a Solaris system) have the don't fragment bit set as seen at the end host atreides, and the Netscout on the SLAC FDDI router ring showed it was also set there.

Further possibilities to test included looking at the trace of an FTP flow to understand more about the TCP dynamics; sit on the Stanford router SUNET47-GATEWAY router and look in both directions; review the rtr-cgb5.slac.stanford.edu configuration and possibly the swh-core1.slac.stanford.edu configurations looking for inconsistencies; turn on debug ip packet detail with an acl that selects destination pharlap on rtr-core1, rtr-core2 and rtr-cgb6 and then ping with big packets from pharlap and look at the debug trace (may need to turn off mroute cache or something similar) do for small (to make sure can see packets) and then large to see where the packets disappear. If it stops in switch then tell Cisco and reboot the switch. Try swapping the TX and RX fibers along the route to see if the direction of the poor file transfer changes.

We never did get to the bottom of the fragmentation problems and left it for others to solve or until it affects us in some real way. The solution to the poor performance of FTPs data flows from Stanford to SLAC was to do a clear interface atm 5/0 which clears the hardware logic on an interface and the FTP get rate improved to 2.3MBytes/sec (yes MBYTES!) and the ignored count stopped increasing. So the problem was in the interface card card getting confused. We will follow up with Cisco to see what needs to happen next.

Afterthoughts

This problem was made easier to solve since the Stanford SLAC link had a redundant path backup through ESnet, thus we could take short outages (luckliy we are not currently trying to put time critical applications like voice over IP through this link) to swap cards or cables etc. Another facilitating factor was that we (SLAC) people had accounts at both ends of the link that enabled us to reproduce the problem easily, and make end-to-end tests at both ends. However, neither site's personnel had logon access to the other site's routers, switches etc. (this is to be expected), so we had to rely on each other to make and report the results of tests. Having good relations between the sites was a big help here. Having multiple technologies (ATM, Ethernet and FDDI involved) made the skill set needed more diverse and slowed debugging down since there were more possibilities to follow up.

The apparent fragmentation problem led us off in the wrong direction and even when suspected we spent valuable time pinning it down. It is a pity some routers (even from a highly regarded single vendor such as Cisco) do not treat oversize pings uniformly. It is even more disturbing that at least one router had a failure mode unrelated to packet fragmentation that resulted in ignored buffers and such poor performance. The fact that a router interface can fail in such a manner (ignoring buffers) appears to be very poor design or implementation, and certainly not something one would accept from a system that is competing with phone PBXs that advertise 99.999% reliability. It will be interesting to see what Cisco has to say about this. I suspect that they will suggest upgrading to the latest stable release of the IOS etc. (rtr-dmz is at release 11.1(16) which is quite old) which of course will require an outage of the complete router (rtr-dmz) which will require a major scheduled outage, and will put us further away from any 99.999% goal.

It would also be useful to get definitive information concerning under what conditions (configuration, model, code release) and Cisco routers respond to ICMP echo requests that require fragmentation. Even when pinging with 1473 byte echo request packets from an AIX 4.2 host (vesta01.slac.stanford.edu) that does not set the don't fragment bit, to core3-gateway.stanford.edu or core5.gateway.stanford.edu there is still 100% packet loss.

Maybe part of the lesson is that even UPS, redundant power supplies, redundant paths etc. do not make up for poor implementations, and one needs to be able to reliably make code level updates without requiring an outage of even a second or more (if one is to contemplate using the data network for internet telephony).

Postscript

Prior to fixing this problem we had been seeing some problems with PPTP sessions over the Covad DSL links that pass over the Stanford SLAC link. These problems manifested as PPTP disconnected sessions (see Troubleshooting VPN Disconnects). They had been occuring for longer than the FTP problems mentioned above, however the PPTP disconnects apparently reduced dramatically in frequency after we fixed the ignored packet problem above.

Created: May 16, 2000
Comments to iepm@slac.stanford.edu