Stanford to SLAC file transfer problems |
traceroute to Lindy.Stanford.EDU (171.64.11.11): 1-25 hops, 38 byte packets 1 router (w.x.y.z) [AS3671 - SU-SLAC] 1.14 ms (ttl=255) 2 router (w.x.y.z ) [AS3671 - SU-SLAC] 1.06 ms (ttl=254) 3 RTR-DMZ.SLAC.Stanford.EDU (134.79.111.4) [AS3671 - SU-SLAC] 1.13 ms (ttl=253) 4 RTR-SUNET47.SLAC.Stanford.EDU (192.68.191.34) [AS32 - Stanford Linear Accelerator Center] 1.72 ms (ttl=252) 5 Core-gateway.Stanford.EDU (171.64.1.115) [AS32 - BN-CIDR-171.64] 1.61 ms (ttl=251) 6 i2-gateway.Stanford.EDU (171.64.1.209) [AS32 - BN-CIDR-171.64] 1.89 ms (ttl=250) 7 Core3-gateway.Stanford.EDU (171.64.1.222) [AS32 - BN-CIDR-171.64] 1.76 ms (ttl=249) 8 Core1-gateway.Stanford.EDU (171.64.3.67) [AS32 - BN-CIDR-171.64] 1.94 ms (ttl=248) 9 Lindy.Stanford.EDU (171.64.11.11) [AS32 - BN-CIDR-171.64] 2.52 ms (ttl=247)A typical route from elaine.stanford.edu to SLAC is shown below:
elaine41:~> traceroute ftp-slac.slac.stanford.edu traceroute to ftp-slac.slac.stanford.edu (w.x.y.z ): 1-30 hops, 38 byte packets 1 leland-gateway.Stanford.EDU (171.64.15.97) 1.26 ms 0.986 ms 0.788 ms 2 Core2-gateway.Stanford.EDU (171.64.1.233) 0.517 ms 0.526 ms 0.444 ms 3 Core3-gateway.Stanford.EDU (171.64.3.34) 0.895 ms 0.815 ms 0.852 ms 4 i2-gateway.Stanford.EDU (171.64.1.221) 0.862 ms 1.1 ms 0.964 ms 5 Core-gateway.Stanford.EDU (171.64.1.210) 1.19 ms 1.13 ms 1.37 ms 6 sunet47-gateway.Stanford.EDU (171.64.1.113) 1.40 ms 1.58 ms 1.50 ms 7 RTR-DMZ.SLAC.Stanford.EDU (192.68.191.33) 2.28 ms 2.28 ms 2.42 msThere have been no routing changes in the last month for the traffic between SLAC and Stanford. The routes on Thursday April 13th, 2000 are shown below.
14cottrell@flora01:~>ping -s lindy.stanford.edu 1473 10 PING lindy.stanford.edu: 1473 data bytes ICMP Port Unreachable from gateway Lindy.Stanford.EDU (171.64.11.11) for udp from login.SLAC.Stanford.EDU (w.x.y.z) to Lindy.Stanford.EDU (171.64.11.11) port 33444 ^C ----lindy.stanford.edu PING Statistics---- 5 packets transmitted, 0 packets received, 100% packet loss 15cottrell@flora01:~>ping -s lindy.stanford.edu 1472 10 PING lindy.stanford.edu: 1472 data bytes 1480 bytes from Lindy.Stanford.EDU (171.64.11.11): icmp_seq=0. time=5. ms 1480 bytes from Lindy.Stanford.EDU (171.64.11.11): icmp_seq=1. time=5. ms ^C ----lindy.stanford.edu PING Statistics---- 2 packets transmitted, 2 packets received, 0% packet loss round-trip (ms) min/avg/max = 5/5/5This loss of 1473 packets is fairly consistent with roughly 1 in 1000 packets being successfully echoed in the reverse direction (from elaine7.stanford.edu to ftp-slac.slac.stanford.edu starting at 21:07 PDT 5/16/00 with pings separated by 1 second) and the reported failure reasons being shown below. "no reply" was the response for 967 out of 1000, and "Frag reassembly" was the response for 34 out of 1000 pings.
packet seq=552 bounced at FTP-SLAC.SLAC.Stanford.EDU (w.x.y.z ): Frag reassembly time exceeded no reply from ftp-slac.slac.stanford.edu within 1 secPinging each of the nodes along the route from flora01.slac.stanford.edu using pingroute.pl indicates that the effect (100% loss on 100 1473 byte packets) starts at the RTR-CGB6, see below.
29cottrell@flora01:~>bin/pingroute.pl -c 4 -s 1473 lindy.stanford.edu Architecture=SUN5, commands=traceroute -q 1 and node , pingroute.pl version=1.3, 5/13/00 pingroute.pl version 1.3, 5/13/00 using traceroute to get nodes in route from flora01 to lindy.stanford.edu traceroute: Warning: ckecksums disabled traceroute to lindy.stanford.edu (171.64.11.11), 30 hops max, 40 byte packets pingroute.pl version 1.3, 5/13/00 found 9 hops in route from flora01 to lindy.stanford.edu 1 router (w.x.y.z) 0.626 ms 2 router (w.x.y.z ) 0.907 ms 3 RTR-DMZ.SLAC.Stanford.EDU (134.79.111.4) 1.090 ms 4 RTR-SUNET47.SLAC.Stanford.EDU (192.68.191.34) 1.211 ms 5 Core-gateway.Stanford.EDU (171.64.1.115) 1.404 ms 6 i2-gateway.Stanford.EDU (171.64.1.209) 1.253 ms 7 Core4-gateway.Stanford.EDU (171.64.1.226) 1.337 ms 8 Core1-gateway.Stanford.EDU (171.64.3.19) 2.255 ms 9 Lindy.Stanford.EDU (171.64.11.11) 2.398 ms Wrote 9 addresses to /tmp/pingaddr, now ping each address 4 times from flora01 pings/node=4 100 byte packets 1473 byte packets NODE %loss min max avg %loss min max avg from flora01 w.x.y.z router 0% 0.0 4.0 1.0 0% 1.0 1.0 1.0 Tue May 16 21:41:04 PDT 2000 w.x.y.z router 0% 0.0 1.0 0.0 100% 0.0 0.0 0.0 Tue May 16 21:41:10 PDT 2000 134.79.111.4 RTR-DMZ.SLAC.STANFORD.EDU 0% 1.0 1.0 1.0 100% 0.0 0.0 0.0 Tue May 16 21:41:26 PDT 2000 192.68.191.34 RTR-SUNET47.SLAC.STANFORD.EDU 0% 1.0 2.0 1.0 100% 0.0 0.0 0.0 Tue May 16 21:41:42 PDT 2000 171.64.1.115 CORE-GATEWAY.STANFORD.EDU 0% 1.0 1.0 1.0 100% 0.0 0.0 0.0 Tue May 16 21:41:58 PDT 2000 171.64.1.209 I2-GATEWAY.STANFORD.EDU 0% 1.0 2.0 1.0 0% 3.0 3.0 3.0 Tue May 16 21:42:14 PDT 2000 171.64.1.226 CORE4-GATEWAY.STANFORD.EDU 100% 0.0 0.0 0.0 100% 0.0 0.0 0.0 Tue May 16 21:42:20 PDT 2000 171.64.3.19 CORE1-GATEWAY.STANFORD.EDU 100% 0.0 0.0 0.0 100% 0.0 0.0 0.0 Tue May 16 21:42:46 PDT 2000 171.64.11.11 LINDY.STANFORD.EDU 0% 1.0 2.0 1.0 100% 0.0 0.0 0.0 Tue May 16 21:43:12 PDT 2000In the reverse direction from elaine7.stanford.edu to ftp-slac.slac.stanford.edu, the 100% of the big packets varies along the route as is seen below.
elaine7:~> bin/pingroute.pl -c 4 -s 1473 ftp-slac.slac.stanford.edu Architecture=STANFORDU, commands=traceroute -q 1 and ping -s -t 3 node 100 4, pingroute.pl version=1.3, 5/13/00, debug=1 pingroute.pl version 1.3, 5/13/00 using traceroute to get nodes in route from elaine7.Stanford.EDU to ftp-slac.slac.stanford.edu pingroute.pl version 1.3, 5/13/00 found 11 hops in route from elaine7.Stanford.EDU to ftp-slac.slac.stanford.edu traceroute to ftp-slac.slac.stanford.edu (w.x.y.z ): 1-30 hops, 38 byte packets 1 leland-gateway.Stanford.EDU (171.64.15.65) 1.10 ms 2 Core6-gateway.Stanford.EDU (171.64.1.229) 0.489 ms 3 Core4-gateway.Stanford.EDU (171.64.3.82) 1.85 ms 4 i2-gateway.Stanford.EDU (171.64.1.225) 1.38 ms 5 Core-gateway.Stanford.EDU (171.64.1.210) 1.20 ms 6 sunet47-gateway.Stanford.EDU (171.64.1.113) 1.92 ms 7 RTR-DMZ.SLAC.Stanford.EDU (192.68.191.33) 2.27 ms 8 * 9 * 10 FTP-SLAC.SLAC.Stanford.EDU (w.x.y.z ) 2.49 ms Wrote 11 addresses to /tmp/pingaddr, now ping each address 4 times from elaine7.Stanford.EDU pings/node=4 100 byte packets 1473 byte packets NODE %loss min max avg %loss min max avg from elaine7.Stanford.EDU 171.64.15.65 LELAND-GATEWAY.STANFORD.EDU 0% 1.0 1.0 1.8 100% 0.0 0.0 0.0 Tue May 16 22:06:08 PDT 2000 171.64.1.229 CORE6-GATEWAY.STANFORD.EDU 0% 0.5 0.0 0.6 100% 0.0 0.0 0.0 Tue May 16 22:06:34 PDT 2000 171.64.3.82 CORE4-GATEWAY.STANFORD.EDU 0% 0.9 1.0 1.5 0% 2.9 3.0 3.3 Tue May 16 22:07:00 PDT 2000 171.64.1.225 I2-GATEWAY.STANFORD.EDU 0% 0.9 2.0 1.6 0% 3.5 4.0 3.9 Tue May 16 22:07:18 PDT 2000 171.64.1.210 CORE-GATEWAY.STANFORD.EDU 0% 1.4 2.0 1.9 100% 0.0 0.0 0.0 Tue May 16 22:07:36 PDT 2000 171.64.1.113 SUNET47-GATEWAY.STANFORD.EDU 0% 2.2 2.0 2.6 100% 0.0 0.0 0.0 Tue May 16 22:08:02 PDT 2000 192.68.191.33 RTR-DMZ.SLAC.STANFORD.EDU 0% 2.3 5.0 3.1 100% 0.0 0.0 0.0 Tue May 16 22:08:28 PDT 2000 w.x.y.z FTP-SLAC.SLAC.STANFORD.EDU 0% 2.3 2.0 2.5 100% 0.0 0.0 0.0 Tue May 16 22:08:55 PDT 2000
We later realized that some of the routers would pass ping packets > 1472 bytes but would not respond to them, which reduced the effectiveness of this method to indicate the onset of failure to pass > 1472 packets (i.e. packets that require IP fragmentation).
PingER also shows low packet loss for the month of May measured between oceanus.slac.stanford.edu and argus.stanford.edu. This is shown below where each column contains the median percentage packet loss for that day of the month.
Monitoring-Site Remote-Site 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 oceanus.slac argus.stanford 0.10 0.52 0.49 0.52 1.46 0.42 0.10 0.62 1.17 0.31 1.04 0.94 0.10 0.21 0.42Using the Cisco type ping from RTR-DMZ (a Cisco 7507) to lindy.stanford.edu gave losses of between 0.1 and 0.5% for packets sizes on 100, 1472 and 1500 Bytes, but using 1700 Byte packets (further study showed the threshold was 1500 bytes, packets with 1501 packets give rise to fragmentation and heavy loss) gave over 99% loss. Repeating the 1700 byte packets from RTR-DMZ to RTR-SUNET47, CORE-GATEWAY and I2-GATEWAY gave 1 loss in 100 pings in each case. However pinging with 100 or 1700 Byte packets from RTR-DMZ to CORE3-GATEWAY or CORE4-GATEWAY gave 0% success rate with an error message of ICMP destination Unreachable. We also pinged from RTR-DMZ in towards the SLAC network with 1700 byte pings and got <= 1% loss to RTR-CGB6 (a Cisco 7513), RTR-CORE1 (a Cisco 8500) and RTR-CORE2 (a Cisco 8500) and login.01 (a Sun Solaris 2.6 host). This might indicate the problem is not in the ATM cloud and appears to be in the Stanford network beyond the I2-GATEWAY.Stanford.Edu router.
Using 100 Cisco type pings with 1500 Bytes from RTR-DMZ to the Stanford campus routers showed no packet loss (100/100) to sunet47, core1-gateway and I2-gateway but Host unreachable (U.U.) to core-gateway, core3-gateway and core4-gateway.
Using the Stanford (from NIKHEF) ping from elaine10.stanford.edu to ping rtr-dmz.slac.stanford.edu with the preload option that sends a specified number of packets at startup (back to back) before listening for a response. With this we get about 73 small (<= 100Bytes) packets through before loss ensues, for larger packets (1000 Bytes) we got about 50 packest through before loss ensued. Preload pings of 100 and 1000 Bytes from elaine10 to the first router (leland.gateway. stanford.edu) also successfuly got 73 packets through before loss ensued. We repeated these tests from another machine at Stanford (elaine1.stanford.edu) with similar results. We believe this result is an artifact of the buffer space available in the Cisco router.
Sending big echo requests from elaine26.stanford.edu to core-gateway.stanford.edu or sunet47-gateway.stanford.edu or rtr-dmz.slac.stanford.edu all fail. However, sending big echo requests from elaine26 to i2-gateway.stanford.edu works.
Further study at the SLAC end showed that the losses for 1473 byte packets occurred when initiated from pharlap.slac.stanford.edu or w.x.y.z (chosen since we have sudo priviledges to run tcpdump on it) on NETHUB or flora01.slac.stanford.edu (w.x.y.z ) and rtr-cgb6.slac.stanford.edu (interfaces w.x.y.z or w.x.y.z or w.x.y.z or w.x.y.z or w.x.y.z ) but not between pharlap and rtr-core1.slac.stanford.edu. By looking at the diagram of the SLAC Switched Network, this removes the FDDI ring as being the cause of the problem. Further we turned on debugging in rtr-cgb6 and it indicated that it was replying to the ping ICMP requests. The route from pharlap to rtr-cgb6 passes through swh-core1 to rtr-core1 out through interface w.x.y.z thru swh-core1 to interface w.x.y.z of rtr-cgb6 (rtr-core1 is acting as a 1 armed router to swh-core1). Pings from pharlap with 1473 bytes work as far as rtr-core1 interface w.x.y.z but not to rtr-cgb6 interface w.x.y.z . Using tcpdump on pharlap we verified that the echo responses were not being received from rtr-cgb6.
On the other hand if the big packet (>> 1500 Byte) pings were initiated from rtr-cgb6 then there was low loss to either pharlap or flora01.
Even more strange was that the effect does not appear when we launch the big echo request packets to rtr-cgb6 from an AIX 4.2 machine (vesta01 - w.x.y.z ) or a Linux machine (noric03.slac.stanford.edu - w.x.y.z ) or a Windows NT machine (atreides.slac.stanford.edu - w.x.y.z ), but does occur when launched from a Solaris 2.6 (SunOS 5.6) machine (flora01.slac.stanford.edu - w.x.y.z or pharlap.slac.stanford.edu - w.x.y.z ). Repeating these big echo requests to lindy.stanford.edu from vesta01, noric03 and atreides also results in high loss. The default behavior of Solaris is to set the don't fragment bit.
When sending big echo requests from elaine26.stanford.edu (171.64.15.101) to atreides then only the first echo request fragment is seen by NetXray running on Atreides. Similarly when sending big echo requests from elaine26.stanford.edu to pharlap only the first of the 2 echo request fragments are seen by tcpdump running on pharlap.
When sending big echo requests from pharlap to rtr-cgb6 then no responses are seen by tcpdump running on pharlap.
When sending big echo requests from elaine26 to pharlap and looking on the FDDI router ring with a Netscout sniffer, both the echo request fragments were seen as well as both the echo reply fragments. See the packet trace. Note that the first request fragment has the don't fragment bit set.
ATM5/0 is up, line protocol is up Hardware is cxBus ATM MTU 4470 bytes, sub MTU 4470, BW 156250 Kbit, DLY 80 usec, rely 255/255, load 1/255 Encapsulation ATM, loopback not set, keepalive set (10 sec) Encapsulation(s): AAL5, PVC mode 256 TX buffers, 256 RX buffers, 2048 maximum active VCs, 1024 VCs per VP, 1 current VCCs VC idle disconnect time: 300 seconds Last input never, output 00:00:00, output hang never Last clearing of "show interface" counters 1w2d Queueing strategy: fifo Output queue 0/40, 0 drops; input queue 0/75, 572 drops 5 minute input rate 233000 bits/sec, 60 packets/sec 5 minute output rate 73000 bits/sec, 64 packets/sec 37439574 packets input, 2747274064 bytes, 0 no buffer Received 0 broadcasts, 0 runts, 0 giants 0 input errors, 0 CRC, 0 frame, 0 overrun, 2300788 ignored, 0 abort 70905621 packets output, 1067094969 bytes, 0 underruns 0 output errors, 0 collisions, 0 interface resets 0 output buffer failures, 0 output buffers swapped outThe 2300788 ignored (number of packets that were ignored on the receive interface due to lack of internal buffers) is of interest. We compared the ignored with packets input while doing the FTP GETs from Stanford to SLAC and saw error rates of 5 to 25%.
Looking in ATM-DMZ similar errors were not noticeable.
ATM1/0/0 is up, line protocol is up Hardware is oc3suni Description: --> RTR-DMZ MTU 4470 bytes, sub MTU 4470, BW 155520 Kbit, DLY 0 usec, rely 255/255, load 1/255 Encapsulation ATM, loopback not set, keepalive not supported Last input 00:00:00, output 00:00:00, output hang never Last clearing of "show interface" counters 10w1d Queueing strategy: fifo Output queue 0/40, 0 drops; input queue 0/75, 0 drops 5 minute input rate 136000 bits/sec, 329 packets/sec 5 minute output rate 303000 bits/sec, 733 packets/sec 4013949168 packets input, 2285908400 bytes, 0 no buffer Received 0 broadcasts, 0 runts, 0 giants, 0 throttles 7 input errors, 7 CRC, 0 frame, 0 overrun, 0 ignored, 0 abort 3622564490 packets output, 3017356946 bytes, 0 underruns 0 output errors, 0 collisions, 0 interface resets 0 output buffer failures, 0 output buffers swapped outWe disconnected the fiber and ran a test on it and it looked good. We exchanged the TX and RX fibers between RTR-DMZ and ATM-DMZ, it had no effect. Since we had a spare, we also changed the ATM interface card on ATM-DMZ at about 3:00pm 5/19/00 in case there was a poor driver at the ATM-DMZ end causing the problems at RTR-DMZ, this also had no effect.
Finally at about 5:20pm PDT 5/19/00 we did a clear interface atm 5/0 which clears the hardware logic on an interface and the FTP GET rate improved to 2.3MBytes/sec (yes MBYTES!) and the ignored count stopped increasing.
After clearing the interface it was also interesting to note that at least some of the earlier ping fragmentation problems cleared, for example compare the ping below with a similar one above:
2cottrell@flora01:~>ping -s lindy.stanford.edu 1473 5 PING lindy.stanford.edu: 1473 data bytes 1481 bytes from Lindy.Stanford.EDU (171.64.11.11): icmp_seq=0. time=6. ms 1481 bytes from Lindy.Stanford.EDU (171.64.11.11): icmp_seq=1. time=6. ms 1481 bytes from Lindy.Stanford.EDU (171.64.11.11): icmp_seq=2. time=6. ms 1481 bytes from Lindy.Stanford.EDU (171.64.11.11): icmp_seq=3. time=6. ms 1481 bytes from Lindy.Stanford.EDU (171.64.11.11): icmp_seq=4. time=5. ms ----lindy.stanford.edu PING Statistics---- 5 packets transmitted, 5 packets received, 0% packet loss round-trip (ms) min/avg/max = 5/5/6
30cottrell@flora01:~>sudo /afs/slac/g/scs/bin/pathchar lindy.stanford.edu Password: pathchar to lindy.stanford.edu (171.64.11.11) mtu limitted to 1500 bytes at login.SLAC.Stanford.EDU (w.x.y.z) doing 32 probes at each of 64 to 1500 by 44 0 login.SLAC.Stanford.EDU (w.x.y.z) | 29 Mb/s, 204 us (816 us) 1 router (w.x.y.z) | 51 Mb/s, 184 us (1.42 ms) 2 router (w.x.y.z ) | 115 Mb/s, 58 us (1.64 ms) 3 RTR-DMZ.SLAC.Stanford.EDU (134.79.111.4) | 52 Mb/s, 79 us (2.03 ms) 4 RTR-SUNET47.SLAC.Stanford.EDU (192.68.191.34) | 1238 Mb/s, 15 us (2.07 ms) 5 Core-gateway.Stanford.EDU (171.64.1.115) | 92 Mb/s, -34 us (2.13 ms) 6 i2-gateway.Stanford.EDU (171.64.1.209) | 46 Mb/s, 132 us (2.65 ms) 7 Core4-gateway.Stanford.EDU (171.64.1.226) | 48 Mb/s, 112 us (3.13 ms) 8 Core1-gateway.Stanford.EDU (171.64.3.19) 9 * 1 548 535 3 10 * 1 504 533 4 11 * 1 460 530 3 12 * 1 460 533 3 13 * 1 196 529 3 14: 14 460 312 3
% sudo sting lindy.stanford.edu Password: Connection setup took 2 ms src = w.x.y.z :11005 (2919847614) dst = 171.64.11.11:80 (4208648018) dataSent = 100, dataReceived = 100 acksSent = 100, acksReceived = 100 Forward drop rate = 0.000000 Reverse drop rate = 0.000000 316 packets received by filter 0 packets dropped by kernelOne can also look at the Surveyor reports to look at the one way delays and losses. The graphs indicate there is very low delay in both directions and low loss in both directions but slightly higher in the Stanford to SLAC direction.
It is beginning to look like some of the fragmantation effects might be associated with a particular operating system since it (so far) does not happen when the ping request is sent from a Linux or AIX or Windows NT machine at SLAC to rtr-cgb6, but does occur when the requests is sent from a Solaris 2.6 host.
One also needs to be aware that a router may pass a fragmented packet but may not respond to it. Another concern is that some of the routers don't correctly generate fragmented pings even though they will reply to them.
The effect is seen on both the SLAC and Stanford sites separately. It does not appear to be associated with the ATM fabric or the FDDI rings (both places where fragmentation may occur).
It smelt like a big packet being sent with the don't fragment bit being set. The correct response from a switch or router receiving an oversized packet that it can't fragment is to send back an ICMP "Destination Unreachable" message. This feature is actually used for maximum transfer unit (MTU) size discovery (see Path MTU Discovery and Filtering ICMP). The NetXray results indicate that echo request packets sent from elaine (a Solaris system) have the don't fragment bit set as seen at the end host atreides, and the Netscout on the SLAC FDDI router ring showed it was also set there.
Further possibilities to test included looking at the trace of an FTP flow to understand more about the TCP dynamics; sit on the Stanford router SUNET47-GATEWAY router and look in both directions; review the rtr-cgb5.slac.stanford.edu configuration and possibly the swh-core1.slac.stanford.edu configurations looking for inconsistencies; turn on debug ip packet detail with an acl that selects destination pharlap on rtr-core1, rtr-core2 and rtr-cgb6 and then ping with big packets from pharlap and look at the debug trace (may need to turn off mroute cache or something similar) do for small (to make sure can see packets) and then large to see where the packets disappear. If it stops in switch then tell Cisco and reboot the switch. Try swapping the TX and RX fibers along the route to see if the direction of the poor file transfer changes.
We never did get to the bottom of the fragmentation problems and left it for others to solve or until it affects us in some real way. The solution to the poor performance of FTPs data flows from Stanford to SLAC was to do a clear interface atm 5/0 which clears the hardware logic on an interface and the FTP get rate improved to 2.3MBytes/sec (yes MBYTES!) and the ignored count stopped increasing. So the problem was in the interface card card getting confused. We will follow up with Cisco to see what needs to happen next.
The apparent fragmentation problem led us off in the wrong direction and even when suspected we spent valuable time pinning it down. It is a pity some routers (even from a highly regarded single vendor such as Cisco) do not treat oversize pings uniformly. It is even more disturbing that at least one router had a failure mode unrelated to packet fragmentation that resulted in ignored buffers and such poor performance. The fact that a router interface can fail in such a manner (ignoring buffers) appears to be very poor design or implementation, and certainly not something one would accept from a system that is competing with phone PBXs that advertise 99.999% reliability. It will be interesting to see what Cisco has to say about this. I suspect that they will suggest upgrading to the latest stable release of the IOS etc. (rtr-dmz is at release 11.1(16) which is quite old) which of course will require an outage of the complete router (rtr-dmz) which will require a major scheduled outage, and will put us further away from any 99.999% goal.
It would also be useful to get definitive information concerning under what conditions (configuration, model, code release) and Cisco routers respond to ICMP echo requests that require fragmentation. Even when pinging with 1473 byte echo request packets from an AIX 4.2 host (vesta01.slac.stanford.edu) that does not set the don't fragment bit, to core3-gateway.stanford.edu or core5.gateway.stanford.edu there is still 100% packet loss.
Maybe part of the lesson is that even UPS, redundant power supplies, redundant paths etc. do not make up for poor implementations, and one needs to be able to reliably make code level updates without requiring an outage of even a second or more (if one is to contemplate using the data network for internet telephony).
Created: May 16, 2000
Comments to iepm@slac.stanford.edu