Problems with performance between SLAC and INFN/Rome March '02Les Cottrell Page created: March 21 '02, last update March 21, 2002. |
PING datamove33.SLAC.Stanford.EDU (134.79.125.253) from 141.108.23.100 : 56(84) bytes of data.- --- datamove33.SLAC.Stanford.EDU ping statistics --- 1000 packets transmitted, 934 packets received, 6% packet loss round-trip min/avg/max/mdev = 164.045/167.031/230.046/3.106 ms traceroute to datamove33.SLAC.Stanford.EDU (134.79.125.253), 30 hops max, 38 byte packets 1 gw23 (141.108.23.254) 2.492 ms 1.739 ms 1.278 ms 2 193.206.131.13 (193.206.131.13) 1.977 ms 2.345 ms 1.587 ms 3 193.206.134.165 (193.206.134.165) 3.005 ms 3.443 ms 5.770 ms 4 193.206.134.17 (193.206.134.17) 12.006 ms 11.663 ms 12.046 ms 5 193.206.134.206 (193.206.134.206) 11.620 ms 11.497 ms 11.178 ms 6 garr.it1.it.geant.net (62.40.103.89) 11.139 ms 11.107 ms 13.646 ms 7 it.de2.de.geant.net (62.40.96.61) 21.624 ms 21.262 ms 24.809 ms 8 62.40.103.254 (62.40.103.254) 103.601 ms 105.535 ms 103.168 ms 9 clev-nycm.abilene.ucaid.edu (198.32.8.29) 119.222 ms 117.544 ms 114.208 ms 10 ipls-clev.abilene.ucaid.edu (198.32.8.25) 122.297 ms 124.763 ms 120.409 ms 11 kscy-ipls.abilene.ucaid.edu (198.32.8.5) 130.461 ms * 130.286 ms 12 dnvr-kscy.abilene.ucaid.edu (198.32.8.13) 141.464 ms 139.223 ms 141.918 ms 13 snva-dnvr.abilene.ucaid.edu (198.32.8.1) 164.660 ms 168.570 ms * 14 198.32.249.161 (198.32.249.161) 167.355 ms 165.196 ms 167.212 ms 15 STAN--SUNV.POS.calren2.net (198.32.249.74) 165.871 ms 167.715 ms 166.413 ms 16 i2-gateway.Stanford.EDU (171.64.1.214) 173.036 ms 173.239 ms 166.065 ms 17 * * * 18 * * * 19 DATAMOVE33.SLAC.Stanford.EDU (134.79.125.253) 166.621 ms 172.125 ms 167.544 ms
4cottrell@pharlap:~>ping -s www.roma1.infn.it PING www.roma1.infn.it: 56 data bytes 64 bytes from www1.roma1.infn.it (141.108.26.1): icmp_seq=0. time=166. msthen that is closer to what I would expect. I wonder if there is a problem with bbpcfarm00.roma1.infn.it, e.g. is it way overloaded, is there a mismatch in the Ethernet duplex or speed setting with that in the switch port it is connected to? I don't think the problem is at datamove33 since I see the problem to bbfarm00 from pharlap.slac.stanford.edu but not to www.roma1.infn.it from pharlap.slac.stanford.edu----www.roma1.infn.it PING Statistics---- 148 packets transmitted, 148 packets received, 0% packet loss round-trip (ms) min/avg/max = 164/165/171
Pingroute (see below) also indicates that the losses start at the end host (bbfarm00):
7cottrell@pharlap:~>fpingroute.pl -c 1000 -i 3 bbpcfarm00.roma1.infn.it Thu Mar 21 10:00:23 2002 Architecture=SUN5, commands=traceroute -q 1 and bbpcfarm00.roma1.infn.it fpingroute.pl version=0.2, 11/29/01. Author cottrell@slac.stanford.edu, debug=1 using traceroute to get nodes in route from pharlap (134.79.240.26) to bbpcfarm00.roma1.infn.it starting at node 3 traceroute: Warning: ckecksums disabled traceroute to bbpcfarm00.roma1.infn.it (141.108.23.100), 30 hops max, 40 byte packets fpingroute.pl version 0.2, 11/29/01 found 19 hops in route from pharlap to bbpcfarm00.roma1.infn.it 3 I2-GATEWAY.Stanford.EDU (192.68.191.83) 0.339 ms 4 STAN.POS.calren2.NET (171.64.1.213) 0.391 ms 5 SUNV--STAN.POS.calren2.net (198.32.249.73) 0.878 ms 6 Abilene--QSV.POS.calren2.net (198.32.249.162) 1.116 ms 7 dnvr-snva.abilene.ucaid.edu (198.32.8.2) 25.479 ms 8 kscy-dnvr.abilene.ucaid.edu (198.32.8.14) 36.114 ms 9 ipls-kscy.abilene.ucaid.edu (198.32.8.6) 45.428 ms 10 clev-ipls.abilene.ucaid.edu (198.32.8.26) 51.581 ms 11 nycm-clev.abilene.ucaid.edu (198.32.8.30) 64.054 ms 12 62.40.103.253 (62.40.103.253) 145.159 ms 13 de.it1.it.geant.net (62.40.96.62) 154.258 ms 14 garr-gw.it1.it.geant.net (62.40.103.90) 154.201 ms 15 rt-rtg.mi.garr.net (193.206.134.205) 154.374 ms 16 rm-mi.garr.net (193.206.134.18) 162.515 ms 17 rc-rt-1.rm.garr.net (193.206.134.162) 164.973 ms 18 infnrmI-rc.rm.garr.net (193.206.131.14) 294.040 ms 19 bbpcfarm00.roma1.infn.it (141.108.23.100) 285.698 ms Wrote 17 addresses to /tmp/fpingaddr, now ping each address 1000 times from pharlap starting at hop 3 ... pings/node=1000 100 byte packets 1400 byte packets NODE %loss min max avg %loss min max avg from pharlap 3 I2-GATEWAY.Stanford.EDU 0.0% 0.3 17.9 0.5 0.0% 0.5 62.3 0.7 Thu Mar 21 10:00:25 PST 2002 4 STAN.POS.calren2.NET AS32 0.0% 0.3 21.2 0.6 0.0% 0.7 48.0 0.9 Thu Mar 21 10:00:25 PST 2002 5 SUNV--STAN.POS.calren2.net AS11423 0.0% 0.7 28.9 1.6 0.0% 1.1 72.4 1.3 Thu Mar 21 10:00:25 PST 2002 6 Abilene--QSV.POS.calren2.net AS11423 0.0% 0.8 237.3 2.0 0.0% 1.2 27.4 1.4 Thu Mar 21 10:00:25 PST 2002 7 dnvr-snva.abilene.ucaid.edu 0.0% 25.4 197.3 26.7 0.0% 25.9 141.0 26.3 Thu Mar 21 10:00:25 PST 2002 8 kscy-dnvr.abilene.ucaid.edu 0.0% 36.0 157.4 37.2 0.0% 36.5 120.0 36.9 Thu Mar 21 10:00:25 PST 2002 9 ipls-kscy.abilene.ucaid.edu 0.0% 45.2 107.4 46.4 0.0% 45.7 125.9 46.1 Thu Mar 21 10:00:25 PST 2002 10 clev-ipls.abilene.ucaid.edu 0.0% 51.3 84.2 52.5 0.0% 52.0 110.1 52.3 Thu Mar 21 10:00:25 PST 2002 11 nycm-clev.abilene.ucaid.edu 0.0% 63.6 152.2 64.8 0.0% 64.2 126.0 64.5 Thu Mar 21 10:00:25 PST 2002 12 62.40.103.253 AS20965 0.0% 144.6 237.5 147.4 0.0% 146.2 207.2 147.8 Thu Mar 21 10:00:25 PST 2002 13 de.it1.it.geant.net AS20965 0.0% 153.9 205.6 156.2 0.0% 155.4 251.3 157.3 Thu Mar 21 10:00:25 PST 2002 14 garr-gw.it1.it.geant.net AS20965 0.0% 153.5 206.1 155.0 0.0% 154.5 278.9 154.9 Thu Mar 21 10:00:25 PST 2002 15 rt-rtg.mi.garr.net AS137 0.0% 153.9 198.5 155.1 0.0% 154.6 195.8 154.9 Thu Mar 21 10:00:25 PST 2002 16 rm-mi.garr.net AS137 0.0% 161.9 443.6 165.2 0.0% 163.2 381.1 164.8 Thu Mar 21 10:00:25 PST 2002 17 rc-rt-1.rm.garr.net AS137 0.0% 162.9 221.8 164.6 0.0% 164.3 225.3 164.9 Thu Mar 21 10:00:25 PST 2002 18 infnrmI-rc.rm.garr.net AS137 0.0% 165.1 316.4 168.7 0.0% 167.3 251.1 168.5 Thu Mar 21 10:00:25 PST 2002 19 bbpcfarm00.roma1.infn.it AS137 3.4% 164.1 323.2 167.7 0.8% 167.2 214.9 168.0 Thu Mar 21 10:00:25 PST 2002 Thu Mar 21 10:34:23 2002 fpingroute.pl done.
>ssh cutter.roma1.infn.it Warning: Permanently added 'cutter.roma1.infn.it,141.108.23.10' (RSA1) to the list of known hosts. cottrell@cutter.roma1.infn.it's password: Last login: Fri Mar 22 19:07:47 2002 from pharlap.slac.st Sun Microsystems Inc. SunOS 5.6 Generic August 1997 PING bbpcfarm00.roma1.infn.it: 56 data bytes 64 bytes from bbpcfarm00.roma1.infn.it (141.108.23.100): icmp_seq=0. time=0. ms----bbpcfarm00.roma1.infn.it PING Statistics---- 102 packets transmitted, 102 packets received, 0% packet loss round-trip (ms) min/avg/max = 0/0/0
At the same time I ran ping from pharlap.slac.stanford.edu to bbpcfarm00.roma1.infn.it and saw 7% loss in 100 packets.
2cottrell@pharlap:~>ping -s bbpcfarm00.roma1.infn.it PING bbpcfarm00.roma1.infn.it: 56 data bytes 64 bytes from bbpcfarm00.roma1.infn.it (141.108.23.100): icmp_seq=0. time=171. msFrom pharlap.slac.stanford.edu to www.roma1.infn.it I got:----bbpcfarm00.roma1.infn.it PING Statistics---- 105 packets transmitted, 97 packets received, 7% packet loss round-trip (ms) min/avg/max = 164/166/203
1cottrell@pharlap:~>ping -s www.roma1.infn.it PING www.roma1.infn.it: 56 data bytesFrom cutter.roma1.infn.it to pharlap.slac.stanford.edu the loss in 100 packets was 4%.----www.roma1.infn.it PING Statistics---- 103 packets transmitted, 102 packets received, 0% packet loss round-trip (ms) min/avg/max = 164/185/1131
> ping -s pharlap.slac.stanford.edu PING PHARLAP.slac.stanford.edu: 56 data bytesThen we measured pings from cutter.roma1.infn.it to www.roma1.infn.it and saw 9% loss:----PHARLAP.slac.stanford.edu PING Statistics---- 102 packets transmitted, 97 packets received, 4% packet loss round-trip (ms) min/avg/max = 164/174/317
> ping -s www.roma1.infn.it PING www.roma1.infn.it: 56 data bytesWe concluded that there is a problem at Rome between www.roma1.infn.it and both cutter.roma1.infn.it and bbpcfarm002.roma1.infn.it. A pingroute from pharlap.slac.stanford.edu to bbpcfarm002.roma1.infn.it also indicates that most of the loss occurs on the last hop:----www.roma1.infn.it PING Statistics---- 111 packets transmitted, 101 packets received, 9% packet loss round-trip (ms) min/avg/max = 0/0/0
4cottrell@pharlap:~>fpingroute.pl -c 1000 bbpcfarm00.roma1.infn.it Mon Mar 25 11:11:58 2002 Architecture=SUN5, commands=traceroute -q 1 and bbpcfarm00.roma1.infn.it fpingroute.pl version=0.2, 11/29/01. Author cottrell@slac.stanford.edu, debug=1 using traceroute to get nodes in route from pharlap (134.79.240.26) to bbpcfarm00.roma1.infn.it starting at node 1 traceroute: Warning: ckecksums disabled traceroute to bbpcfarm00.roma1.infn.it (141.108.23.100), 30 hops max, 40 byte packets fpingroute.pl version 0.2, 11/29/01 found 19 hops in route from pharlap to bbpcfarm00.roma1.infn.it 1 RTR-GSR-TEST.SLAC.Stanford.EDU (134.79.243.1) 0.377 ms 2 RTR-DMZ1-GER.SLAC.Stanford.EDU (134.79.135.15) 0.379 ms 3 I2-GATEWAY.Stanford.EDU (192.68.191.83) 0.325 ms 4 STAN.POS.calren2.NET (171.64.1.213) 0.401 ms 5 SUNV--STAN.POS.calren2.net (198.32.249.73) 0.746 ms 6 Abilene--QSV.POS.calren2.net (198.32.249.162) 0.968 ms 7 dnvr-snva.abilene.ucaid.edu (198.32.8.2) 25.990 ms 8 kscy-dnvr.abilene.ucaid.edu (198.32.8.14) 36.281 ms 9 ipls-kscy.abilene.ucaid.edu (198.32.8.6) 45.433 ms 10 clev-ipls.abilene.ucaid.edu (198.32.8.26) 51.607 ms 11 nycm-clev.abilene.ucaid.edu (198.32.8.30) 63.686 ms 12 62.40.103.253 (62.40.103.253) 144.947 ms 13 de.it1.it.geant.net (62.40.96.62) 154.044 ms 14 garr-gw.it1.it.geant.net (62.40.103.90) 154.102 ms 15 rt-rtg.mi.garr.net (193.206.134.205) 154.139 ms 16 rm-mi.garr.net (193.206.134.18) 162.725 ms 17 rc-rt-1.rm.garr.net (193.206.134.162) 163.895 ms 18 infnrmI-rc.rm.garr.net (193.206.131.14) 170.017 ms 19 bbpcfarm00.roma1.infn.it (141.108.23.100) 166.840 ms Wrote 19 addresses to /tmp/fpingaddr, now ping each address 1000 times from pharlap starting at hop 1 ... pings/node=1000 100 byte packets 1400 byte packets NODE %loss min max avg %loss min max avg from pharlap 1 RTR-GSR-TEST.SLAC.Stanford.EDU AS3671 0.4% 0.2 50.8 0.6 0.6% 0.3 215.4 1.2 Mon Mar 25 11:12:01 PST 2002 2 RTR-DMZ1-GER.SLAC.Stanford.EDU AS3671 0.3% 0.3 1280.0 3.5 0.7% 0.7 208.7 2.1 Mon Mar 25 11:12:01 PST 2002 3 I2-GATEWAY.Stanford.EDU 0.5% 0.3 1236.3 3.3 1.0% 0.5 1363.1 4.0 Mon Mar 25 11:12:01 PST 2002 4 STAN.POS.calren2.NET AS32 0.7% 0.3 1246.3 3.8 1.0% 0.7 1128.5 2.8 Mon Mar 25 11:12:01 PST 2002 5 SUNV--STAN.POS.calren2.net AS11423 1.0% 0.7 1206.4 4.2 1.2% 1.1 1099.4 3.3 Mon Mar 25 11:12:01 PST 2002 6 Abilene--QSV.POS.calren2.net AS11423 1.2% 0.8 1166.5 2.9 1.0% 1.2 1059.5 3.4 Mon Mar 25 11:12:01 PST 2002 7 dnvr-snva.abilene.ucaid.edu 0.7% 25.4 261.7 26.9 0.5% 25.9 1046.4 29.0 Mon Mar 25 11:12:01 PST 2002 8 kscy-dnvr.abilene.ucaid.edu 0.2% 36.0 232.4 37.1 1.1% 36.5 1052.0 39.0 Mon Mar 25 11:12:01 PST 2002 9 ipls-kscy.abilene.ucaid.edu 0.1% 45.2 201.8 46.2 0.7% 45.7 1021.2 49.2 Mon Mar 25 11:12:01 PST 2002 10 clev-ipls.abilene.ucaid.edu 0.5% 51.4 171.6 52.3 0.9% 52.0 997.6 54.8 Mon Mar 25 11:12:01 PST 2002 11 nycm-clev.abilene.ucaid.edu 0.5% 63.6 149.2 64.5 1.0% 64.2 970.0 67.1 Mon Mar 25 11:12:01 PST 2002 12 62.40.103.253 AS20965 0.2% 144.9 207.6 147.1 0.5% 146.2 1012.0 149.7 Mon Mar 25 11:12:01 PST 2002 13 de.it1.it.geant.net AS20965 0.2% 154.0 220.4 155.8 0.8% 155.4 776.0 157.6 Mon Mar 25 11:12:01 PST 2002 14 garr-gw.it1.it.geant.net AS20965 0.2% 153.8 240.6 154.6 0.8% 154.5 200.4 154.9 Mon Mar 25 11:12:01 PST 2002 15 rt-rtg.mi.garr.net AS137 0.7% 153.8 322.8 154.9 0.9% 154.6 1228.5 156.3 Mon Mar 25 11:12:01 PST 2002 16 rm-mi.garr.net AS137 0.3% 162.2 327.7 163.9 0.5% 163.2 1196.7 165.9 Mon Mar 25 11:12:01 PST 2002 17 rc-rt-1.rm.garr.net AS137 0.4% 163.0 228.7 164.6 0.9% 164.3 1137.9 166.5 Mon Mar 25 11:12:01 PST 2002 18 infnrmI-rc.rm.garr.net AS137 0.3% 165.0 348.2 174.2 0.7% 167.3 1083.8 180.4 Mon Mar 25 11:12:01 PST 2002 19 bbpcfarm00.roma1.infn.it AS137 11.7% 164.0 315.9 172.2 5.0% 167.2 1041.1 180.6 Mon Mar 25 11:12:01 PST 2002 Mon Mar 25 12:33:45 2002 fpingroute.pl done.
2cottrell@pharlap:~>ssh cutter.roma1.infn.it cottrell@cutter.roma1.infn.it's password: Last login: Mon Mar 25 20:07:10 2002 from antonia.slac.st Sun Microsystems Inc. SunOS 5.6 Generic August 1997 > ping -s www.roma1.infn.it PING www.roma1.infn.it: 56 data bytes 64 bytes from www2.roma1.infn.it (141.108.26.11): icmp_seq=0. time=1. msTraceroutes from cutter to www and cutter to bbpcfarm00 indicate there is a router in the path in the first case, maybe this is overloaded, misconfigured. The router does not respond to pings from cutter so I cannot test whether loss is in that path:64 bytes from www2.roma1.infn.it (141.108.26.11): icmp_seq=43. time=0. ms ^C ----www.roma1.infn.it PING Statistics---- 44 packets transmitted, 42 packets received, 4% packet loss round-trip (ms) min/avg/max = 0/0/3
> traceroute www.roma1.infn.it traceroute: Warning: ckecksums disabled traceroute: Warning: www.roma1.infn.it has multiple addresses; using 141.108.26.1 traceroute to www.roma1.infn.it (141.108.26.1), 30 hops max, 40 byte packets 1 gw23.roma1.infn.it (141.108.23.254) 2.002 ms 1.762 ms 1.696 ms 2 * www1.roma1.infn.it (141.108.26.1) 1.528 ms 1.334 ms > traceroute www.roma1.infn.it traceroute: Warning: ckecksums disabled traceroute: Warning: www.roma1.infn.it has multiple addresses; using 141.108.26.1 traceroute to www.roma1.infn.it (141.108.26.1), 30 hops max, 40 byte packets 1 gw23.roma1.infn.it (141.108.23.254) 2.002 ms 1.762 ms 1.696 ms 2 * www1.roma1.infn.it (141.108.26.1) 1.528 ms 1.334 ms > traceroute bbpcfarm00.roma1.infn.it traceroute: Warning: ckecksums disabled traceroute to bbpcfarm00.roma1.infn.it (141.108.23.100), 30 hops max, 40 byte packets 1 bbpcfarm00.roma1.infn.it (141.108.23.100) 2.834 ms 1.213 ms 1.335 ms > ping -s gw23.roma1.infn.it PING gw23.roma1.infn.it: 56 data bytes ^C ----gw23.roma1.infn.it PING Statistics---- 45 packets transmitted, 0 packets received, 100% packet lossThe packet loss distribution from pharlap.slac.stanford.edu to bbpcfarm00.roma1.infn.it for 1000 pings separated by 1 second with a 2 second timeout appears to be random, see below. CLP = Conditional Loss Probability, i.e. the probability taht if one packet is lost the next packet will be lost also. PLG is the packet loss gap and the number in parentheses is (1/(1-clp)) the theoretical value if the losses were random. For more on this see for example http://www.ils.unc.edu/dempsey/186s00/bolot.pdf
Fri Mar 29 08:23:54 2002 myping.pl began Fri Mar 29 08:00:10 2002 on pharlap(134.79.240.26), to bbpcfarm00.roma1.infn.it min/avg/max 164/171.6/392 ms. xmt=1000 rcv=917 lost=83(8.300%), CLP= 2.410%, PLG=1.000(1.025) Losses (total=83 of 1000, i.e. 8.300%) as function of time, bins=10, bin width=100: 0s- 100s- 200s- 300s- 400s- 500s- 600s- 700s- 800s- 900s- 9 5 7 6 8 7 10 8 9 14 Successful ping run distribution: 1 2 3 4 5 6 7 8 9 10 12 13 14 15 17 19 21 22 23 28 29 30 32 35 36 43 46 50 3 10 5 10 2 1 11 5 4 3 3 3 3 1 2 1 1 1 21 1 2 1 2 1 2 1 1In summary: The problem appears to be visible between cutter and www on the Rome site. The losses appear to be random (i.e. not bursty when measured one second apart, and allowing a 2 second timeout, the conditional loss probability (i.e. the probability that if a packet is lost the next packet will also be lost) is about 2.4%). I am afraid I do not have any knowledge of the networking at your site at Rome, so it will need your network experts there to get involved. I would look at the router gw23.roma1.infn.it utilization to see if there is some congestion that may be causing the loss. Also look at the Router/switch MIB error rates reported via SNMP or by logging onto the console to see what kind of errors are occuring (e.g. you should not see any collisions on a full duplex link). I would check the duplex/speed settings of the router/switch/host ports in the path to make sure they are correct. You will also need to get a topology map including the switches in the path. You might also make sure the physical connections are OK. Get someone to make noise or DB loss measurements on the cables.
-----Original Message----- From: Cristina Bulfon [mailto:Cristina.Bulfon@roma1.infn.it] Sent: Friday, March 29, 2002 6:16 AM To: les.cottrell@SLAC.Stanford.EDU Cc: giuseppe.della-ricca@ts.infn.it Subject: network problemCiao Les, since last Tues we didn't transfer any single file to Slac, we tried both of bbftp, new bbcp and also with scp. Every time we got connection timed out with bbftp, hangs with bbcp, so we thought that is not software problem but due to a network problem. We are not experts but if you tell us what we can do and join our efforce for finding a solutions. thanks cristina
Hi Les, the names and addresses of the MC Farm nodes are the same as before: it's only the physical location that has changed, the farm is still in Rome, just few hundred meters from where it was ... AFAIK, before the move the farm there was an optical fibre connection between where the farm was and where it is now. Now this connection is gone. But for the details of the setup, either Cristina or Franz should answer your questions: I'm not an expert, and I'm not in Rome. After the move, we haven't had major network problems. Just two or three glitches that lasted about a day or so, and never related to the local system. regards, Giuseppe. On Fri, 9 Aug 2002, Cottrell, Les wrote: > Hi Cristina, > > What is the new name/address/site for the MC Farm, is it at Padova or > just somewhere else at Rome? Are there no problems with networking now? > > -----Original Message----- > From: Bulfon, Cristina > Sent: Wednesday, May 29, 2002 8:11 AM > To: Cottrell, Les > Cc: giuseppe.della-ricca@ts.infn.it > Subject: RE: bbfarm04 > > > Ciao Les, > > I have forwarded your e-mail to our network experts and sorry if I > didn't let you know about what we have done here. > > - first of all some of my collegues have measured the affidability of the > optical fiber between INFN Rome and Caspur and also they have cleaned it. > > - after a switch reboot at Caspur we have had some performance > improvements, no data lost with ping ... > > All of these improved the situation just for few days, we still have some > "BBFTP connection timeout" when transferring to Slac but they don't > influence too bad our MC production (since the amount of data we need to > transfer has been reduced by a new scheme of data export) > and we got 40Mbps for importing data from Slac. > > As soon as we finished our air-conditioning work at INFN Rome we are going > to move MC farm from Caspur to here so in this case you will measure > the same throughtput as for www.roma1.infn.it > > thanks > cristina