Loss of connectivity from BINP/Novosibirsk to SLACLes Cottrell. Page created: April 18, 2005Central Computer Access | Computer Networking | Network Group | More case studies |
|
Our physics noted that since some time we haven't connectivity to slac. Our packets are going via main internet connection and I guess slac is returned replies via KEK channel. Due to asymmetry of this traffic it gets blocked on external firewall.
Our router shows in "sh ip rou" that slac route is absent within routes we get from KEK, while other ESNet routes are still there, like www.es.net, www.bnl.gov, www.fnal.gov. gets blocked on external firewall.
mx:belov {102} ping www.slac.stanford.edu PING www8.slac.stanford.edu (134.79.18.163): 56 data bytes 64 bytes from 134.79.18.163: icmp_seq=0 ttl=243 time=300.086 ms 64 bytes from 134.79.18.163: icmp_seq=1 ttl=243 time=297.625 ms 64 bytes from 134.79.18.163: icmp_seq=2 ttl=243 time=297.588 ms --- www8.slac.stanford.edu ping statistics --- 4 packets transmitted, 3 packets received, 25% packet loss round-trip min/avg/max/std-dev = 297.588/298.433/300.086/1.168 msThus we see that ICMP are getting through while rtt are not usual 200ms.
mx:belov {103} traceroute !$ traceroute www.slac.stanford.edu traceroute to www8.slac.stanford.edu (134.79.18.163), 64 hops max, 40 byte packets 1 rtc-gw (193.124.167.5) 0.652 ms 0.761 ms 0.557 ms 2 NSC-FO-c3550-INP.nsc.ru (212.192.189.53) 0.957 ms 1.19 ms 1.17 ms 3 s3550-12a-unknown-nsc.sbras.ru (217.79.60.6) 6.607 ms 1.55 ms 0.983 ms 4 s3750-48a-ge.sbras.ru (217.79.60.18) 0.874 ms 1.64 ms 0.746 ms 5 r7206-ge.sbras.ru (217.79.60.1) 2.498 ms 2.1 ms 2.43 ms 6 217.79.60.45 (217.79.60.45) 3.271 ms 2.204 ms 4.837 ms 7 SM-TCMS5-RBNet-2.RBNet.ru (195.209.14.73) 34.949 ms 33.875 ms 34.694 ms 8 MSK-M9-RBNet-7.RBNet.ru (195.209.14.25) 50.952 ms 50.707 ms 50.800 ms 9 AMS-RBNet-1.RBNet.ru (195.209.14.182) 91.85 ms 90.67 ms 89.725 ms 10 Chicago-RBNet-1.rbnet.ru (195.209.14.249) 194.388 ms 193.805 ms 194.564 ms 11 chi-gev156-naukanet.es.net (198.125.140.169) 249.276 ms 248.924 ms 249.639 ms 12 chicr1-ge0-chirt1.es.net (134.55.209.189) 249.488 ms 248.933 ms 264.646 msThis trace shows that our probes are indeed going via RBNet making mentioned asymmetry.
binp-gw>sh ip ro www.slac.stanford.edu % Network not in tableThis proves that network is really not is the table of our router, while others nevertheless are.
binp-gw>sh ip ro www.es.net Translating "www.es.net"...domain server (194.226.160.66) [OK] Routing entry for 198.128.0.0/14, supernet Known via "bgp 5402", distance 20, metric 0 Tag 2505, type external Last update from 192.153.114.137 4d07h ago Routing Descriptor Blocks: * 192.153.114.137, from 192.153.114.137, 4d07h ago Route metric is 0, traffic share count is 1 AS Hops 4 binp-gw>sh ip ro www.fnal.gov Translating "www.fnal.gov"...domain server (194.226.160.66) [OK] Routing entry for 131.225.0.0/16 Known via "bgp 5402", distance 20, metric 0 Tag 2505, type external Last update from 192.153.114.137 4d07h ago Routing Descriptor Blocks: * 192.153.114.137, from 192.153.114.137, 4d07h ago Route metric is 0, traffic share count is 1 AS Hops 4 binp-gw>sh ip bgp sum BGP router identifier 193.124.134.129, local AS number 5402 BGP table version is 28506, main routing table version 28506 156 network entries and 156 paths using 20748 bytes of memory 44 BGP path attribute entries using 2640 bytes of memory 26 BGP AS-PATH entries using 656 bytes of memory 0 BGP route-map cache entries using 0 bytes of memory 14 BGP filter-list cache entries using 168 bytes of memory BGP activity 12682/12526 prefixes, 14219/14063 paths, scan interval 60 secs Neighbor V AS MsgRcvd MsgSent TblVer InQ OutQ Up/Down State/PfxRcd 192.153.114.137 4 2505 153172 150541 28506 0 0 4d07h 131 194.67.81.141 4 2683 0 0 0 0 0 never Active 194.67.223.185 4 8756 0 0 0 0 0 never Active 194.67.223.189 4 8756 0 0 0 0 0 never Active 194.226.160.17 4 5387 0 0 0 0 0 never Active 212.192.189.53 4 5387 159002 153597 28506 0 0 1w6d 24 binp-gw>ping www.kek.jp Translating "www.kek.jp"...domain server (194.226.160.66) [OK] Type escape sequence to abort. Sending 5, 100-byte ICMP Echos to 130.87.104.99, timeout is 2 seconds: !!!!! Success rate is 100 percent (5/5), round-trip min/avg/max = 104/116/128 msHope you'll look into the problem. Thank you, Serge Belov
4cottrell@noric03:~>ping rainbow.inp.nsk.su PING rainbow.inp.nsk.su (193.124.167.29) 56(84) bytes of data. 64 bytes from rainbow.inp.nsk.su (193.124.167.29): icmp_seq=0 ttl=238 time=298 ms 64 bytes from rainbow.inp.nsk.su (193.124.167.29): icmp_seq=1 ttl=238 time=299 ms ... 64 bytes from rainbow.inp.nsk.su (193.124.167.29): icmp_seq=1288 ttl=238 time=297 ms --- rainbow.inp.nsk.su ping statistics --- 1290 packets transmitted, 1289 received, 0% packet loss, time 1301545ms rtt min/avg/max/mdev = 296.975/299.671/417.648/8.781 ms, pipe 2The host is not reachable from SLAC by traceroute:
1cottrell@noric06:~>traceroute rainbow.inp.nsk.su traceroute to rainbow.inp.nsk.su (193.124.167.29), 30 hops max, 38 byte packets 1 rtrg-farm0 (134.79.87.1) 0.243 ms 0.162 ms 0.157 ms 2 rtr-dmz1-ger (134.79.135.15) 0.224 ms 0.197 ms 0.199 ms 3 slac-rt4.es.net (192.68.191.146) 0.291 ms 0.255 ms 0.249 ms 4 snv-pos-slac.es.net (134.55.209.1) 0.625 ms 0.651 ms 0.594 ms 5 chicr1-oc192-snvcr1.es.net (134.55.209.54) 48.748 ms 48.728 ms 48.672 ms 6 aoacr1-oc192-chicr1.es.net (134.55.209.58) 68.709 ms 68.691 ms 68.741 ms 7 aoapr1-ge0-aoacr1.es.net (134.55.209.110) 68.787 ms 68.787 ms 68.752 ms 8 198.124.216.126 (198.124.216.126) 181.134 ms 181.039 ms 181.067 ms 9 keksw2-ns.kek.jp (130.87.4.35) 181.052 ms 181.051 ms 181.036 ms 10 kekcis7.kek.jp (130.87.43.7) 186.412 ms 186.347 ms 186.556 ms 11 * * * 12 * * * 13 * * * 14 * * * 15 * * * 16 * * * 17 * * * 18 * * * 19 * * * 20 * * * 21 * * * 22 * * * 23 * * * 24 * * * 25 * * * 26 * * * 27 * * * 28 * * * 29 * * * 30 * * *However ICMP based traceroutes get through OK.
2cottrell@noric06:~>traceroute -I rainbow.inp.nsk.su traceroute to rainbow.inp.nsk.su (193.124.167.29), 30 hops max, 38 byte packets 1 rtrg-farm0 (134.79.87.1) 0.269 ms 0.174 ms 0.160 ms 2 rtr-dmz1-ger (134.79.135.15) 0.226 ms 0.199 ms 0.196 ms 3 slac-rt4.es.net (192.68.191.146) 0.293 ms 0.256 ms 0.253 ms 4 snv-pos-slac.es.net (134.55.209.1) 0.637 ms 0.598 ms 0.666 ms 5 chicr1-oc192-snvcr1.es.net (134.55.209.54) 48.678 ms 49.454 ms 48.719 ms 6 aoacr1-oc192-chicr1.es.net (134.55.209.58) 68.734 ms 68.700 ms 68.781 ms 7 aoapr1-ge0-aoacr1.es.net (134.55.209.110) 68.828 ms 68.783 ms 68.745 ms 8 198.124.216.126 (198.124.216.126) 181.043 ms 181.144 ms 181.218 ms 9 keksw2-ns.kek.jp (130.87.4.35) 181.115 ms 181.053 ms 181.122 ms 10 kekcis7.kek.jp (130.87.43.7) 186.574 ms 186.347 ms 186.415 ms 11 * * * 12 rainbow.inp.nsk.su (193.124.167.29) 297.008 ms 297.259 ms 299.242 msLooking at the PingER data there is no obvious change in losses or RTT.
Looking at the traceroutes measured by IEPM-BW it appeared we lost traceroute connectivity around between 4:33am and 4:43am April 15th, 2005.
It appears our colleagues in Novosibirsk are not seeing a particular route ESnet announces to KEK at our New York peering. The route (134.79.0.0/16) is to our customer SLAC. It does appear that we are announcing the route to
KEK: joeb@aoa-pr1-re0> show route advertising-protocol bgp 198.124.216.126 ... * 134.79.0.0/16 Self 3671 3671 I Could you please verify that this route is being propagated? Thank you so much for your attention to this.Yamagata sent email at 6:53pm:
Indeed, we are receiving it. #sh ip bgp nei 198.124.216.125 rece | incl 134.79.0.0 * 134.79.0.0 198.124.216.125 0 293 3671 3671 i But our upper network SINET (AS2907) announces different one. #sh ip bgp | incl 134.79.0.0 * i134.79.0.0 130.87.4.14 10 100 0 2907 2153 32 3671 i +-------SLAC-----+ | | | | | | ESnet AS2153 | | | |(C) | | | (B) | aoa-pr1.es.net--------SINET---KEK---BINP | | +---(IP tunnel)---------+ (A) Previously, (A) is used from SLAC to BINP and (B) is used from BINP to SLAC. At this time, (C) is used from KEK to SLAC and (B) is used from SLAC to KEK. If necessary, we can announce (C) to BINP, (A) will be used from SLAC to BINP and (C) will be used from BINP to SLAC.Should we announce (C) to BINP? Or if this asymmetric route is too harmful, we will change the tunnel endpoint at our side, and send packets from BINP to ESnet into (A).
Hello Yamagata-San,
Thank you for your diagram and explanation, very helpful.
Unfortunately, I do not know how BINP implemented their
firewall for blocking routes, so I can't say which of your
options will work. Perhaps Serge could comment on what
he would like to see from KEK.
Email from Serge Belov 9:36pm:
Hello Joe, Jamagata-San,
I'm afraid our US colleagues will not receive this message
immediately so I'll duplicate if from another account, sorry for dups.
Thank you for the investigation. As I wrote my yesterday's messages under pressure I missed the detailed explanation of such things as local firewalling, though already had a lot of troubles with it.
We have the following configuration:
+--------+ BINP---FWa--+ Ext GW |----->Link to KEK +---+----+ | +--->Our Default Link-----FWb-->to Global Internet SB RAS (AS5387)Note here are two FWs in place - the local one operated within our network and the other, protecting the upper network, AS5387, belonging to the SB RAS.
This second one is operating in stateful more so it creates and remembers the state for each TCP and UDP conversation established through it. Thus no TCP/UDP exchange is possible when there is an assymetry on Ext GW. The firewall is not filtered specifically any routing information, and is almost invisible for most of us except the cases like the current one, when the whole connectivity is broken.
There are several additional features/holes in filtering policy on that FWb - for example, few our internal machines (rainbow.inp.nsk.su 193.124.167.29 in particular) are excluded from filtering ruleset so they are at least pingable from outside while most other machines of SB RAS and our LAN are tightly filtered there (I guess).
Last fall I discovered almost complete blocking of few ESNet network just due to such setup and wrong routing on their side. For example, the ORNL network preferred to use their default to contact us instead of the route they should have received from KEK. We still used our specific route to ORNL received from KEK and thus all communication with ORNL and one or two another networks from the whole set KEK is translating to us was blocked. I tried to explain this problem to ORNL experts but failed, gave up and just installed at Ext GW a specific route for ORNL pointing to our default thus restoring symmetry broken by them.
It seems that the whole traffic between ORNL and BINP is negligable, so such solution was acceptable, but fixing the current problem in such a way would be, I think, a wrong solution, as SLAC is one of our important partners, keeping care of the KEK-BINP link all the time, and it should not be prevented from using this channel.
It could be understood that some networks on the ESNet side might introduce similiar problems locally, exactly like ORNL, and these problems could be solved, though in such inelegant way. I don't understand why the routing could be changed in such a specific way on a intermediate, transit system? Why the routing was changed just for SLAC and not for others? Or we'll discover in a few days, like we discovered the case with ORNL, that some other networks are also unaccessible in just the same way as SLAC?
Nevertheless thank all of you for your efforts in investigation, hope the problem will be fixed today.
Serge.
Currently the tunnel router and the serial router are different. If you prefer, we change the tunnel endpoint on the router, that is connected BINP via serial line. It makes that all routes announced from ESnet will be announced to BINP and symmetric route.
This "symmetric" means packets from or to BINP will be sent to the tunnel directly, and maybe it makes worse performance. Because the serial router doesn't use MTU lager than 1500. Current tunnel router uses MTU 1550 because of the encapsulation overhead.
I am thinking two plans.
Email from Yamagata, Wed 20 April 12:28pm:
If 2) is choosen,
please use 130.87.43.7 as the address of endpoint at KEK.
Could you please keep the old tunnel between 134.55.200.31 and 130.87.4.2 until BGP works via new tunnel? I will wait for the new address assignment of new tunnel.
Email from Joe Burrescia Wed 20 April 2:57pm"
OK, I've set our end up. Our tunnel endpoint stays the same 134.55.200.31. I've created a /30 for the tunnel, our end is 198.124.216.169 your end is 198.124.216.170. I've set up our end of the bgp session in passive mode, so once you configure your end it should jusr come up. (Our bgp source address is 198.124.216.169).
I've left the original tunnel up, per your request, until the new tunnel is operational.
Please let me know if there is anything else we can do.
Email from Yamagata Wed April 20, 3:22pm:
Thanks, I make up the tunnel and BGP connection just now.
SLAC route is announced to BINP,
kekcis7#sh ip bgp nei 192.153.114.138 advertised-routes | include 3671
*> 134.79.0.0 198.124.216.169 0 293 3671 3671 i
*> 198.51.111.0 198.124.216.169 0 293 3671 3671 i
How about the current status from BINP?
Email from Yukio Karita Wed 20 April 7;10pm:
Les, Joe, and Serge,
One thing you should awawe is that the 1500Bytes MTU limit is not
for the user packets but for the encapsulated tunnel packets.
The MTU limit for the user packets will be less than 1500Bytes.
This means that the usual 1500Bytes user packets will be
fragmented at the ingress router for the tunnel and will make
the performance worse. Even worse, if some applications use
Path MTU Discovery and ICMP packets are blocked at some firewalls,
some packets cannot be transfered, without no error message.
Best regards,
Yukio
Email from Serge Belov 5:40am 4/26/05:
Thank you for your efforts, it seems everything is working now.
This episode was not as routine as I wanted it to be, as the
whole last week I spent in Moscow and had to investigate and
warn everyone in such a remote mode, having no access to my mails,
tools etc.
Just now it seems everything working almost OK, but few issues
remain to be solved.
May be this is due to reduced MTU and clearing DF bit - we
observe rather strange behavior of tcp sessions. For interactive
ones everything works, but for bulk transfers sessions are
stalled. I saw a lot of packets with df-bit set, and guess
that clearing it may be harmful - need to investigate further.
Current trace from BINP is given below.
More details tomorrow, sorry.
S.
mx:belov {142} traceroute www.slac.stanford.edu traceroute to www8.slac.stanford.edu (134.79.18.163), 64 hops max, 40 byte packets
1 rtc-gw (193.124.167.5) 0.615 ms 0.572 ms 0.552 ms
2 192.153.114.137 (192.153.114.137) 103.863 ms 103.633 ms 103.879 ms
3 aoa-t1-kek.es.net (198.124.216.169) 282.556 ms 282.588 ms 282.530 ms
4 aoacr1-ge0-aoapr1.es.net (134.55.209.109) 282.604 ms 282.558 ms 282.607 ms
5 chicr1-oc192-aoacr1.es.net (134.55.209.57) 302.350 ms 302.646 ms 302.270 ms
6 snvcr1-oc192-chicr1.es.net (134.55.209.53) 350.339 ms 350.381 ms 350.557 ms
7 slac-pos-snv.es.net (134.55.209.2) 350.655 ms 350.938 ms 350.863 ms
8 rtr-dmz1-vlan400.slac.stanford.edu (192.68.191.149) 350.969 ms 350.836 ms 350.897 ms
9 * * *
10 *^C
At 05:47 AM 4/26/2005, yamagata@nwgpc.kek.jp wrote:
>In message <20050425223421.991D824FFD@beagle.es.net>,
>Joe Burrescia wrote
>> Thank you, I now see the 193.124.160.0/21 route via the new
>>peering.
>> Do you believe it is possible to increase the mtu of the tunnel to
>> match our end at 1550 bytes?
>No, our serial router can't speak MTU larger than 1500 and the actual
>MTU via tunnel decreases.
>That is why I asked about two plans.
>I want to hear about the current status from SLAC and BINP people.
>Is it out of use?
Email from Yamagato, 6:28am 4/26/05:
This morning I wrote the previous mail
and found it wasn't delivered to BINP.
I waited a few hours but it still remained.
I felt the DF-clear configuration may be bad,
and removed the DF-clear configuration from the router.
Now, I restored the DF-clear configuration.
If this mail is delayed also, I will remove it again.
Email from Yamagat, 6:32am 4/26/06:
At this time, it seems to work and I keep it.
The DF-clear configuration is active.
Email from Joe Metzger, 8:27am, 4/26/05:
I don't fully understand your reasons for clearing the DF bit.
I thought that the 'right' answer is to drop 'large' packets with the
DF bit set and generate a ICMP Fragmentation Needed message back to the
source so that Path MTU discovery works as specified in RFC 1191.
Or, are you assuming that path MTU doesn't work due to ICMP filtering,
broken TCP/IP stacks, or other reasons so we just have to go ahead and
fragment any large packets we see to make it work?
Email from Yamagata 3:15pm, 4/26/05:
Thanks for the comment.
> I thought that the 'right' answer is to drop 'large' packets with the
> DF bit set and generate a ICMP Fragmentation Needed message back to
> the source so that Path MTU discovery works as specified in RFC 1191.
OK. I disable the DF-clear configuration, and wait for the comment from BINP
just now.
Jerrod Williams of SLAC sent email 2:30pm 4/26/05:
We are having trouble running Iepm-ABWE to your 'rainbow' machine.
This problem began on yesterday, April 25 at sometime after 2:47PM PST.
I am suddenly seeing "Connection Refused" when I try to run Iepm-ABWE
tests (abing) to your machine on port 8176. Telnets to this port return:
jerrodw@nettest2 $ telnet rainbow.inp.nsk.su 8176
Trying 193.124.167.29...
telnet: connect to address 193.124.167.29: Connection refused
Can you verify that this port is open to be accessed from our machines
at SLAC (nettest2.slac.stanford.edu[134.79.240.12] and
iepm-resp.slac.stanford.edu[134.79.240.36]). We cannot run our
full tests to your site with Iepm-ABWE tests failing. Can you please
investigate?
Email from Les Cottrell 5:06 pm 4/26/05:
This may be a spin off of the new route through KEK the 1500 byte limitation,
and the setting of the DF (don't fragment bit),
see http://www.slac.stanford.edu/grp/scs/net/case/binp-apr05/
ABwE uses a UDP payload of 1450 Bytes. Adding on 8 Bytes of UDP header
plus 20 Bytes of IP header gets us up to 1478 Bytes which is indeed what
Netflow in the router reports.
Yamagata indicates that he has changed the routing recently to use a
route that only allows a 1500Byte MTU. It used to be 1550 Bytes.
Yukio Karita also points out that the MTU limit for user packets on
this new route is < 1500Bytes.
It is easy for us to reduce the size of the UDP payload. We have reduced
it to 1450Bytes but we are still seeing problems. Thus it may have nothing
to do with the new route/MTU. On the other hand it maybe a port blocking
probelm. ABwE uses port 8176 and as your mail indicates we cannot connect
to this port via telnet (which I would expect to use small packets). So
this sounds more like a port blocking effect than a MTU problem.
Email from Yamagata 4/26/04 7:20pm:
I has stopped announce of BINP routes into new tunnel, and packets from
BINP to ESnet are sent to SINET (AS2907) same as the former condition.
Even though it may violates the AUP of AS2153.
Newer tunnel is used only for feeding ESnet routes to BINP.
Email from Yukio Karita, 4/26/05 10:11pm:
Les, Serge, Joe, and all,
You may have already known this, but I'd like to explain about the problem.
Best regards,
Yukio
Email from Serge 4/27/05 1:42am:
I try to keep the rainbow.inp.nsk.su as open as possible, almost no
firewalling at all (or even exactly no filtering). So if you got
"connection refused" diagnostic it should be treated as it is -
no one is listening here on this port. I don't know what program
should do it - if there was some daemon you've started before,
it was died certainly and should be restarted. Tell me what program
to start and I'll do it, or take a look on diagnostic if there is
any - why it was died, when etc. No any specific actions on our
side were done before.
There is no any program listening on that port, and refusing connection is a valid response. When I try to simulate the listening program, everything works ok in both directions after opening tcp connection:
rainbow$ nc -l 8176 12345 54321 67890 > telnet rainbow.inp.nsk.su 8176 Trying 193.124.167.29... Connected to rainbow.inp.nsk.su. Escape character is '^]'. 12345 54321 67890 ^]Email from jerrod Williams 4/27/05 3:11pm:
Email from Yamagata 4/27/05 2:08pm:
Current state is DF is enabled, the route is
asymmetric.
Traffic from BINP to SLAC uses AS2907 and AS2153 link.
Traffic from SLAC to BINP uses old ESnet tunnel.
Email from Serge 4/27/05 4:00am:
Thank you for the whole bunch of labor of fixing routing thingies.
As I explained last week we are afraid of the assymetry near our
end where traffic might go via angry firewall and get blocked due
to being an unidurection one.
Asymmetry at a relatively far end is not a problem if there are no
such strict filtering as that we have here.
Some questions still might come to mind -
> - in the case there are some other sites injured by this reform, > are the measures taken at KEK sufficient to restore our interactions > with them?So if you still find some unreachable sites, please notice.
Email from Cottrell 4/27/05 6:48pm:
Yukio, Thanks for the explanation. It would seem to me that with the
limitation on bandwidth from KEK to BINP of 512kbits/s if there are
no problems with MTU then BINP will not be able to tell the difference
between an OC48 (via CalREN) and an OC192 (via NY) , unless one is
much more congested than the other. I suppose if we wanted to be
exhaustive we could make tests of the two routes (we actually
have measuremenst on the NY route from the last few months).
So, my preference would be to try out the shorter RTT, watch it
for a couple of weeks, make some measurements, and all things
being roughly equal then stick with the it (shorter RTT via CalREN).
However, I bow to the major user, i.e. Serge to make the final decision.