Problems with performance between SLAC and IN2P3, Jan '02

Les Cottrell Page created: January 8, 2002, last update March 4, 2002.

Central Computer Access | Computer Networking | Network Group | ICFA-NTF Monitoring

Problem description

There have been various problems reported with the IN2P3 link to SLAC since late November 2001. The one we wish to focus on at the moment is the performance throughput observed from datamove33.slac.stanford.edu to ccb2sn04.in2p3.fr. There are two paths between SLAC and the CCIN2P3

the CERN one: SLAC-ESNET-CERN-LYON with a minimum bandwidth of 155M this is our default path
the RENATER one (Renater is part of France Telecom): SLAC-ESNET-LYON with an ATM VP of 30Mbps

IN2P3 announce only the 194.5.57.0 network to ESNET on this link (for the moment the only machine concerned is ccb2sn04.in2p3.fr). So the traffic from SLAC to CCIN2P3 goes through the CERN link except to ccb2sn04 where it uses the RENATER path, and the traffic from CCIN2P3 to SLAC uses the CERN link except for ccb2sn04 to DATAMOVE33.SLAC.Stanford.EDU where it uses the RENATER VP (we have a static routing for this). Jerome Bernier, IN2P3 Nov 29 2001.

Traceroute / Pipechar results

Traceroute shows the route and AS'. It confirms that the route at this time did not go via CERN. Pipechar was used to measure the bottleneck from hercules.slac.stanford.edu to ccb2sn04.in2p3.fr. Hercules is a 2*1131 MHz Linux 2.4 host with 2 GE interfaces. Pipechar indicates a limit of about 46Mbps between the ESnet router in Chicago (chi-s-snv.es.net) and the 192.70.69.14 (192.70.69.14) [AS1717 - PHYNET-INTER] router, and a limit of about 30Mbits/s between the last router (Lyon-ANDA.in2p3.fr) and ccb2sn04. So at least from this point of view things look good.

Fpingroute results

Fpingroute results from SLAC to IN2P3 for 1000 pings indicate there is less than 1% packet loss.

Throughputs

Various measures of bandwidth/throughputs are obtained from the IEPM-BW. These include iperf, bbcp memory to memory, bbcp disk to disk, bbftp, and pipechar. These are shown below for the SLAC - IN2P3 link that goes via CERN. The light blue bars show the ping success rate (usually 100% so there is a complete blue line), the black triangles are the minimum pipechar link bandwidths, and the green dots are the iperf throughputs for the optimum window and streams (256KB window and 12 streams).

We have now extended the IEPM-BW monitoring to also monitor the SLAC-ESNET-RENATER-IN2P3 link on a regular basis. We also used tcpload.pl to measure the throughput for various window sizes and streams (see Bulk throughput measurements for details of the measurement methodology) from pharlap.slac.stanford.edu (a Solaris 5.8 host with 4*336 MHz cpus) to ccb2sn04.in2p3.fr. The maxima (top 10% throughputs) are over 18.25Mbits/s, and the maximum achieved was 21.43 Mbits/s. The details can be seen in the plot below. We repeated the tcpload.pl measurement from hercules.slac.stanford.edu to ccb2sn04.in2p3.fr with very similar results.

The monitoring of the IN2P3 RENATER link from IN2P3 indicates that the utilization is currently about 20Mbits/s, however the measurements of June and September showed they could sustain close to 30 Mbits/s, see below (from Trafic IN2P3/ESnet).

Packet trace

Since none of the above appeared to identify the cause of the problem, I ran tcpdump and tcptrace on an iperf transfer from SLAC to IN2P3. I wrote a script to facilitate this (tcpload.pl). It started tcpdump on hercules and then started iperf for 10 seconds for 8 streams with a window size of 64KB. When iperf finished, tcpdump was terminated and tcptrace was run to analyze the dump file. The output from tcpload.pl shows in detail what happened. The tcptrace and tcpdump files were then made available by anonymous ftp from ftp.slac.stanford.edu and email was sent to Joe Metzger of ESnet so he could analyze. I am unsure if the "incomplete" warning will cause problems, or how to avoid it.

Looking at the tcpdump output it is clear that the window sizes and MTU discovery are working OK.

Joe Metger <JMetzger@lbl.gov> responded on Jan 9 '02 Thanks for generating the trace files! We have been looking at this problem and it is not a straight forward issue of a PVC shaped at 20 Mbps (at least on the ESnet links). My current suspicion is that the flow exceeds a queue limit somewhere along the path that is generating packet loss and the drop in performance. Hopefully the trace files your providing will help us to determine if this is the cause and give us a better idea of the exactly what we should be looking for.

Correspondence

Email from Joe Metzger 1/10/02 to Les Cottrell:

Hello. Analyzing the trace from Les confirms the suspicion that something is policing this path at 20 Mbps. I think the next step is to try to identify a responsible party for each link in this path and have them identify where policing is occurring and at what rates.
I have filled in the ESnet details. Hopefully Jerome Bernier and/or Gilles Farrache can fill in some of the details between Chicago and the destination host.
I am confident that there are no policing issues on the links between SLAC and the ESnet router in Chicago because everything up to that point is shared with other flows that frequently exceed 20 Mbps. I suspect one of the carriers in between Chicago and Lyon are policing but I have no data to support this hypothesis.

On Jan 15 '02 Jerome Bernier of IN2P3 responded:

I can fill some details on the traceroute.

>  6  Lyon-INTER.in2p3.fr (192.70.69.14) [AS1717 - PHYNET-INTER]  168 ms
         CCIN2P3 Site LAN
                 GigE
                 No policing or shaping
>  7  Lyon-ANDA.in2p3.fr (134.158.224.1) [AS789 - Institut National de Physique
> Nucleaire et de Physique des Particules]  195 ms
         CCIN2P3 Site LAN
                 FastE
                 No policing or shaping
> 
>  8  ccbbsn04.in2p3.fr (134.158.104.74) [AS789 - Institut National de Physique
> Nucleaire et de Physique des Particules]  169 ms

The only information that i am not sure is the configuration of the VP between AADS Chicago and us. Renater people already told me that it is a 30Mbps VP (with 40Mbps peak) but this VP is crossing several switches with several configurations so i have asked them to re-re-check the VP configuration.

As far as understanding the various link components (this is from a series of emails between Les Cottrell and Gary Buhrmaster of SLAC, Joe Metzger of ESnet, and Jerome Bernier of IN2P3):

We know SLAC to chi-rt1.es.net is clean because SLAC frequently sends bigger flows through that path to CERN.
The two links on the site LAN at IN2P3 can sustain a 30+ Mbps flow to anywhere off-site. Note that this is the same part used when the flow come by the CERN path.
That just leaves the intercontentinal, multi-carrier, rate limited PVC from ESnet to Lyon via AADS, with two parts the ESNET-AADS one and the AADS-RENATER-CCIN2P3 one. There is no easy way to put a testing host in the middle of a PVC.

Further email from Joe Metzger of ESnet, Jan 22 '02:

Jerome,
Do you have any new information about how the AADS-RENATER-CCIN2P3 PHYnet PVC is being policed? I have been digging deeper into this performance issue and discovered several things that lead me to think it might be reasonable for Renater to be policing this link at 20 Mbps.

According to the AADS web page, Renater only has a DS3 to AADS. They have other peering sessions at AADS over this DS3 in addition to the PVC to IN2P3, at least one of which is configured at 8 Mbps.
I found an Email message from Gilles Farrache dated June 28, 2001 which seems to indicate that this link should only be used as a backup to the CERN OC3 and no traffic should flow over the link when the OC3 is up.
If the AADS web page is up-to-date and I am not taking Gilles message out of context, then perhaps we need to re-examine the initial assertion that "performance on the link is bad". Maybe we are getting 20 Mbps better performance than we should?
The ESnet trouble ticket information can be found at: finger 8281@ticket.es.net and finger 8294@ticket.es.net

Email from Jerome Bernier of IN2P3, January 28 '02:

I don't have yet a definitive answer from Renater (in fact they don't have a definitive one from OpenTransit) for the policy, but what is sure is that Renater know and agree that this link is not for backup but for production. In fact we are also in the process to upgrade this VP to 100Mbps and also for using it in the same time that the CERN link.

Email from Joe Metzger, Jan 29 '02:

Jerome, The information about Renater's connection to AADS came from http://www.aads.net/customers.html
Please let me know if there is anything I can do to help you to get the details about this circuit.

Email from Jerome Bernier, Jan 29 '02:

Email from Joe Metzger, Jan 29 '02:

The information about Renater's connection to AADS came from http://www.aads.net/customers.html Please let me know if there is anything I can do to help you to get the details about this circuit.

Email from Joe Metzger, Feb 13 '02:

Jerome, Have you been able to get any new information from Renater about the policing on the AADS -> Lyon PVC?

Email from Jerome Bernier, Feb 18 '02:

Sorry if i don't answer before, but i was ill last weeks and i just come back to work. I don't have new informations, but i will check with them quickly. Regards, Jerome Bernier

Email from Dominique Boutigny LAPP, Annecy, Feb 18 '02:

There is something very interesting happening now on the LYON-US RENATER link, Anne-Marie is doing a transfer _to_ SLAC (in principle we are using this link for transfer _from_ SLAC) and she is using 27.5 Mb/s while the transfers in the other direction are still limited to 20 Mbit/s. http://ccweb.in2p3.fr/statReseau/mrtg-files/in2p3/in2p3-esnet.html Is this clue useful to find the bottleneck?

Email from Anne-Marie Lutz IN2p3, Feb 18 '2:

even another clue? the on-going transfer _from_ slac has dropped at the time the transfer _to_ slac has started. It's rather a surprise since the connection is thought to be full-duplex (or am I wrong ?)

Email from Joe Metzger Feb 19 '02:

This is interesting but I don't know if it means much. Can you tell if the transfer to SLAC was rate limited by your application or by the network?

Email from Dominique Boutigny February 22 '02:

The situation is now extremly bad on the LYON-RENATER-US link: We are now down to 4.5 Mb/s.
Interestingly, Anne-Marie repeated a transfer in the other direction this morning and got 25 Mb/s.

Email from Joe Metger Feb 22 '02:

Have you made any progress determining why your carrier is policing your circuit at these rates?

Email from Jerome Bernier Feb 22 '02:

i have open a ticket at the Renater NOC. Let see...

Email from Les Cottrell, Feb 22 '02:

I logged onto ccbbsn04 and verified that the route to datamove33 does not go via CERN:
cbbsn04:tcsh[34] traceroute datamove33.slac.stanford.edu traceroute: Warning: Multiple interfaces found; using 134.158.104.74 @ ge0 traceroute to datamove33.slac.stanford.edu (134.79.125.253), 30 hops max, 40 byte packets 1 Lyon-ANDA.in2p3.fr (194.5.57.1) 1.167 ms 0.773 ms 0.724 ms 2 Lyon-INTER.in2p3.fr (134.158.224.4) 24.430 ms 0.981 ms 0.940 ms 3 192.70.69.13 (192.70.69.13) 120.463 ms 120.455 ms 120.580 ms 4 snv-s-chi.es.net (134.55.205.101) 168.731 ms 168.318 ms 169.043 ms 5 slac-pos-snv.es.net (134.55.209.2) 169.221 ms 168.987 ms 169.077 ms 6 RTR-DMZ1-VLAN400.SLAC.Stanford.EDU (192.68.191.149) 169.001 ms 168.699 ms 168.647 ms
Then I started some iperf servers on datamove33 and ran iperf for 10 seconds many times with different windows and stream sizes from ccbbsn04 to datamove33:
I was able to get over 20Mbits/sec throughput with several windows and stream combinations, see below, so Dominique can you verify whether you are using bbftp or iperf to make your measurements and whether you are going from IN2P3 to SLAC or vice-versa, and what streams & window you are using.:

Dominique Boutigny sent email Feb 22 '02:

This is not incompatible with what I see, the troughput _from_ SLAC _to_ IN2P3 is currently limited to 5 Mb/s, while the throughput _to_ SLAC _from_in2p3 is normal.

Les Cottrell sent email Feb 22 '02:

I have made the measurement from datamove33.slac.stanford.edu to CCB2SN04.IN2P3.FR This uses the Renater link
18cottrell@datamove33:~>traceroute CCB2SN04.IN2P3.FR traceroute: Warning: ckecksums disabled traceroute to CCB2SN04.IN2P3.FR (194.5.57.104), 30 hops max, 40 byte packets 1 RTR-FARMCORE1A.SLAC.Stanford.EDU (134.79.127.7) 0.550 ms 0.408 ms 0.382 ms 2 RTR-DMZ1-GER.SLAC.Stanford.EDU (134.79.135.15) 0.361 ms 0.305 ms 0.307 ms 3 192.68.191.146 (192.68.191.146) 0.361 ms 0.351 ms 0.339 ms 4 snv-pos-slac.es.net (134.55.209.1) 0.761 ms 0.710 ms 0.732 ms 5 chi-s-snv.es.net (134.55.205.102) 48.747 ms 48.815 ms 48.791 ms 6 192.70.69.14 (192.70.69.14) 149.914 ms 149.556 ms 167.480 ms 7 Lyon-ANDA.in2p3.fr (134.158.224.1) 149.755 ms 155.322 ms 151.614 ms 8 ccb2sn04.in2p3.fr (194.5.57.104) 150.993 ms * 151.876 ms
The results of 10 second iperf measurements with various windows and streams look as follows:

I am unclear why the behavior of throughput with streams and windows on this latest measurement should differ from that measured on January 8 '02.
Using bbcp to make disk to disk copy of an uncached 60MByte Objectivity file to /de/null, with 40 streams and an 8KByte window, I was able to achieve just over 11Mbits/s. I then ran bbcp memory to memory for about an hour and the throughput in the MRTG plot went up to about 14-15 Mbits/s, at this time bbcp was reporting throughput of about 1183KB/s or 9.46 Mbits/s. Possibly bbcp was adding its traffic to another application which was for some reason constrained to transmit about 5-6 Mbits/s.
I then ran iperf for 20 seconds with 40 streams and a window size of 8Kbytes, and measured a throughput of about 15-16 Mbits/s. I repeated this with 2 separate iperf clients each with 40 streams and an 8KByte window. the aggregate throughput was again about 15-16 Mbits/s. So doubling the number of streams from 40 to 80 had little effect. In order to see the effect on the MRTG plot I then ran iperf for 30 minutes with 40 streams and an 8Kbyte window. The maximum throughput recorded by MRTG was about 22.7Mbits/s, and iperf recorded an average throughput of about 17 Mbits/s. Below is seen the MRTG plot showing the impact of the bbcp measurement from 3-4 am, and the iperf measurement just after 6am.

The datamove33 window sizes are:
ndd /dev/tcp tcp_max_buf = 1048576 ;ndd /dev/tcp tcp_cwnd_max = 1048576 ;ndd /dev/tcp tcp_xmit_hiwat = 16384 ;ndd /dev/tcp tcp_recv_hiwat = 24576
and for ccbs2sno4 (= ccbbsn04, apart from how routing is done) are:
ndd /dev/tcp tcp_max_buf = 4194304 ;ndd /dev/tcp tcp_cwnd_max = 2097152 ;ndd /dev/tcp tcp_xmit_hiwat = 65536 ;ndd /dev/tcp tcp_recv_hiwat = 65536
Thus my conclusion is that if there is an application generating the 5-6 Mbits/s background traffic, then it's throughput is limted by something other than the network in this case. Possibly it does not have enough streams. If the optimum number of streams was deduced from measurements made earlier (e.g. before or around January 8 '02) and the behavior of throughput with streams and windows changed so streams are now much more effective than windows, then this would account for today's poor performance. I do not currently have a hypothesis for why the throughput behavior with streams and windows should have changed. Also note that the above only explains the poor performance of the application that started around midday on Wednesday February 20th French time. It does not explain the apparent rate limiting around 20-22 Mbits/s that was initially reported.

Though I doubt it will make much difference I also recommend that the SLAC Unix administrators increase the window size on datamove33 to be the same as for IN2P3.

Email from Dominique Boutigny Feb 23 '02:

I think that we are using 10 streams and the default bbftp window size, so we have room for improvment.
But the point here is that _in_the_same_conditions_ the transfer rate dropped from 20 Mbit/s to ~5 Mbit/s, the fact that you are able to get 11 with bbcp and optimized parameters clearly shows that the problem is _not_ due to rate policing somewhere.
I remember that last time we got the same kind of troughput drop, it was due to something queuing the packets on the ESNET side instead of dropping them (sorry if I am not using the right terminology !!! ) Could it be the same problem here ?
To summarize, we have probably 2 problems here: - The first one is the limitation to 20 Mbit/s that ESNET is interpreting as a rate policing on the RENATER side.
- The second one is the limitation from 20 to 5 Mbit/s for an unkown reason.

Would it be possible to have the e-mail of a RENATER responsible person to put him in the loop ? I have added Denis Linglin and Francois Etienne in the recipient list, I hope that escalating this problem in the French side will help.

Email from Les Cottrell February 23 '02:

It is possible, maybe even likely, that something has changed that has resulted in requiring a different set streams and window for optimal throughput. It would be extremely interesting to identify what caused that. The previous queuing problem you refer to, resulted in RTTs of many hundreds of msec. The RTTs at the moment are min/avg/max (for 208 pings from datamove 33 to ccb2sn04) are 149/150/264 msec, which is pretty close to the optimal. On the other hand I do not have a record of the RTT on Jan 8 '02 (I have it from pharlap to ccbbsn04 but that is on the CERN route) so maybe there was more queuing and RTT then. I do have a record of pings from pharlap to ccb2sn04 (a route that avoids CERN) on February 1 '02 and that shows for 10 pings min/avg/max 170/170/171 msec, so maybe the buffering has been reduced. The routes on Feb 1 were identical to those seen today.

Email from Joe Metzger Feb 25 '02:

ESnet did not make any policing or queuing changes to this circuit last week. I have checked out our traffic stats and none of the interfaces on the ESnet portion of the path appeared to be saturated.
I don't think we will be able to make any progress on this problem until we can bring Renater, who is responsible for over two thirds of the path, into the discussion.
The ESnet contacts in Renater have been copied on several of the email messages in this long running thread but they have not responded.
Jerome Bernier has been trying to work with Renater for over a month on this problem and hasn't reported any significant progress.
I have been relying on Jerome to coordinate debugging the Renater end of the problem because IN2P3 is a Renater Customer and ESnet is not. Many network service providers ignore difficult problems reported by third parties but they will usually deal with direct customer complaints.
Please let me know if you think there is some other approach ESnet needs to take to move this problem towards resolution.

Email from Francoise Etienne of IN2P3 Feb 25 '02:

you are right: we are missing the right communication channel with our service provider, in fact three layers (FT, CS and RENATER) and we are trying to get out of this inappropriate situation. By the way with the help of the RENATER director the problem has been said to be solved today by FT and I hope that Dominique will confirm this soon. Sorry to have bothered you with this issue, but let me hope that this will help to set up the adequate communication channel to guarantee the Lyon and SLAC link, in particular to sustain the new coming bandwidth at 100 Mbps. Regards François

Eamil from Dominique Boutigny Feb 26 '02:

I confirm that since 9am (French time) the throughput is now back to 30 Mbit/s on the US-RENATER link. This was due to a bad configuration on the "France-Telecom" side.
Many thanks to all of you for your efforts and help to solve this long, long standing problem.
I hope that we will be able to test the upgrade to 100 Mb/s very soon.

Email from Les Cottrell Feb 26 '02:

After just under 2 months of elapsed time (and a lot of man hours in multiple locations, let alone unavailable bandwidth and lost productivity to users) trying to track down this problem, I wonder if we can learn a bit more about the cause from FT or Renater. This is not to try and assign blame, but rather to see whether there was some signature or some measurement we could have made which we would have isolated/identified the problem more quickly/easily. "Bad configuration" is rather limited in its clarification of what was wrong. Was it in rate limiting, what kind, was it in an ATM switch(es), was it a PVC or something, was it in a router? Is there a ticket that would provide more information? The information from our end can be found at http://www.slac.stanford.edu/grp/scs/net/case/in2p3-jan02/problem-20020108.html. Is there a Service Level Agreement (e.g. between Renater and FT) that addresses this? What do we do next time we suspect something like this? Are there plans/needs to have better escalation procedures to Renater, and then to their carriers? Will IN2P3 be meeting with Renater to discuss some of the above issues.

Email from Dominique Boutigny Feb 26 '02:

This problem has been escalated at the highest level on the French side. Some email exchange today with the RENATER director clearly indicate that they will not close the problem like that but will investigate to understand where the problem was.

Email from jerome Bernier Feb 28 '02:

Renater investigate more on this problem, but after a meeting with France Telecom/Opentransit, FT told them that nobody was changed on the FT side. It was really magic ???? We don't want to close this problem as it is, and Renater doesn't want too. But actually what is said is . there is no changes on the application . we (IN2P3) doesn't change anything on our router/switches . Renater doesn't change anything . Opentransit doesn't change anything can you confirm that they were no changes in ESnet and SLAC ? Les Cottrell and Joe Metzger responded that there had been no changes at SLAC or ESnet respectively.

On February 28th at 15:00 hours France time Renater upgraded the link to IN2P3 to 100Mbits/s. Now we need to get the link from Renater to ESnet upgraded. ESnet is policing the SLAC - RENATER link in question at:

 shaping {
                vbr peak 40m sustained 30m burst 200;
                queue-length 980;
            }

The throughput achievable now with iperf is shown below:

On March 4 '02 Joe Metzger in response to a request from Jerome Bernier of IN2P3 reported I have bumped the rate on this PVC up to 50 Mbps until we can get our line into AADS upgraded. I think we will be able to manage this rate provided the traffic to the 60+ other peers at the NAP that share this line doesn't grow too fast. We will drop this PVC back down to 30 Mbps if the line starts to get congested. Our upgrade was ordered a while back but all upgrades to AADS circuits are taking forever. We have been told it will not happen before May. We have not been told when it will happen... The following shows the throughput after upgrading the PVC to 50Mbits/s. It is seen the iperf throughput has increased from about 20 Mbits/s to about 45 Mbits/s/