Problems with performance between SLAC and IN2P3, Jan '02
Les Cottrell Page created:
January 8, 2002, last update March 4, 2002.
There have been various problems reported with the IN2P3 link to SLAC
since late November 2001. The one we wish to focus on at the moment is the
performance throughput observed from datamove33.slac.stanford.edu to
There are two paths between SLAC
and the CCIN2P3
- the CERN one: SLAC-ESNET-CERN-LYON with a minimum bandwidth of 155M this
is our default path
- the RENATER one (Renater is part of
France Telecom): SLAC-ESNET-LYON with an ATM VP of 30Mbps
IN2P3 announce only the 126.96.36.199 network to ESNET on this link (for the
moment the only machine concerned is ccb2sn04.in2p3.fr).
So the traffic from SLAC to CCIN2P3 goes through the CERN link except to ccb2sn04
where it uses the RENATER path, and the traffic from CCIN2P3 to SLAC uses the
CERN link except for ccb2sn04 to DATAMOVE33.SLAC.Stanford.EDU where it
uses the RENATER VP (we have a static routing for this). Jerome
Bernier, IN2P3 Nov 29 2001.
Traceroute / Pipechar results
Traceroute shows the route and AS'.
It confirms that the route at this time did not go via CERN. Pipechar was used
to measure the bottleneck from hercules.slac.stanford.edu to ccb2sn04.in2p3.fr.
Hercules is a 2*1131 MHz Linux 2.4 host with 2 GE interfaces. Pipechar
indicates a limit of about 46Mbps between the ESnet router in
Chicago (chi-s-snv.es.net) and the 188.8.131.52 (184.108.40.206) [AS1717 - PHYNET-INTER]
router, and a limit of about 30Mbits/s between the last router
(Lyon-ANDA.in2p3.fr) and ccb2sn04. So at least from this point of view things look good.
results from SLAC to IN2P3 for 1000 pings indicate there is
less than 1% packet loss.
Various measures of bandwidth/throughputs are obtained from the IEPM-BW.
These include iperf, bbcp memory to memory, bbcp disk to disk, bbftp, and
pipechar. These are shown below for the SLAC - IN2P3 link that goes via
CERN. The light blue bars show the ping success rate
(usually 100% so there is a complete blue line), the black triangles are the
minimum pipechar link bandwidths, and the green dots are the iperf throughputs
for the optimum window and streams (256KB window and 12 streams).
We have now extended the IEPM-BW monitoring to also monitor the SLAC-ESNET-RENATER-IN2P3
link on a regular basis. We also used tcpload.pl to measure the throughput for various
window sizes and streams (see
Bulk throughput measurements
for details of the measurement methodology) from pharlap.slac.stanford.edu
(a Solaris 5.8 host with 4*336 MHz cpus) to
ccb2sn04.in2p3.fr. The maxima (top 10% throughputs) are over 18.25Mbits/s, and the
maximum achieved was 21.43 Mbits/s. The details can be seen in the plot below.
We repeated the tcpload.pl measurement from hercules.slac.stanford.edu
to ccb2sn04.in2p3.fr with very similar results.
The monitoring of the IN2P3 RENATER link from IN2P3 indicates that the utilization
is currently about 20Mbits/s, however the measurements of June and September
showed they could sustain close to 30 Mbits/s, see below (from
Since none of the above appeared to identify the cause of the problem, I ran
tcpdump and tcptrace on an iperf transfer from SLAC to IN2P3. I wrote a script to facilitate
It started tcpdump on hercules and then started iperf for 10 seconds for 8 streams
with a window size of 64KB. When iperf finished, tcpdump was terminated and
tcptrace was run to analyze the dump file.
The output from tcpload.pl
shows in detail what happened. The tcptrace and tcpdump files were then made
available by anonymous ftp from ftp.slac.stanford.edu and email was sent to Joe Metzger
of ESnet so he could analyze.
I am unsure if the "incomplete" warning will cause problems, or how to avoid it.
Looking at the tcpdump output it is clear that the window sizes
and MTU discovery are working OK.
Joe Metger <JMetzger@lbl.gov> responded on Jan 9 '02 Thanks for generating the trace files!
We have been looking at this problem
and it is not a straight forward issue of a PVC shaped at 20 Mbps (at least on
the ESnet links). My current suspicion is that the flow exceeds a queue limit
somewhere along the path that is generating packet loss and the drop in
Hopefully the trace files your providing will help us to determine if this is
the cause and give us a better idea of the exactly what we should be looking
On February 28th at 15:00 hours France time Renater upgraded the link to
IN2P3 to 100Mbits/s. Now we need to get the link from Renater to ESnet upgraded.
ESnet is policing the SLAC - RENATER link in question at:
- Email from Joe Metzger 1/10/02 to Les Cottrell:
Analyzing the trace from Les confirms the suspicion that something is policing
this path at 20 Mbps.
I think the next step is to try to identify a
responsible party for each link in this path and have them identify where
policing is occurring and at what rates.
I have filled in the
ESnet details. Hopefully Jerome Bernier
and/or Gilles Farrache can fill in some of the details between Chicago and
the destination host.
I am confident that there are no policing issues on the links between SLAC
and the ESnet router in Chicago because everything up to that point is
shared with other flows that frequently exceed 20 Mbps. I suspect one
of the carriers in between Chicago and Lyon are policing but I have no
data to support this hypothesis.
- On Jan 15 '02 Jerome Bernier of IN2P3 responded:
- I can fill some details on the traceroute.
> 6 Lyon-INTER.in2p3.fr (220.127.116.11) [AS1717 - PHYNET-INTER] 168 ms
CCIN2P3 Site LAN
No policing or shaping
> 7 Lyon-ANDA.in2p3.fr (18.104.22.168) [AS789 - Institut National de Physique
> Nucleaire et de Physique des Particules] 195 ms
CCIN2P3 Site LAN
No policing or shaping
> 8 ccbbsn04.in2p3.fr (22.214.171.124) [AS789 - Institut National de Physique
> Nucleaire et de Physique des Particules] 169 ms
The only information that i am not sure is the configuration
of the VP between
AADS Chicago and us.
Renater people already told me that it is a 30Mbps VP (with 40Mbps peak)
but this VP is crossing several switches with several configurations
so i have asked them to re-re-check the VP configuration.
As far as understanding the various link components (this is from a series of emails
between Les Cottrell and Gary Buhrmaster of SLAC, Joe Metzger of ESnet, and Jerome Bernier of IN2P3):
- We know SLAC to chi-rt1.es.net is clean because SLAC frequently sends
bigger flows through that path to CERN.
- The two links on the site LAN at IN2P3 can sustain a 30+ Mbps flow to anywhere off-site.
Note that this is the same part used when the flow come
by the CERN path.
- That just leaves the intercontentinal, multi-carrier, rate limited PVC
from ESnet to Lyon via AADS, with two parts the ESNET-AADS one and the AADS-RENATER-CCIN2P3 one.
There is no easy way to put a testing host in the middle of a PVC.
- Further email from Joe Metzger of ESnet, Jan 22 '02:
Do you have any new information about how the AADS-RENATER-CCIN2P3 PHYnet PVC is being policed?
I have been digging deeper into this performance issue and discovered several things that lead me to think it might be reasonable for Renater to be policing this link at 20 Mbps.
If the AADS web page is up-to-date and I am not taking Gilles message out of context, then perhaps we need to re-examine the initial assertion that "performance on the link is bad". Maybe we are getting 20 Mbps
better performance than we should?
- According to the AADS web page, Renater only has a DS3 to AADS. They have other peering sessions at AADS over this DS3 in addition to the PVC to IN2P3, at least one of which is configured at 8 Mbps.
- I found an Email message from Gilles Farrache dated June 28, 2001 which seems to indicate that this link should only be used as a backup to the CERN OC3 and no traffic should flow over the link when the OC3 is up.
The ESnet trouble ticket information can be found at:
finger email@example.com and finger firstname.lastname@example.org
- Email from Jerome Bernier of IN2P3, January 28 '02:
- I don't have yet a definitive answer from Renater (in fact they don't have
a definitive one from OpenTransit) for the policy, but what is sure is that
Renater know and agree that this link is not for backup but for production.
In fact we are also in the process to upgrade this VP to 100Mbps and also
for using it in the same time that the CERN link.
- Email from Joe Metzger, Jan 29 '02:
The information about Renater's connection to AADS came from http://www.aads.net/customers.html
Please let me know if there is anything I can do to help
you to get the details about this circuit.
- Email from Jerome Bernier, Jan 29 '02:
- I don't have yet a definitive answer from Renater (in fact they don't have a definitive one from OpenTransit) for the policy, but what is sure is that Renater know and agree that this link is not for backup but for production. In fact we are also in the process to upgrade this VP to 100Mbps and also for using it in the same time that the CERN link.
PS what is the AADS web page that you spoke about,
because on the Startap web page it is wrote that Renater have an OC3 link
- Email from Joe Metzger, Jan 29 '02:
- The information about Renater's connection to AADS came from http://www.aads.net/customers.html
Please let me know if there is anything I can do to help
you to get the details about this circuit.
- Email from Joe Metzger, Feb 13 '02:
Have you been able to get any new information from Renater about the policing on the AADS -> Lyon PVC?
- Email from Jerome Bernier, Feb 18 '02:
- Sorry if i don't answer before, but i was ill last weeks
and i just come back to work.
I don't have new informations, but i will check with them quickly. Regards, Jerome Bernier
- Email from Dominique Boutigny LAPP, Annecy, Feb 18 '02:
- There is something very interesting happening now on the LYON-US RENATER link, Anne-Marie is doing a transfer _to_ SLAC (in principle we are using this link for transfer _from_ SLAC) and she is using 27.5 Mb/s while the transfers in the other direction are still limited to 20 Mbit/s.
Is this clue useful to find the bottleneck?
- Email from Anne-Marie Lutz IN2p3, Feb 18 '2:
- even another clue? the on-going transfer _from_ slac has dropped at
the time the transfer _to_ slac has started. It's rather a surprise
since the connection is thought to be full-duplex (or am I wrong ?)
- Email from Joe Metzger Feb 19 '02:
- This is interesting but I don't know if it means much.
Can you tell if the transfer to SLAC was rate limited by your application or by the network?
- Email from Dominique Boutigny February 22 '02:
- The situation is now extremly bad on the LYON-RENATER-US link:
We are now down to 4.5 Mb/s.
Interestingly, Anne-Marie repeated a transfer in the other direction this morning and got 25 Mb/s.
- Email from Joe Metger Feb 22 '02:
- Have you made any progress determining why your carrier is policing your circuit at these rates?
- Email from Jerome Bernier Feb 22 '02:
- i have open a ticket at the Renater NOC.
- Email from Les Cottrell, Feb 22 '02:
- I logged onto ccbbsn04 and verified that the route to datamove33 does not go via CERN:
cbbsn04:tcsh traceroute datamove33.slac.stanford.edu
traceroute: Warning: Multiple interfaces found; using 126.96.36.199 @ ge0
traceroute to datamove33.slac.stanford.edu (188.8.131.52), 30 hops max, 40 byte packets
1 Lyon-ANDA.in2p3.fr (184.108.40.206) 1.167 ms 0.773 ms 0.724 ms
2 Lyon-INTER.in2p3.fr (220.127.116.11) 24.430 ms 0.981 ms 0.940 ms
3 18.104.22.168 (22.214.171.124) 120.463 ms 120.455 ms 120.580 ms
4 snv-s-chi.es.net (126.96.36.199) 168.731 ms 168.318 ms 169.043 ms
5 slac-pos-snv.es.net (188.8.131.52) 169.221 ms 168.987 ms 169.077 ms
6 RTR-DMZ1-VLAN400.SLAC.Stanford.EDU (184.108.40.206) 169.001 ms 168.699 ms 168.647 ms
Then I started some iperf servers on datamove33 and ran iperf for 10 seconds many times with different windows and stream sizes from ccbbsn04 to datamove33:
I was able to get over 20Mbits/sec throughput with several windows and
stream combinations, see below, so Dominique can you verify whether you are using
bbftp or iperf to make your measurements and whether you are going from IN2P3
to SLAC or vice-versa, and what streams & window you are using.:
- Dominique Boutigny sent email Feb 22 '02:
- This is not incompatible with what I see, the troughput _from_ SLAC _to_ IN2P3 is currently limited to 5 Mb/s, while the throughput _to_ SLAC _from_in2p3 is normal.
- Les Cottrell sent email Feb 22 '02:
- I have made the measurement from datamove33.slac.stanford.edu to
This uses the Renater link
traceroute: Warning: ckecksums disabled
traceroute to CCB2SN04.IN2P3.FR (220.127.116.11), 30 hops max, 40 byte packets
1 RTR-FARMCORE1A.SLAC.Stanford.EDU (18.104.22.168) 0.550 ms 0.408 ms 0.382 ms
2 RTR-DMZ1-GER.SLAC.Stanford.EDU (22.214.171.124) 0.361 ms 0.305 ms 0.307 ms
3 126.96.36.199 (188.8.131.52) 0.361 ms 0.351 ms 0.339 ms
4 snv-pos-slac.es.net (184.108.40.206) 0.761 ms 0.710 ms 0.732 ms
5 chi-s-snv.es.net (220.127.116.11) 48.747 ms 48.815 ms 48.791 ms
6 18.104.22.168 (22.214.171.124) 149.914 ms 149.556 ms 167.480 ms
7 Lyon-ANDA.in2p3.fr (126.96.36.199) 149.755 ms 155.322 ms 151.614 ms
8 ccb2sn04.in2p3.fr (188.8.131.52) 150.993 ms * 151.876 ms
The results of 10 second iperf measurements with various windows and streams look as follows:
I am unclear why the behavior of throughput with streams and windows on this latest
measurement should differ from that measured on January 8 '02.
Using bbcp to make disk to disk copy of an uncached 60MByte Objectivity file to /de/null,
with 40 streams and an 8KByte window, I was able to achieve
just over 11Mbits/s. I then ran bbcp memory to memory for about an hour and
the throughput in the MRTG plot went up to about 14-15 Mbits/s, at this time bbcp was reporting
throughput of about 1183KB/s or 9.46 Mbits/s. Possibly bbcp was adding its
traffic to another application which was for some reason constrained to transmit about 5-6 Mbits/s.
I then ran iperf for 20 seconds with 40 streams and a window size of 8Kbytes, and measured a throughput of about
15-16 Mbits/s. I repeated this with 2 separate iperf clients each with 40 streams and an 8KByte window.
the aggregate throughput was again about 15-16 Mbits/s. So doubling the number of streams from 40 to 80
had little effect. In order to see the effect on the MRTG plot I then ran iperf for 30 minutes
with 40 streams and an 8Kbyte window. The maximum throughput recorded by MRTG was about 22.7Mbits/s,
and iperf recorded an average throughput of about 17 Mbits/s.
Below is seen the MRTG plot showing the impact of the bbcp measurement from 3-4 am, and the iperf
measurement just after 6am.
The datamove33 window sizes are:
ndd /dev/tcp tcp_max_buf = 1048576
;ndd /dev/tcp tcp_cwnd_max = 1048576
;ndd /dev/tcp tcp_xmit_hiwat = 16384
;ndd /dev/tcp tcp_recv_hiwat = 24576
and for ccbs2sno4 (= ccbbsn04, apart from how routing is done) are:
ndd /dev/tcp tcp_max_buf = 4194304
;ndd /dev/tcp tcp_cwnd_max = 2097152
;ndd /dev/tcp tcp_xmit_hiwat = 65536
;ndd /dev/tcp tcp_recv_hiwat = 65536
Thus my conclusion is that if there is an application generating the 5-6 Mbits/s
background traffic, then it's throughput is limted by something other than the
network in this case. Possibly it does not have enough streams. If the optimum
number of streams was deduced from measurements made earlier (e.g.
before or around January 8
'02) and the behavior of throughput with streams and windows changed so
streams are now much more effective than windows, then this would account
for today's poor performance. I do not currently have a hypothesis for why the
throughput behavior with streams and windows should have changed.
Also note that the above only explains the poor performance of the application that
started around midday on Wednesday February 20th French time. It does not
explain the apparent rate limiting around 20-22 Mbits/s that
was initially reported.
Though I doubt it will make much difference I also recommend that
the SLAC Unix administrators increase the window size on datamove33
to be the same as for IN2P3.
- Email from Dominique Boutigny Feb 23 '02:
- I think that we are using 10 streams and the default bbftp window size, so
we have room for improvment.
But the point here is that
_in_the_same_conditions_ the transfer rate dropped from 20 Mbit/s to ~5
Mbit/s, the fact that you are able to get 11 with bbcp and optimized
parameters clearly shows that the problem is _not_ due to rate policing
I remember that last time we got the same kind of troughput drop, it was
due to something queuing the packets on the ESNET side instead of dropping them
(sorry if I am not using the right terminology !!! ) Could it be the same
problem here ?
To summarize, we have probably 2 problems here:
- The first one is the limitation to 20 Mbit/s that ESNET is interpreting
as a rate policing on the RENATER side.
- The second one is the limitation from 20 to 5 Mbit/s for an unkown
Would it be possible to have the e-mail of a RENATER responsible person to
put him in the loop ?
I have added Denis Linglin and Francois Etienne in the recipient list, I
hope that escalating this problem in the French side will help.
- Email from Les Cottrell February 23 '02:
- It is possible, maybe even likely, that something has changed that has
resulted in requiring a different set streams and window for optimal throughput.
It would be extremely interesting to identify what caused that.
The previous queuing problem you refer to, resulted in RTTs of many hundreds of msec.
The RTTs at the moment are min/avg/max (for 208 pings from datamove 33 to ccb2sn04)
are 149/150/264 msec, which is pretty close to the optimal. On the other hand
I do not have a record of the RTT on Jan 8 '02 (I have it from pharlap to ccbbsn04
but that is on the CERN route) so maybe there was more queuing and RTT then. I do
have a record of pings from pharlap to ccb2sn04 (a route that avoids CERN) on
February 1 '02 and that shows for 10 pings min/avg/max 170/170/171 msec, so maybe
the buffering has been reduced. The routes on Feb 1 were identical to those seen today.
- Email from Joe Metzger Feb 25 '02:
- ESnet did not make any policing or queuing changes to this circuit
last week. I have checked out our traffic stats and none of the
interfaces on the ESnet portion of the path appeared to be saturated.
I don't think we will be able to make any progress on this problem
until we can bring Renater, who is responsible for over two thirds of
the path, into the discussion.
The ESnet contacts in Renater have been copied on several of the
email messages in this long running thread but they have not responded.
Jerome Bernier has been trying to work with Renater for over a month
on this problem and hasn't reported any significant progress.
I have been relying on Jerome to coordinate debugging the Renater
end of the problem because IN2P3 is a Renater Customer and ESnet is not.
Many network service providers ignore difficult problems reported
by third parties but they will usually deal with direct customer complaints.
Please let me know if you think there is some other approach ESnet
needs to take to move this problem towards resolution.
- Email from Francoise Etienne of IN2P3 Feb 25 '02:
- you are right: we are missing the right communication channel with
our service provider, in fact three layers (FT, CS and RENATER) and
we are trying to get out of this inappropriate situation. By the
way with the help of the RENATER director the problem has been said
to be solved today by FT and I hope that Dominique will confirm this soon.
Sorry to have bothered you with this issue, but let me hope that
this will help to set up the adequate communication channel to
guarantee the Lyon and SLAC link, in particular to sustain the new
coming bandwidth at 100 Mbps. Regards François
- Eamil from Dominique Boutigny Feb 26 '02:
I confirm that since 9am (French time) the throughput is now back to 30
Mbit/s on the US-RENATER link.
This was due to a bad configuration on the "France-Telecom" side.
Many thanks to all of you for your efforts and help to solve this long, long
I hope that we will be able to test the upgrade to 100 Mb/s very soon.
- Email from Les Cottrell Feb 26 '02:
- After just under 2 months of elapsed time (and a lot of man hours in multiple
locations, let alone unavailable bandwidth and lost productivity to users) trying to
track down this problem, I wonder if we can learn a bit more about the cause
from FT or Renater. This is not to try and assign blame, but rather to see whether
there was some signature or some measurement we could have made which we would have
isolated/identified the problem more quickly/easily. "Bad configuration" is
rather limited in its clarification of what was wrong. Was it in rate limiting,
what kind, was it in an ATM switch(es), was it a PVC or something, was it in a router?
Is there a ticket that would provide more information? The information from our
end can be found at
Is there a Service Level Agreement (e.g. between Renater and FT) that addresses this?
What do we do next time we suspect something like this? Are there plans/needs to
have better escalation procedures to Renater, and then to their carriers? Will IN2P3
be meeting with Renater to discuss some of the above issues.
- Email from Dominique Boutigny Feb 26 '02:
- This problem has been escalated at the highest level on the French side.
Some email exchange today with the RENATER director clearly indicate that they
will not close the problem like that but will investigate to understand
where the problem was.
- Email from jerome Bernier Feb 28 '02:
- Renater investigate more on this problem, but after a meeting
with France Telecom/Opentransit, FT told them that nobody was changed
on the FT side.
It was really magic ????
We don't want to close this problem as it is, and Renater doesn't want too.
But actually what is said is
. there is no changes on the application
. we (IN2P3) doesn't change anything on our router/switches
. Renater doesn't change anything
. Opentransit doesn't change anything
can you confirm that they were no changes in ESnet and SLAC ?
Les Cottrell and Joe Metzger responded that there had been no changes at SLAC or
vbr peak 40m sustained 30m burst 200;
The throughput achievable now with iperf is shown below:
On March 4 '02 Joe Metzger in response to a request from Jerome Bernier of IN2P3 reported
I have bumped the rate on this PVC up to 50 Mbps until we can get our line into AADS upgraded.
I think we will be able to manage this rate provided the traffic to the 60+ other peers at the NAP that share this line doesn't grow too fast.
We will drop this PVC back down to 30 Mbps if the line starts to get congested.
Our upgrade was ordered a while back but all upgrades to AADS circuits are taking forever. We have been told it will not happen before May. We have not been told when it will happen...
The following shows the throughput after upgrading the PVC to 50Mbits/s.
It is seen the iperf throughput has increased from about 20 Mbits/s to about