Problems with Oracle between SLAC and ORNL, Jan 2006

Les Cottrell (SLAC) and Bill Wing (ORNL) Page created: February 2, 2006

Central Computer Access | Computer Networking | Network Group | More case studies

Problem

Terri Lahey of SLAC's MCC repoted the following problem by email:

From: Lahey, Terri E. 
Sent: Wednesday, February 01, 2006 2:18 PM
To: Cottrell, Les
Cc: Lahey, Terri E.
Subject: connection to SNS

Les,

Is the connection between SLAC and SNS (ornl) running at high throughput?

The LCLS software team is testing XAL code connecting to an RDB 
(ORACLE, I think) using servers at SNS (ORNL). They see slowness, and are 
trying to identify where it is. They are connecting to snsdb1.sns.ornl.gov (port: 1521)

My guess is that this is slowness once they get into SNS, or on their 
application desktop. Can you check if packet traffic is fast between SLAC 
and SNS, so we can eliminate that?

Terri

ps. here's a traceroute that looks like icmp is disabled into SNS.

lahey@flora03 $ traceroute snsdb1.sns.ornl.gov
traceroute: Warning: ckecksums disabled
traceroute to snsdb1.sns.ornl.gov (160.91.230.34), 30 hops max, 40 byte packets
 1  rtrg-nethub.slac.stanford.edu (134.79.19.1)  0.414 ms  0.284 ms  0.217 ms
 2  134.79.255.25 (134.79.255.25)  11.183 ms  0.504 ms  0.351 ms
 3  rtr-dmz1-ger.slac.stanford.edu (134.79.135.15)  0.335 ms  0.341 ms  0.347 ms
 4  192.68.191.146 (192.68.191.146)  0.469 ms  0.515 ms  0.463 ms
 5  slacmr1-slacrt4.es.net (134.55.209.93)  0.584 ms  0.481 ms  0.461 ms
 6  snv2mr1-slacmr1.es.net (134.55.217.2)  0.838 ms  0.785 ms  0.723 ms
 7  snv1mr1-snv2mr1.es.net (134.55.217.5)  0.837 ms  0.752 ms  0.721 ms
 8  snvcr1-snv1mr1.es.net (134.55.218.21)  0.838 ms  0.824 ms  0.845 ms
 9  elpcr1-oc48-snvcr1.es.net (134.55.209.218)  27.545 ms  27.522 ms  27.553 ms 
10  atlcr1-oc48-elpcr1.es.net (134.55.209.222)  61.824 ms  61.692 ms  61.869 ms
11  ornl-oc48-atlcr1.es.net (134.55.213.210)  66.308 ms  66.467 ms  66.357 ms
12  192.31.96.1 (192.31.96.1)  68.730 ms  68.145 ms  66.492 ms
13  * * *
14  * * *

Over the phone Terri also indicated that the time to run the "transaction" between hosts at ORNL is 3-10 seconds, while between SLSAC and ORNL it is 25-30secs.

Analysis

The end site or host does appear to block pings since we were unable to ping the remote host. This accounts for the lack of a response (by means of an ICMP unreachable message) from hop 13 and beyond.

Traceroutes

From ORNL to SLAC:

traceroute to iepm-resp.slac.stanford.edu (134.79.240.36), 64 hops 
max, 40 byte packets
  1  swgecsb-1-004.ens.ornl.gov (160.91.212.1)  0.667 ms  0.252 ms  0.208 ms
  2  ornlgwy.ens.ornl.gov (160.91.0.1)  0.272 ms  0.229 ms  0.241 ms
  3  orgwy-fw (192.31.96.161)  0.558 ms  0.501 ms  0.397 ms
  4  ornl-rt3-ge.cind.ornl.gov (192.31.96.2)  0.560 ms  0.576 ms  0.446 ms
  5  atlcr1-oc48-ornl.es.net (134.55.213.209)  5.201 ms  99.826 ms  29.394 ms
  6  elpcr1-oc48-atlcr1.es.net (134.55.209.221)  39.546 ms  39.549 ms  39.469 ms
  7  snvcr1-oc48-elpcr1.es.net (134.55.209.217)  66.154 ms  66.136 ms  69.778 ms
  8  snv1mr1-snvcr1.es.net (134.55.218.22)  66.187 ms  66.198 ms  66.255 ms
  9  snv2mr1-snv1mr1.es.net (134.55.217.6)  66.192 ms  66.197 ms  66.374 ms
 10  slacmr1-snv2mr1.es.net (134.55.217.1)  66.658 ms  67.601 ms  66.668 ms
 11  slacrt4-slacmr1.es.net (134.55.209.94)  66.671 ms  66.737 ms  66.555 ms
 12  rtr-dmz1-vlan400.slac.stanford.edu (192.68.191.149)  66.868 ms  66.905 ms  66.703 ms
 13       * * *
 14       * * *
 15  iepm-resp.slac.stanford.edu (134.79.240.36)  66.599 ms 66.957 ms  66.775 ms

Comparing that with what Terri saw:

lahey@flora03 $ traceroute snsdb1.sns.ornl.gov
traceroute: Warning: ckecksums disabled
traceroute to snsdb1.sns.ornl.gov (160.91.230.34), 30 hops max, 40 byte packets
  1  rtrg-nethub.slac.stanford.edu (134.79.19.1)  0.414 ms 0.284 ms  0.217 ms
  2  134.79.255.25 (134.79.255.25)  11.183 ms  0.504 ms  0.351 ms
  3  rtr-dmz1-ger.slac.stanford.edu (134.79.135.15)  0.335 ms 0.341 ms  0.347 ms
  4  192.68.191.146 (192.68.191.146)  0.469 ms  0.515 ms  0.463 ms
  5  slacmr1-slacrt4.es.net (134.55.209.93)  0.584 ms  0.481 ms  0.461 ms
  6  snv2mr1-slacmr1.es.net (134.55.217.2)  0.838 ms  0.785 ms  0.723 ms
  7  snv1mr1-snv2mr1.es.net (134.55.217.5)  0.837 ms  0.752 ms  0.721 ms
  8  snvcr1-snv1mr1.es.net (134.55.218.21)  0.838 ms  0.824 ms  0.845 ms
  9  elpcr1-oc48-snvcr1.es.net (134.55.209.218)  27.545 ms  27.522 ms 27.553 ms 
 10  atlcr1-oc48-elpcr1.es.net (134.55.209.222)  61.824 ms 61.692 ms  61.869 ms
 11  ornl-oc48-atlcr1.es.net (134.55.213.210)  66.308 ms  66.467 ms  66.357 ms
 12  192.31.96.1 (192.31.96.1)  68.730 ms  68.145 ms  66.492 ms
 13  * * *

These look to be the inverse of each other inside ESnet, but aren't symmetric inside Stanford/SLAC.

Pingroute

The small amount of dispersion in the traceroute hop responses may indicate there is little congestion. To verify this and probe the routers along the route we ran fpingroute.pl witn the results below:

112cottrell@noric01:~>bin/fpingroute.pl -i 3 -c 1000 snsdb1.sns.ornl.gov
Wed Feb  1 15:11:25 2006 Architecture=LINUX, commands=/usr/sbin/traceroute -q 1 and fping snsdb1.sns.ornl.gov
fpingroute.pl version=0.21, 8/24/04. Author cottrell@slac.stanford.edu, debug=1
  using traceroute to get nodes in route from noric01 (134.79.86.51) to snsdb1.sns.ornl.gov starting at node 3
traceroute to snsdb1.sns.ornl.gov (160.91.230.34), 30 hops max, 38 byte packets
fpingroute.pl version 0.21, 8/24/04 found 30 hops in route from noric01 to snsdb1.sns.ornl.gov
3  slac-rt4.es.net (192.68.191.146)  0.292 ms
4  slacmr1-slacrt4.es.net (134.55.209.93)  0.255 ms
5  snv2mr1-slacmr1.es.net (134.55.217.2)  0.868 ms
6  snv1mr1-snv2mr1.es.net (134.55.217.5)  0.732 ms
7  snvcr1-snv1mr1.es.net (134.55.218.21)  0.822 ms
8  elpcr1-oc48-snvcr1.es.net (134.55.209.218)  27.571 ms
9  atlcr1-oc48-elpcr1.es.net (134.55.209.222)  61.616 ms
10  ornl-oc48-atlcr1.es.net (134.55.213.210)  66.251 ms
11  192.31.96.1 (192.31.96.1)  66.318 ms
12  *
...
30  *

Wed Feb  1 15:13:01 2006 wrote 28 addresses to /tmp/fpingaddr
  now ping each address 1000 times from noric01 starting at hop 3 ...
             pings/node=1000                      100 byte packets            1400 byte packets
             to NODE (from noric01)             %loss   min    max    avg   %loss   min    max    avg
 3                    slac-rt4.es.net           0.0%    0.5   45.3    1.4   0.0%    1.0   50.6    1.7
 4             slacmr1-slacrt4.es.net           0.0%    0.3  229.3    5.3   0.0%    0.7  221.7    5.5
 5             snv2mr1-slacmr1.es.net           0.0%    0.7  197.6    6.5   0.0%    1.0  218.7    6.5
 6             snv1mr1-snv2mr1.es.net           0.0%    0.7  215.8    5.6   0.1%    1.1  217.4    6.4
 7              snvcr1-snv1mr1.es.net           0.0%    0.8   24.6    1.1   0.0%    1.5   33.8    1.8
 8          elpcr1-oc48-snvcr1.es.net           0.0%   27.6  100.0   30.0   0.0%   28.5  110.3   30.9
 9          atlcr1-oc48-elpcr1.es.net           0.0%   61.8  133.4   64.9   0.0%   62.7  140.0   66.2
10            ornl-oc48-atlcr1.es.net           0.0%   66.4  113.1   67.4   0.0%   67.2  135.6   68.5
11                        192.31.96.1           0.1%   66.2  272.1   67.6   0.6%   66.7   95.8   67.0
Wed Feb  1 15:46:41 2006 fpingroute.pl done.
113cottrell@noric01:~>

This indicates there is low loss at least until the last hop, and little dispersion on the RTTs at El Paso and beyond. Unfortunately pings to the end host are blocked so we cannot measure the end-to-end losses. If we use the Mathis formula to estimate the maximum standard TCP throughput then for 0.1% loss we get 52Mbits/s and for 0.6% loss we get 22 Mbits/s.

Pathneck

Next we ran pathneck a packet train method to discover the bottleneck along a path. Typical results from this appear below and indicate that the bottleneck is in the hundreds of Mbits/s range:

8cottrell@iepm-resp:~>sudo /afs/slac.stanford.edu/package/netperf/bin/@sys/pathneck snsdb1.sns.ornl.gov
1138834332.027564 160.91.230.34 500 60 0

00   0.227    134.79.243.1   1038
01   0.225    134.79.252.5   1019
02   0.254   134.79.135.15   1009
03   0.346  192.68.191.146    957
04   0.345   134.55.209.93    926
05   0.718    134.55.217.2    972
06   0.753    134.55.217.5    920
07   0.772   134.55.218.21    926
08  27.492  134.55.209.218    908
09  61.680  134.55.209.222    877
10  66.239  134.55.213.210    946
11  66.303     192.31.96.1    817

Then we looked at the router utilization plots for ESnet routers along the route. Hops 6-10 showed light utilization.

abing

We tried running abing a two-way packet pair bandwidth estimation tool. It gave:

90cottrell@iepm-resp:~>abing -t 5 -n 10 -b 80 -d wrw.ornl.gov
1138854487 T: 160.91.212.99 ABw-Xtr-DBC: 940.7  55.3 996.0 ABW: 940.7 Mbps RTT: 67.128 70.319 100.956 ms 80 80
1138854487 F: 160.91.212.99 ABw-Xtr-DBC:   4.0 817.2 821.2 ABW:   4.0 Mbps RTT: 67.128 70.319 100.956 ms 80 80
1138854494 T: 160.91.212.99 ABw-Xtr-DBC: 905.4  94.6 1000.0 ABW: 931.9 Mbps RTT: 67.106 70.098 100.808 ms 80 80
1138854494 F: 160.91.212.99 ABw-Xtr-DBC:   4.2 782.8 786.9 ABW:   4.0 Mbps RTT: 67.106 70.098 100.808 ms 80 80
1138854502 T: 160.91.212.99 ABw-Xtr-DBC: 826.2 172.3 998.6 ABW: 905.5 Mbps RTT: 67.126 69.991 100.840 ms 80 80
1138854502 F: 160.91.212.99 ABw-Xtr-DBC:   4.5 669.9 674.4 ABW:   4.1 Mbps RTT: 67.126 69.991 100.840 ms 80 80
1138854509 T: 160.91.212.99 ABw-Xtr-DBC: 956.0  43.3 999.4 ABW: 918.1 Mbps RTT: 67.132 67.831 96.850 ms 80 80
1138854509 F: 160.91.212.99 ABw-Xtr-DBC:  19.6 598.0 617.6 ABW:   8.0 Mbps RTT: 67.132 67.831 96.850 ms 80 80
1138854517 T: 160.91.212.99 ABw-Xtr-DBC: 983.4  16.6 1000.0 ABW: 934.4 Mbps RTT: 67.130 69.322 100.845 ms 80 80
1138854517 F: 160.91.212.99 ABw-Xtr-DBC:   5.7 677.4 683.0 ABW:   7.4 Mbps RTT: 67.130 69.322 100.845 ms 80 80
1138854525 T: 160.91.212.99 ABw-Xtr-DBC: 971.0  28.7 999.6 ABW: 943.6 Mbps RTT: 67.137 68.124 86.116 ms 80 80
1138854525 F: 160.91.212.99 ABw-Xtr-DBC:  13.1 633.1 646.2 ABW:   8.8 Mbps RTT: 67.137 68.124 86.116 ms 80 80
1138854532 T: 160.91.212.99 ABw-Xtr-DBC: 1000.0   0.0 1000.0 ABW: 957.7 Mbps RTT: 67.147 68.956 100.861 ms 80 80
1138854532 F: 160.91.212.99 ABw-Xtr-DBC:   6.8 740.6 747.4 ABW:   8.3 Mbps RTT: 67.147 68.956 100.861 ms 80 80
1138854540 T: 160.91.212.99 ABw-Xtr-DBC: 910.7  89.3 1000.0 ABW: 945.9 Mbps RTT: 67.117 67.933 100.696 ms 80 80 
1138854540 F: 160.91.212.99 ABw-Xtr-DBC:  16.8 714.7 731.6 ABW:  10.5 Mbps RTT: 67.117 67.933 100.696 ms 80 80
1138854547 T: 160.91.212.99 ABw-Xtr-DBC: 990.6   9.4 1000.0 ABW: 957.1 Mbps RTT: 67.137 70.655 100.833 ms 80 80
1138854547 F: 160.91.212.99 ABw-Xtr-DBC:   3.5 367.6 371.0 ABW:   8.7 Mbps RTT: 67.137 70.655 100.833 ms 80 80
1138854555 T: 160.91.212.99 ABw-Xtr-DBC: 962.4  37.6 1000.0 ABW: 958.4 Mbps RTT: 67.131 68.161 100.704 ms 80 80
1138854555 F: 160.91.212.99 ABw-Xtr-DBC:  12.7 676.4 689.1 ABW:   9.7 Mbps RTT: 67.131 68.161 100.704 ms 80 80
(Avg/Sdev) RTT: 69.139/1.027 ms ABW To: 944.647/49.543 From:  9.085/5.654 Mbits/s Exit 82 91cottrell@iepm-resp:~>

Where F = from ORNL to SLAC, To = from SLAC to ORNL, ABw=available bandwidth, Xtr=Cross-traffic and DBC = Dynamic Bandwidth capacity. It is seen that there is little congestion (low Xtr) and plenty of available bandwidth from SLAC to ORNL. From ORNL there appears to be cross-traffic and less available bandwidth.

Iperf

From ORNL to SLAC using a 1MByte window we got:

Client connecting to iepm-resp.slac.stanford.edu, TCP port 5001
TCP window size: 1.00 MByte (WARNING: requested 1.00 MByte)
------------------------------------------------------------
[  3] local 160.91.212.99 port 49396 connected with 134.79.240.36 port 5001
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0- 5.0 sec  66.4 MBytes   111 Mbits/sec
[  3]  5.0-10.0 sec  73.4 MBytes   123 Mbits/sec
[  3] 10.0-15.0 sec  72.3 MBytes   121 Mbits/sec
[  3] 15.0-20.0 sec  20.3 MBytes  34.1 Mbits/sec
[  3] 20.0-25.0 sec  30.4 MBytes  51.0 Mbits/sec
[  3] 25.0-30.0 sec  33.6 MBytes  56.4 Mbits/sec
[  3] 30.0-35.0 sec  37.1 MBytes  62.2 Mbits/sec
[  3] 35.0-40.0 sec  39.5 MBytes  66.3 Mbits/sec
[  3] 40.0-45.0 sec  42.2 MBytes  70.9 Mbits/sec
[  3] 45.0-50.0 sec  45.6 MBytes  76.5 Mbits/sec
[  3] 50.0-55.0 sec  48.9 MBytes  82.0 Mbits/sec
[  3] 55.0-60.0 sec  39.4 MBytes  66.1 Mbits/sec
[  3]  0.0-60.5 sec   549 MBytes  76.2 Mbits/sec

With an RTT of ~70ms using the bandwidth delay product (BDP) to predict the window, a 1MByte window will only give about 115Mbits/s.

Using multiple parallel streams from ORNL to SLAC we get:

wrw:~/cstuff/iperf wrw$ ./iperf  -c iepm-resp.slac.stanford.edu -w 1m 
-t 60 -i 200 -P 10
------------------------------------------------------------
Client connecting to iepm-resp.slac.stanford.edu, TCP port 5001
TCP window size: 1.00 MByte (WARNING: requested 1.00 MByte)
------------------------------------------------------------
[ 12] local 160.91.212.99 port 49662 connected with 134.79.240.36 port 5001
[  3] local 160.91.212.99 port 49653 connected with 134.79.240.36 port 5001
[  5] local 160.91.212.99 port 49655 connected with 134.79.240.36 port 5001
[  4] local 160.91.212.99 port 49654 connected with 134.79.240.36 port 5001
[ 11] local 160.91.212.99 port 49661 connected with 134.79.240.36 port 5001
[  6] local 160.91.212.99 port 49656 connected with 134.79.240.36 port 5001
[  8] local 160.91.212.99 port 49658 connected with 134.79.240.36 port 5001
[  9] local 160.91.212.99 port 49659 connected with 134.79.240.36 port 5001
[ 10] local 160.91.212.99 port 49660 connected with 134.79.240.36 port 5001
[  7] local 160.91.212.99 port 49657 connected with 134.79.240.36 port 5001
[ ID] Interval       Transfer     Bandwidth
[ 11]  0.0-60.3 sec   146 MBytes  20.3 Mbits/sec
[  5]  0.0-60.5 sec   134 MBytes  18.6 Mbits/sec
[ 12]  0.0-60.5 sec  99.6 MBytes  13.8 Mbits/sec
[  9]  0.0-60.7 sec   132 MBytes  18.3 Mbits/sec
[  3]  0.0-60.8 sec   178 MBytes  24.6 Mbits/sec
[  7]  0.0-60.8 sec   173 MBytes  23.9 Mbits/sec
[ 10]  0.0-60.8 sec   223 MBytes  30.8 Mbits/sec
[  8]  0.0-61.0 sec   218 MBytes  29.9 Mbits/sec
[  6]  0.0-61.3 sec   183 MBytes  25.0 Mbits/sec
[  4]  0.0-62.3 sec   161 MBytes  21.7 Mbits/sec
[SUM]  0.0-62.3 sec  1.61 GBytes   222 Mbits/sec

From SLAC to ORNL (the ORNL max receive window was set to 1MByte):

4cottrell@iepm-resp:~>iperf -c wrw.ornl.gov -w 1m -t 60 -i 5
------------------------------------------------------------
Client connecting to wrw.ornl.gov, TCP port 5001
TCP window size: 20.0 MByte (WARNING: requested 10.0 MByte)
------------------------------------------------------------
[  3] local 134.79.240.36 port 43536 connected with 160.91.212.99 port 5001
[  3]  0.0- 5.0 sec  59.8 MBytes    100 Mbits/sec
[  3]  5.0-10.0 sec  20.0 MBytes  33.5 Mbits/sec
[  3] 10.0-15.0 sec  15.0 MBytes  25.2 Mbits/sec
[  3] 15.0-20.0 sec  9.90 MBytes  16.6 Mbits/sec
[  3] 20.0-25.0 sec  14.9 MBytes  24.9 Mbits/sec
[  3] 25.0-30.0 sec  15.0 MBytes  25.1 Mbits/sec
[  3] 30.0-35.0 sec  14.9 MBytes  25.0 Mbits/sec
[  3] 35.0-40.0 sec  9.90 MBytes  16.6 Mbits/sec
[  3] 40.0-45.0 sec  4.96 MBytes  8.32 Mbits/sec
[  3] 45.0-50.0 sec  14.9 MBytes  25.0 Mbits/sec
[  3] 50.0-55.0 sec  10.1 MBytes  16.9 Mbits/sec
[  3] 55.0-60.0 sec  4.98 MBytes  8.35 Mbits/sec
[  3]  0.0-60.3 sec    194 MBytes  27.0 Mbits/sec

It is seen that the slow start ramps up to about 100Mbits/s but then the throughput is limited to 10-30Mbits/s. Using multiple (8) parallel streams with a 1MByte window we got about 100 Mbits/s.

The reason we get higher iperf throughput from ORNL to SLAC than vice versa maybe due to the asymmetric utilization of the inbound (to SLAC) and outbound links at the SLAC border seen here. This shows that the outbound link is sustaining 500-700Mbits/s while the inbound links is typically seeing less than 100Mbits/s.

Hosts

The hosts at SLAC are lcls-rogind.slac.stanford.edu and lcls-fairley.slac.stanford.edu. Lcls-rogind is an Intel Pentium 4 with 3GHZ cpus and a 100Mbits/s fast ethernet NIC. The network limitation from the hardware at the host should be close to 100Mbits/s. It is running Linux 2.4.21. The TCP window size is set to the defauilt of 132KBytes. This will limit the TCP throughput for a single stream to about 1Mbits/s.

Netflow

We extracted the Netflow records for the flows to/from snsdb12.sns.ornl.gov port 1521. They are available here. Most of the data, by a factor of 5, is going from ORNL to SLAC. The average throughput/flow from SLAC to ORNL is 15kbits/s (max 22 kbits/s), from ORNL to SLAC it is 85kbits/s (max 832 kbits/s). Probably most of the data is being transferred from ORNL to SLAC and the SLAC to ORNL traffic is the acknowledgements, request and control information.

The flows are all very short (the maximum number of uni-directional packets in a flow is 172, the average 141). These flows are short enough that TCP over such a long RTT link never gets out of startup. The bandwidth * delay product for a 100Mbits/s bottleneck on a 70 ms RTT is about 600 1500 Byte segments. The distribution of the number of packets in a flow from ORNL to SLAC is multi-modal with two major peaks at about 104 Bytes and 31000 Bytes. For ORNL to SLAC it is also mainly bi-modal with peaks at 104 Bytes and 174000 Bytes (166 packets). For 166 packets on such a link TCP is still in "slow_start", so the best one could hope for in TCP throughput when one doubles the number of packets in flight for each RTT, would be about 3Mbits/s.

Since the flows come in groups that are closely separated (e.g. ~60 separate flows from ORNL to SLAC in a 431 second interval) we also looked at the aggregate uni-directional throughput for all the flows in a group. To do this we took the total number of Bytes sent in a group of flows from ORNL to SLAC and divided it by the time between the start of the first flow and the end of the last flow. This yielded ~ 24 MBytes transferred in 460 seconds in about 60 flows for a total of typical throughput of 7.5kbits/s.

Resolution

The aggregate throughput of 7.5kbits/s is way below what the network is capable of (several hundred Mbits/s). It is also well below what the NIC and Ethernet connection on the host is capable of (80-90Mbits/s). It is also well below what TCP even in slow start is capable of (3Mbits/s) or in its stable state (1Mbits/s) with the default host TCP window settings. Even the best flow was only able to achieve 832kbits/s which may be approaching the TCP window size limitation.. It would thus appear that it is not a network problem. The short flows would indicate that this application is not well designed for the Wide Area. If the Oracle application can be tuned for better WAN throughput then the next limitation is likely to be the small default TCP window size. We have sent email to the Unix admins to increase this window size so it will not be the limitation. We do not know the TCP window size on the host at ORNL. Nor can we measure loss to the host at ORNL since pings are blocked. To probe this problem further it would help to have an account at the ORNL end, to bring in the application developers, Oracle experts, ORNL people etc.

Page owner: Les Cottrell