The aim of this study is to compare three different approaches for calculating throughput (Active, Passive, and Web100).
Active calculations are based on the bytes and times reported by data transfer application (Iperf, bbcpmem (/dev/zero to /dev/null), bbcpdisk (disk to disk), or bbftp (disk to disk)). Passive calculations are based on the NetFlow records which capture all the flows going into and out of SLAC. From these records we identify which flows go with which active application run.Web100 calculations are based on data and statistics that are readily available through Web100. Web100 exposes an enormous variety of statistics for each stream, from which we select a small set of variables needed to calculate throughput for this current study. The variables we use are DataBytesOut, which reports the number of data bytes sent (not including headers, retransmitted packets or SYN/FIN packets) and SndLimTimeSender, SndLimTimeCwnd, and SndLimTimeRwin, which are summed up to closely approximate the elapsed time.
Both Web100 and NetFlow(passive) give us bytes and timing information per stream, from which we can calculate throughput using the following three methods (the subscript 's' means "aggregate over all streams"):
| node | test | meth | x | y | R | samples | avg(x) | avg(y) | std(x) | std(y) | Error | min(x) | max(x) | min(y) | max(y) | Cov |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| node1.cacr.caltech.edu | iperf | 1 | passive | web | 1.00 | 19 | 337.95 | 327.18 | 51.53 | 49.80 | 0.032 | 190.58 | 389.81 | 185.27 | 377.44 | 2566.26 |
Passive vs Web100: The throughput calculations based on NetFlow records and Web100 data are very highly correlated. As of 8/26/2002, the average error across all nodes over all test runs is roughly -0.01 and the average correlation(R) is 0.96. The (error, R) averages over individual tests are (-0.01, 0.96) for bbcpdisk, (-.02, 0.94) for bbcpmem, (-0.02, 0.98) for bbftp, and (0.02, 0.98) for iperf. See the passive/web100 summary for up-to-date statistics. Upon inspection of the correlation table, we can see that the R is up around 0.99 or 1.0 for the majority of the rows. In fact, 90% of all rows in the table have a correlation of >= 0.9. Again, see the summary for up-to-date figures. A small minority of the rows have a significantly lower R than expected. These aberrations occur for iperf, bbcpmem, and bbcpdisk across multiple nodes. Examining all of the runs of the offending tests on the nodes in question, we can see the cause of these abberrations. Bbcpmem and bbcpdisk throughput calculations were corrupted when passive data indicates that a stream has been open for orders of magnitude longer than it actually should be during a test. We will refer to these streams as long flows. For example, consider these bbcpmem test runs to node1.cern.ch and node1.clrc.ac.uk. In both test runs there are one or more streams that Netflow reported an elapsed time of more than 10 minutes, yet the bbcpmem test run is set to run for under 20 seconds for each node. This has the effect of dramatically decreasing the passive throughput calculation, thus increasing the error and decrease the overall correlation between passive and web100 calculations for that node (see the bbcpmem row for node1.cern.ch in the correlation table). This type of error was seen in prior iepm research and was attributed to possible bugs in NetFlow, although that has not been confirmed. It is worth stating that, thus far, whenever long flows are reported by NetFlow, Web100 reports normal flows and agree with Active output. Another time it was web100 that falsely reported the elapsed time for a stream (see this iperf test run to node1.rcf.bnl.gov). This will lower web100's throuput calculation for that test run. So while more often it is Netflow that gives an unexpected elapsed time, web100 can also report a false time, though it seems to be extremely rare. The reporting of long flows by NetFlow and Web100 appear to be independent of each other. So far they have not both reported the same long flow. Figures on how often and for which nodes these long flows occur are available.
Active vs Web100: The throughput calculations based on Active output and Web100 data are overall also well correlated. As of 8/26/2002, the average error across all nodes over all the tests is roughly -0.08 and the average correlation(R) is 0.92. The (error, R) averages over the individual tests are (0.06, 0.92) for bbcpdisk, (-0.01, 0.95) for bbcpmem, (-0.48, 0.87) for bbftp, and (0.09, 0.95) for iperf. So Active and Web100 throughput calculations are well correlated for all tests. Error is low for all of the tests except for bbftp. See the summary for more up-to-date details. The high error is expected for bbftp because of the way it actively calculates throughput. Bbftp considers the overall elapsed time to be the duration of the entire transfer, from when the program starts to when it ends. This time also includes the connection setup phase where certain parameters are set up and communicated between the two nodes. This extra time is unrelated to the actual transfer time. Therefore we decided to ignore the connection setup stream when processing Web100 records for the transfer. Since we decided to only consider data transfer streams when calculating throughput, of course bbftp will report a longer elapsed time(and thus a lower throughput) than our web100 calculations (this discrepancy may be larger for transfers between nodes with longer RTT, since the initial handshaking phase will last longer. We are currently investigating this). Bbcpmem, bbcpdisk, and iperf all actively report a throughput calculation does not include the initial handshaking involved (Actually, bbcpdisk also reports an additional bbftp-like throughput value, which we ignore, again, since we only want to consider throughput of data transfer). Approximately 75% (as of 8/26/2002) of all iperf,bbcpmem,and bbcpdisk rows in the Active vs. Web100 correlation table have an R >= .9. See the summary for up-to-date details. There are still cases where the correlation is very low for iperf, bbcpmem, and bbcpdisk. These can not be explained away by long flows. Many of these are caused by lingering sockets. During some transfers (especially ones with a large number of streams), the socket connection still lingers around for a few seconds before the kernel can properly close it. This causes the active elapsed time to be significantly less than the web100(or netflow) elapsed time. This results in a much lower throughput for web100. The only way to avoid this effect is to monitor web100 throughput during the entire transfer, and to disregard the time at the end of the transfer when the sockets are being closed. There also used to be cases where the error was nearly 0, yet R was also 0, implying no correlation between the active and web100 throughput calculations for those tests. This occurred for nodes/tests that experienced no fluctuation in the throughput(a very small range), thus causing both the Covariance and R to be low. To get a good measure of Covariance and R, we really need a larger range or else noise will mask the natural relationship between the two data sets. Originally, we ran each test to each node once per day. Changing these tests to run 4 times a day (thereby adding temporal fluctuation to the throughput) seems to have fixed the low-range problem.
Active vs Passive: The throughput calculations based on Active output and NetFlow records are also highly correlated. As of 8/26/2002, the average error across all nodes over all the tests is roughly -0.07 and the average correlation(R) is 0.91. The (error, R) averages over the individual tests are (0.07, 0.90) for bbcpdisk, (0.004, 0.89) for bbcpmem, (-0.42, 0.89) for bbftp, and (0.07, 0.96) for iperf. See the summary for up-to-date details. Much like the Active/Web100 results, we can see that Active and Passive throughput calculations are well correlated for all tests. We can also see that error is small (average < 10%) for all tests except for bbftp, for the same reasons as previously stated. While bbftp considers the overall elapsed time to include the duration of the initial handshaking, we intentionally ignore the stream involved in that initial handshaking when processing the NetFlow records for the transfer. Many of the cases where the R is low for a node/test can be explained by the effect of long flows . For instance, consider the active/passive correlation for bbcpmem on node1.nersc.gov . As you can see, the correlation is very low for both method 2 and method 3 throughput calculations. It's worth noting that the correlation is high for method 1. It may seem strange at first that the long flows will affect two methods of throughput calculation but not the other. But this effect is clearly justified upon examination of the 3 methods. Method 1 sums the individual throughputs of each stream. Thus, if relatively few streams in the transfer suffer from long flow, then the individual throughputs of those streams are greatly lowered, but the overall sum is not altered much. However, method 2 divides the total data over all streams by the average time of all streams. This average time will be greatly distorted if even one of the streams has an extremely long elapsed time. Similarly, method 3 divides the total data by the maximum time of all streams, so it is also greatly affected if even one of the streams has an extremely long elapsed time. Upon examining the stream-by-stream data, we can see that on 7/11/2002, 7/18/2002, 7/19/2002, and 7/21/2002, the node1.nersc.gov bbcpmem transfer suffered from exactly one long flow stream, which would explain the dramatic lowering of the overall R value for node1.nersc.gov/bbcpmem in the active/passive correlation table for method 2 and 3 but not method 1. Active/Passive correlation also suffers from lingering sockets.
Passive vs Web100: The stream-by-stream comparisons reveal that the elapsed time reported by Netflow(passive) and Web100 are relatively close, with Netflow consistently reporting a slightly larger time. The similarity in their time estimates can be attributed to the fact that both methods perform a similar function, namely, passively observing network traffic, taking notice of when each stream starts, and recording statistics for each stream until the stream terminates. The fact that Netflow consistently reports a longer elapsed stream time than Web100 may be due to the fact that Web100 does not report an absolute 'elapsed time' value. Web100 reports the total time spent in each of three mutually-exclusive TCP states: the "Sender Limited", "Congestion Limited", and the "Receiver Limited" states (there's a short description of these states in the Web100 Variable Documentation). One must add the three values together to get an estimate of the total elapsed time. One possible explanation for difference is that perhaps the microseconds spent in transitioning between the states are not recorded, and over the duration of the stream, accumulate to a noticeable, albeit small, time differential. NetFlow also consistently reports a larger bytes transferred value. This is expected, since NetFlow counts all traffic sent between the two hosts, including TCP packet retransmissions and control packets with no payload. It also includes the TCP header as part of the total byte count. This behavior is appropriate since NetFlow functions at a layer or two below TCP, and should concern itself with the total bytes it sees rather than trying to look into each packet it sees to determine whether it is a retransmission or not, etc. Web100, since it essentially is a window into the kernel's TCP implementation, can report all kinds of statistics, including a 'DataBytesOut' value that disregards retransmitted packets and packets with no TCP payload, allowing us to get a byte total that more closely resembles the actual number of data bytes sent by the test application. So in times of high congestion where there may be a lot of retransmissions, NetFlow will report an appropriately inflated 'bytes transferred' value, whereas Web100's 'bytes transferred' value will not increase from the expected value. Passive analysis usually result in a higher calculated thruput than Web100, since the increased 'bytes' reported by NetFlow is proportionally larger than the increased 'time' reported by NetFlow. The throughput calculations from Passive and Web100 are, however, very highly correlated (see the above discussion of the correlation tables).
Active vs Web100: We can only compare the stream-by-stream throughput calculations between Active and Web100 for the Iperf test, the only test that actively reports statistics for each parallel stream in the test run. Elapsed time is consistently reported lower by Iperf(Active) than by Web100. This is expected, as Iperf only starts the elapsed time counter when the first data packets are sent, whereas Web100 tracks the TCP connection from its conception to its eventual destruction. Still, a very peculiar pattern emerged. If you look at any stream-by-stream comparison, for example this one, you'll notice that for each table, as you move down from row to row, the difference between Iperf- and Web100- reported time get progressively smaller. Because the tables was populated with streams in the order that Iperf actively reported them (and likely the ordering is related to the time each stream was created by Iperf), a possible explanantion emerged -- perhaps Iperf creates each stream (thereby also establishing a TCP connection for each) sequentially, and waits until all streams are created before simultaneouly allowing each stream to proceed with the data transfer. This would explain why the earlier streams(higher up in the table) would report longer time differentials -- they spent more time waiting than the entries lower down in the table. This reasoning was confirmed by Ajay Tirumala, one of the creators of Iperf, so we are satisfied with our explanation for this particular behavior pattern. The bytes transferred total reported by Iperf and Web100 are nearly identical, with Iperf consistently reporting the lower value. We are currently investigating this discrepancy. Perhaps the actual creation of the stream sends some TCP data that is counted by Web100 but not by Iperf.
Active vs Passive: Again, this comparison is only possible for the Iperf test, since it is the only test that reports statistics on a stream-by-stream basis. The comparison for elapsed time is similar to the comparison of Active vs Web100. Again, Iperf reports a lower elapsed time since it only starts the timer for a stream when it begins sending actual data, whereas Netflow(passive) considers the start time to be when the first TCP packet is sent during the connection setup. Also, the difference between the elapsed time reported by Iperf and by Netflow decreases as you move down the table (sequentially move from earlier to later streams). This difference exists for the same synchronization reason as explained in the previous paragraph. The comparision for bytes sent per stream is very similar to the comparison between Passive and Web100. Netflow(passive) includes every packet from the source to destination port in its byte count for the corresponding stream, whereas Iperf only reports the data bytes sent. Again, Netflow includes the TCP and possible IP headers for each packet, thus padding its total even more (roughly 40 bytes per packet).
It is apparent that Web100 closely resembles passive methods when reporting elapsed time, while closely resembling active methods when reporting bytes transferred. It remains to be seen whether Web100 is more highly correlated with passive or active methods in regard to the overall throughput calculations. Refer to the section that discusses correlation tables.