Author: Les Cottrell. Created: March 25
This is the 3rd in a series of workshops held at roughly yearly intervals. The previous were held in Waikato, New Zealand, Amsterdam and this one. The numbers of attendees was 60, 120 and 100 at this one. There was a panel sessions at which the panelists were posed the questions: what critical problems need to be addressed and what are the key enablers.
Use ICMP for many network tools such as ping, traceroute, pathchar, so important to know whether the routers treat ICMP as they do regular packets, and if not how much noise in introduced. They spoof the source address to be the destination address so the ICMP TTL exceeded packet is sent on to the destination rather than back to the source. Then they can compare this with the direct packet which is sent to the destination with a full TTL. Ingress filtering routers will drop packets (but the error message is actually sent on to the destination!), so then only useful in until hits filtering. Unclear whether ingress filtering routers may introduce a bias, e.g. they might be more technologically advance. There were 5 of the 11 routers that had ingress filtering. Tools built based on this called fsd. It runs on NIMI (ran on 11 boxes), does not detect path changes. Ran as a mesh, probed each pairwise path bi-directionally, and ran 2 hour traces. Used interpolation between the adjacent direct probes (one after, one before) to subtract form the ICMP times. Found that the difference for most routers is < 500us. But there are puzzling outliers. Some of the outliers may be due to clock shifts, but have not been able to validate. Many load balancers differentiate based on source address, so can check by also spoofing source address on direct probe. But even then the ICMP TTL exceeded will have the source address of the router that sent it. This gets rid of some of the outliers. Also tried back-to-back (hop limited and direct) probes then this inflates the estimates of difference. Would expect a deflation since hop limited first and so cause delay for direct probe, but actually get 70% out of order delivery. See http://www.icir.org/ramesh/private/misc/fwd-1.21.tar.gz for the source code of fsd.
Capacity is maximum throughput possible when no cross-traffic, limited by narrow link. They have pathrate to measure this capacity. The available bandwidth is the non utilized capacity in a link, limited by the tight link. They use pathload to measure this. Easy to measure if have router MRTG plots. TCP transfer cannot get available bandwidth in under-buffered path, get more than available bandwidth in over-buffered paths, and TCP saturates the path. Methods such as cprobe have problems if more than one bottleneck. SLoPs (Self Loading Packet Streams) measures one way delay, does not need synchronized clocks. Send periodic stream of K packets, L bytes each, T seconds after each other, rate R = L/T. If R> A (available bandwidth) then relative OWD increases as K increases. For R<<A overall trend is that relative OWD does not increase. As R<~A then can get both behaviors, possibly due to cross traffic changing since A is dynamic. So receiver notes whether OWD is increasing, it tells sender and sender decreases R, if OWD not increasing then send increases R.
Try to keep T small to reduce probability of context switches, (100us), packet size < MTU and >> header size (200Bytes). Have two schemes to determine whether increasing or not: Pairwise Comparison Test (adjacent pairs), Pairwise Difference Test (1st and last), which are complementary. Uses fleet of N (usually of order 10-12) streams of OWDs, each stream is 100 packets, silence between streams is RTT to allow network queues to drain measurement result. The rate adjustment algorithm is described in the paper.
Clock skew not an issues dues to small stream duration (~10msec). Pathload abort is loss > 10%. Measurements from U Delaware to Greece. For low speeds show pathload measurement agrees with MRTG within statistics. Also OK for UOregon to UDelaware at 60Mbps. First release of code April 2002.
The idea is to model delays along a single path. Used RIPE NCC TTM setup with GPS synchronized boxes, one way delays, 32 boxes active, 963 useable paths, 2100 one way delay measurements done during one day, for each path. Delay distributions have big peak (gamma peak) with a long tail (class A, 84%), class B (6.3%) has Gaussian peak out in the tail (caused by occasional overloading), class C (2.8%) has two peaks sometimes above (lower delay) the maximum, class D (5%) has multiple peaks sharp with long tail due to multiple paths. There are 2 components in the distributions: a deterministic component (know router, cables, interfaces), stochastic component (due to queuing, buffers etc.) varies from packet to packet. Lab measurements to study deterministic component, measure one way delays on unloaded network. Each router adds 224us delay, modern backbone routers are usually faster. Cable lengths are usually not line of site (easy to do since one has GPS), but actual route is almost never a straight line (e.g. LoS 408km Bratislava to Munich but actual is 2218km, for 35 routes the actual distance is between 1.2 and 14 times the LoS distance). Can often get location of routers via hostnames, can also check location record in DNS but not really usable, data is unreliable. For links use distance by car since fibers usually buried along highways or railways. Given the distance and refraction index of 1.5 or propagation of 5us/km, was able to estimate the cable length delay, typically get within a factor of 2 (32 out of 35 cases). Problems are exact location of routers, model/make of router, exact path of cable, local fiber loops, fiber or copper, does not work for trans-Atlantic and satellites.
In a future paper will model the stochastic contribution.
Goal is to provide a framework & tools for the study of large amounts of packet header data. both at packet level as well as summaries. Open Source, perl , R (open source S). Data flow gives computation from packet header traces to primary and secondary Unix flat file objects, then to S objects. Want to do time blocks and do the same analysis on each block. Detailed packet level analysis, e.g. arrivals, plus summary levels. Multipager display using a trellis display in S to allow analyst to view hundreds of pages of data. Go from pilot runs/analyses to batch analyses with a nice web interface. FSD tool provides open loop packet traffic modeling, will focus on arrival times. Have a row by column format for packet headers captured by tcpdump.
As put more users and flows together there is a theory that says there will be less influence from long-range dependence (LRD). Try this model out on the data and shows it is correct. As increase the amount of multiplexing get less LRD and more entropy (white noise).
Can get software from: http://cm.bell-labs.com/stat/InternetTraffic
Packet Delay & Loss at the Aukland Internet Access Path
&0 measurement points, 3 in US, 1 in NZ, rest in Europe. Full mesh. Adding IPDV, bw and trends. IPDV matters for voice, video on demand. IPDV(ij) = d(i)-d(j), dIPDV = sqrt (dd(i)^2+dd(j)^2) ~ sqrt(2) sqrt(dt(src)^2+dt(dst)^2). IPDV changes over time (showed up to +-5ms). No obvious function that fits the distributions, so sue percentiles (median, 85% where peak end, 97.5% for tails. Class A (60%) 85% ~ 97.5%, class 2 long tails (30%) occasional hickups, class 3 with multiple peaks, believed due to load balancing (if closely spaced no recalculation of route so same route, wider spacing then calculate different path, e.g. if one path has +4ms over standard path, then get peaks at 0 and + 4 ms and -4 ms. 60% of paths have low IPDV.
Bandwidth estimation using pathchar, clink, pchar and pathrate. Pathrate works best but takes longer. Integrtaed into web infrastructure to allow someone to request bandwidth between two hosts. Want to turn into production but seeing problems with high speed links. Want to correlated delays, losses and bandwidth, and also look at available bandwidth.
Long term trends, expect delays to follow a saw-tooth patter, new hardware installed, little load, minimal delays, traffic grows, connection fully utilized long delays, then upgrade. Summarize for 4 hour periods (00:00-6:00 GMT ...) calculate 2.5%, median, 97.5%. Reduces from 1M points/year to 5K points/year, still a pile of plots which then drop to a single page per day with a table. Also want to zoom in from table.
Active are controlled but intrusive, can incur costs, probe packets may be treated differently. Passive measures existing traffic, no extra load, no side effects, has huge amounts of measurement data, do not control. Sampling methods include systematic (time based, packet based), random and stratified by basing on parent populations and sampling within population. Talked about content-based sampling for multi-point sampling so one can see the same packet.
Diagnosing problems requires a knowledge of multiple layers maintained by separate organizations. Telco environment has a lot of legacy and new equipment running side by side. Encapsulation may require capturing a large part of packet. REquire data modeling to understand what is going on.
See problems with link sharing in that hardware does nor switch well, loses packets during switchover. Multi links can have strange effects, switching, selection algorithms, out of order. Showed a few tricky examples. Wants a tool to assist in diagnosis,
NSP use SLAs with penalties to specify and meet performance. Then measure to ensure that QoS is met or if not then what penalty is due. Typically relay on active measurements.
QoS works on marked packets, unmarked get best effort. For measurements it is another dimension so taht the measurements see poor performance while application is doing fine. So measurements must use same QoS characteristics as what application will use (matching). This becomes a problem when SLAs apply QoS dynamically. Can mark the packets in the network or in the application. Doing in network means application does not need modifying. Doing in network may require configuring the router via the console or via SNMP. SNMP is a bit immature in its set capability in this respect, so he used CLI. The lack of standardization of QoS and how it is set make this difficult, would be better if could use SNMP standard MIB.
Want to relate network QoS and human/customer experience/satisfaction. ITU has MOS with how customer perceives against metric such as delay. He reports on a first step for data nets. QoE(app)=Func(QoS(net),QoS(app)). How does one measure satisfaction, how to avoid human bias, how to measure net QoS simultaneously with measuring satisfaction. Can't measure satisfaction, but can measure dissatisfaction. A normal HTTP connection has a GET followed by data with a FIN, but if user gives up then will get RST before data is all sent. This could be used as a dissatisfaction measure. So place passive probe on link and collect HTTP objects (status & length OK - code 200), 1.9M served, 68K cancels (3.6%). 40% of cancelled was re-requested by same user. Also collect delivery time and the response time (GET to response). There is an anecdotal rule that a web page needs to be given in 8 seconds for reasonable performance. Show cancellation rate increases with delivery time out to 20 seconds. Rule of thumb is that network & inter-server latencies are major factors for quality of experience. In the low (sub-second range) there is a break in cancellation rate around 50 msec, then pretty flat out to 300 msec. Important to do content delivery from a server close to user. More bandwidth you have the happier you are, but it is non linear, e.g. if look at cancellation rate as function of available bandwidth there are big changes as one goes over 2400bits/s and another around 150kbits/s.
He reports on a prototype to combine active & passive methods. Need measurements for daily ops & main, RTE eng, plan & dimension, traffic & protocol usage, user usage ...
Method uses a single method to estimate losses, delay and throughput to reflect actual usage, and does not depend on traffic model assumptions. Uses traffic meter and sending monitoring packets. Applications are on SLAC agreements and operations and control. Meter filters based on ports, protocols, IP addresses and count number of packets and bytes. Insert monitoring packets between blocks of user packets. Small block sizes gives better resolution, but often large blocks are suitable. Measure losses in periods. Count packets out of sending node and compare to number received.
Uses Netramet and match pairs (send/receive) of packets in particular the DNS request/responses. Use optical splitter to steal 5% of the light. 95% of requests are resolved without retry. Rssolver if receives no response, will time out and send request to a secondary dns and so on until gets a response. Bind will remember which dns gave the best result. Timeout varies from resolver to resolver can be 4 seconds. Can see delays increase probably with load (e.g. bad during week, OK at weekend), also can see multiple servers seeing delay peaks at similar times. Can see multipathing, has a range of RTT distributions observed, some due to server overloading. Gamma distribution is good fit for clean data, probably simpler stats such as median, IQRs may be sufficient.
He has added a feature to save the 1st 100 points then determine the histogram limits from them, then use the same 100 memory paces for the hisogram itself (calls it dynamic distribution).
Provide a tool for tracing packets, want to monitor between the application and the TCP/IP stack. Also needed finer time tuning than can get in user space, also wanted to run at higher speeds than most similar tools. Required intrusive kernel modification, but low overhead and no modifications to applications. Kernel has circular buffer for events, that are then written to disk. Buffer fixes transient loads. They have instrumented 4 layers, application, TCP, IP, UDP and network. Keep sockid which cannot be traced back to the application. Have a 64 bit cpu clock timer, which is orders of magnitude better than previous tools, also have a size, event id (tells what is being measured), total storage is 64 bytes/event (includes 44 TCP bytes). Runs under Linux 2.4. Will be available under Gnu public license. It is endian aware. Run on both i386 and PowerPC platforms. Kernel & user space communicates via shared memory. Ran on 2 dual 400MHz Pentium IIs with both 100Mbps & 1000Mbps Ethernets. 1. showed (impact on throughput) without Magnet (<1%), 2. with Magnet but not running (<5%), 3 magnet_read on sender (<5%), 4 magnet_read on sender, 5 tcpdump on receiver, 6 tcpdump on sender. Tcpdump does not scale at GE drops throughput by 35% on receive, 25% on send. The reason for poor tcpdump performance is thought to be due to transferring data from kernel to user space. Tcpdump turbo does more in the kernel and may be much closer to MAGNeT. Increasing kernel buffer size from 128KB to 1MB reduces loss from 3% to 1% at GE. Also can reduce delay between reads of buffer to reduce cpu load. Showed an FTP application sending data to socket at 10KB which then at the network layer sends 1500Byte blocks. I.e. what see on wire is quite different from what see at application. Also showed a 128Kbps MPEG-1 layer 3 streaming audio from MP3.com. Socket gets data in short 400 Byte blocks while IP see 1400 Byte blocks. Unloaded Linux context switch times is ~ 2 usec on 1st (of 2 cpus), when run MAGNeT then occasionally se 16 us context switches. This is a known problem that will be fixed in a future Linux release. Future work is to collect traces, run-time vs compile time configuration, kernel thread implementation, handle CPU clock rate changes (Intel SpeedStep to handle power requirements, not easy since kernel is unaware when the clock speed changes).
Will be available under GPL (http://www.lanl.gov/radiant)
TICKET is a minimum kernel level API designed for high-resolution networking tasks. Implemented in Linux. Soft real-time. Nanosecond granularity, tested up to 1.6Gbits/s. Limitations are in PCI bus and the Alteon card which has lack of memory. Wanted to use COTS hardware, yet make high speed. TICKET is designed to scalable (unlike pcap), very efficient, but hard to use, O(ns) timers (vs O(us) in others), parsing capture, not portable, can perform in real-time, highly reliable, free to inexpensive. Needs an optical tap or port spanning, or a promiscous NIC for wireless. TICKET runs on 2 or more cooperating hosts, one (T) runs dedicated OS, others (D) run user tools, e.g. disk storage. T waits on mutex, gets interrupy, receive & timestamps packet. Timestamp comes from Intel clock counter. But can drift, do external NTP information to verify times. Will be compatible with tcpdump output. Test had 2 machines sending 500Mbps (1066B + headers) thru a switch to TICKeT. For a 1GHz host records at 230 Mbits/s disk limited. Without disk limitation (/dev/null) get 650 Mbits/s. TCPdump can do about 230Mbps on GHz machine, TICKET get 650Mbps with no sharing and 950Mbits/s with sharing
Need to get round the limitations of current protocols. Things change during measurement (smearing). Need to measure both forward & reverse paths. Desirable features: integrtaed path & delay measurements, ability to measure different protocol streams, (no -or very low) overhead in routers, do not require full deployment, DoS resistant, reverse path measurement, relate interfaceIP to router, get information about clocks. IPMP proposal echo/request/reply with path ercord. Path record has IP address and stamp, path record may be inserted by router together with time stamp. Resolving router timestamps to real time is not done on measurement path. Packets report both the real time and the reported time so can fix up timestamps. Have added several optimizations such as fixed field locations, reduce need to recalculate checksum. Full deployment not required being used in AMP. Allows an ISP to say I want to be measured, and the path record mechanisms are well optimized. Believe can implmenet in small number of instructions, e.g. in router silicon. Believer it is denial of service resistant. Use ICMP and IPMP measurements in parallel. 81% show no difference (within 1 msec), 8% smaller with IPMP, 10% larger. Presented at IPPM working group at IETF 53. Send email to ippm@advanced