Antibes Juan-les-Pin, France April 19-20, 2004
Rough Notes by Les Cottrell
This was the fourth in a series of workshops, the previous ones were in Amsterdam, Denver, and San Diego. There were about 130 attendees up from last year's about 90. the conference was extremely well organized with working wireless Internet access, the quality of the selected papers (30 out of 130 submissions) was very high and the proceedings were published and handed out at the meeting. My talk on Correlating Internet Performance & Route Changes to Assist in Trouble-shooting from an End-user Perspective was well received and there was a lot of valuable feedback as well as interest from several sites in deploying it. The most interesting talks (to me) were: measurements on peer to peer file sharing applications such as BitTorrent and eDonkey; using delay to triangulate location on the Internet; how to estimate available bandwidth in the presence of interrupt coalescence; errors at the physical and framing levels in optical networks. I had many useful discussions, in particular with Mark Crovella of Boston U who provided some very useful information on multivariate event detection, with Constantinos Dovrolis of GATech concerning an NSF proposal we are working on, Maxim Grigoriev of FNAL concerning our PingER and IEPM collaboration, Tony McGregor of NLANR/Waikato about collaborating on AMP
One BT=one torrent = 1 file copy. Tit for tat, i.e. getter must also put. Contacts web server, returns meta information. Then contacts tracker that connect him with 50 .peers to copy with. During lifetime keeps >= 20 open in parallel. File broken into blocks, each time downloads block will announce a new peer set. Upload data to serve the peers that offer the best rates.
Measurement of RedHat 9 distribution, 1.77GB file. Log contains all reports of all the clients, e.g. ID, IP address. 180,000 clients, can geographically map clients (32% US, 24% Eu, 17% Asia, 27% other), volumes 52TB. Seeds are clients which have downloaded and remain in session. Seeds stay after 6.5 hours after download. Percentage of seeds is consistently high (20%), reaches 40% during flash crowd.
Download for leechers on average > 500kbps (at least ADSL client). Instantaneous aggregate throughput of system (sum over all leechers) > 800Mbits. Only 130,000 of 180,000 complete download sessions. Completed sessions avg 1.3Mbps, avg download time is 8hrs, high variance.
Instrumented a client behind a 10Mbps campus access link. Three transfers, 3 day Doanload time 4500s peer set size 80-90, 40% of file provided by seeds (670% by leechers), Two times to upload compared to download, so clients need to stay long in system.
BT scales well. Seems very efficient for highly popular downloads, might be affected if clients do not stay long enough as seeds. What if all peers at same time, e.g. anti-virus upload?
Heavy use of P2P traffic in networks today (e.f. France Telecom close to 50%, & eDonkey (35%) is high in P2P). P2P increases mice-elephant effects. eDonkey2000 implemented by NY company to share huge files. Has hybrid P2P with client & server. It is open source but badly documented. Downloads and signaling on same port. Has multiple source download, allows getting file from different sources in parallel. Divides file into chunks and can request chunks in parallel. Chunks ~ 9MB and chunks divided into parts and requested by client. eD has v. smart Q management. Download client has to check position in Q of download client. Index servers return location of file chunks. Can download file in parallel. Index servers exchange keep-alive messages between them. Port 4662 for client to client communication. Have ID'd download & non-downloads in traffic.
Measurements duration 300h, 25 loacal hosts, 259K remote hosts,3.5M connections, , 300GB downloads, huge amount of signalling traffic. Mean inter-arrival time for outbound download is 50s, for non download is 8s. Connection holding time ~ 67s (avg), but download 800s, non download 45s. Flow sizes 86KByte mean, 2.5MB for download, and 15KB for non-download. Non download distribution is Gaussian normal, download is heavier tailed. Plotting volume vs. connection hold time shows download and non-download are in different areas (maybe a crude way to separate).
Summary: P2P is a significant traffic source, is here to stay, need to separate download from non-download, does not heavily influence mice vs. elephant, good idea to keep traffic within an AS.
Caching servers might have some copyright issues.
Gnutella unstructured P2P file download service with voluntary membership so membership changes with time. Peers joinn through bootstrapping process. The quality of peers depends on bootstrap. Aanalyse BS process in 4 Gnutella implemntations. Looks at times to BS, etc.
Net of voluntary operated caches. each cache maintains following FIFO lists: 20 online hosts, list of other 10 GWebcaches. BS algorithm reads known hosts, permanent hosts, , ultrapeers, caches, determin n umber of connxns. try to establish connections. Different server implementations treat treat ultrapeers differently.
Compare by search performance, bootstrapping time. GD-Gnbutella performs worst due to no ultrapeer support. Limeware gets the fastest search results since prioritizes the ultrapeers.
Summary: differences are in server implementations, Limeware performs best.
How many GWeb caches, how many active, does the system have sufficient capacity. On a poll 250 caches found only 150 active, so significant mis-reporting, about 100K peers.
Get stats files from caches every hour. Request rates vary from 0/hr to 30k/hr, so huge variation. 10% of peers updating eth caches are unique throughput the system. On average only 50% of peers were online (other reject connections)16% accepted connections, 14% were ultrapeers.
Overlay routing IDs alternate paths with better routing than direct path. Can provide benefits in some case. To do it we monitor the network, then each node exchanges information with peers to get global state, then using data compute overlay paths. What are the gains vs. overheads.
Gains: # node-pairs having >= one better alternate paths, number of better alternate paths per pair, degree of improvements. Overheads: monitoring, computation.
Past works showed efficiency, overlay size could be scaled to 50 nodes with reasonable overhead (30kbps). Current paper looks to see if can do better.
Factors affecting gains & overheads: # nodes in net, # alternate paths, average logical connectivity (# nodes connected to cf total # nodes). Monitoring frequency, smaller more current data so more accurate, but more overheads. Max # intermediate hops allowed, more hops more alternate paths, but more computational overhead.
Want to increase the size without increasing overhead. Use ping utility to monitor net, then exchange link state routing info, compute set of all better paths (loss & latency). Analyze offline with sub-sets to see effects. Used PlanetLab nodes. 5 data stets for 24 hours with 36 nodes. 10% of host pairs had at least one better paths. 15% have 3 or more better host-pair paths. Changing connectivity affects gains but only moderately.
Effect of monitoring frequency. Double duration leads to 30% errors, no longer best paths (maybe better paths).
Monitoring overhead is independent of connectivity.
Multi-hop alternate paths do not help for latency. Multi-metric improvement only 2-3% node pairs can be improved. Increasing overlay size does not help much either.
Increase epoch size reduces gains by 11% reduces overhead by 50%.Reducing connectivity by half reduces better alternate paths by 40% for latency and over 30% for loss. Can go to a network 4-6 times size. Only move to alternate path if > threshold better.
If can locate geographically hosts then can optimize the provision of services, e.g. if host in Nice then could advertise information on Nice, and learn where clients are coming from. Need a location service, go from IP address to Lat/long. Build on GeoPing measurement based f(IP)-L. Have set of landmarks (L) as reference hosts that can locate target hosts (T). L measure RTT by ping to T. Probes gather landmark info. If Location server asks for T location then probe machines return delays to and Landmarks T. So get a set of viewpoints from probes, and by comparing delays of landmarks and Target host. Put vectors into matrix and find the most common delay patterns and this provides location estimation. Locations is based on known landmarks. Comparison can use angles (cosine based), correlation based, distance based, and Canberra based. Use 9 NIMI boxes as probes (3 in US, 1 in Japan, 2 in Paris, and 397 landmarks (85% in US & W. Europe). Measured in Aug 2003. A few landmarks have v. large delays - weak correlation. N. America & W. Reurope have best connectivity and strong correlation between min RTT and distance. Look at one landmark as Target abd remaining landmarks for location and plot distance vs. probability big bias to smaller distances (i.e. better estimates for close distances). The correlation is getting stronger as get richer connectivity between hosts. Coarse grained location useful for some applications. Looking at triangulation from some landmarks.
Interested in number of neighbors (vertex degrees in graph theory) an AS has. Distribution of AS goes as straight line on a log-log scale. Why and how do we model. Good models matches reality, well understood, natural, understood parameters, simpler is better, allows mathematical analysis.
Models says network grows incrementally by adding new vertices, new nodes attach preferentially to vertices that are already well connected, this produces nets described by power law. Barabsi - Albert (BA) model assumes every node added with m edges, preferential selection is proportional to current attachment. Model does not include leaves (30% leaves in real AS graph), power law exponent is too high (g=3 but should be 2.11), does not produce a dense core.
New model each step add one new vertex v, and m edges, one edge connectes v using linear preferential attachment, for m-1 edges one endpoint is selected uniformly at random. Gives power law that is still too steep, but better agreement for small node degree. Next step make preferentiality super linear (e.g. if AS1 has 2 times more connectivity than AS2 then has greater than factor preference). This produces something more complex than a power law but is still roughly power law. Using the extra parameter gives reasonable agreement with Internet. Look at dense core (43 AS's have 73% of edges). The new model does better than BA but not all that good (only produces 19 ASs).
Location AND distance matters in Internet (for content delivery, p2p, multi-user games...). Internet distance used for Internet coordinates and nearness estimate (position location). Distances to fixed set of landmarks are used to provide coordinates. Pic king of landmarks is hard to do, more landmarks are better but more overhead. Two close by landmarks does not provide much added benefit. There are potentially millions of landmarks. One way to select is to examine geometric structure of Internet distances, ask whether geometric algorithms can help in landmark selection.
Look at skitter (12x196K locations), sockeye (11x156K) and AMP (116x116) minimum RTTs to large set of nodes. Use an Internet Coordinate scheme, embed in a 7 dimensional space (IMC 2003). First 2 dimensions are most significant. Shows quite a lot of clustering is in first 3 dimensions. Then color IP addresses based on which register provide IP address. Extremely strong correlation so clusters determined by geographic location (since registers are chosen by geography). Study landmarks by how well they predict distance compared to actual distance. Best algorithm (greedy, too computational complex to do online) gets 20% disagreement, while others get 25% for AMP data. For skitter greedy algorithm works bets (10%), max distance does poorly, random and cluster about equal, as one picks a large number (>20) of landmarks, difference in cluster and random becomes small.
Conclusion: landmark choice is important, there is a lot of clustering. Either choose a large number of landmarks (then landmark choice can be random), or do clustering, offline could do greedy.
Motivating question: what classes of problems can experimental results from PlanetLab be safely generalized to the global Internet. Has over 360 deployed systems in 152 sites. Sites are in A&R nets, ~ 1000 ASs. Many hosts have both A&R and commercial Internet connections, but preferentially choose A&R for communication in PlanetLab. 85% PlanetLab nodes are in A&R nets, 10% in commercial. Most A&R in Abilene, mostly in N. America and Europe. 70% of traffic between PlanetLab carried between GREN, 60% by Abilene.
Want to know does triangular inequality hold? Hot potato routing can cause this. Quantify by looking at minimum RTT over 7 days between 13 sites. 25% commercial paths do not form triangles, for A&R then 20%. Want to know how to improve PlanetLab diversity. Need to target commercial sites, motivate other communities.
Bursty traffic compared to Poisson process. Depends on time scales. Burstiness due to heavy tailed behavior of file transfers long & short (minutes time scale), also due to TCP congestion and other behaviors (slow start, idle restart) give burstiness on seconds time scale.
Measuring passive capacity passively. Packet pair dispersion technique can be used passively by looking at traces. Have to know pre-trace (earlier in path) packet separation. Delayed ACKs send a lot of packet pairs (50% of packets are back to back) since responding to 2 packets with an ACK results in sender sending next 2 packets. Most of the packets will have pre-trace separation (i.e. a minimum separation) even in case of cross-traffic. Can use to get an estimates of the link capacities of TCP flows by looking at the packet pair dispersion.
Measure energy by variance. Then using different time bins and increasing the bins by factors of 2 can measure energy in different bin sizes. Is there a connection between energy and capacity? Look at high capacity flows and see that the energy is higher. For low capacity flows the energy is flat with bin size, i.e. more Poisson like. So cannot just say Internet traffic is multi-fractal, it depends on high-capacity flows.
Time series look the same (burstiness-wise) regardless of time binning. Important to understand SS for the impact on queue sizing. Investigating whether SS exists on General Packet Radio Service (GPRS, cellular wireless network for IP access to Internet, typically 48kbits/a access) networks. New applications WAP, MMS, LBS, new constraints more use of UDP for WAP/MMS, mobility. Results show there IS long range behavior, and some of it is SS. Given that one of the causes of heavy tails in Internet traffic has been the heavy tailed behavior of file size distributions and the that one expects fact few long files to be loaded into phones, this behavior is not necessarily expected a-priori.
Net latency at us timescales exhibits unique characteristics. Wanted to look at detailed way in which NISTnet worked (i.e. how Linux router really works). When look at distributions of latency there is structure at us levels at lower Mbps loads. Looking inside Linux to see where packets hang around. Usually due to other processes interrupting. Maybe also be caused by reading interfering with sending for example. As throughput rate increases the latencies spread out due to queuing, and in detail one sees ladder like as packets get queued to critical state and are then forwarded in bulk. At low throughputs one runs into interrupt coalescing giving bunching. At very low loads effects of clocks, cpu time, buses etc. come into evidence.
Future work is to catalog of us-time scale behaviors, then to test production routers at higher rates. Want to integrate model into NIST net emulator.
Grew from analysis of BGP sizes and AS degrees. Most data is traffic traces.
What are causes of brief periods on congestion in routers. Installed packet monitors on all links and synchronized with GPS (accuracy 5us). Come up with accurate model of store & forward router. Usually due to congested link at an edge (e.g. customer with 90% utilization over 5 min interval).
Changesof speed result in stretching and merging causes queuing at output link. This can be increased by multiplexing. Highly clustered arriving packets can result in queuing. Looking at packets can see duration of transfer, the ramp up period. Max duration of congestion is 50ms. Send traffic at OC48 with OC3 output, then repeat with OC3 on both sides. N.b. traffic can be paced for OC3 by earlier links. She shows effect of multiplexing and bursty (e.g. a big flow providing 80% of total traffic) flow.
They provided a methodology, provide metrics to quantify. Flow burstiness does not lead to excessive packet delay.
Want to provide a diagnostic that tells what quality experience the user experienced in an H.323 session. Map end-user perception against the 3 metrics delay, jitter and loss by a model, also want to know which are the most important metrics. End-user experience depends on human factors (e.g. training), Need more than "can you hear me know", use MOS for voice quality. Jitter: o-20ms Good, 20-50ms acceptable, poor > 50 ms. Losses < -5% good, 0.5%-1.5% acceptable, poor > 1.5%. Can use the reulst to tell user in advance whether the session will be acceptable. Jitter is more dominating than delay.
Look at requests to top level domain servers to understand what are the causes of delay in getting results. Lots (98%) of the requests are garbage (i.e. should not have been sent). 3-5 name-servers responsible for 80% traffic. 58% are type A requests. There are 120 country code servers, many have multiple hosts serving a country. RTTs are heavy tailed except G which was bimodal (it is badly overloaded). Need to reduce the garbage requests. Caching reduces name server load. TTLs work well for hours, days and weeks, Could user for distributing load, but that interferes with caching. TTLs are usually over 1 day, so should see one request per site per day, but see much more (e.g. Aukland caching name server asks 125 time per day. How does caching know which name server to contact. Bind tries to group the servers for a particular domain & tries to used best performing name server. Not all caching name servers take account of distance to name servers. The implementations differ slightly across caching name servers.
In Lab have tested six different DNS caches using default configuration parameters. Different caching nameservers use very different approaches to distribute load to upper levels. Need to implement exponential back off when a nameserver stops responding.
Used for security attacks, distinguish OS effects in traces, for inventory. Observation TCP stacks between vendors & OS vendors are unique. Differences dues to implementations, configurations etc. Active by send packets to ports and looking at responses, can run from anywhere & is adaptive, but puts packets on network and is detectable, most popular tool is nmap ,has over 450 profiles. Passive non-intrusive. Passive rule based tools on exchange point traces fail to ID up to 5% of trace hosts due to stack scrubbers, signatures need updating frequently.
They developed an adaptive naive Bayesian classifier. Uses max. likelihood and provides probability. Their method avoids deep packet inspection. Based on initial SYN of TCP handshake, look at TTL, initial window size, is DF bit set. Window size can be fixed, function of MTU or MSS, other. Use to look at some traces from commercial Internet traces. Measure host, packet & byte distributions with different training and compare with rule based. Windows is dominates in # hosts, but Linux wins for bytes. Web crawlers, Linux distributions.
Look at prevalence of NATs, assume behind a NAT have different OS signatures. Look at sequences and number of sequences gives number of hosts. Synthesise NAT on traffic from MIT border.
Satisfy need for collection of e2e user data, maintain privacy, minimally impact users, large varietty of measurements collected for common Internet protocols, needs to be scalable, upgradeable.
Monitors Internet connection while surfing, built on Etherreal. user specifies privacy level, results sent to GATech. Open Source (Gnu GPL), written in C++, on top of Ethereal, available foe Windows >= 95 *nix, Linux. Not sniffed in promiscuous mode, measurements collected on TCP, UDP, ICMP, IGMP, results compressed by zlib, sent to GATech and made publicly available. Privacy allows IP addresses to be hidden, only network portion of IP addresses, lowest level all IP address. Users can examine own data to verify privacy.
Provides map of whom you are connecting to, cute, may be useful for seeing spyware & pop ups, could add visual traceroute. Available http://neti.gatech.edu, also available from SourceForge.
Need to unify the user interface, consider development of APIs, need standard output formats such as XML DTDs from GGF NMWG, NetLogger ULM format.
Can probe links passively (SNMP) or actively. Active beacons send probes to destinations. Goal to locate failure with minimum beacons and probes. Sprint reports mean time to failure is 34 minutes, and about 30% are multiple link failures. Construct a dependency matrix of what probes are minimum needed from which beacons to test all nodes. Can do this for single failures and for multiple failures. Requires an extended matrix whose construction takes exponential time.
Some octets suffer more to errors leading to content dependent errors. Set up a switched optical infrastructure. Power is a problem in machine rooms so low power drivers are important but then larger errors. Coding scheme needed to maintain clock synchronizing. Ethernet uses 8B/10B. Bit errors on physical layer can result in larger errors after decoding. Built special purpose hardware to look at and deduce errors on the line and various types of code-group errors. BER assumes physical layer errors are uniform and independent. Use BERT with variable attenuation between sender/receiver. Errors depend on octet values, also for structured data can depend on position in frame. Bit error rate and packet error rate not necessarily related. There are error hot-spots. Coding scheme used by Ethernet (8B/10B) results in hot-spots. CRC assumes uniform error probability. Need careful consideration for future network design.
Can DB systems help save/manage traffic analysis data with access in real time as data is being gathered. This is called data stream management (DSMS). Used public domain TelegraphCQ, everything is in main memory, use sliding windows. What are benefits.
Replace CREATE TABLE with CREATE STREAM. Add WINDOW to SELECT command. A problem is that one does not know IP addresses a priori so creates difficulties for a single pass system (sub-queries not supported in TCQ). So still a need for offline analysis but not integrated with online. Can handle data rates of ~ 3GBytes/s.
Max q size on access router has big impacts on throughput, but not widely known by user. Want to estimate max q length. Developed QFind, a black box measurement technique to access queue in last-mile router.
QFind assumes each access link has a relatively small q size (10-100 packets) which is constant & independent of link capacities. Also assume last-mile router is bottleneck of home access network. Pings DNS server from home node. Then start download session from remote server to home node. This fills queue and packets are dropped. Report min and max ping times. Min gives capacity, max gives max q length.
Possible sources of errors include the TCP window size which is limited (defaults W98=8KB, W2K=WXP=16KB, Linux 64KB). Can limit max q size. Validate via ns-2 simulation. Also used NISTNet emulator. Results show that with W2K window size < q size, so cannot detect max q size. If have large enough window size then can get accurate results.
During Jan 2004, total 47 volunteers. Cable 55%, DSL 45%. Cable modems appear to have a shorter q than DSL.
Get program from http://perform.wpi.edu/downloads#qfind
Mainly caused by parallelism. Amount of reordering depends on network load, and configuration and software in the routers. Reordering may impact application performance.
Sending k back-to-back packets from sce to dest. Note ordering of packets when received. Introduce the order packet length metric. Number AND how late a packet arrives (how far out of sequence) are both important, call the latter P (packet length), also define time lag from when expected.
Make measurements on 70 RIPE TTM boxes. Two experiments:
Firtst sned 50 100Byte back-to-back UDP packets (N50), then send 100 100-Byte UDP packest (N100). Tests were run for 3 hours.
Reorder packet length is a power law. Packet lag (important for reorder buffers), has along tail. 45% within 3 packets. Time lag normalized for min.
Packet reordering is a frequent phenomenon in the Internet. Packet reordering has a significant impact on UDP performance, but little effect on UDP delay.
Interrupt coalescing is a technique to reduce the cpu utilization per packet. Three parms in Intel E1000 driver. Interript ThrottleRate (max number of interrupts per second), (Rx/Tx)IntDelay (delay between arrival of packet & generation of an Interrupt), (Rx/Tx)AbsDelay (max delay seen by a packet). This destroys packet interarrival times seen by user. This impacts available bandwidth estimation, and capacity estimation, and TCP self-clocking. The latter causes TCP to send data in a very bursty fashion.
How to estimate bandwidth with IC. First step is to detect IC by looking for jumps and bursts. Nb context switches can have a similar effect. First packet of each burst gives dispersion of the burst. This is related to the speed of the link (i.e. the capacity) via the number of packets in each burst. Available bandwidth look at last packet of each burst to estimate the available bandwidth. Have new versions of pathrate & pathload.
Cellular data nets being deployed world wide (GPRS, CDMA, UMTS etc.) End user experience severely impacted by vagaries of wireless. There are link related issues, e,g. variable RTT. TCP is being improved to accommodate wireless. SACK is important.
BGP convergence slowness is still an issue. Put a well known router (Cisco 12000) under a defined load. Tap in & out links and capture traffic with DAG cards. Preloaded RIB, implicit withdrawals, 1 update/s with 50 e-bgp sessions. Look at effect of cpu utilization, number of peers, RIB size, inter-update rate. Direct packets at router cpu to generate cpu utilization. Easy to get cpu utilization this way, in fact can crash routers. Little impact of cpu utilization on BGP pass-through. BGP update rate only slightly dependent on table size.
BGP 15 yrs old, 16K AS' 130K routes per eBGP peer, 26% of IPv4 advertised. How are operational routers imp[acted by BGP routing changes. Metric cpu utilization (load). ON average 130 changes to table/min in Sprint network, goes up to 5000/min. Slowing down processor will affect BGP convergence. Look at 196 routers in Print, Cisco GSR GSR1000 or 7500-VIP, 256 orb 512MB, 200MHz R5000, show pro cpi, SNMP data 1 min exp decayed moving average of cpu load, every 5 mins for 2.5 yrs, iBGP all 150+ route reflectors.Aggregate behavior: cpu usage since uptime, high load instancers, large time scales (mins), typical & abnormal conditions (SQL slammer worm), also very fine time scales (seconds), BGP can consume 95% cycles. CPU utilization is > 60% for majority of routers. Over last 3.5 years < 1% of data points with > 50%, mostly occur in isolation. High load processes occur in isolation. Typical router loads are 20%. SQL Slammer Worm attack caused edge links to get congested which caused BGP sessions to be lost and lin k lost connectivity and so congestion went down and recovered and BGP table updated again, got connectivity and cycle repeated. Whole sequence took many hours (tens of hours). There was a fairly strong correlation between the BGP updates and cpu utilization during this period. Max cpu load experienced was < 40% (i.e. an increase of 20% over background, not something to be worried about).