ISMA2004 Data Catalog Workshop

SDSC, June 3, 2004

Introduction - kc

About 30 attendees at the one day workshop. Three year project funded by NASA. Want to open up network measurement data to community. This is an interim workshop to show progress so far and to get feedback on the directions.

There are many challenges in the measurements. Scientific rigorous monitoring & instrumentation not included in post-NSFnet Internet. Data we have is disparate, incoherent, limited in scope, scattered, unintended. Need globally relevant measurements: rational architecture for data collection, instrumentation for > OC48. This project is about developing the meta data to describe the data, including annotation. Mission is to help researchers share data. there will be a clearinghouse. Includes pedigrees, how to navigate, audit trails, well managed meta-data, understanding sampling implications, anonymization tools. How collected , by whom, when, where saved, access policies, format, packaging, compression.

Tasks: improve measurement tools for hi-speed links, expanded security, modules for storage & manipulation of data, coordinate movement of traffic measurement data, front end user interface, create backend information management system, maintain & develop compelling tools, solicit input from concerned research & standards groups.

The Internet is not well understood, e.g. trends, see "The Digital Imprimatur" by John Walker.

Motivation and Challenges - Colleen Shannon

Lot of data out there (traces, routing tables, traceroutes, security, names, geographic), how does one represent the research importance of the data. Goal is to provide an easy way for users to provide data and contributors to publish data. Users want 100% complete, 100% accurate, freely available, contributors are often not funded. Have to make attractive to both users and contributors. Help users share information they discover in the data. Give users the ability to correct incorrect information provided on the data. Be ambitious try to come up with all kinds of uses, not all will be realized. Start simple in building system, work up to full functionality using feedback, multiple access modes a necessity.

Internet Measurement Data Catalog - David Moore

Focus: help researchers find available net data; central conceptual object: Data Descriptor (DD). The DD maps conceptually to the level of a file, shared between all references to the same data item, even if available in multiple ways (views). DD fields: name description - long, short, URL; keywords; file size; format; location (geographic, net, logistic); platform; time period (start, end, TZ offset, TZ name); creation process; MD5 hash (to detect duplicates, to make sure not corrupted). Granularity is at the file level. Common fileds: creator (not person, creator of data), contributor (actually puts data in database), creation time, modification data. Format Descriptor (FD): when a researcher gets some data, how can they process it; pointers to information about file formats. FD contains: name, description, keyword, package or data format, type ascii/binary/mixed, file suffixes (comma separated list of suffixes, can overlap). Package descriptor (PD): is a physical grouping of one or more data files, can be thought of a s a downloadable unit, may have multiple data files, a data file may be in multiple packages. PD fields: name, description, keywords, file size, format ID, MD5 hash, linkage to contained DD/PD via a path. Location Descriptor (LD) tells researcher how to actually fetch some data, packages may be available from multiple locations, not all packages will be directly available. LD fields: download URL, download procedure (include AUPs), geographic location of server.

Creation process: currently text fields until gain better understanding of what people might want to put here (may include that data was derived from other data).

ToolSet Descriptors (TD): what tools are available to provided some data, what tools used to generate data, version info allows different versions. TD fields: name, description, keywords, release date, OS ... waiting to see how used before more formalizing.. Notes: annotation; bugs: annotation.

Study Descriptor (SD): Want to know what data was used in a paper/web writeup, what results are available about specific data  SD fields: name, description, keywords, linkage to DDs, TDs, linkage to StudyWriteup (i.e. text of paper).

Generalized collections: groupings of data with a specific purpose, groupings may not exist physically, e.g. AS topology collection #27: all skitter, surveyor traceroute data for Dec 1, 2003 ... Could be very important for identifying the data sets used in a paper, or for others to use.

Annotations: what other information is there about this data (or tool or study or package)? How do I let other people know something is important I learned about this data, e.g. at 1:300am primary route failed, between T1 & T2 there was a DOS attack. Annotations dictionary: key name (e.g. hierarchical namespace, FORMAT-pcap-snaplen), description, value type, position type (time range, all, string). Will standardize further when widely accepted. Annotation fields; dictionary key, "object" of annotation (DD, PD, LD), value, position (e.g. time). Might add a field for Warning, Information etc.

Contact: login, password, name, description (long, short, URL), email (hideable), phone (hideable), address (hideable), country (hideable), organization, research interests.

So far have implementations of DD, FD, PD, contact & locations. Showed a demo allowing selection of of descriptors and then search on the items in the fields for chosen descriptor. Can then look at information to decide what data is interesting and find out how to get it. May be useful to add a field to indicate how public the data is (e.g. freely available & get it from here in automated fashion vs. contact author sign AUP then will be sent on DVD).

Data Collection at CAIDA - Colleen Shannon & Brad

Working on routing topology, passive & workload, bwe, IMDC, flow collection, security, visualization.

Data sets:

IP Monitoring at Sprint - Supratik Bhattacharyya

Packet level data coillected using specialized IPMON systems using Endace DAG cards. Also collect Cisco Netflow data, periodic BGP tables, continuous BGP &IS-IS table updates, SNMP utilization. Data storage & management infrastructure, use GFS (see http://www.sistina.com/products_gfs.htm). Then goal was to provide automated analyses of the traces for requesting people. Did not fly since was unable to support arbitrary requests, had to sanitize results, had to automate allocation of compute resources, exiting user base was not ready, tools were not stable, documented etc. Now after trace collected, then archive on tape, "clean" & put on SAN, script on SAN checks for new clean traces and runs flow analysis, on web-site traces organized by date of collection, upon accessing a given data, meta-database is consulted for details (e.g. link name).

SOAP on a Rope: Standard Schemas for Grid Network Measurements - Dan Gunter

Efficiency vs. Extensibility in network measurement systems - Martin Swany

Information services at the RIPE NCC - Henk Uijterwaal

See TTM: www.ripe.net/ttm, RIS: www.ripe.net/ris. Measure delay, losses & other IPPM metrics. Raw data: traceroutes, packets sent, packets arrived (sent minus lost), processed data: DB with traceroutes, ROOT files combining send, received & traceroute information. Traceroute mySQL DB 2 tables (v4 & v6). Table 1 src, dst set of validity ranges & index, table 2 index, src ip1, ip2, ip3, dst, AS # of all IP addresses based on RIS. Delay & loss data tabel with src, dst, send time, delay (ot -1 if lost), error estimate on clocks, index to path. Root format: data analysis for LHC, library & instructions available. Restricted AUP until 1/1/2004, now opened up considerably, publish anonymously, everybody can get a copy of the data, present to RIPE community for comments first (RIPE meeting, mail paper around). 120Gb/year currently copy by hand, investigating other setup.

RIS: RIB dumps (3/day), MRT format, BGP updates, time-stamped. IPv4 up to 12 locations from 9/1999 onwards, volume 250Gbytes/year, IPv6 from 10/2002 onwards, few Gbytes. Software available to read files, log files available, online, AUP: download & analyze as you like, tell us if you publish anything.

DNSMON: monitoring of DNS servers, beta service, full service next year, data can be made available for research.

Whois DB: route, inet, ..., contact information (name, address, phone numbers), privacy, can be mirrored (sign document, restricted due to contact information).

Future plans: information services, provide data for community (different sets for different target groups, explanation of what is in the data).

Abilene Observatory Datasets - Matt Zekauskas

Datasets: flow data (last 11 bits of IP zeroed), latency (one-way, 2*11^2 paths (v4, v6)), near future IGP (IS-IS updates/node), SYSLOG dumps from router, router snapshots, 1 & 5 min SNMP usage and errors, throughput (iperf, v4, v5, UDP, TCP), multicast.

Summaries available via web, graphs, tables, time series (of summaries) via "web services". Raw data in diverse formats only by special request (probably not enough meta data, stored by data (some tar balls), recovery is a manual process, except router snapshots, have all XML files. Flow data gone after30 days.

Stuff is archived in many places with different administrative hurdles. Data is available off http://abilene.internet2,edu/observatory as a twisty maze of web links. Validation some of the flow data is known to be bad.

Future plans: new DataBases IGP and BGP, looking to use DHS grant to help clean up databases & improve access, hope to contribute to this effort.

Flow data collected using Mark Fullmer's flow tools 1/100 sampling (may be losing some data), summaries are stored forever, raw data acess vis rsync (one directory per day per router).

Latency have OWAMP 1/sec Poisson full mesh among router nodes. Summaries stored in mySQL web displays including graphical& worst "10", XML/SOAP access http://abilene.internet2,edu/ami/webservices.html using NMWG, Raw data have to ask for.

Router snapshots.

5 min SNMP, polled using custom software, stored in RRD files. Access depends on link type (backbone from weathermap ...)

NETI@home - George Riley

See http://www.neti.gatech.edu/

Goal passive Internet measurement from world-wide vantage points, capture "real" users experience, satisfy need for collection of  e2e experience. A large variety of measurements collected, for most commonly used Internet protocols. Software minimally affect user and user's system to have little impact, large user DB, run in background (little or no intervention), all in one distribution of Ethereal, libpcap, neti (future), provide user motivation. Collected measurements reported to GATech, collected measurements will be made publicly available, scalable collection method. Can select privacy levels (e.g. no IP addresses, default gives first 24 bits of IP address, tell me who you are).  Open-source (GNU GPL), written in C++, built on top of Ethereal, an open-source packet sniffer, available for Windows (>= 95), Linux, NIX's. Do not sniff packets in promiscuous mode, measurement kept on a per flow (bidirectional basis) collect TCP, UDP, ... Easy to collect IP addresses, ports, times, flags, packets sent & received, bytes sent & received, TTL values. Not so easy: RTT, losses, connection closure method, re-xmit, OS, TCP internals (cwnd).

Motivation partially altruism, red dots on map for locations contacted during TCP sessions (e.g. web browser). Got written up in SlashDot. As of June 1, 2004: 4113 downloads, approx 240 unique users contacted server since May 26 (one week).  Approximately 500MB of uncompressed binary data collected since May 26, approximately 730 unique users contacted server since January 7, 2004.

Spectral Techniques for Internet Traffic - Christos Papadopoulos

Topics: security, net management, spectral techniques.  Can use to detect congested links (see peak at bandwidth of link). Playground is Los Nettos, regional net for LA area, 15 years in existence. Monitor links to ISPs. Trace software runs in FreeBSD with tcpdump, 2 minute traces. Detect attacks. Have about 80 DDOS attacks (anonymized, binned 1ms time series), available as DVD, must sign 1 page reasonable MOU, about 8-10 takers so far.

Wide Area network data and analysis at MIT - Nick Feamster

Using RON testbed. topology 31 distributed nodes wikth stratum 1 NTP servers, CDMA time sync. Active probes, periodic pairwise probes, local logging for 1-way loss and delay, failure 3 consecutive probes over 2 minutes. Failure triggered traceroutes, daily pair-wise traceroutes over testbed topology, iBGP feeds at 8 measurement hosts (Zebra), data pushed to centralized measurement box.

Changes in connectivity (IP renumbering breaks BGP sessions), upstream providers change. Home-brew tools (sometimes buggy .. keep raw files). management: continuous collection vs. archival (snapshots take space). MySQL table corruption, disk failure etc. Collection machine downtime (power outages, moves etc.). Complaints (preemptions: DNS TXT record, mailing NANOG etc.). Collection subtleties: keeping track of downtimes, session resets etc., hosts not firewalled, some hosts located in the "core" (e.g. GBLX hosts). iBGP sessions to border router on the same LAN.

BGP monitor overview see http://bgp.lcs.mit.edu/ summaries, can see BGP updates by time. Failure characterization study: where do failure occur, e.g. see failures occur about 3-4 minutes before see BGP activity. 60% of failures that appeared >= 3 hops from ena end host coincided with at least one BGP message. Invalid prefix advertisement study (see http://bgp.lcs.mit.edu/bogons.cgi) Many are private addresses leaking (large number of offending ASs), simple statis filters would make a difference. Many bogons sticks around for a day or more (over 50% persist for over one hour).

DIMES - Yuval Shavitt, Tel Aviv University

See http://www.netdimes.org/ Similar in concept to NET@home. Only Windows today, Linux in Autumn, Mac next year.

Overview of scalable security data management for internal/external data sharing - Bill Yurcik, NCSA

 

How do we build a culture that values data catalogs - Mark Allman

Data needs to be easy to set up and useful (often to the people who have to make the effort to set up). Passive measurements have privacy/policy/legal/competitive issues, lots of reasons to say no, few to say yes. Easy to be lazy, can be a big hassle to release data. Data may not be useful to others. Metadata often only in filename. Would need to create a big README so taht others can untangle our mess. Researchesr do not get credit for releasing data, maybe an ACK in a paper.Effort is on par with writing a good paper or a good piece of software. Not much funding for making data available. We need a cultural shift: not be lazy, collect meta data serving no purpose for them, hold data sets in high esteem. Will require lots of mundane work. Need to commit to keeping a repository operational (not trivial), must be easy to use and useful to researchers, measurement tools should help collect meta-data (wrapper scripts). Need anonymization techniques that work. Could reject papers whose authors won't release thedata (maybe too drastic), concentrate easy stuff first (e.g. active data), find some pioneers to seed the system. Cite data sets, make data contribution a condition of funding.

Scalable Internet Measurement Repository (SIMR) - Ethan Blanton, Purdue

Fore-runner of IMDC. Schema definitions are the crux of the project, determining what is interesting to report is important and difficult. Want to store metadata, puts metadata about results explicitly out of scope, where is the line between data and results.

Metropolis network measurement infrastructure - Timur Friedman, Paris 6

Measures RENATER and other French networks. Uses RIPE/TTMs, NIMI, plus new tools one called Pandora (kind of extension of NIMI). Found useful to convert data to XML due to existing parsing tools, data manipulation tools, XML schema to verify proper format, using XML for traceroute@home system. Most tools do not output XML, data compression for large data sets, encourage translation native to XML. 

Data needs for sampling the Internet to measure Performance - Juana Sanchez, UCLA Statistics

Objective is to introduce students in statistics courses to the field and motivate them to propose ideas and solutions. Probabilistic modeling. Single node (link) data analysis (wavelets, Hurst, long range dependence etc.) of traces, new traces showing evidence of Poisson assumption (CAIDA/Broida). General network tomography models, very difficult heavy use of matrices. Network topology identification (Nowak, Rice U) knowing final delays can I estimate the link tree structure in the middle. Sampling (e.g. sampling in routers, random, probabilistic, scheduled).

Bare-bones Measurement Data Archiving - Dave Plonka, U Wwisconsin

Passive: exported flow data, SNMP gathered measurement data. Active: some traceroute & ping-like text output, show ip bgp (from routeviews, campus data. Flow data: packets sampled flow records from Juniper (varying sample rates, varying regularity), non-sampled flow-data from Ciscos (sometimes lossy, always voluminous).

Archiving - short term: raw (binary) flow files, sometimes compressed, randfo access to 5 minute intervals, sequential access to unpredictably order flows, retain for 5 - 14 days, for operational use, storage space limited; long term: RRD files, occasionally copy raw data. Have used reversed DNS of hostnamers of the exporters or observation points /edu/wisc/net/. Annotations: create detailed README files in each directory containing the data. Maintain a journal/log of events as "events.txt" (e.g. 2004/06/03 1600 happened thru 1730). These events are web browsable using RRGrapher.

Encoding / anonymization / obfuscation: use ip2anonip: simple filter for CSV files, pros easy to add arbitrary filed rewrites, cons can take hours to prep a day-long flow data set, very tedious, results depend on order of IPv4 addresses in input, known attacks .. better to use CryptoPAN.

Access & usage policy tries NLANR/CAIDA model: usage agreement documents recipient signs-off data (and therefore analysis) resides on central server. In theory lease as liuttle as possible but bo less. In practice increased levels of access with improved (trust) relationship between researcher and practitioner (creator/archiver). Result is minimally successful. Useful to store multiple encodings of same data (e.g. for follow up questions); canonicalize net element names (r-peer.net.wisc.edu => border.our.domain). Often find an anomaly in sampled data then drill down in non sampled data based on point in time. Can this be accommodated in UI.