Interest in Traceping

Les Cottrell, created March 31, 1999, last update May 16, 2000
Traceroute archives at: CERN/Geneva/Swizerland | Oxford/UK | SLAC/California/US | TRIUMF/Vancouver/Canada

Reason for Interest

Increasingly we are finding that to understand and compare the monitoring results from PingER, we need the routinng information before, after and possibly during some observed change in performance. The IETF/IPPM folks have recognized this critical need in the Surveyor tools and they do traceroutes approximately every 20 minutes between their sites (about 41 sites with roughly full-mesh would come to about 1600 pairs). They represent a different community to HENP/ESnet although there is some overlap. The web page on the Impact of routing on End-to-end Internet performance illustrates the effects of routing changes.

Possible approach

Rather than develop a new package around traceroute to provide such information, we are looking at the existing traceping package from Oxford, or something similar. Traceping ( examples of traceping reports are available) would allow us to perform the traceroutes to a list of sites (e.g. the Beacon sites), gather the data, archive it, display the results, allow navigation to more detail or to older data etc. Today (March 1999), there are traceping monitor/probe processes running at DESY, CERN, Oxford, Munich, Gran-Sasso & SLAC (SLAC & Oxford are also analysis/collector sites, Oxford acts as the collector/analysis site for the other monitor sites).

One gotcha today (March 1999) is that it only runs under VMS. The author (John MacAllister) is actively looking at porting to Perl with the intention of running it on Unix, Linux or Windows NT. Thus in March 1999 we ran a survey of the ESnet & HENP monitoring sites to see how many of the PingER monitoring sites would be willing/able to run traceping on a VMS platform. Traceping Contacts provides a list of the contacts for each of the sites contacted. We followed up the above survey with another to see how many sites would be willing to run traceping if it were ported to Perl and could run on Unix, Linux or Windows NT. The table below shows the results of the surveys.

SiteState VMSPerl Platform Choice SiteState VMSPerl Platform Choice
ARM Willing NoSolaris BNL Willing NoSolaris
CERNRunning analyzer B Yes ? CarletonWilling No HPUX, Solaris, Linux
CMU Willing No Linux, Digital DESYRunning probe Yes?
DOE Willing No Solaris, Linux, WNT FNALWilling ?Solaris or WNT
Gran SassoRunning probeYes? INFNNo answer ??
KEK No answer ?? KFKIResponded No ?
MunichRunning probeYes? NBI Willing YesDigital, Linux
OxfordRunning analyzer B Yes? PNL Willing No Solaris, Linux
RAL/DL No answer ?? SLACRunning analyzer BYesSolaris, Linux, AIX
TaiwanWilling No Linux TRIUMFRunning analyzer B Yes?
TorontoWilling Yes? UMD No answer ??

Notes:
WillingIndicates the site is willing to run traceping
Running probeIndicates the site is already running the traceping probe task.
Running analyzerIndicates the site is running a probe and also collecting and analyzing data for itself and possibly other sites.
BIndicates that the site is running a test version of traceping that probes all Beacon sites.
RespondedIndicates the site responded to the first survey, we await a response to the second survey.
No answerIndicates there has been no response from the site to either survey.
Perl Platform ChoiceThis column indicates on what platforms, in order of preference, the site would deploy a perl version of traceping.

Concerns

Some concerns have been raised about the impact of traceping on the network. We list the concerns below, and follow each of them with responses in italics.
  1. Traceroute sends and receives packets to & from each router along the path to the remote host. This might correspond to a large amount of traffic.

    The standard mode of running traceping is to aim for three shots per hour to avoid having strange looking plots. However the number, size and frequency of sending packets are all configurable parameters and so can be adjusted with experience or to suit specific requirements. The normal setup is to send 9 packets of 100 bytes each approximately every 20 minutes. If the average route has 20 hops that would give (per hour):

    50 sites X 3 shots X 9 packets X 20 hops ==> 8 packets per second or 6 kbits/sec/monitoring site

  2. Traceroute works by provoking error handling in the routers along the path (i.e, responding to hop-count-exceeded by generating ICMP time-exceeded packets). It has to be more work to format and send an ICMP response than to simply forward a packet. This almost certainly is not an optimized operation in the routers, probably is not handled by on-board silicon, and would need the router's CPU to deal with. The question is how many traceroutes/second can the Internet's core routers handle, keeping in mind that their CPUs could already be pretty busy maintaining the core's humongous route tables.

    As a possibility to minimize the traceping impact, each traceping monitoring site should determine what sites it needs to keep traceroutes for (e.g. the Beacon sites, but if the monitoring site is already a Surveyor site, then don't traceroute to beacon sites already covered by Surveyor).

    We also posted a request to the IETF/IPPM news group asking for "comments, advice, measurements of the traceroute impact (e.g. how many traceroutes/second can a core router handle as a function of its typical load), is there any other way to learn the routing information history etc."

    The responses indicated that people did not feel our level of traceroute activity would impact the network. A suggestion was made that rather than sending the default 3 probes for each TTL, to stop sending probes once success is achieved. This should reduce the impact of traceroute, by up to a factor of 3.

  3. The pings generated by traceping are directed at the intermediate routers rather than being traffic that is routed through them. Again, this could well be a non-optimized operation in the routers, requiring the router's CPU for handling. In fact PingER's Requirements for hosts being monitored recommends against pinging routers. Further since there are mutiple routers in a monitor-remote-host pair path, the traffic from traceping could be much greater than from PingER itself.

    The effect of pinging can be reduced if by default traceping uses 0 pings for the routers along the route, except to sites where it is critical to understand the congestion and there is deemed to be adequate bandwidth. John will provide an option to support setting the number of pings to 0.

  4. Traceping adds to the load on DNS servers (both the root DNS servers and the servers for the intermediate sites being probed). This may or may not be significant, depending on how much caching is done by the systems that traceping is running on. In any case, DNS caching won't help when the address being looked up is not in DNS. The DNS implementations we're familiar with either do not cache negative responses or only cache them for a couple of minutes. Worse, some routers use private IP addresses, and attempts to lookup their non-existent names have to be handled by the root DNS servers.

    This impact could be reduced if traceping does its own address-to-name caching.

  5. How much cpu capacity is needed to run traceping?

    The main requirement for cpu cycles is for the analysis. The design allows the monitoring to be done on a separate machine to the analysis. The communication (i.e. the data is sent from the monitoring machine(s) to the analysis/collection machine(s)) is via email. On a DEC Alpha 3400 each remote site requires about 1-2% of the cpu for the analysis.

  6. How much disk space does traceping require?

    The storage space required is about 1.5Mbytes/day (or 45Mbytes/month) for each site that is monitoring about 50 Beacon sites. The storage space difference between traceping with routes only and traceping with the pings is only about 5%.