TeraPaths: DataGrid Wide Area Network Monitoring Infrastructure (DWMI)

Submitted by Dr. R. Les Cottrell, PI, SLAC

Presentation:  http://www.slac.stanford.edu/grp/scs/net/talk03/scidac-dwmi-sep04.ppt

Need

Today’s data intensive sciences, such as High Energy and Nuclear Physics (HENP), need to share data at high speeds. This in turn requires high-performance, reliable end-to-end paths between the major collaborating sites. In addition end users need long and short term expectations for network and application performance for planning, setting expectations and trouble-shooting. To enable this requires a network monitoring infrastructure to provide measurements and analysis of network performance between the major sites. The purpose of this proposal is to provide an initially relatively small but rich, robust monitoring infrastructure focused on the needs of critical HENP experiments.

Design

The network monitoring infrastructure will initially be based on the existing Internet End-to-end Performance Monitoring (IEPM) - BandWidth (BW) measurement infrastructure and toolkit. This enables quick deployment of regular active end-to-end monitoring of paths between monitoring hosts (MH) and remote (monitored) hosts (RH). It also provides for archiving, analysis, and presentation of the results as well as interactive and machine access to the raw and analyzed data and web based navigation of the results

Current HENP experiment collaboration such as Atlas, BaBar and CMS are organized in an hierarchical tiering of sites. To track this tiered approach the IEPM-BW monitoring hosts (MH) are currently independent of each other and each MH chooses to monitor sites of interest to it (e.g. a tier 1 site will monitor its main tier 2 sites). In other words it is an hierarchical rather than full-mesh monitoring infrastructure. Each MH site runs its own MH and without a central hierarchy of control.

Each MH has a Point of Contact (POC). The POC, with assistance from the central IEPM-BW support people (CISP), is responsible for the operation of the IEPM-BW toolkit on the MH including: installation, configuration (in particular selection of and coordination with the RHs to be monitored), and day-to-day operation of the monitoring..

The MH makes regular measurements to its selected set of RHs. The measurements are typically stored, and analyzed locally at the MH site. Since the results including the data, are made available via the web, they are available publicly worldwide.

Description and location of monitoring Platforms

The physical hardware used for the MH and RH are typically provided and administered by the host (monitoring or remote) site. This enables the site to determine exactly how the host is administered etc. Typically the host is located inside the site’s border firewall/filtering router, yet close to the site border or to the services having the most interest in performance measurements on the wide area network (e.g. a computer or data farm).

Usually the MH will also perform the analysis and report generation, though this can be done on a separate host as long as access to the data is accessible (e.g. by means of a network file system). Besides providing a monitoring platform, the MH site will need to provide a web server with access to the data analyzed by the IEPM-BW toolkit. This server may run on the MH itself or may be a separate web server.

The MHs will be located at central points of interest to the HENP and Grid community. This includes LHC and BaBar tier 0 sites, tier 1 sites including FNAL, BNL, and selected tier 2 sites such as U Michigan (Atlas) and Caltech (CMS), Other sites will be chosen to reflect their importance to HENP networking and the DoE programs.

Access and security

Currently access for logon and for executing commands from the MH on the RH is by ssh keys. This requires the RH site to provide an account for the MH POC to submit commands to. Some potential RH sites are not comfortable with this so part of this proposal is to provide support for running the servers “all the time” on the RH. This removes the need for ssh to start the servers. However, it requires greater cooperation from the RH POC, since she/he has to install and configure the IEPM-BW RH toolkit, ensure the servers are always running etc.  Part of the current project will be to develop tools to facilitate these functions for the RH POC.

We will also evaluate and utilize the Internet 2 E2E PiPES BWCTL scheduling and authentication techniques to provide another security option, as well as a scheduling mechanism for network intensive probes such as iperf.

Monitoring Infrastructure Capabilities

The initial measurement tool/probe suite will include ping, forward and reverse traceroutes, a lightweight bandwidth estimation, iperf , bbftp, and GridFTP. Others probes will be evaluated and if found suitable integrated into the probe suite as part of the proposed project. The IEPM-BW analysis and presentation tools will be extended to improve: data selection and access (including web services), and traceroute visualization.

A major problem evolving from today’s monitoring is the sheer volume of reports to be manually viewed when looking for problems. The proposed project will develop advanced analysis for anomalous event detection (including the elimination of periodic effects) plus techniques to filter the events so as to reduce the noise reported to network operators. We also plan to create an architecture to automatically gather further information concerning filtered events.

For robustness, it is critical to provide tools to detect infrastructure problems such as hung processes, servers that have died, files that should have been deleted, hosts that are not accessible, ports that have been blocked etc. The proposed project will develop infrastructure management tools to facilitate and automate the detection and logging of such problems, fix them when possible, provide web accessible summary reports, and notify POCs if necessary.

The project will also utilize the wide diversity of IEPM-BW infrastructure paths to evaluate various QoS mechanisms, as well as  the impact of the measurements on the network itself and ways to limit the impact.

Finally, the project will collaborate with other monitoring projects such as PiPES and AMP to share data, tools, analysis and presentation techniques and integrate where advantageous.