Submitted by Dr. R. Les Cottrell, PI, SLAC
Presentation: http://www.slac.stanford.edu/grp/scs/net/talk03/scidac-dwmi-sep04.ppt
Today’s data intensive sciences, such
as High Energy and Nuclear Physics (HENP), need to share data at high speeds. This
in turn requires high-performance, reliable end-to-end paths between the major
collaborating sites. In addition end users need long and short term
expectations for network and application performance for planning, setting
expectations and trouble-shooting. To enable this requires a network monitoring
infrastructure to provide measurements and analysis of network performance
between the major sites. The purpose of this proposal is to provide an
initially relatively small but rich, robust monitoring infrastructure focused
on the needs of critical HENP experiments.
The network monitoring infrastructure
will initially be based on the existing Internet End-to-end Performance
Monitoring (IEPM) - BandWidth (BW) measurement
infrastructure and toolkit. This enables quick deployment of regular active
end-to-end monitoring of paths between monitoring hosts (MH) and remote
(monitored) hosts (RH). It also provides for archiving, analysis, and presentation
of the results as well as interactive and machine access to the raw and
analyzed data and web based navigation of the results
Current HENP experiment collaboration
such as Atlas, BaBar and CMS are organized in an
hierarchical tiering of sites. To track this tiered
approach the IEPM-BW monitoring hosts (MH) are currently independent of each
other and each MH chooses to monitor sites of interest to it (e.g. a tier 1
site will monitor its main tier 2 sites). In other words it is an hierarchical rather than full-mesh monitoring infrastructure.
Each MH site runs its own MH and without a central hierarchy of control.
Each MH has a Point of Contact (POC). The
POC, with assistance from the central IEPM-BW support people (CISP), is
responsible for the operation of the IEPM-BW toolkit on the MH including:
installation, configuration (in particular selection of and coordination with the
RHs to be monitored), and day-to-day operation of the
monitoring..
The MH makes regular measurements to
its selected set of RHs. The measurements are typically
stored, and analyzed locally at the MH site. Since the
results including the data, are made available via the web, they are available publicly
worldwide.
The physical hardware used for the MH and
RH are typically provided and administered by the host (monitoring or remote)
site. This enables the site to determine exactly how the host is administered
etc. Typically the host is located inside the site’s border firewall/filtering
router, yet close to the site border or to the services having the most interest
in performance measurements on the wide area network (e.g. a computer or data
farm).
Usually the MH will also perform the
analysis and report generation, though this can be done on a separate host as
long as access to the data is accessible (e.g. by means of a network file
system). Besides providing a monitoring platform, the MH site will need to
provide a web server with access to the data analyzed by the IEPM-BW toolkit.
This server may run on the MH itself or may be a separate web server.
The MHs will
be located at central points of interest to the HENP and Grid community. This
includes LHC and BaBar tier 0 sites, tier 1 sites including FNAL, BNL, and
selected tier 2 sites such as U Michigan (Atlas) and Caltech (CMS), Other sites
will be chosen to reflect their importance to HENP networking and the DoE programs.
Currently access for logon and for
executing commands from the MH on the RH is by ssh keys. This requires the RH site to provide an
account for the MH POC to submit commands to. Some potential RH sites are not
comfortable with this so part of this proposal is to provide support for
running the servers “all the time” on the RH. This removes the need for ssh to start the servers. However,
it requires greater cooperation from the RH POC, since she/he has to install and
configure the IEPM-BW RH toolkit, ensure the servers are always running etc. Part of the current project will be to develop
tools to facilitate these functions for the RH POC.
We will also evaluate and utilize the
Internet 2 E2E PiPES BWCTL scheduling and
authentication techniques to provide another
The initial measurement tool/probe
suite will include ping, forward and reverse traceroutes,
a lightweight bandwidth estimation, iperf , bbftp, and GridFTP. Others probes
will be evaluated and if found suitable integrated into the probe suite as part
of the proposed project. The IEPM-BW analysis and presentation tools will be
extended to improve: data selection and access (including web services), and traceroute visualization.
A major problem evolving from today’s
monitoring is the sheer volume of reports to be manually viewed when looking
for problems. The proposed project will develop advanced analysis for anomalous
event detection (including the elimination of periodic effects) plus techniques
to filter the events so as to reduce the noise reported to network operators.
We also plan to create an architecture to automatically
gather further information concerning filtered events.
For robustness, it is critical to
provide tools to detect infrastructure problems such as hung processes, servers
that have died, files that should have been deleted, hosts that are not
accessible, ports that have been blocked etc. The proposed project will develop
infrastructure management tools to facilitate and automate the detection and
logging of such problems, fix them when possible, provide web accessible summary
reports, and notify POCs if necessary.
The project will also utilize the wide
diversity of IEPM-BW infrastructure paths to evaluate various QoS mechanisms, as well as the impact of the measurements on the
network itself and ways to limit the impact.
Finally, the project will collaborate
with other monitoring projects such as PiPES and AMP
to share data, tools, analysis and presentation
techniques and integrate where advantageous.