DataGrid Network Monitoring Insfrastructure
This title is suitable since we are focusing on supporting LHC activities in tis limited scope of work. Internet appears to be too broad.Submitted by Dr. R. Les Cottrell, PI, SLAC
Overview of Project
See http://www.slac.stanford.edu/grp/scs/net/talk03/escc-jul04.ppt for a presentation.
Description of the network monitoring infrastructure
a. Centralized or distributed
b. Decription and Location of monitoring Platform
c. OS
d. Access and security (if any)
e. Location
Monitoring Infrastructure Capabilities
a. List initial measurement tools
b. Data Analysis
c. Visualization
d. Others
Integration with DataGrid Insfractructure
a. Datagrid monitoring with wide-area network monitoring
b. User interfaces issues
c. Intergration with ESnet/Abeline end-to-end network measurement effort?
d. Others


Tasks – These task are to vague in details
Design, develop and productize an improved IEPM-BW toolkit to provide robust, regular end-to-end active network performance measurements with:
* Choice of amount of network bandwidth to be used, i.e. monitor site admin can select from a menu of tools with documented guidance on bandwidth used by each tool, in addition admin can choose whether to use QOS (e.g. via QBSS or HSTCP-LP) to limit bandwidth utilization.
* Choice of  tools to make measurements 
o Very low network traffic bandwidth estimator (ABwE)
o TCP memory to memory throughput (iperf)
o Bulk data throughput applications (bbftp, bbcp, GridFTP)
o Ping, traceroute
o Other tools based on demand and applicability (e.g. owamp)
* Enable choice of security requirements at remote hosts (ssh vs. run servers)
* When ready integrated traceroute recording, analysis and reporting
* When ready integrated anomalous event detection and reporting
* Improved Infrastructure management tools for monitoring sites
o Detect and report failing applications, hung processes, unreachable hosts, restart daemons, report on restarts, ensure starts up right after re-boot etc.
* Improved Code distribution tools
Establish relations with support people at CMS, Atlas and BaBar tier 0 and 1 sites
* Contact responsible people at sites ?
* Provide information on goals, benefits and desired outcome from the project ?
* Establish Point Of Contact (POC) person at each site for project
* Work with POCs at each site to deploy network monitoring platform - get an ssh account, or set up servers with checks to monitor, restart etc. when necessary, plus a host to run the toolkit on (preferable dedicated and provided and administered by the monitoring site to simplify security administration)
For each monitoring site in turn:
* Install code, provide initial configuration template
o Initially SLAC will assist with this, as we develop better distribution tools this will be done more by the monitoring site POC.
* Work with POC at each tier 0 and 1 site to assist in tuning configuration so tier 0 and 1 sites can monitor each other (as required by site)
* On request, provide guidance as tier 0 and 1 sites set up to monitor chosen tier 2 sites of interest
Develop non ssh (server) only IEPM-BW option for remote hosts
* Provide option in monitoring host
* Provide downloadable toolkit for remote host admin, to include bbftp, iperf, and management tools (kill hung processes etc.)
Develop traceroute analysis and visualization, including topology maps, access to raw data, and archive navigation
Research, develop and evaluate an automated anomalous event detection algorithm
* Evaluate accommodating diurnal capabilities, and using multi variables (RTT, capacity, available bandwidth, multiple sites etc.)
* Develop notification filters for events
* Work with Internet2, ESnet and others to design, develop and integrate a toolkit to identify and gather  more detailed relevant information (traceroute, RTT and bandwidth histories, utilization and error counts from accessible and relevant devices), and make available to appropriate trouble shooters 
Understand the impact of active high performance network testing on other traffic, and evaluate the use of QOS techniques such as QBone Scavenger Service and HSTCP-LP to reduce negative impact of iperf  testing
Upgrade IEPM-BW code at tier 0 and 1 sites as new versions become available
Develop documentation
* Develop web site with goals, benefits, desired outcome
* Add reports for existing SLAC measurements
* Add technical information
Possibly: scheduling
The capabilities of the final network monitoring platform are still illusive after reading the above tasks.

At some point you have to prototype and test the monitoring infrastructure that will be deployed in various locations. When will this happen? 
Deliverables
A small focused infrastructure of ~10 self-managed sites with regular active bandwidth performance measurements. Possible initial sites to reflect the HENP tiered computation sites and ESnet needs:
* CERN (LHC tier 0 site)
* BNL, FNAL (ATLAS and CMS tier 1 sites)
* SLAC (BaBar tier 0 site)
* Network sites (ESnet, StarLight)
* Caltech, U Michigan, SDSC (LHC tier 2 sites)
* Optional: European tier 1 sites (INFN/Padova, IN2P3, RAL)
Robust IEPM-BW toolkit for making regular end-to-end throughput performance measurements, archiving the data, analyzing and reporting on the results. This will include toolkits for both monitoring sites and for the remote (monitored sites). For the monitoring host this will include:
* Monitor site selection of a wide range of measurement tools
* Monitor site choice  of security access for remote hosts (this choice will probably mainly depend on the policies at the remote sites)
* Database support for improved selection of data
* Improved management tools
* Toolkit for managing remote hosts
An evaluation and implementation of using various QOS techniques to reduce the impact of high performance iperf tool testing on other traffic.
A new production quality anomalous event detection toolkit
* Publication on algorithm and its performance
* Incorporate diurnal effects and multi variables
* Distributable code
* Integrated into IEPM-BW toolkit
Event filtering and alert generation toolkit integrated into IEPM-BW
Toolkit to enable gathering information relevant to an alert 
A new traceroute analysis and visualization toolkit
* Including generation  of and access to topology maps
* Published and presented to network audiences
* Distributable code
* Integrated into IEPM-BW toolkit
Lightweight bandwidth estimation (ABwE)
* Improved to meet security requirements
* Added RTT measurements
* Integrated into IEPM-BW
Access to data:
* Interactively via the web in various easy to use formats (e.g. tab, comma or space separated variables)
* Upon demand for large volumes of data
* Via prototype web services interfaces
Current Deployment
The following maps identify the major HENP monitoring sites in existence today (SLAC & FNAL), plus the location of the various remote (monitored sites).
USA


Collaborators – we need collaborators identified at the following sites who have commitements from their management:

1. BNL – ATLAS – Tier1
2. Michigan U. – ATLAS-Tier2

3. FNAL	-  CMS – tier2

4. SLAC – BABAR – tier0

5. France – BABAR tier1

6. STARLIGHT

7. UltraNET – ORNL

8. CERN (LHC tier0)

9. One Other Approved HEP site

Timelines
First 2 years

Third year

 Budget justification

Year 1:		$248,956
Year 2:		$255,580
Year 3:		$266,292
Total:		$770,827

See budget detailed sheets from SLAC.

SLAC Personnel

R. Les Cottrell — (0.17 FTE) will supervise the work on this project, direct and participate in the research and data analysis, interface with the HENP and ESnet communities to gather requirements and promote deployment.

Connie Logg - (0.17 FTE) will be responsible for architecting and overall design of the IEPM toolkit, code specification, and code development.

New Hire - (0.75 FTE other professional) will be responsible for day-to day administration of the project, coordination with the Points of Contacts at the monitoring sites, code development, testing, deployment.

SLAC Direct Costs

Cost estimates have been presented in this proposal to be comparable to other research institution’s proposals. At the Stanford Linear Accelerator Center, actual costs will be collected and reported in accordance with the Department of Energy (DOE) guidelines.  Total cost presented in this proposal and actual cost totals will be equivalent. 

Senior Personnel – Item A.1-6
The salary figure listed for Senior Personnel is an estimate based on the current actual salary for an employee in her/his division plus 3% per year for inflation.

Fringe Benefits – Item C
Fringe Benefits for SLAC employees are estimated to be the following percent calculated on labor costs:
•	Career Employees – 30.5%			
•	Students/Others     – 3.49%

Travel – Items E.1 and E.2 
The senior staff members plan to attend domestic and/or foreign technical conferences/workshops in the areas of research covered by this proposal.  There will also be visits to monitoring sites to assist in installation etc. Total cost includes plane fare, housing, meals and other allowable costs under government per diem rules.

Other Direct Costs- Item G.6
The estimated cost of tuition for graduate students.

Indirect Costs – Item I

•	Materials and supplies, clerical support, publication costs, computer (including workstations for people) and network support, phone, site support, heating, lighting etc. are examples of activities included under indirect costs. Indirect costs are 36% of the Salaries including the Fringe, 36% of Travel costs and 6.8% on Materials and Supplies. The entry under this topic is for a high powered Linux workstation with the appropriate network interface to be used for network monitoring of the high speed paths.

Facilities
SLAC is the home of the BaBar HENP experiment. BaBar was  recently recognized as 
having the largest database in the world. In addition to the large amounts of data, 
the SLAC site has farms of compute servers with over 3000 cpus. The main (tier A) 
BaBar computer site is at SLAC. In addition, BaBar has major tier B computer/data 
centers in Lyon, France, near Oxford, England, Padova, Italy and Karlsruhe, Germany 
which share TBytes of data daily with SLAC. Further BaBar has 600 scientist and 
engineer collaborators at about 75 institutions in 10 countries.  This is a very fertile ground for deployment and testing of new bulk-data transfer utilizing improved TCP stacks and Grid replication middleware. There are close ties between the SLAC investigators, the BaBar scientists and the SLAC production network engineers. What about CMS and ATLAS? – The project is about LHCs and BABAR

Relevant DataGrid LHC facilities 

a. CERN
b. FNAL
c. BNL
d. STARLIGHT
e. ESnet Network measurement effort
f. Abiline Network measurement effort