CHEP95 Distributed Computing Environment Monitoring and User Expectations

Table 1: Comparison of Old Environment to Distributed Environment
Mainframe/Workstation	Distributed Environment
One OS	Many OSs & dist. sys. services
One local file system	Multiple distributed file systems
In "Glass House"	All over site, mods by people with varying skills & responsibilities
Mature diagnostics with vendor call in	Roll your own diagnostics & reports

What's Changed that Makes Monitoring so Crucial now

2. Network growth:

· Extent/coverage of network increasing

· Number of devices increasing exponentially (30-50% / year is typical)

· Traffic doubling typically every 18 months

· Technology to manage network is not growing as fast as network technology

What's Changed that Makes Monitoring so Crucial now

3. Complexity:

· a typical ESnet site has:

¯ products from about ten vendors, suppliers, carriers

¯ ~ a dozen different configurable equipment types (routers, bridges, hubs, switches ...)

¯ ~ half dozen network management applications (NMS, trouble ticket, probe management ...)

¯ ~ 9 different vendor MIBs

¯ 5 protocol suites (TCP/IP, DEC, AppleTalk, Netware,..._) typically routing 4 protocols, bridging 3 and tunnelling 2.

¯ 9 server platforms (VMS, MacOS, AIX, SunOS, WNT ...)

¯ ~ 30 networked applications

· this results in:

¯ decreased support effectiveness

¯ decreased QOS

¯ inability to support existing & new applications

¯ increased downtime, lost opportunity, user's time wasted & security exposures

What's Changed that Makes Monitoring so Crucial now

4. Reduced Resources:

· budgets increasingly constrained

· few experienced personnel available, hard to retain after training

So need simple to use, well integrated tools to automate network management and improve the productivity of existing personnel

What Should we Monitor

The ultimate measures of performance are the users' perceptions of the performance of their networked applications (e.g. WWW, email, a distributed RDBMS, a spreadsheet accessing a distributed file system etc.)

This performance is affected by the performance of the complete Distributed System, which includes:

· physical network plant

· communications devices (e.g. routers, switches) , computers and peripherals attached to the network plant

· host resource utilization

· software from device interfaces, thru operating systems to applications running on computers and devices

To set and meet user expectations for distributed system performance, we must monitor all of the above

What is the Current State

Companies are finding it difficult to manage network performance ⁴:

· Only 24% adequately manage network performance

· only 16% have network performance service level agreements

· 55% indicate they are understaffed for managing network performance

· 56% have a project in works or plan to improve network performance

· 65% have a project in work or plan to improve network management

· 95% would like to report on network utilization, but only 55% do

· 91% would like to report on network availability, but only 25% do

ESnet Sites: Practices vs. Desires for Monitoring

Largest Increase is in Security and Applications

What is the Current State of Tools

Table 2: Summary of Existing Tools

Target

Key Issues

SNMP agents & managers

Device management

Node focus
Physical layer

Umbrella Management Systems

Integration platform
Sys admin tools

Centralized polling
Costly
SNMP with RMON applications

LAN analyzers

Trouble shooting

Cost
Not proactive

Table 2: Summary of Existing Tools
	Target	Key Issues
SNMP agents & managers	Device management	Node focus Physical layer
Umbrella Management Systems	Integration platform Sys admin tools	Centralized polling Costly SNMP with RMON applications
LAN analyzers	Trouble shooting	Cost Not proactive