Monitoring Scripts Overview:

 

 

The core part of the Unix Monitoring System  is a set of scripts running on various computers to monitor Unix system performance, resources and applications critical to production.  The Unix based controls system is multi-OS and highly distributed. Currently, the monitoring system works only on Solaris. Here are the machines being monitored:

 

opi00gtw00      opi00gtw01      opi00gtw02      opi00gtw04

opi00gtw05      slcs1                slcs2                 netmon-pep      mccora

 

Once we get the EPICS compiled for Linux, we will implement on px00 and px01.  We will also implement on mccux02 when we move it from HPUX to Linux.

 

There will be only one set of scripts and configuration files which are common to all machines, with one script and one configuration file corresponding to one type of monitored item. Since the Unix based controls system is highly distributed, with some on the public network and some on the private network,  there is no common place to locate the files for all machines.  Instead, one copy will be installed on AFS area which is accessible to all the machines on the public network,  one copy on /usr/local/admin/scripts, which is transparent to all the machines on the private network. We will use rdist to distribute and update the files.

 

 

Network

Location of scripts & config files

Startup Script 

Public

/afs/slac.stanford.edu/g/cd/soft/ref/common/sys-admin

 

/etc/init.d/st.watchdogs 

 

Private

/usr/local/admin/scripts

 

/etc/init.d/st.watchdogs 

 

 

All of the scripts are written in Bourne shell because this is the standard shell on all Unix systems and will be the most compatible between the different platforms.  And system scripts are typically written in Bourne shell. 

 

Monitoring scripts:

 

The startup script

 

There is one startup script,  which is /etc/init.d/ st.watchdogs and common to all the machines.  The startup script will start each monitor script individually at booting time at run level three,

 

ls -all /etc/rc3.d/S98st.watchdogs

lrwxrwxrwx   1 root     other         24 Jan 14 10:01 /etc/rc3.d/S98st.watchdogs -> /etc/init.d/st.watchdogs

 

The script is implemented in such that the monitor script can be stopped or restarted easily when needed with argument stop or restart.

 

The monitoring scripts

 

The monitoring scripts will be started upon system startup by /etc/rc3.d/S98st.watchdogs:

 

 returns percentage of cpu time being used

 returns percentage of disk space being used on all partitions listed in configuration file

opi00gtw00 and opi00gtw04 will ping a list of hosts and will return up, 1-down)

returns percentage of memory being used. Both physical and virtual are combined

monitors specific processes and  returns (0-running, 1-not running).  The regular expression is used to differentiate the processes which have the same process name, are owned by the same account, and run on the same machine by checking additional process information, for instance, the configuration information with which a process is started.

opi00gtw00 monitors the quota for each user in the quota file on the /u1 partition –returns % used for each user

 

All of the scripts are similar:

 

 

Environment setup script

 

Each script will call watchdog_setup.sh to set up its environment. 

 

The scripts are implemented in generic so that little effort is needed when adding a new type or machine to the monitoring system.