A Plan for System Reliability/Availability Improvement

                                                           K. Brobeck, R. Hall, J. Rock and J. Zhou

                                                                    (Updated on 09/03/09)

As LCLS is moving to the operation mode, the availability and reliability are crucial to build up its reputation and expand its user pool. According to the
upper management, the uptime for LCLS is set to be 95%, which means 99% for controls system. Below is a list of vulnerabilities from the system team, and a brief mitigation plan  for each vulnerability.

Archiver

  1. Archiver Data Server:
    1. The Archive Data Server executable was built with a dependency on EPICS libraries being available from AFS.
      • Plan: Build the Archive Data Server using references to local disk libraries.
      • Estimated Cost: Perhaps 3 days of effort.
      • Severity Level: Medium (AFS outage is a very rare event but the importance of the availability of the Archive Viewer for viewing archiver data is high).
  2. Archiver System Management:
    1. Desirability of reliably automating the daily restarting of the archive engines, copying archive data from lcls-archeng local disk space to NFS space, and updating archiver indexes. There is vulnerability for recovery from unexpected problems if this manual procedure is not done daily while the primary archiver administrator is out of the office for a long time period.
      • Plan: There are several significant problems to overcome to reliably be able to automate this manual activity. The main problem is be able to reliably restart the archive engines to ensure that all PVs available for connection do connect successfully.
      • Estimated Cost: Perhaps 3 weeks of effort.
      • Severity Level: Small. The manual procedure is very effective. However, it is still desirable to automate this activity to save the 5 minute effort to do this daily and so that the backup archiver administrator does not need to learn to do this when the primary administrator will be out of the office for a long time period.
  3. Archiver Data Storage:
    1. The archiver system is vulnerable to the unavailability of the NFS storage area.
      • Plan: This problem can be addressed on one of two ways: (a) make the NFS storage area more reliable, or (b) have a secondary top-level LCLS archiver index always available in case of an NFS problem that will allow access to only the archiver data currently available in the lcls-archeng local disk area (currently at least 1 month of the most recent archiver data).
      • Estimated Cost: Perhaps 2 weeks of effort to develop a scheme to make sure that a secondary top-level LCLS archiver index for access to only local data is always available.
      • severity Level: Medium (NFS outage is a very rare event but the importance of being able to access LCLS archiver data is high).

Operations and Physics E-log Systems

  1. E-log NFS dependence:
  2. Operations E-log Oracle dependence:
 

Cmlog System

Proxy Server

Alarms

IRMIS

ORACLE

iocConsole

MCC Computer Room Power

MCC Computer Room Power Conditioner

MCC Computer Room A/C unit

VMS (mcc)

NFS server

Sunray Workstation OPIs

Web servers