SLAC ESD Software Engineering Group

 

UNIX SYSTEM ADMIN

 

 

How to Restore UNIX Service

SLAC Detailed

SLAC Computing

Software Home

Software Detailed

Documentation and Web Suport


 

Introduction

This document describes how to restore UNIX service either from a power outage or a system failure.

In general, if there is a scheduled power outage, all UNIX production systems must be brought down gracefully and by following the proper sequence. To restore the service from a power outage, whether scheduled or unexpected, the systems must be brought up by following the proper sequence as well. When there is a need to reboot the systems, do one at a time. Always make sure that mccfs2 and mcc are up and running. Always make sure that a console is in place.

In the case of a system failure, a reboot is always a very first thing to try. If the reboot doesn't clear the problem, try to boot the system to a single user mode to see if the problem can be fixed (e.g. fixing a corrupted system configuration file). If the failure is found to be in hardware, a restore from a backup system becomes necessary.

For application related problems, please check with Ernest Williamt for EPICS applications and Patrick Krejcik for High Level applications.. For any network problem, please contact Terri Lahey. Listed below is a procedure for system shutdown, restore either from a power outage or a system failure.

System Shutdown and Restore from Power Outage

NFS server: mccfs2

NFS server provides application filesystems and data buffers for production systems. It is hosted on a SunFire V240 system running Solaris 9. It is on PEPII and LAVC network and is a standalone machine. To gain the console access, login opi00gtw04, telnet mccfs0-mgt, type console -f at prompt SC.

To shutdown mccfs0

    1. get a console
    2. login mccfs0
    3. su to root
    4. /usr/sbin/shutdown -i0 -y -g0; wait until "OK" prompt is seen
    5. power off disk array, backup drive and then the unit; power off UPS connected to mccfs0 by pressing "O" button
     

Note: in the event of a power failure, mccfs0 shall be shutdown automatically from PowerChute plus (UPS).

To bring mccfs0 up from an UNEXPECTED power outage

    1. make sure power to the rack is turned on (check with Ken or Ed)
    2. make sure mcc is up and running (check with Ken or Ed)
    3. get a console to mccfs0
    4. power on UPS connected to mccfs0, by pressing "Test" button (the unit should be powered on automatically), wait until login prompt is seen

To bring mccfs0 up from a SCHEDULED power outage

    1. make sure power to the rack is turned on (check with Ken or Ed)
    2. make sure mcc is up and running (check with Ken or Ed)
    3. get a console to mccfs0
    4. power on UPS connected to mccfs0, by pressing "Test" button, power on disk array, tape drive and the unit, wait until login prompt is seen

To reboot mccfs0

    1. get a console
    2. login
    3. su to root
    4. /usr/sbin/shutdown -i6 -y -g0; wait until login prompt is seen


LCLS servers on CA network

PEPII gateways are hosted on SunFire V240 systems running Solaris 9. They are on PEPII and LAVC network, they are standalone machines and served by mccfs0. To gain the console access, login opi00gtw04, telnet op00gtw00-mgt (or opi00gte01-mgt), type console -f at prompt SC.

To shutdown PEPII gateways (opi00gtw01, opi00gtw00)

    1. login opi00gtw01
    2. su to root
    3. /usr/sbin/shutdown -i0 -y -g0 wait until "OK" prompt is seen
    4. power off the unit; power off UPS connected to opi00gtw01 by pressing "O" button
    5. login opi00gtw00 and repeat 2-4 for opi00gtw00.

Note: in the event of a power failure, opi00gtw00 and opi00gtw01 shall be shutdown automatically from PowerChute plus (UPS).

To bring PEPII gateways up from an UNEXPECTED power outage

    1. make sure power to the rack is turned on (check with Ken or Ed)
    2. make sure mccfs0 and mcc is up and running (check with Ken or Ed)
    3. get a console to opi00gtw00
    4. power on UPS connected to opi00gtw00, by pressing "Test" button (the unit should be powered on automatically), wait until login prompt is seen!
    5. repeat 1-4 for opi00gtw01.

To bring PEPII gateways up from a SCHEDULED power outage

    1. make sure power to the rack is turned on (check with Ken or Ed)
    2. make sure mccfs0 and mcc is up and running (check with Ken or Ed)
    3. get a console to opi00gtw00
    4. power on UPS connected to opi00gtw00, by pressing "Test" button, power on the unit, wait until login prompt is seen
    5. repeat 1-4 for opi00gtw01.

To reboot PEPII gateways

If there is a need to reboot PEPII gateways, do one at a time (in a sequence of opi00gtw00 and opi00gtw01). Always make sure that mccfs0 and mcc is up and running.

    1. get a console
    2. login
    3. su to root
    4. /usr/sbin/shutdown -i6 -y -g0; wait until login prompt is seen
 

PEPII SUN Clients

All PEPII SUN clients, functioning as OPIs along the PEPII rings, are hosted on SUN Ultra 5/10 systems running Solaris 9. They are on PEPII network and thus not visible from the public, they are served by mccfs0.

Nodenames:

To shutdown the systems

  1. login as root
  2. /usr/sbin/shutdown -i0 -y -g0; wait until "OK" prompt is seen
  3. power off unit and monitor

To bring the systems up

  1. power on the system (monitor and unit)
  2. wait until login prompt is seen
 

To reboot the systems

  1. login as root
  2. /usr/sbin/shutdown -i6 -y -g0; wait until login prompt is seen

Note: you may need to get PEP Master key from operator in MCC in order to gain access to a building where a workstation is located.

GPIB server: mccux01

The GPIB server is hosted on mccux01, a SunFire V65x system running Linux. It is on PEPII and LAVC network. Although taylored, this machine has its own /usr/local and thus the applications have no dependence on SCS AFS. The console is located on rack 513. Press Ctrl twice to make a selection.

To shutdown mccux01

  1. login mccux01
  2. su to root
  3. shutdown -h -y
  4. power off the unit
 

To bring up mccux01

  1. make sure power to the rack is turned on (check with Ken or Ed)
  2. make sure mcc and mccdev is up and running (check with Ken or Ed)
  3. power on mccux01
  4. wait until system is up (boot normally takes about 2 minutes)
  5. login mccux01 and verify if GPIB server programs (gvs_service and scope) are running:

    ps -ea | grep gvs_service

    ps -ea | grep scope

If mccux01 is booted up before mcc and mccdev, it has to be rebooted.

To reboot mccux01

  1. login mccux01
  2. su to root
  3. shutdown -r -y
  4. wait until system is up (reboot normally takes about 2 minutes)
  5. login mccux01 and verify if the GPIB server programs (gvs_service and scope) are running:

    ps -ea | grep gvs_service

    ps -ea | grep scope

To reset GPIB services

GPIB devices are very critical of command syntax and sequences. A GPIB device may become confused and hang. Other times, the network failure or power outage may cause the device hang. If this happens (you shall be called by users), resetting the GPIB services will clear the problem.

  1. login mccux01 using gpib (ssh mccux01 -l gpib)
  2. at the prompt, type "reset_gvs". A message "reset successful" should be seen if everything goes O.K. Try it again in case of failure. Reboot the system (shutdown -r -y) if reset fails a few times.
 

Proxy servers: px00 and px01

The proxy servers are hosted on PowerEdge 1750 systems running RHEL3. px00 is for production while px01 is for development and test. px00 is a standalone machine and on PEPII network, thus not visible from the public; px01 is SCS tayloed and on LAVC network. The Forward Server running on px00 is a SLC control system proxy between the VMS hosts and the iRMX micros. The VME hosts and the control system Ethernet micros request connections to the proxy server using port 6060. The task that manages this port is called "fwd_server" which is the message forwarding service and is required by the control system. The console is located on rack 513. Press Ctrl twice to make a selection.

To shutdown px00

  1. login px00 as root
  2. shutdown -h -y
  3. power off the unit

To bring up px00

  1. make sure power to the rack is turned on (check with Ken or Ed)
  2. power on px00
  3. wait until system is up

To reboot px00

  1. login px00 as root
  2. shutdown -r -y
  3. wait until system is up

px01 works the same way as px00.

SCS Taylored Production Machines

The SCS taylored machines are administrated within a joint effort between us and SCS. The machine's operation partially depends on SCS's services. We have been making effort to minimize the dependences by using the local services whenever possible. In most cases, with a sudo privilege, you can quickly get problems (local to the system) fixed or get a local admin job done. Other times, unix-admin@slac.stanford.edu has to be consulted to see if SCS services are problematic; Len Moss (3370) is a primary contact person. A reboot often clears problems. Be sure to check with Terri that if the network to SCS is working. Below is a list of taylored machines that are used for PEPII and NLCTA production.

mccelog is a Netra T1 system running Solaris 9. The console is located on rack 513. Press Ctrl twice to make a selection. The rest are SunFire V240 systems running Solaris 9. To gain the console access (e.g., to slcs2) login opi00gtw05 (note: opi00gtw05 should be the last machine to shutdown; with this machine available, we can gain the console access to other machines whose MGT port are on LEB network), telnet slcs2-mgt, type console -f at prompt SC.

To shutdown the system

  1. login
  2. sudo shutdown -i0 -y -g0 wait until "OK" prompt is seen
  3. at the "OK" prompt, power off the system (unit and monitor)

To bring up the system

  1. power on the system
  2. wait until login prompt is seen

To reboot

  1. login
  2. sudo shutdown -i6 -y -g0

Restore from System Failure

System unit failure

In the case of a system unit failure (such as CPU, motherboard and etc.), a replacement with a backup machine is needed. All backup machines are located in Rm 210, Bldg5: there is one for SunFire V240 systems, one for PowerEdge systems, one for Netra T1, one for Ultra 5/10 systems. To replace with a backup machine, follow the steps below:

  1. take out the system disk from the failed machine
  2. install the system disk to the backup machine
  3. make sure all the cables are connected properly to the backup machine
  4. get a console
  5. power on the backup machine
 

System disk failure

If the system failure is due to a corrupted system disk, the disk can be easily replaced with a mirrored one. The production UNIX system disks are all mirrored.

The system disk is mirrored with Solaris LiveUpgrade. To replace the corrupted disk:

  1. make sure the system is powered off
  2. remove the corrupted disk, install the mirrored one
  3. get a console
  4. power on the system

Alternatively, you can boot up from the mirrored system disk by:

  1. get a console
  2. luactivate -s secondary
  3. shutdown -i6 -y -g0 (must use shutdown)

or

  1. get a console
  2. enter the PROM monitor (OK prompt)
  3. change the boot device to the mirrored one by typing setenv boot-device disk:b
  4. type boot

The system disk is mirrored with Solaris LiveUpgrade. The procedure to replace the corrupted system disk or to boot from the mirrored disk is identical to SunFire V240 systems.

A system disk is mirrored with Linux dd facility. To replace the corrupted disk:

  1. make sure the system is powered off
  2. remove the corrupted disk (/dev/sda), install the mirrored one (/dev/sdb)
  3. get a console
  4. power on the system
 

Alternatively, you can boot up from the mirrored system disk by:

  1. get a console
  2. power on the system
  3. get to Bios by pressing F2 early in the boot process (e.g., right after power reset)
  4. select Boot Sequence and follow the instructions.

The system disk is mirrored with Linux dd facility. The procedure to replace the corrupted system disk or to boot from the mirrored one is identical to PowerEdge systems.

System Type:

PowerEdge based Linux system with disk array

PowerEdge based Linux sysem

SunFire based Linux system

SunFire V240 based Solaris system

Reboot sequence

-----------------

mccfs2, lcls-archeng, lcls-daemon*, lcls-srv*, lcls-builder, lclsl-opi*

Shudown sequence

-------------------

Bring up sequence

-----------------

Summary

The functions for each production UNIX system are detailed in http://www.slac.stanford.edu/grp/cd/soft/share/slaconly/network/opi/index.html. A diagram showing UNIX network is in http://www.slac.stanford.edu/grp/cd/soft/unix/slaconly/UNIX.ppt.

Any UNIX system failure should be reported to Jingchen Zhou immediately (pager 849-9598). Consult Ken Brobeck (x2558) if Jingchen is not available; otherwise, seek help from unix-admin@slac in SCS if both Jingchen and Ken happen to be not available. Len Moss (x3370) is a primary contact person from SCS. Steffen Luitz (x2822) - the unix master for Babar, who is very familiar with standalone systems, may also be a good resource.


Author: Jingchen Zhou (x4661, jingchen@slac). Last edited on July 8, 06