In general, if there is a scheduled power outage, all UNIX/Linux/VMS production systems must be brought down gracefully and by following the proper sequence. To restore the service from a power outage, whether scheduled or unexpected, the systems must be brought up by following the proper sequence as well. When there is a need to reboot the systems, do one at a time. Always make sure that mcc and mccfs2 are up and running. Always make sure that a console is in place. In the case of a system failure, a reboot is usually a very first thing to try. If the reboot doesn't clear the problem, try to boot the system to a single user mode to see if the problem can be fixed (e.g. fixing a corrupted system file). If the failure is found to be in hardware, a restore from a backup system unit becomes necessary. For application related problems, please check with Ernest or Jingchen for EPICS, Debbie or Partick for HLA. For any network problem, please contact Ken or Charlie.
This document describes how to restore power in MCC and VMS computing services after a power outage or a system failure.
Listed below is a procedure for system shutdown, restore either from a power outage or a system failure.
Background on MCC Power:
Building 7 has two power feeds (T1 & T2). T1 has a breaker that provide power to DP6A1 Main MCC breaker and a breaker that provides power to 4C. The MCC computer room PDU and the MCC computer room air conditioning unit gets its power from 4C. We are installing a Cut-Over switch to allow the MCC PDU to be able to switch its power source to 4A which gets its power from the T2 line.
During the March 2011 downtime we will shutdown the DP6A1 breaker only so the MCC PDU and AC Unit will have power via 4C. Once the Cut-Over switch is installed we will shutdown the MCC computer room and switch the power to 4A and power up the MCC PCU.
The primary power source for the MCC computer room and AC unit will be 4C.
System Recovery from Power Outage
When there is a power outage, the VMS servers should be powered off so when the power is restored we control when to apply the power to the server. Sometimes the power will go on and off several times before we get stable power.
Confirm with the EOIC desk that the power has been fully restored and that the power is stable. (The EOIC is in contact with the SLAC Electrical Team and will be notified when the power is stable.) Usually we will wait 10-15 minutes to make sure we don't lose power again. If the breakers were turned off and power is stable ask the EOIC desk to contact the electricians to power on the master breaker to the computer room (by the vending machines) and then the main breaker inside the power conditioner in our computer room. There is a tool to open up the power conditione above the power conditionerr, a system person can power off the power conditioner while the master breaker is off. There is a picture on top of the power conditioner that shows the breaker.
Only specially trained personnel can power on the master breaker to the computer room and the main breaker inside the power conditioner.
Verify that the A/C unit is functioning before powering on any servers. The MCC Computer room must be kept around 65° . (Verify that the A/C Unit is blowing cold air) If the A/C unit is not blowing cold air have the EOIC contact the HVAC personnel immediately. Do not power on any equipment until A/C is functioning properly.
There are two WS-C6509-NEB-A Network Routers in MCC computer room: RTR-MCCCORE1 (as a primary) and RTR-MCCCORE2 (as a backup for the primary). Each Network Router has its own UPS and are located in LCLS rack B005-614.
- Make sure each UPS is on.
- Verify each router is up and running by checking
- two Power Supply 1 and 2 for each router are on: INPUT OK and FAN OK.
- green lighs next to network cable connectors.
- SWH-MCC0-NW01 switch and a SWH-MCC0-NW04 in MCC which are for non-critical nodes.
- SWH-MCC0-NW02 in the network closet
Contact net-admin: ( Antonio x2895, Jared x3545, Yee x2945 ) if there is any issue with the routers or switches.
There are two RAID ARRAY 450 units, each of which has two HSJ controllers: HSJ08/HSJ09 and HSJ011/HSJ012. Each unit has a VT100 monitor as a console: one for HSJ08/HSJ09, one for HSJ011/HSJ012. Two power switches for each unit should be kept on. When the power to the cumputer room is restored, the units should be turned on automatically. The HSJ controllers have a battery cache that will need to be charged before you can boot up the VMS machines. If the power was off for over 2 hours or so, the battery will take quite a long time to recharge (maybe up to 45 minutes). Check the HSJ monitors for the message that says the battery cache is fully charged.
Check the VT100 for status. On each HSJ consoles (e.g. HSJ011/HSJ012), at the prompt HSJ011>, type: show this, and show other. The battery status will be shown as: Cache is GOOD, Battery is GOOD. Each console controls 2 controllers.
There are a number of VMS Alpha machines (MCC, MCCDEV, MCCA1 and MCCA2) . MCC is for production and MCCDEV is for development, while both MCCA1 and MCCA2 are for Ken to test system updates. It is critical to keep MCC and MCCDEV up, while other VMS can stay off. Once the HSJ controllers are back up, MCC can be booted up
From the MCC console (at prompt >>>) type: b.
Wait for MCC comes up.
Check with EOIC if the Control System is fully up; if not, ask EOIC to bring it up by type: “warmslcx *” as slcshr account.
Once the system is mounting the disks, you can boot up MCCDEV and any other VMS system
MCCA1 (Alpha) -- can stay off
MCCA2 (Alpha) -- can stay off
Note: all VMS machines boot from HJS controllers and share the same data storage.
Check the UPS Startup list - list is also on the side of the LCLS Rack
Follow the "power up order"
make sure each server plugged into the UPS is up before powering up next UPS
mccfs2, KVM, and Switch come up first
- Use the KVM interface to watch each server boot up
VMS Expected Power outage
If there is an expected power outage then the systems should be taken down gracefully. You can shutdown any of the VMS servers. To see what VMS systems are up on MCC type: show cluster; just shutdown MCC last. Login to the server with a privileged account and type @sys$system:shutdown and follow the prompts until the shutdown begins; power off the servers when they are shutdown to protect them from an accidental power surge. When power is restored follow VMS Recovery from power outage (see above)
VMS Unexpected Power outage
If power is off in the MCC Computer room, power off the VMS servers for protection when the power comes back on. When power is restored follow VMS Recovery from power outage (see above)
Login to the VMS server to reboot with a privileged account and type @sys$system:shutdown and follow the prompts. One prompt will ask if you want to reboot. Remember MCC Control System comes up to minsys and the slcshr account must type: “warmslcx *” to bring up the Control System fully.
VMS Hardware Failure
All VMS Hardware failures should be logged with Maintech. Karl Behnke is our POC for Maintech (x3830 or 510 739-6951). Most of Maintech service has been on our VMS/VAX systems, which we are getting rid of next downtime. If our VAX system goes down…leave it down. There is no impact if slcsrv, mcca1, and mcca2 goes down.
Our HSJ Controllers houses all our disks. Each disk is configured as a RAID 1. This means that we have two identical disks paired together, so if one fails the other takes over. This is completely transparent to the users. We have 2 spares in each HSJ controller so the controller will automatically replace the bad disk with the spare and start rebuilding the new disk to take its place. We have two HSJ disk arrays with 4 controllers. (HSJ008/009 for one disk array and HSJ011/012 for the other disk array) Do not type anything on the HSJ consoles except “show this” and “show other ”
If you suspect a problem with the HSJ controllers contact Maintech;
Do not try and fix yourself
Tape backups are being performed daily on the VMS System. The tape units are located on top of MCCDEV. We perform daily incremental backups, once a week level 9 backups, and a once a month full backup.
VMS Admin Backup Person
Charlie Granieri is our backup person for VMS admin. With his background and our procedures he should be able to handle these issues.