SLAC CPE Software Engineering Group
Stanford Linear Accelerator Center
System Admin

New Archive Engine

Disk Failure Recovery

 

SLAC Detailed
SLAC Computing
Software Home
Software Detailed
 

Modified: December 13, 2012


Virtual Disk 0
RAID0
-Primary system disk in slot 0
Virtual Disk 1
RAID0
-Backup system disk in slot 1
Virtual Disk 2
RAID1
-Mirrored data disks ( slot 2 and slot 3)

 

 

This is for the Dell R610 Perc H700 Raid controller for our archiver appliance.  We would like to keep our disk configurations the same at all times, so the system disk will always be in slot 0 [00:00:00] , while the backup disk in slot 1 [00:00:01] and the data disks in slot 2 [00:00:02] and 3 [00:00:03]. Deleting a VD in the PERC H700 RAID Menu will not destroy the data on the disk, but initializating DOES destroy the data.

Replace failed system disk

The instructions below will basically remove the bad system disk, put the backup system disk into slot 0 and add the new disk into slot 1, thus keeping our disk configurations in place.

  1. Shutdown the server
  2. Remove disk from slot 0 and put aside (label as bad)
  3. Move disk from slot 1 up to slot 0         (This is our backup system disk)
  4. Add new disk to slot 1                          (This will be our "new" backup disk)
  5. Power on system and Enter the Raid menu when prompted ( CTRL-R )
    1. If the system says there is a foreign disk
      1. Type "f" to import the disk
      2. Type "c" to enter config menu
      3. Type "y" to "Are you sure?"
    2. Delete this foreign Virtual Disk when the RAID menu comes up
    3. If the disk installed in slot 1 is from another old RAID system it will seem "foreign" to the RAID controller.  Just remove this VD and follow the instructions below.
  6. Remove Virtual Disk 1
    1. Highlight Virtual Disk: 1 (sys01)
    2. Press F2
    3. Select Delete VD ( This was our backup system disk, so we are going to make this Virtual Disk 0 )
  7. Create Virtual Disk 0 ( Our Primary system disk )
    1. Highlight PERC H700 Integrated and Press F2
    2. Select Create New VD
    3. Select disk [00:00:00]   -Disk in slot 0
    4. Change VD Name to sys00
    5. Do NOT Initiate the disk -(This is our backup system disk)
  8. Create Virtual Disk 1  ( Will be our "New" backup system disk )
    1. Highlight PERC H700 Integrated and Press F2
    2. Select Create New VD
    3. Select disk [00:00:01]   -Disk in slot 1
    4. Change VD Name to sys01
      1. If the disk was found as "Foreign" do a full initialization
        1. Highlight Virtual Disk: 1 and press F2
        2. Highlight Initialization and hit enter
        3. Select Start Init.
      2. If the disk was new/empty then do a fast initializaion
        1. Highlight Virtual Disk: 1 and press F2
        2. Highlight Initialization and hit enter
        3. Select Fast Init.
  9. Make sure VD 0 is the bootable disk
    1. Go to Ctrl Mgmt ( see menu below ) ( Use CTRL-N to move through RAID Menus)
    2. Verify and/or change "Select bootable VD:" to VD 0
      1. Highlight "Select bootable VD" box
      2. Hit enter to get choices
      3. Highlight VD 0
      4. Save configuration
        1. Highlight "APPLY" and hit enter
  10. Reboot
    1. Hit ESC
      1. When you exit the RAID configuration menu it will tell you to hit CTRL-ALT-DEL to reboot
  11. When system is back up perform dd command to backup system disk

 

Replace failed data disk

If there is a failure with one data disk, identify the disk simply via OpenManagement web interface (https://nodename:1311).

Replace the disk with a new one WHILE the system is running or kept on. This is very important. Never turn off the system to replace a failed data disk like we do for the system disk replacement.

As soon as the new disk is inserted, the controller will automatically rebuild the disk for mirroring. The progress of rebuild can be monitored via "omreport storage pdisk controller=0" or via https://nodename:1311. The system should be functional while rebuilding in progress.

In case a data disk failure is caused by a power outage, always turn on the system when the power is restored, and then replace the disk. In short, alway replace a data disk while the system is on or running.

 



Create by Ken Brobeck 29-Nov-2012

Modified by Jingchen Zhou 13-Dec-2012

Programmers' Guides, Users' Guides, Requirements, Design, Papers, Administration, How-To, Hardware, IOC, Database

[SLAC CPE Software Engineering Group][ SLAC Home Page]