Proxy Server Expert Recovery

This explanation of proxy server recovery describes recovery from proxy server problems as the "Proxy Server Recovery" document does but this explanation is oriented towards experts rather than others that just need a simple procedure for recovery. This document has been prepared from information provided by Nancy Spencer and Kristi Luchini in email messages and documentation.

This document provides information for recovery of the proxy server system from problems that prevent the communication between the SLC-aware IOCs and the VMS SLC control system. These problems result in a flood of proxy server related error messages in error logs. Also these result in bad status displayed for micros both on the SLC-Aware IOC EDM display that may accessed from the lclshome display and on the VMS SCP Network Micro Index panel. Another symptom of these problems is a high percentage (e.g., 90%) of time spent in the iowait state for both CPUs on the proxy server machine, PX00.

The problems may be triggered by the reboot of a SLC-aware IOC. They can usually be caused by rebooting more than one SLC-aware IOC in a short time period. They are common on ROD days when SLC-aware IOCs are booted. They also may occur when power is cycled (e.g., when there is a power failure).

Proxy Server Recovery Approach

These problems result in the proxy server software running on PX00 to enter a state where communication between one or more SLC-aware IOCs and the VMS SLC control system is lost. Often once communcation is lost with one SLC-aware IOC the problems often spreads to a loss of communicaton to others.

The recovery approach involves first reseting all of the SLC micro names associated with SLC-Aware IOCs listed on the SLC-aware IOCs EDM display on the VMS SLC control system using a button macro. This will stop the proxy server network traffic. After there are no longer error log messages indicating problems with the proxy server, the SLC-Aware IOCs listed on the SLC-aware IOCs EDM display may be rebooted individually with a sufficient delay (e.g. one minute) between reboots. This should allow the proxy server software an opportunity to restore network traffic without reentering a state where communcation begins to become lost.

It is important to note that the PX00 machine should NOT be rebooted in an effort to help in the recovery. This will likely cause the proxy server software running on this machine to soon enter a state where all communication between SLC-aware IOCs and the VMS SLC control system becomes lost.

Procedure

  1. One may reset individual SLC-aware IOCs from a SCP that are associated with proxy server error messages. One may see these error messages using the CMLOG Viewer, which may be accessed from lclshome as follows:
  2. Alternatively, one may see these error messages by logging into a VMS account on the MCC machine and entering the "errdsp" command (this is usually the preferred method of seeing the error messages since it is easier to spot messages using "errdsp" than using the CMLOG Viewer). An example of a message using "errdsp" that indicates a failure to connect to a SLC-aware IOC is:
  3. Once an error message such as the one above has been identified, the next step is to determine the name of the SLC-aware IOC associated with the IP hex abbreviation. In the message above, this IP hex abbreviation is "B15". The "dump_gs" command on the MCC machine may be used to translate between IP hex abbreviations and SLC-aware IOC names. For example, the command "dump_gs b15" produces the following output:
  4. For the SLC-aware IOC associated with proxy server error messages, reset the SLC micros for these SLC-aware IOC. This is done using the SCP on VMS. For example, in the output above the name of the SLC-aware IOC associated with the B15 message code is IOC-IN20-BP02 and its SLC micro name is IB20. To reset this SLC micro, do the following actions on a SCP: and enter the SLC micro name IB20 followed by selecting "OK". The select the "Reset Micro" button.
  5. Perform an "auto check status" on the SCP: followed by selecting the appropriate micro button (e.g., IB20) and selecting the the "Auto Check Status" button. The date/time lines for the last time run on the display should be very close to the current date/time (ignore the "BPMOTIME" and "BPMOCHCK" lines that appear in white). If they are very close to the current date/time, the next step of rebooting the SLC-aware IOC may be skipped.
  6. Reboot the SLC-aware IOC. First, from the lclshome EDM display bring up the SLC-aware IOC display: To reboot the SLC-aware IOC, select the "SLC-Aware..." button corresponding to the SLC-aware IOC in the "Diag Status" column and then select "Reboot..." on the resulting popup dialog.
  7. Monitor the error messages using either the CMLOG Viewer or the VMS errdsp utility to verify that the proxy server error message seen previously no longer appears.
  8. Repeat the "auto check status" on the SCP as described above in step 4 above to verify that the SLC-aware IOC was rebooted successfully.

Other Notes

  1. One method of checking whether the status of each SLC-aware IOC on the SLC-aware IOC EDM display is to check the status for each on the SCP Network Micro Index panel: On this panel first select the "Disply Last Page" button and then the "Disply Prev Page" button. The first SLC-aware IOC name listed on this display is IA20. To verify that a SLC-aware IOC status is good, check to make sure the HSTA and IPL time column values are good and are green.
  2. When there are proxy server problems, issuing a "top" command on the PX00 proxy server machine can indicate approximately 90% IO wait.
  3. The SLC-aware IOC EDM display has a Help button that provides helpful information regarding proxy server recovery.
  4. The following URL provides SLC-aware IOC documentation: SLC-aware IOC documentation

Author:  Bob Hall 03-Feb-2009