SLAC CPE Software Engineering Group
purpose of the Alarm Heartbeat Monitor (alh_hrtbeat_mon) is to ensure that the
Alarm Handler and Channel Watcher are logging messages to cmlog. The Alarm
Heartbeat Monitor periodically checks cmlog for messages containing specific
PVs from a representative sample of IOCs and sets Unix Watchdog monitoring PVs
to "Missing" if messages are not found.
ALH Diagnostics and Troubleshooting
|nlcta||gtw04||gtw04||8955||TR00, TR08, TR10|
log into the desired host as cddev
st.alh_HrtBeat startPreferred method: You can also click the little "P" button on the UWD to restart alh_HrtBeat.
UWD Status PV
|NLCTA||CW - nlcta||CS04:CWNLCTA:CMLOG:STATUS|
|PEP||CW - pepii||CS01:CWPEPII:CMLOG:STATUS|
|PEP||CW - p2rf||CS01:CWP2RF:CMLOG:STATUS|
Alh_hrtbeat_mon can check (ie monitor) cmlog for any group of PVs. At this point, it is monitoring ALH and CW using the alarm heartbeat PVs (see the column "HrtBeat PV" in the table above). These PVs are named $(STN):IOC:ALRM:HRTBT, and can be booted into any IOC. The PVs are defined in VXStats.db. The appropriate ALH and CW config files need to be updated to include each heartbeat PV, with the logging flag set. Each PV changes value from 0 to 1 and alarm status from NO_ALARM to MAJOR every 10 minutes, and back to 0 and NO_ALARM 10 seconds later, thereby generating alarm and channel change messages to cmlog. There should be one monitored PV for each CW application for each realm.
Alh_hrtbeat_mon periodically searches cmlog messages for text containing the specified search string over the specified time period, which are configurable in the config file (see below). The search string needs to be specified so that it will retrieve all PVs to be monitored (typically "IOC:ALRM:HRTBT"). The typical time period is 10 minutes, the interval for the heartbeat PVs. After the query, alh_hrtbeat_mon goes to sleep for the specified time period, and so on in an infinite loop.
If you suspect something is not working (ie CMLOGGING is red on the UWD or someone complains that they are not seeing expected ALH or CW messages in CMLOG), here are some steps you can take:
- Query CMLOG for the HtrBeat PV's in the Summary Table above. You should see a message from CW *and* a second message from ALH for each of the HrtBeat PV's. That is because each of those PV's is monitored by both CW and ALH. Look in the left hand "Sys" column of CMLOG for ALH and CW to be listed. If one or both are not found, then they are not logging or the Hearbeat Monitor process is in trouble. Continue to the next step.
- Check the status of the ALH_HRTBEAT_MON processes on the UWD display. See the table above for which host to check on the UWD (called alh_htrbeat_mon_host in the table above). Restart it if it is red using the "p" button. If this clears up all the UWD entries, stop here. If it doesn't, preceed to the next step.
- Click the "p" button next to CMLOGGING on the UWD. This will cause the dependent CW/ALH processes to be restarted automatically. It may take 10 minutes for the CMLOGGING button to turn grey. If this clears up all the UWD entries, stop here. If it doesn't, preceed to the next step.
- See the page: ALH Diagnostics and Troubleshooting
UWD Status PVs:
The heartbeat monitor sets UWD status PVs depending on the search results. The status PVs, served by UW00, can be viewed in the UWD display.
For Channel Watcher, there is a UWD status PV for each realm/instance, since Channel Watcher instances can be restarted separately. Alarm Handler instances in a realm must be restarted together (due to Xvfb), so there is a single Alarm Handler UWD status PV per realm.
Setting the status PVs:
Channel Watcher PVs:
The status of CW instances for each realm are monitored separately. For each realm and for each instance, if no IOC:HRTBT:ALRM messages are found in cmlog in the last designated time period, the status of the specific realm/cw application cmlog-ging is assumed bad.
Alarm Handler PVs:
The status of ALH instances in each realm are OR-ed together to create a single ALH status for each realm. Thus, for each realm, if IOC:HRTBT:ALRM messages for either instance are not found in cmlog in the last designated time period, the single alh cmlog-ging status for that realm is assumed bad.
Ping PVs: If cmlog messages are found to be missing for a heartbeat PV, the state of the IOC serving the PV is checked by querying its Ping PV. If the IOC is found to be down, the corresponding cmlog status is assumed to be OK...
If cmlogging is found to have a problem, the appropriate UWD status PV is set to 1 (Missing = red, major alarm). Otherwise the PV is set to 0 (Running = green, noalarm).
The UWD sends e-mail to designated people if any of the PVs are set to Missing.
Alarm Heartbeat Monitor errors:
There are rare instances where the alarm heartbeat monitor will not be able to set the UWD status PVs. For example, the UW00 soft IOC may be down for an extended period, there may be network or gateway or channel access problems...Or, that the alarm heartbeat monitor can't connect to cmlog. In either case, the Alarm Heartbeat Monitor will send e-mail to a designated list of people every hour until the problem is fixed.
If a config file is changed the corresponding program must be restarted to load the new parameters/pv list.
The config files, alh_hrtbeat_mon_nlc.config and alh_hrtbeat_mon_pep.config contain:
line 1: number of seconds for the query and sleep interval. 10 minutes (600) is a useful number for the heartbeat PVs.
line 2: search string for the cmlog query that will retrieve all the PVs in the list - typically IOC:ALRM:HRTBT, which will get all heartbeat PVs.
line 3 and subsequent lines: list PVs to monitor, with their corresponding CW App one per line. Maximum of 48 (could be increased in the program if necessary):
The source code for the Alarm Heartbeat Monitor is in the reference area in $CD_REF/app/alh_hrtbeat_mon/src. It is compiled and escalated in the usual UDE manner using gmakedev and new.
The type II startup file is in the reference area,$CD_REF/app/alh_hrtbeat_mon/script/st.alh_hrtbeat. It is escalated using gmakedev, and then by Jingchen or Ken using gmakenew.
**In order for the type I startup file, st.alh_HrtBeat, to see changes, they must be escalated at least up to the "new" level!!!
The config files are in the reference area in $CD_REF/app/alh_hrtbeat_mon/config. They are gmakedev-ed into $CD_CONFSYS, where they are read by the heartbeat monitor program at startup.
alh_hrtbeat_mon creates a log file, alh_hrtbeat_mon_pep.log or alh_hrtbeat_mon_nlc.log, which records the time of each query and a summary of the results. The log file is recreated each time the program is restarted.
See "environment variables" below for the location of the log file. It is typically in /tmp.
st.alh_hrtbeat sets these heartbeat monitor environment variables depending on host.
ALHMONCONFIG: config file path and name
ALHMONLOG: log file path and name
ALHMONMSG: temporary file containing messages to e-mail - path and name
PEP realm: CS01
NLCTA realm: CS04
for concatenating UWD status PV names in the code
For the channel access api, the startup file also does:
source epicsSetupNlcta or epicsSetupPepii (to set the path to channel access elements)
setenv EPICS_CA_ADDR_LIST "126.96.36.199" (to get to the UWD status PVs on UW00)
Programmers' Guides, Users' Guides, Requirements, Design, Papers, Administration, How-To, Hardware, IOC, Database
Author: Judy Rock, 22-Sep-2003
Modified by: 17-Mar-2006, jrock, modifications to monitor CW, and set UWD status PVs. 5-Apr-2006 more mods for UWD changes. Note, all modifications are listed on same line!