LCLS Controls

SLC-Aware IOC Design


 

Cluster Status and Test Service

 

Quick links:

1.   Scope

The following document describes a software design for the SLC-aware IOC (slcIOC) Cluster Status and Test Service. The document includes a description of the functionality of the Cluster Status and Test Service and the design of each thread involved in it.

2.   Introduction

This service consists of the cstrAsync Thread and the cstrHdlr Thread.

The cstrAsync Thread periodically updates CSTR health and status secondaries for all SLC IOC Async Threads(supertype 3 data) in the SLC database with data read from Global Memory (it was stored there by this and other Async threads) and data read from associated EPICS PVs. It then queues an update request to the SLC database service. cstrAsync is itself an Async thread and behaves like any other Async thread. It also maintains counters in the Global Area for itself and other Async Threads.

The cstrHdlr Thread services all TEST job messages which are used by Paranoia on the VMS system. cstrHdlr echos back messages that are sent to it. This functionality was in the the SLC IOC Message Service in past design reviews.

2.1 Background

This same functionality is implemented in the TESTMAIN job on the SLC Micros.

2.2 References

1.      PRIMARY.DBS: List of SLC Database Primaries and Secondaries

2.      SLC Asynch Database Update Design Spec by T Lahey, N Spencer 1989

3.      Improving Control of Auto-Checking Functions by T Lahey,N Spencer, R Hall 1990

2.3 Requirements

See the SLC-Aware IOC Functional Requirements by Stephanie Alison.

3.                        Cluster Status and Test Service Design

3.1 Service Description and External Interfaces

3.1.1 Service Block Diagram

The Cluster Status and Test Service includes two threads shown in the block diagram below: the cstrAsync thread and the cstrHdlr thread. The cstrAsync thread initialized and maintains the Job Global area by calling Async utilities. It reads data from the EPICS database using Epics Runtime Database Access calls. It reads and writes to the SLC CSTR database primary using SLC database utilities and Async utilities.

The cstrHdlr thread handles TEST function messages from Paranoia on VMS. This involves receiving the messages and sending a reply back. cstrHndlr also calls the cstrAsnc CHK1 check function after receiving TEST messages from the SLC IOC Message Service.


3.1.2 External Interfaces

Described elsewhere in this document.

3.1.3 Data Flow

See Diagram Above.

3.1.4 Data Structures

See slcJobFunc_ts described in Async Utilities Design.

3.2 cstrAsync Thread Detailed Design

The CSTR Async Thread (cstrAsync) periodically updates CSTR health and status secondaries (supertype 3 data) in the SLC database with data read from Global Memory (it was stored there by this and other Async threads) and data read from associated EPICS PVs. It then queues an update request to the SLC database service. It also maintains counters in the Global Area for itself and other Async Treads. All access to the Global Area is through Async Utilities.

3.2.1 Functional Flow

3.2.1.1 Thread Initialization

Note: Prior to cstrAsync being run, slcAsyncInit() (see Utility section below) is called by dbHdlr at SLC IOC initialization time to set up the Async Global area once for all ASYNC threads and perform other initialization. This initialization includes reading Cycling parameters from the SLC database for all cycling functions.

·        Call private function cstrInitMicrStatus() which initializes micro status data items and calls private function cstrProcMicrSts() to set AMSK and MSTA secondaries into the SLC database per the table below. Call asyncMicroDbSend() to send the update to VMS. As indicated in the table of secondaries below, CSTR,[this_micro],1,[CTRS, CRTT, CRVn, CAM, NTIM, and TSTA, ] are set in the database once at at this stage of Startup processing using dblistalloc(),dblput(), dbupdate(),dblistfree(). They are never set again.

·        The value of CSTR:VTIM is echoed to VMS in CSTR:MTIM as an indication that the micro's database has been downloaded (and a message is logged saying "at your service"). This is accomplished by calling an local cstrAsync function which will do a dblistalloc()/dblist() for each secondary, dblget() for CSTR:VTIM, dblput() putting the VTIM value to the database using the MTIM list, dbupdate(), and dblistfree() to free the pointer and data lists.

·        CSTR:VTIM will not be checked to see if it is 25 minutes older than current micro time as it is on the SLC micro. On the SLC micro, that test is done to see if no SCP was involved in the boot request (in which case fast feedback loops are not to be restarted).

·        Call slcAsyncMeterDbSend(forceUpdate=yes) to send a non-metered request to update the Alpha database.

·        Call public function slcAsyncCycleInit(cstrAsyncThread_enum) (see CSTR Utility web page) to set up to be a cycling function.

3.2.1.2 Periodic Thread Processing

1.      There will be no handling of Check Functions from Paranoia (or other messages from VMS) in cstrAsync like there is in the SLC micro TEST job. On the SLC micro, those functions are used for such purposes as telling paranoia that the micro is still online. On the SLC IOC, that functionality is provided by the cstrHdlr thread.

2.      In a Loop, cstrAsync will perform the following processing:

·        call slcAsyncSleep(cstrAsync_thread_enum) (see cstr Util web page) to return an indication as to which cycling function is to be executed or (if there is nothing to be done immediately), wait the appropriate amount of time based on cstrAsync's cycling parameters for all it's check functions. This could wait up to forever.

·        Based on the returned indication of what to do, call one of the cstrAsync check functions:
cstrAsyncChk1(dbUpdateNeedeed)
cstrAsyncChk2(dbUpdateNeeded)
cstrAsyncCpum(dbUpdateNeeded)
cstrAsyncChk3(dbUpdateNeeded)
Those functions call dblput() (not chk3) as needed to write to the SLC database.
Those functions return an indication if the database needs to be updated.

·        Call slcAsyncSetFcnStats(funcId) to update statistics for the funciton in the async function (global) table (this is done inside CHK functions on RMX).

·        If the check function returned an indication that the database update needs to be sent to VMS, call slcAsyncMeterDbSend(). That indication is set only if something changed and needs to be updated.

3.2.1.3 Messages Received by cstrAsync

None. See the cstrHdlr thread design.

3.2.1.4 Termination

Termination is initiated by the slcExec thread setting the slcThreads_as[cstrHdlr].stop flag to TRUE.

When the slcThreads_as[cstrHdlr].stop flag == epicsTrue:

·        Set slcThreads_as[cstrAsync].active = epicsFalse

·        Set slcThreads_as[cstrHdlr].tid_ps = NO_THREAD

·        log a "Stopped and out of service" message.

·        Return

See the Async utility slcAsyncExit() for a description of resources relased when the slc ioc is stopped.

3.2.2 Global and Database Data

1.      The following CSTR Supertype I (read-only) secondaries are read as part of cstrAsync processing. Much of the processing is generalized for all ASYNC threads and is therefore handled within utilities listed in the Async Utilities Section below.

CSTR Secn

Description

CNAM

Cycling function name (job + function name). Read at initialization time only

CYCL

Cycle length in seconds for this function.

MTRL, MTRC, MAXT

Database-write metering parameters (length, count, and max_time).

2.      The following CSTR Supertype II (read-only) secondaries are read as part of cstrAsync processing. Much of the processing is generalized for all ASYNC threads and is therefore handled within utilities listed in the Async Utilities Section.

CSTR Secn

Description

HSTA

Status. Set by Cluster Status Panel. Used to tell if an ASYNC is enabled/disabled.
Read by IS_FCN_ENABLED() on RMX.

CMSK

Mask of functions in this IOC which should cycle asynchronously. Set by Cluster Status Panel..
Read by IS_FCN_ENABLED() on RMX.

MMSK

Mask of functions that have CA monitors.
Used in calculating when to do database updates (if set, use minimum of CYCL and SCAN).

FMSK

Mask of functions that have CA monitors.
Used in calculating when to do database updates (if set, force at SCAN interval).

SCAN

Cycle length in seconds for this async function.
Used in conjunction with MMSK, FMSK, and CYCL.

3.      The following CSTR Supertype III secondaries are handled by the crate verifier on the SLC micros.
On the SLC IOC, they are set to zero at SLC IOC startup time by cstrAsync and then not set periodically.

CSTR Secn

Description

CRTS

Camac Crate Status Mask

CRTT

Camac Crate Temperatures (degF?)

CRV1 to CRV7

Camac Crate Voltages (V?)

CAM

Available CAMAC memory pool (bytes)

NTIM

Last time was TSTA updated by micro

TSTA

Timing job interrupt status

4.      The following CSTR Supertype III secondaries are set to the values indicated in the table below at SLC IOC startup time by slcAsyncInit().
There is no periodic processing for these secondaries.

CSTR Secn

Description

Init. Value

MTIM

Last Successful IPL from micro

VTIM

5.      The following CSTR Supertype III secondaries are periodically processed.
dblistalloc()/dblist() are called at processes startup time. Then, dblput is called to write values to the database as part of periodic cycling.

CSTR Secn

Description

UTIM

Last database update timestamp per job

CTIM

Last async cycling function update timestamp per job

ELPS

Elapsed time for last cycling function update per job (seconds)

NRUN

Number of executions in last calculation interval per job

FAIL

Number of failed executions in last calculation interval per job

PUPD

Percent of executions triggering database update in last calculation interval per job.

PVAX

Percent of executions triggered by VMS message in last calculation interval per job.

 

CPU

CPU idle time (percent)

 

RMX

Available memory (bytes)

 

AMSK/MSTA/JMSK

Masks of Active Jobs/Micro Job Status/Jobs Included (JMSK is SuperType I)
At startup and periodically, CSTR AMSK (job active mask) and MSTA(micro job status) are set.
AMSK is set by looking the slc job array (slcJobs_as) and setting a bit for each index where the non-zero threads (async and/or hdlr) have the active flag set. Each bit in AMSK corresponding to a particular job is set if ALL the tasks associated with the job are active AND the correspoinding JMSK bit is set.. Then MSTA is set to 1 if AMSK == JMSK and 0 otherwise. This logic is consistent with logic in ref_rmx_test:procmicrsts.f86.

 

6.      The following CSTR Supertype III secondaries are handled by the Magnet Handler (not by cstrAsync).

CSTR Secn

Description

MAGF

Current Magnet Job Function Code

BTIM

Timestamp when magnet job was initialized

3.2.3 Resource Management

Writes to the SLC database will be metered in a manner similar to the SLC Micro Test Job by calling the utility function slcAsyncMeterDbsend().

3.2.4 Message Logging

There is not much message logging (there's not any on the rmx micros). Any messages logged are stated explicitly in this document.

3.2.5 Diagnostics

Update the following diagnostics, reset some on-demand.
They will be accessed from the IOC shell via the iocsh commands.

·        Diagnostics list is TBD

3.2.6 Major Routines

The following table shows the names of Functions on the SLC RMX micros and the corresponding cstrAsync Functions on the SLC IOC which accomplish roughly the same functionality.

This code exists in ioc/cstr/cstrAsync.c and cstrAsync.h

SLC Micro Function

SLC IOC Function

TESTMAIN (job)

cstrAsync (thread)

EXEC_CHK1

slcAsyncChk1

EXEC_CHK2

slcAsyncChk2

EXEC_CPUM

slcAsyncCpum

none

slcAsyncChk3

cstrAsync Check functions

1.      cstrAsyncChk1()
Call cstrProcMicrSts() to update CSTR secondaries AMSK and MSTA.
Call ???TBD to get current time of day.
Call slcAsyncLockStats() to get a mutex on the Global Area.
For each cycling function in the Global Area,
      Put the last database update time from Global into the UTIM dblist.
      Put the last cycling function update from Global into the CTIM dblist.
      Set Elapsed time to the value in Global divided by 100. Put the result in the ELPS dblist.
Call slcAsyncUnlockStats() to release the mutex on the Global Area.
Unlike RMX, don't call slcAsyncSetFcnStats(Function_enum) to set statistics in the Global Area for this async function. It's called in the main routine

2.      cstrAsyncChk2()
Call slcAsyncLockStats() to get a mutex on the Global Area.
Call ???TBDepicsOsiCall (Epics runtime database access call) to get current time of day.
For each cycling function in the Global Area,
      If Number of Executions (NRUN) has changed in Global, put the new value from Global into the NRUN dblist.
      If number of failures (FAIL) has changed in Global, put the new value from Global into the FAIL dblist.
      Calculate Percent executions triggered by Vax Messages (PVAX). If it has changed, store it in the PVAX dblist.
      Calculate Percent executions triggering database updates (PUPD). If it has changed, store it in the PUPD dblist.
      Zero the local counters for NRUN, FAIL, DBUPDATES, and VAXFUNCS.
Call slcAsyncUnlockStats() to release the mutex on the Global Area.
Do dblput() on all lists that have changed.
Unlike RMX, don't call slcAsyncSetFcnStats(Function_enum) to set statistics in the Global Area for this async function. It's called in the main routine.

3.     cstrAsyncCpum()
Call epics dbget (Epics runtime database access call) to get CPU idle time from PV IOC::1:CPU and put it into the CPU dblist
Call epics dbget (Epics runtime database access call) to get Available Memory from PV IOC::1:MEM and put it into the RMX dblist
Note: The parameters for these Epics dbGet calls were set up at initialization time.
Call dblput() to put the CPU and RMX lists into local database.
Unlike RMX, don't call slcAsyncSetFcnStats(Function_enum) to set statistics in the Global Area for this async function. It's done in the main routine.

4.      cstrAsyncChk3()
This is a future CHK function. The current values of TBD are fetched from TBD and copied to TBD EPICS PVs. NO SLC database update is done.

Functions called only from within cstrAsync

1.      cstrProcMicrSts()
This function processes AMSK and MSTA per the table above.
Clear local AMSK bit array.
Go through slcThread_as and set a bit for each active job (*TBD* Is it necessary to check both active and stopped flags?)
Call dblput() to put local AMSK into CSTR:AMSK.
If AMSK != JMSK, set MSTA to BAD(0)
ELSE set MSTA to ALIVE(1).
Call dblput() to put local MSTA into CSTR:MSTA
When it is noticed that a thread has exited, possible *TBD* additional processing could include
:

o       Send a stop request to SLC exec to bring down the whole SLC interface.

o       Or perhaps the task that exits on a fatal error should send a stop request right away.

o       But is it better to have users notice the problem and attempt a restart manually (and if the thread that's dead is not critical, then they could time the restart to be less invasive). BTW, this is the way EPICS works - when a task suspends or exits, it doesn't crash the IOC. Instead people start noticing (sometimes subtle) problems and notify controls. Doesn't happen very often.

o       We agree that an automatic SLC restart should not be attempted since the fatal error would probably just happen again.

cstrAsync logs a message for the dead threads (inactive tasks) every time through as a reminder that repair and restart are needed.

2.      cstrInitMicrStatus.
Initialize data items associated with micro status after startup.

Call dballoc() and dblist() to set up lists for CSTR,[this_micro],1:AMSK, MSTA.
Do a dballoc(), dblist, dblget(), and dbfree() for JMSK.  Store JMSK in file scope for use by another function.
Likewise, do the same for CPU, and RMX (one per SLC IOC) for use by CPUM check function.
These CSTR secondaries are one per SLC IOC.
The lists are in file scope for future use cstrProcMicrSts().
Epics PV's (IOC::1:CPU and IOC::1:MEM) will be accessed for use in calculating SLC database CPU and RMX CSTR secondaries and the initial values will be written to the SLC database using dblput().
Perform Initialization for access to these Epics PV's as follows:

o       call EPICS dbNameToAddr() to get the pointer for each PV and EPICS dbGet() for the the initial value. dbGet requires a dbrType argument. That argument will be hard coded based on what's in primary.dbs for the particular secondary (alternately, it would be possible to determine the proper value using a yet unwritten but handy utilitiy in database utility dbget.c called dbgetDbrType that converts from dbgetFormat and dbgetWidth to an EPICS dbrType).

o       A bad status return from any db routine is cause for exit and proper cleanup (dblistfree's, reset "active" flag, etc).

o       The PV name format will be:
CSTR:(4 char ioc name):1:(secondary name of interest).
The 4 char ioc name is determined using executive routine slcGetIOCName().

A bad status return from any of these db calls is cause for exit and cleanup.

Call cstrProcMicrSts()

3.3 cstrHdlr Thread Detailed Design

The CSTR Handler Thread (cstrHdlr) services all TEST job messages from VMS by sending replies back to VMS. It also calls the slcAsyncChk1() (CHK1) cycling function after each message received.

3.3.1 Functional Flow

3.3.1.1 Initialization

The cstrHdlr thread performs the following as initialization

·        Create the message queue using the messageQCreate utility

·        Set its slcThreads_as[cstrHdlr].active = epicsTrue

·        Send an "I'm alive" unsolicited message to the Alpha PARANOIA process (V016) so that PARANOIA turns the micro online in the CSTR STAT secondary and begins existence check polling.

·        Log a "Started and at your service" message.

3.3.1.2 Normal Processing Loop

The cstrHdlr thread is similar to the logic of any message-queue driven thread in the slcIOC.
It accepts incoming message commands and acts on them. The basic loop is:

While (slcThreads_as[cstrHdlr].stop !=epicsTrue)

·        Read message from message queue using the msgQMsgRecv utility.
Gets the data word length, function code

·        Perform the function based on function code in msgheader
If not a valid function code, drop the message

·        Allocate buffer for reply message using the msgQGetSmallBuffer utility

·        Create reply data; converting for VMS

·        Copy the source SCP ID from the request header to the destination SCP ID used by message logging. Reset after reply is queued.

·        Copy the source and destination from the request header to destination and source, respectively, in the reply header. Copy the VMS timestamp and function code from the request to the reply header. Set the proper data length in the reply header.

·        Release the incoming message buffer back to the pool

·        Send message to the msgSend thread using the msgQMsgReply utility.
Sets up the msgheader in native format

·        Call Async function cstrAsyncChk1() to perform a CHK1 function.

·        Update statistics for the CHK1 function in the async function tables by calling cstrAsyncSetFcnStats()

·        Send a request to update the Alpha database by calling slcAsyncDbSend().

3.3.1.3 Messages Received by the cstrHdlr

The cstrHdlr thread performs the functions related to the follow messages from the SLC Control System,

3.3.1.3.1 IOC_STOP

The slcExec sends this message to tell the thread to terminate itself. The messages is an internal slc ioc message, not from VMS. The cstrHdlr thread terminates without a reply.

3.3.1.3.2 TEST_ECHO

Reply with data exactly as sent.

3.3.1.3.3 TEST_ECHO_MWORD

Reply with blocks of repetitions for given word.

3.3.1.3.4 FUNC_TEST

Existence check from PARANOIA. Reply with data exactly as sent.

3.3.1.3.5 TEST_ERR_METER_RESET

Reset cmlog throttling (TBD). Reply with the throttling reset status.

3.3.1.4 Termination

Termination is initiated by the slcExec thread setting the slcThreads_as[cstrHdlr].stop flag to TRUE, or when the cstrHdlr thread receives a IOC_STOP message.

When the slcThreads_as[cstrHdlr].stop flag == epicsTrue, or IOC_STOP is received:

·        Set slcThreads_as[cstrHdlr].active = epicsFalse

·        Destroy the message queue using the messageQRelease utility. This releases any message buffer held in the message queue first.

·        Set slcThreads_as[cstrHdlr].tid_ps = NO_THREAD

·        log a "Stopped and out of service" message.

·        Return

See the Async utility slcAsyncExit() for a description of resources relased when the slc ioc is stopped.

3.3.2 Global Data

The cstrHdlr thread uses the following globals:
slcThreads_as[cstrHdlr]

3.3.3 Resouce Manement

3.3.3.1 Message Queue

The cstrHdlr thread creates a message queue for incoming messages. The message queue is destroyed before the thread exists. The cstrHdlr thread uses the messageQRelease utility to destroy the queue, which first reads out each message in the queue and releases the message buffer it points to, then destroys the queue.

3.3.3.2 Memory Pool

Once it is done acting upon the incoming message the cstrHdlr thread releases the message buffer in which it was stored. The cstrHdlr thread allocates the memory block required to store the reply data for the current reply message.

3.3.4 Message Logging

Many of the error messages logged by the cstrHdlr thread are generated and logged by utility functions. But, for messages logged here, the cstrHdlr thread will log all status and error messages using the slcCmlogLogMsg utility (described in the slcCmlog Utilities section of the General Purpose Utilities document).

The cstrHdlr logs messages for the following conditions:

·        Service availabilty change

·        Invalid function code

·        Unable to queue a reply

·        Queue full

3.3.5 Diagnostics

See the requirements document for a list of diagnostics that are updated and reset upon demand.

3.3.6 Major Routines

This code exists in ioc/cstr/cstrHdlr.c and cstrHdlr.h

3.3.6.1 void cstrHdlrThread(void)

cstrHdlrThread is the main procedure and loop of the cstrHdlr thread. It initialized the cstrHdlr resources and globals, and waits for message at the message queue. It cleans up resources when terminating.

3.3.6.2 void cleanup(void)

This routine is called just before the thread terminates. It sets the active flag to false and destroys the cstrHdlr message queue.

3.3.6.3 void doStop(void)

This routine is called when the TEST_STOP message is received. It sets the slcThreads_as[cstrHdlr].active flag to false, causing the thread to terminate.

3.3.6.4 void doEcho(TBD)

This routine is called when the TEST_ECHO message is received.

3.3.6.5 void doEchoMWord(TBD)

This routine is called when the TEST_ECHO_MWORD is message received.

3.3.6.7 void doExistance(TDB)

This routine is called when the TEST_EXISTANCE message is received.

3.3.6.8 void doErrMeterReset(TBD)

This routine is called when the TEST_ERR_METER_RESET message is received.


SLC-Aware IOC Home Page | LCLS Controls | EPICS at SLAC | SLAC Computing | SLAC Networking | SLAC Home

Contact: Ron MacKenzie
Last Modified: Feb, 2005. by RONM.