LM - the log manager

TaoNSDumper / TaoNSLookup / OprLMInterrupter / TaoNSUnbinder

How it works

The LM has 2 modes of operation:

To get data first, the DRS loads the xtc file onto disk. This is then read, event by event, by the LM and then transported by a TCP/IP connection to the OprDaemonApp on client nodes. Here events are written into a shared memory space. Theis is shared with regards the ELF app running on the client node.

Start up

The LM is started from the OprManager/runOprProcesses script. You can track the start up by looking for the runlog* file in a Run directory and looking at the output files that are associated with the LM:

Normally the LM should start up without any problems. However if the LM fails to start for a run or anything else strange happens the order in which you should search the files [assuming that the exist] is runlog*, lm.log, lmm.log paying note to only the first few lines of lm.log unless you know what this is doing.

Looking for and killing

The utilities that we have to look for and kill running Log Managers are:

where here the [corbaRefName] is the CORBA reference and looks like 'Opr/LM/run#-cb#' for Opr and 'Rep/run#-cb#/LM' for any of the REP farms. The [lookup-corbaRefName] is something like 'Opr LM/run#-cb#' or 'Rep LM/run#-cb#'. For example, if you have a run and you want to check to see if a LM is running for run # 14697 you use: The return of a long string like this tells you that there is a LM running. To kill it you would use: NOTE: the Rep LM/14697-1 -> Rep/LM/14697-1 change. You can check to see if the LM is removed by repeating the above command:

The other inspection utility that you have to use is TaoNSDumper:

Each entry marked as context corresponds to a LM having started for a given run. The line with 'reference' indicates that there is either a LM still running for that particular run, or that the LM has been killed in a nasty way ...

If the machine running the LM crashes or one does a kill -9, the reference for a name in the NamingService is not removed. If this is the case you will not be able to start a new LM for that run because you need to remove the name from the naming service. To do this you can use TaoNSUnbinder.

Possible problems with running

For the moment you'll need to see the oprshifter page for troubleshooting beyond these words.

If a node crashes, events that are not signaled as done with by the client are then redistributed to other client nodes [the killer events are never redistributed to stop wiping out the farm with corrupt events]. For each client that a LM connects to it has a thread to that cient, plus a few others.

If the LM is 'starving', this means that the memory is filled up and that most events wil be pending. The sum of current + pending events is the number stored in memory at any given time. If you see that this is happening it is a sign that you've got a problem that needs fixing.



Back to Opr Operations Home

Maintained by Adrian Bevan. Last updated on 13-Aug-2001