Looking for and killing
The utilities that we have to look for and kill running Log Managers are:
TaoNSLookup [lookup-corbaRefName]
TaoNSDumper [lookup-corbaRefName]
OprLMInterrupter [corbaRefName]
TaoNSUnbinder
where here the [corbaRefName] is the CORBA reference and looks like 'Opr/LM/run#-cb#' for Opr and 'Rep/run#-cb#/LM' for
any of the REP farms. The [lookup-corbaRefName] is something like 'Opr LM/run#-cb#' or 'Rep LM/run#-cb#'. For example,
if you have a run and you want to check to see if a LM is running for run # 14697 you use:
REP2:logFiles/halflingWorkdir> TaoNSLookup Rep LM/14697-1
IOR:000000000000001249444c3a4f70724c6f674d67723a312e30000000000000010000000000000090000102000000001c4f50525345525630332e534c41432e5374616e666f72642e45445500a74b00000000002714010f004e5550000000120000000000000001004368696c64504f410000000000000000014c4d000000000300000000000000080000000054414f000000000100000014000000000001000100000000000101090000000054414f000000000400000000
The return of a long string like this tells you that there is a LM running. To kill it you would use:
REP2:logFiles/halflingWorkdir> OprLMInterrupter Rep/LM/14697-1
NOTE: the Rep LM/14697-1 -> Rep/LM/14697-1 change. You can check to see if the LM is removed by repeating the above command:
REP2:logFiles/halflingWorkdir> TaoNSLookup Rep LM/14697-1
TaoNSLookup: unable to locate Rep/LM/14697-1
The other inspection utility that you have to use is TaoNSDumper:
REP2:logFiles/halflingWorkdir> TaoNSDumper
Name service ior:
IOR:000000000000002b49444c3a6f6d672e6f72672f436f734e616d696e672f4e616d696e67436f6e746578744578743a312e30000000000001000000000000009c000102000000001c4f50525345525630332e534c41432e5374616e666f72642e45445500805500000000003314010f004e5550000000150000000000000001004e616d65536572766963650000000000000000014e616d6553657276696365000000000300000000000000080000000054414f000000000100000014000000000001000100000000000101090000000054414f000000000400000000
Name Graph
0: Bdb: context
0: Conditions: context
0: OIDService: context
0: \nfs\objyboot1\objy\databases\production\boot\physics\V1\rep\current\con002\BaBar.BOOT: reference
1: Rep: context
0: 14697-1: context
0: LM: reference
1: 14687-1: context
2: 14683-1: context
3: 14667-1: context
4: 14663-1: context
5: 14633-1: context
6: 14627-1: context
7: 14623-1: context
8: 14617-1: context
9: 14607-1: context
10: 14603-1: context
11: 14597-1: context
12: 14593-1: context
Each entry marked as context corresponds to a LM having started for a given run. The line with 'reference'
indicates that there is either a LM still running for that particular run, or that the LM has been killed
in a nasty way ...
If the machine running the LM crashes or one does a kill -9, the reference for a name in the NamingService is not removed. If this is the case you will not be able to start a new LM for that run because you need to remove the name from the naming service. To do this you can use TaoNSUnbinder.
Possible problems with running
For the moment you'll need to see the oprshifter page for troubleshooting beyond these words.
If a node crashes, events that are not signaled as done with by the client are then redistributed to other client nodes [the killer events are never redistributed to stop wiping out the farm with corrupt events]. For each client that a LM connects to it has a thread to that cient, plus a few others.
If the LM is 'starving', this means that the memory is filled up and that most events wil be pending. The sum of current + pending events is the number stored in memory at any given time. If you see that this is happening it is a sign that you've got a problem that needs fixing.
Back to Opr Operations Home