SLAC PEP-II
BABAR
SLAC<->RAL
Babar logo
HEPIC E,S & H Databases PDG HEP preprints
Organization Detector Computing Physics Documentation
Personnel Glossary Sitemap Search Hypernews
Unwrap page!
Computing Search
Who's who?
Meetings
FAQ Homepage
Archive
Environment
Online SW
Offline
Workbook
Simulation
Reconstruction
Data Distribution
Beta
Beta Tools
Event display
Code releases
Databases:
Hot Items!
About Us
Meetings
General DB info
Conditions DB
Event Store
Online DB
Links
Check this page for HTML 4.01 Transitional compliance with the
W3C Validator
(More checks...)

Using the OID Server at OPR

October 11, 2000

Igor Gaponenko (gapon@slac.stanford.edu) 

0. Introduction

This document is a simple guide on the use of the "Remote OID Server Facility" to those who are responsible for running the OPR farm.

The OID Server facility provides more efficient access to the Conditions/DB from multiple jobs processing the same set of runs simultaneously. The server reduces the overall traffic between these jobs and Objectivity servers by caching significant amount of information related to the Conditions/DB at its transient memory cache.

The instructions presented in this document are valid for any lettered production build of release 8.8.0 starting from the version 8.8.0c. They (instruction) may not be valid for the most recent (as well as the previous) releases of the code.

The guide is followed by a simple "Troubleshooting" (section 2) covering the most common problems could be seen while using this kind of setup.

1. Setting up the system

In order to be able to start using the OID Server both the server side and its clients (the reconstruction jobs) must be properly prepared. The basic steps needed to run the farm in this mode include the following:

  • running up the server itself (see section 1.1);
  • setting up the clients (the reprocessing jobs) to be able to work with the server (see section 1.2);

Once the server has been setup then there is no need to start it again. The same instance of the server process can be used to process multiple runs sequentially, unless a non-standard situation is met.

NOTE: See the description of possible problems and the corresponding solution at the "Troubleshooting" section at the end of this document.

1.1 The OID Server

1.1.1 Finding the server and its management utility's binaries

The OID Server setup requires two binaries:

BdbCondROIDServerApp

BdbCondRemoteCmd

The first one is the server itself. The second one is a management tool intended to manage the running server(s).

Unfortunately (due to a bug in the production version of the server) the server from the production releases 8.8.0c, 8.8.0d and 8.8.0e is not functional. When it opens a transaction it will crash and produce the following message:

TaoServer: Trapping signal 11

TaoServer: exit handler invoked

Segmentation fault (core dumped)

The fixed version of this binary can be found from:

AFS: ~gapon/vol0/release/8.8.0-BdbCondRemote-bin/

This directory also contains the management utility not built in the production release.

NOTE: It's strongly recommended to copy both mentioned utilities into a more "stable" place and to use them from there.

1.1.2 Choosing a machine to run the server on

The first question to be answered here is where to run the server. Although a running server does not utilize all 100% of the CPU at all the times it's still recommended having a dedicated machine.

The current implementation (given the statement on the version of this software above) has a single threaded version of the server. So having just a single CPU would be adequate.

NOTE: This is not true for the most recent version of the server, which are not in the production yet. This version is MT-capable.

In theory the best place to run the server is a machine having a good connection with the AMS server of the federation.

1.1.3 Starting the server

The binary of the server is a command line utility taking a number of optional parameter. A self-description of these parameters can be obtained by running the server in "help" mode:

BdbCondROIDServerApp help

BdbCondROIDServerApp -h

So the simplest sequence of operation required for starting the server is:

setenv OO_FD_BOOT /nfs/objyboot1/objy/databases/Production/physics/V2/production/opr/BaBar.BOOT

BdbCondROIDServerApp

Since the server is meant to provide its service via the CORBA protocol its very important to make sure that a proper configuration of the ACE/TAO (the particular implementation of the CORBA installed on Solaris machines at SLAC) has been set prior to run the server. The most important thing is concerning the single-threaded nature of the server. The ACE/TAO configuration is controlled by means of the file named "svc.conf" located at the current working directory when the server is being started. This file (if exists) must have the following contents:

static Resource_Factory "-ORBReactorType select_st"

static Server_Strategy_Factory "-ORBConcurrency reactive"

If the file has something else the server may crash during execution with unpredictable diagnostics.

1.1.3.1 Running the server in "verbose" mode

The most important parameter, especially for the very first run of the server, is:

-OIDServer_verbose

This tells the server to print a string (preceded by a time stamp) on every single request arriving from clients. It will also print the information about staring/stopping transactions, which may be important for the purpose of troubleshooting.

NOTE: It's recommended running the server in the "verbose" mode and redirecting its messages into a log file at all times. The resulting log file will be used for the further analysis if the one would be needed later on.

The estimated volume of information produced by the server can be calculated with the following formula:

<NUMBER ODF LINES PER RUN> = 200 * <NUMBER OF JOBS>

This means that each job produces about 200 different messages, one per each OID found via the server.

Each line produced by the server is about 100 bytes in length. Therefore the total amount of information produced by the server serving 150 clients during a single run would be about

100*150*200 = 3 MBYTES

1.1.3.2 Checking the run-time status of the server

There are a few ways to check the status of the server:

  • Looking onto an instance of the server process (ps -ef, top) at a machine where the server was started.
  • Looking at the server in the CORBA Naming Service tables (see below).
  • Running the management utility "BdbCondRemoteCmd" talking to the server via the CORBA protocol. One of the commands supported by the utility will acquire and print the current status of the running server (including the number of the processed requests, the time of the very first and the very last requests, etc.).

These methods are especially useful when there is an indication of troubles related to the Conditions/DB from the jobs' site.

The first method allows monitoring the server’s activity, which typically is very active (taking about 100% of CPU) at its first 10 minutes after the startup of the reconstruction run. The server is loading its cache at this period of time and serving initial clients requests.

The second method is meant to check if the server has been made known to the CORBA Naming Service, which is s way for the client jobs to locate the right instance of the server serving the desired federation. This is done by mean of the following command (which is part of the release build):

TaoNSDumper

The output should contain something like this:

1: Bdb: context
0: Conditions: context
0: OIDService: context
0: \nfs\objyboot1\objy\databases\Production\physics\V2\production\opr\BaBar.BOOT: reference

The last line of this snapshot is in fact the boot file name of the OPR federation with the slashes replaced by the backslashes.

The third way of checking the presence of the server would be to use the management utility (with the proper boot file name set beforehand):

setenv OO_FD_BOOT /nfs/objyboot1/objy/databases/Production/physics/V2/production/opr/BaBar.BOOT

BdbCondRemoteCmd statistics

It will print something like this (or issue a error message if the server is not available):

History Statistics...

STARTUP TIME: 17:00:06.347

FIRST OPERATION EXECUTED: 17:00:20.122

LAST OPERATION EXECUTED: 17:00:20.122

Counters...

Totals…

REQUESTS TO THE SERVER: 1

FAILED REQUESTS: 0

Transaction Management Requests...

TO START: 0

TO COMMIT: 0

EXECUTED STARTS: 0

EXECUTED COMMITS: 0

Find Interval as Function of Time...

TOTAL REQUESTS: 0

CACHED REQUESTS: 0

Find FIRST Interval...

TOTAL REQUESTS: 0

CACHED REQUESTS: 0

Find LAST Interval...

TOTAL REQUESTS: 0

CACHED REQUESTS: 0

Other...

TOTAL REQUESTS: 1

1.1.3.3 Running the server in "rebind" mode

This mode is required when the previous instance of the server dies without disconnecting itself from the CORBA Naming Service (see the third method of checking the server's status from the previous subsection).

This may also happen if there is another instance of the server serving the same federation. In this case the previous version should either be shut down or this naming conflict should be resolved by other means (which are available, but are not covered by this document. Send a message to the author of this document for the details)

In either of the mentioned above cases the following message can be seen while attempting to run the server:

TaoServer: Name \nfs\objyboot1\objy\databases\Production\physics\V2\production\opr\BaBar.BOOT is already in use

BdbCondROIDServerApp: failed to register the OID servant with the Naming Service.

Given the assumption of the "dead" server above, the new instance of the server should be started by specifying another command line option:

BdbCondROIDServerApp -OIDServer_rebind

This will force the server to unbind the previous version of the server from the Naming Service and to register itself as the service name.

1.1.3.4 Stopping the server

The correct way to stop the server is to use the management utility with the following parameters:

BdbCondRemoteCmd shutdown_server -facility "Bdb/Conditions/OIDService" $OO_FD_BOOT

This will tell the server to close the current transaction (if any) and to shut itself down immediately. The server will also disconnect itself from the CORBA Naming Service. The boot file name is required to choose the proper service to be stopped. The "-facility" keyword indicated that the OID Server is targeted by this command (rather then other possible services associated with the federation).

The less preferable way is just "kill -9" the server's process. But this would require to run a fresh instance of the server in the described above "rebind" mode.

1.2 Setting up the clients

The only additional thing about the clients is to have the following environment variable set before to run each client:

setenv BDBCOND_USE_OIDSERVER "yes"

This will tell clients to redirect its request to the OID Server rather then going directly into the Objectivity. The right instance of the OID Server will be automatically located via the CORBA Naming Service. The current value of the OO_FD_BOOT environment variable is used to locate the server.

2. Troubleshooting

This section contains some troubleshooting information for the most common problems connected to the remote setup. The sample messages presented in the section are illustrated using the current boot file of the OPR federation as an example.

2.1 What happens to a client when there is no server serving the federation?

The client will fail with the following message:

BdbCondRemoteAccessor::initialize() -- ERROR.

Failed to obtain an object reference through the Naming Service:

FACILITY: Bdb/Conditions/OIDService

SERVICE: \nfs\objyboot1\objy\databases\Production\physics\V2\production\opr\BaBar.BOOT

This message means that there is no server's instance registered with the CORBA Naming Service. The simplest way to check the currently known servers is to check the contents of the CORBA Name Server's table using the third method described at section 1.1.3.2 by running the following utility:

TaoNSDumper

If the server is not known to the CORBA Naming Service then it should be restarted following a method described at section 1.1.3.x.

2.2 What happens when the server dies without disconnecting itself from the CORBA Naming Service before the jobs start up?

In this case client jobs will fail with the following messages:

BdbCondRemoteAccessor::findInterval() -- ERROR.

Can not find desired interval in this container due to exception occurred while communicating with remote server.

DETECTOR -> <detector name>

CONTAINER -> <container name>

The first step would be to check if the instance of this server is still running on a machine where it's supposed to be run. Kill this process if it's still running.

Then start the OID Server using "rebind" method described at the section 1.1.3.3.

BdbCondROIDServerApp -OIDServer_rebind

This will guarantee that a new instance of the server will register itself under the required name with the CORBA Naming Service.

2.3 What happens if the server dies during execution of a client job?

The answer is found in the previous question.

2.4 What will happen if a client process using the OID Server crashes before the "finalize" stage?

The answer is - nothing special if this is not "finalizing" process. See the sections 2.6 if it' is.

2.5 What happens if the processes which was doing "finalize" crashes in the middle of this operation?

This should not be a problem, because the current implementation of the OID Server is only responsible for the read-only operation with the Conditions/DB.

2.6 What happens if the farm is either running at "No Rolling Calibration" mode or if a process meant to be the "finalizing" one dies before starting the "finalize"?

This can be a potential problem given the current implementation of the server because the server will keep an open read-mode transaction. Normally this transaction is closed when the "finalize" begins storing new conditions.

So that the suggestion is to force the transaction commit at the server's side by sending the following command:

BdbCondRemoteCmd commit

As an alternative the server can be stopped and restarted.