SLAC PEP-II
BABAR
SLAC<->RAL
Babar logo
HEPIC E,S & H Databases PDG HEP preprints
Organization Detector Computing Physics Documentation
Personnel Glossary Sitemap Search Hypernews
Unwrap page!
Computing Search
Who's who?
Meetings
FAQ Homepage
Archive
Environment
Online SW
Offline
Workbook
Simulation
Reconstruction
Data Distribution
Beta
Beta Tools
Event display
Code releases
Databases:
Hot Items!
About Us
Meetings
General DB info
Conditions DB
Event Store
Online DB
Links
Check this page for HTML 4.01 Transitional compliance with the
W3C Validator
(More checks...)

Databases

David R. Quarrie
23 Jun 97

 

Outline

  • Quick overview of what Objectivity/C++ can & can't do
  • How do we get from where we are now to where we need to be?
  • How will developers work?
  • How will people run jobs & select data?

What can Objectivity/C++ do (and what can't it do) ?

Objectivity/C++ can:

  • Provide C++ language interface - but with some differences (see later)
  • Support declaration of classes via a Data Definition Language (DDL)
  • Store instance data members of C++ classes
  • Manage concurrent access to database
  • Manage distributed access from heterogeneous machine architectures
  • Quickly locate a persistent object by name (really the only way to start thing off)
  • Navigate through a web or hierarchy of persistent objects that can be scattered across the database (in different database files on different machines etc.)
  • Use indexing or hash tables to perform fast object lookup
  • Support data versioning (c.f. CVS)
  • Support schema evolution (automatic modification of old data to fit changes to the declarations)
  • Scale to >> 100 TB
  • Integrate into HPSS (Hierarchical Mass Storage System) ("real soon now")

Objectivity/C++ cannot:

  • Store function members or static data members of C++ objects
  • Operate with totally unchanged C++ code (.hh & .cc files)
  • Traverse conventional C++ pointers - need to use special database pointers
  • Let everyone develop totally independently
  • Let you arbitrarily move database files around
  • Use the container classes we're currently using (HepAList & Tools.h++)

In particular:

  • The class .hh file is no longer the primary class declaration file - it must be replaced by a DDL file (.ddl) and is derived from it through a DDL compiler (ooddlx). This also creates other .hh & .cc files that declare & define Objectivity-specific classes & operations.
  • Any persistent-capable object must inherit from ooObj or d_Persistent_Object.
  • A persistent-capable class cannot contain other persistent-capable classes - it can only contain references to them.
  • References to other classes from within a persistent class cannot take the form T* but the form ooRef(T) [we're probably going to wrap this to BdbRef(T)]
  • A transient object can only access a persistent object via ooRef(T) or ooHandle(T). The latter "pins" the object in virtual memory while the handle is in scope. A pinned object can be passed as a function argument using T*, whereas an unpinned one can't.
  • Machine achitecture independent basic datatypes must be used. These are d_Long, d_ULong, d_Float, d_Double etc.
  • Persistent objects must indicate where they wish to be created within the database by giving a clustering hint to the new operator. e.g.
ooHandle(A) theA = new( A::clustering( ) ) A;
  • Persistent objects are deleted by the ooDelete(T) macro instead of the delete keyword.
  • Transient versions of persistent-capable classes can be created, but with some limitations on their capability to reference other objects. The new(0) operator is used to create a transient version of a persistent-capable class.
  • In order to efficiently access parts of the event independently of other parts, the persistent event must be split into small navigation nodes and data leaf nodes. Navigation of this hierarchy is a burden on the progrmmer unless we hide it with a mechanism like ProxyDict.

The list of restrictions is quite long and there is an attempt at cataloging them in the draft BABAR DDL Coding Guidelines and Hints. This is based on work by the RD45 collaboration at CERN.


How do we get from where we are now to where we need to be?

Given that the database gains us a lot, but also has some restrictions & limitations how best do we incorporate it and can we do so without breaking things?

Subject of much discussion. Basically two strategies, with variations:

  1. Persistent Strategy. In this strategy you take the approach of exposing the database to everyone. If it is decided than a class should be made persistent, then it is modified appropriately and all client objects are also modified. A set of persistent container classes has to be used in addition to the existing transient classes. A varient of this strategy (Pinned Persistence) utilizes the pinning ability of ooHandle(T) to pin all required persistent objects so that they may be accessed by other objects as if they were transient.
  2. Transient Strategy. This strategy attempts to minimize the impact of the database by coupling a transient & persistent class together (siblings), corresponding to the existing "data" classes. The transient class exhibits the same interface as the present transient class (or a superset thereof) so that clients need not be modified (other than being recompiled). The transient objects are created as needed from their persistent counterparts when clients attempt to access them from the transient event (via the Ifd<T>::get( ) functions), or by navigation from other transient objects. This latter is performed by smart pointers that intercept the -> & * operators. Additional transient objects are added to the event in the same way as they are now. Once the event reaches the end of being processed the newly transient objects are scanned and their persistent siblings created & added to the persistent event.

Both of these strategies have some advantages and disadvantages, some of these are:

Persistent API

This API still uses the ProxyDict classes to present a flat access to data objects within the event, but retrieves the persistent objects directly. Similarly, newly created objects are already persistent (although they must be explicitly attached to the event).

Advantages

  • Single class per type, plus a proxy for anything asessible from the event directly.
  • Simple to maintain.
  • No memory to memory copies of persistent objects to transient ones and v.v.
  • Could eventually make all classes persistent-capable.
  • Proxies are simple and just need to locate the persistent objects in the object hierarchy for reading, and to create the link from the rest of the event to the new data.

Disadvantages

  • Significant modifications to user code. (HepRef(T), HepNew, HepDelete, etc.). Not only must the classes to be made persistent be changed to conform to the guidelines, client classes must also be changed to access them. The Pinned Persistence strategy could greatly reduce these changes.
  • Use of ddl files instead of hh files complicate dependency structure.
  • Use of "new" operator by mistake creates persistent leaks (but easily tracked).
  • Delete is irreversible (apart from transaction abort).
  • Unconventional storage paradigm (newly created objects are already persistent - although not yet attached to the persistent event - and must be explicitly removed).
  • What persistent container classes should be used is very unclear. Objectivity persistent Rogue Wave classes lag behind the transient ones (still v6 instead of v7). More importantly, Objectivity is migrating to STL-based classes and will drop support for the Rogue Wave classes.

Transient API

This API uses the ProxyDict classes to present a fully transient view of the event to the end user. They locate objects within the event using the Ifd<T>get( ) function and (logically) add new objects to the event using the Ifd<T>::put( ) function. Objects that are to become persistent are "markForStore"'d and the event itself is "store"'d which creates the appropriate persistent objects.

Advantages

  • No (or little - see later) modifications to existing classes.
  • Standard C++ class definitions & API.
  • Conventional storage paradigm (new objects added to the event & then marked for storage).
  • Allows the possibility of data compressed on the database (non-simple mapping between transient & persistent objects).
  • Allows transient objects to be used to cache information for performance advantages etc. without requiring that such caches be persistent.
  • Allows mapping of a hierarchical persistent event organisation into a flat transient one (hidden by proxies).
  • Allows use of existing Rogue Wave and CLHEP classes in transient classes - must decide on mapping to persistent container classes.
  • Allows easy "what-if" modifications to transient information without affecting the underlying persistent information (e.g. interactive event display).
  • Compatible with the proposed solution for access to the environment (AbsEnv).

Disadvantages

  • Twice as many classes (approximately, depending on details of mapping).
  • The same number of proxies are needed as for the persistent API, but they are more complex, having to create the transient or persistent objects from each other as well as peform the navigation within the event.
  • Difficult to manage consistency between transient & persistent classes.
  • Need to write store/fault handlers for every class pair.
  • Need to replace pointer data members (T*) by BdbPtr<T> smart pointer data members or always fault in the complete transitive closure (all events reachable from head).
  • Requires memory to memory copies (transient <-> persistent).

Common issues

Neither solution fully addresses polymorphic lists without some overhead.

  • The transient solution cannot create a persistent list from a transient one without losing the polymorphic information from the list items, unless each transient class implements a persistent( ) member function to create a persistent object (which is what is done in the prototype). This will tightly couple the two classes and the dependencies need to be understood.
  • The persistent solution only works with persistent lists of HepRef's to objects where each object is separately persistent. This incurs some overhead (12 bytes) per object. An alternative would be separate lists for each subclass, although this then requires the use of index offsets into each list. Messy.

Migration Procedures

We're trying to understand the pros and cons of the different strategies and are in the process of drafting several documents:

In addition there is to be a two day workshop on this issue on 8th-9th July at LBNL.


How will developers work?

There are several issues here:

  • Accessing (and protecting) the production federated database.
  • Preventing lock conflicts when running the DDL compiler
  • Allowing developers to try out their own modifications to production schema and adding their own schema
  • Merging schema into the production environment

This is an area where more thought and understanding of the implications still needs to be made. The following is very preliminary and subject to chance as we understand more.

Accessing and protecting the production federated database

The FDDB contains both the schema (class definitions) and the database catalog (locations etc.). It's imperative that this be well managed and protected against accidental corruption or deletion.

In general, once a significant amount of data is loaded into a FDDB, it should be "hard" to change the schema. Objectivity supports several strategies for schema migration and conversion:

  • Treat attempted changes to the schema (as identified by the DDL compiler) as errors and cause the compilation to fail. This is the default.
  • Allow changes to the schema to be made. Existing objects then have to be converted to "fit" their new shape. This can be done in one of several ways:

    Deferred: Objects are converted one at a time as they are accessed and will only be converted if accessed in update mode.

    On-demand: Objects are converted automatically at one of several granularities (container, database, FDDB).

    Immediate: Depending on the schema changes (e.g. replacing a base class), a stand-alone upgrade application might need to be run before other application can access the new schema.

Further considerations are that two developers attempting to run the DDL compiler against the same federated database may come into conflict such that one of their compilations fails because it can't lock the FDDB for exclusive use during the course of the compilation.

Another factor is the concept of multiple partitions where one site is designated as being the primary partition and contains the primary FDDB and data, and other sites (secondary partitions) may have local copies of only some of the databases within the FDDB, and may make extensions to the primary schema without affecting the primary FDDB. The secondary partitions act as write-back caches such that if requested data is available locally it is directly accessed, otherwise it is fetched from the primary partition. Similarly, if data is updated then it will automatically be propagated back to the primary partition (and to other secondary partitions) unless it is purely local to the secondary partition.

The present (very preliminary) strategy for managing the FDDB and allowing developer access is the following:

  • Maintain (at least) two copies of the primary FDDB. One (the Reference FDDB) contains just the reference schema; the other (the Production FDDB) contains both schema and the production data.
  • New releases of the BABAR software are compiled against the Reference FDDB, normally disallowing any schema migration.
  • After some sort of reconstruction/analysis review, modified schema are accepted and accommodated in a new release by appropriate DDL compile switches, and a object conversion strategy is put in place. It's possible that the mechanics of this might involve building against both the Reference & Production FDDBs.
  • When developers create new test releases (via newrel -t) they get their own copy of the Reference FDDB (Developer FDDB). Subsequent development is performed against this developer-specific FDDB and this bypasses any lock conflicts with other developers. Database files may be imported from the Production FDDB (a management activity that Objectivity/C++ supports). Any new data will be entirely local to this Developer FDDB. Any schema migration is entirely local and does not impact the Production FDDB. One disadvatage of this scheme is that it is difficult to propagate data back to the Production FDDB.
  • An alternative developer scheme puts them each in a separate secondary partition. This would allow them to propagate back compatible data to the primary FDDB. This needs to be thought through in more detail. This use of secondary partitions is not entirely compatible with our use of it as a mechanism to distribute data at the Institution level.

The present design supports within the Production FDDB both work-group and user-based databases and event collections in adition to the global system ones. This concept is not yet integrated with the developer view outlined above.


How will people run jobs & select data?

The basic access model is that there will be named collections of events which are accessed through one of several dictionaries. There is one System-wide dictionary, several Group wide dictionaries and many User-specific dictionaries. Thus there can be named collections at the System level representing particular physics data samples (and obviously the complete event sample), similar collections at the group-level (e.g. for each physics working groups) and at the User level. In general these collections need not contain any data, they just contain references to the event headers of the selected events. However, it may be advantageous to duplicate part of the information for the events in order to improve repeated access to that data.

Components of the strategy

  • Database Input Module (BdbEventInput). Allows selection of input event collection.
  • Database Output Module (BdbEventOutput). Allows additional information to be added to the events.
  • Tools to allow selected parts of the selected events to be duplicated in order to improve repeated access. [these don't yet exist]
  • etc.

Several documents describe the present thinking on these issues:

The Design of the BABAR Event Store

The BABAR Event Store User's Guide

The BABAR Event Store Reference Manual

 

DB Home | BaBar Home | Computing | Reconstruction | Simulation | Search

e-mail DRQuarrie@LBL.Gov