vlink="#800000" leftmargin="10">
ROOT's Scribes and Modules
A. Salnikov
September 26, 1999
In this brief report I'm going to describe the ideas and some
implementation details of two packages, RooScribes and
RooModules, as I understand them. These will probably evolve in
the future as some new requirements or details become clearer, so
this description can be outdated soon. The best source of
information in such cases, and as anywhere else in BaBar, is the
source code itself.
For those encountering for the first time with these packages,
I can add that these packages are implementing a common set of
tools to be used for the I/O into ROOT event store. They are
dealing with the event data, as opposed to the conditions data,
although some part of the condition data can migrate to the event
data in the ROOT-based data store. For the condition data you can
look at the excellent material provided by David Kirkby [1].
As the time scale for the whole project is seriously limited,
it looked quite natural, and was generally accepted, that we
should emphasize on reuse of the ideas and implementations of the
existing things related to the event store and condition
database. My own list of features we have to implement in the
ROOT-based event store looks like this:
- Schema evolution. Support for this must be
included at the design level. ROOT's own support for
schema evolution is not quite satisfactory for such big
project as ours, so we must do something about it
ourselves. Fortunately we have the problem solved in
quite general way by Objectivity team, and can reuse that
approach, which already proved itself working.
- Persistent references. Although many basic
classes to be used in the analysis do not require
persistent references, thinking about the future possible
extensions can be useful, and this leads directly to the
need to include support for referencing in the design.
Event our short-term plans will probably need this
feature if we going to implement MC truth information
store in the same terms as it was done for Objectivity.
- Modularity and extensibility. As in any other
area in BaBar the tools we produce should be able to
integrate easily in the existing BaBar SRT. One of the
key words here, I believe, is interdependencies, which
should be reduced to the minimum. The I/O part of the
Framework responsible for data exchange with the ROOT
event store should not depend on any particular data type
in the event store, but should allow at the same time
addition of new persistent data types as they appear.
- Incremental event processing and event selection.
The idea to load parts of the event data incrementally is
not new, and is thought to allow much faster event
processing. There is a realization in the Objectivity
event store with so called filtering modules and a number
of update modules. The approach can be extended to
include a preselection directly in the input module, thus
removing the need to go through the Framework event()
cycle, with the possibility to gain even more speed.
It seems that many of the approaches used in Objectivity event
store can be used also for ROOT. As for the code itself, the
possibility of reuse is limited seriously by the dependency of
practically all code on the lower level technology, i.e.
Objectivity data types.
Another package worth to mention here, as it contains some
classes which are central to persistence implementation, is
RooUtils. It was created to hold the tools and utilities which do
not depend on anything except ROOT itself. Now there are few
class definitions in there, which are of particular interest for
discussion here:
| RooRef |
Implementation of persistent reference. The reference
is basically an object ID of the persistent object. The
method "UInt_t id() const;"
returns an OID of the object. |
| RooPersObj |
This is a base class for all persistent
classes to be created in the ROOT event store in BaBar.
It inherits from ROOT's TObject. The functionality it
provides now is to create an unique object IDs for all
persistent objects. The uniqueness is guaranteed only
between objects in the same session, i.e. there is no
guarantee that ROOT objects will have different IDs in
two different runs. The IDs are used in the the
implementation of persistent references. The method
"RooRef refToMe() const;"
returns persistent reference to the object. The protected
method registerThis(...) is used to
register persistent-transient relations (see below). |
| RooEvtObj<T> |
This is the next-layer base class for
persistent objects, inheriting from RooPersObj. It
defines an interface for all persistent classes which can
be created from, or converted to the transient class of
type T. the interface includes following methods:
- T* transient(
RooEvtObjLocReg& reg ) const = 0
- bool fillRefs( const T* trans,
const RooEvtObjLocReg& aRegistry )
- bool fillPointers( T* trans,
const RooEvtObjLocReg& reg ) const
Additionally to this methods each persistent class
should implement a constructor from transient object in
the form (assuming you have declared "class XxxDataR
: public RooEvtObj<XxxData> {...};"):
- XxxDataR( const XxxData* trans,
RooEvtObjLocReg& reg )
|
| RooEvtObjLocReg |
This is a registry of relations between
transient and persistent objects, used in the
implementation of persistent references. It provides
bi-directional mapping between persistent references
(RooRef) and pointers to the transient objects. The map
should be filled in the constructors of the persistent
objects and in their transient(...)
methods, and will be used to reconstruct
transient-to-transient relations in fillPointers(...)
method or persistent-to-persistent relations in fillRefs(...)
method. |
Here is an example (not-working) of the class which utilizes
the approach to save references between objects in persistent
store:
//
// Version 001 of the persistent class XxxDataR
//
class XxxDataR_001 : public RooEvtObj<XxxData> {
public:
// ====== Constructors ======
// def.ctor must be provided if you are saving collection of objs.
XxxDataR_001() : RooEvtObj<XxxData>() {}
// ==> ctor from the trans obj and a registry must be there
XxxDataR_001( XxxData* trans, RooEvtObjLocReg& reg )
: RooEvtObj<XxxData>()
{
// if you care about storing references, do this:
registerThis ( trans, reg ) ;
// all other stuff relevant to constructing pers.obj.
}
// ==> must provide transient object "ctor"
XxxData* transient( RooEvtObjLocReg& reg ) const
{
// simply create transient from all info you have
XxxData* trans = .... ;
// if you care about storing references, do this:
registerThis ( trans, reg ) ;
}
// fillRefs() method must be there too if you really care about
// storing references, otherwise default implementation is OK
bool fillRefs ( XxxData* trans, const RooEvtObjLocReg& reg )
{
// this is a basic idea how you can make a persistent ref. to other object
AbsEvtObj* transPointer = trans->getSomePointer() ; // just an example
if ( transPointer ) {
// this will work only if registerThis() was called for other object
RooRef persRef = reg.find( transPointer ) ;
if ( ! persRef.id() ) return false;
refToOther = persRef ; // store the reference
}
return true ;
}
// fillPointers() method must be there too if you really care about
// storing references, otherwise default implementation is OK
bool fillPointers ( XxxData* trans, const RooEvtObjLocReg& reg ) const
{
// this will work only if registerThis() was called for other object
if ( refToOther.id() ) {
AbsEvtObj* transPointer = reg.find( refToOther ) ;
if ( ! transPointer ) return false ;
trans->setSomePointer( transPointer ) ; // and set transient reference
}
return true ;
}
// ...... whatever you want to be here .......
private:
// ...... some stuff I don't care about
RooRef refToOther; // persistent reference
// ROOT specific declarations
ClassDef(XxxDataR_001,1)
};
|
But for simple classes, which do not need to care about
references, there is no need to override default implementation
of fillRefs() and fillPointers(),
and to call registerThis() in constructor and transient()
method, although the format of the constructor should be the same
and transient() also must be supplied.
RooScribes contains a set of classes performing
data transfer between transient event store (AbsEvent) and
persistent store (ROOT trees). Data exchange proceeds through the
"streams", with stream being just a ROOT tree. Two or
more streams can share the same file, in this case there will be
more than one trees in the file. Each stream is identified by its
name, which is the same as the tree name. Actual job of doing
transient-persistent exchange is performed by the
"scribes". Scribe is an object knowing which objects it
should convert and how to do it. Each scribe corresponds to
single object branch (TBranchObject) in the ROOT tree.
The reduced class diagram for this package can be
found at this
link. The "chief" class on the diagram is
RooConversionManager, and the "central" one is
RooGenericScribe. RooConversionManager controls the conversion
(either way) of the data, and it does this through the list of
scribes registered for this job.
Each output stream calls conversion manager's convertToPersistent()
method. For each such call the manager executes following
sequence for all scribes "valid" for this stream:
scribe->attemptTransient(),
which creates persistent representation of transient
objects in memory,
scribe->fillRefs(),
which validates persistent references between persistent
objects in the stream,
scribe->store(),
which moves persistent data to the external store.
This sequence is different from the sequence used
for Objy due to some conceptual differences - Objy's persistent
objects are created "already in the store" (actually
they are moved there at the end of transaction, but this is not
scribe level). As a sequence store() method is
not present in the Objy scribes. It is possible to live without
it in ROOT too, because the main part of its functionality is
executed by stream itself (TTree::Fill method), but it the terms
of OO abstractions this method appeared quite naturally, so I
prefer to keep it.
Each input stream in turn calls manager's convertToTransient()
method. For each call manager executes such a sequence for all
scribes "valid" for this stream:
scribe->attemptTransient(),
which fetches data from persistent store and converts
them to transient object,
scribe->fillPointers(),
which validates transient references (pointers) between
transient objects in the stream.
Conversion manager works with the objects of
class RooGenericScribe, which is an abstract class providing only
interface for the operations described above. This is the job of
the client's code to create real scribe objects and pass them to
the manager. This is usually done in so called "loader
modules" (see next sections). To simplify the job for the
loader modules, a set of concrete scribe classes was implemented,
which should cover most of the required functionality. There
exist following concrete classes for scribes:
RooDefScribe<T,P>
- provides conversion of the single transient object of
class T to the single persistent object of class P and
back.
RooAListScribe<T,P>
- provides conversion of the transient collection
(HepAList<T>) of transient objects of type T to the
persistent collection (TObjArray) of persistent objects
of class P and back.
RooAListClonesScribe<T,P,I>
- the same as above but uses TClonesArray as a persistent
collection. This class has third template parameter which
is an interface type for the persistent object, usually
this is a RooEvtObj<T>.
RooCompositeScribe<T,P,I>
- provides conversion of the transient collection
(HepAList<T>) of transient objects of type T to the
composite persistent object of type P, having an
interface I. Interface usually will be
RooEvtObj<HepAList<T> >.
RooAListRCVScribe<T,P>
- provides conversion of the transient collection
(HepAList<T>) of transient objects of type T to the
persistent collection (RooClonesVector<P> from
RooUtils package) of persistent objects of class P and
back.
RooDefSyncScribe<T,P>
- this is a special version of RooDefScribe class to
provide sync'ing scribes. See below for details.
The interfaces to the persistent classes,
appeared in the previous section, are the central part of the
implementation of schema evolution. The idea is that we can read
the data back from the event store without knowing all details of
the object layout, using one of its base classes. If all versions
of the persistent class (such as XxxData_001, XxxDataR_002, etc.)
would have the same base class, then the schema evolution could
be realized very easily. This idea (stolen from Objy event store,
as usual) is implemented for the ROOT event store as well. The
interface class for persistent class, basically, is the minimal
set of methods allowing to create a transient representation of
the data read from the event store through this interface, i.e.
it defines "convertible to transient" type. Two methods
are sufficient for this - these are defined in RooEvtObj<T>
"T* transient()" and "bool fillPointers()".
RooEvtObj<T> defines also fillRefs() method but, strictly
speaking, it is not necessary for interface types. This method,
as well as the constructor from the transient object, must be
implemented by any persistent class used with scribes, so it was
included for convenience.
One important note. ROOT plays all possible
low-level tricks with the pointers, thus making itself completely
non-OO. One particular thing important to us is that it passes
pointers to objects as "void*". For our schema
evolution to work this implies that pointers to the persistent
object and the interface should be physically the same. If they
are different the result will be a memory overwriting and all
sorts of problems related to this. So, never ever ever ever use
multiple inheritance in the persistent classes. Or if you really
mean to use it make sure that interface is physically the first
thing in the class hierarchy. (I'm not sure about what standard
says about memory layout ofthe subobjects, but it seems that on
most platforms you can do it placing interface first in the
inheritance list: "class XxxDataR_001 : public
RooEvtObj<XxxData>, public Whatever {...}".)
As the persistent data will be spread across
several ROOT trees and files, which can be produced and
distributed separately, we should have a mean to guarantee a
consistency of the information in different locations. The idea
was to put a separate branch in every tree holding some unique
data for each event. During the read stage this data will be read
from all trees and checked for consistency. These data can be
whatever providing uniqueness, but to avoid unnecessary data in
the file, something which is useful for other purposes can be
used too, for example event ID. The behavior of the the I/O
system is somewhat different w.r.t. these sync'ing data, so a
special type of scribes was introduced fro this job -
RooDefSyncScribe. On output the scribe of this type will write
the same persistent data in all output streams in separate
branch. On input the scribe is executed for each input stream and
checks that objects from different streams are equal. If the
sync'ing object in some stream is different from such object in
the first read stream, all the data from the stream will be
discarded and warning message will be printed.
This package contains a number of Framework modules and their
helper classes to organize input and output in the Framework
jobs. Organization of the input and output follows closely that
of Objy event store, similar classes and modules can be found in
BdbModules package. There are four basic modules in the package
described in details below, and also a base class for
"loader modules".
One of the concerns worth to mention here is that all modules
are independent on the specific format of transient or persistent
data. I think the best approach is to keep it this way, this will
allow us to keep package dependencies more manageable. But this
may contradict to the idea to make fast selection of the events
based on the content of the data, I'll return to the point below
in the discussion.
RooEventOutput is a standard output module for the production
of ROOT persistent data. It inherits from AppStreamsOutputModule
and has practically no new functionality. All real job is done in
the output stream objects (RooOutputStream
class) which are created with the module's "output"
command (RooOutputCommand class). The stream's
responsibility is to open the output file, create TTree with the
stream's name and call RooConversionManager for each event. The
conversion manager object is extracted from the transient event.
The creation of the output stream from Tcl script looks like
this:
# communicate with the output module
module talkTo RooEventOutput
# create new output stream and give a destination
output stream "Tag" $env(EVENT_STORE)/runXXXX-Tag.root
# associate framework path with this stream. stream will be executed
# only when given path is "passed". Path must be already defined.
output paths "Tag" "Everything"
exit
|
There are two commands necessary to make output - "output
stream" and "output paths". First one creates an
output stream with the given name and gives it a name of the
output file, file naming is discussed below. Second command
associates some framework path (which must be created before this
command is issued) with the output stream. The stream will be
executed only when the path's state is "passed", thus
allowing filtering applications to write only selected events.
There can be more than one stream with the same destination
file. Following script shows an example of this:
module talkTo RooEventOutput
# create two output streams with the same destination
output stream "Tag" $env(EVENT_STORE)/runXXXX-Tag+AOD.root
output stream "AOD" $env(EVENT_STORE)/runXXXX-Tag+AOD.root
# associate framework path with these streams.
output paths "Tag" "Everything"
output paths "AOD" "Everything"
exit
|
In this case the name of the destination file you give must be
literally the same for both streams, letter after letter,
otherwise the result will be complete disaster.
Few remarks about data production in OPR. Our Framework does
not allow having two output modules in one job, so, if we are
going to produce ROOT data directly in OPR, we'll need an output
module which is not an APPOutputModule, which can be executed as
a last module in a standard path. This can be achieved by putting
all the functionality of AppStreamsOutputModule and
RooEventOutput in separate class inheriting from APPModule, for
example. One more issue is that OPR runs on many machines. To get
one file per run (or one file per few runs, probably) we'll need
a special merging application which gathers all output from tens
or hundreds of files into one at the end of run.
And of course there exists input module which can read the
data produced by the output module. The approach for input is the
same as Objy's one - there is a RooEventInput module,
RooEventUpdate module with RooCreateCM and a number of loader
modules between them in the path. Briefly the responsibilities of
this modules are:
- RooEventInput - to locate next event and prepare input
streams for reading,
- RooCreateCM - to create an instance of the
RooConversionManager and put it in the transient event,
- loader modules - create scribes for the objects they want
to fetch (or save as well) and pass this scribes to the
conversion manager,
- RooEventUpdate - to execute all input streams using the
data prepared by all previous stages.
More details about all these are found below.
The responsibility of this module is to locate
the next event for processing and to pass the information about
this event to the input streams. Despite the name there is no
real input occuring in this module, it happens down the framework
path in RooEventUpdate module. The streams, owned by the
RooEventInput are created by the "input stream" command
(RooInputCommand class). To pass the streams to RooEventUpdate
module this module puts them into transient event.
To decide about next event to process, input
module uses "collection" and "selector"
abstractions. Collection (RooInputCollection) is just a set of
the input files to open. Selector is an object responsible for
finding an address (file name and event index) of the next
available event.
As we can write the data to many output
destinations, we have to be able to read them too at once. To do
this the input modules must be able to read from different input
files. Hence the full path name of the input file is determined
both by the its name in collection and in the stream. Presently
the path name is constructed by the concatenation of the
collection's name and some suffix defined for the stream, with a
dash between and ".root" extension. Here is an example
how all this works from Tcl:
module talkTo RooEventInput
# create two input streams to read from different files
input stream "Tag" tag
input stream "AOD" aod
# associate framework path with these streams.
collection add $env(EVENT_STORE)/run10001
collection add $env(EVENT_STORE)/run10002
collection add $env(EVENT_STORE)/run10003
collection add $env(EVENT_STORE)/run10004
collection add $env(EVENT_STORE)/run10005
select all
exit
|
In this case the stream "Tag" will read
data from the following files:
$EVENT_STORE/run10001-tag.root,
$EVENT_STORE/run10002-tag.root,
$EVENT_STORE/run10003-tag.root,
$EVENT_STORE/run10004-tag.root,
$EVENT_STORE/run10005-tag.root,
and the stream "AOD" will read the
files
$EVENT_STORE/run10001-aod.root,
$EVENT_STORE/run10002-aod.root,
$EVENT_STORE/run10003-aod.root,
$EVENT_STORE/run10004-aod.root,
$EVENT_STORE/run10005-aod.root.
The logic of constructing the whole path name from the
collection name and stream name is inside a separate class - RooDirectorySvc.
In principle, this class can be modified to use more
sophisticated schemes, i.e. run database, log-books, etc.
Just as the output streams, the input streams can share the
same file, in this case the parameter giving the suffix name of
the file path should be the same in "input stream"
command.
A command "select all" in the above example creates
"all events" selector. This command can be omitted,
input module will create it itself then. For now this is the only
existing selector for input events (RooIputSelectAll class).
Other selectors can be easily added by implementing
RooInputSelector interface, e.g. selector based on the tag data.
(The problem of dependencies comes into play here. The
tag-based selector will inevitably depend on tag format, the
thing which I want to keep RooModules away from. I can imagine
solution when separate package implements a framework module
which creates in beginJob() such a selector and gives it to the
input module via setSelector() method. The problem here is that
this module should know about the input module object, but this
can be resolved by careful implementation of AppUserBuild stuff.)
The whole purpose of this module is to create a
conversion manager object and to make this object accessible to
all other downstream module by placing it in transient event. The
conversion manager is created for every event and its ownership
is transfered to AbsEvent, so it gets deleted together with
transient event.
The loader modules represent a client part of the
ROOT persistence task. Their responsibility is to create scribes
for particular objects to be saved to or fetched from event
store, and register the scribes with the conversion manager. The
base class RooAbsLoader implements a part of this functionality
which registers scribes. Instantiation of the needed scribes
should be implemented in the subclass. With the RooAbsLoader
scribes are created once for a job in the beginJob() method. A
typical example of such module beginJob() method could be like
this:
AppResult
XxxRooLoad::beginJob( AbsEvent* anEvent )
{
// Add all the Scribes to this load that we want
if ( _readXxx.value() || _writeXxx.value() ) {
// Create single persistent object scribe:
RooGenericScribe* scribe =
new RooDefScribe< XxxData , XxxDataR_001 >
( &_key.value(), // IfdKey for transient object in AbsEvent
_stream.value(), // stream name - string
_branch.value(), // branch name - string
_bufferSize.value(), // buffer size - integer value
_splitMode.value() ); // split mode - bool
if ( _readXxx.value() ) {
addScribeForInput( scribe ) ;
}
if ( _writeXxx.value() ) {
addScribeForOutput( scribe ) ;
}
}
return AppResult::OK ;
}
|
In general, design of the ROOT loader modules is pretty much
the same as it is for Objy, with the exception that ROOT loader
modules work only with the scribes, and not with the converters,
so the creation of the ROOT loaders should be more or less
trivial if one takes as a starting point existing Objy loaders.
This is the final point of the ROOT input sequence for event
data - all real data loading is performed in this module. But
just as a output module this module also does not do any real
work itself - instead it fetches the list of input streams stored
in the transient event and executes input() method for each
stream in the list. This method uses conversion manager to
execute all scribes registered for input and associated with this
stream (and sync'ing scribes too).
Just as in Objy case, incremental loading of event data is
possible if one instantiates multiple copies of RooEventUpdate
modules and placing filtering modules and loader modules at right
places. But as everyone today is concerned with the speed,
providing selectors mentioned above can be somewhat wiser thing
to do, I think.
There exist also two packages created to test
ROOT scribes and modules, RooScribesTest and RooModulesTest. Here
is a brief description of what they do.
This packages tests scribes from RooScribes and
depends only on it. RooScribesTest introduces RooScribesChkClass
class representing transient data, and two versions of persistent
data - RooScribesChkClass_001 and RooScribesChkClassR_002. Two
applications are built, testRooScribesOutput and
testRooScribesInput, to test output and input respectively with
the scribes. First application produces a tree containing one
event with two branches, a single object and a collection of 10
objects of the last version of persistent data. Objects in a
collection have references to each other. Second application
reads these data, converts them back to transient form and prints
them. Regression test script was created to be executed for
RooScribesTest.test target and compare the output of two programs
to what is expected.
This package uses the transient and persistent
classes from RooScribesTest package to check how modules stuff
works. It implements three framework modules:
RooScribeChkLoad - loader module for
RooScribesChkClass. Can work with both versions of
persistent data,
RooFakeScribeChk - framework module which
produces transient objects of RooScribesChkClass and
fills transient event,
RooFakeReadChk - framework module which
reads the transient objects of RooScribesChkClass from
transient event.
There are two applications built in this package,
one for writing and one for reading, with the Tcl files with a
number of parameters to play with. My experience with this stuff
demonstrates that schema evolution basically works, one can read
any version of the data, provided that class for this version is
defined in the application.
All basic stuff related to scribes and
input/output modules is implemented and tested in nightly builds.
More work is needed to implement selectors which would work with
tag data or event collections. No speed tests performed yet
(although there are some numbers obtained in simple tests), as on
my opinion true conditions for this cannot be achieved with
single application running on single host.
References
- D. Kirkby, Framework Access to Conditions Data via
ROOT, http://www.slac.stanford.edu/~davidk/RooCond/
- See also README files in RooUtils, RooScribes, and
RooModules packages.
- RooScribes reduced class diagram, http://www.slac.stanford.edu/~salnikov/Root/RooScribesDiag.ps.gz
|