CM2 builds on the experience from the original computing model and attempts
to address in particular issues that become more and more important as the
BaBar dataset grows over the next few years.
The high level strategy and
requirements for CM2 were developed by the Computing Model Working Group 2
(CMWG2) committee in summer and fall of 2002 and the
implementation of the new model took place during 2003. As implemented,
these can be broken down into four specific areas:
A new eventstore implementation - the CM2 Kanga eventstore
A new analysis model
A new data content for analysis - the New Micro and the Mini
Improved Bookkeeping and "Task Management"
A high level description and motivation for each of these is provided below
(and much greater detail is available in other sections of this document).
The CM2 Kanga Eventstore
Since 1999 BaBar has used two eventstores:
the Bdb/Objectivity eventstore - used for analysis at Tier A sites and
for all data/MC production
the Kanga/ROOT eventstore - used for micro analysis at Tier A and Tier C sites
While the Bdb/Objy eventstore in principle provided greater functionality,
in practice there were a number of data access and scalability issues that
proved very difficult to solve. In addition the size of the Bdb/Objy data was
larger than foreseen and there were a number of practical difficulties to
use the Bdb/Objy data for analysis at Tier C sites (i.e. universities).
The original Kanga implementation (often refered to as "classic Kanga") was
much more easily accessed and easier to export and use at small Tier C sites.
The data had however to be converted from Bdb/Objy data produced in Prompt
Reconstruction (PR) and Simulation Production (SP),
as they were not able to produce Kanga data directly and only "micro" data
was available in classic Kanga.
As part of CM2 we decided to built a next-generation Kanga eventstore that
could be used at both Tier A and Tier C sites as well as on laptops and
workstations. The main properties of the CM2 Kanga eventstore are:
A simple file based format
Minimal size overhead to keep disk space costs low
Simple to setup and distribute
Used at all sites from Tier A to Tier C to laptop/workstation
Written directly from production (PR, SP, ..)
Support for multiple data "components" (tag, usr, cnd, aod, esd, tru, ...)
Support for the requirements of the new analysis model (see below)
To summarize, the new CM2 Kanga eventstore is meant to retain all of the
advantages of classic Kanga, but extend that with significant new
functionality.
A new analysis model
In addition the limitations described above, the 2 eventstore implementations
had other disadvantages in the context of analysis:
In PR and SP, combinatorics were done for various channels, but none
of this information was saved into the eventstore data except for the
tag indicating that a particular event passed a particular skim. The
combinatorics needed to be redone by the analysis user for the events
of interest.
There was no easy way to write back or extend the data in either the
Bdb/Objy or classic Kanga eventstores.
The access rates to any data in the eventstores were too low to make
it practical to do analysis directly on the eventstore
These limitations led to the standard analysis method of running large
"productions" over the data in the eventstore, in one of the two formats,
and writing out ntuples in various AWG-specific custom formats:
This allowed analysis specific information to be stored (composite candidates
and any calculated quantities), but as ntuples had no connection with the
eventstore it was also necessary to copy out some or all of the micro. The
access to these ntuples was sufficiently easy that analysis could be done,
but with the disadvantage of large AWG-organized ntuple productions into what
was effectively a set of ad-hoc eventstore formats.
As part of the CM2 Kanga eventstore implementation we decided to improve
on this situation in three principle ways:
Extending the eventstore to allow the storage of user-defined composite
candidate lists as well as any user-calculated quantities (so-called
"user data") associated with either the event or with specific candidates
Centralizing the user/AWG "ntuple productions" into a centralized skim
production whose output is customized (and possibly deep-copy) skims.
Providing an easy and fast means of reading back customized output in Beta
analysis jobs and augmenting that with direct analysis access to
customized data from the ROOT/CINT prompt (so-called "interactive"
access)
The upshot of this for the average analysis user is that instead of doing
a large ntuple production in each AWG, you provide code and configuration
for your skim which will run for you in a "production skim". You can then read
back in Beta jobs much faster (typically a factor of 2-10 as it is no longer
necessary to redo combinatorics) and also use (at some level) interactively
even more quickly.
These central skim productions are intended to be run every 3 months (to allow
new (and updated) skims to be introduced frequently) and are expected to
reduce significantly the need for AWG-organized ntuple productions.
The new Micro and Mini
Still to finish....
Bookkeeping and Task Management
The average user doing analysis needs access to a variety of "bookkeeping"
information in order to use the data from the eventstore. This can include
luminosities, "good run" classifications, MC information like decay files,
etc. For historical reasons this information was scattered in multiple places
and not particularly well integrated. Users were required to put together
what they needed from the various sources.
In addition, as the integrated luminosity increases, the simple number of jobs
that one needs to run and the management of information regarding their
success or failure, outputs, etc. can become quite significant.
The new bookkeeping and "task management" are intended to address these
issues. The new dataset bookkeeping integrates in one place all information
relevant for the analysis user. This replaces the functionality of "skimData"
(with a much simplified interface), the GoodRuns package, spruns, and the
lumi script. "Task Management" is the name given to a general replacement
for the SkimTools package, allowing application of a "task" to a "dataset".
The set of scripts and the task bookkeeping is intended to be much more
flexible and powerful than the SkimTools package.