SLAC PEP-II
BABAR
SLAC<->RAL
Babar logo
HEPIC E,S & H Databases PDG HEP preprints
Organization Detector Computing Physics Documentation
Personnel Glossary Sitemap Search Hypernews
Unwrap page!
Comp. Search
Who's who?
Meetings
FAQ Homepage
Archive
Environment
Administration
New User Info.
Web Info/Tools
Monitoring
Training
Tools & Utils
Programming
C++ Standard
SRT, AFS, CVS
QA and QC
Remedy
Histogramming
Operations
PromptReco
Simulation Production
Online SW
Dataflow
Detector Control
Evt Processing
Run Control
Calibration
Databases
Offline
Workbook
Coding Standards
Simulation
Reconstruction
Prompt Reco.
BaBar Grid
Data Distribution
Beta & BetaTools
Kanga & Root
Analysis Tools
RooFit Toolkit
Data Management
Data Quality
Event display
Event Browser
Code releases
Databases
Check this page for HTML 4.01 Transitional compliance with the
W3C Validator
(More checks...)

Bookkeeping User Tools

This page contains a simple introduction to the main bookkeeping tools that are useful to the BaBar analysist. This is a subset of the information contained in the full bookkeeping documentation.

There are four main command-line tools for using the core bookkeeping information, though for most cases only the first is needed:-

BbkDatasetTcl
Creates Tcl files for collections in a dataset
BbkExpertTcl
Like BbkDatasetTcl, but allows full selections (can produce unexpected results or unnecessarily large queries, so should only be used if you know what you are doing).
BbkUser
Detailed selection and query (also for experts only)
BbkLumi and lumi
Obtain luminosity information (to be described elsewhere)

A note on the command names: the bookkeeping user tools are accessed from a particular release (currently 16.0.3-physics-1a, which is identical to analysis-24 except for updates to the bookkeeping code). This is defined in $BFROOT/bin/BbkDatasetTcl etc. To prevent accessing the wrong version from your PATH, the commands in the release are called relBbkDatasetTcl etc. The rel* versions should not be used directly unless you are testing a specific version. This scheme is used to maintain compatibility between the commands and the database schema (the version that is in your analysis release may not be compatible with the current version of the bookkeeping database). New versions are announced in the Bookkeeping and Site Contacts HyperNews groups.

Dataset Information

We can start by getting a list of datasets

% BbkDatasetTcl
BbkDatasetTcl: 33202 datasets found:-

A0-Run1-OffPeak-R14
A0-Run1-OffPeak-R14-BlackDiamond
A0-Run1-OffPeak-R14-BlueSquare
A0-Run1-OffPeak-R14-BlueSquarePrime
A0-Run1-OffPeak-R14-GreenCircle
A0-Run1-OffPeak-R14-Total
A0-Run1-OnPeak-R14
A0-Run1-OnPeak-R14-BlackDiamond
... users-phnic-TwoPhotonTwoTrackSkim-BlackDiamond-Run4
users-phnic-TwoPhotonTwoTrackSkim-GreenCircle-Run3

That's a very long list. You can use grep to search for what you want, or specify a wildcard with the -l option (faster).

% BbkDatasetTcl -l '*Inclppbar*'
BbkDatasetTcl: 38 datasets found:-

Inclppbar-Run1-OffPeak-R16a
Inclppbar-Run1-OnPeak-R16a
Inclppbar-Run2-OffPeak-R14
...
Additional options are planned for BbkDatasetTcl that will allow selecting on dataset properties, eg. --ds_stream=Inclppbar.

Creating Tcl Files: BbkDatasetTcl

Now we can create a Tcl file

% BbkDatasetTcl Inclppbar-Run4-OnPeak-R16a
BbkDatasetTcl: wrote Inclppbar-Run4-OnPeak-R16a.tcl
Selected 73 collections, 57721309/1448776065 events, ~99532.6/pb
The file Inclppbar-Run4-OnPeak-R16a.tcl contains all the collections in the dataset. For large datasets, it is usually necessary to split it into smaller chunks, so each job finishes in a reasonable time.
% BbkDatasetTcl Inclppbar-Run4-OnPeak-R16a --tcl 20000k --splitruns
BbkDatasetTcl: wrote Inclppbar-Run4-OnPeak-R16a-1.tcl (25 collections, 20000000 events)
BbkDatasetTcl: wrote Inclppbar-Run4-OnPeak-R16a-2.tcl (26 collections, 20000000 events)
BbkDatasetTcl: wrote Inclppbar-Run4-OnPeak-R16a-3.tcl (24 collections, 17721309 events)
Selected 73 collections, 57721309/1448776065 events, ~99532.6/pb

You can then submit three jobs, one for each Tcl file (you'd probably want to split into smaller chunks still, but you get the idea).

The --splitruns options forces exactly the specified number of events into each Tcl file (except of course the last). This can be a bit inefficient as the collections at the boundaries have to be opened by both jobs. It also might make it a bit more complicated to work out where an error occurred in your job if the job starts in the middle of a collection. If you don't mind about having some jobs a bit shorter (and as long as each job can process at least one collection), then you can leave out the --splitruns option. Try it and see whether the number of events is reasonable.

It is instructive to look at one of the a generated Tcl files from a R14 skim. In this case we look at the sixth file generated by the command

% BbkDatasetTcl A0-Run2-OnPeak-R14 --tcl 1000k --splitruns
% cat A0-Run2-OnPeak-R14-6.tcl
## This file was generated automatically on 2005/05/20-18:31:45-BST
## by user adye on host csfe from /home/csf/adye
## using: BbkDatasetTcl --site=ral A0-Run2-OnPeak-R14 --tcl 1000k --splitruns
## version Id: BbkTcl.pm,v 1.33 2005/01/26 16:23:11 adye Exp
## Selected dataset:
## A0-Run2-OnPeak-R14 (A0 stream PR skim coll. for Run2, On Peak) created 2004/04/04-00:34:50-BST by douglas

# 580343/20070731 events selected from 116 on-peak runs, added to dataset at 2004/04/04-00:34:50-BST, lumi = ~1331.8/pb
lappend inputList /store/PRskims/R12/14.4.0c/A0/01/A0_0116%rejectRun=21618%selectEventSequence=248540-580343 # 592293/20068686 events selected from 111 on-peak runs, added to dataset at 2004/04/04-00:34:50-BST, lumi = ~1347.9/pb lappend inputList /store/PRskims/R12/14.4.0c/A0/01/A0_0119%rejectRun=22538 # 577556/20161405 events selected from 124 on-peak runs, added to dataset at 2004/04/04-00:34:50-BST, lumi = ~1343.6/pb lappend inputList /store/PRskims/R12/14.4.0c/A0/01/A0_0127%selectEventSequence=1-75903 ## In this tcl file: 3 collections, 1000000 events

This includes three collections, specified with lappend inputList Framework Tcl commands. The extended collection name syntax is used. Exactly 1 million events per job are specified with %selectEventSequence. The %rejectRuns are there to remove runs that were determined to be bad after the skimming was complete. The comments give information on each collection, including the number of events selected by the skimming. Events from the rejected runs are excluded from these totals, though the splitting between files does not take this into account so the job may encounter a slightly different number of events in the first/last collection (this will be fixed in a future version BbkDatasetTcl).

Another option, --basename, allows you to specify a different output file name (.tcl and the sequence numbers are still added).

Evolving Datasets

When using a dataset that is still ongoing, you need to take care not to reanalyse data you have already included (at best, this wastes CPU, at worst it artificially inflates your luminosity!). For these examples, let's update a run 4 analysis and, for simplicity, imagine that we are putting our entire analysis in a single job. Hopefully we still have the old Tcl file, Tau11-Run4-OnPeak-R14.tcl. At the top is the line

## This file was generated automatically on 2004/05/29-13:27:43+0100

(though since it will have been generated with a previous version of BbkDatasetTcl, the date will be in a different format without the timezone). We can update this with

% BbkDatasetTcl Tau11-Run4-OnPeak-R14 --since=2004/05/29-13:27:43+0100
BbkDatasetTcl: wrote Tau11-Run4-OnPeak-R14.tcl
Selected 33 collections, 99529441/589347702 events, ~41441.6/pb
wrote : Tau11-Run4-OnPeak-R14-bad-runs.txt (231 runs, ~3891.4/pb)
The events associated with these runs at the start time are now known to be bad.
Please removed or block events with these run numbers to protect from possible
double counting and use of bad data.

You should include the timezone to be sure of using the exact same time, as the dataset is modified continuously. The Tcl file will contain the new collections, which may include events that were reprocessed in the meantime. You should make sure that you exclude the runs listed in Tau11-Run4-OnPeak-R14-bad-runs.txt from your combined results (eg. by excluding those runs from your n-tuple).

You can also show the state of the dataset at any time in the past by specifying a date with the --end option. Note that to cover all times, --since does not include the given date, but --end does.

Every now and again the Data Quality Group and Physics Analysis Coordinator specify an official dataset tag, which can be used in place of the dataset name. For run 4 we had R14 tags called "GreenCircle", "BlueSquare", "BlueSquarePrime", "BlackDiamond", and "Total", eg.

% BbkDatasetTcl Tau11-Run4-OnPeak-R14-GreenCircle
BbkDatasetTcl: wrote Tau11-Run4-OnPeak-R14-GreenCircle.tcl
Selected 44 collections, 146469352/868642325 events, ~55048.9/pb

You can also specify a dataset tag, like Tau11-Run4-OnPeak-R14-GreenCircle, for the --since or --end dates. The above command is equivalent to

% BbkDatasetTcl Tau11-Run4-OnPeak-R14 --end Tau11-Run4-OnPeak-R14-GreenCircle
except for the name of the output Tcl file.

Distributed Analysis

All these commands access the SLAC database by default (soon this default will be changed to access the local database if you are at another site). You can access another site by specifying it on the command line

% BbkDatasetTcl --site=ral

When the bookkeeping tools are used for the first time at a new site, you may notice a short delay. A directory ~/.bbk/sites is created containing the connection information for SLAC, Tier A, and some Tier C databases (this is copied from SLAC via ssh or AFS, so if you have problems try getting a SLAC AFS token the first time). This is only updated when there is a problem connecting.

It is usually best to connect to the database of the site where you intend to run the analysis. This contains a record of which collections are available at that site, so if the dataset is not present or is incomplete, BbkDatasetTcl will warn you when you create the Tcl file and include only those collections that are available (override with the --nolocal option).

Currently the dataset list is of all of them, but a future version of BbkDatasetTcl will restrict it to just those available at the specified site. This will also be used to limit the datasets shown at SLAC to those that are not blocked.

More Refined Selections: BbkExpertTcl

One of the purpose of using datasets is to provide predefined selections. Anyone who had to use the old skimData command will appreciate this. Nevertheless, in some cases, especially for detector studies, it may be necessary to hone the selection further. BbkExpertTcl is identical to BbkDatasetTcl, except that it enables the full set of selections provided by the database (the selection options are the same as the BbkUser command described below). As well as selecting by run number, there are currently 100 other selectors defined (not counting database ids and derived selectors)! The full list can be obtained with BbkExpertTcl -h.

One fairly common case is selecting by run period. There are separate datasets for run cycles run1, run2, run3, run4, and run5, but not for finer delineations (eg. run1a or 200309), so one may be tempted to select by run number of condalias. Unfortunately there are a few problems with this.

  1. it involves querying the run table, which is enormous (due to SP runs).
  2. the record of which runs are in each SP skim collection has been removed (there were too many of them!)
  3. SP and skim collections are merged from many runs, so the selected collections may also contain other runs.
A command like
% BbkExpertTcl Tau11-Run4-OnPeak-R14 --run=42400-42404
BbkExpertTcl: wrote Tau11-Run4-OnPeak-R14.tcl
Selected 3 collections, 10176060/60438180 events, ~44.6/pb
includes %selectRun=42400-42404 on each collection to ensure that only the required runs are processed (the rest will be skipped by the job Framework). Where that won't work (SP skims, or where the database is too slow), you can use --run_select=42400-42404 (or eg. --condalias_select=200309-200407), which returns all collections in the dataset with the %selectRun or %selectCondAlias applied. The use of --run_select and --condalias_select is fairly safe, so they will shortly be enabled for use in BbkDatasetTcl too.

Information for Experts: BbkUser

BbkUser allows arbitrary queries with arbitrary selection of most of the information available in the core bookkeeping.

For example, the following command lists the input and output events, collection luminosity, and collection name for collections in the Tau11-Run4-OnPeak-R14 dataset that are available at the local site.
% BbkUser --dataset=Tau11-Run4-OnPeak-R14 --is_local=1 \
events_in events dse_lumi collection \
--display --style=adye
EVENTS_IN EVENTS DSE_LUMI COLLECTION
========= ======= ======== ==============================================
20057716 3217405 1250.3 /store/PRskims/R14/14.4.0d/Tau11/02/Tau11_0239
20161268 3394079 1348.4 /store/PRskims/R14/14.4.0d/Tau11/02/Tau11_0240
20147656 3306032 1289.3 /store/PRskims/R14/14.4.0d/Tau11/02/Tau11_0241
... 20229215 3407837 1409.0 /store/PRskims/R14/14.4.4e/Tau11/15/Tau11_1551
80 rows returned
The --display option outputs numbers in "display format" (eg. luminosities in inverse picobarns to one decimal place, rather than inverse nanobarns as they are stored in the database). --style=adye (stupid name, I know - not my choice!) lines up in columns.

The --summary option allows you to calculate totals of various quantities. Just list the quantities you are interested in: enumerated types (like run number or dse_id) are counted, numbers (like events, dse_lumi, or gbytes) are summed, and listed for each combination of the remaining string values (like collection or file name). Eg.

% BbkUser --dataset='Tau11-Run4-*Peak-R14' \
dataset components dse_id events dse_lumi gbytes \
--summary --style=adye
DATASET COMPONENTS #DSE_ID +EVENTS +DSE_LUMI +GBYTES
====================== ========== ======= ========= ========= =======
Tau11-Run4-OffPeak-R14 HBCA 10 25015705 10117.8 41.3
Tau11-Run4-OnPeak-R14 HBCA 80 255400590 103845.2 419.9
====================== ========== ======= ========= ========= =======
Totals 90 280416295 113963.0 461.1
333 rows returned
See BbkUser -h for a full list of selection and query values (85 of them!).

A few notes on the use of BbkUser

  • BbkUser translates its options into an SQL query - often with some pre- or post-processing (like dataset pruning or summary generation). You can use the -s option to see what query it uses. This is probably a good idea if you are trying a new sort of query for the first time.
  • If you return information on datasets, runs, files, etc where there is the possibility of more than one dataset, run, file, etc per collection, then you'll get the same collection listed more than once (eg. once per file). You can use --distinct to remove duplicate lines.
  • You can also use SQL expressions to calculate derived or aggregate values, though this can sometimes produce confusing or unexpected results.
  • Some aggregate values have BbkUser aliases like tot_lumi, tot_bytes, and nruns. These can be used to produce efficient queries, but may not aggregate over the values you are interested in. The rule is (roughly) that a total is calculated for each unaggregated value that is returned in the same query. One important restriction is that this cannot aggregate collection totals over pruned datasets (since aggregation is done by the server and dataset pruning is done by the client). BbkUser will issue a warning if you try to do this. A better (if slower) way to do this is to use --summary.

BbkUser is based on the SQL selection API.

More Info...

For more information and information on other subjects related to the bookkeeping, like database mirroring, see the Bookkeeping Documentation page.


Valid HTML 4.01! Best viewed with ANY browser! /BFROOT/www/Computing/Distributed/Bookkeeping/Documentation/BbkUserTools.html last last modified on 27th May 2005 by
Tim Adye, <T.J.Adye@rl.ac.uk>