Bookkeeping User Tools

This page contains a simple introduction to the main bookkeeping tools that will be used by the analysis user. This is a subset of the information contained in the full bookkeeping documentation.

Currently there are just three command-line tools for using the bookkeeping information.
BbkDatasetTcl
Creates tcl files for collections in a dataset
BbkDatasetHistory
Lists changes to the dataset
BbkUser
Detailed selection and query information (experts only)

Datset Information

We can start by getting a list of datasets

% BbkDatasetTcl
BbkDatasetTcl: 1703 datasets found:-

A0-Run1-OffPeak-R14
A0-Run1-OnPeak-R14
A0-Run2-OffPeak-R14
A0-Run2-OnPeak-R14
A0-Run3-OffPeak-R14
A0-Run3-OnPeak-R14
AlignCal-Run1-OnPeak-R14
AllEvents-Run1-OffPeak-R12
AllEvents-Run1-OnPeak-R12
AllEvents-Run1-R12
AllEvents-Run2-OffPeak-R12
AllEvents-Run2-OnPeak-R12
AllEvents-Run2-R12
AllEvents-Run3-OffPeak-R12
AllEvents-Run3-OnPeak-R12
AllEvents-Run3-R12
AllEvents-Run4-OffPeak-R14
AllEvents-Run4-OnPeak-R14
...

To get a summary of one of these datasets

% BbkDatasetHistory -ds AllEvents-Run3-R12
Dataset AllEvents-Run3-R12
  CREATED             ADDED REMOVED   TOT_NEV   TOT_LUMI INPUT_DS MAINTAINER DESCRIPTION                                          
  =================== ===== ======= ========= ========== ======== ========== =====================================================
  2004/03/18 14:09:15  3685       0 473551540 33444975.6 NA       douglas    AllEvent stream PR collections produced for run 3    
  2004/04/09 21:27:54     0       4         0        0.0          douglas    Removing bad collections due to changes in dse status
  2004/04/09 23:44:16    25       0   3766263      268.9 NA       douglas    AllEvents stream PR collection for Run3              
  2004/04/12 15:06:39     0       8         0        0.0          douglas    Removing bad collections due to changes in dse status
DBD::Proxy::db disconnect failed: Can't call method "Call" on an undefined value
at (eval 10) line 5 during global destruction.   ^-- this is a spurious message you can ignore
This shows when collections were added or removed from the dataset.

Creating TCL Files

Now we can create a tcl file

% BbkDatasetTcl -ds AllEvents-Run3-R12
BbkDatasetTcl: wrote AllEvents-Run3-R12.tcl (476389055 events)
Selected 3698 collections, 476389055/476389055 events, 33368.8/pb

That command is roughly equivalent to

% skimData -t -G good_run3.txt -s AllEvents --tableprefix=objy

The file AllEvents-Run3-R12.tcl contains all the collections in the dataset. For large datasets like this, it is usually necessary to split it into smaller chunks, so each job finishes in a reasonable time.

% BbkDatasetTcl -ds AllEvents-Run3-R12 -t 200000000
BbkDatasetTcl: wrote AllEvents-Run3-R12-1.tcl (199838611 events)
BbkDatasetTcl: wrote AllEvents-Run3-R12-2.tcl (199944398 events)
BbkDatasetTcl: wrote AllEvents-Run3-R12-3.tcl (76606046 events)
Selected 3698 collections, 476389055/476389055 events, 33368.8/pb

(you'd probably want to split into smaller chunks still, but you get the idea).

Many CM2 collections are much larger than collections of old (due to merging runs). This is more efficient, but can lead to nasty edge effects where splitting the job on collection boundaries makes for a too-long or too-short job. In that case, you can use the --splitruns option.

% BbkDatasetTcl -ds AllEvents-Run3-R12 -t 200000000 --splitruns
BbkDatasetTcl: wrote AllEvents-Run3-R12-1.tcl (200000000 events)
BbkDatasetTcl: wrote AllEvents-Run3-R12-2.tcl (200000000 events)
BbkDatasetTcl: wrote AllEvents-Run3-R12-3.tcl (76389055 events)
Selected 3698 collections, 476389055/476389055 events, 33368.8/pb

The tcl contains code to stop one job in the middle of the collection, and start the next from where it left off.

Another option, --basename, allows you to specify a different output file name (.tcl and the sequence numbers are still added).

Evolving Datasets

When using a dataset that is still ongoing (run 4 at the moment), you need to take care not to reanalyse data you have already included (at best, this wastes CPU, at worst it artificially inflates your luminosity!). You can use the --marker option to record where we were up to when the tcl file was created (a file is created in ~/.bbk/BbkDatasetTcl/lastsession.sav). Next time you use BbkDatasetTcl, specify --newer (or --older) to include just the collections added to the dataset after (or before) that point.

You can also specify an explicit date with --start or --end, eg.

% BbkDatasetTcl -ds AllEventsSkim-Run4-OnPeak-R14 --start AllEventsSkim-Run4-OnPeak-R14-GreenCircle
BbkDatasetTcl: wrote AllEventsSkim-Run4-OnPeak-R14.tcl (437640532 events)
Selected 40 collections, 437640532/437640532 events, ~30492.2/pb
wrote : AllEventsSkim-Run4-OnPeak-R14-bad-runs.txt (36 runs, ~170.4/pb)
The events associated with these runs at the start time are now known to be bad.Please removed or block events with these run numbers to protect from possible
double counting and use of bad data.

As you can see, this only includes the 40 collections added since the Green Cirlce dataset was defined (see the BbkDatasetHistory output above). You can also use a date as the argument to --start, for example using "04/05/28" would produce the same output as no collections were added to the dataset after Green Circle until that date.

You should also note that some runs have been removed from the dataset (since Green Circle). Due to this a warning has been produced and a file containing the list of runs removed.

Distributed Analysis

All these commands access the SLAC database by default (this default will probably be changed to access the local database if you are at another site). You can access another site by specifying it on the command line

% BbkDatasetTcl --site=ral

When the bookkeeping tools are used for the first time, you may notice a short delay. A directory ~/.bbk is created containing the connection information for SLAC, Tier A, and some Tier C databases (this is copied from SLAC AFS). This is only updated when there is a problem connecting.

Caveats

  1. BbkDatasetTcl ignores is_local, so includes collections even if they haven't been imported yet (this is an issue outside SLAC).

Futures

  1. Lots of good suggestions from David Kirkby and others. Some have already been incorporated, or are mentioned above. Some others:-
    1. Should we allow datasets to be combined? What about duplicate runs?
    2. tcl file integrity (do we want to protect against users editing the file)?
    3. Dataset naming
    4. Short names: dstcl, dshist?
    5. Print selection fractions (with errors)?
  2. For experts, allow additional selection in addition to the dataset (merge in some of the functionality of BbkUser).

More Info...

For more infomation and information on other subjects related to the bookkeeping, like database mirroring, see the Bookkeeping Documentation page.

Information for Experts: BbkUser

BbkUser allows arbitrary queries with arbitrary selection of most of the information available in the core bookkeeping.

For example, the following command lists the input and output events, PR luminosity, and collection name for collections in the AllEvents-Run4-R14 dataset containing runs in the range 42400-42404.
% BbkUser --dataset=AllEvents-Run4-R14 --run=42400-42404 \
          events_in events pr_lumi collection
EVENTS_IN EVENTS PR_LUMI COLLECTION
202182 84177 5322.3819 /store/PR/R14/AllEvents/0004/24/14.3.2/AllEvents_00042400_14.3.2V00
284197 117071 8022.5873 /store/PR/R14/AllEvents/0004/24/14.3.2/AllEvents_00042401_14.3.2V00
256763 106387 6852.1561 /store/PR/R14/AllEvents/0004/24/14.3.2/AllEvents_00042402_14.3.2V00
401839 166754 10675.5647 /store/PR/R14/AllEvents/0004/24/14.3.2/AllEvents_00042403_14.3.2V00
493289 204189 13537.9877 /store/PR/R14/AllEvents/0004/24/14.3.2/AllEvents_00042404_14.3.2V00
5 rows returned

See BbkUser -h for a full list of selection and query options (there are currently 50 of each!).

A few caveats on the use of BbkUser

BbkUser is based on the SQL selection API.


Valid HTML 4.01!
Tim Adye, <T.J.Adye@rl.ac.uk>
Will Roethel, <roethel@slac.stanford.edu>