Bookkeeping User Tools
This page contains a simple introduction to the main bookkeeping tools
that
are useful to the BaBar analysist. This is a subset of the information
contained in the full bookkeeping
documentation.
There are four main command-line tools for using the core
bookkeeping
information, though for most cases only the first is needed:-
- BbkDatasetTcl
- Creates Tcl files for collections in a dataset
- BbkExpertTcl
- Like BbkDatasetTcl, but allows full selections (can produce
unexpected results or
unnecessarily large queries, so should only be used if you know what
you are doing).
- BbkUser
- Detailed selection and query (also for experts only)
BbkLumi and lumi
- Obtain luminosity information (to be described elsewhere)
A note on the command names: the bookkeeping user tools are accessed
from a particular release (currently 16.0.3-physics-1a, which is
identical to analysis-24 except for updates to the bookkeeping code).
This is
defined in $BFROOT/bin/BbkDatasetTcl
etc. To prevent accessing the wrong version from your PATH, the
commands in the release are called relBbkDatasetTcl
etc. The rel*
versions should not be used directly unless you are testing a specific
version. This scheme is used to maintain compatibility between the
commands and the database schema (the version that is in your analysis
release may not be compatible with the current version of the
bookkeeping database). New versions are announced in the Bookkeeping
and Site
Contacts HyperNews groups.
Dataset Information
We can start by getting a list of datasets
% BbkDatasetTcl
BbkDatasetTcl: 33202 datasets found:-
A0-Run1-OffPeak-R14 A0-Run1-OffPeak-R14-BlackDiamond A0-Run1-OffPeak-R14-BlueSquare A0-Run1-OffPeak-R14-BlueSquarePrime A0-Run1-OffPeak-R14-GreenCircle A0-Run1-OffPeak-R14-Total A0-Run1-OnPeak-R14 A0-Run1-OnPeak-R14-BlackDiamond
...
users-phnic-TwoPhotonTwoTrackSkim-BlackDiamond-Run4 users-phnic-TwoPhotonTwoTrackSkim-GreenCircle-Run3
That's a very long list. You can use grep to search for what you
want, or specify a wildcard with the -l
option (faster).
% BbkDatasetTcl -l '*Inclppbar*'
BbkDatasetTcl: 38 datasets found:-
Inclppbar-Run1-OffPeak-R16a Inclppbar-Run1-OnPeak-R16a Inclppbar-Run2-OffPeak-R14
...
Additional options are planned for BbkDatasetTcl
that will allow
selecting on dataset properties, eg. --ds_stream=Inclppbar.
Creating Tcl Files: BbkDatasetTcl
Now we can create a Tcl file
% BbkDatasetTcl Inclppbar-Run4-OnPeak-R16a
BbkDatasetTcl: wrote Inclppbar-Run4-OnPeak-R16a.tcl Selected 73 collections, 57721309/1448776065 events, ~99532.6/pb
The file Inclppbar-Run4-OnPeak-R16a.tcl
contains all the collections in the dataset. For large datasets, it is
usually necessary to split it into smaller chunks, so each
job finishes in a reasonable time.
% BbkDatasetTcl Inclppbar-Run4-OnPeak-R16a --tcl 20000k --splitruns
BbkDatasetTcl: wrote Inclppbar-Run4-OnPeak-R16a-1.tcl (25 collections, 20000000 events) BbkDatasetTcl: wrote Inclppbar-Run4-OnPeak-R16a-2.tcl (26 collections, 20000000 events) BbkDatasetTcl: wrote Inclppbar-Run4-OnPeak-R16a-3.tcl (24 collections, 17721309 events) Selected 73 collections, 57721309/1448776065 events, ~99532.6/pb
You can then submit three jobs, one for each Tcl file (you'd
probably want to split into smaller chunks still, but you get
the idea).
The --splitruns options
forces exactly the specified number of events into each Tcl file
(except of course the last). This can be a bit inefficient as the
collections at the boundaries have to be opened by both jobs. It also
might make it a bit more complicated to work out where an error
occurred in your job if the job starts in the middle of a collection.
If you
don't mind about having some jobs a bit shorter (and as long as each
job can process at least one collection), then you can leave out the --splitruns option. Try it and see
whether the number of events is reasonable.
It is instructive to look at one of the a generated Tcl files from a
R14 skim. In
this
case we look at the sixth file generated by the command
% BbkDatasetTcl A0-Run2-OnPeak-R14 --tcl 1000k --splitruns
% cat A0-Run2-OnPeak-R14-6.tcl
## This file was generated automatically on 2005/05/20-18:31:45-BST ## by user adye on host csfe from /home/csf/adye ## using: BbkDatasetTcl --site=ral A0-Run2-OnPeak-R14 --tcl 1000k --splitruns ## version Id: BbkTcl.pm,v 1.33 2005/01/26 16:23:11 adye Exp ## Selected dataset: ## A0-Run2-OnPeak-R14 (A0 stream PR skim coll. for Run2, On Peak) created 2004/04/04-00:34:50-BST by douglas
# 580343/20070731 events selected from 116 on-peak runs, added to dataset at 2004/04/04-00:34:50-BST, lumi = ~1331.8/pb lappend inputList /store/PRskims/R12/14.4.0c/A0/01/A0_0116%rejectRun=21618%selectEventSequence=248540-580343
# 592293/20068686 events selected from 111 on-peak runs, added to dataset at 2004/04/04-00:34:50-BST, lumi = ~1347.9/pb
lappend inputList /store/PRskims/R12/14.4.0c/A0/01/A0_0119%rejectRun=22538
# 577556/20161405 events selected from 124 on-peak runs, added to dataset at 2004/04/04-00:34:50-BST, lumi = ~1343.6/pb
lappend inputList /store/PRskims/R12/14.4.0c/A0/01/A0_0127%selectEventSequence=1-75903
## In this tcl file: 3 collections, 1000000 events
This includes three collections,
specified with lappend
inputList Framework Tcl commands. The extended
collection name syntax is used. Exactly 1 million events per job
are specified with %selectEventSequence.
The %rejectRuns are there
to remove runs that were determined to be bad after the skimming was
complete. The comments give information on each collection, including
the number of events selected by the skimming. Events from the rejected
runs are excluded from these totals, though the splitting
between files does not take this into account so the job may encounter
a slightly different number of events in the first/last collection
(this will be fixed in a future version BbkDatasetTcl).
Another option, --basename,
allows you to specify a different output file name (.tcl and the
sequence numbers
are still added).
When using a dataset that is still ongoing,
you need to take
care not to reanalyse data you have already included (at best, this
wastes CPU, at worst it artificially inflates your luminosity!). For
these
examples, let's update a run 4 analysis and, for simplicity, imagine
that we are putting our entire analysis in a single job. Hopefully we
still have the old Tcl file, Tau11-Run4-OnPeak-R14.tcl.
At the top is the line
## This file was generated automatically on 2004/05/29-13:27:43+0100
(though since it will have been generated with a previous version of
BbkDatasetTcl, the date will be in a different format without the
timezone). We can update this with
% BbkDatasetTcl Tau11-Run4-OnPeak-R14 --since=2004/05/29-13:27:43+0100
BbkDatasetTcl: wrote Tau11-Run4-OnPeak-R14.tcl Selected 33 collections, 99529441/589347702 events, ~41441.6/pb wrote : Tau11-Run4-OnPeak-R14-bad-runs.txt (231 runs, ~3891.4/pb) The events associated with these runs at the start time are now known to be bad. Please removed or block events with these run numbers to protect from possible double counting and use of bad data.
You should include the timezone to be sure of using the exact same
time, as the dataset is modified continuously. The Tcl file will
contain the new collections, which may include events that were
reprocessed in the meantime. You should make sure that you exclude the
runs listed in Tau11-Run4-OnPeak-R14-bad-runs.txt
from your combined results (eg. by excluding those runs from your
n-tuple).
You can also show the state of the dataset at any time in the past
by specifying a date with the --end
option. Note that to cover all
times, --since does not
include the given date, but --end
does.
Every now and again the Data Quality Group and Physics Analysis
Coordinator specify an official dataset tag,
which can be used in place of the dataset name. For run 4 we had
R14 tags called "GreenCircle", "BlueSquare", "BlueSquarePrime",
"BlackDiamond", and "Total", eg.
% BbkDatasetTcl Tau11-Run4-OnPeak-R14-GreenCircle
BbkDatasetTcl: wrote Tau11-Run4-OnPeak-R14-GreenCircle.tcl Selected 44 collections, 146469352/868642325 events, ~55048.9/pb
You can also specify a dataset tag, like Tau11-Run4-OnPeak-R14-GreenCircle,
for the --since or --end dates. The above command is
equivalent to
% BbkDatasetTcl Tau11-Run4-OnPeak-R14 --end Tau11-Run4-OnPeak-R14-GreenCircle
except for the name of the output Tcl file.
Distributed Analysis
All these commands access the SLAC database by default (soon this
default
will be changed to access the local database if you are at
another site). You can access another site by specifying it on the
command line
% BbkDatasetTcl --site=ral
When the bookkeeping tools are used for the first time at a new
site, you may
notice a short delay. A directory ~/.bbk/sites
is created containing the connection information for SLAC, Tier A, and
some Tier C databases (this is copied from SLAC via ssh or AFS, so if
you have problems try getting a SLAC AFS token the first time). This is
only updated when there is a problem connecting.
It is usually best to connect to the database of the site where you
intend to run the analysis. This contains a record of which collections
are available at that site, so if the dataset is not present or is
incomplete, BbkDatasetTcl will warn you when you create the Tcl file
and include
only those collections that are available (override with the --nolocal
option).
Currently the dataset list is of all of them, but a future version
of BbkDatasetTcl will restrict it to just those available at the
specified site. This will also
be used to limit the datasets shown at SLAC to those that are not
blocked.
More Refined Selections: BbkExpertTcl
One of the purpose of using datasets is to provide predefined
selections. Anyone who had to use the old skimData command will
appreciate this. Nevertheless, in some cases, especially for detector
studies, it may be necessary to hone the selection further.
BbkExpertTcl is identical to BbkDatasetTcl, except that it enables the
full set of selections provided by the database (the selection options
are the same as the BbkUser command described below). As well as
selecting by run number, there are currently 100 other
selectors defined (not counting database ids and derived selectors)!
The full list can be obtained with
BbkExpertTcl -h.
One fairly common case is selecting by run period. There are
separate datasets for run cycles run1, run2, run3, run4, and run5, but
not for
finer delineations (eg. run1a or 200309), so one may be tempted to
select by run number of condalias. Unfortunately there are a few
problems with this.
- it involves querying the run table, which is enormous (due to SP
runs).
- the record of which runs are in each SP skim collection has been
removed (there were too many of them!)
- SP and skim collections are merged from many runs, so the
selected collections may also contain other runs.
A command like
% BbkExpertTcl Tau11-Run4-OnPeak-R14 --run=42400-42404
BbkExpertTcl: wrote Tau11-Run4-OnPeak-R14.tcl Selected 3 collections, 10176060/60438180 events, ~44.6/pb
includes %selectRun=42400-42404
on each collection to ensure that only the required runs are processed
(the rest will be skipped by the job Framework). Where that won't work
(SP skims, or where the database is too slow), you can use --run_select=42400-42404 (or eg. --condalias_select=200309-200407),
which returns all collections in the dataset with the %selectRun or %selectCondAlias applied. The use of
--run_select and --condalias_select is fairly safe,
so they will shortly be enabled for use in BbkDatasetTcl too.
Information for Experts: BbkUser
BbkUser allows arbitrary queries with arbitrary selection of most of
the information available in the core bookkeeping.
For example, the following command lists the input and output events,
collection luminosity,
and collection name for collections in the Tau11-Run4-OnPeak-R14
dataset that are available at the local site.
% BbkUser --dataset=Tau11-Run4-OnPeak-R14 --is_local=1 \ events_in events dse_lumi collection \ --display --style=adye
EVENTS_IN EVENTS DSE_LUMI COLLECTION ========= ======= ======== ============================================== 20057716 3217405 1250.3 /store/PRskims/R14/14.4.0d/Tau11/02/Tau11_0239 20161268 3394079 1348.4 /store/PRskims/R14/14.4.0d/Tau11/02/Tau11_0240 20147656 3306032 1289.3 /store/PRskims/R14/14.4.0d/Tau11/02/Tau11_0241
...
20229215 3407837 1409.0 /store/PRskims/R14/14.4.4e/Tau11/15/Tau11_1551 80 rows returned
The --display option outputs
numbers in "display format" (eg. luminosities in inverse picobarns to
one decimal place, rather than inverse nanobarns as they are stored in
the database). --style=adye (stupid name, I know - not my choice!)
lines up in columns.
The --summary option allows
you to calculate totals of various
quantities. Just list the quantities you are interested in: enumerated
types (like run number or dse_id) are counted, numbers (like events, dse_lumi, or gbytes) are summed, and listed for
each combination of the remaining string values (like collection or file name). Eg.
% BbkUser --dataset='Tau11-Run4-*Peak-R14' \ dataset components dse_id events dse_lumi gbytes \ --summary --style=adye
DATASET COMPONENTS #DSE_ID +EVENTS +DSE_LUMI +GBYTES ====================== ========== ======= ========= ========= ======= Tau11-Run4-OffPeak-R14 HBCA 10 25015705 10117.8 41.3 Tau11-Run4-OnPeak-R14 HBCA 80 255400590 103845.2 419.9 ====================== ========== ======= ========= ========= ======= Totals 90 280416295 113963.0 461.1 333 rows returned
See BbkUser -h for a
full list of selection and query values (85 of them!).
A few notes on the use of BbkUser
- BbkUser translates its options into an SQL query - often with
some pre- or post-processing (like dataset pruning or summary
generation). You can use the
-s
option to see what query it uses. This is probably a good idea if you
are trying a new sort of query for the first time.
- If you return information on datasets, runs, files, etc where
there is the possibility of more than one dataset, run, file, etc per
collection, then you'll get the same collection listed more than once
(eg. once per file). You can use
--distinct
to remove duplicate lines.
- You can also use SQL expressions to calculate derived or
aggregate values, though this can sometimes produce confusing or
unexpected results.
- Some aggregate values have BbkUser aliases like
tot_lumi, tot_bytes, and nruns.
These can be used to produce efficient queries, but may not aggregate
over the values you are interested in. The rule is (roughly) that a
total is calculated for each unaggregated value that is returned in the
same query. One important restriction is that this cannot aggregate
collection totals over pruned datasets (since aggregation is done by
the server and dataset pruning is done by the client). BbkUser will
issue a warning if you try to do this. A better (if slower) way to do
this is to use --summary.
BbkUser is based on the SQL
selection API.
More Info...
For more information and information on other subjects related to
the bookkeeping, like database mirroring, see the Bookkeeping
Documentation page.
/BFROOT/www/Computing/Distributed/Bookkeeping/Documentation/BbkUserTools.html
last last modified on 27th May 2005 by
Tim Adye,
<T.J.Adye@rl.ac.uk>
|