|
To see a list of data sets available at your site, enter the command
> BbkDatasetTcl
without arguments. (To keep it from flying by all at once,
you may wish to pipe this command to less ("BbkDatasetTcl | less"),
or to a file ("BbkDatasetTcl >& BbkDatasetTcl.txt").
The available datasets are also documented at the Data Quality homepage,
which will send you to this page to learn more
about the data sets for Release 18 (for example).
If you know the name of the dataset that you want, you can search for it
with the "-l WILDCARD" option, for example:
> BbkDatasetTcl -l "Inclppbar*"
BbkDatasetTcl: 14 datasets found in bbkr18 at slac:-
Inclppbar-Run1-OffPeak-R18b
Inclppbar-Run1-OnPeak-R18b
Inclppbar-Run2-OffPeak-R18b
Inclppbar-Run2-OnPeak-R18b
Inclppbar-Run3-OffPeak-R18b
...
Data set names
The names of the different data sets have the following form:
SkimName-Run[1-5]-[On/Off]Peak-RXX[a/b/c]
Examples:
AllEventsSkim-Run4-OnPeak-R18b
BchToD0KstarAll-Run4-OffPeak-R14
BFourHHHE-Run3-OffPeak-R18b
InclDs-Run4-OnPeak-R16a
Inclppbar-Run2-OnPeak-R18b
LambdaC-Run3-OnPeak-R16b
SkimName
SkimName indicates the type of data set or skim.
The most general data set available is the AllEvents data set.
Data begins as signals in the different subdetectors. Then it is converted
to digital format and stored in an XTC file. The XTC files are sent to the
prompt reconstruction ("PR") system, which reconstructs particle candidates
from the detector signals. The output of prompt reconstruction is the
AllEvents dataset.
The AllEventsSkim data set is similar to AllEvents, except that
each event is labeled with over a hundred tags. Tags are
boolean variables (set to true or false) that indicate whether a data
set has a given characteristic. For example, the Jpsitoll tag is set to
true if the event contains a psi to l+l- decay, and false otherwise.
The AllEventsSkim data set is created when a skim executable is
run over AllEvents.
The remaining data sets are skims produced from AllEventsSkim.
A skim is a subset of the data whose events all have the same
value of a given tag (or tags). For example, the Jpsitoll skim is the
subset of events in AllEventsSkim that have the Jpsitoll tag=true.
A skim does not necessarily consist of a physical copy of events in
AllEventsSkim - sometimes it consists of pointers to the skim events,
instead. But "deep copy" and pointer skims look the same to the user.
R[1-4]: Run Cycle
The Run Cycle denotes the data-taking period, as
shown in the table below.
Run Periods and SP
| Run | begin | end
|
|---|
| Run1 | Feb 2000 | Oct 2000
|
|---|
| Run2 | Feb 2001 | Jun 2002
|
|---|
| Run3 | Dec 2002 | Jun 2003
|
|---|
| Run4 | Sep 2003 | Jul 2004
|
|---|
| Run5 | May 2005 | |
|
|---|
RXX: Release series
RXX is the release series of the reconstruction software used to
process the data. Data is initially processed with whatever release is
current at the time. But later, when a new and improved release becomes
available, the data is reprocessed using the new software. In general you
want to use the data set that was processed using the same software as your
test release. For example, in the Quicktour you used analysis-30, which is
18.6.2a, a 18-release. So you would want the data sets ending in "R18".
Fortunately, BbkDatasetTcl is smart: once you have enetered
"srtpath" and "cond18boot" from analysis-30, it knows that you are
using an 18-series release, and will list only the R18 collections.
This year (2006), everyone will be using R18 data, so you will
probably not have to worry about older releases unless you are
continuing an older analysis.
[a/b/c]: Skim cycle
Skims are produced at regular intervals, in skim cycles.
This ensures that researchers do not have to wait too long to
create or update skims to incorporate new physics ideas.
Usually several skim cycles are run for a given release.
So often the release, data set name, and skim name are all the same.
Therefore different versions of a skim are labeled by [a/b/c]:
For R18 there have been 3 skim cycles:
- R18a - Just a test, not to be used for analysis.
- R18b - Second skim cycle, almost done (Jan 2006).
- R18c - Third skim cycle, just starting (Jan 2006).
So at the moment, BbkDatasetTcl lists only R18b collections.
BbkDatasetTcl also lists the available Monte Carlo (simulated) sets.
The names are similar to the data set names, of the form:
SP-XXXX-SkimName-Run[1-4]-RXX[a/b/c]
For example:
SP-1237-AllEventsSkim-Run5-R18b
SP-1005-B0ToRhoPRhoM-Run4-R16b
SP-1235-BFourHHHE-Run3-R18b
SP-1235-InclK0s-Run4-R18b
SP-1235-Inclppbar-Run2-R18b
SP-1237-BPCBhabha-Run3-R14
SP-XXXX: mode number
The names of Monte Carlo sets begin with the prefix "SP-XXXX", where
XXXX is a 3 or 4-digit mode number.
A list of the available physics
decay modes is available on the MC Production home page.
To find the definition of a certain decay mode, for example 1237, you
can use BbkSPModes:
> BbkSPModes --modenum 1237
The system will respond:
: Mode : Decfile : Generator : Filter : Run Type : Category :
: 1237 : B0B0bar_generic.dec : Upsilon(4S) : : B0B0bar generic : generic decays :
To find out more about BbkSPModes, you can check the BbkSPModes web page, or type "BbkSPModes --help" at the command line.
SkimName
If you use a skim of your data set, then you will want to study the same
skim of your Monte Carlo set, so you can compare the two. Decay modes like
1237=B0B0bar_generic are standard decays that show up (as background) in
nearly all analyses, so nearly all skims are run over decay mode 1237.
However, for other decays, like 1261=B+B-_pi_D0-Kpipi0, the only skims
available are:
SP-1261-AllEventsSkim-SP5-R14
SP-1261-BchToD0KAll-R14
SP-1261-BchToD0KAll-R16a
SP-1261-BchToD0KAll-SP5-R14
SP-1261-Run2
This probably means that mode 1261 was produced for a particular analysis
that uses only the BchToD0KAll skim. So only the BchToD0KAll skim (and
the AllEventsSkim, from which this skim is derived) were produced.
Run[1-5]
Runs are data-taking periods, not MC production periods. However,
Monte Carlo data sets are designed to reproduce the data as closely
as possible, including the conditions (detector, online, parameters)
at the time. So MC data sets are labeled with Run Cycles that
indicate which data sets they are intended to model.
Simulation Productions: SPN
Simulated (Monte Carlo) data sets are produced in Simulation Production
(SP) cycles:
SP1, SP2, SP3 = obsolete
SP4 = Release 10
SP5 = Release 12
SP6 = Release 14
SP7 = none
SP8 (in production) = Release 18
(SP7 would have been Release 16, but they decided not to produce it.)
Data Location
The plain "BbkDatasetTcl" command tells you only about the
collections at the Tier A site that you are logged in to. A Tier A
site is a computing facility. SLAC's Tier A sites are:
- SLAC (slac)
- Bologna (cnaf)
- IN2P3 (ccin2p3)
- GridKa (gridka)
- RAL (ral)
Skims are assigned to each AWG. So in order to access your skim,
you need to login to the Tier A site that hosts the AWG that
owns your skim. Detailed information is available here:
Data Distribution page
Once you have determined which data sets you need, you can use BbkDatasetTcl
to produce tcl files that tell your application how to access the data sets.
The simplest form of the BbkDatasetTcl command is:
> BbkDatasetTcl DATASET
This will produce a single file, DATASET.tcl.
In practice, however, you will probably want to use a more complicated
BbkDatasetTcl command:
> BbkDatasetTcl DATASET --tcl Nmax --splitruns --basename MYNAME
where:
- DATASET
- The name of the data set.
- --tcl Nmax
- Produces multiple tcl files, each with a
maximum of Nmax events.
- --splitruns
- Allows BbkDatasetTcl to split collections, so
that you end up with exactly NMax per file (except the last one
of course) instead of maximum Nmax per file.
- --basename MYNAME
- A name of your choice. If you do not
use this option, then BbkDatasetTcl will use the default name: DATASET.
For example, in the Quicktour you used BbkDatasetTcl to produce a
single tcl file:
> BbkDatasetTcl SP-1237-Run4
However, if there are too many events in your tcl file, your
job may fail due to CPU time limits. To avoid this, you will want
to divide the events among many tcl files:
> BbkDatasetTcl SP-1237-Run4 --tcl 100k --splitruns --basename MC-B0B0bar-Run4
Now instead of one big tcl file, you have many tcl files of 100k
events each: MC-B0B0bar-Run4-1.tcl, MC-B0B0Bar-Run4-2.tcl, MC-B0B0bar-Run4-3.tcl
... MC-B0B0bar-Run4-N.tcl.
You can determine how many events you should include in one tcl
file by submitting a test job and seeing when it crashes.
Note that
all datasets evolve continuously with time. This is due to new data being added,
old data being found to be bad, or reprocessing.
For more information about how to use BbkDatasetTcl to produce
tcl files, see
the Bookkeeping User Tools web page
(in particular the section
on "Evolving Datasets" gives details on how to update an analysis when new data becomes available), or
type "BbkDatasetTcl --help"
at the command line.
If you look in your SP-1237-Run4.tcl file from the Quicktour,
you will see a bunch of lines like this:
# 138000/138000 events selected from 69 on-peak runs, added to dataset at 2005/11/04-22:50:14-PST, lumi = ~0.0/pb
lappend inputList /store/SP/R18/001237/200309/18.6.0b/SP_001237_013238
# 48000/48000 events selected from 24 on-peak runs, added to dataset at 2005/11/05-04:48:59-PST, lumi = ~0.0/pb
lappend inputList /store/SP/R18/001237/200309/18.6.0b/SP_001237_013240
# 6000/6000 events selected from 3 on-peak runs, added to dataset at 2005/11/05-04:48:58-PST, lumi = ~0.0/pb
lappend inputList /store/SP/R18/001237/200309/18.6.0b/SP_001237_013270
# 138000/138000 events selected from 69 on-peak runs, added to dataset at 2005/11/03-22:47:54-PST, lumi = ~0.0/pb
lappend inputList /store/SP/R18/001237/200309/18.6.0b/SP_001237_013286
# 104000/104000 events selected from 52 on-peak runs, added to dataset at 2005/11/03-22:47:54-PST, lumi = ~0.0/pb
The lines in green begin with a "#", which means that
they are comments:
- "A/B events selected": In a data set of B events, only A of them
were from "Good" runs satisfying certain quality requirments. For MC
(simulated) sets, all runs are good runs (there's no point in simulating
bad ones), so A=B. But for data sets, usually A<B so this line is not
redundant.
- "from N on-peak runs": These small-r runs are different than the
big-R Runs 1-5. A small-r run is a much smaller data-taking period, just
one batch of analysis within a Framework job.
- "lumi=X/pb": The (time-integrated) luminosity in pb-1. This is always
set to 0.0/pb for MC, but it is nonzero for data. Luminosity will be
discussed in more detail in the luminosity section
below.
The lines in red are tcl commands. "lappend
inputList" is a tcl command that adds a collection to the input module's
list of collections to be analyzed.
Tcl files produced by BbkDatasetTcl can be used directly in your
analysis. The next section explains how to do this.
Tcl files produced by BbkDatasetTcl are designed to be used directly in
BaBar analysis. The following example shows how you can modify
the analysis described in the Workbook Quicktour
to take advantage of this.
The Workbook Quicktour and
the examples based on it do NOT use BbkDatasetTcl files to add collections to
the input list.
It is very simple to add all of the collections in the SP-1237-Run4 data
set to your job's input list. Just add the following line to your
snippet.tcl file:
source SP-1237-Run4.tcl
before the line "sourceFoundFile BetaMiniUser/MyMiniAnalysis.tcl".
Handling multiple tcl files
If you have multiple tcl files, then you will need multiple snippets. For
example, if you have used the "-tcl Nmax" switch to divide the dataset among
100 tcl files, SP-1237-Run4-1.tcl, SP-1237-Run4-2.tcl, ...
SP-1237-Run4-100.tcl, then you will need 100 snippets; say snippet-1.tcl,
snippet-2.tcl ... snippet-100.tcl. You're also going to end up with 100 log
files and 100 output files. So it's a good idea to make a directory
for each one:
workdir> mkdir log
workdir> mkdir tcl
workdir> mkdir snippet
workdir> mkdir data
If you've already produced your 100 tcl files, move
them to the tcl directory. Otherwise, you can put them
there in the first place by running BbkDatasetTcl from
the tcl directory.
The difference between each snippet will be the two lines
(for example, for the 36th tcl file):
source tcl/SP-1237-Run4-36.tcl
set histFileName output/SP-1237-Run4-36.root
The command to run job 36 is then:
workdir> bsub -q kanga -o log/SP-1237-Run4-36.log BetaMiniApp snippet/snippet-36.tcl
It would be very time-consuming to make 100 snippet files
yourself, so most users develop scripts that generate large batches
of snippets. Here is a Perl script
that generates the 100 snippets. To run it, copy it to your workdir directory and enter the command:
workdir > perl MultiSnippets.pl SP-1237-Run4 100
Now that you have to run 100 jobs, you probably do not want
to enter user input for each job. To make the job run by itself
instead of giving you a framework prompt ("gt;"), add the following
lines at the end of "MyMiniAnalysis.tcl":
ev begin
exit
"ev begin" starts the job and runs over all events. "exit" exits the
framework. In general, any commands that you enter at the framework
prompt ("gt;") can be put in a tcl file.
(In fact, the original MyMiniAnalysis.tcl contained an "ev begin"
command, which was removed to make the job run interactively.)
Now that your job runs without stopping, and does not require
user input, you can submit them to the batch queue:
Then you can run your jobs with the commands (in workdir):
bsub -q bldrecoq -o log/SP-1237-Run4-1.log BetaMiniApp snippet/snippet-1.tcl
bsub -q bldrecoq -o log/SP-1237-Run4-2.log BetaMiniApp snippet/snippet-2.tcl
...
bsub -q bldrecoq -o log/SP-1237-Run4-100.log BetaMiniApp snippet/snippet-100.tcl
You will probably want to put these 100 commands in a script and then
source it. (Any command that you enter at the command-line can
be put in a unix script, or shell script, and run with the command
"source <script name>".) Here is a Perl command to generate the script:
workdir> perl -e 'foreach $N (1..100)
{print "bsub -q bldrecoq -o log/SP-1237-Run4-$N.log BetaMiniApp
snippet/snippet-$N.tcl\n"}' > & submit.job
Again, this is one line, but has been split for formatting purposes.
This will generate a file called submit.job in your workdir. If you
examine it you will find that it contains all 100 of the above commands.
Now all you have to do to submit your 100 jobs is source the script:
workdir> source submit.job
The system will respond with 100 commands like:
Job <840401> is submitted to queue <bldrecoq>.
As usual, you can use "bjobs", "bpeek" and other batch commands
to check the progress of your jobs.
You will probably want to develop a strategy for keeping track of
your many snippets, tcl files, log files, and jobs. The CM2 tutorial provides one example of
tcl-file bookkeeping.
The last thing to look at are the collection names themselves.
Before doing that, you should be aware that a collection in the event
store isn't really the same thing as a file - any given collection is
stored in lots of files, and each file has parts of multiple collections
in it. They also get moved around from time to time balance loads,
they come and go from tape, etc. So it's not really useful to track down
where the files live in the Unix filesystem. Instead, you should use
BbkDatasetTcl to generate tcl files, and then have a Framework input
module convert those to file locations.
That said, let's look at a collection name. The first collection in
SP-1237-Run4.tcl is:
/store/SP/R18/001237/200309/18.6.0b/SP_001237_013238
- /store All collections are located in /store.
- /SP This is an MC collection.
- /001237 SP mode number is 1237.
- /200309 This MC set was simulated using the
conditions from September 2003.
- /18.6.0b It was produced using SP release 18.6.0b.
- /SP_001237_013238 SP mode number is 1237, and SP merge
number is 013238. Merged skims will be explained in a moment.
Why repeat the mode number? Because you want every file in the
directory /store/SP/001237/200309/18.6.0b/ to have a different name.
For real data the collection names are a bit different. For example,
consider the real-data collection:
/store/PR/R18/AllEvents/0001/81/18.1.0c/AllEvents_00018190_18.1.0cV00
- /store All collections are located in /store.
- /PR This is a real data collection.
- /R18 The data was reconstructed using Release 18 software.
- /AllEvents It is part of the AllEvents skim.
- /0001 First 4 digits of the run number.
- /81 Next 2 digits of the run number.
- /18.1.0c AllEvents_00018190_18.1.0cV00. The collection is
part of the AllEventsSkim, run 00018190. "VXX" tags are introduced
when for some reason someone produces the collection a second time.
The point of the "/0001/81/" format is to make it easier for users
to find the collections they want. For example, someone who wanted
all runs beginning with "0001" could just look in the directory
"/store/PR/R18/AllEvents/0001/".
For skim collections, BaBar skims many different (PR or SP) collections
and then merges the output. Collections from these skims begin with
"PRskims" or "SPskims." There are also collection that begin with
"SPruns". SP data is initially generated in the SPruns tree and merged
into the SP tree. The SPruns files are then deleted. The collection
names for PRskims, SPnames, and SPruns also have their own special
conventions. They are not described here, but you can learn about them
by following the links below.
For a more detailed explanation of collection names, refer to the
Extended Collection Names part of the CM2 introduction, and the
link to the RFC provided on that page.
Making your own collections
You can make your own collections during analysis and simulation jobs. A
walk-through example of how to do this appears in the Workbook section
Simulation. There is further information on this
subject in later sections of the Babar Workbook.
> cond18boot
and the system responded:
Setting OO_FD_BOOT to /afs/slac/g/babar-ro/objy/databases/boot/physics/V7/ana/conditions/BaBar.BOOT
"/afs/slac/g/babar-ro/objy/databases/boot/physics/V7/ana/conditions/BaBar.BOOT"
is a "BOOT file". Whenever a new database is created,
a BOOT file is created as well. The BOOT file tells applications
like BetaMiniApp how to find the database.
In general, if you are using Release XX software, then the command
that you should use is condXXboot. (Note: The release name
should begin with a number, like 16. If it begins with a word, like
analysis-30, then it is a nickname --- so don't use cond30boot!)
(In pre-Release-12 analyses, the boot commands were a bit different:
physboot or data12boot for data, and simuboot or mc12boot for MC.
But it is very unlikely that you will need these commands.)
If you are curious, you can check out the contents of the boot file:
> cat /afs/slac/g/babar-ro/objy/databases/boot/physics/V6/ana/0192/BaBar.BOOT
ooFDNumber=193
ooLFDNumber=65535
ooPageSize=16384
ooLockServerName=objylock06.slac.stanford.edu
ooFDDBHost=objycat02.slac.stanford.edu
ooFDDBFileName=/objy/databases/production/dynamic/physics/V7/0193/BaBar.FDB
ooJNLHost=objyjrnl02.slac.stanford.edu
ooJNLPath=/objy/databases/production/journals/physics/V7/ana/0193/
Basically, the BOOT file sets up the paths to the database.
But you do not need to know about these paths. All that you need to
do is use the correct condXXboot command before you run your analysis.
To determine the luminosity of your (real) data set, you can use BaBar's
bookkeeping tool BbkLumi.
BbkLumi -ds DATASET
BbkLumi --tcl inputfile.tcl
For example, suppose you are using the dataset AllEvents-Run2-R18.
Then the command:
BbkLumi -ds AllEventsSkim-Run2-OnPeak-R18b
prints the luminosity of the full dataset:
Using B Counting release 18 from dataset name AllEventsSkim-Run2-OnPeak-R18b
Failed on dbname : bbkr14 trying bbkr18
==============================================
Run by penguin at Fri Jan 27 12:16:47 2006
First run = 18190 : Last Run 29435
== Your Run Selection Summary =============
***** NOTE only runs in B-counting release 18 considered *****
***** Use --OPR or --L3 options to see runs without B-counting *****
Number of Data Runs 5085
Number of Contributing Runs 5085
-------------------------------------------
Y(4s) Resonance ON OFF
Number Recorded 5085 0
== Your Luminosity (pb-1) Summary =========
Y(4s) Resonance ON OFF
Lumi Processed 59527.608 0.000
== Number of BBBar Events Summary =========
Number | ERROR
| (stat.) (syst.) (total)
Total 69589824.6 | 43257.7 765488.1 766709.3
==For On / Off subtraction======
Nmumu(ON) = 29019709.0 +/- 5387.0 (stat)
Nmumu(OFF) = 0.0 +/- 0.0 (stat)
Nmh(ON) = 211684313.0 +/- 14549.4 (stat)
Nmh(OFF) = 0.0 +/- 0.0 (stat)
(Don't worry about the "failed on dbname" message -
that's just BbkLumi realizing that it should be using
the R18 (bbkr18) database instead of the R14-R16
(bbkr14) database.)
Alternatively, you could use BbkLumi to find the luminosity
for the tcl file that you produced with BbkDatasetTcl.
BbkLumi --tcl AllEvents-Run2-R18.tcl
The output message will be similar to the one above.
The --tcl option can be useful if some of your jobs
fail. As long as your tcl files were produced without
the --splitruns option, you can use the --tcl option
to obtain the luminosity of successful jobs only,
(whereas the --ds option tells you the luminosity of
the full data set only).
For example, imagine that you have divided the AllEvents-Run2-R18.tcl
data set among 50 tcl files (without the --splitruns option) and submitted
50 jobs. 40 of them run with no problems, but 10 of them keep failing no
matter what you do. So you decide not to use the data from those 10 tcl files.
Now you need to know the luminosity of the 40 tcl files that
produced successful jobs, but not the 10 that failed. So you create
two directories, "success" and "failed", and move the 40 successful tcl files
to "success" and the failed tcl files to "failed." Then you go to "success"
and run BbkLumi:
> cd success
success> BbkLumi --tcl *
This will give you the luminosity of the data from
the 40 successful jobs only.
As mentioned above, the --tcl option works only on tcl
files that were NOT produced with the --splitruns option.
With --splitruns is run BbkLumi cannot tell exactly which runs are in a
given tcl file or if the whole run or part of it is.
On the other hand, --splitruns is useful because without it
the tcl files each have a variable number of events, and this
leads to a rather random amount of time in the
batch system. Most people use --splitRuns to ensure
that all of their jobs use about the same amount of batch time.
BbkLumi can also be used to obtain the luminosity for a particular
series or range of runs, for example:
BbkLumi --range 38358-38363 : Gets lumi between runs 38358 and 38363
BbkLumi --run 38358,38363,38451 : Gets lumi for the listed runs
You will probably not need to use these options, but they are
there just in case.
At some point in your analysis you will probably need to determine
the "luminosity" of your Monte Carlo samples. You would use this
information to scale your Monte Carlo set to your data set.
For example, suppose you have a real data set of 100/fb, and a
Monte Carlo set of 300/fb. Then you would need to rescale the
Monte Carlo set by a factor of 1/3 in order to make direct
comparisons with data.
The luminosity that you want is actually the equivalent luminosity,
the luminosity that a generic real data sample filled with all types of decays
would have to have in order to contain the type and number of decays
in your Monte Carlo sample. For example, if you have a Monte Carlo sample
of 90,000,000 e+e- to tau tau decays, you would want to know what size
data sample would contain 90,000,000 e+e- tau tau decays.
In general, the equivalent luminosity of an MC sample of N events
and cross section sigma is:
lumi = N / sigma
For our e+e- tau tau example, the equivalent luminosity is:
lumi = 90,000,000/0.90nb = 100,000,000/nb = 100/fb
(Be careful with your units!)
Note that due to detector acceptance, the effective cross section is
sometimes lower than the actual production cross section. For example
the theoretical cross section for e+e- to tau+tau- is 0.94nb, but the
cross section from the detector is 0.90nb.
If you are using your e+e- to tau tau sample to model the e+e- to tau tau
part of a real data sample of 200/fb, then you'd need to rescale
your e+e- to tau tau sample by a factor of 2.
General related documents:
Author:
Contributors:
Massimiliano Turri,
Joseph Perl,
Jenny Williams
Last modification: 13 April 2006
Last significant update: 13 June 2005
|