SLAC PEP-II
BABAR
SLACRAL
Babar logo
SPIRES E S & H Databases PDG arXiv
Organization Detector Computing Physics Documentation
Personnel Glossary Sitemap Search Hypernews
Home
Workbook
 1. Introduction
 2. Accnt. Setup
 3. QuickTour
 4. Packages
 5. Modules
 6. Event Info.
 7. Tcl Cmds.
 8. Editing
 9. Comp.+Link
 10. Run the Job
 11. Debugging
 12. Parameters
 13. Tcl Files
 14. Find Data
 15. Batch
 16. Analysis
 17. ROOT I
 18. Kanga
Additional Info.
 Other Resources
 BABAR
 Unix
 C++
 SRT/CVS Cmds.
 SRT/CVS Dev.
 Sim/Reco
 CM2 NTuples
 Root II, III
 PAW I, II
 tcsh Script
 perl Script
Check this page for HTML 4.01 Transitional compliance with the
W3C Validator

(More checks...)

Finding Data

Note: this page is in the process of being updated for R24/R26 skims. Some of the information still refers to older skims (R22 and before). Please adjust the information accordingly until we get the page fully updated. Thank you. —Ray


Contents


Finding out what data sets are available

To see a list of data sets available at your site, enter the command

    >  BbkDatasetTcl --dbname bbkr24

without any other arguments. If you are looking for older datasets, such as R22 and earlier, omit the "--dbname bbkr24". This will produce a list of everything in the bookkeeping system. To keep it from scrolling by all at once, pipe the output fo the command to less (e.g., "BbkDatasetTcl | less") or to a file (e.g., "BbkDatasetTcl >& BbkDatasetTcl.txt").

The available datasets are also documented at the Data Quality homepage, which will send you to this page to learn more about the data sets for Release 24 runs 1‒6 and this page for Release 24 run 7 (for example).

If you know the name of the dataset that you want, you can search for it with the "-l WILDCARD" option. For example:

> BbkDatasetTcl -l "Inclppbar*"
BbkDatasetTcl: 14 datasets found in bbkr18 at slac:-


BbkDatasetTcl: 142 datasets found in bbkr18 at slac:-

Inclppbar-Run1-OffPeak-R18b
Inclppbar-Run1-OffPeak-R18b-v02
Inclppbar-Run1-OffPeak-R18b-v03
Inclppbar-Run1-OffPeak-R18b-v04
Inclppbar-Run1-OffPeak-R18b-v05
Inclppbar-Run1-OffPeak-R18b-v06
Inclppbar-Run1-OffPeak-R18b-v07
Inclppbar-Run1-OffPeak-R18c
Inclppbar-Run1-OffPeak-R18c-v03
Inclppbar-Run1-OffPeak-R18c-v04
Inclppbar-Run1-OffPeak-R18c-v05
...
Inclppbar-Run5-OnPeak-R18c-v05
Inclppbar-Run5-OnPeak-R18c-v06
Inclppbar-Run5-OnPeak-R18c-v07
Inclppbar-Run5-OnPeak-R22b
Inclppbar-Run5-OnPeak-R22b-v02

Data set names

The names of the different data sets have the following form:

SkimName-Run[1-5]-[On/Off]Peak-RXX[a/b/c/d...][-vXX]

 Examples:
AllEventsSkim-Run5-OffPeak-R24c-v07
BchToD0KstarAll-Run4-OffPeak-R24a1-v06
DmixD0ToKPiPi0-Run3-OnPeak-R24c-v07
InclPhi-Run2-OnPeak-R18c
Kll-Run3-OffPeak-R22b-v02

SkimName

SkimName indicates the type of data set or skim.

The most general data set available is the AllEvents data set. Data begins as signals in the different subdetectors. Then it is converted to digital format and stored in an XTC file. The XTC files are sent to the prompt reconstruction ("PR") system, which reconstructs particle candidates from the detector signals. The output of prompt reconstruction is the AllEvents dataset.

The AllEventsSkim data set is similar to AllEvents, except that each event is labeled with over a hundred tags. Tags are boolean variables (set to true or false) that indicate whether a data set has a given characteristic. For example, the Jpsitoll tag is set to true if the event contains a psi to l+l- decay, and false otherwise. The AllEventsSkim data set is created when a skim executable is run over AllEvents.

The remaining data sets are skims produced from AllEventsSkim. A skim is a subset of the data whose events all have the same value of a given tag (or tags). For example, the Jpsitoll skim is the subset of events in AllEventsSkim that have the Jpsitoll tag=true. A skim does not necessarily consist of a physical copy of events in AllEventsSkim - sometimes it consists of pointers to the skim events, instead. But "deep copy" and pointer skims look the same to the user.

R[1-4]: Run Cycle

The Run Cycle denotes the data-taking period, as shown in the table below.
Run Periods
Run begin end
Run1 Feb 2000 Oct 2000
Run2 Feb 2001 Jun 2002
Run3 Dec 2002 Jun 2003
Run4 Sep 2003 Jul 2004
Run5 May 2005 Aug 2006
Run6 Jan 2007 Aug 2007
Run7 Dec 2007 Sep 2008

RXX: Release series

RXX is the release series of the reconstruction software used to process the data. Data is initially processed with whatever release is current at the time. But later, when a new and improved release becomes available, the data is reprocessed using the new software. In general you want to use the data set that was processed using the same software as your test release. For example, in the Quicktour you used analysis-41, which is 22.1.1, a 22-release. So you would want the data sets ending in "R22".

Fortunately, BbkDatasetTcl is smart: once you have entered "srtpath" and "cond22boot" from analysis-41, it knows that you are using an 22-series release, and will list the R22 collections.

This year (2007), everyone will be using R18 and R22 data, so you will probably not have to worry about older releases unless you are continuing an older analysis.

[a/b/c]: Skim cycle

Skims are produced at regular intervals, in skim cycles. This ensures that researchers do not have to wait too long to create or update skims to incorporate new physics ideas.

Usually several skim cycles are run for a given release. So often the release, data set name, and skim name are all the same. Therefore different versions of a skim are labeled by [a/b/c].

Some skim cycles will have a further specifier in the form "-v0X" where "X" is a digit. This is called a "tag". The tag denotes repeated iterations of the skim cycle, usually with either more data added to it, or some data removed (bad runs). In general, the highest-numbered tag is the one to use.

At the moment of writing (April 2007) the first R22 skim is in progress. This skim is called R22a, and its target deadline is May 1, 2007. So a lot of people are still using R18 datasets. For R18 there have been several skim cycles:

  • R18a - Just a test, not to be used for analysis.
  • R18b - Second skim cycle.
  • R18c - Third skim cycle.
  • R18d - Fourth skim cycle - the recommended one.

Finding out what SP Modes are available

BbkDatasetTcl also lists the available Monte Carlo (simulated) sets. The names are similar to the data set names, of the form:

SP-XXXX-SkimName-Run[1-4]-RXX[a/b/c]
For example:
SP-1237-AllEventsSkim-Run5-R22a
SP-1005-BchToD0KstarAll-Run4-OffPeak-R18b
SP-998-DmixD0ToKPiPi0-Run3-R22b
SP-1235-InclPhi-Run2-R18c
SP-3981-Kll-Run3-OffPeak-R22b

SP-XXXX: mode number

The names of Monte Carlo sets begin with the prefix "SP-XXXX", where XXXX is a 3 or 4-digit mode number. A list of the available physics decay modes is available on the MC Production home page.

To find the definition of a certain decay mode, for example 1237, you can use BbkSPModes:

     >  BbkSPModes --modenum 1237 

The system will respond:

: Mode : Decfile             : Generator   : Filter : Run Type        : Category       :
: 1237 : B0B0bar_generic.dec : Upsilon(4S) :        : B0B0bar generic : generic decays :

To find out more about BbkSPModes, you can check the BbkSPModes web page, or type "BbkSPModes --help" at the command line.

SkimName

If you use a skim of your data set, then you will want to study the same skim of your Monte Carlo set, so you can compare the two. Decay modes like 1237=B0B0bar_generic are standard decays that show up (as background) in nearly all analyses, so nearly all skims are run over decay mode 1237. However, for other decays, like 3527 = B0toD2StarPi_D2StartoD0Pi_D0toKPi.dec, the only skims available (besides the standard ones) are:

SP-3527
SP-3527-AllEventsSkim-R18[b,c,c-F2KBug,R22a]
SP-3527-B0DNeutralLight-R18[b,c,c-F2KBug,R18b,R22b]
SP-3527-BtoDPiPi-R18[b,c,c-F2KBug,R18b,R22b]
SP-3527-Run[1,2,3,4,5]-[F2KBug,G4Bug,R22]

This probably means that mode 3557 was produced for a particular analysis that uses only the B0DNeutralLight and BtoDPiPi skims. So only the those skim (and the AllEventsSkim, from which this skim is derived) were produced.

Run[1-7]

Runs are data-taking periods, not MC production periods. However, Monte Carlo data sets are designed to reproduce the data as closely as possible, including the conditions (detector, online, parameters) at the time. So MC data sets are labeled with Run Cycles that indicate which data sets they are intended to model.

Simulation Productions: SPN

Simulated (Monte Carlo) data sets are produced in Simulation Production (SP) cycles:

SP1, SP2, SP3 = obsolete
SP4 = Release 10
SP5 = Release 12
SP6 = Release 14
SP7 = none
SP8 = Release 18
SP9 = Release 22

(SP7 would have been Release 16, but they decided not to produce it.)

Data Location

The plain "BbkDatasetTcl" command tells you only about the collections at the Tier A site that you are logged in to. A Tier A site is a computing facility. SLAC's Tier A sites are:

  • SLAC (slac)
  • Bologna (cnaf)
  • IN2P3 (ccin2p3)
  • GridKa (gridka)
  • RAL (ral)

Skims are assigned to each AWG. So in order to access your skim, you need to login to the Tier A site that hosts the AWG that owns your skim. Detailed information is available here:

Data Distribution page

Making tcl files

Once you have determined which data sets you need, you can use BbkDatasetTcl to produce tcl files that tell your application how to access the data sets. The simplest form of the BbkDatasetTcl command is:

    > BbkDatasetTcl DATASET 

This will produce a single file, DATASET.tcl.

In practice, however, you will probably want to use a more complicated BbkDatasetTcl command:

    > BbkDatasetTcl DATASET --tcl Nmax --splitruns --basename MYNAME 

where:

DATASET
The name of the data set.
--tcl Nmax
Produces multiple tcl files, each with a maximum of Nmax events.
--splitruns
Allows BbkDatasetTcl to split collections, so that you end up with exactly NMax per file (except the last file of course) instead of maximum Nmax per file.
--basename MYNAME
A name of your choice. If you do not use this option, then BbkDatasetTcl will use the default name: DATASET.

For example, in the Quicktour you used BbkDatasetTcl to produce a single tcl file:

    >  BbkDatasetTcl SP-1237-Run4 

However, if there are too many events in your tcl file, your job may fail due to CPU time limits. To avoid this, you will want to divide the events among many tcl files:

>  BbkDatasetTcl SP-1237-Run4 --tcl 100k --splitruns --basename MC-B0B0bar-Run4 

Now instead of one big tcl file, you have many tcl files of 100k events each: MC-B0B0bar-Run4-1.tcl, MC-B0B0Bar-Run4-2.tcl, MC-B0B0bar-Run4-3.tcl ... MC-B0B0bar-Run4-N.tcl.

You can determine how many events you should include in one tcl file by submitting a test job and seeing when it crashes.

Note that all datasets evolve continuously with time. This is due to new data being added, old data being found to be bad, or reprocessing.

For more information about how to use BbkDatasetTcl to produce tcl files, see the Bookkeeping User Tools web page (in particular the section on "Evolving Datasets" gives details on how to update an analysis when new data becomes available), or type "BbkDatasetTcl --help" at the command line.

Contents of a tcl file

If you look in your SP-1237-Run4.tcl file from the Quicktour, you will see a bunch of lines like this:

# 138000/138000 events selected from 69 on-peak runs, added to dataset at 2005/11/04-22:50:14-PST, lumi = ~0.0/pb
lappend inputList /store/SP/R18/001237/200309/18.6.0b/SP_001237_013238
# 48000/48000 events selected from 24 on-peak runs, added to dataset at 2005/11/05-04:48:59-PST, lumi = ~0.0/pb
lappend inputList /store/SP/R18/001237/200309/18.6.0b/SP_001237_013240
# 6000/6000 events selected from 3 on-peak runs, added to dataset at 2005/11/05-04:48:58-PST, lumi = ~0.0/pb
lappend inputList /store/SP/R18/001237/200309/18.6.0b/SP_001237_013270
# 138000/138000 events selected from 69 on-peak runs, added to dataset at 2005/11/03-22:47:54-PST, lumi = ~0.0/pb
lappend inputList /store/SP/R18/001237/200309/18.6.0b/SP_001237_013286
# 104000/104000 events selected from 52 on-peak runs, added to dataset at 2005/11/03-22:47:54-PST, lumi = ~0.0/pb
The lines in green begin with a "#", which means that they are comments:
  • "A/B events selected": In a data set of B events, only A of them were from "Good" runs satisfying certain quality requirments. For MC (simulated) sets, all runs are good runs (there's no point in simulating bad ones), so A=B. But for data sets, usually A<B so this line is not redundant.
  • "from N on-peak runs": These small-r runs are different than the big-R Runs 1-5. A small-r run is a much smaller data-taking period, just one batch of analysis within a Framework job.
  • "lumi=X/pb": The (time-integrated) luminosity in pb-1. This is always set to 0.0/pb for MC, but it is nonzero for data. Luminosity will be discussed in more detail in the luminosity section below.

The lines in red are tcl commands. "lappend inputList" is a tcl command that adds a collection to the input module's list of collections to be analyzed.

Tcl files produced by BbkDatasetTcl can be used directly in your analysis. The Workbook's Tcl files section explains how to do this.

Collection names

The last thing to look at are the collection names themselves.

Before doing that, you should be aware that a collection in the event store isn't really the same thing as a file - any given collection is stored in lots of files, and each file has parts of multiple collections in it. They also get moved around from time to time balance loads, they come and go from tape, etc. So it's not really useful to track down where the files live in the Unix filesystem. Instead, you should use BbkDatasetTcl to generate tcl files, and then have a Framework input module convert those to file locations.

That said, let's look at a collection name. The first collection in SP-1237-Run4.tcl is:

/store/SP/R18/001237/200309/18.6.0b/SP_001237_013238
  • /store All collections are located in /store.
  • /SP This is an MC collection.
  • /001237 SP mode number is 1237.
  • /200309 This MC set was simulated using the conditions from September 2003.
  • /18.6.0b It was produced using SP release 18.6.0b.
  • /SP_001237_013238 SP mode number is 1237, and SP merge number is 013238. Merged skims will be explained in a moment. Why repeat the mode number? Because you want every file in the directory /store/SP/001237/200309/18.6.0b/ to have a different name.

For real data the collection names are a bit different. For example, consider the real-data collection:

/store/PR/R18/AllEvents/0001/81/18.1.0c/AllEvents_00018190_18.1.0cV00
  • /store All collections are located in /store.
  • /PR This is a real data collection.
  • /R18 The data was reconstructed using Release 18 software.
  • /AllEvents It is part of the AllEvents skim.
  • /0001 First 4 digits of the run number.
  • /81 Next 2 digits of the run number.
  • /18.1.0c AllEvents_00018190_18.1.0cV00. The collection is part of the AllEventsSkim, run 00018190. "VXX" tags are introduced when for some reason someone produces the collection a second time.

The point of the "/0001/81/" format is to make it easier for users to find the collections they want. For example, someone who wanted all runs beginning with "0001" could just look in the directory "/store/PR/R18/AllEvents/0001/".

For skim collections, BaBar skims many different (PR or SP) collections and then merges the output. Collections from these skims begin with "PRskims" or "SPskims." There are also collection that begin with "SPruns". SP data is initially generated in the SPruns tree and merged into the SP tree. The SPruns files are then deleted. The collection names for PRskims, SPnames, and SPruns also have their own special conventions. They are not described here, but you can learn about them by following the links below.

For a more detailed explanation of collection names, refer to the Extended Collection Names part of the CM2 introduction, and the link to the RFC provided on that page.

Making your own collections

Most users do NOT ever have to make their own collections.

However, for advanced users who do want to make their own collections, walk-through examples are provided in the Workbook section Gen/Sim/Reco.

Locating the database: condXXboot

    > cond22boot

and the system responded:

Setting OO_FD_BOOT to /afs/slac/g/babar-ro/objy/databases/boot/physics/V9/ana/0202/BaBar.BOOT

"/afs/slac/g/babar-ro/objy/databases/boot/physics/V9/ana/0202/BaBar.BOOT" is a "BOOT file". Whenever a new database is created, a BOOT file is created as well. The BOOT file tells applications like BetaMiniApp how to find the database.

In general, if you are using Release XX software, then the command that you should use is condXXboot. (Note: The release name should begin with a number, like 22. If it begins with a word, like analysis-41, then it is a nickname --- so don't use cond41boot!)

(In pre-Release-12 analyses, the boot commands were a bit different: physboot or data12boot for data, and simuboot or mc12boot for MC. But it is very unlikely that you will need these commands.)

If you are curious, you can check out the contents of the boot file:

> cat /afs/slac/g/babar-ro/objy/databases/boot/physics/V9/ana/0202/BaBar.BOOT
BaBar.BOOT
ooFDNumber=202
ooLFDNumber=65535
ooPageSize=16384
ooLockServerName=objylock05.slac.stanford.edu
ooFDDBHost=objycat03.slac.stanford.edu
ooFDDBFileName=/objy/databases/production/dynamic/physics/V9/0202/BaBar.FDB
ooJNLHost=objyjrnl02.slac.stanford.edu
ooJNLPath=/objy/databases/production/journals/physics/V9/ana/0202/

Basically, the BOOT file sets up the paths to the database. But you do not need to know about these paths. All that you need to do is use the correct condXXboot command before you run your analysis.

Luminosity of a data sample: BbkLumi

To determine the luminosity of your (real) data set, you can use BaBar's bookkeeping tool BbkLumi.


BbkLumi -ds DATASET
BbkLumi --tcl inputfile.tcl

For example, suppose you are using the dataset AllEvents-Run2-R18b. Then the command:

BbkLumi -ds AllEventsSkim-Run2-OnPeak-R18b

prints the luminosity of the full dataset:

Failed on dbname : bbkr14 trying bbkr18
Using aliases:
AllEventsSkim-Run2-OnPeak-R18b

Using B Counting release 18 from dataset name AllEventsSkim-Run2-OnPeak-R18b
==============================================
Run by penguin at Sun Apr 22 21:05:18 2007
First run = 18190 : Last Run 29435
== Your Run Selection Summary =============
 ***** NOTE only runs in B-counting release 18 considered *****
 ***** Use --OPR or --L3 options to see runs without B-counting *****
 Number of Data Runs                 5150
 Number of Contributing Runs         5150
-------------------------------------------
 Y(4s)   Resonance            ON         OFF
 Number  Recorded           5150           0

== Your Luminosity (pb-1) Summary =========
 Y(4s)  Resonance             ON         OFF
 Lumi   Processed          61145.302      0.000

== Number of BBBar Events Summary =========
             Number    |             ERROR
                       |   (stat.)   (syst.)   (total)
Total       67472454.3 |    43793.4   742197.0   743487.9

==For On / Off subtraction======
Nmumu(ON)            =   29566811.0 +/-       5437.5 (stat)
Nmumu(OFF)           =          0.0 +/-          0.0 (stat)
Nmh(ON)              =  214220114.0 +/-      14636.3 (stat)
Nmh(OFF)             =          0.0 +/-          0.0 (stat)

(Don't worry about the "failed on dbname" message - that's just BbkLumi realizing that it should be using the R18 (bbkr18) database instead of the R14-R16 (bbkr14) database.)

Alternatively, you could use BbkLumi to find the luminosity for the tcl file that you produced with BbkDatasetTcl.


BbkLumi --tcl AllEvents-Run2-R18b.tcl
The output message will be similar to the one above.

The --tcl option can be useful if some of your jobs fail. As long as your tcl files were produced without the --splitruns option, you can use the --tcl option to obtain the luminosity of successful jobs only, (whereas the --ds option tells you the luminosity of the full data set only).

For example, imagine that you have divided the AllEvents-Run2-R18.tcl data set among 50 tcl files (without the --splitruns option) and submitted 50 jobs. 40 of them run with no problems, but 10 of them keep failing no matter what you do. So you decide not to use the data from those 10 tcl files.

Now you need to know the luminosity of the 40 tcl files that produced successful jobs, but not the 10 that failed. So you create two directories, "success" and "failed", and move the 40 successful tcl files to "success" and the failed tcl files to "failed." Then you go to "success" and run BbkLumi:

    > cd success
    success> BbkLumi --tcl *

This will give you the luminosity of the data from the 40 successful jobs only.

As mentioned above, the --tcl option works only on tcl files that were NOT produced with the --splitruns option. With --splitruns is run BbkLumi cannot tell exactly which runs are in a given tcl file or if the whole run or part of it is.

On the other hand, --splitruns is useful because without it the tcl files each have a variable number of events, and this leads to a rather random amount of time in the batch system. Most people use --splitruns to ensure that all of their jobs use about the same amount of batch time.

BbkLumi can also be used to obtain the luminosity for a particular series or range of runs, for example:

BbkLumi --range 38358-38363     : Gets lumi between runs 38358 and 38363
BbkLumi --run 38358,38363,38451 : Gets lumi for the listed runs

You will probably not need to use these options, but they are there just in case.

Equivalent luminosity for Monte Carlo

At some point in your analysis you will probably need to determine the "luminosity" of your Monte Carlo samples. You would use this information to scale your Monte Carlo set to your data set.

For example, suppose you have a real data set of 100/fb, and a Monte Carlo set of 300/fb. Then you would need to rescale the Monte Carlo set by a factor of 1/3 in order to make direct comparisons with data.

The luminosity that you want is actually the equivalent luminosity, the luminosity that a generic real data sample filled with all types of decays would have to have in order to contain the type and number of decays in your Monte Carlo sample. For example, if you have a Monte Carlo sample of 90,000,000 e+e- to tau tau decays, you would want to know what size data sample would contain 90,000,000 e+e- tau tau decays.

In general, the equivalent luminosity of an MC sample of N events and cross section sigma is:

lumi = N / sigma 

For our e+e- tau tau example, the equivalent luminosity is:

lumi  = 90,000,000/0.90nb = 100,000,000/nb = 100/fb

(Be careful with your units!)

Note that due to detector acceptance, the effective cross section is sometimes lower than the actual production cross section. For example the theoretical cross section for e+e- to tau+tau- is 0.94nb, but the cross section from the detector is 0.90nb.

If you are using your e+e- to tau tau sample to model the e+e- to tau tau part of a real data sample of 200/fb, then you'd need to rescale your e+e- to tau tau sample by a factor of 2.


General related documents:


[Workbook Author List] [Old Workbook] [BaBar Physics Book]

Valid HTML 4.01! Page maintained by Adam Edwards

Last modified: March 2010