next_inactive up previous


BaBar Version SP10





BaBar Monte Carlo Production


SP10 Installation and User Guide



F.F. Wilson
Rutherford Appleton Laboratory, UK
David N. Brown
Louisville, US
Douglas Smith
SLAC, US



June 29, 2008


Contents

Introduction

This document gives the basic instructions on setting up and running the SP10 production at remote sites. You may already have scripts that automate much of this (e.g. downloading of conditions) so you should check with your sys admin first.

How to get this document

This document is userguide.tex in the doc directory of the ProdTools package. Check out the head of ProdTools to get the latest version. It can be converted to postscript, pdf and html. Simply issue the following commands to update the files:

> latex userguide.tex
> dvips  -Ppdf -o userguide.ps userguide.dvi
> ps2pdf userguide.ps  userguide.pdf
> gzip -f -v --best userguide.ps
> latex2html -split 0 -t 'BaBar SP10 MC Production Guide' userguide.tex
> tar -zcvf u.gtar userguide

You can then update the official web page at SLAC if you want. Copy over the userguide.ps, userguide.pdf to $BFROOT/www/Computing/Offline/Production and unpack the tar file.

> tar -C $BFROOT/www/Computing/Offline/Production -zxvf u.gtar


SP Web Page and Information that changes with time

Information that changes is available from the following points.

ProdTools Documentation

The man pages for all the ProdTools commands are in the $BFROOT/prod/ProdTools/doc/man directory. To make the man pages accessible, copy the files to $BFROOT/man/man1. If the $MANPATH environment variable is set up properly, they will be accessible with the man command:

local> man sprite

To make the man pages in a ps or pdf file for printing, use the commands:

local> cd $BFROOT/prod/ProdTools/doc/man
local> groff -Tps -man *.1 > prodtools.ps
local> ps2pdf prodtools.ps prodtools.pdf

These can be copied to the SLAC web page at $BFROOT/www/Computing/Offline/Production. A pdf/ps version can be found here:

http://www.slac.stanford.edu/BFROOT/www/Computing/Offline/Production/prodtools.pdf

http://www.slac.stanford.edu/BFROOT/www/Computing/Offline/Production/prodtools.ps

Installation

You need SL3, SL4, RHEL3, or RHEL 4, Objectivity 8.0.9, ROOT 5.14-00e, CLHEP 1.9, GEANT 4.8, and mysql 4.1.9. The guide assumes that you are using cond24boot database. When SP10 production starts, the new database will be accessed via sp10boot and the documentation will be updated.

You need to follow the steps Section 5.1 to  5.5 before installing the base release for the first time or you make get problems with missing shared libraries etc...You only need to these steps once.


Installing Objectivity

This section is not necessary with releases greater than or equal to 24.3.1.

local> mkdir $BFROOT/package/objy8.0.9

If you install it anywhere else, you will need to configure SiteConfig for it. Get Objectivity.R8.0.9.linux86gcc3.tar.gz from /afs/slac.stanford.edu/g/babar/package/objy8.0.9/Objectivity_admin/Objectivity_src/

local> cd $BFROOT/package/objy8.0.9
local> tar -zxvf Objectivity.R8.0.9.linux86gcc3.tar.gz
local> cd cdrom
local> ./install
- select option 3 'Custom Installation'
- select 5,9,10,12
- select install directory <objydir> (highly recommend $BFROOT/package/objy8.0.9 )
- set linux86gcc3 C++ include path:
 /usr/lib/gcc-lib/i386-redhat-linux/3.2.3/include /usr/include/c++/3.2.3 
 /usr/include/c++/3.2.3/backward /usr/include/c++/3.2.3/i386-redhat-linux
- use defaults otherwise
Copy liboocx.so from SLAC ( into <objydir>/linux86gcc3/lib ). Copy oolicense.runtime.txt from SLAC into <objydir>.

Installing ROOT

Please install ROOT 5.14-00e in $BFROOT/package/root/5.14-00e. If you install it anywhere else, you will need to configure SiteConfig for it.

local> mkdir $BFROOT/package/root/5.14-00e
local> cd $BFROOT/package/root/
local> wget http://hep.phys.utk.edu/~gragghia/babar/root-5.14-00e.tgz
local> tar -xvpzf root-5.14-00e.tgz

Installing CLHEP

Please install CLHEP 1.9.2.1 in $BFROOT/package/clhep/1.9.2.1.

local> mkdir $BFROOT/package/clhep/1.9.2.1
local> cd $BFROOT/package/clhep/1.9.2.1
local> wget http://www.slac.stanford.edu/~dbrown/clhep-1.9.2.1.tar.gz
local> tar -xvpzf clhep-1.9.2.1.tar.gz

Note: If you have already installed your 24-series release, you should cd to the release directory and gmake ldlink to ensure that clhep is linked in to your release successfully. If you have not installed the release yet, gmake siteinstall should take care of this automatically.

Installing Geant

Please install GEANT version geant4-08-03-ref-00-patch-03 in $BFROOT/simu/geant4/geant4-08-03-ref-00-patch-03

local> mkdir $BFROOT/simu/geant4/geant4-08-03-ref-00-patch-03
local> wget http://www.slac.stanford.edu/~dbrown/geant4.8.tar.gz
local> cd $BFROOT/simu/geant4/geant4-08-03-ref-00-patch-03
local> tar -xvpzf geant4.8.tar.gz

Note: If you have already installed your 24-series release, you should cd to the release directory and gmake ldlink to ensure that clhep is linked in to your release successfully. If you have not installed the release yet, gmake siteinstall should take care of this automatically.


Installing MySql

Please install mysql libs for version 4.1.9 in $BFROOT/package/mysql/4.1.9/Linux24SL3_i386_gcc323/lib

local> mkdir $BFROOT/package/mysql/4.1.9/Linux24SL3_i386_gcc323/lib
local> cd $BFROOT/package/mysql/4.1.9/Linux24SL3_i386_gcc323/lib
local> wget http://www.slac.stanford.edu/~dbrown/mysql-4.1.9.tar.gz
local> tar -xvpzf mysql-4.1.9.tar.gz

Note: If you have already installed your 24-series release, you should cd to the release directory and gmake ldlink to ensure that clhep is linked in to your release successfully. If you have not installed the release yet, gmake siteinstall should take care of this automatically.


Getting setup to import Conditions, Configurations and Backgrounds

Downloading the conditions snapshot is similar to downloading background trigger events, and requires use of BbkImport. BbkImport uses a package that depends on BaBar::SQL, which only exists at sites that use a MySQL or Oracle database. To get around this dependency at other sites, you can create a dummy BaBar::SQL package. Create a file $BFROOT/lib.shared/perl/BaBar/SQL.pm which looks like this:

#
# dummy module
#
package SQL;
1;

Importing database files to your site (and later exporting data to SLAC) requires that you have password-less access to the bbrdist account on the bbr-xfer0x machines at SLAC. There is now an alias for all the bbr-xfer0x machines: bbr-xferlfi. Test this by trying to login via ssh to bbrdist@bbr-xferlfi.slac.stanford.edu. If this doesn't work, you need to have your server's rsa key added to the authorized_keys file in the bbrdist directory at SLAC. To do this, first login to your production account on your head node at your site.

local> ssh-keygen -t rsa

Hit $\angle CR\rangle$ when prompted for a passphrase. Send the contents of the resulting public key file to Wilko Kroeger (wilko@slac.stanford.edu) and ask him to put the key in the bbrdist account's authorized_keys file. When he has done this you should be ready to import the conditions database.

Update: The connection to SLAC is setup in $HOME/.bbk by the BbkGetConnectInfo script. That script is part of the release, but also part of the bin package and the bin package is first in the search path. The bin package version of the script uses /usr/local/bin/perl, not the site specific configured perl version for BABAR. Make sure /usr/local/bin/perl exists, set a link if it doesn't (or modify the bin version of BbkGetConnectInfo).


Installing a base release

To install a new release you must first install the base release (e.g. 24.3.1) and afterwords the lettered release (e.g. 24.3.1a). First log in as your bfactory account, enable afs to the computer with the source repository and set the environment variable for your remote source repository. This example assumes you are using SLAC repository but you can replace it with a nearer or faster site.

local> klog username@rl.ac.uk
local> printenv BFDISTr
local> setenv BFDISTr /afs/slac.stanford.edu/g/babar/dist

Some people have had faster access to SLAC using the following:

local> setenv BFDISTr \ 
  bbrdist@bbr-xferlfi.slac.stanford.edu:/afs/slac.stanford.edu/g/babar/dist

Now install the base release. You first import the release (operating system independent) and then any operating system specific files (libraries, executables, etc...). Obviously you only need to use the importarch command for any operating systems architectures that you have. This process can take be quite slow (few hours) or fast (few minutes) depending on the number of packages and libraries that have to be updated:

local> importrel -pa 24.3.1 >& importrel_24.3.1.log
local> importarch -p 24.3.1 Linux24SL3_i386_gcc323 >& importarch_24.3.1.log

When it has finished, check that the executable MooseApp has been copied (look in $BFDIST/releases/24.3.1/bin/). Some remote sites do not copy the executables. If they are missing, copy by hand directly from SLAC.

After importing the release you need to run gmake siteinstall.

local> cd $BFDIST/releases/24.3.1
local> gmake siteinstall >& siteinstall.log

Please note that you might need an ad-hoc setup for your local site for e.g. locating external packages such as ROOT, tcl, etc. This is normally handled through the SiteConfig package and the BFOVERRIDE mechanism, as explained in $BFROOT/www/Computing/Offline/Production.

Now repeat for the lettered release:


Installing a lettered release

First, make sure you have installed a base release (see section 5.7). Then repeat the commands for the lettered release.

local> importrel -pa 24.3.1a >& importrel_24.3.1a.log
local> importarch -p 24.3.1a Linux24SL3_i386_gcc323 >& importarch_24.3.1a.log
local> cd $BFDIST/releases/24.3.1a
local> gmake siteinstall >& siteinstall.log


Removing a release

Removing a release is simple. The rmrel command will remove the libraries and executables that are specific to a release and then remove any links. It does not remove any packages so you can use it without worrying about deleting packages that you will need later. To remove the 2 releases we installed in sections 5.7 and 5.8:

local> rmrel -p 24.3.1a
local> rmrel -p 24.3.1

Installing the Conditions database

This section assumes $BFDISTr is set to /afs/slac.stanford.edu/g/babar/dist. If not, you'll have to use scp or something. It also assumes that you are using cond24boot (this will turn into sp10boot when production starts). It also assumes that all the top level directory for the kanga files is $BFROOT.

local> cd $BFROOT/kanga/config
local> mkdir cdb
local> cd cdb
local> cp $BFROOTr/kanga/config/cdb/CdbNameRules.cfg .

Copy the latest cond24boot configuration file:

local> cd $BFROOT/kanga/config/cdb
local> ls $BFROOTr/kanga/config/cdb/*cond24boot*
local> cp $BFROOTr/kanga/config/cdb/CdbNameRules-cond24boot-full-20080212T131912.cfg .
local> ln -fs CdbNameRules-cond24boot-full-20080212T131912.cfg CdbNameRules-cond24boot-full.cfg

Repeat if the snapshot changes.

local> cd $BFROOT/kanga/config
local> mkdir cfgdb
local> cd cfgdb
local> cp $BFROOTr/kanga/config/cdb/CfgDBNameRules.cfg .

Copy the latest configuration file:

local> cd $BFROOT/kanga/config/cfgdb
local> ls $BFROOTr/kanga/config/cfgdb/CgDBName*
local> cp $BFROOTr/kanga/config/cfgdb/CfgDBNameRules-20080209T194615.cfg .
local> ln -fs CfgDBNameRules-20080209T194615.cfg CfgDBNameRules-latest.cfg

Edit the $BFROOT/kanga/config/KanAccess.cfg file to point to where the conditions files will be imported (called target-directory in the later examples):

If you are using nfs file system:
read /store/*  file <target-directory>

If xrootd file systems:
read /store/*   xrootd  xrootd-server:1094

Importing the Conditions and Configurations

You should now be ready to import the conditions database. You can either issue the next two commands or combine them with the KanImportCdbCfgDB command:

local> srtpath 24.3.1a
local> BbkImport --dbsite=SLAC --dbname bbkr24 --noupdate-sql \ 
   --dataset Nonevent-CDB-cond24boot-full --remote=0 --ftp-type=bbcp \
   --nstreams 10 --window 1M --multiple 10 --reverse \
   bbrdist@bbr-xferlfi.slac.stanford.edu <target-directory>

And to get the configurations db:

local> srtpath 24.3.1a
local> BbkImport --dbsite=SLAC --dbname bbkr24 --noupdate-sql \ 
   --dataset Nonevent-CfgDB --remote=0 --ftp-type=bbcp \
   --nstreams 10 --window 1M --multiple 10 --reverse \
   bbrdist@bbr-xferlfi.slac.stanford.edu <target-directory>

These two commands can be combined, but you will need at least KanNonEventUtils V00-00-09:

local> srtpath 24.3.1a
local> KanImportCdbCfgDB -t cond24boot -n bbkr24 -p $BFROOT/ \
  -i " --remote=0 --ftp-type=bbcp  --nstreams 10 --window 1M \
      --multiple 10 --reverse --remote-user=bbrdist"
The '-p $BFROOT' option puts the files under $BFROOT/kanga/store.

After the imports, you need to update the database with these new configurations and conditions. To find out which are the currently used files you can do something like:

local>ls -l $BFROOT/kanga/config/cfgdb/CfgDBNameRules-latest.cfg
local>ls -l $BFROOT/kanga/cdb/CdbNameRules-cond24boot-full.cfg

These two commands return something like:

CfgDBNameRules-20080321T212457.cfg
CdbNameRules-cond24boot-full-20080319T173024.cfg

To find the names of the latest imported files:

local> srtpath 24.3.1a
local> BbkUser --dataset=Nonevent-CfgDB file_status file \
--dbname=bbkr24 | grep CfgDB | cut -f 2 -d ' '
local> BbkUser --dataset=Nonevent-CDB-cond24boot-full file_status file\
--dbname=bbkr24 | grep boot.root | cut -f 2 -d ' '

These two commands return something like:

/store/cfg/2008/03/CfgDB-20080328T123723.root
/store/cdb/cond24boot/full/2008/04/20080407T133030/CDB-20080407T133030-cdb_boot.root

You can then update the database with these files:

local> srtpath 24.3.1a
local> KanUpdateRulesCfgDB /store/cfg/2008/03/CfgDB-20080328T123723.root
local> KanUpdateRulesCDB \ 
/store/cdb/cond24boot/full/2008/04/20080407T133030/CDB-20080407T133030-cdb_boot.root

This will update the kanga rules files for you but the exact name of the root files will change with time. In general, to find out what snapshot you will need and where to download it from, see Section 3.

Importing the background triggers

First make sure your connection is working, see Section 5.6.

local> srtpath 24.3.1a
local> klog
local> BbkImport --dbsite=SLAC --dbname bbkr24 --noupdate-sql \
   --dataset BkgTriggers-R24 --remote=0 \
   --ftp-type=bbcp --nstreams 10 --window 1M --multiple 10 --reverse \
   bbrdist@bbr-xferlfi.slac.stanford.edu <target directory>

The target-directory is the directory that contains store/SP/BkgTriggers. The '-nstreams ...' options are all bbcp options, Depending on your site setup you might need other options here (especially the reverse might or might not work). The options quoted here worked at Louisville. Overall this will import about 10GB of new background collections as of mid-February 2008. This amount will grow as more background becomes available. Make sure you have enough free disk space.

Importing missing files

If you are missing files and the normal methods don't seem to be importing them, you can import them explicitly e.g.

local>BbkImport --dbsite=slac --dbname bbkr24 --noupdate-sql --remote=0 \ 
--include "/store/SP/R24/BkgTriggers/BkgTriggers_200707_OnPeak_V01.*"\ 
--ftp-type=bbcp --nstreams 10 --window 1M --multiple 10 --reverse \ 
bbrdist@bbr-xferlfi.slac.stanford.edu <target directory>

Checking the root files

To check the files exist (you will see lots of warnings about missing dictionaries):

local> KanFileCheck /store/cfg/2008/02/CfgDB-20080229T051835.root
/store/cfg/2008/02/CfgDB-20080229T051835.root exists

To check they are not corrupted, get the checksum from the database and then run cksum on the file:

local> BbkUser --dbname=bbkr24 checksum bytes file 
               --file=/store/cfg/2008/02/CfgDB-20080229T051835.root
CHECKSUM   BYTES    FILE
1511119755 10860715 /store/cfg/2008/02/CfgDB-20080229T051835.root
1 rows returned from bbkr24 at RAL
local> cksum /stage/xrootd-data3/kanga/store/cfg/2008/02/CfgDB-20080229T051835.root
1511119755 10860715

Checking the database

To check you have the right snapshots issue the following command and compare with the official snapshot given in Section 3.

local> cond24boot
local> CdbRooBrowser views | grep -i default
NAME="MASTER::Run7"  ID=0::45  STATUS=NOT-FROZEN,DEFAULT

To find out which background triggers you have available use the following command:

local> cond24boot
local> BbkUser --dbname bbkr24 --dataset 'BkgTriggers*' dse --distinct
/store/SP/R24/BkgTriggers/BkgTriggers_200707_OnPeak_V01
/store/SP/R24/BkgTriggers/BkgTriggers_200712_OnPeak_V01
/store/SP/R24/BkgTriggers/BkgTriggers_200801_OnPeak_V01
/store/SP/R24/BkgTriggers/BkgTriggers_200802_OffPeak_V01
.
14 rows returned from bbkr24 at RAL

Installing ProdTools

You will now need to extend the $BFROOT directory for MC production. Add two new directories:

local> mkdir -p $BFROOT/prod
local> mkdir -p $BFROOT/prod/log/allruns

The first directory will contain the production tools and scripts. The second directory will contain all the run directories and intermediate files. This will require about 20 Gbytes for a 200,000 event run so you may wish to make allruns a link to a disk with sufficient space.

Now you need to install the ProdTools. You can not use the standard BABAR addpkg command as you are not in a standard release directory. You will need to use cvs directly. First, find out the current tag at SLAC and then install it at your site (the tag will be something like ``V00-07-03'', ignore the leading N):

local> cd $BFROOT/prod
local> klog username@slac.stanford.edu
local> cvs checkout -r V00-07-03 ProdTools

To update a previously checked out ProdTools, again check the release at SLAC then update:

local> cd $BFROOT/prod/ProdTools
local> klog username@slac.stanford.edu
local> cvs update -Ad -r <tag>

If this is the first time you have installed ProdTools, you will need to install the module that allows access to the database at SLAC. Use the installdbi script in ProdTools to do this using your production account (you may need to ask your system admin to install a perl bundle).

You will see that there is a directory in ProdTools called ``site'' with further subdirectories for each site doing MC production. If your site does not exist, You should create a directory will the name $BFSITE of your site. Then copy over the files from one of the other site subdirectories. It's easiest to copy over a site subdirectory that uses the same batch system as you.

local> mkdir -p $BFROOT/prod/ProdTools/site/$BFSITE
local> cd $BFROOT/prod/ProdTools/site/$BFSITE
local> cp ../cu-boulder/* .

Now you have to edit the files. The main things you need to change are:

Installing ProdDecayFiles

Now you need to install the decay files used in MC production. You can not use the standard BABAR addpkg command as you are not in a standard release directory. Instead there is a script in the $BFROOT/prod/ProdTools directory called updateDF that will update/install the ProdDecayFiles directory. Installing and updating the ProdDecayFiles is similar to ProdTools:

local> mkdir -p $BFROOT/prod/packages/ProdDecayFiles
local> cd $BFROOT/prod/packages/ProdDecayFiles
local> klog username@slac.stanford.edu
local> $BFROOT/prod/ProdTools/updateDF <tag>

Setting up connections to SLAC Oracle database

ProdTools keeps records of everything that is done in production, along with configurations of production runs, and decay modes. To keep up-to-date records of what is done at remote site, along with passing configurations of production runs to local tools, a connection to the SLAC Oracle database must be setup.

The first step is getting the Bundle::DBI perl modules installed. You can do this your self by logging in as root and using the command:

perl -MCPAN -e 'install Bundle::DBI'

You may have to talk to your system admin to get this done. This will install modules into your perl installation, so you need to be root when this is done.

There has been problems with some sysadmins which don't like the idea of CPAN installing whatever in the perl directories. Also this command will upgrade things as needed to get the modules to work, and this may mess with existing perl installs. If it is needed, or you just wish it so, you can install the needed modules from source, compile for your system and install yourself. These steps will be different for each system, so I can't help too much with this. But the needed modules are DBI, PlRPC, Net::Daemon, and Storable. Also make sure that Storable is from the 1.x series of releases, to be compatible with the proxy server here. But if you have control over root, and don't really care that much about how your perl is installed the 'install Bundle::DBI' will do all this for you.

After that run the installdbi script that comes with ProdTools. This will install a local DbiProxy module to manage the Oracle connection from the remote site. The script will check on the state of your installation and tell you what problems you might have. It will also e-mail a message to Douglas Smith when it is successful. Your production computer needs to be registered with the proxy server at SLAC to allow database connections. This e-mail will tell Douglas the info he needs to know about your computer. He will register your computer and reply within a day or so. You should then be setup for Oracle connections.

Setting up a production account

Assuming the BABAR environment is setup for your account (is $BFROOT defined?), you should create the production directory in the normal way. This will not required a great deal of space.

local> newrel -t 24.3.1a SP24.3.1a
local> cd SP24.3.1a
local> srtpath
local> cond24boot
local> addpkg workdir
local> gmake installdirs
local> gmake workdir.setup
local> cd workdir

You might want to setup some default environment variables if they have not been set:

export BFSITE=ral
export PRODTOOLS=${BFROOT}/prod/ProdTools
export PRODSITE=${PRODTOOLS}/site/${BFSITE}
export ALLRUNS=/stage/xrootd-data7/simuprod/local/allruns
export MERGEDIR=/stage/xrootd-data7/simuprod/local/merge

If you use multiple $ALLRUNS directories (maybe to keep SP8 and SP9 production separate or to keep the validation runs apart from production runs), be careful with setting a default $ALLRUNS in your login scripts.

When a SP job is submitted, your spsub interface to the batch should transmit all environment variables to the batch job, including $ALLRUNS. But the SP jobs will start a new shell, so if you always set a default in your login scripts, that default will override whatever $ALLRUNS was set for the shell calling spsub. For the Moose part of the SP job that doesn't matter since for Moose everything is fixed in the config files. But Job.bash checks the Moose output with spcheck and with the wrong a $ALLRUNS, spcheck won't find the run. The result is that the run is stuck in a 'done' state and never marked as 'good'. Only 'good' runs are picked up by spmerge.

The problem can be fixed by checking in your login scripts if $ALLRUNS is already set before setting it to the default value. Something like this (for csh) :

if (! $?ALLRUNS) then
  setenv ALLRUNS /somedir/allruns
endif

Validation

The run numbers that need to be validated are posted on the web, see Section 3. At the time of writing the validation run range for 24.3.1c was: 9981453-9981497

Building Validation Jobs

local> spbuild -j valid --user babarmc 9981454
local> spbuild -j valid --user babarmc 9981453-9981497

Executing the first command will create the output:

local> spbuild -j valid --user babarmc 9981454
9981454(simu): A24.3.1aV01x34F 4000 24.3.1a x34 B0B0bar_JpsiKS_+-.dec
   building /stage/xrootd-data7/simuprod/local/allruns/9981454/A24.3.1aV01x34F

The directory will contain the files:

Submitting Jobs

The first time you submit a job, you should probably do just one run. I assume you are in your release directory and have run srtpath and cond24boot:
local> cd ./SP24.3.1a
local> srtpath
local> cond24boot

The command to run the job is:

local> spsub 9981453
local> spsub -y 9981453-9981497

spsub does a lot of checking. It then copies over part of the workdir to $ALLRUNS/9981497/A24.3.1aV01x15F (note how the release name and the background rate are incorporated into the directory name); and submits the simulation job. When it finishes, the workdir files are removed leaving a log file (often gzipped). The main files in the new directory are:

Checking Jobs

While the job is running you can use the spupdate command to interrogate the log files and tell you how things are going. In this example, the spupdate command will attempt to interrogate runs 9981453 to 9981497:

local> spupdate 9981453-9981497

Cleaning up Jobs

Once the simu job has finished, you can clean up and filter the log files using spclean. In this example, the log files for runs 9981453 to 9981497 are filtered together:

local> spclean 9981453-9981497

Validating Jobs

The log files and root files from the validation runs have to be checked to see they are the same as the reference set. First set up the files needed to compare with SLAC. Then use the supplied utilities to compare with SLAC for each run.

At your site use the saveDir command to create a tar file of the runs you want to validate. You must use the '-nodb' parameter:

local> saveDir --nodb -t ./ 9981453-9981497

Now transfer this file to SLAC and unpack. You will need to create a directory if this is the first validation you have done i.e.

yakut> mkdir -p $BFROOT/prod/log/validation/intel/$BFSITE
yakut> mkdir -p $BFROOT/prod/log/validation/amd/$BFSITE

yakut> cd $BFROOT/prod/log/validation/intel/ral
yakut> scp -p username@local:~/9981453-9981497.tar .
yakut> tar -xvf 9981453-9981497.tar
yakut> rm 9981453-9981497.tar

Now compare your files against SLAC. In the $BFROOT/prod/log/validation you will see a number of utilities for comparing the log and root files. You will see some differences (about 200 normally, nearer 2000 if you accidently compare an AMD processor with intel).

yakut> cd $BFROOT/prod/log/validation
yakut> validate.csh <arch> <first run> <last run> slac 01 <site> <version>

For example:

yakut> cd $BFROOT/prod/log/validation/
yakut> ./validate.csh intel 9983524 9983525 slac 01 ral 01
Checking run 9983524 : Number of discrepancies: 269 (4438 histograms compared)
Checking run 9983525 : Number of discrepancies: 268 (4438 histograms compared)

Production

Once you've got the validation runs working, the production run is easy. A quota for your site will be put in the database and you can build the necessary runs by using spbuild. The next few sections show the individual steps which can be followed when initially testing the system. When you are happy the system is working you can use sprite to automate production (see Section 7.9).

Building Jobs

Notice that this time we do not have a ``-j'' option as we wish to use the default collection naming scheme for production and export the output.

local> spbuild --cycle SP10 9981453
local> spbuild --cycle SP10 -u douglas --debug -n 100
local> spbuild --cycle SP10 
local> spbuild --cycle SP10 --local -u babarmc -n 1

The first command builds one run; the second command builds 100 runs, identifying each run with user douglas and printing a lot of debug info; the third command builds all the runs that have been allocated to you. This command probably isn't very useful if you have a lot of jobs and you would be better off letting sprite handle the production (see Section 7.9).

The fourth version uses ``-local''. When the "-local" option is used, the root files are written to a local directory on the batch node and copied over (and checksummed) after the job has finished. This can reduce the load. If the environment variable $SPLOCALDIR is defined in local-simu-setup, it will be used as the temporary area. If $SPLOCALDIR is not defined, the default /tmp area is used. If $SPLOCALDIR is not defined by the user but is defined by the batch system at run time, then the area defined by the batch system can be used by copying the value to the $SPLOCALDIR in local-simu-setup. e.g.

# for CCIN2P3
export SPLOCALDIR="$TMPBATCH"
# for RAL - WORKDIR is defined at run time by the batch system
export SPLOCALDIR="$WORKDIR"
# the default will be /tmp if not defined.

The output will look like this:

local> spbuild --cycle SP10 -n 3
There are still 668 runs in the system to build at this location.

13329101(simu): A24.3.1aV01x35F 8000 24.3.1a x35 B+B-_generic.dec
   building /stage/xrootd-data7/simuprod/local/allruns/13329101/A24.3.1aV01x35F
13329102(simu): A24.3.1aV01x35F 8000 24.3.1a x35 B+B-_generic.dec
   building /stage/xrootd-data7/simuprod/local/allruns/13329102/A24.3.1aV01x35F
13329103(simu): A24.3.1aV01x35F 8000 24.3.1a x35 B+B-_generic.dec
   building /stage/xrootd-data7/simuprod/local/allruns/13329103/A24.3.1aV01x35F

If you try to build an already existing run you will get an error:

local> spbuild 13329096
ERROR: job A24.3.1aV01x35F for 13329096 already exists

If you run the spcheck command you can see the state of the runs:

local> spcheck --status
Run    Status of runs by procspec
---------------------------------
13329096 simu : A24.3.1aV01x35F - built
13329097 simu : A24.3.1aV01x35F - built
13329098 simu : A24.3.1aV01x35F - built
13329099 simu : A24.3.1aV01x35F - built
13329100 simu : A24.3.1aV01x35F - built
13329101 simu : A24.3.1aV01x35F - built
13329102 simu : A24.3.1aV01x35F - built
13329103 simu : A24.3.1aV01x35F - built

You can use the spbuild command as a useful way to find out how many runs have been allocated to you.

local> spbuild --cycle SP10 -n 1
There are still 722 runs in the system to build at this location.

13389416(simu): A24.3.1aV01x34F 8000 24.3.1a x34 B0B0bar_generic.dec
   building /stage/xrootd-data7/simuprod/local/allruns/13389416/A24.3.1aV01x34F

Submitting Jobs

Now submit all the jobs. You can of course just submit a subset at a time by modifying the run range at the end of the spsub command:

local> spsub 13329096
local> spsub -y 13329097-13329103

local> spsub 13329096
Submit run 13329096? [Yes/No/All/None] y
Run 13329096 submitted as job 8052930

To see what is happening in the batch queue now:

local> spjobs --summary
Run      JobID   Type User     Stat Queue    Host       Exec   Per.  LastMod
----------------------------------------------------------------------------
13329096 8052930 simu babarmc  RUN  sl4p     csflnx353           0%  0.20 min

spjobs does not show the merge jobs. To see what is happening based on the log files:

local> spcheck --status
Run    Status of runs by procspec
---------------------------------
13329096 simu : A24.3.1aV01x35F - run - lcg0306 -  0.1% - 0.25 mins
13329097 simu : A24.3.1aV01x35F - built

The run directory will contain the following files while running a job:

local> ls $ALLRUNS/13329096/A24.3.1aV01x35F
13329096.moose.01.root bin DECAY.DEC pdt.table RooLogon.C imu13329096.log
13329096.moose.02E.root config.sh  GNUmakefile RELEASE config.tcl
seed-overuse.txt workdir.files B+B-_generic.dec PARENT RooAlias.C shlib

spcheck, as well as parsing the log file, will change the status of production of each run as appropriate so that the next stage can start.

To submit all the run range without being prompted, use the '-y' option:

local> spsub -y 13329097-13329103
Run 13329097 submitted as job 8053009
Run 13329098 submitted as job 8053010
Run 13329099 submitted as job 8053011
Run 13329100 submitted as job 8053012
Run 13329101 submitted as job 8053013
Run 13329102 submitted as job 8053014
Run 13329103 submitted as job 8053015

local> spcheck --status
Run    Status of runs by procspec
--------------------------------------------------
13329096 simu : A24.3.1aV01x35F - run - lcg0306 -  0.5% - 0.23 mins
13329097 simu : A24.3.1aV01x35F - run - lcg0342 -  0.0% - 0.98 mins
13329098 simu : A24.3.1aV01x35F - run - lcg0295 -  0.0% - 0.98 mins
13329099 simu : A24.3.1aV01x35F - submit - 0.32 min
13329100 simu : A24.3.1aV01x35F - submit - 0.30 min
13329101 simu : A24.3.1aV01x35F - submit - 0.30 min
13329102 simu : A24.3.1aV01x35F - submit - 0.30 min
13329103 simu : A24.3.1aV01x35F - submit - 0.28 min

local> spjobs --summary
Run      JobID  Type User     Stat Queue    Host       Exec Per.  LastMod
------------------------------------------------------------------------
13329096 805293 simu babarmc  RUN  sl4p     csflnx353  lcg0664    1%  0.55 min
13329097 805300 simu babarmc  RUN  sl4p     csflnx353  lcg0664    0%  0.05 min
13329098 805301 simu babarmc  RUN  sl4p     csflnx353  lcg0664    0%  0.02 min
13329099 805301 simu babarmc  RUN  sl4p     csflnx353  lcg0664    0%  0.05 min
13329100 805301 simu babarmc  RUN  sl4p     csflnx353  csflnx389  0%  0.07 min
13329101 805301 simu babarmc  RUN  sl4p     csflnx353  lcg0655    0%  0.05 min
13329102 805301 simu babarmc  RUN  sl4p     csflnx353  csflnx390  0%  0.38 min
13329103 805301 simu babarmc  RUN  sl4p     csflnx353  csflnx390  0%  0.38 min
Summary of runs: 8 in system, 8 running and 0 pending.
Types in system: 8 simu, 0 mixr, and 0 reco.

Now go away and have a very large coffee (and dinner, and a nap). And hope nothing goes wrong!

Checking Jobs

While runs are working there are a couple tools for checking on the state of these runs, spjobs and spcheck.

spjobs will query the batch system, find production jobs, and produce a nicely formatted output on the status of these jobs. It will look for all runs which are in the batch system, submitted by the username with which you are logged in. To check on all jobs in the batch system just do:

local> spjobs --summary
Run      JobID  Type User     Stat Queue    Host       Exec Per.  LastMod
------------------------------------------------------------------------
13329096 805293 simu babarmc  RUN  sl4p     csflnx353        3%  0.43 min
13329097 805300 simu babarmc  RUN  sl4p     csflnx353        2%  0.18 min
13329098 805301 simu babarmc  RUN  sl4p     csflnx353        2%  0.35 min
13329099 805301 simu babarmc  RUN  sl4p     csflnx353        2%  0.27 min
13329100 805301 simu babarmc  RUN  sl4p     csflnx353        2%  0.35 min
13329101 805301 simu babarmc  RUN  sl4p     csflnx353        2%  0.22 min
13329102 805301 simu babarmc  RUN  sl4p     csflnx353        2%  0.10 min
13329103 805301 simu babarmc  RUN  sl4p     csflnx353        2%  0.43 min

Summary of runs: 8 in system, 8 running and 0 pending.
Types in system: 8 simu, 0 mixr, and 0 reco.

This script needs to be able to query the batch system. This has information on what computer the job in running on, if the jobs is running or just pending in the system, the percentage of events done in the running job, and the last time the log file was modified. If a log file is not touched in over 30 mins, the run will be marked hanging.

A range of run numbers can be used with this command to get a listing of desired runs. Also options can filter on the state of runs to see only runs which are running, pending, exited, done, or hanging. You can also filter on computer name, using partial strings to view runs on a set of computers. Check the man pages and help message for a full list of options. A useful option for people is usually the '-T' option, which will tail the log file for job which is running. This option needs a number which is the number of lines to print from the log file.

spcheck does not query the batch system, so this will work anywhere. This script will go through the files in the $ALLRUNS directory looking for the status of the runs. The default is to list the status of all procspecs built in all runs in the $ALLRUNS directory. It will report back on a procspec as built, submitted, running, done, good, or failed. spcheck will also update the status of the production for each run so that the next stage can begin.

To use the script type:

local> spcheck --status
Run    Status of runs by procspec
-------------------------------------------------------------------
9979558 : finished - 4000 events, 5.96 s/ev
9979559 : finished - 20 events, 20.10 s/ev
9981453 simu : A24.3.1cV01x34F - built
9981454 simu : A24.3.1cV01x34F - built
13305635 : finished - 8000 events, 4.28 s/ev
13305685 : finished - 8000 events, 8.11 s/ev
13305686 : finished - 8000 events, 7.34 s/ev
13305687 : finished - 8000 events, 6.38 s/ev
13305688 simu : A24.2.0cV06x75F - done - 12.33 hrs
13305689 : finished - 8000 events, 7.68 s/ev
13305690 : finished - 8000 events, 7.76 s/ev
13305691 : finished - 8000 events, 7.69 s/ev
13305692 : finished - 8000 events, 7.33 s/ev
13305693 : finished - 8000 events, 5.40 s/ev
13305694 : finished - 8000 events, 6.10 s/ev
13329096 simu : A24.3.1aV01x35F - run - lcg0306 -  2.9% - 0.43 mins
13329097 simu : A24.3.1aV01x35F - run - lcg0342 -  2.1% - 0.35 mins

The output can be a little confusing since there can be different procspecs for each run number. But there are options to filter display on jobs running, build, submitted, done, good, or failed. There is also an option to check the collection to see how many events were written compared to the number of events requested.

This script can also tail the log files for the procspecs listed, using '-T'. This option requires a number for the number of lines printed from the log file. To do this for all the procspecs built can be very lengthly so there are a number of option to filter the procspecs which are displayed.

Check the man pages for this script (in "ProdTools/doc/man") for a complete listing of options, and the help message.

Recovering from crashes and abandoned runs

If you run a job and it crashes you can re-run once you have solved the problem. This assumes that the problem is not repeatable. In other words, if the problem is inherent in the code, simply rerunning the job will not cure it. If a job crashes, it will be marked as failed in the database of completed runs and there will be an incomplete collection in the database. To get round this, you will have to make a second version of the simu stage. This is done with the '-V' option in the ProdTools scripts.

local> spbuild -V02 385257
local> spsub -y -V02 385257

This will build a second version, and submit it.

If your jobs have been run by sprite and the problem is not a one-off problem, you will see that lots of version of the job that have run marked as failed. If you have reached sprite's limit, the runs will be marked as abandoned and sprite will no longer attempt to resubmit them. You have two choices.

The maximum number of retries is twice the value of ``failuresBeforeAbandon'' set in sprite_rc. Therefore, if ``failuresBeforeAbandon=3'' then the files will be marked as abandoned by sprite if V03 fails; after ``-set recover'' has been run, the files will be marked as abandoned if V06 fails. After that, ``-set recover'' has no effect unless you increase ``failuresBeforeAbandon''.

Updating Job Status

You use the spupdate command to update the SLAC database with the status of the runs. In this example, the spupdate command will attempt to interrogate runs 9981453 to 9981497:

local> spupdate 13329096-13329103
RUN      PROCSPEC          HOST    CPUSEC NEVTS  STATUS
13329096 A24.3.1aV01x35F   lcg0306      0   775  RUNNING Wed Feb 20 22:41:59 2008
13329097 A24.3.1aV01x35F   lcg0342      0   701  RUNNING Wed Feb 20 22:41:51 2008
13329098 A24.3.1aV01x35F   lcg0295      0   679  RUNNING Wed Feb 20 22:41:51 2008
13329099 A24.3.1aV01x35F   lcg0295      0   696  RUNNING Wed Feb 20 22:41:56 2008
13329100 A24.3.1aV01x35F   lcg0296      0   691  RUNNING Wed Feb 20 22:41:48 2008
13329101 A24.3.1aV01x35F   lcg0296      0   678  RUNNING Wed Feb 20 22:42:08 2008
13329102 A24.3.1aV01x35F   lcg0298      0   697  RUNNING Wed Feb 20 22:41:56 2008
13329103 A24.3.1aV01x35F   lcg0298      0   692  RUNNING Wed Feb 20 22:42:06 2008

The CPUSEC will only be updated when the run has finished.

local> spupdate --debug 13329096
Does not have local database system.
Has dbiproxy module, using BaBar::DbiProxy.
Connecting to the database, nodb:0, dbiproxy:1...
Using proxy db...
Established connection to database.
Setting nodb option: nodb:0
Getting param for run: 13329096 - A24.3.1aV01x35F, simu, 8000 events.
RUN    PROCSPEC                      HOST             CPUSEC  NEVTS  STATUS
Wed Feb 20 21:36:50 2008
Procspec for job is: A24.3.1aV01x35F
Checking status file.
Wed Feb 20 21:36:50 2008
lcg0306,0,292,,0,V00-08-62,24.3.1a,,, , ,,
Intel(R) Xeon(TM) CPU 2.66GHz,2668,
 13329096  lcg0306               0    292  RUNNING Wed Feb 20 21:36:21 2008

The line beginning ``lcg0306'' gives the: batch machine, batch length (0 until job ends), number of events done so far (292), time job finished (null until end of run), job is marked as good (1) or not marked (0), tag release of ProdDecayFiles (V00-08-62), the release number (24.3.1a), Number of generated events, not filled, not filled, not filled, ScanCol (?); the next line contains the CPU type (Intel(R) Xeon(TM) CPU 2.66GHz), CPU speed in MHZ (2668), CPU percent (filled at end of run).

spupdate involves a lot of database access so should not be run very often.

Merging Jobs

If you installed mysql after the release, spmerge will not work unless you have done a 'gmake ldlink'. To test if spmerge will work run, run KanCollUtil from the command line and see if there is a missing library:

local> KanCollUtil -h
KanCollUtil: error while loading shared libraries: libmysqlclient.so.14:

local> spmerge -v -u babarmc
Couldn't open merge log file /stage/xrootd-data7/simuprod/local/ \ 
merge/001235/200707/24.2.0c/SP_001235_000132/merge.log(.gz).

This has in fact started a merge job (look in the batch queue); the error just means it couldn't confirm that already existing merge attempt was successful so will try a new merge. If spmerge says it has tried too many times, this can be overridden with '-ignore' option.

The merged jobs are created under the $ALLRUNS/merge directory. When the merge is successful, sparchive is run in the background to clean up the run directories. The run directories are made into tar files. What you do with them is up to you. sparchive needs a connection to the SLAC database. spmerge will also update the SLAC database and run directories when the merge is successful.

local> spcheck --status
Run    Status of runs by procspec
----------------------------------------------------------------
9981453 simu : A24.3.1cV01x34F - built
9981454 simu : A24.3.1cV01x34F - built
13305635 simu : A24.2.0cV06x75F - merging - into SP_001235_000703
13305685 simu : A24.2.0cV06x75F - merging - into SP_001005_000690
13305688 simu : A24.2.0cV06x75F - done - 12.33 hrs
13305693 simu : A24.2.0cV07x75F - merging - into SP_001005_000690
13305694 simu : A24.2.0cV07x75F - merging - into SP_001005_000690
13329096 simu : A24.3.1aV01x35F - run - lcg0306 - 27.8% - 0.38 mins
13329097 simu : A24.3.1aV01x35F - run - lcg0342 - 26.6% - 0.20 mins

Th next time you run spmerge it will change the status to merged for successful merged jobs. You can only see the status of the merged runs if you use the '-merged' option or do not use the '-status' option with spcheck:

local> spmerge -v -u babarmc
local> spcheck --status --merged
Run    Status of runs by procspec
---------------------------------------------------------------
13305688 : merged - into SP_001005_000690

Exporting Merged Jobs

local> spexport -v -u babarmc

spexport scans the merge directory $RUNDIR/merge for successfully merged runs and exports them to SLAC. When finished the database at SLAC is updated. spexport starts the transfers in the background:

local>  ps -aux | grep transfer
babarmc  28694  0.0  0.1  8236 4452 ?        S    05:29   0:00 
/usr/local/bin/perl -w /afs/rl.ac.uk/bfactory/prod/ProdTools/transferUtil 
--debug --wait 2 -group SP10 
-coll /store/SP/R24/001235/200707/24.2.0c/SP_001235_000132 
-dir /stage/xrootd-data7/simuprod/local/merge/001235/200707/24.2.0c

The status of the transfers are written to $ALLRUNS/transfer.log

If you rerun spexport, there is a delay of 1 hour after an export has failed before it is marked as failed. Until then it will be marked as running in spexport. If the transfer starts but for some reason does not indicate it has failed, spexport will mark it as failed after 8 hours. spexport currently does note export log files (why?).

Cleaning up Jobs

spmerge will run sparchive to archive the completed runs if the merge has been successful (the merged runs don't have to have been exported). sparchive will also automatically archive the directories of abandoned after 3 weeks. The contents of the run directories are appended to a tar file in the $ALLRUNS directory and then deleted. Each tar file will be up to 300 Mbytes in size.


Using sprite

All the above commands can be combined and controlled using the sprite daemon. sprite is controlled by the spcontrol command. It is a daemon that takes care of all the individual tasks outlined above. The settings are controlled by a file sprite_rc in the site directory. At first time start up, sprite will create a run control file for you, with default settings, if one does not exist.

If your site directory is in afs, this can cause problems as the token can expire and the sprite will no longer be able to write to sprite_rc. sprite will check for the expiration of the afs token if AfsTokenWatch is set and will stop 2 hours before expiration. Alternatively, you can specify a different sprite_rc when you start spcontrol (see below.)

sprite can get particularly confusing if you log into different machines from time to time either on purpose or due to some DNS-name aliasing. Make sure you don't start multiple sprite sessions. It's always good to start a new session by issuing the ``status'' command (see below).

To start sprite:

local> spcontrol start

This will launch sprite as a background process on that machine. sprite will not interfere with on going runs or production. You can continue to submit runs if you wish, sprite will see that they are running, and ignore them. sprite will just check on runs and do what is needed, and use the available resources.

To stop sprite:

local> spcontrol stop

sprite will stop after between 30 seconds and 6 minutes.

To use a different configuration file use the '-c' option. However, you will need to use the same option for all subsequent calls to spcontrol or it will get confused. e.g.

local> spcontrol -c ./new_sprite_rc start

For all the available options see the man pages for sprite (Section 7.9).

Here is an example of a sprite_rc file:

#Run control for the simu production run control daemon
#runcontrol choices: stop, suspend, running
runcontrol=stop
#run range for sprite control
runrange=
#runcyle to work in
cycle=SP10
# spritename (not really needed these days) 
spritename=
#maximum jobs to submit at one time
maxSubmitPerCycle=50
#maximum number of running jobs in system
maxJobRunning=100
#time to wait between submissions in seconds
submitWait=10
#number of failures before a run is abandoned to manual
failuresBeforeAbandon=4
#turn off spmonitor
monitor=off
#turn on OpenPBS interaction
spjobs=yes
#set user
produser=babarmc
#check this disk and don't submit runs if not enough disk space
diskwatch=/stage/xrootd-data7
# minimum amount of space (GB)
minavailspace=5
# hours to wait between merges
mergeWait=4
# hours to wait between exports
exportWait=4

sprite status

To find out what sprite is doing, issue the spcontrol command with the ``status'' option and the same -conf option you used when you started sprite (otherwise you may pick up a different sprite file). In this example, the spritename is RALSP10:

local> spcontrol --conf ./sprite_rc status
Status: 4 checks think sprite with spritename RALSP10 is running:
0) sptools.log: sprite RALSP10 last started: Fri Feb 22 19:08:50 2008 lcgui0361
1) the .sprite_runningRALSP10 file IS in place.
2) The sprite_rc file says sprite IS running.
3) sprite modified its log 20.1 hrs ago: >10 mins implies sprite NOT running.
4) sprite IS running on this machine (lcgui0361) with pid 25765.

There are 5 tests:

ProdTools Commands

Documentation

There are man pages for the main ProdTools commands and they can be accessed with the man command. The man pages' source files for all the ProdTools commands are in the $BFROOT/prod/ProdTools/doc/man directory. To make the man pages into a ps file or to put the man pages where they can be accessed with the man command, see the README in $BFROOT/prod/ProdTools/doc/man.

Definitions of SP states and log files

There are 3 sources of information used by SP commands and jobs:

About this document ...

This document was generated using the LaTeX2HTML translator Version 2002-2-1 (1.71)

Copyright © 1993, 1994, 1995, 1996, Nikos Drakos, Computer Based Learning Unit, University of Leeds.
Copyright © 1997, 1998, 1999, Ross Moore, Mathematics Department, Macquarie University, Sydney.

The command line arguments were:
latex2html -noaddress -split 0 -t 'BaBar SP10 MC Production Guide' userguide.tex

The translation was initiated by Fergus Wilson on 2008-06-29


next_inactive up previous