This document is userguide.tex in the doc directory of the ProdTools package. Check out the head of ProdTools to get the latest version. It can be converted to postscript, pdf and html. Simply issue the following commands to update the files:
> latex userguide.tex > dvips -Ppdf -o userguide.ps userguide.dvi > ps2pdf userguide.ps userguide.pdf > gzip -f -v --best userguide.ps > latex2html -split 0 -t 'BaBar SP10 MC Production Guide' userguide.tex > tar -zcvf u.gtar userguide
You can then update the official web page at SLAC if you want. Copy over the userguide.ps, userguide.pdf to $BFROOT/www/Computing/Offline/Production and unpack the tar file.
> tar -C $BFROOT/www/Computing/Offline/Production -zxvf u.gtar
Information that changes is available from the following points.
http://www.slac.stanford.edu/BFROOT/www/Computing/Offline/Production
http://www.slac.stanford.edu/BFROOT/www/Computing/Offline/Production/run_ranges_sp10.html
http://www.slac.stanford.edu/BFROOT/www/Computing/Offline/Production/releases.html
/afs/slac.stanford.edu/g/babar/kanga/config/cdb
/afs/slac.stanford.edu/g/babar/kanga/config/cfgdb
The man pages for all the ProdTools commands are in the $BFROOT/prod/ProdTools/doc/man directory. To make the man pages accessible, copy the files to $BFROOT/man/man1. If the $MANPATH environment variable is set up properly, they will be accessible with the man command:
local> man sprite
To make the man pages in a ps or pdf file for printing, use the commands:
local> cd $BFROOT/prod/ProdTools/doc/man local> groff -Tps -man *.1 > prodtools.ps local> ps2pdf prodtools.ps prodtools.pdf
These can be copied to the SLAC web page at $BFROOT/www/Computing/Offline/Production. A pdf/ps version can be found here:
http://www.slac.stanford.edu/BFROOT/www/Computing/Offline/Production/prodtools.pdf
http://www.slac.stanford.edu/BFROOT/www/Computing/Offline/Production/prodtools.ps
You need SL3, SL4, RHEL3, or RHEL 4, Objectivity 8.0.9, ROOT 5.14-00e, CLHEP 1.9, GEANT 4.8, and mysql 4.1.9. The guide assumes that you are using cond24boot database. When SP10 production starts, the new database will be accessed via sp10boot and the documentation will be updated.
You need to follow the steps Section 5.1 to 5.5 before installing the base release for the first time or you make get problems with missing shared libraries etc...You only need to these steps once.
This section is not necessary with releases greater than or equal to 24.3.1.
local> mkdir $BFROOT/package/objy8.0.9
If you install it anywhere else, you will need to configure SiteConfig for it. Get Objectivity.R8.0.9.linux86gcc3.tar.gz from /afs/slac.stanford.edu/g/babar/package/objy8.0.9/Objectivity_admin/Objectivity_src/
local> cd $BFROOT/package/objy8.0.9 local> tar -zxvf Objectivity.R8.0.9.linux86gcc3.tar.gz local> cd cdrom local> ./install - select option 3 'Custom Installation' - select 5,9,10,12 - select install directory <objydir> (highly recommend $BFROOT/package/objy8.0.9 ) - set linux86gcc3 C++ include path: /usr/lib/gcc-lib/i386-redhat-linux/3.2.3/include /usr/include/c++/3.2.3 /usr/include/c++/3.2.3/backward /usr/include/c++/3.2.3/i386-redhat-linux - use defaults otherwiseCopy liboocx.so from SLAC ( into <objydir>/linux86gcc3/lib ). Copy oolicense.runtime.txt from SLAC into <objydir>.
Please install ROOT 5.14-00e in $BFROOT/package/root/5.14-00e. If you install it anywhere else, you will need to configure SiteConfig for it.
local> mkdir $BFROOT/package/root/5.14-00e local> cd $BFROOT/package/root/ local> wget http://hep.phys.utk.edu/~gragghia/babar/root-5.14-00e.tgz local> tar -xvpzf root-5.14-00e.tgz
Please install CLHEP 1.9.2.1 in $BFROOT/package/clhep/1.9.2.1.
local> mkdir $BFROOT/package/clhep/1.9.2.1 local> cd $BFROOT/package/clhep/1.9.2.1 local> wget http://www.slac.stanford.edu/~dbrown/clhep-1.9.2.1.tar.gz local> tar -xvpzf clhep-1.9.2.1.tar.gz
Note: If you have already installed your 24-series release, you should cd to the release directory and gmake ldlink to ensure that clhep is linked in to your release successfully. If you have not installed the release yet, gmake siteinstall should take care of this automatically.
Please install GEANT version geant4-08-03-ref-00-patch-03 in $BFROOT/simu/geant4/geant4-08-03-ref-00-patch-03
local> mkdir $BFROOT/simu/geant4/geant4-08-03-ref-00-patch-03 local> wget http://www.slac.stanford.edu/~dbrown/geant4.8.tar.gz local> cd $BFROOT/simu/geant4/geant4-08-03-ref-00-patch-03 local> tar -xvpzf geant4.8.tar.gz
Note: If you have already installed your 24-series release, you should cd to the release directory and gmake ldlink to ensure that clhep is linked in to your release successfully. If you have not installed the release yet, gmake siteinstall should take care of this automatically.
Please install mysql libs for version 4.1.9 in $BFROOT/package/mysql/4.1.9/Linux24SL3_i386_gcc323/lib
local> mkdir $BFROOT/package/mysql/4.1.9/Linux24SL3_i386_gcc323/lib local> cd $BFROOT/package/mysql/4.1.9/Linux24SL3_i386_gcc323/lib local> wget http://www.slac.stanford.edu/~dbrown/mysql-4.1.9.tar.gz local> tar -xvpzf mysql-4.1.9.tar.gz
Note: If you have already installed your 24-series release, you should cd to the release directory and gmake ldlink to ensure that clhep is linked in to your release successfully. If you have not installed the release yet, gmake siteinstall should take care of this automatically.
Downloading the conditions snapshot is similar to downloading background trigger events, and requires use of BbkImport. BbkImport uses a package that depends on BaBar::SQL, which only exists at sites that use a MySQL or Oracle database. To get around this dependency at other sites, you can create a dummy BaBar::SQL package. Create a file $BFROOT/lib.shared/perl/BaBar/SQL.pm which looks like this:
# # dummy module # package SQL; 1;
Importing database files to your site (and later exporting data to SLAC) requires that you have password-less access to the bbrdist account on the bbr-xfer0x machines at SLAC. There is now an alias for all the bbr-xfer0x machines: bbr-xferlfi. Test this by trying to login via ssh to bbrdist@bbr-xferlfi.slac.stanford.edu. If this doesn't work, you need to have your server's rsa key added to the authorized_keys file in the bbrdist directory at SLAC. To do this, first login to your production account on your head node at your site.
local> ssh-keygen -t rsa
Hit
when prompted for a passphrase. Send the
contents of the resulting public key file to Wilko Kroeger
(wilko@slac.stanford.edu) and ask him to put the key in the bbrdist
account's authorized_keys file. When he has done this you should be
ready to import the conditions database.
Update: The connection to SLAC is setup in $HOME/.bbk by the BbkGetConnectInfo script. That script is part of the release, but also part of the bin package and the bin package is first in the search path. The bin package version of the script uses /usr/local/bin/perl, not the site specific configured perl version for BABAR. Make sure /usr/local/bin/perl exists, set a link if it doesn't (or modify the bin version of BbkGetConnectInfo).
To install a new release you must first install the base release (e.g. 24.3.1) and afterwords the lettered release (e.g. 24.3.1a). First log in as your bfactory account, enable afs to the computer with the source repository and set the environment variable for your remote source repository. This example assumes you are using SLAC repository but you can replace it with a nearer or faster site.
local> klog username@rl.ac.uk local> printenv BFDISTr local> setenv BFDISTr /afs/slac.stanford.edu/g/babar/dist
Some people have had faster access to SLAC using the following:
local> setenv BFDISTr \ bbrdist@bbr-xferlfi.slac.stanford.edu:/afs/slac.stanford.edu/g/babar/dist
Now install the base release. You first import the release (operating system independent) and then any operating system specific files (libraries, executables, etc...). Obviously you only need to use the importarch command for any operating systems architectures that you have. This process can take be quite slow (few hours) or fast (few minutes) depending on the number of packages and libraries that have to be updated:
local> importrel -pa 24.3.1 >& importrel_24.3.1.log local> importarch -p 24.3.1 Linux24SL3_i386_gcc323 >& importarch_24.3.1.log
When it has finished, check that the executable MooseApp has been copied (look in $BFDIST/releases/24.3.1/bin/). Some remote sites do not copy the executables. If they are missing, copy by hand directly from SLAC.
After importing the release you need to run gmake siteinstall.
local> cd $BFDIST/releases/24.3.1 local> gmake siteinstall >& siteinstall.log
Please note that you might need an ad-hoc setup for your local site for e.g. locating external packages such as ROOT, tcl, etc. This is normally handled through the SiteConfig package and the BFOVERRIDE mechanism, as explained in $BFROOT/www/Computing/Offline/Production.
Now repeat for the lettered release:
First, make sure you have installed a base release (see section 5.7). Then repeat the commands for the lettered release.
local> importrel -pa 24.3.1a >& importrel_24.3.1a.log local> importarch -p 24.3.1a Linux24SL3_i386_gcc323 >& importarch_24.3.1a.log local> cd $BFDIST/releases/24.3.1a local> gmake siteinstall >& siteinstall.log
Removing a release is simple. The rmrel command will remove the libraries and executables that are specific to a release and then remove any links. It does not remove any packages so you can use it without worrying about deleting packages that you will need later. To remove the 2 releases we installed in sections 5.7 and 5.8:
local> rmrel -p 24.3.1a local> rmrel -p 24.3.1
This section assumes $BFDISTr is set to /afs/slac.stanford.edu/g/babar/dist. If not, you'll have to use scp or something. It also assumes that you are using cond24boot (this will turn into sp10boot when production starts). It also assumes that all the top level directory for the kanga files is $BFROOT.
local> cd $BFROOT/kanga/config local> mkdir cdb local> cd cdb local> cp $BFROOTr/kanga/config/cdb/CdbNameRules.cfg .
Copy the latest cond24boot configuration file:
local> cd $BFROOT/kanga/config/cdb local> ls $BFROOTr/kanga/config/cdb/*cond24boot* local> cp $BFROOTr/kanga/config/cdb/CdbNameRules-cond24boot-full-20080212T131912.cfg . local> ln -fs CdbNameRules-cond24boot-full-20080212T131912.cfg CdbNameRules-cond24boot-full.cfg
Repeat if the snapshot changes.
local> cd $BFROOT/kanga/config local> mkdir cfgdb local> cd cfgdb local> cp $BFROOTr/kanga/config/cdb/CfgDBNameRules.cfg .
Copy the latest configuration file:
local> cd $BFROOT/kanga/config/cfgdb local> ls $BFROOTr/kanga/config/cfgdb/CgDBName* local> cp $BFROOTr/kanga/config/cfgdb/CfgDBNameRules-20080209T194615.cfg . local> ln -fs CfgDBNameRules-20080209T194615.cfg CfgDBNameRules-latest.cfg
Edit the $BFROOT/kanga/config/KanAccess.cfg file to point to where the conditions files will be imported (called target-directory in the later examples):
If you are using nfs file system: read /store/* file <target-directory> If xrootd file systems: read /store/* xrootd xrootd-server:1094
You should now be ready to import the conditions database. You can either issue the next two commands or combine them with the KanImportCdbCfgDB command:
local> srtpath 24.3.1a local> BbkImport --dbsite=SLAC --dbname bbkr24 --noupdate-sql \ --dataset Nonevent-CDB-cond24boot-full --remote=0 --ftp-type=bbcp \ --nstreams 10 --window 1M --multiple 10 --reverse \ bbrdist@bbr-xferlfi.slac.stanford.edu <target-directory>
And to get the configurations db:
local> srtpath 24.3.1a local> BbkImport --dbsite=SLAC --dbname bbkr24 --noupdate-sql \ --dataset Nonevent-CfgDB --remote=0 --ftp-type=bbcp \ --nstreams 10 --window 1M --multiple 10 --reverse \ bbrdist@bbr-xferlfi.slac.stanford.edu <target-directory>
These two commands can be combined, but you will need at least KanNonEventUtils V00-00-09:
local> srtpath 24.3.1a
local> KanImportCdbCfgDB -t cond24boot -n bbkr24 -p $BFROOT/ \
-i " --remote=0 --ftp-type=bbcp --nstreams 10 --window 1M \
--multiple 10 --reverse --remote-user=bbrdist"
The '-p $BFROOT' option puts the files under $BFROOT/kanga/store.
After the imports, you need to update the database with these new configurations and conditions. To find out which are the currently used files you can do something like:
local>ls -l $BFROOT/kanga/config/cfgdb/CfgDBNameRules-latest.cfg local>ls -l $BFROOT/kanga/cdb/CdbNameRules-cond24boot-full.cfg
These two commands return something like:
CfgDBNameRules-20080321T212457.cfg CdbNameRules-cond24boot-full-20080319T173024.cfg
To find the names of the latest imported files:
local> srtpath 24.3.1a local> BbkUser --dataset=Nonevent-CfgDB file_status file \ --dbname=bbkr24 | grep CfgDB | cut -f 2 -d ' ' local> BbkUser --dataset=Nonevent-CDB-cond24boot-full file_status file\ --dbname=bbkr24 | grep boot.root | cut -f 2 -d ' '
These two commands return something like:
/store/cfg/2008/03/CfgDB-20080328T123723.root /store/cdb/cond24boot/full/2008/04/20080407T133030/CDB-20080407T133030-cdb_boot.root
You can then update the database with these files:
local> srtpath 24.3.1a local> KanUpdateRulesCfgDB /store/cfg/2008/03/CfgDB-20080328T123723.root local> KanUpdateRulesCDB \ /store/cdb/cond24boot/full/2008/04/20080407T133030/CDB-20080407T133030-cdb_boot.root
This will update the kanga rules files for you but the exact name of the root files will change with time. In general, to find out what snapshot you will need and where to download it from, see Section 3.
First make sure your connection is working, see Section 5.6.
local> srtpath 24.3.1a local> klog local> BbkImport --dbsite=SLAC --dbname bbkr24 --noupdate-sql \ --dataset BkgTriggers-R24 --remote=0 \ --ftp-type=bbcp --nstreams 10 --window 1M --multiple 10 --reverse \ bbrdist@bbr-xferlfi.slac.stanford.edu <target directory>
The target-directory is the directory that contains store/SP/BkgTriggers. The '-nstreams ...' options are all bbcp options, Depending on your site setup you might need other options here (especially the reverse might or might not work). The options quoted here worked at Louisville. Overall this will import about 10GB of new background collections as of mid-February 2008. This amount will grow as more background becomes available. Make sure you have enough free disk space.
If you are missing files and the normal methods don't seem to be importing them, you can import them explicitly e.g.
local>BbkImport --dbsite=slac --dbname bbkr24 --noupdate-sql --remote=0 \ --include "/store/SP/R24/BkgTriggers/BkgTriggers_200707_OnPeak_V01.*"\ --ftp-type=bbcp --nstreams 10 --window 1M --multiple 10 --reverse \ bbrdist@bbr-xferlfi.slac.stanford.edu <target directory>
To check the files exist (you will see lots of warnings about missing dictionaries):
local> KanFileCheck /store/cfg/2008/02/CfgDB-20080229T051835.root /store/cfg/2008/02/CfgDB-20080229T051835.root exists
To check they are not corrupted, get the checksum from the database and then run cksum on the file:
local> BbkUser --dbname=bbkr24 checksum bytes file
--file=/store/cfg/2008/02/CfgDB-20080229T051835.root
CHECKSUM BYTES FILE
1511119755 10860715 /store/cfg/2008/02/CfgDB-20080229T051835.root
1 rows returned from bbkr24 at RAL
local> cksum /stage/xrootd-data3/kanga/store/cfg/2008/02/CfgDB-20080229T051835.root
1511119755 10860715
To check you have the right snapshots issue the following command and compare with the official snapshot given in Section 3.
local> cond24boot local> CdbRooBrowser views | grep -i default NAME="MASTER::Run7" ID=0::45 STATUS=NOT-FROZEN,DEFAULT
To find out which background triggers you have available use the following command:
local> cond24boot local> BbkUser --dbname bbkr24 --dataset 'BkgTriggers*' dse --distinct /store/SP/R24/BkgTriggers/BkgTriggers_200707_OnPeak_V01 /store/SP/R24/BkgTriggers/BkgTriggers_200712_OnPeak_V01 /store/SP/R24/BkgTriggers/BkgTriggers_200801_OnPeak_V01 /store/SP/R24/BkgTriggers/BkgTriggers_200802_OffPeak_V01 . 14 rows returned from bbkr24 at RAL
You will now need to extend the $BFROOT directory for MC production. Add two new directories:
local> mkdir -p $BFROOT/prod local> mkdir -p $BFROOT/prod/log/allruns
The first directory will contain the production tools and scripts. The second directory will contain all the run directories and intermediate files. This will require about 20 Gbytes for a 200,000 event run so you may wish to make allruns a link to a disk with sufficient space.
Now you need to install the ProdTools. You can not use the standard BABAR addpkg command as you are not in a standard release directory. You will need to use cvs directly. First, find out the current tag at SLAC and then install it at your site (the tag will be something like ``V00-07-03'', ignore the leading N):
local> cd $BFROOT/prod local> klog username@slac.stanford.edu local> cvs checkout -r V00-07-03 ProdTools
To update a previously checked out ProdTools, again check the release at SLAC then update:
local> cd $BFROOT/prod/ProdTools local> klog username@slac.stanford.edu local> cvs update -Ad -r <tag>
If this is the first time you have installed ProdTools, you will need to install the module that allows access to the database at SLAC. Use the installdbi script in ProdTools to do this using your production account (you may need to ask your system admin to install a perl bundle).
You will see that there is a directory in ProdTools called ``site'' with further subdirectories for each site doing MC production. If your site does not exist, You should create a directory will the name $BFSITE of your site. Then copy over the files from one of the other site subdirectories. It's easiest to copy over a site subdirectory that uses the same batch system as you.
local> mkdir -p $BFROOT/prod/ProdTools/site/$BFSITE local> cd $BFROOT/prod/ProdTools/site/$BFSITE local> cp ../cu-boulder/* .
Now you have to edit the files. The main things you need to change are:
Now you need to install the decay files used in MC production. You can not use the standard BABAR addpkg command as you are not in a standard release directory. Instead there is a script in the $BFROOT/prod/ProdTools directory called updateDF that will update/install the ProdDecayFiles directory. Installing and updating the ProdDecayFiles is similar to ProdTools:
local> mkdir -p $BFROOT/prod/packages/ProdDecayFiles local> cd $BFROOT/prod/packages/ProdDecayFiles local> klog username@slac.stanford.edu local> $BFROOT/prod/ProdTools/updateDF <tag>
ProdTools keeps records of everything that is done in production, along with configurations of production runs, and decay modes. To keep up-to-date records of what is done at remote site, along with passing configurations of production runs to local tools, a connection to the SLAC Oracle database must be setup.
The first step is getting the Bundle::DBI perl modules installed. You can do this your self by logging in as root and using the command:
perl -MCPAN -e 'install Bundle::DBI'
You may have to talk to your system admin to get this done. This will install modules into your perl installation, so you need to be root when this is done.
There has been problems with some sysadmins which don't like the idea of CPAN installing whatever in the perl directories. Also this command will upgrade things as needed to get the modules to work, and this may mess with existing perl installs. If it is needed, or you just wish it so, you can install the needed modules from source, compile for your system and install yourself. These steps will be different for each system, so I can't help too much with this. But the needed modules are DBI, PlRPC, Net::Daemon, and Storable. Also make sure that Storable is from the 1.x series of releases, to be compatible with the proxy server here. But if you have control over root, and don't really care that much about how your perl is installed the 'install Bundle::DBI' will do all this for you.
After that run the installdbi script that comes with ProdTools. This will install a local DbiProxy module to manage the Oracle connection from the remote site. The script will check on the state of your installation and tell you what problems you might have. It will also e-mail a message to Douglas Smith when it is successful. Your production computer needs to be registered with the proxy server at SLAC to allow database connections. This e-mail will tell Douglas the info he needs to know about your computer. He will register your computer and reply within a day or so. You should then be setup for Oracle connections.
Assuming the BABAR environment is setup for your account (is $BFROOT defined?), you should create the production directory in the normal way. This will not required a great deal of space.
local> newrel -t 24.3.1a SP24.3.1a local> cd SP24.3.1a local> srtpath local> cond24boot local> addpkg workdir local> gmake installdirs local> gmake workdir.setup local> cd workdir
You might want to setup some default environment variables if they have not been set:
export BFSITE=ral
export PRODTOOLS=${BFROOT}/prod/ProdTools
export PRODSITE=${PRODTOOLS}/site/${BFSITE}
export ALLRUNS=/stage/xrootd-data7/simuprod/local/allruns
export MERGEDIR=/stage/xrootd-data7/simuprod/local/merge
If you use multiple $ALLRUNS directories (maybe to keep SP8 and SP9 production separate or to keep the validation runs apart from production runs), be careful with setting a default $ALLRUNS in your login scripts.
When a SP job is submitted, your spsub interface to the batch should transmit all environment variables to the batch job, including $ALLRUNS. But the SP jobs will start a new shell, so if you always set a default in your login scripts, that default will override whatever $ALLRUNS was set for the shell calling spsub. For the Moose part of the SP job that doesn't matter since for Moose everything is fixed in the config files. But Job.bash checks the Moose output with spcheck and with the wrong a $ALLRUNS, spcheck won't find the run. The result is that the run is stuck in a 'done' state and never marked as 'good'. Only 'good' runs are picked up by spmerge.
The problem can be fixed by checking in your login scripts if $ALLRUNS is already set before setting it to the default value. Something like this (for csh) :
if (! $?ALLRUNS) then setenv ALLRUNS /somedir/allruns endif
The run numbers that need to be validated are posted on the web, see Section 3. At the time of writing the validation run range for 24.3.1c was: 9981453-9981497
local> spbuild -j valid --user babarmc 9981454 local> spbuild -j valid --user babarmc 9981453-9981497
Executing the first command will create the output:
local> spbuild -j valid --user babarmc 9981454 9981454(simu): A24.3.1aV01x34F 4000 24.3.1a x34 B0B0bar_JpsiKS_+-.dec building /stage/xrootd-data7/simuprod/local/allruns/9981454/A24.3.1aV01x34F
The directory will contain the files:
local> cd ./SP24.3.1a local> srtpath local> cond24boot
The command to run the job is:
local> spsub 9981453 local> spsub -y 9981453-9981497
spsub does a lot of checking. It then copies over part of the workdir to $ALLRUNS/9981497/A24.3.1aV01x15F (note how the release name and the background rate are incorporated into the directory name); and submits the simulation job. When it finishes, the workdir files are removed leaving a log file (often gzipped). The main files in the new directory are:
While the job is running you can use the spupdate command to interrogate the log files and tell you how things are going. In this example, the spupdate command will attempt to interrogate runs 9981453 to 9981497:
local> spupdate 9981453-9981497
local> spclean 9981453-9981497
The log files and root files from the validation runs have to be checked to see they are the same as the reference set. First set up the files needed to compare with SLAC. Then use the supplied utilities to compare with SLAC for each run.
At your site use the saveDir command to create a tar file of the runs you want to validate. You must use the '-nodb' parameter:
local> saveDir --nodb -t ./ 9981453-9981497
Now transfer this file to SLAC and unpack. You will need to create a directory if this is the first validation you have done i.e.
yakut> mkdir -p $BFROOT/prod/log/validation/intel/$BFSITE yakut> mkdir -p $BFROOT/prod/log/validation/amd/$BFSITE
yakut> cd $BFROOT/prod/log/validation/intel/ral yakut> scp -p username@local:~/9981453-9981497.tar . yakut> tar -xvf 9981453-9981497.tar yakut> rm 9981453-9981497.tar
Now compare your files against SLAC. In the $BFROOT/prod/log/validation you will see a number of utilities for comparing the log and root files. You will see some differences (about 200 normally, nearer 2000 if you accidently compare an AMD processor with intel).
yakut> cd $BFROOT/prod/log/validation yakut> validate.csh <arch> <first run> <last run> slac 01 <site> <version>
For example:
yakut> cd $BFROOT/prod/log/validation/ yakut> ./validate.csh intel 9983524 9983525 slac 01 ral 01 Checking run 9983524 : Number of discrepancies: 269 (4438 histograms compared) Checking run 9983525 : Number of discrepancies: 268 (4438 histograms compared)
Once you've got the validation runs working, the production run is easy. A quota for your site will be put in the database and you can build the necessary runs by using spbuild. The next few sections show the individual steps which can be followed when initially testing the system. When you are happy the system is working you can use sprite to automate production (see Section 7.9).
Notice that this time we do not have a ``-j'' option as we wish to use the default collection naming scheme for production and export the output.
local> spbuild --cycle SP10 9981453 local> spbuild --cycle SP10 -u douglas --debug -n 100 local> spbuild --cycle SP10 local> spbuild --cycle SP10 --local -u babarmc -n 1
The first command builds one run; the second command builds 100 runs, identifying each run with user douglas and printing a lot of debug info; the third command builds all the runs that have been allocated to you. This command probably isn't very useful if you have a lot of jobs and you would be better off letting sprite handle the production (see Section 7.9).
The fourth version uses ``-local''. When the "-local" option is used, the root files are written to a local directory on the batch node and copied over (and checksummed) after the job has finished. This can reduce the load. If the environment variable $SPLOCALDIR is defined in local-simu-setup, it will be used as the temporary area. If $SPLOCALDIR is not defined, the default /tmp area is used. If $SPLOCALDIR is not defined by the user but is defined by the batch system at run time, then the area defined by the batch system can be used by copying the value to the $SPLOCALDIR in local-simu-setup. e.g.
# for CCIN2P3 export SPLOCALDIR="$TMPBATCH" # for RAL - WORKDIR is defined at run time by the batch system export SPLOCALDIR="$WORKDIR" # the default will be /tmp if not defined.
The output will look like this:
local> spbuild --cycle SP10 -n 3 There are still 668 runs in the system to build at this location. 13329101(simu): A24.3.1aV01x35F 8000 24.3.1a x35 B+B-_generic.dec building /stage/xrootd-data7/simuprod/local/allruns/13329101/A24.3.1aV01x35F 13329102(simu): A24.3.1aV01x35F 8000 24.3.1a x35 B+B-_generic.dec building /stage/xrootd-data7/simuprod/local/allruns/13329102/A24.3.1aV01x35F 13329103(simu): A24.3.1aV01x35F 8000 24.3.1a x35 B+B-_generic.dec building /stage/xrootd-data7/simuprod/local/allruns/13329103/A24.3.1aV01x35F
If you try to build an already existing run you will get an error:
local> spbuild 13329096 ERROR: job A24.3.1aV01x35F for 13329096 already exists
If you run the spcheck command you can see the state of the runs:
local> spcheck --status Run Status of runs by procspec --------------------------------- 13329096 simu : A24.3.1aV01x35F - built 13329097 simu : A24.3.1aV01x35F - built 13329098 simu : A24.3.1aV01x35F - built 13329099 simu : A24.3.1aV01x35F - built 13329100 simu : A24.3.1aV01x35F - built 13329101 simu : A24.3.1aV01x35F - built 13329102 simu : A24.3.1aV01x35F - built 13329103 simu : A24.3.1aV01x35F - built
You can use the spbuild command as a useful way to find out how many runs have been allocated to you.
local> spbuild --cycle SP10 -n 1 There are still 722 runs in the system to build at this location. 13389416(simu): A24.3.1aV01x34F 8000 24.3.1a x34 B0B0bar_generic.dec building /stage/xrootd-data7/simuprod/local/allruns/13389416/A24.3.1aV01x34F
Now submit all the jobs. You can of course just submit a subset at a time by modifying the run range at the end of the spsub command:
local> spsub 13329096 local> spsub -y 13329097-13329103
local> spsub 13329096 Submit run 13329096? [Yes/No/All/None] y Run 13329096 submitted as job 8052930
To see what is happening in the batch queue now:
local> spjobs --summary Run JobID Type User Stat Queue Host Exec Per. LastMod ---------------------------------------------------------------------------- 13329096 8052930 simu babarmc RUN sl4p csflnx353 0% 0.20 min
spjobs does not show the merge jobs. To see what is happening based on the log files:
local> spcheck --status Run Status of runs by procspec --------------------------------- 13329096 simu : A24.3.1aV01x35F - run - lcg0306 - 0.1% - 0.25 mins 13329097 simu : A24.3.1aV01x35F - built
The run directory will contain the following files while running a job:
local> ls $ALLRUNS/13329096/A24.3.1aV01x35F 13329096.moose.01.root bin DECAY.DEC pdt.table RooLogon.C imu13329096.log 13329096.moose.02E.root config.sh GNUmakefile RELEASE config.tcl seed-overuse.txt workdir.files B+B-_generic.dec PARENT RooAlias.C shlib
spcheck, as well as parsing the log file, will change the status of production of each run as appropriate so that the next stage can start.
To submit all the run range without being prompted, use the '-y' option:
local> spsub -y 13329097-13329103 Run 13329097 submitted as job 8053009 Run 13329098 submitted as job 8053010 Run 13329099 submitted as job 8053011 Run 13329100 submitted as job 8053012 Run 13329101 submitted as job 8053013 Run 13329102 submitted as job 8053014 Run 13329103 submitted as job 8053015
local> spcheck --status Run Status of runs by procspec -------------------------------------------------- 13329096 simu : A24.3.1aV01x35F - run - lcg0306 - 0.5% - 0.23 mins 13329097 simu : A24.3.1aV01x35F - run - lcg0342 - 0.0% - 0.98 mins 13329098 simu : A24.3.1aV01x35F - run - lcg0295 - 0.0% - 0.98 mins 13329099 simu : A24.3.1aV01x35F - submit - 0.32 min 13329100 simu : A24.3.1aV01x35F - submit - 0.30 min 13329101 simu : A24.3.1aV01x35F - submit - 0.30 min 13329102 simu : A24.3.1aV01x35F - submit - 0.30 min 13329103 simu : A24.3.1aV01x35F - submit - 0.28 min
local> spjobs --summary Run JobID Type User Stat Queue Host Exec Per. LastMod ------------------------------------------------------------------------ 13329096 805293 simu babarmc RUN sl4p csflnx353 lcg0664 1% 0.55 min 13329097 805300 simu babarmc RUN sl4p csflnx353 lcg0664 0% 0.05 min 13329098 805301 simu babarmc RUN sl4p csflnx353 lcg0664 0% 0.02 min 13329099 805301 simu babarmc RUN sl4p csflnx353 lcg0664 0% 0.05 min 13329100 805301 simu babarmc RUN sl4p csflnx353 csflnx389 0% 0.07 min 13329101 805301 simu babarmc RUN sl4p csflnx353 lcg0655 0% 0.05 min 13329102 805301 simu babarmc RUN sl4p csflnx353 csflnx390 0% 0.38 min 13329103 805301 simu babarmc RUN sl4p csflnx353 csflnx390 0% 0.38 min Summary of runs: 8 in system, 8 running and 0 pending. Types in system: 8 simu, 0 mixr, and 0 reco.
Now go away and have a very large coffee (and dinner, and a nap). And hope nothing goes wrong!
While runs are working there are a couple tools for checking on the state of these runs, spjobs and spcheck.
spjobs will query the batch system, find production jobs, and produce a nicely formatted output on the status of these jobs. It will look for all runs which are in the batch system, submitted by the username with which you are logged in. To check on all jobs in the batch system just do:
local> spjobs --summary Run JobID Type User Stat Queue Host Exec Per. LastMod ------------------------------------------------------------------------ 13329096 805293 simu babarmc RUN sl4p csflnx353 3% 0.43 min 13329097 805300 simu babarmc RUN sl4p csflnx353 2% 0.18 min 13329098 805301 simu babarmc RUN sl4p csflnx353 2% 0.35 min 13329099 805301 simu babarmc RUN sl4p csflnx353 2% 0.27 min 13329100 805301 simu babarmc RUN sl4p csflnx353 2% 0.35 min 13329101 805301 simu babarmc RUN sl4p csflnx353 2% 0.22 min 13329102 805301 simu babarmc RUN sl4p csflnx353 2% 0.10 min 13329103 805301 simu babarmc RUN sl4p csflnx353 2% 0.43 min Summary of runs: 8 in system, 8 running and 0 pending. Types in system: 8 simu, 0 mixr, and 0 reco.
This script needs to be able to query the batch system. This has information on what computer the job in running on, if the jobs is running or just pending in the system, the percentage of events done in the running job, and the last time the log file was modified. If a log file is not touched in over 30 mins, the run will be marked hanging.
A range of run numbers can be used with this command to get a listing of desired runs. Also options can filter on the state of runs to see only runs which are running, pending, exited, done, or hanging. You can also filter on computer name, using partial strings to view runs on a set of computers. Check the man pages and help message for a full list of options. A useful option for people is usually the '-T' option, which will tail the log file for job which is running. This option needs a number which is the number of lines to print from the log file.
spcheck does not query the batch system, so this will work anywhere. This script will go through the files in the $ALLRUNS directory looking for the status of the runs. The default is to list the status of all procspecs built in all runs in the $ALLRUNS directory. It will report back on a procspec as built, submitted, running, done, good, or failed. spcheck will also update the status of the production for each run so that the next stage can begin.
To use the script type:
local> spcheck --status Run Status of runs by procspec ------------------------------------------------------------------- 9979558 : finished - 4000 events, 5.96 s/ev 9979559 : finished - 20 events, 20.10 s/ev 9981453 simu : A24.3.1cV01x34F - built 9981454 simu : A24.3.1cV01x34F - built 13305635 : finished - 8000 events, 4.28 s/ev 13305685 : finished - 8000 events, 8.11 s/ev 13305686 : finished - 8000 events, 7.34 s/ev 13305687 : finished - 8000 events, 6.38 s/ev 13305688 simu : A24.2.0cV06x75F - done - 12.33 hrs 13305689 : finished - 8000 events, 7.68 s/ev 13305690 : finished - 8000 events, 7.76 s/ev 13305691 : finished - 8000 events, 7.69 s/ev 13305692 : finished - 8000 events, 7.33 s/ev 13305693 : finished - 8000 events, 5.40 s/ev 13305694 : finished - 8000 events, 6.10 s/ev 13329096 simu : A24.3.1aV01x35F - run - lcg0306 - 2.9% - 0.43 mins 13329097 simu : A24.3.1aV01x35F - run - lcg0342 - 2.1% - 0.35 mins
The output can be a little confusing since there can be different procspecs for each run number. But there are options to filter display on jobs running, build, submitted, done, good, or failed. There is also an option to check the collection to see how many events were written compared to the number of events requested.
This script can also tail the log files for the procspecs listed, using '-T'. This option requires a number for the number of lines printed from the log file. To do this for all the procspecs built can be very lengthly so there are a number of option to filter the procspecs which are displayed.
Check the man pages for this script (in "ProdTools/doc/man") for a complete listing of options, and the help message.
If you run a job and it crashes you can re-run once you have solved the problem. This assumes that the problem is not repeatable. In other words, if the problem is inherent in the code, simply rerunning the job will not cure it. If a job crashes, it will be marked as failed in the database of completed runs and there will be an incomplete collection in the database. To get round this, you will have to make a second version of the simu stage. This is done with the '-V' option in the ProdTools scripts.
local> spbuild -V02 385257 local> spsub -y -V02 385257
This will build a second version, and submit it.
If your jobs have been run by sprite and the problem is not a one-off problem, you will see that lots of version of the job that have run marked as failed. If you have reached sprite's limit, the runs will be marked as abandoned and sprite will no longer attempt to resubmit them. You have two choices.
local> spbuild --set recover 13329296-13329396
The maximum number of retries is twice the value of ``failuresBeforeAbandon'' set in sprite_rc. Therefore, if ``failuresBeforeAbandon=3'' then the files will be marked as abandoned by sprite if V03 fails; after ``-set recover'' has been run, the files will be marked as abandoned if V06 fails. After that, ``-set recover'' has no effect unless you increase ``failuresBeforeAbandon''.
You use the spupdate command to update the SLAC database with the status of the runs. In this example, the spupdate command will attempt to interrogate runs 9981453 to 9981497:
local> spupdate 13329096-13329103 RUN PROCSPEC HOST CPUSEC NEVTS STATUS 13329096 A24.3.1aV01x35F lcg0306 0 775 RUNNING Wed Feb 20 22:41:59 2008 13329097 A24.3.1aV01x35F lcg0342 0 701 RUNNING Wed Feb 20 22:41:51 2008 13329098 A24.3.1aV01x35F lcg0295 0 679 RUNNING Wed Feb 20 22:41:51 2008 13329099 A24.3.1aV01x35F lcg0295 0 696 RUNNING Wed Feb 20 22:41:56 2008 13329100 A24.3.1aV01x35F lcg0296 0 691 RUNNING Wed Feb 20 22:41:48 2008 13329101 A24.3.1aV01x35F lcg0296 0 678 RUNNING Wed Feb 20 22:42:08 2008 13329102 A24.3.1aV01x35F lcg0298 0 697 RUNNING Wed Feb 20 22:41:56 2008 13329103 A24.3.1aV01x35F lcg0298 0 692 RUNNING Wed Feb 20 22:42:06 2008
The CPUSEC will only be updated when the run has finished.
local> spupdate --debug 13329096 Does not have local database system. Has dbiproxy module, using BaBar::DbiProxy. Connecting to the database, nodb:0, dbiproxy:1... Using proxy db... Established connection to database. Setting nodb option: nodb:0 Getting param for run: 13329096 - A24.3.1aV01x35F, simu, 8000 events. RUN PROCSPEC HOST CPUSEC NEVTS STATUS Wed Feb 20 21:36:50 2008 Procspec for job is: A24.3.1aV01x35F Checking status file. Wed Feb 20 21:36:50 2008 lcg0306,0,292,,0,V00-08-62,24.3.1a,,, , ,, Intel(R) Xeon(TM) CPU 2.66GHz,2668, 13329096 lcg0306 0 292 RUNNING Wed Feb 20 21:36:21 2008
The line beginning ``lcg0306'' gives the: batch machine, batch length (0 until job ends), number of events done so far (292), time job finished (null until end of run), job is marked as good (1) or not marked (0), tag release of ProdDecayFiles (V00-08-62), the release number (24.3.1a), Number of generated events, not filled, not filled, not filled, ScanCol (?); the next line contains the CPU type (Intel(R) Xeon(TM) CPU 2.66GHz), CPU speed in MHZ (2668), CPU percent (filled at end of run).
spupdate involves a lot of database access so should not be run very often.
If you installed mysql after the release, spmerge will not work unless you have done a 'gmake ldlink'. To test if spmerge will work run, run KanCollUtil from the command line and see if there is a missing library:
local> KanCollUtil -h KanCollUtil: error while loading shared libraries: libmysqlclient.so.14:
local> spmerge -v -u babarmc Couldn't open merge log file /stage/xrootd-data7/simuprod/local/ \ merge/001235/200707/24.2.0c/SP_001235_000132/merge.log(.gz).
This has in fact started a merge job (look in the batch queue); the error just means it couldn't confirm that already existing merge attempt was successful so will try a new merge. If spmerge says it has tried too many times, this can be overridden with '-ignore' option.
The merged jobs are created under the $ALLRUNS/merge directory. When the merge is successful, sparchive is run in the background to clean up the run directories. The run directories are made into tar files. What you do with them is up to you. sparchive needs a connection to the SLAC database. spmerge will also update the SLAC database and run directories when the merge is successful.
local> spcheck --status Run Status of runs by procspec ---------------------------------------------------------------- 9981453 simu : A24.3.1cV01x34F - built 9981454 simu : A24.3.1cV01x34F - built 13305635 simu : A24.2.0cV06x75F - merging - into SP_001235_000703 13305685 simu : A24.2.0cV06x75F - merging - into SP_001005_000690 13305688 simu : A24.2.0cV06x75F - done - 12.33 hrs 13305693 simu : A24.2.0cV07x75F - merging - into SP_001005_000690 13305694 simu : A24.2.0cV07x75F - merging - into SP_001005_000690 13329096 simu : A24.3.1aV01x35F - run - lcg0306 - 27.8% - 0.38 mins 13329097 simu : A24.3.1aV01x35F - run - lcg0342 - 26.6% - 0.20 mins
Th next time you run spmerge it will change the status to merged for successful merged jobs. You can only see the status of the merged runs if you use the '-merged' option or do not use the '-status' option with spcheck:
local> spmerge -v -u babarmc local> spcheck --status --merged Run Status of runs by procspec --------------------------------------------------------------- 13305688 : merged - into SP_001005_000690
local> spexport -v -u babarmc
spexport scans the merge directory $RUNDIR/merge for successfully merged runs and exports them to SLAC. When finished the database at SLAC is updated. spexport starts the transfers in the background:
local> ps -aux | grep transfer babarmc 28694 0.0 0.1 8236 4452 ? S 05:29 0:00 /usr/local/bin/perl -w /afs/rl.ac.uk/bfactory/prod/ProdTools/transferUtil --debug --wait 2 -group SP10 -coll /store/SP/R24/001235/200707/24.2.0c/SP_001235_000132 -dir /stage/xrootd-data7/simuprod/local/merge/001235/200707/24.2.0c
The status of the transfers are written to $ALLRUNS/transfer.log
If you rerun spexport, there is a delay of 1 hour after an export has failed before it is marked as failed. Until then it will be marked as running in spexport. If the transfer starts but for some reason does not indicate it has failed, spexport will mark it as failed after 8 hours. spexport currently does note export log files (why?).
spmerge will run sparchive to archive the completed runs if the merge has been successful (the merged runs don't have to have been exported). sparchive will also automatically archive the directories of abandoned after 3 weeks. The contents of the run directories are appended to a tar file in the $ALLRUNS directory and then deleted. Each tar file will be up to 300 Mbytes in size.
All the above commands can be combined and controlled using the sprite daemon. sprite is controlled by the spcontrol command. It is a daemon that takes care of all the individual tasks outlined above. The settings are controlled by a file sprite_rc in the site directory. At first time start up, sprite will create a run control file for you, with default settings, if one does not exist.
If your site directory is in afs, this can cause problems as the token can expire and the sprite will no longer be able to write to sprite_rc. sprite will check for the expiration of the afs token if AfsTokenWatch is set and will stop 2 hours before expiration. Alternatively, you can specify a different sprite_rc when you start spcontrol (see below.)
sprite can get particularly confusing if you log into different machines from time to time either on purpose or due to some DNS-name aliasing. Make sure you don't start multiple sprite sessions. It's always good to start a new session by issuing the ``status'' command (see below).
To start sprite:
local> spcontrol start
This will launch sprite as a background process on that machine. sprite will not interfere with on going runs or production. You can continue to submit runs if you wish, sprite will see that they are running, and ignore them. sprite will just check on runs and do what is needed, and use the available resources.
To stop sprite:
local> spcontrol stop
sprite will stop after between 30 seconds and 6 minutes.
To use a different configuration file use the '-c' option. However, you will need to use the same option for all subsequent calls to spcontrol or it will get confused. e.g.
local> spcontrol -c ./new_sprite_rc start
For all the available options see the man pages for sprite (Section 7.9).
Here is an example of a sprite_rc file:
#Run control for the simu production run control daemon #runcontrol choices: stop, suspend, running runcontrol=stop #run range for sprite control runrange= #runcyle to work in cycle=SP10 # spritename (not really needed these days) spritename= #maximum jobs to submit at one time maxSubmitPerCycle=50 #maximum number of running jobs in system maxJobRunning=100 #time to wait between submissions in seconds submitWait=10 #number of failures before a run is abandoned to manual failuresBeforeAbandon=4 #turn off spmonitor monitor=off #turn on OpenPBS interaction spjobs=yes #set user produser=babarmc #check this disk and don't submit runs if not enough disk space diskwatch=/stage/xrootd-data7 # minimum amount of space (GB) minavailspace=5 # hours to wait between merges mergeWait=4 # hours to wait between exports exportWait=4
To find out what sprite is doing, issue the spcontrol command with the ``status'' option and the same -conf option you used when you started sprite (otherwise you may pick up a different sprite file). In this example, the spritename is RALSP10:
local> spcontrol --conf ./sprite_rc status Status: 4 checks think sprite with spritename RALSP10 is running: 0) sptools.log: sprite RALSP10 last started: Fri Feb 22 19:08:50 2008 lcgui0361 1) the .sprite_runningRALSP10 file IS in place. 2) The sprite_rc file says sprite IS running. 3) sprite modified its log 20.1 hrs ago: >10 mins implies sprite NOT running. 4) sprite IS running on this machine (lcgui0361) with pid 25765.
There are 5 tests:
There are man pages for the main ProdTools commands and they can be accessed with the man command. The man pages' source files for all the ProdTools commands are in the $BFROOT/prod/ProdTools/doc/man directory. To make the man pages into a ps file or to put the man pages where they can be accessed with the man command, see the README in $BFROOT/prod/ProdTools/doc/man.
There are 3 sources of information used by SP commands and jobs:
This document was generated using the LaTeX2HTML translator Version 2002-2-1 (1.71)
Copyright © 1993, 1994, 1995, 1996,
Nikos Drakos,
Computer Based Learning Unit, University of Leeds.
Copyright © 1997, 1998, 1999,
Ross Moore,
Mathematics Department, Macquarie University, Sydney.
The command line arguments were:
latex2html -noaddress -split 0 -t 'BaBar SP10 MC Production Guide' userguide.tex
The translation was initiated by Fergus Wilson on 2008-06-29