SLAC PEP-II
BABAR
SLAC<->RAL
Babar logo
Workbook HEPIC Databases PDG HEP preprints
Organization Detector Computing Physics Documentation
Personnel Glossary Sitemap Search Hypernews
Unwrap page!
Wkbk. Search
Wkbk. Sitemap
Introduction
Non SLAC
HOWTO's
Introduction
Logging In
QuickTour
Detector
Info Resources
Software Infrastructure
CM2 Introduction
Unix
OO
SRT
Objectivity
Event Store
Framework
Beta
Modifying Code
Writing and Editing
Compiling
Debugging
Analysis
Framework II
Analysis
Find Data
Batch Processing
PAW
PAW II
ROOT I
ROOT II
ROOT III
Advanced Infrastructure
New Releases
Workdir
Main Packages
Event Displays
Gen/Sim/Reco
Contributing Software
SRT and CVS
Coding
Advanced Topics
Make CM2 Ntuples
New Packages
New Packages 2
Persistent Classes
Java
Site Installation
Check this page for HTML 4.01 Transitional compliance with the
W3C Validator
(More checks...)

Workbook for BaBar Offline Users - Batch Processing

Getting large jobs done more quickly while taking the load off the interactive machines.

You already used the Batch system for several tasks during the Quicktour section of this Workbook. We will now discuss the batch system in more detail.


Quick link: Job crash/Exit codes page



When to Use Batch versus Interactive

Interactive processing is intended for those activities that truly require user interaction, e.g. debugging, or are very short. Other jobs should be submitted to batch. This includes but is not limited to CPU intensive jobs.

Building BaBar code (gmake lib, bin or all) should not be done interactively. The special bldrecoq queue is the place to run a build.


Submitting and Monitoring the Job

How the Different Queues Work

Use the bqueues command to get a list of all the queues available. In order to get detailed information on a specific queue type
   > bqueues -l (queuename)
Here is a summary of the relevant information at SLAC:

Queue Name
Description
Users
bldrecoq For BaBar builds all users
kanga For CM2-Kanga analysis jobs —all jobs accessing CM2-Kanga (Mini) data MUST be run in the kanga queue apart from very long jobs all users
express For very short jobs (Maximum 4 minutes) all users
short Maximum 15 minutes SLAC CPU time all users
control Queue for batch control jobs all users
medium Maximum 90 minutes SLAC CPU time all users
long Maximum 6 hours SLAC CPU time all users
xlong Maximum 24 hours SLAC CPU time all users
idle Jobs scheduled only if the machine is lightly loaded all users
bfobjy For analysis jobs accessing physboot, data12boot, simuboot, and all Objectivity federations, all jobs accessing Objectivity federations MUST be run in the bfobjy queue. all users

Choose the queue for your job carefully. Please note that bldrecoq is for gmake builds only.

The above table lists CPU time limits for some of the queues. But the CPU time in the table above is in "SLAC units". Each machine has its own CPU normalization factor (CPUF) that relates SLAC time to CPU time:

SLAC time = CPU time * CPUF
See below for an example of how to convert from the CPU time in your log file to SLAC time.

The bsub Command

Use the bsub command to submit jobs. See the LSF manual for details.

bsub [ -q queuename ] [ -J jobname ] [ -o outfile ] command
 
-q queuename  Submit the job to one of the queues specified by queuename 
-J jobname  Assign the character string specified by jobname to the job 
-o outfile  Store the standard output of the job to the file outfile. If the file already exists, the job output is appended to it. If the -o option is NOT specified the job output is sent by mail to the submitter. 
The system should respond with something like:

   Job <IdNumber> is submitted to queue <queuename>

Monitoring the Job

You can check progress of a SLAC batch job by using the command

bjobs [ -J jobname ] [-u username] [-q queue [ -l ] [ IdNumber ]

This will display the status and other information about the batch jobs that are specified by the options. If no options is specified, than the default is to display informations about all the your unfinished (that is, pending, running and suspended) jobs.

  • -l returns detailed information ("long" listing)
  • -u username returns a listing of jobs owned by username
  • -l all will list all jobs belonging to all users.
  • -q queue lists all jobs running on a specific queue
By default, bjobs will list all the jobs that you have running on any queue.

When you issue the bjobs command, the system should respond with something like:

   JOBID USER     STAT  QUEUE      FROM_HOST   EXEC_HOST   JOB_NAME   SUBMIT_TIME

   87082 perl     RUN   bldrecoq   shire01     palomino10  *aUser.bin Dec 20 17:01
The status (STAT) of a job can either be:
PEND The job has not yet been started, it is pending
PSUSP The job has been suspended, either by its owner or by the LSF administrator, while pending
RUN The job is currently running
USUSP The job has been suspended, either by its owner or by the LSF administrator, while running
SSUSP The job has been suspended by LSF due to the conditions of the queue
DONE The job has terminated
EXIT The job has been either killed or aborted due to an error in its execution
UNKWN Contact has been lost with the job
ZOMBI The job is a zombi

Your jobs are done when bjobs responds:

   No unfinished job found

Deleting a Job

You can kill a batch job by using the command
bkill [ -q queuename ] [ -J jobname ] [ IdNumber ]

where queuename, jobname, and idnumber are obtained using the bjobs command.

For example to kill a job that was submitted with:

   bsub -q kanga -J MyJob -o myJob.log BetaMiniApp snippet.tcl
first check the job's status with
   bjobs
or
   bjobs -J MyJob
and then kill it with
   bkill <IdNumber>
or
   bkill -J MyJob
The system response should be
    Job <IdNumber> is being terminated
Issuing the command:
  bkill 0
will kill all jobs you have on all queues, whether they are currently running or pending.

Getting the Output

You can peek at the output of a job in progress by using the command

bpeek [ -q queuename ] [ -J jobname ] [ -f ][ IdNumber ]

You can leave out the job number in the above command if you only have one job running.

If bpeek gives the following message:

   << output from stdout >>

   select: protocol failure in circuit setup.
it just means the job hasn't gone far enough yet to allow a peek. Wait a few minutes and try bpeek again.

The -f option will keep listing the last lines of the output file until you type ^C (like the unix command "tail -f")


The Main Queues for Analysis Users

  • bldrecoq is for gmake builds (gmake all, lib, bin) only.
  • kanga for jobs reading CM2/Mini/kanga files. All jobs accessing kanga data MUST be run in the kanga queue unless they require longer CPU times and therefore should be run in the xlong queue

Running Beta

You already saw how to compile and link BetaMiniUser in batch:
    bsub -q bldrecoq -o all.log gmake all
But you ran your job interactively:
    BetaMiniApp snippet.tcl

In the Quicktour, MyMiniAnalysis.tcl was modified a bit so that the job would run interactively, and you would get a framework prompt. You cannot submit this type of job to the batch system, because it requires user input. But once you put the "ev begin" and "exit" commands back in MyMiniAnalysis.tcl, as explained near the end of the Framework Continued section, the job will run without stopping.

Once you have a job that runs without stopping in the middle for user input, you can submit it to the kanga queue:

    bsub -q kanga -o snippet.log BetaMiniApp snippet.tcl
The output will be written to the file snippet.log.

Log files and CPU time

As mentioned above, the CPU time in the table above is in "SLAC units". Each machine has its own CPU normalization factor (CPUF) that relates SLAC time to CPU time:
SLAC time = CPU time * CPUF

The CPUF can be obtained with the command bhosts -l <host name>.

So for example issuing

  > bhosts -l bldlnx06
you will find (among other things) that the CPUF for bldlnx06 is 3.36.

As an example, we will find out how much CPU time it took you to compile and link with gmake all. Your all.log file tells you how much CPU time you used. All batch job log files end with a message like:

------------------------------------------------------------
Sender: LSF System <lsf@bldlnx06>
Subject: Job 403100: <gmake all> Done

Job <gmake all> was submitted from host <yakut06> by user <penguin>.
Job was executed on host(s) <bldlnx06>, in queue <bldrecoq>, as user <penguin>.
</u/br/penguin> was used as the home directory.
</u/br/penguin/ana30> was used as the working directory.
Started at Sat Jan 21 09:59:33 2006
Results reported at Sat Jan 21 10:01:11 2006

Your job looked like:

------------------------------------------------------------
# LSBATCH: User input
gmake all
------------------------------------------------------------

Successfully completed.

Resource usage summary:

    CPU time   :     64.99 sec.
    Max Memory :       218 MB
    Max Swap   :       259 MB

    Max Processes  :        14
    Max Threads    :        14

The output (if any) is above this job summary.

You can see from this message that the job was run on bldlnx06, with a CPU time of 64.99 seconds. (Of course, your compile and link time was probably a bit different from mine.) And you just saw that the CPUF for yakut06 is 3.36. So the SLAC time for the above job was:

SLAC time = 64.99 s * 3.36 = 218.366 s

Failed jobs and exit codes

When a batch job fails, then instead of "Successfully completed" your log file will probably contain a line like this:
Exited with exit code 137
Exit codes are assigned by the LSF batch system. Sometimes they are helpful, and sometimes they are just confusing. To learn more about exit codes and how to decode them, check out the following webpage:

Exit codes page

"batch system daemon not responding"

Sometimes when you enter a batch command, you will get the message:

batch system daemon not responding ... still trying
batch system daemon not responding ... still trying
...
batch system daemon not responding ... still trying

This just means that there is a temporary problem with the batch system. The easiest way to deal with it is just to type <control c> to get the message to stop, and try again later.

A note about different machines

Some commands and programs - for example the debuggers - work only on certain machines. The commands to set up environment variables also vary from system to system. (This is one reason that the use of environment variables to set parameters is discouraged - use Framework configuration variables instead.) This occasionally (but not often) causes problems when submitting jobs to batch queues.


General Related Documents:

Back to Workbook Front Page

Author: Massimiliano Turri
Contributors: Dominique Boutigny, Joseph Perl, Jenny Williams

Last modification: 20 January 2006
Last significant update: 13 June 2005