SLAC PEP-II
BABAR
SLACRAL
Babar logo
SPIRES E S & H Databases PDG arXiv
Organization Detector Computing Physics Documentation
Personnel Glossary Sitemap Search Hypernews
Home
Workbook
 1. Introduction
 2. Accnt. Setup
 3. QuickTour
 4. Packages
 5. Modules
 6. Event Info.
 7. Tcl Cmds.
 8. Editing
 9. Comp.+Link
 10. Run the Job
 11. Debugging
 12. Parameters
 13. Tcl Files
 14. Find Data
 15. Batch
 16. Analysis
 17. ROOT I
 18. Kanga
Additional Info.
 Other Resources
 BABAR
 Unix
 C++
 SRT/CVS Cmds.
 SRT/CVS Dev.
 Sim/Reco
 CM2 NTuples
 Root II, III
 PAW I, II
 tcsh Script
 perl Script
Check this page for HTML 4.01 Transitional compliance with the
W3C Validator

(More checks...)

Batch Processing


Contents:


Quick link: Job crash/Exit codes page

You already used the Batch system for several tasks during the Quicktour section of this Workbook. We will now discuss the batch system in more detail.


Batch guide in a nutshell

In general, it is almost always best to submit your jobs to the batch system. This is especially important for CPU-intensive jobs. The batch system makes sure that computing resources are shared fairly between different users.

To submit the job "command" to the batch queue, use the bsub command:

bsub [ -q queuename ] [ -o outfile ] command

(See below for details and options.)

The only jobs that do not need to be submitted to the batch system are jobs that are very short, or that truly require user interaction, such as debugging.


The Main Queues for Analysis Users

  • bldrecoq is for gmake builds (gmake all, lib, bin) only.
  • kanga for running analysis applications (like BetaMiniApp). But if your jobs require longer CPU times, you can run them on the long or xlong queue.

Available Queues

Use the bqueues command to get a list of all the queues available. In order to get detailed information on a specific queue type
   > bqueues -l (queuename)
Here is a summary of the relevant information at SLAC:

Queue Name
Description
Users
bldrecoq For BaBar builds all users
kanga For CM2-Kanga analysis jobs —all jobs accessing CM2-Kanga (Mini) data MUST be run in the kanga queue apart from very long jobs all users
express For very short jobs (Maximum 4 minutes) all users
short Maximum 15 minutes SLAC CPU time all users
control Queue for batch control jobs all users
medium Maximum 90 minutes SLAC CPU time all users
long Maximum 6 hours SLAC CPU time all users
xlong Maximum 24 hours SLAC CPU time all users
idle Jobs scheduled only if the machine is lightly loaded all users
bfobjy For analysis jobs accessing physboot, data12boot, simuboot, and all Objectivity federations, all jobs accessing Objectivity federations MUST be run in the bfobjy queue. all users

Choose the queue for your job carefully. Please note that bldrecoq is for gmake builds only.

The above table lists CPU time limits for some of the queues. But the CPU time in the table above is in "SLAC units". Each machine has its own CPU normalization factor (CPUF) that relates SLAC time to CPU time:

SLAC time = CPU time * CPUF
See below for an example of how to convert from the CPU time in your log file to SLAC time.

Submitting and Monitoring the Job

The bsub Command

Use the bsub command to submit jobs. See the LSF manual for details.

bsub [ -q queuename ] [ -J jobname ] [ -o outfile ] command
 
-q queuename  Submit the job to one of the queues specified by queuename 
-J jobname  Assign the character string specified by jobname to the job 
-o outfile  Store the standard output of the job to the file outfile. If the file already exists, the job output is appended to it. If the -o option is NOT specified the job output is sent by mail to the submitter. 
The system should respond with something like:

   Job <IdNumber> is submitted to queue <queuename>

Monitoring the Job

You can check progress of a SLAC batch job by using the command

bjobs [ -J jobname ] [-u username] [-q queue [ -l ] [ IdNumber ]

This will display the status and other information about the batch jobs that are specified by the options. If no options is specified, than the default is to display informations about all the your unfinished (that is, pending, running and suspended) jobs.

  • -l returns detailed information ("long" listing)
  • -u username returns a listing of jobs owned by username
  • -l all will list all jobs belonging to all users.
  • -q queue lists all jobs running on a specific queue
By default, bjobs will list all the jobs that you have running on any queue.

When you issue the bjobs command, the system should respond with something like:

   JOBID USER     STAT  QUEUE      FROM_HOST   EXEC_HOST   JOB_NAME   SUBMIT_TIME

   87082 perl     RUN   bldrecoq   shire01     palomino10  *aUser.bin Dec 20 17:01
The status (STAT) of a job can either be:
PEND The job has not yet been started, it is pending
PSUSP The job has been suspended, either by its owner or by the LSF administrator, while pending
RUN The job is currently running
USUSP The job has been suspended, either by its owner or by the LSF administrator, while running
SSUSP The job has been suspended by LSF due to the conditions of the queue
DONE The job has terminated
EXIT The job has been either killed or aborted due to an error in its execution
UNKWN Contact has been lost with the job
ZOMBI The job is a zombi

Your jobs are done when bjobs responds:

   No unfinished job found

Deleting a Job

You can kill a batch job by using the command
bkill [ -q queuename ] [ -J jobname ] [ IdNumber ]

where queuename, jobname, and idnumber are obtained using the bjobs command.

For example to kill a job that was submitted with:

   bsub -q kanga -J MyJob -o myJob.log BetaMiniApp snippet.tcl
first check the job's status with
   bjobs
or
   bjobs -J MyJob
and then kill it with
   bkill <IdNumber>
or
   bkill -J MyJob
The system response should be
    Job <IdNumber> is being terminated
Issuing the command:
  bkill 0
will kill all jobs you have on all queues, whether they are currently running or pending.

Getting the Output

You can peek at the output of a job in progress by using the command

bpeek [ -q queuename ] [ -J jobname ] [ -f ][ IdNumber ]

You can leave out the job number in the above command if you only have one job running.

If bpeek gives the following message:

   << output from stdout >>

   select: protocol failure in circuit setup.
it just means the job hasn't gone far enough yet to allow a peek. Wait a few minutes and try bpeek again.

The -f option will keep listing the last lines of the output file until you type ^C (like the unix command "tail -f")


Log files and CPU time

As mentioned above, the CPU time in the table above is in "SLAC units". Each machine has its own CPU normalization factor (CPUF) that relates SLAC time to CPU time:
SLAC time = CPU time * CPUF

The CPUF can be obtained with the command bhosts -l <host name>.

So for example issuing

  > bhosts -l bldlnx06
you will find (among other things) that the CPUF for bldlnx06 is 3.36.

As an example, we will find out how much CPU time it took you to compile and link with gmake all. Your all.log file tells you how much CPU time you used. All batch job log files end with a message like:

------------------------------------------------------------
Sender: LSF System <lsf@bldlnx06>
Subject: Job 563645: <gmake all> Done

Job <gmake all> was submitted from host <yakut06> by user <penguin>.
Job was executed on host(s) <bldlnx06>, in queue <bldrecoq>, as user <penguin>.
</u/br/penguin> was used as the home directory.
</u/br/penguin/ana41> was used as the working directory.
Started at Fri Apr 20 21:13:34 2007
Results reported at Fri Apr 20 21:23:12 2007

Your job looked like:

------------------------------------------------------------
# LSBATCH: User input
gmake all
------------------------------------------------------------

Successfully completed.

Resource usage summary:

    CPU time   :    254.36 sec.
    Max Memory :       592 MB
    Max Swap   :       637 MB

    Max Processes  :        14
    Max Threads    :        14

The output (if any) is above this job summary.

You can see from this message that the job was run on bldlnx06, with a CPU time of 254.36 seconds. (Of course, your compile and link time was probably a bit different from mine.) And you just saw that the CPUF for bldlnx06 is 3.36. So the SLAC time for the above job was:

SLAC time = 254.36 s * 3.36 = 855.456 s

Failed jobs and exit codes

When a batch job fails, then instead of "Successfully completed" your log file will probably contain a line like this:
Exited with exit code 137
Exit codes are assigned by the LSF batch system. Sometimes they are helpful, and sometimes they are just confusing. To learn more about exit codes and how to decode them, check out the following webpage:

Exit codes page

"batch system daemon not responding"

Sometimes when you enter a batch command, you will get the message:

batch system daemon not responding ... still trying
batch system daemon not responding ... still trying
...
batch system daemon not responding ... still trying

This just means that there is a temporary problem with the batch system. The easiest way to deal with it is just to type <control c> to get the message to stop, and try again later.


General Related Documents:

[Workbook Author List] [Old Workbook] [BaBar Physics Book]

Valid HTML 4.01! Page maintained by Adam Edwards

Last modified: January 2008