Workbook for BaBar Offline Users - Batch Processing
Getting large jobs done more quickly while taking the load off the
interactive machines.
You already used the Batch system for several tasks during the Quicktour section of this
Workbook. We will now discuss the batch system in more detail.
Quick link: Job crash/Exit codes page
When to Use Batch versus Interactive
Interactive processing is intended for those activities that truly
require user interaction, e.g. debugging, or are very short. Other
jobs should be submitted to batch. This includes but is not limited to
CPU intensive jobs.
Building BaBar code (gmake lib, bin or all) should not be done
interactively. The special bldrecoq queue is the place to run
a build.
Submitting and Monitoring the Job
How the Different Queues Work
Use the bqueues command to get a list of all the queues
available. In order to get detailed information on a specific queue type
> bqueues -l (queuename)
Here is a summary of the relevant information at SLAC:
|
Queue Name
|
Description
|
Users
|
| bldrecoq |
For BaBar builds |
all users |
| kanga |
For CM2-Kanga analysis jobs —all jobs accessing CM2-Kanga (Mini)
data MUST be run in the kanga queue apart from very long jobs |
all users |
| express |
For very short jobs (Maximum 4 minutes) |
all users |
| short |
Maximum 15 minutes SLAC CPU time |
all users |
| control |
Queue for batch control jobs |
all users |
| medium |
Maximum 90 minutes SLAC CPU time |
all users |
| long |
Maximum 6 hours SLAC CPU time |
all users |
| xlong |
Maximum 24 hours SLAC CPU time |
all users |
| idle |
Jobs scheduled only if the machine is lightly loaded |
all users |
| bfobjy |
For analysis jobs accessing physboot,
data12boot, simuboot, and all Objectivity
federations, all jobs accessing Objectivity federations MUST
be run in the bfobjy queue. |
all users |
Choose the queue for your job carefully. Please note that
bldrecoq is for gmake builds only.
The above table lists CPU time limits for some of the queues.
But the CPU time in the table above is in "SLAC units".
Each machine has its own CPU normalization factor (CPUF) that relates
SLAC time to CPU time:
SLAC time = CPU time * CPUF
See below for an example of how to convert from the
CPU time in your log file to SLAC time.
The bsub Command
Use the bsub command to submit jobs. See the LSF manual for details.
bsub [ -q queuename ] [ -J jobname ] [ -o outfile ] command
| -q queuename |
Submit the job to one of the queues specified by queuename |
| -J jobname |
Assign the character string specified by jobname to the job |
| -o outfile |
Store the standard output of the job to the file outfile. If the file
already exists, the job output is appended to it. If the -o option is NOT
specified the job output is sent by mail to the submitter. |
The system should respond with something like:
Job <IdNumber> is submitted to queue <queuename>
Monitoring the Job
You can check progress of a SLAC batch job by using the command
bjobs [ -J jobname ] [-u username] [-q queue [ -l ] [ IdNumber ]
This will display the status and other information about the batch jobs
that are specified by the options. If no options is specified, than the
default is to display informations about all the your unfinished (that
is, pending, running and suspended) jobs.
-
-l returns detailed information ("long" listing)
-
-u username returns a listing of jobs owned by username
-
-l all will list all jobs belonging to all users.
-
-q queue lists all jobs running on a specific queue
By default, bjobs will list all the jobs that
you have running on any queue.
When you issue the bjobs command, the system should
respond with something like:
JOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME
87082 perl RUN bldrecoq shire01 palomino10 *aUser.bin Dec 20 17:01
The status (STAT) of a job can either be:
| PEND |
The job has not yet been started, it is pending |
| PSUSP |
The job has been suspended, either by its owner or by the LSF administrator,
while pending |
| RUN |
The job is currently running |
| USUSP |
The job has been suspended, either by its owner or by the LSF administrator,
while running |
| SSUSP |
The job has been suspended by LSF due to the conditions of the queue |
| DONE |
The job has terminated |
| EXIT |
The job has been either killed or aborted due to an error in its execution |
| UNKWN |
Contact has been lost with the job |
| ZOMBI |
The job is a zombi |
Your jobs are done when bjobs responds:
No unfinished job found
Deleting a Job
You can kill a batch job by using the command
bkill [ -q queuename ] [ -J jobname ] [ IdNumber ]
where queuename, jobname, and idnumber are obtained using the
bjobs command.
For example to kill a job that was submitted with:
bsub -q kanga -J MyJob -o myJob.log BetaMiniApp snippet.tcl
first check the job's status with
bjobs
or
bjobs -J MyJob
and then kill it with
bkill <IdNumber>
or
bkill -J MyJob
The system response should be
Job <IdNumber> is being terminated
Issuing the command:
bkill 0
will kill all jobs you have on all queues, whether they are currently
running or pending.
Getting the Output
You can peek at the output of a job in progress by using the command
bpeek [ -q queuename ] [ -J jobname ] [ -f ][ IdNumber ]
You can leave out the job number in the above command if you only have
one job running.
If bpeek gives the following message:
<< output from stdout >>
select: protocol failure in circuit setup.
it just means the job hasn't gone far enough yet to allow a peek. Wait
a few minutes and try bpeek again.
The -f option will keep listing the last lines of the output file until
you type ^C (like the unix command "tail -f")
The Main Queues for Analysis Users
-
bldrecoq is for gmake builds (gmake all, lib, bin) only.
-
kanga for jobs reading CM2/Mini/kanga files. All jobs
accessing kanga data MUST be run in the kanga queue unless they
require longer CPU times and therefore should be run in the xlong queue
Running Beta
You already saw how to compile and link BetaMiniUser in batch:
bsub -q bldrecoq -o all.log gmake all
But you ran your job interactively:
BetaMiniApp snippet.tcl
In the Quicktour, MyMiniAnalysis.tcl was modified a bit so that
the job would run interactively, and you would get a framework prompt.
You cannot submit this type of job to the batch system, because it
requires user input. But once you put the "ev begin" and "exit"
commands back in MyMiniAnalysis.tcl, as explained near the end
of the Framework Continued
section, the job will run without stopping.
Once you have a job that runs without stopping in the middle for
user input, you can submit it to the kanga queue:
bsub -q kanga -o snippet.log BetaMiniApp snippet.tcl
The output will be written to the file snippet.log.
As mentioned above, the CPU time in the table above is in "SLAC units".
Each machine has its own CPU normalization factor (CPUF) that relates
SLAC time to CPU time:
SLAC time = CPU time * CPUF
The CPUF can be obtained with the command bhosts -l <host name>.
So for example issuing
> bhosts -l bldlnx06
you will find (among other things) that the CPUF for bldlnx06 is 3.36.
As an example, we will find out how much CPU time it took you to
compile and link with gmake all. Your all.log file tells you how much
CPU time you used. All batch job log files end with a message like:
------------------------------------------------------------
Sender: LSF System <lsf@bldlnx06>
Subject: Job 403100: <gmake all> Done
Job <gmake all> was submitted from host <yakut06> by user <penguin>.
Job was executed on host(s) <bldlnx06>, in queue <bldrecoq>, as user <penguin>.
</u/br/penguin> was used as the home directory.
</u/br/penguin/ana30> was used as the working directory.
Started at Sat Jan 21 09:59:33 2006
Results reported at Sat Jan 21 10:01:11 2006
Your job looked like:
------------------------------------------------------------
# LSBATCH: User input
gmake all
------------------------------------------------------------
Successfully completed.
Resource usage summary:
CPU time : 64.99 sec.
Max Memory : 218 MB
Max Swap : 259 MB
Max Processes : 14
Max Threads : 14
The output (if any) is above this job summary.
You can see from this message that the job was run on bldlnx06,
with a CPU time of 64.99 seconds. (Of course, your compile and link
time was probably a bit different from mine.) And you just saw that the
CPUF for yakut06 is 3.36. So the SLAC time for the above job was:
SLAC time = 64.99 s * 3.36 = 218.366 s
When a batch job fails, then instead of
"Successfully completed" your log file will probably
contain a line like this:
Exited with exit code 137
Exit codes are assigned by the LSF batch system.
Sometimes they are helpful, and sometimes they are just confusing.
To learn more about exit codes and how to decode them, check
out the following webpage:
Exit codes page
"batch system daemon not responding"
Sometimes when you enter a batch command, you will get the message:
batch system daemon not responding ... still trying
batch system daemon not responding ... still trying
...
batch system daemon not responding ... still trying
This just means that there is a temporary problem with the
batch system. The easiest way to deal with it is just to
type <control c> to get the message to stop,
and try again later.
A note about different machines
Some commands and programs - for example the debuggers
- work only on certain machines. The commands to set up
environment variables also vary from system to system. (This is one
reason that the use of environment variables to set parameters is
discouraged - use Framework configuration variables instead.) This
occasionally (but not often) causes problems when submitting jobs to
batch queues.
General Related Documents:
Author:
Massimiliano Turri
Contributors:
Dominique Boutigny,
Joseph Perl,
Jenny Williams
Last modification: 20 January 2006
Last significant update: 13 June 2005
|