Batch Processing
Contents:
Quick link: Job crash/Exit codes page
You already used the Batch system for several tasks during the Quicktour section of this
Workbook. We will now discuss the batch system in more detail.
Batch guide in a nutshell
In general, it is almost always best to submit your jobs
to the batch system. This is especially important for CPU-intensive jobs.
The batch system makes sure that computing resources are shared fairly between
different users.
To submit the job "command" to the batch queue, use the bsub command:
bsub [ -q queuename ] [ -o outfile ] command
(See below for details and options.)
The only jobs that do not need to be submitted to the batch system
are jobs that are very short, or that truly require user interaction,
such as debugging.
The Main Queues for Analysis Users
-
bldrecoq is for gmake builds (gmake all, lib, bin) only.
-
kanga for running analysis applications (like BetaMiniApp).
But if your jobs require longer CPU times, you can run them on
the long or xlong queue.
Available Queues
Use the bqueues command to get a list of all the queues
available. In order to get detailed information on a specific queue type
> bqueues -l (queuename)
Here is a summary of the relevant information at SLAC:
|
Queue Name
|
Description
|
Users
|
| bldrecoq |
For BaBar builds |
all users |
| kanga |
For CM2-Kanga analysis jobs —all jobs accessing CM2-Kanga (Mini)
data MUST be run in the kanga queue apart from very long jobs |
all users |
| express |
For very short jobs (Maximum 4 minutes) |
all users |
| short |
Maximum 15 minutes SLAC CPU time |
all users |
| control |
Queue for batch control jobs |
all users |
| medium |
Maximum 90 minutes SLAC CPU time |
all users |
| long |
Maximum 6 hours SLAC CPU time |
all users |
| xlong |
Maximum 24 hours SLAC CPU time |
all users |
| idle |
Jobs scheduled only if the machine is lightly loaded |
all users |
| bfobjy |
For analysis jobs accessing physboot,
data12boot, simuboot, and all Objectivity
federations, all jobs accessing Objectivity federations MUST
be run in the bfobjy queue. |
all users |
Choose the queue for your job carefully. Please note that
bldrecoq is for gmake builds only.
The above table lists CPU time limits for some of the queues.
But the CPU time in the table above is in "SLAC units".
Each machine has its own CPU normalization factor (CPUF) that relates
SLAC time to CPU time:
SLAC time = CPU time * CPUF
See below for an example of how to convert from the
CPU time in your log file to SLAC time.
Use the bsub command to submit jobs. See the LSF manual for details.
bsub [ -q queuename ] [ -J jobname ] [ -o outfile ] command
| -q queuename |
Submit the job to one of the queues specified by queuename |
| -J jobname |
Assign the character string specified by jobname to the job |
| -o outfile |
Store the standard output of the job to the file outfile. If the file
already exists, the job output is appended to it. If the -o option is NOT
specified the job output is sent by mail to the submitter. |
The system should respond with something like:
Job <IdNumber> is submitted to queue <queuename>
You can check progress of a SLAC batch job by using the command
bjobs [ -J jobname ] [-u username] [-q queue [ -l ] [ IdNumber ]
This will display the status and other information about the batch jobs
that are specified by the options. If no options is specified, than the
default is to display informations about all the your unfinished (that
is, pending, running and suspended) jobs.
-
-l returns detailed information ("long" listing)
-
-u username returns a listing of jobs owned by username
-
-l all will list all jobs belonging to all users.
-
-q queue lists all jobs running on a specific queue
By default, bjobs will list all the jobs that
you have running on any queue.
When you issue the bjobs command, the system should
respond with something like:
JOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME
87082 perl RUN bldrecoq shire01 palomino10 *aUser.bin Dec 20 17:01
The status (STAT) of a job can either be:
| PEND |
The job has not yet been started, it is pending |
| PSUSP |
The job has been suspended, either by its owner or by the LSF administrator,
while pending |
| RUN |
The job is currently running |
| USUSP |
The job has been suspended, either by its owner or by the LSF administrator,
while running |
| SSUSP |
The job has been suspended by LSF due to the conditions of the queue |
| DONE |
The job has terminated |
| EXIT |
The job has been either killed or aborted due to an error in its execution |
| UNKWN |
Contact has been lost with the job |
| ZOMBI |
The job is a zombi |
Your jobs are done when bjobs responds:
No unfinished job found
You can kill a batch job by using the command
bkill [ -q queuename ] [ -J jobname ] [ IdNumber ]
where queuename, jobname, and idnumber are obtained using the
bjobs command.
For example to kill a job that was submitted with:
bsub -q kanga -J MyJob -o myJob.log BetaMiniApp snippet.tcl
first check the job's status with
bjobs
or
bjobs -J MyJob
and then kill it with
bkill <IdNumber>
or
bkill -J MyJob
The system response should be
Job <IdNumber> is being terminated
Issuing the command:
bkill 0
will kill all jobs you have on all queues, whether they are currently
running or pending.
You can peek at the output of a job in progress by using the command
bpeek [ -q queuename ] [ -J jobname ] [ -f ][ IdNumber ]
You can leave out the job number in the above command if you only have
one job running.
If bpeek gives the following message:
<< output from stdout >>
select: protocol failure in circuit setup.
it just means the job hasn't gone far enough yet to allow a peek. Wait
a few minutes and try bpeek again.
The -f option will keep listing the last lines of the output file until
you type ^C (like the unix command "tail -f")
As mentioned above, the CPU time in the table above is in "SLAC units".
Each machine has its own CPU normalization factor (CPUF) that relates
SLAC time to CPU time:
SLAC time = CPU time * CPUF
The CPUF can be obtained with the command bhosts -l <host name>.
So for example issuing
> bhosts -l bldlnx06
you will find (among other things) that the CPUF for bldlnx06 is 3.36.
As an example, we will find out how much CPU time it took you to
compile and link with gmake all. Your all.log file tells you how much
CPU time you used. All batch job log files end with a message like:
------------------------------------------------------------
Sender: LSF System <lsf@bldlnx06>
Subject: Job 563645: <gmake all> Done
Job <gmake all> was submitted from host <yakut06> by user <penguin>.
Job was executed on host(s) <bldlnx06>, in queue <bldrecoq>, as user <penguin>.
</u/br/penguin> was used as the home directory.
</u/br/penguin/ana41> was used as the working directory.
Started at Fri Apr 20 21:13:34 2007
Results reported at Fri Apr 20 21:23:12 2007
Your job looked like:
------------------------------------------------------------
# LSBATCH: User input
gmake all
------------------------------------------------------------
Successfully completed.
Resource usage summary:
CPU time : 254.36 sec.
Max Memory : 592 MB
Max Swap : 637 MB
Max Processes : 14
Max Threads : 14
The output (if any) is above this job summary.
You can see from this message that the job was run on bldlnx06,
with a CPU time of 254.36 seconds. (Of course, your compile and link
time was probably a bit different from mine.) And you just saw that the
CPUF for bldlnx06 is 3.36. So the SLAC time for the above job was:
SLAC time = 254.36 s * 3.36 = 855.456 s
When a batch job fails, then instead of
"Successfully completed" your log file will probably
contain a line like this:
Exited with exit code 137
Exit codes are assigned by the LSF batch system.
Sometimes they are helpful, and sometimes they are just confusing.
To learn more about exit codes and how to decode them, check
out the following webpage:
Exit codes page
"batch system daemon not responding"
Sometimes when you enter a batch command, you will get the message:
batch system daemon not responding ... still trying
batch system daemon not responding ... still trying
...
batch system daemon not responding ... still trying
This just means that there is a temporary problem with the
batch system. The easiest way to deal with it is just to
type <control c> to get the message to stop,
and try again later.
General Related Documents:
Page maintained by Adam Edwards
Last modified: January 2008
|