Quick link: Job crash/Exit codes page
You already used the Batch system for several tasks during the Quicktour section of this Workbook. We will now discuss the batch system in more detail.
In general, it is almost always best to submit your jobs to the batch system. This is especially important for CPU-intensive jobs. The batch system makes sure that computing resources are shared fairly between different users.
To submit the job "command" to the batch queue, use the bsub command:
bsub [ -q queuename ] [ -o outfile ] command
(See below for details and options.)
The only jobs that do not need to be submitted to the batch system are jobs that are very short, or that truly require user interaction, such as debugging.
> bqueues -l (queuename)Here is a summary of the relevant information at SLAC:
|
|
|
|
| bldrecoq | For BaBar builds | all users |
| kanga | For CM2-Kanga analysis jobs —all jobs accessing CM2-Kanga (Mini) data MUST be run in the kanga queue apart from very long jobs | all users |
| express | For very short jobs (Maximum 4 minutes) | all users |
| short | Maximum 15 minutes SLAC CPU time | all users |
| control | Queue for batch control jobs | all users |
| medium | Maximum 90 minutes SLAC CPU time | all users |
| long | Maximum 6 hours SLAC CPU time | all users |
| xlong | Maximum 24 hours SLAC CPU time | all users |
| idle | Jobs scheduled only if the machine is lightly loaded | all users |
| bfobjy | For analysis jobs accessing physboot,
data12boot, simuboot, and all Objectivity
federations, all jobs accessing Objectivity federations MUST
be run in the bfobjy queue. |
all users |
Choose the queue for your job carefully. Please note that bldrecoq is for gmake builds only.
The above table lists CPU time limits for some of the queues. But the CPU time in the table above is in "SLAC units". Each machine has its own CPU normalization factor (CPUF) that relates SLAC time to CPU time:
SLAC time = CPU time * CPUFSee below for an example of how to convert from the CPU time in your log file to SLAC time.
bsub [ -q queuename ] [ -J jobname ] [ -o outfile ] command
| -q queuename | Submit the job to one of the queues specified by queuename |
| -J jobname | Assign the character string specified by jobname to the job |
| -o outfile | Store the standard output of the job to the file outfile. If the file already exists, the job output is appended to it. If the -o option is NOT specified the job output is sent by mail to the submitter. |
Job <IdNumber> is submitted to queue <queuename>
bjobs [ -J jobname ] [-u username] [-q queue [ -l ] [ IdNumber ]
This will display the status and other information about the batch jobs that are specified by the options. If no options is specified, than the default is to display informations about all the your unfinished (that is, pending, running and suspended) jobs.
-l returns detailed information ("long" listing)
-u username returns a listing of jobs owned by username
-l all will list all jobs belonging to all users.
-q queue lists all jobs running on a specific queue
bjobs will list all the jobs that
you have running on any queue.
When you issue the bjobs command, the system should
respond with something like:
JOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME 87082 perl RUN bldrecoq shire01 palomino10 *aUser.bin Dec 20 17:01The status (STAT) of a job can either be:
| PEND | The job has not yet been started, it is pending |
| PSUSP | The job has been suspended, either by its owner or by the LSF administrator, while pending |
| RUN | The job is currently running |
| USUSP | The job has been suspended, either by its owner or by the LSF administrator, while running |
| SSUSP | The job has been suspended by LSF due to the conditions of the queue |
| DONE | The job has terminated |
| EXIT | The job has been either killed or aborted due to an error in its execution |
| UNKWN | Contact has been lost with the job |
| ZOMBI | The job is a zombi |
Your jobs are done when bjobs responds:
No unfinished job found
bkill [ -q queuename ] [ -J jobname ] [ IdNumber ]
where queuename, jobname, and idnumber are obtained using the bjobs command.
For example to kill a job that was submitted with:
bsub -q kanga -J MyJob -o myJob.log BetaMiniApp snippet.tclfirst check the job's status with
bjobsor
bjobs -J MyJoband then kill it with
bkill <IdNumber>or
bkill -J MyJobThe system response should be
Job <IdNumber> is being terminatedIssuing the command:
bkill 0will kill all jobs you have on all queues, whether they are currently running or pending.
bpeek [ -q queuename ] [ -J jobname ] [ -f ][ IdNumber ]
You can leave out the job number in the above command if you only have one job running.
If bpeek gives the following message:
<< output from stdout >> select: protocol failure in circuit setup.it just means the job hasn't gone far enough yet to allow a peek. Wait a few minutes and try bpeek again.
The -f option will keep listing the last lines of the output file until
you type ^C (like the unix command "tail -f")
SLAC time = CPU time * CPUF
The CPUF can be obtained with the command bhosts -l <host name>.
So for example issuing
> bhosts -l bldlnx06you will find (among other things) that the CPUF for bldlnx06 is 3.36.
As an example, we will find out how much CPU time it took you to compile and link with gmake all. Your all.log file tells you how much CPU time you used. All batch job log files end with a message like:
------------------------------------------------------------
Sender: LSF System <lsf@bldlnx06>
Subject: Job 563645: <gmake all> Done
Job <gmake all> was submitted from host <yakut06> by user <penguin>.
Job was executed on host(s) <bldlnx06>, in queue <bldrecoq>, as user <penguin>.
</u/br/penguin> was used as the home directory.
</u/br/penguin/ana41> was used as the working directory.
Started at Fri Apr 20 21:13:34 2007
Results reported at Fri Apr 20 21:23:12 2007
Your job looked like:
------------------------------------------------------------
# LSBATCH: User input
gmake all
------------------------------------------------------------
Successfully completed.
Resource usage summary:
CPU time : 254.36 sec.
Max Memory : 592 MB
Max Swap : 637 MB
Max Processes : 14
Max Threads : 14
The output (if any) is above this job summary.
You can see from this message that the job was run on bldlnx06, with a CPU time of 254.36 seconds. (Of course, your compile and link time was probably a bit different from mine.) And you just saw that the CPUF for bldlnx06 is 3.36. So the SLAC time for the above job was:
SLAC time = 254.36 s * 3.36 = 855.456 s
Exited with exit code 137Exit codes are assigned by the LSF batch system. Sometimes they are helpful, and sometimes they are just confusing. To learn more about exit codes and how to decode them, check out the following webpage:
Sometimes when you enter a batch command, you will get the message:
batch system daemon not responding ... still trying batch system daemon not responding ... still trying ... batch system daemon not responding ... still trying
This just means that there is a temporary problem with the batch system. The easiest way to deal with it is just to type <control c> to get the message to stop, and try again later.