Batch system in a nutshell

The SLAC's batch system uses SCS UNIX compute farm that is based on LSF (Load Sharing Facility).

13 Feb 2006

Map:

Useful batch commands;

bsub command.
bjobs command.
Command summary of LSF batch system
Examples of LSF batch system
Batch Exit Codes
More information, help for LSF batch system

General Description

The SLAC batch system uses the Load Sharing Facility system from Platform Computing.

The queues for ATLAS users at SLAC are the "general queues", which are not experiment specific or cordoned off for production type usage. The names of these queues (with minutes of SLAC CPU limit in parenthesis) are express (5), short (20), medium (90), long (360), xlong (2900), xxl (15192). As the length of the queue gets longer fewer jobs will be run, you should therefore select the queue most appropriate for your job. Most of the queues also have a limit on wall clock time to ensure that jobs don't get stuck (technical term), this is typically twice the CPU limit.

The SLAC time units are based on a machine no longer in use at SLAC, this was a Sun Netra T1 440MHz SparcIII (IIRC). On the most recent machine at SLAC there is a scaling factor (CPUF) of 8.46, so for example on the long queue a job would only have 42.5 minutes. You can check the CPUF of a machine with the "bhosts -l <host>" command.

bsub command

Introduction:

   Submits a command to the batch system.

Syntax:

   bsub [options]   command [argument]

Major Options:

     -c <hh:mm>      [amount of CPU time]
     -q <queue>      [job queue]

Minor Options:

     -J <jobname>    [specify job name]
     -m <host>       [run job on this machine]
     -R <resource>   [run job on this resource]

Execution Options:

     -E <command>    [specify pre-run command]          
     -L <shell>      [specify a login-shell]          
     -nr             [job is not re-runable from beginning or last check point] 
     -r              [job is re-runable from beginning or last check point]

I/O Options:

     -i <infile>     [specify standard input file]
     -o <outfile>    [specify standard output file]
     -e <errfile>    [specify standard error file]

Example of bsub:

    bsub -q bldrecoq -m build02 gmake all
    bsub -q bldrecoq -m build02 ls -la /u1/drjohn/bfdist/releases/nightly
    bsub -q bldrecoq -m build02 'ls -la /u1/drjohn/bfdist/releases/nightly/DbiEvent/*'

bjobs command

Introduction:

   Queries the status of jobs in the batch system.

Syntax:

     bjobs [options]

Major Options:

     -u <user>        [specify user, all means all users]

Minor Options:

     -a               [all jobs]
     -l               [long form]

Example:
     bjobs            [query my jobs in the batch queue]
     bjobs -u mark    [query all jobs submitted by user mark]



Command summary of LSF batch system

Major batch queue commands:
     bkill    [kill batch jobs.]
     bsub     [submit a job for batched execution.]
     bmod     [modify the parameters of a submitted job.

Minor batch queue commands:
     bacct    [generate accounting information about batch jobs.
     bchkpnt  [checkpoint batch jobs.]
     bmig     [migrate a job.]
     brestart [restart a job from checkpoint its files.]

Suspend/resume commands:
     bbot     [move a pending job to the bottom (end) of its queue.]
     bresume  [resume suspended batch jobs.]
     bstop    [suspend batch jobs.]
     bswitch  [switch pending jobs from one queue to another.]
     btop     [move a pending job to the top (beginning) of its queue.]

Query commands:
     bjobs    [display the status and other information about batch jobs.]
     bqueues  [display the status and other information about batch job queues]
     bhosts   [display the status and other info about Batch server hosts]
     bhpart   [display information about Batch  host  partitions]
     busers   [display information about Batch users]
     bugroup  [display the user group names and their memberships]
     bmgroup  [display the host group names and their memberships]
     bparams  [display the info about the configurable system parameters]
     bpeek    [display the stdout and stderr output produced so far by a batch]
     bhist    [display the processing history of batch jobs.]



Examples  of batch system:

General examples:
     bsub -c00:30 gmake all       [build test release]
     bjobs                        [find my batch job]
     bkill 388999                 [kill this job]

Use specific host:
     bqueues -m <host>            [which queue suports this machine]
     bsub -q <queue> -m <host> <commands..>    [run on this machine]

     [Note]: this won't work for the moment for build10. Use
     the following:
      bsub -q <queue> -R sol7 <commands..>    [run on build10]





Job crashes and exit codes


This webpage is a collection of information about job 
crashes and exit codes, gleaned from HyperNews and wherever 
else I could find it.

If you have anything useful to add, or if any of the information 
is incorrect, please feel free to edit the page!

Quick diagnosis

The overall impression I get from searching Hypernews is:


 If you get core dumped, there was a problem in your code, and you 
should use the debugger.  
 If you do not get core dumped, 
then your exit code probably means you ran out of CPU time.


Exit codes and kill-job signals

The exit code from a batch job is a standard Unix termination status,
the same sort of number you get in a shell script from checking the "$?"
variable after executing a command.


Typically, exit code 0 (zero) means successful completion.  Codes 1-127
are typically generated by your job itself calling
exit()
with a non-zero value to terminate itself and indicate an error.
In BaBar we don't make very much use of this.  The most common such value you might 
see is 64, which is the value used by Framework to say that its event loop is
being stopped before all the requested data have been read, typically because time
ran out.
In recent BaBar releases you might also see 125, which we use as a code for a generic
"severe error"; the job log should contain a message stating what the error was.



Exit codes in the range 129-255 represent jobs terminated by Unix 
"signals".
Each type of signal has a number, and what's reported as the job exit code is the
signal number plus 128.
Signals can arise from within the process itself (as for SEGV, see below)
or be sent to the process by some external agent (such as the batch control system,
or your using the "bkill" command).



By way of example, then, exit code 64 means that the job deliberately terminated its 
execution by calling "exit(64)", 
exit code 137 means that the job received a signal 9,
and exit code 140 represents signal 12.



The specific meaning of the signal numbers is platform-dependent.  If
you are trying to figure out a problem that was seen on Linux, you
have to run the following commands on Linux.  We don't have Solaris or Mac OS batch
resources in BaBar at the moment, but if we did, you would have to match platforms
similarly when debugging.



terminationDecoder


BaBar provides a little program that will take your exit code 
and spit out an explanation.  
The program is called terminationDecoder.  Examples:

[yakut] terminationDecoder 137
terminated by signal 9 (Killed)

[yakut] terminationDecoder 64
exited with code 64 (in Framework: stop requested, e.g., by CpuCheck)




More details

You can also look this up yourself;
if you know the signal number, then you can find out why the job was killed 
using the command "kill -l":

[yakut] kill -l

HUP INT QUIT ILL TRAP ABRT BUS FPE KILL USR1 SEGV USR2 PIPE ALRM TERM STKFLT
CHLD CONT STOP TSTP TTIN TTOU URG XCPU XFSZ VTALRM PROF WINCH POLL PWR SYS
RTMIN RTMIN+1 RTMIN+2 RTMIN+3 RTMAX-3 RTMAX-2 RTMAX-1 RTMAX


So for example, if your job was killed by signal 6, then it got an 
"ABRT", which is short for ABORT.

To find out what all the "kill -l" words mean, 
you can use the command:

man 7 signal    



(or, on Solaris, "man -s 3HEAD signal").
This will give you the man page for SIGNAL(7).  
Scroll down a bit and you will get a list of the kill-signal words 
with a short explanation.  Here is a sample:

SIGHUP        1       Term    Hangup detected on controlling terminal
                              or death of controlling process
SIGINT        2       Term    Interrupt from keyboard
SIGQUIT       3       Core    Quit from keyboard
SIGILL        4       Core    Illegal Instruction
SIGABRT       6       Core    Abort signal from abort(3)
SIGFPE        8       Core    Floating point exception
SIGKILL       9       Term    Kill signal
SIGSEGV      11       Core    Invalid memory reference
SIGPIPE      13       Term    Broken pipe: write to pipe with no readers
SIGALRM      14       Term    Timer signal from alarm(2)
SIGTERM      15       Term    Termination signal


(Obviously, these are just the "kill -l" words, but with 
a "SIG" in front of them.)


You may also find it useful to look at the file signal.h.  On 
a Linux machine, the location is:

/usr/include/asm/signal.h




Frequently Seen Codes/h4>

Exit code 9: Ran out of CPU time.

Exit code 64: The framework ended the job nicely for you, most
likely because  
the job was running out of CPU time.  But it means you did not go through all  
the data requested.  The solution is to submit the job to a queue with more 
resources (bigger CPU time limit).

Exit code 125: An ErrMsg(severe) was reached in your job.

Exit code 127: Something wrong with the machine?

Exit code 130: The job ran out of CPU or swap time.  If swap 
time is the culprit, check for memory leaks.

Exit code 131: The job ran out of CPU or swap time.  If swap 
time is the culprit, check for memory leaks.

Exit code 134: The job is killed with an abort signal, and you 
probably got core dumped.  Often this is caused either by an assert()
or an ErrMsg(fatal) being hit in your job.
There may be a run-time bug in your code.  
Use a debugger like gdb or dbx to find out what's wrong.

Exit code 137: The job was killed because it exceeded the time limit.

Exit code 139: Segmentatation violation.  

Exit code 140: The job exceeded the "wall clock" time limit 
(as opposed to the CPU time limit).

HOWTO's guide to job-kill signals


 SEGV 
 A segmentation violation or segmentation fault typically 
means that something is trying to access memory that 
it shouldn't be accessing. One common example of this is 
trying to access memory through a NULL pointer, for example:

sunprompt> cat main.c
#include 
main()
{
  int* bunk(0);
  cout << *bunk << endl;
}
sunprompt> CC main.c
sunprompt> ./a.out
Segmentation fault (core dumped)


 ABRT
 asserts are one common source of the "abort" signal, for example:

sunprompt> cat main.c
#include 
main()
{
  int i=0;
  assert(i!=0);
}
sunprompt> CC main.c
sunprompt> ./a.out
Assertion failed: i!=0, file main.c, line 5
Abort (core dumped)


Note that the actual assertion which was failed and the location is also printed.
An ABRT can also be generated from the BaBar ErrMsg(fatal) construct,
in which case your job log should contain a message explaining the error.

 FPE
 A "Floating Point Error" usually indicates a numerical 
problem such as a division by zero or an overflow. One example would be:

osfprompt> cat main.c
main()
{
  float a = 1.;
  float b = 0.;
  float c = a/b;
}
osfprompt> g++ main.c
osfprompt> ./a.out
Floating exception (core dumped)


ILL

 If you receive a signal like this ("Illegal Instruction"), means 
that, while running, your program has tried to execute a 
machine "instruction" which does not exist. This can happen 
for a variety of reasons, including:


 a memory overwrite that happens to overwrite part of the
program stored in memory. This may result in the program
trying, for example, to execute data as if it is a
machine instruction.
 an attempt to take an executable compiled on one platform
for use on another, for example on an earlier version of the same chip.
 a truncated or corrupted executable is loaded for execution
 incomplete recompilation of source code, i.e. you changed one C++ class 
and didn't recompile all other code affected by that change.


 BUS
 A "Bus Error" may come, for example, from accessing unaligned 
data (i.e. like trying to access a 4 byte integer with a 
pointer to the middle of it). What this means will vary 
from platform to platform. (I haven't come up with a good 
example of this one yet.)

A "Bus Error" can also often indicate a memory overwrite, 
e.g. somebody wrote a number where a pointer is kept.  
Often caused by going past the end of an array and into 
the system pointers at the start of the next memory block.


How do you know if you've exceeded your CPU time?

To find out whether your job has exceeded the CPU time limit, you have 
to do 3 things:


 Look at your log file to get the job's CPU time.
 Use the machine-dependent CPUF to convert the CPU time to SLAC time.  The 
formula is: SLAC time = CPU time * CPUF.
 Compare this to the time allowed by the queue in which the job was run.


Here is an example.

First, look at the end of your log file:


Job <VubRecoilUserApp VubXlnu.tcl SP-1237-BSemiExcl-Run5-R18b-1 MC> was submitte
d from host <yakut02> by user <penguin>.
Job was executed on host(s) <cob0313>, in queue <xlong>, as user <penguin>.
</u/br/penguin> was used as the home directory.
</u/br/penguin/vubrecoil/vub30/workdir> was used as the working directory.
Started at Wed Feb  8 17:25:33 2006
Results reported at Wed Feb  8 19:27:28 2006

Your job looked like:

------------------------------------------------------------
# LSBATCH: User input
VubRecoilUserApp VubXlnu.tcl SP-1237-BSemiExcl-Run5-R18b-1 MC
------------------------------------------------------------

Exited with exit code 134.

Resource usage summary:

    CPU time   :   7058.71 sec.
    Max Memory :      2863 MB
    Max Swap   :      2968 MB

    Max Processes  :         3
    Max Threads    :         3



The job was run on the machine cob0313.  

> bhosts -l cob0313


This tells you (among other things) that the CPUF 
for cob0313 is 7.65.

The SLAC time for your job is thus:

SLAC time = (CPU time) * CPUF = (7058.71 sec) * 7.65 = 53999.1 sec = 900 min


The next step to find out if this exceeds the CPU limit of the 
queue in which the job was run.  In this example, the job was 
the xlong queue:

> bqueues -l xlong


Among other things, this tells you the CPU limit for the queue:

 CPULIMIT
 2900.0 min of slac


The job used only 900 minutes of SLAC time, less than the 2900 allowed 
by the xlong queue.  So the job did not exceed its CPU time limit.
It must have crashed for some other reason.

Memory Leaks


Jobs can also crash because of memory leaks --- things like dangling 
pointers or array overruns.



More information or help  for LSF system:

Getting help or more information regarding LSF batch system. This is web page "High Performance Computing at SLAC" provided by SCS.







Maintained by Stephen J. Gowdy

Originaly by Terry Hung