Using the SLAC Batch Farm

Resource Monitor:

On this page:

Also see:

Problem? Contact: helpsoftlist@glast.stanford.edu

If you need more cycles than you readily have your hands on, you're welcome to use the SLAC linux batch farm. The resources of the batch farm give you the ability to run multiple jobs concurrently.

The batch system is easy to use and offers substantially more CPU horsepower than can be obtained from the handful of interactive servers available. However, there are some important issues regarding the use of the batch farm as described below.

Test Drive. You can submit a simple batch job, consisting only of the UNIX command "hostname", which prints out the name of the host computer on which the job runs.

  1. Login to your SLAC Public account.
  1. At the prompt, enter: bsub hostname

This will submit a one-command batch job to the 'short' queue, which gives you about 2 minutes of CPU time (on fell-class machine). A message listing your job number and the queue the job was submitted to will be displayed:

  Job <206422> is submitted to default queue <short>.
  chuckp@noric06 $

  1. Check your email for a message with the name of your host machine; it should be similar to this example: email response.

Normally, a batch job is more complex and you will likely submit a shell or python script to perform the desired computation. The syntax for a more typical batch submission might look like this:

$ bsub -q long -R rhel60 myScript.py

This command submits a job to the 'long' queue (about 2 hours CPU time on a fell-class machine) and further specifies that the job be run on a machine running RedHat Enterprise Linux version 6. A log of the batch job is returned to you as email, and any files created will be directed per your script.

Note: Since SLAC moved to LSF 9, users must use a 64 bit machine to submit batch jobs, such as rhel5-64 or rhel6-64.

Shared Resources

Once you login to afs, your SLAC Public environment enables you to share a great many resources. For example, when you submit a job to the batch farm, the shared resources you are using include the:

  • Interactive machines (e.g., noric, rhel5-32, etc.)
    Note:
    For a complete list, see Public Machines at SLAC.
  • Batch farm (non-interactive)
  • NFS disks (Fermi group space)
  • AFS disks (home directories and some Fermi group space)
  • Xroot disks (Fermi storage for bulk data)
  • Network facilities

Monitoring Resources

Remember, there are many users. When sharing these resources, it is important to avoid overloading them, thereby degrading the response times and causing jobs to fail for everybody. For example, if you are simultaneously running 100's of jobs that are doing a lot of I/O, the server struggles to handle the requests due to the load.

When running jobs on the batch farm, you can — and should — monitor the activity and state of these resources to ensure that you haven't inadvertently overloaded them. Ganglia monitors many of them, and serves as an early alert to a problem. Once you find the relevant monitoring page(s), select an appropriate time period for the plots, and don't forget to refresh the page with new data periodically.

Tip: The following ganglia links enable you to monitor:

Disk access is the single most likely bottleneck in a swarm of batch jobs. To avoid having colleagues beating down your door because your jobs are causing problems for them, take the time to do a bit of homework first and then be prepared to *carefully* monitor your project as it ramps up—and be equally prepared to kill those batch jobs of a problem materializes.

First, assess which disks you will be reading and writing from; examples include:

  • Your $HOME directory in AFS (e.g., "dot" files, pfiles).
  • Other AFS directories you own where code might be kept.
  • Fermi NFS directories (e.g., /nfs/farm/g/glast/uXX and /afs/slac/g/glast/users).
  • Xroot directories (e.g., results of a DataCatalog query).

Next, discover which servers are involved:

  1. For AFS disks, use something like this:

$ fs whereis /afs/slac/g/glast/ground/releases/analysisFiles/
File /afs/slac/g/glast/ground/releases/analysisFiles/ is on host afs00.slac.stanford.edu

the server is "afs00".

  1. For NFS disks, use something like this:

$ cd /nfs/farm/g/glast/u33
$ pwd
/a/sulky36/g.glast.u33

the server is "sulky36".

  1. For xroot, one does not know (even from day-to-day) on which server a particular file is stored on; so, you will just need to monitor the entire system for overloading.

Then, bring up the Ganglia web pages mentioned above, and zero in on the server(s) of interest.

Finally, how do you know if you are stressing the system? Before you begin to submit jobs, take a look at the CPU utilization and the disk and network I/O plots for each server to get an idea of the instantaneous baseline.

  1. As your jobs begin to run, any abrupt and significant increase in those two metrics indicates a significant load.

How much is too much? That's hard to say exactly, but here is where the situation can become very painful both to your jobs and to other users:

CPU > 90%
Disk I/O > ~50 MB/s on a single disk

  1. If you notice a large number of "nfs_server_badcalls", that is evidence of problems with the server and – if correlated with your batch jobs – it is probably your fault!

If a problem occurs, contact helpsoftlist@glast.stanford.edu.

Best Practices

There are some basic guidelines to keep in mind before you submit a batch job.

  • Put analysis code and scripts in your afs home directories (which are backed up), and ultimately put your output files in nfs, e.g., the user disk

    /afs/slac/g/glast/users/<username>

    1. Create a unique directory in /scratch for your batch job, e.g.,

            mkdir -p /scratch/<userid>/${LSB_JOBID}
    1. Define this directory as your $HOME and then go there prior to running any ScienceTools/Ftools/etc.,

    export HOME=/scratch/<userid>/${LSB_JOBID}
    cd ${HOME}

    This will automatically take care of PFILES being unique for your job and avoid overloading the /nfs user disk with large numbers of opens and closes.  Create any new files in $HOME and then copy anything you wish to save at the end of your job.

    1. Cleanup the scratch directory at end of job (after you have copied out anything you want to save),
           
            rm -rf /scratch/<userid>/${LSB_JOBID}

    Note: Cleaning up the scratch directory is critical! Any scratch file left behind will slowly fill up the /scratch partition and eventually fill it up.

  • Submitting jobs. While submitting a few jobs at a time will likely result in reasonably rapid turn-around, if you plan to run a large number of batch jobs (e.g., 50 or more), you run the risk of delays and competing with other users of the batch farm. Allocations and job scheduling are complex formulas, making it difficult to predict batch response.

    Tip: Weekends are often good for such work — except just before major conferences when there may be a lot of contention for CPU cycles.

If you are planning a large batch operation, please inform and coordinate with SAS management (Richard Dubois).

  • Disable core dumps for most batch jobs.

    Creating a core dump from a failed job can be very useful for debugging. However, when running many jobs in parallel on the batch farm a serious problem can occur if core dumps are allowed. For example, if a critical resource malfunctions, causing many jobs to crash, if those jobs then all attempt to write core dumps at the same time, the target file server will likely become overwhelmed. In general, when running more than one instance of a batch job for a given project, core dumps should be disabled or seriously limited in terms of size.

    Here is how to limit each core dump to 512 kB in three scripting languages. Limiting to 0 bytes is also possible, and even desirable if running hundreds of simultaneous jobs. These statements can be added to the wrapper scripts for your jobs so that they do not affect your interactive login sessions. (If you wish to limit core dump sizes in your interactive login sessions, just add these lines to .cshrc, .bashrc, etc. as appropriate.)

    In python:
    import resource
    resource.setrlimit(resource.RLIMIT_CORE,(524288,-1)) #limit core dump size to 512kB

    In bash (units are in kB):
    $ ulimit -c 512

    In tcsh (units are kB):
    $ limit coredumpsize 512

  • PFILES. When running multiple, simultaneous jobs on either interactive machines or on the
    SLAC batch farm, be sure that each job is given a unique, local PFILE path in which to write its parameter files. (One .par file is created by each ScienceTool or Ftool.) This is accomplished by setting the $PFILES environment variable appropriately.

Why bother? If you do not specify a unique PFILE path for each job, these parameter files will be created in your $HOME directory, i.e., $HOME/pfiles, and each job will attempt to write its .par files to the same directory, causing an unfortunate and painful conflict if the same tools run simultaneously. Not only will your jobs fail to give reliable results, but this sort of activity is very demanding on file servers and can cause severely degraded performance for all users.

Ideally, you should direct your pfiles to a local scratch space (the resulting parameter files are not something you likely need to keep upon completion of the job). Whatever you do, DO NOT direct your writable PFILE path to one of the GLAST user disks! (Such anti-social behavior will not go unnoticed.)

Tip: All SLAC-managed Linux machines are configured with scratch space. Most public interactive machines have a space called '/usr/work'. Most (all?) batch machines have a large space called '/scratch'. Desktop machines may also have '/scratch'. All machines have '/tmp', but this space should not be used for large files as the space is usually limited and filling up /tmp can cause a machine to crash. When using scratch space on public machines, always create a directory with your own username, e.g., /scratch/<username>, into which all of your temporary files are written. It is also crucial on batch machines to always clean up any temporary files you create or that area will fill up.

Note: Remember that the batch farm is not interactive and it inherits whatever environment you happen to have set up when you submit the job. Thus, all desired non-default ScienceTool/Ftool parameters must be specified explicitly.

Below are examples of two approaches to managing $PFILES. Note that $PFILES consists of two lists separated by a semicolon, ";". The first list contains one or more directory paths for the ScienceTools (and FTOOLS) to use for writing and preferentially for reading parameter files, while the second list specifies directories to be used as read-only reference, as needed.

Example 1 —Explicitly set $PFILES to a unique writable space (in conjunction with SCons ScienceTools-09-15-05 build):

PFILES=/scratch/<uniqueIdentifier>/pfiles;/nfs/farm/g/glast/u35/ReleaseManagerBuild/redhat3-i686-32bit-gcc32/Optimized/ScienceTools/09-15-05/burstFit/pfiles

Note: This environment variable tells the ScienceTools (and FTOOLS) to use the first path in the list for writing and preferentially for reading, but use the second as a read-only reference, as needed. (Note the semi-colon between the r/w and read-only path elements.)

Example 2 — Set the home directory to the curent directory:

    1. Move to a unique directory:
    2. mkdir uniqueDirectory
      cd uniqueDirectory

    3. Set home directory to current working directory:

      HOME=$PWD

    4. Run your environment setup script for the version of ScienceTools (and/or Ftools) that you wish to use.
  • Cleanup. Be sure to perform a cleanup on /scratch after your jobs have completed!

Python Script. For an example of setup and cleanup routines from a python script, each with a unique environment variable, see new.py.txt and note the two FILE.write blocks. The first block creates the unique environment variables for each job, and the second block cleans up the scratch directory after the batch jobs have completed.

Commands You Need to Know

When using the batch farm, you need to know:

  • bsub
  • bjobs (-l)
  • bkill
  • bqueues

submit a job
get a summary of running jobs (-l gets a longer summary)
kill an errant job
get info on the queues

Logging into a noric system is your first step. Jobs submitted from these machines will automatically go to the batch farm.

An easy way to operate is to use the GLAST nfs user space

/afs/slac/g/glast/users/

The batch machines can access this space.

Submitting a job

The syntax is simple:

bsub -q [queueName] <command>  

Monitoring a Job

bjobs [-l]

Provides a summary of running jobs.

Tip: It also gives you a batch id for the job, which you can use with bkill.

bpeek <job-id>

Provides a 'peek' at the log file for a running batch job.

Tip: There are 'man' pages for all the batch commands, e.g., 'man bsub'. The batch system is formally known as LSF (Load Sharing Facility), and a brief overview to the system can be read with 'man lsfintro'.

Other useful batch monitoring commands include:

bqueues
bqueues -l long
lshosts
bmod

bhist
lsinfo
busers

Summary status of entire batch system organized by queues.
Detailed summary status of the 'long' queue.
Very long listing of all batch machines along with their resources.
Change the queue for a submitted job.
Get history information for completed jobs.
List all 'resources' defined in batch system.
Summary of my batch activity.

Error Codes on exit

Exit codes from batch jobs should be interpreted similarly to those from any other
unix program. Codes:

  • 1-128 are exit codes generated by the job itself.
  • 129-255 usually indicate that a signal was received and the value is the signal+128.

For example, an exit code of 131 means the job received signal 3 (= 131-128) which is SIGQUIT.

Note: A brief summary of the signal codes can be seen with the command: kill -l

Queue Information

bqueues (-l)

The standard queues we can submit to are:

short
medium
long
xlong
xxl
2 minutes
15 minutes
2 hours
16 hours
130 hours
Note: Times are specified for fell-class machines. (See Queue Limits and CPU Factors for more information.)

Killing a Bad Job

bkill <id>

Assuming a batch id of 217304, the command to kill a bad job would be:

bkill 217304

Note: An id = 0 is a wildcard. To cancel all of your jobs, enter: bkill 0

Queue Limits and CPU Factors

(As of February 2, 2010)
QUEUE
MAX
JL/U
JL/H
CPULIMIT
+>
don
cob
yili
boer
fell
hequ
Short
-
-
-
21m
 
4m
2m
2m
2m
2m
1.4m
medium
-
1000
-
168m
 
31m
22m
20m
17m
15m
12m
long
-
1000
-
22.3h
 
4.1h
2.9h
158m
134m
121m
92m
xlong
2800
-
4
177.6h
 
33h
23h
21h
18h
16h
12h
xxl
800
400
2
1428h
 
261h
187h
169h
142h
130h
98h
CPULIMIT is in "SLAC time" = wall-clock time * CPU_Factor
 
CPU Factors:
MODEL_NAME
CPU_FACTOR
Machine Names
RS6k-370
0.19
[morgan]
Ultra5
0.46
[pinto]
UT1_440
1.00
[bronco]
VA_867
2.11
[barb]
PC_2660
2.81
[tori,orlov]
PC_1400
3.36
[noma,morab]
G5_2000
4.82
fuji(MacOS)
AMD_1800
5.47
don, noric
AMD_2000
7.65
cob, coma
AMD_2200
8.46
yili,sdc,noric
AMD-2600
10.00
boer,bali,orange,sdc
INTEL_2660
11.00
fell,simes
INTEL_3000
12.00
sdc
INTEL_2930
14.58
hequ
[ ] = no longer in service
For more information about the various machine types, see: Public Machines at SLAC.

 


Owned by: Tom Glanzman
Last updated by: Chuck Patterson 01/31/2011