SLAC PEP-II
BABAR
SLAC<->RAL
Babar logo
HEPIC E,S & H Databases PDG HEP preprints
Organization Detector Computing Physics Documentation
Personnel Glossary Sitemap Search Hypernews
Unwrap page!
Comp. Search
Who's who?
Meetings
FAQ Homepage
Archive
Environment
Administration
New User Info.
Web Info/Tools
Monitoring
Training
Tools & Utils
Programming
C++ Standard
SRT, AFS, CVS
QA and QC
Remedy
Histogramming
Operations
PromptReco
Simulation Production
Online SW
Dataflow
Detector Control
Evt Processing
Run Control
Calibration
Databases
Offline
Workbook
Coding Standards
Simulation
Reconstruction
Prompt Reco.
BaBar Grid
Data Distribution
Beta & BetaTools
Kanga & Root
Analysis Tools
RooFit Toolkit
Data Management
Data Quality
Event display
Event Browser
Code releases
Databases
Check this page for HTML 4.01 Transitional compliance with the
W3C Validator
(More checks...)

SJM - A Simple Job Manager

Overview

1. New
2. Introduction
3. Download
4. Quick Start
5. Example
6. Reference
7. Miscellaneous

New

SJM version V01.11 contains some bug fixes and enhancements:

  • sjm sprited was arguably broken. Though it did manage jobs as intended, it did not run well as a daemon process. Thanks to Jan Strube for bug report and testing.
  • V01.11 adds better support for projects that are shared between several users in the same group. Thanks to Chih-hsiang Cheng for the suggestion.
  • Head and tail of the log files are checked now for the JobFinishedString/JobSuccessfulString. Thanks again to Jan Strube to point that out.

SJM version V1.09 corrects a bug when printing out a warning message for jobs with log files that have not been updated for more than 1 hour (3600 sec.). Thanks to Olga Igonkina for reporting this bug.

SJM version V1.08 corrects a problem where SJMPrepareJobs crashes when the configuration file contains 'empty' lines with spaces or options that can not be identified. Otherwise V1.08 is identical with V1.07.

SJM version V1.07 includes means for automatic job monitoring and submission and the possibility to run jobs in gdb. See below for more information on these.

In addition V1.07 has improved job validation, where it now uses the same procedure as FjrCheckJob.

V1.07 is fully backward compatible to V1.0, i.e. you only need to replace the contents of the tar file into your workdirectory.
SJM V1.07 is compatible with python 2.2.3 and higher.

Some parts of this document still need to be updated. That doesn't mean that they are wrong, but rather that some new files or features might not be included in all parts of this document.

Introduction

The 'Simple Job Manager' (SJM) is a simplistic framework to manage jobs running in the BABAR framework. It is written in OO-Python, which should make it readable for you hackers out there, and it is fairly lightweight with only about 1000 lines of code (compared to the Task Manager which has more than 30,000 lines). So what does it do?

Download

SJM can be downloaded here as a tar file. 'cd' into your workdir and gunzip/tar/gtar whatever you feel like.

Quick Start

This assumes that you have a working framework application and that you run your application from the workdir in the test release (no fancy stuff here). The steps required to configure SJM are then to provide a short configuration file and a tcl snippet template file.

  1. Download the SJM tar file, e.g. SJM-V01-11.tar, and untar it. The tar file contains the following files:
    roethel@noric04> ls
    SJMConfigFile.txt  SJMTestSnippet.tcl  sjm
      
    Then copy sjm into a location where it can be found, e.g. ~/bin.
  2. Edit the example configuration file SJMConfigFile.txt to fit your analysis.
  3. Edit the example tcl snippet template SJMTestSnippet.tcl to fit your analysis (and preferably rename it).
  4. Set up the SJM directory tree and create jobs by running sjm prepare SJMConfigFile.txt in your workdir. This should create a subdirectory with the name of the SJM Project you defined in step 2.
  5. If everything worked, you should see a list of tcl files created by BbkDatasetTcl in the subdirectory <SJM Project Name>/tcl and a list of tcl snippets in <SJM Project Name>/prepared. You can check the existence of the jobs by running sjm show <SJM Project Name>
  6. Submit jobs with sjm submit --njobs 2 <SJM Project Name> (don't forget srtpath and condxxboot).
  7. When the jobs have completed (you can check that with sjm show again or by running bjobs) you can check them with sjm check <SJM Project Name>

A Simple Example

This example still uses the old command structure when SJM consisted of a set of separate executables and the main library SJMBase.py. The example is still valid, you just need to replace the old commands with the new ones, e.g. use sjm prepare instead of SJMPrepareJobs.

In the following I will show a simple test case for using SJM. You can't run this example line-by-line (since you can't write to my scratch space - I hope). But it is fairly straight forward to adopt this example for your own analysis.

I prepared a test release for a simple two photon analysis in ~roethel/analysis/analysis-21. My executable is called BetaMiniApp and the main tcl file that drives the analysis is in BetaMiniUser/GamGamTo4pi.tcl. I'm not skimming, but just writing out ntuples, which should all be written to /afs/slac.stanford.edu/g/babar/work/r/roethel/GamGam/ntuple/ntuple_<ID>.root in scratch space, where <ID> should be the job-id (or job number) of the current job. The run directories (the directories the log files and jobreport files are written to) should be /afs/slac.stanford.edu/g/babar/work/r/roethel/GamGam/Pentaquark-Run1/<ID>. The input dataset for this analysis should be
users-phnic-TwoPhotonPentaquarkSkim-BlackDiamond-Run1,
and I only want to run (or better 'can run') 250,000 events per job.

Edit the Configuration File:

With that I can edit the configuration file... ok - the example configuration file is already setup for this example, what a coincidence.. However I want to change the name of this SJM Project to SJMPentaquarkRun1, so I edit the line

# define the name of the SJM 
SJMName = SJMPentaquarkRun1 
  
The whole configuration file looks like this now. So on with the next step.

Edit the Tcl Snippet Template:

First, what is a tcl snippet template? The main tcl file - in my example GamGamTo4pi.tcl - provides the general configuration that should be used by all analysis jobs I want to run on in this context. However, I do have parameters like the names of my ntuples or names of my input collections that are different for every job and I need to pass them on to my general tcl file. It used to be common to define these parameters over environmental variables in the current unix shell, which is not a very good idea (and I don't want to go into this here). The better solution is to provide a short, job specific, tcl file, which only defines parameters (as tcl variables) particular for the current job and then itself sources the main tcl file which properly sets up the framework using these parameters.

In the context of a configurable job manager there is a little complication though, since it is not possible to anticipate what anyone would want to define in a tcl snippet. To get around this problem SJM (like the Task Manager) uses a tcl template and a set of 'tags' that act as placeholders for job specific information (for a list of tags see below). A user now can define any parameter in the tcl snippet template using these tags. When the jobs are created the tags get resolved and the actual tcl snippets are written out. If this is not totally clear yet, just follow the example and you will see how the job specific tcl snippets are created from the tcl snippet template.

The tcl snippet template in this example as defined in the configuration file is called SJMTestSnippet.tcl. I don't need to change anything in this tcl snippet template as it already is configured for this example. I like to point out the definition of rootName (the FwkCfgVar that defines the ntuple to be written out) and the last line that sources the main tcl file. For more details on tcl snippets see below. Also note that GamGamTo4pi.tcl uses these FwkCfgVars to setup the framework job, in particular the jobreport file and the ntuple name (there is really not much use defining variables if they are not used later on).

See here for the tcl snippet template file and here for the GamGamTo4pi.tcl file.

Create the Jobs:

roethel@noric04> SJMPrepareJobs SJMConfigFile.txt
Running BbkDatasetTcl --tcl 250000 --basename SJMPentaquarkRun1 \
--splitruns users-phnic-TwoPhotonPentaquarkSkim-BlackDiamond-Run1 ...
BbkDatasetTcl: wrote SJMPentaquarkRun1-1.tcl (250000 events)
...
BbkDatasetTcl: wrote SJMPentaquarkRun1-166.tcl (151236 events)
Selected 11 collections, 41401236/0 events, ~0.0/pb
done. Creating tcl snippets in directory 'prepared' now...
done!

Running this command created a subdirectory called SJMPentaquarkRun1 in my workdir. Looking at this directory you can find six subdirectories, one for each job state (prepared, submitted, done, ok, failed) and one storing the tcl files defining the input collections that were created by BbkDatasetTcl. Listing these directories you find

roethel@noric04> ls SJMPentaquarkRun1/tcl
SJMPentaquarkRun1-1.tcl    SJMPentaquarkRun1-15.tcl   SJMPentaquarkRun1-50.tcl
SJMPentaquarkRun1-10.tcl   SJMPentaquarkRun1-150.tcl  SJMPentaquarkRun1-51.tcl
SJMPentaquarkRun1-100.tcl  SJMPentaquarkRun1-151.tcl  SJMPentaquarkRun1-52.tcl
...
roethel@noric04> ls SJMPentaquarkRun1/prepared
SJMPentaquarkRun1-0001.tcl  SJMPentaquarkRun1-0084.tcl
SJMPentaquarkRun1-0002.tcl  SJMPentaquarkRun1-0085.tcl
...

If you remember the discussion on tcl snippet templates, you may want to compare the resolved tcl snippet for e.g. the first job SJMPentaquarkRun1-0001.tcl with the tcl template file. Before continuing you may want to check if your snippets in the 'prepared' directory look ok. If not, fix the snippet template and/or the configuration file and try again (delete the subdirectory tree to remove the existing configuration - see below).

You can look at the job statistics with

roethel@noric04> SJMShowJobs SJMPentaquarkRun1

     name            prepared submitted    done     failed      ok
---------------------------------------------------------------------------
 SJMPentaquarkRun1       166         0         0         0         0

If you messed up when creating jobs, you could simply remove the subdirectory tree SJMPentaquarkRun1 (e.g. with rm -rf SJMPentaquarkRun1) and start over again. Now let's run some jobs...

Submitting Jobs

Let's test our configuration by submitting 2 jobs:

roethel@noric04> SJMSubmitJobs --njobs 2 SJMPentaquarkRun1
submitting jobs
Submitting job 1
Job <150864> is submitted to queue <kanga>.
bsub -q kanga -C 0 \
-o /afs/slac.stanford.edu/g/babar/work/r/roethel/GamGam/Pentaquark-Run1/1/SJMPentaquarkRun1.log \
 /afs/slac.stanford.edu/g/babar/work/r/roethel/GamGam/Pentaquark-Run1/1/wrapper-1.sh

Submitting job 2
Job <150865> is submitted to queue <kanga>.
bsub -q kanga -C 0 \
-o /afs/slac.stanford.edu/g/babar/work/r/roethel/GamGam/Pentaquark-Run1/2/SJMPentaquarkRun1.log \
/afs/slac.stanford.edu/g/babar/work/r/roethel/GamGam/Pentaquark-Run1/2/wrapper-2.sh

Submitted 2 job(s).

While preparing this example something interesting happened - the jobs crashed with (taken from the log file):

  ...
  BetaMiniApp: error while loading shared libraries: libCore_pkgid_3.10-01.so: can
not open shared object file: No such file or directory
  ...

and STMShowJobs reported

roethel@noric04> SJMShowJobs SJMPentaquarkRun1
Job 1: Log file assumed done, job report file not found!
Job 2: Log file assumed done, job report file not found!

     name            prepared submitted    done     failed      ok
---------------------------------------------------------------------------
 SJMPentaquarkRun1       164         0         2         0         0

First the error message indicates that the log file satisfies the 'done'-conditions but the jobreport file for the job was not found (which is never a good sign). I fix this problem by submitting from a RH7.2 noric and want to resubmit the jobs, i.e. I need to move the jobs from the 'done' state back to the 'prepared' state. To do that I simply move the tcl snippet files for these jobs from the SJMPentaquarkRun1/done directory to the submitted directory:

roethel@noric04> ls SJMPentaquarkRun1/done
SJMPentaquarkRun1-0001.tcl  SJMPentaquarkRun1-0002.tcl
roethel@noric04> mv SJMPentaquarkRun1/done/*.tcl SJMPentaquarkRun1/prepared/.
roethel@noric04> SJMShowJobs SJMPentaquarkRun1

     name            prepared submitted    done     failed      ok
---------------------------------------------------------------------------
 SJMPentaquarkRun1       166         0         0         0         0

We're ready to resubmit the jobs now :

roethel@noric14>  SJMSubmitJobs --njobs 2 SJMPentaquarkRun1
submitting jobs
Submitting job 1
Run directory /afs/slac.stanford.edu/g/babar/work/r/roethel/GamGam/Pentaquark-Run1/1 exists 
already. Cleaning up
Job <152456> is submitted to queue <kanga>.
bsub -q kanga -C 0 \
-o /afs/slac.stanford.edu/g/babar/work/r/roethel/GamGam/Pentaquark-Run1/1/SJMPentaquarkRun1.log \
 /afs/slac.stanford.edu/g/babar/work/r/roethel/GamGam/Pentaquark-Run1/1/wrapper-1.sh

Submitting job 2
Run directory /afs/slac.stanford.edu/g/babar/work/r/roethel/GamGam/Pentaquark-Run1/2 exists 
already. Cleaning up
Job <152458> is submitted to queue <kanga>.
bsub -q kanga -C 0 \
-o /afs/slac.stanford.edu/g/babar/work/r/roethel/GamGam/Pentaquark-Run1/2/SJMPentaquarkRun1.log \
/afs/slac.stanford.edu/g/babar/work/r/roethel/GamGam/Pentaquark-Run1/2/wrapper-2.sh

Submitted 2 job(s).

The warnings indicate that the run directories for the two jobs in question exist already, since I submitted these jobs before. The directories will be cleaned up so the output does not conflict.We can check if the jobs are really running

roethel@noric04> SJMShowJobs SJMPentaquarkRun1

     name            prepared submitted    done     failed      ok
---------------------------------------------------------------------------
 SJMPentaquarkRun1       164         2         0         0         0

And waiting a little more...

roethel@noric04> SJMShowJobs SJMPentaquarkRun1

     name            prepared submitted    done     failed      ok
---------------------------------------------------------------------------
 SJMPentaquarkRun1       164         0         2         0         0

We can now check the success of these jobs:

roethel@noric04> SJMCheckJobs SJMPentaquarkRun1
Checking jobs
...updating job status
...checking
job 1 ok.
job 2 ok.
Checked 2 jobs. Ok: 2   failed: 0

All fine for me - I hope for you as well... have fun!

NEW: In addition to the just mentioned way of running jobs, SJM V1.06 and later supports running jobs in gdb (This only works where gdb is installed on the batch machines). To use this option run
> SJMSubmitJobs -g <SJMName>.

Using the Job Monitor

From V1.02 on SJM comes with the script/daemon sjm sprited (was SJMSprited) to automatically take on the management of jobs. This includes keeping a constant number of jobs in the queue, checking jobs that are done and, if requested, send an email in case of problems. As mentioned the script is designed to run as a daemon, i.e. it will continue to run even after you log off, but it has been mainly tested running in a terminal window. The configuration file takes the following parameters to configure sjm sprited:
  • SpriteMaxJobs : The maximum number of jobs to keep in the queue.
  • SpriteSleepTime : The time between checks on the processing status.
  • SpriteEmailNotify : If set the email address to send notification to.
  • SpriteCheckAfsToken : A simple check that sends notification and terminates in cases when no valid afs token exists.

Running automated job monitoring in the background adds some non-trivial complication to the simple job manager. The main (or better the only issue) is the possibility that two commands are attempting to do the same thing at the same time, e.g. sjm sprited is running sjm check in the background and you are running the same from the command line. That can lead to race conditions with unpredictable results (though the damage is pretty limited since after all the bookkeeping is done moving files within a unix file system. This is very safe and takes care of most of the possible race conditions which boil down to two processes trying to to things with the same file. However, you may see unusual error messages from SJM because an expected file all of the sudden does not exist). To avoid that a sophisticated lock mechanism was introduced, which prohibits two critical processes to run at the same time. Ok, ok - well, the sophisticated lock mechanism is simply a file called 'lock.pid', which resides in the SJM project directory and which contains the process id of the process which owns the lock. The lock should only be set when a process is attempting to move files and update job status, e.g. when running sjm submit, sjm show (with updating job status) and sjm check. Sometimes it can happen that a process did not remove the lock, either because it was killed before it finished (better not do that) or because the sjm sprited daemon process died, or... If a lock persists for a long time you should probably check the process id in the lock file and see if that process is still alive (you can do that by logging on the machine the process is running on and using > ps -p <process id>). If not it is safe to remove the lock file and proceed (i.e. > rm lock). In addition it is also advisable to check for a lock file when moving files from the prepared, submitted or done states (directories) to other states. However in practice one will move files from failed or possibly ok to prepared which is always safe.

sjm sprited maintains a log file which contains besides log information also the output of the various job submit and check operations. The output however is not flushed, so the order can be somewhat confusing. To start the daemon you just need to run
> sjm sprite --start <SJMName>
For further options see > sjm sprite -h. To just run sjm sprited in a terminal window (in which case the output is flushed and is better understandable) run
> sjm sprited <SJMName>

Finally - when all jobs have been submitted and checked sjm sprited will terminate by itself and optionally send an email notification.

.

Reference

SJM is the Task Manager with every feature removed that is not absolutely essential. The result was small enough to be written in two days and still do the work. The main idea behind SJM is that the tcl snippet for each job contains enough information to run a job and do some essential bookkeeping on it. The bookkeeping itself is managed over the particular directory structure in SJM.

SJM File- and Directory Structure and Bookkeeping

As mentioned the bookkeeping is managed over the directory structure and file names in SJM. The only parameter required to identify a job and resolve all its associated files and directories (for a given SJM project name) is the job id, which is stored over the tcl snippet name and input tcl file name convention

  <SJM project name>-<job id>.tcl
  

(it is not a good idea to choose a SJM name that itself uses a '-<some number>' pattern since this may conflict with the job id extraction.). The other files SJM relies on (the log file, the job report file and the wrapper script) are all located in the run directory that is made up of the job id and (possibly) the SJM project name and is defined by the user in the configuration file.

The current job state is defined by the directory the snippet file is located in. At the beginning all snippet files are in the prepared directory. The command SJMShowJobs just counts the number of tcl files in any of these directories and displays the count. There is no other hidden behind-the-scenes bookkeeping. So just for fun you could move a snippet file from the prepared directory to any other job state directory and see how the output of SJMShowJobs changes (don't forget to move the file back again... and please use mv, don't cp the files!!!). As mentioned above, if you mess up (or don't like your setup) just delete the SJM directory structure and start over again.

The Run Directory

Every job managed by SJM has it's own run directory. This may seem a bit inconvenient, but it simplifies the management of jobs and makes it more flexible. Instead of having to keep track of different files individually, the only variable is the run directory itself and all other files (currently the log file, job report file and the wrapper script used to submit a job) can be identified using that.

The uniqueness of the run directory also requires one more thing - the user has to make sure that he/she defines a unique run directory when configuring a SJM project. The simplest way to do that is to make sure the <ID> tag is part of the run directory (The job creation should fail if this is not satisfied!).

The Configuration File

The parameters defined in the configuration file are:
  • SJMName - The name of the SJM project. In the Task Manager we would call this a task, but this is not the Task Manager. Make sure there is no subdirectory in workdir with that name, since SJM will create this subdirectory.
  • DatasetName - The name of the dataset that the jobs should run over.
  • MaxEvents - The max. number of events run per job. SJM runs BbkDatasetTcl internally to create tcl files, that define the input collections. This parameter specifies the value passed on to the --tcl option.
  • BbkDatasetTclRaw - Some 'raw' options to be passed on to BbkDatasetTcl. SJM provides the --tcl, --splitruns and --basename option. Further options may be provided by setting this string, however there is no guarantee that the command will work as it is supposed to. You'll just have to try.
  • TclSnippet - The name of the tcl snippet template file.
  • RunDirectory - The name of the run directories. In SJM every job gets it's own run directory (typically in scratch space) which contains the log file and the jobreport file. Users may add further files to this directory by using the <RUNDIR> tag in the tcl snippet template.
  • Executable - the name of the framework executable that should be run (like BetaMiniApp). Just list the executable, do not include the tcl file.
  • BatchCommand - The command used to submit jobs to the job scheduler. Uses the tags <LOG> and <WRAPPER>, which will be replaced by the job specific log file and a wrapper shell script at the time the job is submitted. For running at SLAC with LSF this can be left as it is.
  • JobFinishedString - A string searched for in the last 200 lines, that - if found - indicates that the job has finished/exited the batch system. This does not need to be set at SLAC (LSF) where it defaults to 'Resource usage summary'. This may need to be set at sites using PBS.
  • JobSuccessfulString - A string indicating that the job was run successfully in the batch queue (exit code 0), similar to JobFinishedString described above. LSF does not show the exit code directly, but instead uses the string 'Successfully completed' to indicate an exit code of 0. Again, this may have to be customized to meet the requirements at other sites.
  • Share - Set to 1 if the SJM project should be shared with others. This will make the project and run directories group writable.
  • umask - Similar to Share but allows to set custom values for the file mask.
  • SpriteMaxJobs - The maximum number of jobs to keep in the queue.
  • SpriteSleepTime - The time between checks on the processing status.
  • SpriteEmailNotify - If set the email address to send notification to.
  • SpriteCheckAfsToken - A simple check that sends notification and terminates in cases when no valid afs token exists.

The syntax used in the configuration file is of the form parameter = text, where parameter is not allowed to contain space characters (leading and trailing spaces will be removed). Text is provided as raw text, i.e. all spaces besides leading and trailing spaces will be preserved, so no quotes are necessary(!). Comments can be added by using the hash character '#' as the first character(!) in a line.

The Tcl Snippet Template File

The tcl snippet file and the use of tags were introduced in the example. Though the basic contents of the tcl snippet template is up to the user SJM requires the following lines:


sourceFoundFile <INPUTTCL>
set jobReportName <JOBREPORT>

SJM could automatically add these lines to the snippet file, but adding things behind the scenes that may interfere with other user defined settings may be more confusing then requiring certain values to be set. You also need to make sure your main tcl file contains the appropriate line to write out the jobreport file (see below).

Valid tags that can be used in the tcl snippet template are:

  • <ID> - The job id of the current job.
  • <ID100> - The job id divided by 100 (int(job id)/100). This may be useful if you have a very large number of jobs and distribute the number of subdirectories in the run directory tree.
  • <INPUTTCL> - The name of the tcl file created by BbkDatasetTcl, which defines the input collections. This tag is basically only used to source the input tcl file. Ah, yes, don't forget to source <INPUTTCL> or your jobs will have no defined input.
  • <JOBREPORT> - The (predefined) name of the job report file. Do not give the job report file an other name but <JOBREPORT>, since SJM uses the job report file to determine the success or failure of a job. You may however choose a different name for the tcl variable (in this example jobReportName).
  • <RUNDIR> - The name of the run directory for the current job. This tag may be used if other files (e.g. ntuples) should be written to the run directory.
  • <NAME> - The name of this SJ Manager as defined over SJMName in the configuration.

How to use FwkCfgVars is explained elsewhere (I don't know where) but essentially do the equivalent of the following in your main tcl file:

   FwkCfgVar jobReportName
   FwkCfgVar rootName
   ...
   jobReport filename $jobReportName
  

The Commands

With the two new additions there are now six commands altogether. These are simple commands and don't take a lot of command line options, but they all, with the exception of SJMSprited, do have a basic -h, --help option to remind you of all the options that they (don't) have.

sjm prepare (was SJMPrepareJobs)

sjm prepare actually does three different things:

First it reads in the configuration file, verifies some entries and creates the SJM directory structure in the current workdir. It also copies the configuration file to the subdirectory where it serves as the main configuration file for all the other SJM commands. For reference a copy of the tcl snippet is also stored in the SJM directory.

The second thing sjm prepare does in to run BbkDatasetTcl to create a list of input tcl files in the tcl subdirectory.

Finally sjm prepare reads in the list of input tcl files and creates the tcl snippets for every input tcl file in the prepared subdirectory.

sjm show (was SJMShowJobs)

sjm show basically just counts tcl snippet files in the different job status subdirectories. However, before doing so it checks if jobs listed in the submitted directory have finished. If a job is still running or is assumed to be finished, is determined by checking the last 200 lines of the log file for a given string that signals that the job has completed. The default is to look for 'Resource usage summary', but this can be overridden defining JobFinishedString in the configuration file (e.g. for running at sites using PBS). In addition the existence of the job report file and the stop time written in the job report file is also checked and a warning is printed if the log file has been found to indicate a finished job but the job report file does not mirror this.

sjm submit (was SJMSubmitJobs)

sjm submit submits jobs. It first creates the run directory (and cleans up old run directories if these happen to exist already) and then creates a small shell wrapper script in that directory. The wrapper script, which is necessary to provide compatibility with other job schedulers like PBS, is then submitted to the batch queue using the command defined in the configuration file.

sjm check (was SJMCheckJobs)

This command finally checks jobs that are found to be done. Similar to identifying completed jobs, jobs are checked ok if the exit code of the job running in the queue was found to be 0 (Note that the shell wrapper 'exec's the framework executables for this purpose instead of running the framework executable as a sub process!). In LSF this can be done by parsing the first and last 200 lines of the log file for the string 'Successfully completed', which is the default in SJM. This default can be overridden by defining JobSuccessfulString in the configuration file. In addition, the job report file must exists and must contain the stop time.

sjm sprite (was SJMSprite)

sjm sprite is just used to start and stop the job monitoring script sjm sprited when that is run as daemon. To keep the daemon running in an afs environment as e.g. at slac, you need to run klog -setpag to assure to have valid token after closing the terminal window.

sjm sprited (was SJMSprited)

sjm sprited is the job monitoring daemon - when run as daemon. But it can be run just as well in a terminal window. There are no commandline options for this command, except for the SJMName itself. For most parts sjm sprited sleeps (the default sleep time is 20 minutes). When it awakes it first updates the job status, then determines how many jobs are currently in the queue (it uses SJM's own bookkeeping for this and does not rely on a specific batch interface and is therefore also not sensitive to occasional outages of the batch system) and submits the neccessary number of jobs to keep the requested number of jobs in the queue. Finally it checks done jobs. When nothing is left to be done sjm sprited exits.

Running in PBS (or other Job Schedulers)

Since SJM uses a wrapper script to submit jobs, running on PBS (or yet another job scheduler) is not a big problem. However you have to provide the necessary strings in the configuration file (see 1. in the list above) to identify in a log file to determine if a job has finished and has run successfully.

SJM tries to identify idle jobs by checking the time of the last update of a log file. If the update occured more than an hour before the check a warning message will be printed, but no further action will be taken. It is up to the users to check the status of the job and (possibly) fix the problem. PBS typically writes log files to a private area and only renames the log file to the defined log file name when the job has finished. Therefore this additional check is not possible (Redirecting the output over the shell wrapper is not a good alternative, since this will not capture the report from the job scheduler containing the job exit code).

A short description on how to run at RAL will follow in a little. However the only thing needed to configure is the batch command and the string in the log file to identify that a job is done. (Actually at RAL this can be anything because the log file only exists in the global readable area when the job is done. Job validation is exclusively done using the job report file, not the log file.

Miscellaneous

What is SJM

SJM was born in the need to run some analysis while I'm still writing on the Task Manager. It should not replace the Task Manager though - if you are looking for a full production type analysis framework which allows e.g. merges, imports of collections to the bookkeeping database/hpss, full bookkeeping of runs etc., SJM is not the tool to use. However, if you just want to run some jobs with a given input dataset, and don't really care about all the additional feature, then the heavy-weight Task Manager will not be the ideal tool to use and you are better off with a simple tool as SJM... I would guess that the majority of analysis jobs will fall into the latter category...

Distribution

SJM is distributed as a tar file and not in a package in cvs. The reason is that a cvs package needs a maintainer and I don't have the time to maintain SJM. If someone volunteers to takeover this job, I can unwrap the SJMBase.py file and put SJM in a package.

Bug Reports/Disclaimer

SJM is provided as is. It probably has bugs. You can send me a mail with bug reports and I will try to fix them whenever I have time.

 


Page author(s): Will Roethel
Last significant update: Sep 30, 2005 Expiry date: January 31, 2006