SJM - A Simple Job Manager
Overview
1. New
2. Introduction
3. Download
4. Quick Start
5. Example
6. Reference
7. Miscellaneous
New
SJM version V01.11 contains some bug fixes and enhancements:
- sjm sprited was arguably broken. Though it did manage jobs as
intended, it did not run well as a daemon process. Thanks to Jan Strube for
bug report and testing.
- V01.11 adds better support for projects that are shared between several
users in the same group. Thanks to Chih-hsiang Cheng for the suggestion.
- Head and tail of the log files are checked now for the
JobFinishedString/JobSuccessfulString. Thanks again to Jan Strube to
point that out.
SJM version V1.09 corrects a bug when printing out a warning message
for jobs with log files that have not been updated for more than 1 hour
(3600 sec.). Thanks to Olga Igonkina for reporting this bug.
SJM version V1.08 corrects a problem where SJMPrepareJobs crashes when
the configuration file contains 'empty' lines with spaces or options
that can not
be identified. Otherwise V1.08 is identical with V1.07.
SJM version V1.07 includes means
for automatic job monitoring and submission and the possibility to run
jobs in gdb. See below for more information on these.
In addition V1.07 has improved job validation, where it now uses
the same procedure as FjrCheckJob.
V1.07 is fully backward compatible to V1.0, i.e. you only need to
replace the contents of the tar file into your workdirectory.
SJM V1.07 is compatible with python 2.2.3 and higher.
Some parts of this document still need to be updated. That doesn't
mean that they are wrong, but rather that some new files or features might
not be included in all parts of this document.
Introduction
The 'Simple Job Manager' (SJM) is a simplistic framework to manage
jobs running in the BABAR framework. It is written in OO-Python, which should
make it readable for you hackers out there, and it is fairly lightweight with
only about 1000 lines of code (compared to the Task Manager which has more than
30,000 lines). So what does it do?
Download
SJM can be downloaded here as a
tar file. 'cd' into your workdir and gunzip/tar/gtar whatever you feel like.
Quick Start
This assumes that you have a working framework application and
that you run your application from the workdir in the test release (no fancy
stuff here). The steps required to configure SJM are then to provide a short
configuration file and a tcl snippet template file.
- Download the SJM tar file, e.g. SJM-V01-11.tar, and untar it. The tar file
contains the following files:
roethel@noric04> ls
SJMConfigFile.txt SJMTestSnippet.tcl sjm
Then
copy sjm into a location where it can be found, e.g. ~/bin.
- Edit the example configuration file SJMConfigFile.txt to fit your analysis.
- Edit the example tcl snippet template SJMTestSnippet.tcl to fit your
analysis (and preferably rename it).
- Set up the SJM directory tree and create jobs by running sjm prepare
SJMConfigFile.txt in your workdir. This should create a subdirectory
with the name of the SJM Project you defined in step 2.
- If everything worked, you should see a list of tcl files created by BbkDatasetTcl
in the subdirectory <SJM Project Name>/tcl and a list of tcl
snippets in <SJM Project Name>/prepared. You can check the
existence of the jobs by running sjm show <SJM Project Name>
- Submit jobs with sjm submit --njobs 2 <SJM Project Name>
(don't forget srtpath and condxxboot).
- When the jobs have completed (you can check that with sjm show
again or by running bjobs) you can check them with sjm check <SJM
Project Name>
A Simple Example
This example still uses the old command structure when SJM consisted of
a set of separate executables and the main library SJMBase.py. The example is
still valid, you just need to replace the old commands with the new ones, e.g.
use sjm prepare instead of SJMPrepareJobs.
In the following I will show a simple test case for using SJM. You can't run
this example line-by-line (since you can't write to my scratch space - I hope).
But it is fairly straight forward to adopt this example for your own analysis.
I prepared a test release for a simple two photon analysis in ~roethel/analysis/analysis-21.
My executable is called BetaMiniApp and the main tcl file that drives the analysis
is in BetaMiniUser/GamGamTo4pi.tcl. I'm not skimming, but just writing out ntuples,
which should all be written to /afs/slac.stanford.edu/g/babar/work/r/roethel/GamGam/ntuple/ntuple_<ID>.root
in scratch space, where <ID> should be the job-id (or job number)
of the current job. The run directories (the directories the log files and jobreport
files are written to) should be /afs/slac.stanford.edu/g/babar/work/r/roethel/GamGam/Pentaquark-Run1/<ID>.
The input dataset for this analysis should be
users-phnic-TwoPhotonPentaquarkSkim-BlackDiamond-Run1, and I only want
to run (or better 'can run') 250,000 events per job.
Edit the Configuration File:
With that I can edit the configuration file... ok - the example configuration
file is already setup for this example, what a coincidence.. However I want
to change the name of this SJM Project to SJMPentaquarkRun1, so I edit the line
# define the name of the SJM
SJMName = SJMPentaquarkRun1
The whole configuration file looks like this now.
So on with the next step.
Edit the Tcl Snippet Template:
First, what is a tcl snippet template? The main tcl file - in my example GamGamTo4pi.tcl
- provides the general configuration that should be used by all analysis jobs
I want to run on in this context. However, I do have parameters like the names
of my ntuples or names of my input collections that are different for every
job and I need to pass them on to my general tcl file. It used to be common
to define these parameters over environmental variables in the current unix
shell, which is not a very good idea (and I don't want to go into this here).
The better solution is to provide a short, job specific, tcl file, which only
defines parameters (as tcl variables) particular for the current job and then
itself sources the main tcl file which properly sets up the framework using
these parameters.
In the context of a configurable job manager there is a little complication
though, since it is not possible to anticipate what anyone would want to define
in a tcl snippet. To get around this problem SJM (like the Task Manager) uses
a tcl template and a set of 'tags' that act as placeholders for job specific
information (for a list of tags see below). A user now can define any parameter
in the tcl snippet template using these tags. When the jobs are created the
tags get resolved and the actual tcl snippets are written out. If this is not
totally clear yet, just follow the example and you will see how the job specific
tcl snippets are created from the tcl snippet template.
The tcl snippet template in this example as defined in the configuration file
is called SJMTestSnippet.tcl. I don't need to change anything in this tcl snippet
template as it already is configured for this example. I like to point out the
definition of rootName (the FwkCfgVar that defines the ntuple to be written
out) and the last line that sources the main tcl file. For more details on tcl
snippets see below. Also note that GamGamTo4pi.tcl uses these FwkCfgVars to
setup the framework job, in particular the jobreport file and the ntuple name
(there is really not much use defining variables if they are not used later
on).
See here for the tcl snippet template file
and here for the GamGamTo4pi.tcl file.
Create the Jobs:
roethel@noric04> SJMPrepareJobs SJMConfigFile.txt
Running BbkDatasetTcl --tcl 250000 --basename SJMPentaquarkRun1 \
--splitruns users-phnic-TwoPhotonPentaquarkSkim-BlackDiamond-Run1 ...
BbkDatasetTcl: wrote SJMPentaquarkRun1-1.tcl (250000 events)
...
BbkDatasetTcl: wrote SJMPentaquarkRun1-166.tcl (151236 events)
Selected 11 collections, 41401236/0 events, ~0.0/pb
done. Creating tcl snippets in directory 'prepared' now...
done!
Running this command created a subdirectory called SJMPentaquarkRun1 in my
workdir. Looking at this directory you can find six subdirectories, one for
each job state (prepared, submitted, done, ok, failed) and one storing the tcl
files defining the input collections that were created by BbkDatasetTcl. Listing
these directories you find
roethel@noric04> ls SJMPentaquarkRun1/tcl
SJMPentaquarkRun1-1.tcl SJMPentaquarkRun1-15.tcl SJMPentaquarkRun1-50.tcl
SJMPentaquarkRun1-10.tcl SJMPentaquarkRun1-150.tcl SJMPentaquarkRun1-51.tcl
SJMPentaquarkRun1-100.tcl SJMPentaquarkRun1-151.tcl SJMPentaquarkRun1-52.tcl
...
roethel@noric04> ls SJMPentaquarkRun1/prepared
SJMPentaquarkRun1-0001.tcl SJMPentaquarkRun1-0084.tcl
SJMPentaquarkRun1-0002.tcl SJMPentaquarkRun1-0085.tcl
...
If you remember the discussion on tcl snippet templates, you may want to compare
the resolved tcl snippet for e.g. the first job SJMPentaquarkRun1-0001.tcl
with the tcl template file. Before continuing you may want to check if your
snippets in the 'prepared' directory look ok. If not, fix the snippet template
and/or the configuration file and try again (delete the subdirectory tree to
remove the existing configuration - see below).
You can look at the job statistics with
roethel@noric04> SJMShowJobs SJMPentaquarkRun1
name prepared submitted done failed ok
---------------------------------------------------------------------------
SJMPentaquarkRun1 166 0 0 0 0
If you messed up when creating jobs, you could simply remove the subdirectory
tree SJMPentaquarkRun1 (e.g. with rm -rf SJMPentaquarkRun1)
and start over again. Now let's run some jobs...
Submitting Jobs
Let's test our configuration by submitting 2 jobs:
roethel@noric04> SJMSubmitJobs --njobs 2 SJMPentaquarkRun1
submitting jobs
Submitting job 1
Job <150864> is submitted to queue <kanga>.
bsub -q kanga -C 0 \
-o /afs/slac.stanford.edu/g/babar/work/r/roethel/GamGam/Pentaquark-Run1/1/SJMPentaquarkRun1.log \
/afs/slac.stanford.edu/g/babar/work/r/roethel/GamGam/Pentaquark-Run1/1/wrapper-1.sh
Submitting job 2
Job <150865> is submitted to queue <kanga>.
bsub -q kanga -C 0 \
-o /afs/slac.stanford.edu/g/babar/work/r/roethel/GamGam/Pentaquark-Run1/2/SJMPentaquarkRun1.log \
/afs/slac.stanford.edu/g/babar/work/r/roethel/GamGam/Pentaquark-Run1/2/wrapper-2.sh
Submitted 2 job(s).
While preparing this example something interesting happened - the jobs crashed
with (taken from the log file):
...
BetaMiniApp: error while loading shared libraries: libCore_pkgid_3.10-01.so: can
not open shared object file: No such file or directory
...
and STMShowJobs reported
roethel@noric04> SJMShowJobs SJMPentaquarkRun1
Job 1: Log file assumed done, job report file not found!
Job 2: Log file assumed done, job report file not found!
name prepared submitted done failed ok
---------------------------------------------------------------------------
SJMPentaquarkRun1 164 0 2 0 0
First the error message indicates that the log file satisfies the 'done'-conditions
but the jobreport file for the job was not found (which is never a good sign).
I fix this problem by submitting from a RH7.2 noric and want to resubmit the
jobs, i.e. I need to move the jobs from the 'done' state back to the 'prepared'
state. To do that I simply move the tcl snippet files for these jobs from the
SJMPentaquarkRun1/done directory to the submitted directory:
roethel@noric04> ls SJMPentaquarkRun1/done
SJMPentaquarkRun1-0001.tcl SJMPentaquarkRun1-0002.tcl
roethel@noric04> mv SJMPentaquarkRun1/done/*.tcl SJMPentaquarkRun1/prepared/.
roethel@noric04> SJMShowJobs SJMPentaquarkRun1
name prepared submitted done failed ok
---------------------------------------------------------------------------
SJMPentaquarkRun1 166 0 0 0 0
We're ready to resubmit the jobs now :
roethel@noric14> SJMSubmitJobs --njobs 2 SJMPentaquarkRun1
submitting jobs
Submitting job 1
Run directory /afs/slac.stanford.edu/g/babar/work/r/roethel/GamGam/Pentaquark-Run1/1 exists
already. Cleaning up
Job <152456> is submitted to queue <kanga>.
bsub -q kanga -C 0 \
-o /afs/slac.stanford.edu/g/babar/work/r/roethel/GamGam/Pentaquark-Run1/1/SJMPentaquarkRun1.log \
/afs/slac.stanford.edu/g/babar/work/r/roethel/GamGam/Pentaquark-Run1/1/wrapper-1.sh
Submitting job 2
Run directory /afs/slac.stanford.edu/g/babar/work/r/roethel/GamGam/Pentaquark-Run1/2 exists
already. Cleaning up
Job <152458> is submitted to queue <kanga>.
bsub -q kanga -C 0 \
-o /afs/slac.stanford.edu/g/babar/work/r/roethel/GamGam/Pentaquark-Run1/2/SJMPentaquarkRun1.log \
/afs/slac.stanford.edu/g/babar/work/r/roethel/GamGam/Pentaquark-Run1/2/wrapper-2.sh
Submitted 2 job(s).
The warnings indicate that the run directories for the two jobs in question
exist already, since I submitted these jobs before. The directories will be
cleaned up so the output does not conflict.We can check if the jobs are really
running
roethel@noric04> SJMShowJobs SJMPentaquarkRun1
name prepared submitted done failed ok
---------------------------------------------------------------------------
SJMPentaquarkRun1 164 2 0 0 0
And waiting a little more...
roethel@noric04> SJMShowJobs SJMPentaquarkRun1
name prepared submitted done failed ok
---------------------------------------------------------------------------
SJMPentaquarkRun1 164 0 2 0 0
We can now check the success of these jobs:
roethel@noric04> SJMCheckJobs SJMPentaquarkRun1
Checking jobs
...updating job status
...checking
job 1 ok.
job 2 ok.
Checked 2 jobs. Ok: 2 failed: 0
All fine for me - I hope for you as well... have fun!
NEW: In addition to the just mentioned way of running jobs, SJM V1.06 and
later supports running jobs in gdb (This only works where gdb is installed
on the batch machines). To use this option run
> SJMSubmitJobs -g <SJMName>.
Using the Job Monitor
From V1.02 on SJM comes with the script/daemon
sjm sprited (was SJMSprited) to
automatically take on the
management of jobs. This includes keeping a constant number of jobs
in the queue, checking jobs that are done and, if requested, send an
email in case of problems. As mentioned the script is designed to run
as a daemon, i.e. it will continue to run even after you log off, but
it has been mainly tested running in a terminal window. The configuration
file takes the following parameters to configure sjm sprited:
- SpriteMaxJobs : The maximum number of
jobs to keep in the queue.
- SpriteSleepTime : The time between checks on
the processing status.
- SpriteEmailNotify : If set the email address to
send notification to.
- SpriteCheckAfsToken : A simple check that sends notification and
terminates in cases when no valid afs token exists.
Running automated job monitoring in the background adds some
non-trivial complication to the simple job manager. The main (or
better the only issue) is the possibility that two commands are attempting
to do the same thing at the same time, e.g. sjm sprited
is running
sjm check in the background and you are
running the same from the
command line. That can lead to race conditions with unpredictable results
(though the damage is pretty limited since after all the bookkeeping is done
moving files within a unix file system. This is very safe and takes care
of most of the possible race conditions which boil down to two processes
trying to to things with the same file. However, you may see unusual error
messages from SJM because an expected file all of the sudden does not exist).
To avoid that a sophisticated lock mechanism was introduced, which
prohibits two critical processes to run at the same time.
Ok, ok - well, the sophisticated lock mechanism is simply a file
called 'lock.pid', which resides in the SJM project directory and which
contains the process id of the process which owns the lock. The lock should
only be set when a process is attempting to move files and update job
status, e.g. when running sjm submit,
sjm show
(with updating job
status) and sjm check. Sometimes it can happen that a
process did not
remove the lock, either because it was killed before it finished (better not
do that) or because the sjm sprited daemon process
died, or... If a
lock persists for a long time you should probably check the process id in the
lock file and see if that process is still alive (you can do that by
logging on the machine the process is running on and using
> ps -p <process id>).
If not it is safe to
remove the lock file and proceed (i.e. > rm lock).
In addition it is also
advisable to check for a lock file when moving files from the
prepared, submitted or done
states (directories) to other states. However in
practice one will move files from failed or
possibly ok to prepared which is
always safe.
sjm sprited maintains a log file
which contains besides log information
also the output of the various job submit and check operations. The output
however is not flushed, so the order can be somewhat confusing. To start the
daemon you just need to run
> sjm sprite --start <SJMName>
For further options see > sjm sprite -h. To just run
sjm sprited in a terminal window (in which case
the output is flushed and is better understandable) run
> sjm sprited <SJMName>
Finally - when all jobs have been submitted and checked
sjm sprited will
terminate by itself and optionally send an email notification. .
Reference
SJM is the Task Manager with every feature removed that is not
absolutely essential. The result was small enough to be written in two days
and still do the work. The main idea behind SJM is that the tcl snippet for
each job contains enough information to run a job and do some essential bookkeeping
on it. The bookkeeping itself is managed over the particular directory structure
in SJM.
SJM File- and Directory Structure and Bookkeeping
As mentioned the bookkeeping is managed over the directory structure
and file names in SJM. The only parameter required to identify a job and resolve
all its associated files and directories (for a given SJM project name) is the
job id, which is stored over the tcl snippet name and input tcl file name convention
<SJM project name>-<job id>.tcl
(it is not a good idea to choose a SJM name that itself uses a '-<some
number>' pattern since this may conflict with the job id extraction.).
The other files SJM relies on (the log file, the job report file and the wrapper
script) are all located in the run directory that is made up of the job id and
(possibly) the SJM project name and is defined by the user in the configuration
file.
The current job state is defined by the directory the snippet file is located
in. At the beginning all snippet files are in the prepared directory.
The command SJMShowJobs just counts the number of tcl files
in any of these directories and displays the count. There is no other hidden
behind-the-scenes bookkeeping. So just for fun you could move a snippet file
from the prepared directory to any other job state directory and see
how the output of SJMShowJobs changes (don't forget to move
the file back again... and please use mv,
don't cp the files!!!). As mentioned above, if you mess
up (or don't like your setup) just delete the SJM directory structure and start
over again.
The Run Directory
Every job managed by SJM has it's own run directory. This may seem a bit inconvenient,
but it simplifies the management of jobs and makes it more flexible. Instead
of having to keep track of different files individually, the only variable is
the run directory itself and all other files (currently the log file, job report
file and the wrapper script used to submit a job) can be identified using that.
The uniqueness of the run directory also requires one more thing - the user
has to make sure that he/she defines a unique run directory when configuring
a SJM project. The simplest way to do that is to make sure the <ID>
tag is part of the run directory (The job creation should fail if this is not
satisfied!).
The Configuration File
The parameters defined in the configuration file are:
- SJMName - The name of the SJM project. In the Task Manager
we would call this a task, but this is not the Task Manager. Make sure there
is no subdirectory in workdir with that name, since SJM will create this subdirectory.
- DatasetName - The name of the dataset that the jobs should
run over.
- MaxEvents - The max. number of events run per job. SJM
runs BbkDatasetTcl internally to create tcl files, that define
the input collections. This parameter specifies the value passed on to the
--tcl option.
- BbkDatasetTclRaw - Some 'raw' options to be passed on to
BbkDatasetTcl. SJM provides the --tcl, --splitruns
and --basename option. Further options may be provided by setting
this string, however there is no guarantee that the command will work as it
is supposed to. You'll just have to try.
- TclSnippet - The name of the tcl snippet template file.
- RunDirectory - The name of the run directories. In SJM
every job gets it's own run directory (typically in scratch space) which contains
the log file and the jobreport file. Users may add further files to this directory
by using the <RUNDIR> tag in the tcl snippet template.
- Executable - the name of the framework executable that
should be run (like BetaMiniApp). Just list the executable, do not include
the tcl file.
- BatchCommand - The command used to submit jobs to the job
scheduler. Uses the tags <LOG> and <WRAPPER>,
which will be replaced by the job specific log file and a wrapper shell script
at the time the job is submitted. For running at SLAC with LSF this can be
left as it is.
- JobFinishedString - A string searched for in the last 200
lines, that - if found - indicates that the job has finished/exited the batch
system. This does not need to be set at SLAC (LSF) where it defaults to 'Resource
usage summary'. This may need to be set at sites using PBS.
- JobSuccessfulString - A string indicating that the job
was run successfully in the batch queue (exit code 0), similar to JobFinishedString
described above. LSF does not show the exit code directly, but instead uses
the string 'Successfully completed' to indicate an exit code of 0. Again,
this may have to be customized to meet the requirements at other sites.
- Share - Set to 1 if the SJM project should be shared with
others. This will make the project and run directories group writable.
- umask - Similar to Share but allows to set
custom values for the file mask.
- SpriteMaxJobs - The maximum number of
jobs to keep in the queue.
- SpriteSleepTime - The time between checks on
the processing status.
- SpriteEmailNotify - If set the email address to
send notification to.
- SpriteCheckAfsToken - A simple check that sends notification and
terminates in cases when no valid afs token exists.
The syntax used in the configuration file is of the form parameter = text,
where parameter is not allowed to contain space characters (leading and trailing
spaces will be removed). Text is provided as raw text, i.e. all spaces besides
leading and trailing spaces will be preserved, so no quotes are necessary(!).
Comments can be added by using the hash character '#' as the first character(!)
in a line.
The Tcl Snippet Template File
The tcl snippet file and the use of tags were introduced in the example. Though
the basic contents of the tcl snippet template is up to the user SJM requires
the following lines:
sourceFoundFile <INPUTTCL>
set jobReportName <JOBREPORT>
SJM could automatically add these lines to the snippet file, but adding things
behind the scenes that may interfere with other user defined settings may be
more confusing then requiring certain values to be set. You also need to make
sure your main tcl file contains the appropriate line to write out the jobreport
file (see below).
Valid tags that can be used in the tcl snippet template are:
- <ID> - The job id of the current job.
- <ID100> - The job id divided by 100 (int(job id)/100).
This may be useful if you have a very large number of jobs and distribute
the number of subdirectories in the run directory tree.
- <INPUTTCL> - The name of the tcl file created by
BbkDatasetTcl, which defines the input collections. This tag is basically
only used to source the input tcl file. Ah, yes, don't forget to source <INPUTTCL>
or your jobs will have no defined input.
- <JOBREPORT> - The (predefined) name of the job report
file. Do not give the job report file an other name but <JOBREPORT>,
since SJM uses the job report file to determine the success or failure of
a job. You may however choose a different name for the tcl variable (in this
example jobReportName).
- <RUNDIR> - The name of the run directory for the current
job. This tag may be used if other files (e.g. ntuples) should be written
to the run directory.
- <NAME> - The name of this SJ Manager as defined over
SJMName in the configuration.
How to use FwkCfgVars is explained elsewhere (I don't know where) but essentially
do the equivalent of the following in your main tcl file:
FwkCfgVar jobReportName
FwkCfgVar rootName
...
jobReport filename $jobReportName
The Commands
With the two new additions there are now six commands altogether.
These are simple commands and
don't take a lot of command line options, but they all, with the exception of
SJMSprited, do have a basic -h, --help
option to remind you of all the options that they (don't) have.
sjm prepare (was SJMPrepareJobs)
sjm prepare actually does three different things:
First it reads in the configuration file, verifies some entries and creates
the SJM directory structure in the current workdir. It also copies the configuration
file to the subdirectory where it serves as the main configuration file for
all the other SJM commands. For reference a copy of the tcl snippet is also
stored in the SJM directory.
The second thing sjm prepare does in to run BbkDatasetTcl to create a list
of input tcl files in the tcl subdirectory.
Finally sjm prepare reads in the list of input tcl files and creates the
tcl snippets for every input tcl file in the prepared subdirectory.
sjm show (was SJMShowJobs)
sjm show basically just counts tcl snippet files in the different job status
subdirectories. However, before doing so it checks if jobs listed in the submitted
directory have finished. If a job is still running or is assumed to be finished,
is determined by checking the last 200 lines of the log file for a given string
that signals that the job has completed. The default is to look for 'Resource
usage summary', but this can be overridden defining JobFinishedString
in the configuration file (e.g. for running at sites using PBS). In addition
the existence of the job report file and the stop time written in the job report
file is also checked and a warning is printed if the log file has been found
to indicate a finished job but the job report file does not mirror this.
sjm submit (was SJMSubmitJobs)
sjm submit submits jobs. It first creates the run directory (and cleans
up old run directories if these happen to exist already) and then creates a
small shell wrapper script in that directory. The wrapper script, which is necessary
to provide compatibility with other job schedulers like PBS, is then submitted
to the batch queue using the command defined in the configuration file.
sjm check (was SJMCheckJobs)
This command finally checks jobs that are found to be done. Similar to identifying
completed jobs, jobs are checked ok if the exit code of the job running in the
queue was found to be 0 (Note that the shell wrapper 'exec's the framework executables
for this purpose instead of running the framework executable as a sub process!).
In LSF this can be done by parsing the first and last 200 lines of the log file for the
string 'Successfully completed', which is the default in SJM. This default can
be overridden by defining JobSuccessfulString in the configuration
file. In addition, the job report file must exists and must contain the stop
time.
sjm sprite (was SJMSprite)
sjm sprite is just used to start and stop the job monitoring script
sjm sprited when that is run as daemon. To keep the daemon running in an
afs environment as e.g. at slac, you need to run klog -setpag to assure to have
valid token after closing the terminal window.
sjm sprited (was SJMSprited)
sjm sprited is the job monitoring daemon - when run as daemon.
But it can be run
just as well in a terminal window. There are no commandline options for
this command, except for the SJMName itself. For most parts
sjm sprited sleeps (the
default sleep time is 20 minutes). When it awakes it first updates the job
status, then determines how many jobs are currently in the queue (it uses SJM's own
bookkeeping for this and does not rely on a specific batch interface and is therefore
also not sensitive to occasional outages of the batch system) and submits
the neccessary number of jobs to keep the requested number of jobs in the queue.
Finally it checks done jobs. When nothing is left to be done
sjm sprited
exits.
Running in PBS (or other Job Schedulers)
Since SJM uses a wrapper script to submit jobs, running on PBS (or yet another
job scheduler) is not a big problem. However you have to provide the necessary
strings in the configuration file (see 1. in the list above) to identify in
a log file to determine if a job has finished and has run successfully.
SJM tries to identify idle jobs by checking the time of the last update of
a log file. If the update occured more than an hour before the check a warning
message will be printed, but no further action will be taken. It is up to the
users to check the status of the job and (possibly) fix the problem. PBS typically
writes log files to a private area and only renames the log file to the defined
log file name when the job has finished. Therefore this additional check is
not possible (Redirecting the output over the shell wrapper is not a good alternative,
since this will not capture the report from the job scheduler containing the
job exit code).
A short description on how to run at RAL will follow in a little. However the
only thing needed to configure is the batch command and the string in the log file
to identify that a job is done. (Actually at RAL this can be anything because the
log file only exists in the global readable area when the job is done. Job
validation is exclusively done using the job report file, not the log file.
Miscellaneous
What is SJM
SJM was born in the need to run some analysis while I'm still writing on the
Task Manager. It should not replace the Task Manager though - if you are looking
for a full production type analysis framework which allows e.g. merges, imports
of collections to the bookkeeping database/hpss, full bookkeeping of runs etc.,
SJM is not the tool to use. However, if you just want to run some jobs with
a given input dataset, and don't really care about all the additional feature,
then the heavy-weight Task Manager will not be the ideal tool to use and you
are better off with a simple tool as SJM... I would guess that the majority
of analysis jobs will fall into the latter category...
Distribution
SJM is distributed as a tar file and not in a package in cvs. The reason is
that a cvs package needs a maintainer and I don't have the time to maintain
SJM. If someone volunteers to takeover this job, I can unwrap the SJMBase.py
file and put SJM in a package.
Bug Reports/Disclaimer
SJM is provided as is. It probably has bugs. You can send me a mail with bug
reports and I will try to fix them whenever I have time.
Page author(s): Will Roethel
| Last significant update:
Sep 30, 2005
|
Expiry date: January 31, 2006 |
|