Example: XML Task Configuration File

Print Page

On this page:

Also see:

Links:

This section presents an example a Pipeline XML Task Configuration File (TCF) that defines a routine science processing task designed to perform an automated micro quasar (mq) analysis on a variable number of sources
being tracked on a regular basis.

Note: This is an example only, and is not meant as an exercise to be run.

Using the pipeline makes it easy to:

  • Submit a large number of jobs.
  • "Roll back" (i.e., rerun) any jobs (whether they
    have completed successfully, or may have failed
    due to quirks in the SLAC batch system).
  • Register output datasets in a data catalog, which
    in turn makes it easy to keep track of what data is available.

In addition, the pipeline maintains a record of all
jobs run, including links to log files and to other
files produced, and the web interface allows status of jobs to be monitored worldwide.

Note:

  • The SRS Pipeline runs on the SLAC Batch Farm. However, it has been designed so that jobs can be run on other participating batch farms, regardless of where in the world they may be located.
  • When running the SRS Pipeline on the SLAC Batch Farm, be sure you are running on a RHEL4 machine. (See Using the SLAC Batch Farm.)

Example: Creating the rspmq Task Configuration File

Note: When editing an XML file for the pipeline, you are encouraged to use an editor which can validate XML files against XML schema, since this will save you a lot of time. EMACS users may be interested in Using EMACS for XML documents.

rspmq Tasks and Processes. The boxes in the following flow diagram depict a top-level pipeline task (rspmq) and one subtask (mqAnalysis). The ellipses represent processes, of which there are two types: batch job or jython script. In this example, there are five processes: two are top-level batch jobs (setup and finishup). Among other things, the setup process invoked the pipeline's createStream command to create a stream for each mq source which, in this example, totaled 75 streams. Each stream successfully performed all of the processes in the subtask independently of the other streams.

Subtask Processes. Two of the subtask processes are batch jobs (mqAnalysis and loadTrending) and finishSource is a jython script. When the setup process and each of the streams have all completed, the pipeline runs the finishup batch job. Should a stream or a process have failed, a link would have been displayed in the "X" column.

Note: An option to "rollback" and rerun a task, subtask, or process is also available, whether or not there was a failure. When rolled back, all successive processes are also rerun.

Key points are summarized below:

rspTestConfig XML File

Note: To see the a popup of the full XML file for this example
without explanations, click on: SRS Example: rspTestConfig.xml.

Required. At the beginning of the SRS pipeline XML configuration file, include the following lines:


Note: These prerequisites allow the XML file that defines a task to require that certain variables be set at stream creation time. If values are not given, the stream creation will fail. These are the minimal set of variables that must be set; it is always possible to specify additional variables.

The next line is delineated by "<!--" ... "-->", and is therefore commented out.


Below, the <task> tag used to open the task definition. The task name, type (DATA, MC, EXO, etc.) and version are defined, followed by a set of constants, i.e., <variables> that will be used by one or more tasks or subtasks.

Beneath the <variable> tag, the first two lines are commented out: the first is a reminder that the values given to the variables are defaults; these default values can be overridden for any specific stream when the stream is created; the second line commented out is a "TASK_ROOT" variable that is not used in the current example.


Notes:

  • Variables can be defined within an XML configuration file at the <task> or <process> level;
  • they can also be defined when a stream or substream is created or rolled back;
  • and they can be defined within – or computed by – processes (either scripts or batch jobs) and passed into the stream, thereby providing the ability to perform runtime calculations.

Be aware, however, that there are some restrictions when passing variables into the stream. For example, subtasks can see variables of the parent process, but parent task processes cannot see variables of a subtask.

Similarly, a pipeline python process-generated variable cannot be seen by a subsequent python process unless the variable was predefined by the XML configuration file at the <process> level.

There are also anomalies and some workarounds when trying to pass process-generated variables between jython and python scriptlets, and vice versa. For a more detailed discussion, see XXXXXXXXXXXXXXX??? TOM ???

Top-level Processes

Top-level processes are next; first the setup batch job, which will create the top-level stream and the substreams as well (i.e., in this case, there is one substream created for each of the sources being analyzed):


Note: The queue specification for running this script is: batchOptions="-q medium". For more queue information, see: Using the SLAC Batch Farm: Queue Information.

... and then the finishup batch job:


Note: The depends statement stipulates that finishup job will run only after the top-level setup job and the finishSource jython script for each of the subtask streams have successfully completed.

Subtasks

Next the Subtasks are defined. Note that the first statement defines the task name, type, and version, which is then followed by the process name and job executable:

 

The "loadTrending" batch job is next; note the "depends" statement that stipulates each stream will wait until its mqAnalysis batch job has completed successfully:

 

Finally, the last Subtask process (i.e., the "finishSource" jython script) is run after an individual stream's loadTrending job has completed successfully.


Owned by:Tom Glanzman

 

Last updated by: Chuck Patterson 03/16/2010