Enforcing Resource Usage Limits for Parallel Tasks

A typical Platform LSF parallel job launches its tasks across multiple hosts. By default you can enforce limits on the total resources used by all the tasks in the job.

Resource usage limits

Since PAM only reports the sum of parallel task resource usage, LSF does not enforce resource usage limits on individual tasks in a parallel job. For example, resource usage limits cannot control allocated memory of a single task of a parallel job to prevent it from allocating memory and bringing down the entire system. For some jobs, the total resource usage may be exceed a configured resource usage limit even if no single task does, and the job is terminated when it does not need to be.

Attempting to limit individual tasks by setting a system-level swap hard limit (RLIMIT_AS) in the system limit configuration file (/etc/security/limits.conf) is not satisfactory, because it only prevents tasks running on that host from allocating more memory than they should; other tasks in the job can continue to run, with unpredictable results.

By default, custom job controls (JOB_CONTROL in lsb.queues) apply only to the entire job, not individual parallel tasks.

Enabling resource usage limit enforcement for parallel tasks

Use the LSF_HPC_EXTENSIONS options TASK_SWAPLIMIT and TASK_MEMLIMIT in lsf.conf to enable resource usage limit enforcement and job control for parallel tasks. When TASK_SWAPLIMIT or TASK_MEMLIMIT is set in LSF_HPC_EXTENSIONS, LSF terminates the entire parallel job if any single task exceeds the limit setting for memory and swap limits.

Other resource usage limits (CPU limit, process limit, run limit, and so on) continue to be enforced for the entire job, not for individual tasks.

Assumptions and behavior

  • To enforce resource usage limits by parallel task, you must use the LSF generic Parallel Job Launcher (PJL) framework (PAM/TS) to launch your parallel jobs.

  • This feature only affects parallel jobs monitored by PAM. It has no effect on other LSF jobs.

  • LSF_HPC_EXTENSIONS=TASK_SWAPLIMIT overrides the default behavior of swap limits (bsub -v, bmod -v, or SWAPLIMIT in lsb.queues).

  • LSF_HPC_EXTENSIONS=TASK_MEMLIMIT overrides the default behavior of memory limits (bsub -M, bmod -M, or MEMLIMIT in lsb.queues).


  • When a parallel job is terminated because of task limit enforcement, LSF sets a value in the LSB_JOBEXIT_INFO environment variable for any post-execution programs:



  • When a parallel job is terminated because of task limit enforcement, LSF logs the job termination reason in lsb.acct file:

    • TERM_SWAP for swap limit

    • TERM_MEMLIMIT for memory limit

    bacct displays the termination reason.