Memory enforcement based on Linux cgroup memory subsystem

LSF can impose strict host-level memory and swap limits on systems that support Linux cgroups. These limits cannot be exceeded. All LSF job processes are controlled by the Linux cgroup system. If job processes on a host use more memory than the defined limit, the job is immediately killed by the Linux cgroup memory subsystem. Memory is enforced on a per job and per host basis, not per task. If the host OS is Red Hat Enterprise Linux 6.3 or above, cgroup memory limits are enforced, and LSF is notified to terminate the job. More notification is provided to users through specific termination reasons that are displayed by bhist –l.

Memory enforcement for Linux cgroups is supported on Red Hat Enterprise Linux (RHEL) 6.2 or above and SuSe Linux Enterprise Linux 11 SP2 or above.

LSF enforces memory limits for jobs by periodically collecting job memory usage and comparing it with memory limits set by users. If a job exceeds the memory limit, the job is terminated. However, if a job uses a large amount of memory before the next memory enforcement check by LSF, it is possible for the job to exceed its memory limit before it is killed.

To enable memory enforcement through the Linux cgroup memory subsystem, configure LSB_RESOURCE_ENFORCE="memory" in lsf.conf.

If you are enabling memory enforcement through the Linux cgroup memory subsystem after upgrading an existing LSF cluster, make sure that the following parameters are set in lsf.conf:


Setting LSB_RESOURCE_ENFORCE="memory" automatically turns on cgroup accounting (LSF_LINUX_CGROUP_ACCT=Y) to provide more accurate memory and swap consumption data for memory and swap enforcement checking. Setting LSF_PROCESS_TRACKING=Y enables LSF to kill jobs cleanly after memory and swap limits are exceeded.

Note: If LSB_RESOURCE_ENFORCE="memory" is configured, all existing LSF memory limit related parameters such as LSF_HPC_EXTENSIONS="TASK_MEMLIMIT", LSF_HPC_EXTENSIONS="TASK_SWAPLIMIT", "LSB_JOB_MEMLIMIT", and "LSB_MEMLIMIT_ENFORCE" are ignored.

For example, submit a parallel job with 3 tasks and a memory limit of 100 MB, with span[ptile=2] so that 2 tasks can run on one host and 1 task can run on another host:

bsub -n 3 -M 100 –R "span[ptile=2]" blaunch ./mem_eater 

The application mem_eater keeps increasing the memory usage.

LSF kills the job at any point in time that it consumes more than 200 MB total memory on hosta or more than 100 MB total memory on hostb. For example, if at any time 2 tasks run on hosta and 1 task runs on hostb, the job is killed only if total memory consumed by the 2 tasks on hosta exceeds 200 MB on hosta or 100 MB in hostb.

LSF does not support per task memory enforcement for cgroups. For example, if one of the tasks on hosta consumes 150 MB memory and the other task consumes only 10 MB, the job is not killed because, at that point in time, the total memory that is consumed by the job on hosta is only 160 MB.

Memory enforcement does not apply to accumulated memory usage. For example, two tasks consume a maximum 250 MB on hosta in total. The maximum memory rusage of task1 on hosta is 150 MB and the maximum memory rusage of task2 on hosta is 100 MB, but this never happens at the same time, so at any given time, the two tasks consumes less than 200M and this job is not killed. The job would be killed only if at a specific point in time, the two tasks consume more than 200M on hosta.

Note: The cgroup memory subsystem does not separate enforcement of memory usage and swap usage. If a swap limit is specified, limit enforcement differs from previous LSF behavior.

For example, for the following job submission:

bsub -M 100  -v 50 ./mem_eater

After the application uses more than 100 MB of memory, the cgroup will start to use swap for the job process. The job is not killed until the application reaches 150 MB memory usage (100 MB memory + 50 MB swap).

The following job specifies only a swap limit:

bsub -v 50 ./mem_eater

Because no memory limit is specified, LSF considers the memory limit to be same as a swap limit. The job is killed when it reaches 50 MB combined memory and swap usage.

Limitations and known issues:
  • For parallel jobs, cgroup limits are only enforced for jobs that are launched through the LSF blaunch framework. Parallel jobs that are launched through LSF PAM/Taskstarter are not supported.

  • On RHEL 6.2, LSF cannot receive notification from the cgroup that memory and swap limits are exceeded. When job memory and swap limits are exceeded, LSF cannot guarantee that the job is killed. On RHEL 6.3, LSF does receive notification and kills the job.

  • On RHEL 6.2, a multithreaded application becomes a zombie process if the application is killed by cgroup due to memory enforcement. As a result, LSF cannot wait for the user application exited status and LSF processes are hung. LSF recognizes the job does not exit and the job always runs.

Start of change

Host-based memory and swap limit enforcement by Linux cgroup

When LSB_RESOURCE_ENFORCE="memory" is configured in lsf.conf, memory and swap limits are calculated and enforced as a multiple of the number of tasks running on the execution host when memory and swap limits are specified for the job (at the job-level with -M and -v, or in lsb.queues or lsb.applications with MEMLIMIT and SWAPLIMIT).

The bsub -hl option enables job-level host-based memory and swap limit enforcement regardless of the number of tasks running on the execution host. LSB_RESOURCE_ENFORCE="memory" must be specified in lsf.conf for host-based memory and swap limit enforcement with the -hl option to take effect. If no memory or swap limit is specified for the job (the merged limit for the job, queue, and application profile, if specified), or LSB_RESOURCE_ENFORCE="memory" is not specified, a host-based memory limit is not set for the job. The -hl option only applies only to memory and swap limits; it does not apply to any other resource usage limits.

End of change