Out Of Memory (OOM)

This page is a discussion of what happens when a job gets an out of memory (OOM) error under SLURM. The way SLURM handles this event has recently (Sept. 2025) changed do to some configuration changes made by Research Computing. Generally speaking when a job runs out of memory, it is best to kill the job so that false results are not produced.

Old Behavior

The behavior varies depending upon if you are 1) run a single threaded job 2) a multi-threaded job or 3) as a job step for example with the srun command. In the first case the job will be killed and in the other two the remaining threads or steps will run.

To activate the old behavior you can set the option: `#SBATCH --oom-kill-step=0`

New Behavior

The new behavior is to kill the whole job step in which an OOM event occurs (all its processes/threads whether on the same or different nodes). For jobs without multiple steps, this results in killing the job.

Behavior on OpenOnDemand (OOD)

If your session runs out of memory then the entire session will be terminated (this is new) and you will have to log back in. You should then request addition memory for your job, e.g. using the "--mem=" flag. For examle to ask SLURM for 8 GB of memory, "--mem=8g".

 

Last Update 10/14/2025 6:16:27 PM