SLURM basic commands

notes
slurm
HPC
Some basic commands to run jobs in SLURM

TL;DR

Basic SLURM job script

Write the following SLURM directives and code in a file, e.g. job script.sh:

#!/bin/bash

#SBATCH --mail-type=ALL                # Type of events triggering email
#SBATCH --mail-user=[email_address]    # Email address for notifications
#SBATCH --job-name=[name_for_job]      # Job name
#SBATCH --time=50:00:00                # Max. runtime, 50 hours in this case
#SBATCH --mem=128G                     # Define amout of RAM, 128 gigabytes here
#SBATCH --partition=[partition_name]   # Partition to use
#SBATCH --gres=gpu:1                   # Generic resource specification, here a node with 1 GPU
#SBATCH --constraint=A10               # Additional constraint: only nodes with A10 feature are valid
#SBATCH --no-requeue                   # Do not requeue job if failed


## Bash code to execute
script="$HOME/script.py" # A script to be run
data_path="$HOME/some_data.txt" # Input data
out_path="$PWD" # Path for output

## Execute
srun /usr/bin/nvidia-smi # E.g. to validate that the node has the required GPU
srun python -c 'import torch; print(torch.cuda.is_available())' # To validate that Pytorch is correctly detected
srun python $script $data_path $out_path

Note that lines starting with #SBATCH are SLURM directives that will be interpreted by SLRUM.

Then, run it with the sbatch SLURM command:

sbatch script.sh

Check status of running jobs

The SLURM command to list running jobs and their status is squeue. In practice there are too many jobs and some filtering must be applied, often user-based. This can be done using standard grep with the username of interest:

squeue | grep [username]

or using the specific options of squeue:

squeue --u [username] # or `whoami` instead of [username]

For more information go here.

Cancel a running job

After running a job with sbatch SLURM will output the job ID of the running job. To cancel the job use scancel:

scancel [jobID]

List available nodes and features

Different jobs may have different requirements in terms of memory, cores, GPUs… In the job script we have to specify these requirements as generic resources (via --gres) or constraints (via --constraints). To list the available generic resources and features (used to specify constraints) we use sinfo:

sinfo -o '%25N %5c %10m %40f %G'

where

  • -o is the option to specify the output format
  • '%25N %5c %10m %40f %G' is the format specification:
    • number after % indicates the max. length for that field
    • %N to show node names
    • %c to show number of cores
    • %m to show available RAM memory
    • %f to show available features (which can be specified in the --constraints option of a job script)
    • %G to show available generic resources (used in the --gres option of a job script)

sinfo provides lots of information about the cluster. For more information and options go here.

Resources