Slurm#

Slurm (Simple Linux Utility for Resource Management) is an open-source, highly configurable, fault-tolerant, and adaptable workload manager, used extensively in High Performance Computing (HPC) environments.

Slurm is designed to accommodate the complex needs of large-scale computational workloads by efficiently distributing and managing tasks across clusters comprising thousands of nodes, offering seamless control over resources, scheduling, and job queuing. You can also use Slurm on the Discovery cluster for functionalities such as Slurm Jobs Array, Monitoring and Managing Jobs, and check the Query Partitions: sinfo.

Basic Slurm commands that are used for running, monitoring, and canceling jobs.

Advanced usage and explanation of srun and sbatch for running jobs.

Learn the advanced usage and explanation of squeue, scancel, and sinfo for monitoring jobs.

An introduction and use cases for Slurm job arrays for launching a large series of jobs.