Slurm Best Practices#

Slurm is the nerve center of many High-Performance Computing (HPC) clusters. Understanding it well is not optional; it’s a requirement for effective computational research or tasks. This guide outlines how to navigate Slurm as a user.

Job Submission#

Interactive and Batch Jobs#

Batch Jobs: Use sbatch for long-running jobs. Create an sbatch script specifying resources and runtime parameters.
Interactive Jobs: For testing or debugging, use srun or salloc.

Sbatch Scripts#

#!/bin/bash
#SBATCH --job-name=example_job
#SBATCH --nodes=1
#SBATCH --cpus-per-task=4
#SBATCH --mem=8G
#SBATCH --time=2-00:00:00

# commands to execute

The above script is a template. Modify the directives according to your needs.

Optimizing Resource Usage#

Specify Exact Requirements#

Use --cpus-per-task to specify the exact number of CPU cores your job requires.
Allocate only the memory you need using --mem. Make sure to check the amount of resources you are allowed to use for a given partition.

Monitoring Jobs#

Job Status#

squeue: Shows the status of your submitted jobs.
sacct: Provides account information after job completion, good for diagnosing problems.

Real-Time Monitoring#

sstat: This is your real-time dashboard for jobs, showing you CPU, memory, and IO metrics.

Troubleshooting#

Job Priority#

sprio: Use this to understand the priority of your pending jobs and why they might be queued.

Logs#

Each Slurm job generates logs. Know how to access and interpret them.

Data Management#

Input and Output#

Use --input and --output directives in your sbatch script to handle data.

Staging#

Utilize tools like rsync to move data to and from the cluster efficiently. Be mindful of disk quotas.