Slurm Best Practices#
Slurm is the nerve center of many High-Performance Computing (HPC) clusters. Understanding it well is not optional; it’s a requirement for effective computational research or tasks. This guide outlines how to navigate Slurm as a user.
Job Submission#
Interactive and Batch Jobs#
Batch Jobs: Use
sbatch
for long-running jobs. Create an sbatch script specifying resources and runtime parameters.Interactive Jobs: For testing or debugging, use
srun
orsalloc
.
Sbatch Scripts#
#!/bin/bash
#SBATCH --job-name=example_job
#SBATCH --nodes=1
#SBATCH --cpus-per-task=4
#SBATCH --mem=8G
#SBATCH --time=2-00:00:00
# commands to execute
The above script is a template. Modify the directives according to your needs.
Optimizing Resource Usage#
Specify Exact Requirements#
Use
--cpus-per-task
to specify the exact number of CPU cores your job requires.Allocate only the memory you need using
--mem
. Make sure to check the amount of resources you are allowed to use for a given partition.
Monitoring Jobs#
Job Status#
squeue
: Shows the status of your submitted jobs.sacct
: Provides account information after job completion, good for diagnosing problems.
Real-Time Monitoring#
sstat
: This is your real-time dashboard for jobs, showing you CPU, memory, and IO metrics.
Troubleshooting#
Job Priority#
sprio
: Use this to understand the priority of your pending jobs and why they might be queued.
Logs#
Each Slurm job generates logs. Know how to access and interpret them.
Data Management#
Input and Output#
Use
--input
and--output
directives in your sbatch script to handle data.
Staging#
Utilize tools like
rsync
to move data to and from the cluster efficiently. Be mindful of disk quotas.