Using Slurm#
Slurm Overview#
Slurm (Simple Linux Utility for Resource Management) is an open-source, highly configurable, fault-tolerant, and adaptable workload manager. It is extensively used across High-Performance Computing (HPC) environments.
Slurm is designed to accommodate the complex needs of large-scale computational workloads. It can efficiently distribute and manage tasks across clusters comprising thousands of nodes, offering seamless control over resources, scheduling, and job queuing. It is the software on the HPC that provides functionalities such as Slurm Job Arrays, Job Management, view Account information, and check the Cluster and Node States: sinfo.
Slurm on HPC#
HPC systems are designed to perform complex, computationally intensive tasks. For example, users can specify complex workflows of jobs where specific jobs depend on others, and Slurm will manage the scheduling and execution of these workflows. Efficiently managing these tasks and resources in such an environment is a daunting challenge. That’s where Slurm comes into play.
Slurm allows users to submit their computational tasks as jobs to be scheduled on the cluster’s compute nodes. Its role-based access control ensures proper resource allocation and job execution, preventing resource conflicts.
Slurm is crucial in research environments, where it ensures fair usage of resources among a multitude of users, helps optimize the workload for the available resources, and provides precise job accounting and statistics.
Page Objective:#
To understand the Slurm workload manage, which will allow you to properly leverage the HPC. It starts with the basics - the resources that Slurm manager. Then, useful Slurm features (e.g., job submission, monitoring, canceling, etc.) are mentioned with code examples. We discuss jobs that are both interactive (i.e., Interactive Jobs: srun Command) and batch (i.e., Batch Jobs: sbatch), along with the slurm array variants (i.e., Slurm Job Arrays). Advanced Usage, Common Problems and Troubleshooting, and Best Practices are also covered.
Who Should Use This Guide?#
This guide is for HPC users: researchers intending to use Slurm-based clusters for their computation tasks, system administrators managing HPC environments, and even seasoned HPC users looking to brush up on their knowledge. It progresses from fundamental to advanced topics, making it a valuable resource for a broad audience.
Slurm: Basic Concepts#
Before we delve into using Slurm, it’s essential to grasp some fundamental concepts related to its operation.
Nodes#
In the context of Slurm, a ‘node’ refers to a server within the HPC cluster. Each node possesses its resources, such as CPUs, memory, storage, and potentially GPUs. Slurm manages these nodes and allocates resources to the tasks.
Partition(s)#
A ‘partition’ is a grouping of nodes. You can think of partitions as virtual clusters within your HPC system. They allow administrators to segregate the compute environment based on factors like job sizes, hardware type, or resource allocation policies.
See also
Our Partitions documentation.
Account information#
When running a job with either srun
or sbatch
, if you have more than one account associated with your username, we recommend you use the --account=
flag and specify the account that corresponds to the respective project.
To find out what account(s) your username is associated with, use the following command:
sacctmgr show associations user=<yourusername>
After you have determined what accounts your username is associated with, if you have more than one account association, you can use the account=
flag with your usual srun
or sbatch
commands.
Jobs, Job Steps, and Job Arrays#
A job in Slurm is a user-defined computational task that’s submitted to the cluster for execution. Each job has one or more job steps, sub-tasks that are part of a larger job and can be executed in parallel or sequentially.
Job arrays are a series of similar jobs that differ only by the array index. They’re especially useful when you want to execute the same application with different inputs.
Tasks#
Tasks are the individual processes that run within a job step. They could be single-threaded or multi-threaded and can run on one or more nodes.
Understanding the Configuration File#
Note
Most users should not be concerned with slurm configurations, beyond being aware of the configurations.
See also
Slurm documentation for a complete list of available parameters as shown in the example config file belong, along with their meanings.
The slurm.conf file is the primary configuration file (i.e., slurm.conf
) for Slurm. It contains the parameters that govern the behavior of the Slurm controller, nodes, and partitions.
Example of a very basic slurm.conf
file:
# Basic slurm.conf file
ControlMachine=slurm-controller # Hostname of the control node
AuthType=auth/munge # Authentication type
CryptoType=crypto/munge # Cryptography type
MpiDefault=none # Default MPI type
ProctrackType=proctrack/pgid # Process tracking type
ReturnToService=1 # Return failed nodes to service
SlurmctldPidFile=/var/run/slurmctld.pid # PID file for the Slurm controller
SlurmctldPort=6817 # Communication port for the Slurm controller
SlurmdPidFile=/var/run/slurmd.pid # PID file for the Slurm daemon
SlurmdPort=6818 # Communication port for the Slurm daemon
SlurmdSpoolDir=/var/spool/slurmd # Spool directory for the Slurm daemon
SlurmUser=slurm # User the Slurm daemon runs as
StateSaveLocation=/var/spool/slurmctld # Where state information is saved
SwitchType=switch/none # Job switch type
TaskPlugin=task/none # Task plugin
#
# Node definitions
NodeName=compute[1-16] CPUs=1 State=UNKNOWN
#
# Partition definitions
PartitionName=test Nodes=compute[1-16] Default=YES MaxTime=INFINITE State=UP
Basic Slurm Usage#
To submit your job script to Slurm, you use the sbatch
command:
sbatch job_script.sh
Slurm will then schedule your job to run when resources are available. It returns a job ID that you can use to monitor your job’s status.
To check the status of your job, you use the squeue
command:
squeue -u <username>
To cancel a job, you use the scancel
command with the job ID:
scancel <job_id>
To check detailed information about a job, use the scontrol
command:
scontrol show job <job_id>
This information is crucial for managing your jobs and ensuring they are running as expected.
To view cluster information, use sinfo
: this command allows to view partition and node information. Use option -a to view all partitions.
sinfo <options>
Details on each of the commands above, and more, is covered in the following sections.
Allocating Resources#
You have two options when running tasks, run interactively via Interactive Jobs: srun Command or by batch via Batch Jobs: sbatch.
For parallel tasks, one can treat each task as a separate job and run them independently. The other option is to allocate resources for all the jobs simultaneously, allowing them to overlap (share CPUs, RAM, etc.). This is done with the --overlap
flag: the assumption must be that not all tasks require all resources simultaneously, creating a more natural working environment, and resources are not wasted on idle time.
Note
While the sbatch
and srun
commands request resource allocation if none exists, using salloc
allows us to separate the allocation and submission processes.
Some things Slurm assumes, like there is no overlap between different CPUs by default: tasks do not share CPUs with others running parallel. If overlap is needed, use the following Slurm flag:
--overlap
Also, set the environment variable SLURM_OVERLAP=1
via
export SLURM_OVERLAP=1
Important
Run export SLURM_OVERLAP=1
prior to logging onto a compute node when using MPI interactively.
Batch Jobs: sbatch
#
The sbatch
command is used to submit a job script for later execution. The script includes the SBATCH
directives that control the job parameters like the number of nodes, CPUs per task, job name, etc.
Syntax: sbatch
#
sbatch [options] <script_file>
Options and Usage: sbatch
#
n, --ntasks= <number>
: specify the number of tasksN, --nodes=<minnodes[-maxnodes]>
: specify the number of nodesJ, --job-name=<jobname>
: specify a name for the job
#!/bin/bash
#SBATCH -J MyJob # Job name
#SBATCH -N 2 # Number of nodes
#SBATCH -n 16 # Number of tasks
#SBATCH -o output_%j.txt # Standard output file
#SBATCH -e error_%j.txt # Standard error file
# Your program/command here
srun ./my_program
To submit this job script, save it as my_job.sh
and run:
sbatch my_job.sh
Examples using sbatch
#
Single node#
Run a job on one node for four hours on the short partition:
#!/bin/bash
#SBATCH --nodes=1
#SBATCH --time=4:00:00
#SBATCH --job-name=MyJobName
#SBATCH --partition=short
# <commands to execute>
Single node with additional memory#
The default memory per allocated core is 1.95GB. If calculations attempt to access more memory than allocated, Slurm automatically terminates the job. Request a specific amount of memory in the job script if calculations require more than the default. The example script below requests 100GB of memory (--mem=100G
). Use one capital letter to abbreviate the unit of memory (i.e., kilo K
, mega M
, giga G
, and tera T
) with the --mem=
option, as that is what Slurm expects to see:
#!/bin/bash
#SBATCH --nodes=1
#SBATCH --time=4:00:00
#SBATCH --job-name=MyJobName
#SBATCH --mem=100G
#SBATCH --partition=short
# <commands to execute>
Single with exclusive use of a node#
If you need exclusive use of a node, such as when you have a job that has high I/O requirements, you can use the exclusive flag. The example script below specifies the exclusive use of one node in the short partition for four hours:
#!/bin/bash
#SBATCH --nodes=1
#SBATCH --time=4:00:00
#SBATCH --job-name=MyJobName
#SBATCH --exclusive
#SBATCH --partition=short
# <commands to execute>
Interactive Jobs: srun
Command#
The srun
command is used to submit an interactive job which runs a single task directly via the shell. This method is useful when you want to test a short computation or run an interactive session like a shell, Python, or an R terminal.
Syntax: srun
#
srun [options] [command]
Options and Usage: srun
#
n, --ntasks=<number>
: specify the number of tasksN, --nodes=<minnodes[-maxnodes]>
: specify the number of nodesJ, --job-name=<jobname>
: specify a name for the job
srun -N 1 -n 1 --pty bash
This command starts an interactive bash shell on one node with one task.
Examples using srun
#
The user needs to review the Hardware Overview and Partitions to be familiar with the available hardware and partition limits on Discovery. This way, user can tailor the request to fit both the needs of the job and the limits of partitions. For example, if the user specifies --partition=short
and --time=01:00:00
, it will result in an error because the time specified exceeds the limit for that partition.
This simple srun
example is to move to a compute node after you first log into the HPC:
srun --pty /bin/bash
To request one node and one task for 30 minutes with X11 forwarding on the short partition, type:
srun --partition=short --nodes=1 --ntasks=1 --x11 --mem=10G --time=00:30:00 --pty /bin/bash
To request one node, with 10 tasks and 2 CPUs per task (a total of 20 CPUs), 1 GB of memory, for one hour on the express partition, type:
srun --partition=short --nodes 1 --ntasks 10 --cpus-per-task 2 --pty --mem=1G --time=01:00:00 /bin/bash
To request two nodes, each with 10 tasks per node and 2 CPUs per task (a total of 40 CPUs), 1 GB of memory, for one hour on the express partition, type:
srun --partition=short --nodes=2 --ntasks 10 --cpus-per-task 2 --pty --mem=1G --time=01:00:00 /bin/bash
To allocate a GPU node, you should specify the gpu
partition and use the –gres
option:
srun --partition=gpu --nodes=1 --ntasks=1 --gres=gpu:1 --mem=1Gb --time=01:00:00 --pty /bin/bash
Slurm Job Arrays#
Job arrays are a convenient way to submit and manage large numbers of similar jobs quickly. They can process millions of tasks in milliseconds, provided they are within size limits. Job arrays are particularly useful when running similar jobs, such as performing the same analysis with different inputs or parameters.
Using job arrays can save time and reduce the amount of manual work required. Instead of submitting each job individually, you can submit a single job array and let Slurm handle the scheduling of individual jobs. This approach is beneficial if you have limited time or resources, as it allows you to use the cluster’s computing power more efficiently by running multiple jobs in parallel.
There are several ways to define job arrays, such as specifying the range of indices or providing a list of indices in a file. Slurm also offers various features to manage and track job arrays, such as options to simultaneously suspend, resume, or cancel all jobs in the array.
Syntax: Job Arrays#
The most basic configuration for a job array is as follows:
#!/bin/bash
#SBATCH --partition=short
#SBATCH --job-name=jarray-example
#SBATCH --output=out/array_%A_%a.out
#SBATCH --error=err/array_%A_%a.err
#SBATCH --array=1-6
This command runs the same script six times using Slurm job arrays. Each job array has two additional environment variable sets. SLURM_ARRAY_JOB_ID
(%A
) is set to the first job ID of the array, and SLURM_ARRAY_TASK_ID
(%a
) is set to the job array index value.
Note
Both the SLURM_ARRAY_JOB_ID
(%A
) and SLURM_ARRAY_TASK_ID
(%a
) are referenced when naming outputs so file do not overwrite when a “task” (i.e., one of the executions of the script through the job array) finishes.
Tip
Generally, we want to pass the former as an argument for our script. If you are using R, you can retrieve the former using task_id <- Sys.getenv("SLURM_ARRAY_TASK_ID")
. If you are using job arrays with Python, you can obtain the task ID using the following:
import sys
taskId = sys.getenv('SLURM_ARRAY_TASK_ID')
When submitting an array and setting its size with many dimensions, please use the %
symbol to indicate how many tasks run simultaneously. For example, the following code specifies an array of 600 jobs, with 20 running at a time:
#!/bin/bash
#SBATCH --partition=short
#SBATCH --job-name=jarray-example
#SBATCH --output=out/array_%A_%a.out
#SBATCH --error=err/array_%A_%a.err
#SBATCH --array=1-600%20
Whenever you specify the memory, number of nodes, number of CPUs, or other specifications, they will be applied to each task. Therefore, if we set the header of our submission file as follows:
#!/bin/bash
#SBATCH --partition=short
#SBATCH --job-name=jarray-example
#SBATCH --output=out/array_%A_%a.out
#SBATCH --error=err/array_%A_%a.err
#SBATCH --array=1-600%20
#SBATCH --mem=128G
#SBATCH --nodes=2
Slurm will submit 20 jobs simultaneously. Each job, represented by a task ID, will use two nodes with 128GB of RAM each. In most cases, setting up a single task is sufficient.
Lastly, we usually use job arrays for embarrassingly parallel jobs. If your case is such that the job executed at each job ID does not use any multi-threading libraries, you can use the following header to avoid wasting resources:
#!/bin/bash
#SBATCH --partition=short
#SBATCH --job-name=jarray-example
#SBATCH --output=out/array_%A_%a.out
#SBATCH --error=err/array_%A_%a.err
#SBATCH --array=1-600%50 # 50 is the maximum number
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=1
#SBATCH --mem-per-cpu=2G ## debug prior to know how much RAM
Warning
50 is the maximum number of jobs allowed to be run at once per user-account.
The above examples apply for interactive mode, as well. For instance:
sbatch --array=<indexes> [options] script_file
Indexes can be listed as 1-5
(i.e., one to five), =1,2,3,5,8,13
(i.e., each index listed), or 1-200%5
(i.e., produce a 200 task job array with only 5 tasks active at any given time). The symbol used is the % sign, which tasks to be submitted at once (again, cannot be set larger than 50).
Use-cases: Job Arrays#
Job arrays can be used in situations where you have to process multiple data files using the same procedure or program. Instead of creating multiple scripts or running the same script multiple times, you can create a job array, and Slurm will handle the parallel execution for you.
Example using Job Array Flag#
In the following script, the $SLURM_ARRAY_TASK_ID
variable is used to differentiate between array tasks.
#!/bin/bash
#SBATCH -J MyArrayJob # Job name
#SBATCH -N 1 # Number of nodes
#SBATCH -n 1 # Number of tasks
#SBATCH -o output_%A_%a.txt # Standard output file (%A for array job ID, %a for array index)
#SBATCH -e error_%A_%a.txt # Standard error file
# Your program/command here
srun ./my_program input_$SLURM_ARRAY_TASK_ID
To submit this job array, save it as my_array_job.sh
and run:
sbatch --array=1-50 my_array_job.sh
This command will submit 50 jobs, running my_program
with input_1
through input_100
.
Job Management#
Managing jobs in a Slurm-based HPC environment involves monitoring running jobs, modifying job parameters, and canceling jobs when necessary. This section will cover the commands and techniques you can use for these tasks.
Monitoring Jobs#
The squeue
command allows you to monitor the state of jobs in the queue. It provides information such as the job ID, the partition it’s running on, the job name, and more.
Syntax: squeue
#
squeue [options]
Options and Usage#
j, --jobs=<job_id>
: display information about specific job(s)u, --user=<user_name>
: display jobs for a specific userl, --long
: display more information (long format)
Code Example of Job Monitoring#
To monitor all jobs of a specific user, use the following command:
squeue -u <username>
To monitor a specific job, use:
squeue -j <job_id>
Modifying Jobs#
The scontrol
command is a command-line utility that allows users to view and control Slurm jobs and job-related resources. It provides a way to check the status of jobs, modify job properties, and perform other job-related tasks.
You can monitor your jobs by using the Slurm scontrol
command. Type scontrol show jobid -d <JOBID>
, where JOBID
is the number of your job. In the figure at the top of the page, you can see that when you submit your srun
command, Slurm displays the unique ID number of your job (job 12962519
). This is the number you use with scontrol
to monitor your job.
Using scontrol
#
Some of the tasks that can done using scontrol
include:
Viewing job status and properties:
scontrol
can display detailed information about a job, including its status, node allocation, and other properties.Modifying job properties:
scontrol
allows users to modify job properties such as the job name, the number of nodes, the time limit, and other parameters.Managing job dependencies:
scontrol
provides a way to specify job dependencies and view the dependencies between jobs.Suspending and resuming jobs:
scontrol
can stop and resume running jobs, allowing users to temporarily halt jobs or continue them as needed.Canceling jobs:
scontrol
can cancel jobs that are running or queued, allowing users to stop jobs that are no longer needed.
Overall, scontrol
is a powerful tool for managing Slurm jobs and job-related resources. Its command-line interface allows users to perform a wide range of tasks, from checking the status of jobs to modifying job properties and managing dependencies.
Controlling jobs: scontrol
#
Place a hold on a pending job, i.e., prevent specified job from starting. <job_list> is either a space separate list of job IDs or job names.
scontrol hold <jobid>
Release a held job, i.e., permit specified job to start (see hold).
scontrol release <jobid>
Re-queue a completed, failed, or cancelled job
scontrol requeue <jobid>
For more information on the commands listed above, along with a complete list of scontrol
commands run below:
scontrol --help
Syntax: scontrol
#
scontrol [command] [options]
Example: scontrol
#
scontrol show jobid -d <JOBID>
Options and Usage: scontrol
#
update
: used to modify job or system configurationhold jobid=<job_id>
: hold a specific jobrelease jobid=<job_id>
: release a specific jobrequeue jobid=<job_id>
: requeue a specific job
Examples using scontrol
#
View information about a specific node:
scontrol show node -d <node_name>
For information on all reservations, this command will show information about a specific node in the cluster, including the node name, state, number of CPUs, and amount of memory:
scontrol show reservations
View information about a specific job. This command will show information about a specific job, including the job ID, state, username, and partition name:
scontrol show job <job_id>
To view information about a specific reservation (e.g., found via scontrol show res
listed above), and print information about a specific reservation in the cluster, including the reservation name, start time, end time, and nodes included in the reservation:
scontrol show reservation <reservation_name>
Cancelling Jobs: scancel
#
The scancel
command is used to cancel a running or pending job. Once cancelled, a job cannot be resumed.
Syntax: scancel
#
scancel [options] [job_id]
Options and Usage: scancel
#
u, --user=<user_name>
: cancel all jobs of a specific user-name=<job_name>
: cancel all jobs with a specific name
Examples using scancel
#
To cancel a specific job, use:
scancel <job_id>
To cancel all jobs of a specific user:
scancel -u <username>
To cancel all jobs with a specific name:
scancel --name=<job_name>
The job management section aims to give you a solid understanding of how to manage and control your jobs effectively. Always ensure to monitor your jobs regularly and adjust parameters as needed to achieve the best performance.
Cluster and Node States: sinfo
#
Below are some more examples of using sinfo
and scontrol
to provide information about the state of the cluster and specific nodes.
Using sinfo
#
The sinfo
command will show information about all partitions in the cluster, including the partition name, available nodes, and status. By default, sinfo
reports:
| PARTITION | The list of the cluster’s partitions; a set of compute nodes grouped logically |
| --- | --- |
| AVAIL | The active state of the partition (up, down, idle) |
| TIMELIMIT | The maximum job execution wall-time per partition |
| NODES | The total number of nodes per partition |
| STATE | See STATE table below |
| NODELIST(REASON) | The list of nodes per partition |
Examples using sinfo
#
View information about all partitions:
sinfo -a
Or, a specific partition, which gives all the nodes and the states the nodes are in at the current time:
sinfo -p gpu
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
gpu up 8:00:00 5 drain* c[2171,2184,2188],d[1008,1014]
gpu up 8:00:00 3 down* c2162,d[1006,1017]
gpu up 8:00:00 1 drain d1025
gpu up 8:00:00 2 resv c2177,d1029
gpu up 8:00:00 50 mix c[2160,2163-2170,2172-2176,2178-2179,2185-2187,2189-2195,2204-2207],d[1001,1003-1005,1007,1009-1013,1016,1018,1020-1024,1026-1028]
gpu up 8:00:00 3 alloc d[1002,1015,1019]
gpu up 8:00:00 4 idle c[2180-2183]
The current TimeLimit
for the queues:
sinfo -o "%12P %.10A %.11l"
PARTITION NODES(A/I) TIMELIMIT
debug 402/174 20:00
express 403/180 1:00:00
short* 401/178 1-00:00:00
long 224/47 5-00:00:00
large 376/172 6:00:00
gpu 41/17 8:00:00
multigpu 41/17 1-00:00:00
lowpriority 118/102 1-00:00:00
reservation 617/402 100-00:00:0
ai-jumpstart 2/15 2-00:00:00
allshouse 5/7 infinite
bansil 15/4 30-00:00:00
ce-mri 3/10 30-00:00:00
chen 0/12 30-00:00:00
ctbp 0/20 30-00:00:00
.
.
.
View information about a specific partition (e.g., short
, gpu
, long
):
sinfo -p <partition_name>
Or, only view nodes in a certain state:
sinfo -p <partition> -t <state>
You can use the --Format
flag to get more information or a specific format for the output:
sinfo -p <partition> -t idle --Format=gres,nodes
Below command will show detailed information about all nodes in the cluster, including the node name, state, CPU architecture, memory, and available features:
sinfo -N -l
View what features a node has:
sinfo -n <node> --Format=nodes,nodelist,statecompact,features
View what nodes are in what state in a partition using statecompact
:
sinfo -p <partition> --Format=time,nodes,statecompact,features,memory,cpus,nodelist
Advanced Usage#
Advanced usage of Slurm involves working with Multi-node Jobs, GPU Jobs, and understanding Priority and QoS parameters. It also involves Memory Management and Using Environment Variables in Job Scripts.
Multi-node Jobs#
Multi-node jobs involve executing a single job across multiple nodes. Such jobs are typically used for computationally intensive tasks that require significant parallelization.
Use-cases: Multi-node Jobs#
Multi-node jobs are used in scenarios where tasks can be broken down into sub-tasks that can be executed in parallel, such as simulation and modeling, machine learning training, or big data analysis.
Code Example for Multi-node Job Submission#
#!/bin/bash
#SBATCH -J MultiNodeJob # Job name
#SBATCH -N 4 # Number of nodes
#SBATCH -n 16 # Number of tasks
# Your program/command here
srun ./my_program
GPU Jobs#
Slurm can also manage GPU resources, allowing you to specify GPU requirements in your job scripts.
Use-cases: GPUs#
GPU jobs are used in scenarios where tasks are parallelized and can benefit from the high computational capabilities of GPUs, such as machine learning and deep learning workloads, image processing, or simulations.
Code Example for GPU Job Submission#
#!/bin/bash
#SBATCH -J GPUJob # Job name
#SBATCH -N 1 # Number of nodes
#SBATCH -n 1 # Number of tasks
#SBATCH --gres=gpu:1 # Number of GPUs
#SBATCH -p gpu # gpu partition
# Your program/command here
srun ./my_gpu_program
Priority and QoS#
Slurm uses priority and Quality of Service (QoS) parameters to determine the order in which jobs are scheduled.
Understanding Job Priorities and Quality of Service (QoS) Parameters#
The job priority is a numerical value assigned to each job, determining its position in the queue. Higher priority jobs are scheduled before lower priority ones. Quality of Service (QoS) parameters control various job limits, such as the maximum allowed job runtime, the maximum number of CPUs or nodes a job can use, etc.
Code Example to Manipulate Job Priority#
scontrol update jobid=<job_id> priority=<new_priority>
Memory Management#
In Slurm, memory allocation can be controlled on the job or task level using the --mem
or --mem-per-cpu
options, respectively.
Code example for specifying memory in job scripts#
#!/bin/bash
#SBATCH -J MyJob # Job name
#SBATCH -N 1 # Number of nodes
#SBATCH -n 4 # Number of tasks
#SBATCH --mem=8G # Memory for the entire job
# Your program/command here
srun ./my_program
We can also specify memory per CPU.
#!/bin/bash
#SBATCH -J MyJob # Job name
#SBATCH -N 1 # Number of nodes
#SBATCH -n 4 # Number of tasks
#SBATCH --mem-per-cpu=2G # Memory per task (CPU)
# Your program/command here
srun ./my_program
Note
Either --mem-per-cpu
or --mem
can be specified as a sbatch directive, but not both.
Using Environment Variables in Job Scripts#
Slurm sets several environment variables that you can use in your job scripts to control the job behavior dynamically. Some of these include SLURM_JOB_ID
, SLURM_JOB_NUM_NODES
, SLURM_JOB_NODELIST
, etc.
Code Example Showcasing Use of Environment Variables#
#!/bin/bash
#SBATCH -J MyJob # Job name
#SBATCH -N 2 # Number of nodes
#SBATCH -n 8 # Number of tasks
echo "Job ID: $SLURM_JOB_ID"
echo "Number of nodes: $SLURM_JOB_NUM_NODES"
echo "Node list: $SLURM_JOB_NODELIST"
# Your program/command here
srun ./my_program
Common Problems and Troubleshooting#
Despite its flexibility and robustness, it’s not uncommon to encounter issues when using Slurm. Here we’ll explore some common problems and provide strategies for debugging and optimizing job scripts.
Commonly Encountered Issues in Using Slurm#
Job Stuck in Queue: If your job is stuck in the queue and not getting scheduled, it may be due to insufficient available resources, low priority, or system limits set by the Quality of Service (QoS) parameters.
Solution: Check the job requirements, priority, and QoS parameters using
scontrol show job <job_id>
. Ensure your job requirements do not exceed available resources and system limits.Job Failed with Non-zero Exit Code: If your job script or the program it’s running exits with a non-zero code, it indicates an error.
Solution: Check the job’s output and error files for any error messages. They can provide valuable clues about what went wrong.
Insufficient Memory: If your job fails with “Out Of Memory” (OOM) errors, it means it’s using more memory than allocated.
Solution: Increase the
--mem
or--mem-per-cpu
value in your job script. Remember that requesting more memory might increase the time your job spends in the queue.
Strategies for Debugging and Optimizing Job Scripts#
Testing Job Scripts Interactively: Use the
srun
command to run your job script interactively for debugging purposes. This approach allows you to observe the program’s behavior in real-time.
srun --pty bash -i
Using echo Command for Debugging: Use the
echo
command in your job script to print the values of variables, command outputs, etc., to the output file. This method can help you understand the script’s flow and pinpoint any issues.
echo "Value of variable x is $x"
Optimizing Job Resources: Monitor your job’s resource usage using
sstat <job_id>
and adjust the resource requirements accordingly in your job script. Requesting more resources than needed can result in your job spending more time in the queue, while requesting less than needed can lead to job failures. Remember, troubleshooting requires patience and a systematic approach. Begin by identifying the problem, then hypothesize potential causes, test those hypotheses, and apply solutions. Make small changes one at a time and retest after each change. With experience, you will be able to troubleshoot effectively and make the most of your HPC resources.
Best Practices#
In this section, we will discuss some best practices that can help you make efficient use of resources, write optimized job scripts, and ensure maximum throughput and minimal queue time in a Slurm-based HPC environment.
Efficient Usage of Resources#
Request Only What You Need: When submitting a job, request only the resources that you need. Overestimating your requirements can result in longer queue times as Slurm waits for the requested resources to become available.
Use Job Arrays for Similar Jobs: If you need to run multiple similar jobs, consider using job arrays. This approach makes job management more straightforward and reduces overhead.
Use Appropriate Partitions: Select the appropriate partition for your job based on your requirements. Each partition may have different limits and priorities, so choose the one that suits your needs best.
Writing Optimized Job Scripts#
Use Environment Variables: Slurm provides several environment variables that can be used to customize job behavior dynamically. Use these variables to make your scripts more flexible and efficient.
Specify All Necessary Options: Ensure to specify all necessary SBATCH options in your job script. Missing options can lead to unpredictable job behavior or performance.
Check Exit Codes: Always check the exit codes of commands in your job script. Non-zero exit codes usually indicate an error, and failing to check these can lead to undetected job failures.
command
if [ $? -ne 0 ]; then
echo "Command failed"
exit 1
fi
Guidelines to Ensure Maximum Throughput and Minimal Queue Time#
Monitor Job Performance: Regularly monitor your jobs’ performance using
sstat
and adjust resource requests as needed. This can help improve job scheduling efficiency.Use the
time
Option Judiciously: While it’s crucial to give your job enough time to complete, overestimating can lead to longer queue times. Start with a reasonable estimate and adjust based on actual run times.Prioritize Critical Jobs: If you have a critical job that needs to be run immediately, you can temporarily increase its priority using
scontrol
. However, use this feature sparingly to maintain fairness.
Remember, these best practices aim to ensure efficient and fair usage of shared HPC resources. Always be mindful of other users and try to use the resources in a way that maximizes productivity for everyone.
Proper resource request syntax#
#!/bin/bash
#SBATCH --job-name=my_job # Job name
#SBATCH --ntasks=4 # Number of tasks
#SBATCH --cpus-per-task=2 # Number of CPUs per task
#SBATCH --mem-per-cpu=4G # Memory per CPU
#SBATCH --time=02:00:00 # Time limit
#SBATCH --partition=my_partition # Partition (queue) to use
Use environment modules: Load the necessary modules before running your job. In this example, we load the Python and TensorFlow modules:
module load python/3.8
module load tensorflow/2.5.0
For further reading and resources, consider the following:
High Performance Computing For Dummies, IBM Limited Edition If you need further support, please contact your system administrator or visit the Slurm community page for mailing lists and support forums.
Appendix#
Glossary of Slurm Terms#
Node: A computer or server in the HPC cluster.
Partition: A group of nodes with specific attributes and limits.
Job: A user-submitted work request.
Task: A unit of work within a job.
Full List of Slurm Commands and Their Options#
Sample Job Scripts for Various Use-cases#
Slurm FAQs#
Slurm References#
SchedMD. (2023). Slurm Workload Manager. https://slurm.schedmd.com
SchedMD. (2023). Slurm Quick Start User Guide. https://slurm.schedmd.com/quickstart.html
IBM. (2023). High Performance Computing For Dummies, IBM Limited Edition. https://www.ibm.com/downloads/cas/WQDZWBYJ Thank you for following along with this guide, and we wish you success in your HPC journey!