Monitoring and Managing Jobs

You can use Slurm commands to check the status of the cluster, check details on the compute nodes, and check on the running jobs you have on the cluster.

Query Partitions: sinfo

The sinfo command will show information about all partitions in the cluster, including the partition name, available nodes, and status. To view information about a specific partition (e.g., short, gpu, long):

sinfo -p <partition_name>

as an example:

sinfo -p gpu
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
gpu          up    8:00:00      5 drain* c[2171,2184,2188],d[1008,1014]
gpu          up    8:00:00      3  down* c2162,d[1006,1017]
gpu          up    8:00:00      1  drain d1025
gpu          up    8:00:00      2   resv c2177,d1029
gpu          up    8:00:00     50    mix c[2160,2163-2170,2172-2176,2178-2179,2185-2187,2189-2195,2204-2207],d[1001,1003-1005,1007,1009-1013,1016,1018,1020-1024,1026-1028]
gpu          up    8:00:00      3  alloc d[1002,1015,1019]
gpu          up    8:00:00      4   idle c[2180-2183]

You can use the --Format flag to get more information or a specific format for the output:

sinfo -p <partition> -t idle --Format=gres,nodes

For more information about sinfo, please review the sinfo manual.

Monitoring Jobs: squeue

The squeue command allows you to monitor the state of jobs in the queue. It provides information such as the job ID, the partition it is running on, and the job name.

To monitor all jobs of a specific user, use the following command:

squeue -u <username>

For more information about squeue, please review the squeue manual.

Canceling Jobs: scancel

The scancel command is used to cancel a running or pending job.

To cancel a specific job, use:

scancel <job_id>

To cancel all jobs of a specific user:

scancel -u <username>

For more information about scancel, please review the scancel manual.