Working with GPUs#
See also
This page covers the Graphics Processing Unit (GPU) resources available on the cluster.
GPU Type |
GPU Architecture |
Memory (GB) |
Tensor Cores |
CUDA Cores |
Public Nodes (x # GPUs) |
Private Nodes (x # GPUs) |
---|---|---|---|---|---|---|
12 |
N/A |
3,584 |
12(x3-4) |
3(x4) |
||
32 |
640 |
5,120 |
4(x2) |
1(x2), 16GB |
||
32 |
640 |
5,120 |
24(x4) |
10(x4), 16GB |
||
15 |
320 |
2,560 |
2(x3-4) |
1(x4) |
||
41 & 82 |
432 |
6,912 |
3(x4) |
15(x2-8) |
||
46 |
576 |
4,608 |
0 |
2(x3) |
||
24 |
224 |
3,804 |
0 |
1(x3) |
||
24 |
256 |
8,192 |
0 |
6(x8) |
||
49 |
336 |
10,752 |
0 |
3(x8) |
The gpu
partition is the general GPU resource for HPC users looking to use a GPU; multigpu
is the alternative, where more than one GPU are accessible.
Anyone with a cluster account has access to the gpu
partition. However, you must submit a ServiceNow ticket requesting temporary access to multigpu
provided sufficient need and preparation.
Note
The multigpu
partition is available for a limited time window to fulfill urgent needs. In addition, only instances that require multigpu
will be granted access to this partition. As the partition is only accessible for a limited time (e.g., 48 hours), it is advisable to use it at full capacity. A member of the RC team will review your request to ensure that there is a genuine need for the partition. Please note that all user limits are subject to the availability of the multigpu
resources at the time and will be allocated based on user needs.
Name |
Requires Approval? |
Time in Hours (Default/Max) |
Submitted Jobs |
GPU per Job Limit |
User Limit (No. GPUs) |
---|---|---|---|---|---|
|
No |
4/8 |
50/100 |
1 |
8 |
|
Yes |
4/24 |
50/100 |
12 |
12 |
Important
Consider the compatibility of the GPU, as some programs do not work on the older k40m or k80 GPUs.
Execute the following command to display the non-Kepler
GPUs that are available:
sinfo -p gpu –Format=nodes,cpus,memory,features,statecompact,nodelist,gres
This indicates the state (idle or not) of gpu-types and could be helpful to find one that is idle
. However, the command does not give real-time information of the state and should be used carefully.
Requesting GPUs with Slurm#
Use srun
for interactive and sbatch
for batch mode. The srun
example below is requesting 1 node and 1 GPU with 4GB of memory in the gpu
partition. You must use the --gres=
option to request a gpu:
srun --partition=gpu --nodes=1 --pty --gres=gpu:1 --ntasks=1 --mem=4GB --time=01:00:00 /bin/bash
Note
On the gpu
partition, requesting more than 1 GPU (--gres=gpu:1
) will cause your request to fail. Additionally, one cannot request all the CPUs on that gpu node as they are reserved for other GPUs.
The sbatch
example below is similar to the srun
example above, but it submits the job in the background, gives it a name, and directs the output to a file:
#!/bin/bash
#SBATCH --partition=gpu
#SBATCH --nodes=1
#SBATCH --gres=gpu:1
#SBATCH --time=01:00:00
#SBATCH --job-name=gpu_run
#SBATCH --mem=4GB
#SBATCH --ntasks=1
#SBATCH --output=myjob.%j.out
#SBATCH --error=myjob.%j.err
## <your code>
Specifying a GPU type#
You can add a specific type of GPU to the --gres=
option (with either srun
or sbatch
). For a list of available GPU types, refer to the GPU Types column in the table, at the top of this page, that are listed as Public.
--gres=gpu:p100:1
Note
Requesting a specific type of GPU could result in longer wait times, based on GPU availability at that time.
Using CUDA#
There are several versions of CUDA Toolkits available on the HPC, including. Use the module avail
command to check for the latest software versions on the cluster.
$ module avail cuda
------------------------------- /shared/centos7/modulefiles -------------------------------
cuda/10.0 cuda/10.2 cuda/11.1 cuda/11.3 cuda/11.7 cuda/12.1 cuda/9.1
cuda/10.1 cuda/11.0(default) cuda/11.2 cuda/11.4 cuda/11.8 cuda/9.0 cuda/9.2
To see details on a specific CUDA toolkit version, use module show
(e.g., module show cuda/11.4
).
To add CUDA to your path, use module load
(e.g., module load cuda/11.4
adds CUDA 11.4).
Note
Executing nvidia-smi
(i.e., NVIDIA System Management Interface) on a GPU node displays the CUDA driver information and monitor the GPU device.
GPUs for Deep Learning#
See also
Deep learning frameworks tend to cost storage that can quickly surpass Home Directory Storage Quota: follow best practices for Conda environments.
First, log onto gpu
interactively, and load anaconda and CUDA 11.8:
srun --partition=gpu --nodes=1 --gres=gpu:v100-sxm2:1 --cpus-per-task=2 --mem=10GB --time=02:00:00 --pty /bin/bash
module load anaconda3/2022.05 cuda/11.8
Select the tab with the desire deeplearning framework.
Important
Each tab assumes you are on a GPU node before with CUDA 11.8 and anaconda modules loaded as done above.
The following example demonstrates how to build PyTorch inside a conda virtual environment for CUDA version 11.8.
conda create --name pytorch_env python=3.10 -y
source activate pytorch_env
conda install pytorch torchvision torchaudio pytorch-cuda=11.8 -c pytorch -c nvidia -y
Now, let us check the installation:
python -c 'import torch; print(torch.cuda.is_available())'
If CUDA is detected by PyTorch, you should see the result, True
.
See also
PyTorch documentation for the most up-to-date instructions and for different CUDA versions.
Here are steps for installing CUDA 11.8 with the latest version of TensorFlow (TF).
For the latest installation, use the TensorFlow pip package, which includes GPU support for CUDA-enabled devices:
conda create --name TF_env python=3.9 -y
source activate TF_env
conda install -c "nvidia/label/cuda-11.8.0" cuda-toolkit -y
pip install --upgrade pip
pip install tensorflow==2.13.*
Verify the installation:
python3 -c 'import tensorflow as tf; print(tf.test.is_built_with_cuda())' # True
Note
Ignore the Warning
messages that get generated after executing the above commands.
conda create --name deeplearning-cuda11_8 python=3.9 -y
source activate deeplearning-cuda11_8
conda install pytorch torchvision torchaudio pytorch-cuda=11.8 -c pytorch -c nvidia -y
conda install -c "nvidia/label/cuda-11.8.0" cuda-toolkit -y
pip install --upgrade pip
pip install tensorflow==2.13.*
Verify installation:
python -c 'import torch; print(torch.cuda.is_available())' # True
python3 -c 'import tensorflow as tf; print(tf.test.is_built_with_cuda())' # True
Tip
Install jupyterlab
and few other commonly used datascience packages in the pytorch_env
environment:
conda install pandas scikit-learn matplotlib seaborn jupyterlab -y