Working with GPUs#

The cluster has various NVIDIA Graphics Processing Units (GPUs) available on gpu-equipped partitions, as listed in the table below.

GPU Type

GPU Memory

Tensor Cores

CUDA Cores

Nodes in Public GPUs

Nodes in Private GPUs

p100 (Pascal)

12GB

N/A

3,584

12 (x3-4 GPUs)

3 (x4 GPUs)

v100-pcie (Volta)

32GB

640

5,120

4 (x2 GPUs)

1 (x2 GPUs, 16GB)

v100-sxm2 (Volta)

32GB

640

5,120

24 (x4 GPUs)

10 (x4 GPUs, 16GB); 8 (x4 GPUs, 32GB)

t4 (Turing)

15GB

320

2,560

2 (x3-4 GPUs)

1 (x4 GPUs)

quadro (Quadro RTX 8000)

46GB

576

4,608

0

2 (x3 GPUs)

a30 (Ampere)

24GB

224

3,804

0

1 (x3 GPUs)

a100 (Amperea100)

41 & 82GB

432

6,912

3 (x4 GPUs)

15 (x2-8 GPUs)

a5000 (Ampere RTX A5000)

24GB

256

8,192

0

6 (x8 GPUs)

a6000 (Ampere RTX A6000)

49GB

336

10,752

0

3 (x8 GPUs)

The gpu partition is the general GPU resource for HPC users looking to use a GPU; multigpu is the alternative, where more than one GPU are accessible.

Anyone with a Discovery account can use the gpu partition. However, you must submit a ServiceNow ticket to request temporary access to multigpu: the multigpu partition is available for times in need for predefined time window. In other words, instances that require multigpu must demonstrate the need; furthermore, the specifics of the working code, as the partition is only accessible for a limited of time (e.g., 48 hours), so best to make sure to use at full capacity. Your request will be evaluated by members of the RC team to ensure that the resources in this partition will be used appropriately.

Note

All user limits are subject to the availability of cluster resources at the time of submission and will be honored according to that.

Name

Requires Approval?

Time limit (Default/Max)

Submitted Jobs

GPU per job Limit

Max GPUs per user Limit

gpu

No

4 hours/8 Hours

50/100

1

8

multigpu

Yes

4 hours/24 Hours

50/100

12

12

Requesting GPUs with Slurm#

Use srun for interactive and sbatch for batch mode. The srun example below is requesting 1 node and 1 GPU with 4GB of memory in the gpu partition. You must use the --gres= option to request a gpu:

srun --partition=gpu --nodes=1 --pty --gres=gpu:1 --ntasks=1 --mem=4GB --time=01:00:00 /bin/bash

Note

On the gpu partition, requesting more than 1 GPU (--gres=gpu:1) will cause your request to fail. Additionally, one cannot request all the CPUs on that gpu node as they are reserved for other GPUs.

The sbatch example below is similar to the srun example above, but it submits the job in the background, gives it a name, and directs the output to a file:

#!/bin/bash
#SBATCH --partition=gpu
#SBATCH --nodes=1
#SBATCH --gres=gpu:1
#SBATCH --time=01:00:00
#SBATCH --job-name=gpu_run
#SBATCH --mem=4GB
#SBATCH --ntasks=1
#SBATCH --output=myjob.%j.out
#SBATCH --error=myjob.%j.err

## <your code>

Specifying a GPU type#

You can add a specific type of GPU to the --gres= option (with either srun or sbatch). For a list of available GPU types, refer to the GPU Types column in the table, at the top of this page, that are listed as Public. The following is an example for requesting a single p100 GPU:

--gres=gpu:p100:1

Note

Requesting a specific type of GPU could result in longer wait times, based on GPU availability at that time.

Using CUDA#

There are several versions of CUDA Toolkits available on the HPC, including:

cuda/9.0
cuda/9.2
cuda/10.0
cuda/10.2
cuda/11.0
cuda/11.1
cuda/11.2
cuda/11.3
cuda/11.4
cuda/11.7
cuda/11.8
cuda/12.1

Use the module avail command to check for the latest software versions on Discovery. To see details on a specific CUDA toolkit version, use module show. For example, module show cuda/11.4.

To add CUDA to your path, use module load. For example, type module load cuda/11.4 to load version 11.4 to your path.

Use the command nvidia-smi (NVIDIA System Management Interface) inside a GPU node to get the CUDA driver information and monitor the GPU device.

GPUs for Deep Learning#

See also

Deep learning frameworks tend to cost storage that can quickly surpass Home Directory Storage Quota: follow best practices for Conda environments.

First, log onto gpu interactively, and load anaconda and CUDA 11.8:

srun --partition=gpu --nodes=1 --gres=gpu:v100-sxm2:1 --cpus-per-task=2 --mem=10GB --time=02:00:00 --pty /bin/bash
module load anaconda3/2022.05 cuda/11.8

Note

Be aware of compatibility regarding the GPU type: some installations do not work on k40m or k80 GPUs. To see what non-Kepler GPUs might be available execute the following command:

sinfo -p gpu --Format=nodes,cpus,memory,features,statecompact,nodelist,gres

This will indicate the state (idle or not) of a certain gpu-type that could be helpful in requesting an idle gpu. However, the command does not give real-time information of the state and should be used with caution.

Select the tab with the desire deeplearning framework.

Warning

Each tab assumes you are on a GPU node before with CUDA 11.8 and anaconda modules loaded as done above.

The following example demonstrates how to build PyTorch inside a conda virtual environment for CUDA version 11.8.

PyTorch’s installation steps (with a specific GPU-type):#
conda create --name pytorch_env python=3.10 -y
source activate pytorch_env
conda install pytorch torchvision torchaudio pytorch-cuda=11.8 -c pytorch -c nvidia -y

Now, let’s check the installation:

python -c 'import torch; print(torch.cuda.is_available())'

If CUDA is detected by PyTorch, you should see the result, True.

See also

PyTorch documentation for the most up-to-date instructions and for different CUDA versions.

We recommend that you use CUDA 11.8 (the latest supported version) when working on a GPU with the latest version of TensorFlow (TF).

For the latest installation, use the TensorFlow pip package, which includes GPU support for CUDA-enabled devices:

conda create --name TF_env python=3.9 -y
source activate TF_env
conda install -c "nvidia/label/cuda-11.8.0" cuda-toolkit -y
pip install --upgrade pip
pip install tensorflow==2.13.*

Verify the installation:

python3 -c 'import tensorflow as tf; print(tf.test.is_built_with_cuda())' # True

Note

Ignore the Warning messages that get generated after executing the above commands.

conda create --name deeplearning-cuda11_8 python=3.9 -y
source activate deeplearning-cuda11_8
conda install pytorch torchvision torchaudio pytorch-cuda=11.8 -c pytorch -c nvidia -y
conda install -c "nvidia/label/cuda-11.8.0" cuda-toolkit -y
pip install --upgrade pip
pip install tensorflow==2.13.*

Verify installation:

python -c 'import torch; print(torch.cuda.is_available())' # True
python3 -c 'import tensorflow as tf; print(tf.test.is_built_with_cuda())' # True

Tip

Install jupyterlab and few other commonly used datascience packages in the pytorch_env environment:

conda install pandas scikit-learn matplotlib seaborn jupyterlab -y