Working with GPUs¶

The Discovery cluster has a number of NVIDIA Graphics Processing Units (GPUs) available, as detailed in the table below.

Note

The tables on this page slide from left-to-right. Make sure to swipe to right to see the content on the right side of the table

GPU Type	GPU Memory	Tensor Cores	CUDA Cores	Nodes in Public GPUs	Nodes in Private GPUs
p100 (Pascal)	12GB	N/A	3,584	12 with 3-4 GPUs each	3 with 4 GPUs each
v100-pcie (Volta)	32GB	640	5,120	4 with 2 GPUs each	1 with (16GB) 2 GPUs
v100-sxm2 (Volta)	32GB	640	5,120	24 with 4 GPUs each	10 with 4 GPUs each & 16GB GPU memory; 8 with 4 GPUs & 32GB GPU memory
t4 (Turing)	15GB	320	2,560	2 with 3-4 GPUs each	1 with 4 GPUs
quadro (Quadro RTX 8000)	46GB	576	4,608	0	2 with 3 GPUs each
a30 ( Ampere)	24GB	224	3,804	0	1 with 3 GPUs
a100 (Ampere)	41 & 82GB	432	6,912	3 nodes with 4 GPUs each	15 nodes with 2-8 GPUs each
a5000 (Ampere RTX A5000)	24GB	256	8,192	0	6 with 8 GPUs each
a6000 (Ampere RTX A6000)	49GB	336	10,752	0	3 with 8 GPUs each

The public GPUs are available within two partitions, named gpu and multigpu. The differences between the two partitions are the number of GPUs that one can request per job and the time limit on each job. Both partitions give access to all of the public GPU types mentioned above. The table below shows the differences between the two partitions. For more information about the partitions on Discovery, see Partitions.

Note

All user limits are subject to the availability of cluster resources at the time of submission and will be honored according to that.

Name	Requires Approval?	Time limit (Default/Max)	Submitted Jobs	GPU per job Limit	Max GPUs per user Limit
gpu	No	4 hours/8 Hours	50/100	1	8
multigpu	Yes	4 hours/24 Hours	50/100	12	12

Anyone with a Discovery account can use the gpu partition. However, you must submit a ServiceNow ticket to request temporary access to multigpu for testing, or to request full access to the multigpu partition. Your request will be evaluated by members of the RC team to ensure that the resources in this partition will be used appropriately.

Requesting GPUs with `srun` or `sbatch`¶

Use srun for interactive and sbatch for batch mode. The srun example below is requesting 1 node and 1 GPU with 4GB of memory in the gpu partition. You must use the --gres= option to request a gpu:

srun --partition=gpu --nodes=1 --pty --gres=gpu:1 --ntasks=1 --mem=4GB --time=01:00:00 /bin/bash

Note

On the gpu partition, requesting more than 1 GPU (--gres=gpu:1) will cause your request to fail. Additionally, one cannot request all the CPUs on that gpu node as they are reserved for other GPUs.

The sbatch example below is similar to the srun example above, but it submits the job in the background, gives it a name, and directs the output to a file:

#!/bin/bash
#SBATCH --partition=gpu
#SBATCH --nodes=1
#SBATCH --gres=gpu:1
#SBATCH --time=01:00:00
#SBATCH --job-name=gpu_run
#SBATCH --mem=4GB
#SBATCH --ntasks=1
#SBATCH --output=myjob.%j.out
#SBATCH --error=myjob.%j.err
<your code>

Specifying a GPU type¶

You can add a specific type of GPU to the --gres= option (with either srun or sbatch). For a list of available GPU types, refer to the GPU Types column in the table, at the top of this page, that are listed as Public. The following is an example for requesting a single p100 GPU:

--gres=gpu:p100:1

Note

Requesting a specific type of GPU could result in longer wait times, based on GPU availability at that time.

Using CUDA¶

There are several versions of CUDA Toolkits on Discovery, including:

cuda/9.0
cuda/9.2
cuda/10.0
cuda/10.2
cuda/11.0
cuda/11.1
cuda/11.2
cuda/11.3
cuda/11.4
cuda/11.7
cuda/11.8
cuda/12.1

Use the module avail command to check for the latest software versions on Discovery. To see details on a specific CUDA toolkit version, use module show. For example, module show cuda/11.4.

To add CUDA to your path, use module load. For example, type module load cuda/11.4 to load version 11.4 to your path.

Use the command nvidia-smi (NVIDIA System Management Interface) inside a GPU node to get the CUDA driver information and monitor the GPU device.

Using GPUs with PyTorch¶

You should use PyTorch with a conda virtual environment if you need to run the environment on the Nvidia GPUs on Discovery. The following example demonstrates how to build PyTorch inside a conda virtual environment for CUDA version 11.7.

Note

Make sure to be on a GPU node before loading the environment. Additionally, the latest version of PyTorch is not compatible with GPUs with CUDA version 11.7 or less. Hence, the installation does not work on k40m or k80 GPU’s. In order to see what non-Kepler GPUs might be available, one can execute this command:

sinfo -p gpu --Format=nodes,cpus,memory,features,statecompact,nodelist,gres

This will indicate the state (idle or not) of a certain gpu-type that could be helpful in requesting an idle gpu. However, the command does not give real-time information of the state and should be used with caution.

PyTorch installation steps (with a specific GPU-type):

srun --partition=gpu --nodes=1 --gres=gpu:v100-sxm2:1 --cpus-per-task=2 --mem=10GB --time=02:00:00 --pty /bin/bash
module load anaconda3/2022.05 cuda/11.7
conda create --name pytorch_env python=3.9 -y
source activate pytorch_env
conda install pytorch torchvision torchaudio pytorch-cuda=11.7 -c pytorch -c nvidia -y
python -c'import torch; print(torch.cuda.is_available())'

Note

If the installation times out, please ensure that your .condarc file doesn’t contain additional channels. Also, consider cleaning your conda instance using the conda clean command. See Conda best practices .

If CUDA is detected by PyTorch, you should see the result, True.

As the latest version of PyTorch often depends on the newest CUDA available, please refer to the PyTorch documentation page for the most up to date instructions on installation.

The above PyTorch installation instructions will not include jupyterlab and few other commonly used datascience packages in the environment. In order to include those one can execute the following command after activating the pytorch_env environment:

conda install pandas scikit-learn matplotlib seaborn jupyterlab -y

Using GPUs with TensorFlow¶

We recommend that you use CUDA 11.2 (latest supported version) when working on a GPU with the latest version of TensorFlow (TF). TensorFlow provides information on the compatibility of CUDA and TensorFlow versions, and detailed installation instructions.

For the latest installation, use the TensorFlow pip package, which includes GPU support for CUDA-enabled devices:

srun --partition=gpu --gres=gpu:1 --nodes=1 --cpus-per-task=2 --mem=10GB --time=02:00:00 --pty /bin/bash
module load anaconda3/2022.05 cuda/11.2
conda create --name TF_env python=3.9 -y
source activate TF_env
conda install -c conda-forge cudatoolkit=11.2.2 cudnn=8.1.0 -y
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$CONDA_PREFIX/lib/
mkdir -p $CONDA_PREFIX/etc/conda/activate.d
echo 'export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$CONDA_PREFIX/lib/' > $CONDA_PREFIX/etc/conda/activate.d/env_vars.sh
pip install --upgrade pip
pip install tensorflow==2.11.*

Verify the installation:

# Verify the CPU setup (if successful, then a tensor is returned):
python3 -c "import tensorflow as tf; print(tf.reduce_sum(tf.random.normal([1000, 1000])))"

# verify the GPU setup (if successful, then a list of GPU device is returned):
python3 -c "import tensorflow as tf; print(tf.config.list_physical_devices('GPU'))"

# test if a GPU device is detected with TF (if successful, then True is returned):
python3 -c 'import tensorflow as tf; print(tf.test.is_built_with_cuda())'

To get the name of the GPU, type:

python -c 'import tensorflow as tf;  print(tf.test.gpu_device_name())'

If the installation is successful, then, for example, you should see the following as an output,:

2023-02-24 16:39:35.798186: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1613] Created device /device:GPU:0 with 10785 MB memory:  -> device: 0, name: Tesla K80, pci bus id: 0000:0a:00.0, compute capability: 3.7 /device:GPU:0

Note

Ignore the Warning messages that get generated after executiing the above commands.

test

Working with GPUs¶

Requesting GPUs with srun or sbatch¶

Specifying a GPU type¶

Using CUDA¶

Using GPUs with PyTorch¶

Using GPUs with TensorFlow¶

Requesting GPUs with `srun` or `sbatch`¶