Working with GPUs

The Discovery cluster has a number of NVIDIA Graphics Processing Units (GPUs) available, as detailed in the table below.

Note

The tables on this page slide from left-to-right. Make sure to swipe to right to see the content on the right side of the table

GPU Type GPU Memory Tensor Cores CUDA Cores Nodes in Public GPUs Nodes in Private GPUs
p100 (Pascal) 12GB N/A 3,584 12 with 3-4 GPUs each 3 with 4 GPUs each
v100-pcie (Volta) 32GB 640 5,120 4 with 2 GPUs each 1 with (16GB) 2 GPUs
v100-sxm2 (Volta) 32GB 640 5,120 24 with 4 GPUs each 10 with 4 GPUs each & 16GB GPU memory; 8 with 4 GPUs & 32GB GPU memory
t4 (Turing) 15GB 320 2,560 2 with 3-4 GPUs each 1 with 4 GPUs
quadro (Quadro RTX 8000) 46GB 576 4,608 0 2 with 3 GPUs each
a30 ( Ampere) 24GB 224 3,804 0 1 with 3 GPUs
a100 (Ampere) 41 & 82GB 432 6,912 3 nodes with 4 GPUs each 15 nodes with 2-8 GPUs each
a5000 (Ampere RTX A5000) 24GB 256 8,192 0 6 with 8 GPUs each
a6000 (Ampere RTX A6000) 49GB 336 10,752 0 3 with 8 GPUs each

The public GPUs are available within two partitions, named gpu and multigpu. The differences between the two partitions are the number of GPUs that one can request per job and the time limit on each job. Both partitions give access to all of the public GPU types mentioned above. The table below shows the differences between the two partitions. For more information about the partitions on Discovery, see Partitions.

Note

All user limits are subject to the availability of cluster resources at the time of submission and will be honored according to that.

Name Requires Approval? Time limit (Default/Max) Submitted Jobs GPU per job Limit Max GPUs per user Limit
gpu No 4 hours/8 Hours 50/100 1 8
multigpu Yes 4 hours/24 Hours 50/100 12 12

Anyone with a Discovery account can use the gpu partition. However, you must submit a ServiceNow ticket to request temporary access to multigpu for testing, or to request full access to the multigpu partition. Your request will be evaluated by members of the RC team to ensure that the resources in this partition will be used appropriately.

Requesting GPUs with srun or sbatch

Use srun for interactive and sbatch for batch mode. The srun example below is requesting 1 node and 1 GPU with 4GB of memory in the gpu partition. You must use the --gres= option to request a gpu:

srun --partition=gpu --nodes=1 --pty --gres=gpu:1 --ntasks=1 --mem=4GB --time=01:00:00 /bin/bash

Note

On the gpu partition, requesting more than 1 GPU (--gres=gpu:1) will cause your request to fail. Additionally, one cannot request all the CPUs on that gpu node as they are reserved for other GPUs.

The sbatch example below is similar to the srun example above, but it submits the job in the background, gives it a name, and directs the output to a file:

#!/bin/bash
#SBATCH --partition=gpu
#SBATCH --nodes=1
#SBATCH --gres=gpu:1
#SBATCH --time=01:00:00
#SBATCH --job-name=gpu_run
#SBATCH --mem=4GB
#SBATCH --ntasks=1
#SBATCH --output=myjob.%j.out
#SBATCH --error=myjob.%j.err
<your code>

Specifying a GPU type

You can add a specific type of GPU to the --gres= option (with either srun or sbatch). For a list of available GPU types, refer to the GPU Types column in the table, at the top of this page, that are listed as Public. The following is an example for requesting a single p100 GPU:

--gres=gpu:p100:1

Note

Requesting a specific type of GPU could result in longer wait times, based on GPU availability at that time.

Using CUDA

There are several versions of CUDA Toolkits on Discovery, including:

cuda/9.0
cuda/9.2
cuda/10.0
cuda/10.2
cuda/11.0
cuda/11.1
cuda/11.2
cuda/11.3
cuda/11.4
cuda/11.7
cuda/11.8
cuda/12.1

Use the module avail command to check for the latest software versions on Discovery. To see details on a specific CUDA toolkit version, use module show. For example, module show cuda/11.4.

To add CUDA to your path, use module load. For example, type module load cuda/11.4 to load version 11.4 to your path.

Use the command nvidia-smi (NVIDIA System Management Interface) inside a GPU node to get the CUDA driver information and monitor the GPU device.

Using GPUs with PyTorch

You should use PyTorch with a conda virtual environment if you need to run the environment on the Nvidia GPUs on Discovery. The following example demonstrates how to build PyTorch inside a conda virtual environment for CUDA version 11.7.

Note

Make sure to be on a GPU node before loading the environment. Additionally, the latest version of PyTorch is not compatible with GPUs with CUDA version 11.7 or less. Hence, the installation does not work on k40m or k80 GPU’s. In order to see what non-Kepler GPUs might be available, one can execute this command:

sinfo -p gpu --Format=nodes,cpus,memory,features,statecompact,nodelist,gres

This will indicate the state (idle or not) of a certain gpu-type that could be helpful in requesting an idle gpu. However, the command does not give real-time information of the state and should be used with caution.

PyTorch installation steps (with a specific GPU-type):

srun --partition=gpu --nodes=1 --gres=gpu:v100-sxm2:1 --cpus-per-task=2 --mem=10GB --time=02:00:00 --pty /bin/bash
module load anaconda3/2022.05 cuda/11.7
conda create --name pytorch_env python=3.9 -y
source activate pytorch_env
conda install pytorch torchvision torchaudio pytorch-cuda=11.7 -c pytorch -c nvidia -y
python -c'import torch; print(torch.cuda.is_available())'

Note

If the installation times out, please ensure that your .condarc file doesn’t contain additional channels. Also, consider cleaning your conda instance using the conda clean command. See Conda best practices .

If CUDA is detected by PyTorch, you should see the result, True.

As the latest version of PyTorch often depends on the newest CUDA available, please refer to the PyTorch documentation page for the most up to date instructions on installation.

The above PyTorch installation instructions will not include jupyterlab and few other commonly used datascience packages in the environment. In order to include those one can execute the following command after activating the pytorch_env environment:

conda install pandas scikit-learn matplotlib seaborn jupyterlab -y

Using GPUs with TensorFlow

We recommend that you use CUDA 11.2 (latest supported version) when working on a GPU with the latest version of TensorFlow (TF). TensorFlow provides information on the compatibility of CUDA and TensorFlow versions, and detailed installation instructions.

For the latest installation, use the TensorFlow pip package, which includes GPU support for CUDA-enabled devices:

srun --partition=gpu --gres=gpu:1 --nodes=1 --cpus-per-task=2 --mem=10GB --time=02:00:00 --pty /bin/bash
module load anaconda3/2022.05 cuda/11.2
conda create --name TF_env python=3.9 -y
source activate TF_env
conda install -c conda-forge cudatoolkit=11.2.2 cudnn=8.1.0 -y
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$CONDA_PREFIX/lib/
mkdir -p $CONDA_PREFIX/etc/conda/activate.d
echo 'export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$CONDA_PREFIX/lib/' > $CONDA_PREFIX/etc/conda/activate.d/env_vars.sh
pip install --upgrade pip
pip install tensorflow==2.11.*

Verify the installation:

# Verify the CPU setup (if successful, then a tensor is returned):
python3 -c "import tensorflow as tf; print(tf.reduce_sum(tf.random.normal([1000, 1000])))"

# verify the GPU setup (if successful, then a list of GPU device is returned):
python3 -c "import tensorflow as tf; print(tf.config.list_physical_devices('GPU'))"

# test if a GPU device is detected with TF (if successful, then True is returned):
python3 -c 'import tensorflow as tf; print(tf.test.is_built_with_cuda())'

To get the name of the GPU, type:

python -c 'import tensorflow as tf;  print(tf.test.gpu_device_name())'

If the installation is successful, then, for example, you should see the following as an output,:

2023-02-24 16:39:35.798186: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1613] Created device /device:GPU:0 with 10785 MB memory:  -> device: 0, name: Tesla K80, pci bus id: 0000:0a:00.0, compute capability: 3.7 /device:GPU:0

Note

Ignore the Warning messages that get generated after executiing the above commands.

test