# Access to Multi-GPU Partition The `multigpu` partition in the HPC cluster enables parallel processing across multiple GPUs. This setup is ideal for applications that require significant computational power, such as deep learning, scientific simulations, and large-scale data analysis. Please follow these steps to apply for the Multi-GPU Partition. ## MultiGPU Partition Access Guide ### Requesting Access to MultiGPU partition - Please use the partition application form here: [HPC Partition Request](https://bit.ly/NURC-PartitionAccess) - Select under "Partition Type" - `Multigpu` - Make sure that you have filled the form with briefly answering the setup questions at the end. :::{note} Please consider the following while requesting GPUs for testing: - Testing should represent your planned work but doesn't need full production runs. - Use scaled-down versions of your jobs for timing data when possible. - Ensure your test cases cover a range of processor/GPU counts to measure scaling accurately. - Consider testing with different problem sizes to understand how scaling efficiency changes with workload. - If you need to use A100s in your workflow, consider testing your code on V100 GPUs: * V100 testing is often sufficient to demonstrate scaling. * If a job scales well on V100s, it will also scale on A100s. * This approach conserves A100 resources for all the users on the cluster. * Only use A100s for testing if your job requires their capabilities or memory that V100s cannot provide. ::: - Submit the form. - After submitting the form Research Computing team will reach out to you for an estimate of the duration needed for testing (preferably less than 24 hrs). - Research Computing team will contact you with more details for accessing a multi-GPU node. ### Testing the Code on the Temporary Reservation In the testing phase, you will be provided with the reservation for performing test on 1, 2, 4, and 8 GPUs and recording runtimes for each test to calculate efficiency. #### Check your reservation Before you begin check your reservation ```bash scontrol show reservation= # will be in the details provided by the RC team. ``` #### Interactive Job To run an {term}`Interactive Job`, for example, using V100-sxm2 with 4 GPUs, use the command below ```bash srun -p reservation --reservation= --gres=gpu:v100-sxm2:4 --time=24:00:00 -N 1 --pty /bin/bash ``` :::{note} Your local machine must remain active for successful job execution. Be aware of the following risks: - Network disconnections have the potential to interrupt the job. - Computer sleep/hibernation can break the connection. - Losing RDP session (e.g., timeout, local reboot) stops GUI-dependent processes. - Power outages or system updates can cause unexpected disconnects. **Recommendations:** - Use a stable network connection. - Disable sleep mode on your local machine during testing. - Implement job checkpointing where possible. - Monitor job status regularly. - Use [Non-interactive job](https://rc-docs.northeastern.edu/en/latest/gpus/multigpu-partition-access.html#non-interactive-job) ::: #### Non-interactive job To run {term}`Non-interactive job`, you can use the following command 1- Create a script ```bash nano my_job.sh # You can use any text editor you prefer, not just nano. Choose the editor you're most comfortable with for modifying files. ``` 2- Add reservation and job details ```bash #!/bin/bash #SBATCH --job-name=MyJob # Job name #SBATCH -p reservation #SBATCH --reservation= #SBATCH --gres=gpu:v100-sxm2:4 #SBATCH --nodes=1 # Number of nodes #SBATCH --ntasks=4 # Number of tasks #SBATCH --cpus-per-task=2 # Number of CPUs per task #SBATCH --mem=8G # Total CPU memory (not GPU memory) #SBATCH --time=02:00:00 # Time limit hh:mm:ss #SBATCH --output=output_%j.txt # Standard output and error log #SBATCH --error=error_%j.txt # Standard error log # Load any required modules # Run your program python your_script.py #example # You can use the editor of your choice to edit and save the file on the cluster. # Following commands are for the editor 'Nano', and can be used to write this script to disk # Ctrl+x to save the file # press 'Y' to save the changes # press enter to complete saving the file to the disk ``` 3- Submit the job ```bash sbatch job_script.sh ``` 4-You can monitor the status of your job using the squeue command: ```bash squeue -u #To cancel a job scancel # where is the ID of the Job we are canceling ``` :::{important} This multi-GPU setup is intended for research workflows only. For course-related multi-GPU needs, please refer to the course request form. Instructors should submit those requests directly through the appropriate channels. ::: #### Post-Testing Sharing Performance Results - Ensure your {term}`Scaling efficiency` is adequate (generally over 0.5) when using the maximum GPUs selected on the `multigpu` partition⁠. If needed, please consult with the Research Computing (RC) team for guidance and support throughout the process⁠. - Share your execution time across 1, 2, 3, 4 GPUs with efficiency in the ticket for review. - Based on your performance, the team will evaluate your application for permanent access to the `multigpu` partition until your research work is completed.