GPU Resources

GPUs are expensive, in high demand, and consume a lot of energy. Careful attention to your workloads and appropriateness of requested resources for a given job is always important in shared facilities such as ITS Research Computing, and these concerns are greatly amplified for GPU resources. It is imperative that you know how to profile and monitor your codes' usage of allocated GPU resources.

If you are not well-versed in viewing and monitoring GPU utilization for your jobs, please do not request more than one gpu per job as the most likely outcome is the additional GPU(s) will remain idle throughout the lifetime of the job.

GPU type Quantity Precision GPU Memory Partition NVLink
L40S 24 Single 48GB l40-gpu None
L40 28 Single 48GB l40-gpu None
A100 24 Double 40GB a100-gpu None
V100 88 Double 16GB volta-gpu 25GB/s
GTX 1080 32 Single 8GB gpu None
A100 3 Double 40GB OnDemand None
A100 MIG 9 Double 10GB OnDemand None
A100 MIG 3 Double 5GB OnDemand None

New GPUs are are purchased each year to support increased demand and the best price/performance point is often different each time, leading to diversity in capabilities among the overall pool of resources. We find that in general, older GPU types provide signficant value via capacity, presuming it is capable of running your code and you can keep batch scheduled resources loaded up with work.

Batch vs Interactive

GPU sessions for interactive work are avaialble in OnDemand. These sessions are designed for dev/test/repl activities, allowing you to build and test code, and perhaps even run shorter running GPU jobs. OnDemand sessions are limited to 10 hours, so work must be completed within that timeframe.

There are currently 15 GPU sessions avaialble in OnDemand that can be used in a Longleaf Desktop for build/test/repl or compute of less than 10 hours, MD Desktop optimized for Molecular Dynamics, or some other applications such as Jupyter and Matlab. The only way to acquire these resources is via OnDemand.

Nvidia provides a mechanism to slice a GPU into smaller parts via virtualization called "MIG" or "Mult-Instance GPU". The purpose of MIG is to enable a card such as an A100 to present as more than one GPU, i.e. provide multiple users with simultaneous access to fractional portions of the A100. Unlike slicing a delicious pie, it is not practical to slice an A100 into equal parts, with each having the exact same capability without leaving resources unused. Each of the three A100's serving MIG'd OnDemand sessions is sliced into four parts, three of which have 10GB of GPU ram, and one with 5GB of GPU ram. This yields a total of 9 slices at 10GB, and 3 slices at 5GB. Just as the GPU ram is sliced differently, so are the GPU cores - A100's contain 6912 CUDA cores; the slices with 10GB GPU memory have roughly twice as much GPU compute capability as the 5GB slices.

To explicitly request one of these two MIG slice types that are configured on the A100's serving interactive GPU sessions, enter one of these two strings into the "Additional Job Submission Arguments" of the OnDemand web form:
--gres=gpu:1g.5gb:1
--gres=gpu:2g.10gb:1

Note that it is never appropriate to use Nvlink or any other method to re-combine MIG'd slices into a larger resource. If a larger resource is needed, the job should use a non-MIG'd GPU. Virtualizating a GPU into slices, then stringing two or more of those slices back together for a larger job is never a wise use of resources.

As of November 2023, we have removed the MIG configuration on one of the A100 nodes serving OnDemand to provide interactive GPU sessions with 40GB of GPU memory and all 6912 cuda cores. While this reduces the total number of sessions available, it provides signficant capabilities previously unavailable. There are exactly THREE available OnDemand sessions using a full A100 GPU, and these sessions are limited to one per user at time, and TWO HOURS. Feedback on our choices for slicing (or not) GPUs to create more/fewer sessions with less/more capability per session is welcome via email to research@unc.edu.

See Longleaf SLURM examples for details on how to submit a job to the gpu partition.

How to view GPU utilization for your job while it is running

It is very important to know how well your code/workflow is utilizing GPU resources allocated to your job. The following link demonstrates a few methods to interrogate GPU utilization for your actively running jobs:

Single and Double Precision

If you are unsure if your code requires single or double precision, please try to only utilize single precision cards until you have more information about the code/environment/frameworks in use and their specific needs/capabilities. Double precision code will work on single precision GPUs, however the performance penalty can be significant. Some resources about precision:
   Nvidia blog post
   insideHPC article
   Numerical Accuracy from PyTorch docs

Some GPU jobs require more resources than can be accomplished by a single GPU card, such as training large language models. NVLink enables more than one GPU card to participate in the work for an individual job. This is an advanced scenario, and one should be certain that their jobs can effectively make use of multiple cards via iterative tests and validation.

 

Last Update 6/23/2024 10:53:31 AM