GPUs in Sycamore

Status as of 8/20/2025

There are seven 4xH100 GPU nodes in Sycamore at this time: 28 H100's in total, 2.2TB of GPU memory in aggregate

Within each node, nvlink enables aggregating up to four H100's for a given job using the h100_sn (H100 single node) partition described below.

Sycamore has an InfiniBand high speed low latency network between nodes enabling gpus from across these H100 nodes to be coupled for even larger gpu jobs. At this time, only two nodes are fully IB interconnected allowing a single job to use 8 gpus in aggregate, which is 640GB gpu memory and 135k cuda cores. Reach out to our helpdesk if you can take advantage of this multi-node gpu infrastructure, we want to hear from you and help get your code running! Special qos access is required, and there's a different partition, h100_mn (H100 multinode).

Accessing these H100 GPUs

You must have an account on Sycamore to access these GPUs. Anyone with a Sycamore account should be able to submit jobs to the Sycamore h100_sn SLURM partition which is the partition on Sycamore via which the H100 GPUs are being made available.

Submitting jobs

Like all jobs on Sycamore, job submission to the H100 GPUs is handled using SLURM and requires constructing a SLURM job submission command to submit your code to the dedicated partition requesting appropriate compute resources (e.g., CPUs, RAM, GPUs, etc.) and then running the job submission command.

As an example of submitting a multi-GPU (in this case four GPUs) job, you can create a SLURM script called example.sl using a text editor. Enter the following into your example.sl script:

#!/bin/bash

#SBATCH -n 4
#SBATCH --gpus=4
#SBATCH --partition=h100_sn

# put module commands here
# module purge
# module add etc.

my_gpu_job

Then submit your job using the sbatch command:

sbatch example.sl

Here example.sl is our SLURM job submission script being used to run the executable my_gpu_job (which needs to be on your PATH) on Sycamore. The executable my_gpu_job is the GPU code that uses four GPUs.

So we have created a script example.sl that can run four tasks (-n 4) and allocates four GPUs (--gpus=4) in the h100_sn partition (--partition=h100_sn). Note you can optionally add any module commands after the SBATCH directives.

More Sycamore SLURM examples

Monitoring

Monitoring the fit between allocated and utilized resources is of great benefit to everyone, but especially to your own workloads. With high demand for GPUs, it is essential to monitor batch jobs and verify they are making reasonable and expected usage of allocated hardware resources.

Thank You Lenovo

These nodes are available courtesy of Lenovo via a Free Trial program. They are also available for purchase as patron nodes by faculty, staff, departments, centers, institutes, etc. via the patron program. If interested, reach out to us for a consult.

Per node specifications:

  • 4x Nvidia H100 GPU – 80GB; with NVLink
  • 2x AMD “Bergamo” 128 Core Sockets (256 physical CPU cores)
  • 1.5TB System RAM
  • NDR CX-7 with 2x 800Gbps ports
  • 25Gbe Ethernet ports
  • Liquid Cooling

 

Last Update 9/9/2025 4:10:02 PM