NVbandwidth Your Essential Guide to Measuring GPU Interconnect and Memory Performance

Why GPU Memory Bandwidth Matters

Modern GPU applications—from large language models to real-time inference—are often bottlenecked by data movement, not compute. You can have the fastest tensor cores in the world, but if your data can't reach them quickly, performance tanks.

Whether you're a CUDA developer optimizing a kernel, an ML infrastructure engineer validating a cluster, or a system architect designing a multi-GPU node, understanding your system's memory bandwidth is critical.

NVIDIA's open-source tool, NVbandwidth, gives you a standardized, reproducible way to measure exactly how fast data moves between:

CPU memory ↔ GPU memory
GPU memory ↔ GPU memory (peer-to-peer)
Across nodes in a multi-GPU cluster

This guide walks you through installation, usage, and interpretation of results—with real command examples you can run today.

Reference: This content is based on the official NVIDIA NVbandwidth blog post.

Getting Started: Installation & Basic Usage

Prerequisites

NVIDIA GPU with CUDA compute capability 6.0+
CUDA Toolkit 11.x (12.3 for multi-node)
C++17 compiler (GCC 7+ on Linux)
CMake 3.20+
Boost program_options

Build from Source

# Clone the repository
git clone https://github.com/NVIDIA/nvbandwidth.git
cd nvbandwidth

# Configure and build
mkdir build && cd build
cmake .. -DCMAKE_BUILD_TYPE=Release
make -j$(nproc)

# Run a full suite test
./nvbandwidth

Quick Benchmark: Host-to-Device Bandwidth

# Run a specific test with 1 GiB buffer, 10 iterations, JSON output
./nvbandwidth -t host_to_device_memcpy_ce -b 1024 -i 10 -j

Example JSON output:

{
  "test": "host_to_device_memcpy_ce",
  "bandwidth_gbs": 55.63,
  "coefficient_of_variation": 0.00,
  "unit": "GB/s"
}

Multi-GPU Peer-to-Peer Test

# Measure device-to-device bandwidth (copy engine)
./nvbandwidth -t device_to_device_memcpy_read_ce -b 1024 -i 10 -j

This will produce a matrix showing bandwidth between every pair of GPUs in your system.

Pro Tip: Use the -j flag for machine-readable output that you can pipe into monitoring dashboards or CI pipelines.

Developer running NVbandwidth CLI tool on terminal showing bandwidth test results in JSON format Technical Structure Concept

Understanding the Results & Limitations

Interpreting Bandwidth Matrices

NVbandwidth outputs a grid where rows are source GPUs and columns are destinations. Values are in GB/s. For example:

memcpy CE GPU(row) -> GPU(column) bandwidth (GB/s)
      0      1      2      3
0   N/A  397.4  397.4  397.5
1  397.6   N/A  397.4  397.5
2  397.5  397.4   N/A  397.5
3  397.5  397.4  397.5   N/A

A value close to the theoretical peak of your interconnect (e.g., NVLink 4.0: 400 GB/s per direction) indicates healthy hardware.

Common Pitfalls

Driver version matters: Always match your CUDA toolkit and driver versions. Mismatches can silently degrade bandwidth.
Thermal throttling: If GPUs overheat, clocks drop and bandwidth suffers. Run tests after a cold boot for baseline.
Topology unawareness: NVbandwidth is topology-agnostic, but the system topology (PCIe switch, NVSwitch) directly affects results. Check nvidia-smi topo -m.

Limitations & Caveats

Not a replacement for application profiling: Bandwidth numbers are synthetic. Your actual workload's memory access pattern may differ.
Single-node vs multi-node: Multi-node tests require MPI and the NVIDIA IMEX service. Setup is non-trivial.
Copy Engine vs SM kernels: CE copies are asynchronous and non-blocking; SM copies use compute resources. Results can vary significantly.

Related resource: If you're new to GPU programming, check out our guide to building interactive demos with CodePen slideVars for a lighter introduction to interactive GPU-accelerated web experiences.

Diagram of multi-GPU memory copy patterns between CPU and GPU devices using NVbandwidth Coding Session Visual

Next Steps: From Measurement to Optimization

NVbandwidth gives you the raw numbers. Now what?

Establish a baseline: Run the full suite on a fresh system. Save the JSON output.
Regression testing: After driver updates or hardware changes, re-run and compare.
Optimize data transfer patterns:
- Use pinned (page-locked) memory for faster H2D transfers.
- Prefer asynchronous transfers with streams to overlap compute and copy.
- For multi-GPU, use NVLink peer-to-peer access instead of staging through CPU.

NVbandwidth Your Essential Guide to Measuring GPU Interconnect and Memory Performance

Why GPU Memory Bandwidth Matters

Getting Started: Installation & Basic Usage

Prerequisites

Build from Source

Quick Benchmark: Host-to-Device Bandwidth

Multi-GPU Peer-to-Peer Test

Understanding the Results & Limitations

Interpreting Bandwidth Matrices

Common Pitfalls

Limitations & Caveats

Next Steps: From Measurement to Optimization

Recommended Reading

Share this post

Did you find this post helpful?
It helps the author a lot!

Subscribe

RSS / Atom Feed

Real-time Alerts

Comments 0

Why GPU Memory Bandwidth Matters

Getting Started: Installation & Basic Usage

Prerequisites

Build from Source

Quick Benchmark: Host-to-Device Bandwidth

Multi-GPU Peer-to-Peer Test

Understanding the Results & Limitations

Interpreting Bandwidth Matrices

Common Pitfalls

Limitations & Caveats

Next Steps: From Measurement to Optimization

Recommended Reading

Share this post

Did you find this post helpful?It helps the author a lot!

Subscribe

RSS / Atom Feed

Real-time Alerts

Comments 0

Did you find this post helpful?
It helps the author a lot!