NVIDIA Shatters MLPerf Inference Records Blackwell Ultra, 2.7x Software Gains, and the Rise of Interactive AI

The New Benchmark King: MLPerf Inference v6.0

NVIDIA just dropped its latest MLPerf Inference results, and the numbers are staggering. With 291 cumulative wins since 2018—that's 9x more than every other submitter combined—NVIDIA is flexing its full-stack dominance. But this isn't just about raw hardware specs. The real story is how co-designed hardware, software, and models are driving down token costs and unlocking new AI use cases.

MLPerf Inference v6.0 introduced five new, more demanding benchmarks:

DeepSeek-R1 Interactive: A high-interactivity scenario with 5x faster minimum token rate
Qwen3-VL-235B-A22B: The first multimodal model in the suite (vision-language)
GPT-OSS-120B: A 120B-parameter MoE reasoning LLM from OpenAI
WAN-2.2-T2V-A14B: Text-to-video generation
DLRMv3: A transformer-based generative recommendation benchmark

NVIDIA was the only platform to submit results on all new models and scenarios, and it topped every single one. Check out the full breakdown from the official NVIDIA MLPerf blog post.

Software Optimization: The 2.7x Secret Sauce

Here's where it gets interesting for developers. The same NVIDIA GB300 NVL72 hardware launched last year now delivers 2.7x higher token throughput on DeepSeek-R1 server scenario compared to just six months ago. That's a 60%+ reduction in cost per token—from software alone.

This is powered by open-source TensorRT-LLM and NVIDIA Dynamo frameworks. Key optimizations include:

Faster fused kernels: Fewer, higher-performance kernel calls
Optimized Attention Data Parallel: Better load balancing across GPU ranks
Disaggregated serving: Separating prefill and decode phases for optimal throughput
Wide Expert Parallel (WideEP): Sharding MoE experts across multiple GPUs to reduce weight-load bottlenecks
Multi-Token Prediction (MTP): Using idle compute to predict and verify up to 3 tokens in parallel
KV-aware routing: Dynamo routes requests by evaluating compute costs across workers

# Example: Conceptual code for KV-aware routing (simplified)
# In practice, Dynamo handles this transparently

def route_request(request, workers):
    """Route inference request to least-loaded worker based on KV cache cost"""
    worker_costs = []
    for worker in workers:
        cost = estimate_compute_cost(request, worker.kv_cache_utilization)
        worker_costs.append((cost, worker))
    # Pick the worker with lowest estimated cost
    best_worker = min(worker_costs, key=lambda x: x[0])[1]
    return best_worker.infer(request)

Scale-Out: Millions of Tokens Per Second

NVIDIA didn't stop at single-node performance. By connecting four GB300 NVL72 systems (288 Blackwell Ultra GPUs) with Quantum-X800 InfiniBand, they achieved:

2,494,310 tokens/sec (Offline)
1,555,110 tokens/sec (Server)

This is the largest scale ever submitted to any MLPerf Inference benchmark. For AI factories, this means serving more users, generating more revenue, and reducing token costs at scale.

Limitations and Caveats

While these results are impressive, a few things to keep in mind:

Benchmark conditions: MLPerf results are run under specific, controlled conditions. Real-world production deployments may vary.
Hardware cost: Blackwell Ultra GPUs and Quantum InfiniBand networking are premium infrastructure. Not every team can justify the investment.
Software complexity: Achieving these optimizations requires deep integration with TensorRT-LLM and Dynamo, which have a learning curve.
Model specificity: The 2.7x gain is model-specific (DeepSeek-R1). Gains on other architectures may be smaller.

What's Next: MLPerf Endpoints

NVIDIA is already working with MLCommons on MLPerf Endpoints, a new benchmark that measures deployed service performance under real API traffic. This will capture metrics like latency, throughput, and cost under realistic conditions—giving developers a clearer picture of production readiness.

For AI engineers, the takeaway is clear: software optimization is the new hardware. The same GPUs can deliver dramatically different performance with the right stack. Dive into TensorRT-LLM and start experimenting with disaggregated serving and multi-token prediction.

NVIDIA Shatters MLPerf Inference Records Blackwell Ultra, 2.7x Software Gains, and the Rise of Interactive AI

The New Benchmark King: MLPerf Inference v6.0

Software Optimization: The 2.7x Secret Sauce

Scale-Out: Millions of Tokens Per Second

Limitations and Caveats

What's Next: MLPerf Endpoints

Recommended Next Reads

Share this post

Did you find this post helpful?
It helps the author a lot!

Subscribe

RSS / Atom Feed

Real-time Alerts

Comments 0

The New Benchmark King: MLPerf Inference v6.0

Software Optimization: The 2.7x Secret Sauce

Scale-Out: Millions of Tokens Per Second

Limitations and Caveats

What's Next: MLPerf Endpoints

Recommended Next Reads

Share this post

Did you find this post helpful?It helps the author a lot!

Subscribe

RSS / Atom Feed

Real-time Alerts

Comments 0

Did you find this post helpful?
It helps the author a lot!