The New Benchmark King: MLPerf Inference v6.0
NVIDIA just dropped its latest MLPerf Inference results, and the numbers are staggering. With 291 cumulative wins since 2018—that's 9x more than every other submitter combined—NVIDIA is flexing its full-stack dominance. But this isn't just about raw hardware specs. The real story is how co-designed hardware, software, and models are driving down token costs and unlocking new AI use cases.
MLPerf Inference v6.0 introduced five new, more demanding benchmarks:
- DeepSeek-R1 Interactive: A high-interactivity scenario with 5x faster minimum token rate
- Qwen3-VL-235B-A22B: The first multimodal model in the suite (vision-language)
- GPT-OSS-120B: A 120B-parameter MoE reasoning LLM from OpenAI
- WAN-2.2-T2V-A14B: Text-to-video generation
- DLRMv3: A transformer-based generative recommendation benchmark
NVIDIA was the only platform to submit results on all new models and scenarios, and it topped every single one. Check out the full breakdown from the official NVIDIA MLPerf blog post.
Software Optimization: The 2.7x Secret Sauce
Here's where it gets interesting for developers. The same NVIDIA GB300 NVL72 hardware launched last year now delivers 2.7x higher token throughput on DeepSeek-R1 server scenario compared to just six months ago. That's a 60%+ reduction in cost per token—from software alone.
This is powered by open-source TensorRT-LLM and NVIDIA Dynamo frameworks. Key optimizations include:
- Faster fused kernels: Fewer, higher-performance kernel calls
- Optimized Attention Data Parallel: Better load balancing across GPU ranks
- Disaggregated serving: Separating prefill and decode phases for optimal throughput
- Wide Expert Parallel (WideEP): Sharding MoE experts across multiple GPUs to reduce weight-load bottlenecks
- Multi-Token Prediction (MTP): Using idle compute to predict and verify up to 3 tokens in parallel
- KV-aware routing: Dynamo routes requests by evaluating compute costs across workers
# Example: Conceptual code for KV-aware routing (simplified)
# In practice, Dynamo handles this transparently
def route_request(request, workers):
"""Route inference request to least-loaded worker based on KV cache cost"""
worker_costs = []
for worker in workers:
cost = estimate_compute_cost(request, worker.kv_cache_utilization)
worker_costs.append((cost, worker))
# Pick the worker with lowest estimated cost
best_worker = min(worker_costs, key=lambda x: x[0])[1]
return best_worker.infer(request)
Scale-Out: Millions of Tokens Per Second
NVIDIA didn't stop at single-node performance. By connecting four GB300 NVL72 systems (288 Blackwell Ultra GPUs) with Quantum-X800 InfiniBand, they achieved:
- 2,494,310 tokens/sec (Offline)
- 1,555,110 tokens/sec (Server)
This is the largest scale ever submitted to any MLPerf Inference benchmark. For AI factories, this means serving more users, generating more revenue, and reducing token costs at scale.
Limitations and Caveats
While these results are impressive, a few things to keep in mind:
- Benchmark conditions: MLPerf results are run under specific, controlled conditions. Real-world production deployments may vary.
- Hardware cost: Blackwell Ultra GPUs and Quantum InfiniBand networking are premium infrastructure. Not every team can justify the investment.
- Software complexity: Achieving these optimizations requires deep integration with TensorRT-LLM and Dynamo, which have a learning curve.
- Model specificity: The 2.7x gain is model-specific (DeepSeek-R1). Gains on other architectures may be smaller.
What's Next: MLPerf Endpoints
NVIDIA is already working with MLCommons on MLPerf Endpoints, a new benchmark that measures deployed service performance under real API traffic. This will capture metrics like latency, throughput, and cost under realistic conditions—giving developers a clearer picture of production readiness.
For AI engineers, the takeaway is clear: software optimization is the new hardware. The same GPUs can deliver dramatically different performance with the right stack. Dive into TensorRT-LLM and start experimenting with disaggregated serving and multi-token prediction.
Recommended Next Reads
- Why the Data Generalist Wins in the Age of AI
- NVIDIA DLSS 4.5 Deep Dive: Next-Gen AI Upscaling and Dynamic Frame Gen
