The MoE Renaissance: Why Sparse Models Matter Now
Large Language Models (LLMs) have followed a simple scaling law: more data + more parameters = better performance. But dense models hit practical limits — training cost skyrockets, inference latency grows, and deployment requires enormous memory.
Mixture of Experts (MoEs) offer a way out. By replacing dense feed-forward layers with multiple 'expert' sub-networks and a router that activates only a few per token, MoEs decouple total parameters from active parameters. A model like GPT-OSS-20B has 21B total parameters but only ~3.6B active per token, enabling ~110 tokens/sec on consumer hardware.
This isn't just theory. Recent open MoE releases include Qwen 3.5, MiniMax M2, GLM-5, and Kimi K2.5 — all building on the momentum from DeepSeek R1. Even ChatGPT is rumored to use a sparse architecture.
But MoEs break assumptions baked into most ML tooling. Model loading, device placement, quantization, and backend execution were originally designed for dense models. That's where the transformers library's v5 refactor comes in.

The Weight Loading Pipeline: From Chaos to Conversion
The Problem
In a MoE checkpoint, each expert is serialized independently. For DeepSeek-V3, you see keys like:
# Checkpoint keys (256 separate tensors)
'model.layers.3.mlp.experts.0.gate_proj.weight'
'model.layers.3.mlp.experts.1.gate_proj.weight'
# ... up to expert 255
But at runtime, GPUs want all experts packed into a single contiguous tensor for efficient grouped GEMM operations. The old approach assumed checkpoint layout matches runtime layout. MoEs break that assumption.
The Solution: WeightConverter
Hugging Face introduced a generic WeightConverter abstraction. Instead of a key-by-key copy, loading becomes a conversion pipeline:
from transformers.models.llama.convert_weights import WeightConverter, MergeModulelist, Concatenate, SplitModulelist
# Pack 256 expert gate projections into one tensor
converter_pack = WeightConverter(
["block_sparse_moe.experts.*.w1.weight", "block_sparse_moe.experts.*.w3.weight"],
"mlp.experts.gate_up_proj",
operations=[
MergeModulelist(dim=0),
Concatenate(dim=1),
]
)
# Split back for debugging or custom backends
converter_split = WeightConverter(
"mlp.experts.down_proj",
"block_sparse_moe.experts.*.w2.weight",
operations=[SplitModulelist(dim=0)]
)
Lazy Materialization = Faster Loading
The loader scans checkpoint keys once, matches them against converter patterns, and groups tensors per converter. Each key is registered as a future and materialized via a thread pool. Conversion operations (like MergeModulelist) run only once all dependencies are ready. This avoids repeated scans and reduces memory peaks.
Benchmark Results
| Version | Strategy | Loading Mode | Time |
|---|---|---|---|
| v4.57.6 | device_map="auto" | Threadpool | 66.24s |
| v4.57.6 | TP | — | OOM |
| v5 | device_map="auto" | Async (default) | 20.71s |
| v5 | TP | Async | 10.1s |
The 6x speedup on TP comes from single-pass routing, async materialization, and conversion-aware scheduling — not just more threads.

Expert Backend: Pluggable Execution for Any Hardware
Once experts are packed into a single tensor, you need to route tokens efficiently. The Expert Backend system (PR #42697) decouples expert computation from model implementation via a decorator:
from transformers.models.llama.modeling_llama import use_experts_implementation
@use_experts_implementation
class LlamaSparseMoeBlock(nn.Module):
# The decorator dispatches to the selected backend automatically
pass
Three backends are available:
- eager: loops over experts. Great for debugging.
- batched_mm: duplicates expert weights per token, uses
torch.bmm. Fast for small batches. - grouped_mm: sorts tokens by expert ID, uses
torch._grouped_mm. Best for large batches or memory-constrained setups.
Expert Parallelism: Scaling Beyond One GPU
Expert parallelism distributes experts across multiple devices. Each GPU loads only its assigned subset. Enable it with one config:
from transformers import AutoModelForCausalLM, AutoTokenizer
from transformers.distributed.configuration_utils import DistributedConfig
distributed_config = DistributedConfig(enable_expert_parallel=True)
model = AutoModelForCausalLM.from_pretrained(
"openai/gpt-oss-120b",
dtype="auto",
distributed_config=distributed_config,
)
Launch with:
torchrun --nproc-per-node N script.py
Where N evenly divides the total number of experts. The model switches from tensor-parallel to expert-parallel with specialized sharding: GroupedGemmParallel splits expert weights along dim=0, and RouterParallel remaps global indices to local ones, using all-reduce to combine outputs.
Training MoEs: 12x Faster with Unsloth
Training MoEs is complex — massive parameter count, distributed communication, routing instabilities. Hugging Face collaborated with Unsloth to deliver:
- ~12× faster MoE training
-
35% VRAM reduction
- ~6× longer context
- 12–30× overall speedup over v4
This leverages the Expert Backend abstraction and custom Triton grouped-GEMM + LoRA kernels. For full details, check out Unsloth’s official guide.
Limitations & Caveats
- Router Collapse: Without careful regularization (e.g., load balancing loss), the router may route all tokens to the same few experts, negating MoE benefits.
- Memory Overhead: Even though active parameters are few, total parameters must fit in aggregate GPU memory across devices.
- Quantization Complexity: Per-expert quantization only makes sense once experts are in a predictable packed layout — which v5 now enables, but it's still an advanced workflow.
What's Next?
- Dynamic Expert Allocation: Allocating more experts to harder tokens during inference.
- MoE for Vision & Multimodal: Sparse architectures are spreading beyond language.
- Better Router Architectures: Learned routing vs. hash-based routing (like Switch Transformer).
If you're building with MoEs or experimenting with new sparse ideas, the transformers library is evolving with you. Let the community know what abstractions, kernels, or workflows you'd like to see next.
Further Reading
- Styling Highlight Pseudo-Elements: A Deep Dive into search-text and Friends
- RCCLX by Meta: Revolutionizing GPU Communication for AMD Platforms
Source: Hugging Face Blog - MoE in Transformers

Conclusion
Hugging Face transformers v5 isn't just a minor update — it's a fundamental rethinking of how the library handles sparse architectures. The WeightConverter pipeline, pluggable Expert Backend, and native expert parallelism make MoE models first-class citizens. If you've been avoiding MoEs because of loading complexity or training overhead, now is the time to dive in.
Next Steps:
- Try loading a MoE model with
transformers>=5.0and benchmark loading time. - Experiment with different Expert Backends (
eager,batched_mm,grouped_mm) for your use case. - Explore Unsloth for MoE fine-tuning if you're training custom models.