Introduction
Pretraining frontier-scale large language models (LLMs) in FP8 has become standard practice, but moving to 4-bit floating point has remained an open research challenge because narrower formats compress dynamic range and amplify quantization error over long token horizons. NVIDIA's research introduces a pretraining methodology built around NVFP4, a 4-bit microscaling format natively supported by Blackwell Tensor Cores. They validated this approach by pretraining a 12-billion-parameter hybrid Mamba-Transformer model on 10 trillion tokens—the longest publicly documented training run in 4-bit precision to date. The resulting model achieves 62.58% on MMLU-Pro 5-shot versus 62.62% for the FP8 baseline, demonstrating near-lossless accuracy while enabling up to 3× speedups over FP8. This guide walks you through the steps to replicate that methodology.

What You Need
- NVIDIA Blackwell GPU (e.g., GB200 or GB300) with Tensor Core support for NVFP4.
- PyTorch (or similar framework) with NVIDIA Transformer Engine installed and configured for FP4 operations.
- LLM model architecture compatible with hybrid Mamba-Transformer designs (or you can adapt your own).
- Training dataset of at least 10 trillion tokens (e.g., curated web text).
- Optimizer (e.g., AdamW) with FP32 state accumulation.
- Knowledge of microscaling formats—review NVFP4 details below.
Step-by-Step Guide
Step 1: Understand the NVFP4 Format
NVFP4 differs from standard MXFP4 in three key ways. First, block size is reduced from 32 to 16 elements, narrowing the dynamic range each scale must cover. Second, block scale factors use E4M3 instead of UE8M0, trading exponent range for mantissa precision so the per-block absolute maximum (amax) maps closer to the FP4 maximum representable value. Third, an additional FP32 per-tensor scale remaps values so the E4M3 block scales stay in range. This results in at least 6.25% of values in each block (the amax) being represented at near-FP8 precision, while the rest remain in FP4. On Blackwell hardware, FP4 GEMMs run at 4× BF16 throughput on GB200 and 6× on GB300—roughly 2× and 3× speedups over FP8, respectively.
Step 2: Identify Which Layers to Quantize
Only the GEMMs inside linear (fully connected) layers for forward pass (Fprop), backward pass through weights (Wgrad), and backward pass through activations (Dgrad) should run in NVFP4. Embeddings, the output projection head, normalization layers, non-linearities, and all attention components (softmax, query-key and attention score-value batched GEMMs) must remain in BF16 or FP32. Model weights, weight gradients used for accumulation across microbatches and data-parallel replicas, and optimizer states stay in FP32. Tensor parallel reductions use BF16. This selective quantization preserves precision where it matters most.
Step 3: Configure the Two-Level Scaling System
The NVFP4 methodology relies on a two-level scaling approach: per-block and per-tensor. For each linear layer, define 16-element blocks (instead of 32). Store block scale factors in E4M3 format. Then compute an FP32 per-tensor scale that remaps all values so the E4M3 block scales remain within range. During training, the per-tensor scale is updated periodically (e.g., every 100 steps) to adapt to shifting activation statistics. This prevents the block scales from saturating or underflowing.
Step 4: Integrate with Training Loop
Implement the quantization within the training loop using NVIDIA's Transformer Engine. For each linear layer, apply NVFP4 quantization before the GEMM. Use round-to-nearest-even for tensor quantization. The per-tensor scale is computed once per layer per step (or less frequently) from the full-precision tensor's maximum absolute value. Ensure that gradients flowing back through quantized layers are computed in FP32 internally—only the forward GEMM runs in FP4. The engine handles the conversion automatically if configured correctly.

Step 5: Validate with a Small Model First
Before scaling to 12B parameters and 10T tokens, test your setup on a smaller model (e.g., 1B parameters) with a few billion tokens. Monitor the loss curves and compare against an FP8 baseline. Divergence early in training indicates incorrect scaling or block size settings. Adjust the per-tensor scale update frequency or re-check the block scale format. Successful validation on a smaller scale builds confidence for the full run.
Step 6: Run the Full-Scale Pretraining
With a validated pipeline, launch the 12B parameter hybrid Mamba-Transformer model on 10 trillion tokens using multiple Blackwell GPUs (GB200 or GB300). Use data parallelism and tensor parallelism as needed, ensuring reductions use BF16. Keep optimizer states in FP32. Expect throughput gains of ~2× over FP8 on GB200 and ~3× on GB300. Monitor MMLU-Pro scores at checkpoints; the target is ~62.6% (5-shot) to match the FP8 baseline.
Tips and Best Practices
- Start with conservative scaling: Use the default 1×16 block scaling everywhere, but expect initial divergence—adjust per-tensor scale updates to every 50–200 steps.
- Watch for saturation: If block scales hit the E4M3 maximum often, increase per-tensor scale frequency or reduce learning rate.
- Profile throughput: Measure actual speedup against FP8; theoretical gains may not appear due to memory bandwidth bottlenecks in attention components.
- Use Transformer Engine's built-in support: NVIDIA's library handles NVFP4 conversion and backprop, reducing implementation errors.
- Document hyperparameters: Record per-tensor scale update frequency, block size, and rounding mode to reproduce results or debug.
- Consider hybrid architectures: The Mamba-Transformer mix helps distribute quantization noise—pure transformers may need additional tuning.
- Scale up gradually: For models larger than 12B, test with 1T tokens first to verify loss trends before committing to 10T.
- Reference the paper: See the full arXiv paper (link in original) for ablation studies and alternative scaling configurations.
By following these steps, you can replicate NVIDIA's breakthrough pretraining in 4-bit precision, achieving near-FP8 accuracy with significant speed and memory savings on Blackwell hardware.