GitHub - vuiseng9/mlperf-t5.1-rundown: A Quick Rundown of MLPerf Training v5.1 on the NEW Llama3.1-8B, Flux.1 Models

Quick Rundown of MLPerf v5.1 Training

on the New Llama3.1-8B, Flux.1 Models. — 2025/11/13

Few days ago, MLCommons released the MLPerf Training v5.1 results, with a record level of participation.

Combed through the data for my own use, and figured might as well share it. Docker instructions are included to reproduce the runs locally, intentional bypass the needs of SLURM system and terabytes of download.

MLPerf covers many benchmarks, but here focuses only on the new ones: Llama3.1-8B (405B), and Flux.1, and only the GPU submissions as these represent the most widely used today.

Links:

Maybe useful for you: local single node 8xGPU runs,

llama3.1-8b: nvidia/steps, amd/steps.
flux.1: nvidia/steps.

Per GPU Model x 8

time-to-train (mins)	GPU	Organization	Public ID
122.929	MI350X (fp8)	AMD	5.1-0017
99.709	MI355X (fp8)	AMD	5.1-0018
84.379	B200 (fp4)	Nvidia/Supermicro	5.1-0081
79.325	GB200 (fp4)	Nvidia	5.1-0067
75.841	B300 (fp4)	Nvidia/Nebius	5.1-0008
67.373	GB300 (fp4)	Nvidia	5.1-0058
Datasheet: B300, B200, MI355X, MI350X.

We only take the fastest per GPU type and only 8xGPU submissions here. Most NVIDIA GPU submissions are evaluated in FP4, whereas AMD GPU submissions are conducted in FP8.
** in the figure is our estimate on B200 fp8 based on the provided FP8 recipe included in the submission. The intent is to approximate the performance uplift achievable when transitioning from FP8 to FP4. We measure elapsed time per training and validation step and assume the same number of steps to convergence as FP4. This estimate is slightly optimistic, as it accounts only for the training and evaluation loops and excludes miscellaneous overheads, which can contribute up to an additional ~5% based on logs from other submissions.
For an 8×B200 setup training Llama-3.1-8B under these hyperparameters, our estimate leads to a 1.22× speedup when moving from FP8 to FP4. This aligns with our early local benchmarking results observed when NVFP4 recipe was first released in Transformer Engine. While the two benchmarks differs in hyperparameters, I generally keep in mind that the expected gain is around 20%, though it ultimately depends on the specific hyperparameters.
MI355X edges B200 in fp8 comparison. While it aligns to on paper HW spec, the comparison is inherently difficult given config differences (batch size, attention implementation, ...). E.g. AMD uses batch size 32 vs 16 on Nvidia side, it means AMD is running half of gradient updates while Nvidia has smaller load per forward/backward pass. It is not clear which has the advantage here, as software implementation/optimization could differ a great deal too. Also in practice, we do care about the RDMA interconnects for scale out training which these sets of numbers do not reflect.
How (G)B300 (Blackwell Ultra) get so fast?
- Industry's first FP4 recipe using NVFP4 precision (with last few iterations in FP8).
- 1.5× Tensor Core uplift over (G)B200
- 2× attention speed via HW-accelerated Softmax
- FP8 BMM in Attention, previously in BF16
GBs faster than Bs are likely due to NVLinked Grace CPU-GPU.
AMD's optimizations: GEMM Tile Sizing, BF16 FlashAttention v3, DataLoader Tuning (15mins->3mins validation). See the blog for LORA optimization.
Optional Contexts:
- Training specifics: 12288 train, 1024 eval samples per train-eval loop. Typically takes 172,032 samples to reach convergence. Sequence length 8192, batch size depends, 16 Nvidia, 32 AMD.
- Advanced features in NVFP4 paper such as stochastic rounding, rotation-based 2D quantization are disabled, most likely means that it behaves just like fp8 recipe except that nvfp4 is used.

Almost Log-linear Cluster Scaling

Train 8B, 405B LLM and a Flux.1 in one bio break, if you have thousands of GPUs. Okay, hyped but not too far 😆

8B

405B

Flux.1 (11.9B)

Results from more organizations available, we only pick those with large range of GPU counts.

MLCommons reference pretraining for FLUX.1 uses the TorchTitan framework, while most submissions rely on NeMo.
The model is a customized subclass of MegatronFluxModel, trained in MXFP8 using Transformer Engine.
Scaling is handled via Megatron DP with distributed optimizer (ZeRO-1), as defined in flux1_schnell.yaml.
We sampled multiple operating points from the plot and inspected the corresponding logs. We confirm that global batch size varies across scales with no gradient accumulation, and that learning rates are adjusted accordingly as you would expect.
Our local reproduction is based on the University of Florida submission and adapted to run on a single 8×B200 node. This setup is intended for implementation understanding rather than benchmarking. Modifications include using a small CC12M subset for faster iteration (while retaining the full COCO validation set) and disabling IB interface. code --diff flux_training_scale.sh flux_training_local.sh

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
assets		assets
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Quick Rundown of MLPerf v5.1 Training

Per GPU Model x 8

Almost Log-linear Cluster Scaling

8B

405B

Flux.1 (11.9B)

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Quick Rundown of MLPerf v5.1 Training

Per GPU Model x 8

Almost Log-linear Cluster Scaling

8B

405B

Flux.1 (11.9B)

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Packages