Editing Openai/69191dd5-6360-800d-b47b-7ebcc8b69dc8 (section)

==== ### ====
* Llama-3.1 405B: ~15T tokens. Meta AI<ref>{{cite web|title=Meta AI|url=https://ai.meta.com/blog/meta-llama-3-1/|publisher=Meta AI|access-date=2025-11-16}}</ref>
* Zebra-Llama / HybridLM: 7–11B tokens for 1–8B hybrids. arXiv<ref>{{cite web|title=arXiv|url=https://arxiv.org/html/2505.17272v1|publisher=arxiv.org|access-date=2025-11-16}}</ref>

Token ratio:

15×101210×109≈1500×\frac{15 \times 10^{12}}{10 \times 10^{9}} \approx 1500\times10×10915×1012≈1500×
So just on tokens, the full 400B pre-train is ~1.5–2.0k× heavier than a single 8B hybrid post-train.

===== Concrete known number: =====
* Llama-3.1 405B: 39.3M H100 GPU-hours. Hugging Face<ref>{{cite web|title=Hugging Face|url=https://huggingface.co/meta-llama/Llama-3.1-405B|publisher=Hugging Face|access-date=2025-11-16}}</ref>

For hybrids, we have actual MI300 data:
* 1B hybrid, 7B tokens: 136 MI300 GPU-hours. ROCm Blog<ref>{{cite web|title=ROCm Blog|url=https://rocm.blogs.amd.com/artificial-intelligence/hybrid-models%2C-mla%2C/README.html|publisher=ROCm Blog|access-date=2025-11-16}}</ref>

If you roughly scale that up to an 8B hybrid on similar hardware:
* 8× parameters → ~8× compute.
* 7B → 11B tokens → ~1.6×.

So ballpark:

136×8×1.6≈1,700 GPU-hours136 \times 8 \times 1.6 \approx 1{,}700 \text{ GPU-hours}136×8×1.6≈1,700 GPU-hours
Call it ~1–3k GPU-hours for an 8B hybrid.

GPU-hour ratio (H100 vs MI-class is apples/oranges, but order-of-magnitude still works):

39.3 M1.7 k≈23,000×\frac{39.3\text{ M}}{1.7\text{ k}} \approx 23{,}000\times1.7 k39.3 M≈23,000×
So one 8B Zebra-Llama-style hybrid is ~10⁴–10⁵× cheaper in GPU-hours than a 400B pre-train.

Even if Rubin VR-200 is, say, 3–4× faster than H100 for training, you’re still looking at several million VR-200 GPU-hours for the 400B run vs thousands of MI450X GPU-hours for a hybrid.