Editing Openai/693060a1-c630-8008-b36c-77d4e223ac3d (section)

==== 4. Step-by-step scenarios ====

===== Scenario A: You have a single 4090 and want a 70B model =====
# Download a GGUF Q4_K_M quantised 70B.
# Install llama.cpp.
# Run with small number of GPU layers offloaded:

<syntaxhighlight lang="bash">./llama-cli -m ./Llama-3.1-70B-Q4_K_M.gguf --n-gpu-layers 40 --ctx-size 4096

</syntaxhighlight>

Expect ~4–10 tokens/s, depending on quantisation.

===== Scenario B: You have 2×80 GB H100 and want 70B full precision =====
# Install vLLM.
# Pull HF weights in bf16.
# Run distributed:

<syntaxhighlight lang="bash">python -m vllm.entrypoints.openai.api_server \
    --model meta-llama/Llama-3.1-70B-Instruct \
    --tensor-parallel-size 2

</syntaxhighlight>

You get 150–250 tokens/s.

===== Scenario C: You only have CPU but want to experiment with 70B =====
# Download a Q2_K or Q3_K GGUF.
# Run llama.cpp CPU-only:

<syntaxhighlight lang="bash">./llama-cli -m Llama-3.1-70B.Q2_K.gguf -t 14

</syntaxhighlight>

You get ~1–2 tokens/s.