Editing Openai/691c1dba-9228-800f-8463-13b3a9006306 (section)

==== - You can’t easily combine 8GB + 8GB VRAM for a single large model. Model parallelism exists (DeepSpeed ZeRO, Megatron-LM) but is complex, fragile, and often requires NVLink or enterprise hardware. ====
* Practical rule: Use one GPU for inference at a time. Use the other GPU as a worker for another process (embedding calc, caching, or a second smaller model).
* If you want to run larger models, you can: - Quantize aggressively (GPTQ/Q4_0/Q4_K) to fit 13B into ~8–10GB VRAM. - Use ggml CPU mode (slower) and let your NVMe paging help. - Rent a single A100/4090 hour from a spot provider when needed.