Editing Openai/694057b6-101c-8007-9a65-c40578c7252d (section)

=== Assistant: Nice — in that setup, you’ll usually want to treat llama.cpp as a fleet of small, single-purpose services and put a thi… ===
Nice — in that setup, you’ll usually want to treat llama.cpp as a fleet of small, single-purpose services and put a thin Python control plane in front of them.

Here’s a pattern that works well.

==== ### ====

An allocation unit is typically:
* one model on one GPU (most common), or
* one model sharded across multiple GPUs (for big models), still as one server process.

This keeps memory/KV-cache ownership simple and avoids cross-talk between workloads.

You can run either:
* llama.cpp <code>llama-server</code> processes, or
* <code>llama-cpp-python</code> server processes

Both expose OpenAI-ish endpoints; the key is: separate processes per model/GPU.

===== Use environment + flags so each instance only sees the GPU(s) it should use. =====

Common approaches:
* CUDA_VISIBLE_DEVICES=… per process (simplest, very reliable)
* llama.cpp GPU selection/sharding flags (exact names vary by build/version)

Example idea:
* model A on GPU 0 → port 8001
* model B on GPU 1 → port 8002
* model C sharded across GPUs 2,3 → port 8003 (with an appropriate split flag, if you’re doing tensor split)

===== Make one FastAPI (or similar) service that: =====
* Accepts OpenAI-compatible requests (/v1/chat/completions, /v1/embeddings, /v1/models)
* Routes by model name (and optionally by purpose or max_tokens, ctx, etc.)
* Proxies streaming responses (SSE) back to the client
* Handles retries, timeouts, and health checks

Clients talk to one base URL, your router decides which llama instance handles it.

==== ### ====
* Start/stop/restart instances (via subprocess, systemd, Docker, or Nomad/K8s)
* Assign ports, log paths, and GPU pinning
* Enforce per-instance limits (max concurrent requests, queue depth)

===== - Periodic health checks (e.g., call /v1/models or a tiny completion) =====
* Track “busy” signals: - in-flight requests - recent latency - (optional) parse server logs for queue/backpressure messages
* Simple scheduler policies: - static: model→instance mapping - pooled: multiple replicas per model, pick least-busy

===== - Central structured logs =====
* Metrics endpoint (Prometheus style if you want)
* Request tracing (request id propagated to backends)

==== ### ====

<syntaxhighlight lang="yaml">instances:
* name: mistral_7b_gpu0
    model: mistral-7b-instruct
    cmd: ["./llama-server", "-m", "/models/mistral.gguf", "--port", "8001"]
    env: { CUDA_VISIBLE_DEVICES: "0" }
* name: qwen_14b_gpu1
    model: qwen-14b
    cmd: ["./llama-server", "-m", "/models/qwen14b.gguf", "--port", "8002"]
    env: { CUDA_VISIBLE_DEVICES: "1" }

routing:
  mistral-7b-instruct: ["mistral_7b_gpu0"]
  qwen-14b: ["qwen_14b_gpu1"]

</syntaxhighlight>

===== - POST /v1/chat/completions: - read body["model"] - pick backend from routing[model] (optionally least-busy) - proxy request to http://backend_host:port/v1/chat/completions - if stream=true, proxy SSE bytes directly =====

Use httpx.AsyncClient for proxying; FastAPI makes streaming pass-through pretty clean.

==== - Prefer one model per process. Multi-model in one process is almost never worth the complexity unless you’re doing something very custom. ====
* Be explicit about context size per instance; KV cache is the silent VRAM killer. If you’re mixing workloads, separate “long ctx” and “short ctx” instances even for the same model.
* If you use MIG / MPS / time-slicing, keep the policy outside llama.cpp (router-level) and don’t expect llama.cpp to “fair schedule” across tenants.
* Add a cheap warmup (1-token completion) after start so the first real request doesn’t eat the initialization penalty.
* If you need strict isolation, use containers; if you want max performance and simplest GPU pinning, systemd services with env vars is hard to beat.

If you tell me:
* whether you’re using <code>llama-server</code> or <code>llama-cpp-python[server]</code>
* how many GPUs and whether any models are multi-GPU sharded
* whether you need streaming + embeddings + tool/function calling …I can sketch a router skeleton (FastAPI + httpx) that’s already OpenAI-compatible and drop-in for your clients.