Editing Openai/694057b6-101c-8007-9a65-c40578c7252d (section)

=== Assistant: Yes — the main Python tooling is llama-cpp-python, which provides Python bindings to llama.cpp plus a couple of “batter… ===
Yes — the main Python tooling is <code>llama-cpp-python</code>, which provides Python bindings to llama.cpp plus a couple of “batteries included” ways to serve and call models. GitHub<ref>{{cite web|title=GitHub|url=https://github.com/abetlen/llama-cpp-python|publisher=github.com|access-date=2025-12-16}}</ref>

==== What you get: ====
* High-level <code>Llama</code> class for local inference (text + chat-style prompting). GitHub<ref>{{cite web|title=GitHub|url=https://github.com/abetlen/llama-cpp-python|publisher=github.com|access-date=2025-12-16}}</ref>
* OpenAI-like API surface (so it feels like chat.completions, etc.). PyPI<ref>{{cite web|title=PyPI|url=https://pypi.org/project/llama-cpp-python/|publisher=pypi.org|access-date=2025-12-16}}</ref>
* Embeddings support. PyPI<ref>{{cite web|title=PyPI|url=https://pypi.org/project/llama-cpp-python/|publisher=pypi.org|access-date=2025-12-16}}</ref>
* Low-level access to llama.cpp’s C API via <code>ctypes</code> if you want finer control. PyPI<ref>{{cite web|title=PyPI|url=https://pypi.org/project/llama-cpp-python/|publisher=pypi.org|access-date=2025-12-16}}</ref>
* Integrations: works with LangChain and LlamaIndex. PyPI<ref>{{cite web|title=PyPI|url=https://pypi.org/project/llama-cpp-python/|publisher=pypi.org|access-date=2025-12-16}}</ref>
* Also advertised in the ecosystem/docs: function calling support and vision API support (when model/build supports it). PyPI<ref>{{cite web|title=PyPI|url=https://pypi.org/project/llama-cpp-python/|publisher=pypi.org|access-date=2025-12-16}}</ref>

Tiny example (in-process):

<syntaxhighlight lang="python">from llama_cpp import Llama

llm = Llama(model_path="model.gguf", n_ctx=4096)
out = llm("Write a haiku about GPUs.")
print(out["choices"][0]["text"])

</syntaxhighlight>

(Exact knobs vary by your build/model.)

==== If you’d rather keep models out-of-process, it can run an OpenAI API–compatible web server so you can point any OpenAI client library at it. Llama CPP Python<ref>{{cite web|title=Llama CPP Python|url=https://llama-cpp-python.readthedocs.io/|publisher=Llama CPP Python|access-date=2025-12-16}}</ref> ====

Run it:

<syntaxhighlight lang="bash">pip install 'llama-cpp-python[server]'
python -m llama_cpp.server --model model.gguf

</syntaxhighlight>

Llama CPP Python<ref>{{cite web|title=Llama CPP Python|url=https://llama-cpp-python.readthedocs.io/en/latest/server/|publisher=Llama CPP Python|access-date=2025-12-16}}</ref>

Then in Python you can use openai-style clients by setting the base URL to your local server (exact code depends on which OpenAI SDK version you use).

==== Separately from Python bindings, llama.cpp itself ships <code>llama-server</code>, a lightweight HTTP server exposing OpenAI-compatible endpoints like /v1/chat/completions, and it supports concurrent requests/parallel decoding. GitHub<ref>{{cite web|title=GitHub|url=https://github.com/ggml-org/llama.cpp|publisher=github.com|access-date=2025-12-16}}</ref> ====

Run:

<syntaxhighlight lang="bash">llama-server -m model.gguf --port 8080

</syntaxhighlight>

GitHub<ref>{{cite web|title=GitHub|url=https://github.com/ggml-org/llama.cpp|publisher=github.com|access-date=2025-12-16}}</ref>

Then you can call it from Python with requests (or an OpenAI-compatible client pointed at http://localhost:8080/v1).

If you tell me whether you prefer in-process (bindings) or server-based (HTTP), and what GPUs/build you’re using (CUDA/Vulkan/CPU), I can suggest the cleanest setup + example code for your exact workflow.