Jump to content
Main menu
Main menu
move to sidebar
hide
Navigation
Main page
Recent changes
Random page
freem
Search
Search
Appearance
Create account
Log in
Personal tools
Create account
Log in
Pages for logged out editors
learn more
Contributions
Talk
Editing
Openai/694057b6-101c-8007-9a65-c40578c7252d
(section)
Add languages
Page
Discussion
English
Read
Edit
Edit source
View history
Tools
Tools
move to sidebar
hide
Actions
Read
Edit
Edit source
View history
General
What links here
Related changes
Special pages
Page information
Appearance
move to sidebar
hide
Warning:
You are not logged in. Your IP address will be publicly visible if you make any edits. If you
log in
or
create an account
, your edits will be attributed to your username, along with other benefits.
Anti-spam check. Do
not
fill this in!
==== This appendix defines what the worker’s transport layer (transport.py) should assume about llama-server’s OpenAI-compatible HTTP API, and how it should behave when the server is imperfect or evolving. ==== ===== Readiness probe (locked in): ===== * GET /v1/models (no trailing slash). GET /v1/models/ may return 404 on some versions, so the worker must probe the exact path without a trailing slash. GitHub<ref>{{cite web|title=GitHub|url=https://github.com/ggml-org/llama.cpp/issues/16525|publisher=github.com|access-date=2025-12-16}}</ref> Chat inference (primary endpoint): * POST /v1/chat/completions GitHub<ref>{{cite web|title=GitHub|url=https://github.com/ggml-org/llama.cpp|publisher=github.com|access-date=2025-12-16}}</ref> Notes: * llama-server is described as an OpenAI API compatible HTTP server and documents the chat completion endpoint at /v1/chat/completions. GitHub<ref>{{cite web|title=GitHub|url=https://github.com/ggml-org/llama.cpp|publisher=github.com|access-date=2025-12-16}}</ref> * Other endpoints exist (e.g., embeddings/reranking in certain modes), but are out of scope for this module unless you add explicit support later. GitHub<ref>{{cite web|title=GitHub|url=https://github.com/ggml-org/llama.cpp|publisher=github.com|access-date=2025-12-16}}</ref> ===== The worker should construct an OpenAI-style chat completion payload with (at minimum): ===== * messages: ordered list of {role, content, ...} including: - BIOS system message - caller system message - user / assistant / tool messages from the ongoing request context * stream: true (the module always streams internally) * tools: a list of OpenAI function tool definitions, including: - normal tools (round-trip) - exit tools (one-way signals) Forward compatibility requirement: the worker must pass through any caller-provided params fields unchanged unless they collide with fields the worker owns (e.g., messages, tools, stream). This keeps the module from needing edits when llama.cpp adds new knobs. ===== When stream=true, the transport should expect Server-Sent Events (SSE) framing, where: ===== * The stream consists of event records separated by blank lines. * The primary payload is carried in data: lines. * The stream typically terminates with data: [DONE]. GitHub<ref>{{cite web|title=GitHub|url=https://github.com/ggml-org/llama.cpp/issues/16104|publisher=github.com|access-date=2025-12-16}}</ref> Transport parsing rules (robust and tolerant): # Treat any bytes received after headers as “progress” (worker-level definition). # Parse SSE incrementally: - tolerate partial lines and partial JSON frames - ignore keepalive/comment/empty lines that carry no data # For each data: payload: - if it is [DONE], finish the stream cleanly - else parse JSON and yield structured events upward (e.g., “text delta”, “tool call”, “usage update”, etc.) # If the stream ends unexpectedly (socket close), treat it as an error unless a terminal condition was already observed. ===== Some llama-server versions have emitted streaming error records using an SSE field name like error: instead of data:, which can be ignored by strict SSE decoders (including OpenAI client implementations). GitHub<ref>{{cite web|title=GitHub|url=https://github.com/ggml-org/llama.cpp/issues/16104|publisher=github.com|access-date=2025-12-16}}</ref> ===== Transport requirement: * Treat both data: and error: as possible carriers of JSON. * If an error: record is seen: - parse its JSON (best effort), - surface it as a terminal transport error to the worker, - preserve any partial output already accumulated for later get_result() retrieval. This keeps the worker resilient across server versions. ===== llama-server supports OpenAI-style function/tool calling via its chat handling, including: ===== * native tool-call formats for many model families * a generic tool-call handler when a template isn’t recognized * optional parallel tool calling via payload "parallel_tool_calls": true (supported but disabled by default). GitHub<ref>{{cite web|title=GitHub|url=https://raw.githubusercontent.com/ggml-org/llama.cpp/master/docs/function-calling.md|publisher=raw.githubusercontent.com|access-date=2025-12-16}}</ref> Worker behavior requirements (transport-facing): * The transport should not “decide” tool semantics; it should simply surface parsed JSON events to the worker/tool loop. * Tool calls may appear: - in a final message object, or - in streaming deltas (depending on server/model/template behavior). * The tool loop layer (tooling.py) must support: - structured tool_calls when provided, and - the BIOS-driven fallback parsing strategy when they’re not. Note: the function-calling doc indicates llama-server tool calling is used when started with --jinja and that generic/native handlers exist. GitHub<ref>{{cite web|title=GitHub|url=https://raw.githubusercontent.com/ggml-org/llama.cpp/master/docs/function-calling.md|publisher=raw.githubusercontent.com|access-date=2025-12-16}}</ref> (Your worker config keeps the server command fully configurable, so enabling --jinja or templates is an orchestrator concern.) ===== The transport layer should distinguish: ===== A) Transport/protocol errors * connection refused / connect timeout * header timeout * malformed SSE frames / unparseable JSON (beyond best-effort tolerance) * premature disconnect before completion These are candidates for worker-level restart decisions (nuke & repave), depending on policy and frequency. B) Application errors from server * returned as JSON error bodies in non-streaming responses, or * emitted into the streaming channel as a JSON error (sometimes via the error: field described above). GitHub<ref>{{cite web|title=GitHub|url=https://github.com/ggml-org/llama.cpp/issues/16104|publisher=github.com|access-date=2025-12-16}}</ref> These should fail the request (preserving partial output), and may or may not trigger a restart depending on how often they occur (policy-driven). ===== To preserve separation of concerns, transport.py should expose a small surface, e.g.: ===== * async probe_ready() -> bool (calls GET /v1/models) * async stream_chat(payload: dict[str, Any]) -> AsyncIterator[TransportEvent] Where TransportEvent is a small internal union such as: * BytesProgress() (optional) * TextDelta(str) * ToolCallEvent(tool_call_payload) * UsageEvent(usage_payload) * ServerErrorEvent(error_payload) * StreamDone() The worker should treat any received bytes as progress; everything else is higher-level semantics. If you want to continue, the next section I’d propose is Appendix G: Tool-call loop algorithm (step-by-step pseudo-code for the tool loop + fallback parsing + how exit-tools are recorded alongside), so implementation and tests line up exactly.
Summary:
Please note that all contributions to freem are considered to be released under the Creative Commons Attribution-ShareAlike 4.0 (see
Freem:Copyrights
for details). If you do not want your writing to be edited mercilessly and redistributed at will, then do not submit it here.
You are also promising us that you wrote this yourself, or copied it from a public domain or similar free resource.
Do not submit copyrighted work without permission!
Cancel
Editing help
(opens in new window)