Editing Openai/694057b6-101c-8007-9a65-c40578c7252d (section)

==== This appendix defines what the worker’s transport layer (transport.py) should assume about llama-server’s OpenAI-compatible HTTP API, and how it should behave when the server is imperfect or evolving. ====

===== Readiness probe (locked in): =====
* GET /v1/models (no trailing slash). GET /v1/models/ may return 404 on some versions, so the worker must probe the exact path without a trailing slash. GitHub<ref>{{cite web|title=GitHub|url=https://github.com/ggml-org/llama.cpp/issues/16525|publisher=github.com|access-date=2025-12-16}}</ref>

Chat inference (primary endpoint):
* POST /v1/chat/completions GitHub<ref>{{cite web|title=GitHub|url=https://github.com/ggml-org/llama.cpp|publisher=github.com|access-date=2025-12-16}}</ref>

Notes:
* llama-server is described as an OpenAI API compatible HTTP server and documents the chat completion endpoint at /v1/chat/completions. GitHub<ref>{{cite web|title=GitHub|url=https://github.com/ggml-org/llama.cpp|publisher=github.com|access-date=2025-12-16}}</ref>
* Other endpoints exist (e.g., embeddings/reranking in certain modes), but are out of scope for this module unless you add explicit support later. GitHub<ref>{{cite web|title=GitHub|url=https://github.com/ggml-org/llama.cpp|publisher=github.com|access-date=2025-12-16}}</ref>

===== The worker should construct an OpenAI-style chat completion payload with (at minimum): =====
* messages: ordered list of {role, content, ...} including: - BIOS system message - caller system message - user / assistant / tool messages from the ongoing request context
* stream: true (the module always streams internally)
* tools: a list of OpenAI function tool definitions, including: - normal tools (round-trip) - exit tools (one-way signals)

Forward compatibility requirement: the worker must pass through any caller-provided params fields unchanged unless they collide with fields the worker owns (e.g., messages, tools, stream). This keeps the module from needing edits when llama.cpp adds new knobs.

===== When stream=true, the transport should expect Server-Sent Events (SSE) framing, where: =====
* The stream consists of event records separated by blank lines.
* The primary payload is carried in data: lines.
* The stream typically terminates with data: [DONE]. GitHub<ref>{{cite web|title=GitHub|url=https://github.com/ggml-org/llama.cpp/issues/16104|publisher=github.com|access-date=2025-12-16}}</ref>

Transport parsing rules (robust and tolerant):
# Treat any bytes received after headers as “progress” (worker-level definition).
# Parse SSE incrementally: - tolerate partial lines and partial JSON frames - ignore keepalive/comment/empty lines that carry no data
# For each data: payload: - if it is [DONE], finish the stream cleanly - else parse JSON and yield structured events upward (e.g., “text delta”, “tool call”, “usage update”, etc.)
# If the stream ends unexpectedly (socket close), treat it as an error unless a terminal condition was already observed.

===== Some llama-server versions have emitted streaming error records using an SSE field name like error: instead of data:, which can be ignored by strict SSE decoders (including OpenAI client implementations). GitHub<ref>{{cite web|title=GitHub|url=https://github.com/ggml-org/llama.cpp/issues/16104|publisher=github.com|access-date=2025-12-16}}</ref> =====

Transport requirement:
* Treat both data: and error: as possible carriers of JSON.
* If an error: record is seen: - parse its JSON (best effort), - surface it as a terminal transport error to the worker, - preserve any partial output already accumulated for later get_result() retrieval.

This keeps the worker resilient across server versions.

===== llama-server supports OpenAI-style function/tool calling via its chat handling, including: =====
* native tool-call formats for many model families
* a generic tool-call handler when a template isn’t recognized
* optional parallel tool calling via payload "parallel_tool_calls": true (supported but disabled by default). GitHub<ref>{{cite web|title=GitHub|url=https://raw.githubusercontent.com/ggml-org/llama.cpp/master/docs/function-calling.md|publisher=raw.githubusercontent.com|access-date=2025-12-16}}</ref>

Worker behavior requirements (transport-facing):
* The transport should not “decide” tool semantics; it should simply surface parsed JSON events to the worker/tool loop.
* Tool calls may appear: - in a final message object, or - in streaming deltas (depending on server/model/template behavior).
* The tool loop layer (tooling.py) must support: - structured tool_calls when provided, and - the BIOS-driven fallback parsing strategy when they’re not.

Note: the function-calling doc indicates llama-server tool calling is used when started with --jinja and that generic/native handlers exist. GitHub<ref>{{cite web|title=GitHub|url=https://raw.githubusercontent.com/ggml-org/llama.cpp/master/docs/function-calling.md|publisher=raw.githubusercontent.com|access-date=2025-12-16}}</ref>
(Your worker config keeps the server command fully configurable, so enabling --jinja or templates is an orchestrator concern.)

===== The transport layer should distinguish: =====

A) Transport/protocol errors
* connection refused / connect timeout
* header timeout
* malformed SSE frames / unparseable JSON (beyond best-effort tolerance)
* premature disconnect before completion

These are candidates for worker-level restart decisions (nuke & repave), depending on policy and frequency.

B) Application errors from server
* returned as JSON error bodies in non-streaming responses, or
* emitted into the streaming channel as a JSON error (sometimes via the error: field described above). GitHub<ref>{{cite web|title=GitHub|url=https://github.com/ggml-org/llama.cpp/issues/16104|publisher=github.com|access-date=2025-12-16}}</ref>

These should fail the request (preserving partial output), and may or may not trigger a restart depending on how often they occur (policy-driven).

===== To preserve separation of concerns, transport.py should expose a small surface, e.g.: =====
* async probe_ready() -> bool (calls GET /v1/models)
* async stream_chat(payload: dict[str, Any]) -> AsyncIterator[TransportEvent]

Where TransportEvent is a small internal union such as:
* BytesProgress() (optional)
* TextDelta(str)
* ToolCallEvent(tool_call_payload)
* UsageEvent(usage_payload)
* ServerErrorEvent(error_payload)
* StreamDone()

The worker should treat any received bytes as progress; everything else is higher-level semantics.

If you want to continue, the next section I’d propose is Appendix G: Tool-call loop algorithm (step-by-step pseudo-code for the tool loop + fallback parsing + how exit-tools are recorded alongside), so implementation and tests line up exactly.