Editing Openai/694057b6-101c-8007-9a65-c40578c7252d (section)

=== llama_worker Design Document ===

Status: Draft (repo design doc; intended to be fed into future code generation)
Language: Python 3, fully typed (PEP 484) with typing_extensions (Protocol, TypedDict, etc.)
Backend: llama-server from llama.cpp (OpenAI-compatible HTTP server)

==== 0. Overview ====

llama_worker is an asyncio-native supervisor for a single llama-server subprocess (one model per process). It is designed to be used by a higher-level “hivemind” orchestrator that runs multiple workers (different models / GPUs / profiles), routes jobs between them, and prefers nuke & repave (restart broken workers) over fragile replay.

The worker provides:
* Subprocess supervision (start/stop/restart) with no-orphan guarantees (process group teardown).
* Async request execution: submit() returns immediately with a request id; caller polls status/result.
* Slot-based concurrency (no queue by default).
* Robust failure handling and restart policy (nuke & repave).
* BIOS prompt injection (hivemind + runtime data), generated as a distinct method/component.
* OpenAI-format normal tool calling (round-trip via a pluggable ToolRunner) with a fallback parsing option.
* Exit tools (one-way “control signals”) provided at init and recorded (never executed, never alter control flow).
* Internal streaming for monitoring and progress detection.
* Loop mitigation: max_tokens plus a conservative repeated-line detector.
* Partial output is always retrievable via get_result() even for failures/cancels/restarts.

This module optimizes for simplicity, clarity, robustness, and testability. It must also avoid wasting compute in CPU/RAM constrained environments (lightweight polling/probes, minimal background work).

==== 1. Goals ====
# One worker = one <code>llama-server</code> process (one model per process; may span multiple GPUs if configured).
# Async-first API (explicitly asyncio-native).
# Slot-based concurrency: accept up to slots concurrent in-flight requests; otherwise reject immediately.
# Nuke & repave reliability: - detect dead/stalled/unreachable server, - restart subprocess, - fail in-flight requests with explicit reasons, - no replay/resume.
# Long-prefill-friendly: - time-to-first-token can be minutes to tens of minutes, - stall detection is based on progress/liveness, not TTFT.
# Tools: - OpenAI function/tool calling for normal tools (round-trip), - ToolRunner is pluggable; the mechanism must not assume “lightweight tools.”
# BIOS prompt: - universal hivemind context + runtime metadata, - generated via a distinct, testable method/component.
# Exit tools (signals): - provided at init, - recorded as structured signals, - never change worker control flow.
# Loop mitigation: - rely on max_tokens as baseline, - plus repeated-line early kill.
# Forward-compatible params:
* accept a pass-through mapping of generation params without modifying the module for new server features.
# Engineering style:
* simple state machines, clear invariants, strong tests,
* avoid compute waste (no aggressive polling, no expensive per-token work).

==== 2. Non-goals ====
* In-place model swapping or reconfiguration of a running worker.
* Replay/resume of in-flight requests after restart.
* Global scheduling across workers (belongs in orchestrator).
* Heavy output post-processing or complex token analytics.
* Persistent storage of prompts/outputs (handled by caller/orchestrator).

==== 3. Terminology ====
* Worker: one running llama-server subprocess plus its management logic.
* Slot: admission-control unit representing one in-flight request.
* Iteration (tools): one assistant “tool call turn” (a single assistant emission that may include multiple tool calls).
* Normal tools: round-trip tools executed via ToolRunner.
* Exit tools: one-way control-signal “tools” recorded only.
* Progress: any bytes received after HTTP headers, plus liveness evidence during prefill.

==== 4. Async model (explicit) ====
* The module is asyncio-native.
* Public APIs are async def and intended to be called from an asyncio event loop.
* Thread-safety is not a v1 requirement; keep calls within a consistent async context.

==== 5. Subprocess supervision requirements ====
* Worker launches llama-server in its own process group/session.
* stop() must ensure no orphaned processes: - SIGTERM process group → short wait → SIGKILL process group if needed.
* Capture stdout/stderr into a bounded ring buffer for debug breadcrumbs.
* Port assignment is external: worker config includes host/port; worker does not auto-assign ports.

==== 6. Concurrency model: slots ====
* Worker has slots: int.
* A slot is permission to have one request “in flight” (best mapping: concurrent streaming HTTP request).
* If slots are full, submit() returns immediately with NO_SLOT_AVAILABLE.
* No internal queue by default.

==== 7. Request identity ====
* Request IDs are incrementing integers per worker lifetime (1, 2, 3, …).
* Caller provides a job_name string per request for correlation.

==== 8. Public interface ====

===== Lifecycle =====
* async start() -> None
* async stop() -> None
* (internal) async restart(reason: str) -> None

===== Requests =====
'' async submit(job_name: str, system_prompt: str, user_prompt: str, '', params: Mapping[str, Any] | None = None) -> SubmitResult
* async get_status(request_id: int) -> RequestStatus | NOT_FOUND
* async get_result(request_id: int) -> RequestResult | NOT_FOUND
* async cancel(request_id: int) -> bool (best-effort)

===== Resource release semantics (locked) =====
* get_result(request_id) is the one-time completion call: - returns terminal result (completed/failed/canceled), - returns partial output for any failure/cancel (possibly empty), - releases all stored state/output for that request id.
* After a successful get_result(), later lookups return NOT_FOUND.

==== 9. Generation params: forward-compatible pass-through ====
* params is an open mapping passed through to the OpenAI-compatible request payload.
* Worker overwrites only fields it owns (e.g., messages, tools, stream).
* Unknown keys are preserved so adding new features does not require modifying the worker module.

==== 10. Prompt assembly ====

===== BIOS prompt (worker-owned) =====

The BIOS prompt is a universal system-level prompt layer that includes:
* hivemind/cooperating-agent context,
* current date/time and timezone,
* tool iteration budget remaining,
* instructions for normal tools and exit tools,
* fallback tool-call formatting rules (if fallback parsing is enabled).

BIOS generation must be a distinct method/component and unit-testable.

===== Ordering (invariant) =====
# BIOS system prompt
# caller system prompt
# conversation messages (user/assistant/tool)

==== 11. Transport contract (llama-server) ====
* Readiness probe: GET /v1/models → HTTP 200 + JSON parse success means READY.
* Chat inference: POST /v1/chat/completions with stream=true.
* Streaming parser should be tolerant of SSE framing and partial JSON.
* Progress definition (locked): any bytes after headers count as progress.

Transport is a distinct module; the worker consumes semantic events (text delta, final message, error, done).

==== 12. Timeouts, progress, and liveness (per worker profile only) ====
* No per-request timeout overrides. Workers are configured based on job type/hardware.
* Stall detection must tolerate multi-minute / tens-of-minutes prefill: - use last_progress_at, not time-to-first-token.

===== Liveness evidence (prefill-safe) =====
* During prefill (no stream bytes yet), worker uses lightweight probes: - process alive, - /proc/<pid>/stat CPU tick deltas (Linux baseline).

===== Restart policy =====
* Nuke & repave: - restart subprocess on death/unreachable/stall per policy, - fail in-flight requests with explicit reasons, - preserve partial output for retrieval.

==== 13. Tools and exit tools ====

===== Normal tools (round-trip) =====
* OpenAI function/tool calling schema.
* Executed via pluggable ToolRunner.
* Budget is iterations, not calls: - one assistant tool-emission that contains any normal tool calls consumes exactly 1 iteration, even if it requests multiple tools at once.
* Fallback parsing is allowed when structured tool_calls are absent/unreliable (BIOS-guided strict JSON).

===== Tool failure classification (locked) =====

Tools should only kill the request when there is an actual execution/format problem, such as:
* tool runner timeout,
* tool runner exception,
* malformed tool arguments (parse/validation),
* non-serializable result (implementation error),
* unknown tool name.

Domain outcomes are not failures:
* e.g. “search found nothing” must be a successful tool result (empty list / {results: []} / {found: false}), and the request continues.

===== Exit tools (one-way signals; non-terminating) =====
* Exit tools are provided at worker init as tool definitions.
* Worker exposes them to the model and records any exit-tool calls as structured signals[].
* Exit tools are never executed (no ToolRunner) and never alter control flow.
* Priority: process normal tool loop as usual; record exit signals whenever they occur.
* Models will be encouraged (via BIOS) to emit exit tools near completion, but correctness does not depend on it.

==== 14. Loop mitigation ====
* Baseline: pass max_tokens via params (default set per worker profile/orchestrator).
* Repeated-line detector: - detects degenerate loops where the model repeats the same line. - On trigger: cancel request → FAILED(reason="repeated_line_loop"). - Partial output remains retrievable via get_result().

==== 15. Output retention ====
* The worker accumulates full output in memory while running.
* For any terminal state (completed/failed/canceled), get_result() returns whatever output was accumulated.
* Output and state are released only after get_result() succeeds.

==== 16. Observability (minimum) ====

Minimum worker status surface:
* READY
* RUNNING
* FAILED
* (internally STOPPED is useful)

Nice-to-haves (optional):
* output length,
* tokens received,
* tokens/sec.

Logging is breadcrumbs only; persistent storage of prompts/outputs is handled by the caller.

==== 17. Engineering rule: separation of concerns (explicit) ====

===== Specific rule =====

Prompt generation (especially BIOS) and message-stack construction must be implemented as distinct methods/components, not embedded inline inside request execution or transport code.

===== Guide for approaching problems =====

When adding features or fixing issues, place changes in the smallest responsible layer:
* process supervision,
* transport,
* prompting,
* tool loop,
* exit signal parsing,
* liveness probes,
* timeout policy evaluation,
* loop detection,
* state accounting.

Prefer small, testable functions over clever shared logic; keep policy data-driven.