Editing Openai/694057b6-101c-8007-9a65-c40578c7252d (section)

=== ## ===

llama_worker is a Python 3 module (typed with PEP 484 + typing_extensions) that supervises a single <code>llama-server</code> subprocess from llama.cpp, and provides an async, slot-limited, resilient interface for running chat-style inference requests against it.

It’s meant to be used by a higher-level “hivemind” orchestrator that:
* runs many workers (different models / GPUs / profiles),
* routes jobs between them,
* and prefers nuke & repave (restart a broken worker) over fragile replay.

==== 1. One worker = one <code>llama-server</code> process (one model per process; may span multiple GPUs if configured). ====
# Async-first API: explicitly asyncio-native.
# Slot-based concurrency: accept up to slots concurrent in-flight requests; otherwise reject immediately.
# Nuke & repave reliability: - detect dead/stalled/unreachable server, - restart subprocess, - fail in-flight requests with explicit reasons (no replay).
# Long-prefill-friendly: supports workloads where time-to-first-token can be minutes to tens of minutes.
# OpenAI-format tool calling with a pluggable ToolRunner; plus a fallback tool-call parsing method.
# BIOS prompt layer: inject stable platform-wide instructions (hivemind context) + runtime metadata (date/time, budgets, etc.).
# One-way exit-tools (control signals): model can emit structured signals upward; worker records them but does not alter control flow.
# Simple early loop kill: repeated-line detector as a supplement to max token limits.
# Forward-compatible parameters: allow passing arbitrary generation parameters without module changes.
# Separation of concerns rule: prompt generation (especially BIOS) must be isolated, testable, and independently modifiable.

==== - In-place model swapping or reconfiguration of a running worker. ====
* Replay/resume of in-flight requests after restart.
* Global scheduling across workers (belongs in orchestrator).
* Heavy output post-processing.

==== - The module is asyncio-native. ====
* Public methods are <code>async def</code> and expected to be called from an asyncio event loop.
* Thread-safety is not a v1 requirement; keep calls within a consistent async context.

==== - Launch llama-server in its own process group/session. ====
* stop() must ensure no orphaned processes: - SIGTERM to process group → wait → SIGKILL process group if needed.
* Capture stdout/stderr into a bounded ring buffer for debugging.
* Port is assigned externally and passed in config.

==== - Worker has slots: int. ====
* A slot is permission to have one request “in flight” (best mapping to concurrent HTTP streaming requests).
* If all slots are in use, submit() returns immediately with NO_SLOT_AVAILABLE.
* No internal queue by default.

==== - Request IDs are incrementing integers (1, 2, 3… per worker lifetime). ====
* Each request includes a caller-provided <code>job_name</code> for correlation.

==== ### ====
* async start() -> None
* async stop() -> None
* (internal) async restart(reason: str) -> None

===== - async submit(job_name: str, system_prompt: str, user_prompt: str, *, params: Mapping[str, Any] | None = None) -> SubmitResult =====
* async get_status(request_id: int) -> RequestStatus
* async get_result(request_id: int) -> RequestResult | NotReady
* async cancel(request_id: int) -> bool

Result retrieval releases resources (explicit decision):
* Calling get_result(request_id) when the request is terminal returns the result and releases all stored state/output for that request.
* After successful get_result(), subsequent get_status/get_result for that request_id should return a stable “unknown/released” response.

===== - async get_worker_status() -> WorkerStatus =====
* async get_debug_info() -> WorkerDebugInfo

==== - params is an open mapping passed through to llama-server’s OpenAI-compatible request body. ====
* Worker merges in fields it controls (messages/tools/stream) and passes unknown keys through untouched.

==== ### ====

Worker-owned, regenerated at request start and before each post-tool continuation. Includes:
* universal platform/hivemind guidance
* current date/time + timezone
* tool iteration budget remaining and constraints
* instructions for using tools and exit-tools

===== 1. BIOS system prompt =====
# caller system prompt
# conversation messages

==== - OpenAI function-calling schema. ====
* Worker: - exposes tools, - detects tool calls (structured, or fallback), - executes via ToolRunner, - appends tool result messages, - continues generation until completion or tool-iteration budget exhausted.

Fallback tool parsing is allowed via BIOS-enforced structured JSON conventions if native tool_calls is unreliable.

==== Explicit decision: Exit tools never terminate output or alter control flow beyond what the model itself does. ====
* Exit tools are provided at worker init as OpenAI-format function tool definitions.
* Worker includes them so the model knows its signaling options.
* When the model emits an exit-tool call, the worker records it into signals[].
* The orchestrator may react (including choosing to cancel externally), but the worker itself does not change behavior.

==== - Worker always uses streaming internally. ====
* Progress definition: any response data flowing after headers counts as progress.
* Track: - last_stream_byte_at - last_liveness_at (prefill probes) - last_progress_at = max(last_stream_byte_at, last_liveness_at)

Prefill liveness baseline:
* subprocess alive
* /proc/<pid> CPU time delta

==== - No per-request overrides. ====
* Profile includes: - connect timeout - headers timeout - TTFT timeout typically disabled/None - prefill liveness timeout (large/None) - idle stream timeout (no bytes) - optional absolute timeout - probe interval - restart backoff + crash-loop limits

Recovery is nuke & repave; in-flight requests fail with explicit reasons (no replay).

==== - Primary: max_tokens (passed via params, default set per worker profile). ====
* Secondary: repeated-line detector cancels request on clear degenerate repetition.

==== - Full output is accumulated while running. ====
* On successful <code>get_result()</code>, the request’s stored output, tool trace, and signals are released immediately.

==== - WorkerStatus: active/healthy/restarting, slot usage, active request ids, restarts, last error. ====
* RequestStatus: state, timestamps, output length so far, last_progress_at, tool iterations remaining, recorded signals.
* DebugInfo: bounded subprocess logs + recent restart reasons.

==== ### ====

In places likely to evolve—especially BIOS prompt generation and prompt assembly—implementation must be separated into distinct methods/components, not embedded inline inside request execution or transport code.

At minimum:
* BIOS prompt creation is a distinct method/component.
* Message-stack assembly (BIOS + caller system + conversation) is a distinct method/component.

===== When adding features or handling edge cases: =====
# Identify which concern it belongs to, and keep changes inside that layer: - Process supervision (start/stop/restart, orphan prevention) - Transport (HTTP request/stream parsing, retries) - Prompting (BIOS generation, message ordering, tool/exit-tool exposure) - Tool loop (detect tool call, execute via ToolRunner, resume) - Parsing (tool-call parsing, exit-tool parsing, repeated-line detection) - State & accounting (slots, request lifecycle, status reporting)
# Prefer small, testable functions over clever shared logic.
# Keep policy separate from mechanics: - Timeout policy and restart policy should be data/config driven. - Prompt content should be generated from structured inputs.

===== BIOS generation and message assembly must be unit-testable in isolation: =====
* required fields present (hivemind guidance, date/time, budgets)
* formatting stability (e.g., version tag or delimiters if used)
* correct updates as tool budget decreases

==== - Strong unit + integration tests prioritized over cleverness. ====
* Integration tests should validate: - process-group teardown leaves no orphans - long prefill does not trigger false timeouts - disconnect/stall triggers restart and fails in-flight requests - tool-call parsing (native + fallback) - exit-tools are recorded but do not affect control flow - get_result() releases request state

==== 1. Health endpoint / readiness check - Recommend: GET /v1/models; fallback to a tiny warmup completion if needed. ====
# Post-release status behavior - Recommend: get_status() after get_result() returns a stable NOT_FOUND / RELEASED code.
# Exit-tools parsing precedence - If both a normal tool call and an exit-tool appear, recommended default is: - process normal tool loop as usual - record exit-tool signals whenever they occur

If you want, next step is producing a compact “Types & Protocols” appendix (Protocols/TypedDicts/dataclasses/enums) that exactly matches this doc, so implementation can proceed test-first without ambiguity.