Editing Openai/694057b6-101c-8007-9a65-c40578c7252d (section)

=== ## ===

llama_worker is a Python 3 module (typed with PEP 484 + typing_extensions) that supervises a single <code>llama-server</code> subprocess from llama.cpp, and provides an async, slot-limited, resilient interface for running chat-style inference requests against it.

It’s meant to be used by a higher-level “hivemind” orchestrator that:
* runs many workers (different models / GPUs / profiles),
* routes jobs between them,
* and prefers nuke & repave (restart a broken worker) over fragile replay.

==== 1. One worker = one <code>llama-server</code> process (one model per process; may span multiple GPUs if configured). ====
# Async-first API: explicitly asyncio-native.
# Slot-based concurrency: accept up to slots concurrent in-flight requests; otherwise reject immediately.
# Nuke & repave reliability: - detect dead/stalled/unreachable server, - restart subprocess, - fail in-flight requests with explicit reasons (no replay).
# Long-prefill-friendly: supports workloads where time-to-first-token can be minutes to tens of minutes.
# OpenAI-format tool calling with a pluggable ToolRunner; plus a fallback tool-call parsing method.
# BIOS prompt layer: inject stable platform-wide instructions (hivemind context) + runtime metadata (date/time, budgets, etc.).
# One-way exit-tools (control signals): model can emit structured signals upward; worker records them but does not alter control flow.
# Simple early loop kill: repeated-line detector as a supplement to max token limits.
# Forward-compatible parameters: allow passing arbitrary generation parameters without module changes.

==== - In-place model swapping or reconfiguration of a running worker. ====
* Replay/resume of in-flight requests after restart.
* Global scheduling across workers (belongs in orchestrator).
* Heavy output post-processing.

==== - The module is asyncio-native. ====
* Public methods are <code>async def</code> and expected to be called from an asyncio event loop.
* Thread-safety is not a v1 requirement; keep calls within a consistent async context.

==== - Launch llama-server in its own process group/session. ====
* stop() must ensure no orphaned processes: - SIGTERM to process group → wait → SIGKILL process group if needed.
* Capture stdout/stderr into a bounded ring buffer for debugging.
* Port is assigned externally and passed in config.

==== - Worker has slots: int. ====
* A slot is permission to have one request “in flight” (best mapping to concurrent HTTP streaming requests).
* If all slots are in use, submit() returns immediately with NO_SLOT_AVAILABLE.
* No internal queue by default.

==== - Request IDs are incrementing integers (1, 2, 3… per worker lifetime). ====
* Each request includes a caller-provided <code>job_name</code> for correlation.

==== ### ====
* async start() -> None
* async stop() -> None
* (internal) async restart(reason: str) -> None

===== - async submit(job_name: str, system_prompt: str, user_prompt: str, *, params: Mapping[str, Any] | None = None) -> SubmitResult =====
* async get_status(request_id: int) -> RequestStatus
* async get_result(request_id: int) -> RequestResult | NotReady
* async cancel(request_id: int) -> bool

Result retrieval releases resources (explicit decision):
* Calling get_result(request_id) when the request is terminal returns the result and releases all stored state/output for that request.
* After successful get_result(), subsequent get_status/get_result for that request_id should return a stable “unknown/released” response (e.g., NOT_FOUND / RELEASED).

===== - async get_worker_status() -> WorkerStatus =====
* async get_debug_info() -> WorkerDebugInfo

==== - params is an open mapping passed through to llama-server’s OpenAI-compatible request body. ====
* Worker merges in fields it controls (messages/tools/stream) and passes unknown keys through untouched.

==== ### ====

Worker-owned, regenerated at request start and before each post-tool continuation. Includes:
* universal platform/hivemind guidance
* current date/time + timezone
* tool iteration budget remaining and constraints
* instructions for using tools and exit-tools

===== 1. BIOS system prompt =====
# caller system prompt
# conversation messages

==== - OpenAI function-calling schema. ====
* Worker: - exposes tools, - detects tool calls (structured, or fallback), - executes via ToolRunner, - appends tool result messages, - continues generation until completion or tool-iteration budget exhausted.

Fallback tool parsing is allowed via BIOS-enforced structured JSON conventions if native tool_calls is unreliable.

==== Explicit decision: Exit tools never terminate output or alter control flow beyond what the model itself does. ====
* Exit tools are provided at worker init as OpenAI-format function tool definitions.
* Worker includes them in the tool list (or a dedicated list, depending on server compatibility) so the model knows its signaling options.
* When the model emits an exit-tool call, the worker records it into signals[] (structured, typed).
* The worker does not automatically: - stop the request, - cancel it, - restart the server, - or change sampling/params.
* The orchestrator can react to signals[] (including choosing to call cancel() externally).

This keeps exit-tools purely informational and “upward-facing”.

==== - Worker always uses streaming internally. ====
* Progress definition: any response data flowing after headers counts as progress.
* Track: - last_stream_byte_at - last_liveness_at (prefill probes) - last_progress_at = max(last_stream_byte_at, last_liveness_at)

Prefill liveness baseline:
* subprocess alive
* /proc/<pid> CPU time delta

==== - No per-request overrides. ====
* Profile includes: - connect timeout - headers timeout - TTFT timeout typically disabled/None - prefill liveness timeout (large/None) - idle stream timeout (no bytes) - optional absolute timeout - probe interval - restart backoff + crash-loop limits

Recovery is nuke & repave; in-flight requests fail with explicit reasons (no replay).

==== - Primary: max_tokens (passed via params, default set per worker profile). ====
* Secondary: repeated-line detector cancels request on clear degenerate repetition.

==== - Full output is accumulated while running. ====
* On successful <code>get_result()</code>, the request’s stored output, tool trace, and signals are released immediately.

==== - WorkerStatus: active/healthy/restarting, slot usage, active request ids, restarts, last error. ====
* RequestStatus: state, timestamps, output length so far, last_progress_at, tool iterations remaining, recorded signals.
* DebugInfo: bounded subprocess logs + recent restart reasons.

==== - Strong unit + integration tests prioritized over cleverness. ====
* Integration tests should specifically validate: - process-group teardown leaves no orphans - long prefill does not trigger false timeouts - disconnect/stall triggers restart and fails in-flight requests - tool-call parsing (native + fallback) - exit-tools are recorded but do not affect control flow - get_result() releases request state

==== These are the only places where you may want to pick a default before coding: ====
# Health endpoint / readiness check - Prefer GET /v1/models; fallback to a tiny completion as warmup if needed.
# What <code>get_status()</code> returns after <code>get_result()</code> - Recommend: NOT_FOUND or RELEASED with a stable error code.
# Exit-tools parsing precedence - If a response includes both normal tool_calls and an exit-tool call, do you process normal tools first, or just record exit-tools and continue? (Default recommendation: process normal tool loop as usual; record exit-tools whenever they occur.)

If you want, I can append a compact “Types & Protocols” section (TypedDicts/Enums/Protocols) that matches this design doc 1:1, ready to implement.