Editing Openai/694057b6-101c-8007-9a65-c40578c7252d (section)

=== ## ===

llama_worker is a Python 3 module (typed with PEP 484 + typing_extensions) that supervises a single <code>llama-server</code> subprocess from llama.cpp, and provides an async, slot-limited, resilient interface for running chat-style inference requests against it.

It’s designed to be used by a higher-level “hivemind” orchestrator that:
* runs many workers (different models / GPUs / profiles),
* routes jobs between them,
* and prefers nuke & repave (restart a broken worker) over fragile replay.

==== 1. One worker = one <code>llama-server</code> process (one model per process; may span multiple GPUs if configured). ====
# Async-first API (explicitly asyncio-native): requests are submitted and completed asynchronously.
# Slot-based concurrency: accept up to slots concurrent in-flight requests; otherwise reject immediately.
# Robustness via nuke & repave: - detect unhealthy/stalled server, - restart subprocess, - fail in-flight requests with explicit reasons (no replay).
# Long-prefill-friendly: supports workloads where time-to-first-token is minutes to tens of minutes.
# OpenAI-format tool calling with a pluggable ToolRunner; also supports a fallback tool-call parsing method.
# BIOS prompt layer: inject stable platform-wide instructions (hivemind context) + runtime metadata (date/time, budgets, etc.).
# One-way “exit tools” (control signals): model can emit structured signals (issue/escalation/decision request) without a tool round-trip.
# Simple early loop kill: repeated-line detector as a supplement to max token limits.
# Forward-compatible parameters: allow passing arbitrary generation parameters without changing module code.

==== - In-place model swapping or reconfiguration of a running worker instance. ====
* Replay/resume of in-flight requests after restart.
* A global scheduler across workers (belongs in the orchestrator).
* Fancy token analytics or heavyweight monitoring agents.

==== - The module is asyncio-native. ====
* Public methods are <code>async def</code> and expected to be called from an asyncio event loop.
* Thread-safety is not a primary requirement; the orchestrator should call worker methods from a consistent async context. (A sync wrapper may exist later, but is not required for v1.)

==== ### ====
* <code>LlamaWorker</code> - owns the llama-server subprocess and HTTP client - manages slots, request table, and health state - assembles prompts (BIOS + caller system + conversation) - streams response internally and accumulates full output - runs tool-call loop and parses/records exit-tools
* <code>ToolRunner</code> (plugin) - pluggable execution for normal tool calls (OpenAI function schema) - may be lightweight or heavy; worker must not assume
* Request records - store full output until retrieved - expose status + debug metadata

==== - Worker launches llama-server as a subprocess in its own process group/session. ====
* stop() must ensure no orphaned processes: - send SIGTERM to the process group → wait → SIGKILL process group if needed.
* Worker captures stdout/stderr into a bounded ring buffer for debugging.
* Port is assigned externally and provided via config; worker validates it can connect / server binds.

==== - Worker has slots: int. ====
* A slot is “whatever maps best to having multiple queries in flight at once”: - Implementation will treat a slot as permission to dispatch one concurrent HTTP streaming request. - Slots are admission control, not a guarantee of linear throughput.
* If all slots are in use, submit() returns immediately with NO_SLOT_AVAILABLE.
* No internal queue by default.

==== - Worker assigns request IDs as an incrementing integer (1, 2, 3, …). ====
* Each submission also includes a caller-provided <code>job_name</code> string for correlation and orchestration.
* The (job_name, request_id) pair is the primary handle externally (request_id is unique within a worker lifetime).

==== ### ====
* async start() -> None
* async stop() -> None
* (internal) async restart(reason: str) -> None

===== - async submit(job_name: str, system_prompt: str, user_prompt: str, *, params: Mapping[str, Any] | None = None) -> SubmitResult - returns: - success: { ok: True, request_id: int } - failure: { ok: False, error: "NO_SLOT_AVAILABLE" | "WORKER_NOT_READY" | ... } =====
* async get_status(request_id: int) -> RequestStatus
* async get_result(request_id: int) -> RequestResult | NotReady
* async cancel(request_id: int) -> bool
* async release(request_id: int) -> None (or auto-release after successful get_result(); implementation choice)

===== - async get_worker_status() -> WorkerStatus =====
* async get_debug_info() -> WorkerDebugInfo (recent logs, last errors, restart count)

All structures are typed (TypedDict/dataclasses) with stable machine-readable fields.

==== - params is an open mapping passed through to llama-server’s OpenAI-compatible request body. ====
* The worker: - merges required fields it controls (messages/tools/stream flags), - passes unknown keys through untouched, - may optionally record “unsupported parameter” warnings if the backend rejects them.
* This prevents needing module changes whenever llama.cpp adds a new knob.

==== ### ====

BIOS is a worker-generated system prompt that includes:
* stable platform-wide instruction: the model is one cooperating agent in a multi-model hivemind
* current date/time + timezone
* tool budget remaining / tool usage rules
* instructions on how to emit exit-tools/control signals (if enabled)

BIOS is regenerated:
* at request start
* before each continuation after tool execution

===== 1. BIOS system prompt (highest priority) =====
# caller’s system prompt
# conversation messages (user/assistant/tool)

Multiple system messages are allowed; if backend behavior requires it, the worker may concatenate with clear delimiters (fallback).

==== - Tools follow OpenAI schema: tools=[{type:"function", function:{name, description, parameters}}]. ====
* Workflow: 1. Send request with tools available. 2. If assistant returns a tool call: - execute via ToolRunner, - append tool result to conversation, - decrement tool iteration budget, - continue generation.
* Tool loop controls: - max_tool_iterations (per worker) - per-tool timeout (per worker)
* Fallback parsing: If backend doesn’t emit structured tool_calls reliably, worker can use BIOS-enforced structured JSON conventions and strict parsing; failure is a request-level terminal error (or can emit an exit-tool escalation signal).

==== ### ====
* Exit tools are specified at worker init as a list of OpenAI-format tool definitions (same schema).
* They do not use the ToolRunner and do not involve a tool result round trip.
* The worker exposes these to the model as “available control actions” and then parses and records any such tool calls emitted.

===== - LOW_CONFIDENCE =====
* NEEDS_EXTERNAL_INFO
* NEEDS_HIGHER_REASONER
* NEEDS_MANAGER_DECISION (with options)
* POLICY_RISK (optional lane)
* etc.

===== - Exit tool calls become structured signals[] in the request record. =====
* Default: stop_on_decision_request=True (worker terminates generation early on decision requests so orchestrator can branch via a new request).

==== - Worker always uses streaming internally (even if caller only polls). ====
* Progress definition (explicit): any data flowing after headers counts as progress.
* Timestamps tracked per request: - last_stream_byte_at (any bytes received) - last_liveness_at (prefill liveness probes) - last_progress_at = max(last_stream_byte_at, last_liveness_at)

===== - Before tokens/bytes arrive, worker uses lightweight probes: - process alive - /proc/<pid> CPU time delta (baseline on Linux) - optional GPU probes later =====

==== No per-request overrides. ====

A worker’s timeout profile includes:
* connect_timeout_s (short)
* headers_timeout_s (moderate)
* ttft_timeout_s (typically disabled/None)
* prefill_liveness_timeout_s (large/None)
* idle_stream_timeout_s (time without any bytes once streaming)
* absolute_timeout_s (optional/None)
* liveness_probe_interval_s
* restart controls: backoff and crash-loop limits

===== - On connect/header failures: restart quickly. =====
* On stall (no progress beyond thresholds): restart and fail in-flight requests.
* On process death: restart and fail in-flight requests.
* No replay; in-flight requests fail with reasons like: - worker_restarted - stall_timeout - connect_failed - headers_timeout - server_died

==== - Primary: max_tokens / max-new-tokens in params (per worker default; may be overridden by caller if desired). ====
* Secondary: repeated-line detector - detects the same sufficiently-long line repeated N times consecutively - cancels the request and records FAILED(reason="repeated_line_loop") - cheap and conservative (line-based, not token-heavy)

==== - Worker stores full output text in memory for each request until: - caller retrieves it and the worker auto-releases, or - caller calls release(request_id). ====
* Given your RAM assumptions, no output cap is required for v1.

==== ### ====
* active, healthy, restarting
* slots_total, slots_used
* list of active request IDs (and job_name mapping)
* restart_count, last_error, last_healthy_at

===== - request_id, job_name =====
* state + timestamps
* output length so far
* last_progress_at
* tool iteration remaining
* captured signals[]

===== - last N lines of subprocess output =====
* last restart reasons and times

==== Priority is a solid suite over cleverness. ====

===== - slot accounting and state transitions =====
* BIOS assembly + ordering
* tool loop logic + fallback parsing
* exit tools parsing and early-stop behavior
* repeated-line detector behavior
* timeout bookkeeping with synthetic time/progress

===== Using a stub server (or controlled llama-server in CI when feasible) to simulate: =====
* slow headers
* long silence then streaming
* streaming stalls
* disconnect mid-stream
* malformed tool calls and exit tool calls
* subprocess crash and restart behavior
* process-group teardown (no orphans)

==== These aren’t blockers, but locking them now will reduce rework: ====
# Exactly which endpoint is used for health/ready - Recommend: GET /v1/models if supported; otherwise a tiny warmup completion endpoint.
# Result retention policy - Auto-release on get_result() vs requiring explicit release(). (Either is fine; pick one and test it.)
# Exit tool semantics - Which exit tools should always terminate generation immediately (decision requests, policy risk), vs which just annotate and continue.
# Cancellation behavior - Confirm that cancel is “best effort” and may return partial output + terminal reason canceled.

If you want, I can apply your choices for (2)–(4) and produce a final “interfaces + types” section (Protocols/TypedDicts/enums/dataclasses) that’s ready to implement directly.