Editing Openai/694057b6-101c-8007-9a65-c40578c7252d (section)

=== ## ===

This module provides a robust, simple, supervisor-style wrapper around a single llama.cpp server subprocess (one model per process). It is intended for use inside a larger “hivemind” system that runs multiple model instances across multiple GPUs, routes work between them, and occasionally tears down and recreates instances based on workload.

The worker module’s responsibilities are:
* Start/stop a llama.cpp server subprocess configured for a specific model + GPU allocation.
* Accept generation requests up to a configured number of concurrency slots.
* Execute requests asynchronously (submit returns immediately; caller polls for status/result).
* Provide monitoring status (health, slot usage, in-flight request details).
* Handle tool calling (OpenAI function-calling schema) via a pluggable tool runner.
* Inject a low-level BIOS system prompt with platform-wide guidance and runtime metadata.
* Detect and recover from failures using a nuke & repave strategy (restart the server rather than attempt delicate replay).
* Detect obvious “runaway loops” via a simple repeated-line detector, in addition to max token limits.

This module is designed to prioritize simplicity, clarity, robustness, and testability over cleverness. It is also designed not to waste compute in a CPU/RAM constrained environment: bounded polling, lightweight liveness checks, and minimal background work.

==== 1. One-process-per-model worker: each instance manages exactly one llama.cpp server process and its configuration. ====
# Slot-based concurrency: support slots > 1 with immediate refusal if full (no queue by default).
# Async request lifecycle: - submit() returns immediately with a request handle/id. - caller uses polling APIs to read status and final output.
# Nuke & repave reliability: - detect dead/stalled/unreachable server; - restart subprocess; - fail in-flight requests with explicit error reasons.
# Long-prefill-friendly timeouts: - accommodate multi-minute / tens-of-minutes time-to-first-token; - stall detection is liveness/progress-based, not TTFT-based.
# Tool calling: - OpenAI function-calling format; - tool runner is pluggable; protocol should not assume “lightweight tools” even if typical for smaller models.
# BIOS system prompt: - inject platform/hivemind guidance + runtime data (date/time, tool budget remaining, etc.).
# Control signals channel: - allow model to emit structured “signals” (e.g., low confidence, escalation, decision request) without requiring downstream models to parse long text.
# Simple loop early-kill: - repeated-line detector that cancels a request when it clearly loops.

==== - In-place reconfiguration or model swapping within a worker instance. ====
* Replay/resume of in-flight requests after restart.
* Global scheduling across workers (handled by the higher-level orchestrator).
* Complex token-level analytics or heavy output post-processing.

==== ### ====
* Worker: owns the llama.cpp server subprocess; exposes request APIs; supervises health; injects BIOS prompt; runs tool loop; emits status.
* Request: encapsulates one generation job (messages, tool state, output accumulation, status).
* ToolRunner (plugin): executes tools invoked by the model; worker handles tool-call parsing and continuation.
* Control Signals: structured records emitted by the model via “control tools” (one-way).

===== - The worker runs a small internal event loop / async runtime (or thread-based equivalent) to: - start subprocess, - dispatch HTTP requests, - stream responses, - perform liveness checks at a configurable interval, - update request state and worker status. =====

==== Exact naming is flexible; the intent is to keep it small and explicit. ====

===== - start() -> None =====
* stop() -> None
* (internal) restart(reason: str) -> None

===== - submit(system_prompt: str, user_prompt: str, *, params: GenerationParams | None = None) -> SubmitResult - returns immediately: - success: request_id - failure: NO_SLOT_AVAILABLE (no queue by default) or WORKER_NOT_READY =====
* get_status(request_id) -> RequestStatus
* get_result(request_id) -> RequestResult | NOT_READY
* cancel(request_id) -> bool (best-effort)
* release(request_id) -> None (optional; can also auto-release after successful get_result())

===== - get_worker_status() -> WorkerStatus =====
* get_debug_info() -> WorkerDebugInfo (recent subprocess logs, last errors, restart count, etc.)

==== ### ====
* name: identifier
* model_path
* host, port
* gpu_env: typically CUDA_VISIBLE_DEVICES="0" (or similar)
* llama.cpp server args: ctx, gpu layers, batch, tensor split, etc. (opaque list/dict)
* slots: max concurrent requests
* timeout_profile: per-worker only (no per-request overrides)
* tool_policy: max tool iterations, per-tool timeout, etc.
* bios_provider: callable that renders BIOS prompt text
* control_tools_enabled: bool
* stop_on_decision_request: bool (default true)
* loop_detector: enable + thresholds

===== Designed for long TTFT workloads. =====
* connect_timeout_s (short)
* headers_timeout_s (moderate)
* ttft_timeout_s: typically disabled / None
* prefill_liveness_timeout_s: large or None (prefill-safe)
* idle_stream_timeout_s: max time without bytes once streaming starts
* absolute_timeout_s: optional/None
* liveness_probe_interval_s
* restart backoff + crash-loop limits: - restart_backoff_s - max_restarts_per_window - restart_window_s

==== States (minimum viable): ====
* RUNNING (includes prefill + generation)
* TOOL_RUNNING (worker executing a tool; generation paused)
* Terminal: - COMPLETED - FAILED (with reason) - CANCELED

Key timestamps:
* created_at
* dispatched_at
* last_stream_byte_at
* last_liveness_at
* last_progress_at = max(last_stream_byte_at, last_liveness_at)
* completed_at

==== ### ====

Worker-owned, regenerated:
* at request start
* before each tool-loop continuation

BIOS includes:
* platform/hivemind statement (universal)
* current date/time + timezone
* tool iteration budget remaining and constraints
* optionally: how to emit control signals / decision requests

===== 1. BIOS system prompt (highest priority) =====
# Caller system prompt (job-specific)
# User/assistant/tool messages

===== The worker should support either: =====
* multiple system messages, or
* a single combined system message with clear delimiters (fallback).

==== ### ====

OpenAI function-calling schema (tools=[{type:"function", function:{...}}]).

===== - Worker: registers tools into request payload; detects tool calls; manages continuation; appends tool result messages. =====
* ToolRunner (plugin): executes tool by name with args and returns result.

===== - max_tool_iterations budget (exposed in BIOS) =====
* per-tool timeout
* tool output size handling (may be truncated/summarized by ToolRunner if needed)

===== Even if most tools for small models are lightweight, the worker must not embed that assumption into the protocol. The same mechanism must work for heavier tools when plugged into a different ToolRunner. =====

==== ### ====

Let the model emit structured “flags” and “requests” that the orchestrator can act on without parsing large outputs.

===== Expose a minimal set of control tools in OpenAI function format. Worker intercepts and records them into signals[] returned with the result/status. =====

Example signal types:
* issue (low confidence, tool limit reached, etc.)
* escalation (needs higher reasoning model / external info)
* decision_request (requires management branch decision)
* outcome (summarizable terminal condition)

Behavior:
* bounded count/size
* stop_on_decision_request=True default: end generation immediately when a decision is requested
* orchestrator later creates a new request with the decision injected (no resume).

==== ### ====

Nuke & repave:
* if server unhealthy/stalled/disconnected: restart subprocess
* mark all in-flight requests as FAILED(reason="worker_restarted") (no replay)

===== 1. Connect/header failures - cannot connect or cannot receive headers in time → likely dead server → restart. =====
# Stalls - use last_progress_at driven by: - stream bytes when streaming exists - liveness probes during prefill - if no progress for configured window → restart (policy-controlled)
# Process death - subprocess exits → restart; fail in-flight requests.

===== - restart backoff =====
* cap restarts per time window; mark worker unhealthy if exceeded.

==== ### ====

A simple early-kill mechanism in addition to max_tokens.
* Parse streamed output into completed lines.
* Compare with previous completed line after normalization (trim whitespace).
* Ignore empty/very short lines.
* If same line repeats N times consecutively, cancel the request: - FAILED(reason="repeated_line_loop") - include the repeated line snippet + repeat count in debug/status.

This should be conservative (low false positives) and cheap (minimal CPU).

==== - Worker accumulates the full output text for each request in memory until terminal state. ====
* Caller retrieves via get_result().
* Cleanup policy: - auto-release after get_result() or explicit release(request_id) - optional TTL-based GC for abandoned requests (robustness, not RAM pressure)

==== ### ====
* active, healthy, restarting
* slot counts: slots_total, slots_used
* active_request_ids
* restart_count, last_error, last_healthy_at

===== - state, timestamps =====
* last_progress_at
* output length so far
* tool iteration remaining
* recorded signals[]

===== - ring buffer of subprocess logs (recent N lines) =====
* last health check results / recent restart reasons

==== - No heavy periodic work; liveness probing is interval-based and configurable. ====
* Default to lightweight Linux /proc CPU time checks for prefill liveness.
* Avoid expensive per-token processing; repeated-line detection only processes completed lines.
* Keep dependencies minimal and stable.

==== ### ====
* Request state transitions and slot accounting (including cancellation)
* BIOS prompt assembly and ordering
* Tool call parsing → ToolRunner invocation → continuation assembly
* Control tool interception and “stop on decision request”
* Repeated-line detector behavior (including conservative non-triggers)
* Timeout logic with synthetic progress/liveness timestamps

===== Using a stub HTTP server that can simulate: =====
* slow headers
* long silence then streaming
* streaming stalls
* disconnect mid-stream
* malformed tool calls
* server crash / restart conditions and crash-loop limits

===== - repeated start/stop cycles =====
* concurrent submits up to slot limit
* ensure no slot leaks on failures/cancellations

==== This module is a single-worker supervisor for llama.cpp servers designed for a multi-instance, multi-GPU orchestrated environment. It exposes a simple async request API with slot-based concurrency, robust failure recovery via nuke & repave, long-prefill-friendly health detection, OpenAI-format tool calling, BIOS prompt injection for platform context, and a structured control-signals channel for escalation and decision-making. ====

If you want, next step is to translate this doc into:
* concrete class/method signatures + dataclasses/enums, and
* a test-first skeleton (request store, slot semaphore, stub server harness) before hooking up llama-server subprocess + HTTP streaming.