Editing Openai/694057b6-101c-8007-9a65-c40578c7252d (section)

=== ## ===

This module is a Python 3 component (using PEP 484 typing plus <code>typing_extensions</code> constructs such as Protocol, TypedDict, Self, etc.) that provides a robust, simple supervisor-style wrapper around a single llama.cpp server subprocess (one model per process).

It is intended for a larger “hivemind” system that runs multiple model instances across multiple GPUs, routes work between them, and occasionally tears down and recreates instances based on workload.

The worker module’s responsibilities are:
* Start/stop a llama.cpp server subprocess configured for a specific model + GPU allocation.
* Accept generation requests up to a configured number of concurrency slots.
* Execute requests asynchronously (submit returns immediately; caller polls for status/result).
* Provide monitoring status (health, slot usage, in-flight request details).
* Handle tool calling (OpenAI function-calling schema) via a pluggable tool runner.
* Inject a low-level BIOS system prompt with platform-wide guidance and runtime metadata.
* Detect and recover from failures using a nuke & repave strategy (restart the server rather than attempt delicate replay).
* Detect obvious “runaway loops” via a simple repeated-line detector, in addition to max token limits.

This module prioritizes simplicity, clarity, robustness, and testability over cleverness. It also aims not to waste compute in CPU/RAM constrained environments: bounded polling, lightweight liveness checks, and minimal background work.

==== 1. One-process-per-model worker: each instance manages exactly one llama.cpp server process and its configuration. ====
# Slot-based concurrency: support slots > 1 with immediate refusal if full (no queue by default).
# Async request lifecycle: - submit() returns immediately with a request handle/id. - caller uses polling APIs to read status and final output.
# Nuke & repave reliability: - detect dead/stalled/unreachable server; - restart subprocess; - fail in-flight requests with explicit error reasons.
# Long-prefill-friendly timeouts: - accommodate multi-minute / tens-of-minutes time-to-first-token; - stall detection is liveness/progress-based, not TTFT-based.
# Tool calling: - OpenAI function-calling format; - tool runner is pluggable; protocol should not assume “lightweight tools” even if typical for smaller models.
# BIOS system prompt: - inject platform/hivemind guidance + runtime data (date/time, tool budget remaining, etc.).
# Control signals channel: - allow model to emit structured “signals” (e.g., low confidence, escalation, decision request) without requiring downstream models to parse long text.
# Simple loop early-kill: - repeated-line detector that cancels a request when it clearly loops.

==== - In-place reconfiguration or model swapping within a worker instance. ====
* Replay/resume of in-flight requests after restart.
* Global scheduling across workers (handled by the higher-level orchestrator).
* Heavy output post-processing or expensive token analytics.

==== ### ====
* Worker: owns the server subprocess; exposes request APIs; supervises health; injects BIOS prompt; runs tool loop; emits status.
* Request: encapsulates one generation job (messages, tool state, output accumulation, status).
* ToolRunner (plugin): executes tools invoked by the model; worker handles tool-call parsing and continuation.
* Control Signals: structured records emitted by the model via “control tools” (one-way).

==== ### ====
* start() -> None
* stop() -> None
* (internal) restart(reason: str) -> None

===== - submit(system_prompt: str, user_prompt: str, *, params: GenerationParams | None = None) -> SubmitResult - immediate return: - success: request_id - failure: NO_SLOT_AVAILABLE or WORKER_NOT_READY =====
* get_status(request_id) -> RequestStatus
* get_result(request_id) -> RequestResult | NOT_READY
* cancel(request_id) -> bool
* release(request_id) -> None (or auto-release after get_result())

===== - get_worker_status() -> WorkerStatus =====
* get_debug_info() -> WorkerDebugInfo

==== ### ====
* name
* model_path
* host, port
* gpu_env (e.g., CUDA_VISIBLE_DEVICES)
* server command template: executable path + args (opaque list/dict)
* slots
* timeout_profile (per worker only)
* tool_policy (max iterations, per-tool timeout, etc.)
* bios_provider
* control_tools_enabled
* stop_on_decision_request (default true)
* loop_detector enable + thresholds

===== - connect_timeout_s =====
* headers_timeout_s
* ttft_timeout_s (typically None/disabled)
* prefill_liveness_timeout_s (large/None)
* idle_stream_timeout_s
* absolute_timeout_s (optional/None)
* liveness_probe_interval_s
* restart backoff + crash-loop limits

==== ### ====

Worker-owned, regenerated:
* at request start
* before each tool-loop continuation

BIOS includes:
* universal platform/hivemind guidance
* current date/time + timezone
* tool iteration budget remaining and constraints
* instructions for emitting control signals/decision requests

===== 1. BIOS system prompt =====
# caller system prompt
# user/assistant/tool messages

==== - Tools use OpenAI function-calling schema. ====
* Worker manages: exposing tools, detecting calls, invoking ToolRunner, appending tool results, continuing generation.
* Budgeted by max iterations; timeout per tool; ToolRunner may bound output sizes.

==== - One-way “control tools” intercepted by worker and returned as structured signals[]. ====
* Supports issue/escalation/outcome/decision_request.
* Default stop_on_decision_request=True; orchestrator makes a later new request with chosen decision.

==== - Nuke & repave: restart server on unhealthy/stalled/disconnected. ====
* In-flight requests are failed with a clear reason on restart.
* Crash-loop protection via backoff and restart rate limits.

==== - Repeated-line detector on streamed output: - detect same sufficiently-long line repeated N consecutive times; - cancel request; mark FAILED(reason="repeated_line_loop"); - attach repeated line snippet + count to debug/status. ====
* Also always support max_tokens as hard stop.

==== - Accumulate full output in memory until caller retrieves it. ====
* Auto-release after result retrieval or explicit release().
* Optional TTL GC for abandoned requests (robustness).

==== - WorkerStatus: active/healthy/restarting, slot usage, active request IDs, restart count, last error, last healthy time. ====
* RequestStatus: state, timestamps, last progress, output length, tool iteration remaining, captured signals.
* Debug: ring buffer of subprocess logs.

==== - Python 3. ====
* Heavily typed using built-in typing + <code>typing_extensions</code> (Protocol, TypedDict, etc.).
* Design emphasizes: - small explicit state machines - straightforward control flow - minimal dependencies - test suite depth over clever optimizations
* Resource-aware: - liveness probing interval is configurable and not aggressive - avoid heavy per-token work; repeated-line detector operates on completed lines