Editing Openai/694057b6-101c-8007-9a65-c40578c7252d (section)

==== ### ====
* One class instance owns one llama.cpp server subprocess (one model per process; may span multiple GPUs if sharded).
* Instance config is immutable; reconfiguration is drop & recreate.
* Public lifecycle: <code>start()</code> / <code>stop()</code> (internal restart() allowed).
* Failure handling is nuke & repave: - detect unhealthy/stalled server → restart subprocess - no replay of in-flight requests; they terminate with a clear failure reason.

===== - Worker has slots (≥1). A slot = one in-flight generation. =====
* If all slots are occupied, submit() immediately returns NO_SLOT_AVAILABLE.
* No internal queueing unless explicitly added later as a bounded optional feature.

===== - submit(system_prompt, user_prompt, …) -> request_id/handle returns immediately. =====
* Caller polls: - get_status(request_id) → state + progress metadata - get_result(request_id) → final output or “not ready”
* Minimal state machine (exact naming flexible): RUNNING | TOOL_RUNNING | COMPLETED | FAILED | CANCELED
* cancel(request_id) is best-effort and frees the slot.

===== - Worker uses streaming internally even if the external API is polling. =====
* Streaming drives: - progress timestamps (last_stream_byte_at) - partial output capture (bounded) - loop detection - liveness/stall detection

===== - No per-request timeout overrides. Workers are configured based on expected workload. =====
* Timeout profile (per worker) includes: - connect timeout (short) - headers/accept timeout (moderate) - TTFT timeout disabled by default (or effectively huge) - stall detection is liveness/progress based to support long-prefill (tens of minutes) - optional absolute timeout (usually None or very large)
* Liveness sources (pluggable): - subprocess alive - /proc/<pid> CPU time delta (baseline) - optional GPU activity probes later
* Restart policy with backoff + crash-loop protection.

===== - Always support max new tokens (max_tokens) per worker default (override per request for token limit may exist, but it’s orthogonal to timeouts). =====
* Add a simple repeated-line detector (low false positives): - triggers when the same sufficiently-long line repeats N times consecutively - cancels the request (does not restart worker) - records repeated_line_loop reason + the repeated line snippet.

===== - Tools use OpenAI function-calling schema. =====
* Tool execution is via a plug-in ToolRunner interface.
* The worker must not bake in assumptions like “tools are always lightweight” into the protocol: - the worker handles registration, call parsing, execution, and resume uniformly - the plugged-in runner determines whether a tool is lightweight/heavy and how it is executed.
* Tool loop controls: - max tool iterations (budget) - per-tool timeout - tool output size caps

===== - Worker injects a BIOS system prompt layer that includes: - universally applicable platform guidance (e.g., “you are one agent in a cooperating hivemind of models”) - current date/time + timezone - tool budget remaining and tool usage rules - optionally capability hints (“tools available”, “decision signals available”) =====
* Ordering remains: 1. BIOS system prompt (highest priority) 2. caller system prompt (job-specific) 3. conversation messages

===== - Support one-way control signals emitted by the model for orchestration: - low confidence, needs external info, needs higher reasoner, tool limits, etc. - decision requests for management-level branching choices =====
* Implemented as a small set of control tools (OpenAI function format) that the worker intercepts and returns as structured signals[].
* stop_on_decision_request=True default: the worker ends generation immediately when such a signal is emitted.
* Orchestrator handles the decision and later starts a new request with the chosen option injected.

===== - Worker status includes: - active/healthy/restarting - slots total/used + active request IDs - restart count, last error, last healthy time =====
* Request status includes: - state + timestamps - last progress time (from stream/liveness) - partial output length + bounded tail snippet (for ops) - tool trace metadata and captured control signals
* Subprocess logs captured in a bounded ring buffer.

===== - No in-place model swapping =====
* No replay/resume after restart
* No global scheduler inside the worker