Editing Openai/694057b6-101c-8007-9a65-c40578c7252d (section)

=== Assistant: Here’s the consolidated list of design decisions / requirements we’ve locked in so far for the llama.cpp worker module. ===
Here’s the consolidated list of design decisions / requirements we’ve locked in so far for the llama.cpp worker module.

==== - Each class instance owns one llama.cpp server subprocess (one model per process; may use 1+ GPUs if sharded). ====
* Instance is immutable once created; reconfiguration is drop & recreate.
* Public lifecycle is <code>start()</code> / <code>stop()</code> (optionally restart() internally).
* Preferred recovery strategy is nuke & repave: restart the server rather than attempting delicate replay.

==== - Concurrency is slot-based: worker has a configured integer slots (≥1). ====
* A slot represents one in-flight generation request handled by this worker.
* If all slots are full, the worker does not accept new work (returns an immediate “no slot available” result). No internal queue by default.

==== - Caller submits work via a method like submit(system_prompt, user_prompt, …) that returns immediately with a request id/handle. ====
* Caller retrieves progress/output via polling initially (callbacks optional later).
* Worker tracks per-request state and exposes: - running/queued (if you ever add queue), completed, failed, canceled, timed out - in-flight count / slot utilization

==== - Worker uses streaming internally even if the external API is polling-only. ====
* Internal streaming is used for: - monitoring / progress timestamps - partial output capture (debug/ops) - stall detection

==== - Timeouts must handle very long prefill (multi-minute to tens-of-minutes before first token). ====
* Timeout configuration is per-worker (“profile” / policy object) with optional per-request overrides.
* Timeout types: - connect timeout (short) - headers/accept timeout (moderate) - TTFT timeout is disabled by default (or set very high) - primary stall detection is progress/liveness based, not “no tokens yet” - optional absolute timeout (default None or very large)
* Liveness/progress signals (pluggable): - stream bytes/tokens (when streaming) - subprocess liveness + /proc CPU time delta (baseline) - optional GPU activity probes (nice-to-have)
* On stall/health failure: reconnect attempt(s) then restart, per policy.
* Crash-loop protection via restart backoff / restart rate limits.

==== - On worker restart, in-flight requests are failed with a clear reason (no replay). ====
* Looping output is treated as request-level failure (not necessarily worker restart).

==== - Always support <code>max_tokens</code> / max-new-tokens as a hard stop (per-worker default, per-request override). ====
* Add an optional repeated-line detector as an early-kill mechanism: - detects same sufficiently-long line repeated N times consecutively - cancels the request, frees slot, returns terminal reason repeated_line_loop - designed to be simple and low false-positive

==== - Tools follow OpenAI tool/function-calling format. ====
* Tool runner is a separate component passed in, but the worker handles: - registering/exposing tools to the model - receiving tool calls - executing via tool runner - resuming the model with tool results
* Tools are intended to be lightweight (calculator/lookup). Large calls are escalated to higher-level models.
* Tool loop constraints: - max tool iterations - per-tool timeout - tool output size caps
* Tool iteration budget should be visible to the model via BIOS prompt.

==== - Worker injects a low-level BIOS system prompt (secondary system prompt) per request/turn. ====
* BIOS contains runtime/control-plane facts such as: - tool iteration budget remaining - current date/time + timezone - (optional) capability hints / platform rules
* Prompt ordering: 1. BIOS system prompt (highest priority) 2. caller’s system prompt 3. conversation messages (user/assistant/tool)

==== - Worker maintains a status readout with at least: - active/healthy/restarting - slots total / slots used + active request ids - restart count, last error, last healthy time ====
* Per-request status includes: - state, timestamps (created, started, last_progress) - partial output length (and optionally small tail snippet for debugging) - tool call count / trace metadata
* Subprocess stdout/stderr captured (ring buffer) for diagnostics.

==== - In addition to normal tools, the worker supports one-way control signals emitted by the model to help orchestration without needing another model to read all output. ====
* Signals cover: - uncertainty / low confidence - needs external info / browsing - needs higher reasoning model - tooling limitations - policy/safety escalation lane (optional) - decision requests (management-level choice among options)
* Implemented as “control tools” in OpenAI function-call format (intercepted by worker), or equivalent structured footer if needed.
* Signals are: - bounded (count + size) - returned alongside output - treated as hints; orchestrator decides actions
* For decision_request specifically: - worker can be configured to stop immediately once signaled (stop_on_decision_request=True) - orchestrator will later create a new request with the chosen decision injected (no resume/replay required)

==== - No in-place model swapping; recreate worker instead. ====
* No replay/resume of in-flight requests after restart.
* No global scheduler inside the worker (handled by higher-level manager).

If you want, the next step is to turn this into a crisp “module spec” with: class responsibilities, method signatures, request/worker state enums, timeout policy schema, and the exact control-tool JSON schemas.