Editing Openai/694057b6-101c-8007-9a65-c40578c7252d (section)

==== This section makes the “boring but critical” rules explicit so the implementation and tests have a tight correctness target. The module should enforce these invariants rather than relying on convention. ====

===== #### =====
* STOPPED: subprocess not running; cannot accept requests
* RUNNING: subprocess started but not yet confirmed ready (or starting/restarting)
* READY: readiness probe succeeds (GET /v1/models → 200 + JSON)
* FAILED: subprocess dead or crash-loop lockout; cannot accept requests

====== - STOPPED → RUNNING (start) ======
* RUNNING → READY (readiness probe succeeds)
* RUNNING → FAILED (subprocess exits before ready, or repeated probe failure beyond policy)
* READY → RUNNING (restart initiated)
* READY → FAILED (subprocess exits, crash-loop lockout triggers)
* FAILED → RUNNING (explicit restart/start allowed by policy)
* RUNNING/READY → STOPPED (stop)

====== 1. No request dispatch unless state is READY. ======
# State READY implies: - subprocess is alive and - last readiness probe succeeded within a reasonable recent window (configurable, but “probe success happened” is the minimum).
# FAILED implies: - submit() returns WORKER_FAILED immediately (no slot consumption).
# STOPPED implies: - submit() returns WORKER_NOT_READY immediately.

===== #### =====
* RUNNING: request task is executing (includes prefill and token streaming)
* TOOL_RUNNING: normal tool execution in progress (awaiting ToolRunner)
* COMPLETED: finished successfully
* FAILED: finished with error
* CANCELED: canceled by caller

====== - COMPLETED, FAILED, CANCELED ======

====== - (created) → RUNNING ======
* RUNNING → TOOL_RUNNING (model emitted normal tool call)
* TOOL_RUNNING → RUNNING (tool result appended; generation continues)
* RUNNING → COMPLETED
* RUNNING → FAILED
* TOOL_RUNNING → FAILED (tool error / parse error)
* RUNNING/TOOL_RUNNING → CANCELED (cancel)
* RUNNING/TOOL_RUNNING → FAILED (worker restart / server death / timeout)

====== 1. One request occupies exactly one slot for its entire lifetime from acceptance to terminal state. ======
# A request ID is unique within the lifetime of a worker instance.
# Output accumulation is monotonic (only grows) until terminal.
# Exit tools never cause a transition by themselves: - they may be recorded at any time, but do not change request state.
# Tool iteration budget is monotonic decreasing and never negative.
# After <code>get_result()</code> succeeds, the request record is released: - subsequent get_status/get_result returns NOT_FOUND.

===== Slots are the core stability mechanism and must be leak-proof. =====
# No slot is consumed on failed submit. - If submit() returns NO_SLOT_AVAILABLE, nothing was allocated.
# If <code>submit()</code> succeeds, a slot is consumed immediately and will be released exactly once.
# Slot release must occur in a finally: block in the request task, ensuring: - tool errors - transport exceptions - cancellations - restart-triggered failures do not leak slots.
# slots_used = number of non-terminal request records currently held (or equivalently, the number of acquired semaphore permits).

===== When a restart happens, correctness matters more than salvage. =====
# No replay of in-flight requests.
# On restart initiation, all in-flight requests must transition to FAILED with: - fail_reason="worker_restarted" (or more specific, e.g. "server_died"), - and must free their slots promptly.
# Restart is performed as: - stop process group (SIGTERM → SIGKILL) - start new subprocess - wait for readiness probe success - move worker state to READY or FAILED
# Crash-loop protection invariant: - If restarts exceed max_restarts_per_window, worker becomes FAILED and will not auto-restart until explicitly restarted (or until policy allows).

===== Given long prefill, timeouts must be careful. =====
# Progress is defined as any bytes received after headers, plus any liveness evidence from prefill probes.
# A request updates: - last_stream_byte_at on any received stream data - last_liveness_at on each successful liveness probe - last_progress_at = max(last_stream_byte_at, last_liveness_at)
# “Stall” decisions are based on no progress for configured windows, not lack of tokens.
# Connect and header timeouts remain short/moderate and are treated as “server unreachable” rather than “model slow”.

===== Normal tools: =====
# Tool calls must be matched to a known tool name from normal_tools, otherwise it’s a tool parse/validation failure (FAILED).
# Each tool iteration decrements the remaining tool budget.
# ToolRunner output must be JSON-serializable (or worker serializes safely), and appended as a tool message.

Exit tools:
# Exit tool calls are recorded into signals[] even if they are malformed (best effort), but malformed signals should not crash the worker.
# Exit tools do not decrement normal tool budgets unless you explicitly decide to share budgets (default: separate; but not required).

===== Worker status: =====
* READY/RUNNING/FAILED/STOPPED must be accurate with respect to readiness probe and subprocess state.

Request status:
* state must reflect the real request lifecycle
* output length must be consistent with accumulated text
* after release, request is NOT_FOUND

===== In debug builds or behind a flag, periodically assert: =====
* 0 <= slots_used <= slots_total
* semaphore permits + active requests are consistent
* no request is both terminal and still has an active task
* tool_iters_remaining never negative

These checks are low-cost and prevent silent corruption.

If you want to continue, the next section would be Appendix D: Concrete defaults (recommended default values for timeouts, tool budgets, loop detector thresholds, restart backoff) to help implementation start with sensible behavior without requiring more decisions later.