Editing Openai/694057b6-101c-8007-9a65-c40578c7252d (section)

==== This appendix specifies when and how the worker restarts llama-server, and how it avoids thrashing. ====

===== A restart may be initiated when any of the following occurs: =====
# Process death
* llama-server subprocess exits unexpectedly.
# Readiness probe failure (when expecting READY)
* While in READY, repeated failures of GET /v1/models beyond policy thresholds.
# Stall timeout
* A request has no progress (no stream bytes and no liveness evidence) beyond stall_timeout/policy thresholds.
* “Progress” is defined elsewhere: any bytes after headers, or /proc liveness during prefill.

===== Restart is “nuke & repave”: =====
# Mark worker state = RUNNING (restarting)
# Fail all in-flight requests
* Transition each in-flight request → FAILED
* Set fail_reason="worker_restarted" (or "server_died" if it actually died)
* Do not discard partial output; it must remain retrievable via get_result()
* Ensure each request releases its slot promptly (task cancel + cleanup)
# Stop the subprocess process group
* SIGTERM → wait (short) → SIGKILL group if needed
# Start subprocess
* spawn with configured server_cmd and env (port externally assigned)
# Wait for readiness
* Poll GET /v1/models until success or startup deadline
* On success: worker state = READY
* On failure: worker state = FAILED (or remain RUNNING briefly if retrying per backoff policy)

===== To prevent thrash: =====
* Track restart timestamps in a rolling window.
* If restarts exceed max_restarts_per_window inside restart_window_s: - worker state becomes FAILED - submit() returns WORKER_FAILED immediately (no slot consumption) - orchestrator can decide whether/when to call start() again

===== When a restart is triggered repeatedly: =====
* Apply restart_backoff_s between restart attempts (can be constant or modest exponential, but keep it simple and testable).
* Backoff should never block stop() from completing promptly.

===== - submit() while restarting: - should return WORKER_NOT_READY (or equivalent) without consuming a slot. =====
* In-flight requests failed due to restart remain retrievable via get_result() until fetched (and then released).