Editing Openai/694057b6-101c-8007-9a65-c40578c7252d (section)

==== ### ====

Each request has:
* id, timestamps, config snapshot (model/ctx/tools enabled), slot id
* states: ALLOCATED_SLOT → DISPATCHED → RUNNING → TOOL_RUNNING (optional) → COMPLETED
* terminal: FAILED | CANCELED | TIMEOUT | NO_SLOT (NO_SLOT is a submit-time result)

Required methods:
* submit(...) -> RequestHandle | error(NO_SLOT_AVAILABLE)
* poll(id) -> status (includes partial text length / last_progress_time)
* result(id) -> final (or still running)
* cancel(id) best-effort: cancels HTTP stream and frees slot

===== You want multiple independent timers, and at least one that is progress-based rather than “first token in X seconds”. =====

Recommended set:

A. Connect timeout (short)
* e.g. 1–3s to open TCP connection to the server.

B. “Request accepted” timeout (medium)
* time until you get HTTP status/headers back (server actually accepted). If you can’t even get headers, that’s a strong sign of dead server.

C. Progress timeout (primary stall detector)
* a “deadman” timer that resets whenever any bytes arrive on the stream (or whenever token count increases).
* This is the one that works with slow hardware + huge contexts.
* Example: if no bytes received for 120s (configurable), treat as stall.

D. Absolute wall-clock timeout (optional / large)
* Set very high or disable by default for long prompts. If enabled, should be per-request override.

Key point: avoid a strict “time to first token” timeout unless it’s huge or derived from request size.

Also specify:
* after a stall: attempt one reconnect/health check, then restart if still bad
* restart backoff + crash-loop breaker (e.g., if 5 restarts in 2 minutes, mark worker unhealthy and stop)

===== Worker-level status should include: =====
* active (started), healthy, restarting
* slots_total, slots_in_use, list of active request ids
* last_healthy_time, restart_count, last_error
* optional: rolling latency stats (ttft, tokens/sec, total time)

Request-level status should include:
* state, created_at, started_at, last_progress_at
* partial_output_chars (or tokens) + optionally last N chars of partial for debugging
* tool_calls_count and tool trace entries

===== Lock down: =====
* Tools are provided as OpenAI-style tools=[{type:"function", function:{name, description, parameters}}]
* Model-facing messages follow OpenAI tool calling conventions: - assistant returns tool call(s) - worker executes them via the provided ToolRunner - worker appends tool results and continues until final answer

Constraints (important for safety + stability):
* max tool call depth / iterations (e.g. 8)
* per-tool timeout (e.g. 2–10s)
* max tool output size (truncate / summarize)
* tool calls should be deterministic / idempotent where possible (since restarts nuke in-flight)

===== Explicitly require: =====
* server stdout/stderr captured (ring buffer for debug)
* SIGTERM then SIGKILL on stop
* port binding strategy (caller provides port OR worker allocates)
* GPU pinning via env (e.g., CUDA_VISIBLE_DEVICES) per worker process

===== - If submit() succeeds, the worker guarantees: - the request is either completed with a result, or ends in a terminal error with reason =====
* Terminal error reasons should be machine-usable: - server_restart, stall_timeout, connect_failed, http_error, tool_error, canceled, etc.