Editing Openai/694057b6-101c-8007-9a65-c40578c7252d (section)

=== Assistant: That’s totally compatible with the “nuke & repave” philosophy — you just have to make sure your stall detection doesn’t… ===
That’s totally compatible with the “nuke & repave” philosophy — you just have to make sure your stall detection doesn’t confuse “long prefill” with “hung”.

Here’s how I’d lock the timeout requirements so they work with multi-minute / tens-of-minutes TTFT.

==== ### ====

These detect dead server / dead socket, not slow inference:
* Connect timeout (TCP connect): ~1–3s
* Header/accept timeout (time to receive HTTP status/headers): ~5–30s (configurable)

If you can’t connect or can’t even get headers back, that’s not “slow model”, that’s “not serving”.

===== With your workloads, a “time to first token” timeout is basically unusable unless it’s huge. =====

Requirement:
* ttft_timeout = None by default (disabled)
* If enabled, it’s per-worker profile and set very high.

===== Instead of “no tokens yet”, stall should mean “nothing indicates progress”. =====

Define a Progress/Liveness watchdog that is satisfied by any of these signals:

A. Stream progress
Once streaming begins, any bytes/tokens reset the watchdog.

B. Process liveness progress (prefill-safe)
Before tokens arrive, use one or more low-level signals to decide “the model is still working”:
* server subprocess is alive and
* its CPU time is increasing (via /proc/<pid>/stat), or
* optional: GPU activity associated with that PID (via nvidia-smi pmon / utilization sampling), if you’re on NVIDIA

Requirement-wise: make these checks pluggable so you can support “just /proc” everywhere and add GPU checks where available.

Then you can define:
* stall_timeout = “no stream bytes AND no liveness progress” for X seconds → treat as hung

Given your TTFT reality, X should be measured in tens of minutes by default, or even disabled unless you have liveness checks.

===== For long contexts, absolute timeouts should be either: =====
* very large (hours), or
* per-request override set by the caller (e.g., “this request can run 45 minutes”).

Default requirement:
* absolute_timeout = None or extremely high.

==== Even if no tokens for 20 minutes, the worker should still show: ====
* request state = RUNNING
* last_liveness_at updating from /proc / GPU checks
* optionally: “prefill_estimated” flag (if you infer it from prompt size/ctx)

This prevents your supervisor from “helpfully” restarting healthy-but-slow jobs.

==== - If connect/header failures happen: restart quickly. ====
* If stall_timeout triggers (no evidence of life): restart + fail in-flight requests with FAILED(reason="stall_timeout").
* If the process exits: restart + fail in-flight with FAILED(reason="worker_died").

==== Workers should support profiles, because timeout expectations will differ wildly: ====
* “fast 7B on P40” vs “big model on M10 shards” etc.

So bake into config:
* timeouts = {connect, headers, stall, ttft(optional), absolute(optional)}
* liveness_checks = {proc_cpu_time: on, gpu_util: optional}

If you’re good with that spec, next step (when you’re ready for code) is implementing a watchdog that updates last_progress_at from either stream bytes or liveness probes, so “slow prefill” never trips the restart logic.