Editing Openai/694057b6-101c-8007-9a65-c40578c7252d (section)

=== Assistant:  ===
==== These defaults are meant to be safe, conservative, and boring—especially given your long-prefill workloads and “nuke & repave” philosophy. They’re starting points, not hard commitments; each worker profile can tune them. ====

===== Readiness probe =====
* GET /v1/models
* Consider READY on: HTTP 200 + JSON parse success.

Startup sequence
* Start subprocess → probe every 0.5s initially (short burst) until ready, then back off.
* Startup max wait: 120s before declaring startup failed and moving to FAILED (this is about “server didn’t come up”, not model prefill).

Rationale: server should bind/respond quickly even if model load is heavy; if it truly needs longer, increase this per worker.

===== A good baseline profile for “slow hardware / big contexts”: =====
* connect_timeout_s = 3.0
* headers_timeout_s = 30.0
* ttft_timeout_s = None (disabled)
* prefill_liveness_timeout_s = None (disabled) or 3600.0 (1 hour) if you want eventual kill even with liveness
* idle_stream_timeout_s = 300.0 (5 minutes) (Once streaming, 5 minutes with zero bytes is suspicious; tune higher if needed.)
* absolute_timeout_s = None (disabled by default)
* liveness_probe_interval_s = 5.0 (lightweight)
* Restart controls: - restart_backoff_s = 5.0 (first delay) - restart_window_s = 120.0 (2 minutes) - max_restarts_per_window = 5

Notes:
* If you disable prefill_liveness_timeout_s, then a “hard hang during prefill” is handled only by detecting process death or external cancellation. That may be acceptable in your environment.
* If you enable it, keep it large.

===== - max_tool_iterations = 8 =====
* per-tool timeout: - lightweight tool runners: 5–10s - heavier tool runners: configured in that ToolRunner, not here
* tool output size: - not capped by worker; ToolRunner is responsible if needed

Fallback parsing:
* Enabled if native tool_calls aren’t present; strict JSON parsing; failure → tool_parse_error.

===== - Exit tools list is provided by the orchestrator at worker init. =====
* Worker records signals; does not change control flow.
* Suggested “starter” exit tools (names are up to you): - signal_issue(code, severity, summary, meta={}) - request_escalation(reason, summary, meta={}) - request_decision(question, options, default=None, context=None, meta={}) - declare_outcome(code, summary=None, meta={})

(These are recommendations only; worker treats them as opaque schemas.)

===== Conservative thresholds to minimize false positives: =====
* Ignore lines with length < 64 characters (after stripping).
* Ignore empty/whitespace-only lines.
* Trigger if the same normalized line repeats consecutively: - repeat_threshold = 10
* Start checking only after some output exists: - min_output_chars_before_check = 512

On trigger:
* cancel request → FAILED(reason="repeated_line_loop")
* record: repeated line snippet (truncated) + count

===== - Default slots = 1 for very slow or memory-tight models. =====
* Use slots > 1 only when the server and hardware demonstrably handle concurrency without pathological latency.

===== - Track tokens/sec only if the backend provides usage/token counts without expensive parsing. =====
* Otherwise, record: - output chars - elapsed time - last_progress_at

If you want to keep going, the next appendix would be Appendix E: Example BIOS prompt template (including a stable formatting/versioning approach and how it references tool budgets and exit-tools) since that’s a high-churn area and benefits from being specified early.