Editing Openai/694057b6-101c-8007-9a65-c40578c7252d (section)

=== Assistant: You’ve already got the core nailed. ===
You’ve already got the core nailed. The main “gotchas” that tend to bite later are around request lifecycle, capacity/backpressure, observability, and tool-call state. Here’s a checklist of requirements I’d add/lock down now so you don’t end up redesigning mid-flight.

==== Lifecycle ====
* start() / stop() only (no “reconfigure in place”).
* Owns a single llama server subprocess + its port + its GPU pinning env.
* Clean shutdown: try graceful, then hard kill; always free VRAM.

Requests
* submit(system_prompt, user_prompt, …) -> request_id returns immediately.
* Completion retrieved by polling: get_result(request_id) / get_status(request_id).
* Supports max_concurrent_requests > 1 (internal queue when saturated).

Health & resilience
* Detect stalls/disconnects.
* Attempt reconnect; if unhealthy past threshold, restart the subprocess.
* Resume operation after restart (at minimum, fail in-flight requests cleanly with a clear reason).

Tools
* Tool registry passed in (separate class OK).
* Handles tool-call detection → executes tool → resumes generation with tool result.

That’s a solid v1.

==== ### ====

You want an explicit state machine, e.g.:
* QUEUED → DISPATCHED → RUNNING → TOOL_WAIT → COMPLETED
* terminal: FAILED | CANCELED | TIMEOUT

Add:
* cancel(request_id) (even if it’s best-effort). Without cancellation, queue buildup becomes painful under load.

===== Decide now what happens when saturated: =====
* Queue up to max_queue_depth, then reject with QueueFullError
* or drop oldest, or priority lanes (later)

Also decide whether max_concurrent_requests is:
* per worker instance (likely), and
* enforced with a semaphore + queue.

===== People often only add one and then it’s messy. =====
* Connect timeout: can I reach the server?
* First-byte timeout: server accepted request but never started responding (common stall symptom).
* Total timeout: wall-clock cap per request.

Also add a restart backoff (avoid crash loops).

===== Lock this down explicitly: =====
* If the server restarts, do you: - mark all in-flight as FAILED(restarted) immediately, or - try to replay (dangerous unless your prompts/tools are deterministic and idempotent)

For v1: I strongly recommend fail in-flight with a clear error.

===== Even if you only expose polling now, build the internal plumbing so you can add: =====
* partial tokens
* “tail” polling (get latest chunk since offset)
* or callback hooks later

If you don’t, tool-calls + long responses can feel opaque.

===== Minimum that saves you hours: =====
* last N log lines from subprocess accessible via get_debug_info()
* per-request timing: queued_ms, run_ms, tool_ms, retries
* health metrics: restart_count, last_healthy_time, consecutive_failures

===== Expose: =====
* model path/name
* context length, gpu-layers, batch settings, etc.
* a config hash so the higher-level system can understand “what worker is this?”

This matters when you have many instances and need to correlate behavior.

===== If multiple instances are created/destroyed dynamically: =====
* module should either accept a port (caller-managed) or
* include a reliable port allocator + bind-check.

Also: decide whether you bind to 127.0.0.1 only (recommended).

===== Tool support is where hidden requirements show up. Define: =====
* tool schema format (OpenAI tool schema? custom?)
* max tool recursion / max tool calls per request
* tool timeout + retry policy
* tool result size limits (avoid stuffing huge payloads back into context)
* idempotency expectations (important if you ever retry a tool call)

Also: persist enough per-request state that tool resumes are correct even if you poll later.

===== If you want max_concurrent_requests > 1, decide if that means: =====
* multiple simultaneous HTTP requests to one server process (depends on server’s parallelism support), or
* you’ll actually spawn multiple worker processes for true parallel throughput.

Even if you choose the former, keep a switch so you can move to the latter without API changes.

==== It can help to say “not in scope for v1”: ====
* automatic model swapping inside an instance
* replaying in-flight requests across restarts
* global scheduling across workers (belongs in a higher-level manager)
* distributed tool execution (keep it in-process initially)

==== - start(), stop(), restart() ====
'' submit(system, user, '', tools_enabled=True, request_params={...}) -> id
* get_status(id) -> {state, progress?, error?, timestamps...}
* get_result(id) -> {text, tool_trace?, usage?}
* cancel(id)
* get_worker_status() -> {active, healthy, in_flight, queued, restarts, last_error...}
* get_debug_info() -> logs, config_hash, last_health_checks...

If you lock those items down now, you’ll avoid the common rewrites later. When you’re ready to move to code, we can turn this into a concrete state machine + class layout (worker + request objects + tool runner) with clear boundaries.