Editing Openai/694057b6-101c-8007-9a65-c40578c7252d (section)

=== Below are the main “spec gaps” I see, plus a concrete default choice for each so you can lock them down without adding complexity. ===

==== Under-specified: Are submit/get_status/get_result safe to call from multiple threads? Is the module asyncio-native? ====

Recommendation (simple + robust):
* Make the worker asyncio-native internally, but expose a thread-safe sync facade if needed.
* At minimum: specify “public methods are thread-safe” or “caller must stay on one event loop”.

==== Under-specified: Are you targeting llama-server (llama.cpp) or llama-cpp-python[server]? Their OpenAI-compat can differ across versions. ====

Recommendation:
* Treat the server as an OpenAI-compatible HTTP endpoint and keep the server command fully configurable.
* Define required endpoints you rely on: - POST /v1/chat/completions (streaming supported) - GET /v1/models (health/ready check)
* If /v1/models doesn’t exist, allow a configurable health URL or a tiny completion warmup.

==== Under-specified: SSE vs raw chunks, JSON framing differences. ====

Recommendation:
* Specify: worker supports SSE “data:” JSON lines and tolerates keepalive/comment lines.
* Define what “progress” means: “any received bytes after headers count as stream progress”.

==== Under-specified: If the backend doesn’t emit tool_calls formally, how do you detect tool calls? ====

Recommendation (robust without replay):
* Prefer native tool_calls when present.
* Otherwise fall back to a structured prompting convention + strict JSON parse (still OpenAI-shaped), controlled by BIOS instructions.
* If tool-call parsing fails, treat as FAILED(reason="tool_parse_error") (or emit a control signal requesting escalation).

==== Under-specified: ID format, how long results persist, GC behavior. ====

Recommendation:
* IDs: monotonic integer or UUID (integer is cheaper; UUID is easier for distributed logging).
* Results persist until get_result() or release().
* Optional TTL GC disabled by default; enable only if you see abandoned requests.

==== Under-specified: Does the worker allocate ports or are they assigned externally? ====

Recommendation:
* Port is provided by caller (simplest in multi-worker orchestrators).
* Worker validates port availability at start; if bind fails, mark WORKER_NOT_READY with explicit error.

==== Under-specified: Ensuring subprocess and any children die. ====

Recommendation:
* Start subprocess in its own process group/session.
* On stop: SIGTERM → wait → SIGKILL process group.
* This avoids orphan workers consuming VRAM.

==== Under-specified: Does a slot map to “one concurrent HTTP request” or to “server’s internal parallel slots”? ====

Recommendation:
* Define slots as “max concurrent requests the worker will dispatch”.
* Don’t assume the server truly parallelizes well; slots is an admission control knob, not a throughput guarantee.

==== Under-specified: Besides max_tokens, what parameters are supported (temperature, top_p, stop, seed, etc.)? ====

Recommendation:
* Define a typed GenerationParams that includes common OpenAI-like fields (optional): - max_tokens, temperature, top_p, stop, seed (if supported), etc.
* Worker passes through fields it knows, ignores/records unsupported fields (for forwards compatibility).

==== Under-specified: Exact JSON schema and stable enum codes. ====

Recommendation:
* Lock a tiny stable enum set now (you can expand later without breaking): - LOW_CONFIDENCE, NEEDS_EXTERNAL_INFO, NEEDS_HIGHER_REASONER, NEEDS_MANAGER_DECISION, TOOL_LIMIT_REACHED, MAX_TOKENS_REACHED, LOOP_DETECTED, POLICY_RISK
* Specify maximum signals per request and “stop on decision request” behavior (already decided).

If you want, I can produce the “next artifact” as a concrete typed interface spec (Protocols + TypedDicts/Enums + dataclasses) that directly matches this doc, ready to implement and test against.