Editing Openai/694057b6-101c-8007-9a65-c40578c7252d (section)

==== This section describes a clean internal structure for implementing the design without accreting cruft. The guiding rule is: keep supervision, transport, prompting, parsing, and state accounting in separate, testable components. ====

===== A small package (directory) with a few focused modules: =====
* llama_worker/types.py - All public TypedDict/dataclasses/enums/protocols from Appendix A. - No logic.
* llama_worker/worker.py - LlamaWorker public API implementation. - Wires together the other components. - Owns the request table + slot semaphore. - No low-level subprocess details, no prompt formatting details.
* llama_worker/process.py - Subprocess lifecycle: - spawn llama-server (process group/session) - stop/kill group reliably - capture logs into ring buffer - Exposes minimal interface: start(), stop(), pid, is_alive(), recent_logs().
* llama_worker/transport.py - HTTP client logic to talk to llama-server. - Implements: - readiness probe: GET /v1/models - POST /v1/chat/completions streaming request/response handling - Does not know about slots, tools, BIOS, or orchestration.
* llama_worker/prompting.py - BIOS prompt generation adapter + message-stack assembly. - Contains build_message_stack() implementation. - May contain helpers for combining multiple system prompts if needed.
* llama_worker/tooling.py - Normal tool loop machinery: - detect tool call vs final output - invoke ToolRunner - append tool-result messages - decrement tool budget - Also includes fallback tool parsing strategy.
* llama_worker/exit_signals.py - Parsing and recording of exit-tool calls into signals[]. - Must be explicitly non-terminating (records only).
* llama_worker/liveness.py - Prefill-safe liveness probes: - /proc/<pid> CPU-time delta - process-alive checks - Returns timestamps or “evidence of life” booleans.
* llama_worker/timeouts.py - Implements timeout bookkeeping based on: - connect/header timeouts (transport-level) - progress timestamps (last_stream_byte_at, last_liveness_at) - Returns “should_restart” decisions (policy evaluation), but does not restart itself.
* llama_worker/loopdetect.py - Repeated-line detector. - Pure incremental API: feed text chunks, ask “triggered?”.
* llama_worker/util.py - Small shared utilities (ring buffer, monotonic time helpers, etc.). - Keep tiny.

This structure ensures each concern can be unit-tested in isolation.

===== LlamaWorker (in worker.py) should be mostly orchestration glue: =====
* Holds: - ProcessSupervisor (from process.py) - TransportClient (from transport.py) - BiosProvider and prompt assembly (from prompting.py) - ToolLoopRunner (from tooling.py) - ExitSignalParser (from exit_signals.py) - LivenessProbe (from liveness.py) - TimeoutEvaluator (from timeouts.py) - RepeatedLineDetector per request (from loopdetect.py) - request table: dict[int, RequestRecord] - slot semaphore: asyncio.Semaphore(slots)
* Owns the only “long-lived background tasks”: - optional supervisor task that watches subprocess death/restart policy - optional periodic readiness check (low frequency, only when needed)

Everything else should be invoked only when work arrives (to avoid wasting CPU).

===== A request is run by a single asyncio Task created on submit(): =====
# Admission control
* Try acquire a slot semaphore immediately.
* If not available: return NO_SLOT_AVAILABLE.
# Assemble prompt
* Call BIOS generator (bios_provider(ctx)) via prompting.py.
* Build message stack via build_message_stack().
# Dispatch streaming completion
* Use TransportClient.stream_chat_completions(...).
* Update last_stream_byte_at on any bytes after headers.
* Accumulate full output text.
* Feed chunks into RepeatedLineDetector.
# Parse tool calls
* If the server yields a tool call: - Pass it to ToolLoopRunner: - execute tool via ToolRunner - append tool result message - regenerate BIOS (updated tool budget) - continue generation
* Exit-tools: - If an exit-tool call is detected at any point: - record into signals[] - continue normally (non-terminating)
# Finish
* Terminal states: completed/failed/canceled.
* get_result() returns and releases stored request record.

===== - Slots - Implemented as asyncio.Semaphore(slots). - Always release the semaphore in a finally: block. - Slot count is the primary invariant; tests should ensure no leaks. =====
* Cancellation - cancel(request_id) cancels the asyncio task for that request. - Transport streaming must be cancellation-friendly (close stream promptly). - The request transitions to CANCELED with fail_reason="canceled".
* Result retrieval = release - get_result() pops the request record from the table (or marks as released). - Subsequent lookups return NOT_FOUND.

===== Nuke & repave is implemented at the worker level: =====
* Restart triggers: - subprocess exits - readiness probe fails repeatedly - timeout evaluator says “stalled” (no progress/liveness for too long)
* On restart: - fail all in-flight requests with fail_reason="worker_restarted" (no replay) - stop process group, start new subprocess, wait for readiness (GET /v1/models) - transition worker state accordingly

Crash-loop protection lives in timeouts.py (policy) plus worker’s restart gatekeeping (mechanics).

===== transport.py owns these details: =====
* Readiness probe: - GET /v1/models → parse JSON → ok/not ok
* Streaming: - Accepts request payload and yields decoded events: - raw bytes, or structured “delta text”, or tool_call payloads (depending on how you decide to parse) - The worker layer treats any bytes after headers as progress and does not need to know SSE details. - Keep transport tolerant: - allow keepalive lines - handle partial JSON frames - raise clear exceptions for irrecoverable protocol errors

Keep “how to parse llama-server’s stream” in one place so it’s easy to adjust.

===== prompting.py must remain “pure-ish”: =====
* BiosProvider(ctx) -> str is injected and unit-testable.
* build_message_stack(...) -> list[ChatMessage] is a pure function with unit tests.
* BIOS generation should not touch: - subprocess state - HTTP logic - request table mutation

This makes it safe to evolve the BIOS rules without breaking transport.

===== When adding something new, put it in the smallest responsible layer: =====
* New timeout rule → timeouts.py + tests
* New stream parsing quirk → transport.py + tests
* New BIOS fields → prompting.py + tests
* New tool behavior → tooling.py + tests
* New signal schema → exit_signals.py + tests

Avoid “just add an if in worker.py”. That’s how the module becomes unmaintainable.

===== - tests/test_prompting.py - BIOS content, ordering, formatting stability =====
* tests/test_loopdetect.py - repeated-line detector cases (positive + non-trigger)
* tests/test_timeouts.py - stall decisions using synthetic timestamps - crash-loop/backoff behavior
* tests/test_tooling.py - tool loop state machine - fallback parsing behavior
* tests/test_exit_signals.py - signals are recorded and non-terminating
* tests/test_process.py - process-group teardown (use a dummy child process tree)
* tests/test_worker_integration.py - stub HTTP server simulating llama-server behaviors: - slow headers - long silence then streaming - stall - disconnect - tool call emission - exit-tool emission - verifies slot behavior, restart behavior, and get_result() release semantics

This provides high confidence without needing heavyweight infrastructure.

If you’re ready, the next natural section is Appendix C: State machines and invariants (explicit invariants for slots, request lifecycle, restart behavior) so the implementation has a tight correctness target.