Jump to content
Main menu
Main menu
move to sidebar
hide
Navigation
Main page
Recent changes
Random page
freem
Search
Search
Appearance
Create account
Log in
Personal tools
Create account
Log in
Pages for logged out editors
learn more
Contributions
Talk
Editing
Openai/694057b6-101c-8007-9a65-c40578c7252d
(section)
Add languages
Page
Discussion
English
Read
Edit
Edit source
View history
Tools
Tools
move to sidebar
hide
Actions
Read
Edit
Edit source
View history
General
What links here
Related changes
Special pages
Page information
Appearance
move to sidebar
hide
Warning:
You are not logged in. Your IP address will be publicly visible if you make any edits. If you
log in
or
create an account
, your edits will be attributed to your username, along with other benefits.
Anti-spam check. Do
not
fill this in!
==== This section describes a clean internal structure for implementing the design without accreting cruft. The guiding rule is: keep supervision, transport, prompting, parsing, and state accounting in separate, testable components. ==== ===== A small package (directory) with a few focused modules: ===== * llama_worker/types.py - All public TypedDict/dataclasses/enums/protocols from Appendix A. - No logic. * llama_worker/worker.py - LlamaWorker public API implementation. - Wires together the other components. - Owns the request table + slot semaphore. - No low-level subprocess details, no prompt formatting details. * llama_worker/process.py - Subprocess lifecycle: - spawn llama-server (process group/session) - stop/kill group reliably - capture logs into ring buffer - Exposes minimal interface: start(), stop(), pid, is_alive(), recent_logs(). * llama_worker/transport.py - HTTP client logic to talk to llama-server. - Implements: - readiness probe: GET /v1/models - POST /v1/chat/completions streaming request/response handling - Does not know about slots, tools, BIOS, or orchestration. * llama_worker/prompting.py - BIOS prompt generation adapter + message-stack assembly. - Contains build_message_stack() implementation. - May contain helpers for combining multiple system prompts if needed. * llama_worker/tooling.py - Normal tool loop machinery: - detect tool call vs final output - invoke ToolRunner - append tool-result messages - decrement tool budget - Also includes fallback tool parsing strategy. * llama_worker/exit_signals.py - Parsing and recording of exit-tool calls into signals[]. - Must be explicitly non-terminating (records only). * llama_worker/liveness.py - Prefill-safe liveness probes: - /proc/<pid> CPU-time delta - process-alive checks - Returns timestamps or “evidence of life” booleans. * llama_worker/timeouts.py - Implements timeout bookkeeping based on: - connect/header timeouts (transport-level) - progress timestamps (last_stream_byte_at, last_liveness_at) - Returns “should_restart” decisions (policy evaluation), but does not restart itself. * llama_worker/loopdetect.py - Repeated-line detector. - Pure incremental API: feed text chunks, ask “triggered?”. * llama_worker/util.py - Small shared utilities (ring buffer, monotonic time helpers, etc.). - Keep tiny. This structure ensures each concern can be unit-tested in isolation. ===== LlamaWorker (in worker.py) should be mostly orchestration glue: ===== * Holds: - ProcessSupervisor (from process.py) - TransportClient (from transport.py) - BiosProvider and prompt assembly (from prompting.py) - ToolLoopRunner (from tooling.py) - ExitSignalParser (from exit_signals.py) - LivenessProbe (from liveness.py) - TimeoutEvaluator (from timeouts.py) - RepeatedLineDetector per request (from loopdetect.py) - request table: dict[int, RequestRecord] - slot semaphore: asyncio.Semaphore(slots) * Owns the only “long-lived background tasks”: - optional supervisor task that watches subprocess death/restart policy - optional periodic readiness check (low frequency, only when needed) Everything else should be invoked only when work arrives (to avoid wasting CPU). ===== A request is run by a single asyncio Task created on submit(): ===== # Admission control * Try acquire a slot semaphore immediately. * If not available: return NO_SLOT_AVAILABLE. # Assemble prompt * Call BIOS generator (bios_provider(ctx)) via prompting.py. * Build message stack via build_message_stack(). # Dispatch streaming completion * Use TransportClient.stream_chat_completions(...). * Update last_stream_byte_at on any bytes after headers. * Accumulate full output text. * Feed chunks into RepeatedLineDetector. # Parse tool calls * If the server yields a tool call: - Pass it to ToolLoopRunner: - execute tool via ToolRunner - append tool result message - regenerate BIOS (updated tool budget) - continue generation * Exit-tools: - If an exit-tool call is detected at any point: - record into signals[] - continue normally (non-terminating) # Finish * Terminal states: completed/failed/canceled. * get_result() returns and releases stored request record. ===== - Slots - Implemented as asyncio.Semaphore(slots). - Always release the semaphore in a finally: block. - Slot count is the primary invariant; tests should ensure no leaks. ===== * Cancellation - cancel(request_id) cancels the asyncio task for that request. - Transport streaming must be cancellation-friendly (close stream promptly). - The request transitions to CANCELED with fail_reason="canceled". * Result retrieval = release - get_result() pops the request record from the table (or marks as released). - Subsequent lookups return NOT_FOUND. ===== Nuke & repave is implemented at the worker level: ===== * Restart triggers: - subprocess exits - readiness probe fails repeatedly - timeout evaluator says “stalled” (no progress/liveness for too long) * On restart: - fail all in-flight requests with fail_reason="worker_restarted" (no replay) - stop process group, start new subprocess, wait for readiness (GET /v1/models) - transition worker state accordingly Crash-loop protection lives in timeouts.py (policy) plus worker’s restart gatekeeping (mechanics). ===== transport.py owns these details: ===== * Readiness probe: - GET /v1/models → parse JSON → ok/not ok * Streaming: - Accepts request payload and yields decoded events: - raw bytes, or structured “delta text”, or tool_call payloads (depending on how you decide to parse) - The worker layer treats any bytes after headers as progress and does not need to know SSE details. - Keep transport tolerant: - allow keepalive lines - handle partial JSON frames - raise clear exceptions for irrecoverable protocol errors Keep “how to parse llama-server’s stream” in one place so it’s easy to adjust. ===== prompting.py must remain “pure-ish”: ===== * BiosProvider(ctx) -> str is injected and unit-testable. * build_message_stack(...) -> list[ChatMessage] is a pure function with unit tests. * BIOS generation should not touch: - subprocess state - HTTP logic - request table mutation This makes it safe to evolve the BIOS rules without breaking transport. ===== When adding something new, put it in the smallest responsible layer: ===== * New timeout rule → timeouts.py + tests * New stream parsing quirk → transport.py + tests * New BIOS fields → prompting.py + tests * New tool behavior → tooling.py + tests * New signal schema → exit_signals.py + tests Avoid “just add an if in worker.py”. That’s how the module becomes unmaintainable. ===== - tests/test_prompting.py - BIOS content, ordering, formatting stability ===== * tests/test_loopdetect.py - repeated-line detector cases (positive + non-trigger) * tests/test_timeouts.py - stall decisions using synthetic timestamps - crash-loop/backoff behavior * tests/test_tooling.py - tool loop state machine - fallback parsing behavior * tests/test_exit_signals.py - signals are recorded and non-terminating * tests/test_process.py - process-group teardown (use a dummy child process tree) * tests/test_worker_integration.py - stub HTTP server simulating llama-server behaviors: - slow headers - long silence then streaming - stall - disconnect - tool call emission - exit-tool emission - verifies slot behavior, restart behavior, and get_result() release semantics This provides high confidence without needing heavyweight infrastructure. If you’re ready, the next natural section is Appendix C: State machines and invariants (explicit invariants for slots, request lifecycle, restart behavior) so the implementation has a tight correctness target.
Summary:
Please note that all contributions to freem are considered to be released under the Creative Commons Attribution-ShareAlike 4.0 (see
Freem:Copyrights
for details). If you do not want your writing to be edited mercilessly and redistributed at will, then do not submit it here.
You are also promising us that you wrote this yourself, or copied it from a public domain or similar free resource.
Do not submit copyrighted work without permission!
Cancel
Editing help
(opens in new window)