Jump to content
Main menu
Main menu
move to sidebar
hide
Navigation
Main page
Recent changes
Random page
freem
Search
Search
Appearance
Create account
Log in
Personal tools
Create account
Log in
Pages for logged out editors
learn more
Contributions
Talk
Editing
Openai/694057b6-101c-8007-9a65-c40578c7252d
(section)
Add languages
Page
Discussion
English
Read
Edit
Edit source
View history
Tools
Tools
move to sidebar
hide
Actions
Read
Edit
Edit source
View history
General
What links here
Related changes
Special pages
Page information
Appearance
move to sidebar
hide
Warning:
You are not logged in. Your IP address will be publicly visible if you make any edits. If you
log in
or
create an account
, your edits will be attributed to your username, along with other benefits.
Anti-spam check. Do
not
fill this in!
=== ## === llama_worker is a Python 3 module (typed with PEP 484 + typing_extensions) that supervises a single <code>llama-server</code> subprocess from llama.cpp, and provides an async, slot-limited, resilient interface for running chat-style inference requests against it. It’s meant to be used by a higher-level “hivemind” orchestrator that: * runs many workers (different models / GPUs / profiles), * routes jobs between them, * and prefers nuke & repave (restart a broken worker) over fragile replay. ==== 1. One worker = one <code>llama-server</code> process (one model per process; may span multiple GPUs if configured). ==== # Async-first API: explicitly asyncio-native. # Slot-based concurrency: accept up to slots concurrent in-flight requests; otherwise reject immediately. # Nuke & repave reliability: - detect dead/stalled/unreachable server, - restart subprocess, - fail in-flight requests with explicit reasons (no replay). # Long-prefill-friendly: supports workloads where time-to-first-token can be minutes to tens of minutes. # OpenAI-format tool calling with a pluggable ToolRunner; plus a fallback tool-call parsing method. # BIOS prompt layer: inject stable platform-wide instructions (hivemind context) + runtime metadata (date/time, budgets, etc.). # One-way exit-tools (control signals): model can emit structured signals upward; worker records them but does not alter control flow. # Simple early loop kill: repeated-line detector as a supplement to max token limits. # Forward-compatible parameters: allow passing arbitrary generation parameters without module changes. # Separation of concerns rule: prompt generation (especially BIOS) must be isolated, testable, and independently modifiable. ==== - In-place model swapping or reconfiguration of a running worker. ==== * Replay/resume of in-flight requests after restart. * Global scheduling across workers (belongs in orchestrator). * Heavy output post-processing. ==== - The module is asyncio-native. ==== * Public methods are <code>async def</code> and expected to be called from an asyncio event loop. * Thread-safety is not a v1 requirement; keep calls within a consistent async context. ==== - Launch llama-server in its own process group/session. ==== * stop() must ensure no orphaned processes: - SIGTERM to process group → wait → SIGKILL process group if needed. * Capture stdout/stderr into a bounded ring buffer for debugging. * Port is assigned externally and passed in config. ==== - Worker has slots: int. ==== * A slot is permission to have one request “in flight” (best mapping to concurrent HTTP streaming requests). * If all slots are in use, submit() returns immediately with NO_SLOT_AVAILABLE. * No internal queue by default. ==== - Request IDs are incrementing integers (1, 2, 3… per worker lifetime). ==== * Each request includes a caller-provided <code>job_name</code> for correlation. ==== ### ==== * async start() -> None * async stop() -> None * (internal) async restart(reason: str) -> None ===== - async submit(job_name: str, system_prompt: str, user_prompt: str, *, params: Mapping[str, Any] | None = None) -> SubmitResult ===== * async get_status(request_id: int) -> RequestStatus * async get_result(request_id: int) -> RequestResult | NotReady * async cancel(request_id: int) -> bool Result retrieval releases resources (explicit decision): * Calling get_result(request_id) when the request is terminal returns the result and releases all stored state/output for that request. * After successful get_result(), subsequent get_status/get_result for that request_id should return a stable “unknown/released” response. ===== - async get_worker_status() -> WorkerStatus ===== * async get_debug_info() -> WorkerDebugInfo ==== - params is an open mapping passed through to llama-server’s OpenAI-compatible request body. ==== * Worker merges in fields it controls (messages/tools/stream) and passes unknown keys through untouched. ==== ### ==== Worker-owned, regenerated at request start and before each post-tool continuation. Includes: * universal platform/hivemind guidance * current date/time + timezone * tool iteration budget remaining and constraints * instructions for using tools and exit-tools ===== 1. BIOS system prompt ===== # caller system prompt # conversation messages ==== - OpenAI function-calling schema. ==== * Worker: - exposes tools, - detects tool calls (structured, or fallback), - executes via ToolRunner, - appends tool result messages, - continues generation until completion or tool-iteration budget exhausted. Fallback tool parsing is allowed via BIOS-enforced structured JSON conventions if native tool_calls is unreliable. ==== Explicit decision: Exit tools never terminate output or alter control flow beyond what the model itself does. ==== * Exit tools are provided at worker init as OpenAI-format function tool definitions. * Worker includes them so the model knows its signaling options. * When the model emits an exit-tool call, the worker records it into signals[]. * The orchestrator may react (including choosing to cancel externally), but the worker itself does not change behavior. ==== - Worker always uses streaming internally. ==== * Progress definition: any response data flowing after headers counts as progress. * Track: - last_stream_byte_at - last_liveness_at (prefill probes) - last_progress_at = max(last_stream_byte_at, last_liveness_at) Prefill liveness baseline: * subprocess alive * /proc/<pid> CPU time delta ==== - No per-request overrides. ==== * Profile includes: - connect timeout - headers timeout - TTFT timeout typically disabled/None - prefill liveness timeout (large/None) - idle stream timeout (no bytes) - optional absolute timeout - probe interval - restart backoff + crash-loop limits Recovery is nuke & repave; in-flight requests fail with explicit reasons (no replay). ==== - Primary: max_tokens (passed via params, default set per worker profile). ==== * Secondary: repeated-line detector cancels request on clear degenerate repetition. ==== - Full output is accumulated while running. ==== * On successful <code>get_result()</code>, the request’s stored output, tool trace, and signals are released immediately. ==== - WorkerStatus: active/healthy/restarting, slot usage, active request ids, restarts, last error. ==== * RequestStatus: state, timestamps, output length so far, last_progress_at, tool iterations remaining, recorded signals. * DebugInfo: bounded subprocess logs + recent restart reasons. ==== ### ==== In places likely to evolve—especially BIOS prompt generation and prompt assembly—implementation must be separated into distinct methods/components, not embedded inline inside request execution or transport code. At minimum: * BIOS prompt creation is a distinct method/component. * Message-stack assembly (BIOS + caller system + conversation) is a distinct method/component. ===== When adding features or handling edge cases: ===== # Identify which concern it belongs to, and keep changes inside that layer: - Process supervision (start/stop/restart, orphan prevention) - Transport (HTTP request/stream parsing, retries) - Prompting (BIOS generation, message ordering, tool/exit-tool exposure) - Tool loop (detect tool call, execute via ToolRunner, resume) - Parsing (tool-call parsing, exit-tool parsing, repeated-line detection) - State & accounting (slots, request lifecycle, status reporting) # Prefer small, testable functions over clever shared logic. # Keep policy separate from mechanics: - Timeout policy and restart policy should be data/config driven. - Prompt content should be generated from structured inputs. ===== BIOS generation and message assembly must be unit-testable in isolation: ===== * required fields present (hivemind guidance, date/time, budgets) * formatting stability (e.g., version tag or delimiters if used) * correct updates as tool budget decreases ==== - Strong unit + integration tests prioritized over cleverness. ==== * Integration tests should validate: - process-group teardown leaves no orphans - long prefill does not trigger false timeouts - disconnect/stall triggers restart and fails in-flight requests - tool-call parsing (native + fallback) - exit-tools are recorded but do not affect control flow - get_result() releases request state ==== 1. Health endpoint / readiness check - Recommend: GET /v1/models; fallback to a tiny warmup completion if needed. ==== # Post-release status behavior - Recommend: get_status() after get_result() returns a stable NOT_FOUND / RELEASED code. # Exit-tools parsing precedence - If both a normal tool call and an exit-tool appear, recommended default is: - process normal tool loop as usual - record exit-tool signals whenever they occur If you want, next step is producing a compact “Types & Protocols” appendix (Protocols/TypedDicts/dataclasses/enums) that exactly matches this doc, so implementation can proceed test-first without ambiguity.
Summary:
Please note that all contributions to freem are considered to be released under the Creative Commons Attribution-ShareAlike 4.0 (see
Freem:Copyrights
for details). If you do not want your writing to be edited mercilessly and redistributed at will, then do not submit it here.
You are also promising us that you wrote this yourself, or copied it from a public domain or similar free resource.
Do not submit copyrighted work without permission!
Cancel
Editing help
(opens in new window)