Jump to content
Main menu
Main menu
move to sidebar
hide
Navigation
Main page
Recent changes
Random page
freem
Search
Search
Appearance
Create account
Log in
Personal tools
Create account
Log in
Pages for logged out editors
learn more
Contributions
Talk
Editing
Openai/694057b6-101c-8007-9a65-c40578c7252d
(section)
Add languages
Page
Discussion
English
Read
Edit
Edit source
View history
Tools
Tools
move to sidebar
hide
Actions
Read
Edit
Edit source
View history
General
What links here
Related changes
Special pages
Page information
Appearance
move to sidebar
hide
Warning:
You are not logged in. Your IP address will be publicly visible if you make any edits. If you
log in
or
create an account
, your edits will be attributed to your username, along with other benefits.
Anti-spam check. Do
not
fill this in!
=== ## === llama_worker is a Python 3 module (typed with PEP 484 + typing_extensions) that supervises a single <code>llama-server</code> subprocess from llama.cpp, and provides an async, slot-limited, resilient interface for running chat-style inference requests against it. It’s meant to be used by a higher-level “hivemind” orchestrator that: * runs many workers (different models / GPUs / profiles), * routes jobs between them, * and prefers nuke & repave (restart a broken worker) over fragile replay. ==== 1. One worker = one <code>llama-server</code> process (one model per process; may span multiple GPUs if configured). ==== # Async-first API: explicitly asyncio-native. # Slot-based concurrency: accept up to slots concurrent in-flight requests; otherwise reject immediately. # Nuke & repave reliability: - detect dead/stalled/unreachable server, - restart subprocess, - fail in-flight requests with explicit reasons (no replay). # Long-prefill-friendly: supports workloads where time-to-first-token can be minutes to tens of minutes. # OpenAI-format tool calling with a pluggable ToolRunner; plus a fallback tool-call parsing method. # BIOS prompt layer: inject stable platform-wide instructions (hivemind context) + runtime metadata (date/time, budgets, etc.). # One-way exit-tools (control signals): model can emit structured signals upward; worker records them but does not alter control flow. # Simple early loop kill: repeated-line detector as a supplement to max token limits. # Forward-compatible parameters: allow passing arbitrary generation parameters without module changes. ==== - In-place model swapping or reconfiguration of a running worker. ==== * Replay/resume of in-flight requests after restart. * Global scheduling across workers (belongs in orchestrator). * Heavy output post-processing. ==== - The module is asyncio-native. ==== * Public methods are <code>async def</code> and expected to be called from an asyncio event loop. * Thread-safety is not a v1 requirement; keep calls within a consistent async context. ==== - Launch llama-server in its own process group/session. ==== * stop() must ensure no orphaned processes: - SIGTERM to process group → wait → SIGKILL process group if needed. * Capture stdout/stderr into a bounded ring buffer for debugging. * Port is assigned externally and passed in config. ==== - Worker has slots: int. ==== * A slot is permission to have one request “in flight” (best mapping to concurrent HTTP streaming requests). * If all slots are in use, submit() returns immediately with NO_SLOT_AVAILABLE. * No internal queue by default. ==== - Request IDs are incrementing integers (1, 2, 3… per worker lifetime). ==== * Each request includes a caller-provided <code>job_name</code> for correlation. ==== ### ==== * async start() -> None * async stop() -> None * (internal) async restart(reason: str) -> None ===== - async submit(job_name: str, system_prompt: str, user_prompt: str, *, params: Mapping[str, Any] | None = None) -> SubmitResult ===== * async get_status(request_id: int) -> RequestStatus * async get_result(request_id: int) -> RequestResult | NotReady * async cancel(request_id: int) -> bool Result retrieval releases resources (explicit decision): * Calling get_result(request_id) when the request is terminal returns the result and releases all stored state/output for that request. * After successful get_result(), subsequent get_status/get_result for that request_id should return a stable “unknown/released” response (e.g., NOT_FOUND / RELEASED). ===== - async get_worker_status() -> WorkerStatus ===== * async get_debug_info() -> WorkerDebugInfo ==== - params is an open mapping passed through to llama-server’s OpenAI-compatible request body. ==== * Worker merges in fields it controls (messages/tools/stream) and passes unknown keys through untouched. ==== ### ==== Worker-owned, regenerated at request start and before each post-tool continuation. Includes: * universal platform/hivemind guidance * current date/time + timezone * tool iteration budget remaining and constraints * instructions for using tools and exit-tools ===== 1. BIOS system prompt ===== # caller system prompt # conversation messages ==== - OpenAI function-calling schema. ==== * Worker: - exposes tools, - detects tool calls (structured, or fallback), - executes via ToolRunner, - appends tool result messages, - continues generation until completion or tool-iteration budget exhausted. Fallback tool parsing is allowed via BIOS-enforced structured JSON conventions if native tool_calls is unreliable. ==== Explicit decision: Exit tools never terminate output or alter control flow beyond what the model itself does. ==== * Exit tools are provided at worker init as OpenAI-format function tool definitions. * Worker includes them in the tool list (or a dedicated list, depending on server compatibility) so the model knows its signaling options. * When the model emits an exit-tool call, the worker records it into signals[] (structured, typed). * The worker does not automatically: - stop the request, - cancel it, - restart the server, - or change sampling/params. * The orchestrator can react to signals[] (including choosing to call cancel() externally). This keeps exit-tools purely informational and “upward-facing”. ==== - Worker always uses streaming internally. ==== * Progress definition: any response data flowing after headers counts as progress. * Track: - last_stream_byte_at - last_liveness_at (prefill probes) - last_progress_at = max(last_stream_byte_at, last_liveness_at) Prefill liveness baseline: * subprocess alive * /proc/<pid> CPU time delta ==== - No per-request overrides. ==== * Profile includes: - connect timeout - headers timeout - TTFT timeout typically disabled/None - prefill liveness timeout (large/None) - idle stream timeout (no bytes) - optional absolute timeout - probe interval - restart backoff + crash-loop limits Recovery is nuke & repave; in-flight requests fail with explicit reasons (no replay). ==== - Primary: max_tokens (passed via params, default set per worker profile). ==== * Secondary: repeated-line detector cancels request on clear degenerate repetition. ==== - Full output is accumulated while running. ==== * On successful <code>get_result()</code>, the request’s stored output, tool trace, and signals are released immediately. ==== - WorkerStatus: active/healthy/restarting, slot usage, active request ids, restarts, last error. ==== * RequestStatus: state, timestamps, output length so far, last_progress_at, tool iterations remaining, recorded signals. * DebugInfo: bounded subprocess logs + recent restart reasons. ==== - Strong unit + integration tests prioritized over cleverness. ==== * Integration tests should specifically validate: - process-group teardown leaves no orphans - long prefill does not trigger false timeouts - disconnect/stall triggers restart and fails in-flight requests - tool-call parsing (native + fallback) - exit-tools are recorded but do not affect control flow - get_result() releases request state ==== These are the only places where you may want to pick a default before coding: ==== # Health endpoint / readiness check - Prefer GET /v1/models; fallback to a tiny completion as warmup if needed. # What <code>get_status()</code> returns after <code>get_result()</code> - Recommend: NOT_FOUND or RELEASED with a stable error code. # Exit-tools parsing precedence - If a response includes both normal tool_calls and an exit-tool call, do you process normal tools first, or just record exit-tools and continue? (Default recommendation: process normal tool loop as usual; record exit-tools whenever they occur.) If you want, I can append a compact “Types & Protocols” section (TypedDicts/Enums/Protocols) that matches this design doc 1:1, ready to implement.
Summary:
Please note that all contributions to freem are considered to be released under the Creative Commons Attribution-ShareAlike 4.0 (see
Freem:Copyrights
for details). If you do not want your writing to be edited mercilessly and redistributed at will, then do not submit it here.
You are also promising us that you wrote this yourself, or copied it from a public domain or similar free resource.
Do not submit copyrighted work without permission!
Cancel
Editing help
(opens in new window)