Jump to content
Main menu
Main menu
move to sidebar
hide
Navigation
Main page
Recent changes
Random page
freem
Search
Search
Appearance
Create account
Log in
Personal tools
Create account
Log in
Pages for logged out editors
learn more
Contributions
Talk
Editing
Openai/694057b6-101c-8007-9a65-c40578c7252d
(section)
Add languages
Page
Discussion
English
Read
Edit
Edit source
View history
Tools
Tools
move to sidebar
hide
Actions
Read
Edit
Edit source
View history
General
What links here
Related changes
Special pages
Page information
Appearance
move to sidebar
hide
Warning:
You are not logged in. Your IP address will be publicly visible if you make any edits. If you
log in
or
create an account
, your edits will be attributed to your username, along with other benefits.
Anti-spam check. Do
not
fill this in!
=== llama_worker Design Document === Status: Draft (repo design doc; intended to be fed into future code generation) Language: Python 3, fully typed (PEP 484) with typing_extensions (Protocol, TypedDict, etc.) Backend: llama-server from llama.cpp (OpenAI-compatible HTTP server) ==== 0. Overview ==== llama_worker is an asyncio-native supervisor for a single llama-server subprocess (one model per process). It is designed to be used by a higher-level “hivemind” orchestrator that runs multiple workers (different models / GPUs / profiles), routes jobs between them, and prefers nuke & repave (restart broken workers) over fragile replay. The worker provides: * Subprocess supervision (start/stop/restart) with no-orphan guarantees (process group teardown). * Async request execution: submit() returns immediately with a request id; caller polls status/result. * Slot-based concurrency (no queue by default). * Robust failure handling and restart policy (nuke & repave). * BIOS prompt injection (hivemind + runtime data), generated as a distinct method/component. * OpenAI-format normal tool calling (round-trip via a pluggable ToolRunner) with a fallback parsing option. * Exit tools (one-way “control signals”) provided at init and recorded (never executed, never alter control flow). * Internal streaming for monitoring and progress detection. * Loop mitigation: max_tokens plus a conservative repeated-line detector. * Partial output is always retrievable via get_result() even for failures/cancels/restarts. This module optimizes for simplicity, clarity, robustness, and testability. It must also avoid wasting compute in CPU/RAM constrained environments (lightweight polling/probes, minimal background work). ==== 1. Goals ==== # One worker = one <code>llama-server</code> process (one model per process; may span multiple GPUs if configured). # Async-first API (explicitly asyncio-native). # Slot-based concurrency: accept up to slots concurrent in-flight requests; otherwise reject immediately. # Nuke & repave reliability: - detect dead/stalled/unreachable server, - restart subprocess, - fail in-flight requests with explicit reasons, - no replay/resume. # Long-prefill-friendly: - time-to-first-token can be minutes to tens of minutes, - stall detection is based on progress/liveness, not TTFT. # Tools: - OpenAI function/tool calling for normal tools (round-trip), - ToolRunner is pluggable; the mechanism must not assume “lightweight tools.” # BIOS prompt: - universal hivemind context + runtime metadata, - generated via a distinct, testable method/component. # Exit tools (signals): - provided at init, - recorded as structured signals, - never change worker control flow. # Loop mitigation: - rely on max_tokens as baseline, - plus repeated-line early kill. # Forward-compatible params: * accept a pass-through mapping of generation params without modifying the module for new server features. # Engineering style: * simple state machines, clear invariants, strong tests, * avoid compute waste (no aggressive polling, no expensive per-token work). ==== 2. Non-goals ==== * In-place model swapping or reconfiguration of a running worker. * Replay/resume of in-flight requests after restart. * Global scheduling across workers (belongs in orchestrator). * Heavy output post-processing or complex token analytics. * Persistent storage of prompts/outputs (handled by caller/orchestrator). ==== 3. Terminology ==== * Worker: one running llama-server subprocess plus its management logic. * Slot: admission-control unit representing one in-flight request. * Iteration (tools): one assistant “tool call turn” (a single assistant emission that may include multiple tool calls). * Normal tools: round-trip tools executed via ToolRunner. * Exit tools: one-way control-signal “tools” recorded only. * Progress: any bytes received after HTTP headers, plus liveness evidence during prefill. ==== 4. Async model (explicit) ==== * The module is asyncio-native. * Public APIs are async def and intended to be called from an asyncio event loop. * Thread-safety is not a v1 requirement; keep calls within a consistent async context. ==== 5. Subprocess supervision requirements ==== * Worker launches llama-server in its own process group/session. * stop() must ensure no orphaned processes: - SIGTERM process group → short wait → SIGKILL process group if needed. * Capture stdout/stderr into a bounded ring buffer for debug breadcrumbs. * Port assignment is external: worker config includes host/port; worker does not auto-assign ports. ==== 6. Concurrency model: slots ==== * Worker has slots: int. * A slot is permission to have one request “in flight” (best mapping: concurrent streaming HTTP request). * If slots are full, submit() returns immediately with NO_SLOT_AVAILABLE. * No internal queue by default. ==== 7. Request identity ==== * Request IDs are incrementing integers per worker lifetime (1, 2, 3, …). * Caller provides a job_name string per request for correlation. ==== 8. Public interface ==== ===== Lifecycle ===== * async start() -> None * async stop() -> None * (internal) async restart(reason: str) -> None ===== Requests ===== '' async submit(job_name: str, system_prompt: str, user_prompt: str, '', params: Mapping[str, Any] | None = None) -> SubmitResult * async get_status(request_id: int) -> RequestStatus | NOT_FOUND * async get_result(request_id: int) -> RequestResult | NOT_FOUND * async cancel(request_id: int) -> bool (best-effort) ===== Resource release semantics (locked) ===== * get_result(request_id) is the one-time completion call: - returns terminal result (completed/failed/canceled), - returns partial output for any failure/cancel (possibly empty), - releases all stored state/output for that request id. * After a successful get_result(), later lookups return NOT_FOUND. ==== 9. Generation params: forward-compatible pass-through ==== * params is an open mapping passed through to the OpenAI-compatible request payload. * Worker overwrites only fields it owns (e.g., messages, tools, stream). * Unknown keys are preserved so adding new features does not require modifying the worker module. ==== 10. Prompt assembly ==== ===== BIOS prompt (worker-owned) ===== The BIOS prompt is a universal system-level prompt layer that includes: * hivemind/cooperating-agent context, * current date/time and timezone, * tool iteration budget remaining, * instructions for normal tools and exit tools, * fallback tool-call formatting rules (if fallback parsing is enabled). BIOS generation must be a distinct method/component and unit-testable. ===== Ordering (invariant) ===== # BIOS system prompt # caller system prompt # conversation messages (user/assistant/tool) ==== 11. Transport contract (llama-server) ==== * Readiness probe: GET /v1/models → HTTP 200 + JSON parse success means READY. * Chat inference: POST /v1/chat/completions with stream=true. * Streaming parser should be tolerant of SSE framing and partial JSON. * Progress definition (locked): any bytes after headers count as progress. Transport is a distinct module; the worker consumes semantic events (text delta, final message, error, done). ==== 12. Timeouts, progress, and liveness (per worker profile only) ==== * No per-request timeout overrides. Workers are configured based on job type/hardware. * Stall detection must tolerate multi-minute / tens-of-minutes prefill: - use last_progress_at, not time-to-first-token. ===== Liveness evidence (prefill-safe) ===== * During prefill (no stream bytes yet), worker uses lightweight probes: - process alive, - /proc/<pid>/stat CPU tick deltas (Linux baseline). ===== Restart policy ===== * Nuke & repave: - restart subprocess on death/unreachable/stall per policy, - fail in-flight requests with explicit reasons, - preserve partial output for retrieval. ==== 13. Tools and exit tools ==== ===== Normal tools (round-trip) ===== * OpenAI function/tool calling schema. * Executed via pluggable ToolRunner. * Budget is iterations, not calls: - one assistant tool-emission that contains any normal tool calls consumes exactly 1 iteration, even if it requests multiple tools at once. * Fallback parsing is allowed when structured tool_calls are absent/unreliable (BIOS-guided strict JSON). ===== Tool failure classification (locked) ===== Tools should only kill the request when there is an actual execution/format problem, such as: * tool runner timeout, * tool runner exception, * malformed tool arguments (parse/validation), * non-serializable result (implementation error), * unknown tool name. Domain outcomes are not failures: * e.g. “search found nothing” must be a successful tool result (empty list / {results: []} / {found: false}), and the request continues. ===== Exit tools (one-way signals; non-terminating) ===== * Exit tools are provided at worker init as tool definitions. * Worker exposes them to the model and records any exit-tool calls as structured signals[]. * Exit tools are never executed (no ToolRunner) and never alter control flow. * Priority: process normal tool loop as usual; record exit signals whenever they occur. * Models will be encouraged (via BIOS) to emit exit tools near completion, but correctness does not depend on it. ==== 14. Loop mitigation ==== * Baseline: pass max_tokens via params (default set per worker profile/orchestrator). * Repeated-line detector: - detects degenerate loops where the model repeats the same line. - On trigger: cancel request → FAILED(reason="repeated_line_loop"). - Partial output remains retrievable via get_result(). ==== 15. Output retention ==== * The worker accumulates full output in memory while running. * For any terminal state (completed/failed/canceled), get_result() returns whatever output was accumulated. * Output and state are released only after get_result() succeeds. ==== 16. Observability (minimum) ==== Minimum worker status surface: * READY * RUNNING * FAILED * (internally STOPPED is useful) Nice-to-haves (optional): * output length, * tokens received, * tokens/sec. Logging is breadcrumbs only; persistent storage of prompts/outputs is handled by the caller. ==== 17. Engineering rule: separation of concerns (explicit) ==== ===== Specific rule ===== Prompt generation (especially BIOS) and message-stack construction must be implemented as distinct methods/components, not embedded inline inside request execution or transport code. ===== Guide for approaching problems ===== When adding features or fixing issues, place changes in the smallest responsible layer: * process supervision, * transport, * prompting, * tool loop, * exit signal parsing, * liveness probes, * timeout policy evaluation, * loop detection, * state accounting. Prefer small, testable functions over clever shared logic; keep policy data-driven.
Summary:
Please note that all contributions to freem are considered to be released under the Creative Commons Attribution-ShareAlike 4.0 (see
Freem:Copyrights
for details). If you do not want your writing to be edited mercilessly and redistributed at will, then do not submit it here.
You are also promising us that you wrote this yourself, or copied it from a public domain or similar free resource.
Do not submit copyrighted work without permission!
Cancel
Editing help
(opens in new window)