Jump to content
Main menu
Main menu
move to sidebar
hide
Navigation
Main page
Recent changes
Random page
freem
Search
Search
Appearance
Create account
Log in
Personal tools
Create account
Log in
Pages for logged out editors
learn more
Contributions
Talk
Editing
Openai/694057b6-101c-8007-9a65-c40578c7252d
(section)
Add languages
Page
Discussion
English
Read
Edit
Edit source
View history
Tools
Tools
move to sidebar
hide
Actions
Read
Edit
Edit source
View history
General
What links here
Related changes
Special pages
Page information
Appearance
move to sidebar
hide
Warning:
You are not logged in. Your IP address will be publicly visible if you make any edits. If you
log in
or
create an account
, your edits will be attributed to your username, along with other benefits.
Anti-spam check. Do
not
fill this in!
=== ## === llama_worker is a Python 3 module (typed with PEP 484 + typing_extensions) that supervises a single <code>llama-server</code> subprocess from llama.cpp, and provides an async, slot-limited, resilient interface for running chat-style inference requests against it. It’s designed to be used by a higher-level “hivemind” orchestrator that: * runs many workers (different models / GPUs / profiles), * routes jobs between them, * and prefers nuke & repave (restart a broken worker) over fragile replay. ==== 1. One worker = one <code>llama-server</code> process (one model per process; may span multiple GPUs if configured). ==== # Async-first API (explicitly asyncio-native): requests are submitted and completed asynchronously. # Slot-based concurrency: accept up to slots concurrent in-flight requests; otherwise reject immediately. # Robustness via nuke & repave: - detect unhealthy/stalled server, - restart subprocess, - fail in-flight requests with explicit reasons (no replay). # Long-prefill-friendly: supports workloads where time-to-first-token is minutes to tens of minutes. # OpenAI-format tool calling with a pluggable ToolRunner; also supports a fallback tool-call parsing method. # BIOS prompt layer: inject stable platform-wide instructions (hivemind context) + runtime metadata (date/time, budgets, etc.). # One-way “exit tools” (control signals): model can emit structured signals (issue/escalation/decision request) without a tool round-trip. # Simple early loop kill: repeated-line detector as a supplement to max token limits. # Forward-compatible parameters: allow passing arbitrary generation parameters without changing module code. ==== - In-place model swapping or reconfiguration of a running worker instance. ==== * Replay/resume of in-flight requests after restart. * A global scheduler across workers (belongs in the orchestrator). * Fancy token analytics or heavyweight monitoring agents. ==== - The module is asyncio-native. ==== * Public methods are <code>async def</code> and expected to be called from an asyncio event loop. * Thread-safety is not a primary requirement; the orchestrator should call worker methods from a consistent async context. (A sync wrapper may exist later, but is not required for v1.) ==== ### ==== * <code>LlamaWorker</code> - owns the llama-server subprocess and HTTP client - manages slots, request table, and health state - assembles prompts (BIOS + caller system + conversation) - streams response internally and accumulates full output - runs tool-call loop and parses/records exit-tools * <code>ToolRunner</code> (plugin) - pluggable execution for normal tool calls (OpenAI function schema) - may be lightweight or heavy; worker must not assume * Request records - store full output until retrieved - expose status + debug metadata ==== - Worker launches llama-server as a subprocess in its own process group/session. ==== * stop() must ensure no orphaned processes: - send SIGTERM to the process group → wait → SIGKILL process group if needed. * Worker captures stdout/stderr into a bounded ring buffer for debugging. * Port is assigned externally and provided via config; worker validates it can connect / server binds. ==== - Worker has slots: int. ==== * A slot is “whatever maps best to having multiple queries in flight at once”: - Implementation will treat a slot as permission to dispatch one concurrent HTTP streaming request. - Slots are admission control, not a guarantee of linear throughput. * If all slots are in use, submit() returns immediately with NO_SLOT_AVAILABLE. * No internal queue by default. ==== - Worker assigns request IDs as an incrementing integer (1, 2, 3, …). ==== * Each submission also includes a caller-provided <code>job_name</code> string for correlation and orchestration. * The (job_name, request_id) pair is the primary handle externally (request_id is unique within a worker lifetime). ==== ### ==== * async start() -> None * async stop() -> None * (internal) async restart(reason: str) -> None ===== - async submit(job_name: str, system_prompt: str, user_prompt: str, *, params: Mapping[str, Any] | None = None) -> SubmitResult - returns: - success: { ok: True, request_id: int } - failure: { ok: False, error: "NO_SLOT_AVAILABLE" | "WORKER_NOT_READY" | ... } ===== * async get_status(request_id: int) -> RequestStatus * async get_result(request_id: int) -> RequestResult | NotReady * async cancel(request_id: int) -> bool * async release(request_id: int) -> None (or auto-release after successful get_result(); implementation choice) ===== - async get_worker_status() -> WorkerStatus ===== * async get_debug_info() -> WorkerDebugInfo (recent logs, last errors, restart count) All structures are typed (TypedDict/dataclasses) with stable machine-readable fields. ==== - params is an open mapping passed through to llama-server’s OpenAI-compatible request body. ==== * The worker: - merges required fields it controls (messages/tools/stream flags), - passes unknown keys through untouched, - may optionally record “unsupported parameter” warnings if the backend rejects them. * This prevents needing module changes whenever llama.cpp adds a new knob. ==== ### ==== BIOS is a worker-generated system prompt that includes: * stable platform-wide instruction: the model is one cooperating agent in a multi-model hivemind * current date/time + timezone * tool budget remaining / tool usage rules * instructions on how to emit exit-tools/control signals (if enabled) BIOS is regenerated: * at request start * before each continuation after tool execution ===== 1. BIOS system prompt (highest priority) ===== # caller’s system prompt # conversation messages (user/assistant/tool) Multiple system messages are allowed; if backend behavior requires it, the worker may concatenate with clear delimiters (fallback). ==== - Tools follow OpenAI schema: tools=[{type:"function", function:{name, description, parameters}}]. ==== * Workflow: 1. Send request with tools available. 2. If assistant returns a tool call: - execute via ToolRunner, - append tool result to conversation, - decrement tool iteration budget, - continue generation. * Tool loop controls: - max_tool_iterations (per worker) - per-tool timeout (per worker) * Fallback parsing: If backend doesn’t emit structured tool_calls reliably, worker can use BIOS-enforced structured JSON conventions and strict parsing; failure is a request-level terminal error (or can emit an exit-tool escalation signal). ==== ### ==== * Exit tools are specified at worker init as a list of OpenAI-format tool definitions (same schema). * They do not use the ToolRunner and do not involve a tool result round trip. * The worker exposes these to the model as “available control actions” and then parses and records any such tool calls emitted. ===== - LOW_CONFIDENCE ===== * NEEDS_EXTERNAL_INFO * NEEDS_HIGHER_REASONER * NEEDS_MANAGER_DECISION (with options) * POLICY_RISK (optional lane) * etc. ===== - Exit tool calls become structured signals[] in the request record. ===== * Default: stop_on_decision_request=True (worker terminates generation early on decision requests so orchestrator can branch via a new request). ==== - Worker always uses streaming internally (even if caller only polls). ==== * Progress definition (explicit): any data flowing after headers counts as progress. * Timestamps tracked per request: - last_stream_byte_at (any bytes received) - last_liveness_at (prefill liveness probes) - last_progress_at = max(last_stream_byte_at, last_liveness_at) ===== - Before tokens/bytes arrive, worker uses lightweight probes: - process alive - /proc/<pid> CPU time delta (baseline on Linux) - optional GPU probes later ===== ==== No per-request overrides. ==== A worker’s timeout profile includes: * connect_timeout_s (short) * headers_timeout_s (moderate) * ttft_timeout_s (typically disabled/None) * prefill_liveness_timeout_s (large/None) * idle_stream_timeout_s (time without any bytes once streaming) * absolute_timeout_s (optional/None) * liveness_probe_interval_s * restart controls: backoff and crash-loop limits ===== - On connect/header failures: restart quickly. ===== * On stall (no progress beyond thresholds): restart and fail in-flight requests. * On process death: restart and fail in-flight requests. * No replay; in-flight requests fail with reasons like: - worker_restarted - stall_timeout - connect_failed - headers_timeout - server_died ==== - Primary: max_tokens / max-new-tokens in params (per worker default; may be overridden by caller if desired). ==== * Secondary: repeated-line detector - detects the same sufficiently-long line repeated N times consecutively - cancels the request and records FAILED(reason="repeated_line_loop") - cheap and conservative (line-based, not token-heavy) ==== - Worker stores full output text in memory for each request until: - caller retrieves it and the worker auto-releases, or - caller calls release(request_id). ==== * Given your RAM assumptions, no output cap is required for v1. ==== ### ==== * active, healthy, restarting * slots_total, slots_used * list of active request IDs (and job_name mapping) * restart_count, last_error, last_healthy_at ===== - request_id, job_name ===== * state + timestamps * output length so far * last_progress_at * tool iteration remaining * captured signals[] ===== - last N lines of subprocess output ===== * last restart reasons and times ==== Priority is a solid suite over cleverness. ==== ===== - slot accounting and state transitions ===== * BIOS assembly + ordering * tool loop logic + fallback parsing * exit tools parsing and early-stop behavior * repeated-line detector behavior * timeout bookkeeping with synthetic time/progress ===== Using a stub server (or controlled llama-server in CI when feasible) to simulate: ===== * slow headers * long silence then streaming * streaming stalls * disconnect mid-stream * malformed tool calls and exit tool calls * subprocess crash and restart behavior * process-group teardown (no orphans) ==== These aren’t blockers, but locking them now will reduce rework: ==== # Exactly which endpoint is used for health/ready - Recommend: GET /v1/models if supported; otherwise a tiny warmup completion endpoint. # Result retention policy - Auto-release on get_result() vs requiring explicit release(). (Either is fine; pick one and test it.) # Exit tool semantics - Which exit tools should always terminate generation immediately (decision requests, policy risk), vs which just annotate and continue. # Cancellation behavior - Confirm that cancel is “best effort” and may return partial output + terminal reason canceled. If you want, I can apply your choices for (2)–(4) and produce a final “interfaces + types” section (Protocols/TypedDicts/enums/dataclasses) that’s ready to implement directly.
Summary:
Please note that all contributions to freem are considered to be released under the Creative Commons Attribution-ShareAlike 4.0 (see
Freem:Copyrights
for details). If you do not want your writing to be edited mercilessly and redistributed at will, then do not submit it here.
You are also promising us that you wrote this yourself, or copied it from a public domain or similar free resource.
Do not submit copyrighted work without permission!
Cancel
Editing help
(opens in new window)