Jump to content
Main menu
Main menu
move to sidebar
hide
Navigation
Main page
Recent changes
Random page
freem
Search
Search
Appearance
Create account
Log in
Personal tools
Create account
Log in
Pages for logged out editors
learn more
Contributions
Talk
Editing
Openai/694057b6-101c-8007-9a65-c40578c7252d
(section)
Add languages
Page
Discussion
English
Read
Edit
Edit source
View history
Tools
Tools
move to sidebar
hide
Actions
Read
Edit
Edit source
View history
General
What links here
Related changes
Special pages
Page information
Appearance
move to sidebar
hide
Warning:
You are not logged in. Your IP address will be publicly visible if you make any edits. If you
log in
or
create an account
, your edits will be attributed to your username, along with other benefits.
Anti-spam check. Do
not
fill this in!
=== ## === This module provides a robust, simple, supervisor-style wrapper around a single llama.cpp server subprocess (one model per process). It is intended for use inside a larger “hivemind” system that runs multiple model instances across multiple GPUs, routes work between them, and occasionally tears down and recreates instances based on workload. The worker module’s responsibilities are: * Start/stop a llama.cpp server subprocess configured for a specific model + GPU allocation. * Accept generation requests up to a configured number of concurrency slots. * Execute requests asynchronously (submit returns immediately; caller polls for status/result). * Provide monitoring status (health, slot usage, in-flight request details). * Handle tool calling (OpenAI function-calling schema) via a pluggable tool runner. * Inject a low-level BIOS system prompt with platform-wide guidance and runtime metadata. * Detect and recover from failures using a nuke & repave strategy (restart the server rather than attempt delicate replay). * Detect obvious “runaway loops” via a simple repeated-line detector, in addition to max token limits. This module is designed to prioritize simplicity, clarity, robustness, and testability over cleverness. It is also designed not to waste compute in a CPU/RAM constrained environment: bounded polling, lightweight liveness checks, and minimal background work. ==== 1. One-process-per-model worker: each instance manages exactly one llama.cpp server process and its configuration. ==== # Slot-based concurrency: support slots > 1 with immediate refusal if full (no queue by default). # Async request lifecycle: - submit() returns immediately with a request handle/id. - caller uses polling APIs to read status and final output. # Nuke & repave reliability: - detect dead/stalled/unreachable server; - restart subprocess; - fail in-flight requests with explicit error reasons. # Long-prefill-friendly timeouts: - accommodate multi-minute / tens-of-minutes time-to-first-token; - stall detection is liveness/progress-based, not TTFT-based. # Tool calling: - OpenAI function-calling format; - tool runner is pluggable; protocol should not assume “lightweight tools” even if typical for smaller models. # BIOS system prompt: - inject platform/hivemind guidance + runtime data (date/time, tool budget remaining, etc.). # Control signals channel: - allow model to emit structured “signals” (e.g., low confidence, escalation, decision request) without requiring downstream models to parse long text. # Simple loop early-kill: - repeated-line detector that cancels a request when it clearly loops. ==== - In-place reconfiguration or model swapping within a worker instance. ==== * Replay/resume of in-flight requests after restart. * Global scheduling across workers (handled by the higher-level orchestrator). * Complex token-level analytics or heavy output post-processing. ==== ### ==== * Worker: owns the llama.cpp server subprocess; exposes request APIs; supervises health; injects BIOS prompt; runs tool loop; emits status. * Request: encapsulates one generation job (messages, tool state, output accumulation, status). * ToolRunner (plugin): executes tools invoked by the model; worker handles tool-call parsing and continuation. * Control Signals: structured records emitted by the model via “control tools” (one-way). ===== - The worker runs a small internal event loop / async runtime (or thread-based equivalent) to: - start subprocess, - dispatch HTTP requests, - stream responses, - perform liveness checks at a configurable interval, - update request state and worker status. ===== ==== Exact naming is flexible; the intent is to keep it small and explicit. ==== ===== - start() -> None ===== * stop() -> None * (internal) restart(reason: str) -> None ===== - submit(system_prompt: str, user_prompt: str, *, params: GenerationParams | None = None) -> SubmitResult - returns immediately: - success: request_id - failure: NO_SLOT_AVAILABLE (no queue by default) or WORKER_NOT_READY ===== * get_status(request_id) -> RequestStatus * get_result(request_id) -> RequestResult | NOT_READY * cancel(request_id) -> bool (best-effort) * release(request_id) -> None (optional; can also auto-release after successful get_result()) ===== - get_worker_status() -> WorkerStatus ===== * get_debug_info() -> WorkerDebugInfo (recent subprocess logs, last errors, restart count, etc.) ==== ### ==== * name: identifier * model_path * host, port * gpu_env: typically CUDA_VISIBLE_DEVICES="0" (or similar) * llama.cpp server args: ctx, gpu layers, batch, tensor split, etc. (opaque list/dict) * slots: max concurrent requests * timeout_profile: per-worker only (no per-request overrides) * tool_policy: max tool iterations, per-tool timeout, etc. * bios_provider: callable that renders BIOS prompt text * control_tools_enabled: bool * stop_on_decision_request: bool (default true) * loop_detector: enable + thresholds ===== Designed for long TTFT workloads. ===== * connect_timeout_s (short) * headers_timeout_s (moderate) * ttft_timeout_s: typically disabled / None * prefill_liveness_timeout_s: large or None (prefill-safe) * idle_stream_timeout_s: max time without bytes once streaming starts * absolute_timeout_s: optional/None * liveness_probe_interval_s * restart backoff + crash-loop limits: - restart_backoff_s - max_restarts_per_window - restart_window_s ==== States (minimum viable): ==== * RUNNING (includes prefill + generation) * TOOL_RUNNING (worker executing a tool; generation paused) * Terminal: - COMPLETED - FAILED (with reason) - CANCELED Key timestamps: * created_at * dispatched_at * last_stream_byte_at * last_liveness_at * last_progress_at = max(last_stream_byte_at, last_liveness_at) * completed_at ==== ### ==== Worker-owned, regenerated: * at request start * before each tool-loop continuation BIOS includes: * platform/hivemind statement (universal) * current date/time + timezone * tool iteration budget remaining and constraints * optionally: how to emit control signals / decision requests ===== 1. BIOS system prompt (highest priority) ===== # Caller system prompt (job-specific) # User/assistant/tool messages ===== The worker should support either: ===== * multiple system messages, or * a single combined system message with clear delimiters (fallback). ==== ### ==== OpenAI function-calling schema (tools=[{type:"function", function:{...}}]). ===== - Worker: registers tools into request payload; detects tool calls; manages continuation; appends tool result messages. ===== * ToolRunner (plugin): executes tool by name with args and returns result. ===== - max_tool_iterations budget (exposed in BIOS) ===== * per-tool timeout * tool output size handling (may be truncated/summarized by ToolRunner if needed) ===== Even if most tools for small models are lightweight, the worker must not embed that assumption into the protocol. The same mechanism must work for heavier tools when plugged into a different ToolRunner. ===== ==== ### ==== Let the model emit structured “flags” and “requests” that the orchestrator can act on without parsing large outputs. ===== Expose a minimal set of control tools in OpenAI function format. Worker intercepts and records them into signals[] returned with the result/status. ===== Example signal types: * issue (low confidence, tool limit reached, etc.) * escalation (needs higher reasoning model / external info) * decision_request (requires management branch decision) * outcome (summarizable terminal condition) Behavior: * bounded count/size * stop_on_decision_request=True default: end generation immediately when a decision is requested * orchestrator later creates a new request with the decision injected (no resume). ==== ### ==== Nuke & repave: * if server unhealthy/stalled/disconnected: restart subprocess * mark all in-flight requests as FAILED(reason="worker_restarted") (no replay) ===== 1. Connect/header failures - cannot connect or cannot receive headers in time → likely dead server → restart. ===== # Stalls - use last_progress_at driven by: - stream bytes when streaming exists - liveness probes during prefill - if no progress for configured window → restart (policy-controlled) # Process death - subprocess exits → restart; fail in-flight requests. ===== - restart backoff ===== * cap restarts per time window; mark worker unhealthy if exceeded. ==== ### ==== A simple early-kill mechanism in addition to max_tokens. * Parse streamed output into completed lines. * Compare with previous completed line after normalization (trim whitespace). * Ignore empty/very short lines. * If same line repeats N times consecutively, cancel the request: - FAILED(reason="repeated_line_loop") - include the repeated line snippet + repeat count in debug/status. This should be conservative (low false positives) and cheap (minimal CPU). ==== - Worker accumulates the full output text for each request in memory until terminal state. ==== * Caller retrieves via get_result(). * Cleanup policy: - auto-release after get_result() or explicit release(request_id) - optional TTL-based GC for abandoned requests (robustness, not RAM pressure) ==== ### ==== * active, healthy, restarting * slot counts: slots_total, slots_used * active_request_ids * restart_count, last_error, last_healthy_at ===== - state, timestamps ===== * last_progress_at * output length so far * tool iteration remaining * recorded signals[] ===== - ring buffer of subprocess logs (recent N lines) ===== * last health check results / recent restart reasons ==== - No heavy periodic work; liveness probing is interval-based and configurable. ==== * Default to lightweight Linux /proc CPU time checks for prefill liveness. * Avoid expensive per-token processing; repeated-line detection only processes completed lines. * Keep dependencies minimal and stable. ==== ### ==== * Request state transitions and slot accounting (including cancellation) * BIOS prompt assembly and ordering * Tool call parsing → ToolRunner invocation → continuation assembly * Control tool interception and “stop on decision request” * Repeated-line detector behavior (including conservative non-triggers) * Timeout logic with synthetic progress/liveness timestamps ===== Using a stub HTTP server that can simulate: ===== * slow headers * long silence then streaming * streaming stalls * disconnect mid-stream * malformed tool calls * server crash / restart conditions and crash-loop limits ===== - repeated start/stop cycles ===== * concurrent submits up to slot limit * ensure no slot leaks on failures/cancellations ==== This module is a single-worker supervisor for llama.cpp servers designed for a multi-instance, multi-GPU orchestrated environment. It exposes a simple async request API with slot-based concurrency, robust failure recovery via nuke & repave, long-prefill-friendly health detection, OpenAI-format tool calling, BIOS prompt injection for platform context, and a structured control-signals channel for escalation and decision-making. ==== If you want, next step is to translate this doc into: * concrete class/method signatures + dataclasses/enums, and * a test-first skeleton (request store, slot semaphore, stub server harness) before hooking up llama-server subprocess + HTTP streaming.
Summary:
Please note that all contributions to freem are considered to be released under the Creative Commons Attribution-ShareAlike 4.0 (see
Freem:Copyrights
for details). If you do not want your writing to be edited mercilessly and redistributed at will, then do not submit it here.
You are also promising us that you wrote this yourself, or copied it from a public domain or similar free resource.
Do not submit copyrighted work without permission!
Cancel
Editing help
(opens in new window)