Jump to content
Main menu
Main menu
move to sidebar
hide
Navigation
Main page
Recent changes
Random page
freem
Search
Search
Appearance
Create account
Log in
Personal tools
Create account
Log in
Pages for logged out editors
learn more
Contributions
Talk
Editing
Openai/694057b6-101c-8007-9a65-c40578c7252d
(section)
Add languages
Page
Discussion
English
Read
Edit
Edit source
View history
Tools
Tools
move to sidebar
hide
Actions
Read
Edit
Edit source
View history
General
What links here
Related changes
Special pages
Page information
Appearance
move to sidebar
hide
Warning:
You are not logged in. Your IP address will be publicly visible if you make any edits. If you
log in
or
create an account
, your edits will be attributed to your username, along with other benefits.
Anti-spam check. Do
not
fill this in!
=== Assistant: Here’s the consolidated list of design decisions / requirements we’ve locked in so far for the llama.cpp worker module. === Here’s the consolidated list of design decisions / requirements we’ve locked in so far for the llama.cpp worker module. ==== - Each class instance owns one llama.cpp server subprocess (one model per process; may use 1+ GPUs if sharded). ==== * Instance is immutable once created; reconfiguration is drop & recreate. * Public lifecycle is <code>start()</code> / <code>stop()</code> (optionally restart() internally). * Preferred recovery strategy is nuke & repave: restart the server rather than attempting delicate replay. ==== - Concurrency is slot-based: worker has a configured integer slots (≥1). ==== * A slot represents one in-flight generation request handled by this worker. * If all slots are full, the worker does not accept new work (returns an immediate “no slot available” result). No internal queue by default. ==== - Caller submits work via a method like submit(system_prompt, user_prompt, …) that returns immediately with a request id/handle. ==== * Caller retrieves progress/output via polling initially (callbacks optional later). * Worker tracks per-request state and exposes: - running/queued (if you ever add queue), completed, failed, canceled, timed out - in-flight count / slot utilization ==== - Worker uses streaming internally even if the external API is polling-only. ==== * Internal streaming is used for: - monitoring / progress timestamps - partial output capture (debug/ops) - stall detection ==== - Timeouts must handle very long prefill (multi-minute to tens-of-minutes before first token). ==== * Timeout configuration is per-worker (“profile” / policy object) with optional per-request overrides. * Timeout types: - connect timeout (short) - headers/accept timeout (moderate) - TTFT timeout is disabled by default (or set very high) - primary stall detection is progress/liveness based, not “no tokens yet” - optional absolute timeout (default None or very large) * Liveness/progress signals (pluggable): - stream bytes/tokens (when streaming) - subprocess liveness + /proc CPU time delta (baseline) - optional GPU activity probes (nice-to-have) * On stall/health failure: reconnect attempt(s) then restart, per policy. * Crash-loop protection via restart backoff / restart rate limits. ==== - On worker restart, in-flight requests are failed with a clear reason (no replay). ==== * Looping output is treated as request-level failure (not necessarily worker restart). ==== - Always support <code>max_tokens</code> / max-new-tokens as a hard stop (per-worker default, per-request override). ==== * Add an optional repeated-line detector as an early-kill mechanism: - detects same sufficiently-long line repeated N times consecutively - cancels the request, frees slot, returns terminal reason repeated_line_loop - designed to be simple and low false-positive ==== - Tools follow OpenAI tool/function-calling format. ==== * Tool runner is a separate component passed in, but the worker handles: - registering/exposing tools to the model - receiving tool calls - executing via tool runner - resuming the model with tool results * Tools are intended to be lightweight (calculator/lookup). Large calls are escalated to higher-level models. * Tool loop constraints: - max tool iterations - per-tool timeout - tool output size caps * Tool iteration budget should be visible to the model via BIOS prompt. ==== - Worker injects a low-level BIOS system prompt (secondary system prompt) per request/turn. ==== * BIOS contains runtime/control-plane facts such as: - tool iteration budget remaining - current date/time + timezone - (optional) capability hints / platform rules * Prompt ordering: 1. BIOS system prompt (highest priority) 2. caller’s system prompt 3. conversation messages (user/assistant/tool) ==== - Worker maintains a status readout with at least: - active/healthy/restarting - slots total / slots used + active request ids - restart count, last error, last healthy time ==== * Per-request status includes: - state, timestamps (created, started, last_progress) - partial output length (and optionally small tail snippet for debugging) - tool call count / trace metadata * Subprocess stdout/stderr captured (ring buffer) for diagnostics. ==== - In addition to normal tools, the worker supports one-way control signals emitted by the model to help orchestration without needing another model to read all output. ==== * Signals cover: - uncertainty / low confidence - needs external info / browsing - needs higher reasoning model - tooling limitations - policy/safety escalation lane (optional) - decision requests (management-level choice among options) * Implemented as “control tools” in OpenAI function-call format (intercepted by worker), or equivalent structured footer if needed. * Signals are: - bounded (count + size) - returned alongside output - treated as hints; orchestrator decides actions * For decision_request specifically: - worker can be configured to stop immediately once signaled (stop_on_decision_request=True) - orchestrator will later create a new request with the chosen decision injected (no resume/replay required) ==== - No in-place model swapping; recreate worker instead. ==== * No replay/resume of in-flight requests after restart. * No global scheduler inside the worker (handled by higher-level manager). If you want, the next step is to turn this into a crisp “module spec” with: class responsibilities, method signatures, request/worker state enums, timeout policy schema, and the exact control-tool JSON schemas.
Summary:
Please note that all contributions to freem are considered to be released under the Creative Commons Attribution-ShareAlike 4.0 (see
Freem:Copyrights
for details). If you do not want your writing to be edited mercilessly and redistributed at will, then do not submit it here.
You are also promising us that you wrote this yourself, or copied it from a public domain or similar free resource.
Do not submit copyrighted work without permission!
Cancel
Editing help
(opens in new window)