Jump to content
Main menu
Main menu
move to sidebar
hide
Navigation
Main page
Recent changes
Random page
freem
Search
Search
Appearance
Create account
Log in
Personal tools
Create account
Log in
Pages for logged out editors
learn more
Contributions
Talk
Editing
Openai/694057b6-101c-8007-9a65-c40578c7252d
(section)
Add languages
Page
Discussion
English
Read
Edit
Edit source
View history
Tools
Tools
move to sidebar
hide
Actions
Read
Edit
Edit source
View history
General
What links here
Related changes
Special pages
Page information
Appearance
move to sidebar
hide
Warning:
You are not logged in. Your IP address will be publicly visible if you make any edits. If you
log in
or
create an account
, your edits will be attributed to your username, along with other benefits.
Anti-spam check. Do
not
fill this in!
=== Assistant: You’ve already got the core nailed. === You’ve already got the core nailed. The main “gotchas” that tend to bite later are around request lifecycle, capacity/backpressure, observability, and tool-call state. Here’s a checklist of requirements I’d add/lock down now so you don’t end up redesigning mid-flight. ==== Lifecycle ==== * start() / stop() only (no “reconfigure in place”). * Owns a single llama server subprocess + its port + its GPU pinning env. * Clean shutdown: try graceful, then hard kill; always free VRAM. Requests * submit(system_prompt, user_prompt, …) -> request_id returns immediately. * Completion retrieved by polling: get_result(request_id) / get_status(request_id). * Supports max_concurrent_requests > 1 (internal queue when saturated). Health & resilience * Detect stalls/disconnects. * Attempt reconnect; if unhealthy past threshold, restart the subprocess. * Resume operation after restart (at minimum, fail in-flight requests cleanly with a clear reason). Tools * Tool registry passed in (separate class OK). * Handles tool-call detection → executes tool → resumes generation with tool result. That’s a solid v1. ==== ### ==== You want an explicit state machine, e.g.: * QUEUED → DISPATCHED → RUNNING → TOOL_WAIT → COMPLETED * terminal: FAILED | CANCELED | TIMEOUT Add: * cancel(request_id) (even if it’s best-effort). Without cancellation, queue buildup becomes painful under load. ===== Decide now what happens when saturated: ===== * Queue up to max_queue_depth, then reject with QueueFullError * or drop oldest, or priority lanes (later) Also decide whether max_concurrent_requests is: * per worker instance (likely), and * enforced with a semaphore + queue. ===== People often only add one and then it’s messy. ===== * Connect timeout: can I reach the server? * First-byte timeout: server accepted request but never started responding (common stall symptom). * Total timeout: wall-clock cap per request. Also add a restart backoff (avoid crash loops). ===== Lock this down explicitly: ===== * If the server restarts, do you: - mark all in-flight as FAILED(restarted) immediately, or - try to replay (dangerous unless your prompts/tools are deterministic and idempotent) For v1: I strongly recommend fail in-flight with a clear error. ===== Even if you only expose polling now, build the internal plumbing so you can add: ===== * partial tokens * “tail” polling (get latest chunk since offset) * or callback hooks later If you don’t, tool-calls + long responses can feel opaque. ===== Minimum that saves you hours: ===== * last N log lines from subprocess accessible via get_debug_info() * per-request timing: queued_ms, run_ms, tool_ms, retries * health metrics: restart_count, last_healthy_time, consecutive_failures ===== Expose: ===== * model path/name * context length, gpu-layers, batch settings, etc. * a config hash so the higher-level system can understand “what worker is this?” This matters when you have many instances and need to correlate behavior. ===== If multiple instances are created/destroyed dynamically: ===== * module should either accept a port (caller-managed) or * include a reliable port allocator + bind-check. Also: decide whether you bind to 127.0.0.1 only (recommended). ===== Tool support is where hidden requirements show up. Define: ===== * tool schema format (OpenAI tool schema? custom?) * max tool recursion / max tool calls per request * tool timeout + retry policy * tool result size limits (avoid stuffing huge payloads back into context) * idempotency expectations (important if you ever retry a tool call) Also: persist enough per-request state that tool resumes are correct even if you poll later. ===== If you want max_concurrent_requests > 1, decide if that means: ===== * multiple simultaneous HTTP requests to one server process (depends on server’s parallelism support), or * you’ll actually spawn multiple worker processes for true parallel throughput. Even if you choose the former, keep a switch so you can move to the latter without API changes. ==== It can help to say “not in scope for v1”: ==== * automatic model swapping inside an instance * replaying in-flight requests across restarts * global scheduling across workers (belongs in a higher-level manager) * distributed tool execution (keep it in-process initially) ==== - start(), stop(), restart() ==== '' submit(system, user, '', tools_enabled=True, request_params={...}) -> id * get_status(id) -> {state, progress?, error?, timestamps...} * get_result(id) -> {text, tool_trace?, usage?} * cancel(id) * get_worker_status() -> {active, healthy, in_flight, queued, restarts, last_error...} * get_debug_info() -> logs, config_hash, last_health_checks... If you lock those items down now, you’ll avoid the common rewrites later. When you’re ready to move to code, we can turn this into a concrete state machine + class layout (worker + request objects + tool runner) with clear boundaries.
Summary:
Please note that all contributions to freem are considered to be released under the Creative Commons Attribution-ShareAlike 4.0 (see
Freem:Copyrights
for details). If you do not want your writing to be edited mercilessly and redistributed at will, then do not submit it here.
You are also promising us that you wrote this yourself, or copied it from a public domain or similar free resource.
Do not submit copyrighted work without permission!
Cancel
Editing help
(opens in new window)