Jump to content
Main menu
Main menu
move to sidebar
hide
Navigation
Main page
Recent changes
Random page
freem
Search
Search
Appearance
Create account
Log in
Personal tools
Create account
Log in
Pages for logged out editors
learn more
Contributions
Talk
Editing
Openai/694057b6-101c-8007-9a65-c40578c7252d
(section)
Add languages
Page
Discussion
English
Read
Edit
Edit source
View history
Tools
Tools
move to sidebar
hide
Actions
Read
Edit
Edit source
View history
General
What links here
Related changes
Special pages
Page information
Appearance
move to sidebar
hide
Warning:
You are not logged in. Your IP address will be publicly visible if you make any edits. If you
log in
or
create an account
, your edits will be attributed to your username, along with other benefits.
Anti-spam check. Do
not
fill this in!
=== Below are the main “spec gaps” I see, plus a concrete default choice for each so you can lock them down without adding complexity. === ==== Under-specified: Are submit/get_status/get_result safe to call from multiple threads? Is the module asyncio-native? ==== Recommendation (simple + robust): * Make the worker asyncio-native internally, but expose a thread-safe sync facade if needed. * At minimum: specify “public methods are thread-safe” or “caller must stay on one event loop”. ==== Under-specified: Are you targeting llama-server (llama.cpp) or llama-cpp-python[server]? Their OpenAI-compat can differ across versions. ==== Recommendation: * Treat the server as an OpenAI-compatible HTTP endpoint and keep the server command fully configurable. * Define required endpoints you rely on: - POST /v1/chat/completions (streaming supported) - GET /v1/models (health/ready check) * If /v1/models doesn’t exist, allow a configurable health URL or a tiny completion warmup. ==== Under-specified: SSE vs raw chunks, JSON framing differences. ==== Recommendation: * Specify: worker supports SSE “data:” JSON lines and tolerates keepalive/comment lines. * Define what “progress” means: “any received bytes after headers count as stream progress”. ==== Under-specified: If the backend doesn’t emit tool_calls formally, how do you detect tool calls? ==== Recommendation (robust without replay): * Prefer native tool_calls when present. * Otherwise fall back to a structured prompting convention + strict JSON parse (still OpenAI-shaped), controlled by BIOS instructions. * If tool-call parsing fails, treat as FAILED(reason="tool_parse_error") (or emit a control signal requesting escalation). ==== Under-specified: ID format, how long results persist, GC behavior. ==== Recommendation: * IDs: monotonic integer or UUID (integer is cheaper; UUID is easier for distributed logging). * Results persist until get_result() or release(). * Optional TTL GC disabled by default; enable only if you see abandoned requests. ==== Under-specified: Does the worker allocate ports or are they assigned externally? ==== Recommendation: * Port is provided by caller (simplest in multi-worker orchestrators). * Worker validates port availability at start; if bind fails, mark WORKER_NOT_READY with explicit error. ==== Under-specified: Ensuring subprocess and any children die. ==== Recommendation: * Start subprocess in its own process group/session. * On stop: SIGTERM → wait → SIGKILL process group. * This avoids orphan workers consuming VRAM. ==== Under-specified: Does a slot map to “one concurrent HTTP request” or to “server’s internal parallel slots”? ==== Recommendation: * Define slots as “max concurrent requests the worker will dispatch”. * Don’t assume the server truly parallelizes well; slots is an admission control knob, not a throughput guarantee. ==== Under-specified: Besides max_tokens, what parameters are supported (temperature, top_p, stop, seed, etc.)? ==== Recommendation: * Define a typed GenerationParams that includes common OpenAI-like fields (optional): - max_tokens, temperature, top_p, stop, seed (if supported), etc. * Worker passes through fields it knows, ignores/records unsupported fields (for forwards compatibility). ==== Under-specified: Exact JSON schema and stable enum codes. ==== Recommendation: * Lock a tiny stable enum set now (you can expand later without breaking): - LOW_CONFIDENCE, NEEDS_EXTERNAL_INFO, NEEDS_HIGHER_REASONER, NEEDS_MANAGER_DECISION, TOOL_LIMIT_REACHED, MAX_TOKENS_REACHED, LOOP_DETECTED, POLICY_RISK * Specify maximum signals per request and “stop on decision request” behavior (already decided). If you want, I can produce the “next artifact” as a concrete typed interface spec (Protocols + TypedDicts/Enums + dataclasses) that directly matches this doc, ready to implement and test against.
Summary:
Please note that all contributions to freem are considered to be released under the Creative Commons Attribution-ShareAlike 4.0 (see
Freem:Copyrights
for details). If you do not want your writing to be edited mercilessly and redistributed at will, then do not submit it here.
You are also promising us that you wrote this yourself, or copied it from a public domain or similar free resource.
Do not submit copyrighted work without permission!
Cancel
Editing help
(opens in new window)