Editing Openai/694057b6-101c-8007-9a65-c40578c7252d (section)

=== Assistant:  ===
==== This section defines the concrete typed interfaces the implementation and test suite will code against. It’s intentionally “boring” and explicit. ====

<syntaxhighlight lang="python">from __future__ import annotations

from dataclasses import dataclass
from datetime import datetime
from enum import Enum
from typing import Any, Mapping, MutableMapping, Sequence

from typing_extensions import (
    Literal,
    NotRequired,
    Protocol,
    TypedDict,
    runtime_checkable,
)

</syntaxhighlight>

===== <syntaxhighlight lang="python">class WorkerState(str, Enum): =====
    RUNNING = "running"   # process started (may or may not be ready yet)
    READY = "ready"       # GET /v1/models succeeds
    FAILED = "failed"     # crashed, or locked out by crash-loop protection
    STOPPED = "stopped"   # not running

class RequestState(str, Enum):
    RUNNING = "running"
    TOOL_RUNNING = "tool_running"
    COMPLETED = "completed"
    FAILED = "failed"
    CANCELED = "canceled"

</syntaxhighlight>

===== Keep these as stable strings so the orchestrator can route without parsing text. =====

<syntaxhighlight lang="python">RequestFailReason = Literal[
    "worker_restarted",
    "server_died",
    "connect_failed",
    "headers_timeout",
    "stall_timeout",
    "tool_parse_error",
    "tool_execution_error",
    "repeated_line_loop",
    "canceled",
    "unknown_error",
]

RequestFinishReason = Literal[
    "stop",              # normal stop sequence / end-of-generation
    "max_tokens",        # hit max new tokens
    "canceled",
    "failed",
]

</syntaxhighlight>

===== We keep this permissive so the module doesn’t need edits whenever llama.cpp adds fields. =====

<syntaxhighlight lang="python">class ChatMessage(TypedDict, total=False):
    role: Literal["system", "user", "assistant", "tool"]
    content: str
    name: str
    tool_call_id: str
    # For tool calling, the server may include more fields:
    tool_calls: Any

class ToolFunctionDef(TypedDict, total=False):
    name: str
    description: str
    parameters: dict[str, Any]  # JSON schema

class ToolDef(TypedDict, total=False):
    type: Literal["function"]
    function: ToolFunctionDef

class ToolCall(TypedDict, total=False):
    id: str
    type: Literal["function"]
    function: dict[str, Any]  # expects {"name": str, "arguments": str|dict}

</syntaxhighlight>

===== Exit tools are “one-way”: they are recorded, not executed, and do not change control flow. =====

<syntaxhighlight lang="python">class ExitSignal(TypedDict, total=False):
    tool_name: str
    arguments: dict[str, Any]
    # helpful metadata for debugging / correlation
    emitted_at: float

</syntaxhighlight>

===== IDs are incrementing integers. job_name is caller-provided. =====

<syntaxhighlight lang="python">class SubmitOk(TypedDict):
    ok: Literal[True]
    request_id: int

class SubmitErr(TypedDict):
    ok: Literal[False]
    error: Literal["NO_SLOT_AVAILABLE", "WORKER_NOT_READY", "WORKER_FAILED"]

SubmitResult = SubmitOk | SubmitErr

class RequestStatus(TypedDict, total=False):
    request_id: int
    job_name: str
    state: RequestState

    created_at: float
    dispatched_at: NotRequired[float]
    completed_at: NotRequired[float]

    last_progress_at: NotRequired[float]
    output_chars: NotRequired[int]

    # optional nice-to-haves
    tokens_received: NotRequired[int]
    tokens_per_second: NotRequired[float]

    # tool loop info
    tool_iters_remaining: NotRequired[int]

    # exit-tool info
    signals: NotRequired[list[ExitSignal]]

    # error info if terminal failed/canceled
    fail_reason: NotRequired[RequestFailReason]
    fail_detail: NotRequired[str]

class RequestResult(TypedDict, total=False):
    request_id: int
    job_name: str

    state: Literal["completed", "failed", "canceled"]
    finish_reason: RequestFinishReason

    text: str
    signals: NotRequired[list[ExitSignal]]

    # terminal error info (if failed/canceled)
    fail_reason: NotRequired[RequestFailReason]
    fail_detail: NotRequired[str]

</syntaxhighlight>

Release semantics (locked in):
* get_result(request_id) returns RequestResult only once; it also releases stored state/output.
* After release, later lookups return a stable “not found”.

<syntaxhighlight lang="python">class NotFound(TypedDict):
    ok: Literal[False]
    error: Literal["NOT_FOUND"]

GetResultResponse = RequestResult | NotFound
GetStatusResponse = RequestStatus | NotFound

</syntaxhighlight>

===== Minimum: running/ready/failed. Extra fields are optional. =====

<syntaxhighlight lang="python">class WorkerStatus(TypedDict, total=False):
    state: WorkerState
    slots_total: int
    slots_used: int
    active_request_ids: list[int]

    restart_count: int
    last_error: NotRequired[str]
    last_ready_at: NotRequired[float]

class WorkerDebugInfo(TypedDict, total=False):
    # bounded ring buffer content (most recent N lines)
    recent_logs: list[str]
    recent_restart_reasons: list[str]

</syntaxhighlight>

===== No per-request overrides. =====

<syntaxhighlight lang="python">@dataclass(frozen=True, slots=True)
class TimeoutProfile:
    connect_timeout_s: float
    headers_timeout_s: float

    # time-to-first-token is usually disabled / huge in your environment:
    ttft_timeout_s: float | None

    # prefill-safe: based on liveness probes before any bytes arrive
    prefill_liveness_timeout_s: float | None

    # once streaming starts: max allowed time with no bytes
    idle_stream_timeout_s: float | None

    absolute_timeout_s: float | None

    liveness_probe_interval_s: float

    # restart control
    restart_backoff_s: float
    restart_window_s: float
    max_restarts_per_window: int

</syntaxhighlight>

===== BIOS generation is its own component. It should be easy to unit test. =====

<syntaxhighlight lang="python">@dataclass(frozen=True, slots=True)
class BiosContext:
    now: datetime
    timezone_name: str

    worker_name: str

    tool_iters_remaining: int
    normal_tools: Sequence[ToolDef]
    exit_tools: Sequence[ToolDef]

    # optional: stable version tag for formatting evolution
    bios_version: str = "bios-v1"

@runtime_checkable
class BiosProvider(Protocol):
    def __call__(self, ctx: BiosContext) -> str: ...

</syntaxhighlight>

A separate method/component assembles the message list:

<syntaxhighlight lang="python">def build_message_stack(
    *,
    bios_text: str,
    caller_system_prompt: str,
    conversation: Sequence[ChatMessage],
) -> list[ChatMessage]:
    """Pure function: returns full message list in required order."""
    ...

</syntaxhighlight>

===== Normal tools use a plugin that can be swapped for lightweight vs heavyweight implementations. =====

<syntaxhighlight lang="python">@runtime_checkable
class ToolRunner(Protocol):
    async def run_tool(
        self,
        *,
        name: str,
        arguments: dict[str, Any],
        request_id: int,
        job_name: str,
    ) -> Any:
        """
        Return any JSON-serializable result (dict/list/str/number/bool/null).
        Worker will serialize it into a tool message payload.
        """
        ...

</syntaxhighlight>

===== <syntaxhighlight lang="python">GenerationParams = Mapping[str, Any] =====

@dataclass(frozen=True, slots=True)
class WorkerConfig:
    name: str

    host: str
    port: int

    # full command including executable: ["./llama-server", "-m", "...", "--port", "...", ...]
    server_cmd: Sequence[str]

    # env overrides, e.g. {"CUDA_VISIBLE_DEVICES": "0"}
    env: Mapping[str, str]

    slots: int
    timeouts: TimeoutProfile

    # Tools
    normal_tools: Sequence[ToolDef]
    tool_runner: ToolRunner | None

    # Exit tools (one-way)
    exit_tools: Sequence[ToolDef]

    # BIOS
    bios_provider: BiosProvider
    timezone_name: str

</syntaxhighlight>

===== Locked-in default: =====

<syntaxhighlight lang="python">@dataclass(frozen=True, slots=True)
class ReadinessProbe:
    method: Literal["GET"] = "GET"
    path: str = "/v1/models"
    # Used with timeouts.headers_timeout_s and connect_timeout_s

</syntaxhighlight>

==== - Exit tools never alter control flow. They’re recorded into signals[] and returned to the orchestrator. ====
* Normal tool calls run through ToolRunner and continue generation (tool loop budgeted).
* Progress is “any bytes after headers count as progress,” plus prefill liveness via /proc/<pid> CPU time deltas.
* <code>get_result()</code> releases everything for that request_id immediately.

If you want the next section after this, I’d suggest Appendix B: Internal module layout (classes/files and which component owns what), so the implementation stays clean and testable from day one.