Editing Openai/694057b6-101c-8007-9a65-c40578c7252d (section)

==== Since you’re already planning internal streaming, you can add a loop detector that triggers before max_tokens when the output is clearly stuck. ====

===== - Runs only while streaming output is coming in. =====
* If it triggers, it cancels that request and frees the slot.
* It should not automatically restart the worker process (a looping model is usually not a dead server).
* Records a short “why” snippet in request debug info.

Terminal reason example: FAILED(reason="loop_detected").

===== Use multiple weak signals together (to avoid false positives): =====
# Repeated line detector
* Keep the last K lines (e.g., 20–50).
* If the same line appears N times in a row (e.g., 8+) or dominates the window (e.g., >70%), flag.
# Repeated suffix detector
* Keep last M characters/tokens (e.g., 2–8k chars).
* If the newest chunk makes the output end with a suffix that has appeared repeatedly (same ~200–500 char tail repeated 3+ times), flag.
# Low novelty / low entropy
* Track unique-token or unique-3gram rate over a sliding window.
* If novelty drops below a threshold for long enough and output length keeps growing, flag.

===== This matters because some legitimate outputs look repetitive: =====
* tables, logs, code templates, poetry refrains, etc.

So add to worker config / per-request overrides:
* loop_detection_enabled (default on, but tunable)
* thresholds (line repeats, suffix length, window size)
* min_generated_tokens_before_check (don’t trigger too early)