Editing Openai/694ab010-3e7c-8000-b0e8-b98c22b7a063 (section)

==== This matters because LLM-backed APIs are expensive compared to UI. ====

===== Assume: =====
* Each container has enough CPU/memory to handle LLM calls well
* You deploy 1 service with 1 container type

Traffic:
* 200 people open the UI (mostly idle browsing)
* 20 people actually send chat messages

Problem:
* UI traffic forces you to scale containers to keep the UI snappy
* But those same containers also include the backend runtime and secret access
* You end up paying for LLM-capable containers even when people are just loading the UI

Example:
* UI load requires 6 tasks to keep responsiveness
* You now have 6 “full stack” tasks running even though only 1–2 tasks worth of API compute is needed

===== Now: =====
* UI service: lightweight, can run 2–3 small tasks
* API service: scales only when chat calls increase

Example:
* UI: 2 tasks (small CPU/mem)
* API: 1 task normally, scales to 5 only if request rate spikes

Benefits:
# Lower cost
* API tasks are larger (CPU/mem), scale only when needed
# Higher reliability
* UI spikes do not crash API
* API spikes do not slow UI
# Better security
* Only API tasks can read secrets
# Cleaner operations
* You debug UI and API independently

===== - UI scaling metric: ALB request count / target response time =====
* API scaling metric: request rate to /api/chat, CPU, latency