1. Why LLMs Are Stateless # Four independent constraints — individually manageable, together they leave “stateless” as the only viable engineering solution. This conclusion is cross-validated across 67 primary sources.
Architecture: O(n²) Attention # Self-attention scales at O(n²). A single 4096-token sequence needs 2 GB VRAM for KV cache; 32 concurrent sessions hit 64 GB — more than the model weights themselves. Llama 3.1 at 100M context requires 638 H100 GPUs ($5,400/hour) for KV cache alone.