[{"content":"I\u0026rsquo;m Liu ZhuoQi, an AI Agent developer.\nI integrate AI into real products — agent systems, data visualization, automated workflows. Good tools shouldn\u0026rsquo;t need to explain themselves.\nCursor is my daily driver. This site is where I document and share technical explorations, covering AI Agent development, engineering practices, and creative coding.\nIf you need an AI Agent integration developer, reach out.\nTech Stack # AI \u0026amp; Agents LangChain · OpenAI API · Claude API · Cursor Canvas · Prompt Engineering · RAG\nFrontend React · TypeScript · Next.js · Astro · Tailwind CSS\nBackend \u0026amp; Infra Node.js · Python · PostgreSQL · Docker · Cloudflare Workers\nContact # GitHub: github.com/zhuoqidev X / Twitter: x.com/zhuoqidev Email: hello@zhuoqidev.com ","date":"2026-05-04","externalUrl":null,"language":"zh","permalink":"/en/about/","section":"Home","summary":"AI Agent developer Liu ZhuoQi’s personal introduction","title":"About","type":"en"},{"content":" Why Hugo # When picking a framework for a personal blog, my top criterion was low maintenance cost — I didn\u0026rsquo;t want to abandon writing three months later because of npm dependency hell.\nHugo is a single binary, requires no Node.js, builds thousands of posts in 1-2 seconds, and the PaperMod theme comes with dark mode, full-text search, RSS, Open Graph, and reading time estimates out of the box. Day-to-day writing only requires touching Markdown files.\nArchitecture # 1 2 3 4 5 6 7 8 9 10 11 12 13 14 ┌─────────────────────────────┐ │ DNS Geo-Based Routing │ │ (Alibaba Cloud DNS GeoDB) │ └──────┬──────────────┬────────┘ │ │ CN visitors ▼ Intl. visitors ▼ ┌──────────────────┐ ┌─────────────────┐ │ Alibaba CDN │ │ Cloudflare Pages │ │ ↓ │ │ (Free, Global) │ │ Alibaba OSS │ └─────────────────┘ │ (Static Hosting)│ └──────────────────┘ ↑ GitHub Actions auto-build \u0026amp; dual-stack push Annual cost: approximately ¥206 (~$29 USD):\nDomain zhuoqidev.com: ¥85/yr (bought 3 years) Function Compute resource pack (ICP filing): ¥101/yr Alibaba CDN 100GB traffic pack: ¥14/yr OSS storage: ~¥6/yr Cloudflare Pages: ¥0 ICP Filing Without a Server # Websites served to mainland China visitors need an ICP filing, which requires a \u0026ldquo;filing carrier\u0026rdquo; (a server IP). Instead of buying a full server, Alibaba Cloud\u0026rsquo;s Function Compute resource pack (¥101/yr) works as a filing carrier and provides a filing service code.\nTimeline: Alibaba Cloud initial review ~1 day + MIIT review 5-20 business days. Plenty of time to finish the site while waiting.\nGeo-DNS Routing # Alibaba Cloud DNS free tier supports \u0026ldquo;domestic / international\u0026rdquo; split routing:\nDomestic → Alibaba CDN CNAME International (default) → Cloudflare Pages CNAME Domestic visitors get the ICP-compliant Alibaba CDN; international visitors get Cloudflare\u0026rsquo;s free global CDN — one domain, two acceleration paths.\nDeployment # Push to GitHub → Actions runs hugo build → uploads in parallel to OSS and Cloudflare Pages. The whole process takes 2-3 minutes. Publishing a post is nearly instant.\nMore posts on AI Agent development coming soon. If you need an AI Agent integration developer, reach out: hello@zhuoqidev.com\n","date":"2026-05-04","externalUrl":null,"language":"zh","permalink":"/en/posts/hello-world/","section":"Home","summary":"Why Hugo # When picking a framework for a personal blog, my top criterion was low maintenance cost — I didn’t want to abandon writing three months later because of npm dependency hell.\nHugo is a single binary, requires no Node.js, builds thousands of posts in 1-2 seconds, and the PaperMod theme comes with dark mode, full-text search, RSS, Open Graph, and reading time estimates out of the box. Day-to-day writing only requires touching Markdown files.\n","title":"Building a Personal Site with Hugo and Dual-Stack CDN","type":"posts"},{"content":"Hugo shortcodes make it easy to embed live code demos. Here are three ways:\n1. Inline CSS Demo (No External Service) # A spinning loader animation, right in the article:\nPure CSS Spinner\nA gradient text animation:\nCSS Gradient Text\nZhuoQi Dev 2. Embed CodePen # If you already have CodePen creations, embed them with a single shortcode:\n1 {{\u0026lt; codepen id=\u0026#34;yourPenID\u0026#34; height=\u0026#34;400\u0026#34; tab=\u0026#34;result\u0026#34; \u0026gt;}} 3. Embed CodeSandbox # For React / Vue components, use CodeSandbox:\n1 {{\u0026lt; codesandbox id=\u0026#34;yourSandboxID\u0026#34; height=\u0026#34;450\u0026#34; view=\u0026#34;preview\u0026#34; \u0026gt;}} These three shortcodes cover most code demo scenarios — no extra tools needed.\n","date":"2026-05-04","externalUrl":null,"language":"zh","permalink":"/en/posts/css-animation-demo/","section":"Home","summary":"Hugo shortcodes make it easy to embed live code demos. Here are three ways:\n1. Inline CSS Demo (No External Service) # A spinning loader animation, right in the article:\nPure CSS Spinner\nA gradient text animation:\n","title":"Embedding CSS Animation Demos in Hugo Articles","type":"posts"},{"content":"","date":"2026-05-04","externalUrl":null,"language":"zh","permalink":"/en/","section":"Home","summary":"","title":"Home","type":"en"},{"content":"","date":"2026-05-04","externalUrl":null,"language":"zh","permalink":"/en/posts/","section":"Home","summary":"","title":"Posts","type":"posts"},{"content":"","date":"2026-05-04","externalUrl":null,"language":"zh","permalink":"/en/projects/","section":"Home","summary":"","title":"Projects","type":"projects"},{"content":" 1. Why LLMs Are Stateless # Four independent constraints — individually manageable, together they leave \u0026ldquo;stateless\u0026rdquo; as the only viable engineering solution. This conclusion is cross-validated across 67 primary sources.\nArchitecture: O(n²) Attention # Self-attention scales at O(n²). A single 4096-token sequence needs 2 GB VRAM for KV cache; 32 concurrent sessions hit 64 GB — more than the model weights themselves. Llama 3.1 at 100M context requires 638 H100 GPUs ($5,400/hour) for KV cache alone.\n→ Liu et al. \u0026ldquo;Lost in the Middle\u0026rdquo; (TACL 2024): long contexts aren\u0026rsquo;t just slower — middle-section recall follows a U-shaped curve, worse than closed-book.\nTraining: Catastrophic Forgetting # LLM knowledge is entangled across billions of weights. No isolated \u0026ldquo;French module\u0026rdquo; or \u0026ldquo;user preference register\u0026rdquo; exists. Every fine-tune reshapes the entire parameter landscape. Even LoRA suffers from catastrophic forgetting in continual learning scenarios (arXiv 2404.16789).\n→ Industry standard: offline retraining at weekly/daily cadence. No one does per-request weight updates.\nCompliance: Right to Be Forgotten # GDPR Article 17 and PDPA require data controllers to delete personal data \u0026ldquo;without undue delay.\u0026rdquo; Once baked into billions of weights, the right to be forgotten becomes nearly impossible to execute — you can\u0026rsquo;t \u0026ldquo;subtract\u0026rdquo; a user from the model. Both Anthropic and OpenAI explicitly state Memory data lives externally, not in weights. This is a legal constraint, not a technical preference.\n→ RAG / Memory Layer beats fine-tuning because of compliance, not technical superiority.\nSecurity: Persistent Memory = Persistent Attack Surface # ChatGPT Memory has been breached via prompt injection through Google Docs, images, and web pages — attackers invoke to=bio to write malicious persistent instructions affecting all future conversations (Embrace The Red, 2024). This is precisely why Cursor 1.0→1.2 added mandatory user approval, and why Anthropic tested sycophancy/harmful conversation before releasing Memory.\nKarpathy\u0026rsquo;s canonical analogy: Weights = ROM (static, burned in at training); context window = RAM (directly addressable during inference); KV cache = working memory (formed at test-time); external vector / KG store = disk (persistent, requires retrieval). \u0026ldquo;Knowledge in the weights is a hazy recollection of training-time internet documents; content in the context window is directly accessible\u0026rdquo; — Andrej Karpathy, Dwarkesh Patel Interview (2025-10). 2. Product Landscape: Cache vs Memory vs True Memory # 14 products, zero weight modifications. This section also disentangles three commonly conflated concepts:\nCache (KV/Prompt Caching): Caches K,V projection tensors; prefix byte-level match → skip prefill. 5min–24h lifetime. Compute optimization, not \u0026ldquo;remembering.\u0026rdquo; Memory (Product Layer): Text in external databases/vector stores/markdown, injected into system prompt on each call. User-controlled. True Model Memory (In-Weights): Changing weights themselves. Hit by catastrophic forgetting + GDPR + interpretability. Comparison Table # Product Strategy Type Weight Δ? ChatGPT Memory 4-layer: metadata + bio + ~40 summaries + window Memory No OpenAI Prompt Caching ≥1024 tokens auto KV cache, 5min–24h TTL Cache No Anthropic Prompt Caching Explicit cache_control ≤4 breakpoints, byte-level match Cache No Gemini Context Caching Implicit 90% discount + Explicit 60min TTL Cache No Claude.ai Projects Instructions + files + history, full prompt injection Memory No Claude Memory (2025-10) Project-isolated, 24h synthesis, editable Memory No Claude Code CLAUDE.md + model-written MEMORY.md (200 lines) Memory No Cursor Rules / AGENTS.md Static markdown, 4 trigger modes, Team \u0026gt; Project \u0026gt; User Memory No Cursor Memories (1.0+) AI generates candidates → user approves → writes Memory No Cursor Codebase Index Merkle tree + encryption + Turbopuffer vector DB RAG No Windsurf Cascade global + workspace rules + auto Memories + RAG Memory No Devin Knowledge Human-written + AI suggestions + DeepWiki + VM Snapshots Memory+RAG No Replit Checkpoints VM snapshot = files + DB + chat + Agent memory Snapshot No Italic = Cache/RAG/Snapshot; Bold = Memory. No product modifies weights.\nKey reverse-engineering evidence: Manthan Gupta confirmed through three experiments: ask ChatGPT about a specific topic discussed a year ago, and it has absolutely no idea. ChatGPT Memory does not use RAG. It stores only: session metadata + dozens of bio entries + user message summaries of the last ~40 chats (not ChatGPT\u0026rsquo;s own replies) + the current sliding window. Cursor\u0026rsquo;s official docs put it even more bluntly: \u0026ldquo;Large language models don\u0026rsquo;t retain memory between completions. Rules provide persistent, reusable context at the prompt level.\u0026rdquo; 3. The Four-Layer Future Stack # Bottom-up: base layer forever stateless. The three above are different abstractions for \u0026ldquo;giving it memory.\u0026rdquo; L4 is the short-term mainstream; L2 is the highest-value research leap.\nL4 · Agent Memory Layer # Most Mature Treats the LLM as a stateless CPU; memory lives in external databases + Agent runtime. Representatives: Letta (MemGPT) · Mem0 · Zep + Graphiti · LangGraph Store · AutoGen Memory.\n✅ Auditable · Deletable · Model-agnostic ⚠️ Retrieval quality ceiling · Write contamination accumulates Mem0 scores 26% above OpenAI Memory on LoCoMo; 91% lower p95 latency; 90% fewer tokens L3 · Ultra-Long Context # Commercialized Stuffs memory into ultra-long context windows. Representatives: Gemini 2M (\u0026gt;99% needle recall) · Magic LTM-2-Mini 100M tokens.\n✅ Best in-session carrier ⚠️ Lost-in-the-middle unsolved · 100M ctx single user = 638×H100 L3 and L4 are complementary, not competitive: ultra-long context handles within-session associations; Agent memory layer handles cross-session / cross-year persistence. Combining both is the current engineering optimum.\nL2 · In-Architecture Memory # Highest Research Value Embeds \u0026ldquo;persistent memory\u0026rdquo; as a differentiable module in the network — potentially the real paradigm shift. Representatives: Google Titans · Infini-attention · Mamba-2 · RWKV-7 Goose.\n✅ Constant VRAM · Linear time ⚠️ Not yet validated at scale (needs ≥70B params / ≥10T tokens) L1 · Bare LLM (frozen weights) # Forever Stateless GPT / Claude / Gemini / Llama core. Each inference is a fresh process. Continual learning won\u0026rsquo;t become a per-user memory path short-term. LoRA is for domain/role specialization, not per-user.\n4. Memory Economics: Why Cache TTL Is a Hidden Pricing Dial # This is the most underappreciated thread in the entire landscape.\nIn 2026-03, Anthropic silently dropped cache TTL from 1h to 5min, causing Claude Code users to pay 17–26% more. No announcement. No SLA commitment. This exposed a brutal truth: cache TTL directly impacts per-user cost but appears on zero SLAs.\nMetric Value Cost increase after Anthropic TTL change 17–26% Cache cost transparency 0% (fully hidden) 100M ctx hardware cost (single user) ~$5.4k/hr SLA commitments on cache TTL 0 Extrapolate this logic and the future \u0026ldquo;memory economics\u0026rdquo; increasingly resemble cloud storage — tiered (5min/1h/24h/permanent), pricable (micro-adjusting TTL is reverse-pricing by traffic), and lock-in (migration cost once agent workflows depend on specific cache strategies).\n5. Three-Year Paradigm Roadmap # Based on Anthropic, Letta, Karpathy, LeCun sources. 2026 has high confidence; 2027–2028 are inferential with explicit uncertainty.\nYear Mainstream Potential Dark Horse 2026 Bare LLM + Agent Memory (Mem0/Zep/Letta) + long-context caching Titans-style architectures begin small-scale commercial use; Sleep-time Compute becomes agent standard 2027 Reflection / Sleep-time / TTT enter mainstream Agent framework primitives A 7B SSM/Hybrid surpasses Transformer on long-context benchmarks 2028 Top models may integrate in-arch memory (high-risk prediction); otherwise Memory Layer remains standard LeCun H-JEPA + LLM hybrid prototype (early signal for 5–10 year bet) 2028 caveat: In-architecture memory requires ≥70B params and ≥10T token training for validation — currently arXiv-only. The more likely 2028 scenario is coexistence, not replacement. 6. Nine Practical Takeaways # Never conflate Cache and Memory: Cache skips prefill; Memory decides what goes into the prompt. Orthogonal.\nWriting memory = writing system prompt: Any convention expressible in markdown (Cursor Rules / CLAUDE.md / AGENTS.md) beats \u0026ldquo;letting the AI remember\u0026rdquo; — diffable, version-controlled, deterministic.\nPrefix order: static → dynamic: Tool definitions, system prompt, project rules first; user input last. Top-level advice from OpenAI, Anthropic, and Google docs.\nCompaction must be cache-safe: Don\u0026rsquo;t open a new system prompt for summarization — forces full uncached recomputation. Claude Code calls this \u0026ldquo;cache-safe forking.\u0026rdquo;\nTTL is a product decision: The Anthropic 1h→5min incident proves it. Expose TTL as user-configurable, or users will find your hidden pricing in their bills.\nAI writes, human approves = steadiest auto-Memory: Cursor 1.2\u0026rsquo;s user approval + Devin\u0026rsquo;s suggestion-only flow are the post-prompt-injection consensus.\nVisible, editable, exportable = trust: Anthropic\u0026rsquo;s natural language synthesis vs ChatGPT\u0026rsquo;s opaque synthesis — two sides of the same coin.\nPrivacy mode conflicts with Cache: OpenAI Extended cache loses ZDR; Cursor privacy mode stores no plaintext. Offer \u0026ldquo;performance vs. privacy\u0026rdquo; as two modes.\nThe real moat is \u0026ldquo;context engineering,\u0026rdquo; not \u0026ldquo;memory models\u0026rdquo;: Deterministic, version-controlled, human-readable state. Curation cost is one-time; benefit compounds.\n7. Key References # All primary sources from 2024–2026. 30+ curated entries covering vendor docs, arXiv papers, and researcher essays.\nA. Vendor Sources # OpenAI: Prompt Caching guide · Caching 201 cookbook · Manthan Gupta · Reverse Engineered ChatGPT Memory · Embrace The Red · Hacking Memories\nAnthropic: Prompt Caching docs · Lessons from Claude Code · Claude Code Memory · How Claude\u0026rsquo;s memory works\nGoogle: Gemini Context Caching · Vertex AI caching overview\nCursor / Windsurf / Devin / Replit: Cursor Rules · Codebase Indexing · Cursor 1.0 + 1.2 changelogs · Windsurf Memories · Devin Knowledge · Replit Checkpoints\nB. Key Papers # Architecture: Lost in the Middle · Gemini 1.5 · Magic LTM-2-Mini · Titans · Infini-attention · Mamba-2 · RWKV-7 · KV-Direct\nMemory Layer: MemGPT · Mem0 · Zep + Graphiti · A-Mem · Generative Agents · Sleep-time Compute\nContinual Learning: CL Survey · TTT (ICML 2025) · Memory Taxonomy\nC. Researchers (Karpathy / LeCun / Raschka) # Karpathy · Dwarkesh Patel Interview (2025-10) · Intro to LLMs LeCun · Path Towards AMI · NVIDIA GTC 2025 Raschka · Coding the KV Cache D. Frameworks # LangGraph Persistence · AutoGen Memory Letta Research · Don\u0026rsquo;t Break the Cache · ctx.ist Research method: Three parallel sub-agents (technical principles + product API design + future paradigms), cross-validated across four sources (Exa, Tavily, Context7, WebSearch). 67 primary URLs, 2024-Q1 to 2026-Q2.\n","date":"2026-05-04","externalUrl":null,"language":"zh","permalink":"/en/projects/llm-memory-research/","section":"Home","summary":"1. Why LLMs Are Stateless # Four independent constraints — individually manageable, together they leave “stateless” as the only viable engineering solution. This conclusion is cross-validated across 67 primary sources.\nArchitecture: O(n²) Attention # Self-attention scales at O(n²). A single 4096-token sequence needs 2 GB VRAM for KV cache; 32 concurrent sessions hit 64 GB — more than the model weights themselves. Llama 3.1 at 100M context requires 638 H100 GPUs ($5,400/hour) for KV cache alone.\n","title":"Why LLMs Have No Memory — A Cross-Validated Research Report with 67 Primary Sources","type":"projects"}]