Pages tagged "cost-optimization"
3 pages tagged with cost-optimization.
← all tags
- Prompt Caching — The Production Cost-Optimization Layer for LLM Applications Prompt caching reuses LLM input tokens across requests, cutting input-token costs by up to 90% (Anthropic cache reads are 10% of base price; OpenAI cached inputs run 75-90% cheaper). Combined caching strategies achieve 70-80% total cost reduction in production. The 2026 production landscape: Anthropic cache_control markers with 5-min default TTL (1-hour extended), OpenAI automatic prompt caching, semantic caching via vector similarity (Redis, GPTCache). Distinct from KV caching (model-internal) and agentic memory (cross-session persistence).
- Advisor Strategy — Pairing a Smarter Model as an Occasional Advisor With a Cheaper Executor Anthropic's advisor pattern (April 2026): the executor model (Sonnet or Haiku) handles tasks end-to-end while consulting an advisor model (Opus) only on hard decisions. Server-side, single API request. Sonnet+Opus advisor: +2.7pp on SWE-bench at -11.9% cost. Haiku+Opus: 41.2% on BrowseComp vs 19.7% solo, 85% cheaper than Sonnet alone.
- Advisor Strategy — Smart Model Pairing for Cost-Efficiency A pattern where a cheap executor model consults an expensive advisor only when facing hard decisions, reducing costs while improving performance