Most teams overpay for LLM tokens by 3–5×.
Not because the models are expensive — they've gotten cheaper every quarter. Because the code sending tokens to them hasn't been touched since the first prototype worked. These twelve techniques are ordered by impact on a realistic production workload (RAG + agents + some classification). Do the first three and most teams cut their bill in half.
1. Turn on prompt caching. Today.
Anthropic, OpenAI, and Google all now support prompt caching. Cached input tokens cost ~10% of uncached ones on Anthropic, ~25–50% on OpenAI, similar on Gemini.
The win: your system prompt, tool definitions, few-shot examples, and long retrieved context are the same across calls. Pay full price once, 10% forever after.
# Anthropic — mark the reusable prefix as cacheable
messages = client.messages.create(
model="claude-sonnet-4-6",
system=[
{"type": "text", "text": LONG_SYSTEM_PROMPT,
"cache_control": {"type": "ephemeral"}},
],
messages=[{"role": "user", "content": user_query}],
)
Typical savings: 60–90% of input token cost on multi-turn or high-QPS workloads. This is the single biggest lever. If you do nothing else on this list, do this.
2. Route by difficulty. Stop sending easy work to flagship models.
A classifier or simple heuristic in front of your LLM call can route 70–90% of requests to a cheaper model. GPT-5.4 mini, Claude Haiku 4.5, Gemini Flash-Lite are all roughly 10–20× cheaper than their flagship siblings — and match them on easy tasks.
user query
│
▼
classifier (tiny model or rules)
│
├── "simple" → Haiku 4.5
├── "medium" → Sonnet 4.6
└── "complex" → Opus 4.6 / GPT-5.4
Typical savings: 50–80% of output cost on mixed workloads.
3. Set max_tokens aggressively — and use stop sequences.
Nobody does this and everybody should. If you're asking for a JSON object with three fields, the answer is not 4,000 tokens long. Cap max_tokens at what you actually need. Use stop sequences (</answer>, \n\n, etc.) to short-circuit verbose tails.
Typical savings: 20–40% of output cost. Also: faster responses.
4. Use structured outputs instead of free-form text.
Every LLM provider now supports JSON schema / grammar-constrained decoding. Structured outputs are cheaper because the model doesn't waste tokens on "Here is the response you asked for:" preamble and "Let me know if you need anything else" outro.
response = client.chat.completions.create(
model="gpt-5.4-mini",
messages=[...],
response_format={"type": "json_schema", "json_schema": my_schema},
)
Typical savings: 15–30% of output cost.
5. Stop stuffing the whole document. Use RAG properly.
The most common architecture mistake in 2026: "we have 512K context now, let's just throw everything in." 400K tokens of context costs the same whether the model needs 2K of it or 400K of it.
Embedding + retrieval adds ~50ms and a few tenths of a cent. Stuffing adds dollars per call.
Typical savings: 70–95% of input cost on document-heavy workloads.
6. Compress your prompts. Actually compress them.
Most system prompts contain:
- Duplicated instructions ("be helpful" said four different ways)
- Verbose bullet lists that could be one sentence
- XML tags the model doesn't need
- Examples that no longer match production input
Run your prompt through a pruner (Anthropic's own "prompt improver" is good; LLMLingua is excellent). Expect to cut 30–50% of tokens with zero behavior change.
7. Cache embeddings.
Embeddings are deterministic per model + input. Computing them twice is pure waste. A simple content-hash keyed cache (Redis, SQLite, even a Python dict for small apps) eliminates redundant embedding calls.
Typical savings: 40–80% of embedding cost in any system with re-indexing or duplicate content.
8. Batch when you can.
OpenAI and Anthropic both offer batch APIs at ~50% discount for async workloads — summarization jobs, nightly classification, dataset labeling. If it doesn't need a response in under a minute, batch it.
9. Summarize long conversation history instead of resending it.
Chat apps that send the entire 30-turn history on every reply are paying 30× for the same context. Keep a rolling summary + the last 4–6 turns verbatim.
[SYSTEM]
[SUMMARY of turns 1-24] ← ~400 tokens
[TURN 25, 26, 27, 28] ← verbatim
[USER: new message]
Typical savings: 60–85% of input cost on long conversations.
10. Keep tool definitions tight.
If your agent has 40 tools, every call sends 40 tool schemas. Most agent turns only need 3–5. Dynamic tool selection (route-then-call, or MCP's tool filtering) cuts this dramatically.
Also: trim tool descriptions. "A function to retrieve the current weather for a given location based on the city name" → "Get current weather for a city."
Typical savings: 20–60% of input cost on tool-heavy agents.
11. Use extended thinking selectively — not by default.
Claude's extended thinking and GPT's reasoning tokens are powerful and expensive. Turning them on for every request is a common mistake. Use them only when the task's difficulty warrants it (math, multi-step planning, code review), not for "what's the capital of France."
Typical savings: 40–70% of output cost on mixed workloads where thinking was enabled globally.
12. Measure first, then optimize.
Log input_tokens, output_tokens, cached_tokens, and model name on every call. Aggregate by endpoint. You will find:
- One endpoint burning 60% of your bill
- A debug log you forgot to remove sending full transcripts
- A prompt that accidentally doubled last sprint
- A fallback path that silently routes to Opus
Without logging, every optimization is a guess. With it, you can pick the 20% of work that returns 80% of the savings — and skip everything else on this list.
Putting it together
A realistic production AI app, after applying #1–#4 and #9:
| Before | After |
|---|---|
| $18,400/mo | $4,100/mo |
| p95 latency 3.1s | p95 latency 1.4s |
| 100% flagship | 78% small / 22% flagship |
The biggest wins aren't from clever prompt tricks. They're from not paying for tokens you were never using.
Numbers in this post are illustrative industry ranges based on typical production workloads — your mileage will vary. Measure before and after on your own traffic.