Tech

Cutting LLM Token Costs: 12 Techniques That Actually Move the Bill

Most teams overpay for LLM tokens by 3–5× without realizing it. Here are 12 techniques, ordered by impact — from prompt caching that cuts 90% off repeated system prompts, to model routing that saves 80% on easy tasks, to the context-window mistake almost every team makes.

Initial Editor·2026-04-21·5min read·1,021 words·14 views

Most teams overpay for LLM tokens by 3–5×.

Not because the models are expensive — they've gotten cheaper every quarter. Because the code sending tokens to them hasn't been touched since the first prototype worked. These twelve techniques are ordered by impact on a realistic production workload (RAG + agents + some classification). Do the first three and most teams cut their bill in half.

1. Turn on prompt caching. Today.

Anthropic, OpenAI, and Google all now support prompt caching. Cached input tokens cost ~10% of uncached ones on Anthropic, ~25–50% on OpenAI, similar on Gemini.

The win: your system prompt, tool definitions, few-shot examples, and long retrieved context are the same across calls. Pay full price once, 10% forever after.

# Anthropic — mark the reusable prefix as cacheable
messages = client.messages.create(
    model="claude-sonnet-4-6",
    system=[
        {"type": "text", "text": LONG_SYSTEM_PROMPT,
         "cache_control": {"type": "ephemeral"}},
    ],
    messages=[{"role": "user", "content": user_query}],
)

Typical savings: 60–90% of input token cost on multi-turn or high-QPS workloads. This is the single biggest lever. If you do nothing else on this list, do this.

2. Route by difficulty. Stop sending easy work to flagship models.

A classifier or simple heuristic in front of your LLM call can route 70–90% of requests to a cheaper model. GPT-5.4 mini, Claude Haiku 4.5, Gemini Flash-Lite are all roughly 10–20× cheaper than their flagship siblings — and match them on easy tasks.

user query
   │
   ▼
classifier (tiny model or rules)
   │
   ├── "simple" → Haiku 4.5
   ├── "medium" → Sonnet 4.6
   └── "complex" → Opus 4.6 / GPT-5.4

Typical savings: 50–80% of output cost on mixed workloads.

3. Set max_tokens aggressively — and use stop sequences.

Nobody does this and everybody should. If you're asking for a JSON object with three fields, the answer is not 4,000 tokens long. Cap max_tokens at what you actually need. Use stop sequences (</answer>, \n\n, etc.) to short-circuit verbose tails.

Typical savings: 20–40% of output cost. Also: faster responses.

4. Use structured outputs instead of free-form text.

Every LLM provider now supports JSON schema / grammar-constrained decoding. Structured outputs are cheaper because the model doesn't waste tokens on "Here is the response you asked for:" preamble and "Let me know if you need anything else" outro.

response = client.chat.completions.create(
    model="gpt-5.4-mini",
    messages=[...],
    response_format={"type": "json_schema", "json_schema": my_schema},
)

Typical savings: 15–30% of output cost.

5. Stop stuffing the whole document. Use RAG properly.

The most common architecture mistake in 2026: "we have 512K context now, let's just throw everything in." 400K tokens of context costs the same whether the model needs 2K of it or 400K of it.

Embedding + retrieval adds ~50ms and a few tenths of a cent. Stuffing adds dollars per call.

Typical savings: 70–95% of input cost on document-heavy workloads.

6. Compress your prompts. Actually compress them.

Most system prompts contain:

  • Duplicated instructions ("be helpful" said four different ways)
  • Verbose bullet lists that could be one sentence
  • XML tags the model doesn't need
  • Examples that no longer match production input

Run your prompt through a pruner (Anthropic's own "prompt improver" is good; LLMLingua is excellent). Expect to cut 30–50% of tokens with zero behavior change.

7. Cache embeddings.

Embeddings are deterministic per model + input. Computing them twice is pure waste. A simple content-hash keyed cache (Redis, SQLite, even a Python dict for small apps) eliminates redundant embedding calls.

Typical savings: 40–80% of embedding cost in any system with re-indexing or duplicate content.

8. Batch when you can.

OpenAI and Anthropic both offer batch APIs at ~50% discount for async workloads — summarization jobs, nightly classification, dataset labeling. If it doesn't need a response in under a minute, batch it.

9. Summarize long conversation history instead of resending it.

Chat apps that send the entire 30-turn history on every reply are paying 30× for the same context. Keep a rolling summary + the last 4–6 turns verbatim.

[SYSTEM]
[SUMMARY of turns 1-24]   ← ~400 tokens
[TURN 25, 26, 27, 28]     ← verbatim
[USER: new message]

Typical savings: 60–85% of input cost on long conversations.

10. Keep tool definitions tight.

If your agent has 40 tools, every call sends 40 tool schemas. Most agent turns only need 3–5. Dynamic tool selection (route-then-call, or MCP's tool filtering) cuts this dramatically.

Also: trim tool descriptions. "A function to retrieve the current weather for a given location based on the city name" → "Get current weather for a city."

Typical savings: 20–60% of input cost on tool-heavy agents.

11. Use extended thinking selectively — not by default.

Claude's extended thinking and GPT's reasoning tokens are powerful and expensive. Turning them on for every request is a common mistake. Use them only when the task's difficulty warrants it (math, multi-step planning, code review), not for "what's the capital of France."

Typical savings: 40–70% of output cost on mixed workloads where thinking was enabled globally.

12. Measure first, then optimize.

Log input_tokens, output_tokens, cached_tokens, and model name on every call. Aggregate by endpoint. You will find:

  • One endpoint burning 60% of your bill
  • A debug log you forgot to remove sending full transcripts
  • A prompt that accidentally doubled last sprint
  • A fallback path that silently routes to Opus

Without logging, every optimization is a guess. With it, you can pick the 20% of work that returns 80% of the savings — and skip everything else on this list.

Putting it together

A realistic production AI app, after applying #1–#4 and #9:

Before After
$18,400/mo $4,100/mo
p95 latency 3.1s p95 latency 1.4s
100% flagship 78% small / 22% flagship

The biggest wins aren't from clever prompt tricks. They're from not paying for tokens you were never using.

Numbers in this post are illustrative industry ranges based on typical production workloads — your mileage will vary. Measure before and after on your own traffic.

Sources

// more in tech

see all →
Tech· 2026-05-29· 5min

The Smallest Agent That Works, Part 3: The Three Agents With State

Stateless agents fit most tasks. State is the most expensive capability you can add — it doubles your operational surface, breaks your debugging, and rewards exactly the use cases that can't survive without it. Memory, environment control, self-learning. Part 3 of three.

#agent-architecture#ai-engineering#ai-agents#system-design
Tech· 2026-05-27· 5min

The Smallest Agent That Works, Part 2: The Three Reach-Out Agents

When the cheap tiers run out, the agent has to reach beyond the model itself — into knowledge it doesn't have, tools it can't natively use, or its own previous answer. RAG, tool use, and self-critique: three patterns, three failure modes worth pricing in. Part 2 of three.

#llm#rag#agent-architecture#ai-engineering
Tech· 2026-05-26· 5min

The Smallest Agent That Works, Part 1: The Three Cheap Agents

Most agent stacks are built one tier too capable for the job. Three of the cheapest architectures — a fixed pipeline, an LLM with rule constraints, and a reasoning loop — solve more problems than the architecture diagrams admit. Part 1 of three.

#llm#agent-architecture#ai-engineering#ai-agents
Tech· 2026-05-15· 5min

What MLX Got to Throw Away (That PyTorch Can't)

Every mature framework is a museum of decisions you can't take back. MLX is interesting mostly because it started after the decisions that matter for Apple Silicon were already mistakes — and the things it threw away are the things that were quietly costing the rest of us the most.

#ai-engineering#apple-silicon#mlx#ml-frameworks
Tech· 2026-05-15· 5min

The Unified-Memory Bet: Why On-Device Inference Stopped Being a Toy

For two years the industry's default answer to every inference question has been "bigger cluster." A different hardware topology is quietly making that the wrong default for a non-trivial slice of workloads — and the framework layer that earns it is the buzzword most decks haven't caught up with yet.

#hardware#ai-infrastructure#inference#edge-ai
Tech· 2026-05-14· 5min

Every Useful Skill Is One of Five Shapes

Skills aren't a freeform format. The useful ones fit one of five shapes — sequential workflow, multi-MCP coordination, iterative refinement, context-aware selection, domain-specific intelligence. Picking the right shape is most of the design work. Picking the wrong one is most of the bugs.

#claude-code#workflow#agents#skills