The Smallest Agent That Works, Part 1: The Three Cheap Agents

Most production agents are over-engineered for the task they were built for. A team reaches for a "reasoning agent" when a regex would have shipped six months earlier with a tenth of the operational surface. The instinct comes from a hierarchy that's mostly false: more autonomy is more capable, more capable is better, therefore build up. The cheaper truth is that capability you don't need is operational cost you do pay.

This is the first of three posts on picking the smallest agent that works. Part 1 covers the three cheap tiers and the conditions under which each one beats the more expensive options. Part 2 covers what to do when these aren't enough. Part 3 covers the narrow cases where you actually need persistent state.

The three cheap tiers

Tier	What it does	Cost profile	When it wins
Fixed pipeline	Deterministic steps — no model in the loop	Free at runtime, fixed at build time	Inputs are structured, outputs are predictable
LLM with rule constraints	One model call, bounded by a prompt-and-validator	Cheap per call, fast	Inputs are noisy but outcomes are bounded
Reasoning loop (ReAct)	Multi-step think-then-act loop with branching	Costs scale with steps	Multi-stage tasks where the path can't be pre-decided

Each tier costs more than the one above. Each tier covers cases the one above can't. The discipline is starting at the top and only moving down when the tier above demonstrably fails.

1. Fixed pipeline — when determinism is the feature

The case for the boring option: rule-based pipelines don't hallucinate, don't drift, and don't change their behavior because someone tuned a temperature parameter. RPA processing invoices into a finance system isn't glamorous, but the failure mode is "the OCR misread one digit," which is bounded and debuggable. The failure mode of an LLM-driven invoice agent is "it hallucinated a vendor."

What it looks like in practice: a Python script reads a CSV, validates each row against a schema, calls a known API per row, logs failures. No model in the loop. Maintenance is reading the logs once a quarter.

When not to use it. The moment the input is unstructured — free-text emails, varied document layouts, ambiguous user requests — the rules start to grow exponentially. You'll spend more time tuning regex than the next tier would spend on prompt engineering. If your rule file is over 200 lines, you've outgrown the tier.

The diagnostic question: if the input format changed tomorrow, how much rework? If the answer is "a few lines," stay at this tier. If it's "a rewrite," go up.

2. LLM with rule constraints — one shot, bounded

This is the tier most teams underrate. A single model call, with a tight prompt, a schema for output, and a validator on the way out. The model handles the messiness of natural language; the rules around it handle the messiness of model behavior.

What it looks like: an email router that takes an inbound message, classifies it into one of seven buckets, and returns structured JSON. The prompt says "you must return one of these seven values"; the validator rejects anything else. If validation fails, retry with a stricter prompt, then fall back to human review on the third miss.

Three properties make this tier work:

Single call. No loop, no agentic reasoning. The model's job is to map noisy input to bounded output.
Schema enforcement. The output shape is specified, not hoped for. JSON Schema, Pydantic, Zod — pick one and gate every response on it.
Bounded fallback. The agent doesn't have agency; if the call fails twice, a human sees it. This caps the blast radius of a bad model day.

The high-volume / low-stakes sweet spot lives here: support ticket triage, content moderation flags, draft routing, basic summarization with a fixed structure. Tasks that benefit from semantic understanding but don't need the model to do anything beyond returning a label.

When not to use it. If the task genuinely requires multi-step planning — fetch this, then based on that, decide whether to fetch the other thing — you can simulate it inside a single prompt with a complex schema, but the prompt becomes unreadable and the failure modes become opaque. That's the cue to go up a tier.

3. Reasoning loop — when the path can't be pre-decided

A ReAct-style agent thinks, takes an action, observes the result, thinks again. Each loop is a model call. The capability you gain is dynamic path selection — the agent doesn't need to know all the steps in advance; it figures them out as it goes.

What it looks like: a planning agent given "draft a Q3 plan for the platform team." First call: think about what information is needed. Action: fetch the current OKRs. Observation: here they are. Second call: think about what else. Action: check team capacity. And so on until the plan is drafted.

This is the lowest tier with genuine agentic behavior, and it's the first one where the bill starts to matter. Each loop is a token-billed call. A six-step task is six round trips, and if any step branches, you're paying for paths that didn't pan out.

When not to use it. Three failure modes worth knowing.

The task has a fixed shape. "For every incoming ticket, classify it" doesn't need reasoning — every ticket follows the same steps. Use tier 2.
The reasoning is theater. If the agent's "thoughts" don't change its actions, you're paying for monologue. Strip the loop, hand-write the sequence, drop down a tier.
The branching is shallow. Two possible paths can be an if statement. Reasoning loops earn their keep when there are five or more plausible next moves and the right one depends on intermediate results.

The honest test: trace through a typical task and ask whether the model's reasoning changed the action it took. If the answer is "no" most of the time, you're at the wrong tier.

Three diagnostic questions before you build

Three questions. The answer to each pushes you up or down the ladder.

Is the input structured? Yes → tier 1 or 2. No → tier 2 or 3.
Is the output bounded? Yes (one of N labels, a JSON schema) → tier 1 or 2. No (free-form text, multi-step plan) → tier 3 or higher.
Does the path change based on intermediate results? No → tier 1 or 2. Yes → tier 3 or higher.

Most production tasks score yes / yes / no — and the tier 2 architecture handles them fine. The trap is reading the three questions as a recommendation to scale up; they're a recommendation to scale down when the answers permit.

What you don't get at these tiers

What the cheap tiers cannot do:

Look up information they weren't trained on (covered in Part 2).
Take action on external systems beyond a single bounded call (Part 2).
Critique their own output before returning it (Part 2).
Remember anything between requests (Part 3).
Improve over time (Part 3).

Knowing what's missing is the point. Most tasks don't need any of it.

Reach for the simplest architecture that produces the output you need. The instinct to add capability is strong; the instinct to subtract is what makes a stack maintainable two years in.

The Smallest Agent That Works, Part 1: The Three Cheap Agents

The three cheap tiers

1. Fixed pipeline — when determinism is the feature

2. LLM with rule constraints — one shot, bounded

3. Reasoning loop — when the path can't be pre-decided

Three diagnostic questions before you build

What you don't get at these tiers

// more in tech

The Smallest Agent That Works, Part 3: The Three Agents With State

The Smallest Agent That Works, Part 2: The Three Reach-Out Agents

What MLX Got to Throw Away (That PyTorch Can't)

The Unified-Memory Bet: Why On-Device Inference Stopped Being a Toy

Every Useful Skill Is One of Five Shapes

MCP Gives You the Kitchen. Skills Are the Recipe.

The Smallest Agent That Works, Part 1: The Three Cheap Agents

The three cheap tiers

1. Fixed pipeline — when determinism is the feature

2. LLM with rule constraints — one shot, bounded

3. Reasoning loop — when the path can't be pre-decided

Three diagnostic questions before you build

What you don't get at these tiers

// more in tech

The Smallest Agent That Works, Part 3: The Three Agents With State

The Smallest Agent That Works, Part 2: The Three Reach-Out Agents

What MLX Got to Throw Away (That PyTorch Can't)

The Unified-Memory Bet: Why On-Device Inference Stopped Being a Toy

Every Useful Skill Is One of Five Shapes

MCP Gives You the Kitchen. Skills Are the Recipe.

New posts, every week.Delivered Sunday mornings.

New posts, every week.
Delivered Sunday mornings.