Tech

The Smallest Agent That Works, Part 1: The Three Cheap Agents

Most agent stacks are built one tier too capable for the job. Three of the cheapest architectures — a fixed pipeline, an LLM with rule constraints, and a reasoning loop — solve more problems than the architecture diagrams admit. Part 1 of three.

Initial Editor·2026-05-26·6min read·1,254 words

Most production agents are over-engineered for the task they were built for. A team reaches for a "reasoning agent" when a regex would have shipped six months earlier with a tenth of the operational surface. The instinct comes from a hierarchy that's mostly false: more autonomy is more capable, more capable is better, therefore build up. The cheaper truth is that capability you don't need is operational cost you do pay.

This is the first of three posts on picking the smallest agent that works. Part 1 covers the three cheap tiers and the conditions under which each one beats the more expensive options. Part 2 covers what to do when these aren't enough. Part 3 covers the narrow cases where you actually need persistent state.

The three cheap tiers

Tier What it does Cost profile When it wins
Fixed pipeline Deterministic steps — no model in the loop Free at runtime, fixed at build time Inputs are structured, outputs are predictable
LLM with rule constraints One model call, bounded by a prompt-and-validator Cheap per call, fast Inputs are noisy but outcomes are bounded
Reasoning loop (ReAct) Multi-step think-then-act loop with branching Costs scale with steps Multi-stage tasks where the path can't be pre-decided

Each tier costs more than the one above. Each tier covers cases the one above can't. The discipline is starting at the top and only moving down when the tier above demonstrably fails.

1. Fixed pipeline — when determinism is the feature

The case for the boring option: rule-based pipelines don't hallucinate, don't drift, and don't change their behavior because someone tuned a temperature parameter. RPA processing invoices into a finance system isn't glamorous, but the failure mode is "the OCR misread one digit," which is bounded and debuggable. The failure mode of an LLM-driven invoice agent is "it hallucinated a vendor."

What it looks like in practice: a Python script reads a CSV, validates each row against a schema, calls a known API per row, logs failures. No model in the loop. Maintenance is reading the logs once a quarter.

When not to use it. The moment the input is unstructured — free-text emails, varied document layouts, ambiguous user requests — the rules start to grow exponentially. You'll spend more time tuning regex than the next tier would spend on prompt engineering. If your rule file is over 200 lines, you've outgrown the tier.

The diagnostic question: if the input format changed tomorrow, how much rework? If the answer is "a few lines," stay at this tier. If it's "a rewrite," go up.

2. LLM with rule constraints — one shot, bounded

This is the tier most teams underrate. A single model call, with a tight prompt, a schema for output, and a validator on the way out. The model handles the messiness of natural language; the rules around it handle the messiness of model behavior.

What it looks like: an email router that takes an inbound message, classifies it into one of seven buckets, and returns structured JSON. The prompt says "you must return one of these seven values"; the validator rejects anything else. If validation fails, retry with a stricter prompt, then fall back to human review on the third miss.

Three properties make this tier work:

  • Single call. No loop, no agentic reasoning. The model's job is to map noisy input to bounded output.
  • Schema enforcement. The output shape is specified, not hoped for. JSON Schema, Pydantic, Zod — pick one and gate every response on it.
  • Bounded fallback. The agent doesn't have agency; if the call fails twice, a human sees it. This caps the blast radius of a bad model day.

The high-volume / low-stakes sweet spot lives here: support ticket triage, content moderation flags, draft routing, basic summarization with a fixed structure. Tasks that benefit from semantic understanding but don't need the model to do anything beyond returning a label.

When not to use it. If the task genuinely requires multi-step planning — fetch this, then based on that, decide whether to fetch the other thing — you can simulate it inside a single prompt with a complex schema, but the prompt becomes unreadable and the failure modes become opaque. That's the cue to go up a tier.

3. Reasoning loop — when the path can't be pre-decided

A ReAct-style agent thinks, takes an action, observes the result, thinks again. Each loop is a model call. The capability you gain is dynamic path selection — the agent doesn't need to know all the steps in advance; it figures them out as it goes.

What it looks like: a planning agent given "draft a Q3 plan for the platform team." First call: think about what information is needed. Action: fetch the current OKRs. Observation: here they are. Second call: think about what else. Action: check team capacity. And so on until the plan is drafted.

This is the lowest tier with genuine agentic behavior, and it's the first one where the bill starts to matter. Each loop is a token-billed call. A six-step task is six round trips, and if any step branches, you're paying for paths that didn't pan out.

When not to use it. Three failure modes worth knowing.

  • The task has a fixed shape. "For every incoming ticket, classify it" doesn't need reasoning — every ticket follows the same steps. Use tier 2.
  • The reasoning is theater. If the agent's "thoughts" don't change its actions, you're paying for monologue. Strip the loop, hand-write the sequence, drop down a tier.
  • The branching is shallow. Two possible paths can be an if statement. Reasoning loops earn their keep when there are five or more plausible next moves and the right one depends on intermediate results.

The honest test: trace through a typical task and ask whether the model's reasoning changed the action it took. If the answer is "no" most of the time, you're at the wrong tier.

Three diagnostic questions before you build

Three questions. The answer to each pushes you up or down the ladder.

  1. Is the input structured? Yes → tier 1 or 2. No → tier 2 or 3.
  2. Is the output bounded? Yes (one of N labels, a JSON schema) → tier 1 or 2. No (free-form text, multi-step plan) → tier 3 or higher.
  3. Does the path change based on intermediate results? No → tier 1 or 2. Yes → tier 3 or higher.

Most production tasks score yes / yes / no — and the tier 2 architecture handles them fine. The trap is reading the three questions as a recommendation to scale up; they're a recommendation to scale down when the answers permit.

What you don't get at these tiers

What the cheap tiers cannot do:

  • Look up information they weren't trained on (covered in Part 2).
  • Take action on external systems beyond a single bounded call (Part 2).
  • Critique their own output before returning it (Part 2).
  • Remember anything between requests (Part 3).
  • Improve over time (Part 3).

Knowing what's missing is the point. Most tasks don't need any of it.

Reach for the simplest architecture that produces the output you need. The instinct to add capability is strong; the instinct to subtract is what makes a stack maintainable two years in.

// more in tech

see all →
Tech· 2026-05-29· 5min

The Smallest Agent That Works, Part 3: The Three Agents With State

Stateless agents fit most tasks. State is the most expensive capability you can add — it doubles your operational surface, breaks your debugging, and rewards exactly the use cases that can't survive without it. Memory, environment control, self-learning. Part 3 of three.

#agent-architecture#ai-engineering#ai-agents#system-design
Tech· 2026-05-27· 5min

The Smallest Agent That Works, Part 2: The Three Reach-Out Agents

When the cheap tiers run out, the agent has to reach beyond the model itself — into knowledge it doesn't have, tools it can't natively use, or its own previous answer. RAG, tool use, and self-critique: three patterns, three failure modes worth pricing in. Part 2 of three.

#llm#rag#agent-architecture#ai-engineering
Tech· 2026-05-15· 5min

What MLX Got to Throw Away (That PyTorch Can't)

Every mature framework is a museum of decisions you can't take back. MLX is interesting mostly because it started after the decisions that matter for Apple Silicon were already mistakes — and the things it threw away are the things that were quietly costing the rest of us the most.

#ai-engineering#apple-silicon#mlx#ml-frameworks
Tech· 2026-05-15· 5min

The Unified-Memory Bet: Why On-Device Inference Stopped Being a Toy

For two years the industry's default answer to every inference question has been "bigger cluster." A different hardware topology is quietly making that the wrong default for a non-trivial slice of workloads — and the framework layer that earns it is the buzzword most decks haven't caught up with yet.

#hardware#ai-infrastructure#inference#edge-ai
Tech· 2026-05-14· 5min

Every Useful Skill Is One of Five Shapes

Skills aren't a freeform format. The useful ones fit one of five shapes — sequential workflow, multi-MCP coordination, iterative refinement, context-aware selection, domain-specific intelligence. Picking the right shape is most of the design work. Picking the wrong one is most of the bugs.

#claude-code#workflow#agents#skills
Tech· 2026-05-13· 5min

MCP Gives You the Kitchen. Skills Are the Recipe.

Most teams ship one of these and call the job done. MCP gives the agent tools. Skills tell it which to use, in what order, with which fallbacks. Without skills, your MCP integration ends with users asking 'okay, what now?'

#claude-code#mcp#agents#skills