Tech

The Smallest Agent That Works, Part 2: The Three Reach-Out Agents

When the cheap tiers run out, the agent has to reach beyond the model itself — into knowledge it doesn't have, tools it can't natively use, or its own previous answer. RAG, tool use, and self-critique: three patterns, three failure modes worth pricing in. Part 2 of three.

Initial Editor·2026-05-27·6min read·1,291 words

Part 1 covered three architectures that solve most problems without leaving the model's own context. This post covers the next tier: agents that reach outside themselves. The agent borrows from somewhere — fresh knowledge, an external system, or its own previous attempt — to do something the base model couldn't do alone.

These are augmentation patterns, not state. The agent still doesn't remember anything between sessions. What changes is what it can pull in during a single session.

Three ways to reach out

Pattern What it pulls in When it earns the cost Failure mode
Retrieval (RAG) Fresh or domain-specific knowledge The model would otherwise hallucinate or be stale Bad retrieval is worse than no retrieval
External tools The ability to act on systems the model doesn't natively control The action is non-trivial and needs current data Tool failures cascade silently
Self-critique A second pass at the model's own output The first pass is unreliable on this task Doubles the cost for marginal improvement

These compose. A real agent might use all three. But each one is worth pricing separately, because each one has a separate failure mode you'll need to debug independently.

1. Retrieval — when the model's training is the bottleneck

The base model knows what was in its training data, cut off at a date you don't control. If your task depends on information that's domain-specific, internal, or fresher than the cutoff, retrieval is the bridge. Pull the relevant documents into context at query time, let the model answer over them.

What it looks like: a legal research agent. The user asks "what's our position on the new SEC rule?" The agent embeds the query, retrieves the three most relevant internal memos, and answers grounding every claim in them. The model brings language fluency; the retrieval brings the facts.

The wins are real: hallucinations drop sharply, the answer cites sources you trust, the system stays useful as your knowledge base grows.

When not to use it. Three honest failure modes.

  • The retrieval is bad. If your top-three chunks are off-topic, the model now has worse context than it had without RAG — it's been actively misled. Bad retrieval beats no retrieval only on marketing slides.
  • The knowledge is small enough to fit in the prompt. If your entire knowledge base is 4,000 tokens, paste it into the prompt and skip the retrieval infrastructure. You've added latency and an indexing pipeline for no gain.
  • The user's query is structured. "Get me the invoice for vendor X in March" is a database query in disguise. Translate it to SQL and hit the database. RAG turns a O(1) lookup into a O(N) embedding search.

The diagnostic: when retrieval fails on a query, is the failure debuggable? If you can't trace why the wrong chunks came back, your RAG layer is a liability.

2. External tools — when the agent has to do, not just say

Some tasks require the agent to act — call an API, run a query, push a commit. The base model can describe how to do these things; it can't execute them. Tool use is the contract: the model emits a structured call, your runtime executes it, the result feeds back in.

What it looks like: a code-generation agent that doesn't just write the function but runs the test suite on it, observes the failures, and patches. A data analysis bot that doesn't describe the analysis — it runs the pandas query and returns the dataframe. Each "tool" is a function the agent can choose to call.

The win is that the agent's output becomes verifiable. Instead of "here's how to fix the bug," you get "I ran the test, it passed."

When not to use it. Three traps.

  • The tool surface is too big. Twenty registered tools and the model picks the wrong one half the time. Each tool you expose is a chance for the model to misroute the request. Start with three; add carefully.
  • The tool failures aren't surfaced. Tool errors that the runtime swallows cascade silently. The agent gets null, treats it as "no result," and returns a confidently wrong answer. Tool errors need to be loud — back into the prompt as visible errors the agent has to handle.
  • The tool wraps a deterministic system. "Use the tool to call our REST API" is fine. "Use the tool to deterministically format JSON" is the agent doing what a ten-line function would do, with extra steps and a model bill.

The diagnostic: for every tool the agent has, ask whether a hand-written sequence of three Python lines would do the same job. If yes, drop the tool.

3. Self-critique — the second pass

The cheapest reach-out: have the model review its own output. First call generates a draft; second call evaluates the draft against explicit criteria; third call (or the second one, with revision instructions) produces the final answer. The agent reaches out to its own previous response.

What it looks like: a QA agent for generated documentation. Draft the doc. Then critique: "Are all code examples runnable? Does the API description match the actual signature? Are there sections that contradict each other?" Then revise based on the critique.

The win is non-trivial: on tasks where the model has a known failure mode (skipping validation, inventing API names, missing edge cases), forcing a second pass against an explicit checklist catches a meaningful chunk of them.

When not to use it. The trap is doubling the cost without doubling the value.

  • The criteria are fuzzy. "Is the output good?" produces fuzzy critiques. The model marks its own work as fine and you've spent two calls on one answer. The criteria have to be specific and checkable.
  • The first pass is already good. On tasks the model handles well in one shot, the second pass mostly rephrases the first. You're paying for paraphrase.
  • The critique reveals what the rewrite would do anyway. If you can collapse the critique-then-revise into a single prompt with the criteria upfront, do that. Two calls is overhead unless the critique step is genuinely diagnostic.

The diagnostic: run the same task with and without self-critique on twenty examples. If the critique-enabled version is meaningfully better on more than half, it earns the slot. If not, drop it and write a stricter first-pass prompt.

How these stack

A common production shape uses all three:

  1. Tools to fetch the latest state from external systems.
  2. Retrieval to ground the answer in your domain knowledge.
  3. Self-critique to catch the failure modes you've seen the model exhibit.

Each costs additional latency and money. Each adds a failure surface to monitor. The default ordering — tools first, retrieval second, self-critique third — works because each later layer can correct mistakes from earlier ones (the critic can catch a bad retrieval; the retrieval can fact-check a hallucinated tool result), but the inverse doesn't hold.

What you still don't get at these tiers

The reach-out patterns are stateless. Every request starts fresh. The agent that retrieved your customer history yesterday doesn't remember you today; it retrieves it again from scratch. That's fine for most tasks — and it keeps the architecture clean.

Part 3 covers the narrow cases where statelessness is the failure mode itself.

The reach-out patterns earn their cost only when the cheap tiers have demonstrably failed. Test the boring version first. If your boring version handles the task in one prompt, you didn't need RAG, you didn't need tools, and you didn't need a second pass.

// more in tech

see all →
Tech· 2026-05-29· 5min

The Smallest Agent That Works, Part 3: The Three Agents With State

Stateless agents fit most tasks. State is the most expensive capability you can add — it doubles your operational surface, breaks your debugging, and rewards exactly the use cases that can't survive without it. Memory, environment control, self-learning. Part 3 of three.

#agent-architecture#ai-engineering#ai-agents#system-design
Tech· 2026-05-26· 5min

The Smallest Agent That Works, Part 1: The Three Cheap Agents

Most agent stacks are built one tier too capable for the job. Three of the cheapest architectures — a fixed pipeline, an LLM with rule constraints, and a reasoning loop — solve more problems than the architecture diagrams admit. Part 1 of three.

#llm#agent-architecture#ai-engineering#ai-agents
Tech· 2026-05-15· 5min

What MLX Got to Throw Away (That PyTorch Can't)

Every mature framework is a museum of decisions you can't take back. MLX is interesting mostly because it started after the decisions that matter for Apple Silicon were already mistakes — and the things it threw away are the things that were quietly costing the rest of us the most.

#ai-engineering#apple-silicon#mlx#ml-frameworks
Tech· 2026-05-15· 5min

The Unified-Memory Bet: Why On-Device Inference Stopped Being a Toy

For two years the industry's default answer to every inference question has been "bigger cluster." A different hardware topology is quietly making that the wrong default for a non-trivial slice of workloads — and the framework layer that earns it is the buzzword most decks haven't caught up with yet.

#hardware#ai-infrastructure#inference#edge-ai
Tech· 2026-05-14· 5min

Every Useful Skill Is One of Five Shapes

Skills aren't a freeform format. The useful ones fit one of five shapes — sequential workflow, multi-MCP coordination, iterative refinement, context-aware selection, domain-specific intelligence. Picking the right shape is most of the design work. Picking the wrong one is most of the bugs.

#claude-code#workflow#agents#skills
Tech· 2026-05-13· 5min

MCP Gives You the Kitchen. Skills Are the Recipe.

Most teams ship one of these and call the job done. MCP gives the agent tools. Skills tell it which to use, in what order, with which fallbacks. Without skills, your MCP integration ends with users asking 'okay, what now?'

#claude-code#mcp#agents#skills