Part 1 covered three architectures that solve most problems without leaving the model's own context. This post covers the next tier: agents that reach outside themselves. The agent borrows from somewhere — fresh knowledge, an external system, or its own previous attempt — to do something the base model couldn't do alone.
These are augmentation patterns, not state. The agent still doesn't remember anything between sessions. What changes is what it can pull in during a single session.
Three ways to reach out
| Pattern | What it pulls in | When it earns the cost | Failure mode |
|---|---|---|---|
| Retrieval (RAG) | Fresh or domain-specific knowledge | The model would otherwise hallucinate or be stale | Bad retrieval is worse than no retrieval |
| External tools | The ability to act on systems the model doesn't natively control | The action is non-trivial and needs current data | Tool failures cascade silently |
| Self-critique | A second pass at the model's own output | The first pass is unreliable on this task | Doubles the cost for marginal improvement |
These compose. A real agent might use all three. But each one is worth pricing separately, because each one has a separate failure mode you'll need to debug independently.
1. Retrieval — when the model's training is the bottleneck
The base model knows what was in its training data, cut off at a date you don't control. If your task depends on information that's domain-specific, internal, or fresher than the cutoff, retrieval is the bridge. Pull the relevant documents into context at query time, let the model answer over them.
What it looks like: a legal research agent. The user asks "what's our position on the new SEC rule?" The agent embeds the query, retrieves the three most relevant internal memos, and answers grounding every claim in them. The model brings language fluency; the retrieval brings the facts.
The wins are real: hallucinations drop sharply, the answer cites sources you trust, the system stays useful as your knowledge base grows.
When not to use it. Three honest failure modes.
- The retrieval is bad. If your top-three chunks are off-topic, the model now has worse context than it had without RAG — it's been actively misled. Bad retrieval beats no retrieval only on marketing slides.
- The knowledge is small enough to fit in the prompt. If your entire knowledge base is 4,000 tokens, paste it into the prompt and skip the retrieval infrastructure. You've added latency and an indexing pipeline for no gain.
- The user's query is structured. "Get me the invoice for vendor X in March" is a database query in disguise. Translate it to SQL and hit the database. RAG turns a O(1) lookup into a O(N) embedding search.
The diagnostic: when retrieval fails on a query, is the failure debuggable? If you can't trace why the wrong chunks came back, your RAG layer is a liability.
2. External tools — when the agent has to do, not just say
Some tasks require the agent to act — call an API, run a query, push a commit. The base model can describe how to do these things; it can't execute them. Tool use is the contract: the model emits a structured call, your runtime executes it, the result feeds back in.
What it looks like: a code-generation agent that doesn't just write the function but runs the test suite on it, observes the failures, and patches. A data analysis bot that doesn't describe the analysis — it runs the pandas query and returns the dataframe. Each "tool" is a function the agent can choose to call.
The win is that the agent's output becomes verifiable. Instead of "here's how to fix the bug," you get "I ran the test, it passed."
When not to use it. Three traps.
- The tool surface is too big. Twenty registered tools and the model picks the wrong one half the time. Each tool you expose is a chance for the model to misroute the request. Start with three; add carefully.
- The tool failures aren't surfaced. Tool errors that the runtime swallows cascade silently. The agent gets
null, treats it as "no result," and returns a confidently wrong answer. Tool errors need to be loud — back into the prompt as visible errors the agent has to handle. - The tool wraps a deterministic system. "Use the tool to call our REST API" is fine. "Use the tool to deterministically format JSON" is the agent doing what a ten-line function would do, with extra steps and a model bill.
The diagnostic: for every tool the agent has, ask whether a hand-written sequence of three Python lines would do the same job. If yes, drop the tool.
3. Self-critique — the second pass
The cheapest reach-out: have the model review its own output. First call generates a draft; second call evaluates the draft against explicit criteria; third call (or the second one, with revision instructions) produces the final answer. The agent reaches out to its own previous response.
What it looks like: a QA agent for generated documentation. Draft the doc. Then critique: "Are all code examples runnable? Does the API description match the actual signature? Are there sections that contradict each other?" Then revise based on the critique.
The win is non-trivial: on tasks where the model has a known failure mode (skipping validation, inventing API names, missing edge cases), forcing a second pass against an explicit checklist catches a meaningful chunk of them.
When not to use it. The trap is doubling the cost without doubling the value.
- The criteria are fuzzy. "Is the output good?" produces fuzzy critiques. The model marks its own work as fine and you've spent two calls on one answer. The criteria have to be specific and checkable.
- The first pass is already good. On tasks the model handles well in one shot, the second pass mostly rephrases the first. You're paying for paraphrase.
- The critique reveals what the rewrite would do anyway. If you can collapse the critique-then-revise into a single prompt with the criteria upfront, do that. Two calls is overhead unless the critique step is genuinely diagnostic.
The diagnostic: run the same task with and without self-critique on twenty examples. If the critique-enabled version is meaningfully better on more than half, it earns the slot. If not, drop it and write a stricter first-pass prompt.
How these stack
A common production shape uses all three:
- Tools to fetch the latest state from external systems.
- Retrieval to ground the answer in your domain knowledge.
- Self-critique to catch the failure modes you've seen the model exhibit.
Each costs additional latency and money. Each adds a failure surface to monitor. The default ordering — tools first, retrieval second, self-critique third — works because each later layer can correct mistakes from earlier ones (the critic can catch a bad retrieval; the retrieval can fact-check a hallucinated tool result), but the inverse doesn't hold.
What you still don't get at these tiers
The reach-out patterns are stateless. Every request starts fresh. The agent that retrieved your customer history yesterday doesn't remember you today; it retrieves it again from scratch. That's fine for most tasks — and it keeps the architecture clean.
Part 3 covers the narrow cases where statelessness is the failure mode itself.
The reach-out patterns earn their cost only when the cheap tiers have demonstrably failed. Test the boring version first. If your boring version handles the task in one prompt, you didn't need RAG, you didn't need tools, and you didn't need a second pass.