The Smallest Agent That Works, Part 3: The Three Agents With State

The first two posts covered six architectures, all of them stateless. Each request runs from scratch, each session is independent, and the agent's behavior is fully determined by the inputs you pass in. That's a feature — stateless systems are easier to debug, scale, and roll back.

This post covers the cases where statelessness has stopped being a feature and become the bug. Three patterns, in increasing order of how much they change about your architecture: memory across sessions, environment manipulation in real time, and self-learning over the agent's own behavior.

State is the most expensive capability

A blunt framing before the details: every kind of state you add doubles the operational surface. Memory means you now have a separate data layer with its own consistency guarantees, retention policy, and failure modes. Environment manipulation means you have an agent that can do things to the world, which means liability that statelessness avoided entirely. Self-learning means the agent in production today is not the agent in production tomorrow, which means your test suite covers a snapshot, not the system.

Pattern	What it persists	What it costs you	When statelessness wins
Memory	User preferences, prior conversations, learned facts	A data layer with consistency, retention, and recall failure modes	Sessions are independent; users don't expect continuity
Environment control	The world the agent is acting on	Physical or operational blast radius for every action	Actions are reversible or one-off
Self-learning	The agent's own behavior or weights	The "current agent" is a moving target your tests can't pin down	The base model's behavior is good enough; tuning happens offline

Each is the right answer in narrow conditions. Each is over-engineered for everything else.

1. Memory — when the user expects to be known

A memory-enhanced agent recalls prior interactions. The clearest use case: a personal assistant that learns your scheduling preferences over months, remembers which projects you're on, knows that "the migration" means a specific repo without you saying so. The agent gets cheaper to use over time because the user stops re-explaining context.

What it looks like in practice: every interaction emits structured facts (preferences, entities, decisions). A retrieval layer surfaces relevant facts at the start of each new session. The agent treats them as context. A retention policy decides when facts decay — the project you finished in 2024 shouldn't show up in next month's planning.

When not to add memory. Four failure modes that show up the moment state arrives.

Sessions are genuinely independent. A support agent handling ten thousand unique tickets a day doesn't need to remember anything between them. Each ticket is fresh; treating them as a continuing conversation pollutes context.
The "memory" is one query away. If everything the agent needs is already in your CRM, the memory layer is duplicating data you already have. Retrieve it (Part 2's RAG pattern) and move on.
Users don't expect continuity, and giving it to them is uncanny. An agent that remembers an offhand remark from three weeks ago is uncanny, not helpful. Memory needs an explicit consent and visibility model — what's stored, what's shown, what's forgettable on request.
The retention policy is "everything forever." Memory without an expiry policy turns into a slow-leaking liability. The most useful memory layers have aggressive defaults — facts decay unless reinforced.

The diagnostic: would your user notice if a session started with no recall? If the answer is "no, the task fits in one session," skip the memory layer.

2. Environment control — when the agent has to change the world

The tools tier from Part 2 let the agent reach out and call APIs. Environment control is the same idea at a different blast radius: the agent doesn't just call an API, it actively manipulates a system that other things depend on. Smart-home controllers, adaptive robotics, autonomous infrastructure agents that re-balance load in real time. The output isn't a string; it's a state change in the world.

What it looks like: an agent monitoring server load that scales a cluster up and down without human approval. A building-automation agent that adjusts HVAC based on occupancy. The agent is in a control loop with a physical or operational system.

The capability is real. The cost is the operational surface that comes with it.

When not to add environment control.

The actions are reversible only in theory. "We can roll back" doesn't help if rollback takes an hour and the agent took the action sixty seconds ago. Reversibility has to be operational, not theoretical.
The blast radius isn't bounded. An agent that can scale your cluster from 10 nodes to 1,000 also has a path to 100,000. Put hard ceilings on every action; review every ceiling quarterly.
The decision could be batched. If the "real-time" part is decorative — the decision could have been made every five minutes by a cron job — you don't need an agent in the loop. You need a scheduled report and a human approving changes.
The accountability story is unclear. An autonomous agent took an action; something went wrong; who's responsible? If you can't answer this in one sentence before going live, you're not ready for this tier.

The diagnostic: imagine the agent makes its worst plausible decision. What's the recovery? If the recovery is "human notices, human fixes," you're fine. If the recovery is "the system is in a bad state for hours and we're not sure how to back out," reconsider.

3. Self-learning — when the agent improves itself

The most ambitious tier. The agent's behavior changes based on its own experience. Reinforcement-style updates, evolutionary search over agent populations, online fine-tuning. The agent in production this week is not the agent that was in production last week — because it learned.

What it looks like: a swarm of trading agents whose strategies are tuned by their own outcomes. Neural architectures that mutate and select. Online learning systems that adapt to user feedback in near-real-time.

The capability is genuinely powerful in domains where the environment is dynamic, the optimum is moving, and offline tuning can't keep up. It's also genuinely dangerous in domains where it's not necessary.

When not to use self-learning.

You don't have a way to roll back to a previous version. Every self-learning system needs a checkpointing story. Without one, a bad update has no recovery.
Your reward signal is noisy or proxy-shaped. The agent will optimize the metric you gave it, not the outcome you wanted. If your reward function is "users gave thumbs up," the agent will optimize for thumbs up, including by being sycophantic. Reward design is the whole problem.
The environment is static enough that offline tuning would work. If the world isn't moving fast, batch your updates, evaluate them offline, ship them as discrete versions. Online learning is for genuinely non-stationary environments.
The savings on continuous improvement don't outweigh the cost of "the system is different today than yesterday." Most production systems are easier to operate when their behavior is pinned. Self-learning is for when pinning has become the bottleneck.

The diagnostic: if a regulator asked you to explain why the agent took action X on day Y, could you? If the answer involves "well, the model had been updating itself for three weeks, so…", you're past the line where the capability earns its cost.

How to read this tier as a whole

The throughline of this series has been a single question: what's the smallest agent that produces the output you need? Part 1 said most tasks don't need any agency. Part 2 said many tasks that need agency don't need state. Part 3's harder claim is that even tasks that seem to need state usually don't — or need a much narrower kind of state than the architecture diagrams imply.

A pattern that holds across all three stateful tiers: the failure mode is invisible until production. Memory leaks privacy. Environment control accumulates blast radius. Self-learning drifts. None of these show up in a demo. All of them dominate the operational reality six months in.

Start with no state. Add the narrowest slice you can prove you need, and add it with explicit policies for retention, reversibility, and rollback. The capability hierarchy isn't a ladder you should climb — it's a ladder you should climb only when you've felt the floor break.

The Smallest Agent That Works, Part 3: The Three Agents With State

State is the most expensive capability

1. Memory — when the user expects to be known

2. Environment control — when the agent has to change the world

3. Self-learning — when the agent improves itself

How to read this tier as a whole

// more in tech

The Smallest Agent That Works, Part 2: The Three Reach-Out Agents

The Smallest Agent That Works, Part 1: The Three Cheap Agents

What MLX Got to Throw Away (That PyTorch Can't)

The Unified-Memory Bet: Why On-Device Inference Stopped Being a Toy

Every Useful Skill Is One of Five Shapes

MCP Gives You the Kitchen. Skills Are the Recipe.

The Smallest Agent That Works, Part 3: The Three Agents With State

State is the most expensive capability

1. Memory — when the user expects to be known

2. Environment control — when the agent has to change the world

3. Self-learning — when the agent improves itself

How to read this tier as a whole

// more in tech

The Smallest Agent That Works, Part 2: The Three Reach-Out Agents

The Smallest Agent That Works, Part 1: The Three Cheap Agents

What MLX Got to Throw Away (That PyTorch Can't)

The Unified-Memory Bet: Why On-Device Inference Stopped Being a Toy

Every Useful Skill Is One of Five Shapes

MCP Gives You the Kitchen. Skills Are the Recipe.

New posts, every week.Delivered Sunday mornings.

New posts, every week.
Delivered Sunday mornings.