Tech

The Image Model Started Thinking

On April 21, OpenAI shipped gpt-image-2 — the first image model with a reasoning step in the middle of the generation loop. The 242-ELO leap on LMArena is the headline number. It is not the story. The story is that image generation stopped being decoding and started being thinking, and the workflows built around the old assumption are about to invert.

Initial Editor·2026-04-22·5min read·1,026 words

Until last Tuesday, every image generator was a decoder.

You gave it a prompt. It hashed the prompt into a latent vector. It ran a diffusion loop — or an autoregressive transformer, depending on the vendor — and at the end it handed you pixels. If the result was wrong, you prompted again. The model had no opinion about whether the output matched the intent. It had no concept of "opinion."

On April 21, OpenAI shipped gpt-image-2, and that pipeline broke. The new model reads the prompt, plans the layout, decides what text goes where, reasons about constraints, and can revise its own plan before emitting anything. It thinks, in the architectural sense — there is a reasoning step in the middle of the generation loop. It just happens to output images instead of tokens.

The 242-ELO leap on LMArena is the headline number. It is not the story.

What "thinking" actually means inside an image model

OpenAI's framing is that gpt-image-2 is a "General Purpose Transformer built from the ground up for visual data." That is marketing language for something concrete: the same reasoning stack that produces step-by-step text in GPT-5-class models now sits upstream of the pixel decoder. Before any image gets drawn, the model has already committed to where the headline goes, how many items appear in the chart, which language each label should be in, and whether the composition honors the aspect ratio you asked for.

The practical consequence is that gpt-image-2 renders the kinds of images that require planning — the ones previous models failed at not because their decoders were weak, but because they were never asked to plan. A full slide with six data points and a title that wraps cleanly. An infographic whose arrows point at the right boxes. A storefront sign with Japanese, Hindi, and English side by side, each rendered in the correct script at the correct baseline.

TechCrunch called the text rendering "surprisingly good." The surprise is that text rendering was ever the bottleneck. It was not. Planning was.

The second-order effects show up in what you stop doing

Consider the workflows that the image-as-decoder era built around itself. You generated a rough image, then opened Figma or Photoshop to fix the typography. You generated a layout, then hand-placed the labels. You generated a multi-panel comic, then discarded the panels where the character's outfit drifted. You generated a chart, then rebuilt it in whichever tool could actually render numbers.

Every one of those workflows exists because the model was not doing the upstream reasoning. You were. The tool's job was to produce a plausible canvas; yours was to enforce the structure.

Before gpt-image-2 After gpt-image-2
Generate, then open Figma to fix the typography Generate, then review whether the layout reasoning was right
Generate, then hand-place labels on the diagram Generate, then accept or reject the plan
Generate ten comic panels, discard the ones that drift Generate ten planned panels, pick the strongest
Generate a chart image, rebuild it in a real charting tool Generate a chart image, ship it

A model that plans before it draws removes the need for most of the patching. OpenAI is claiming close to 99% typography accuracy. Even if the real number is closer to 90%, the shape of the workflow inverts — you stop patching outputs and start reviewing reasoning. "Is this the right layout?" is a much shorter question than "how do I fix this layout?"

Generation of up to ten images per prompt amplifies the same shift. Ten variants of a planned composition is a design meeting. Ten variants of a decoder's best guess is a slot machine.

Why this is a step change, not a quarterly increment

Image models have been improving on a fairly predictable curve — sharper textures, better faces, marginally better hands. Incremental wins. The industry absorbed a new SOTA every few months and most of them were indistinguishable from the last one six weeks later.

gpt-image-2 is not on that curve. The capability jump is not from a better decoder; it is from a different architecture. The 242-point ELO gap on LMArena is an artifact of exactly that — the arena grades on outputs, and the outputs encode a reasoning stage the competition has not shipped. Leaderboards will compress again once Google and Anthropic ship their own planning-first image models. Until then, the rest of the arena is answering last year's question.

There is a concern worth flagging. A reasoning model is still a model, and reasoning models hallucinate. gpt-image-2 will confidently plan a layout that is wrong. It will commit to a color scheme nobody asked for. It will cheerfully output an infographic with two subtly inconsistent numbers if the prompt leaves ambiguity. The failure mode of a thinking model is not the failure mode of a decoder — it is smoother, harder to spot, and easier to ship by accident.

Which means the next generation of tooling around image models is not better prompts. It is review.

The part the headlines got wrong

Most of the launch coverage framed gpt-image-2 as a text-rendering breakthrough — "finally, readable words in AI images." That misses the story. Readable words were a hard problem because the old architecture had no plan for where words went. The new architecture has a plan, which is why the words land. The breakthrough was not solving typography. It was changing what the model is thinking about when it begins.

Everything else — the multilingual scripts, the ten-image batches, the 2K resolution, the arena lead — is downstream of that single architectural choice. The image model started thinking. Everything after that follows.

The decoder era asked, "what should this look like?" gpt-image-2 asks, "what should this contain, and where?" — and then draws the answer. That is not a bigger model. It is a different job description.

Sources

// more in tech

see all →
Tech· 2026-05-29· 5min

The Smallest Agent That Works, Part 3: The Three Agents With State

Stateless agents fit most tasks. State is the most expensive capability you can add — it doubles your operational surface, breaks your debugging, and rewards exactly the use cases that can't survive without it. Memory, environment control, self-learning. Part 3 of three.

#agent-architecture#ai-engineering#ai-agents#system-design
Tech· 2026-05-27· 5min

The Smallest Agent That Works, Part 2: The Three Reach-Out Agents

When the cheap tiers run out, the agent has to reach beyond the model itself — into knowledge it doesn't have, tools it can't natively use, or its own previous answer. RAG, tool use, and self-critique: three patterns, three failure modes worth pricing in. Part 2 of three.

#llm#rag#agent-architecture#ai-engineering
Tech· 2026-05-26· 5min

The Smallest Agent That Works, Part 1: The Three Cheap Agents

Most agent stacks are built one tier too capable for the job. Three of the cheapest architectures — a fixed pipeline, an LLM with rule constraints, and a reasoning loop — solve more problems than the architecture diagrams admit. Part 1 of three.

#llm#agent-architecture#ai-engineering#ai-agents
Tech· 2026-05-15· 5min

What MLX Got to Throw Away (That PyTorch Can't)

Every mature framework is a museum of decisions you can't take back. MLX is interesting mostly because it started after the decisions that matter for Apple Silicon were already mistakes — and the things it threw away are the things that were quietly costing the rest of us the most.

#ai-engineering#apple-silicon#mlx#ml-frameworks
Tech· 2026-05-15· 5min

The Unified-Memory Bet: Why On-Device Inference Stopped Being a Toy

For two years the industry's default answer to every inference question has been "bigger cluster." A different hardware topology is quietly making that the wrong default for a non-trivial slice of workloads — and the framework layer that earns it is the buzzword most decks haven't caught up with yet.

#hardware#ai-infrastructure#inference#edge-ai
Tech· 2026-05-14· 5min

Every Useful Skill Is One of Five Shapes

Skills aren't a freeform format. The useful ones fit one of five shapes — sequential workflow, multi-MCP coordination, iterative refinement, context-aware selection, domain-specific intelligence. Picking the right shape is most of the design work. Picking the wrong one is most of the bugs.

#claude-code#workflow#agents#skills