The Image Model Started Thinking

Until last Tuesday, every image generator was a decoder.

You gave it a prompt. It hashed the prompt into a latent vector. It ran a diffusion loop — or an autoregressive transformer, depending on the vendor — and at the end it handed you pixels. If the result was wrong, you prompted again. The model had no opinion about whether the output matched the intent. It had no concept of "opinion."

On April 21, OpenAI shipped gpt-image-2, and that pipeline broke. The new model reads the prompt, plans the layout, decides what text goes where, reasons about constraints, and can revise its own plan before emitting anything. It thinks, in the architectural sense — there is a reasoning step in the middle of the generation loop. It just happens to output images instead of tokens.

The 242-ELO leap on LMArena is the headline number. It is not the story.

What "thinking" actually means inside an image model

OpenAI's framing is that gpt-image-2 is a "General Purpose Transformer built from the ground up for visual data." That is marketing language for something concrete: the same reasoning stack that produces step-by-step text in GPT-5-class models now sits upstream of the pixel decoder. Before any image gets drawn, the model has already committed to where the headline goes, how many items appear in the chart, which language each label should be in, and whether the composition honors the aspect ratio you asked for.

The practical consequence is that gpt-image-2 renders the kinds of images that require planning — the ones previous models failed at not because their decoders were weak, but because they were never asked to plan. A full slide with six data points and a title that wraps cleanly. An infographic whose arrows point at the right boxes. A storefront sign with Japanese, Hindi, and English side by side, each rendered in the correct script at the correct baseline.

TechCrunch called the text rendering "surprisingly good." The surprise is that text rendering was ever the bottleneck. It was not. Planning was.

The second-order effects show up in what you stop doing

Consider the workflows that the image-as-decoder era built around itself. You generated a rough image, then opened Figma or Photoshop to fix the typography. You generated a layout, then hand-placed the labels. You generated a multi-panel comic, then discarded the panels where the character's outfit drifted. You generated a chart, then rebuilt it in whichever tool could actually render numbers.

Every one of those workflows exists because the model was not doing the upstream reasoning. You were. The tool's job was to produce a plausible canvas; yours was to enforce the structure.

Before `gpt-image-2`	After `gpt-image-2`
Generate, then open Figma to fix the typography	Generate, then review whether the layout reasoning was right
Generate, then hand-place labels on the diagram	Generate, then accept or reject the plan
Generate ten comic panels, discard the ones that drift	Generate ten planned panels, pick the strongest
Generate a chart image, rebuild it in a real charting tool	Generate a chart image, ship it

A model that plans before it draws removes the need for most of the patching. OpenAI is claiming close to 99% typography accuracy. Even if the real number is closer to 90%, the shape of the workflow inverts — you stop patching outputs and start reviewing reasoning. "Is this the right layout?" is a much shorter question than "how do I fix this layout?"

Generation of up to ten images per prompt amplifies the same shift. Ten variants of a planned composition is a design meeting. Ten variants of a decoder's best guess is a slot machine.

Why this is a step change, not a quarterly increment

Image models have been improving on a fairly predictable curve — sharper textures, better faces, marginally better hands. Incremental wins. The industry absorbed a new SOTA every few months and most of them were indistinguishable from the last one six weeks later.

gpt-image-2 is not on that curve. The capability jump is not from a better decoder; it is from a different architecture. The 242-point ELO gap on LMArena is an artifact of exactly that — the arena grades on outputs, and the outputs encode a reasoning stage the competition has not shipped. Leaderboards will compress again once Google and Anthropic ship their own planning-first image models. Until then, the rest of the arena is answering last year's question.

There is a concern worth flagging. A reasoning model is still a model, and reasoning models hallucinate. gpt-image-2 will confidently plan a layout that is wrong. It will commit to a color scheme nobody asked for. It will cheerfully output an infographic with two subtly inconsistent numbers if the prompt leaves ambiguity. The failure mode of a thinking model is not the failure mode of a decoder — it is smoother, harder to spot, and easier to ship by accident.

Which means the next generation of tooling around image models is not better prompts. It is review.

The part the headlines got wrong

Most of the launch coverage framed gpt-image-2 as a text-rendering breakthrough — "finally, readable words in AI images." That misses the story. Readable words were a hard problem because the old architecture had no plan for where words went. The new architecture has a plan, which is why the words land. The breakthrough was not solving typography. It was changing what the model is thinking about when it begins.

Everything else — the multilingual scripts, the ten-image batches, the 2K resolution, the arena lead — is downstream of that single architectural choice. The image model started thinking. Everything after that follows.

The decoder era asked, "what should this look like?" gpt-image-2 asks, "what should this contain, and where?" — and then draws the answer. That is not a bigger model. It is a different job description.

The Image Model Started Thinking

What "thinking" actually means inside an image model

The second-order effects show up in what you stop doing

Why this is a step change, not a quarterly increment

The part the headlines got wrong

Sources

// more in tech

The Smallest Agent That Works, Part 3: The Three Agents With State

The Smallest Agent That Works, Part 2: The Three Reach-Out Agents

The Smallest Agent That Works, Part 1: The Three Cheap Agents

What MLX Got to Throw Away (That PyTorch Can't)

The Unified-Memory Bet: Why On-Device Inference Stopped Being a Toy

Every Useful Skill Is One of Five Shapes

The Image Model Started Thinking

What "thinking" actually means inside an image model

The second-order effects show up in what you stop doing

Why this is a step change, not a quarterly increment

The part the headlines got wrong

Sources

// more in tech

The Smallest Agent That Works, Part 3: The Three Agents With State

The Smallest Agent That Works, Part 2: The Three Reach-Out Agents

The Smallest Agent That Works, Part 1: The Three Cheap Agents

What MLX Got to Throw Away (That PyTorch Can't)

The Unified-Memory Bet: Why On-Device Inference Stopped Being a Toy

Every Useful Skill Is One of Five Shapes

New posts, every week.Delivered Sunday mornings.

New posts, every week.
Delivered Sunday mornings.