GPT-5.5: The Benchmark Jump Isn't the Story. Tokens Per Task Is.

OpenAI shipped GPT-5.5 today. Everyone will quote the Terminal-Bench 2.0 jump from 75.1% to 82.7% — and miss the claim buried further down: significantly fewer tokens per Codex task, and state-of-the-art coding intelligence at half the cost of competitive frontier models. Here's which number to actually watch, and a decision rule for when to switch.

Initial Editor·2026-04-23·5min read·963 words·10 views

GPT-5.5: The Benchmark Jump Isn't the Story. Tokens Per Task Is. OpenAI shipped GPT-5.5 today. Everyone will quote the Terminal-Bench 2.0 jump from 75.1% to 82.7% — and miss the claim buried further down: significantly fewer tokens per Codex task, and state-of-the-art coding intelligence at half the cost of competitive frontier models. Here's which number to actually watch, and a decision rule for when to switch. Read: https://aixfwd.com/posts/gpt-55-the-benchmark-jump-isnt-the-story-tokens-per-task-is #AI #Tech #OpenAI #AgenticCoding #GPT55 #ModelReleases

OpenAI shipped GPT-5.5 today, and everyone's going to quote the Terminal-Bench 2.0 number: 82.7%, up from 75.1% for GPT-5.4. That isn't the line that matters.

Further down the same announcement: GPT-5.5 "uses significantly fewer tokens to complete the same Codex tasks," and lands at "state-of-the-art intelligence at half the cost of competitive frontier coding models" on Artificial Analysis's Coding Agent Index. Tokens per task is the claim that reshapes your monthly bill. The headline score is noise by comparison.

The benchmark bump is real — and priced in

Terminal-Bench 2.0 moved from 75.1% to 82.7%. Expert-SWE went from 68.5% to 73.1%. These are respectable point-release gains, and OpenAI itself frames 5.5 as "a faster, sharper thinker for fewer tokens" — a refresh, not a new class of model.

If your decision rule is "does the model clear 80% on Terminal-Bench," switch. If your decision rule is anything else, the benchmark is informational at best. Frontier coding models have been clustered within a few points of each other for six months; a seven-point jump that arrives with a price increase doesn't change the ordering as much as the chart makes it look.

Tokens per task is the number that moves the bill

Reported per-token pricing is up from 5.4: $5 per million input, $30 per million output for GPT-5.5, and $30 / $180 for GPT-5.5 Pro. At face value that's a price increase. But the model also completes the same agentic task in fewer tokens, so the interesting comparison is cost-per-task, not cost-per-token — and it's the one you have to measure yourself on a job you actually run.

Metric	What it tells you	When it misleads
Price per 1M tokens	List pricing only	A "cheap" model that rambles costs more per task
Tokens per task	Real serving cost on your harness	Only comparable within the same agent loop
Cost per completed task	The number you care about	Requires running both models on the same eval

If you're shipping a Codex-style agent loop, row three is the only one that matters. Everything above it is a proxy, and cheap proxies have cost people real money.

"Half the cost of frontier coding models" is a framing, not a measurement

Artificial Analysis's Coding Agent Index plots intelligence against cost. OpenAI's claim is that GPT-5.5 sits on that curve at roughly half the cost of "competitive frontier coding models" for the same intelligence level. Three things to keep in mind:

It's cost on that specific index, which aggregates a handful of coding evals. Your workload is not that index.
"Competitive frontier" is a comparison set selected by OpenAI. Expect the cut to flatter the new model.
The index uses list prices. Enterprise discounts, prompt caching, and batch pricing change the shape of the curve materially.

The claim is plausible and probably directionally correct. It is not a number you can paste into a budget spreadsheet.

What changes for your stack — a decision rule

Your current stack	What GPT-5.5 changes	Action
GPT-5.4 on agentic workflows (Codex, long tasks)	Same harness, fewer tokens per task, better scores	Switch a pilot repo this week; measure cost/task
Claude Code + Sonnet for daily coding	No runtime change; different vendor entirely	Don't switch reflexively. Test on your worst three prompts first.
GPT-5.4 for one-shot Q&A or short completions	Marginal — gains concentrate on multi-step work	Stay on 5.4 unless you have latency headroom
Mixed stack (Claude for code, GPT for ops)	GPT-5.5 Pro is the candidate "ops brain"	Pilot Pro on one long-running workflow; don't roll it out

The wins are concentrated in multi-step agentic tasks — scenarios where "fewer tokens to finish" compounds across tool calls. If your usage is short-turn chat or single-prompt completions, the upgrade barely pays its bill.

When not to switch yet

You're mid-project on a Claude Code setup that works. Tool-chain disruption costs more than a benchmark-point difference. Finish the thing you're shipping, then evaluate.
You don't have a cost-per-task dashboard. If you can't measure what switching costs or saves, you'll just have a different bill and no way to judge the trade.
Your workload leans heavily on cached prompts. OpenAI and Anthropic price prompt caching differently; a model that's cheaper per raw token can be more expensive per cached conversation.
You're considering the Pro tier. At $30 / $180 per million tokens, GPT-5.5 Pro is in the "only if your task actually needs it" bucket. Measure the intelligence gap on your task before paying 6x.
You rely on a third-party agent framework. Harness assumptions about tool-call shape and reasoning budget shift between model versions. Give your framework of choice a week to catch up before you blame the model for regressions.

Two signals worth watching over the next two weeks

Two data points will tell you more than the blog post:

Artificial Analysis's cost-per-task curves once third parties have run 5.5 at scale. The "half the cost" claim will either hold up or erode — either outcome is informative.
Codex cost dashboards from teams with heavy 5.4 usage. If the "fewer tokens" claim is real, their spend per merged PR should visibly drop inside the first billing cycle.

Until those land, "GPT-5.5 is 10% better" is true and mostly useless. The right response to a press release is not to upgrade. It's to queue a measurement.

The Terminal-Bench score is what OpenAI wants you to remember. Tokens per task is what your CFO will remember three months from now.

GPT-5.5: The Benchmark Jump Isn't the Story. Tokens Per Task Is.

The benchmark bump is real — and priced in

Tokens per task is the number that moves the bill

"Half the cost of frontier coding models" is a framing, not a measurement

What changes for your stack — a decision rule

When not to switch yet

Two signals worth watching over the next two weeks

// more in tech

The Smallest Agent That Works, Part 3: The Three Agents With State

The Smallest Agent That Works, Part 2: The Three Reach-Out Agents

The Smallest Agent That Works, Part 1: The Three Cheap Agents

What MLX Got to Throw Away (That PyTorch Can't)

The Unified-Memory Bet: Why On-Device Inference Stopped Being a Toy

Every Useful Skill Is One of Five Shapes

GPT-5.5: The Benchmark Jump Isn't the Story. Tokens Per Task Is.

The benchmark bump is real — and priced in

Tokens per task is the number that moves the bill

"Half the cost of frontier coding models" is a framing, not a measurement

What changes for your stack — a decision rule

When not to switch yet

Two signals worth watching over the next two weeks

// more in tech

The Smallest Agent That Works, Part 3: The Three Agents With State

The Smallest Agent That Works, Part 2: The Three Reach-Out Agents

The Smallest Agent That Works, Part 1: The Three Cheap Agents

What MLX Got to Throw Away (That PyTorch Can't)

The Unified-Memory Bet: Why On-Device Inference Stopped Being a Toy

Every Useful Skill Is One of Five Shapes

New posts, every week.Delivered Sunday mornings.

New posts, every week.
Delivered Sunday mornings.