Tech

GPT-5.5: The Benchmark Jump Isn't the Story. Tokens Per Task Is.

OpenAI shipped GPT-5.5 today. Everyone will quote the Terminal-Bench 2.0 jump from 75.1% to 82.7% — and miss the claim buried further down: significantly fewer tokens per Codex task, and state-of-the-art coding intelligence at half the cost of competitive frontier models. Here's which number to actually watch, and a decision rule for when to switch.

Initial Editor·2026-04-23·5min read·963 words·10 views

OpenAI shipped GPT-5.5 today, and everyone's going to quote the Terminal-Bench 2.0 number: 82.7%, up from 75.1% for GPT-5.4. That isn't the line that matters.

Further down the same announcement: GPT-5.5 "uses significantly fewer tokens to complete the same Codex tasks," and lands at "state-of-the-art intelligence at half the cost of competitive frontier coding models" on Artificial Analysis's Coding Agent Index. Tokens per task is the claim that reshapes your monthly bill. The headline score is noise by comparison.

The benchmark bump is real — and priced in

Terminal-Bench 2.0 moved from 75.1% to 82.7%. Expert-SWE went from 68.5% to 73.1%. These are respectable point-release gains, and OpenAI itself frames 5.5 as "a faster, sharper thinker for fewer tokens" — a refresh, not a new class of model.

If your decision rule is "does the model clear 80% on Terminal-Bench," switch. If your decision rule is anything else, the benchmark is informational at best. Frontier coding models have been clustered within a few points of each other for six months; a seven-point jump that arrives with a price increase doesn't change the ordering as much as the chart makes it look.

Tokens per task is the number that moves the bill

Reported per-token pricing is up from 5.4: $5 per million input, $30 per million output for GPT-5.5, and $30 / $180 for GPT-5.5 Pro. At face value that's a price increase. But the model also completes the same agentic task in fewer tokens, so the interesting comparison is cost-per-task, not cost-per-token — and it's the one you have to measure yourself on a job you actually run.

Metric What it tells you When it misleads
Price per 1M tokens List pricing only A "cheap" model that rambles costs more per task
Tokens per task Real serving cost on your harness Only comparable within the same agent loop
Cost per completed task The number you care about Requires running both models on the same eval

If you're shipping a Codex-style agent loop, row three is the only one that matters. Everything above it is a proxy, and cheap proxies have cost people real money.

"Half the cost of frontier coding models" is a framing, not a measurement

Artificial Analysis's Coding Agent Index plots intelligence against cost. OpenAI's claim is that GPT-5.5 sits on that curve at roughly half the cost of "competitive frontier coding models" for the same intelligence level. Three things to keep in mind:

  1. It's cost on that specific index, which aggregates a handful of coding evals. Your workload is not that index.
  2. "Competitive frontier" is a comparison set selected by OpenAI. Expect the cut to flatter the new model.
  3. The index uses list prices. Enterprise discounts, prompt caching, and batch pricing change the shape of the curve materially.

The claim is plausible and probably directionally correct. It is not a number you can paste into a budget spreadsheet.

What changes for your stack — a decision rule

Your current stack What GPT-5.5 changes Action
GPT-5.4 on agentic workflows (Codex, long tasks) Same harness, fewer tokens per task, better scores Switch a pilot repo this week; measure cost/task
Claude Code + Sonnet for daily coding No runtime change; different vendor entirely Don't switch reflexively. Test on your worst three prompts first.
GPT-5.4 for one-shot Q&A or short completions Marginal — gains concentrate on multi-step work Stay on 5.4 unless you have latency headroom
Mixed stack (Claude for code, GPT for ops) GPT-5.5 Pro is the candidate "ops brain" Pilot Pro on one long-running workflow; don't roll it out

The wins are concentrated in multi-step agentic tasks — scenarios where "fewer tokens to finish" compounds across tool calls. If your usage is short-turn chat or single-prompt completions, the upgrade barely pays its bill.

When not to switch yet

  • You're mid-project on a Claude Code setup that works. Tool-chain disruption costs more than a benchmark-point difference. Finish the thing you're shipping, then evaluate.
  • You don't have a cost-per-task dashboard. If you can't measure what switching costs or saves, you'll just have a different bill and no way to judge the trade.
  • Your workload leans heavily on cached prompts. OpenAI and Anthropic price prompt caching differently; a model that's cheaper per raw token can be more expensive per cached conversation.
  • You're considering the Pro tier. At $30 / $180 per million tokens, GPT-5.5 Pro is in the "only if your task actually needs it" bucket. Measure the intelligence gap on your task before paying 6x.
  • You rely on a third-party agent framework. Harness assumptions about tool-call shape and reasoning budget shift between model versions. Give your framework of choice a week to catch up before you blame the model for regressions.

Two signals worth watching over the next two weeks

Two data points will tell you more than the blog post:

  1. Artificial Analysis's cost-per-task curves once third parties have run 5.5 at scale. The "half the cost" claim will either hold up or erode — either outcome is informative.
  2. Codex cost dashboards from teams with heavy 5.4 usage. If the "fewer tokens" claim is real, their spend per merged PR should visibly drop inside the first billing cycle.

Until those land, "GPT-5.5 is 10% better" is true and mostly useless. The right response to a press release is not to upgrade. It's to queue a measurement.

The Terminal-Bench score is what OpenAI wants you to remember. Tokens per task is what your CFO will remember three months from now.

// more in tech

see all →
Tech· 2026-05-29· 5min

The Smallest Agent That Works, Part 3: The Three Agents With State

Stateless agents fit most tasks. State is the most expensive capability you can add — it doubles your operational surface, breaks your debugging, and rewards exactly the use cases that can't survive without it. Memory, environment control, self-learning. Part 3 of three.

#agent-architecture#ai-engineering#ai-agents#system-design
Tech· 2026-05-27· 5min

The Smallest Agent That Works, Part 2: The Three Reach-Out Agents

When the cheap tiers run out, the agent has to reach beyond the model itself — into knowledge it doesn't have, tools it can't natively use, or its own previous answer. RAG, tool use, and self-critique: three patterns, three failure modes worth pricing in. Part 2 of three.

#llm#rag#agent-architecture#ai-engineering
Tech· 2026-05-26· 5min

The Smallest Agent That Works, Part 1: The Three Cheap Agents

Most agent stacks are built one tier too capable for the job. Three of the cheapest architectures — a fixed pipeline, an LLM with rule constraints, and a reasoning loop — solve more problems than the architecture diagrams admit. Part 1 of three.

#llm#agent-architecture#ai-engineering#ai-agents
Tech· 2026-05-15· 5min

What MLX Got to Throw Away (That PyTorch Can't)

Every mature framework is a museum of decisions you can't take back. MLX is interesting mostly because it started after the decisions that matter for Apple Silicon were already mistakes — and the things it threw away are the things that were quietly costing the rest of us the most.

#ai-engineering#apple-silicon#mlx#ml-frameworks
Tech· 2026-05-15· 5min

The Unified-Memory Bet: Why On-Device Inference Stopped Being a Toy

For two years the industry's default answer to every inference question has been "bigger cluster." A different hardware topology is quietly making that the wrong default for a non-trivial slice of workloads — and the framework layer that earns it is the buzzword most decks haven't caught up with yet.

#hardware#ai-infrastructure#inference#edge-ai
Tech· 2026-05-14· 5min

Every Useful Skill Is One of Five Shapes

Skills aren't a freeform format. The useful ones fit one of five shapes — sequential workflow, multi-MCP coordination, iterative refinement, context-aware selection, domain-specific intelligence. Picking the right shape is most of the design work. Picking the wrong one is most of the bugs.

#claude-code#workflow#agents#skills