OpenAI shipped GPT-5.5 today, and everyone's going to quote the Terminal-Bench 2.0 number: 82.7%, up from 75.1% for GPT-5.4. That isn't the line that matters.
Further down the same announcement: GPT-5.5 "uses significantly fewer tokens to complete the same Codex tasks," and lands at "state-of-the-art intelligence at half the cost of competitive frontier coding models" on Artificial Analysis's Coding Agent Index. Tokens per task is the claim that reshapes your monthly bill. The headline score is noise by comparison.
The benchmark bump is real — and priced in
Terminal-Bench 2.0 moved from 75.1% to 82.7%. Expert-SWE went from 68.5% to 73.1%. These are respectable point-release gains, and OpenAI itself frames 5.5 as "a faster, sharper thinker for fewer tokens" — a refresh, not a new class of model.
If your decision rule is "does the model clear 80% on Terminal-Bench," switch. If your decision rule is anything else, the benchmark is informational at best. Frontier coding models have been clustered within a few points of each other for six months; a seven-point jump that arrives with a price increase doesn't change the ordering as much as the chart makes it look.
Tokens per task is the number that moves the bill
Reported per-token pricing is up from 5.4: $5 per million input, $30 per million output for GPT-5.5, and $30 / $180 for GPT-5.5 Pro. At face value that's a price increase. But the model also completes the same agentic task in fewer tokens, so the interesting comparison is cost-per-task, not cost-per-token — and it's the one you have to measure yourself on a job you actually run.
| Metric | What it tells you | When it misleads |
|---|---|---|
| Price per 1M tokens | List pricing only | A "cheap" model that rambles costs more per task |
| Tokens per task | Real serving cost on your harness | Only comparable within the same agent loop |
| Cost per completed task | The number you care about | Requires running both models on the same eval |
If you're shipping a Codex-style agent loop, row three is the only one that matters. Everything above it is a proxy, and cheap proxies have cost people real money.
"Half the cost of frontier coding models" is a framing, not a measurement
Artificial Analysis's Coding Agent Index plots intelligence against cost. OpenAI's claim is that GPT-5.5 sits on that curve at roughly half the cost of "competitive frontier coding models" for the same intelligence level. Three things to keep in mind:
- It's cost on that specific index, which aggregates a handful of coding evals. Your workload is not that index.
- "Competitive frontier" is a comparison set selected by OpenAI. Expect the cut to flatter the new model.
- The index uses list prices. Enterprise discounts, prompt caching, and batch pricing change the shape of the curve materially.
The claim is plausible and probably directionally correct. It is not a number you can paste into a budget spreadsheet.
What changes for your stack — a decision rule
| Your current stack | What GPT-5.5 changes | Action |
|---|---|---|
| GPT-5.4 on agentic workflows (Codex, long tasks) | Same harness, fewer tokens per task, better scores | Switch a pilot repo this week; measure cost/task |
| Claude Code + Sonnet for daily coding | No runtime change; different vendor entirely | Don't switch reflexively. Test on your worst three prompts first. |
| GPT-5.4 for one-shot Q&A or short completions | Marginal — gains concentrate on multi-step work | Stay on 5.4 unless you have latency headroom |
| Mixed stack (Claude for code, GPT for ops) | GPT-5.5 Pro is the candidate "ops brain" | Pilot Pro on one long-running workflow; don't roll it out |
The wins are concentrated in multi-step agentic tasks — scenarios where "fewer tokens to finish" compounds across tool calls. If your usage is short-turn chat or single-prompt completions, the upgrade barely pays its bill.
When not to switch yet
- You're mid-project on a Claude Code setup that works. Tool-chain disruption costs more than a benchmark-point difference. Finish the thing you're shipping, then evaluate.
- You don't have a cost-per-task dashboard. If you can't measure what switching costs or saves, you'll just have a different bill and no way to judge the trade.
- Your workload leans heavily on cached prompts. OpenAI and Anthropic price prompt caching differently; a model that's cheaper per raw token can be more expensive per cached conversation.
- You're considering the Pro tier. At $30 / $180 per million tokens, GPT-5.5 Pro is in the "only if your task actually needs it" bucket. Measure the intelligence gap on your task before paying 6x.
- You rely on a third-party agent framework. Harness assumptions about tool-call shape and reasoning budget shift between model versions. Give your framework of choice a week to catch up before you blame the model for regressions.
Two signals worth watching over the next two weeks
Two data points will tell you more than the blog post:
- Artificial Analysis's cost-per-task curves once third parties have run 5.5 at scale. The "half the cost" claim will either hold up or erode — either outcome is informative.
- Codex cost dashboards from teams with heavy 5.4 usage. If the "fewer tokens" claim is real, their spend per merged PR should visibly drop inside the first billing cycle.
Until those land, "GPT-5.5 is 10% better" is true and mostly useless. The right response to a press release is not to upgrade. It's to queue a measurement.
The Terminal-Bench score is what OpenAI wants you to remember. Tokens per task is what your CFO will remember three months from now.