News

GPT-5.5 is OpenAI’s bid to win the assistant war on your phone

GPT-5.5 shows OpenAI wants the phone assistant market, not just benchmarks, with ChatGPT and Codex pushing agentic work into daily mobile use this year.

GPT-5.5 benchmark comparison showing coding and computer-use performance
Image: OpenAI

GPT-5.5 is the model OpenAI built to win the assistant war, not just the benchmark league, and TechCrunch confirmed on 23 April 2026 that it shipped to Plus, Pro, Business and Enterprise users in ChatGPT and Codex the same day. OpenAI calls GPT-5.5 its “smartest and most intuitive to use model yet” and frames it as the next step toward a “super app” that does your work for you. Strip away the marketing and the thesis is simpler and sharper: this is the first OpenAI release engineered around the model doing things rather than answering things — and that is exactly the capability your phone’s assistant has been missing.

Key facts
  • GPT-5.5 launched 23 April 2026, scoring a state-of-the-art 82.7% on Terminal-Bench 2.0 (up from GPT-5.4’s 75.1%) and 78.7% on OSWorld-Verified computer use (GPT-5.4: 75.0%).
  • API pricing is £4 (about $5) per million input tokens and £24 (about $30) per million output tokens — double GPT-5.4’s £2 (about $2.50) / £12 (about $15) — with a 1M-token context window and GPT-5.5 Pro at £24 (about $30) / £140 (about $180).
  • GPT-5.5 Pro scored 33.2% on the GeneBench scientific benchmark; the standard model hit 84.9% on GDPval knowledge work and 58.6% on SWE-Bench Pro.
  • GPT-5.5 Instant became the new default free-tier ChatGPT model on 5 May 2026, putting the agentic engine into every phone running the app.

GPT-5.5 is a computer-use model wearing a chatbot’s clothes

The headline number for GPT-5.5 is not a trivia score. It is 78.7% on OSWorld-Verified, the benchmark that measures whether a model can actually operate a real computer environment — open apps, click menus, fill forms, recover when something goes wrong. GPT-5.4 managed 75.0%; the jump to 78.7% sounds modest until you realise these last percentage points are where autonomous task completion either works or quietly fails halfway through. Pair that with 82.7% on Terminal-Bench 2.0, a command-line benchmark demanding planning and tool coordination, and the shape of the release becomes obvious. OpenAI did not optimise GPT-5.5 to be a better talker. It optimised it to be a better doer.

The 1M-token context window is the unsung mobile feature

GPT-5.5 carries a one-million-token context window in the API, with a 128,000-token maximum output. On a desktop that means feeding a whole codebase or a quarter’s worth of documents in one shot. On a phone it means something subtler and arguably more important: an assistant that can hold the full thread of your day — your messages, your calendar, the document you were editing, the three tabs you left open — without losing the plot halfway through a task. The reason today’s mobile assistants feel forgetful is not personality. It is context budget. GPT-5.5 quietly removes that ceiling.

OpenAI also claims GPT-5.5 matches GPT-5.4’s per-token latency in real-world serving while operating at a much higher level of intelligence. That is the line that should worry competitors most. Historically every capability jump cost speed, and on mobile latency is the whole experience — an assistant that thinks for eight seconds is an assistant you stop using. Holding latency flat while pushing computer-use accuracy up is precisely the trade-off that makes an on-phone agent tolerable rather than a novelty you disable after a week.

Where GPT-5.5 still loses, and why that is fine

Let us be honest about the gaps, because OpenAI will not be. On SWE-Bench Pro, GPT-5.5’s 58.6% trails Anthropic’s Claude Opus 4.7 at 64.3%, and Tom’s Guide reported the model lost in all seven of its head-to-head test categories against the same Claude release. Critics also flagged a persistent tendency to hallucinate confidently rather than admit a knowledge gap — the oldest and most dangerous failure mode for anything you let act autonomously on your behalf. There was even a documented quirk where the model kept inserting goblins and gremlins into outputs until OpenAI filtered it out, which is funny until you imagine it inside an agent booking your travel.

But raw coding supremacy is not the metric that decides the mobile assistant war — reliable, low-latency action across everyday apps is. On the computer-use and knowledge-work benchmarks that actually map to “do my admin for me”, GPT-5.5 leads. Anthropic is winning the developer’s terminal; OpenAI is aiming at the other billion people who never open one. Both can be true. The doubled API price (£4 (about $5)/£24 (about $30) against GPT-5.4’s £2 (about $2.50)/£12 (about $15)) is the real cost question for businesses, the same calculus we covered when Claude for Small Business launched with QuickBooks and PayPal inside — capability is now cheap to access and expensive to run at scale.

Video: OpenAI

The verdict: GPT-5.5 is the first credible agent your phone will run

GPT-5.5 is not the model that beats every rival on every chart, and anyone telling you it is has not read the SWE-Bench Pro line. It is something more strategically dangerous: the first frontier model deliberately tuned for agentic computer use, shipped free into the most-installed AI app on the planet, with a context window big enough to hold a working day and a latency profile that does not punish you for the privilege. For coders chasing the absolute top score, Claude Opus 4.7 still edges it. For the mobile assistant war — the fight over who runs the agent in your pocket — GPT-5.5 just moved OpenAI into the lead, and Apple and Google are now responding to OpenAI’s roadmap rather than setting their own.

Watch the ChatGPT app, not the benchmark tables, over the next two quarters. If OpenAI ships native computer-use actions to the mobile app on top of GPT-5.5 — and the bet looks even sharper once you read how OpenAI fast-tracked its own AI phone to 2027 — the question stops being “which assistant answers best” and becomes “which assistant actually finishes the job”, and right now OpenAI has the only honest claim to the second.

Buyer action

Where to buy or check next

Use this as the final check before ordering a phone, changing network or trusting a headline monthly price.

Stay in the loop

Get MTW reporting, reviews, guides, and buying advice in your inbox.

Subscribe

Reader discussion

Leave a comment

Comments are moderated. Keep it useful, accurate, and on topic.

Join the discussion

Your email address will not be published. All comments are held for moderation.

Spam protection

Keep reading

Today on MTW

The latest stories moving through the newsroom.