
GPT-5.5 Arrives: OpenAI Narrowly Tops Claude Mythos Preview on Terminal-Bench 2.0
OpenAI Launches GPT-5.5, Reclaiming the Top Spot on a Key Agentic Benchmark
OpenAI on April 24, 2026, officially unveiled GPT-5.5 — internally codenamed "Spud" — its most capable and token-efficient model to date. The release arrives just one week after Anthropic launched Claude Opus 4.7 and roughly two weeks after Anthropic's more restricted frontier model, Claude Mythos Preview, set a new high-water mark on Terminal-Bench 2.0. According to OpenAI's official announcement, GPT-5.5 scores 82.7% on Terminal-Bench 2.0, narrowly edging Claude Mythos Preview's previously reported score of 82.0% on that same benchmark — a result that, if the comparison holds, hands OpenAI back the lead among generally available large language models on one of the industry's most closely watched agentic evaluations.
GPT-5.5 is rolling out immediately to Plus, Pro, Business, and Enterprise subscribers in ChatGPT and through Codex. Broader API access will follow once OpenAI finishes incorporating additional cybersecurity guardrails, the company said.
Benchmark Performance: What the Numbers Actually Show
Terminal-Bench 2.0 is a benchmark developed as a joint project between Stanford University and Laude Institute. It consists of 89 carefully curated tasks in computer terminal environments designed to test AI agents on complex, end-to-end command-line workflows that require planning, iteration, and tool coordination. When the benchmark paper was published on arXiv in January 2026, frontier models scored less than 65% on it — making the gains recorded over the following months substantial.
Prior to GPT-5.5's launch, the Terminal-Bench 2.0 leaderboard at llm-stats.com showed Claude Mythos Preview at the top with 82.0%, followed by GPT-5.3 Codex at 77.3% and GPT-5.4 at 75.1%. BenchLM.ai reported the same ordering. With GPT-5.5's claimed score of 82.7%, OpenAI asserts a narrow lead — though it is worth noting that benchmark comparability caveats apply: some analysts have flagged that OpenAI's self-reported Terminal-Bench scores use a different evaluation harness than those used for Anthropic's models, making direct one-to-one comparison difficult.
Beyond Terminal-Bench 2.0, OpenAI's official announcement reports the following GPT-5.5 scores across other evaluations:
- SWE-Bench Pro (real-world GitHub issue resolution): 58.6%
- GDPval (agent performance across 44 occupations in knowledge work): 84.9%
- OSWorld-Verified (computer use tasks): 78.7%
- Tau2-bench Telecom (without prompt tuning): 98.0%
For context, Anthropic's system card for Claude Mythos Preview, as analyzed by Vellum.ai, places that model at 93.9% on SWE-bench Verified and 77.8% on SWE-bench Pro — notably higher than GPT-5.5's 58.6% on the Pro variant of that benchmark. The picture across evaluations is therefore mixed, and no single benchmark tells the full story.
OpenAI also states that GPT-5.5 matches GPT-5.4's per-token latency in real-world serving despite the jump in capability — a meaningful claim for enterprise deployments where speed and cost efficiency matter as much as raw performance.
Cybersecurity Capabilities, Safety Guardrails, and Restricted Access
One of the more consequential dimensions of GPT-5.5's release is how OpenAI is handling its cybersecurity capabilities. Under OpenAI's Preparedness Framework, GPT-5.5's biological and cybersecurity capabilities are rated "High" — a designation that triggered additional safeguards before launch. The model underwent safety evaluations including external testing and feedback from approximately 200 early-access partners. OpenAI is also offering specialized access through its "Trusted Access for Cyber" program for verified security professionals, and full API access remains gated pending the finalization of additional cybersecurity guardrails.
The cybersecurity framing puts GPT-5.5's release in direct conversation with Anthropic's Claude Mythos Preview, which Anthropic announced on April 7, 2026, as part of Project Glasswing. Anthropic described Mythos Preview as "strikingly capable at computer security tasks" and has declined to make it generally available. According to the UK AI Security Institute's evaluation, Claude Mythos Preview was the first AI model able to complete the institute's test simulating an attack that takes over a full network, succeeding end-to-end in 3 out of 10 attempts and completing an average of 22 out of 32 steps — compared to an average of 16 steps for Claude Opus 4.6, the next best model. The UK AI Security Institute also found that on expert-level CTF (Capture the Flag) tasks — which no model could complete before April 2025 — Mythos Preview succeeds 73% of the time.
Anthropic's Project Glasswing page further reports that Claude Mythos Preview identified thousands of zero-day vulnerabilities across every major operating system and every major web browser, including a 27-year-old vulnerability in OpenBSD. In 89% of 198 manually reviewed vulnerability reports, expert contractors agreed exactly with Mythos Preview's severity assessment, and 98% of assessments were within one severity level.
Newton Cheng, Anthropic's Frontier Red Team Cyber Lead, has been unambiguous about the company's position: "We do not plan to make Claude Mythos Preview generally available due to its cybersecurity capabilities."
A Microsoft spokesperson quoted on Anthropic's Project Glasswing page captured the broader stakes: "AI capabilities have crossed a threshold that fundamentally changes the urgency required to protect critical infrastructure from cyber threats, and there is no going back."
Pricing, Access, and the Enterprise Push
GPT-5.5's API pricing, as listed on OpenAI's official product page, is set at $5 per million input tokens and $30 per million output tokens, with a 1-million token context window. A GPT-5.5 Pro version carries a significantly higher price: $30 per million input tokens and $180 per million output tokens.
The tiered pricing structure signals OpenAI's intent to serve both cost-conscious developers and high-throughput enterprise customers who need maximum capability. Nvidia's vice president of enterprise computing, Justin Boitano, told Axios that GPT-5.5 can act as a "chief of staff," helping power agents that are already acting as employees at Nvidia. Nvidia also says its new chips slash the cost of running advanced AI like GPT-5.5 up to 35x per token — a claim that, if it holds at scale, could meaningfully lower the barrier for enterprise adoption of frontier models.
According to Axios, OpenAI executives had previously described Anthropic's rise as a "code red" wake-up call that prompted a strategic pivot toward business customer adoption. The timing of GPT-5.5's release — one week after Anthropic's Claude Opus 4.7 launch — fits that competitive posture.
Context: Why This Moment in AI Competition Matters
The GPT-5.5 release lands at a moment when the gap between what AI models can do and what enterprise customers are prepared to deploy is narrowing rapidly. OpenAI states that GPT-5.5's strongest gains are in agentic coding, computer use, knowledge work, and early scientific research — precisely the domains where businesses are beginning to deploy AI not as a productivity assistant but as an autonomous actor.
Greg Brockman, OpenAI co-founder, described GPT-5.5 to Axios as a "faster, sharper thinker for fewer tokens" compared to GPT-5.4, noting that it can handle multi-step workflows more autonomously with less user input. That framing matters for enterprise customers who are building agentic pipelines where every token and every human intervention adds cost.
At the same time, both OpenAI and Anthropic are navigating genuinely new territory on the safety and policy side. GPT-5.5's "High" cybersecurity rating under OpenAI's Preparedness Framework and the gated API rollout reflect an awareness that the capabilities being released are not purely productivity tools. The fact that Anthropic continues to keep Claude Mythos Preview restricted — despite its benchmark performance — suggests the two leading AI labs are converging on a shared, if tacit, acknowledgment that certain capability thresholds warrant different deployment calculus than previous model generations.
Terminal-Bench 2.0, the benchmark at the center of this release's competitive narrative, was designed specifically to evaluate whether AI agents can handle the kind of multi-step, real-world terminal tasks that underpin software engineering, system administration, and research workflows. When it was published in January 2026, frontier models were below 65%. Four months later, two competing labs are both above 82% — a trajectory that has significant implications for how quickly agentic AI enters professional workflows.
What OpenAI and Industry Voices Are Saying
Greg Brockman offered two framing statements to Axios that capture OpenAI's broader ambition for GPT-5.5 beyond the benchmark competition:
"This is a new class of intelligence. It's a big step towards more agentic and intuitive computing."
"We are moving to a compute-powered economy."
Those statements, taken alongside Nvidia's hardware cost claims and the tiered enterprise pricing, sketch a picture of an AI industry that is actively working to make frontier-model deployment not just technically feasible but economically routine.
What Comes Next
Several near-term developments are worth watching. OpenAI has indicated that broader API access for GPT-5.5 will open once it finishes incorporating additional cybersecurity guardrails — the timeline for that has not been specified. The "Trusted Access for Cyber" program for verified security professionals is live at launch, but the criteria and scale of that program remain to be detailed.
On the Anthropic side, VentureBeat's coverage of the Claude Opus 4.7 launch confirmed that Anthropic continues to keep Claude Mythos Preview restricted to a small number of external enterprise partners for cybersecurity testing, with no announced plans for general availability. How long that restriction holds — and whether regulatory or competitive pressure eventually changes the calculus — remains an open question.
The benchmark race between GPT-5.5 and Claude Mythos Preview on Terminal-Bench 2.0 is narrow enough (82.7% vs. 82.0%) and subject to enough methodological caveats that the competitive picture could shift quickly with independent evaluation. Developers and enterprises evaluating these models for agentic workflows would be well served to run their own task-specific evaluations rather than relying solely on lab-reported scores.
For more tech news, visit our news section.
Why This Matters for Your Productivity
The rapid capability gains reflected in GPT-5.5's benchmark scores are not just a story about competing AI labs — they are a preview of the tools that knowledge workers, researchers, and professionals will be using to manage complex, multi-step workflows in the near term. As AI agents move from assistants to autonomous actors in coding, research, and decision-support contexts, understanding which models are available, how they are priced, and what guardrails govern their use becomes directly relevant to how individuals and teams plan their work. At Moccet, we track developments like this because the intersection of AI capability and personal productivity is where meaningful gains in health, focus, and output are increasingly being made. Join the Moccet waitlist to stay ahead of the curve.