Nextwebworlds

OpenAIs GPT54 sets new records on professional benchmarks

OpenAI is moving fast enough these days that it teased its next model on the same afternoon it launched its last one. Just two days ago, the company released GPT-5.3 Instant.

On Thursday, it shipped GPT-5.4 — a considerably more substantial release, and one that arrives amid an unusually turbulent moment for the company, whose deal with the US Department of Defense has triggered a wave of user cancellations and a public spat with Anthropic's CEO.

The model itself, at least, is a genuine step forward. OpenAI is positioning GPT-5.4 as “our most capable and efficient frontier model for professional work,” and has released it in three configurations.

A standard version for general use, GPT-5.4 Thinking for tasks that benefit from extended chain-of-thought reasoning, and GPT-5.4 Pro for the highest-demand workloads.

In ChatGPT, Thinking is available to Plus, Team, and Pro subscribers starting today, replacing GPT-5.2 Thinking. Pro is reserved for the $200-per-month ChatGPT Pro and Enterprise tiers.

The benchmark story is striking.

On GDPval, OpenAI's internal evaluation measuring performance on knowledge work tasks across 44 occupations, from legal analysis to financial modelling, GPT-5.4 matched or exceeded industry professionals in 83% of comparisons, up from 70.9% for GPT-5.2.

On OSWorld-Verified, which measures a model's ability to navigate a desktop environment using screenshots and keyboard and mouse input, GPT-5.4 hit a 75% success rate, ahead of the reported human performance benchmark of 72.4%, and a substantial jump from GPT-5.2's 47.3%.

It also claimed the top position on Mercor's APEX-Agents benchmark, designed to evaluate agents on sustained professional tasks across investment banking, consulting, and corporate law.

On hallucinations, OpenAI reports that individual factual claims are 33% less likely to be incorrect compared to GPT-5.2, and that overall responses are 18% less likely to contain errors.

These figures are self-reported, and benchmark comparisons are against GPT-5.2 rather than the more recent GPT-5.3 — a pattern worth noting when reading the headline numbers.

Computer use and the 1-million-token window

The most consequential new capability is native computer use in Codex and the API. GPT-5.4 is the first general-purpose OpenAI model with this built in, allowing agents to operate software, navigate file systems, and carry out multi-step workflows across applications, the kind of behaviour previously associated with specialised agentic frameworks that layered on top of models.

For developers building automation pipelines, the significance is less about demos and more about reliability: a general-purpose model that handles computer interaction natively removes one category of integration complexity.

The API version also supports context windows up to 1 million tokens, more than double the 400,000 available in GPT-5.3, and the largest OpenAI has shipped.

For organisations dealing with sprawling document sets, long codebases, or multi-quarter financial records, keeping the full context in-window rather than relying on retrieval workarounds is a genuine practical advantage.

It is worth noting, though, that the 1-million-token window comes with a pricing caveat: OpenAI charges double the standard rate per million tokens once input exceeds 272,000 tokens. Google's Gemini 3.1 Pro, by comparison, offers a 2-million-token context at a lower base price.

A secondary efficiency improvement is worth attention for developers. The new Tool Search system changes how API calls handle tool definitions.

Previously, every call included the full specification for all available tools upfront, a practice that could add tens of thousands of tokens to each request as tool ecosystems grew.

Under the new system, the model retrieves tool definitions on demand when it needs them. In internal testing using 250 tasks across 36 MCP servers, OpenAI reported a 47% reduction in total token usage. For developers running large agentic systems with many integrations, that translates directly into lower costs and faster responses.

The benchmark caveat worth keeping

The Mercor APEX-Agents result is presented in the launch materials as a straightforward win, but there is important context attached.

When Mercor introduced the benchmark in January, it found that even the best frontier models at the time completed fewer than 25% of professional tasks on the first attempt, and with eight tries, the ceiling was around 40%. GPT-5.4 topping the leaderboard means it is the best-performing model in a field where no model is yet close to professional-grade reliability on long-horizon tasks.

Brendan Foody, Mercor's co-founder and CEO, acknowledged as much when he introduced the benchmark: “Right now it's fair to say it's like an intern that gets it right a quarter of the time.”

That caveat does not diminish the progress. It does affect how the headline benchmark result should be read, particularly when OpenAI's own framing frames GDPval, its internal benchmark, as evidence of matching or exceeding “industry professionals.”

GDPval and APEX-Agents measure quite different things: GDPval evaluates individual deliverables across broad occupation categories, while APEX-Agents tests sustained multi-step workflows inside simulated enterprise environments. Both matter; neither tells the complete story.

The safety addition

OpenAI has included a new open-source evaluation called CoT Controllability, designed to test whether reasoning models can deliberately obscure their chain-of-thought to evade monitoring.

This addresses a concern that has been building in AI safety research for some time: that a sufficiently capable model might learn to misrepresent its internal reasoning when being observed.

The company reports that GPT-5.4 Thinking shows low ability to control its chain-of-thought in this way, which OpenAI frames as a positive safety signal, suggesting that monitoring the model's visible reasoning remains a meaningful safeguard.

Anthropic published related research in February noting that its own models sometimes engage in reasoning that differs from their stated chain-of-thought under specific conditions; OpenAI explicitly links to that work in its launch materials.

Whether the chain-of-thought controllability evaluation will hold as models become more capable is an open question. The fact that OpenAI is publishing the evaluation methodology as open source is at least a step toward external scrutiny.

Where things stand

GPT-5.4 arrives during what is arguably the most competitive month in frontier AI to date. Anthropic's Claude Opus 4.6, released in February, still leads on several coding benchmarks.

Google's Gemini 3.1 Pro leads on abstract reasoning measures and offers a larger context window at a lower price. GPT-5.4 appears to take the lead on desktop computer use and professional knowledge work tasks, as measured by the benchmarks OpenAI is choosing to highlight. No single model sweeps everything.

The release cadence itself is worth noticing. GPT-5.3 Instant launched Monday; GPT-5.4 landed Thursday.

That pace, two significant model releases in under a week, with the next one already being hinted at, suggests OpenAI is betting that staying visible in the news cycle is as important as any single capability leap.

Whether that strategy translates into sustained enterprise adoption, or simply accelerates the already rapid benchmark turnover that makes it difficult for any lead to last, is the real question heading into the rest of 2026.

Story by Ana-Maria Stanciuc

Editor-in-Chief

Digital Marketing Specialist with 10+ years of experience in content writing, content strategy, campaign execution, SEO, and marketing autom (show all) Digital Marketing Specialist with 10+ years of experience in content writing, content strategy, campaign execution, SEO, and marketing automation. Adept at managing cross-functional projects in both B2B (SaaS) and B2C (hospitality, eCommerce) environments. Skilled in developing and optimizing digital campaigns, lead generation funnels, email marketing flows, and website content.