GPT-5.4 Thinking Review: The First AI to Beat Humans at Desktop Tasks (75% OSWorld)
OpenAI's GPT-5.4 Thinking scored 75.0% on OSWorld-Verified — higher than the 72.4% human expert baseline. It's the first general-purpose AI to clear that bar. Here's what it actually means for you.
TL;DR
- GPT-5.4 Thinking scored 75.0% on the OSWorld desktop benchmark — humans score 72.4%
- It uses a "thinking-time toggle" — adjustable reasoning depth before responding
- Computer-use capability: operates browsers, editors, and apps via screenshots + mouse/keyboard
- Previous GPT-5.2 scored 47.3% on the same benchmark — a massive jump
- Available to ChatGPT Plus, Team, and Pro subscribers; API: model ID gpt-5.4
What Is OSWorld and Why Does This Score Matter?
OSWorld is a benchmark designed to test AI on real-world computer tasks — not trivia, not coding in isolation, but actually controlling a desktop: opening apps, navigating browsers, editing documents in real software, writing and running code in actual IDEs. Tasks reflect the kind of work a junior technical employee might do.
The human expert baseline on OSWorld-Verified is 72.4%. That's the score a competent human achieves when working through the same benchmark tasks. GPT-5.4 Thinking scored 75.0% — exceeding it for the first time by a general-purpose frontier model.
For context, the previous version (GPT-5.2 Thinking) scored 47.3% on the same benchmark. The jump to 75% is enormous — a 27-point gain from one model generation. It's not incremental progress; it's a step change in what AI can do unsupervised at a computer.
How GPT-5.4 Thinking Works
GPT-5.4 Thinking is not a separate model — it's a reasoning mode of GPT-5.4 that applies additional test-time compute before generating a response. The model "thinks" internally about a problem, generating chains of reasoning that don't appear in the final output, before committing to an answer.
Key features:
- Thinking-time toggle: Users can set reasoning depth from "Light" (fast, less thorough) to "Heavy" (slow, deep analysis). Pro subscribers get full access to Heavy mode for demanding tasks.
- Computer-use: The model interacts with desktop environments by processing screenshots and executing mouse clicks and keyboard inputs — no plugins required.
- "Run, verify, fix" loop: For coding tasks, it writes code, executes it, reads the output, and corrects errors autonomously — a complete development loop without human prompting at each step.
- 1M token context window: Handles entire codebases or long task histories in a single session via the API.
Benchmark Comparison
| Model | OSWorld-Verified | Notes |
|---|---|---|
| GPT-5.4 Thinking | 75.0% | First to exceed human baseline |
| Human Expert | 72.4% | Baseline |
| GPT-5.2 Thinking | 47.3% | Previous generation |
Real-World Use Cases That Benefit
The OSWorld score translates to practical capability in tasks like:
- Software QA automation: Running through UI test flows in browsers, recording results
- Data entry and transformation: Moving data between spreadsheets, forms, and databases
- Dev environment setup: Cloning repos, installing dependencies, configuring tools
- Research compilation: Browsing sources, extracting data, building structured reports
- Multi-app workflows: Tasks that require switching between multiple applications in sequence
Pricing and Access
GPT-5.4 Thinking is available through:
- ChatGPT: Plus ($20/mo), Team, and Pro ($200/mo) subscribers — Pro gets unrestricted Heavy mode
- OpenAI API: Model ID
gpt-5.4— $2.50 input / $15.00 output per 1M tokens (Standard variant) - Note: GPT-5.2 Thinking is now a Legacy Model, retiring June 5, 2026
How Does It Compare to Claude and Gemini?
The OSWorld result is specifically for GPT-5.4 Thinking. Anthropic's Claude Mythos 5 (leaked, not yet publicly available) is expected to compete directly in agentic computer-use tasks. Google's Gemini 3.1 Pro currently scores well on agentic benchmarks (33.5% on APEX-Agents, 69.2% on MCP Atlas) but does not yet report an OSWorld score at this level.
For now, GPT-5.4 Thinking holds the clearest published claim to human-surpassing desktop automation. Whether that translates to real-world superiority in your specific workflows depends on the task — Claude remains stronger for long-form analysis and nuanced writing, while Gemini 3.1 leads on certain multimodal and document tasks.
Should You Switch to GPT-5.4 Thinking?
If you currently use an AI assistant mainly for chat, writing, and coding assistance, the OSWorld benchmark doesn't change your calculus much. The advantage shows up in fully automated, multi-step desktop workflows — tasks where you want to hand the AI a goal and come back when it's done.
If you're building automated desktop agents or want to offload repetitive computer tasks entirely, GPT-5.4 Thinking is now the strongest published option. But the compute cost is high — the Heavy thinking mode is slow and expensive. For most everyday queries, standard GPT-5.4 (no Thinking) or Claude Sonnet will be faster and cheaper.
What This Benchmark Really Tells You
OSWorld-Verified is one of the most demanding real-world benchmarks for AI agents. Crossing the human baseline matters not because AI is "smarter than humans" in general — it's not — but because it signals that for specific, bounded, repetitive desktop workflows, AI agents can now outperform a trained human without human intervention. That's the threshold where serious automation becomes viable.