OSWorld is a benchmark that measures an AI model's ability to complete real desktop tasks — navigating browsers, editing documents, writing code in actual IDEs — using screenshots and mouse/keyboard inputs. A human expert baseline is 72.4%.

How does GPT-5.4 Thinking differ from standard GPT-5.4?

GPT-5.4 Thinking is a reasoning-optimized variant that uses test-time compute to 'think' through complex problems before responding. It has a controllable thinking-time toggle from Light to Heavy modes, sacrificing speed for accuracy on hard tasks.

Can I use GPT-5.4 Thinking to automate my computer?

Yes, via the OpenAI API using model ID gpt-5.4 with computer-use enabled. It interacts with desktop environments through screenshots, mouse clicks, and keyboard inputs. Consumer access is available to Plus, Team, and Pro ChatGPT subscribers.

AI News2026-04-07

GPT-5.4 Thinking Review: The First AI to Beat Humans at Desktop Tasks (75% OSWorld)

OpenAI's GPT-5.4 Thinking scored 75.0% on OSWorld-Verified — higher than the 72.4% human expert baseline. It's the first general-purpose AI to clear that bar. Here's what it actually means for you.

TL;DR

GPT-5.4 Thinking scored 75.0% on the OSWorld desktop benchmark — humans score 72.4%
It uses a "thinking-time toggle" — adjustable reasoning depth before responding
Computer-use capability: operates browsers, editors, and apps via screenshots + mouse/keyboard
Previous GPT-5.2 scored 47.3% on the same benchmark — a massive jump
Available to ChatGPT Plus, Team, and Pro subscribers; API: model ID gpt-5.4

What Is OSWorld and Why Does This Score Matter?

OSWorld is a benchmark designed to test AI on real-world computer tasks — not trivia, not coding in isolation, but actually controlling a desktop: opening apps, navigating browsers, editing documents in real software, writing and running code in actual IDEs. Tasks reflect the kind of work a junior technical employee might do.

The human expert baseline on OSWorld-Verified is 72.4%. That's the score a competent human achieves when working through the same benchmark tasks. GPT-5.4 Thinking scored 75.0% — exceeding it for the first time by a general-purpose frontier model.

For context, the previous version (GPT-5.2 Thinking) scored 47.3% on the same benchmark. The jump to 75% is enormous — a 27-point gain from one model generation. It's not incremental progress; it's a step change in what AI can do unsupervised at a computer.

How GPT-5.4 Thinking Works

GPT-5.4 Thinking is not a separate model — it's a reasoning mode of GPT-5.4 that applies additional test-time compute before generating a response. The model "thinks" internally about a problem, generating chains of reasoning that don't appear in the final output, before committing to an answer.

Key features:

Thinking-time toggle: Users can set reasoning depth from "Light" (fast, less thorough) to "Heavy" (slow, deep analysis). Pro subscribers get full access to Heavy mode for demanding tasks.
Computer-use: The model interacts with desktop environments by processing screenshots and executing mouse clicks and keyboard inputs — no plugins required.
"Run, verify, fix" loop: For coding tasks, it writes code, executes it, reads the output, and corrects errors autonomously — a complete development loop without human prompting at each step.
1M token context window: Handles entire codebases or long task histories in a single session via the API.

Benchmark Comparison

Model	OSWorld-Verified	Notes
GPT-5.4 Thinking	75.0%	First to exceed human baseline
Human Expert	72.4%	Baseline
GPT-5.2 Thinking	47.3%	Previous generation

Real-World Use Cases That Benefit

The OSWorld score translates to practical capability in tasks like:

Software QA automation: Running through UI test flows in browsers, recording results
Data entry and transformation: Moving data between spreadsheets, forms, and databases
Dev environment setup: Cloning repos, installing dependencies, configuring tools
Research compilation: Browsing sources, extracting data, building structured reports
Multi-app workflows: Tasks that require switching between multiple applications in sequence

Pricing and Access

GPT-5.4 Thinking is available through:

ChatGPT: Plus ($20/mo), Team, and Pro ($200/mo) subscribers — Pro gets unrestricted Heavy mode
OpenAI API: Model ID gpt-5.4 — $2.50 input / $15.00 output per 1M tokens (Standard variant)
Note: GPT-5.2 Thinking is now a Legacy Model, retiring June 5, 2026

How Does It Compare to Claude and Gemini?

The OSWorld result is specifically for GPT-5.4 Thinking. Anthropic's Claude Mythos 5 (leaked, not yet publicly available) is expected to compete directly in agentic computer-use tasks. Google's Gemini 3.1 Pro currently scores well on agentic benchmarks (33.5% on APEX-Agents, 69.2% on MCP Atlas) but does not yet report an OSWorld score at this level.

For now, GPT-5.4 Thinking holds the clearest published claim to human-surpassing desktop automation. Whether that translates to real-world superiority in your specific workflows depends on the task — Claude remains stronger for long-form analysis and nuanced writing, while Gemini 3.1 leads on certain multimodal and document tasks.

Should You Switch to GPT-5.4 Thinking?

If you currently use an AI assistant mainly for chat, writing, and coding assistance, the OSWorld benchmark doesn't change your calculus much. The advantage shows up in fully automated, multi-step desktop workflows — tasks where you want to hand the AI a goal and come back when it's done.

If you're building automated desktop agents or want to offload repetitive computer tasks entirely, GPT-5.4 Thinking is now the strongest published option. But the compute cost is high — the Heavy thinking mode is slow and expensive. For most everyday queries, standard GPT-5.4 (no Thinking) or Claude Sonnet will be faster and cheaper.

What This Benchmark Really Tells You

OSWorld-Verified is one of the most demanding real-world benchmarks for AI agents. Crossing the human baseline matters not because AI is "smarter than humans" in general — it's not — but because it signals that for specific, bounded, repetitive desktop workflows, AI agents can now outperform a trained human without human intervention. That's the threshold where serious automation becomes viable.

Try Happycapy — AI Agent Platform

Sources

OpenAI OpenAI ChatGPT Anthropic Anthropic Claude

← Back to all articles