GPT-5.4 Beats Human Desktop Performance — 75% on OSWorld, Native Computer Use Explained
Released in March 2026, GPT-5.4 is the first general-purpose AI to score above the human expert baseline on desktop automation. Here is what actually changed, how the computer use API works, and whether you should migrate.
TL;DR
- GPT-5.4 scores 75.0% on OSWorld — surpassing human expert baseline of 72.4%
- First general-purpose model with native (not wrapped) computer use in its architecture
- GDPval: 83.0% (+12.1 pts), BrowseComp: 82.7% (+17 pts), ARC-AGI-2: 73.3% (+20.4 pts)
- Price: $2.50 / $10 per million tokens (input / output)
- One breaking change: prompt prefilling removed — update assistant turns before migrating
- Claude Opus 4.6 still leads on coding (80.8% SWE-bench); GPT-5.4 leads on knowledge work
What is OSWorld and why does 75% matter?
OSWorld-Verified is a benchmark that measures how well an AI can complete real desktop tasks — opening applications, filling forms, navigating menus, writing files — using only screenshots and keyboard/mouse commands. It is the closest existing proxy for "can this AI do what a human knowledge worker does at a computer?"
The human expert baseline sits at 72.4%. Every major model before GPT-5.4 fell short: GPT-5.2 hit 47.3%. Claude Opus 4.6, despite leading on coding benchmarks, scores in the low 60s range on OSWorld. Gemini 3.1 Pro is roughly comparable to Claude.
GPT-5.4's 75.0% means that when you ask it to "fill out the expense report in Excel and email it to finance@company.com," it completes that task more reliably than a human expert doing the same thing by hand. That is a meaningful threshold — not marketing language.
All benchmark improvements at a glance
| Benchmark | GPT-5.2 | GPT-5.4 | Delta |
|---|---|---|---|
| OSWorld-Verified (computer use) | 47.3% | 75.0% | +27.7 pts ✓ beats human |
| GDPval (professional work) | 70.9% | 83.0% | +12.1 pts |
| BrowseComp (web research) | 65.8% | 82.7% | +16.9 pts |
| ARC-AGI-2 (abstract reasoning) | 52.9% | 73.3% | +20.4 pts |
| SWE-bench (coding) | ~74% | ~76% | +2 pts |
| Context window | 512K | 1M+ | 2× longer |
How native computer use actually works
The key word is "native." Previous computer use implementations — including Anthropic's Claude Computer Use (2024) and early GPT-5.x experiments — bolted vision and action modules onto an existing language model. The model would receive a screenshot, process it through a separate vision encoder, and then try to map visual understanding to UI actions through a tool-calling layer. This introduced latency, accuracy loss, and brittle behavior on complex multi-step tasks.
GPT-5.4 integrates computer use from the pre-training stage. The model was trained on data that includes visual UI interactions as first-class training examples — not as an afterthought. The result is a model that genuinely understands the relationship between a button's visual appearance, its label, its position on screen, and the expected outcome of clicking it.
In the API, computer use is enabled by passing a computer_use tool type in the Responses API (not the legacy Chat Completions endpoint). The model can then receive screenshots as input and return structured actions: click, type, scroll, keypress, drag.
Five levels of reasoning effort
A less-discussed but practically important feature: GPT-5.4 introduces five-level reasoning effort control — ranging from "minimal" (fast, low token usage) to "maximum" (deep chain-of-thought with extended thinking). This gives developers fine-grained control over the cost/quality trade-off for each request.
| Level | Best for | Cost impact |
|---|---|---|
| 1 — Minimal | Classification, routing, simple Q&A | ~0.3× standard |
| 2 — Low | Summarization, extraction, short responses | ~0.6× |
| 3 — Standard (default) | General chat, drafting, analysis | 1× |
| 4 — High | Complex reasoning, long documents | ~2–3× |
| 5 — Maximum | Research, multi-step agentic tasks | ~5–8× |
The one breaking change: prefilling is gone
GPT-5.4 removes support for prompt prefilling — the technique of providing a partial assistant turn to steer the model's response direction. If your system prompts or API calls included pre-populated assistant messages, those calls will fail or produce unexpected output with GPT-5.4.
The fix is straightforward: move any content from prefilled assistant turns into the system prompt or user message. OpenAI provides a migration guide, and for most applications this is a 5–10 minute change.
Beyond prefilling, GPT-5.4 is a drop-in replacement for GPT-5.2 on the Responses API, with no other breaking changes to parameters, response format, or tool schemas.
When to choose GPT-5.4 vs Claude Opus 4.6
The honest answer: they excel at different things. GPT-5.4 leads on knowledge work (GDPval 83%), web research (BrowseComp 82.7%), and desktop automation (OSWorld 75%). Claude Opus 4.6 leads on pure coding (SWE-bench 80.8%) and tends to produce more reliable long-form structured outputs.
| Use case | GPT-5.4 | Claude Opus 4.6 |
|---|---|---|
| Desktop / computer use automation | ✓ Best | Good |
| Web research + synthesis | ✓ Best | Good |
| Professional writing / analysis | ✓ Best | Good |
| Coding / software engineering | Good | ✓ Best |
| Long structured documents | Good | ✓ Best |
| Abstract reasoning / math | 73.3% ARC-AGI-2 | ~68% |
| Pricing (input/output per M tokens) | $2.50 / $10 | $5 / $25 |
Access GPT-5.4 and Claude Opus 4.6 in one place
Happycapy connects all frontier AI models so you can switch by task, not by subscription.
Try Happycapy Free →Frequently asked questions
What is OSWorld and why does GPT-5.4 scoring 75% matter?
OSWorld (OSWorld-Verified) is a benchmark that measures an AI model's ability to complete real desktop tasks by interpreting screenshots and controlling the mouse and keyboard. A human expert baseline is 72.4%. GPT-5.4 scoring 75% means it is the first general-purpose AI model to surpass human expert performance on desktop automation — a significant milestone for agentic AI workflows.
How does GPT-5.4's native computer use differ from previous Claude or Anthropic implementations?
Previous computer use implementations (including early Claude Computer Use in 2024) relied on external wrappers or separate tool systems added after training. GPT-5.4 has computer use baked directly into its architecture from pre-training, meaning the model natively understands screenshots, UI elements, and mouse/keyboard actions without needing a separate vision module or wrapper layer. This results in higher accuracy and lower latency.
What does GPT-5.4 cost and how does it compare to GPT-5.2?
GPT-5.4 is priced at $2.50 per million input tokens and $10 per million output tokens — roughly comparable to GPT-5.2 pricing. The key improvements over GPT-5.2: OSWorld score jumped from 47.3% to 75.0% (computer use), GDPval from 70.9% to 83.0% (professional work quality), BrowseComp from 65.8% to 82.7% (web research), and ARC-AGI-2 from 52.9% to 73.3% (abstract reasoning).
Should developers switch from GPT-5.2 to GPT-5.4?
For most use cases, yes. The main breaking change is that prompt prefilling was removed in GPT-5.4. If your prompts relied on prefilled assistant turns, you need to update them. Otherwise the upgrade is a drop-in replacement with significant quality gains, especially for knowledge work, research, and any agentic workflows that involve browsing or desktop automation.