MolmoWeb is an open-source web agent developed by the Allen Institute for AI (AI2). It navigates websites by looking at screenshots — no DOM access, no accessibility tree. It comes in 4B and 8B parameter sizes, both fully open-weight under Apache 2.0 license.

How does MolmoWeb compare to OpenAI's web agent?

MolmoWeb 8B scores 78.2% on WebVoyager, compared to OpenAI o3's 79.3%. It significantly outperforms agents built on GPT-4o (which had access to richer annotated screenshot inputs). For an open-weight model at 8B parameters, this is a breakthrough result.

Is MolmoWeb free to use?

Yes. MolmoWeb is released under the Apache 2.0 license — completely free for commercial and research use. The models and training dataset (MolmoWebMix) are available on Hugging Face and GitHub.

What benchmarks did MolmoWeb set?

MolmoWeb 8B scores: 78.2% on WebVoyager (94.7% with test-time scaling), 35.3% on Online-Mind2Web (60.5% with pass@4), 42.3% on DeepShop, and 49.5% on WebTailBench. It leads all open-weight web agents on these benchmarks.

MolmoWeb: AI2's Open-Source Web Agent Beats GPT-4o at 8B Parameters

TL;DR: The Allen Institute for AI (AI2) released MolmoWeb — a fully open-source web agent that navigates browsers using only screenshots, no DOM or accessibility tree needed. The 8B model scores 78.2% on WebVoyager (vs. OpenAI o3's 79.3%), outperforms GPT-4o-based agents, and is free under Apache 2.0.

Open-source AI just caught up to frontier proprietary web agents. MolmoWeb, released in late March 2026 by AI2, is a compact model (4B and 8B parameters) that can navigate real websites, fill forms, search for products, and complete multi-step tasks — using nothing but visual screenshots of what's on screen.

That's the same way a human uses a browser. No peeking at source code. No accessibility API shortcuts. Just pixels — and remarkable performance.

How MolmoWeb Works

Most AI web agents use a hybrid approach: they receive screenshots plus structured representations of the page (HTML, DOM tree, ARIA labels). This gives them a significant advantage — they can "read" buttons and links even when the visual is unclear.

MolmoWeb does neither. It receives only a screenshot and a natural-language task description. It outputs a mouse click coordinate or keyboard action. Then it receives the next screenshot. Repeat until done.

This vision-only constraint makes MolmoWeb:

More generalizable — works on any website, including those that block DOM scraping
More like human behavior — useful for testing UX that real users would experience
Deployable in restricted environments — no special API access needed, just a screenshot stream

Benchmark Results

MolmoWeb was evaluated on four standard web agent benchmarks. Results for the 8B model:

Benchmark	MolmoWeb 4B	MolmoWeb 8B	MolmoWeb 8B (×4 scaling)	OpenAI o3
WebVoyager	68.4%	78.2%	94.7%	79.3%
Online-Mind2Web	28.1%	35.3%	60.5%	—
DeepShop	35.8%	42.3%	—	—
WebTailBench	41.2%	49.5%	—	—

"×4 scaling" means running the same task up to 4 times and selecting the best outcome — a test-time compute technique that significantly boosts reliability.

How It Compares to Other Web Agents

Agent	Model Size	Input Type	WebVoyager	License
MolmoWeb 8B	8B	Screenshot only	78.2%	Apache 2.0
OpenAI o3 (web)	Unknown	Screenshot + structured	79.3%	Proprietary
Claude Computer Use	Opus 4.6	Screenshot + accessibility	~70% (est.)	API (paid)
Microsoft Fara-7B	7B	Screenshot + DOM	~62%	Research
UI-TARS-1.5-7B	7B	Screenshot + accessibility	~61%	Research
GPT-4o (web agent)	~200B (est.)	Screenshot + annotated	~55–60%	Proprietary

MolmoWeb 8B beats agents built on GPT-4o even though those agents had access to richer structured inputs. The result demonstrates that strong visual understanding can compensate for missing structural context.

The MolmoWebMix Training Dataset

A major part of MolmoWeb's performance comes from its training data: MolmoWebMix. This dataset, also released under Apache 2.0, contains:

36,000 human browsing task demonstrations — real people completing real tasks while screen was recorded
2.2 million screenshot–QA pairs — covering UI element identification, reading UI text, understanding layouts
Coverage of 100+ popular websites including search, shopping, news, and productivity tools

Intentional omissions from the training data include authentication flows and financial transactions — AI2 made a deliberate safety decision to not train on login/payment sequences.

Limitations to Know About

MolmoWeb is impressive, but not perfect. Key limitations:

Small text and high-DPI — the model can misread dense text, particularly on 4K displays or when font size is small
Authentication walls — no training data for login flows means it can't get past sign-in screens
Drag-and-drop — complex interactions like sliders, file drag-and-drop, or canvas operations degrade performance
Financial transactions — deliberately excluded from training; it won't reliably handle checkout flows
Dynamic pages — rapidly changing UI (live dashboards, real-time updates) can confuse navigation

How to Get MolmoWeb

Both the model and training data are fully open:

Hugging Face — search "allenai/MolmoWeb-8B" for model weights
GitHub — github.com/allenai/molmoweb for inference code and demos
MolmoWebMix dataset — available on Hugging Face Datasets

The 8B model fits in 16GB VRAM (quantized 4-bit runs on 8GB). A standard RTX 4080 can run it locally at usable inference speeds.

What This Means for AI Agents and Automation

MolmoWeb's release changes the calculus for teams building browser automation:

No more dependency on proprietary APIs — you can run a near-frontier web agent on your own hardware with no per-call cost
Privacy-preserving automation — internal tools, sensitive workflows, and intranet navigation can run locally without sending screenshots to third-party APIs
Foundation for custom fine-tuning — MolmoWebMix is open, so teams can add their own task demonstrations to specialize the model

For AI-powered tools like Happycapy, MolmoWeb-class models enable browser-based skill execution that was previously only possible with expensive proprietary computer-use APIs.

AI2's Open-Source Streak

AI2 has consistently been one of the most open AI research institutions. MolmoWeb follows their Molmo (vision-language model), OLMo (open language model), and Tulu instruction-tuning series — all released with training data, model weights, and permissive licenses. This pattern makes AI2 an increasingly important counterweight to closed frontier labs.

The release also comes shortly after Google released Gemma 4 under Apache 2.0 — a signal that the open-source AI ecosystem is closing the gap with proprietary systems faster than most predicted.

Key Takeaways

MolmoWeb is a vision-only open-source web agent from AI2 — navigates websites using screenshots only
The 8B model scores 78.2% on WebVoyager (94.7% with test-time scaling), nearly matching OpenAI o3
It outperforms GPT-4o-based web agents despite being far smaller and using less structured input
Fully open: Apache 2.0 license for both models and the 2.2M-example MolmoWebMix dataset
Runs locally on consumer GPUs (RTX 4080 / 16GB VRAM at full precision, 8GB quantized)
Key limitations: can't handle logins, financial transactions, or very small text

Sources: Allen Institute for AI blog (allenai.org/blog/molmoweb), GeekWire (Mar 2026), The AI Economy Substack (Mar 2026). Benchmark numbers from AI2's official evaluation report.

Sources

OpenAI OpenAI GPT-4 Anthropic Claude Microsoft

← Back to all articles