The Cost-Efficient Web Agent With Open Weights

See how Surfer H is surfing the web for you

SURFER H

Introducing Post-training as a service

RL engine to train on enterprises and genaralist user traces. Illustrating our Pareto for cost-effectiveness. Carving-out on example : calendars. That’s how we build with effectiveness at scale for AI agents.

1. Why we built Holo-1 and Surfer H

Large Language Models (LLMs) have transformed our ability to generate and reason about information. But on their own, they cannot act in the world.
We need agents that can:
- See what’s on the screen
- Decide what to do
- Interact with UIs like a human
- Know when a task is done — and correct if it's not

Surfer H was built to fulfill this vision. To power it, we trained Holo1: a family of open, cost-effective VLMs designed to bridge the gap between visual perception and language understanding — enabling agents to interpret and act within web environments.

2. Surfer H: Pareto-Optimal performance on WebVoyager

Surfer H is designed to be flexible and modular. It is composed of three independent components:
- A Policy model that plans, decides, and drives the agent's behavior
- A Localizer model that sees and understands visual UIs to drive precise interactions
- A Validator model that checks whether the answer is valid

The agent thinks before acting, takes notes, and can retry if its answer is rejected. It can operate with different models for each module, allowing for tradeoffs between accuracy, speed, and cost.
We evaluated Surfer H on the WebVoyager benchmark: 643 real-world web tasks ranging from retrieving prices to finding news or scheduling events.

We’ve tested multiple configurations, from GPT-4-powered agents to 100% open Holo1 setups.

Among them, the fully Holo1-based agents offered the strongest tradeoff between accuracy and cost:

- Surfer H + Holo1-7B: 92.2% accuracy at only $0.13 per task

- Surfer H + GPT-4o: 84.3% at $0.71- Surfer H + GPT-4.1-mini: 88.8% at $0.26

- Surfer H + Holo1-3B: 89.7% at $0.11

This places Holo1-powered agents on the Pareto frontier, delivering the best accuracy per dollar. Unlike other agents that rely on custom APIs or brittle wrappers, Surfer H operates purely through the browser — just like a real user. Combined with Holo1, it becomes a powerful, general-purpose, cost-efficient web automation system.

3. Holo1: State-of-the-Art UI Localization

A key skill for the real-world utility of our VLMs within agents is localization: the ability to identify precise coordinates on a user interface (UI) to interact with, to complete a task, or follow an instruction. To assess this capability, we evaluated our Holo1 models on several established localization benchmarks, including Screenspot, Screenspot-V2, Screenspot-Pro, GroundUI-Web.

Holo1 significantly outperforms prior models like Qwen2.5-VL, UI-TARS, and UGround across these benchmarks:

-Holo1-3B: 73.6% average localization accuracy, beating other 3B and even some 7B models

-Holo1-7B: 76.2%, the highest small-size model overall

To support the community, we're also releasing Web Click, a new benchmark for UI Grounding that better reflects how humans really use the web. It includes 1,639 screenshots and instruction-label pairs from over 100 websites, designed to challenge existing VLMs.

4. Looking Ahead

This is the first step in H Company’s mission to build a scalable framework to create better, safer, and cheaper agents. We are now:

-Rapidly amplifying the exploration and insights of our agents so that learnings accumulate

-Expanding the domain of our agents so that they tackle an increasingly broad spectrum of tasks

We believe that most of the value that comes from open agents is yet to be discovered, which is why we are eager to discover what the community and our users will build on top of our work. Open weights are more than a philosophy — they’re a practical tool to accelerate experimentation, transparency, and collective progress.

We want to put these agents in the hands of real users. We’ll explore new environments — not just the web, but any software interface.

Replay Live Runs