11-19-2024

A Research Update

H Team

Charting a New Route: The Tech Behind Runner H’s State-of-the-Art Results

Our extensive evaluation of Runner H 0.1 agents versus its competitors on the WebVoyager benchmarks is proof that the tech behind Runner H can hold up in real-world scenarios.

Evaluation methods

The evaluation of agent performance used the auto evaluator method proposed in the original WebVoyager paper based on GPT-4o. It compares the agent’s answer to five screenshots it had gathered and assesses whether the retrieved information was correct and consistent with the extracted screenshots.

Because WebVoyager uses live, public websites, agent performance is highly dependent on when and where the evaluation is performed. We ran all of our reported evaluations from the USA, almost simultaneously in November 2024.

We benchmarked our H agents, which leverage a mix of internal and external models, alongside other agent configurations:

  • Anthropic Computer Use: We used the native streamlit front-end demo controlling a Linux virtual machine and Firefox browser provided by Anthropic.
  • Emergence AgentE: The current best open-source agent known for its web navigation and interaction capabilities, which uses text only.
  • Original WebVoyager Agent: The open-source implementation of the agent designed specifically for the WebVoyager benchmark.

Runner H obtained 67% on WebVoyager, compared to Emergence AgentE’s 61% and Anthropic Computer Use’s 52%.

Model Success Rate

To concretely illustrate performance of Runner H 0.1, we compared eight executions of WebVoyager tasks by Runner H and Anthropic Computer Use agents. We see that Runner H is generally much faster and more accurate.

H-VLM: Runner H’s eye

We trained and specialized our 3B parameters VLM to perceive, understand, and interact with graphical user interfaces, images, diagrams, and other visual information, including describing and localizing elements in graphical user interfaces, extracting key information and text from screenshots and images, and accurately interpreting complex diagrams, charts, and documents.

We evaluated H-VLM on Screenspot, a benchmark for graphical user interface actions that gives the model a screenshot and an instruction such as "create account" or "switch to show link attributes", requiring the model to output coordinates that lie within the bounding box of the associated element.

H-VLM is by far the strongest small model in localization.

Impressively, our model is much more accurate than the very large generalist models, while being orders of magnitude cheaper and faster to serve.

H-LLM: Runner H’s brain

To power Runner H and the backbone of its vision capabilities, we trained our own internal family of LLMs designed for the agentic era: with both fundamental programming and high-level decision-making skills.

H-LLM is the backbone of our VLM and can also be used in our agents for text-only roles.

As shown below, our two-billion-parameter model performed extremely well on an average of code and function calling, outperforming much bigger models:

Below you can see our detailed results on code with HumanEval, HumanEval+, MBPP, MBPP+ and function calling with BFCL datasets. For fair evaluation, we decontaminated our fine-tuning data by removing every document that contained an overlapping word 8-gram with the prompts of our main benchmarks (HumanEval, MBPP/MBPP+, BFCv2):

If you are excited to test these new classes of agentic models, the link for the waitlist is here.