Holo1.5

October 9, 2025

Since the release of Holo1, we have been hard at work on improving our models on foundational capabilities for Computer Use agents. Today, we are open-sourcing the Holo1.5 series of models in 3B, 7B, and new 72B sizes. Holo1.5 adds a massive 10%+ accuracy boost compared to Holo1 for all sizes and sets a new state-of-the-art for Computer Use localization models. It also delivers strong performance on user interface (UI) understanding and question answering. All our models are open weights and accessible on HuggingFace.

Figure 1. Pareto frontier of UI Localization accuracy versus Model size

What is UI element localization?

Computer Use (CU) agents interact directly with software in the same native interface as humans do: by perceiving the screen and taking actions such as clicking or writing on certain elements. UI element localization (also called grounding) is a critical skill for CU agents, where the model is presented with a screenshot of a computer interface with a task (e.g., Open the Spotify App) and is required to output the precise coordinates on the screenshot (“Click X, Y”). Because precise navigation is essential in digital environments, every CU agent needs a strong localization model.

Watch a demo of how to prompt the model in a Computer Use setting

The demo is also live on our Hugging Face Space.

State-of-the-art Performance on Localization Benchmarks

Holo1.5 models achieve state-of-the-art results across all major localization benchmarks. We evaluate across Web, Mobile, and Desktop environments—including macOS, Ubuntu, and Windows—and Holo1.5 consistently outperforms open-source (Qwen-2.5 VL) and closed-source generalist models (Sonnet 4), as well as specialized systems designed for these tasks (UI-TARS 1.5, UI-Venus). In particular, Holo1.5 delivers strong gains on ScreenSpot-Pro, a demanding benchmark covering professional, high-resolution GUI software such as Photoshop, AutoCAD, and VSCode—closely mirroring the environments where CU agents operate.

Figure 2. Accuracy of our and competitors’ models on UI Localization benchmarks.

UI Understanding and Visual Question Answering Performance

In addition to localization, CU agents need to understand what is happening on the screen in order to act reliably. This capability is evaluated through UI Visual Question Answering (VQA) tasks, where the model is asked natural language questions about the interface—for example, “Which tab is currently active?” or “Is the user signed in?”—and must answer correctly based on the visual state of the software.

UI VQA is critical because it allows agents to track context, verify their actions, and resolve ambiguity in real-world tasks.

On UI VQA benchmarks, Holo1.5 shows consistent improvements over the original Qwen base models and outperforms both open-source and closed-source competitors. These results highlight that Holo1.5 is not only stronger at localization but also more capable of comprehending and reasoning about software interfaces, a key step toward building reliable, general-purpose Computer Use agents.

Figure 3. Pareto Frontier of UI Question Answering Performance versus Model Size

Figure 4. UI Understanding and Visual Question Answering performance

Building Generalist Cross-Platform Computer Use Agents

Our goal is to build cost-efficient and reliable Computer Use agents. With the release of Holo1.5, we are taking an important step toward fostering trust and adoption of this technology.

This milestone is only the beginning—over the coming weeks, we will be unveiling new tools and agents powered by Holo models.

Stay tuned—we’re just getting started.