Computer-use model

Template:Infobox technology

A computer-use model is a specialized type of artificial intelligence model that enables autonomous agents to interact with graphical user interfaces (GUIs) by perceiving screen content and executing actions like clicking, typing, and scrolling, similar to how humans use computers.^[1]^[2] These models represent a significant advancement in AI agents, allowing them to control computers through visual understanding rather than programmatic APIs, making them capable of automating complex digital tasks across various applications and operating systems.^[3]

Overview

Computer-use models combine vision-language models (VLMs) with reinforcement learning capabilities to understand and interact with computer screens through pixel-level visual processing.^[2] Unlike traditional automation approaches that require specific APIs or scripting for each application, computer-use models can control any software that has a graphical interface, using the same visual cues and input methods that humans use.^[3] This universal approach makes them particularly valuable for tasks that span multiple applications or require interaction with legacy systems that lack modern APIs.^[4]

While many agent systems integrate through structured APIs, a large portion of digital work still happens in GUIs including form filling, dashboards, and behind-login workflows. Computer-use models address this gap by powering agents that can operate like human users, navigating web pages and applications by clicking, typing and scrolling.^[3]

The fundamental innovation of computer-use models is their ability to translate high-level instructions into low-level computer actions by:

Perceiving screen content through screenshot analysis
Understanding the spatial layout and purpose of UI elements
Generating appropriate mouse and keyboard actions
Adapting to dynamic changes in the interface
Learning from feedback to improve performance over time^[5]

History

The concept of computer-use models emerged as part of the broader development of multimodal large language models (LLMs) capable of processing visual inputs. Early research focused on visual question answering and image captioning, but by 2024, advancements allowed models to actively control UIs.

The first public beta of a computer-use model was introduced by Anthropic on October 22, 2024, with an upgraded Claude 3.5 Sonnet model featuring "computer use" capabilities. This allowed Claude to perceive screens and perform actions like cursor movement and typing.^[1]^[6]

In July 2025, OpenAI released a preview of its Computer Use tool via Azure OpenAI, enabling models to interact with browsers, desktops, and applications across operating systems like Windows, macOS, and Ubuntu.^[7]

On October 7, 2025, Google DeepMind announced the Gemini 2.5 Computer Use model, built on Gemini 2.5 Pro, optimized primarily for web browsers and mobile UIs. The model became available through the Gemini API via Google AI Studio and Vertex AI.^[3]^[8]

Technical Architecture

Core Components

Computer-use models typically consist of several integrated components working in an iterative loop:^[3]^[9]

Visual Perception Module: Processes screenshots using convolutional neural networks or vision transformers to understand screen content
Language Understanding Module: Interprets user instructions and maintains context using large language models
Action Planning Module: Uses chain-of-thought reasoning to decompose tasks into executable steps^[2]
Action Execution Module: Translates high-level decisions into specific UI actions (clicks, keystrokes, scrolls)
Feedback Processing Module: Evaluates action results and adjusts strategy based on observed changes

Agent Loop

At a high level, agents using computer-use models follow a repeated loop:^[9]

Send Request: The application invokes the Computer Use tool with the user's goal, a screenshot of the current GUI, the current URL, and optionally recent action history and constraints (for example excluded actions)
Receive Model Response: The model analyzes these inputs and generates a response, typically containing one or more function calls representing UI actions (for example open browser, click, type) and may include a safety decision (for example "requires confirmation")
Execute Actions: Client-side code executes allowed actions (prompting the end user for confirmation when required), then captures a fresh screenshot/URL
Capture New State: If the action has been executed, the client captures a new screenshot of the GUI and the current URL
Send Function Response: The new state is returned to the model as function responses, and the loop repeats from step 2

This process continues until the task is complete, an error occurs, or termination due to safety response or user decision. The loop is conceptually similar to function calling with tools, but specialized for GUI manipulation.^[9]

Coordinate System

Most computer-use models employ a normalized coordinate system where screen positions are represented on a 1000x1000 grid regardless of actual screen resolution.^[9] This approach ensures consistency across different display configurations. The model outputs normalized coordinates that are then converted to actual pixel values by the client implementation:

X coordinates: 0-999 (left to right)
Y coordinates: 0-999 (top to bottom)
Actual pixel position = (normalized_coordinate / 1000) × screen_dimension

The recommended screen size for use with computer-use models is 1440×900 pixels, though models work with any resolution.^[9]

Training Methodology

Computer-use models are typically trained using a combination of:^[10]

Supervised Fine-tuning (SFT): Initial training on human demonstrations of UI interactions
Reinforcement Learning (RL): Optimization through trial-and-error with reward signals
Reinforcement Learning from Human Feedback (RLHF): Refinement based on human preferences and corrections^[11]
Imitation Learning: Learning from recorded sequences of expert human interactions

Major Implementations

Google Gemini Computer Use

Released on October 7, 2025, Google DeepMind's Gemini 2.5 Computer Use model is a specialized variant of Gemini 2.5 Pro optimized for browser control.^[3] Key features include:

Model code: `gemini-2.5-computer-use-preview-10-2025`
Specialized for web browser automation with promise for mobile UI control
13 predefined UI actions (click_at, type_text_at, scroll_document, etc.)
Built-in safety monitoring with per-step safety service
Performance: 70.3% on Online-Mind2Web benchmark, 34.7% on WebVoyager, 70.9% on AndroidWorld^[12]
Powers Project Mariner, Firebase Testing Agent, and some agentic capabilities in AI Mode in Search^[3]
Available via Google AI Studio and Vertex AI

Early testers report significant results:

Poke.com (AI assistant): "50% faster and better than the next best solutions"^[3]
Autotab (AI agent): "18% performance increase on hardest evals"^[3]
Google payments platform: "Successfully rehabilitates over 60% of executions" for failed UI tests^[3]

OpenAI Computer-Using Agent (CUA)

OpenAI's Computer-Using Agent (CUA) powers the Operator product and combines GPT-4o's vision capabilities with reinforcement learning.^[2] Released in July 2025 via Azure OpenAI, it achieves:

58.1% success rate on WebArena benchmark
87% success rate on WebVoyager benchmark
38.1% success rate on OSWorld benchmark
Supports cross-platform operation (Windows, macOS, Linux)^[13]

Anthropic Claude Computer Use

Anthropic's Claude 3.5 Sonnet was the first frontier AI model to offer computer use capabilities in public beta (October 22, 2024).^[1] Features include:

Pixel counting navigation method
14.9% score on OSWorld (screenshot-only)
22.0% score on OSWorld (with additional steps)
Available through API for developer integration
Early adopters include Asana, DoorDash, and Replit for multi-step automation^[1]

Benchmarks and Evaluation

Performance Comparison on Major Benchmarks
Model	OSWorld	WebArena	WebVoyager	Online-Mind2Web	AndroidWorld
OpenAI CUA	38.1%	58.1%	87%	-	-
Gemini 2.5 Computer Use	-	-	34.7%	70.3%	70.9%
Claude 3.5 Sonnet	14.9-22%	-	-	-	-
Human Performance	72.36%	-	-	-	-

OSWorld

OSWorld is a comprehensive benchmark for evaluating multimodal agents on open-ended computer tasks across Ubuntu, Windows, and macOS.^[14] The benchmark consists of 369 tasks involving:

Real web and desktop applications
OS file I/O operations
Multi-application workflows
Cross-platform compatibility testing^[15]

WebArena

WebArena evaluates web browsing agents using self-hosted open-source websites that simulate real-world scenarios in e-commerce, content management systems, and social platforms.^[2] It tests abilities including form filling, multi-step navigation, information extraction, and transaction completion.

WebVoyager

WebVoyager tests model performance on live websites including Amazon, GitHub, and Google Maps, evaluating real-world web interaction capabilities.^[2]

Browserbase collaboration with Google DeepMind reported Gemini 2.5 Computer Use leading in accuracy, speed, and cost under matched constraints, with public evaluation traces across thousands of human-verified runs.^[16]

Supported Actions

Computer-use models typically support a standardized set of UI actions. Developers must implement the execution logic for these actions on their client-side application:^[9]

Common UI Actions Supported by Computer-Use Models
Action	Description	Parameters
open_web_browser	Opens the web browser	None
click_at	Click at specific coordinates	x, y coordinates
type_text_at	Type text at location	x, y, text, clear_before_typing, press_enter
scroll_document	Scroll entire page	direction (up/down/left/right)
scroll_at	Scroll specific element/region	x, y, direction, magnitude
drag_and_drop	Drag element to new location	start x,y, destination x,y
key_combination	Press keyboard shortcuts	keys (for example "Control+C")
hover_at	Hover mouse at location	x, y coordinates
navigate	Go to URL	url
wait_5_seconds	Pause execution	None
go_back/go_forward	Navigate browser history	None
search	Go to default search engine	None

Developers can also add custom user-defined functions (for example `open_app`, `long_press_at` for mobile) and exclude specific predefined functions to constrain behavior.^[9]

Applications

Computer-use models have numerous practical applications across industries:^[3]^[5]

Business Automation

Data entry and form processing across multiple websites
Cross-application workflow automation
Report generation from multiple sources
Customer service automation
Invoice and document processing

Software Development

UI testing and quality assurance (Google's payments team recovered 60% of failed tests)^[3]
Automated debugging and cross-browser compatibility testing
Accessibility testing
Performance monitoring
Firebase Testing Agent and similar tools^[3]

Research and Analysis

Web scraping and data collection (for example gathering product information, prices, and reviews)
Competitive intelligence gathering
Market research automation
Academic research assistance
Content aggregation

Personal Productivity

Email management and calendar scheduling
File organization
Online shopping assistance (for example finding "highly rated smart fridges with touchscreen")
Social media management
Personal assistant applications (Poke.com reports 50% speed improvement)^[3]

Safety and Security

Computer-use models introduce unique risks including intentional misuse, unexpected model behavior, and vulnerability to prompt injections and scams. To address these, implementations use layered safety approaches:^[3]^[9]

Built-in Safety

Per-step safety service: An out-of-model, inference-time safety service assesses each action before execution
Safety decisions: Actions classified as regular/allowed, requires_confirmation, or blocked
Training-level safety: Features trained directly into models to avoid harmful actions

Prompt Injection Attacks

Prompt injection represents one of the most significant security risks for computer-use models.^[17] These attacks can occur through:

Direct injection: Malicious instructions embedded in user input^[18]
Indirect injection: Hidden commands in external content (web pages, documents)^[19]
Stored injection: Persistent malicious prompts in training data or memory^[20]

Mitigation Strategies

Organizations implementing computer-use models should employ multiple layers of security:^[21]

Sandboxed Execution: Run agents in isolated virtual machines or containers^[1]
Human-in-the-Loop (HITL): Require human confirmation for sensitive actions (for example purchases, CAPTCHA interactions)^[9]
System Instructions: Custom safety policies to block or require confirmation for high-stakes actions
Access Control: Implement strict permission boundaries and authentication
Content Filtering: Use guardrails to detect and block malicious inputs^[7]
Monitoring and Logging: Track all agent actions for audit and forensics
Rate Limiting: Prevent abuse through action frequency restrictions
Allowlists/Blocklists: Control which websites agents can access

Ethical Considerations

The deployment of computer-use models raises several ethical concerns:

Privacy implications of screen content analysis
Potential for unauthorized data access or exfiltration
Risk of perpetuating biases in automated decisions
Impact on employment in data entry and similar fields
Need for transparency in automated actions^[22]

Technical Infrastructure

Virtual Network Computing (VNC)

Many computer-use implementations rely on VNC (Virtual Network Computing) protocol for remote desktop access.^[23] VNC provides:

Platform-independent remote control
Remote Frame Buffer (RFB) protocol for screen sharing
Pixel-level screen capture capabilities
Mouse and keyboard event transmission
Support for various encoding methods (Raw, RRE, Hextile, ZRLE, Tight)^[24]

Implementation Requirements

Deploying computer-use models typically requires:^[25]^[9]

Execution environment (cloud VM, local container, or sandboxed system)
Screenshot capture mechanism
Input device emulation (mouse/keyboard control)
Client-side action executor (for example Playwright, Selenium)
Browser automation runtime
Safety monitoring system
Logging and audit infrastructure

Google provides a reference implementation on GitHub demonstrating setup and example loops for browser control agents.^[26]

Future Developments

Research Directions

Current research in computer-use models focuses on:^[27]

Improving accuracy on complex, multi-step tasks (current best: 38% vs 72% human performance)
Reducing latency for real-time interactions
Enhancing spatial reasoning capabilities
Developing better safety mechanisms against prompt injection
Extending support to mobile and embedded systems
Desktop OS-level control optimization

Emerging Trends

Multimodal Integration: Combining screen understanding with audio and video processing^[28]
Continual Learning: Models that improve through experience and retain task-specific knowledge^[5]
Specialized Agents: Domain-specific models optimized for particular industries or applications
Federated Learning: Privacy-preserving training across distributed deployments
Neuromorphic Computing: Hardware acceleration for more efficient inference

Limitations

Current computer-use models face several technical limitations:^[1]^[2]^[9]

Preview Status: Models are experimental with potential for errors and security vulnerabilities
Accuracy Gap: Best models achieve only ~38% success on complex OS tasks versus 72% human performance^[14]
Spatial Reasoning: Difficulty with precise positioning and complex layouts
Dynamic Content: Challenges with animations, videos, and rapidly changing interfaces
Security Restrictions: Cannot autonomously handle CAPTCHAs, authentication flows, or payment systems^[4]
Context Windows: Limited ability to maintain long interaction histories
Error Recovery: Difficulty recovering from unexpected states or errors
Platform Optimization: Variable performance across different operating systems and environments

Open Source Projects

Several open-source initiatives support computer-use model development:

OSWorld: Benchmark and evaluation framework for cross-platform agent testing^[25]
ScreenAgent: Computer control agent with visual language model^[29]
Self-Operating Computer: Framework for multimodal AI computer control^[30]
Agent S2: Modular framework with visual grounding and memory mechanisms^[5]
Google Computer Use Preview: Reference implementation for Gemini model^[26]

References

[anthropic-1] 1.0 ^1.1 ^1.2 ^1.3 ^1.4 ^1.5 https://www.anthropic.com/news/3-5-models-and-computer-use

[openai-2] 2.0 ^2.1 ^2.2 ^2.3 ^2.4 ^2.5 ^2.6 https://openai.com/index/computer-using-agent/

[gemini-3] 3.00 ^3.01 ^3.02 ^3.03 ^3.04 ^3.05 ^3.06 ^3.07 ^3.08 ^3.09 ^3.10 ^3.11 ^3.12 ^3.13 ^3.14 https://blog.google/technology/google-deepmind/gemini-computer-use-model/

[ieee-4] 4.0 ^4.1 https://spectrum.ieee.org/ai-agents-computer-use

[simular-5] 5.0 ^5.1 ^5.2 ^5.3 https://www.simular.ai/articles/agent-s2

[anthropic-dev-6] ttps://www.anthropic.com/news/developing-computer-use

[azure-7] 7.0 ^7.1 https://learn.microsoft.com/en-us/azure/ai-foundry/openai/how-to/computer-use

[9to5google-8] ttps://9to5google.com/2025/10/07/gemini-2-5-computer-use-model/

[gemini-docs-9] 9.00 ^9.01 ^9.02 ^9.03 ^9.04 ^9.05 ^9.06 ^9.07 ^9.08 ^9.09 ^9.10 https://ai.google.dev/gemini-api/docs/computer-use

[rlvlm-10] ttps://proceedings.mlr.press/v235/wang24bn.html

[rlhf-11] ttps://encord.com/blog/guide-to-rlhf/

[marktechpost-12] ttps://www.marktechpost.com/2025/10/08/google-ai-introduces-gemini-2-5-computer-use-preview-a-browser-control-model-to-power-ai-agents-to-interact-with-user-interfaces/

[convergence-13] ttps://www.convergenceindia.org/industry-news/artificial-intelligence/test-scores-of-chatgpts-all-new-computer-using-agent-operator-might-blow-your-minds-119000/

[osworld-14] 14.0 ^14.1 https://os-world.github.io/

[osworld_neurips-15] ttps://neurips.cc/virtual/2024/poster/97468

[browserbase-16] ttps://www.browserbase.com/blog/evaluating-browser-agents

[owasp-17] ttps://genai.owasp.org/llmrisk/llm01-prompt-injection/

[paloalto-18] ttps://www.paloaltonetworks.com/cyberpedia/what-is-a-prompt-injection-attack

[ibm-19] ttps://www.ibm.com/think/topics/prompt-injection

[lakera-20] ttps://www.lakera.ai/blog/guide-to-prompt-injection

[aws-21] ttps://aws.amazon.com/blogs/security/safeguard-your-generative-ai-workloads-from-prompt-injections/

[salesforce-22] ttps://www.salesforce.com/blog/prompt-injection-detection/

[vnc-23] ttps://en.wikipedia.org/wiki/VNC

[ultravnc-24] ttps://uvnc.com/docs/ultravnc-viewer/71-ultravnc-viewer-gui.html

[osworld_github-25] 25.0 ^25.1 https://github.com/xlang-ai/OSWorld

[github-26] 26.0 ^26.1 https://github.com/google/computer-use-preview

[acu-27] ttps://github.com/trycua/acu

[vlm2025-28] ttps://huggingface.co/blog/vlms-2025

[screenagent-29] ttps://github.com/niuzaisheng/ScreenAgent

[hyperwrite-30] ttps://www.hyperwriteai.com/self-operating-computer

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]

[17]

[18]

[19]

[20]

[21]

[22]

[23]

[24]

[25]

[26]

[27]

[28]

[29]

[30]