Computer-use model

From AI Wiki

Template:Infobox technology

A computer-use model is a specialized type of artificial intelligence model that enables autonomous agents to interact with graphical user interfaces (GUIs) by perceiving screen content and executing actions like clicking, typing, and scrolling, similar to how humans use computers.[1][2] These models represent a significant advancement in AI agents, allowing them to control computers through visual understanding rather than programmatic APIs, making them capable of automating complex digital tasks across various applications and operating systems.[3]

Overview

Computer-use models combine vision-language models (VLMs) with reinforcement learning capabilities to understand and interact with computer screens through pixel-level visual processing.[2] Unlike traditional automation approaches that require specific APIs or scripting for each application, computer-use models can control any software that has a graphical interface, using the same visual cues and input methods that humans use.[3] This universal approach makes them particularly valuable for tasks that span multiple applications or require interaction with legacy systems that lack modern APIs.[4]

While many agent systems integrate through structured APIs, a large portion of digital work still happens in GUIs including form filling, dashboards, and behind-login workflows. Computer-use models address this gap by powering agents that can operate like human users, navigating web pages and applications by clicking, typing and scrolling.[3]

The fundamental innovation of computer-use models is their ability to translate high-level instructions into low-level computer actions by:

  • Perceiving screen content through screenshot analysis
  • Understanding the spatial layout and purpose of UI elements
  • Generating appropriate mouse and keyboard actions
  • Adapting to dynamic changes in the interface
  • Learning from feedback to improve performance over time[5]

History

The concept of computer-use models emerged as part of the broader development of multimodal large language models (LLMs) capable of processing visual inputs. Early research focused on visual question answering and image captioning, but by 2024, advancements allowed models to actively control UIs.

The first public beta of a computer-use model was introduced by Anthropic on October 22, 2024, with an upgraded Claude 3.5 Sonnet model featuring "computer use" capabilities. This allowed Claude to perceive screens and perform actions like cursor movement and typing.[1][6]

In July 2025, OpenAI released a preview of its Computer Use tool via Azure OpenAI, enabling models to interact with browsers, desktops, and applications across operating systems like Windows, macOS, and Ubuntu.[7]

On October 7, 2025, Google DeepMind announced the Gemini 2.5 Computer Use model, built on Gemini 2.5 Pro, optimized primarily for web browsers and mobile UIs. The model became available through the Gemini API via Google AI Studio and Vertex AI.[3][8]

Technical Architecture

Core Components

Computer-use models typically consist of several integrated components working in an iterative loop:[3][9]

  1. Visual Perception Module: Processes screenshots using convolutional neural networks or vision transformers to understand screen content
  2. Language Understanding Module: Interprets user instructions and maintains context using large language models
  3. Action Planning Module: Uses chain-of-thought reasoning to decompose tasks into executable steps[2]
  4. Action Execution Module: Translates high-level decisions into specific UI actions (clicks, keystrokes, scrolls)
  5. Feedback Processing Module: Evaluates action results and adjusts strategy based on observed changes

Agent Loop

At a high level, agents using computer-use models follow a repeated loop:[9]

  1. Send Request: The application invokes the Computer Use tool with the user's goal, a screenshot of the current GUI, the current URL, and optionally recent action history and constraints (for example excluded actions)
  2. Receive Model Response: The model analyzes these inputs and generates a response, typically containing one or more function calls representing UI actions (for example open browser, click, type) and may include a safety decision (for example "requires confirmation")
  3. Execute Actions: Client-side code executes allowed actions (prompting the end user for confirmation when required), then captures a fresh screenshot/URL
  4. Capture New State: If the action has been executed, the client captures a new screenshot of the GUI and the current URL
  5. Send Function Response: The new state is returned to the model as function responses, and the loop repeats from step 2

This process continues until the task is complete, an error occurs, or termination due to safety response or user decision. The loop is conceptually similar to function calling with tools, but specialized for GUI manipulation.[9]

Coordinate System

Most computer-use models employ a normalized coordinate system where screen positions are represented on a 1000x1000 grid regardless of actual screen resolution.[9] This approach ensures consistency across different display configurations. The model outputs normalized coordinates that are then converted to actual pixel values by the client implementation:

  • X coordinates: 0-999 (left to right)
  • Y coordinates: 0-999 (top to bottom)
  • Actual pixel position = (normalized_coordinate / 1000) × screen_dimension

The recommended screen size for use with computer-use models is 1440×900 pixels, though models work with any resolution.[9]

Training Methodology

Computer-use models are typically trained using a combination of:[10]

  1. Supervised Fine-tuning (SFT): Initial training on human demonstrations of UI interactions
  2. Reinforcement Learning (RL): Optimization through trial-and-error with reward signals
  3. Reinforcement Learning from Human Feedback (RLHF): Refinement based on human preferences and corrections[11]
  4. Imitation Learning: Learning from recorded sequences of expert human interactions

Major Implementations

Google Gemini Computer Use

Released on October 7, 2025, Google DeepMind's Gemini 2.5 Computer Use model is a specialized variant of Gemini 2.5 Pro optimized for browser control.[3] Key features include:

  • Model code: `gemini-2.5-computer-use-preview-10-2025`
  • Specialized for web browser automation with promise for mobile UI control
  • 13 predefined UI actions (click_at, type_text_at, scroll_document, etc.)
  • Built-in safety monitoring with per-step safety service
  • Performance: 70.3% on Online-Mind2Web benchmark, 34.7% on WebVoyager, 70.9% on AndroidWorld[12]
  • Powers Project Mariner, Firebase Testing Agent, and some agentic capabilities in AI Mode in Search[3]
  • Available via Google AI Studio and Vertex AI

Early testers report significant results:

  • Poke.com (AI assistant): "50% faster and better than the next best solutions"[3]
  • Autotab (AI agent): "18% performance increase on hardest evals"[3]
  • Google payments platform: "Successfully rehabilitates over 60% of executions" for failed UI tests[3]

OpenAI Computer-Using Agent (CUA)

OpenAI's Computer-Using Agent (CUA) powers the Operator product and combines GPT-4o's vision capabilities with reinforcement learning.[2] Released in July 2025 via Azure OpenAI, it achieves:

  • 58.1% success rate on WebArena benchmark
  • 87% success rate on WebVoyager benchmark
  • 38.1% success rate on OSWorld benchmark
  • Supports cross-platform operation (Windows, macOS, Linux)[13]

Anthropic Claude Computer Use

Anthropic's Claude 3.5 Sonnet was the first frontier AI model to offer computer use capabilities in public beta (October 22, 2024).[1] Features include:

  • Pixel counting navigation method
  • 14.9% score on OSWorld (screenshot-only)
  • 22.0% score on OSWorld (with additional steps)
  • Available through API for developer integration
  • Early adopters include Asana, DoorDash, and Replit for multi-step automation[1]

Benchmarks and Evaluation

Performance Comparison on Major Benchmarks
Model OSWorld WebArena WebVoyager Online-Mind2Web AndroidWorld
OpenAI CUA 38.1% 58.1% 87% - -
Gemini 2.5 Computer Use - - 34.7% 70.3% 70.9%
Claude 3.5 Sonnet 14.9-22% - - - -
Human Performance 72.36% - - - -

OSWorld

OSWorld is a comprehensive benchmark for evaluating multimodal agents on open-ended computer tasks across Ubuntu, Windows, and macOS.[14] The benchmark consists of 369 tasks involving:

  • Real web and desktop applications
  • OS file I/O operations
  • Multi-application workflows
  • Cross-platform compatibility testing[15]

WebArena

WebArena evaluates web browsing agents using self-hosted open-source websites that simulate real-world scenarios in e-commerce, content management systems, and social platforms.[2] It tests abilities including form filling, multi-step navigation, information extraction, and transaction completion.

WebVoyager

WebVoyager tests model performance on live websites including Amazon, GitHub, and Google Maps, evaluating real-world web interaction capabilities.[2]

Browserbase collaboration with Google DeepMind reported Gemini 2.5 Computer Use leading in accuracy, speed, and cost under matched constraints, with public evaluation traces across thousands of human-verified runs.[16]

Supported Actions

Computer-use models typically support a standardized set of UI actions. Developers must implement the execution logic for these actions on their client-side application:[9]

Common UI Actions Supported by Computer-Use Models
Action Description Parameters
open_web_browser Opens the web browser None
click_at Click at specific coordinates x, y coordinates
type_text_at Type text at location x, y, text, clear_before_typing, press_enter
scroll_document Scroll entire page direction (up/down/left/right)
scroll_at Scroll specific element/region x, y, direction, magnitude
drag_and_drop Drag element to new location start x,y, destination x,y
key_combination Press keyboard shortcuts keys (for example "Control+C")
hover_at Hover mouse at location x, y coordinates
navigate Go to URL url
wait_5_seconds Pause execution None
go_back/go_forward Navigate browser history None
search Go to default search engine None

Developers can also add custom user-defined functions (for example `open_app`, `long_press_at` for mobile) and exclude specific predefined functions to constrain behavior.[9]

Applications

Computer-use models have numerous practical applications across industries:[3][5]

Business Automation

  • Data entry and form processing across multiple websites
  • Cross-application workflow automation
  • Report generation from multiple sources
  • Customer service automation
  • Invoice and document processing

Software Development

  • UI testing and quality assurance (Google's payments team recovered 60% of failed tests)[3]
  • Automated debugging and cross-browser compatibility testing
  • Accessibility testing
  • Performance monitoring
  • Firebase Testing Agent and similar tools[3]

Research and Analysis

  • Web scraping and data collection (for example gathering product information, prices, and reviews)
  • Competitive intelligence gathering
  • Market research automation
  • Academic research assistance
  • Content aggregation

Personal Productivity

  • Email management and calendar scheduling
  • File organization
  • Online shopping assistance (for example finding "highly rated smart fridges with touchscreen")
  • Social media management
  • Personal assistant applications (Poke.com reports 50% speed improvement)[3]

Safety and Security

Computer-use models introduce unique risks including intentional misuse, unexpected model behavior, and vulnerability to prompt injections and scams. To address these, implementations use layered safety approaches:[3][9]

Built-in Safety

  • Per-step safety service: An out-of-model, inference-time safety service assesses each action before execution
  • Safety decisions: Actions classified as regular/allowed, requires_confirmation, or blocked
  • Training-level safety: Features trained directly into models to avoid harmful actions

Prompt Injection Attacks

Prompt injection represents one of the most significant security risks for computer-use models.[17] These attacks can occur through:

  • Direct injection: Malicious instructions embedded in user input[18]
  • Indirect injection: Hidden commands in external content (web pages, documents)[19]
  • Stored injection: Persistent malicious prompts in training data or memory[20]

Mitigation Strategies

Organizations implementing computer-use models should employ multiple layers of security:[21]

  1. Sandboxed Execution: Run agents in isolated virtual machines or containers[1]
  2. Human-in-the-Loop (HITL): Require human confirmation for sensitive actions (for example purchases, CAPTCHA interactions)[9]
  3. System Instructions: Custom safety policies to block or require confirmation for high-stakes actions
  4. Access Control: Implement strict permission boundaries and authentication
  5. Content Filtering: Use guardrails to detect and block malicious inputs[7]
  6. Monitoring and Logging: Track all agent actions for audit and forensics
  7. Rate Limiting: Prevent abuse through action frequency restrictions
  8. Allowlists/Blocklists: Control which websites agents can access

Ethical Considerations

The deployment of computer-use models raises several ethical concerns:

  • Privacy implications of screen content analysis
  • Potential for unauthorized data access or exfiltration
  • Risk of perpetuating biases in automated decisions
  • Impact on employment in data entry and similar fields
  • Need for transparency in automated actions[22]

Technical Infrastructure

Virtual Network Computing (VNC)

Many computer-use implementations rely on VNC (Virtual Network Computing) protocol for remote desktop access.[23] VNC provides:

  • Platform-independent remote control
  • Remote Frame Buffer (RFB) protocol for screen sharing
  • Pixel-level screen capture capabilities
  • Mouse and keyboard event transmission
  • Support for various encoding methods (Raw, RRE, Hextile, ZRLE, Tight)[24]

Implementation Requirements

Deploying computer-use models typically requires:[25][9]

  • Execution environment (cloud VM, local container, or sandboxed system)
  • Screenshot capture mechanism
  • Input device emulation (mouse/keyboard control)
  • Client-side action executor (for example Playwright, Selenium)
  • Browser automation runtime
  • Safety monitoring system
  • Logging and audit infrastructure

Google provides a reference implementation on GitHub demonstrating setup and example loops for browser control agents.[26]

Future Developments

Research Directions

Current research in computer-use models focuses on:[27]

  • Improving accuracy on complex, multi-step tasks (current best: 38% vs 72% human performance)
  • Reducing latency for real-time interactions
  • Enhancing spatial reasoning capabilities
  • Developing better safety mechanisms against prompt injection
  • Extending support to mobile and embedded systems
  • Desktop OS-level control optimization

Emerging Trends

  • Multimodal Integration: Combining screen understanding with audio and video processing[28]
  • Continual Learning: Models that improve through experience and retain task-specific knowledge[5]
  • Specialized Agents: Domain-specific models optimized for particular industries or applications
  • Federated Learning: Privacy-preserving training across distributed deployments
  • Neuromorphic Computing: Hardware acceleration for more efficient inference

Limitations

Current computer-use models face several technical limitations:[1][2][9]

  • Preview Status: Models are experimental with potential for errors and security vulnerabilities
  • Accuracy Gap: Best models achieve only ~38% success on complex OS tasks versus 72% human performance[14]
  • Spatial Reasoning: Difficulty with precise positioning and complex layouts
  • Dynamic Content: Challenges with animations, videos, and rapidly changing interfaces
  • Security Restrictions: Cannot autonomously handle CAPTCHAs, authentication flows, or payment systems[4]
  • Context Windows: Limited ability to maintain long interaction histories
  • Error Recovery: Difficulty recovering from unexpected states or errors
  • Platform Optimization: Variable performance across different operating systems and environments

Open Source Projects

Several open-source initiatives support computer-use model development:

See Also

References

  1. 1.0 1.1 1.2 1.3 1.4 1.5 https://www.anthropic.com/news/3-5-models-and-computer-use
  2. 2.0 2.1 2.2 2.3 2.4 2.5 2.6 https://openai.com/index/computer-using-agent/
  3. 3.00 3.01 3.02 3.03 3.04 3.05 3.06 3.07 3.08 3.09 3.10 3.11 3.12 3.13 3.14 https://blog.google/technology/google-deepmind/gemini-computer-use-model/
  4. 4.0 4.1 https://spectrum.ieee.org/ai-agents-computer-use
  5. 5.0 5.1 5.2 5.3 https://www.simular.ai/articles/agent-s2
  6. https://www.anthropic.com/news/developing-computer-use
  7. 7.0 7.1 https://learn.microsoft.com/en-us/azure/ai-foundry/openai/how-to/computer-use
  8. https://9to5google.com/2025/10/07/gemini-2-5-computer-use-model/
  9. 9.00 9.01 9.02 9.03 9.04 9.05 9.06 9.07 9.08 9.09 9.10 https://ai.google.dev/gemini-api/docs/computer-use
  10. https://proceedings.mlr.press/v235/wang24bn.html
  11. https://encord.com/blog/guide-to-rlhf/
  12. https://www.marktechpost.com/2025/10/08/google-ai-introduces-gemini-2-5-computer-use-preview-a-browser-control-model-to-power-ai-agents-to-interact-with-user-interfaces/
  13. https://www.convergenceindia.org/industry-news/artificial-intelligence/test-scores-of-chatgpts-all-new-computer-using-agent-operator-might-blow-your-minds-119000/
  14. 14.0 14.1 https://os-world.github.io/
  15. https://neurips.cc/virtual/2024/poster/97468
  16. https://www.browserbase.com/blog/evaluating-browser-agents
  17. https://genai.owasp.org/llmrisk/llm01-prompt-injection/
  18. https://www.paloaltonetworks.com/cyberpedia/what-is-a-prompt-injection-attack
  19. https://www.ibm.com/think/topics/prompt-injection
  20. https://www.lakera.ai/blog/guide-to-prompt-injection
  21. https://aws.amazon.com/blogs/security/safeguard-your-generative-ai-workloads-from-prompt-injections/
  22. https://www.salesforce.com/blog/prompt-injection-detection/
  23. https://en.wikipedia.org/wiki/VNC
  24. https://uvnc.com/docs/ultravnc-viewer/71-ultravnc-viewer-gui.html
  25. 25.0 25.1 https://github.com/xlang-ai/OSWorld
  26. 26.0 26.1 https://github.com/google/computer-use-preview
  27. https://github.com/trycua/acu
  28. https://huggingface.co/blog/vlms-2025
  29. https://github.com/niuzaisheng/ScreenAgent
  30. https://www.hyperwriteai.com/self-operating-computer