Computer-use agent

Template:Infobox AI technology

A computer-use agent (CUA) is a type of software agent in artificial intelligence that performs tasks by directly operating a general-purpose computer's graphical user interface (GUI) the way a human does by "seeing" the screen, moving a cursor, clicking, typing, and interacting with windows and applications.^[1] Unlike tool-calling approaches that rely on predefined APIs, CUAs aim to generalize across arbitrary software by treating the computer itself as the universal interface.^[2] They represent an advancement in AI agent technology, combining computer vision, natural language processing, and reinforcement learning to handle open-ended tasks.^[3]

CUAs typically combine a large language model (LLM) with computer vision and an action executor (for example, a virtual machine or a remote desktop session), enabling end-to-end perception, reasoning, and control loops.^[4]^[5] Early deployments are experimental and can be error-prone, but rapid progress since 2022 has made "computer use" a central paradigm in building autonomous agents for software workflows.^[1]^[6]

Terminology

Computer-use agent (CUA) - a general term for agents that operate computers via on-screen interaction
GUI agent - emphasizes interaction with graphical user interfaces
Desktop agent - focuses on desktop environment automation
Self-operating computer (SOC) - popularized by an open-source framework that allows a multimodal model to observe pixels and emit mouse/keyboard actions^[7]
Computer-Using Agent - OpenAI's specific terminology for their implementation^[2]

Core Characteristics

Universal interface: Operates any on-screen software (within OS and permission limits) rather than only API-integrated tools^[2]
Perception-action loop: Iteratively reads screen state (image + text), plans, and executes actions such as click(x,y) and type("text") until the goal is reached or a stop condition fires^[4]
Multimodal reasoning: Combines text-based planning with visual grounding to locate and manipulate UI elements^[8]
Environment abstraction: Runs inside controlled sandboxes such as virtual machines, remote desktops, or browser emulators to improve safety and reproducibility^[5]^[6]
Agentic autonomy: Can chain steps across applications (for example web browser + spreadsheet + email) with minimal human intervention, while remaining subject to guardrails and approvals^[1]

Technical Architecture

Core Components

Typical CUAs include the following components:

Perception - captures screen frames and auxiliary signals (window hierarchy, OCR). The vision component uses computer vision and multimodal AI models to interpret screenshots and identify interactive elements such as buttons, text fields, menus, and other GUI components^[4]
Reasoning and planning - uses an LLM (sometimes with tool-use memory or reinforcement learning fine-tuning) to decide the next high-level step. This component breaks down complex instructions into actionable steps and maintains context across interactions^[8]
Grounding - maps plan tokens to concrete UI targets (coordinates, elements, shortcuts)^[9]
Action executor - sends clicks, keystrokes, scrolls, and window commands to the target environment. Translates decisions into specific computer interactions at precise coordinates^[4]
Monitoring and recovery - detects failure modes (pop-ups, navigation drift) and triggers retries, backtracking, or human handoff^[5]

Implementation Approaches

Comparison of CUA Implementation Methods
Approach	Description	Advantages	Limitations
Pure Vision	Relies solely on visual interpretation of screen pixels	Platform-agnostic, works with any GUI	May struggle with complex layouts
DOM-Enhanced	Combines vision with web page structure analysis	Higher accuracy for web tasks	Limited to browser environments
Hybrid Systems	Integrates multiple data sources including OS APIs	Most accurate and reliable	Platform-specific implementations required
Container-Based	Runs in isolated virtual environments	Enhanced security and scalability	Additional infrastructure overhead

History

Year	Milestone
2022	Adept introduces ACT-1, a transformer trained to use digital tools via a Chrome extension, an early demonstration of end-to-end GUI action from model outputs^[10]
November 2023	The open-source Self-Operating Computer framework by OthersideAI shows a multimodal model operating a desktop using the same inputs/outputs as a human (pixels and mouse/keyboard)^[7]
2024	Frameworks like LaVague and Skyvern emerged, combining LLMs with vision for web agent automation^[11]
October 22, 2024	Anthropic publicly announces "computer use" in beta for Claude 3.5 Sonnet, enabling on-screen control (look, move cursor, click, type) via API, marking the first major commercial implementation^[1]
January 23, 2025	OpenAI publishes a formal description of a Computer-Using Agent and provides a documented Computer Use tool that runs a continuous observe-plan-act loop, introduced as part of "Operator" research preview^[2]^[4]
February 24, 2025	Anthropic releases Claude 3.7 Sonnet with improved computer use capabilities and extended thinking mode^[12]
March 2025	Azure OpenAI documents "Computer Use (preview)" for building agents that interact with computer UIs; major cloud providers publish prescriptive guidance patterns^[6]^[5]
March 2025	Simular AI releases Agent S2, an open-source modular framework outperforming proprietary CUAs on benchmarks like OSWorld^[13]
September 2025	Anthropic releases Claude Sonnet 4.5, achieving state-of-the-art 61.4% success rate on OSWorld benchmark and 77.2% on SWE-bench Verified^[14]

Functionality

CUAs operate through an iterative loop of perception, reasoning, and action:

Interaction Model

Many implementations expose a loop in which the agent:

Observes the current screen through screenshot capture or pixel analysis
Proposes an action based on visual interpretation and task context
Executes the action via simulated mouse/keyboard input
Receives updated observations and tool feedback
Repeats until task completion or failure^[4]

Process Flow

Perception: The agent receives screenshots or raw pixel data to analyze the current screen state, using computer vision to identify elements like buttons, text fields, and menus^[2]^[15]
Reasoning: Leveraging LLMs, the agent plans actions based on user instructions, past context, and self-correction for errors. Techniques like chain-of-thought prompting enable adaptation to dynamic interfaces^[3]^[2]
Action: The agent emulates inputs via virtual mouse (for example pixel-based cursor movement) and keyboard, performing clicks, types, scrolls, or drags^[1]

Public SDKs document low-level actions such as click(x,y), type(text), and clipboard/file operations, executed by a host process controlling a VM or remote session. This loop allows CUAs to handle tasks requiring dozens of steps, such as form filling or software testing.^[16] Limitations include challenges with scrolling, zooming, and short-lived UI elements due to screenshot-based (non-video) perception.^[15]

Major Implementations

Anthropic Claude Computer Use

Anthropic released computer use capabilities in beta with Claude 3.5 Sonnet in October 2024, allowing developers to direct Claude to use computers through the Anthropic API.^[1] The implementation includes specialized tools:

computer_20241022 - Original computer tool for Claude 3.5 Sonnet
computer_20250124 - Enhanced version with additional features for Claude 4
bash_20241022 - Command line interface tool
text_editor_20241022 - Text editing capabilities^[17]

Training focused on simple software like calculators and text editors, with restricted internet access for safety. Anthropic's research emphasized pixel counting for accuracy, with generalization from limited examples.^[15] Early adopters included companies like Asana, Canva, and DoorDash, using it for multi-step automation.^[1]

Claude Sonnet 4.5, released in September 2025, represents the current state-of-the-art with a 61.4% success rate on OSWorld benchmark, a significant improvement from the 14.9% achieved by the October 2024 version.^[14]

OpenAI Computer-Using Agent

OpenAI introduced the Computer-Using Agent (CUA) in January 2025 as part of its "Operator" research preview, built on GPT-4o's vision capabilities with advanced reasoning.^[2]^[18] The CUA model achieves:

38.1% success rate on OSWorld benchmark for full computer use
58.1% success rate on WebArena
87% success rate on WebVoyager for web-based tasks^[2]

The implementation uses reinforcement learning for reasoning and handles GUI interactions via screenshots. It's integrated into Operator and requires user confirmations for sensitive actions.^[18]

Microsoft Azure Computer-Using Agent

Microsoft announced the Computer-Using Agent capabilities in Azure AI Foundry in March 2025, featuring integration with the Responses API. The implementation focuses on enterprise integration with Windows 365 and Azure Virtual Desktop.^[6]

Open-Source Implementations

Framework	Description	Release Date	Key Features
Self-Operating Computer	Vision-based computer control	November 2023	Screenshot analysis, basic automation, multimodal control^[7]
OpenInterpreter	General-purpose control with Python	2024	Extensible, LLM integration^[11]
Agent S2	Modular framework for GUIs	March 2025	Hierarchical planning, 34.5% OSWorld score^[13]
LaVague	Web agent framework	2024	Modular architecture, vision + LLMs^[11]
Skyvern	Browser workflow automation	2024	HTML extraction, task automation^[11]
Cua Framework	Containerized environments for CUAs	2025	Docker-like deployment, OS virtualization^[19]
Browser-Use	Web-specific agent	2025	89.1% WebVoyager success rate, DOM + vision^[20]
UFO Agents	Windows-specific control	2025	Windows API integration, enhanced accuracy^[21]
AutoGen	Distributed agent framework	2024	Multi-agent coordination^[11]
NatBot	Browser-specific automation	2024	GPT-4 Vision integration^[11]

Performance Benchmarks

Researchers have proposed interactive benchmarks to evaluate CUAs in realistic settings.

OSWorld

OSWorld is a comprehensive benchmark for evaluating multimodal agents in real computer environments across Ubuntu, Windows, and macOS. It includes 369 tasks involving real web and desktop applications, file I/O operations, and cross-application workflows.^[9]

OSWorld Performance Results (as of September 2025)
Model	Success Rate	Multi-step Score	Notes
Human Performance	72.4%	N/A	Baseline human capability
Claude Sonnet 4.5	61.4%	N/A	Current state-of-the-art (September 2025)^[14]
OpenAI CUA	38.1%	N/A	January 2025 release^[2]
Agent S2	34.5%	N/A	50-step configuration^[13]
Claude 3.5 Sonnet	14.9%	22.0%	October 2024 version^[1]
Previous Best (2024)	12.0%	N/A	Prior to CUA models

OSWorld-Human

OSWorld-Human provides annotated trajectories with human optimal steps. Across 16 agents tested, even the best took 1.4–2.7× the human step count on average, indicating significant efficiency gaps.^[22]

WebArena

WebArena evaluates web browsing agents using self-hosted open-source websites that simulate real-world scenarios in e-commerce, content management systems, and social platforms. It tests complex, multi-step web interactions offline.^[23]

OpenAI CUA: 58.1% success rate^[2]

WebVoyager

WebVoyager tests agent performance on live websites including Amazon, GitHub, and Google Maps, evaluating real-world web navigation and task completion capabilities. The benchmark includes 586 diverse web tasks.^[24]

Browser-Use: 89.1% success rate^[20]
OpenAI CUA: 87% success rate^[2]

macOSWorld

macOSWorld introduces the first comprehensive macOS benchmark with 202+ multilingual interactive tasks. It reports distinct performance tiers with >30% success for some proprietary CUAs in its evaluations.^[25]

AndroidWorld

AndroidWorld evaluates mobile GUI tasks:

Agent S2: 50% success rate^[13]

Applications

CUAs automate repetitive tasks in various domains:

Enterprise Automation

Customer service automation through multi-step workflow execution
IT operations management and system maintenance
Data entry and form processing across legacy systems without APIs
Software testing and quality assurance automation
Enterprise process automation across heterogeneous legacy systems^[5]^[26]

Software Development

Automated code generation and debugging
Application testing across different environments
Continuous integration and deployment processes
Documentation generation and maintenance
Developer tooling inside isolated environments^[6]^[27]

Office Workflows

File handling and organization
Spreadsheet operations and data manipulation
Calendar management and scheduling
Email processing and response automation^[1]

Research and Analysis

Web research and information gathering
Data analysis across multiple sources
Report generation from various applications
Competitive intelligence gathering
Data migration and RPA-style tasks with free-form reasoning^[5]

Accessibility

Assisting users with disabilities in GUI navigation
Automating repetitive tasks for users with motor impairments
Providing alternative interaction methods for standard software
Accessibility augmentation by translating natural-language intents into GUI actions^[8]^[2]

Companies like DoorDash use CUAs for internal processes requiring hundreds of steps, while Replit uses Anthropic's tool for code evaluation.^[1]

Security Considerations

Prompt Injection Vulnerabilities

CUAs are susceptible to prompt injection attacks where malicious instructions embedded in content can override intended behavior. This vulnerability is particularly concerning as CUAs can execute actions on behalf of users.^[17]

Types of Prompt Injection

Direct injection: Malicious commands entered directly by users
Indirect injection: Hidden instructions in external content like websites, documents, or images
Cross-modal attacks: Exploiting interactions between different input modalities in multimodal systems^[28]

Mitigation Strategies

Security Mitigation Approaches for CUAs
Strategy	Description	Effectiveness
Containerization	Run CUAs in isolated virtual machines or Docker containers	High for system isolation
Least Privilege	Restrict CUA access to minimum necessary resources	Medium-High for damage limitation
Human Oversight	Require approval for sensitive operations	High for critical actions
Input Validation	Filter and sanitize user inputs and external content	Medium, not foolproof
Monitoring	Track CUA actions and detect anomalous behavior	High for incident response
Classifiers	Detect harmful content and restrict actions	Medium-High for known threats
Blocklists	Prevent access to sensitive domains/applications	High for defined restrictions

Best Practices

Organizations deploying CUAs should implement:^[29]^[15]

Use dedicated virtual environments with minimal privileges
Avoid providing access to sensitive accounts or data
Implement strict approval workflows for high-stakes operations
Regular security audits and penetration testing
User training on CUA risks and safe usage
Multiple layered security controls (defense in depth)
Instrumenting logging/telemetry to support audits and incident response
Privacy policies and deployment docs emphasizing consent

Anthropic implements classifiers to detect harm, restrictions on election-related tasks, and ASL-2 compliance.^[15] OpenAI includes refusals for harmful tasks, blocklists, user confirmations, and evaluations against frontier risks like autonomous replication.^[2]

Limitations and Challenges

Independent evaluations and benchmark studies report that state-of-the-art CUAs still struggle with robust GUI grounding, long-horizon plans, and operational knowledge of unfamiliar applications.^[9]^[25]^[22]

Performance Limitations

Latency: Current implementations are slower than human-directed actions, limiting real-time applications^[17]
Accuracy: Vision-based coordinate detection can be unreliable, especially with complex interfaces
Context limitations: Token limits restrict the amount of information agents can process simultaneously
Efficiency: Even high-performing agents often take more steps than necessary compared with humans^[22]

Technical Challenges

Difficulty handling dynamic elements (for example pop-ups, date pickers, dropdown menus) due to static screenshots^[15]
Struggles with CAPTCHAs and anti-automation measures
Limited ability to understand context across multiple applications
Challenges with non-standard or custom UI elements^[30]
Problems with scrolling, zooming, and short-lived UI elements due to screenshot-based (non-video) perception^[15]

Reliability Issues

Success rates remain below human performance (61.4% vs 72.4% on OSWorld for best models)^[14]
Susceptibility to errors when interfaces change unexpectedly
Potential for cascading failures in multi-step processes^[2]
Dependency on high-quality vision models and potential for cumulative errors in long tasks^[31]

Users may find interfaces confusing without clear benefits over traditional tools.^[31]

Future Developments

Technical Roadmap

Industry leaders have outlined several advancement areas:

Enhanced multimodal models with better visual understanding
Improved reasoning capabilities for complex task planning
Integration with specialized APIs when available for hybrid approaches
Development of CUA-specific training datasets and benchmarks^[18]
Video stream perception to replace static screenshots
Better mobile support and cross-platform compatibility
Integration with agentic AI for collaborative workflows^[3]

Standardization Efforts

The llms.txt proposal suggests a standardized format for websites to provide AI-readable information, potentially improving CUA reliability while maintaining human usability.^[21] This would allow websites to expose structured data specifically for AI consumption.

Integration with Existing Systems

Future developments include:

Native OS integration for improved performance
Combination with traditional RPA for deterministic tasks
Enhanced security frameworks specifically designed for CUAs
Standardized evaluation metrics and benchmarks^[6]

Open-source efforts like Agent S2 emphasize modularity for scalability.^[13] By mid-2025, CUAs are seen as foundational for "agentic coworkers."^[3]

Impact and Implications

Economic Impact

Organizations implementing CUAs report significant operational improvements:

30-50% reduction in manual workload
65% faster data processing times
35% increase in customer retention through improved service
Significant cost savings from reduced human error^[26]

Workforce Implications

CUAs are reshaping workplace dynamics by:

Automating routine digital tasks
Enabling workers to focus on higher-value activities
Creating new roles in CUA management and oversight
Requiring new skills in AI collaboration and supervision^[21]

Ethical Considerations

The deployment of CUAs raises important ethical questions:

Privacy concerns regarding automated data access and screen content reading
Accountability for agent-initiated actions
Potential for misuse in surveillance or unauthorized access
Need for transparent disclosure when CUAs interact with humans^[32]
Potential for fraud, cybersecurity vulnerabilities, and malicious use^[33]
Job displacement in automation-heavy fields^[34]

References

↑ ^1.0 ^1.1 ^1.2 ^1.3 ^1.4 ^1.5 ^1.6 ^1.7 ^1.8 ^1.9 Anthropic (22 October 2024). "Introducing computer use, a new Claude 3.5 Sonnet, and more." https://www.anthropic.com/news/3-5-models-and-computer-use
↑ ^2.00 ^2.01 ^2.02 ^2.03 ^2.04 ^2.05 ^2.06 ^2.07 ^2.08 ^2.09 ^2.10 ^2.11 ^2.12 ^2.13 OpenAI (23 January 2025). "Computer-Using Agent." https://openai.com/index/computer-using-agent/
↑ ^3.0 ^3.1 ^3.2 ^3.3 a16z. (August 28, 2025). "The Rise of Computer Use and Agentic Coworkers". https://a16z.com/the-rise-of-computer-use-and-agentic-coworkers/
↑ ^4.0 ^4.1 ^4.2 ^4.3 ^4.4 ^4.5 OpenAI Platform Docs. "Computer use." https://platform.openai.com/docs/guides/tools-computer-use
↑ ^5.0 ^5.1 ^5.2 ^5.3 ^5.4 ^5.5 Amazon Web Services (2025). "Computer-use agents - AWS Prescriptive Guidance." https://docs.aws.amazon.com/prescriptive-guidance/latest/agentic-ai-patterns/computer-use-agents.html
↑ ^6.0 ^6.1 ^6.2 ^6.3 ^6.4 ^6.5 Microsoft Learn (25 March 2025). "Announcing the Responses API and Computer-Using Agent in Azure AI Foundry." https://azure.microsoft.com/en-us/blog/announcing-the-responses-api-and-computer-using-agent-in-azure-ai-foundry/
↑ ^7.0 ^7.1 ^7.2 OthersideAI (2023). "Self-Operating Computer Framework." https://github.com/OthersideAI/self-operating-computer
↑ ^8.0 ^8.1 ^8.2 Guo, Y. et al. (2025). "GUI Agents with Foundation Models: A Comprehensive Survey." arXiv:2411.04890v2 / OpenReview (2025). https://arxiv.org/abs/2411.04890 ; https://openreview.net/forum?id=CzMnCp6TFl
↑ ^9.0 ^9.1 ^9.2 OSWorld (2024). "OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments." https://os-world.github.io/ and https://github.com/xlang-ai/OSWorld
↑ Adept (14 September 2022). "ACT-1: Transformer for Actions." https://www.adept.ai/blog/act-1
↑ ^11.0 ^11.1 ^11.2 ^11.3 ^11.4 ^11.5 GitHub - trycua/acu. "A curated list of resources about AI agents for Computer Use". https://github.com/trycua/acu
↑ Anthropic (24 February 2025). "Claude 3.7 Sonnet and Claude Code". https://www.anthropic.com/news/claude-3-7-sonnet
↑ ^13.0 ^13.1 ^13.2 ^13.3 ^13.4 Simular AI. (March 12, 2025). "Agent S2 - Open, Modular, and Scalable Framework for Computer Use Agents". https://www.simular.ai/articles/agent-s2
↑ ^14.0 ^14.1 ^14.2 ^14.3 Anthropic (29 September 2025). "Claude Sonnet 4.5". https://www.anthropic.com/claude/sonnet
↑ ^15.0 ^15.1 ^15.2 ^15.3 ^15.4 ^15.5 ^15.6 Anthropic. (2024). "Developing a computer use model". https://www.anthropic.com/news/developing-computer-use
↑ Labellerr. (March 5, 2025). "Computer Use Agent: Guide to Functionality & Benefits". https://www.labellerr.com/blog/computer-use-agent-guide-to-functionality-benefits/
↑ ^17.0 ^17.1 ^17.2 Anthropic. (2025). "Computer use (beta)". Anthropic Documentation. https://docs.anthropic.com/en/docs/build-with-claude/computer-use
↑ ^18.0 ^18.1 ^18.2 OpenAI. (2025). "Introducing Operator". https://openai.com/index/introducing-operator/
↑ GitHub. (2025). "trycua/cua: Open-source infrastructure for Computer-Use Agents". https://github.com/trycua/cua
↑ ^20.0 ^20.1 Browser Use. (2025). "Browser Use = state of the art Web Agent". https://browser-use.com/posts/sota-technical-report
↑ ^21.0 ^21.1 ^21.2 Microsoft. (2025). "Computer Use Agents (CUAs) for Enhanced Automation". Microsoft Tech Community. https://techcommunity.microsoft.com/blog/aiplatformblog/the-future-of-ai-computer-use-agents-have-arrived/4401025
↑ ^22.0 ^22.1 ^22.2 Abhyankar, R. et al. (2025). "Benchmarking the Efficiency of Computer-Use Agents." arXiv:2506.16042. https://arxiv.org/abs/2506.16042
↑ WebArena. (2024). "WebArena: A Realistic Web Environment for Building Autonomous Agents".
↑ WebVoyager. (2024). "Building an End-to-End Web Agent with Large Multimodal Models". arXiv:2401.13919. https://arxiv.org/abs/2401.13919
↑ ^25.0 ^25.1 Yang, P. et al. (2025). "macOSWorld: A Multilingual Interactive Benchmark for GUI Agents." arXiv:2506.04135. https://arxiv.org/abs/2506.04135
↑ ^26.0 ^26.1 Rapid Innovation. (2025). "A Detailed Guide to Computer Using Agent (CUA) Models". Medium. https://medium.com/@rapidinnovation/a-detailed-guide-to-computer-using-agent-cua-models-41dcbf864552
↑ OpenAI. (2025). "openai-cua-sample-app". GitHub. https://github.com/openai/openai-cua-sample-app
↑ OWASP. (2025). "LLM01:2025 Prompt Injection". OWASP Gen AI Security Project. https://genai.owasp.org/llmrisk/llm01-prompt-injection/
↑ IBM. (2025). "Protect Against Prompt Injection". IBM Think. https://www.ibm.com/think/insights/prevent-prompt-injection
↑ ZBrain. (2025). "Computer-using agent (CUA) models". ZBrain AI. https://zbrain.ai/cua-models/
↑ ^31.0 ^31.1 Understanding AI. (June 26, 2025). "Computer-use agents seem like a dead end". https://www.understandingai.org/p/computer-use-agents-seem-like-a-dead
↑ Anthropic. (2024). "Introducing computer use, a new Claude 3.5 Sonnet, and Claude 3.5 Haiku". https://www.anthropic.com/news/3-5-models-and-computer-use
↑ Push Security. (January 28, 2025). "How Computer-Using Agents can be leveraged in cyber attacks". https://pushsecurity.com/blog/considering-the-impact-of-computer-using-agents/
↑ IEEE Spectrum. (February 13, 2025). "Are You Ready to Let an AI Agent Use Your Computer?". https://spectrum.ieee.org/ai-agents-computer-use

Cite error: <ref> tag with name "deepmind-webagent-iclr-2024" defined in <references> is not used in prior text.
Cite error: <ref> tag with name "reddit-anthropic" defined in <references> is not used in prior text.

[anthropic-2024-1] 1.0 ^1.1 ^1.2 ^1.3 ^1.4 ^1.5 ^1.6 ^1.7 ^1.8 ^1.9 Anthropic (22 October 2024). "Introducing computer use, a new Claude 3.5 Sonnet, and more." https://www.anthropic.com/news/3-5-models-and-computer-use

[openai-cua-2025-2] 2.00 ^2.01 ^2.02 ^2.03 ^2.04 ^2.05 ^2.06 ^2.07 ^2.08 ^2.09 ^2.10 ^2.11 ^2.12 ^2.13 OpenAI (23 January 2025). "Computer-Using Agent." https://openai.com/index/computer-using-agent/

[a16z-rise-3] 3.0 ^3.1 ^3.2 ^3.3 a16z. (August 28, 2025). "The Rise of Computer Use and Agentic Coworkers". https://a16z.com/the-rise-of-computer-use-and-agentic-coworkers/

[openai-docs-computer-use-4] 4.0 ^4.1 ^4.2 ^4.3 ^4.4 ^4.5 OpenAI Platform Docs. "Computer use." https://platform.openai.com/docs/guides/tools-computer-use

[aws-prescriptive-2025-5] 5.0 ^5.1 ^5.2 ^5.3 ^5.4 ^5.5 Amazon Web Services (2025). "Computer-use agents - AWS Prescriptive Guidance." https://docs.aws.amazon.com/prescriptive-guidance/latest/agentic-ai-patterns/computer-use-agents.html

[azure-computer-use-2025-6] 6.0 ^6.1 ^6.2 ^6.3 ^6.4 ^6.5 Microsoft Learn (25 March 2025). "Announcing the Responses API and Computer-Using Agent in Azure AI Foundry." https://azure.microsoft.com/en-us/blog/announcing-the-responses-api-and-computer-using-agent-in-azure-ai-foundry/

[soc-repo-2023-7] 7.0 ^7.1 ^7.2 OthersideAI (2023). "Self-Operating Computer Framework." https://github.com/OthersideAI/self-operating-computer

[gui-survey-2025-8] 8.0 ^8.1 ^8.2 Guo, Y. et al. (2025). "GUI Agents with Foundation Models: A Comprehensive Survey." arXiv:2411.04890v2 / OpenReview (2025). https://arxiv.org/abs/2411.04890 ; https://openreview.net/forum?id=CzMnCp6TFl

[osworld-site-9] 9.0 ^9.1 ^9.2 OSWorld (2024). "OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments." https://os-world.github.io/ and https://github.com/xlang-ai/OSWorld

[adept-act1-2022-10] Adept (14 September 2022). "ACT-1: Transformer for Actions." https://www.adept.ai/blog/act-1

[github-acu-11] 11.0 ^11.1 ^11.2 ^11.3 ^11.4 ^11.5 GitHub - trycua/acu. "A curated list of resources about AI agents for Computer Use". https://github.com/trycua/acu

[claude-37-release-12] Anthropic (24 February 2025). "Claude 3.7 Sonnet and Claude Code". https://www.anthropic.com/news/claude-3-7-sonnet

[simular-s2-13] 13.0 ^13.1 ^13.2 ^13.3 ^13.4 Simular AI. (March 12, 2025). "Agent S2 - Open, Modular, and Scalable Framework for Computer Use Agents". https://www.simular.ai/articles/agent-s2

[claude-45-release-14] 14.0 ^14.1 ^14.2 ^14.3 Anthropic (29 September 2025). "Claude Sonnet 4.5". https://www.anthropic.com/claude/sonnet

[anthropic-develop-15] 15.0 ^15.1 ^15.2 ^15.3 ^15.4 ^15.5 ^15.6 Anthropic. (2024). "Developing a computer use model". https://www.anthropic.com/news/developing-computer-use

[labellerr-guide-16] Labellerr. (March 5, 2025). "Computer Use Agent: Guide to Functionality & Benefits". https://www.labellerr.com/blog/computer-use-agent-guide-to-functionality-benefits/

[anthropic-computer-use-17] 17.0 ^17.1 ^17.2 Anthropic. (2025). "Computer use (beta)". Anthropic Documentation. https://docs.anthropic.com/en/docs/build-with-claude/computer-use

[openai-operator-18] 18.0 ^18.1 ^18.2 OpenAI. (2025). "Introducing Operator". https://openai.com/index/introducing-operator/

[github-cua-19] GitHub. (2025). "trycua/cua: Open-source infrastructure for Computer-Use Agents". https://github.com/trycua/cua

[browser-use-20] 20.0 ^20.1 Browser Use. (2025). "Browser Use = state of the art Web Agent". https://browser-use.com/posts/sota-technical-report

[microsoft-cua-21] 21.0 ^21.1 ^21.2 Microsoft. (2025). "Computer Use Agents (CUAs) for Enhanced Automation". Microsoft Tech Community. https://techcommunity.microsoft.com/blog/aiplatformblog/the-future-of-ai-computer-use-agents-have-arrived/4401025

[osworld-human-2025-22] 22.0 ^22.1 ^22.2 Abhyankar, R. et al. (2025). "Benchmarking the Efficiency of Computer-Use Agents." arXiv:2506.16042. https://arxiv.org/abs/2506.16042

[webarena-23] WebArena. (2024). "WebArena: A Realistic Web Environment for Building Autonomous Agents".

[webvoyager-24] WebVoyager. (2024). "Building an End-to-End Web Agent with Large Multimodal Models". arXiv:2401.13919. https://arxiv.org/abs/2401.13919

[macosworld-2025-25] 25.0 ^25.1 Yang, P. et al. (2025). "macOSWorld: A Multilingual Interactive Benchmark for GUI Agents." arXiv:2506.04135. https://arxiv.org/abs/2506.04135

[rapid-innovation-26] 26.0 ^26.1 Rapid Innovation. (2025). "A Detailed Guide to Computer Using Agent (CUA) Models". Medium. https://medium.com/@rapidinnovation/a-detailed-guide-to-computer-using-agent-cua-models-41dcbf864552

[github-openai-sample-27] OpenAI. (2025). "openai-cua-sample-app". GitHub. https://github.com/openai/openai-cua-sample-app

[owasp-prompt-28] OWASP. (2025). "LLM01:2025 Prompt Injection". OWASP Gen AI Security Project. https://genai.owasp.org/llmrisk/llm01-prompt-injection/

[ibm-prompt-29] IBM. (2025). "Protect Against Prompt Injection". IBM Think. https://www.ibm.com/think/insights/prevent-prompt-injection

[zbrain-cua-30] ZBrain. (2025). "Computer-using agent (CUA) models". ZBrain AI. https://zbrain.ai/cua-models/

[understandingai-deadend-31] 31.0 ^31.1 Understanding AI. (June 26, 2025). "Computer-use agents seem like a dead end". https://www.understandingai.org/p/computer-use-agents-seem-like-a-dead

[anthropic-announcement-32] Anthropic. (2024). "Introducing computer use, a new Claude 3.5 Sonnet, and Claude 3.5 Haiku". https://www.anthropic.com/news/3-5-models-and-computer-use

[pushsecurity-impact-33] Push Security. (January 28, 2025). "How Computer-Using Agents can be leveraged in cyber attacks". https://pushsecurity.com/blog/considering-the-impact-of-computer-using-agents/

[ieee-ready-34] IEEE Spectrum. (February 13, 2025). "Are You Ready to Let an AI Agent Use Your Computer?". https://spectrum.ieee.org/ai-agents-computer-use

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]

[17]

[18]

[19]

[20]

[21]

[22]

[23]

[24]

[25]

[26]

[27]

[28]

[29]

[30]

[31]

[32]

[33]

[34]