Browser-use agent
A Browser-Use Agent (BUA), also known as an autonomous web agent or LLM-based browser agent, is a type of artificial intelligence software agent designed to operate a standard web browser through its graphical user interface (GUI) to accomplish goals specified by users in natural language.[1][2] Unlike traditional web scraping, API-based approaches, or simple automation scripts that follow predefined rules, BUAs leverage the reasoning and understanding capabilities of large language models (LLMs) combined with browser automation technologies to dynamically perceive web page content, plan sequences of actions, and execute them to complete complex, multi-step tasks across diverse websites without bespoke integration.[3][4]
These agents represent a significant advancement toward creating general-purpose digital assistants that can handle real-world web-based tasks, such as booking travel, managing online shopping, conducting detailed information searches, or completing data entry, without direct human intervention for each step.[5]
Terminology and Definition
The term "browser-use agent" is used in research and industry to describe agents that complete tasks by controlling a web browser, rather than calling site-specific APIs.[6] It encompasses systems that:
- Render or inspect webpages through visual or DOM-based perception
- Plan multi-step procedures using LLM reasoning
- Execute low-level browser actions (click, type, select, navigate) to achieve goals such as booking, data extraction, or account management[7]
BUAs are distinguished from computer-use agents (CUAs), which operate in broader desktop environments beyond browsers, by their focus on web-specific interactions within browser instances.[8]
History
The concept of browser-use agents emerged from the convergence of advances in large language models and web automation technologies.
Early Development (2021-2023)
- 2021: OpenAI's WebGPT demonstrated early browser-assisted question-answering with human feedback[5]
- 2022: Adept AI introduced ACT-1, a model trained to use common software tools including web browsers[9]
- 2023: Academic benchmarks WebArena and Mind2Web established standardized evaluation frameworks[1][3]
Modern Era (2024-2025)
- October 2024: Anthropic released Computer Use feature for Claude models[10]
- December 2024: Google DeepMind unveiled Project Mariner built on Gemini 2.0[11]
- January 2025:
- OpenAI announced Operator on January 23, powered by the Computer-Using Agent (CUA) model[2][8]
- Browser-Use open-source library reached 21,000+ GitHub stars[12]
- OpenAI later integrated Operator capabilities into ChatGPT Agent mode, with the standalone operator.chatgpt.com site scheduled for deprecation[13]
Rationale
Many real-world workflows remain locked behind human-oriented web interfaces. BUAs aim to generalize across diverse sites without bespoke integration by:[14]
- Interpreting on-page content and structure through perception systems
- Mapping high-level instructions to concrete browser interactions
- Adapting to website changes without reprogramming
- Reducing fragmented evaluation through unified testbeds
Architecture and Core Components
A BUA's operation follows a perception-reasoning-action loop where it perceives the state of a web page, reasons about the next best action toward its goal, and executes that action. This cycle repeats until task completion or failure determination.[15]
| Component | Description | Technologies | Implementation Details |
|---|---|---|---|
| Perception Layer | Understands content and layout of current web page | DOM parsing, CSS selectors, XPath, Accessibility Tree APIs, Vision Models | • DOM extraction for interactive elements • Screenshot processing (base64 encoding) • Visual analysis for layout understanding • Text extraction and semantic parsing |
| Reasoning & Planning Layer | Core decision-making powered by LLMs | GPT-4, Claude, Gemini, Llama, Chain-of-thought prompting, ReAct framework | • Task decomposition into sub-goals • Multi-step action planning • Context management across pages • Error detection and recovery strategies |
| Action Execution Layer | Translates abstract actions into browser commands | Selenium, Playwright, Puppeteer, Browser Extensions, Chrome DevTools Protocol | • Low-level control (click, type, scroll) • Multi-browser support • Headless and visible modes • Session management |
| Memory Management | Maintains state and context | Vector databases, Session storage, RL memories | • Working memory for active tasks • Persistent memory across sessions • Semantic memory for knowledge • Episodic memory for action history |
| Safety & Monitoring | Ensures safe operation and compliance | Refusal mechanisms, Audit logging, Permission systems | • Prompt injection prevention • Sensitive action gates • User approval workflows • Activity logging and rollback |
Technical Implementation
Browser Automation Frameworks
| Framework | Primary Use Case | Advantages | Limitations | BUA Adoption |
|---|---|---|---|---|
| Playwright | Cross-browser automation | Fast, reliable, modern API, built-in waiting | Newer ecosystem | Preferred for most BUAs[16] |
| Selenium | Traditional web testing | Mature, wide language support | Slower, more complex setup | Legacy support |
| Puppeteer | Chrome/Chromium control | Direct CDP access, lightweight | Chrome-only | Specialized use cases |
| CDP (Chrome DevTools Protocol) | Low-level browser control | Maximum control, performance | Complex, browser-specific | Advanced implementations |
Processing Modes
| Mode | Description | Token Usage | Speed | Accuracy | Best For |
|---|---|---|---|---|---|
| Snapshot Mode | Uses accessibility tree for element identification | Low (500-2K) | Fast (<1s) | High for simple pages | Form filling, standard layouts |
| Vision Mode | Processes screenshots for visual understanding | High (5K-15K) | Slow (2-5s) | High for complex layouts | Dynamic content, visual elements |
| Hybrid Mode | Combines DOM parsing with visual processing | Medium (2K-8K) | Medium (1-3s) | Highest overall | General-purpose automation |
| Streaming Mode | Continuous observation and action | Very High | Real-time | Variable | Interactive applications |
Language Model Integration
BUAs support various LLM providers with different capabilities:[17]
| Provider | Models | Vision Support | Cost (per 1M tokens) | Latency | Best Use Case |
|---|---|---|---|---|---|
| OpenAI | GPT-4o, GPT-4-turbo | Yes | $5-15 | Low | Production systems |
| Anthropic | Claude 3.5 Sonnet, Claude 3 Opus | Yes | $3-15 | Low | Complex reasoning |
| Gemini 1.5 Pro, Gemini 2.0 | Yes | $3.5-7 | Low | Multimodal tasks | |
| Open Source | Llama 3, Mistral, Qwen | Limited | $0.5-2 | Variable | Cost-sensitive applications |
Performance Benchmarks
Standardized Evaluation Frameworks
| Benchmark | Focus Area | Task Count | Characteristics | Key Metrics |
|---|---|---|---|---|
| WebArena | Realistic multi-site environment | 812 tasks | Self-hostable sites across e-commerce, CMS, social platforms; execution-based evaluation | Task success rate, efficiency score[1] |
| Mind2Web | Cross-website generalization | 2,350 tasks | 137 websites, real-world task diversity, action sequence annotation | Element accuracy, action F1 score[3] |
| WebVoyager | Live website interaction | 643 tasks | Amazon, GitHub, Google Maps, real-time execution | End-to-end success rate[8] |
| VisualWebArena | Multimodal/visual tasks | 910 tasks | Image-heavy tasks, visual grounding requirements | Visual element accuracy[18] |
| BrowserGym | Unified ecosystem | 5,000+ tasks | Standardized obs/action spaces, cross-benchmark evaluation | Aggregate performance score[14] |
| WebShop | E-commerce navigation | 12,087 products | Product search and selection, attribute matching | Purchase success rate, reward score[19] |
| OSWorld | Full OS control | 369 tasks | Ubuntu, Windows, macOS environments | Cross-platform success rate[8] |
Comparative Performance (2025)
| Agent/Model | WebArena | WebVoyager | Mind2Web | OSWorld | Average |
|---|---|---|---|---|---|
| Human Baseline | 78.2% | 90.0% | 85.3% | 72.4% | 81.5% |
| Browser-Use (Open Source) | 51.2% | 89.1% | 73.4% | N/A | 71.2% |
| CUA (OpenAI) | 58.1% | 87.0% | 76.2% | 38.1% | 64.9% |
| Computer Use (Anthropic) | 45.3% | 56.0% | 62.1% | 22.0% | 46.4% |
| Mariner (Google) | 52.4% | 83.5% | 71.3% | N/A | 69.1% |
Major Implementations
OpenAI Operator
Released January 23, 2025, Operator is powered by the Computer-Using Agent (CUA) model, combining GPT-4o's vision capabilities with reinforcement learning:[2][8]
- Architecture: Iterative perception-reasoning-action loop with self-correction
- Safety Features: Refusal mechanisms, user approval gates for sensitive actions
- Performance: Industry-leading scores on WebArena (58.1%) and WebVoyager (87%)
- Availability: Initially required ChatGPT Pro subscription; later integrated into ChatGPT Agent mode[13]
Browser-Use Library
An open-source Python library enabling LLM-powered browser interaction via natural language:[4]
- Statistics: 21,000+ GitHub stars, 1,000+ forks (as of January 2025)
- Features: Multi-model support, memory-enabled workflows, cloud/local execution
- License: MIT License
- Integration: Supports OpenAI, Anthropic, Google, and open-source models
Anthropic Computer Use
Released October 2024, enables Claude models to interact with computer interfaces:[10]
- Approach: Visual perception and simulated input
- Models: Claude 3.5 Sonnet optimized for computer use
- Applications: Desktop and web automation
Google Project Mariner
Experimental agent from Google DeepMind for autonomous web navigation:[11]
- Foundation: Built on Gemini 2.0
- Focus: Research-oriented, multimodal understanding
- Performance: 83.5% on WebVoyager benchmark
Applications and Use Cases
Enterprise Automation
- Business Process Automation: Replacing traditional RPA with adaptive agents
- Data Integration: Cross-platform data extraction and synchronization
- Compliance Monitoring: Automated regulatory checks and reporting
- Supply Chain Management: Vendor portal navigation and order processing
E-Commerce and Services
- Price Monitoring: Real-time competitor analysis and dynamic pricing
- Inventory Management: Multi-marketplace stock synchronization
- Customer Service: Automated order tracking and status updates
- Review Aggregation: Sentiment analysis across platforms
Research and Analysis
- Academic Research: Literature review automation, citation gathering
- Market Intelligence: Trend analysis, competitor monitoring
- Financial Analysis: Earnings report extraction, market data compilation
- Patent Research: Prior art searches, classification analysis
Quality Assurance
- Automated Testing: End-to-end user journey validation[20]
- Accessibility Testing: WCAG compliance verification
- Cross-browser Testing: Compatibility validation across platforms
- Performance Testing: Load testing, response time measurement
Personal Productivity
- Travel Planning: Multi-site booking coordination
- Job Applications: Resume parsing, application submission
- Content Aggregation: News curation, research compilation
- Social Media Management: Cross-platform posting and monitoring
Challenges and Limitations
Technical Challenges
- Dynamic Content Handling:
- Element Identification:
- Shadow DOM and iframes create isolation barriers
- Dynamic ID generation prevents reliable selectors
- Similar elements require disambiguation strategies[22]
- State Management:
- Error Recovery:
Performance Limitations
| Issue | Impact | Current Solutions | Future Approaches |
|---|---|---|---|
| LLM Inference Latency | 2-5 second delays per action | Caching, batching | Edge deployment, model optimization |
| Token Consumption | $0.10-1.00 per complex task | Efficient prompting, mode selection | Specialized models, compression |
| Memory Limitations | Context window constraints | Summarization, pruning | Extended context models |
| Reliability | 60-90% success rates | Retry logic, fallbacks | Reinforcement learning, self-improvement |
Safety, Security, and Ethical Concerns
- Security Risks:[8]
- Prompt injection attacks from malicious websites
- Credential theft through compromised agents
- Cross-site scripting (XSS) vulnerabilities
- Data exfiltration risks
- Privacy Concerns:
- Processing of sensitive personal information
- Screenshot capture of private data
- Audit trail storage and retention
- Misuse Prevention:
- Automated spam and harassment potential
- Unauthorized web scraping at scale
- Terms of service violations
- Legal compliance challenges
Notable Research Projects
Academic Initiatives
- WebGPT (OpenAI): Pioneering work on browser-assisted question-answering with human feedback, establishing foundations for modern BUAs[5]
- Mind2Web (OSU & Allen AI): Large-scale dataset and framework for developing generalist web agents, testing cross-website generalization[3]
- WebArena (CMU): Realistic benchmark environment with self-hosted websites for reproducible agent evaluation[1]
- AgentTuning: Research on enabling generalized agent abilities through fine-tuning LLMs for web tasks[15]
Industry Research
- Adept ACT-1: Universal action transformer for software interface control[9]
- Google Multimodal Web Navigation: Research on instruction-finetuned foundation models for web navigation[25]
Open Source Frameworks
- LangChain/LlamaIndex: Provide building blocks for BUA development[26]
- BrowserGym: Unified ecosystem for web agent research and evaluation[14]
- Skyvern: Open-source browser automation with LLMs and computer vision[24]
Future Directions
Technical Developments
- Enhanced Perception:
- Perception tokens for improved visual reasoning[27]
- 3D understanding of page layouts
- Video comprehension for dynamic content
- Improved Planning:
- Hierarchical task decomposition
- Long-horizon reasoning capabilities
- Multi-objective optimization
- Self-Improvement:
- Online learning from execution feedback
- Automated test generation and validation
- Self-healing adaptation to website changes[20]
Industry Adoption Trajectory
- 2025-2026: Early adoption in QA and testing
- 2026-2027: Enterprise RPA replacement begins
- 2027-2028: Consumer-facing agent assistants
- 2028-2030: Ubiquitous web automation
Emerging Standards and Protocols
- Development of agent-specific web protocols
- Standardized safety and permission frameworks
- Industry benchmarks for reliability certification
- Regulatory compliance frameworks
Comparison with Related Technologies
| Aspect | Browser-Use Agent (BUA) | Computer-Use Agent (CUA) | Traditional RPA | Web Scraping |
|---|---|---|---|---|
| Scope | Web browsers | Full desktop OS | Predefined workflows | Data extraction only |
| Adaptability | High (LLM-based) | High (LLM-based) | Low (scripted) | Low (rule-based) |
| Setup Complexity | Medium | High | High | Low |
| Maintenance | Self-adapting | Self-adapting | Frequent updates needed | Regular updates needed |
| Cost | $0.10-1.00/task | $0.50-2.00/task | High initial, low per-task | Low |
| Use Cases | General web automation | Any desktop application | Repetitive business processes | Data collection |
| Error Handling | Intelligent recovery | Intelligent recovery | Basic retry logic | Minimal |
See also
- Artificial intelligence
- Large language model
- Web scraping
- Robotic Process Automation
- Selenium (software)
- Playwright (software)
- Software agent
- Computer vision
- Natural language processing
- Reinforcement learning
- Human-computer interaction
- Web automation
- Autonomous agent
- Prompt engineering
- Chain-of-thought prompting
References
- ↑ 1.0 1.1 1.2 1.3 Zhou, Tianbao et al. (2023). "WebArena: A Realistic Web Environment for Building Autonomous Agents." arXiv preprint arXiv:2307.13854. Retrieved from https://arxiv.org/abs/2307.13854
- ↑ 2.0 2.1 2.2 OpenAI. (2025). "Introducing Operator." Retrieved from https://openai.com/index/introducing-operator/
- ↑ 3.0 3.1 3.2 3.3 Deng, Xiang et al. (2023). "Mind2Web: Towards a Generalist Agent for the Web." arXiv preprint arXiv:2306.06070. Retrieved from https://arxiv.org/abs/2306.06070
- ↑ 4.0 4.1 Browser-Use Team. (2025). "Browser-Use: Make websites accessible for AI agents." GitHub. Retrieved from https://github.com/browser-use/browser-use
- ↑ 5.0 5.1 5.2 Nakano, Reiichiro et al. (2021). "WebGPT: Browser-assisted question-answering with human feedback." arXiv preprint arXiv:2112.09332. Retrieved from https://arxiv.org/abs/2112.09332
- ↑ VentureBeat. (2025). "The rise of browser-use agents: Why Convergence's Proxy is beating OpenAI's Operator." Retrieved from industry sources
- ↑ The Verge. (2025). "OpenAI's new Operator AI agent can do things on the web for you." Retrieved from https://www.theverge.com/2025/1/23/operator-openai
- ↑ 8.0 8.1 8.2 8.3 8.4 8.5 OpenAI. (2025). "Computer-Using Agent." Retrieved from https://openai.com/index/computer-using-agent/
- ↑ 9.0 9.1 Adept AI. (2022). "ACT-1: A new frontier for models that take actions on your computer." Retrieved from https://www.adept.ai/blog/act-1
- ↑ 10.0 10.1 Anthropic. (2024). "Introducing computer use, a new Claude 3.5 Sonnet, and Claude 3.5 Haiku." Retrieved from https://www.anthropic.com/news/3-5-models-and-computer-use
- ↑ 11.0 11.1 DataCamp. (2025). "OpenAI's Operator: Examples, Use Cases, Competition & More." Retrieved from https://www.datacamp.com/blog/operator
- ↑ InfoWorld. (2025). "Browser Use: An open-source AI agent to automate web-based tasks." Retrieved from https://www.infoworld.com/article/3812644/browser-use-an-open-source-ai-agent-to-automate-web-based-tasks.html
- ↑ 13.0 13.1 OpenAI. (2025). "Introducing ChatGPT agent: bridging research and action." Retrieved from https://openai.com/index/introducing-chatgpt-agent/
- ↑ 14.0 14.1 14.2 Le Sellier De Chezelles, T. et al. (2024). "The BrowserGym Ecosystem for Web Agent Research." arXiv:2412.05467. Retrieved from https://arxiv.org/abs/2412.05467
- ↑ 15.0 15.1 Zeng, Aohan et al. (2023). "AgentTuning: Enabling Generalized Agent Abilities for LLMs." arXiv preprint arXiv:2310.12823. Retrieved from https://arxiv.org/abs/2310.12823
- ↑ DZone. (2025). "Build an AI Browser Agent With LLMs, Playwright, Browser-Use." Retrieved from https://dzone.com/articles/build-ai-browser-agent-llms-playwright-browser-use
- ↑ ADASCI. (2025). "A Practical Guide to Enabling AI Agent Browser Control using Browser-use." Retrieved from https://adasci.org/a-practical-guide-to-enabling-ai-agent-browser-control-using-browser-use/
- ↑ Web-Arena-x. (2024-2025). "VisualWebArena: Evaluating Multimodal Agents on Realistic Visual Web Tasks." GitHub. Retrieved from project site
- ↑ Yao, Shunyu et al. (2022). "WebShop: Towards Scalable Real-World Web Interaction with Grounded Language Agents." arXiv preprint arXiv:2207.01206. Retrieved from https://arxiv.org/abs/2207.01206
- ↑ 20.0 20.1 Ministry of Testing. (2025). "Creating self-healing automated tests with AI and Playwright." Retrieved from https://www.ministryoftesting.com/articles/creating-self-healing-automated-tests-with-ai-and-playwright
- ↑ Skyvern. (2025). "Web Bench - A new way to compare AI Browser Agents." Retrieved from https://blog.skyvern.com/web-bench-a-new-way-to-compare-ai-browser-agents/
- ↑ Browser-Use. (2025). "Browser Use = state of the art Web Agent." Retrieved from https://browser-use.com/posts/sota-technical-report
- ↑ OpenAI. (2025). "Context Engineering - Short-Term Memory Management with Sessions." Retrieved from https://cookbook.openai.com/examples/agents_sdk/session_memory
- ↑ 24.0 24.1 Skyvern. (2025). "Automate browser-based workflows with LLMs and Computer Vision." GitHub. Retrieved from https://github.com/Skyvern-AI/skyvern
- ↑ Furuta, Hiroki et al. (2023). "Multimodal Web Navigation with Instruction-Finetuned Foundation Models." arXiv preprint arXiv:2305.11854. Retrieved from https://arxiv.org/abs/2305.11854
- ↑ LangChain. (2025). "Agent architectures." Retrieved from https://langchain-ai.github.io/langgraph/concepts/agentic_concepts/
- ↑ arXiv. (2024). "Perception Tokens Enhance Visual Reasoning in Multimodal Language Models." Retrieved from https://arxiv.org/html/2412.03548v1