Browser-use agent

Template:Infobox software

A Browser-Use Agent (BUA), also known as an autonomous web agent or LLM-based browser agent, is a type of artificial intelligence software agent designed to operate a standard web browser through its graphical user interface (GUI) to accomplish goals specified by users in natural language.^[1]^[2] Unlike traditional web scraping, API-based approaches, or simple automation scripts that follow predefined rules, BUAs leverage the reasoning and understanding capabilities of large language models (LLMs) combined with browser automation technologies to dynamically perceive web page content, plan sequences of actions, and execute them to complete complex, multi-step tasks across diverse websites without bespoke integration.^[3]^[4]

These agents represent a significant advancement toward creating general-purpose digital assistants that can handle real-world web-based tasks, such as booking travel, managing online shopping, conducting detailed information searches, or completing data entry, without direct human intervention for each step.^[5]

Terminology and Definition

The term "browser-use agent" is used in research and industry to describe agents that complete tasks by controlling a web browser, rather than calling site-specific APIs.^[6] It encompasses systems that:

Render or inspect webpages through visual or DOM-based perception
Plan multi-step procedures using LLM reasoning
Execute low-level browser actions (click, type, select, navigate) to achieve goals such as booking, data extraction, or account management^[7]

BUAs are distinguished from computer-use agents (CUAs), which operate in broader desktop environments beyond browsers, by their focus on web-specific interactions within browser instances.^[8]

History

The concept of browser-use agents emerged from the convergence of advances in large language models and web automation technologies.

Early Development (2021-2023)

2021: OpenAI's WebGPT demonstrated early browser-assisted question-answering with human feedback^[5]
2022: Adept AI introduced ACT-1, a model trained to use common software tools including web browsers^[9]
2023: Academic benchmarks WebArena and Mind2Web established standardized evaluation frameworks^[1]^[3]

Modern Era (2024-2025)

October 2024: Anthropic released Computer Use feature for Claude models^[10]
December 2024: Google DeepMind unveiled Project Mariner built on Gemini 2.0^[11]
January 2025:
- OpenAI announced Operator on January 23, powered by the Computer-Using Agent (CUA) model^[2]^[8]
- Browser-Use open-source library reached 21,000+ GitHub stars^[12]
- OpenAI later integrated Operator capabilities into ChatGPT Agent mode, with the standalone operator.chatgpt.com site scheduled for deprecation^[13]

Rationale

Many real-world workflows remain locked behind human-oriented web interfaces. BUAs aim to generalize across diverse sites without bespoke integration by:^[14]

Interpreting on-page content and structure through perception systems
Mapping high-level instructions to concrete browser interactions
Adapting to website changes without reprogramming
Reducing fragmented evaluation through unified testbeds

Architecture and Core Components

A BUA's operation follows a perception-reasoning-action loop where it perceives the state of a web page, reasons about the next best action toward its goal, and executes that action. This cycle repeats until task completion or failure determination.^[15]

Browser-Use Agent Architecture Components
Component	Description	Technologies	Implementation Details
Perception Layer	Understands content and layout of current web page	DOM parsing, CSS selectors, XPath, Accessibility Tree APIs, Vision Models	• DOM extraction for interactive elements • Screenshot processing (base64 encoding) • Visual analysis for layout understanding • Text extraction and semantic parsing
Reasoning & Planning Layer	Core decision-making powered by LLMs	GPT-4, Claude, Gemini, Llama, Chain-of-thought prompting, ReAct framework	• Task decomposition into sub-goals • Multi-step action planning • Context management across pages • Error detection and recovery strategies
Action Execution Layer	Translates abstract actions into browser commands	Selenium, Playwright, Puppeteer, Browser Extensions, Chrome DevTools Protocol	• Low-level control (click, type, scroll) • Multi-browser support • Headless and visible modes • Session management
Memory Management	Maintains state and context	Vector databases, Session storage, RL memories	• Working memory for active tasks • Persistent memory across sessions • Semantic memory for knowledge • Episodic memory for action history
Safety & Monitoring	Ensures safe operation and compliance	Refusal mechanisms, Audit logging, Permission systems	• Prompt injection prevention • Sensitive action gates • User approval workflows • Activity logging and rollback

Technical Implementation

Browser Automation Frameworks

Comparison of Browser Automation Technologies for BUAs
Framework	Primary Use Case	Advantages	Limitations	BUA Adoption
Playwright	Cross-browser automation	Fast, reliable, modern API, built-in waiting	Newer ecosystem	Preferred for most BUAs^[16]
Selenium	Traditional web testing	Mature, wide language support	Slower, more complex setup	Legacy support
Puppeteer	Chrome/Chromium control	Direct CDP access, lightweight	Chrome-only	Specialized use cases
CDP (Chrome DevTools Protocol)	Low-level browser control	Maximum control, performance	Complex, browser-specific	Advanced implementations

Processing Modes

BUA Processing Mode Comparison
Mode	Description	Token Usage	Speed	Accuracy	Best For
Snapshot Mode	Uses accessibility tree for element identification	Low (500-2K)	Fast (<1s)	High for simple pages	Form filling, standard layouts
Vision Mode	Processes screenshots for visual understanding	High (5K-15K)	Slow (2-5s)	High for complex layouts	Dynamic content, visual elements
Hybrid Mode	Combines DOM parsing with visual processing	Medium (2K-8K)	Medium (1-3s)	Highest overall	General-purpose automation
Streaming Mode	Continuous observation and action	Very High	Real-time	Variable	Interactive applications

Language Model Integration

BUAs support various LLM providers with different capabilities:^[17]

LLM Provider Capabilities for Browser-Use Agents
Provider	Models	Vision Support	Cost (per 1M tokens)	Latency	Best Use Case
OpenAI	GPT-4o, GPT-4-turbo	Yes	$5-15	Low	Production systems
Anthropic	Claude 3.5 Sonnet, Claude 3 Opus	Yes	$3-15	Low	Complex reasoning
Google	Gemini 1.5 Pro, Gemini 2.0	Yes	$3.5-7	Low	Multimodal tasks
Open Source	Llama 3, Mistral, Qwen	Limited	$0.5-2	Variable	Cost-sensitive applications

Performance Benchmarks

Standardized Evaluation Frameworks

Browser-Use Agent Benchmark Suites
Benchmark	Focus Area	Task Count	Characteristics	Key Metrics
WebArena	Realistic multi-site environment	812 tasks	Self-hostable sites across e-commerce, CMS, social platforms; execution-based evaluation	Task success rate, efficiency score^[1]
Mind2Web	Cross-website generalization	2,350 tasks	137 websites, real-world task diversity, action sequence annotation	Element accuracy, action F1 score^[3]
WebVoyager	Live website interaction	643 tasks	Amazon, GitHub, Google Maps, real-time execution	End-to-end success rate^[8]
VisualWebArena	Multimodal/visual tasks	910 tasks	Image-heavy tasks, visual grounding requirements	Visual element accuracy^[18]
BrowserGym	Unified ecosystem	5,000+ tasks	Standardized obs/action spaces, cross-benchmark evaluation	Aggregate performance score^[14]
WebShop	E-commerce navigation	12,087 products	Product search and selection, attribute matching	Purchase success rate, reward score^[19]
OSWorld	Full OS control	369 tasks	Ubuntu, Windows, macOS environments	Cross-platform success rate^[8]

Comparative Performance (2025)

Browser-Use Agent Performance Comparison
Agent/Model	WebArena	WebVoyager	Mind2Web	OSWorld	Average
Human Baseline	78.2%	90.0%	85.3%	72.4%	81.5%
Browser-Use (Open Source)	51.2%	89.1%	73.4%	N/A	71.2%
CUA (OpenAI)	58.1%	87.0%	76.2%	38.1%	64.9%
Computer Use (Anthropic)	45.3%	56.0%	62.1%	22.0%	46.4%
Mariner (Google)	52.4%	83.5%	71.3%	N/A	69.1%

Major Implementations

OpenAI Operator

Released January 23, 2025, Operator is powered by the Computer-Using Agent (CUA) model, combining GPT-4o's vision capabilities with reinforcement learning:^[2]^[8]

Architecture: Iterative perception-reasoning-action loop with self-correction
Safety Features: Refusal mechanisms, user approval gates for sensitive actions
Performance: Industry-leading scores on WebArena (58.1%) and WebVoyager (87%)
Availability: Initially required ChatGPT Pro subscription; later integrated into ChatGPT Agent mode^[13]

Browser-Use Library

An open-source Python library enabling LLM-powered browser interaction via natural language:^[4]

Statistics: 21,000+ GitHub stars, 1,000+ forks (as of January 2025)
Features: Multi-model support, memory-enabled workflows, cloud/local execution
License: MIT License
Integration: Supports OpenAI, Anthropic, Google, and open-source models

Anthropic Computer Use

Released October 2024, enables Claude models to interact with computer interfaces:^[10]

Approach: Visual perception and simulated input
Models: Claude 3.5 Sonnet optimized for computer use
Applications: Desktop and web automation

Google Project Mariner

Experimental agent from Google DeepMind for autonomous web navigation:^[11]

Foundation: Built on Gemini 2.0
Focus: Research-oriented, multimodal understanding
Performance: 83.5% on WebVoyager benchmark

Applications and Use Cases

Enterprise Automation

Business Process Automation: Replacing traditional RPA with adaptive agents
Data Integration: Cross-platform data extraction and synchronization
Compliance Monitoring: Automated regulatory checks and reporting
Supply Chain Management: Vendor portal navigation and order processing

E-Commerce and Services

Price Monitoring: Real-time competitor analysis and dynamic pricing
Inventory Management: Multi-marketplace stock synchronization
Customer Service: Automated order tracking and status updates
Review Aggregation: Sentiment analysis across platforms

Research and Analysis

Academic Research: Literature review automation, citation gathering
Market Intelligence: Trend analysis, competitor monitoring
Financial Analysis: Earnings report extraction, market data compilation
Patent Research: Prior art searches, classification analysis

Quality Assurance

Automated Testing: End-to-end user journey validation^[20]
Accessibility Testing: WCAG compliance verification
Cross-browser Testing: Compatibility validation across platforms
Performance Testing: Load testing, response time measurement

Personal Productivity

Travel Planning: Multi-site booking coordination
Job Applications: Resume parsing, application submission
Content Aggregation: News curation, research compilation
Social Media Management: Cross-platform posting and monitoring

Challenges and Limitations

Technical Challenges

Dynamic Content Handling:
- Modern SPAs with AJAX pose navigation challenges
- Asynchronous loading requires sophisticated waiting strategies
- Virtual scrolling and lazy loading complicate element discovery^[21]

Element Identification:
- Shadow DOM and iframes create isolation barriers
- Dynamic ID generation prevents reliable selectors
- Similar elements require disambiguation strategies^[22]

State Management:
- Session persistence across page transitions
- Handling authentication and 2FA
- Recovery from unexpected logouts or timeouts^[23]

Error Recovery:
- CAPTCHA solving limitations
- Popup and modal dialog handling
- Network failure and retry logic^[24]

Performance Limitations

Performance Bottlenecks in Browser-Use Agents
Issue	Impact	Current Solutions	Future Approaches
LLM Inference Latency	2-5 second delays per action	Caching, batching	Edge deployment, model optimization
Token Consumption	$0.10-1.00 per complex task	Efficient prompting, mode selection	Specialized models, compression
Memory Limitations	Context window constraints	Summarization, pruning	Extended context models
Reliability	60-90% success rates	Retry logic, fallbacks	Reinforcement learning, self-improvement

Safety, Security, and Ethical Concerns

Security Risks:^[8]
- Prompt injection attacks from malicious websites
- Credential theft through compromised agents
- Cross-site scripting (XSS) vulnerabilities
- Data exfiltration risks

Privacy Concerns:
- Processing of sensitive personal information
- Screenshot capture of private data
- Audit trail storage and retention

Misuse Prevention:
- Automated spam and harassment potential
- Unauthorized web scraping at scale
- Terms of service violations
- Legal compliance challenges

Notable Research Projects

Academic Initiatives

WebGPT (OpenAI): Pioneering work on browser-assisted question-answering with human feedback, establishing foundations for modern BUAs^[5]

Mind2Web (OSU & Allen AI): Large-scale dataset and framework for developing generalist web agents, testing cross-website generalization^[3]

WebArena (CMU): Realistic benchmark environment with self-hosted websites for reproducible agent evaluation^[1]

AgentTuning: Research on enabling generalized agent abilities through fine-tuning LLMs for web tasks^[15]

Industry Research

Adept ACT-1: Universal action transformer for software interface control^[9]

Google Multimodal Web Navigation: Research on instruction-finetuned foundation models for web navigation^[25]

Open Source Frameworks

LangChain/LlamaIndex: Provide building blocks for BUA development^[26]
BrowserGym: Unified ecosystem for web agent research and evaluation^[14]
Skyvern: Open-source browser automation with LLMs and computer vision^[24]

Future Directions

Technical Developments

Enhanced Perception:
- Perception tokens for improved visual reasoning^[27]
- 3D understanding of page layouts
- Video comprehension for dynamic content

Improved Planning:
- Hierarchical task decomposition
- Long-horizon reasoning capabilities
- Multi-objective optimization

Self-Improvement:
- Online learning from execution feedback
- Automated test generation and validation
- Self-healing adaptation to website changes^[20]

Industry Adoption Trajectory

2025-2026: Early adoption in QA and testing
2026-2027: Enterprise RPA replacement begins
2027-2028: Consumer-facing agent assistants
2028-2030: Ubiquitous web automation

Emerging Standards and Protocols

Development of agent-specific web protocols
Standardized safety and permission frameworks
Industry benchmarks for reliability certification
Regulatory compliance frameworks

Comparison with Related Technologies

Browser-Use Agents vs. Related Technologies
Aspect	Browser-Use Agent (BUA)	Computer-Use Agent (CUA)	Traditional RPA	Web Scraping
Scope	Web browsers	Full desktop OS	Predefined workflows	Data extraction only
Adaptability	High (LLM-based)	High (LLM-based)	Low (scripted)	Low (rule-based)
Setup Complexity	Medium	High	High	Low
Maintenance	Self-adapting	Self-adapting	Frequent updates needed	Regular updates needed
Cost	$0.10-1.00/task	$0.50-2.00/task	High initial, low per-task	Low
Use Cases	General web automation	Any desktop application	Repetitive business processes	Data collection
Error Handling	Intelligent recovery	Intelligent recovery	Basic retry logic	Minimal

References

↑ ^1.0 ^1.1 ^1.2 ^1.3 Zhou, Tianbao et al. (2023). "WebArena: A Realistic Web Environment for Building Autonomous Agents." arXiv preprint arXiv:2307.13854. Retrieved from https://arxiv.org/abs/2307.13854
↑ ^2.0 ^2.1 ^2.2 OpenAI. (2025). "Introducing Operator." Retrieved from https://openai.com/index/introducing-operator/
↑ ^3.0 ^3.1 ^3.2 ^3.3 Deng, Xiang et al. (2023). "Mind2Web: Towards a Generalist Agent for the Web." arXiv preprint arXiv:2306.06070. Retrieved from https://arxiv.org/abs/2306.06070
↑ ^4.0 ^4.1 Browser-Use Team. (2025). "Browser-Use: Make websites accessible for AI agents." GitHub. Retrieved from https://github.com/browser-use/browser-use
↑ ^5.0 ^5.1 ^5.2 Nakano, Reiichiro et al. (2021). "WebGPT: Browser-assisted question-answering with human feedback." arXiv preprint arXiv:2112.09332. Retrieved from https://arxiv.org/abs/2112.09332
↑ VentureBeat. (2025). "The rise of browser-use agents: Why Convergence's Proxy is beating OpenAI's Operator." Retrieved from industry sources
↑ The Verge. (2025). "OpenAI's new Operator AI agent can do things on the web for you." Retrieved from https://www.theverge.com/2025/1/23/operator-openai
↑ ^8.0 ^8.1 ^8.2 ^8.3 ^8.4 ^8.5 OpenAI. (2025). "Computer-Using Agent." Retrieved from https://openai.com/index/computer-using-agent/
↑ ^9.0 ^9.1 Adept AI. (2022). "ACT-1: A new frontier for models that take actions on your computer." Retrieved from https://www.adept.ai/blog/act-1
↑ ^10.0 ^10.1 Anthropic. (2024). "Introducing computer use, a new Claude 3.5 Sonnet, and Claude 3.5 Haiku." Retrieved from https://www.anthropic.com/news/3-5-models-and-computer-use
↑ ^11.0 ^11.1 DataCamp. (2025). "OpenAI's Operator: Examples, Use Cases, Competition & More." Retrieved from https://www.datacamp.com/blog/operator
↑ InfoWorld. (2025). "Browser Use: An open-source AI agent to automate web-based tasks." Retrieved from https://www.infoworld.com/article/3812644/browser-use-an-open-source-ai-agent-to-automate-web-based-tasks.html
↑ ^13.0 ^13.1 OpenAI. (2025). "Introducing ChatGPT agent: bridging research and action." Retrieved from https://openai.com/index/introducing-chatgpt-agent/
↑ ^14.0 ^14.1 ^14.2 Le Sellier De Chezelles, T. et al. (2024). "The BrowserGym Ecosystem for Web Agent Research." arXiv:2412.05467. Retrieved from https://arxiv.org/abs/2412.05467
↑ ^15.0 ^15.1 Zeng, Aohan et al. (2023). "AgentTuning: Enabling Generalized Agent Abilities for LLMs." arXiv preprint arXiv:2310.12823. Retrieved from https://arxiv.org/abs/2310.12823
↑ DZone. (2025). "Build an AI Browser Agent With LLMs, Playwright, Browser-Use." Retrieved from https://dzone.com/articles/build-ai-browser-agent-llms-playwright-browser-use
↑ ADASCI. (2025). "A Practical Guide to Enabling AI Agent Browser Control using Browser-use." Retrieved from https://adasci.org/a-practical-guide-to-enabling-ai-agent-browser-control-using-browser-use/
↑ Web-Arena-x. (2024-2025). "VisualWebArena: Evaluating Multimodal Agents on Realistic Visual Web Tasks." GitHub. Retrieved from project site
↑ Yao, Shunyu et al. (2022). "WebShop: Towards Scalable Real-World Web Interaction with Grounded Language Agents." arXiv preprint arXiv:2207.01206. Retrieved from https://arxiv.org/abs/2207.01206
↑ ^20.0 ^20.1 Ministry of Testing. (2025). "Creating self-healing automated tests with AI and Playwright." Retrieved from https://www.ministryoftesting.com/articles/creating-self-healing-automated-tests-with-ai-and-playwright
↑ Skyvern. (2025). "Web Bench - A new way to compare AI Browser Agents." Retrieved from https://blog.skyvern.com/web-bench-a-new-way-to-compare-ai-browser-agents/
↑ Browser-Use. (2025). "Browser Use = state of the art Web Agent." Retrieved from https://browser-use.com/posts/sota-technical-report
↑ OpenAI. (2025). "Context Engineering - Short-Term Memory Management with Sessions." Retrieved from https://cookbook.openai.com/examples/agents_sdk/session_memory
↑ ^24.0 ^24.1 Skyvern. (2025). "Automate browser-based workflows with LLMs and Computer Vision." GitHub. Retrieved from https://github.com/Skyvern-AI/skyvern
↑ Furuta, Hiroki et al. (2023). "Multimodal Web Navigation with Instruction-Finetuned Foundation Models." arXiv preprint arXiv:2305.11854. Retrieved from https://arxiv.org/abs/2305.11854
↑ LangChain. (2025). "Agent architectures." Retrieved from https://langchain-ai.github.io/langgraph/concepts/agentic_concepts/
↑ arXiv. (2024). "Perception Tokens Enhance Visual Reasoning in Multimodal Language Models." Retrieved from https://arxiv.org/html/2412.03548v1

External links

[WebArena-1] 1.0 ^1.1 ^1.2 ^1.3 Zhou, Tianbao et al. (2023). "WebArena: A Realistic Web Environment for Building Autonomous Agents." arXiv preprint arXiv:2307.13854. Retrieved from https://arxiv.org/abs/2307.13854

[openai-operator-2] 2.0 ^2.1 ^2.2 OpenAI. (2025). "Introducing Operator." Retrieved from https://openai.com/index/introducing-operator/

[Mind2Web-3] 3.0 ^3.1 ^3.2 ^3.3 Deng, Xiang et al. (2023). "Mind2Web: Towards a Generalist Agent for the Web." arXiv preprint arXiv:2306.06070. Retrieved from https://arxiv.org/abs/2306.06070

[browser-use-github-4] 4.0 ^4.1 Browser-Use Team. (2025). "Browser-Use: Make websites accessible for AI agents." GitHub. Retrieved from https://github.com/browser-use/browser-use

[OpenAIWebGPT-5] 5.0 ^5.1 ^5.2 Nakano, Reiichiro et al. (2021). "WebGPT: Browser-assisted question-answering with human feedback." arXiv preprint arXiv:2112.09332. Retrieved from https://arxiv.org/abs/2112.09332

[vb-browseruse-6] VentureBeat. (2025). "The rise of browser-use agents: Why Convergence's Proxy is beating OpenAI's Operator." Retrieved from industry sources

[verge-operator-7] The Verge. (2025). "OpenAI's new Operator AI agent can do things on the web for you." Retrieved from https://www.theverge.com/2025/1/23/operator-openai

[openai-cua-8] 8.0 ^8.1 ^8.2 ^8.3 ^8.4 ^8.5 OpenAI. (2025). "Computer-Using Agent." Retrieved from https://openai.com/index/computer-using-agent/

[AdeptACT1-9] 9.0 ^9.1 Adept AI. (2022). "ACT-1: A new frontier for models that take actions on your computer." Retrieved from https://www.adept.ai/blog/act-1

[anthropic-computer-use-10] 10.0 ^10.1 Anthropic. (2024). "Introducing computer use, a new Claude 3.5 Sonnet, and Claude 3.5 Haiku." Retrieved from https://www.anthropic.com/news/3-5-models-and-computer-use

[datacamp-operator-11] 11.0 ^11.1 DataCamp. (2025). "OpenAI's Operator: Examples, Use Cases, Competition & More." Retrieved from https://www.datacamp.com/blog/operator

[infoworld-browser-use-12] InfoWorld. (2025). "Browser Use: An open-source AI agent to automate web-based tasks." Retrieved from https://www.infoworld.com/article/3812644/browser-use-an-open-source-ai-agent-to-automate-web-based-tasks.html

[chatgpt-agent-13] 13.0 ^13.1 OpenAI. (2025). "Introducing ChatGPT agent: bridging research and action." Retrieved from https://openai.com/index/introducing-chatgpt-agent/

[browsegym-arxiv-14] 14.0 ^14.1 ^14.2 Le Sellier De Chezelles, T. et al. (2024). "The BrowserGym Ecosystem for Web Agent Research." arXiv:2412.05467. Retrieved from https://arxiv.org/abs/2412.05467

[AgentTuning-15] 15.0 ^15.1 Zeng, Aohan et al. (2023). "AgentTuning: Enabling Generalized Agent Abilities for LLMs." arXiv preprint arXiv:2310.12823. Retrieved from https://arxiv.org/abs/2310.12823

[dzone-browser-use-16] DZone. (2025). "Build an AI Browser Agent With LLMs, Playwright, Browser-Use." Retrieved from https://dzone.com/articles/build-ai-browser-agent-llms-playwright-browser-use

[adasci-guide-17] ADASCI. (2025). "A Practical Guide to Enabling AI Agent Browser Control using Browser-use." Retrieved from https://adasci.org/a-practical-guide-to-enabling-ai-agent-browser-control-using-browser-use/

[visualwebarena-18] Web-Arena-x. (2024-2025). "VisualWebArena: Evaluating Multimodal Agents on Realistic Visual Web Tasks." GitHub. Retrieved from project site

[WebShop-19] Yao, Shunyu et al. (2022). "WebShop: Towards Scalable Real-World Web Interaction with Grounded Language Agents." arXiv preprint arXiv:2207.01206. Retrieved from https://arxiv.org/abs/2207.01206

[ministry-testing-20] 20.0 ^20.1 Ministry of Testing. (2025). "Creating self-healing automated tests with AI and Playwright." Retrieved from https://www.ministryoftesting.com/articles/creating-self-healing-automated-tests-with-ai-and-playwright

[skyvern-webbench-21] Skyvern. (2025). "Web Bench - A new way to compare AI Browser Agents." Retrieved from https://blog.skyvern.com/web-bench-a-new-way-to-compare-ai-browser-agents/

[browser-use-sota-22] Browser-Use. (2025). "Browser Use = state of the art Web Agent." Retrieved from https://browser-use.com/posts/sota-technical-report

[openai-sessions-23] OpenAI. (2025). "Context Engineering - Short-Term Memory Management with Sessions." Retrieved from https://cookbook.openai.com/examples/agents_sdk/session_memory

[skyvern-github-24] 24.0 ^24.1 Skyvern. (2025). "Automate browser-based workflows with LLMs and Computer Vision." GitHub. Retrieved from https://github.com/Skyvern-AI/skyvern

[GoogleWebAgent-25] Furuta, Hiroki et al. (2023). "Multimodal Web Navigation with Instruction-Finetuned Foundation Models." arXiv preprint arXiv:2305.11854. Retrieved from https://arxiv.org/abs/2305.11854

[langchain-agents-26] LangChain. (2025). "Agent architectures." Retrieved from https://langchain-ai.github.io/langgraph/concepts/agentic_concepts/

[arxiv-perception-27] rXiv. (2024). "Perception Tokens Enhance Visual Reasoning in Multimodal Language Models." Retrieved from https://arxiv.org/html/2412.03548v1

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]

[17]

[18]

[19]

[20]

[21]

[22]

[23]

[24]

[25]

[26]

[27]