#53

HackWorld: Evaluating Computer-Use Agents on Exploiting Web Application Vulnerabilities HackWorld: Evaluating Computer-Use Agents on Exploiting Web Application Vulnerabilities

Xiaoxue Ren, Penghao Jiang, Kaixin Li, Zhiyong Huang, Xiaoning Du, Jiaojiao Jiang, Zhenchang Xing, Jiamou Sun, Terry Yue Zhuo

2025 | arXiv (preprint)

2510.12200

benchmark vulnerability-assessment fully-autonomous single-agent

PDF Preview 论文预览

Loading PDF... 加载 PDF 中...

Problem & Motivation 问题与动机

While computer-use agents (CUAs) have shown strong capabilities in web browsing and visual task automation, their potential to discover and exploit web application vulnerabilities through graphical interfaces remains unknown. Existing benchmarks for CUAs assume sanitized environments and overlook security aspects.

虽然计算机使用智能体（CUA）在网页浏览和视觉任务自动化方面表现出强大能力，但它们通过图形界面发现和利用 Web 应用程序漏洞的潜力尚不清楚。现有的 CUA 基准测试通常假设环境是经过清理的，忽略了安全方面的问题。

Modern web applications require visual understanding, dynamic content rendering, and multi-step interactive workflows that only CUAs can handle. As CUAs increasingly operate autonomously in vulnerable real-world web environments, understanding their exploitation capabilities is critical for both offensive security research and defensive agent design.

现代 Web 应用程序需要只有 CUA 才能处理的视觉理解、动态内容渲染和多步交互工作流。随着 CUA 越来越多地在易受攻击的真实 Web 环境中自主运行，了解它们的利用能力对于进攻性安全研究和防御性智能体设计都至关重要。

Threat Model 威胁模型

CUAs operating autonomously in web environments containing realistic security vulnerabilities. Agents interact through visual interfaces (screenshots, accessibility trees) and have access to industry-standard security tools in a Kali Linux environment.

CUA 在包含现实安全漏洞的 Web 环境中自主运行。智能体通过视觉界面（屏幕截图、无障碍树）进行交互，并能访问 Kali Linux 环境中的工业标准安全工具。

Methodology 核心方法

HackWorld is an evaluation framework that exposes CUAs to 36 curated web applications spanning 11 frameworks and 7 programming languages, each containing realistic vulnerabilities including injection flaws, authentication bypasses, and unsafe input handling. Agents interact through visual interfaces (screenshots, a11y trees, Set-of-Marks) and can use over 20 security tools from a Kali Linux environment. Success is measured via Capture-the-Flag methodology with fuzzy flag matching (edit distance threshold of 5).

HackWorld 是一个评估框架，它让 CUA 面对 36 个精心挑选的 Web 应用程序（涵盖 11 个框架和 7 种编程语言），每个程序都包含现实的漏洞，包括注入缺陷、身份验证绕过和不安全的输入处理。智能体通过视觉界面（屏幕截图、无障碍树、标记集 Set-of-Marks）进行交互，并能使用来自 Kali Linux 环境的 20 多种安全工具。成功通过夺旗赛（CTF）方法衡量，并采用模糊 flag 匹配（编辑距离阈值为 5）。

Architecture 架构设计

Modular pipeline: (1) Task Assignment with natural language instructions; (2) Environment Perception via screenshots and accessibility trees; (3) Tool Selection and Execution from the Kali toolkit; (4) Action Execution through an Action Server mediating web interactions; (5) Progress Monitoring via a Controller that logs HTTP requests, tool invocations, and file-system operations. Each challenge runs in an isolated Docker container.

模块化流水线：（1）带有自然语言指令的任务分配；（2）通过屏幕截图和无障碍树进行环境感知；（3）从 Kali 工具包中进行工具选择和执行；（4）通过调节 Web 交互的操作服务器进行操作执行；（5）通过记录 HTTP 请求、工具调用和文件系统操作的控制器进行进度监控。每个挑战都在隔离的 Docker 容器中运行。

LLM Models 使用的大模型

Tool Integration 工具集成

Memory Mechanism 记忆机制

conversation-history

Attack Phases Covered 覆盖的攻击阶段

reconnaissance

scanning

enumeration

exploitation

post exploitation

privilege escalation

lateral movement

reporting

Evaluation 评估结果

CUAs achieve exploitation rates below 12%, with Claude-3.7-Sonnet performing best at 10.18% average success across observation spaces (max 11.11% with Screenshot or Set-of-Marks). UI-TARS-1.5-7B and Qwen-2.5-VL-72B-Instruct achieved 0% in all or most conditions. Larger/newer models do not necessarily outperform older ones, challenging the naive scaling hypothesis for security tasks.

CUA 的利用率低于 12%，其中 Claude-3.7-Sonnet 表现最佳，在各种观察空间下的平均成功率为 10.18%（在使用屏幕截图或标记集时最高达到 11.11%）。UI-TARS-1.5-7B 和 Qwen-2.5-VL-72B-Instruct 在所有或大多数条件下都取得了 0% 的成功率。更大或更新的模型不一定优于旧模型，这挑战了安全任务中的简单缩放假设（scaling hypothesis）。

Environment 评估环境

Metrics 评估指标

Baseline Comparisons 基准对比

Claude-3.5-Sonnet
Claude-3.7-Sonnet
Claude-4-Sonnet
Claude-4-Opus
UI-TARS-1.5-7B
Qwen-2.5-VL-72B-Instruct

Scale 评估规模

36 web CTF challenges from NYY CTF Bench (26), Cybench (8), and InterCode-CTF (2)

Contributions 核心贡献

First evaluation framework (HackWorld) for systematically assessing CUAs' capabilities in exploiting web application vulnerabilities through visual interaction
A comprehensive benchmark of 36 vulnerable web applications spanning 11 frameworks and 7 languages with realistic vulnerabilities
Systematic evaluation revealing critical safety limitations in current CUAs, identifying the core bottleneck as strategic reasoning and tool orchestration rather than perceptual understanding

首个用于系统评估 CUA 通过视觉交互利用 Web 应用程序漏洞能力的评估框架 (HackWorld)
包含 36 个易受攻击 Web 应用程序的综合基准测试，涵盖 11 个框架、7 种语言以及现实的漏洞
系统评估揭示了当前 CUA 在安全性方面的关键局限，确定核心瓶颈是战略推理和工具编排，而非感知理解

Limitations 局限性

Benchmark limited to 36 challenges, which may not cover the full diversity of real-world web vulnerabilities
Only evaluates exploitation through visual/GUI interaction; does not assess terminal-based or API-based exploitation
Maximum step limit of 30 steps may constrain agent performance on complex multi-stage attacks
Challenges are drawn from CTF competitions which, while realistic, may differ from production vulnerabilities in structure and difficulty distribution
Open-source models (UI-TARS, Qwen) achieved 0% success, limiting comparative analysis across model families

基准测试仅限于 36 个挑战，可能无法涵盖现实世界 Web 漏洞的全部多样性
仅评估通过视觉/GUI 交互进行的利用；未评估基于终端或基于 API 的利用
最大 30 步的限制可能会约束智能体在复杂多阶段攻击中的表现
挑战源自 CTF 竞赛，虽然具有现实意义，但在结构和难度分布上可能与生产环境中的漏洞有所不同
开源模型（UI-TARS, Qwen）成功率为 0%，限制了跨模型系列的对比分析

Research Gaps 研究空白

CUAs lack cybersecurity domain knowledge and strategic reasoning for multi-step vulnerability exploitation
Current security tool interfaces (CLI) are designed for human users, not agents; need for agent-oriented tool design (machine-readable outputs, structured error codes)
No existing methods for training or fine-tuning CUAs specifically for security tasks
Gap between perceptual capabilities (agents can read pages) and strategic synthesis (agents fail to assemble clues into exploit plans)
Need for evaluations that measure reasoning and strategic decision-making, not just benchmark accuracy

CUA 缺乏网络安全领域知识以及用于多步漏洞利用的战略推理
当前的安全工具界面（CLI）是为人类用户而非智能体设计的；需要面向智能体的工具设计（机器可读输出、结构化错误代码）
尚无针对安全任务训练或微调 CUA 的现有方法
感知能力（智能体能阅读页面）与战略综合能力（智能体无法将线索组装成利用计划）之间存在差距
需要衡量推理和战略决策而非仅仅是基准测试准确率的评估方法

Novel Techniques 新颖技术

CTF-based evaluation framework specifically designed for computer-use agents interacting through visual interfaces
Multi-observation-space evaluation (Screenshot, Screenshot+a11y tree, Set-of-Marks) for security tasks
Fuzzy flag matching with edit distance threshold to account for OCR errors in multimodal agents
Comprehensive tool integration with 20+ security tools in a Kali Linux environment for agent evaluation

专门为通过视觉界面交互的计算机使用智能体设计的基于 CTF 的评估框架
针对安全任务的多观测空间评估（屏幕截图、屏幕截图+无障碍树、标记集 Set-of-Marks）
考虑到多模态智能体中 OCR 错误的带有编辑距离阈值的模糊 flag 匹配
综合工具集成，在 Kali Linux 环境中提供 20 多种安全工具用于智能体评估

Open Questions 开放问题

How can CUAs be trained or fine-tuned to develop cybersecurity reasoning capabilities?
Can structured planning strategies (e.g., attack trees, kill chains) improve CUA exploitation success rates?
How should security tools be redesigned for agent-oriented use (MCP/Arazzo-style contracts)?
What is the relationship between inference-time scaling (more steps) and security task success beyond the modest gains observed?
How do CUA security capabilities evolve as models continue to improve, given that scaling alone did not help?

如何训练或微调 CUA 以培养其网络安全推理能力？
结构化的规划策略（如攻击树、杀伤链）能否提高 CUA 利用成功率？
应如何为智能体导向的使用重新设计安全工具（如采用 MCP/Arazzo 风格的契约）？
除了观察到的适度收益外，推理时间缩放（更多步骤）与安全任务成功率之间有何关系？
鉴于单纯的缩放并无帮助，随着模型持续改进，CUA 的安全能力将如何演进？

Builds On 基于前人工作

OSWorld
WebArena
NYY CTF Bench
Cybench
InterCode-CTF
PentestGPT
EnIGMA

Open Source 开源信息

Yes - https://github.com/GUI-Agent/HackWorld