#57

Multi-Agent Penetration Testing AI for the Web Multi-Agent Penetration Testing AI for the Web

Isaac David, Arthur Gervais

2025 | arXiv (preprint)

arXiv:2508.20816v1

system penetration-testing fully-autonomous multi-agent ReAct

PDF Preview 论文预览

Loading PDF... 加载 PDF 中...

Problem & Motivation 问题与动机

AI-powered development platforms are democratizing software creation but introducing a scalability crisis in security assessment, as up to 40% of AI-generated code contains vulnerabilities and the pace of development outstrips the capacity for thorough security auditing.

AI 驱动的开发平台正在使软件创建民主化，但也引入了安全评估的可扩展性危机，因为高达 40% 的 AI 生成代码包含漏洞，且开发速度超过了彻底安全审计的能力。

Existing approaches face critical limitations: lack of rigorous cost-performance analysis, insufficient vulnerability validation leading to false positives, and commercial systems like XBOW lack scientific reproducibility. There is a semantic gap between pattern-based vulnerability detection and contextual exploitation understanding that LLM-based multi-agent systems can bridge.

现有方法面临关键局限：缺乏严格的成本性能分析、漏洞验证不足导致误报，以及像 XBOW 这样的商业系统缺乏科学的可复现性。基于模式的漏洞检测与基于上下文的漏洞利用理解之间存在语义鸿沟，基于 LLM 的多智能体系统可以弥合这一差距。

Threat Model 威胁模型

Two distinct threat models: (1) Blackbox local CTF assessment where the system operates as an external attacker with no access to source code, database schemas, or internal configurations, receiving only target URLs and challenge descriptions; (2) Whitebox local assessment of locally cloned open-source repositories with complete source code access. Both operate within strict ethical constraints, avoiding destructive operations, data exfiltration, or persistent system modifications.

两种不同的威胁模型：(1) 黑盒本地 CTF 评估，系统作为一个外部攻击者运行，无法访问源代码、数据库架构或内部配置，仅接收目标 URL 和挑战描述；(2) 本地克隆开源仓库的白盒本地评估，具有完整的源代码访问权限。两者都在严格的道德约束下运行，避免破坏性操作、数据外泄或持久的系统修改。

Methodology 核心方法

MAPTA (Multi-Agent Penetration Testing AI) is a three-role, tool-driven multi-agent architecture that couples high-level planning with concrete exploit execution and mandatory proof-of-concept validation. A Coordinator agent performs strategy and delegation, Sandbox agents execute tactical steps in isolated Docker containers, and a Validation agent converts candidate findings into verified end-to-end PoCs. The system executes within a bounded loop with explicit stop conditions (validated exploit, budget/time/tool-call caps) and progresses through hypothesis synthesis, targeted dispatch, PoC assembly, and validation/finalization phases.

MAPTA（多智能体 Web 渗透测试 AI）是一个三角色、工具驱动的多智能体架构，它将高层规划与具体的漏洞利用执行以及强制性的概念验证 (PoC) 验证相结合。协调者（Coordinator）智能体执行策略和委派，沙盒（Sandbox）智能体在隔离的 Docker 容器中执行战术步骤，验证（Validation）智能体将候选发现转换为经过验证的端到端 PoC。系统在具有明确停止条件（验证的利用、预算/时间/工具调用上限）的有界循环中执行，并经历假设综合、定向派遣、PoC 组装以及验证/完成阶段。

Architecture 架构设计

Three-agent-role architecture: (1) Coordinator agent handles attack-path reasoning, tool orchestration, and report synthesis with 8 tools including sandbox_agent delegation, run_command, run_python, and email/Slack workflow helpers; (2) Sandbox agents (1..N) execute tactical steps in isolated LLM context within a shared per-job Docker container using run_command and run_python; (3) Validation agent consumes candidate PoC artifacts (HTTP requests, payloads, scripts), verifies exploitability by concrete execution on the per-job Docker container, and returns pass/fail with evidence. Orchestration is dynamic -- the Coordinator decides at runtime whether to delegate to sandbox agents or act directly.

三智能体角色架构：(1) 协调者智能体负责攻击路径推理、工具编排和报告综合，拥有包括 sandbox_agent 委派、run_command、run_python 和 email/Slack 工作流助手在内的 8 个工具；(2) 沙盒智能体 (1..N) 在隔离的 LLM 上下文中执行战术步骤，使用 run_command and run_python 在每个作业共享的 Docker 容器中运行；(3) 验证智能体消耗候选 PoC 构件（HTTP 请求、有效载荷、脚本），通过在每个作业的 Docker 容器上具体执行来验证可利用性，并返回带有证据的通过/失败结果。编排是动态的 —— 协调者在运行时决定是委派给沙盒智能体还是直接行动。

LLM Models 使用的大模型

Tool Integration 工具集成

Memory Mechanism 记忆机制

conversation-history

Attack Phases Covered 覆盖的攻击阶段

reconnaissance

scanning

enumeration

exploitation

post exploitation

privilege escalation

lateral movement

reporting

Evaluation 评估结果

MAPTA achieves 76.9% success rate (80/104) on the XBOW benchmark, within 7.7 percentage points of the commercial XBOW system (84.6%). Total cost across all 104 challenges was $21.38 with median cost of $0.073 for successes vs $0.357 for failures. Perfect performance (100%) on SSRF and misconfiguration vulnerabilities, strong results on SSTI (85%), SQL injection (83%), and command injection (75%). In real-world whitebox assessment of 10 open-source applications (8K-70K GitHub stars), MAPTA discovered 19 vulnerabilities across 6 applications, with 14 classified as high or critical severity and 10 findings under CVE review, at an average cost of $3.67 per assessment.

MAPTA 在 XBOW 基准测试中实现了 76.9% 的成功率 (80/104)，与商业 XBOW 系统 (84.6%) 的差距在 7.7 个百分点之内。所有 104 个挑战的总成本为 21.38 美元，成功的平均成本为 0.073 美元，失败的平均成本为 0.357 美元。在 SSRF 和错误配置漏洞方面表现完美 (100%)，在 SSTI (85%)、SQL 注入 (83%) and 命令注入 (75%) 方面取得了强劲结果。在对 10 个开源应用（GitHub 星数 8K-70K）的真实白盒评估中，MAPTA 在 6 个应用中发现了 19 个漏洞，其中 14 个被分类为高或严重等级，10 个发现正在进行 CVE 审查，平均每次评估成本为 3.67 美元。

Environment 评估环境

Metrics 评估指标

Baseline Comparisons 基准对比

XBOW-commercial-system

Scale 评估规模

104 XBOW CTF challenges + 10 real-world open-source web applications

Contributions 核心贡献

Tool-grounded multi-agent architecture with three specialized roles (Coordinator, Sandbox, Validation) that separates strategic reasoning from tactical execution and mandatory PoC validation to eliminate false positives
First rigorous cost-performance accounting for autonomous penetration testing across 104 challenges, tracking token-level I/O (3.2M regular input, 1.10M output, 50.5M cached, 0.595M reasoning) with median cost of $0.117 per challenge
Strong negative correlations between resource usage and success (tools r=-0.661, cost r=-0.606, tokens r=-0.587, time r=-0.557) enabling practical early-stopping thresholds at approximately 40 tool calls, $0.30, or 300 seconds
Real-world whitebox validation discovering 19 vulnerabilities (14 high/critical) across 10 popular open-source applications including RCEs, command injections, secret exposure, and arbitrary file write, with 10 findings under CVE review
Open-source artifacts including code, evaluation results, and fixes for 43 outdated XBOW benchmark Docker images

具有三个专业角色（协调者、沙盒、验证）的工具落地多智能体架构，将战略推理与战术执行以及强制性 PoC 验证分离，消除了误报
针对自主渗透测试跨 104 个挑战进行的首次严格成本性能核算，跟踪令牌级 I/O（3.2M 普通输入、1.10M 输出、50.5M 缓存、0.595M 推理），每个挑战的中位成本为 0.117 美元
资源使用与成功之间存在强负相关性（工具 r=-0.661，成本 r=-0.606，令牌 r=-0.587，时间 r=-0.557），使得在约 40 次工具调用、0.30 美元或 300 秒处设置实际的早期停止阈值成为可能
现实世界的白盒验证在 10 个流行的开源应用中发现了 19 个漏洞（14 个高/危等级），包括 RCE、命令注入、密钥泄露和任意文件写入，其中 10 个发现正在进行 CVE 审查
开源成果包括代码、评估结果以及 43 个过时的 XBOW 基准 Docker 镜像的修复程序

Limitations 局限性

Complete failure on blind SQL injection (0% success rate) due to limitations in timing-based attack detection and payload refinement
Moderate performance on cross-site scripting (57%) despite being the largest category, suggesting challenges with complex payload crafting and DOM manipulation
Low success on broken authentication challenges (33%), indicating need for improved credential analysis and session manipulation capabilities
Cannot guarantee zero false positives for complex business logic vulnerabilities that require deep understanding of application-specific workflows
Excludes network-level vulnerabilities (SSL/TLS misconfigurations, network protocol vulnerabilities), physical security controls, social engineering, and human factors
Real-world whitebox sample size (N=10) is too small to validate the resource-success correlation patterns observed in CTF evaluation
Evaluated exclusively on GPT-5 due to limited financial resources, with no comparison across different LLM providers

在盲 SQL 注入方面完全失败（0% 成功率），原因是基于时间的攻击检测和有效载荷优化的局限性
尽管跨站脚本 (XSS) 是最大的类别，但表现一般 (57%)，这表明在复杂的有效载荷构建和 DOM 操作方面存在挑战
在失效的身份验证挑战中成功率较低 (33%)，表明需要改进凭据分析和会话操作能力
无法保证对需要深入理解应用特定工作流的复杂业务逻辑漏洞实现零误报
不包括网络级漏洞（SSL/TLS 错误配置、网络协议漏洞）、物理安全控制、社会工程和人为因素
现实世界的白盒样本量 (N=10) 太小，无法验证在 CTF 评估中观察到的资源与成功相关性模式
仅在 GPT-5 上进行了评估，由于财务资源有限，没有跨不同 LLM 提供商进行比较

Research Gaps 研究空白

Timing-based attack detection for blind SQL injection and similar side-channel exploitation remains unsolved for LLM agents
Enhanced payload generation and feedback-based exploration strategies needed for XSS and DOM manipulation attacks
Authentication flow analysis and session state reasoning capabilities require improvement for auth bypass scenarios
Canary placement systems that embed detectable markers throughout application workflows could provide additional exploitation validation for business logic flaws
No rigorous cost-performance frameworks existed prior to this work for autonomous penetration testing systems
Commercial systems like XBOW lack scientific reproducibility, with methodologies available only through blog posts rather than detailed system architectures

盲 SQL 注入和类似侧信道利用的基于时间的攻击检测对于 LLM 智能体来说仍未解决
XSS 和 DOM 操作攻击需要增强的有效载荷生成和基于反馈的探索策略
身份验证绕过场景需要改进身份验证流分析和会话状态推理能力
在整个应用工作流中嵌入可检测标记的 Canary 放置系统可以为业务逻辑缺陷提供额外的利用验证
在这项工作之前，自主渗透测试系统缺乏严格的成本性能框架
像 XBOW 这样的商业系统缺乏科学的可复现性，方法论仅通过博客文章而非详细的系统架构提供

Novel Techniques 新颖技术

Mandatory end-to-end PoC validation via dedicated Validation agent that concretely executes candidate exploits to eliminate false positives
Dynamic orchestration where Coordinator decides at runtime whether to delegate to sandbox agents or act directly based on task complexity
LLM context isolation (separate prompts/memory per sandbox agent) combined with system state sharing (single Docker container) to reduce context bloat while preserving useful runtime state
Resource-based early-stopping heuristics derived from negative correlations between resource consumption and success probability
Per-job Docker container lifecycle with job-scoped credentials, artifact reuse across sub-tasks, and graceful teardown with secret purging

通过专门的验证智能体进行强制性的端到端 PoC 验证，具体执行候选漏洞利用以消除误报
动态编排，由协调者根据任务复杂度在运行时决定是委派给沙盒智能体还是直接行动
LLM 上下文隔离（每个沙盒智能体独立的提示词/内存）结合系统状态共享（单个 Docker 容器），在减少上下文膨胀的同时保留有用的运行时状态
从资源消耗与成功概率之间的负相关性中推导出的基于资源的早期停止启发式方法
带有作业范围凭据、子任务间构件重用以及带有密钥清除的优雅关闭的每作业 Docker 容器生命周期

Open Questions 开放问题

Can multi-agent architectures be extended to handle network-level and infrastructure vulnerabilities beyond the application layer?
How would MAPTA perform with open-source LLMs compared to GPT-5, and what is the minimum model capability threshold?
Can the early-stopping thresholds generalize across different vulnerability types and application architectures?
How to handle business logic vulnerabilities that require deep understanding of application-specific intended behavior?
What is the optimal number and specialization of sandbox agents for different target application complexities?

多智能体架构能否扩展到处理应用层之外的网络级和基础设施漏洞？
与 GPT-5 相比，MAPTA 在开源 LLM 上的表现如何，最低的模型能力阈值是多少？
早期停止阈值能否跨不同的漏洞类型和应用架构泛化？
如何处理需要深入理解应用特定预期行为的业务逻辑漏洞？
针对不同复杂度的目标应用，沙盒智能体的最佳数量和专业化程度是多少？

Builds On 基于前人工作

PentestGPT
PenHeal
ReAct
Toolformer
SWE-agent
RESTler

Open Source 开源信息

Yes - https://github.com/arthurgervais/mapta and https://github.com/arthurgervais/validation-benchmarks