#07

AutoPT: How Far Are We from the End2End Automated Web Penetration Testing? AutoPT: How Far Are We from the End2End Automated Web Penetration Testing?

Benlong Wu, Guoqiang Chen, Kejiang Chen, Xiuwei Shang, Jiapeng Han, Yanru He, Weiming Zhang, Nenghai Yu

2024 | arXiv (preprint)

arXiv:2411.01236

Problem & Motivation 问题与动机

Fully automated end-to-end web penetration testing remains unsolved. Existing LLM-based agents can handle individual subtasks but fail at completing the entire penetration testing workflow autonomously due to context window limitations, agents getting stuck in repetitive loops, and model inference capability constraints.

完全自动化的端到端 Web 渗透测试仍未解决。现有的基于 LLM 的智能体可以处理单个子任务,但由于上下文窗口限制、智能体陷入重复循环以及模型推理能力约束,无法自主完成整个渗透测试工作流。

Despite the powerful inference capabilities of LLMs, no existing automated method can solve the end-to-end penetration testing task -- completing the entire process from scanning through exploitation without human involvement and adapting to diverse environments. Current benchmarks (e.g., PentestGPT Bench) lack detailed environment specifications and clear completion targets, making fair and granular evaluation of LLM-based agents impossible. The paper addresses the gap between LLM capabilities on individual subtasks and the need for fully autonomous, multi-stage web penetration testing.

尽管 LLM 具有强大的推理能力,但现有的自动化方法都无法解决端到端渗透测试任务 —— 即在无需人工干预的情况下完成从扫描到利用的全过程,并适应各种环境。目前的基准测试(例如 PentestGPT Bench)缺乏详细的环境规范和明确的完成目标,导致无法对基于 LLM 的智能体进行公平且细粒度的评估。本文解决了 LLM 在单个子任务上的能力与全自主、多阶段 Web 渗透测试需求之间的差距。

Threat Model 威胁模型

Black-box penetration testing from an external attacker perspective. The tester knows nothing about the internal structure, code, or configuration of the target system. The agent starts with only an IP address and port, and must discover and exploit vulnerabilities through exposed interfaces and services. Victim environments are pre-configured Docker containers from VulnHub with known vulnerabilities.

从外部攻击者视角的黑盒渗透测试。测试人员对目标系统的内部结构、代码或配置一无所知。智能体仅从一个 IP 地址和端口开始,必须通过暴露的接口和服务发现并利用漏洞。受害者环境是来自 VulnHub 的预配置 Docker 容器,具有已知漏洞。

Methodology 核心方法

AutoPT introduces the Penetration Testing State Machine (PSM), a finite state machine (FSM) framework that decomposes end-to-end penetration testing into discrete states with defined transitions. The system divides states into Agent states (LLM-driven: Scanning, Reconnaissance, Exploitation) and Rule states (deterministic: Selection, Check). Each state operates independently with its own prompt, tools, and context, passing only essential output to the next state rather than maintaining the full conversation history. This design addresses three identified challenges: context window overflow, agent looping on minor issues, and hallucination from model inference limitations.

AutoPT 引入了渗透测试状态机 (PSM),这是一个有限状态机 (FSM) 框架,它将端到端渗透测试分解为具有定义转换的离散状态。系统将状态分为智能体状态(LLM 驱动:扫描、侦察、利用)和规则状态(确定性:选择、检查)。每个状态独立运行,拥有自己的提示词、工具和上下文,仅将基本输出传递给下一个状态,而不是维护完整的对话历史。这种设计解决了确定的三个挑战:上下文窗口溢出、智能体在次要问题上循环以及由于模型推理局限性产生的幻觉。

Architecture 架构设计

Five-state FSM architecture with two state types. Agent states (Scanning, Reconnaissance, Exploitation) use LLM-based agents with role-playing prompts, tool definitions, examples, and response format constraints. Rule states (Selection, Check) use deterministic rule-matching logic without LLM calls. The Scanning state runs the Xray open-source vulnerability scanner via a terminal tool. The Selection state parses scan results into a vulnerability library and prioritizes vulnerabilities by harm level and exploitability. The Reconnaissance state queries vulnerability information via Google Search. The Exploitation state attempts to exploit vulnerabilities using terminal commands and Playwright browser automation. The Check state verifies exploitation success by matching output against preset target strings. State transitions are managed via a graph-based routing function modeled as a Mealy machine. The system is built on LangChain and runs within a Kali Linux Docker environment with root access.

具有两种状态类型的五状态 FSM 架构。智能体状态(扫描、侦察、利用)使用基于 LLM 的智能体,带有角色扮演提示、工具定义、示例和响应格式约束。规则状态(选择、检查)使用确定性的规则匹配逻辑,无需 LLM 调用。扫描状态通过终端工具运行 Xray 开源漏洞扫描器。选择状态将扫描结果解析为漏洞库,并按危害级别和可利用性对漏洞进行优先级排序。侦察状态通过 Google 搜索查询漏洞信息。利用状态尝试使用终端命令和 Playwright 浏览器自动化来利用漏洞。检查状态通过将输出与预设的目标字符串匹配来验证利用是否成功。状态转换通过建模为 Mealy 机的基于图的路由函数进行管理。该系统基于 LangChain 构建,并在具有 root 权限的 Kali Linux Docker 环境中运行。

LLM Models 使用的大模型

GPT-4oGPT-4o-miniGPT-3.5-turbo

Tool Integration 工具集成

Xray (vulnerability scanner)Terminal (Kali Linux shell with all pentesting tools)Playwright (headless browser via LangChain)Google Search (web search for vulnerability information)LangChain

Memory Mechanism 记忆机制

conversation-history

Attack Phases Covered 覆盖的攻击阶段

reconnaissance
scanning
enumeration
exploitation
post exploitation
privilege escalation
lateral movement
reporting

Evaluation 评估结果

AutoPT with GPT-4o-mini achieved the highest overall success rate of 36%, improving task completion from 22% (best ReAct/PTT baseline) to 41%. AutoPT reduced execution time by ~50% and API cost by ~71.6% compared to ReAct and PTT frameworks. Compared to human penetration testers, AutoPT costs ~$0.99 vs ~$310 (99.6% reduction) and takes ~4.48h vs ~5h. Even GPT-3.5 under AutoPT completed 11% of tasks, up from 0% under both ReAct and PTT. AutoPT doubled success rates on simple tasks and achieved nearly 10x improvement on complex tasks versus baseline frameworks.

搭载 GPT-4o-mini 的 AutoPT 实现了 36% 的最高总体成功率,将任务完成率从 22%(最佳 ReAcT/PTT 基准)提高到 41%。与 ReAcT 和 PTT 框架相比,AutoPT 减少了约 50% 的执行时间和约 71.6% 的 API 成本。与人类渗透测试员相比,AutoPT 的成本约为 0.99 美元对比约 310 美元(降低了 99.6%),耗时约为 4.48 小时对比约 5 小时。即使是 GPT-3.5,在 AutoPT 下也完成了 11% 的任务,而之前在 ReAcT 和 PTT 下均为 0%。与基准框架相比,AutoPT 在简单任务上的成功率翻了一番,在复杂任务上实现了近 10 倍的提升。

Environment 评估环境

VulnHub Docker containersCustom end-to-end pentesting benchmark (20 CVEs)

Metrics 评估指标

success-ratetask-completioncost (USD per task)time-to-completeexecution-efficiency

Baseline Comparisons 基准对比

  • ReAct framework (with GPT-4o, GPT-4o-mini, GPT-3.5)
  • PTT (PentestGPT Penetration Testing Tree) framework (with GPT-4o, GPT-4o-mini, GPT-3.5)
  • Manual human penetration testing

Scale 评估规模

20 CVE vulnerability environments across 17 penetration testing targets from VulnHub, covering 4 major categories and 6 subcategories of OWASP Top 10 2023

Contributions 核心贡献

  • Developed a fine-grained end-to-end penetration testing benchmark with 20 VulnHub Docker environments covering OWASP Top 10 2023, with manual complexity annotations (simple: <3 steps, complex: >=3 steps) and explicit task goal strings for automated success verification.
  • Designed the Penetration Testing State Machine (PSM), a novel FSM-based agent architecture that decomposes end-to-end pentesting into Agent states (LLM-driven) and Rule states (deterministic), passing only essential context between states to avoid context overflow.
  • Implemented AutoPT, an end-to-end automated penetration testing system based on PSM using LangChain, with five states (Scanning, Selection, Reconnaissance, Exploitation, Check) and three tool types (Terminal, Playwright, Search).
  • Conducted the first systematic and quantitative evaluation of LLM-based agents on end-to-end web penetration testing, identifying three core challenges: context limitations, agent looping, and model inference capability constraints.
  • Comprehensive pre-experiment evaluating 9 LLMs (including Claude-3.5-Sonnet, Llama-3, Qwen2.5, Mixtral, GLM-4) finding only GPT-4o, GPT-4o-mini, and GPT-3.5 capable of basic pentesting tool operation.
  • 开发了一个细粒度的端到端渗透测试基准,包含 20 个涵盖 OWASP Top 10 2023 的 VulnHub Docker 环境,带有手动复杂度注释(简单:<3 步,复杂:>=3 步)和显式任务目标字符串,用于自动验证成功。
  • 设计了渗透测试状态机 (PSM),这是一种新型的基于 FSM 的智能体架构,它将端到端渗透测试分解为智能体状态(LLM 驱动)和规则状态(确定性),仅在状态间传递基本上下文以避免上下文溢出。
  • 实现了 AutoPT,一个基于 PSM 使用 LangChain 的端到端自动化渗透测试系统,具有五个状态(扫描、选择、侦察、利用、检查)和三种工具类型(终端、Playwright、搜索)。
  • 对基于 LLM 的智能体在端到端 Web 渗透测试方面进行了首次系统且定量的评估,确定了三个核心挑战:上下文限制、智能体循环和模型推理能力约束。
  • 进行了初步实验,评估了 9 个 LLM(包括 Claude-3.5-Sonnet、Llama-3、Qwen2.5、Mixtral、GLM-4),发现只有 GPT-4o、GPT-4o-mini 和 GPT-3.5 具备基本的渗透测试工具操作能力。

Limitations 局限性

  • Victim environments are pre-configured to be insecure (e.g., default dangerous configurations), not representing hardened real-world targets.
  • Only focuses on vulnerability exploitation of known CVEs, not vulnerability discovery or mining of unknown vulnerabilities.
  • Benchmark is limited to 20 CVEs, which may not cover the full diversity of real-world web vulnerabilities.
  • Does not attempt post-exploitation, privilege escalation, lateral movement, or reporting phases.
  • Does not address defense mechanisms against LLM-driven automated attacks.
  • Relies on Xray scanner for initial scanning, which may produce inaccurate or incomplete results.
  • Does not use advanced jailbreaking methods to bypass model alignment, limiting exploitation of certain attack categories blocked by LLM safety policies.
  • Web search results used in reconnaissance may be outdated or erroneous, potentially misleading the agent.
  • LLM safety policy refusals ('I cannot assist with that') still occur during exploitation despite role-playing prompts.
  • 受害者环境预配置为不安全(例如默认危险配置),不代表加固后的现实目标。
  • 仅关注已知 CVE 的漏洞利用,不涉及漏洞发现或未知漏洞挖掘。
  • 基准测试仅限于 20 个 CVE,可能无法涵盖现实世界 Web 漏洞的完整多样性。
  • 未尝试后渗透、权限提升、横向移动或报告阶段。
  • 未解决针对 LLM 驱动的自动化攻击的防御机制。
  • 依赖 Xray 扫描器进行初始扫描,这可能产生不准确或不完整的结果。
  • 未使用高级越狱方法来绕过模型对齐,限制了对 LLM 安全策略阻断的某些攻击类别的利用。
  • 侦察中使用的 Web 搜索结果可能过时或错误,可能误导智能体。
  • 尽管有角色扮演提示,利用期间仍会发生 LLM 安全策略拒绝(“我无法协助此项操作”)。

Research Gaps 研究空白

  • No existing method achieves truly end-to-end automated web penetration testing that adapts to diverse environments without human involvement.
  • Current LLM-based agents struggle with maintaining long interaction histories within context limits during multi-step security tasks.
  • Agents tend to get stuck in depth-first loops on minor issues rather than exploring alternative attack paths, unlike human testers who try diverse approaches.
  • LLM safety policies (refusal to assist with attacks) remain a significant barrier even with role-playing prompts and authorization framing.
  • Non-OpenAI LLMs (Llama-3-70B, Llama-3.1-70B, Qwen2.5-72B, Mixtral-8x22B, GLM-4, Claude-3.5-Sonnet, Claude-3-Opus) all failed basic pentesting pre-experiments, indicating a large capability gap with OpenAI GPT models for tool-use in security tasks.
  • Need for defense mechanisms to detect and counter LLM-driven automated penetration testing attacks (e.g., hallucination detection for identifying LLM-generated attack commands).
  • Model 'unconfidence' -- agents prematurely declare failure after unsuccessful exploit attempts instead of trying alternative approaches.
  • 目前还没有一种方法能实现真正适应各种环境且无需人工参与的端到端自动化 Web 渗透测试。
  • 当前基于 LLM 的智能体在多步安全任务中难以在上下文限制内维护长交互历史。
  • 智能体倾向于在次要问题上陷入深度优先循环,而不是探索替代攻击路径,这与尝试多种方法的人类测试员不同。
  • 即使有角色扮演提示和授权框架,LLM 安全策略(拒绝协助攻击)仍然是一个重大障碍。
  • 非 OpenAI 系列 LLM(Llama-3-70B、Llama-3.1-70B、Qwen2.5-72B、Mixtral-8x22B、GLM-4、Claude-3.5-Sonnet、Claude-3-Opus)在基本渗透测试预实验中均告失败,表明在安全任务的工具使用方面与 OpenAI GPT 模型存在巨大能力差距。
  • 需要防御机制来检测和反击 LLM 驱动的自动化渗透测试攻击(例如,用于识别 LLM 生成的攻击命令的幻觉检测)。
  • 模型“不自信” —— 智能体在利用尝试不成功后过早宣布失败,而不是尝试替代方法。

Novel Techniques 新颖技术

  • Penetration Testing State Machine (PSM): FSM-based agent architecture splitting pentesting into Agent states (LLM-driven) and Rule states (deterministic), formalized as a Mealy machine six-tuple.
  • State-based context isolation: each state receives only the output value of the previous state rather than the full conversation history, solving context overflow in long-running security tasks.
  • Hybrid agent-rule architecture: combining LLM reasoning for complex tasks (reconnaissance, exploitation) with deterministic rules for structured tasks (vulnerability selection, success checking).
  • Vulnerability library construction from scan results with priority-based selection rules (high harm + simple exploitability first).
  • Maximum iteration limits per state to prevent infinite looping, with state transition forcing to avoid getting stuck.
  • 渗透测试状态机 (PSM):基于 FSM 的智能体架构,将渗透测试拆分为智能体状态(LLM 驱动)和规则状态(确定性),形式化为 Mealy 机六元组。
  • 基于状态的上下文隔离:每个状态仅接收前一个状态的输出值,而不是完整的对话历史,解决了长时间运行安全任务中的上下文溢出问题。
  • 混合智能体-规则架构:将 LLM 推理用于复杂任务(侦察、利用)与确定性规则用于结构化任务(漏洞选择、成功检查)相结合。
  • 根据扫描结果构建漏洞库,并采用基于优先级的选择规则(高危害+简单可利用性优先)。
  • 每个状态设置最大迭代限制以防止死循环,并通过强制状态转换来避免陷入停滞。

Open Questions 开放问题

  • Can FSM-based decomposition scale to more complex real-world penetration testing scenarios with unknown vulnerability types and chained exploits?
  • How would AutoPT perform against hardened or patched systems rather than intentionally vulnerable Docker environments?
  • Can open-source LLMs be fine-tuned on penetration testing data to close the capability gap with GPT-4o?
  • How to incorporate vulnerability discovery (zero-day finding) rather than just exploitation of known CVEs into an automated pipeline?
  • What defense mechanisms are effective against FSM-structured automated penetration testing agents?
  • How to handle the LLM safety policy barrier more reliably for legitimate authorized penetration testing?
  • 基于 FSM 的分解能否扩展到具有未知漏洞类型和链式利用的更复杂的现实渗透测试场景?
  • AutoPT 在面对加固或已修复的系统,而非有意设置漏洞的 Docker 环境时表现如何?
  • 能否在渗透测试数据上对开源 LLM 进行微调,以弥补与 GPT-4o 的能力差距?
  • 如何将漏洞发现(寻找零日漏洞)而非仅仅是利用已知 CVE 纳入自动化流程?
  • 哪些防御机制能有效对抗 FSM 结构的自动化渗透测试智能体?
  • 如何更可靠地处理合法授权渗透测试中的 LLM 安全策略障碍?

Builds On 基于前人工作

  • ReAct (Yao et al., 2023)
  • PentestGPT (Deng et al., 2023)
  • LangChain (Chase, 2022)
  • Finite State Machine theory (Yannakakis, 1991; Rich et al., 2008)
  • Happe and Cito (2023) - Getting pwn'd by AI: Penetration Testing with LLMs
  • Wintermute (Happe et al., 2024) - LLMs as Hackers for Linux privilege escalation

Open Source 开源信息

Yes - https://github.com/Dizzy-K/AutoPT

Tags