#01

PentestGPT: Evaluating and Harnessing Large Language Models for Automated Penetration Testing PentestGPT: Evaluating and Harnessing Large Language Models for Automated Penetration Testing

Gelei Deng, Yi Liu, Víctor Mayoral-Vilches, Peng Liu, Yuekang Li, Yuan Xu, Tianwei Zhang, Yang Liu, Martin Pinzger, Stefan Rass

2024 | arXiv (preprint)

arXiv:2308.06782v2

Problem & Motivation 问题与动机

Penetration testing traditionally relies on manual effort and specialized expertise, creating a gap in meeting the growing demand for efficient security evaluations. While LLMs show promise for automating aspects of pentesting, there is no systematic, quantitative assessment of their aptitude in this regard, nor are existing benchmarks comprehensive enough to evaluate progressive accomplishments.

渗透测试传统上依赖于手动努力和专业知识,在满足日益增长的高效安全评估需求方面存在差距。虽然大语言模型(LLM)在自动化渗透测试方面展现出潜力,但目前还没有对其在这方面的能力进行系统性、定量化的评估,现有的基准测试也不够全面,无法评估渐进式的成果。

LLMs demonstrate emergent abilities in reasoning and domain-specific problem-solving, but their application to penetration testing is hindered by three key challenges: loss of long-term context during extended testing sessions, over-emphasis on recent tasks (depth-first bias), and hallucination/inaccurate command generation. Existing penetration testing benchmarks are too narrow in scope and only measure final success, not incremental progress.

LLM 在推理和特定领域问题解决方面表现出涌现能力,但其在渗透测试中的应用受到三个关键挑战的阻碍:在长时间测试会话中丢失长期上下文、过度强调最近的任务(深度优先偏差)以及幻觉/不准确的命令生成。现有的渗透测试基准测试范围过于狭窄,仅衡量最终的成功,而不是增量进展。

Threat Model 威胁模型

The system assumes a standard external penetration testing scenario where the tester has network access to the target machine but no prior credentials or insider knowledge. The human operator acts purely as an executor of LLM-generated instructions without contributing expert judgment. Automated vulnerability scanners (e.g., Nexus, OpenVAS) are explicitly excluded to test the LLM's innate capabilities.

该系统假设一个标准的外部渗透测试场景,测试人员可以网络访问目标机器,但没有先前的凭据或内部知识。人类操作员纯粹作为 LLM 生成指令的执行者,不贡献专家判断。明确排除自动化漏洞扫描器(如 Nexus、OpenVAS),以测试 LLM 的内在能力。

Methodology 核心方法

The paper first constructs a comprehensive penetration testing benchmark with 13 targets (from HackTheBox and VulnHub) decomposed into 182 sub-tasks covering OWASP Top 10 vulnerabilities. An exploratory study evaluates GPT-3.5, GPT-4, and Bard using a human-in-the-loop interactive loop where a human expert strictly executes LLM instructions. Based on findings about LLM limitations (context loss, depth-first bias, hallucination), the authors design PentestGPT, a tripartite system with Reasoning, Generation, and Parsing modules that mirrors the collaborative dynamics of real-world penetration testing teams.

本文首先构建了一个全面的渗透测试基准,包含 13 个目标(来自 HackTheBox 和 VulnHub),分解为 182 个子任务,涵盖 OWASP Top 10 漏洞。一项探索性研究使用人类在环的交互循环评估了 GPT-3.5、GPT-4 和 Bard,其中人类专家严格执行 LLM 指令。基于关于 LLM 局限性(上下文丢失、深度优先偏差、幻觉)的发现,作者设计了 PentestGPT,这是一个由推理、生成和解析模块组成的三部分系统,模仿了现实世界渗透测试团队的协作动态。

Architecture 架构设计

PentestGPT comprises three self-interacting LLM-powered modules, each maintaining its own conversation session: (1) The Reasoning Module acts as a team lead, maintaining a Pentesting Task Tree (PTT) -- a novel natural-language tree representation of the testing state inspired by attack trees -- to preserve context and strategically select next tasks. (2) The Generation Module acts as a junior tester, translating high-level sub-tasks into concrete terminal commands or GUI operation descriptions using a two-step Chain-of-Thought process (task expansion then operation generation). (3) The Parsing Module handles information processing, compressing verbose tool outputs, source code, HTTP data, and user intentions into concise summaries. An active feedback mechanism allows users to query and optionally update the PTT without disrupting the reasoning context.

PentestGPT 由三个自我交互的 LLM 驱动模块组成,每个模块维护自己的对话会话:(1) 推理模块充当团队负责人,维护渗透测试任务树 (PTT) —— 一种受攻击树启发的、新型的自然语言树状表示,用于保存上下文并战略性地选择下一个任务。(2) 生成模块充当初级测试员,使用两步思维链过程(任务扩展然后操作生成),将高层子任务翻译成具体的终端命令或 GUI 操作描述。(3) 解析模块处理信息处理,将冗长的工具输出、源代码、HTTP 数据和用户意图压缩成简洁的摘要。一种主动反馈机制允许用户查询并可选地更新 PTT,而不会破坏推理上下文。

LLM Models 使用的大模型

GPT-3.5 (8k token limit)GPT-4 (32k token limit)Bard (LaMDA)

Tool Integration 工具集成

nmapniktodirbdirbustersqlmapBurpSuiteGPT-4 code interpreter

Memory Mechanism 记忆机制

conversation-history

Attack Phases Covered 覆盖的攻击阶段

reconnaissance
scanning
enumeration
exploitation
post exploitation
privilege escalation
lateral movement
reporting

Evaluation 评估结果

PentestGPT-GPT-4 achieved a 228.6% increase in sub-task completion over naive GPT-3.5 and 58.6% over naive GPT-4. It solved 6/7 easy and 2/4 medium benchmark targets. On 10 active HackTheBox machines, it completed 4 easy and 1 medium challenge at a total cost of $131.5 USD. In picoMini CTF, it solved 9/21 challenges scoring 1400/4200 points, ranking 24th out of 248 teams.

PentestGPT-GPT-4 相比原生 GPT-3.5,子任务完成率提高了 228.6%,相比原生 GPT-4 提高了 58.6%。它解决了 6/7 个简单和 2/4 个中等难度的基准目标。在 10 台活跃的 HackTheBox 机器上,它以 131.5 美元的总成本完成了 4 台简单和 1 台中等难度的挑战。在 picoMini CTF 中,它解决了 9/21 个挑战,获得 1400/4200 分,在 248 支队伍中排名第 24 位。

Environment 评估环境

HackTheBoxVulnHubpicoMini CTF (Carnegie Mellon / redpwn)

Metrics 评估指标

success-ratetask-completionsub-task-completion-ratecost

Baseline Comparisons 基准对比

  • GPT-3.5 (naive usage)
  • GPT-4 (naive usage)
  • Bard (naive usage)
  • PentestGPT-no-Parsing (ablation)
  • PentestGPT-no-Generation (ablation)
  • PentestGPT-no-Reasoning (ablation)
  • Official walkthroughs and certified penetration testers (OSCP)

Scale 评估规模

13 benchmark targets with 182 sub-tasks, 10 active HackTheBox machines, 21 picoMini CTF challenges

Contributions 核心贡献

  • A comprehensive penetration testing benchmark with 13 targets (HackTheBox + VulnHub) decomposed into 182 sub-tasks covering OWASP Top 10 and 18 CWE items, enabling progressive accomplishment tracking rather than binary success/failure evaluation.
  • The first systematic and quantitative exploratory study evaluating LLM capabilities (GPT-3.5, GPT-4, Bard) for penetration testing, yielding five key findings about LLM strengths (tool usage, vulnerability identification, code analysis) and limitations (context loss, depth-first bias, hallucination).
  • PentestGPT, a novel LLM-powered penetration testing framework with a tripartite architecture (Reasoning, Generation, Parsing modules) inspired by real-world pentesting team dynamics, featuring the Pentesting Task Tree (PTT) for structured state representation.
  • Open-source release on GitHub garnering 6,200+ stars and industry collaborations with AWS, Huawei, and ByteDance.
  • 一个全面的渗透测试基准,包含 13 个目标(HackTheBox + VulnHub),分解为 182 个子任务,涵盖 OWASP Top 10 和 18 个 CWE 项目,能够跟踪渐进式成就,而不是二元的成功/失败评估。
  • 第一项系统且定量的探索性研究,评估了 LLM(GPT-3.5、GPT-4、Bard)在渗透测试方面的能力,得出了关于 LLM 优势(工具使用、漏洞识别、代码分析)和局限性(上下文丢失、深度优先偏差、幻觉)的五个关键发现。
  • PentestGPT,一个新型的 LLM 驱动的渗透测试框架,具有三部分架构(推理、生成、解析模块),灵感来自现实世界的渗透测试团队动态,其特征是用于结构化状态表示的渗透测试任务树 (PTT)。
  • 在 GitHub 上开源,获得 6,200 多个星标,并与 AWS、华为和字节跳动开展了行业合作。

Limitations 局限性

  • Cannot process images or visual data, which are crucial in certain penetration testing scenarios (e.g., CAPTCHA, graphical clues).
  • Lacks the ability to employ social engineering techniques or detect subtle contextual cues (e.g., generating custom wordlists from target-specific information).
  • Struggles with accurate exploitation code construction, particularly for low-level bytecode operations and detailed exploitation scripts.
  • All models fail on hard-difficulty targets that feature rabbit holes (non-exploitable seemingly vulnerable services) and require unique, unpredictable exploitation paths.
  • Still prioritizes brute-force attacks before vulnerability scanning in some cases, reflecting a non-optimal inherited strategy from LLM training data.
  • Requires jailbreak techniques to bypass LLM safety alignment, which impacts reproducibility.
  • Vector database integration was found ineffective due to semantically similar but subtly different pentesting outputs causing confused retrieval.
  • Human-in-the-loop evaluation strategy introduces potential bias since human experts must translate GUI operations and non-textual results into text descriptions.
  • Benchmark machines may have been present in LLM training data (mitigated by selecting post-2021 machines, but not fully eliminated).
  • 无法处理图像或视觉数据,这在某些渗透测试场景(如验证码、图形线索)中至关重要。
  • 缺乏使用社会工程技术或检测微妙上下文线索(如从目标特定信息生成自定义字典)的能力。
  • 在准确构建漏洞利用代码方面存在困难,特别是对于底层字节码操作和详细的利用脚本。
  • 所有模型在具有兔子洞(看似有漏洞但不可利用的服务)且需要独特、不可预测的利用路径的高难度目标上都失败了。
  • 在某些情况下仍然优先考虑暴力破解攻击而不是漏洞扫描,反映了从 LLM 训练数据中继承的非最优策略。
  • 需要越狱技术来绕过 LLM 安全对齐,这影响了可复现性。
  • 发现向量数据库集成无效,因为语义相似但微妙不同的渗透测试输出导致了混淆检索。
  • 人类在环评估策略引入了潜在的偏差,因为人类专家必须将 GUI 操作和非文本结果翻译成文本描述。
  • 基准机器可能出现在 LLM 训练数据中(通过选择 2021 年之后的机器来缓解,但未能完全消除)。

Research Gaps 研究空白

  • No existing benchmark provides progressive accomplishment tracking for penetration testing -- most only measure binary success/failure.
  • LLMs lack the ability to maintain coherent long-term context across extended multi-step penetration testing sessions, even with 32k token windows.
  • No effective method exists to integrate vector databases for pentesting context management due to high semantic similarity between different tool outputs.
  • Multimodal capabilities are absent from current pentesting automation approaches, preventing processing of visual information.
  • LLMs cannot modify or create custom exploitation tools/scripts, which is essential for hard targets requiring novel attack paths.
  • There is no mechanism for LLMs to learn from failed pentesting attempts and adapt their strategies accordingly.
  • The gap between semi-autonomous and fully autonomous penetration testing remains wide -- human expertise is still required for complex reasoning and creative exploitation.
  • 现有的基准测试没有为渗透测试提供渐进式成就跟踪 —— 大多数仅衡量二元的成功/失败。
  • LLM 缺乏在扩展的多步骤渗透测试会话中保持连贯的长期上下文的能力,即使有 32k 的令牌窗口。
  • 由于不同工具输出之间的高语义相似性,不存在将向量数据库集成用于渗透测试上下文管理的有效方法。
  • 当前的渗透测试自动化方法缺乏多模态能力,无法处理视觉信息。
  • LLM 无法修改或创建自定义漏洞利用工具/脚本,这对于需要新颖攻击路径的困难目标至关重要。
  • 没有机制让 LLM 从失败的渗透测试尝试中学习并相应地调整其策略。
  • 半自主和全自主渗透测试之间的差距仍然很大 —— 复杂的推理和创造性的利用仍然需要人类专业知识。

Novel Techniques 新颖技术

  • Pentesting Task Tree (PTT): A novel natural-language tree representation based on attributed trees and attack trees that encodes the ongoing penetration testing state and can be directly interpreted by LLMs, enabling structured reasoning about testing progress.
  • Tripartite multi-session architecture: Separating reasoning (strategic oversight), generation (tactical execution), and parsing (information compression) into independent LLM sessions to mitigate context loss and attention bias.
  • Task tree verification: A validation step ensuring only leaf nodes of the PTT are modified during updates, guarding against LLM hallucination corrupting the overall testing state.
  • Two-step Chain-of-Thought generation: First expanding a sub-task into detailed steps considering available tools, then translating each step into precise terminal commands or GUI instructions.
  • Progressive sub-task decomposition benchmark: Breaking penetration testing targets into granular sub-tasks following NIST 800-115 categories and CWE classifications for fine-grained performance evaluation.
  • 渗透测试任务树 (PTT):一种基于属性树 and 攻击树的新型自然语言树状表示,它对正在进行的渗透测试状态进行编码,并可以直接由 LLM 解释,从而实现对测试进度的结构化推理。
  • 三部分多会话架构:将推理(战略监督)、生成(战术执行)和解析(信息压缩)分离到独立的 LLM 会话中,以减轻上下文丢失和注意力偏差。
  • 任务树验证:一个确保在更新期间仅修改 PTT 叶节点的验证步骤,防止 LLM 幻觉破坏整体测试状态。
  • 两步思维链生成:首先将子任务扩展到考虑可用工具的详细步骤,然后将每个步骤翻译成精确的终端命令或 GUI 指令。
  • 渐进式子任务分解基准:遵循 NIST 800-115 类别和 CWE 分类将渗透测试目标分解为细粒度的子任务,用于细粒度的性能评估。

Open Questions 开放问题

  • Can multimodal LLMs (e.g., GPT-4V) address the image interpretation limitation and handle GUI-heavy pentesting scenarios autonomously?
  • How can LLMs be made to learn and adapt from failed pentesting attempts within a session without catastrophic forgetting?
  • What is the optimal balance between autonomous operation and human guidance for penetration testing across different difficulty levels?
  • Can fine-tuning or retrieval-augmented generation effectively provide LLMs with up-to-date vulnerability and exploit knowledge?
  • How can the PTT representation be extended to handle lateral movement and multi-host network penetration testing scenarios?
  • What safety guardrails are sufficient to prevent misuse of LLM-powered pentesting tools while preserving their utility for legitimate security professionals?
  • Can reinforcement learning from pentesting feedback improve LLM strategic decision-making beyond the static Chain-of-Thought approach?
  • 多模态 LLM(如 GPT-4V)能否解决图像解释限制并自主处理重度使用 GUI 的渗透测试场景?
  • 如何让 LLM 在会话中从失败的渗透测试尝试中学习和适应,而不会产生灾难性遗忘?
  • 在不同难度级别的渗透测试中,自主操作和人类指导之间的最佳平衡是什么?
  • 微调或检索增强生成(RAG)能否有效地为 LLM 提供最新的漏洞和利用知识?
  • PTT 表示如何扩展以处理横向移动和多主机网络渗透测试场景?
  • 什么样的安全护栏足以防止滥用 LLM 驱动的渗透测试工具,同时保留其对合法安全专业人员的效用?
  • 渗透测试反馈的强化学习能否改进 LLM 的战略决策,超越静态的思维链方法?

Builds On 基于前人工作

  • Attack trees (Mauw and Oostdijk, 2006)
  • Chain-of-Thought prompting (Wei et al., 2023)
  • AutoHint for prompt optimization (Sun et al., 2023)
  • ExploitFlow (Mayoral-Vilches et al., 2023)
  • NIST 800-115 Technical Guide to Security Testing

Open Source 开源信息

Yes - https://github.com/GreyDGL/PentestGPT

Tags