#11

PentestAgent: Incorporating LLM Agents to Automated Penetration Testing PentestAgent: Incorporating LLM Agents to Automated Penetration Testing

Xiangmin Shen, Lingzhi Wang, Zhenyuan Li, Yan Chen, Wencheng Zhao, Dawei Sun, Jiashui Wang, Wei Ruan

2025 | AsiaCCS 2025 (top-conference)

system penetration-testing semi-autonomous multi-agent chain-of-thought

PDF Preview 论文预览

Loading PDF... 加载 PDF 中...

Problem & Motivation 问题与动机

Manual penetration testing is time-consuming, expensive, and requires deep expertise, while existing automated approaches based on attack graphs or reinforcement learning struggle with the enormous state and action spaces of real-world systems. Prior LLM-based systems like PentestGPT still require substantial human intervention throughout the testing pipeline, limiting their practical automation potential.

手动渗透测试耗时耗力且昂贵，需要深厚的专业知识；而现有的基于攻击图或强化学习的自动化方法在应对现实系统的巨大状态和操作空间时显得力不从心。此前基于大语言模型的系统（如 PentestGPT）在整个测试流程中仍需要大量的人工干预，限制了其实际的自动化潜力。

LLMs have demonstrated strong reasoning and planning capabilities that could bridge the gap between fully manual pentesting and brittle rule-based automation. However, existing LLM-based pentesting tools either cover only post-exploitation phases or demand continuous human feedback. There is a need for a framework that can autonomously handle the full penetration testing workflow -- from intelligence gathering through vulnerability analysis to exploitation -- with minimal human involvement.

大语言模型（LLM）展现出了强大的推理和规划能力，有望弥补纯手动渗透测试与脆弱的基于规则的自动化之间的差距。然而，现有的基于 LLM 的渗透测试工具要么仅涵盖后期漏洞利用阶段，要么需要持续的人工反馈。因此需要一个能够自主处理完整渗透测试工作流（从情报收集到漏洞分析再到利用）且仅需极少人工参与的框架。

Threat Model 威胁模型

External penetration testing scenario targeting web applications and online services. The attacker has network connectivity to the target but no prior credentials or insider knowledge. Each target contains a single exploitable vulnerability. The system assumes standard PTES (Penetration Testing Execution Standard) methodology and does not address sophisticated multi-step attack chains combining multiple vulnerabilities.

针对 Web 应用程序和在线服务的外部渗透测试场景。攻击者拥有通往目标节点的网络连接，但没有预先存在的凭据或内部知识。每个目标包含一个可利用的漏洞。系统假设遵循标准的 PTES（渗透测试执行标准）方法，且不涉及结合多个漏洞的复杂多步攻击链。

Methodology 核心方法

PentestAgent employs a multi-agent architecture with four specialized LLM agents that collaborate across three PTES stages: intelligence gathering, vulnerability analysis, and exploitation. The Reconnaissance Agent iteratively executes scanning tools using chain-of-thought reasoning. The Search Agent performs hierarchical online searches across vulnerability databases and code repositories, building a RAG-based knowledge base. The Planning Agent identifies attack surfaces and suitable exploits from the knowledge base. The Execution Agent carries out exploits with self-reflection for error debugging and iterative refinement.

PentestAgent 采用了多智能体架构，由四个专门的大语言模型智能体组成，它们在 PTES 的三个阶段（情报收集、漏洞分析和漏洞利用）中协同工作。侦察智能体使用思维链推理迭代执行扫描工具。搜索智能体在漏洞数据库和代码库中进行分层在线搜索，构建基于 RAG 的知识库。规划智能体根据知识库识别攻击面和合适的漏洞利用。执行智能体执行漏洞利用，并具有用于错误调试和迭代细化的自我反思机制。

Architecture 架构设计

The system consists of four specialized agents organized around shared databases. (1) The Reconnaissance Agent uses chain-of-thought prompting to iteratively query tools like Nmap and ObserverWard, storing structured findings in an environmental information database. (2) The Search Agent performs hierarchical searches across Google, Snyk, AVD (Alibaba Vulnerability Database), GitHub, and ExploitDB, then uses RAG-based analysis to extract and organize knowledge into a hierarchical tree structure of vulnerabilities, applicable versions, and exploit procedures. (3) The Planning Agent leverages the knowledge base to identify attack surfaces and match them against detected services and application versions. (4) The Execution Agent handles exploit preparation and execution with a self-reflection mechanism for debugging errors and iteratively refining the attack approach.

系统由四个围绕共享数据库组织的专门智能体组成：(1) 侦察智能体：使用思维链提示迭代查询 Nmap 和 ObserverWard 等工具，将结构化发现存储在环境信息数据库中。(2) 搜索智能体：在 Google、Snyk、AVD（阿里巴巴漏洞库）、GitHub 和 ExploitDB 上进行分层搜索，然后使用基于 RAG 的分析来提取知识，并将其组织成漏洞、适用版本和利用程序的层级树结构。(3) 规划智能体：利用知识库识别攻击面，并将其与检测到的服务和应用程序版本进行匹配。(4) 执行智能体：负责漏洞利用的准备和执行，具有自我反思机制，用于调试错误并迭代细化攻击方法。

LLM Models 使用的大模型

Tool Integration 工具集成

Memory Mechanism 记忆机制

RAG

Attack Phases Covered 覆盖的攻击阶段

reconnaissance

scanning

enumeration

exploitation

post exploitation

privilege escalation

lateral movement

reporting

Evaluation 评估结果

PentestAgent with GPT-4o achieved a 74.2% overall success rate on 67 VulHub targets, while GPT-3.5 reached 60.6%. On easy tasks, GPT-4o achieved 81.8% exploitation completion. On 11 HackTheBox challenges, PentestAgent fully exploited 6 machines compared to PentestGPT's 3. PentestAgent completed intelligence gathering in 220 seconds versus PentestGPT's 1,199 seconds, demonstrating a roughly 5x speedup.

使用 GPT-4o 的 PentestAgent 在 67 个 VulHub 目标上实现了 74.2% 的总成功率，而 GPT-3.5 达到了 60.6%。在简单任务中，GPT-4o 实现了 81.8% 的利用完成率。在 11 个 HackTheBox 挑战中，PentestAgent 完全渗透了 6 台机器，而 PentestGPT 为 3 台。PentestAgent 在 220 秒内完成了情报收集，而 PentestGPT 用时 1199 秒，实现了约 5 倍的提速。

Environment 评估环境

Metrics 评估指标

Baseline Comparisons 基准对比

PentestGPT
GPT-3.5-turbo (within PentestAgent)
GPT-4o (within PentestAgent)
o1-mini (within PentestAgent)
Llama 3.1-8B-Instruct (within PentestAgent)

Scale 评估规模

67 VulHub targets spanning 32 CWE categories (50 easy, 11 medium, 6 hard) and 11 HackTheBox challenges (9 easy, 1 medium, 1 hard)

Contributions 核心贡献

A multi-agent LLM framework with four specialized agents (Reconnaissance, Search, Planning, Execution) that automates the full PTES penetration testing workflow with minimal human intervention.
A RAG-based hierarchical knowledge base system that dynamically retrieves and organizes vulnerability information from multiple online sources (Google, Snyk, AVD, GitHub, ExploitDB) to augment LLM reasoning during pentesting.
A comprehensive penetration testing benchmark comprising 67 VulHub targets spanning 32 CWE categories and 8 OWASP Top 10 vulnerability types, plus 11 HackTheBox challenges, with difficulty stratification.
Systematic evaluation across four LLM models (GPT-4o, GPT-3.5, o1-mini, Llama 3.1-8B) with stage-level performance analysis revealing that exploitation remains the bottleneck while intelligence gathering and vulnerability analysis achieve higher completion rates.
Open-source release of the framework and benchmark datasets at https://github.com/nbshenxm/pentest-agent.

提出了一个具有四个专门智能体（侦察、搜索、规划、执行）的多智能体 LLM 框架，以极少的人工干预实现了 PTES 渗透测试全流程的自动化。
设计了一个基于 RAG 的层级知识库系统，能够从多个在线资源（Google, Snyk, AVD, GitHub, ExploitDB）动态检索和组织漏洞信息，以增强渗透测试过程中的 LLM 推理。
建立了一个全面的渗透测试基准，包含跨越 32 个 CWE 类别和 8 种 OWASP Top 10 漏洞类型的 67 个 VulHub 目标，以及 11 个具有难度分层的 HackTheBox 挑战。
在四种大语言模型（GPT-4o, GPT-3.5, o1-mini, Llama 3.1-8B）上进行了系统评估，阶段性性能分析表明漏洞利用仍然是瓶颈，而情报收集和漏洞分析具有较高的完成率。
在 https://github.com/nbshenxm/pentest-agent 开源了框架和基准数据集。

Limitations 局限性

Struggles to detect non-standalone web components (e.g., PHPMailer, Ghostscript, PHPUnit) that are embedded within larger applications, leading to missed attack surfaces.
Cannot handle exploits requiring domain-specific knowledge such as Samba credential configurations or Java deserialization techniques without additional human guidance.
Fails on tasks requiring user interaction such as file uploads, where the exploit chain involves manual steps that the agent cannot automate.
Subject to LLM hallucinations during exploitation that can cause cascading errors, where an incorrect command leads to compounding failures in subsequent steps.
Limited to single-vulnerability exploitation per target; does not support sophisticated multi-step attack chains that combine multiple vulnerabilities.
Performance degrades significantly on hard-difficulty targets (50% exploitation with GPT-4o on hard tasks vs. 81.8% on easy tasks), indicating difficulty with complex reconnaissance and exploitation scenarios.
Does not cover post-exploitation phases such as privilege escalation, lateral movement, or persistence, limiting its applicability to full red-team engagements.
Open-source LLM (Llama 3.1-8B) performs substantially worse than proprietary models, limiting deployment in environments where API-based models cannot be used.

难以检测嵌入在大型应用程序中的非独立 Web 组件（如 PHPMailer, Ghostscript, PHPUnit），导致攻击面遗漏。
在没有额外人工指导的情况下，无法处理需要领域特定知识的漏洞利用，如 Samba 凭据配置或 Java 反序列化技术。
在需要用户交互的任务（如文件上传）上失败，因为这类漏洞利用链涉及智能体无法自动化的手动步骤。
在漏洞利用过程中容易受到 LLM 幻觉的影响，可能导致级联错误，即一条错误的命令会导致后续步骤接连失败。
每个目标仅限于单漏洞利用；不支持结合多个漏洞的复杂多步攻击链。
在困难难度的目标上性能显著下降（GPT-4o 在困难任务上的利用率为 50%，而简单任务为 81.8%），表明在复杂的侦察和利用场景中存在困难。
未涵盖后期利用阶段，如权限提升、横向移动或持久化，限制了其在完整红队任务中的适用性。
开源大语言模型（Llama 3.1-8B）的表现远差于商业模型，限制了其在无法使用 API 模型环境中的部署。

Research Gaps 研究空白

Automated detection of embedded or non-standalone components within web applications remains unsolved, creating blind spots in vulnerability discovery.
No effective mechanism exists for LLM agents to autonomously handle interactive exploitation steps such as file uploads or multi-stage user interactions.
LLM hallucination mitigation in security-critical contexts like exploitation is underdeveloped; cascading errors from hallucinated commands can derail entire attack chains.
Multi-vulnerability chaining and complex attack path planning across multiple hosts is beyond current LLM agent capabilities.
The gap between open-source and proprietary LLMs for security tasks is significant, and fine-tuning strategies for pentesting-specific tasks are unexplored.
There is no standardized cost-effectiveness metric for LLM-based pentesting that balances token costs against testing thoroughness and success rates.

Web 应用程序中嵌入式或非独立组件的自动检测仍未解决，导致漏洞发现存在盲点。
大语言模型智能体尚无有效机制来自主处理交互式漏洞利用步骤，如文件上传或多阶段用户交互。
在漏洞利用等安全关键场景中，减轻大语言模型幻觉的研究尚不成熟；幻觉命令导致的级联错误会破坏整个攻击链。
跨多台主机的多漏洞链式利用和复杂攻击路径规划超出了当前 LLM 智能体的能力。
开源与商业 LLM 在安全任务上的差距依然巨大，且针对渗透测试特定任务的微调策略尚待探索。
目前尚无标准化的 LLM 渗透测试性价比指标来平衡 token 成本与测试彻底性和成功率。

Novel Techniques 新颖技术

Hierarchical RAG-based knowledge base for pentesting: Dynamically constructs a tree-structured knowledge base from online vulnerability sources (Snyk, AVD, ExploitDB) during testing, organizing information by vulnerability type, affected versions, and exploit procedures.
Four-agent specialization pattern: Separates pentesting into distinct cognitive roles (reconnaissance, search/knowledge-building, planning, execution) rather than using a single monolithic agent, enabling each agent to be optimized for its specific subtask.
Self-reflection mechanism for exploitation: The Execution Agent detects errors during exploit execution and iteratively refines its approach through a debugging feedback loop, improving resilience to initial failures.
Structured output formatting for inter-agent communication: Uses standardized data formats to ensure pipeline compatibility between the four agents, enabling reliable information flow across the testing stages.

用于渗透测试的分层 RAG 知识库：在测试期间从在线漏洞源（Snyk, AVD, ExploitDB）动态构建树状知识库，按漏洞类型、受影响版本和利用程序组织信息。
四智能体专门化模式：将渗透测试分解为不同的认知角色（侦察、搜索/知识构建、规划、执行），而非使用单一的整体智能体，使每个智能体都能针对其特定的子任务进行优化。
漏洞利用的自我反思机制：执行智能体检测漏洞利用执行期间的错误，并通过调试反馈循环迭代细化其方法，提高了对初始失败的韧性。
用于智能体间通信的结构化输出格式：使用标准化的数据格式来确保四个智能体之间的流水线兼容性，从而实现各测试阶段之间可靠的信息流。

Open Questions 开放问题

Can the RAG-based knowledge base be extended to support multi-vulnerability attack chains, where information from one exploit informs the planning of subsequent exploits?
How can LLM agents be made to handle interactive exploitation steps (file uploads, CAPTCHA, multi-factor authentication) without human intervention?
What fine-tuning or distillation strategies could bring open-source LLMs closer to GPT-4o performance for pentesting tasks, enabling deployment in restricted environments?
Can the multi-agent architecture be extended to cover post-exploitation, privilege escalation, and lateral movement phases for full red-team automation?
How should hallucination detection and recovery mechanisms be designed specifically for security-critical agent workflows where errors can cascade?
What is the optimal balance between online search (dynamic knowledge retrieval) and pre-built knowledge bases for pentesting agents, considering latency, cost, and accuracy tradeoffs?
Can reinforcement learning from pentesting outcomes be used to improve agent planning strategies beyond static chain-of-thought prompting?

基于 RAG 的知识库能否扩展到支持多漏洞攻击链，使来自一个漏洞利用的信息能够为后续利用的规划提供参考？
如何使 LLM 智能体在没有人工干预的情况下处理交互式漏洞利用步骤（文件上传、验证码、多因素身份验证）？
哪些微调或蒸馏策略可以使开源大语言模型在渗透测试任务中接近 GPT-4o 的性能，从而实现在受限环境中的部署？
多智能体架构能否扩展到涵盖后期漏洞利用、权限提升和横向移动阶段，以实现全红队自动化？
在错误可能产生级联效应的安全关键型智能体工作流中，应如何专门设计幻觉检测和恢复机制？
考虑到延迟、成本和准确性的权衡，渗透测试智能体在在线搜索（动态知识获取）与预构建知识库之间的最佳平衡点是什么？
能否利用渗透测试结果的强化学习来改进智能体规划策略，而不仅局限于静态的思维链提示？

Builds On 基于前人工作

PentestGPT (Deng et al., 2024)
AutoAttacker (Xu et al., 2024)
Happe et al. (2023) LLM-based pentesting
MDP/POMDP-based attack planning (Boddy et al., 2005)
Reinforcement learning for pentesting
PTES (Penetration Testing Execution Standard)

Open Source 开源信息

Yes - https://github.com/nbshenxm/pentest-agent