#05

AutoPenGPT: Highly automated penetration testing framework based on LLM AutoPenGPT: Highly automated penetration testing framework based on LLM

Tianqi Jiang

2025 | University of Auckland (Master's Thesis) (preprint)

system penetration-testing semi-autonomous multi-agent ReAct

PDF Preview 论文预览

Loading PDF... 加载 PDF 中...

Problem & Motivation 问题与动机

Existing LLM-based penetration testing frameworks suffer from poor context memory across multi-step tasks, high susceptibility to hallucinations, and low automation levels, making them ineffective in dynamically complex cyber environments.

现有的基于 LLM 的渗透测试框架在多步任务中存在上下文记忆能力差、易受幻觉影响以及自动化程度低的问题，导致它们在动态复杂的网络环境中无效。

The cybersecurity talent shortage (3.5 million unfilled positions globally) combined with the rising cost of cybercrime (projected $10.29 trillion by 2025) demands more automated PT solutions. Current tools like PentestGPT struggle with maintaining context over extended operations, frequently hallucinate commands, and require substantial human intervention, limiting their practical utility in real-world penetration testing.

网络安全人才短缺（全球 350 万个职位空缺）加上网络犯罪成本上升（预计到 2025 年达到 10.29 万亿美元），要求有更自动化的渗透测试解决方案。目前的工具如 PentestGPT 在长时间操作中难以维持上下文，经常产生错误的幻觉命令，且需要大量的人工干预，限制了它们在现实世界渗透测试中的实际效用。

Threat Model 威胁模型

Standard penetration testing threat model where the tester has network access to a target system and seeks to identify and exploit vulnerabilities to gain unauthorized access or escalate privileges. Testing is performed in controlled VulnHub virtual machine environments with known vulnerability classes.

标准渗透测试威胁模型，测试人员可以网络访问目标系统，并寻求识别和利用漏洞以获取未经授权的访问或提升权限。测试在受控的、具有已知漏洞类别的 VulnHub 虚拟机环境中进行。

Methodology 核心方法

AutoPenGPT integrates three core technologies: (1) Retrieval-Augmented Generation (RAG) with dual knowledge bases (a temporary session-specific knowledge base and a persistent MITRE ATT&CK-based knowledge base) to ground LLM outputs in factual information; (2) an LLM Agent module using the ReAcT (Reasoning, Acting, Observing) framework for dynamic task planning and execution with short-term and long-term memory; and (3) a Mixture of Experts (MoE) system where specialized expert models handle different PT facets (vulnerability scanning, asset analysis, attack simulation) with LLM-driven dynamic task routing replacing traditional fixed gated networks.

AutoPenGPT 集成了三项核心技术：(1) 检索增强生成 (RAG)，带有双重知识库（会话特定的临时知识库和基于 MITRE ATT&CK 的持久知识库），使 LLM 输出基于事实信息；(2) 使用 ReAcT（推理、行动、观察）框架的 LLM 智能体模块，用于具有短期和长期记忆的动态任务规划和执行；(3) 专家混合 (MoE) 系统，其中专门的专家模型处理不同的渗透测试方面（漏洞扫描、资产分析、攻击模拟），并由 LLM 驱动的动态任务路由取代传统的固定门控网络。

Architecture 架构设计

Modular multi-agent architecture consisting of five main modules: (1) Decision Module - strategic task tree generation and management with counterfactual analysis for loop prevention (max 5 retry cycles); (2) Expert Module - MoE-based task delegation with multi-threaded parallel execution; (3) Analyser Module - RAG-enhanced result analysis feeding back to Decision Module; (4) Summarizer Module - structured JSON report generation stored in vector database; (5) Util Module - logging, configuration management, command execution, and token compression. The system runs on Kali Linux and integrates with external PT tools.

由五个主要模块组成的模块化多智能体架构：(1) 决策模块 —— 战略任务树的生成和管理，带有用于防止循环的逆向事实分析（最多 5 次重试循环）；(2) 专家模块 —— 基于 MoE 的任务委托，支持多线程并行执行；(3) 分析器模块 —— RAG 增强的结果分析，反馈给决策模块；(4) 总结器模块 —— 生成存储在向量数据库中的结构化 JSON 报告；(5) 工具模块 —— 日志记录、配置管理、命令执行和令牌压缩。该系统运行在 Kali Linux 上并与外部渗透测试工具集成。

LLM Models 使用的大模型

Tool Integration 工具集成

Memory Mechanism 记忆机制

RAG

Attack Phases Covered 覆盖的攻击阶段

reconnaissance

scanning

enumeration

exploitation

post exploitation

privilege escalation

lateral movement

reporting

Evaluation 评估结果

AutoPenGPT achieved 100%/60%/20% task completion rates on simple/medium/complex tasks respectively, outperforming all baselines on medium and complex tasks (PentestGPT: 100%/40%/0%, Nessus: 80%/20%/0%). Context retention rates were 91.3%/78.6%/57.1% (vs PentestGPT's 71.2%/52.7%/33.4%). Hallucination rates were reduced to 7.3%/9.2%/12.7% compared to PentestGPT's 8.9%/11.4%/15.3%, with user intervention needs of 0%/18%/26% vs PentestGPT's 0%/30%/42%.

AutoPenGPT 在简单/中等/复杂任务上分别实现了 100%/60%/20% 的任务完成率，在中等和复杂任务上优于所有基准（PentestGPT: 100%/40%/0%，Nessus: 80%/20%/0%）。上下文保留率分别为 91.3%/78.6%/57.1%（对比 PentestGPT 的 71.2%/52.7%/33.4%）。幻觉率降至 7.3%/9.2%/12.7%（对比 PentestGPT 的 8.9%/11.4%/15.3%），人工干预需求为 0%/18%/26%（对比 PentestGPT 的 0%/30%/42%）。

Environment 评估环境

Metrics 评估指标

Baseline Comparisons 基准对比

Nessus
Deep Exploit
PentestGPT

Scale 评估规模

15 VulnHub targets (5 simple, 5 medium, 5 difficult)

Contributions 核心贡献

Advanced context management framework using dual RAG knowledge bases (temporary session-specific and persistent MITRE ATT&CK-based) that significantly improves context retention and accuracy in multi-step penetration testing
Hallucination mitigation through counterfactual analysis strategy in the Decision Module, which detects looping tasks (threshold of 5 cycles) and uses counterfactual reasoning to identify unsolvable tasks, reducing severe hallucination rates to 1.8-5.5%
Novel automated PT framework integrating RAG, LLM Agent with ReAcT workflow, and Mixture of Experts (MoE) architecture with LLM-driven dynamic gating replacing traditional fixed gated networks for task routing
Modular multi-agent architecture with five specialized modules (Decision, Expert, Analyser, Summarizer, Util) supporting parallel task execution and dynamic task tree adjustment

使用双重 RAG 知识库（会话特定临时库和基于 MITRE ATT&CK 的持久库）的高级上下文管理框架，显著提高了多步渗透测试中的上下文保留率和准确性。
通过决策模块中的逆向事实分析策略减轻幻觉，该策略可检测循环任务（5 次循环阈值）并使用逆向事实推理来识别不可解任务，将严重幻觉率降低至 1.8-5.5%。
一种集成 RAG、采用 ReAcT 工作流的 LLM 智能体以及由 LLM 驱动动态门控取代传统固定门控网络进行任务路由的专家混合 (MoE) 架构的新型自动化渗透测试框架。
模块化多智能体架构，具有五个专业模块（决策、专家、分析器、总结器、工具），支持并行任务执行和动态任务树调整。

Limitations 局限性

Only 20% task completion rate on complex tasks, indicating the framework still struggles significantly with multi-step, high-complexity penetration scenarios
Context retention drops to 57.1% and utilization accuracy to 63.8% on complex tasks, showing limited ability to manage extended contextual dependencies
Hallucination rate of 12.7% on complex tasks with 5.5% severe hallucinations still disrupts task flows and requires manual oversight
High computational costs, especially for simple tasks where traditional tools like Nessus achieve better time efficiency (83.7% vs AutoPenGPT's 36.0%)
Evaluation limited to 15 VulnHub machines in a controlled virtual environment - no real-world network testing or diversity in target types
Struggles with ambiguous inputs and complex logic vulnerabilities with multi-step dependencies, as demonstrated in the failure case study
Framework relies on GPT-4 API, creating dependency on a single commercial LLM provider with associated cost and content filtering constraints
No comparison with more recent LLM-based pentest tools beyond PentestGPT; the baseline set is limited

在复杂任务上仅有 20% 的任务完成率，表明该框架在多步骤、高复杂度的渗透场景中仍然面临巨大困难。
在复杂任务上，上下文保留率下降到 57.1%，利用准确性下降到 63.8%，显示出管理扩展上下文依赖的能力有限。
在复杂任务上 12.7% 的幻觉率（其中 5.5% 为严重幻觉）仍然破坏任务流并需要手动监督。
计算成本高，特别是在简单任务上，传统工具如 Nessus 的时间效率更高（83.7% 对比 AutoPenGPT 的 36.0%）。
评估仅限于受控虚拟环境中的 15 台 VulnHub 机器 - 没有现实世界的网络测试或目标类型的多样性。
在处理模糊输入和具有多步依赖关系的复杂逻辑漏洞方面存在困难，如失败案例研究所示。
框架依赖 GPT-4 API，造成了对单一商业 LLM 供应商的依赖，并受到相关成本和内容过滤的限制。
除了 PentestGPT 之外，没有与更近期的基于 LLM 的渗透测试工具进行比较；基准集有限。

Research Gaps 研究空白

No robust mechanism for handling ambiguous or fuzzy inputs in real-world penetration testing scenarios where vulnerability descriptions are not well-defined
Context management for very long task chains (12+ steps) remains unsolved, with retention dropping significantly as complexity increases
Lack of multimodal feedback validation - the framework does not compare generated outputs against real-time visual or structural feedback from the target environment
No integration with Security Operations Center (SOC) workflows or enterprise security tool pipelines
Domain-specific fine-tuning of LLMs for penetration testing has not been explored in conjunction with the RAG approach
Ethical safeguards and guardrails for preventing misuse of autonomous penetration testing frameworks are insufficiently addressed
Multi-target collaborative testing scenarios (testing multiple interconnected systems simultaneously) have not been explored

在现实世界渗透测试场景中，对于漏洞描述不明确的模糊或不清晰输入，缺乏稳健的处理机制。
超长任务链（12 步以上）的上下文管理仍未解决，保留率随复杂度增加而显著下降。
缺乏多模态反馈验证 - 框架没有将生成的输出与来自目标环境的实时视觉或结构化反馈进行比较。
没有与安全运营中心 (SOC) 工作流或企业安全工具管道集成。
尚未探索将渗透测试领域的 LLM 特定微调与 RAG 方法相结合。
防止滥用自主渗透测试框架的伦理保障和护栏措施解决不足。
尚未探索多目标协作测试场景（同时测试多个相互连接的系统）。

Novel Techniques 新颖技术

Dual knowledge base RAG design: temporary session-specific KB (cleared after each test for data independence) combined with persistent MITRE ATT&CK KB for authoritative reference
LLM-driven MoE gating: replacing traditional fixed neural network gating with LLM-based dynamic expert selection and task routing, enabling more flexible adaptation to diverse PT scenarios
Counterfactual analysis for loop detection: Decision Module uses counterfactual strategies after 5 failed cycles to determine task unsolvability, preventing infinite loops and resource waste
Query-based RAG with FAISS vector database and Sentence-BERT embeddings for semantic similarity retrieval of penetration testing knowledge

双重知识库 RAG 设计：会话特定的临时知识库（每次测试后清除以保证数据独立性）结合基于 MITRE ATT&CK 的持久知识库以进行权威引用。
LLM 驱动的 MoE 门控：用基于 LLM 的动态专家选择和任务路由取代传统的固定神经网络门控，从而能够更灵活地适应各种渗透测试场景。
用于循环检测的逆向事实分析：决策模块在 5 次失败循环后使用逆向事实策略来确定任务不可解性，防止死循环和资源浪费。
基于查询的 RAG，使用 FAISS 向量数据库和 Sentence-BERT 嵌入，实现渗透测试知识的语义相似性检索。

Open Questions 开放问题

How can context retention be maintained above acceptable thresholds (e.g., >80%) for tasks requiring 12+ sequential steps?
Can the counterfactual analysis strategy be made more sophisticated to distinguish between truly unsolvable tasks and those requiring alternative approaches?
Would fine-tuning an open-source LLM on penetration testing data outperform the RAG-augmented GPT-4 approach, especially for reducing hallucinations?
How should autonomous penetration testing frameworks handle the discovery of critical zero-day vulnerabilities during testing - what ethical and operational protocols are needed?
Can the MoE architecture be extended with self-improving expert models that learn from successful and failed penetration attempts?
What is the cost-benefit tradeoff of the RAG overhead for simple tasks where traditional scanning tools are more time-efficient?

对于需要 12 个以上连续步骤的任务，如何将上下文保留率维持在可接受的阈值以上（例如 >80%）？
逆向事实分析策略能否变得更复杂，以区分真正的不可解任务和需要替代方法的任务？
在渗透测试数据上微调开源 LLM 是否会优于 RAG 增强的 GPT-4 方法，特别是在减少幻觉方面？
自主渗透测试框架应如何 handle 发现的关键零日漏洞 - 需要什么样的伦理和操作协议？
MoE 架构能否扩展为可以从成功和失败的渗透尝试中学习的自我改进专家模型？
对于简单任务，RAG 开销的成本效益权衡是什么，因为在这些任务中传统扫描工具的时间效率更高？

Builds On 基于前人工作

PentestGPT
DeepExploit
GAIL-PT
AutoAttack
PTGroup
ReAcT framework
RAG (Lewis et al. 2020)
Mixture of Experts