#04

xOffense: An AI-driven autonomous penetration testing framework with offensive knowledge-enhanced LLMs and multi agent systems xOffense: An AI-driven autonomous penetration testing framework with offensive knowledge-enhanced LLMs and multi agent systems

Phung Duc Luong, Le Tran Gia Bao, Nguyen Vu Khai Tam, Dong Huu Nguyen Khoa, Nguyen Huu Quyen, Van-Hau Pham, Phan The Duy

2025 | arXiv (preprint submitted to Elsevier) (preprint)

arXiv:2509.13021

system penetration-testing fully-autonomous multi-agent hierarchical-planning

PDF Preview 论文预览

Loading PDF... 加载 PDF 中...

Problem & Motivation 问题与动机

Conventional automated penetration testing approaches using ML, DL, or RL are limited by simplified action spaces, high computational costs, and weak reasoning across the multi-stage process of reconnaissance, vulnerability analysis, and exploitation. Recent LLM-based frameworks like PentestGPT and VulnBot rely on extremely large or proprietary models (GPT-4o, LLaMA3-70B, DeepSeek-V3), introducing high cost, API dependencies, limited fine-tuning, and scalability challenges.

使用 ML、DL 或 RL 的传统自动化渗透测试方法受限于简化的操作空间、高昂的计算成本以及在侦察、漏洞分析和利用的多阶段过程中微弱的推理能力。最近基于 LLM 的框架（如 PentestGPT 和 VulnBot）依赖于极大或专有的模型（GPT-4o、LLaMA3-70B、DeepSeek-V3），引入了高成本、API 依赖、有限的微调以及可扩展性挑战。

There is a pressing need for cost-effective, reproducible, and domain-adapted penetration testing systems that do not depend on large proprietary LLMs. Most current systems use LLMs as black-box assistants without deep task-specific guidance or domain adaptation. The authors explore whether a mid-scale, fine-tuned open-source LLM (32B parameters) can match or exceed larger models when embedded within a structured multi-agent orchestration framework with domain-specific fine-tuning and grey-box prompting.

迫切需要具有成本效益、可复现且领域自适应的渗透测试系统，且不依赖大型专有 LLM。目前大多数系统将 LLM 作为黑盒助手使用，缺乏深入的任务特定指导或领域自适应。作者探讨了在结构化多智能体编排框架中，嵌入经过领域特定微调和灰盒提示的中等规模、经过微调的开源 LLM（32B 参数）是否能匹配或超过更大的模型。

Threat Model 威胁模型

Black-box to grey-box penetration testing scenario where the attacker has network access to the target but limited initial knowledge. Agents operate from a Kali Linux attacker machine targeting vulnerable hosts on the same NAT network. The grey-box prompting mechanism provides partial system insights (protocol hints, observed services, prior scan summaries) without full system disclosure.

黑盒到灰盒的渗透测试场景，攻击者可以网络访问目标，但初始知识有限。智能体从 Kali Linux 攻击机运行，针对同一 NAT 网络上的漏洞主机。灰盒提示机制提供部分系统见解（协议提示、观察到的服务、先前的扫描摘要），而无需完全公开系统。

Methodology 核心方法

xOffense is a multi-agent penetration testing framework that decomposes pentesting into three phases (reconnaissance, scanning, exploitation) coordinated through a Task Coordination Graph (TCG). At its core, it uses Qwen3-32B fine-tuned with LoRA on Chain-of-Thought penetration testing data. The framework employs five core components: Task Orchestrator, Knowledge Repository (RAG via vector-store), Command Synthesizer (the fine-tuned LLM), Action Executor (with MemAgent for long-context handling), and Information Aggregator. A Check and Reflection mechanism enables error recovery by re-planning failed tasks using RAG-retrieved similar successful cases.

xOffense 是一个多智能体渗透测试框架，它将渗透测试分解为三个阶段（侦察、扫描、利用），通过任务协调图 (TCG) 进行协调。其核心使用在思维链渗透测试数据上通过 LoRA 微调的 Qwen3-32B。该框架采用五个核心组件：任务编排器、知识库（通过向量存储的 RAG）、命令合成器（微调后的 LLM）、操作执行器（带有用于长上下文处理的 MemAgent）和信息聚合器。检查与反思机制通过使用 RAG 检索到的类似成功案例重新规划失败的任务，从而实现错误恢复。

Architecture 架构设计

Multi-agent system with five core components: (1) Task Orchestrator - constructs and manages a Task Coordination Graph (TCG), a DAG of tasks with dependencies, directives, commands, and statuses; (2) Knowledge Repository - a vector-based RAG database (via Langchain-Chatchat) storing penetration testing knowledge from HackTricks and HackingArticles, plus embeddings of previously successful tasks; (3) Command Synthesizer - the LoRA fine-tuned Qwen3-32B model that translates task directives into precise tool-specific shell commands; (4) Action Executor - executes commands via a Python Paramiko-based interactive shell on Kali Linux, using MemAgent for handling outputs exceeding the 16,384-token context window; (5) Information Aggregator - consolidates phase outputs into concise summaries to pass between phases, maintaining a persistent shell state log tracking access levels and system context.

具有五个核心组件的多智能体系统：(1) 任务编排器 —— 构建并管理任务协调图 (TCG)，这是一个包含任务依赖、指令、命令和状态的 DAG；(2) 知识库 —— 基于向量的 RAG 数据库（通过 Langchain-Chatchat），存储来自 HackTricks 和 HackingArticles 的渗透测试知识，以及先前成功任务的嵌入；(3) 命令合成器 —— LoRA 微调的 Qwen3-32B 模型，将任务指令翻译成精确的工具特定 Shell 命令；(4) 操作执行器 —— 通过 Kali Linux 上基于 Python Paramiko 的交互式 Shell 执行命令，使用 MemAgent 处理超过 16,384 令牌上下文窗口的输出；(5) 信息聚合器 —— 将各阶段的输出整合为简洁的摘要以在阶段间传递，维护一个跟踪访问级别和系统上下文的持久 Shell 状态日志。

LLM Models 使用的大模型

Tool Integration 工具集成

Memory Mechanism 记忆机制

RAG

Attack Phases Covered 覆盖的攻击阶段

reconnaissance

scanning

enumeration

exploitation

post exploitation

privilege escalation

lateral movement

reporting

Evaluation 评估结果

On AutoPenBench, xOffense (Qwen3-32B-finetune) achieved 72.72% overall task completion rate, substantially outperforming GPT-4o (21.21%), VulnBot-Llama3.3-70B (18.18%), VulnBot-Llama3.1-405B (30.30%), and PentestGPT (9.09%). The sub-task completion rate reached 79.17% (1 experiment) and 60.94% (5 experiments). On AI-Pentest-Benchmark with 6 real-world VulnHub machines, the fine-tuned model with RAG achieved perfect scores (1.00) on Victim1 and WestWild, outperforming all baselines including VulnBot-DeepSeek-V3. The 32B fine-tuned model consistently outperformed the 405B parameter Llama3.1 across all metrics, demonstrating that domain-adapted mid-scale LLMs can exceed much larger general-purpose models.

在 AutoPenBench 上，xOffense (Qwen3-32B-finetune) 实现了 72.72% 的总体任务完成率，显著优于 GPT-4o (21.21%)、VulnBot-Llama3.3-70B (18.18%)、VulnBot-Llama3.1-405B (30.30%) 和 PentestGPT (9.09%)。子任务完成率达到 79.17%（1 次实验）和 60.94%（5 次实验）。在包含 6 台真实 VulnHub 机器的 AI-Pentest-Benchmark 上，带有 RAG 的微调模型在 Victim1 和 WestWild 上获得了满分 (1.00)，优于包括 VulnBot-DeepSeek-V3 在内的所有基准。32B 微调模型在所有指标上始终优于 405B 参数的 Llama3.1，证明了领域自适应的中等规模 LLM 可以超越大得多的通用模型。

Environment 评估环境

Metrics 评估指标

Baseline Comparisons 基准对比

PentestGPT (Llama3.1-405B)
VulnBot-Llama3.3-70B
VulnBot-Llama3.1-405B
VulnBot-DeepSeek-V3
GPT-4o
Qwen3-32B-base

Scale 评估规模

33 AutoPenBench tasks (22 in-vitro + 11 real-world CVEs) and 13 AI-Pentest-Benchmark VulnHub machines (6 used for detailed evaluation: Victim1, Library2, Sar, WestWild, Symfonos2, Funbox)

Contributions 核心贡献

An AI-driven multi-agent penetration testing system with specialized agents covering reconnaissance, vulnerability analysis, exploitation, and reporting phases, coordinated through a Task Coordination Graph (TCG)
A domain-adapted mid-scale LLM: Qwen3-32B fine-tuned with LoRA on Chain-of-Thought penetration testing data (PentestData from 1,000+ machine write-ups and WhiteRabbitNeo cybersecurity dataset), achieving superior performance to 405B parameter models
Grey-box phase prompting: a context-aware prompting mechanism that selectively integrates environmental cues (protocol hints, discovered services, prior scan outputs) into agent reasoning, balancing black-box and white-box testing constraints
Extensive empirical validation on AutoPenBench and AI-Pentest-Benchmark demonstrating state-of-the-art performance across both synthetic and real-world penetration testing scenarios

一个 AI 驱动的多智能体渗透测试系统，具有涵盖侦察、漏洞分析、利用和报告阶段的专业智能体，通过任务协调图 (TCG) 进行协调
一个领域自适应的中等规模 LLM：Qwen3-32B，在思维链渗透测试数据（来自 1,000 多个机器 Writeup 的 PentestData 和 WhiteRabbitNeo 网络安全数据集）上通过 LoRA 进行微调，性能优于 405B 参数模型
灰盒阶段提示：一种上下文感知的提示机制，选择性地将环境线索（协议提示、发现的服务、先前的扫描输出）集成到智能体推理中，平衡了黑盒和白盒测试的约束
在 AutoPenBench 和 AI-Pentest-Benchmark 上进行了广泛的实证验证，展示了在合成和现实世界渗透测试场景中的最先进性能

Limitations 局限性

Fine-tuning on CoT pentest data may bias the model toward disproportionately represented attack vectors, limiting generalization to underrepresented scenarios
Benchmarks (AutoPenBench, AI-Pentest-Benchmark) cannot fully capture the heterogeneity of production-scale environments with active defenses, non-standard configurations, and deception mechanisms
Evaluation metrics focus on task completion and success rates, neglecting stealth, efficiency, resource utilization, time-to-compromise, and resilience against detection
Binary success measures fail to capture partial progress or incremental compromise in complex exploitation chains
Reproducibility affected by stochastic factors in LLM inference, hardware variation, runtime conditions, network latency, and nondeterministic tool outputs
Approximately 18% performance drop observed from single-experiment to 5-experiment aggregated results, indicating variance in autonomous pentesting workflows
Prompting strategies and toolchains may embed implicit task-specific heuristics, meaning reported improvements could partly reflect dataset artifacts rather than genuine reasoning ability

在 CoT 渗透测试数据上进行微调可能会使模型偏向于不成比例代表的攻击向量，限制了对代表性不足场景的泛化能力
基准测试（AutoPenBench, AI-Pentest-Benchmark）无法完全捕获具有主动防御、非标准配置和欺骗机制的生产规模环境的异构性
评估指标侧重于任务完成和成功率，忽略了隐蔽性、效率、资源利用、入侵时间和抗检测能力
二元成功衡量标准无法捕获复杂利用链中的部分进展或增量入侵
复现性受到 LLM 推理中的随机因素、硬件差异、运行时条件、网络延迟和非确定性工具输出的影响
从单次实验到 5 次实验的汇总结果中观察到约 18% 的性能下降，表明自主渗透测试工作流中存在方差
提示策略和工具链可能嵌入了隐含的任务特定启发式方法，这意味着报告的改进可能部分反映了数据集的人造特征，而非真正的推理能力

Research Gaps 研究空白

Need for structured function calling integration in the command generation module to improve execution precision beyond free-form shell command synthesis
Robustness of long-running process handling and strengthening of RAG with automated updates from vulnerability intelligence sources (ExploitDB, GitIngest)
Extension to advanced web and GUI interactions via browser automation for broader penetration testing scenarios
Lack of evaluation on enterprise-scale heterogeneous infrastructures and zero-day exploitation scenarios
Need for metrics beyond binary task completion that capture stealth, efficiency, and operational realism
Transferability of benchmark results to operational networks with active defenses and dynamic adversarial tactics remains unvalidated

需要在命令生成模块中集成结构化函数调用，以提高超出自由格式 Shell 命令合成的执行精度
增强长时间运行过程处理的鲁棒性，并通过来自漏洞情报源（ExploitDB, GitIngest）的自动更新来强化 RAG
通过浏览器自动化扩展到高级 Web 和 GUI 交互，以涵盖更广泛的渗透测试场景
缺乏对企业级异构基础设施和零日漏洞利用场景的评估
需要除了二元任务完成之外的、能捕获隐蔽性、效率和操作现实感的指标
基准测试结果对具有主动防御和动态对抗策略的操作网络的迁移性仍未得到验证

Novel Techniques 新颖技术

Grey-box phase prompting: selectively injecting environmental context (observed protocols, discovered services, prior scan summaries) into agent prompts to balance information availability with realistic testing constraints
Task Coordination Graph (TCG): a DAG-based task planning structure with Check and Reflection mechanism enabling dynamic re-planning upon failure, using RAG retrieval of similar successful tasks to regenerate failed commands
LoRA fine-tuning of mid-scale LLM on Chain-of-Thought penetration testing data with <think> tags, enabling structured step-by-step reasoning for exploit chain planning
Inter-agent communication via PlannerSummary (Algorithm 4) that consolidates completed phase outputs into concise context for the next phase planner, managing token overhead across phases
Success-preserving plan merge (Algorithm 3) that integrates updated plans after failure while retaining all previously completed tasks with adjusted dependencies

灰盒阶段提示：选择性地将环境上下文（观察到的协议、发现的服务、先前的扫描摘要）注入智能体提示中，以平衡信息可用性与现实测试约束
任务协调图 (TCG)：一种基于 DAG 的任务规划 structure，带有检查与反思机制，可在失败时进行动态重新规划，利用 RAG 检索类似成功任务来重新生成失败的命令
在带有 <think> 标签的思维链渗透测试数据上对中等规模 LLM 进行 LoRA 微调，从而实现针对利用链规划的结构化分步推理
通过 PlannerSummary 进行智能体间通信，该摘要将已完成阶段的输出整合为下一阶段规划器的简洁上下文，管理跨阶段的令牌开销
成功保留规划合并：一种在失败后集成更新规划的方法，同时保留所有先前已完成的任务并调整依赖关系

Open Questions 开放问题

Can the fine-tuned mid-scale model approach generalize to entirely unseen vulnerability classes or novel network topologies not represented in training data?
How would xOffense perform against actively defended targets with IDS/IPS, honeypots, or adaptive security measures?
What is the optimal balance between model size, fine-tuning data quantity, and RAG corpus quality for penetration testing tasks?
Can structured function calling replace or augment free-form command synthesis to reduce hallucinated or malformed commands?
How does the system handle interactive or stateful exploitation scenarios requiring real-time adaptation (e.g., buffer overflow development, custom exploit crafting)?

经过微调的中等规模模型方法能否推广到训练数据中未呈现的、完全未见的漏洞类别或新颖的网络拓扑？
xOffense 在面对具有 IDS/IPS、蜜罐或自适应安全措施的主动防御目标时表现如何？
对于渗透测试任务，模型大小、微调数据量和 RAG 语料库质量之间的最佳平衡点是什么？
结构化函数调用能否取代或增强自由格式的命令合成，以减少虚假或格式错误的命令？
系统如何处理需要实时适应的交互式或有状态利用场景（例如，缓冲区溢出开发、自定义利用编写）？

Builds On 基于前人工作

VulnBot (multi-agent PTG framework)
PentestGPT (LLM-based pentest scaffolding)
PentestAgent (RAG-grounded multi-agent)
MemAgent (long-context LLM memory management)
Langchain-Chatchat (RAG implementation)
Qwen3 (base language model)

Open Source 开源信息

No (not mentioned; no repository URL provided)