#12

PTFusion: LLM-driven context-aware knowledge fusion for web penetration testing PTFusion: LLM-driven context-aware knowledge fusion for web penetration testing

Wenhao Wang, Hao Gu, Zhixuan Wu, Hao Chen, Xingguo Chen, Fan Shi

2026 | Information Fusion (journal)

https://doi.org/10.1016/j.inffus.2025.103731

system penetration-testing fully-autonomous hierarchical chain-of-thought

PDF Preview 论文预览

Loading PDF... 加载 PDF 中...

Problem & Motivation 问题与动机

LLM-enhanced web penetration testing suffers from three critical operational failures: (1) imprecise command execution where LLMs struggle to generate precise commands from contextual information, (2) context decay where critical relationships between ports, services, and vulnerabilities are lost during handovers, and (3) inefficient task guidance where analysts become trapped in feedback loops of localized analysis without unified correlation frameworks.

基于LLM的Web渗透测试面临三个关键的操作失败：(1) 命令执行不精确，LLM难以从上下文信息中生成精确的可执行命令；(2) 上下文衰减，端口、服务和漏洞之间的关键关系在交接过程中丢失；(3) 任务引导低效，分析人员陷入局部分析的反馈循环中，缺乏统一的关联框架。

While LLMs show promise for penetration testing automation, existing approaches like PentestGPT remain confined to human-in-the-loop paradigms and simplistic scenarios without fully demonstrating operational potential in real-world testing environments. Modern security tools (Nmap, Metasploit, Burp Suite) generate heterogeneous data streams whose syntactic and semantic fragmentation causes LLMs to fail at precise command execution and context maintenance. The emergence of the Model Context Protocol (MCP) creates an opportunity to seamlessly orchestrate external tools and process heterogeneous information into actionable penetration testing knowledge.

虽然LLM在渗透测试自动化方面展现了前景，但现有的方法如PentestGPT仍局限于人类在环范式和简单场景，未能在真实世界测试环境中充分展示其操作潜力。现代安全工具（Nmap、Metasploit、Burp Suite）生成的异构数据流在语法和语义上的碎片化导致LLM无法精确执行命令和维护上下文。模型上下文协议（MCP）的出现为无缝编排外部工具和将异构信息处理为可操作的渗透测试知识创造了机会。

Threat Model 威胁模型

The system targets web application penetration testing scenarios where the objective is to obtain webshell execution privileges on vulnerable web applications hosted in network environments. The attacker has network access to the target but no prior credentials. The system operates fully autonomously without human intervention during the penetration testing workflow.

该系统针对Web应用渗透测试场景，目标是在网络环境中托管的易受攻击的Web应用上获取Webshell执行权限。攻击者具有目标的网络访问权限，但没有先前的凭据。系统在渗透测试工作流程中完全自主运行，无需人工干预。

Methodology 核心方法

PTFusion employs a semi-decentralized multi-agent collaborative architecture with three specialized agents (MasterAgent, ReconAgent, AttackAgent) connected via the Model Context Protocol (MCP). The MasterAgent handles strategic planning using a dynamic knowledge graph for situational awareness, while subordinate agents autonomously execute tactical reconnaissance and attack tasks. A context-aware knowledge fusion mechanism combines a dynamic knowledge graph with preference-based chain-of-thought prompting to address noisy tool outputs and guide reliable autonomous decision-making.

PTFusion采用半去中心化的多智能体协作架构，包含三个专业化智能体（MasterAgent、ReconAgent、AttackAgent），通过模型上下文协议（MCP）连接。MasterAgent利用动态知识图谱进行态势感知来处理战略规划，而下属智能体则自主执行战术侦察和攻击任务。上下文感知的知识融合机制将动态知识图谱与基于偏好的思维链提示相结合，以处理嘈杂的工具输出并指导可靠的自主决策。

Architecture 架构设计

The framework consists of three functionally distinct agents communicating through MCP Servers: (1) MasterAgent - the strategic decision-making node that analyzes user intent, formulates high-level penetration testing objectives, queries the Dynamic Knowledge Graph (DKG) via a DKG MCP Server, and allocates tasks to subordinate agents using a two-stage planning process (Context Retrieval then Strategy Generation). (2) ReconAgent - an autonomous tactical agent responsible for information gathering (port scanning, web enumeration, HTTP probing) via a Recon MCP Server interfacing with Nmap, Dirb, and Curl. (3) AttackAgent - an autonomous tactical agent for exploitation via an Attack MCP Server interfacing with Msfconsole, Hydra, and Sqlmap. The DKG is a Neo4j-backed graph modeling hosts, ports, services, vulnerabilities, websites, sensitive paths, login statuses, and brute-force statuses with 9 entity types and 9 relationship types. Subordinate agents have autonomous decision-making loops, determining task continuation independently without awaiting MasterAgent instructions.

该框架由三个功能不同的智能体组成，通过MCP服务器通信：(1) MasterAgent——战略决策节点，负责分析用户意图、制定高层渗透测试目标、通过DKG MCP服务器查询动态知识图谱（DKG），并使用两阶段规划过程（上下文检索和策略生成）将任务分配给下属智能体。(2) ReconAgent——自主战术智能体，负责信息收集（端口扫描、Web枚举、HTTP探测），通过Recon MCP服务器与Nmap、Dirb和Curl交互。(3) AttackAgent——自主战术智能体，负责漏洞利用，通过Attack MCP服务器与Msfconsole、Hydra和Sqlmap交互。DKG是基于Neo4j的图数据库，建模主机、端口、服务、漏洞、网站、敏感路径、登录状态和暴力破解状态，包含9种实体类型和9种关系类型。下属智能体拥有自主决策循环，独立决定任务是否继续，无需等待MasterAgent指令。

LLM Models 使用的大模型

Tool Integration 工具集成

Memory Mechanism 记忆机制

knowledge-graph

Attack Phases Covered 覆盖的攻击阶段

reconnaissance

scanning

enumeration

exploitation

post exploitation

privilege escalation

lateral movement

reporting

Evaluation 评估结果

PTFusion achieved 100% Penetration Success Rate (PSR) across all 6 VulnHub environments, outperforming all baselines including PentestGPT operated by experienced experts. GPT-4.1-mini as the backbone LLM achieved 100% PSR across all environments, while GPT-4o-mini and Qwen-72B showed significantly degraded performance (20-80% and 0-60% PSR respectively). Ablation studies showed that removing either DKG or action history caused marked performance degradation, confirming the synergistic value of the context-aware knowledge fusion mechanism.

PTFusion在所有6个VulnHub环境中均达到100%的渗透成功率（PSR），超越了所有基线方法，包括由经验丰富的专家操作的PentestGPT。GPT-4.1-mini作为骨干LLM在所有环境中均达到100% PSR，而GPT-4o-mini和Qwen-72B表现显著下降（分别为20-80%和0-60% PSR）。消融研究表明，移除DKG或动作历史中的任何一个都会导致明显的性能下降，证实了上下文感知知识融合机制的协同价值。

Environment 评估环境

Metrics 评估指标

Baseline Comparisons 基准对比

PTFusion with action history only (ablation)
PTFusion with DKG only (ablation)
PentestGPT with an experienced expert
PentestGPT with a beginner

Scale 评估规模

6 VulnHub environments (AI Web 1.0, from_sqli_to_shell_i386, JIS-CTF, Metasploitable 2, SickOs 1.2, Basic Pentesting 1), each tested 5 times

Contributions 核心贡献

A semi-decentralized multi-agent collaborative architecture using MCP, where a MasterAgent provides strategic guidance while ReconAgent and AttackAgent operate with autonomous tactical decision-making authority, eliminating the need for human-in-the-loop operation.
A context-aware knowledge fusion mechanism combining a dynamic knowledge graph (9 entity types, 9 relationship types, 27 attributes) with preference-based Chain-of-Thought prompting to process heterogeneous, noisy tool outputs into verifiable, ground-truth intelligence for reliable autonomous decision-making.
Empirical validation demonstrating PTFusion achieves 100% penetration success rate across 6 diverse VulnHub environments fully autonomously, outperforming PentestGPT variants that require human expert guidance.

基于MCP的半去中心化多智能体协作架构，其中MasterAgent提供战略指导，ReconAgent和AttackAgent拥有自主战术决策权，消除了人类在环操作的需求。
上下文感知的知识融合机制，将动态知识图谱（9种实体类型、9种关系类型、27个属性）与基于偏好的思维链提示相结合，将异构、嘈杂的工具输出处理为可验证的、基于事实的情报，用于可靠的自主决策。
实证验证表明PTFusion在6个不同的VulnHub环境中完全自主地达到100%渗透成功率，超越了需要人类专家指导的PentestGPT变体。

Limitations 局限性

Evaluation is limited to 6 relatively simple VulnHub environments focused on web application penetration testing; no hard-difficulty targets or real-world production environments are tested.
The system's scope is restricted to web penetration testing up to webshell acquisition; post-exploitation, privilege escalation, lateral movement, and network-level penetration are not addressed.
Only three LLMs are evaluated (GPT-4.1-mini, GPT-4o-mini, Qwen-72B), and performance is heavily dependent on the backbone LLM -- GPT-4o-mini and Qwen-72B show dramatically worse results, raising questions about generalizability.
The fixed three-agent architecture (MasterAgent, ReconAgent, AttackAgent) is acknowledged as potentially insufficient for more complex scenarios requiring additional specialized agents (e.g., C2, privilege escalation).
The knowledge graph schema is predefined with fixed entity types and relationships, potentially missing novel or unexpected attack surface elements that fall outside the schema.
Each experiment runs only 5 times per environment, which is a small sample size for assessing statistical reliability, especially given the observed variance in reasoning paths.
No cost analysis is provided, making it difficult to assess practical deployment feasibility compared to human-operated alternatives.
The Reasoning Similarity Score (RSS) shows high variability in complex environments (env4, env5), indicating that reasoning consistency degrades precisely when reliability matters most.
The preference-based CoT prompting mechanism encodes domain-specific heuristics (e.g., combining SQL injection paths with physical paths for webshell upload), which may not generalize to novel vulnerability classes.
Data availability is listed as 'will be made available on request' rather than open-source, limiting reproducibility.

评估仅限于6个相对简单的VulnHub环境，专注于Web应用渗透测试；没有测试高难度目标或真实世界生产环境。
系统范围限制在Web渗透测试到获取Webshell阶段；不涉及后渗透、权限提升、横向移动和网络级渗透。
仅评估了三个LLM（GPT-4.1-mini、GPT-4o-mini、Qwen-72B），且性能高度依赖骨干LLM——GPT-4o-mini和Qwen-72B表现急剧下降，令人质疑其泛化能力。
固定的三智能体架构（MasterAgent、ReconAgent、AttackAgent）被承认可能不足以应对需要额外专业化智能体（如C2、权限提升）的更复杂场景。
知识图谱模式预定义了固定的实体类型和关系，可能遗漏超出模式范围的新颖或意外攻击面要素。
每个环境仅运行5次实验，样本量较小，难以充分评估统计可靠性，尤其是考虑到推理路径中观察到的方差。
未提供成本分析，难以评估与人工操作替代方案相比的实际部署可行性。
推理相似度分数（RSS）在复杂环境（env4、env5）中显示出高变异性，表明推理一致性恰恰在最需要可靠性时发生退化。
基于偏好的CoT提示机制编码了领域特定的启发式规则（如将SQL注入路径与物理路径组合用于Webshell上传），可能无法泛化到新型漏洞类别。
数据可用性标注为'按请求提供'而非开源，限制了可复现性。

Research Gaps 研究空白

No existing LLM-based penetration testing system achieves fully autonomous operation across diverse web environments without human expert involvement.
Current approaches lack mechanisms to fuse heterogeneous, noisy outputs from multiple security tools into a coherent, queryable knowledge representation for strategic planning.
The Model Context Protocol (MCP) has not been previously applied to penetration testing tool orchestration despite its natural fit for managing diverse security tool interfaces.
Existing LLM pentesting systems do not address the critical problem of ensuring factual verification and preventing hallucinated findings from corrupting downstream decision-making.
There is no established method for measuring reasoning consistency and stability in autonomous penetration testing agents across repeated runs.
Cross-segment network penetration testing and integration of a wider range of penetration testing tools remain unexplored in the LLM-driven paradigm.

现有的基于LLM的渗透测试系统没有一个能在多样化的Web环境中实现完全自主操作，均需人类专家参与。
当前方法缺乏将多个安全工具的异构、嘈杂输出融合为连贯、可查询的知识表示用于战略规划的机制。
模型上下文协议（MCP）尽管天然适合管理多样化的安全工具接口，此前未被应用于渗透测试工具编排。
现有的LLM渗透测试系统未解决确保事实验证和防止幻觉生成的发现污染下游决策的关键问题。
对于自主渗透测试智能体在重复运行中的推理一致性和稳定性，尚无既定的测量方法。
跨网段网络渗透测试和更广泛渗透测试工具的集成在LLM驱动范式中仍未被探索。

Novel Techniques 新颖技术

MCP-enabled semi-decentralized multi-agent architecture: Using the Model Context Protocol to establish standardized JSON-RPC interfaces between specialized penetration testing agents and their respective tool servers (Recon MCP Server, Attack MCP Server, DKG MCP Server), enabling modular, extensible tool integration without custom adapters.
Dynamic Knowledge Graph for penetration testing state: A Neo4j-backed temporally evolving graph with 9 entity types (Host, Port, Service, Vulnerability, WebSite, SensWebPath, NetworkSegment, LoginStatus, BruteforceStatus) and 9 relationship types that captures the complete attack surface and evolves in real-time as reconnaissance and exploitation proceed.
Two-stage task planning with knowledge graph retrieval: The MasterAgent first converts high-level intent into focused knowledge graph queries (Context Retrieval), then performs logical reasoning on the retrieved structured data to generate concrete strategic steps (Strategy Generation), avoiding context overload.
Preference-based Chain-of-Thought prompting: A four-step information alignment process (Strict Deduplication, Categorized Aggregation, Strict Prohibition of Information Fabrication, Principle of Factual Verification) that enforces ground-truth fidelity by requiring every reported finding to be directly traceable to raw tool output.
Autonomous tactical decision-making loops: Subordinate agents (ReconAgent, AttackAgent) independently determine task continuation and tool selection within their tactical scope, initiating their own internal reasoning cycles without awaiting MasterAgent micro-management.

基于MCP的半去中心化多智能体架构：使用模型上下文协议在专业化渗透测试智能体及其各自的工具服务器（Recon MCP服务器、Attack MCP服务器、DKG MCP服务器）之间建立标准化的JSON-RPC接口，实现模块化、可扩展的工具集成，无需自定义适配器。
渗透测试状态的动态知识图谱：基于Neo4j的时间演化图，包含9种实体类型（Host、Port、Service、Vulnerability、WebSite、SensWebPath、NetworkSegment、LoginStatus、BruteforceStatus）和9种关系类型，捕获完整的攻击面并随侦察和利用过程实时演化。
基于知识图谱检索的两阶段任务规划：MasterAgent首先将高层意图转化为聚焦的知识图谱查询（上下文检索），然后对检索到的结构化数据进行逻辑推理以生成具体的战略步骤（策略生成），避免上下文过载。
基于偏好的思维链提示：四步信息对齐过程（严格去重、分类聚合、严格禁止信息捏造、事实验证原则），通过要求每个报告的发现都可直接追溯到原始工具输出来强制执行事实真实性。
自主战术决策循环：下属智能体（ReconAgent、AttackAgent）在其战术范围内独立决定任务是否继续和工具选择，启动自己的内部推理循环，无需等待MasterAgent的微观管理。

Open Questions 开放问题

Can the semi-decentralized architecture scale to full network penetration testing scenarios involving lateral movement, privilege escalation, and multi-host campaigns?
How does the system perform against hardened targets with WAFs, IDS/IPS, and other defensive measures that actively respond to scanning and exploitation attempts?
What is the cost-performance trade-off of using GPT-4.1-mini versus smaller/cheaper models, and can the architecture compensate for weaker backbone LLMs?
Can the dynamic knowledge graph schema be automatically extended or learned rather than being predefined, to handle novel vulnerability classes and attack surfaces?
How would the system handle scenarios requiring social engineering, client-side attacks, or physical security testing that go beyond web application exploitation?
What safety mechanisms are needed to prevent autonomous penetration testing systems from causing unintended damage to production environments?
Can the preference-based CoT approach be automatically tuned per environment rather than relying on manually crafted domain-specific prompts?
How does reasoning consistency (RSS) relate to actual penetration testing effectiveness, and is high consistency always desirable given that diverse strategies may be needed for complex targets?

半去中心化架构能否扩展到涉及横向移动、权限提升和多主机攻击活动的完整网络渗透测试场景？
系统在面对配有WAF、IDS/IPS和其他主动响应扫描和利用尝试的防御措施的加固目标时表现如何？
使用GPT-4.1-mini与更小/更便宜的模型之间的成本-性能权衡是什么？架构能否补偿较弱的骨干LLM？
动态知识图谱模式能否自动扩展或学习，而非预定义，以处理新型漏洞类别和攻击面？
系统如何处理需要社会工程、客户端攻击或超越Web应用利用的物理安全测试的场景？
需要什么安全机制来防止自主渗透测试系统对生产环境造成意外损害？
基于偏好的CoT方法能否根据环境自动调优，而非依赖手工制作的领域特定提示？
推理一致性（RSS）与实际渗透测试效果之间的关系是什么？考虑到复杂目标可能需要多样化策略，高一致性是否总是理想的？

Builds On 基于前人工作

PentestGPT (Deng et al., 2024)
PentestAgent (Shen et al., 2024)
Model Context Protocol (Anthropic, 2024)
Dynamic Knowledge Graphs
Chain-of-Thought prompting
Debate on Graphs (Ma et al., 2025)

Open Source 开源信息

No - data available on request