#64

SoK: A Comparison of Autonomous Penetration Testing Agents SoK: A Comparison of Autonomous Penetration Testing Agents

Raphael Simon, Wim Mees

2024 | ARES 2024 (19th International Conference on Availability, Reliability and Security) (conference)

https://doi.org/10.1145/3664476.3664484

Problem & Motivation 问题与动机

Multiple autonomous penetration testing agents using Deep Reinforcement Learning (DRL) have been proposed in the literature, but they use distinct methods, different environments, different algorithms, and different evaluation setups, making it difficult to draw conclusions about the current state and relative performance of these agents.

文献中已经提出了多种使用深度强化学习(DRL)的自主渗透测试智能体,但它们采用不同的方法、不同的环境、不同的算法和不同的评估设置,使得难以对这些智能体的当前状态和相对性能进行判断。

With the rise of DRL, agents have emerged that aim to actively assess system security through penetration testing by learning to use tools and emulate human behavior. However, the diversity of approaches, simulators, and evaluation criteria makes fair comparison nearly impossible. A systematic overview and comparative analysis is needed to identify research challenges and future directions.

随着深度强化学习的兴起,出现了旨在通过学习使用工具和模拟人类行为来主动评估系统安全性的智能体。然而,方法、模拟器和评估标准的多样性使得公平比较几乎不可能。需要一份系统性的概述和比较分析来识别研究挑战和未来方向。

Threat Model 威胁模型

Standard automated penetration testing scenario where an RL-trained agent interacts with a network environment (simulated or real) containing vulnerable hosts. The agent must discover and exploit vulnerabilities to compromise targets. The environment may be fully or partially observable depending on the modeling choice (MDP vs POMDP).

标准的自动化渗透测试场景,其中经过强化学习训练的智能体与包含脆弱主机的网络环境(模拟或真实)进行交互。智能体必须发现并利用漏洞来攻陷目标。根据建模选择(MDP 与 POMDP),环境可以是完全可观察的或部分可观察的。

Methodology 核心方法

Systematic literature survey comparing autonomous penetration testing agents that use RL, DRL, or Imitation Learning. The paper catalogs execution environments (simulators), classifies agents by their training environment (simulated vs real-world), algorithm (DQN, A2C, PPO, A3C, DQfD, GAIL), and environment model (MDP, POMDP, model-free). It then performs a comparative analysis across four key research challenges: partial observability, large action spaces, reward structure design, and sim-to-real transfer.

系统性文献综述,比较使用强化学习(RL)、深度强化学习(DRL)或模仿学习的自主渗透测试智能体。论文对执行环境(模拟器)进行了分类,按训练环境(模拟与真实世界)、算法(DQN、A2C、PPO、A3C、DQfD、GAIL)和环境模型(MDP、POMDP、无模型)对智能体进行分类。然后围绕四个关键研究挑战进行了比较分析:部分可观察性、大动作空间、奖励结构设计和模拟到真实的迁移。

Architecture 架构设计

N/A - this is a survey paper. However, it catalogs architectures including: single-agent DRL systems (NDSPI-DQN, EPPTA, Deep Exploit), double-agent architectures (DAA with structuring and exploiting agents), hierarchical multi-agent decomposition (HA-DRL, CLAP), and imitation learning frameworks (DQfD-AIPT, GAIL-PT).

不适用——这是一篇综述论文。但它收录了多种架构,包括:单智能体 DRL 系统(NDSPI-DQN、EPPTA、Deep Exploit)、双智能体架构(DAA,包含结构化智能体和利用智能体)、分层多智能体分解(HA-DRL、CLAP),以及模仿学习框架(DQfD-AIPT、GAIL-PT)。

LLM Models 使用的大模型

N/A (survey covers DRL agents, not LLM-based agents; LLMs mentioned only as related work)

Tool Integration 工具集成

Metasploit Framework (via RPC API)NmapPowerShell EmpireNASim simulatorCybORG simulatorCyberBattleSim simulatorNASimEmu

Memory Mechanism 记忆机制

belief-state

Attack Phases Covered 覆盖的攻击阶段

reconnaissance
scanning
enumeration
exploitation
post exploitation
privilege escalation
lateral movement
reporting

Evaluation 评估结果

EPPTA is the only simulator-based agent that properly accounts for partial observability via POMDP modeling with belief states, handling 3515 actions. HA-DRL handles the largest action space (4646 actions) but ignores partial observability. DAA handles 4136 actions using double-agent decomposition. Real-world agents (Deep Exploit, Kujanpaa, Maeda, Raiju) have much smaller action spaces (38-204) but face harder transfer challenges. No single agent addresses all challenges simultaneously.

EPPTA 是唯一一个通过 POMDP 建模和信念状态正确处理部分可观察性的模拟器智能体,能够处理 3515 个动作。HA-DRL 处理最大的动作空间(4646 个动作),但忽略了部分可观察性。DAA 使用双智能体分解处理 4136 个动作。真实世界智能体(Deep Exploit、Kujanpaa、Maeda、Raiju)的动作空间要小得多(38-204 个),但面临更艰难的迁移挑战。没有任何单一智能体能同时解决所有挑战。

Environment 评估环境

NASim (Network Attack Simulator)CybORG (Cyber Operations Research Gym)CyberBattleSim (Microsoft)NASimEmuMetasploitable machinesCustom self-defined environmentsWindows 7 VMsActive Directory networks

Metrics 评估指标

partial-observability-supportlargest-action-space-handledconvergencequalitative-comparison

Baseline Comparisons 基准对比

  • NDSPI-DQN
  • EPPTA
  • Kujanpaa et al.
  • HA-DRL
  • Maeda et al.
  • CLAP
  • Raiju
  • DQfD-AIPT
  • DAA
  • GAIL-PT
  • Deep Exploit

Scale 评估规模

11 autonomous pentesting agents compared across simulated and real-world environments with up to 100 hosts

Contributions 核心贡献

  • First systematic comparison of autonomous penetration testing agents using DRL, covering 11 agents across simulated and real-world environments.
  • Comprehensive catalog of execution environments (NASim, CybORG, CyberBattleSim, NASimEmu) with their capabilities and limitations for training pentesting agents.
  • Identification and structured analysis of four key research challenges: partial observability, large action spaces, reward structure design, and sim-to-real transfer.
  • Comparative table (Table 2) evaluating agents on partial observability support and largest action space handled, providing a quantitative basis for comparison.
  • Identification of future research directions including multi-agent learning, LLM integration, imitation learning, and multi-objective reward structures.
  • 首次对使用深度强化学习的自主渗透测试智能体进行系统性比较,涵盖了模拟和真实世界环境中的 11 个智能体。
  • 全面收录了执行环境(NASim、CybORG、CyberBattleSim、NASimEmu)及其在训练渗透测试智能体方面的能力和局限性。
  • 识别并结构化分析了四个关键研究挑战:部分可观察性、大动作空间、奖励结构设计和模拟到真实的迁移。
  • 比较表格(表 2)评估了智能体在部分可观察性支持和最大动作空间处理方面的表现,为比较提供了定量基础。
  • 识别了未来研究方向,包括多智能体学习、大语言模型集成、模仿学习和多目标奖励结构。

Limitations 局限性

  • The comparison is largely qualitative due to fundamental incomparability of existing work: different simulators, scenarios, reward structures, and most importantly missing source code prevent direct quantitative benchmarking.
  • LLM-based pentesting agents (e.g., PentestGPT) are only briefly mentioned as related work and not included in the core comparison, despite their growing prominence.
  • The paper does not propose or implement any new agent or benchmark, limiting its ability to validate its own conclusions empirically.
  • The survey scope is restricted to agents using RL/DRL/IL, excluding rule-based automation, classical planning, or hybrid approaches that are also relevant to autonomous pentesting.
  • CybORG's emulation component was never fully implemented, and the project has been discontinued, yet it is still discussed as a viable environment.
  • The paper does not cover defensive agents or blue-team perspectives, which are increasingly intertwined with offensive automation research.
  • 由于现有工作的根本不可比性(不同的模拟器、场景、奖励结构,最重要的是缺少源代码),比较在很大程度上是定性的,无法进行直接的定量基准测试。
  • 基于大语言模型的渗透测试智能体(如 PentestGPT)仅作为相关工作被简要提及,未纳入核心比较,尽管其重要性日益增长。
  • 论文没有提出或实现任何新的智能体或基准,限制了其以实证方式验证自身结论的能力。
  • 综述范围仅限于使用 RL/DRL/IL 的智能体,排除了同样与自主渗透测试相关的基于规则的自动化、经典规划或混合方法。
  • CybORG 的仿真组件从未完全实现,该项目已停止维护,但论文仍将其作为可行环境进行讨论。
  • 论文未涵盖防御智能体或蓝队视角,而攻防自动化研究正日益紧密交织。

Research Gaps 研究空白

  • No standardized benchmark or common evaluation framework exists for comparing autonomous pentesting agents -- each uses different simulators, scenarios, and metrics, making reproducibility and fair comparison impossible.
  • The sim-to-real transfer problem remains largely unsolved: agents trained in simulators cannot be directly deployed in real-world networks due to abstraction gaps, and no general methodology exists for bridging this divide.
  • Partial observability is inadequately handled by most agents: only EPPTA properly models the environment as a POMDP, while most others use MDP formulations that unrealistically assume full state knowledge.
  • Reward structure design lacks principled guidance: current approaches range from simple binary rewards to multi-objective formulations, but no systematic study compares their effectiveness or transferability.
  • Large action spaces remain a fundamental bottleneck, with the largest tested scenario having ~4600 actions, far below the combinatorial complexity of real enterprise networks.
  • Source code availability is poor across the field, preventing independent verification and reproducible comparison of published results.
  • No agent simultaneously handles all four key challenges (partial observability, large action spaces, appropriate reward structures, and real-world deployment).
  • Integration of LLMs with DRL for penetration testing is completely unexplored -- LLMs could provide high-level strategy while DRL handles low-level action selection.
  • 不存在用于比较自主渗透测试智能体的标准化基准或通用评估框架——每种智能体使用不同的模拟器、场景和指标,使得可重复性和公平比较成为不可能。
  • 模拟到真实的迁移问题在很大程度上仍未解决:在模拟器中训练的智能体无法直接部署到真实网络中,因为存在抽象差距,且不存在跨越这一鸿沟的通用方法论。
  • 大多数智能体对部分可观察性的处理不充分:只有 EPPTA 正确地将环境建模为 POMDP,而大多数其他智能体使用 MDP 公式化,不切实际地假设完全状态知识。
  • 奖励结构设计缺乏原则性指导:当前方法从简单的二元奖励到多目标公式化不等,但没有系统性研究比较其有效性或可迁移性。
  • 大动作空间仍然是一个根本性瓶颈,最大测试场景仅有约 4600 个动作,远低于真实企业网络的组合复杂度。
  • 该领域的源代码可用性较差,阻碍了对已发表结果的独立验证和可重复比较。
  • 没有任何智能体能同时处理所有四个关键挑战(部分可观察性、大动作空间、适当的奖励结构和真实世界部署)。
  • 将大语言模型与深度强化学习集成用于渗透测试完全未被探索——大语言模型可以提供高层策略,而 DRL 处理低层动作选择。

Novel Techniques 新颖技术

  • Attack vector decoupling (NDSPI-DQN): splitting the neural network into two streams to separately select the target host and the exploit, reducing action space from O(MN) to O(M+N) and enabling scaling to 150+ hosts.
  • POMDP belief state module (EPPTA): incorporating transient belief states to capture partial observability, computing probability distributions over true states for better decision-making in uncertain environments.
  • Coverage masking mechanism (CLAP/Yang et al.): maintaining a coverage set tracking past actions to shift agent focus toward newly discovered network segments, mimicking human pentester behavior.
  • Double-agent architecture (DAA): decomposing pentesting into a structuring agent (reconnaissance/scanning) and an exploiting agent, with the structuring agent triggering the exploiting agent when sufficient information is gathered.
  • Multi-objective RL with MOMDP (Yang et al.): using Chebyshev Critic Scalarization to handle conflicting objectives (exploitation vs privilege escalation) in penetration testing.
  • Action decomposition for multi-agent DRL (CLAP/Tran et al.): separating large discrete action spaces into manageable subsets with separate DQN agents per subset, combined via linear function.
  • 攻击向量解耦(NDSPI-DQN):将神经网络拆分为两个流,分别选择目标主机和漏洞利用,将动作空间从 O(MN) 降低到 O(M+N),使其能够扩展到 150+ 台主机。
  • POMDP 信念状态模块(EPPTA):引入瞬态信念状态以捕获部分可观察性,计算真实状态上的概率分布,以在不确定环境中实现更好的决策。
  • 覆盖掩码机制(CLAP/Yang et al.):维护一个跟踪过去动作的覆盖集合,将智能体的注意力转移到新发现的网络区域,模拟人类渗透测试人员的行为。
  • 双智能体架构(DAA):将渗透测试分解为结构化智能体(侦察/扫描)和利用智能体,当收集到足够信息时,结构化智能体触发利用智能体。
  • 基于 MOMDP 的多目标强化学习(Yang et al.):使用切比雪夫评论家标量化处理渗透测试中相互冲突的目标(漏洞利用与权限提升)。
  • 多智能体 DRL 的动作分解(CLAP/Tran et al.):将大型离散动作空间分成可管理的子集,每个子集由单独的 DQN 智能体处理,通过线性函数组合。

Open Questions 开放问题

  • Can DRL and LLM approaches be combined, with LLMs providing high-level penetration testing strategy and DRL agents executing low-level actions in the environment?
  • What would a universal benchmark for autonomous pentesting agents look like that enables fair comparison across different algorithmic paradigms (DRL, LLM, hybrid)?
  • How can agents be trained to generalize across diverse network topologies and vulnerability types rather than being specialized to specific simulator configurations?
  • Is the MDP/POMDP formalization adequate for real-world penetration testing, or do we need fundamentally different mathematical frameworks?
  • Can imitation learning from human pentesters provide a more sample-efficient path to competent agents than pure RL exploration?
  • How should reward structures be designed to encourage both breadth (finding all vulnerabilities) and depth (complete exploitation chains) in autonomous pentesting?
  • What is the minimum fidelity required in a simulation environment for trained agents to transfer successfully to real networks?
  • DRL 和 LLM 方法能否结合,由大语言模型提供高层渗透测试策略,DRL 智能体在环境中执行低层动作?
  • 一个能够在不同算法范式(DRL、LLM、混合)之间实现公平比较的通用自主渗透测试基准应该是什么样的?
  • 如何训练智能体使其能够泛化到多样化的网络拓扑和漏洞类型,而不是仅针对特定的模拟器配置进行专门化?
  • MDP/POMDP 形式化对于真实世界的渗透测试是否足够,还是我们需要根本不同的数学框架?
  • 从人类渗透测试人员进行模仿学习能否提供比纯 RL 探索更具样本效率的路径来训练出胜任的智能体?
  • 奖励结构应如何设计以同时鼓励广度(发现所有漏洞)和深度(完整的漏洞利用链)?
  • 模拟环境需要达到什么最低保真度,训练出的智能体才能成功迁移到真实网络?

Builds On 基于前人工作

  • NASim (Schwartz, 2019) - Network Attack Simulator
  • CybORG (Standen et al., 2021) - Cyber Operations Research Gym
  • CyberBattleSim (Microsoft Defender Research Team, 2021)
  • NASimEmu (Janisch et al., 2023)
  • Dulac-Arnold et al. (2021) - Real-world RL challenges survey
  • Sutton and Barto (2018) - Reinforcement Learning: An Introduction

Open Source 开源信息

No (survey paper; notes that most surveyed agents lack public source code)

Tags