#24

POMDPs Make Better Hackers: Accounting for Uncertainty in Penetration Testing POMDPs Make Better Hackers: Accounting for Uncertainty in Penetration Testing

Carlos Sarraute, Olivier Buffet, Joerg Hoffmann

2013 | arXiv preprint (preprint)

system penetration-testing fully-autonomous single-agent POMDP-based

PDF Preview 论文预览

Loading PDF... 加载 PDF 中...

Problem & Motivation 问题与动机

Automated penetration testing requires generating attack plans under incomplete knowledge about network configurations. Classical planning approaches assume perfect world knowledge and require costly scanning pre-processes to reduce uncertainty, yet residual uncertainty remains. The paper addresses how to automatically generate attack plans that intelligently interleave scanning and exploitation under uncertainty.

自动化渗透测试需要在对网络配置了解不完全的情况下生成攻击计划。经典的规划方法假设拥有完美的场景知识，并需要昂贵的扫描预处理来减少不确定性，但残留的不确定性仍然存在。本文探讨了如何在不确定性下自动生成能够智能地交替进行扫描和漏洞利用的攻击计划。

Existing approaches (e.g., Core Insight Enterprise using Metric-FF) treat scanning as a separate pre-processing step before planning exploits, which is expensive in time and network traffic and still leaves residual uncertainty. No prior method could both (a) avoid exhaustive scanning and (b) model dependencies between exploits. POMDPs naturally capture the uncertainty inherent in pentesting, allowing the planner to intelligently decide when to scan and when to exploit, mimicking how a skilled human hacker would reason.

现有方法（例如使用 Metric-FF 的 Core Insight Enterprise）将扫描视为规划漏洞利用之前的独立预处理步骤，这在时间和网络流量上都很昂贵，且仍会留下残留的不确定性。之前的方法都无法同时做到：(a) 避免详尽扫描和 (b) 建模漏洞利用之间的依赖关系。POMDP 天生就能捕捉渗透测试中固有的不确定性，允许规划器智能地决定何时扫描、何时利用，模拟了熟练的人类黑客的推理方式。

Threat Model 威胁模型

The attacker starts from a single controlled machine and knows the network topology (logical network graph with subnetworks and firewalls) but has incomplete knowledge of individual machine configurations (OS versions, installed programs, open ports, vulnerability status). The uncertainty grows with time since the last pentest due to software updates. Privilege escalation, password retrieval, and lateral credential use are abstracted away at the logical network level.

攻击者从单台受控机器开始，了解网络拓扑（具有子网和防火墙的逻辑网络图），但对单个机器配置（操作系统版本、安装的程序、开放端口、漏洞状态）的了解不完全。由于软件更新，不确定性会随着距离上次渗透测试的时间增加而增长。权限提升、密码获取和凭据的横向使用在逻辑网络层面进行了抽象。

Methodology 核心方法

The paper models penetration testing as a POMDP where states encode machine configurations (installed programs, vulnerability status, protection mechanisms), actions are scans and exploits, observations provide partial information about machine state, and rewards reflect the value of compromising target machines minus action costs and detection risks. The initial belief is computed from the last known configuration evolved through Markov chains modeling software updates over time. To handle scalability, the authors propose the 4AL decomposition algorithm that breaks the network-wide attack problem into single-machine POMDPs by exploiting network structure (biconnected components), combining solutions conservatively.

本文将渗透测试建模为一个 POMDP，其中状态编码了机器配置（安装的程序、漏洞状态、防护机制），动作是扫描和漏洞利用，观察结果提供关于机器状态的部分信息，奖励反映了攻陷目标机器的价值减去动作成本和检测风险。初始信念（belief）是根据已知的上次配置通过建模软件随时间更新的马尔可夫链计算得出的。为了解决可扩展性问题，作者提出了 4AL 分解算法，通过利用网络结构（双连通分量），将全网络范围的攻击问题分解为单机 POMDP，并保守地组合解。

Architecture 架构设计

The 4AL (4 Abstraction Levels) algorithm decomposes the problem hierarchically: Level 1 decomposes the logical network into a tree of biconnected components using Hopcroft-Tarjan; Level 2 considers attack paths through each biconnected component; Level 3 attacks individual subnetworks by choosing which machine to compromise first; Level 4 solves single-machine attack planning as a POMDP using the SARSOP point-based solver. Results propagate upward as pivoting rewards through the tree decomposition.

4AL（4 个抽象层级）算法进行分层分解：第 1 层使用 Hopcroft-Tarjan 算法将逻辑网络分解为双连通分量的树；第 2 层考虑通过每个双连通分量的攻击路径；第 3 层通过选择首先攻陷哪台机器来攻击单个子网；第 4 层使用 SARSOP 点集求解器将单机攻击规划作为 POMDP 求解。结果通过树分解作为横向移动（pivoting）奖励向上层传播。

Tool Integration 工具集成

Memory Mechanism 记忆机制

belief-state

Attack Phases Covered 覆盖的攻击阶段

reconnaissance

scanning

enumeration

exploitation

post exploitation

privilege escalation

lateral movement

reporting

Evaluation 评估结果

The 4AL decomposition achieves near-optimal attack quality compared to the global POMDP solution, with an average quality loss of only 1.96% and maximum loss of 14.1% (at |E|=7, |M|=6). Runtime scales polynomially in the number of machines (unlike the global POMDP which is exponential), staying below 37 seconds even with |M| and |E| both around 100. The decomposition is conservative, never overestimating the optimal policy value.

与全局 POMDP 解相比，4AL 分解实现了接近最优的攻击质量，平均质量损失仅为 1.96%，最大损失为 14.1%（在 |E|=7, |M|=6 时）。运行时间随机器数量呈多项式级增长（不同于呈指数级增长的全局 POMDP），即使在 |M| 和 |E| 都在 100 左右时，运行时间仍低于 37 秒。该分解是保守的，从未高估最优策略值。

Environment 评估环境

Metrics 评估指标

Baseline Comparisons 基准对比

global-POMDP-model

Scale 评估规模

Scalable test scenario based on Core Security industrial test suite with up to 100 machines and 100 exploit templates, network with 3 areas (Exposed, Sensitive, User) separated by firewalls.

Contributions 核心贡献

First POMDP formulation of penetration testing that intelligently interleaves scanning and exploitation under uncertainty, unlike prior classical planning approaches that require exhaustive scanning as a pre-process.
The 4AL decomposition algorithm that exploits network structure (biconnected components) to break the exponentially-scaling global POMDP into tractable single-machine POMDPs with conservative approximation guarantees.
A principled method for computing initial beliefs based on Markov chain models of software updates over time, capturing how uncertainty grows between pentests.
Formal proofs that the 4AL approximation is conservative (never overestimates policy value) at each decomposition level.

首次提出渗透测试的 POMDP 形式化，能够智能地在不确定性下交替进行扫描和漏洞利用，不同于以往需要详尽扫描作为预处理的经典规划方法。
提出了 4AL 分解算法，利用网络结构（双连通分量）将指数级增长的全局 POMDP 转化为可处理的单机 POMDP，并具有保守的近似保证。
提出了一种基于软件随时间更新的马尔可夫链模型来计算初始信念的原则性方法，捕捉了渗透测试间隔期间不确定性增长的过程。
形式化证明了 4AL 近似在每个分解层级上都是保守的（从未高估策略值）。

Limitations 局限性

The POMDP model does not handle privilege escalation or password retrieval; these are abstracted at the logical network level.
The software update model using independent Markov chains per program is simplistic and does not capture correlated updates or complex upgrade patterns.
The evaluation uses a simulated test scenario rather than a real deployed network; although based on Core Security industrial parameters, it remains synthetic.
The 4AL decomposition is domain-specific to network pentesting and does not contribute to general POMDP solving.
Level 2 of 4AL has worst-case exponential runtime in the size of biconnected components (though these are typically small in practice).
No comparison against the classical planning approach (Metric-FF) actually deployed in Core Insight Enterprise; only compared against the global POMDP baseline.
The approach does not model detection avoidance beyond a simple cost parameter r_d; no adaptive defender model is considered.
Actions are assumed deterministic given the true state, which may not hold for all real exploits that can have non-deterministic outcomes.

POMDP 模型不处理权限提升或密码获取；这些在逻辑网络层面进行了抽象。
使用每个程序独立的马尔可夫链进行的软件更新模型过于简单，没有捕捉到关联更新或复杂的升级模式。
评估使用的是模拟测试场景而非真实的部署网络；尽管基于 Core Security 的行业参数，但仍属于合成环境。
4AL 分解是针对网络渗透测试领域的，对通用 POMDP 求解没有贡献。
4AL 的第 2 层在双连通分量的大小上具有最坏情况下的指数级运行时间（尽管在实践中这些分量通常很小）。
未与 Core Insight Enterprise 中实际部署的经典规划方法 (Metric-FF) 进行对比；仅与全局 POMDP 基准进行了对比。
该方法除了简单的成本参数 r_d 外，没有建模检测规避；未考虑自适应防御者模型。
假设给定真实状态时动作是确定性的，这对于所有可能有非确定性结果的真实漏洞利用来说可能并不成立。

Research Gaps 研究空白

No comparison between the POMDP-based approach and the classical planning solution (Metric-FF) used commercially by Core Security, leaving cost-effectiveness unclear.
More accurate models of software updates are needed to produce realistic initial beliefs; current independent Markov chain model is acknowledged as oversimplified.
Tailoring POMDP solvers to the specific structure of pentesting problems (deterministic actions, static uncertain state components) could yield significant efficiency gains.
The relationship between this POMDP decomposition approach and general POMDP decomposition methods (e.g., Pineau et al. 2003, Mueller and Biundo 2011) is unexplored.
No modeling of an adaptive defender who may respond to detected scanning or exploitation attempts.
Post-exploitation actions (data exfiltration, persistence, cleanup) are not modeled.

尚未将基于 POMDP 的方法与 Core Security 商业使用的经典规划方案 (Metric-FF) 进行对比，因此成本效益尚不明确。
需要更准确的软件更新模型来产生现实的初始信念；目前的独立马尔可夫链模型被公认为过于简化。
针对渗透测试问题的特定结构（确定性动作、静态不确定状态组件）定制 POMDP 求解器可能会显著提高效率。
这种 POMDP 分解方法与通用 POMDP 分解方法（例如 Pineau et al. 2003, Mueller and Biundo 2011）之间的关系尚未探索。
未建模可能对检测到的扫描或利用尝试做出反应的自适应防御者。
未建模后渗透操作（数据窃取、持久化、清理）。

Novel Techniques 新颖技术

Modeling pentesting as a POMDP to naturally handle uncertainty about machine configurations and intelligently interleave reconnaissance with exploitation.
4AL hierarchical decomposition exploiting network graph structure (biconnected components via Hopcroft-Tarjan) to reduce exponential POMDP to polynomial-time algorithm with single-machine POMDPs.
Conservative approximation with formal guarantees: the decomposition never overestimates attack value, ensuring safe lower bounds on expected reward.
Initial belief computation via Markov chain evolution of software configurations over time, parameterized by days since last pentest.
Pivoting reward propagation through tree decomposition to account for downstream attack value when breaking into intermediate machines.

将渗透测试建模为 POMDP，以自然地处理关于机器配置的不确定性，并智能地将侦察与漏洞利用交替进行。
4AL 分层分解利用网络图结构（通过 Hopcroft-Tarjan 算法获得双连通分量），将指数级的 POMDP 降为多项式时间算法及单机 POMDP。
具有形式化保证的保守近似：分解从未高估攻击价值，确保了预期奖励的安全下界。
通过软件配置随时间的马尔可夫链演化计算初始信念，参数为距离上次渗透测试的天数。
通过树分解传播横向移动奖励，以在突破中间机器时考虑下游的攻击价值。

Open Questions 开放问题

How does the POMDP approach compare in cost-effectiveness to the classical planning solution already deployed industrially?
Can the POMDP model be extended to incorporate an adaptive defender or IDS that changes behavior based on detected attacker actions?
How would factored POMDP representations (rather than enumerating states) affect scalability of the single-machine models?
Can modern deep RL or neural POMDP solvers replace SARSOP to handle larger individual machine state spaces?
How could this framework incorporate privilege escalation, credential harvesting, and lateral movement as first-class actions?
Would combining this principled planning approach with LLM-based reasoning improve attack plan generation in modern pentesting tools?

POMDP 方法在成本效益上与已经工业化部署的经典规划方案相比如何？
POMDP 模型能否扩展到包含根据检测到的攻击者行为改变策略的自适应防御者或 IDS？
因子分解的 POMDP 表示（而非枚举状态）将如何影响单机模型的可扩展性？
现代深度强化学习或神经 POMDP 求解器能否取代 SARSOP 来处理更大的单机状态空间？
该框架如何将权限提升、凭据收集和横向移动作为一类操作整合进来？
将这种原则性的规划方法与基于 LLM 的推理相结合，是否能提高现代渗透测试工具的攻击计划生成能力？

Builds On 基于前人工作

Lucangeli, Sarraute, and Richarte 2010 (attack planning using PDDL and Metric-FF)
Sarraute, Buffet, and Hoffmann 2011 (preliminary POMDP model at SecArt'11)
Sarraute, Richarte, and Lucangeli 2011 (optimal attack paths in nondeterministic scenarios)
Boddy et al. 2005 (course of action generation for cyber security using classical planning)
SARSOP (Kurniawati, Hsu, and Lee 2008) point-based POMDP solver