#36

Automated Penetration Testing: Formalization and Realization Automated Penetration Testing: Formalization and Realization

Charilaos Skandylas, Mikael Asplund

2024 | arXiv (Preprint submitted to Elsevier) (preprint)

system penetration-testing fully-autonomous single-agent classical-planning

PDF Preview 论文预览

Loading PDF... 加载 PDF 中...

Problem & Motivation 问题与动机

There is no existing automated penetration testing approach that can perform the whole penetration testing process, including attack planning, runtime decision-making to handle system state changes, and providing attack automation capabilities to target real-world systems. Most prior work focuses only on attack planning and lacks a precise formal definition of automated penetration testing.

目前还没有一种现有的自动化渗透测试方法能够执行完整的渗透测试流程，包括攻击规划、处理系统状态变化的运行时决策，以及针对现实世界系统的攻击自动化能力。以往的大多数工作仅关注攻击规划，且缺乏对自动化渗透测试的精确形式化定义。

Manual penetration testing is labor-intensive and requires highly skilled practitioners, yet there is an enormous shortage of cybersecurity experts. Existing automated approaches are fragmented: attack graph methods lack transfer to real scenarios, ML/DL approaches need extensive training data, and RL approaches have not been demonstrated on real systems. There is a need for both a unified formal framework and a practical end-to-end architecture that can autonomously perform penetration testing against realistic targets without prior knowledge of the system.

手动渗透测试是劳动密集型的，需要高技能的从业者，然而网络安全专家严重短缺。现有的自动化方法是零散的：攻击图方法缺乏向真实场景的迁移能力，机器学习/深度学习方法需要大量的训练数据，而强化学习方法尚未在真实系统上得到验证。因此，既需要一个统一的形式化框架，也需要一个能够在事先不了解系统的情况下，自主对现实目标执行渗透测试的端到端架构。

Threat Model 威胁模型

Black-box penetration testing assumed by default. The tester has only network range access and flag identifiers; no prior information about hosts, services, architecture, or vulnerabilities is provided. The tool also supports white-box and gray-box modes. The goal is to discover as much of the target system as possible and gain control of as many capabilities as possible (maximize discovery and exploitation).

默认假设黑盒渗透测试。测试人员仅拥有网络范围访问权限和旗标（flag）标识符；不提供关于主机、服务、架构或漏洞的先验信息。该工具也支持白盒和灰盒模式。目标是尽可能多地发现目标系统并获得尽可能多的能力控制（最大化发现和利用）。

Methodology 核心方法

The paper first formalizes penetration testing at the architectural level using security-informed architectures, labeled transition systems (LTS), and game strategies. It defines penetration test states, attacks, scans, and strategies as formal mathematical objects. Building on this formalization, the authors propose a general self-adaptive architecture based on autonomic computing principles (MAPE-K loop) that can handle system dynamics and make runtime decisions. They then implement this architecture in ADAPT, a concrete tool that uses utility-based decision theory to rank targets, attacks, and scans, and employs a plugin-based managed system of scanning, exploitation, and post-exploitation tools.

本文首先使用安全感知架构（security-informed architectures）、标号迁移系统（LTS）和博弈策略在架构层面形式化了渗透测试。它将渗透测试状态、攻击、扫描和策略定义为形式化的数学对象。基于这一形式化，作者提出了一种基于自主计算原理（MAPE-K循环）的通用自适应架构，能够处理系统动态并进行运行时决策。随后，他们在ADAPT中实现了这一架构。ADAPT是一个具体的工具，使用基于效用的决策理论对目标、攻击和扫描进行排名，并采用由扫描、利用和利用后工具组成的插件化管理系统。

Architecture 架构设计

A self-adaptive system architecture with two subsystems: (1) a managing subsystem implemented as a MAPE-K autonomic manager with four phases - Monitoring (target and vulnerability discovery), Analysis (threat modeling, vulnerability analysis, target and attack scoring), Planning (target selection, attack selection, active attack planning, post-exploitation planning), and Execution (exploitation, post-exploitation); and (2) a managed subsystem comprising reconfigurable scanners, exploitation tools, and post-exploitation tools organized as plugins. The four MAPE-K phases share a common knowledge base storing the penetration test state, attack/scan repertoires, and strategy. The architecture supports concurrent operations with event-based callbacks.

一种包含两个子系统的自适应系统架构：(1) 管理子系统，实现为具有四个阶段的MAPE-K 自主管理器——监测（目标和漏洞发现）、分析（威胁建模、漏洞分析、目标和攻击评分）、规划（目标选择、攻击选择、主动攻击规划、利用后规划）和执行（漏洞利用、利用后）； (2) 受控子系统，由组织为插件的可重新配置的扫描器、利用工具和利用后工具组成。 MAPE-K的四个阶段共享一个存储渗透测试状态、攻击/扫描库和策略的通用知识库。该架构通过基于事件的回调支持并发操作。

Tool Integration 工具集成

Memory Mechanism 记忆机制

knowledge-graph

Attack Phases Covered 覆盖的攻击阶段

reconnaissance

scanning

enumeration

exploitation

post exploitation

privilege escalation

lateral movement

reporting

Evaluation 评估结果

ADAPT successfully performed fully autonomous black-box penetration testing against Metasploitable2 (root access via 11 different interfaces), Metasploitable3 (multiple exploitation paths requiring multi-step attacks including EternalBlue), and a realistic VM network lab (found all 11 flags and gained root privileges on all hosts). On the VM network, ADAPT required 4 attacker actions for Metasploitable2, up to 7 for Metasploitable3, and 12 actions for the full network. The MAPE-K loop overhead was negligible (tens of milliseconds), with most time spent on managed system tool execution. Increasing concurrency thresholds from 1 to 2 noticeably reduced total running time.

ADAPT成功地对Metasploitable2（通过11个不同接口获得root访问权限）、Metasploitable3 （包含多步攻击（如永恒之蓝）的多条利用路径）以及一个真实的虚拟机网络实验室（找到了全部11个旗标并获得了所有主机的root权限）执行了完全自主的黑盒渗透测试。在虚拟机网络上， ADAPT对Metasploitable2需要4步攻击动作，对Metasploitable3最多需要7步，对整个网络需要 12步动作。MAPE-K循环的开销可以忽略不计（几十毫秒），大部分时间花在受控子系统的工具执行上。将并发阈值从1增加到2显著减少了总运行时间。

Environment 评估环境

Metrics 评估指标

Baseline Comparisons 基准对比

Student groups performing manual penetration testing on VM network
PentestGPT (implicit comparison via discussion)

Scale 评估规模

3 case studies: Metasploitable2, Metasploitable3 (Windows), and a multi-host VM network with 11 embedded flags

Contributions 核心贡献

Formal formulation of penetration testing at the architectural level using security-informed architectures, labeled transition systems, and game strategies, providing a unified framework that can incorporate prior RL and PDDL-based formulations
A generic automated penetration testing architecture based on autonomic computing (MAPE-K loop) with self-adaptive capabilities for runtime decision-making, attack automation, and dynamic handling of system state changes
Implementation and evaluation of ADAPT, a concrete plugin-based tool instantiating the architecture, demonstrated against realistic systems including Metasploitable2, Metasploitable3, and a multi-host VM network used in university ethical hacking courses

在架构层面上，使用安全感知架构、标号迁移系统和博弈策略对渗透测试进行了形式化表述，提供了一个可以纳入先前强化学习和基于PDDL表述的统一框架
提出了一种基于自主计算（MAPE-K循环）的通用自动化渗透测试架构，具有用于运行时决策、攻击自动化和系统状态变化动态处理的自适应能力
实现并评估了ADAPT，这是一个实例化了该架构的具体的插件化工具，并在包括Metasploitable2、Metasploitable3以及大学伦理黑客课程中使用的多主机虚拟机网络在内的现实系统上进行了演示

Limitations 局限性

Attack and scan repertoires must be manually designed and configured; the tool does not automatically synthesize new attack tactics from constituent techniques
The effectiveness is bounded by the repertoires: ADAPT cannot exploit vulnerabilities not covered by its scan and attack repertoires or perform attacks not present in its attack tactics
Intrusive tool plugins are withheld from public release due to ethical considerations, limiting full reproducibility
Evaluation limited to three case studies; no large-scale evaluation across diverse network topologies or modern enterprise environments
Utility weights and value functions for target/attack/scan ranking are manually specified and may not generalize across different environments
Does not use LLMs or any form of machine learning for decision-making; relies entirely on predefined utility functions

攻击和扫描库必须手动设计和配置；该工具不会自动从组成技术中合成新的攻击战术
有效性受限于工具库：ADAPT无法利用其扫描和攻击库未涵盖的漏洞，也无法执行其攻击战术中不存在的攻击
出于伦理考虑，侵入性工具插件未公开发布，限制了完全的可复现性
评估仅限于三个案例研究；没有在多样化的网络拓扑或现代企业环境中进行大规模评估
用于目标/攻击/扫描排名的效用权重和价值函数是手动指定的，可能无法推广到不同的环境
没有使用LLM或任何形式的机器学习进行决策；完全依赖于预定义的效用函数

Research Gaps 研究空白

Need for automated repertoire generation - the ability to automatically synthesize attack tactics from constituent attack techniques
Investigation of the architecture's effectiveness in different domains and as a component of a larger cybersecurity tool ecosystem
Expanding autonomy beyond level 3 by incorporating more sophisticated reasoning and decision-making capabilities
Most existing automated penetration testing approaches lack runtime decision-making and attack automation capabilities
No existing approach automates the complete penetration testing process end-to-end against realistic systems
LLM-based approaches show initial promise on single VMs but are outperformed by ADAPT in services exploited, running time, and scalability

需要自动化的工具库生成——即从基本攻击技术中自动合成攻击战术的能力
调查该架构在不同领域的有效性，以及作为更大型网络安全工具生态系统组成部分的效果
通过结合更先进的推理和决策能力，将自主性扩展到3级以上
大多数现有的自动化渗透测试方法缺乏运行时决策和攻击自动化能力
目前还没有任何方法能够针对现实系统自动执行完整的端到端渗透测试流程
虽然基于LLM的方法在单个虚拟机上显示出初步前景，但在漏洞利用服务数量、运行时间和可扩展性方面被ADAPT超越

Novel Techniques 新颖技术

Formal definition of penetration testing as a labeled transition system with a game strategy, unifying RL and PDDL formulations under one framework
Security-informed architecture model that captures components, interfaces, capabilities (intended, adversary-controllable, non-controllable), vulnerabilities, and interactions
MAPE-K autonomic loop applied to penetration testing with event-based callbacks enabling concurrent scanning, exploitation, and post-exploitation operations
Utility-based multi-factor scoring system for runtime target prioritization, attack selection, and scan selection with configurable weights
Exploitation graph representation showing the full penetration test execution trace with scans, exploits, post-exploitation steps, and discovered capabilities

将渗透测试形式化定义为具有博弈策略的标号迁移系统，在统一框架下统一了强化学习和PDDL表述
安全感知架构模型，捕捉了组件、接口、能力（预期、攻击者可控、不可控）、漏洞和交互
将MAPE-K自主循环应用于渗透测试，通过基于事件的回调实现了并发扫描、利用和利用后操作
用于运行时目标优先级排序、攻击选择和扫描选择的基于效用的多因素评分系统，具有可配置的权重
漏洞利用图（Exploitation graph）表示，展示了包含扫描、利用、利用后步骤和发现能力的完整渗透测试执行追踪

Open Questions 开放问题

Can LLMs replace or augment the utility-based decision-making in the analysis and planning phases for more adaptive reasoning?
How can attack repertoires be automatically generated or extended, potentially using LLMs to synthesize new attack tactics?
How does the approach scale to large enterprise networks with hundreds or thousands of hosts?
Can the formal framework be extended to model defensive countermeasures and adversarial interactions?
How would the architecture perform against modern, hardened systems with active defenses?

LLM能否在分析和规划阶段替代或增强基于效用的决策，以实现更具适应性的推理？
如何自动生成或扩展攻击库，特别是利用LLM来合成新的攻击战术？
该方法如何扩展到拥有数百或数千个主机的型大型企业网络？
形式化框架能否扩展到对防御性对策和对抗性交互建模？
该架构在面对具有主动防御的现代加固系统时表现如何？

Builds On 基于前人工作

MAPE-K autonomic computing architecture (Kephart and Chess, 2003)
Security-informed architecture modeling (Bratus and Shubina, 2017)
Labeled transition systems (van Glabbeek, 2001)
Utility-based decision making (von Neumann and Morgenstern, 1947; Anand, 1995)
Attack planning via PDDL (Obes et al., 2013)
RL-based penetration testing (Zhou et al., 2019; Hu et al., 2020)

Open Source 开源信息

Partial - https://gitfront.io/r/anonymous-submitter/P2LRhxvh9L7z/ADAPT/ (intrusive plugins withheld for ethical reasons)