#03

Automated Penetration Testing with LLM Agents and Classical Planning Automated Penetration Testing with LLM Agents and Classical Planning

Lingzhi Wang, Xinyi Shi, Ziyu Li, Yi Jiang, Shiyu Tan, Yuhao Jiang, Junjie Cheng, Wenyuan Chen, Xiangmin Shen, Zhenyuan Li, Yan Chen

2025 | arXiv (preprint)

arXiv:2512.11143v1

Problem & Motivation 问题与动机

Fully automated, hands-off-the-keyboard penetration testing remains a significant research challenge. Existing LLM-based pentesting systems struggle with maintaining coherent long-horizon plans, performing complex reasoning, and effectively utilizing specialized tools, which constrains their capability, efficiency, and stability.

全自动、无需人工干预的渗透测试仍然是一个重大的研究挑战。现有的基于 LLM 的渗透测试系统在维护连贯的长时程规划、执行复杂推理以及有效利用专业工具方面存在困难,这限制了它们的能力、效率和稳定性。

The global pentesting market is projected to reach USD 3.9 billion by 2029, driving demand for AI-powered automation. Current systems still require nontrivial human intervention, and LLM agents alone suffer from hallucinations, limited context memory, erratic planning, and inability to leverage specialized tools. There is a clear gap for a structured planning approach that complements LLM strengths while mitigating their weaknesses.

全球渗透测试市场预计到 2029 年将达到 39 亿美元,这推动了对 AI 驱动自动化的需求。现有系统仍然需要大量的人工干预,且 LLM 智能体本身存在幻觉、上下文记忆有限、规划反复无常以及无法利用专业工具的问题。在结构化规划方法方面存在明显的差距,该方法可以补充 LLM 的优势,同时减轻其弱点。

Threat Model 威胁模型

The attacker has network access to a target system (IP address provided) and can use any tool available in a standard Kali Linux distribution. No prior knowledge of the target's vulnerabilities is assumed. All targets are containerized vulnerable environments (Vulhub) with known exploitable vulnerabilities. The system must operate fully autonomously without human guidance.

攻击者可以网络访问目标系统(提供 IP 地址),并可以使用标准 Kali Linux 发行版中可用的任何工具。不假设预先了解目标的漏洞。所有目标都是具有已知可利用漏洞的容器化漏洞环境 (Vulhub)。系统必须在没有人类指导的情况下完全自主运行。

Methodology 核心方法

CHECKMATE integrates classical planning with LLM agents through a Planner-Executor-Perceptor (PEP) paradigm. The planner uses 'Classical Planning+', a novel extension of classical planning that leverages LLMs to dynamically determine non-deterministic action effects at runtime, enabling planning in partially observable and non-deterministic environments. Predefined attack actions encode specialized pentesting tools (Metasploit modules, NSE scripts, Nuclei templates) with explicit preconditions and effects, providing structured knowledge that LLMs lack. The LLM agent serves as a constrained executor for individual actions rather than orchestrating the entire workflow.

CHECKMATE 通过规划器-执行器-感知器 (PEP) 范式将经典规划与 LLM 智能体集成。规划器使用“经典规划+” (Classical Planning+),这是经典规划的一种新型扩展,它利用 LLM 在运行时动态确定非确定性操作效果,从而实现在部分可观察和非确定性环境中的规划。预定义的攻击操作编码了专业的渗透测试工具(Metasploit 模块、NSE 脚本、Nuclei 模板),具有明确的先验条件和效果,提供了 LLM 所缺乏的结构化知识。LLM 智能体充当单个操作的受限执行器,而不是编排整个工作流。

Architecture 架构设计

Three-component PEP architecture: (1) Classical Planning+ planner that maintains a DAG of feasible actions, checks preconditions, and selects next actions; (2) LLM agent executor that translates planned actions into concrete commands using action-specific prompts with parameter placeholders; (3) Dual perceptor system with rule-based parsing for structured outputs and LLM-based interpretation for unstructured outputs, both converting results into classical planning predicates to update the state.

三组件 PEP 架构:(1) 经典规划+规划器,维护可行操作的 DAG,检查先验条件并选择下一个操作;(2) LLM 智能体执行器,使用带有参数占位符的操作特定提示将计划的操作翻译成具体的命令;(3) 双重感知器系统,具有用于结构化输出的基于规则的解析和用于非结构化输出的基于 LLM 的解释,两者都将结果转换为经典规划谓词以更新状态。

LLM Models 使用的大模型

Claude Sonnet 4.5GPT-o4-miniGemini Pro 2.5

Tool Integration 工具集成

nmapmetasploitnucleinse-scriptswhatwebnetcatsearchsploitkali-linux-tools

Memory Mechanism 记忆机制

knowledge-graph

Attack Phases Covered 覆盖的攻击阶段

reconnaissance
scanning
enumeration
exploitation
post exploitation
privilege escalation
lateral movement
reporting

Evaluation 评估结果

CHECKMATE achieves 88% success rate reaching milestone M7 (interactive shell), improving benchmark success rates by over 20% compared to Claude Code. It reduces both monetary cost (average $0.68, 53% lower than Claude Code) and execution time (average 7.75 minutes, 54% lower) by more than 50%. CHECKMATE achieves 100% consistency across repeated runs versus Claude Code's 75%, with coefficient of variation for cost at 0.129 vs 0.451 and for time at 0.093 vs 0.325.

CHECKMATE 在达到里程碑 M7(交互式 Shell)方面实现了 88% 的成功率,与 Claude Code 相比,将基准成功率提高了 20% 以上。它将货币成本(平均 0.68 美元,比 Claude Code 低 53%)和执行时间(平均 7.75 分钟,低 54%)降低了 50% 以上。CHECKMATE 在重复运行中实现了 100% 的一致性,而 Claude Code 为 75%,成本变异系数为 0.129 对 0.451,时间变异系数为 0.093 对 0.325。

Environment 评估环境

Vulhub

Metrics 评估指标

milestone-progressionsuccess-ratemonetary-costexecution-timestability-coefficient-of-variation

Baseline Comparisons 基准对比

  • Claude Code + Sonnet 4.5
  • PentestGPT + o4-mini
  • PentestAgent + o4-mini
  • CAI + o4-mini
  • AutoPentester
  • Codex + o4-mini
  • Gemini Code Assist + Gemini Pro 2.5
  • Claude Code + RAG
  • Claude Code + Structured Planning File

Scale 评估规模

120 Vulhub containerized vulnerable environments (largest benchmark of its kind); 20 tasks for efficiency/stability evaluation with 3 repeated runs each

Contributions 核心贡献

  • Proposed the Planner-Executor-Perceptor (PEP) design paradigm for systematically decomposing, reviewing, and comparing automated pentesting systems
  • Conducted the largest evaluation of existing pentesting systems on the Vulhub dataset (120 containers), showing Claude Code + Sonnet 4.5 substantially outperforms all prior systems, and identified three key limitations of LLM agents in pentesting
  • Proposed Classical Planning+, the first LLM-augmented classical planning framework with dynamic updates, extending classical planning to partially observable and non-deterministic domains
  • Developed CHECKMATE, integrating classical planning+ with LLM agents, achieving 20%+ improvement in success rates and 50%+ reduction in cost and time over Claude Code
  • Defined an 11-milestone evaluation framework for measuring pentesting progress more rigorously than prior sub-task-based metrics
  • 提出了规划器-执行器-感知器 (PEP) 设计范式,用于系统地分解、审查和比较自动化渗透测试系统
  • 在 Vulhub 数据集(120 个容器)上对现有渗透测试系统进行了最大规模的评估,显示 Claude Code + Sonnet 4.5 显著优于所有先前的系统,并确定了 LLM 智能体在渗透测试中的三个关键局限性
  • 提出了“经典规划+”,这是第一个具有动态更新的 LLM 增强型经典规划框架,将经典规划扩展到部分可观察和非确定性领域
  • 开发了 CHECKMATE,将经典规划+与 LLM 智能体集成,与 Claude Code 相比,成功率提高了 20% 以上,成本和时间减少了 50% 以上
  • 定义了一个 11 里程碑的评估框架,用于比以前基于子任务的指标更严格地衡量渗透测试进度

Limitations 局限性

  • Predefined attack actions require manual curation and domain expertise to create, limiting scalability to new attack types and tools
  • Evaluation limited to Vulhub single-application vulnerabilities which may not represent complex real-world multi-host network pentesting scenarios
  • Milestones M8-M11 (privilege escalation, lateral movement, credential theft) are difficult to evaluate on Vulhub, leaving post-exploitation capability largely untested
  • System relies on a fixed set of predefined factors (applications, CVEs, URLs, credentials) as preconditions and effects, which may not capture all relevant pentesting state
  • Cannot handle scenarios requiring GUI interaction, visual understanding, or human-computer interaction (e.g., web UI manipulation, CAPTCHA solving)
  • The classical planning+ approach still depends on the LLM for non-deterministic effect interpretation, inheriting some LLM limitations for that component
  • Puzzle-like CTF challenges and real-world networks with active defenses were explicitly excluded from evaluation
  • 预定义的攻击操作需要手动策划和领域专业知识来创建,限制了对新攻击类型和工具的可扩展性
  • 评估仅限于 Vulhub 单应用程序漏洞,这可能无法代表现实世界复杂的跨多主机网络渗透测试场景
  • 里程碑 M8-M11(权限提升、横向移动、凭据窃取)在 Vulhub 上难以评估,导致后渗透能力在很大程度上未经验证
  • 系统依赖于一组固定的预定义因素(应用程序、CVE、URL、凭据)作为先验条件和效果,这可能无法捕获所有相关的渗透测试状态
  • 无法处理需要 GUI 交互、视觉理解或人机交互的场景(例如 Web UI 操作、验证码破解)
  • 经典规划+方法仍然依赖 LLM 进行非确定性效果解释,继承了该组件的一些 LLM 局限性
  • 拼图式的 CTF 挑战和具有主动防御的现实网络被明确排除在评估之外

Research Gaps 研究空白

  • No formal framework exists for defining the complete action space and state representation needed for pentesting, which is inherently open-ended
  • Multimodal and UI-aware pentesting is unexplored; no prior work leverages visual artifacts or Computer-User Interaction Simulation for pentesting
  • Automated extraction of structured pentesting knowledge from unstructured sources to enable formal planning algorithms remains unsolved
  • Experience-driven reasoning (recognizing subtle cues like URL patterns hinting at specific backends) is lacking in LLM-based pentesting
  • Standardized evaluation methodology for pentesting systems is missing; human intervention levels are inconsistent across studies making comparison difficult
  • Commercial systems (XBOW, AutoAttacker) do not release code, preventing independent reproduction and fair comparison
  • 目前还没有正式的框架来定义渗透测试所需的完整操作空间和状态表示,渗透测试本质上是开放式的
  • 多模态和 UI 感知渗透测试尚未探索;之前的工作都没有利用视觉工件或计算机-用户交互模拟进行渗透测试
  • 从非结构化来源自动提取结构化渗透测试知识以启用正式规划算法的问题仍未解决
  • 基于经验的推理(识别微妙的线索,如暗示特定后端的 URL 模式)在基于 LLM 的渗透测试中缺失
  • 渗透测试系统的标准化评估方法缺失;各研究之间的人工干预水平不一致,导致难以比较
  • 商业系统(XBOW、AutoAttacker)不发布代码,阻碍了独立复现和公平比较

Novel Techniques 新颖技术

  • Classical Planning+ - extending classical planning to partially observable and non-deterministic domains by using LLMs to dynamically determine action effects at runtime
  • Predefined attack actions as an alternative to RAG and fine-tuning for expanding LLM knowledge of specialized tools, with precondition-based retrieval instead of embedding-based similarity
  • Dual perceptor design combining rule-based parsing for structured outputs with LLM-based interpretation for unstructured outputs, mapping both to classical planning predicates
  • Using predefined command templates with LLM-populated placeholders to reduce hallucination in command generation while maintaining flexibility
  • 11-milestone evaluation framework measuring actual pentesting progress rather than activity completion
  • 经典规划+ (Classical Planning+) —— 通过在运行时使用 LLM 动态确定操作效果,将经典规划扩展到部分可观察和非确定性领域
  • 预定义的攻击操作作为 RAG 和微调的替代方案,用于扩展 LLM 对专业工具的知识,具有基于先验条件的检索而非基于嵌入的相似性
  • 双重感知器设计,结合了用于结构化输出的基于规则的解析与用于非结构化输出的基于 LLM 的解释,并将两者映射到经典规划谓词
  • 使用带有 LLM 填充占位符的预定义命令模板,以减少命令生成中的幻觉,同时保持灵活性
  • 11 里程碑评估框架,衡量实际的渗透测试进度,而不是活动完成情况

Open Questions 开放问题

  • How can predefined attack actions be automatically generated or discovered rather than manually curated?
  • Can Classical Planning+ scale to large enterprise networks with hundreds of hosts and complex interdependencies?
  • How would CHECKMATE perform against actively defended targets with IDS/IPS, firewalls, and incident response?
  • Can the PEP paradigm be extended to incorporate multimodal perception for GUI-based and visual pentesting scenarios?
  • What is the optimal granularity for predefined actions - too coarse loses flexibility, too fine loses the benefit of structured planning?
  • How does CHECKMATE compare to commercial systems like XBOW that do not release their implementations?
  • Can the classical planning+ approach be combined with reinforcement learning for improved action selection beyond LLM-based ranking?
  • 如何自动生成或发现预定义的攻击操作,而不是手动策划?
  • 经典规划+能否扩展到具有数百台主机和复杂相互依赖关系的大型企业网络?
  • CHECKMATE 在面对具有 IDS/IPS、防火墙和事件响应的主动防御目标时表现如何?
  • PEP 范式能否扩展以纳入针对基于 GUI 和视觉渗透测试场景的多模态感知?
  • 预定义操作的最佳粒度是多少 —— 太粗会失去灵活性,太细会失去结构化规划的好处?
  • CHECKMATE 与不发布实现的商业系统(如 XBOW)相比如何?
  • 经典规划+方法能否与强化学习相结合,以实现超出基于 LLM 排名的改进操作选择?

Builds On 基于前人工作

  • ChainReactor (classical planning for privilege escalation)
  • PentestGPT (LLM + Penetration Testing Tree)
  • PentestAgent (LLM + CVE-Exploit Mapping)
  • CAI (multi-agent cybersecurity AI)
  • AutoPentester (LLM agent-based pentesting)
  • Claude Code (LLM code agent)

Open Source 开源信息

No

Tags