#23

On the Surprising Efficacy of LLMs for Penetration-Testing On the Surprising Efficacy of LLMs for Penetration-Testing

Andreas Happe, Juergen Cito

2025 | arXiv (preprint)

survey penetration-testing fully-autonomous ReAct

PDF Preview 论文预览

Loading PDF... 加载 PDF 中...

Problem & Motivation 问题与动机

This paper critically examines why LLMs have proven surprisingly effective for penetration testing, systematically reviewing the evolution of LLM capabilities and their adoption in both academic research and industry for offensive security tasks across the cyber kill chain.

本文批判性地审视了为什么 LLM 在渗透测试中被证明出奇地有效，系统地回顾了 LLM 能力的演变，及其在学术研究和工业界中跨越网络攻击链（cyber kill chain）的攻击性安全任务中的应用。

There is a massive shortage of cybersecurity professionals (4.7 million missing, growing 19.1% year-over-year), creating urgent need to make penetration testers more effective or automate parts of their work. LLMs have been increasingly adopted for this purpose but no comprehensive analysis exists explaining why they work so well or cataloging the obstacles to further adoption.

网络安全专业人员严重短缺（缺口达 470 万，且每年增长 19.1%），迫切需要提高渗透测试人员的效率或使其部分工作自动化。LLM 已越来越多地用于此目的，但目前尚无全面分析解释它们为何表现如此出色，也未记录进一步采用的障碍。

Threat Model 威胁模型

Both benign (white-hat penetration testers augmenting their work) and malicious (black-hat actors using LLMs for exploit development, social engineering, and information gathering) usage scenarios are considered. The paper explicitly notes the dual-use nature of offensive security tooling and that the same techniques are used by both ethical and unethical hackers.

考虑了良性（白帽渗透测试人员利用 LLM 增强工作）和恶意（黑帽黑客利用 LLM 进行漏洞开发、社会工程和信息收集）的使用场景。本文明确指出攻击性安全工具的双重用途性质，即道德黑客和非道德黑客使用的是相同的技术。

Methodology 核心方法

The paper conducts a comprehensive survey and critical analysis of LLM-aided penetration testing across the first two years (2023-2025) of the field. It categorizes the landscape into interactive vibe-hacking and autonomous systems, identifies key factors explaining LLM efficacy (pattern-matching alignment, uncertainty handling, cost-effectiveness), catalogs obstacles to adoption (reliability, safety, privacy, ethics, accountability), and proposes research directions. The analysis draws on academic publications identified through Google Scholar survey papers and industry reports from LLM providers.

本文对该领域前两年（2023-2025 年）由 LLM 辅助的渗透测试进行了全面调查和批判性分析。它将研究现状分为交互式“氛围黑客”（vibe-hacking）和自主系统，确定了揭示 LLM 功效的关键因素（模式匹配对齐、不确定性处理、成本效益），编目了采用障碍（可靠性、安全性、隐私、道德、问责），并提出了研究方向。分析基于通过 Google Scholar 调查论文确定的学术出版物，以及来自 LLM 供应商的行业报告。

Architecture 架构设计

N/A - survey paper. However, the paper describes and categorizes architectures from reviewed systems including single-agent ReAct-based agents, hierarchical multi-agent systems with task-specific sub-agents, and human-in-the-loop copilot systems.

不适用（综述论文）。然而，本文描述并分类了所审查系统的架构，包括基于 ReAct 的单智能体、具有任务特定子智能体的分层多智能体系统，以及人机协作的副驾驶（copilot）系统。

LLM Models 使用的大模型

Tool Integration 工具集成

Memory Mechanism 记忆机制

RAG

Attack Phases Covered 覆盖的攻击阶段

reconnaissance

scanning

enumeration

exploitation

post exploitation

privilege escalation

lateral movement

reporting

Evaluation 评估结果

The paper finds that LLMs are surprisingly effective for penetration testing due to three key factors: (1) penetration testing fundamentally resembles pattern-matching which LLMs excel at, (2) LLMs inherently cope with the uncertainty prevalent in real-world pentesting, and (3) LLM providers front-load creation costs making deployment cost-effective ($0.10-$11.64 per pentest run). However, LLMs exhibit capability without reliability -- they can exploit systems but produce different attack chains across runs of the same testbed.

研究发现，LLM 在渗透测试中出奇地有效，主要归功于三个关键因素：(1) 渗透测试在本质上类似于 LLM 所擅长的模式匹配；(2) LLM 天生就能应对现实世界渗透测试中普遍存在的不确定性；(3) LLM 供应商分摊了前期创建成本，使得部署具有成本效益（每次渗透测试运行 0.10-11.64 美元）。然而，LLM 表现出的是“有能力但不可靠”——它们可以攻陷系统，但在同一测试床的多次运行中会产生不同的攻击链。

Environment 评估环境

Metrics 评估指标

Baseline Comparisons 基准对比

human-penetration-testers

Scale 评估规模

N/A - survey paper reviewing multiple studies across varying scales

Contributions 核心贡献

Provides the first comprehensive survey and critical analysis of LLM-aided penetration testing across both academic research and industry adoption (2023-2025)
Articulates three key hypotheses explaining why LLMs are surprisingly effective for pentesting: alignment with pattern-matching, inherent uncertainty handling, and cost-effective access to pre-trained knowledge
Categorizes the current landscape into vibe-hacking (interactive) and autonomous hacking, showing how both approaches converge as task complexity increases
Systematically identifies and discusses six major obstacle categories preventing further adoption: model features/stability, safety/security, costs/efficiency, privacy/digital sovereignty, accountability, capability vs. reliability
Proposes concrete research directions including better safeguards, improved reliability through single-agent and multi-agent approaches, and societal decision-making frameworks
Documents malicious adoption of LLMs including darknet chatbots, influence operations, and APT usage, drawing on abuse reports from OpenAI, Anthropic, and Google

提供了首个关于学术研究和工业应用（2023-2025 年）中 LLM 辅助渗透测试的全面调查和批判性分析。
提出了三个解释 LLM 在渗透测试中为何出奇有效的关键假设：模式匹配对齐、天生的不确定性处理能力，以及对预训练知识的成本效益访问。
将当前格局分为“氛围黑客”（交互式）和自主黑客，展示了随着任务复杂性增加，这两种方法是如何趋同的。
系统地识别并讨论了阻碍进一步采用的六大障碍类别：模型特征/稳定性、安全/保密、成本/效率、隐私/数字主权、问责制、能力与可靠性的矛盾。
提出了具体的研究方向，包括更好的防护措施、通过单智能体和多智能体方法提高可靠性，以及社会决策框架。
借鉴 OpenAI、Anthropic 和 Google 的滥用报告，记录了 LLM 的恶意应用，包括暗网聊天机器人、影响力行动和 APT 使用。

Limitations 局限性

As a survey/position paper, the hypotheses about why LLMs work well for pentesting are speculative and based primarily on one author's 13 years of professional experience rather than rigorous empirical validation
The review of academic research is limited to papers identified through specific survey papers on Google Scholar, potentially missing relevant work not captured by those surveys
Industry adoption analysis relies on security news sites and LLM provider abuse reports, which likely underrepresent actual usage especially by malicious actors
The paper does not provide a systematic comparison or meta-analysis of quantitative results across the reviewed systems
Limited coverage of open-source and fine-tuned security-specific LLMs, which the authors note are typically not published or lack tool-calling capabilities
The paper acknowledges that few empirical studies exist on penetration testers' actual work practices, making it difficult to fully validate the pattern-matching hypothesis

作为一篇综述/观点论文，关于 LLM 为何在渗透测试中表现良好的假设具有推测性，主要基于作者 13 年捕捉到的专业经验，而非严格的实证验证。
对学术研究的回顾局限于通过 Google Scholar 上特定综述论文确定的论文，可能遗漏了这些综述未捕捉到的相关工作。
行业采用分析依赖于安全新闻网站和 LLM 供应商的滥用报告，这可能低估了实际使用情况，特别是恶意行为者的使用。
本文未对所审查系统的定量结果进行系统比较或元分析。
对开源和经过微调的安全专用 LLM 的覆盖有限，作者指出这些模型通常不发布或缺乏工具调用能力。
本文承认关于渗透测试人员实际工作实践的实证研究很少，因此难以完全验证模式匹配假设。

Research Gaps 研究空白

Capability vs. reliability gap: LLMs can exploit systems but lack consistency across runs, and no good solutions exist for improving reliability without prohibitive cost increases
No standardized benchmarks that reflect real-world penetration testing complexity -- existing CTF-based benchmarks may not measure real-world impact
Lack of empirical research on how professional penetration testers actually work and make decisions, which would validate or refute the pattern-matching hypothesis
Missing safeguards for distinguishing ethical from unethical use of offensive AI tools -- identical techniques are used by both whitehats and blackhats
No established frameworks for accountability when LLM-driven pentesting prototypes cause unintended damage
Insufficient research on privacy and digital sovereignty implications of sending sensitive vulnerability data to cloud-based LLM providers
Lack of cost-effectiveness analysis comparing LLM-driven pentesting against traditional approaches including total ecological costs
Multi-stage network attacks remain under-explored compared to single-host exploitation
No solutions for the inherent tension between LLM safety filters and legitimate offensive security use

能力与可靠性之间的鸿沟：LLM 可以攻陷系统，但在多次运行中缺乏一致性，且目前尚无在不大幅增加成本的情况下提高可靠性的良好方案。
缺乏反映现实世界渗透测试复杂性的标准化基准——现有的基于 CTF 的基准可能无法衡量现实世界的影响。
缺乏关于专业渗透测试人员实际如何工作和决策的实证研究，而这能证实或反驳模式匹配假设。
缺乏区分攻击性 AI 工具的道德用途与非道德用途的保护措施——白帽和黑帽使用的是完全相同的技术。
对于 LLM 驱动的渗透测试原型造成意外损害的情况，缺乏既定的问责框架。
关于将敏感漏洞数据发送给云端 LLM 供应商所带来的隐私和数字主权影响的研究不足。
缺乏将 LLM 驱动的渗透测试与传统方法进行对比的成本效益分析（包括总生态成本）。
与单主机利用相比，多阶段网络攻击仍缺乏深入探索。
尚未解决 LLM 安全过滤器与合法的攻击性安全使用之间的内在张力。

Novel Techniques 新颖技术

Framing penetration testing as fundamentally a pattern-matching task, explaining why LLMs (which excel at pattern matching) are inherently well-suited for it
Concept of 'vibe-hacking' as a distinct category of LLM-assisted pentesting parallel to vibe-coding, representing interactive human-AI collaboration for security tasks
Observation that LLM hallucinations are less problematic for pentesting than other domains because they resemble hypothesis-testing behavior of human pentesters
Red Queen's hypothesis applied to pentesting: LLMs inherently stay current because newer training data incorporates new attack techniques, transferring maintenance costs to LLM providers
Sampling-based strategy exploration (from Project Naptime/Big Sleep) where multiple vulnerability hypotheses are tested through independent trajectories rather than a single planning loop
Tool abstraction layers that simplify tool usage for LLMs, enabling smaller models to perform pentesting effectively

将渗透测试框架化为本质上的模式匹配任务，解释了为什么擅长模式匹配的 LLM 天生适合该任务。
提出了“氛围黑客”（vibe-hacking）的概念，作为 LLM 辅助渗透测试的一个独特类别（类似于 vibe-coding），代表了安全任务中人类与 AI 的交互式协作。
观察到 LLM 的幻觉在渗透测试中比在其他领域问题更少，因为它们类似于人类渗透测试人员的假设检验行为。
将“红皇后假说”应用于渗透测试：LLM 天生能保持最新状态，因为更新的训练数据包含了新的攻击技术，从而将维护成本转嫁给了 LLM 供应商。
基于采样的策略探索（源自 Project Naptime/Big Sleep），即通过独立的轨迹测试多个漏洞假设，而不是单一的规划循环。
工具抽象层简化了 LLM 对工具的使用，使较小的模型也能有效地执行渗透测试。

Open Questions 开放问题

Can LLM reliability be improved without proportionally increasing costs, or is there a fundamental capability-reliability tradeoff?
How should society regulate dual-use offensive AI tools when the same techniques serve both legitimate security testing and malicious hacking?
Will LLMs eventually replace human penetration testers or will they primarily serve as force multipliers requiring expert oversight?
How can penetration testing results and vulnerability data be kept private when using cloud-based LLMs?
Do reasoning models (o1, o3, DeepSeek-R1) actually improve pentesting outcomes enough to justify their 70x higher energy consumption compared to smaller models?
Can synthetic benchmarks ever accurately predict real-world pentesting performance?
Will the convergence of vibe-hacking and autonomous approaches lead to a qualitatively different capability level for LLM-driven attacks?
How do we prevent the premature use of LLMs in security education from degrading the development of essential human pentesting skills?

能否在不按比例增加成本的情况下提高 LLM 的可靠性，或者在能力和可靠性之间是否存在根本性的权衡？
当相同的技术既能用于合法的安全测试又能用于恶意黑客攻击时，社会该如何监管这种双用途的攻击性 AI 工具？
LLM 最终会取代人类渗透测试人员，还是主要作为需要专家监督的效能倍增器？
在使用云端 LLM 时，如何保护渗透测试结果和漏洞数据的私密性？
推理性模型（o1, o3, DeepSeek-R1）对渗透测试结果的提升是否足以抵消其比小模型高出 70 倍的能耗？
合成基准能否准确预测现实世界的渗透测试表现？
“氛围黑客”与自主方法的融合是否会使 LLM 驱动的攻击达到一个质变的能力水平？
如何防止在安全教育中过早使用 LLM 削弱人类渗透测试核心技能的发展？

Builds On 基于前人工作

Getting pwn'd by AI (Happe & Cito, 2023) - first paper using LLMs for pentesting
pentestGPT (Deng et al., 2023) - interactive LLM-aided CTF hacking with Pentest-Task-Tree
LLM as Hackers (Happe et al., 2023) - privilege escalation benchmark
LLM Agents can Autonomously Hack Websites (Fang et al., 2024) - function-calling for web hacking
AutoAttacker (Xu et al., 2024) - Metasploit-based post-breach attacks with RAG
Can LLMs Hack Enterprise Networks? (Happe et al., 2025) - hierarchical multi-agent for AD networks
Mirsky et al. (2023) - pre-LLM survey of offensive AI threat to organizations
VulnBot (Kong et al., 2025) - multi-agent with task graph and RAG
Singer et al. (2025) - multistage network attacks with tool abstraction layer