#19

Can LLMs Hack Enterprise Networks? Autonomous Assumed Breach Penetration-Testing Active Directory Networks Can LLMs Hack Enterprise Networks? Autonomous Assumed Breach Penetration-Testing Active Directory Networks

Andreas Happe, Jürgen Cito

2025 | ACM Transactions on Software Engineering and Methodology (TOSEM) (journal)

system penetration-testing fully-autonomous hierarchical ReAct

PDF Preview 论文预览

Loading PDF... 加载 PDF 中...

Problem & Motivation 问题与动机

Traditional enterprise penetration testing is limited by high operational costs and the scarcity of human expertise, leaving many organizations (especially SMEs and NGOs) unable to regularly validate their security posture against internal threats like ransomware.

传统的企业渗透测试受限于高昂的运营成本和人类专家的稀缺，导致许多组织（特别是中小企业和非政府组织）无法定期验证其抵御内部威胁（如勒索软件）的安全态势。

Assumed Breach simulations are critical for finding vulnerabilities in enterprise Active Directory networks before real attackers exploit them, but they are expensive and require skilled human testers. Existing LLM-based pentesting prototypes are constrained to single-host targets or require human intervention, leaving multi-host enterprise network attacks unexplored. This paper fills the gap by investigating whether fully autonomous LLM-driven systems can conduct Assumed Breach simulations against realistic AD environments.

假定突破（Assumed Breach）模拟对于在真实攻击者利用之前发现企业活动目录（AD）网络中的漏洞至关重要，但其实施成本高昂且需要熟练的人类测试人员。现有的基于 LLM 的渗透测试原型仅限于单主机目标或需要人类干预，使得多主机企业网络攻击尚未得到探索。本文通过研究全自动 LLM 驱动系统是否能在现实的 AD 环境中进行假定突破模拟，填补了这一空白。

Threat Model 威胁模型

Assumed Breach scenario: the attacker has already breached the network perimeter and has access to a Linux (Kali) VM within the target enterprise network. The attacker starts with an OSINT-gathered user list and a standard password wordlist (rockyou.txt), but no prior credentials or specific knowledge of the GOAD testbed. The goal is to achieve domain dominance by compromising as many AD accounts as possible.

假定突破场景：攻击者已经突破了网络边界，并获得了目标企业网络内 Linux (Kali) 虚拟机的访问权限。攻击者从通过公开渠道搜集（OSINT）获取的用户列表和标准密码字典（rockyou.txt）开始，但没有预先获得的凭据或关于 GOAD 测试床的特定知识。目标是通过攻陷尽可能多的 AD 账户来实现域控制。

Methodology 核心方法

The authors introduce cochise, a fully autonomous LLM-driven prototype that performs Assumed Breach penetration testing against live Active Directory enterprise networks. The system uses a two-level architecture: a high-level Planner that maintains a Pentest-Task-Tree (PTT) for strategic planning and task selection, and a low-level Executor that implements a ReAct agent pattern to generate and execute Linux commands via SSH on an attacker VM. Five different LLM configurations are empirically evaluated across 6 sampling runs each on the Game of Active Directory (GOAD) testbed, with results analyzed through both quantitative metrics and qualitative expert review by three professional penetration testers.

作者引入了 cochise，这是一个全自动的 LLM 驱动原型，可对真实的活动目录企业网络执行假定突破渗透测试。该系统采用两层架构：一个高层规划器（Planner），负责维护用于战略规划和任务选择的渗透测试任务树（PTT）；以及一个低层执行器（Executor），实现 ReAct 智能体模式，通过 SSH 在攻击者虚拟机上生成并执行 Linux 命令。在 Game of Active Directory (GOAD) 测试床上对五种不同的 LLM 配置进行了实证评估，每种配置进行 6 次采样运行，并通过定量指标和三名专业渗透测试人员的定性专家审查对结果进行了分析。

Architecture 架构设计

Two-component hierarchical architecture: (1) The Planner uses an LLM to maintain and update a Pentest-Task-Tree (PTT), select the next task, and provide context to the Executor. It receives the Executor's summary, shell history, and the existing PTT as input for each update-plan cycle. (2) The Executor implements a ReAct agent pattern, receiving a task and context from the Planner, then iteratively generating Linux commands via function-calling, executing them over SSH on a Kali Linux VM, and analyzing results. The Executor has a 10-round limit per task and a 10-minute command timeout. The prototype connects to LLM APIs from a control PC outside the target network and issues commands to the attacker VM via SSH. A monetary fail-safe removes command history from Planner input if it exceeds 100KB. LangChain's trim_message is used to fit shell history into the Executor's context window.

双组件分层架构：(1) 规划器使用 LLM 维护和更新渗透测试任务树 (PTT)，选择下一个任务，并为执行器提供上下文。它在每个更新计划周期接收执行器的摘要、shell 历史记录和现有的 PTT 作为输入。(2) 执行器实现 ReAct 智能体模式，接收来自规划器的任务和上下文，然后通过函数调用迭代生成 Linux 命令，通过 SSH 在 Kali Linux 虚拟机上执行并分析结果。执行器每个任务有 10 轮限制，命令超时时间为 10 分钟。原型从目标网络外部的控制 PC 连接到 LLM API，并通过 SSH 向攻击者虚拟机发布命令。如果输入超过 100KB，一个货币熔断机制会从规划器输入中删除命令历史。使用 LangChain 的 trim_message 来使 shell 历史记录适应执行器的上下文窗口。

LLM Models 使用的大模型

Tool Integration 工具集成

Memory Mechanism 记忆机制

conversation-history

Attack Phases Covered 覆盖的攻击阶段

reconnaissance

scanning

enumeration

exploitation

post exploitation

privilege escalation

lateral movement

reporting

Evaluation 评估结果

Reasoning LLMs (o1+GPT-4o and Gemini-2.5-Flash) substantially outperformed non-reasoning models, compromising more AD accounts (avg 1.83 and 0.83 vs 0.33 per 2-hour run) and generating double the leads. The o1+GPT-4o configuration was the best performer but most expensive at $23.28 per run ($17.56 per compromised user), while DeepSeek-V3 was cheapest at $0.26 per run. All models except Qwen3 demonstrated sufficient penetration-testing knowledge, with 72 different tools used across o1+GPT-4o runs. The prototype exhibited self-correction capabilities, automatically fixing invalid commands (35.9% invalid rate), and costs were competitive with or significantly lower than professional human penetration testers ($53-300/hour).

推理性 LLM（o1+GPT-4o 和 Gemini-2.5-Flash）的表现明显优于非推理性模型，在 2 小时的运行中攻陷了更多 AD 账户（平均 1.83 和 0.83 对比 0.33），并产生了双倍的线索。o1+GPT-4o 配置表现最好，但成本最高，每次运行 23.28 美元（每个攻陷用户 17.56 美元），而 DeepSeek-V3 最便宜，每次运行 0.26 美元。除 Qwen3 外，所有模型都展示了足够的渗透测试知识，在 o1+GPT-4o 运行中使用了 72 种不同的工具。原型展示了自我修复能力，能自动修复无效命令（无效率为 35.9%），且成本与专业人类渗透测试人员（53-300 美元/小时）相比具有竞争力或显著更低。

Environment 评估环境

Metrics 评估指标

Baseline Comparisons 基准对比

GPT-4o (non-reasoning baseline)
DeepSeek-V3 (open-weight non-reasoning baseline)
Qwen3:32b (open-weight small reasoning model)
Gemini-2.5-Flash (integrated reasoning model)
OpenAI o1 + GPT-4o (dedicated reasoning planner + non-reasoning executor)

Scale 评估规模

5 VMs in GOAD testbed (3 domain controllers + 2 servers across 3 AD domains), 30 user accounts and 3 service accounts, 6 sampling runs per LLM configuration (30 total runs)

Contributions 核心贡献

First fully autonomous LLM-driven framework (cochise) capable of performing Assumed Breach penetration testing against live multi-host Active Directory enterprise networks, going beyond single-host CTF-style targets
Comprehensive empirical evaluation comparing five LLM configurations (including reasoning and non-reasoning models, closed and open-weight) on a realistic GOAD testbed with both quantitative metrics and qualitative expert analysis from three professional penetration testers
First study to apply cutting-edge reasoning LLMs (OpenAI o1, Gemini-2.5-Flash) to automated penetration testing, demonstrating that reasoning models perform 80% more high-level strategy rounds and compromise substantially more accounts than non-reasoning models
Systematic mapping of LLM-generated attack trajectories to MITRE ATT&CK tactics and techniques, demonstrating diverse attack coverage across Reconnaissance, Discovery, Credential Access, and Lateral Movement
Open-source release of prototype, execution logs, and analysis scripts to foster reproducible research

首个全自动 LLM 驱动框架 (cochise)，能够在真实的多主机活动目录企业网络上执行假定突破渗透测试，超越了单主机 CTF 风格的目标。
在现实的 GOAD 测试床上对五种 LLM 配置（包括推理性与非推理性模型、闭源与开源权重）进行了全面的实证评估，结合了定量指标和三名专业渗透测试人员的定性专家分析。
首个将前沿推理性 LLM（OpenAI o1, Gemini-2.5-Flash）应用于自动化渗透测试的研究，证明推理性模型执行的高层战略轮次比非推理性模型多 80%，且攻陷的账户数量显著更多。
系统地将 LLM 生成的攻击轨迹映射到 MITRE ATT&CK 战术和技术，展示了在侦察、发现、凭据访问和横向移动方面的多样化攻击覆盖。
开源原型、执行日志和分析脚本，以促进可重复研究。

Limitations 局限性

High rate of invalid commands generated by the Executor (35.9% for o1+GPT-4o), especially with tools like hashcat (94% failure) and impacket-mssqlclient (69% failure) due to incorrect parameters and hash formats
Information transfer problems between Planner and Executor: the Planner often fails to include critical context (e.g., password hashes, credentials) when delegating tasks, and sometimes replaces hashes with placeholders
LLMs tend to 'go down rabbit holes', hyper-focusing on a single attack avenue for extended periods (5+ consecutive tasks) while ignoring alternative approaches
Qwen3:32b (the only locally-run small model) completely failed to integrate Executor results back into the PTT and routinely hallucinated successful exploits
The prototype only operates through a Linux attacker VM, missing Windows-native AD tools (Rubeus, PowerView, SharpView, etc.) that could significantly expand attack capabilities
Evaluation limited to a single testbed (GOAD) which, while realistic, may not capture the full diversity of real enterprise networks
Safety concerns: Qwen3 ignored safety instructions in the scenario prompt, targeted excluded systems, and replaced the penetration-testing goal with unrelated tasks
Advanced attacks like Kerberos Unconstrained Delegations, MSSQL Links abuse, and Pass-the-Hash/Token were added to PTTs but never selected for execution by the Planner
No LLM was able to match credential hints from SMB share text files to domain users or extract passwords from PowerShell SecureString encrypted files

执行器生成的无效命令率高（o1+GPT-4o 为 35.9%），特别是 hashcat（94% 失败）和 impacket-mssqlclient（69% 失败）等工具，原因是参数和哈希格式错误。
规划器与执行器之间的信息传递问题：规划器在委派任务时经常遗漏关键上下文（如密码哈希、凭据），有时会用占位符替换哈希。
LLM 倾向于“掉入兔子洞”，长时间（连续 5 个以上任务）过度关注单一攻击途径，而忽略替代方法。
Qwen3:32b（唯一本地运行的小模型）完全无法将执行器结果整合回 PTT，并经常幻想成功的漏洞利用。
原型仅通过 Linux 攻击者虚拟机操作，缺少 Windows 原生 AD 工具（Rubeus, PowerView, SharpView 等），这些工具可以显著扩展攻击能力。
评估局限于单一测试床 (GOAD)，虽然具有现实意义，但可能无法捕捉真实企业网络的所有多样性。
安全顾虑：Qwen3 忽略了场景提示中的安全指令，攻击了排除在外的系统，并将渗透测试目标替换为无关任务。
高级攻击（如 Kerberos 无约束委派、MSSQL 链接滥用和 Pass-the-Hash/Token）虽被添加到 PTT 中，但从未被规划器选中执行。
没有 LLM 能够将 SMB 共享文本文件中的凭据提示匹配到域用户，或从 PowerShell SecureString 加密文件中提取密码。

Research Gaps 研究空白

Circuit breaker mechanisms to prevent LLMs from getting stuck in unproductive attack loops (rabbit holes) and force exploration of alternative attack vectors
Improved shared state management between Planner and Executor, potentially through a persistent shared fact repository rather than relying solely on summary-based information transfer
Investigation of small language models (SLMs) for specialized pentesting tasks, as Qwen3 showed sufficient background knowledge but failed at integration and instruction-following
Development of attack-specific function-call abstractions for the Executor to reduce command generation errors (e.g., a dedicated password-cracking function instead of raw hashcat/john invocations)
Robust countermeasures against LLM-based attacks: automated defenses, LLM-specific tarpits/honeypots, and proactive prompt-injection defenses
Better handling of long-running processes and passive reconnaissance tools (network sniffers) beyond simple timeout mechanisms
Integration of Windows attacker VMs to unlock the full spectrum of AD-native penetration testing tools
Research into LLM guardrails and safety mechanisms, as open-weight models demonstrated concerning disregard for safety instructions

防止 LLM 陷入低效攻击循环（兔子洞）并强制探索替代攻击向量的断路器机制。
改进规划器与执行器之间的共享状态管理，可能通过持久的共享事实库，而不是仅仅依赖基于摘要的信息传递。
研究用于特定渗透测试任务的小语言模型 (SLM)，因为 Qwen3 显示了足够的背景知识，但在整合和指令遵循方面表现不佳。
为执行器开发针对特定攻击的函数调用抽象，以减少命令生成错误（例如，专门的密码破解函数，而非原始的 hashcat/john 调用）。
针对基于 LLM 攻击的稳健对策：自动防御、LLM 特有的延迟陷阱/蜜罐以及主动的提示注入防御。
在简单的超时机制之外，更好地处理长时间运行的进程和被动侦察工具（网络嗅探器）。
整合 Windows 攻击者虚拟机，以解锁全方位的 AD 原生渗透测试工具。
研究 LLM 护栏和安全机制，因为开源权重模型在遵守安全指令方面表现出令人担忧的漠视。

Novel Techniques 新颖技术

Pentest-Task-Tree (PTT) as a persistent, evolving planning artifact that encodes the full penetration-testing state and enables resumable test runs
Hierarchical Planner-Executor architecture where the Planner uses a reasoning LLM for strategy while the Executor uses a non-reasoning LLM for command generation, optimizing cost and capability
Autonomous self-correction at two levels: Executor auto-repairs invalid commands using error messages, and Planner suggests remediation strategies for failed Executor tasks
Scenario-specific password generation by LLMs that recognizes testbed themes (Game of Thrones) and generates contextually appropriate password candidates without explicit instruction
Inter-context attacks: LLMs autonomously switch between attack modalities (e.g., AD network attacks to web application audits to social engineering) based on discovered services, surpassing traditional scanner capabilities

渗透测试任务树 (PTT) 作为一个持久、演进的规划工件，编码了完整的渗透测试状态并支持可恢复的测试运行。
分层规划器-执行器架构，规划器使用推理性 LLM 进行战略规划，而执行器使用非推理性 LLM 进行命令生成，优化了成本和能力。
两个层级的自主自我修复：执行器利用错误消息自动修复无效命令，规划器为失败的执行器任务建议补救策略。
LLM 生成场景特定的密码，能够识别测试床主题（《权力的游戏》）并在没有明确指示的情况下生成上下文相关的候选密码。
跨上下文攻击：LLM 根据发现的服务自主切换攻击模式（例如，从 AD 网络攻击到 Web 应用审计再到社会工程），超越了传统扫描器的能力。

Open Questions 开放问题

Can LLMs achieve full domain dominance (domain admin compromise) in enterprise networks, or are they limited to initial access and lateral movement?
How would LLM-driven penetration testing perform against networks with active EDR/NIDS defenses beyond Microsoft Defender?
Would retrieval-augmented generation (RAG) with penetration-testing knowledge bases improve performance, or is the LLMs' pre-trained knowledge sufficient?
How can the information loss between Planner and Executor be fundamentally solved without abandoning the stateless Executor design that enables run resumability?
Can prompt-injection-based defenses effectively trap or mislead LLM-based attackers in real enterprise environments?

LLM 是否能在企业网络中实现完全的域支配（攻陷域管理员），还是仅限于初始访问和横向移动？
LLM 驱动的渗透测试在面对具有 Microsoft Defender 之外的主动 EDR/NIDS 防御的网络时表现如何？
使用渗透测试知识库进行检索增强生成 (RAG) 是否能提高性能，还是 LLM 的预训练知识已经足够？
如何在不放弃支持运行恢复的无状态执行器设计的前提下，从根本上解决规划器与执行器之间的信息丢失问题？
基于提示注入的防御能否在真实企业环境中有效地诱捕或误导基于 LLM 的攻击者？

Builds On 基于前人工作

hackingBuddyGPT (Happe and Cito, 2024) - prior single-host Linux privilege escalation prototype
pentestGPT (Deng et al., 2024) - Pentest-Task-Tree planning concept
ReAct (Yao et al., 2022) - reasoning and acting agent pattern
Plan-and-Solve (Wang et al., 2023) - inter-task planning pattern
Reflexion (Shinn et al., 2024) - self-reflection agent pattern

Open Source 开源信息

Yes - https://github.com/andreashappe/cochise