#60

Incalmo: An Autonomous LLM-assisted System for Red Teaming Multi-Host Networks Incalmo: An Autonomous LLM-assisted System for Red Teaming Multi-Host Networks

Brian Singer, Keane Lucas, Lakshmi Adiga, Meghna Jain, Lujo Bauer, Vyas Sekar

2025 | arXiv (preprint)

system red-teaming fully-autonomous hierarchical ReAct

PDF Preview 论文预览

Loading PDF... 加载 PDF 中...

Problem & Motivation 问题与动机

Existing LLM-based cyber offense systems are unable to autonomously execute multi-host red team exercises that involve chaining attacks across multiple hosts and network segments, despite the promise of LLMs in simpler CTF-style challenges.

现有的基于 LLM 的网络攻击系统虽然在简单的 CTF 式挑战中展现出潜力，但仍无法自主执行涉及跨多个主机和网络分段链接攻击的多主机红队演习。

Real-world red team exercises require multi-host network attacks spanning stepping-stone hosts across multiple network segments with different vulnerabilities. These exercises are expensive and require significant expertise. Prior LLM-based offense systems (PentestGPT, CyberSecEval3, CAI) with leading LLMs fail at multi-host challenges due to operating at too low a level of abstraction, pursuing irrelevant tasks, executing incorrect commands, using brittle post-exploitation techniques, and suffering from context bloat.

现实世界的红队演习需要跨越具有不同漏洞的多个网络分段的主机进行多主机网络攻击。这些演习成本高昂且需要大量的专业知识。之前的基于 LLM 的攻击系统（如 PentestGPT, CyberSecEval3, CAI）即使使用领先的 LLM，在多主机挑战中也会失败，原因是操作抽象级别太低、追求无关任务、执行错误命令、使用脆弱的后渗透技术以及遭受上下文膨胀。

Threat Model 威胁模型

The attacker has access to an external attacker host (Kali Linux) connected to the target network. The red team exercise targets known vulnerabilities (not zero-days). No active defenders or detection systems are present in the evaluation environments.

攻击者拥有一个连接到目标网络的外部攻击主机（Kali Linux）。红队演习针对的是已知漏洞（非零日漏洞）。在评估环境中没有活跃的防御者或检测系统。

Methodology 核心方法

Incalmo raises the level of abstraction at which LLMs operate for red teaming by explicitly decoupling planning from execution. The LLM serves as a planner that outputs high-level declarative tasks (inspired by the cyber kill chain) rather than low-level shell commands. These tasks are delegated to domain-specific expert task agents that use reliable best practices for execution. Auxiliary services (environment state, attack graph, C&C server) manage context and acquired assets, preventing prompt bloat and enabling structured reasoning about the environment.

Incalmo 通过显式地将规划与执行解耦，提高了 LLM 进行红队测试的操作抽象级别。LLM 充当规划器，输出高层声明式任务（灵感来自网络杀伤链），而非底层的 shell 命令。这些任务被委派给领域特定的专家任务智能体，由其使用可靠的最佳实践进行执行。辅助服务（环境状态、攻击图、C&C 服务器）管理上下文和获取的资产，防止提示词膨胀，并实现对环境的结构化推理。

Architecture 架构设计

Two-layer architecture: (1) Planning layer where an LLM plans red team exercises using high-level declarative tasks expressed as Python functions following MITRE ATT&CK and cyber kill chain frameworks; (2) Execution layer with expert red team agents (non-LLM by default) that translate high-level tasks (Scan, LateralMove, EscalatePrivilege, FindInformation, ExfiltrateData) into low-level commands. Three auxiliary services support both layers: environment state service (structured queryable knowledge base akin to RAG), attack graph service (dynamic graph for reasoning about attack paths), and C&C server service (reliable command execution and asset management on compromised hosts).

两层架构：(1) 规划层，LLM 使用遵循 MITRE ATT&CK 和网络杀伤链框架、以 Python 函数表示的高层声明式任务来规划红队演习；(2) 执行层，由专家红队智能体（默认为非 LLM）组成，负责将高层任务（扫描、横向移动、提权、查找信息、外传数据）转换为底层命令。三项辅助服务支持这两层：环境状态服务（类似于 RAG 的结构化可查询知识库）、攻击图服务（用于推理攻击路径的动态图）和 C&C 服务器服务（在受控主机上进行可靠的命令执行和资产管理）。

LLM Models 使用的大模型

Tool Integration 工具集成

Memory Mechanism 记忆机制

RAG

Attack Phases Covered 覆盖的攻击阶段

reconnaissance

scanning

enumeration

exploitation

post exploitation

privilege escalation

lateral movement

reporting

Evaluation 评估结果

Incalmo with Sonnet 4 succeeded in 37 out of 40 MHBench environments, while ExpertPromptShell with Sonnet 4 succeeded in only 3 out of 40. Incalmo achieved perfect Reliability (5/5 trials) in 28 of 40 environments. Successful attacks took 12-54 minutes and cost at most $15 in LLM credits. Incalmo with smaller LLMs (e.g., Haiku 3.5) outperformed ExpertPromptShell with larger LLMs, demonstrating that abstractions matter more than model size.

使用 Sonnet 4 的 Incalmo 在 40 个 MHBench 环境中成功完成了 37 个，而使用 Sonnet 4 的 ExpertPromptShell 仅成功完成了 3 个。Incalmo 在 40 个环境中的 28 个实现了完美的可靠性（5/5 次尝试）。成功的攻击耗时 12-54 分钟，LLM 积分成本最高为 15 美元。使用较小 LLM（如 Haiku 3.5）的 Incalmo 优于使用较大 LLM 的 ExpertPromptShell，证明了抽象比模型大小更重要。

Environment 评估环境

Metrics 评估指标

Baseline Comparisons 基准对比

ExpertPromptShell (expert-crafted prompt with shell access)
CyberSecEval3
CAI
PentestGPT
Caldera (non-LLM)

Scale 评估规模

40 multi-host network environments with 22 to 50 hosts each

Contributions 核心贡献

Identification of a key gap in LLM-assisted cyber offense: inability to autonomously execute multi-host red team exercises in unforeseen environments
MHBench, an extensible benchmark with 40 realistic multi-host network environments for evaluating autonomous red teaming
Systematic failure analysis of state-of-the-art LLM offense systems on multi-host challenges, identifying four failure modes: irrelevant tasks, incorrect execution, brittle post-exploitation, and context bloat
Incalmo system design that raises the level of abstraction by decoupling planning from execution, using domain-specific expert agents, and introducing auxiliary services for context and asset management
Demonstration that Incalmo works across 10 different LLMs including small models, showing abstractions matter more than model size

识别了 LLM 辅助网络攻击的一个关键差距：无法在不可预见的环境中自主执行多主机红队演习
MHBench，一个可扩展的基准测试，包含 40 个现实的多主机网络环境，用于评估自主红队测试
对多主机挑战中先进 LLM 攻击系统的系统性失败分析，识别了四种失败模式：无关任务、错误执行、脆弱的后渗透和上下文膨胀
Incalmo 系统设计，通过解耦规划与执行、使用领域特定专家智能体以及引入用于上下文和资产管理的辅助服务，提高了抽象级别
证明了 Incalmo 可跨 10 种不同的 LLM（包括小模型）运行，表明抽象比模型大小更重要

Limitations 局限性

Does not consider active defenders or detection/blocking mechanisms in the environment
Only considers known vulnerabilities, not zero-day exploits
Failed to succeed in 3 out of 40 environments, particularly those requiring both external and internal scans from different network vantage points
LLM planner sometimes stops after acquiring a single critical asset rather than exploring all attack paths for complete TotalAcquisition
Enterprise network details are sensitive and hard to obtain, so MHBench environments are best-effort approximations of real networks
Potential memorization concern: LLM success on CTF challenges may partly be due to training data exposure

未考虑环境中的活跃防御者或检测/阻断机制
仅考虑已知漏洞，不考虑零日漏洞利用
在 40 个环境中有 3 个未能成功，特别是那些需要从不同网络视角进行外部和内部扫描的环境
LLM 规划器有时在获取单个关键资产后就会停止，而不是为了完整的 TotalAcquisition 探索所有攻击路径
企业网络细节敏感且难以获取，因此 MHBench 环境是对真实网络的最佳近似
潜在的记忆问题：LLM 在 CTF 挑战上的成功可能部分归功于训练数据暴露

Research Gaps 研究空白

Extending autonomous red teaming to handle zero-day vulnerabilities with advanced task-specific agents
Adding realistic active defenders to evaluate evasion capabilities of autonomous red teams
Improving attack graph service to reason about network segmentation and access control (e.g., inter-segment firewalls)
Training and fine-tuning LLMs specifically for multi-host red teaming to improve coverage
Scaling to real enterprise networks with proprietary configurations
Understanding how LLM-based red teams perform when defenders are actively monitoring and blocking

将自主红队测试扩展到通过先进的任务特定智能体处理零日漏洞
增加现实的活跃防御者，以评估自主红队智能体的逃避能力
改进攻击图服务，以推理网络分段和访问控制（例如分段间的防火墙）
专门针对多主机红队测试训练和微调 LLM，以提高覆盖范围
扩展到具有专有配置的真实企业网络
了解当防御者正在进行活跃监控和阻断时，基于 LLM 的红队表现如何

Novel Techniques 新颖技术

Decoupling LLM planning from execution via high-level declarative tasks modeled on cyber kill chain stages
Environment state service as a structured queryable knowledge base (RAG-like) to replace context bloat in LLM prompts
Dynamic attack graph service that reasons about incomplete and evolving attacker knowledge to recommend next actions
High-level C&C server abstraction that handles low-level communication (proxying, beaconing) internally
Planning abstraction where LLMs output Python functions composing high-level tasks rather than shell commands

通过模仿网络杀伤链阶段的高层声明式任务，将 LLM 规划与执行解耦
环境状态服务作为结构化可查询知识库（类 RAG），取代 LLM 提示词中的上下文膨胀
动态攻击图服务，通过推理不完整且不断变化的攻击者知识来推荐下一步操作
高层 C&C 服务器抽象，内部处理底层通信（代理、信标）
规划抽象，LLM 输出组合高层任务的 Python 函数，而非 shell 命令

Open Questions 开放问题

Can LLM-based red teams evade realistic detection and response systems?
How to handle zero-day vulnerability discovery and exploitation in autonomous red teaming?
What is the right balance between LLM-based and non-LLM agents for task execution?
How to improve LLM planners to explore all attack paths rather than stopping after partial success?
How will LLM performance on MHBench change as models are trained on its data?

基于 LLM 的红队能否逃避现实的检测和响应系统？
在自主红队演习中如何处理零日漏洞的发现和利用？
在任务执行方面，基于 LLM 的智能体和非 LLM 智能体之间的正确平衡点是什么？
如何改进 LLM 规划器以探索所有攻击路径，而不是在部分成功后停止？
随着模型在 MHBench 数据上的训练，LLM 在其上的表现将如何变化？

Builds On 基于前人工作

PentestGPT
CyberSecEval3
Cybench
CAI
MITRE ATT&CK framework
Cyber kill chain
Caldera
Lore (red team emulation tool)

Open Source 开源信息

Yes - https://github.com/bsinger98/Incalmo