#47

Measuring AI Agents' Progress on Multi-Step Cyber Attack Scenarios Measuring AI Agents' Progress on Multi-Step Cyber Attack Scenarios

Linus Folkerts, Will Payne, Simon Inman, Philippos Giavridis, Joe Skinner, Sam Deverett, James Aung, Ekin Zorer, Michael Schmatz, Mahmoud Ghanem, John Wilkinson, Alan Steer, Vy Hong, Jessica Wang

2026 | arXiv (preprint)

arXiv:2603.11214v3

empirical-study general-cybersecurity fully-autonomous single-agent ReAct

PDF Preview 论文预览

Loading PDF... 加载 PDF 中...

Problem & Motivation 问题与动机

Most existing evaluations of AI cyber capability rely on isolated CTF challenges or question-answer benchmarks, which do not capture the autonomy, multi-step reasoning, state tracking, and error recovery required to navigate large-scale network environments closer to real-world offensive operations.

大多数现有的 AI 网络能力评估依赖于孤立的夺旗赛（CTF）挑战或问答基准测试。这些测试无法捕捉到导航大规模网络环境（更接近真实的进攻性操作）所需的自主性、多步推理、状态追踪和错误恢复能力。

Understanding whether frontier AI models can reliably execute multi-step attack chains with limited human input is critical for cybersecurity and AI governance. Real-world signals already show AI being used for offensive cyber operations (e.g., Anthropic reported a state-sponsored campaign where AI autonomously executed the majority of intrusion steps), yet rigorous longitudinal measurement of this capability across models and compute budgets has been lacking.

了解前沿 AI 模型是否能在有限的人类干预下可靠地执行多步攻击链，对于网络安全和 AI 治理至关重要。现实信号已经表明 AI 被用于进攻性网络操作（例如，Anthropic 报告了一起由国家支持的活动，其中 AI 自主执行了大部分入侵步骤），然而目前仍缺乏跨模型和计算预算的针对此能力的严格纵向评估。

Threat Model 威胁模型

A fully autonomous AI agent given access to an attacker machine (Kali Linux) with standard penetration testing tools, tasked with completing a multi-step attack chain across a simulated network environment. No active defenders or detection mechanisms impede the agent. The agent operates with a ReAct loop and a fixed token budget.

一个完全自主的 AI 智能体被赋予访问一台预装标准渗透测试工具的攻击机（Kali Linux）的权限，其任务是在模拟网络环境中完成多步攻击链。没有主动防御者或检测机制阻碍该智能体。智能体采用 ReAct 循环运行，并具有固定的 token 预算。

Methodology 核心方法

The authors evaluate seven frontier AI models released over an 18-month period (August 2024 to February 2026) on two purpose-built cyber ranges: 'The Last Ones' (a 32-step corporate network attack) and 'Cooling Tower' (a 7-step industrial control system attack). Models are tested at varying inference-time compute budgets (10M and 100M tokens) to measure both scaling behavior and cross-generational improvement. Performance is measured by the number of sequential attack steps completed autonomously.

作者评估了在 18 个月期间（2024 年 8 月至 2026 年 2 月）发布的七个前沿 AI 模型，测试环境为两个专门构建的模拟靶场：“The Last Ones”（一个 32 步的企业网络攻击）和“Cooling Tower”（一个 7 步的工业控制系统攻击）。模型在不同的推理时间计算预算（10M 和 100M token）下进行测试，以衡量缩放行为（scaling behavior）和跨代改进。性能通过智能体自主完成的连续攻击步骤数量来衡量。

Architecture 架构设计

Single AI agent running on Kali Linux with access to Bash commands, Python code, and the Mythic C2 framework. The agent follows a ReAct (Reason + Act) loop: reasoning about the next step, executing an action, and observing the result before continuing. Context compaction is used when the conversation approaches 80% of the context window capacity.

运行在 Kali Linux 上的单个 AI 智能体，可访问 Bash 命令、Python 代码和 Mythic C2 框架。智能体遵循 ReAct（推理 + 行动）循环：对下一步进行推理、执行操作并观察结果。当对话接近上下文窗口容量的 80% 时，会使用上下文压缩技术。

LLM Models 使用的大模型

Tool Integration 工具集成

Memory Mechanism 记忆机制

conversation-history

Attack Phases Covered 覆盖的攻击阶段

reconnaissance

scanning

enumeration

exploitation

post exploitation

privilege escalation

lateral movement

reporting

Evaluation 评估结果

Model performance scales log-linearly with inference-time compute, with no observed plateau up to 100M tokens (gains of up to 59%). Each successive model generation outperforms its predecessor at fixed token budgets: at 10M tokens on The Last Ones, average steps rose from 1.7 (GPT-4o, August 2024) to 9.8 (Opus 4.6, February 2026). The best single Opus 4.6 run completed 22 of 32 steps (reaching milestone 6), corresponding to roughly 6 of the estimated 14 hours a human expert would need. On Cooling Tower (ICS range), performance remains limited, with the best model (Opus 4.6) averaging only 1.4 steps out of 7 at 100M tokens.

模型性能随推理时间计算量呈对数线性增长，在高达 100M token 的规模下未观察到平台期（增幅高达 59%）。在固定的 token 预算下，每一代后续模型的表现都优于其前代：在 The Last Ones 上使用 10M token 时，平均完成步数从 1.7 步（GPT-4o，2024 年 8 月）上升到 9.8 步（Opus 4.6，2026 年 2 月）。表现最好的单个 Opus 4.6 运行完成了 32 步中的 22 步（达到里程碑 6），约相当于人类专家所需 14 小时中的 6 小时。在 Cooling Tower（ICS 靶场）上，性能仍然有限，表现最好的模型（Opus 4.6）在 100M token 下平均仅完成 7 步中的 1.4 步。

Environment 评估环境

Metrics 评估指标

Baseline Comparisons 基准对比

GPT-4o (August 2024 baseline)
Cross-model comparisons across seven frontier models
Estimated human expert completion time (14 hours for The Last Ones, 15 hours for Cooling Tower)

Scale 评估规模

Two cyber ranges: 'The Last Ones' with 32 steps across 9 milestones on a multi-domain corporate network, and 'Cooling Tower' with 7 steps on a simulated industrial control system

Contributions 核心贡献

Introduce two purpose-built cyber ranges ('The Last Ones' and 'Cooling Tower') that test multi-step, multi-host attack chain execution as a complement to isolated CTF-style benchmarks
Provide the first longitudinal measurement of autonomous cyber capability across seven frontier models spanning 18 months (August 2024 to February 2026)
Demonstrate that model performance scales log-linearly with inference-time compute (token budget) with no observed plateau up to 100M tokens
Show that each successive model generation consistently outperforms its predecessor at fixed token budgets, with compounding improvements in both token efficiency and capability depth
Identify key bottlenecks in multi-step attack chains: NTLM relay attacks and CI/CD pipeline manipulation remain challenging even for the most capable models

引入了两个专门构建的模拟靶场（“The Last Ones”和“Cooling Tower”），测试多步、多主机的攻击链执行，作为孤立 CTF 式基准测试的补充
提供了首个关于自主网络能力的纵向评估，涵盖了 18 个月内（2024 年 8 月至 2026 年 2 月）的七个前沿模型
证明了模型性能随推理时间计算量（token 预算）呈对数线性增长，且在 100M token 以下未观察到平台期
表明每一代后续模型在固定 token 预算下始终优于其前代，在 token 效率和能力深度上都有复合提升
识别了多步攻击链中的关键瓶颈：即使对于最强大的模型，NTLM 重放攻击和 CI/CD 流水线操纵仍然具有挑战性

Limitations 局限性

No active defenders: ranges lack security teams, endpoint detection, or adaptive defenses that would be present in real networks
Detections not penalized: security alerts are logged but not incorporated into performance scores, meaning a noisy agent is not disadvantaged
Vulnerability density differs from reality: ranges are designed to have vulnerabilities, unlike real environments
Lower artifact density than real environments: fewer nodes, services, and files than typical production networks
Limited token budgets tested (10M and 100M); higher budgets like 1B tokens would likely yield further improvements
Minimal scaffolding: standard ReAct agent without specialized cyber scaffolding, representing a lower bound on capability
No tailored tooling: standard Kali Linux tools rather than model-optimized tool sets
No human-AI teaming: models' remaining failure modes (knowledge gaps, information tracking errors, long action sequences) are areas where human intervention could yield significant uplift
Small sample sizes (N=5) for 100M-token runs limit statistical power; high variance between runs
Only two cyber ranges, which is insufficient to draw general conclusions across all domains and vulnerability types

没有主动防御者：靶场缺乏安全团队、终端检测（EDR）或现实网络中存在的自适应防御
检测未被惩罚：安全警报会被记录但未计入评分，这意味着产生大量噪声的智能体不会因此处于劣势
漏洞密度与现实不同：靶场设计为必然存在漏洞，而真实环境并非如此
制品密度低于真实环境：与典型的生产网络相比，节点、服务和文件较少
测试的 token 预算有限（10M 和 100M）；更高的预算（如 1B token）可能会带来进一步提升
脚手架极简：使用标准的 ReAct 智能体而无专门的网络安全脚手架，代表了能力的下限
没有定制工具：使用标准的 Kali Linux 工具而非针对模型优化的工具集
没有人类-AI 协作：模型剩余的失败模式（知识差距、信息追踪错误、长操作序列）是人类干预可以产生显著提升的领域
100M token 运行的样本量较小（N=5），限制了统计效力；各次运行之间存在高方差
仅有两个模拟靶场，不足以得出跨所有领域和漏洞类型的普遍结论

Research Gaps 研究空白

Containerized evaluation infrastructure with sufficient fidelity for cyber ranges to reduce engineering overhead compared to VM-based approaches
Broader suite of ranges spanning more domains, vulnerability types, and difficulty levels for finer-grained capability tracking
Validation of synthetic range results against real-world penetration testing engagements
Incorporating operational security and stealth into capability metrics (measuring alerts per step, testing against active defenses)
Understanding the optimal strategy for allocating fixed compute budgets: fewer long runs with context compaction vs. many short independent attempts
Human-AI teaming evaluation: measuring the most operationally relevant threat model where a human operator uses an AI agent and intervenes at specific bottlenecks
Predictive modeling of agent performance beyond confounding factors like overall attack length

需要具备足够保真度的容器化评估基础设施，以减少相比基于虚拟机的模拟靶场所需的工程开销
需要涵盖更多领域、漏洞类型和难度级别的更广泛的靶场套件，以实现更细粒度的能力追踪
验证合成靶场结果与真实世界渗透测试业务之间的关联性
将操作安全（OpSec）和隐蔽性纳入能力指标（衡量每步产生的警报，针对主动防御进行测试）
了解分配固定计算预算的最佳策略：少数带有上下文压缩的长序列运行 vs 许多短时间的独立尝试
人类-AI 协作评估：衡量最具现实威胁意义的模型，即人类操作员使用 AI 智能体并在特定瓶颈处进行干预
超越攻击长度等混淆因素，对智能体性能进行预测建模

Novel Techniques 新颖技术

Context compaction for long-horizon agent runs: when context window reaches 80% capacity, the full conversation is summarized by the same model to retain task-relevant information and continue in a fresh context window
Token efficiency vs. capability depth framework for analyzing model performance on long-form cyber tasks
Measuring cyber capability through sequential attack step completion on multi-host network ranges rather than isolated CTF challenges
Longitudinal capability tracking across model generations at fixed compute budgets to identify compounding improvement trends

针对长程智能体运行的上下文压缩：当上下文窗口达到 80% 容量时，由同一模型对完整对话进行总结，以保留任务相关信息并在新鲜的上下文窗口中继续运行
分析长篇网络任务中模型表现的“Token 效率 vs 能力深度”框架
通过多主机网络靶场上的连续攻击步完成情况，而非孤立的 CTF 挑战来衡量网络能力
跨模型代的固定计算预算能力追踪，以识别复合提升趋势

Open Questions 开放问题

Will the log-linear scaling of performance with inference-time compute continue beyond 100M tokens, or will it eventually plateau?
How would active defensive measures (EDR, SOC teams, adaptive response) affect agent performance?
What is the optimal human-AI teaming configuration for maximizing offensive cyber capability?
Can specialized cyber scaffolding, fine-tuning, or tailored tooling substantially improve upon the baseline ReAct agent performance?
How do these synthetic range results translate to real-world penetration testing effectiveness?
Will ICS/OT attack capabilities improve at the same rate as corporate network attack capabilities with future model generations?

性能随推理时间计算量的对数线性缩放是否会在超过 100M token 后继续，还是最终会进入平台期？
主动防御措施（EDR、SOC 团队、自适应响应）会如何影响智能体性能？
最大化进攻性网络能力的人类-AI 协作最佳配置是什么？
专门的网络安全脚手架、微调或定制工具是否能比基础 ReAct 智能体表现有实质性提升？
这些合成靶场的结果如何转化为真实世界的渗透测试有效性？
未来几代模型在 ICS/OT 攻击能力上的提升速度是否会与企业网络攻击能力一致？

Builds On 基于前人工作

ReAct (Yao et al., 2023)
Mythic C2 framework
Inspect AI framework (AI Security Institute, UK, 2024)
METR evaluations (Kinniment et al., 2024)
METR time-horizon framework (Kwa et al., 2025)