#46

CyberExplorer: Benchmarking LLM Offensive Security Capabilities in a Real-World Attacking Simulation Environment CyberExplorer: Benchmarking LLM Offensive Security Capabilities in a Real-World Attacking Simulation Environment

Nanda Rani, Kimberly Milner, Minghao Shao, Meet Udeshi, Haoran Xi, Venkata Sai Charan Putrevu, Saksham Aggarwal, Sandeep K. Shukla, Prashanth Krishnamurthy, Farshad Khorrami, Muhammad Shafique, Ramesh Karri

2025 | arXiv (preprint)

2602.08023

benchmark penetration-testing fully-autonomous multi-agent FSM

PDF Preview 论文预览

Loading PDF... 加载 PDF 中...

Problem & Motivation 问题与动机

Existing LLM-based offensive agent evaluations operate in closed-world settings with predefined goals and binary success criteria, using isolated single-service environments that fail to capture realistic multi-target attack scenarios involving reconnaissance, target selection, and exploitation under uncertainty.

现有的基于大语言模型（LLM）的攻防智能体评估通常在闭循环设置中进行，具有预定义的目标和二进制成功标准，且使用隔离的单一服务环境。这些方法未能捕捉到真实世界中涉及侦察、目标选择以及在不确定性下进行利用的多目标攻击场景。

Real-world offensive security operations are inherently open-ended: attackers explore unknown attack surfaces, revise hypotheses under uncertainty, and operate without guaranteed success. Current benchmarks (e.g., NYU CTF Bench, Cybench, CTFTiny) only test agents against single vulnerable services in isolation, preventing evaluation of target discrimination, false-positive handling, prioritization, and multi-service coordination. CyberExplorer fills this gap by providing an open-environment benchmark with noisy, multi-target settings and fine-grained behavioral metrics beyond binary flag recovery.

真实的攻防安全操作本质上是开放式的：攻击者需要探索未知的攻击面，在不确定性下修正假设，且操作并不保证成功。目前的基准测试（如 NYU CTF Bench、Cybench、CTFTiny）仅测试针对单一隔离漏洞服务的智能体，无法评估目标辨别、误报处理、优先级排序及多服务协作能力。CyberExplorer 通过提供一个具有噪声、多目标设置和超越二进制 flag 恢复的细粒度行为指标的开放环境基准测试，填补了这一空白。

Threat Model 威胁模型

Agents are given only an address space containing a virtual machine with multiple services. They have no prior knowledge of vulnerability locations, challenge boundaries, or which services are exploitable. The attacker operates from a sandboxed Docker container provisioned with offensive security tools, interacting with services through externally reachable network ports.

智能体仅被赋予一个包含多个服务的虚拟机的地址空间。它们事先不知道漏洞位置、挑战边界或哪些服务是可利用的。攻击者在一个预装了攻防安全工具的沙箱化 Docker 容器中操作，通过外部可达的网络端口与服务进行交互。

Methodology 核心方法

CyberExplorer introduces (1) an Open Environment Offensive Security Task where 40 web-based vulnerable services run concurrently on a single VM alongside non-vulnerable noise services, and (2) a reactive multi-agent framework that performs reconnaissance to discover entry points, then dispatches parallel agent teams for exploration and exploitation. The framework uses agentic chaining with knowledge hand-off, supervisor-guided exploration, and critic-based trajectory correction. Agents operate with fixed per-agent budgets and self-reflection mechanisms at configurable budget thresholds.

CyberExplorer 引入了：（1）开放环境攻防安全任务，其中 40 个基于 Web 的易受攻击服务与非漏洞噪声服务在单个虚拟机上同时运行；（2）一个响应式多智能体框架，执行侦察以发现入口点，然后派遣并行智能体团队进行探索和利用。该框架使用智能体链（带有知识移交）、导师引导的探索以及基于批评者的轨迹校正。智能体在固定的预算下运行，并具有可在可配置预算阈值下触发的自我反思机制。

Architecture 架构设计

The architecture consists of a Recon Agent that performs network scanning to build an attack surface map, followed by a Dispatcher that queues discovered entry points for parallel exploration. Each entry point is explored by a chain of short-lived agents (up to 7 per team), each with a $0.30 budget. A global Supervisor synthesizes exploration history and generates best-hypothesis task directives for successor agents. A Critic is introduced after the third agent in a chain, capable of injecting interventional advice into the current agent's conversation. A Decision agent node coexists with each agent for self-reflection at 50% and 80% budget thresholds, enabling budget extensions (up to 4 times). Dead-end heuristics terminate unproductive entry point exploration early.

架构包括一个执行网络扫描以构建攻击面地图的侦察智能体（Recon Agent），随后是一个将发现的入口点排队进行并行探索的调度器（Dispatcher）。每个入口点由一个短寿命智能体链（每队最多 7 个）进行探索，每个智能体预算为 0.30 美元。全局导师（Supervisor）综合探索历史并为后继智能体生成基于最佳假设的任务指令。在链中的第三个智能体之后引入批评者（Critic），能够向当前智能体的对话中注入干预性建议。每个智能体节点都共存一个决策智能体，在 50% 和 80% 的预算阈值下进行自我反思，允许延长预算（最多 4 次）。死胡同启发式方法会提前终止无生产力的入口点探索。

LLM Models 使用的大模型

Tool Integration 工具集成

Memory Mechanism 记忆机制

conversation-history

Attack Phases Covered 覆盖的攻击阶段

reconnaissance

scanning

enumeration

exploitation

post exploitation

privilege escalation

lateral movement

reporting

Evaluation 评估结果

Claude Opus 4.5 and Gemini 3 Pro achieved the highest recall (22.50% each, 9/40 correct flags) with Opus 4.5 attaining 90% precision. GPT-5.2 achieved 60% precision with 6 correct flags. Qwen 3 and DeepSeek V3 showed lower precision (17.65% and 62.5%) with more false positives and prolonged exploration. Dead-end trajectories consume 2.6x to 4.6x more interaction rounds and up to 5.1x higher cost than successful ones, revealing that agent escalation under uncertainty amplifies inefficiency rather than improving convergence.

Claude Opus 4.5 和 Gemini 3 Pro 获得了最高的召回率（各为 22.50%，即 40 个 flag 中获得 9 个），其中 Opus 4.5 的精确度达到了 90%。GPT-5.2 获得了 60% 的精确度和 6 个正确 flag。Qwen 3 和 DeepSeek V3 的精确度较低（分别为 17.65% 和 62.5%），出现了更多的误报和冗长的探索。死胡同轨迹消耗的交互轮数是成功轨迹的 2.6 到 4.6 倍，成本高出多达 5.1 倍，这表明在不确定性下的智能体升级（agent escalation）反而放大了低效，而非改善收敛。

Environment 评估环境

Metrics 评估指标

Baseline Comparisons 基准对比

GPT-5.2
Claude Opus 4.5
Gemini 3 Pro
DeepSeek V3
Qwen 3

Scale 评估规模

40 web-based CTF challenges deployed on a single virtual machine

Contributions 核心贡献

Introduces the Open Environment Offensive Security Task, a new evaluation paradigm with noisy multi-target environments where agents must perform reconnaissance, target selection, and exploitation without prior knowledge of vulnerability locations
Proposes an asynchronous multi-agent architecture featuring parallel entry-point exploration, supervisor-guided knowledge hand-off between sequential agents, and critic intervention for trajectory correction
Provides comprehensive behavioral evaluation of five state-of-the-art LLMs across correctness, efficiency, coordination dynamics, failure modes, and vulnerability discovery signals, going beyond binary success metrics
Demonstrates through hyperparameter sensitivity analysis that increasing budget or agent limits does not yield monotonic performance gains, revealing fundamental limitations of budget-driven exploration in agentic systems

引入了开放环境攻防安全任务（Open Environment Offensive Security Task），这是一种新的评估范式，具有噪声多目标环境，智能体必须在不知道漏洞位置的情况下执行侦察、目标选择和利用
提出了一种异步多智能体架构，其特征是并行入口点探索、后续智能体之间的导师引导知识移交，以及用于轨迹校正的批评者干预
对五种最先进的 LLM 在正确性、效率、协作动态、失败模式和漏洞发现信号方面提供了全面的行为评估，超越了二进制成功指标
通过超参数敏感性分析证明，增加预算或智能体限制并不能产生单调的性能提升，揭示了智能体系统中预算驱动探索的根本局限性

Limitations 局限性

Evaluates only web-based vulnerabilities; does not cover client-side attacks, kernel vulnerabilities, post-exploitation movement, or long-lived campaigns
Agent performance is evaluated under fixed interaction and cost budgets, which may influence persistence and termination behavior differently across models with different API pricing
Behavioral analysis relies on observable interaction traces rather than internal model reasoning, limiting causal attribution of failures
Automatically generated vulnerability signals reflect reconnaissance capability rather than definitive exploitation success
Findings are web-centric and reflect web exploitation reasoning rather than full attacker capability

仅评估基于 Web 的漏洞；不涵盖客户端攻击、内核漏洞、后渗透移动或长期活动
智能体性能是在固定的交互和成本预算下评估的，这可能会对不同 API 定价模型的持久性和终止行为产生不同影响
行为分析依赖于可观察的交互轨迹而非模型内部推理，限制了故障的因果归因
自动生成的漏洞信号反映的是侦察能力，而非确定的利用成功
研究结果以 Web 为中心，反映的是 Web 利用推理而非全面的攻击者能力

Research Gaps 研究空白

No monotonic relationship between increased computational resources (budget/agents) and improved agentic performance, suggesting fundamental limitations in current agent escalation strategies
Agent escalation under uncertainty is primarily reactive (spawning new agents) rather than corrective (refining hypotheses), indicating a need for better uncertainty-handling mechanisms
Reasoning continuity is fragmented across sequential agents, with dead-end trajectories characterized by many short-lived agents resetting context rather than building on prior knowledge
Current benchmarks and metrics do not capture coordination quality, failure persistence, or vulnerability discovery signals that are critical for realistic offensive security evaluation
Cross-service attack reasoning (e.g., FTP-to-HTTP pivots) remains challenging for most models

增加计算资源（预算/智能体）与改进智能体性能之间不存在单调关系，这表明当前的智能体升级策略存在根本局限
不确定性下的智能体升级主要是反应性的（产生新智能体）而非纠正性的（完善假设），这表明需要更好的不确定性处理机制
后继智能体之间的推理连续性是断裂的，死胡同轨迹的特征是许多短寿命智能体在重置上下文，而不是建立在先前的知识之上
当前的基准测试和指标未能捕捉到对于现实攻防安全评估至关重要的协作质量、失败持久性或漏洞发现信号
跨服务攻击推理（例如，从 FTP 到 HTTP 的跳板攻击）对大多数模型来说仍然具有挑战性

Novel Techniques 新颖技术

Open-environment offensive security task with concurrent noisy multi-service deployment on a single VM, requiring target discrimination and false-positive handling
Agentic chaining with knowledge hand-off via supervisor-synthesized task directives, passing exploration history and failed approaches between sequential agents
Critic-based interventional trajectory correction that injects advice directly into a running agent's conversation when progress stalls
Self-reflective budget management with configurable reflection thresholds (50%/80%) and up to 4 budget extensions based on reflection quality
Dead-end heuristic for early termination of low-yield entry points based on absence of medium-or-higher severity findings
Vulnerability discovery signal analysis measuring security intelligence extracted even from failed exploitation attempts

具有并行噪声多服务部署在单个虚拟机上的开放环境攻防安全任务，要求目标辨别和误报处理
带有知识移交的智能体链，通过导师综合的任务指令，在后继智能体之间传递探索历史和失败方法
基于批评者的干预性轨迹校正，在进展停滞时直接向运行中的智能体对话注入建议
具有可配置反思阈值（50%/80%）的自我反思预算管理，并根据反思质量允许最多 4 次预算延长
基于缺乏中等或更高严重性发现的死胡同启发式方法，用于早期终止低产出的入口点
漏洞发现信号分析，测量即使在失败的利用尝试中提取出的安全情报

Open Questions 开放问题

How can agent escalation be made corrective rather than merely reactive to reduce wasted computation?
What mechanisms can improve reasoning continuity across sequential agents without unbounded context growth?
Can adaptive budget allocation strategies outperform fixed per-agent budgets?
How do these findings extend to non-web attack surfaces such as network, kernel, or Active Directory environments?
What is the role of model-specific reasoning styles (depth-first vs breadth-first) in determining optimal agent architecture design?

如何使智能体升级具有纠正性而非仅仅是反应性，以减少计算浪费？
哪些机制可以在不导致上下文无限制增长的情况下，改善后继智能体之间的推理连续性？
自适应预算分配策略能否优于固定的人均智能体预算？
这些发现如何扩展到非 Web 攻击面，如网络、内核或有源目录（Active Directory）环境？
特定模型的推理风格（深度优先 vs 广度优先）在决定最佳智能体架构设计中起什么作用？

Builds On 基于前人工作

NYU CTF Bench
Cybench
CTFTiny
PentestGPT
EnIGMA
HackSynth
D-CIPHER
CRAKEN