#56

EnIGMA: Interactive Tools Substantially Assist LM Agents in Finding Security Vulnerabilities EnIGMA: Interactive Tools Substantially Assist LM Agents in Finding Security Vulnerabilities

Talor Abramovich, Meet Udeshi, Minghao Shao, Kilian Lieret, Haoran Xi, Kimberly Milner, Sofija Jancheska, John Yang, Carlos E. Jimenez, Farshad Khorrami, Prashanth Krishnamurthy, Brendan Dolan-Gavitt, Muhammad Shafique, Karthik Narasimhan, Ramesh Karri, Ofir Press

2025 | ICML (top-conference)

arXiv:2409.16165

system ctf fully-autonomous single-agent ReAct

PDF Preview 论文预览

Loading PDF... 加载 PDF 中...

Problem & Motivation 问题与动机

Existing LM agents for cybersecurity tasks are limited in scope and capability, particularly because they cannot execute interactive command-line utilities (e.g., debuggers, server connection tools) that are essential for solving CTF challenges. They also struggle with long program outputs that exceed context windows.

现有的用于网络安全任务的 LM 智能体在范围和能力上都受到限制，特别是因为它们无法执行交互式命令行工具（如调试器、服务器连接工具），而这些工具对于解决 CTF 挑战至关重要。它们还难以处理超出上下文窗口的冗长程序输出。

CTF challenges often require interactive tools like debuggers (gdb) and remote server connections (pwntools/netcat), which current LM agents cannot natively support. Additionally, agents cannot adapt to new strategies after initial failures, and they lack suitable interfaces tailored to cybersecurity. Closing this gap could substantially improve autonomous vulnerability discovery.

CTF 挑战通常需要调试器 (gdb) 和远程服务器连接 (pwntools/netcat) 等交互式工具，而目前的 LM 智能体原生不支持这些工具。此外，智能体在最初失败后无法适应新策略，并且缺乏针对网络安全定制的合适接口。填补这一空白可以显著改善自主漏洞发现。

Threat Model 威胁模型

The agent operates in a Dockerized environment with access to pre-installed cybersecurity tools. It receives a CTF challenge description and iterates through actions until it finds the flag or exhausts its budget ($3 per instance).

智能体在 Docker 环境中运行，可以访问预安装的网络安全工具。它接收 CTF 挑战描述，并迭代执行操作，直到找到 flag 或耗尽预算（每个实例 3 美元）。

Methodology 核心方法

EnIGMA is built on top of SWE-agent and extends it for cybersecurity by introducing two key innovations: (1) Interactive Agent Tools (IATs) that enable LM agents to use interactive programs like debuggers (gdb) and server connection tools (pwntools) through non-blocking parallel REPL sessions, and (2) summarizers (both a simple line-count-based summarizer and an LM-based summarizer) that condense long command outputs to fit within context windows. The agent also uses in-context demonstrations and guidelines tailored to each CTF category.

EnIGMA 建立在 SWE-agent 之上，并通过引入两项关键创新将其扩展到网络安全领域：(1) 交互式智能体工具 (IAT)，使 LM 智能体能够通过非阻塞并行 REPL 会话使用调试器 (gdb) 和服务器连接工具 (pwntools) 等交互式程序；(2) 摘要生成器（包括基于行数的简单摘要生成器和基于 LM 的摘要生成器），用于压缩冗长的命令输出以压缩到上下文窗口中。智能体还使用针对每个 CTF 类别的上下文内演示和指南。

Architecture 架构设计

EnIGMA extends SWE-agent's Agent-Computer Interface (ACI) with cybersecurity-specific commands, interactive agent tools (debugger and server connection), and output summarizers. The agent operates in a thought-action-observation loop (ReAct) within a Docker container, maintaining non-blocking interactive sessions alongside the main shell.

EnIGMA 扩展了 SWE-agent 的智能体-计算机接口 (ACI)，增加了网络安全专用命令、交互式智能体工具（调试器和服务器连接）以及输出摘要生成器。智能体在 Docker 容器内以“思考-行动-观察”循环 (ReAct) 运行，在维护主 shell 的同时维护非阻塞交互式会话。

LLM Models 使用的大模型

Tool Integration 工具集成

Memory Mechanism 记忆机制

conversation-history

Attack Phases Covered 覆盖的攻击阶段

reconnaissance

scanning

enumeration

exploitation

post exploitation

privilege escalation

lateral movement

reporting

Evaluation 评估结果

EnIGMA achieves 13.5% on NYU CTF (more than 3x the previous best of 4%), 67% on InterCode-CTF (29 percentage points above previous best), 20% on CyBench (surpassing the previous best of 17.5%), and 26% on HTB. Interactive tools reduce solve time by 22.8% (11.5 vs 14.9 turns). The LM summarizer improves performance compared to no summarizer, and demonstrations generally help across most categories.

EnIGMA 在 NYU CTF 上达到了 13.5%（是之前最好成绩 4% 的 3 倍以上），在 InterCode-CTF 上达到了 67%（比之前最好成绩高出 29 个百分点），在 CyBench 上达到了 20%（超过了之前最好成绩 17.5%），在 HTB 上达到了 26%。交互式工具将解决时间缩短了 22.8%（11.5 回合对比 14.9 回合）。与没有摘要生成器相比，LM 摘要生成器提高了性能，且演示通常对大多数类别都有帮助。

Environment 评估环境

Metrics 评估指标

Baseline Comparisons 基准对比

NYU agent (Shao et al., 2024b)
CyBench agent (Zhang et al., 2024)
InterCode-CTF agent (Yang et al., 2023b)
Google DeepMind agent (Team et al., 2024)

Scale 评估规模

390 CTF challenges across four benchmarks (200 NYU CTF, 100 InterCode-CTF, 40 CyBench, 50 HackTheBox)

Contributions 核心贡献

Interactive Agent Tools (IATs) that enable LM agents to use interactive programs (debugger via gdb, server connections via pwntools) through non-blocking parallel REPL sessions
A new development set of 55 CTF challenges from CSAW competitions (2013-2016) for facilitating cybersecurity agent development
Comprehensive quantitative and qualitative analysis on 390 challenges across four benchmarks, revealing the soliloquizing phenomenon where models self-generate hallucinated observations without environment interaction

交互式智能体工具 (IAT)，使 LM 智能体能够通过非阻塞并行 REPL 会话使用交互式程序（通过 gdb 的调试器，通过 pwntools 的服务器连接）
一套来自 CSAW 竞赛（2013-2016 年）的 55 个 CTF 挑战的新开发集，用于促进网络安全智能体的开发
对四个基准测试中的 390 个挑战进行了全面的定量和定性分析，揭示了“自言自语”（soliloquizing）现象，即模型在没有环境交互的情况下自行生成虚假的观察结果

Limitations 局限性

Models are unlikely to recover if they do not succeed fast; most successes occur within the first 20 steps
Models do not give up until they exhaust the maximum cost budget despite availability of a give-up action
Limited ability to creatively explore multiple approaches when problem-solving
Demonstrations and guidelines are not always helpful and can surprisingly hurt performance for certain categories (web, misc)
Interactive tools do not uniformly benefit all categories; web category actually shows increased performance when IATs are ablated, suggesting need for better web navigation interfaces
Soliloquizing phenomenon negatively correlates with challenge success (-26% correlation), indicating hallucinated observations degrade accuracy
Overall solve rates remain low (e.g., 13.5% on NYU CTF), indicating significant room for improvement

如果模型不能快速成功，就不太可能恢复；大多数成功发生在最初的 20 步之内
尽管有“放弃”操作，但模型在耗尽最大成本预算之前不会放弃
在解决问题时创造性地探索多种方法的能力有限
演示和指南并不总是有帮助，令人惊讶的是，对于某些类别（Web、杂项）反而会损害性能
交互式工具并非对所有类别都统一有益；Web 类别在消融 IAT 时实际上表现出性能提升，这表明需要更好的 Web 导航界面
“自言自语”现象与挑战成功率呈负相关（-26% 相关性），表明虚假的观察结果降低了准确性
整体解决率仍然较低（例如 NYU CTF 上为 13.5%），表明仍有很大的改进空间

Research Gaps 研究空白

Need for human-in-the-loop approaches to amplify agent performance beyond fully autonomous operation
Lack of proper interactive web navigation tools for LM agents in cybersecurity contexts
Understanding and mitigating the soliloquizing phenomenon where LMs hallucinate environment observations
Need for better cost-efficiency strategies, such as capping maximum cost per attempt based on challenge category
Data leakage in CTF benchmarks remains a challenge for fair evaluation of LM cybersecurity capabilities
Agents cannot adapt strategies well after initial failures, limiting their ability to solve harder multi-step challenges

需要人类在环的方法来进一步提升智能体在全自主运行之外的性能
缺乏适用于网络安全语境下的 LM 智能体交互式 Web 导航工具
理解并减轻 LM 虚构环境观察结果的“自言自语”现象
需要更好的成本效益策略，例如根据挑战类别设定每次尝试的最大成本上限
CTF 基准测试中的数据泄露仍然是公平评估 LM 网络安全能力的挑战
智能体在最初失败后无法很好地调整策略，限制了它们解决更复杂的多步挑战的能力

Novel Techniques 新颖技术

Interactive Agent Tools (IATs): non-blocking parallel REPL sessions that allow agents to run interactive programs (debugger, server connections) alongside the main shell, mimicking how humans use multiple terminal windows
LM summarizer: using a secondary LM call to condense long command outputs into actionable summaries with context from the challenge description
Soliloquizing detection: identifying when LMs self-generate hallucinated observation strings in their responses without actual environment interaction

交互式智能体工具 (IAT)：非阻塞并行 REPL 会话，允许智能体在主 shell 之外运行交互式程序（调试器、服务器连接），模仿人类使用多个终端窗口的方式
LM 摘要生成器：使用辅助 LM 调用，根据挑战描述的上下文将冗长的命令输出压缩为可操作的摘要
自言自语检测：识别 LM 何时在其回复中自行生成虚假的观察字符串，而没有实际的环境交互

Open Questions 开放问题

How can agents be made to recover and try genuinely different strategies after initial failures?
What causes soliloquizing and can it be suppressed without degrading performance?
How much of reported benchmark performance is attributable to data leakage from CTF solutions in training data?
Can human-in-the-loop approaches substantially boost performance beyond fully autonomous agents?
How to design effective interactive web navigation tools for cybersecurity agents?

如何让智能体在最初失败后能够恢复并尝试真正不同的策略？
是什么导致了自言自语，能否在不降低性能的情况下抑制它？
报告的基准测试性能中有多少归功于训练数据中 CTF 解法的数据泄露？
人类在环的方法能否显著提升全自主智能体之外的性能？
如何为网络安全智能体设计有效的交互式 Web 导航工具？

Builds On 基于前人工作

SWE-agent (Yang et al., 2024)
NYU CTF Bench (Shao et al., 2024b)
ReAct (Yao et al., 2023b)
InterCode-CTF (Yang et al., 2023a)

Open Source 开源信息

Yes - https://github.com/SWE-agent/SWE-agent/tree/v0.7 and https://github.com/NYU-LLM-CTF/NYU_CTF_Bench/tree/main/development