#43

CRAKEN: Cybersecurity LLM Agent with Knowledge-Based Execution CRAKEN: Cybersecurity LLM Agent with Knowledge-Based Execution

Minghao Shao, Haoran Xi, Nanda Rani, Meet Udeshi, Venkata Sai Charan Putrevu, Kimberly Milner, Brendan Dolan-Gavitt, Sandeep Kumar Shukla, Prashanth Krishnamurthy, Farshad Khorrami, Ramesh Karri, Muhammad Shafique

2025 | arXiv (preprint)

2505.17107

system ctf fully-autonomous multi-agent ReAct

PDF Preview 论文预览

Loading PDF... 加载 PDF 中...

Problem & Motivation 问题与动机

LLM agents for cybersecurity have two key limitations: they cannot access the latest cybersecurity expertise beyond their training data cutoff, and they struggle to integrate new knowledge into complex, multi-stage task planning for vulnerability discovery and exploitation.

网络安全领域中的 LLM 智能体存在两个关键局限：它们无法访问超出其训练数据截止日期之后的最新网络安全专业知识，且难以将新知识整合到漏洞发现和利用的复杂多阶段任务规划中。

While LLM agents have demonstrated cybersecurity capabilities on CTF competitions, their performance is limited by stale training data and an inability to incorporate domain-specific knowledge such as recent threats, vulnerabilities, and exploits. Knowledge-based approaches that embed technical understanding into the task-solving automation can overcome these limitations, filling a gap left by prior purely agentic systems like D-CIPHER and EnIGMA.

虽然 LLM 智能体在 CTF 竞赛中展示了网络安全能力，但其表现受到训练数据陈旧以及无法整合特定领域知识（如最近的威胁、漏洞和利用程序）的限制。将技术理解嵌入到任务解决自动化中的基于知识的方法可以克服这些局限，填补了先前纯智能体系统（如 D-CIPHER 和 EnIGMA）留下的空白。

Threat Model 威胁模型

The system assumes access to a Docker environment with shell access and network connectivity to CTF challenge servers. The agent operates autonomously within a cost budget ($3.00 per challenge). The paper acknowledges prompt injection risks when RAG is combined with external knowledge corpora.

系统假设可以访问具有 shell 访问权限且与 CTF 挑战服务器联网的 Docker 环境。智能体在成本预算（每项挑战 3.00 美元）内自主运行。论文承认，当 RAG 与外部知识库结合时，存在提示词注入风险。

Methodology 核心方法

CRAKEN is a knowledge-based LLM agent framework that enhances cybersecurity capabilities through three core mechanisms: (1) contextual decomposition of task-critical information from lengthy agent conversations into effective queries, (2) iterative self-reflected knowledge retrieval using Self-RAG and Graph-RAG pipelines with hallucination grading and query rewriting, and (3) knowledge-hint injection that transforms retrieved insights into adaptive attack strategies provided to executor agents. The framework builds on D-CIPHER's planner-executor multi-agent architecture and adds a modular retrieval system that can be integrated with any agentic system.

CRAKEN 是一个基于知识的 LLM 智能体框架，通过三个核心机制增强网络安全能力：(1) 上下文分解，将冗长的智能体对话中的任务关键信息分解为有效的查询；(2) 迭代自反思知识检索，使用具有幻觉分级和查询重写的 Self-RAG 和 Graph-RAG 管道；(3) 知识提示注入，将检索到的见解转化为提供给执行器智能体的自适应攻击策略。该框架基于 D-CIPHER 的规划器-执行器 (planner-executor) 多智能体架构，并添加了一个可以与任何智能体系统集成的模块化检索系统。

Architecture 架构设计

CRAKEN consists of two main components: (1) a Planner-Executor multi-agent system based on D-CIPHER's ReWOO-inspired architecture, where a planner decomposes CTF challenges into subtasks and delegates them to specialized executors, and (2) an iterative retrieval system incorporating Self-RAG and Graph-RAG. The retrieval system has six modules: Retriever (document retrieval from structured knowledge database), RelevanceGrader (evaluates document relevance), Generator (produces knowledge hints from context), HallucinationGrader (ensures grounding and factual correctness), Rewriter (refines queries for improved retrieval), and SolvedGrader (determines if the output sufficiently answers the query). The retrieval is triggered during task delegation and injects knowledge hints for executors. An auto-prompter agent from D-CIPHER is also incorporated.

CRAKEN 由两个主要部分组成：(1) 一个基于 D-CIPHER 受 ReWOO 启发的架构的规划器-执行器多智能体系统，其中规划器将 CTF 挑战分解为子任务并分配给专业执行器；(2) 一个集成了 Self-RAG 和 Graph-RAG 的迭代检索系统。该检索系统有六个模块：Retriever（从结构化知识库中检索文档）、RelevanceGrader（评估文档相关性）、Generator（从上下文中生成知识提示）、HallucinationGrader（确保真实性和事实正确性）、Rewriter（细化查询以改进检索）和 SolvedGrader（确定输出是否充分回答了查询）。检索在任务分配期间触发，并为执行器注入知识提示。还整合了来自 D-CIPHER 的自动提示器智能体。

LLM Models 使用的大模型

Tool Integration 工具集成

Memory Mechanism 记忆机制

RAG

Attack Phases Covered 覆盖的攻击阶段

reconnaissance

scanning

enumeration

exploitation

post exploitation

privilege escalation

lateral movement

reporting

Evaluation 评估结果

CRAKEN with Graph-RAG achieved 22% accuracy on NYU CTF Bench, outperforming D-CIPHER (19%) by 3% and achieving state-of-the-art results with an average cost increase of $0.34. Claude 3.5 Sonnet had the highest solve rate of 21% with default CRAKEN configuration at $0.68 cost (vs 19% at $0.52 for D-CIPHER). On MITRE ATT&CK techniques, CRAKEN solves 25-30% more techniques than prior work, with total technique coverage of 34 (Sonnet 3.5) vs 26 (D-CIPHER Sonnet 3.5).

搭载 Graph-RAG 的 CRAKEN 在 NYU CTF Bench 上实现了 22% 的准确率，比 D-CIPHER (19%) 高出 3%，并在平均成本增加 0.34 美元的情况下实现了最先进的结果。在默认 CRAKEN 配置下，Claude 3.5 Sonnet 的解决率最高，达到 21%，成本为 0.68 美元（相比之下，D-CIPHER 为 19%，成本为 0.52 美元）。在 MITRE ATT&CK 技术方面，CRAKEN 比之前的研究多解决了 25-30% 的技术，总技术覆盖范围为 34（Sonnet 3.5）对比 26（D-CIPHER Sonnet 3.5）。

Environment 评估环境

Metrics 评估指标

Baseline Comparisons 基准对比

D-CIPHER
EnIGMA

Scale 评估规模

200 CTFs across 6 categories (53 crypto, 15 forensics, 38 pwn, 51 reverse engineering, 19 web, 24 misc) from NYU CTF Bench

Contributions 核心贡献

The CRAKEN framework for integrating domain-specific knowledge databases to facilitate knowledge-based execution for LLM agents, compatible with other automated task planning systems
An optimized Self-RAG based retrieval framework performing iterative retrieval, generation, hallucination grading, query rewriting, and answer refinement for accurate grounded outputs in cybersecurity tasks
A Graph-RAG integrated retrieval algorithm augmenting vector-based search with structured reasoning over a cybersecurity knowledge graph
An open-source dataset of CTF writeups (1,298 writeups), code snippets (4,656), and attack payloads (135) for knowledge-based automated cybersecurity agents
Comprehensive evaluation of knowledge-based execution on CTF benchmarks and MITRE ATT&CK classification showing state-of-the-art performance

CRAKEN 框架，用于集成特定领域知识数据库以促进 LLM 智能体的基于知识的执行，并兼容其他自动化任务规划系统。
一个优化的基于 Self-RAG 的检索框架，执行迭代检索、生成、幻觉评分、查询重写和答案细化，以便在网络安全任务中获得准确的、有依据的输出。
一种集成 Graph-RAG 的检索算法，通过对网络安全知识图谱进行结构化推理来增强基于向量的搜索。
一个开源数据集，包含 CTF 解题报告（1,298 篇）、代码片段（4,656 个）和攻击载荷（135 个），用于基于知识的自动化网络安全智能体。
在 CTF 基准测试和 MITRE ATT&CK 分类上对基于知识的执行进行了全面评估，展示了最先进的性能。

Limitations 局限性

The knowledge database exhibits limited diversity, comprising only select CTF writeups, code snippets, and attack payloads, which may prevent CRAKEN from reaching its full capacity
CRAKEN relies on tool calling capabilities of LLMs, preventing use of advanced reasoning models such as OpenAI o3 or Claude 3.7 Sonnet with thinking mode
Only 43.8% of retrieved documents meet relevance grading standards, indicating retrieval quality issues
72.7% of generated content fails hallucination verification, requiring heavy reliance on the retry mechanism
Mixing datasets without careful curation degrades performance (15.5% solve rate with all databases vs 21.0% with writeups alone)
DeepSeek V3 performs very poorly (2-3% solve rate), indicating strong dependence on the underlying LLM's capabilities
Moderate increase in computational cost compared to non-RAG baselines (31% cost increase for Claude 3.5 Sonnet)
Vulnerable to prompt injection when combined with RAG, as malicious actors could theoretically manipulate the agent via the retrieved corpus

知识库的针对性有限，仅包含精选的 CTF 解题报告、代码片段和攻击载荷，这可能阻碍 CRAKEN 发挥其全部潜力。
CRAKEN 依赖于 LLM 的工具调用能力，这使得其无法使用 OpenAI o3 或具有思考模式的 Claude 3.7 Sonnet 等高级推理模型。
检索到的文档中只有 43.8% 符合相关性评分标准，表明检索质量存在问题。
72.7% 的生成内容未通过幻觉验证，需要严重依赖重试机制。
在没有仔细筛选的情况下混合数据集会降低性能（所有数据库混合时解决率为 15.5%，而仅使用解题报告时为 21.0%）。
DeepSeek V3 表现非常糟糕（2-3% 的解决率），表明该框架强烈依赖于底层 LLM 的能力。
与非 RAG 基准相比，计算成本适度增加（Claude 3.5 Sonnet 的成本增加了 31%）。
与 RAG 结合时容易受到提示词注入攻击，因为恶意行为者理论上可以通过检索到的语料库操纵智能体。

Research Gaps 研究空白

Expanding retrieval strategies designed for long conversational contexts to improve knowledge integration over extended agent interactions
Improving integration technologies to strengthen connections between knowledge databases and agents
Exploring data organization strategies for curating datasets across various cybersecurity domains
Incorporating advanced reasoning models (o3, thinking modes) that currently lack tool-calling support
Knowledge graph evaluation demonstrates retrieval methods are critical for knowledge augmentation in complex task planning, but optimal graph structures remain unexplored
Prompt injection defenses for RAG-augmented cybersecurity agents

扩展专为长对话上下文设计的检索策略，以改进在长时间智能体交互中的知识整合。
改进集成技术以增强知识库与智能体之间的连接。
探索跨各种网络安全领域策划数据集的数据组织策略。
整合目前缺乏工具调用支持的高级推理模型（o3、思考模式）。
知识图谱评估表明检索方法对于复杂任务规划中的知识增强至关重要，但最佳图结构仍有待探索。
RAG 增强型网络安全智能体的提示词注入防御。

Novel Techniques 新颖技术

Contextual decomposition of lengthy agent conversations into focused sub-queries for effective knowledge retrieval
Six-module iterative Self-RAG pipeline with hallucination grading, relevance grading, and query rewriting loops for cybersecurity knowledge retrieval
Hybrid Graph-RAG combining structured knowledge graph search (entity-relation triplets via Neo4j) with unstructured vector-similarity retrieval (Milvus) for cybersecurity knowledge
Knowledge-hint injection at task delegation boundaries, where retrieved knowledge is transformed into actionable hints for executor agents
Multiple configurable RAG algorithms (multi-query, RAG-fusion with Reciprocal Rank Fusion, decomposition, step-back) that can be independently toggled and composed into hybrid pipelines

将冗长的智能体对话上下文分解为集中的子查询，以实现有效的知识检索。
包含幻觉评分、相关性评分和查询重写循环的六模块迭代 Self-RAG 管道，用于网络安全知识检索。
混合 Graph-RAG，将结构化知识图谱搜索（通过 Neo4j 的实体-关系三元组）与非结构化向量相似性检索（Milvus）相结合，用于获取网络安全知识。
在任务分配边界处进行知识提示注入，将检索到的知识转化为执行器智能体的可操作提示。
多种可配置的 RAG 算法（多查询、带倒数排名融合的 RAG-fusion、分解、回退），这些算法可以独立切换并组合成混合管道。

Open Questions 开放问题

How to effectively combine advanced reasoning models (that lack tool-calling) with knowledge-based execution frameworks?
What is the optimal structure and granularity for cybersecurity knowledge graphs to maximize retrieval quality?
How to defend RAG-augmented cybersecurity agents against prompt injection via the knowledge corpus?
Can knowledge-based execution extend to real-world penetration testing beyond CTF challenges?
How to reduce the 72.7% hallucination rate in generated knowledge hints while maintaining retrieval coverage?

如何有效地将缺乏工具调用的高级推理模型与基于知识的执行框架相结合？
网络安全知识图谱的最佳结构和粒度是什么，以最大化检索质量？
如何防御 RAG 增强型网络安全智能体通过知识语料库受到的提示词注入攻击？
基于知识的执行能否扩展到 CTF 挑战之外的现实世界渗透测试？
如何在保持检索覆盖率的同时，降低生成的知识提示中 72.7% 的幻觉率？

Builds On 基于前人工作

D-CIPHER (multi-agent planner-executor architecture)
Self-RAG (self-evaluating recursive retrieval-generation)
Graph-RAG (structured graph-based retrieval)
ReWOO (decoupled reasoning from observations)
NYU CTF Bench (evaluation benchmark)
EnIGMA (interactive tools for LM agents)

Open Source 开源信息

Yes - https://github.com/NYU-LLM-CTF/nyuctf_agents_craken and https://github.com/NYU-LLM-CTF/craken_baseline_datasets