#45

Improving LLM Agents with Reinforcement Learning on Cryptographic CTF Challenges Improving LLM Agents with Reinforcement Learning on Cryptographic CTF Challenges

Lajos Muzsai, David Imolai, András Lukács

2025 | arXiv (preprint)

arXiv (see https://github.com/aielte-research/HackSynth-GRPO)

system ctf fully-autonomous single-agent RL-based

PDF Preview 论文预览

Loading PDF... 加载 PDF 中...

Problem & Motivation 问题与动机

Current LLM-based penetration testing agents are primarily evaluated on CTF challenges but lack scalable, rigorous environments for training these agents via reinforcement learning. Practical RL training requires abundant, objectively verifiable tasks, which are scarce in cybersecurity.

目前基于大语言模型（LLM）的渗透测试智能体主要在夺旗赛（CTF）挑战上进行评估，但缺乏可扩展且严格的环境来通过强化学习（RL）对这些智能体进行训练。实际的 RL 训练需要大量且可客观验证的任务，而这类任务在网络安全领域非常稀缺。

Reinforcement Learning has shown strong results in improving LLM reasoning for math and coding, but the absence of procedurally generated, reward-compatible cybersecurity datasets has prevented applying RL to security-domain agents. Cryptographic reasoning is an ideal RL testbed because it combines precise validation, structured multi-step inference, and reliance on reliable computational tool use.

强化学习在提高 LLM 的数学和编程推理能力方面已显示出强大效果，但由于缺乏程序化生成、且与奖励机制兼容的网络安全数据集，阻碍了 RL 在安全领域智能体上的应用。密码学推理是一个理想的 RL 试验场，因为它结合了精确的验证、结构化的多步推导以及对可靠计算工具调用的依赖。

Threat Model 威胁模型

The agent operates in a sandboxed Python execution environment with access to a Python REPL via MCP. No network access or system-level exploitation is assumed; the threat model concerns solving cryptographic CTF challenges that require code execution and mathematical reasoning.

该智能体在一个沙盒化的 Python 执行环境中运行，通过模型上下文协议（MCP）访问 Python REPL。不假设具有网络访问权限或系统级利用能力；威胁模型侧重于解决需要代码执行和数学推理的密码学 CTF 挑战。

Methodology 核心方法

The authors introduce Random-Crypto, a procedurally generated cryptographic CTF benchmark spanning 50 algorithmic families and over 5,000 unique tasks with three difficulty levels (easy, medium, hard). They fine-tune a tool-augmented Llama-3.1-8B-Instruct model using Group Relative Policy Optimization (GRPO) for 250 training steps on easy challenges in a secure Python execution environment accessed via Anthropic's Model Context Protocol (MCP). The model generates eight candidate trajectories per training step consisting of iterative reasoning and JSON-formatted tool calls. A composite reward function penalizes hallucinated flags and rewards correct flag recovery, proper answer formatting, valid tool calls, and error-free code execution.

作者引入了 Random-Crypto，这是一个程序化生成的密码学 CTF 基准测试，涵盖 50 个算法族和 5,000 多个独特任务，分为三个难度级别（易、中、难）。他们使用组相对策略优化（GRPO）算法，在安全的 Python 执行环境（通过 Anthropic 的 MCP 访问）中，针对简单挑战对工具增强型 Llama-3.1-8B-Instruct 模型进行了 250 个训练步的微调。模型在每个训练步生成八个候选轨迹，每个轨迹包含迭代推理和 JSON 格式的工具调用。复合奖励函数会惩罚幻觉 flag，并奖励正确的 flag 恢复、正确的答案格式、有效的工具调用和无错误的代码执行。

Architecture 架构设计

A tool-augmented LLM agent that interfaces with a Python REPL server via MCP. The agent follows a structured interaction cycle: (1) generate reasoning in XML tags, (2) emit a single JSON tool call to execute Python code, (3) receive execution output, (4) repeat up to four loops until the flag is recovered. The model is augmented with QLoRA adapters for parameter-efficient fine-tuning on a single A100 80GB GPU.

这是一种工具增强型 LLM 智能体，通过 MCP 与 Python REPL 服务器交互。该智能体遵循结构化的交互循环：（1）在 XML 标签中生成推理；（2）发出单个 JSON 工具调用以执行 Python 代码；（3）接收执行输出；（4）重复循环最多四次，直到恢复 flag。模型通过 QLoRA 适配器进行增强，在单个 A100 80GB GPU 上实现参数高效的微调。

LLM Models 使用的大模型

Tool Integration 工具集成

Memory Mechanism 记忆机制

conversation-history

Attack Phases Covered 覆盖的攻击阶段

reconnaissance

scanning

enumeration

exploitation

post exploitation

privilege escalation

lateral movement

reporting

Evaluation 评估结果

GRPO fine-tuning on Llama-3.1-8B increased Pass@8 on Random-Crypto from 0.10 to 0.88-0.90 (easy subset), surpassing GPT-4.1 and approaching o3. The improvements generalized to external benchmarks: picoCTF Pass@8 increased from 0.07 to 0.18, and AICrypto MCQ Pass@8 increased from 0.04 to 0.19 with curriculum training. Training without hints produced a more consistent model (higher Maj@8) despite slightly lower Pass@8.

对 Llama-3.1-8B 进行 GRPO 微调后，其在 Random-Crypto（简单子集）上的 Pass@8 从 0.10 提高到 0.88-0.90，超过了 GPT-4.1 并接近 o3。这种改进泛化到了外部基准测试：通过课程训练，picoCTF 的 Pass@8 从 0.07 提高到 0.18，AICrypto MCQ 的 Pass@8 从 0.04 提高到 0.19。尽管不带提示的训练 Pass@8 略低，但产生的模型更具一致性（Maj@8 更高）。

Environment 评估环境

Metrics 评估指标

Baseline Comparisons 基准对比

Llama-3.1-8B-base
Llama-3.1-70B
Llama-4-Scout-17B-16E
GPT-4.1
o3

Scale 评估规模

50 manually validated Random-Crypto challenges for testing, 5,000+ auto-generated challenges for training, 120 picoCTF challenges, 135 AICrypto MCQ questions

Contributions 核心贡献

Introduction of Random-Crypto, a procedurally generated cryptographic CTF benchmark covering 50 algorithmic families across 8 archetypes (Classical, RSA, AES, ECC, Hash, PRNG, Web Crypto, Signature Schemes) with three difficulty levels for RL-based agent training
Demonstration that GRPO-based reinforcement learning with a composite reward function significantly improves tool-augmented reasoning in a small (8B) LLM, raising Pass@8 from 0.10 to 0.88-0.90 on cryptographic challenges
Validation of cross-domain generalization: RL-acquired strategies transfer to the heterogeneous picoCTF benchmark (spanning web, forensics, reverse engineering, binary exploitation) and the AICrypto MCQ multiple-choice benchmark, despite the model never training on these formats

引入了 Random-Crypto，这是一个程序化生成的密码学 CTF 基准测试，涵盖 8 个原型（古典密码、RSA、AES、ECC、哈希、PRNG、Web 密码、签名方案）的 50 个算法族，具有三个难度级别，用于基于 RL 的智能体训练
证明了基于 GRPO 的强化学习结合复合奖励函数可以显著提高小型（8B）LLM 的工具增强推理能力，将密码学挑战的 Pass@8 从 0.10 提高到 0.88-0.90
验证了跨领域泛化能力：尽管模型从未在这些格式上训练过，但通过 RL 获得的策略可以迁移到异质的 picoCTF 基准测试（涵盖 Web、取证、逆向工程、二进制利用）和 AICrypto MCQ 多选题基准测试

Limitations 局限性

Training restricted to easy challenges only; harder challenges caused convergence to local optima prioritizing superficial rewards
Agent cannot process auxiliary artifacts like ELF binaries, PCAP traces, or PNG images (95 of 120 picoCTF challenges embed such artifacts)
Evaluation prompts were optimized for Llama-3.1-8B, inadvertently disadvantaging larger Llama-family models in benchmarking
The o3 model experienced content moderation issues and confusion about code execution permissions, limiting fair comparison
Security risks from unbounded code execution: the REPL server could crash from memory exhaustion when the agent generated resource-intensive code
Limited to four sequential tool invocations per challenge, constraining multi-step reasoning on harder problems
Token generation capped at 8192 per interaction sequence

训练仅限于简单挑战；更难的挑战会导致模型收敛到优先考虑表面奖励的局部最优解
智能体无法处理 ELF 二进制文件、PCAP 数据包跟踪或 PNG 图像等辅助制品（120 个 picoCTF 挑战中有 95 个包含此类制品）
评估提示词针对 Llama-3.1-8B 进行了优化，在基准测试中无意中对较大的 Llama 系列模型造成了劣势
o3 模型遇到了内容审查问题以及对代码执行权限的困惑，限制了公平比较
无限制代码执行带来的安全风险：当智能体生成资源密集型代码时，REPL 服务器可能会因内存耗尽而崩溃
每个挑战限制为四次连续的工具调用，限制了对更难问题的多步推理
单次交互序列的 Token 生成上限为 8192

Research Gaps 研究空白

Comprehensive sandboxing strategies for tool-augmented LLM agents remain underdeveloped, including timeouts, memory limits, and instruction whitelisting
RL training on harder cryptographic challenges remains unsolved due to reward sparsity causing convergence to superficial local optima
Tool APIs with networking capabilities (web scrapers, HTTP clients) could expose systems to unintentional denial-of-service attacks when used by RL-trained agents
No established methodology for curriculum-based RL training in cybersecurity that progressively increases challenge difficulty
Cross-modal CTF challenges (involving binary analysis, forensics, image processing) are not yet addressable by text-only tool-augmented agents

工具增强型 LLM 智能体的综合沙盒策略仍不成熟，包括超时、内存限制和指令白名单
由于奖励稀疏导致模型收敛到表面局部最优，更难的密码学挑战的 RL 训练仍未解决
具有网络功能的工具 API（网络爬虫、HTTP 客户端）在被经过 RL 训练的智能体使用时，可能会使系统面临无意的拒绝服务攻击
在网络安全领域，尚无建立起逐步增加挑战难度的课程化 RL 训练方法
仅支持文本的工具增强型智能体尚无法处理跨模态 CTF 挑战（涉及二进制分析、取证、图像处理）

Novel Techniques 新颖技术

Composite reward function with deduction strategy that penalizes hallucinated flags and fictitious tool outputs to prevent reward hacking during RL training
Procedural generation of cryptographic CTF challenges with randomized parameters and LLM-generated narratives for scalable RL training data
Curriculum-based RL training (hints first, then no hints) that improves adaptability to unseen task formats and cross-domain generalization
Restricting RL training to easy challenges to avoid convergence to local optima from reward sparsity on hard tasks

具有扣分策略的复合奖励函数，惩罚幻觉 flag 和虚假工具输出，以防止 RL 训练期间的奖励作弊（reward hacking）
程序化生成带有随机参数和 LLM 生成叙事的密码学 CTF 挑战，用于可扩展的 RL 训练数据
课程化 RL 训练（先带提示，后不带提示），提高了对未见任务格式的适应能力和跨领域泛化能力
将 RL 训练限制在简单挑战中，以避免因困难任务奖励稀疏而收敛到局部最优

Open Questions 开放问题

Can RL training on cryptographic challenges transfer to non-crypto penetration testing tasks like web exploitation or privilege escalation?
How can RL training be extended to harder challenges without reward sparsity causing convergence to local optima?
What is the optimal curriculum design for progressively training agents on increasingly difficult security tasks?
Can multi-modal agents be trained via RL to handle binary analysis, packet captures, and image-based CTF challenges?
How should execution sandboxing be designed to safely support RL training of tool-augmented security agents at scale?

密码学挑战的 RL 训练能否迁移到非密码学的渗透测试任务（如 Web 利用或权限提升）？
如何在不因奖励稀疏导致收敛到局部最优的情况下，将 RL 训练扩展到更难的挑战？
针对日益复杂的安全任务，逐步训练智能体的最佳课程设计是什么？
能否通过 RL 训练多模态智能体来处理二进制分析、数据包捕获和基于图像的 CTF 挑战？
应如何设计执行沙盒，以在大规模环境下安全地支持工具增强型安全智能体的 RL 训练？

Builds On 基于前人工作

HackSynth
DeepSeek-R1
GRPO
ReAct
QLoRA
Intercode
MCP-Python

Open Source 开源信息

Yes - https://github.com/aielte-research/HackSynth-GRPO