Pentest-R1: Towards Autonomous Penetration Testing Reasoning Optimized via Two-Stage Reinforcement Learning Pentest-R1: Towards Autonomous Penetration Testing Reasoning Optimized via Two-Stage Reinforcement Learning
Problem & Motivation 问题与动机
Current LLMs face significant limitations in autonomous penetration testing, including poor error handling, inefficient reasoning (e.g., circular thought paths), and an inability to perform complex end-to-end tasks autonomously. Chain-of-Thought prompting can even be detrimental, and existing RL paradigms designed for single-turn tasks fail to capture the multi-round, stochastic nature of penetration testing.
目前的大语言模型在自主渗透测试中面临重大局限,包括错误处理能力差、推理效率低(例如循环思维路径)以及无法自主执行复杂的端到端任务。思维链(CoT)提示甚至可能产生负面影响,而现有的针对单轮任务设计的强化学习(RL)范式无法捕捉渗透测试的多轮、随机性特征。
Traditional penetration testing workflows are heavily reliant on deep expertise and significant time investment. While LLMs show promise for automation, two critical gaps remain: (1) the lack of large-scale, multi-step walkthrough datasets reflecting real-world infiltration scenarios, and (2) current single-round RL training paradigms are fundamentally ill-suited for the long-horizon, strategic reasoning required in dynamic attack surfaces. Bridging these gaps is essential to enable truly autonomous penetration testing agents.
传统的渗透测试工作流高度依赖深度专业知识和大量时间投入。虽然大语言模型在自动化方面展现出潜力,但仍存在两个关键空白:(1) 缺乏反映真实世界渗透场景的大规模、多步骤操作指南(walkthrough)数据集;(2) 当前的单轮强化学习训练范式根本不适合动态攻击面所需的长期、战略性推理。弥合这些差距对于实现真正的自主渗透测试智能体至关重要。
Threat Model 威胁模型
The agent operates fully autonomously on a Kali Linux host with access to standard penetration testing tools. It interacts with target machines through a sandboxed environment (InterCode-CTF for online training; Cybench and AutoPenBench for evaluation). The goal is to capture a flag or complete exploitation tasks without human guidance.
该智能体在 Kali Linux 主机上完全自主运行,可以访问标准渗透测试工具。它通过沙箱环境(用于在线训练的 InterCode-CTF;用于评估的 Cybench 和 AutoPenBench)与目标机器交互。目标是在没有人工指导的情况下获取 flag 或完成漏洞利用任务。
Methodology 核心方法
Pentest-R1 employs a two-stage reinforcement learning pipeline. Stage 1 (Offline RL) trains the LLM on a curated dataset of over 500 real-world expert walkthroughs from HackTheBox and VulnHub, structured as Thought-Command-Observation tuples, using GRPO to instill foundational attack logic. Stage 2 (Online RL) fine-tunes the model in an interactive CTF environment (InterCode-CTF) where it learns from environmental feedback through multi-turn trajectory rollouts, developing error self-correction and adaptive strategies. Both stages use Group Relative Policy Optimization (GRPO) with LoRA for efficient training.
Pentest-R1 采用了两阶段强化学习流水线。第一阶段(离线强化学习)在精心策划的、包含 500 多个来自 HackTheBox 和 VulnHub 的真实世界专家操作指南的数据集上训练大语言模型,数据集结构为“思维-命令-观察”三元组,使用 GRPO 灌输基础攻击逻辑。第二阶段(在线强化学习)在交互式 CTF 环境(InterCode-CTF)中对模型进行微调,通过多轮轨迹展开从环境反馈中学习,培养错误自我修正和自适应策略。两个阶段均使用带有 LoRA 的组相对策略优化(GRPO)以实现高效训练。
Architecture 架构设计
Two-stage pipeline: (1) Offline RL stage processes expert walkthroughs into a sequential auto-regressive training task with 14K multi-turn interaction tuples, applying GRPO with format and accuracy rewards. (2) Online RL stage deploys the pre-trained agent into InterCode-CTF where it generates 8-turn conversational trajectories, evaluated by episodic rewards (flag capture, valid steps, failed commands). A turn-aware loss mask ensures backpropagation only on agent-generated tokens. The framework uses Algorithm 1 for multi-turn trajectory optimization with GRPO updates.
两阶段流水线:(1) 离线强化学习阶段将专家操作指南处理为具有 1.4 万个多轮交互三元组的顺序自回归训练任务,应用带有格式和准确性奖励的 GRPO。(2) 在线强化学习阶段将预训练好的智能体部署到 InterCode-CTF 中,生成 8 轮对话轨迹,由阶段性奖励(获取 flag、有效步骤、失败命令)进行评估。轮次感知的损失掩码确保仅对智能体生成的 token 进行反向传播。该框架使用算法 1 进行带有 GRPO 更新的多轮轨迹优化。
LLM Models 使用的大模型
Tool Integration 工具集成
Memory Mechanism 记忆机制
conversation-history
Attack Phases Covered 覆盖的攻击阶段
Evaluation 评估结果
On AutoPenBench, Pentest-R1 achieves a 24.2% success rate, outperforming GPT-4o (21.2%) and ranking second only to Gemini 2.5 Flash (27.3%). On Cybench, it achieves a SOTA 15.0% unguided success rate for open-source models, tying for first place overall among all tested models including proprietary ones. The full two-stage RL framework consumes 31% fewer tokens than the untrained base model on Cybench (1.64M vs 2.39M), demonstrating more efficient reasoning.
在 AutoPenBench 上,Pentest-R1 实现了 24.2% 的成功率,超越了 GPT-4o (21.2%),仅次于 Gemini 2.5 Flash (27.3%)。在 Cybench 上,它在开源模型中实现了 15.0% 的 SOTA 无指导成功率,在包括商业模型在内的所有测试模型中并列第一。完整的两阶段强化学习框架在 Cybench 上的 token 消耗比未训练的基础模型减少了 31% (1.64M vs 2.39M),证明了更高效的推理能力。
Environment 评估环境
Metrics 评估指标
Baseline Comparisons 基准对比
- GPT-4o
- Claude-4-Opus
- Claude-3-Opus
- OpenAI-o1-preview
- Claude-3.7-Sonnet
- Gemini-2.5-Flash
- Gemini-1.5-Pro
- Llama-3.1-405B-Instruct
- Mixtral-8x22b-Instruct
- Qwen3-32B
- Qwen2.5-32B
- Llama-3-70b-Chat
Scale 评估规模
33 AutoPenBench tasks and 40 Cybench CTF tasks
Contributions 核心贡献
- Construction of a large-scale dataset of 500+ real-world multi-step penetration testing walkthroughs from HackTheBox and VulnHub, structured as Thought-Command-Observation tuples (yielding 14K interaction tuples)
- First two-stage end-to-end reinforcement learning framework for autonomous penetration testing, combining offline RL on expert data with online RL in an interactive CTF environment
- State-of-the-art performance on Cybench and AutoPenBench benchmarks, with an 8B parameter model rivaling or surpassing much larger proprietary models
- 构建了一个大规模数据集,包含 500 多个来自 HackTheBox 和 VulnHub 的真实多步渗透测试操作指南,结构为“思维-命令-观察”三元组(产生 1.4 万个交互元组)
- 提出了首个用于自主渗透测试的两阶段端到端强化学习框架,将专家数据的离线强化学习与交互式 CTF 环境中的在线强化学习相结合
- 在 Cybench 和 AutoPenBench 基准测试上达到了最先进的性能,一个 8B 参数的模型即可与更大规模的商业模型相媲美或将其超越
Limitations 局限性
- Built on an 8B parameter model, which may have inherent capacity limitations for the most complex multi-stage attacks compared to larger models
- Online RL training is limited to InterCode-CTF environment with 8-turn trajectories, which may not fully capture the complexity of real-world penetration testing engagements that can span many more steps
- Evaluation is limited to CTF-style benchmarks (Cybench and AutoPenBench); no evaluation on real enterprise networks or production systems
- The walkthrough dataset is derived from publicly available resources (HackTheBox, VulnHub), which may not cover all attack categories (e.g., Active Directory, cloud, IoT)
- Token consumption (1.64M on Cybench) remains higher than highly optimized proprietary models like Claude 3.7 Sonnet (1.61M) and Gemini 2.5 Flash (0.80M), suggesting room for efficiency improvements
- No multimodal capabilities; the agent cannot interpret visual interfaces, screenshots, or graphical outputs from tools
- The Thought reconstruction in Stage 1 uses an auxiliary LLM to reverse-engineer expert reasoning, which may introduce inaccuracies in the training signal
- 基于 8B 参数模型构建,与更大规模模型相比,在处理最复杂的阶段性攻击时可能存在固有的容量限制
- 在线强化学习训练仅限于具有 8 轮轨迹的 InterCode-CTF 环境,可能无法完全捕捉到可能跨越更多步骤的真实渗透测试任务的复杂性
- 评估仅限于 CTF 风格的基准测试(Cybench 和 AutoPenBench);未在真实的商业网络或生产系统中进行评估
- 操作指南数据集源自公共资源(HackTheBox, VulnHub),可能无法涵盖所有攻击类别(例如活动目录、云安全、物联网)
- Token 消耗(Cybench 上为 1.64M)仍高于高度优化的商业模型,如 Claude 3.7 Sonnet (1.61M) 和 Gemini 2.5 Flash (0.80M),表明仍有改进效率的空间
- 缺乏多模态能力;智能体无法理解视觉界面、屏幕截图或工具的图形输出
- 第一阶段的“思维”重建使用辅助 LLM 对专家推理进行逆向工程,这可能会在训练信号中引入误差
Research Gaps 研究空白
- Lack of large-scale, realistic multi-step penetration testing datasets that go beyond CTF challenges to cover enterprise-level scenarios
- Existing RL methods for LLMs are predominantly designed for single-turn tasks and fail to capture multi-round, stochastic interactions needed for penetration testing
- Chain-of-thought reasoning in current LLMs is often inefficient for penetration testing, producing verbose or circular thought patterns rather than actionable attack strategies
- No established methodology for incorporating multimodal inputs (visual interfaces, network diagrams) into LLM-based penetration testing agents
- Gap between CTF/benchmark performance and real-world penetration testing effectiveness remains largely unexamined
- 缺乏大规模、真实的多步渗透测试数据集,目前的数据集多局限于 CTF 挑战,未能涵盖企业级场景
- 现有的大语言模型强化学习方法主要针对单轮任务设计,无法捕捉渗透测试所需的多轮、随机性交互
- 当前大语言模型中的思维链推理在渗透测试中往往效率低下,产生冗长或循环的思维模式,而非可执行的攻击策略
- 尚无建立将多模态输入(视觉界面、网络图谱)整合到基于大语言模型的渗透测试智能体中的成熟方法
- CTF/基准测试表现与真实世界渗透测试有效性之间的差距在很大程度上仍未被研究
Novel Techniques 新颖技术
- Two-stage RL pipeline combining offline RL on expert walkthroughs with online RL in interactive environments, specifically tailored for multi-turn penetration testing
- Episodic trajectory optimization extending GRPO from single-turn to multi-turn settings with a turn-aware loss mask that only backpropagates on agent-generated tokens
- Thought-Command-Observation tuple structure for walkthrough data, using an auxiliary LLM to reconstruct expert reasoning chains from raw walkthroughs
- Composite reward function design with separate format (structural adherence) and accuracy (command correctness via Jaccard similarity) components for offline stage, and flag/step/fail rewards for online stage
- 专门为多轮渗透测试定制的两阶段强化学习流水线,结合了专家操作指南的离线强化学习与交互式环境的在线强化学习
- 阶段性轨迹优化,将 GRPO 从单轮扩展到多轮设置,采用轮次感知的损失掩码,仅对智能体生成的 token 进行反向传播
- 针对操作指南数据的“思维-命令-观察”元组结构,使用辅助大语言模型从原始操作指南中重建专家推理链
- 复合奖励函数设计,离线阶段包含独立的格式(结构依从性)和准确性(通过 Jaccard 相似度衡量的命令正确性)组件,在线阶段包含 flag/步骤/失败奖励
Open Questions 开放问题
- How well does the two-stage RL approach generalize to penetration testing domains not represented in the training data (e.g., cloud infrastructure, Active Directory, IoT)?
- Can the framework scale to longer engagement horizons beyond 8-15 turns without degradation in reasoning quality?
- Would incorporating multimodal capabilities (as mentioned in future work) fundamentally change the architecture or can it be added incrementally?
- How does the agent handle novel, zero-day vulnerabilities that have no representation in the walkthrough training data?
- What is the optimal balance between offline and online RL training, and could curriculum learning further improve performance?
- Can the reward shaping be made more fine-grained (e.g., partial exploitation credit) to improve learning signal density in sparse-reward scenarios?
- 两阶段强化学习方法在训练数据未涵盖的渗透测试领域(如云基础设施、活动目录、物联网)的泛化效果如何?
- 该框架能否扩展到超过 8-15 轮的更长任务周期,而不降低推理质量?
- 整合多模态能力(如未来工作中所述)是否会从根本上改变架构,还是可以增量添加?
- 智能体如何处理训练操作指南数据中未体现的新型 0day 漏洞?
- 离线和在线强化学习训练之间的最佳平衡点是什么,课程学习能否进一步提高性能?
- 奖励塑造能否变得更加细粒度(例如部分利用信用),以提高稀疏奖励场景中的学习信号密度?
Builds On 基于前人工作
- DeepSeek-R1
- GRPO
- LoRA
- InterCode-CTF
- PentestGPT
- VulnBot
- AUTOATTACKER
- Agent-R1
- Search-R1
Open Source 开源信息
Yes - https://github.com/KHenryAegis/Pentest-R1