#52

RedTeamLLM: an Agentic AI framework for offensive security RedTeamLLM: an Agentic AI framework for offensive security

Brian Challita, Pierre Parrend

2025 | IJCAI 2025 / arXiv (preprint)

arXiv:2505.06913v1

Problem & Motivation 问题与动机

Existing agentic AI offensive security frameworks converge on a narrow ReAct-style design that lacks memory, plan revision, and context window management, limiting their ability to perform complex, long-horizon penetration testing tasks autonomously.

现有的智能体 AI 进攻性安全框架大多采用狭窄的 ReAct 风格设计,缺乏记忆、计划修订和上下文窗口管理能力,这限制了它们自主执行复杂、长程渗透测试任务的能力。

Agentic AI poses an urgent dual-use threat: malicious actors can leverage autonomous offensive tools for cybercrime, so the security community must proactively build and understand these models before adversaries do. Current systems like PenTestGPT still show restricted automation capacity, use memory only as a scratch-pad for latest observations, and never implement hierarchical plan refinement, long-horizon memory, or roll-back of faulty plans.

智能体 AI 构成了紧迫的双重用途威胁:恶意行为者可以利用自主进攻工具进行网络犯罪,因此安全社区必须在对手之前主动构建并理解这些模型。目前的系统(如 PenTestGPT)仍然表现出受限的自动化能力,仅将记忆用作记录最新观察结果的草稿本,从未实现分层计划细化、长程记忆或故障计划的回滚。

Threat Model 威胁模型

The LLM is used in its default configuration with a benevolent user; typical threats like prompt injection are considered out of scope. The architecture itself addresses two threat families: hijacking of the execution process and inversion of dependency from LLM agents towards the framework. A five-layer security model covers authentication/authorization, network/system isolation (Docker), command validation by user, append-only logging, and a kill switch.

LLM 以其默认配置与仁慈用户配合使用;典型的威胁(如提示注入)被视为超出研究范围。该架构本身解决了两类威胁:执行过程的劫持,以及 LLM 智能体对框架依赖性的倒置。一个五层安全模型涵盖了身份验证/授权、网络/系统隔离(Docker)、用户命令验证、仅追加日志记录以及自毁开关。

Methodology 核心方法

RedTeamLLM proposes a seven-component architecture (Launcher, RedTeamAgent, Memory Manager, ADaPT Enhanced, Plan Corrector, ReAct, Planner) that combines recursive task decomposition via an enhanced ADaPT mechanism with a ReAct execution loop, a plan correction module for error recovery, and a tree-structured memory manager for cross-run learning. The evaluated proof-of-concept focuses on a three-step pipeline (Reasoning, Act, Summarizer) where each step runs in a separate LLM session: the Reasoner analyzes strategy and plans next actions, the Act session executes tool calls via a root-privileged Linux terminal, and the Summarizer condenses long command outputs to preserve context window budget. The agent has unrestricted terminal access rather than a fixed toolset, promoting genericity and creativity in attack paths.

RedTeamLLM 提出了一个由七个组件(启动器、红队智能体、记忆管理器、ADaPT 增强版、计划校正器、ReAct、计划器)组成的架构。该架构结合了通过增强的 ADaPT 机制实现的递归任务分解、ReAct 执行循环、用于错误恢复的计划校正模块,以及用于跨运行学习的树状结构记忆管理器。评估的验证性实现侧重于三步流水线(推理、行动、总结),每步运行在独立的 LLM 会话中:推理会话分析策略并规划下一步行动,行动会话通过具有 root 权限的 Linux 终端执行工具调用,而状态无关的总结会话压缩长的命令输出以节省上下文窗口预算。智能体拥有不受限制的终端访问权限而非固定工具集,这促进了攻击路径的通用性和创造性。

Architecture 架构设计

Seven components: Launcher (UI and task management), RedTeamAgent (orchestrator that delegates to ADaPT Enhanced and saves results to Memory Manager), ADaPT Enhanced (recursive task decomposition into subtask trees), Planner (generates subtask trees from tasks), Plan Corrector (adjusts plans on subtask failure), ReAct (iterative reasoning-execution-observation loop with terminal access), and Memory Manager (embeds and stores task tree nodes in a database for cross-run retrieval). The evaluated implementation focuses on the ReAct component with a three-session pipeline: Reasoning session, Act session, and stateless Summarizer session.

七个组件:启动器(UI 和任务管理)、红队智能体(负责委派给 ADaPT 增强版并将结果保存到记忆管理器的协调器)、ADaPT 增强版(将任务递归分解为子任务树)、计划器(从任务生成子任务树)、计划校正器(在子任务失败时调整计划)、ReAct(带有终端访问权限的迭代“推理-执行-观察”循环)以及记忆管理器(在数据库中嵌入并存储任务树节点,用于跨运行检索)。评估的实现侧重于 ReAct 组件,采用三会话流水线:推理会话、行动会话和无状态总结会话。

LLM Models 使用的大模型

GPT-4o

Tool Integration 工具集成

linux-terminalnmapsqlmap

Memory Mechanism 记忆机制

conversation-history

Attack Phases Covered 覆盖的攻击阶段

reconnaissance
scanning
enumeration
exploitation
post exploitation
privilege escalation
lateral movement
reporting

Evaluation 评估结果

RedTeamLLM outperforms PenTestGPT in 3 out of 5 VulnHub VMs (Victim1: 4 vs 1 steps for GPT4o variant, 300% improvement; WestWild: 4 vs 3, 33% improvement; CTF4: 3.5 vs 2, 75% improvement). The ablation study shows reasoning reduces tool calls by 37-68% in 4 of 5 cases while improving offensive capability in 4 of 5 cases. In two cases, the number of completed steps jumps from 1 to 4 when reasoning is enabled.

RedTeamLLM 在 5 个 VulnHub 虚拟机中的 3 个上表现优于 PenTestGPT(Victim1:GPT4o 变体为 4 步 vs 1 步,提升 300%;WestWild:4 vs 3,提升 33%;CTF4:3.5 vs 2,提升 75%)。消融实验表明,在 5 个案例中的 4 个,启用推理将工具调用减少了 37-68%,同时提高了进攻能力。在两个案例中,启用推理后完成的步骤数从 1 跳升至 4。

Environment 评估环境

VulnHub

Metrics 评估指标

task-completionnum-stepsapi-callstool-calls

Baseline Comparisons 基准对比

  • PenTestGPT-GPT4o
  • PenTestGPT-Llama

Scale 评估规模

5 VulnHub VMs (Sar, CewiKid, Victim1, WestWild, CTF4), each tested 10 times (5 with reasoning, 5 without)

Contributions 核心贡献

  • A seven-component agentic AI architecture for offensive security that addresses four open challenges: dynamic plan correction, memory management, context window constraints, and generality vs. specialization
  • A three-step pipeline (Reasoning, Act, Summarizer) using separate LLM sessions that decouples strategic analysis from action execution, demonstrating that dedicated reasoning reduces tool calls by 37-68% while improving task completion
  • A comprehensive security model with five layers (authentication, isolation, command validation, logging, kill switch) to prevent misuse of the autonomous offensive agent
  • A tree-structured memory management system that stores execution traces with embeddings for cross-run learning and plan improvement
  • Empirical evidence that unrestricted terminal access (genericity) outperforms fixed-toolset approaches by enabling creative attack paths
  • 一种用于进攻性安全的七组件智能体 AI 架构,解决了四个开放挑战:动态计划校正、记忆管理、上下文窗口约束以及通用性与专业性的平衡
  • 一种使用独立 LLM 会话的三步流水线(推理、行动、总结),将战略分析与操作执行解耦,证明了专门的推理将会话工具调用减少了 37-68%,同时提高了任务完成度
  • 一个包含五层(认证、隔离、命令验证、日志、自毁开关)的综合安全模型,以防止自主进攻智能体的滥用
  • 一个树状结构的记忆管理系统,存储带有嵌入向量的执行痕迹,用于跨运行学习和计划改进
  • 实证证据表明,不受限制的终端访问(通用性)通过实现创造性的攻击路径,优于固定工具集的方法

Limitations 局限性

  • Only the ReAct component is fully implemented and evaluated; ADaPT Enhanced, Memory Management, and Plan Correction are implemented but less mature and not evaluated
  • The stateless Summarizer sometimes omits important information because it lacks context about the agent's overall goal
  • Evaluation is limited to 5 entry-level (easy category) VulnHub VMs, not testing on harder or real-world targets
  • Determining when a process requires interactive input is handled via strace but is imprecise because some processes read from multiple file descriptors beyond stdin
  • PenTestGPT-Llama still outperforms RedTeamLLM on 2 of 5 VMs (Sar by 17%, CewiKid by 100%)
  • No evaluation of cost in monetary terms or wall-clock time is provided
  • 目前仅完整实现并评估了 ReAct 组件;ADaPT 增强版、记忆管理和计划校正已实现但不够成熟且未经验证
  • 无状态的总结器有时会遗漏重要信息,因为它缺乏关于智能体总体目标的上下文
  • 评估仅限于 5 个入门级(简单类别)的 VulnHub 虚拟机,未在更难或真实目标上进行测试
  • 通过 strace 确定进程何时需要交互式输入不够精确,因为某些进程会从 stdin 之外的多个文件描述符中读取
  • 在 5 个虚拟机中的 2 个(Sar 领先 17%,CewiKid 领先 100%),PenTestGPT-Llama 的表现仍优于 RedTeamLLM
  • 未提供货币成本或墙钟时间的评估

Research Gaps 研究空白

  • No existing offensive AI agent implements hierarchical plan refinement, long-horizon memory, or roll-back of faulty plans
  • Methodologies for evaluating cost and automation capabilities of agentic offensive models are lacking
  • The impact of recursive planning (ADaPT) combined with memory management and plan correction on offensive capability remains unevaluated
  • Context-aware summarization that preserves task-relevant details from tool outputs is an open problem
  • Scaling agentic offensive AI beyond entry-level CTF challenges to complex, multi-stage real-world penetration tests is unexplored
  • 现有的进攻性 AI 智能体均未实现分层计划细化、长程记忆或故障计划的回滚
  • 缺乏评估智能体攻防模型的成本和自动化能力的方法论
  • 递归计划(ADaPT)结合记忆管理和计划校正对进攻能力的影响仍有待评估
  • 如何实现能保留工具输出中任务相关细节的上下文感知总结,仍然是一个开放性问题
  • 将智能体进攻性 AI 的规模从入门级 CTF 挑战扩展到复杂的、多阶段的真实世界渗透测试,仍未得到探索

Novel Techniques 新颖技术

  • Three separate LLM sessions (Reasoning, Act, Summarizer) to decouple strategic planning from execution and output compression, reducing context window consumption
  • Enhanced ADaPT with plan correction: recursive decomposition that adjusts the plan on subtask failure rather than halting entirely
  • Tree-structured memory management using task description embeddings for cross-run learning, enabling the agent to narrow possibilities to the correct path over multiple executions
  • Unrestricted root terminal access as a generic tool interface rather than a predefined toolset, fostering agent creativity
  • 采用三个独立的 LLM 会话(推理、行动、总结)来解耦战略规划、执行和输出压缩,减少了上下文窗口的消耗
  • 带有计划校正的增强型 ADaPT:递归分解,在子任务失败时调整计划而非直接停止
  • 使用任务描述嵌入的树状结构记忆管理,用于跨运行学习,使智能体能够在多次执行中将可能性缩小到正确路径
  • 不受限制的 root 终端访问作为通用工具接口而非预定义工具集,培养了智能体的创造力

Open Questions 开放问题

  • How does the full architecture (ADaPT Enhanced + Memory Manager + Plan Corrector + ReAct) perform compared to the ReAct-only proof of concept?
  • Can tree-structured memory management enable the agent to solve previously unsolvable challenges after multiple runs?
  • How does performance scale to medium and hard difficulty CTF challenges or real-world penetration testing scenarios?
  • What is the optimal balance between summarization aggressiveness and information retention for context window management?
  • How robust is the security model against sophisticated adversarial manipulation of the agent's execution process?
  • 与仅包含 ReAct 的概念验证相比,完整架构(ADaPT 增强版 + 记忆管理器 + 计划校正器 + ReAct)的表现如何?
  • 树状结构记忆管理能否使智能体在多次运行后解决此前无法解决的挑战?
  • 性能如何扩展到中等和高难度的 CTF 挑战或现实世界的渗透测试场景?
  • 在上下文窗口管理的总结激进程度与信息保留之间,最佳平衡点是什么?
  • 安全模型对于针对智能体执行过程的复杂对抗操纵的稳健性如何?

Builds On 基于前人工作

  • ReAct
  • ADaPT
  • Plan-and-Execute (P&E)
  • PenTestGPT
  • AutoAttacker
  • HackSynth
  • TDAG

Open Source 开源信息

Yes - https://github.com/lre-security-systems-team/redteamllm

Tags