#06

AutoAttacker: A Large Language Model Guided System to Implement Automatic Cyber-attacks AutoAttacker: A Large Language Model Guided System to Implement Automatic Cyber-attacks

Jiacen Xu, Jack W. Stokes, Geoff McDonald, Xuesong Bai, David Marshall, Siyue Wang, Adith Swaminathan, Zhou Li

2024 | arXiv (preprint)

system penetration-testing fully-autonomous multi-agent ReAct

PDF Preview 论文预览

Loading PDF... 加载 PDF 中...

Problem & Motivation 问题与动机

There is no comprehensive study on whether LLMs can automate post-breach, hands-on-keyboard attacks in enterprise network environments. Prior work focused on pre-breach stages (phishing, malware generation) or required significant human involvement during penetration testing.

关于 LLM 是否能够在大规模企业网络环境中自动化执行入侵后、手操键盘式的攻击，目前还没有全面的研究。之前的工作集中在入侵前阶段（网络钓鱼、恶意软件生成）或在渗透测试期间需要大量的人工参与。

As LLMs grow more capable, attackers will inevitably use them to automate both pre- and post-breach attack stages, transforming rare expert-led operations into frequent automated ones. Understanding these risks now is critical so defenders can prepare. Existing LLM pentesting tools either require intensive human interaction, focus on a single attack type, or have high failure rates.

随着 LLM 能力的增强，攻击者不可避免地会利用它们来自动化入侵前后的攻击阶段，将罕见的专家主导操作转变为频繁的自动化操作。现在了解这些风险至关重要，以便防御者能够做好准备。现有的 LLM 渗透测试工具要么需要密集的人机交互，要么专注于单一攻击类型，或者失败率很高。

Threat Model 威胁模型

The adversary has either controlled a machine in the enterprise or can communicate with one, depending on the attack stage. The enterprise network has certain weaknesses (e.g., anti-virus is turned off) that would allow a human attacker to succeed. The attacker uses off-the-shelf hacking tools like Metasploit and native OS capabilities.

根据攻击阶段的不同，对手要么控制了企业中的一台机器，要么可以与其中一台机器通信。企业网络存在某些弱点（例如反病毒软件已关闭），允许人类攻击者成功。攻击者使用现成的黑客工具（如 Metasploit）和原生操作系统功能。

Methodology 核心方法

AutoAttacker is a modular LLM-guided system that automates post-breach cyber-attacks. Instead of a monolithic agent, it decomposes the attack automation problem into four specialized modules -- summarizer, planner, navigator, and experience manager -- each querying the LLM with carefully designed prompt templates. It uses a role-playing jailbreak technique to bypass LLM safety filters, a RAG-inspired experience manager to reuse successful attack steps, and chain-of-thought prompting within the planner to generate precise attack commands. The system iteratively interacts with the victim environment, summarizing observations, planning next actions, selecting optimal commands from candidates, and executing them via Metasploit or shell.

AutoAttacker 是一个模块化的 LLM 引导系统，可自动化入侵后的网络攻击。它没有采用单一的智能体，而是将攻击自动化问题分解为四个专门的模块 —— 总结器、规划器、导航器和经验管理器 —— 每个模块都使用精心设计的提示模板查询 LLM Boris。它使用角色扮演越狱技术来绕过 LLM 安全过滤器，使用受 RAG 启发的经验管理器来重用成功的攻击步骤，并在规划器中使用思维链提示来生成精确的攻击命令。该系统迭代地与受害者环境交互，总结观察结果，规划下一步行动，从候选方案中选择最佳命令，并通过 Metasploit 或 Shell 执行它们。

Architecture 架构设计

Four-module pipeline: (1) Summarizer (SUM) -- condenses previous interactions and current observations into a concise situation description for the LLM; (2) Planner (PLA) -- takes the situation and objective, uses chain-of-thought prompting to generate a candidate action with planning, command type, and exact command fields; (3) Navigator (NAV) -- selects the best action from the planner's suggestion and top-k similar past experiences retrieved by the experience manager, using the LLM to choose; (4) Experience Manager (EXP) -- stores successful actions in a database with embedding vectors of their planning sections, uses cosine similarity to retrieve relevant past experiences for new tasks. A command checker post-processes LLM output to fix common syntactical mistakes.

四模块管道：(1) 总结器 (SUM) —— 将之前的交互和当前的观察结果浓缩为 LLM 的简洁情况描述；(2) 规划器 (PLA) —— 接收情况和目标，使用思维链提示生成带有规划、命令类型和精确命令字段的候选行动；(3) 导航器 (NAV) —— 从规划器的建议和经验管理器检索到的前 k 个相似过去经验中选择最佳行动，使用 LLM 进行选择；(4) 经验管理器 (EXP) —— 将成功的行动与其规划部分的嵌入向量一起存储在数据库中，使用余弦相似度为新任务检索相关的过去经验。命令检查器对 LLM 输出进行后处理，以修复常见的语法错误。

LLM Models 使用的大模型

Tool Integration 工具集成

Memory Mechanism 记忆机制

RAG

Attack Phases Covered 覆盖的攻击阶段

reconnaissance

scanning

enumeration

exploitation

post exploitation

privilege escalation

lateral movement

reporting

Evaluation 评估结果

With GPT-4 at temperature 0, AutoAttacker achieves a perfect 3/3 success rate on all 14 attack tasks, completing basic tasks in under 8 rounds and advanced tasks in under 18 rounds. GPT-3.5 can only complete 3 out of 14 tasks, while Llama-2-7B-chat and Llama-2-70B-chat fail on all tasks due to lacking Metasploit knowledge and not following the required action format. The experience manager reduces interaction rounds (e.g., from 17 to 7 for ransomware attack) and API costs.

在温度为 0 的 GPT-4 下，AutoAttacker 在所有 14 个攻击任务中实现了完美的 3/3 成功率，在 8 轮内完成基本任务，在 18 轮内完成高级任务。GPT-3.5 只能完成 14 个任务中的 3 个，而 Llama-2-7B-chat 和 Llama-2-70B-chat 由于缺乏 Metasploit 知识且未遵循要求的行动格式而失败。经验管理器减少了交互轮数（例如，勒索软件攻击从 17 轮减少到 7 轮）和 API 成本。

Environment 评估环境

Metrics 评估指标

Baseline Comparisons 基准对比

GPT-3.5 (direct prompting)
Llama-2-7B-chat
Llama-2-70B-chat
AutoAttacker ablation variants (without experience manager, without summarizer, detailed vs abstract objectives)

Scale 评估规模

14 attack tasks across multiple VMs (Windows 10, Windows 11, Windows Server 2016, Ubuntu 12.04, Ubuntu 22.04) in a Hyper-V enterprise network

Contributions 核心贡献

First comprehensive study evaluating LLMs for automating post-breach hands-on-keyboard attacks across multiple attack stages and OS environments
Design of AutoAttacker, a modular multi-component agent (summarizer, planner, navigator, experience manager) with carefully crafted prompt templates for precise attack command generation
A new benchmark of 14 attack tasks mapped to MITRE ATT&CK tactics, covering reconnaissance, initial access, execution, persistence, privilege escalation, credential access, lateral movement, and impact
A role-playing jailbreak technique achieving 100% jailbreak success rate across all tested LLMs with a single prompt
Demonstration that GPT-4 achieves perfect success rate on all 14 tasks while GPT-3.5 and open-source models largely fail

第一项全面评估 LLM 在跨多个攻击阶段和操作系统环境自动化入侵后手操键盘式攻击的研究
设计了 AutoAttacker，一个模块化的多组件智能体（总结器、规划器、导航器、经验管理器），带有精心制作的提示模板，用于生成精确的攻击命令
一个包含 14 个攻击任务的新基准，映射到 MITRE ATT&CK 战术，涵盖侦察、初始访问、执行、持久化、权限提升、凭据访问、横向移动和影响
一种角色扮演越狱技术，仅凭一个提示就在所有测试的 LLM 中实现了 100% 的越狱成功率
证明了 GPT-4 在所有 14 个任务上都实现了完美的成功率，而 GPT-3.5 和开源模型在很大程度上失败了

Limitations 局限性

The victim environment is pre-configured to be insecure (e.g., anti-virus disabled), which does not reflect hardened real-world enterprise networks
Only 14 tasks are evaluated, covering a small subset of the MITRE ATT&CK matrix; setting up vulnerable environments for each TTP is time-consuming
Uses a single LLM per run; does not explore multi-LLM architectures that could improve performance
LLM hallucinations occur (e.g., GPT-4 generates non-existent Metasploit modules), though multi-round interaction can self-correct
The system depends heavily on GPT-4; GPT-3.5 and open-source models fail on most tasks, raising questions about generalizability
No evaluation against active defenses (IDS, EDR, antivirus) -- all security measures are turned off
Complex cross-machine attacks like full lateral movement chains are not thoroughly evaluated
The experience manager relies on text-embedding-ada-002 for similarity matching, which may not capture nuanced attack context differences

受害者环境预配置为不安全（例如禁用反病毒软件），这不能反映加固后的现实企业网络
仅评估了 14 个任务，涵盖了 MITRE ATT&CK 矩阵的一小部分；为每个 TTP 设置漏洞环境非常耗时
每次运行使用单个 LLM；未探索可以提高性能的多 LLM 架构
发生 LLM 幻觉（例如 GPT-4 生成不存在的 Metasploit 模块），尽管多轮交互可以自我纠正
该系统严重依赖 GPT-4；GPT-3.5 和开源模型在大多数任务上都失败了，引发了关于泛化性的问题
没有针对主动防御（IDS、EDR、反病毒软件）进行评估 - 所有安全措施都已关闭
复杂的跨机器攻击（如完整的横向移动链）未得到彻底评估
经验管理器依赖 text-embedding-ada-002 进行相似性匹配，这可能无法捕获细微的攻击上下文差异

Research Gaps 研究空白

No prior comprehensive study on LLM-driven post-breach attack automation; most work focuses on pre-breach (phishing, malware) or simple CTF challenges
Existing LLM pentesting tools (PentestGPT, wintermute, Happe et al.) require human involvement, target single machines, lack cross-machine attacks, or do not use RAG for experience reuse
Standard LLM reasoning techniques (ToT, CoT) are insufficient for cyber-attack automation due to complex action spaces, environment-dependent observations, and long task chains
Lack of benchmarks specifically designed for evaluating LLM-based post-breach attack automation with detailed environment specifications
No robust defenses exist against LLM jailbreaking for attack command generation; unlearning attack knowledge is difficult to audit

之前没有关于 LLM 驱动的入侵后攻击自动化的全面研究；大多数工作集中在入侵前（网络钓鱼、恶意软件）或简单的 CTF 挑战
现有的 LLM 渗透测试工具（PentestGPT、wintermute、Happe 等）需要人工参与，目标单一机器，缺乏跨机器攻击，或不使用 RAG 进行经验重用
标准的 LLM 推理技术（ToT, CoT）不足以应对网络攻击自动化，因为存在复杂的操作空间、依赖环境的观察结果和长任务链
缺乏专门为评估具有详细环境规范的基于 LLM 的入侵后攻击自动化而设计的基准
目前还没有针对用于攻击命令生成的 LLM 越狱的稳健防御措施；遗忘攻击知识难以审计

Novel Techniques 新颖技术

Role-playing jailbreak technique that achieves 100% success rate with a single prompt across GPT-3.5, GPT-4, Llama-2-7B-chat, and Llama-2-70B-chat
Modular agent decomposition (summarizer/planner/navigator/experience manager) specifically designed to address LLM limitations in attack automation (context window limits, verbose outputs, environment tracking failures)
RAG-based experience manager that stores successful attack steps and retrieves them via embedding similarity for new tasks, enabling task chaining where advanced attacks build on basic task experiences
Structured action format (<r>planning</r><t>command_type</t><c>command</c>) that constrains LLM output to parseable, executable commands
Command checker module that post-processes LLM output to fix syntactical errors (e.g., replacing semicolons with end-of-line symbols)

角色扮演越狱技术，在 GPT-3.5、GPT-4、Llama-2-7B-chat 和 Llama-2-70B-chat 中仅凭一个提示即可实现 100% 的成功率
专门为解决攻击自动化中 LLM 局限性（上下文窗口限制、冗长输出、环境跟踪失败）而设计的模块化智能体分解（总结器/规划器/导航器/经验管理器）
基于 RAG 的经验管理器，存储成功的攻击步骤并通过嵌入相似性为新任务检索它们，从而实现高级攻击建立在基础任务经验之上的任务链接
结构化的行动格式（<r>planning</r><t>command_type</t><c>command</c>），将 LLM 输出约束为可解析、可执行的命令
命令检查器模块，对 LLM 输出进行后处理以修复语法错误（例如，用换行符替换分号）

Open Questions 开放问题

Can AutoAttacker work against hardened environments with active defenses (EDR, IDS, antivirus)?
How would the system perform with newer, more capable LLMs (GPT-4 Turbo, Claude, Gemini) or fine-tuned open-source models?
Can the modular architecture scale to full end-to-end penetration tests spanning dozens of machines and attack stages?
What defenses are effective against LLM-automated attacks -- can C2 traffic detection or hallucination-pattern detection reliably identify automated attacks?
How robust is the role-playing jailbreak against improved LLM safety training, and will an arms race between jailbreaking and safety ensue?
Can multi-LLM architectures (e.g., different LLMs for planning vs. command generation) improve performance beyond single-LLM approaches?

AutoAttacker 能否在具有主动防御（EDR、IDS、反病毒软件）的加固环境中工作？
该系统在更新、能力更强的 LLM（GPT-4 Turbo, Claude, Gemini）或经过微调的开源模型下表现如何？
模块化架构能否扩展到跨越数十台机器和攻击阶段的完整端到端渗透测试？
哪些防御措施能有效对抗 LLM 自动攻击 - C2 流量检测或幻觉模式检测能否可靠识别自动攻击？
角色扮演越狱在改进的 LLM 安全训练面前有多稳健，越狱与安全之间是否会发生军备竞赛？
多 LLM 架构（例如，用于规划与用于命令生成的不同 LLM）能否比单一 LLM 方法提高性能？

Builds On 基于前人工作

RAG (Lewis et al., 2020)
Chain-of-Thought prompting (Wei et al., 2022)
ThinkGPT
LangChain
Metasploit framework
MITRE ATT&CK framework
PentestGPT (Deng et al., 2023)
Happe and Cito (2023) - Getting pwn'd by AI
wintermute (Happe et al., 2023)