#32

Cloak, Honey, Trap: Proactive Defenses Against LLM Agents Cloak, Honey, Trap: Proactive Defenses Against LLM Agents

Daniel Ayzenshteyn, Roy Weiss, Yisroel Mirsky

2025 | USENIX Security Symposium (top-conference)

defense defense fully-autonomous multi-agent

PDF Preview 论文预览

Loading PDF... 加载 PDF 中...

Problem & Motivation 问题与动机

Recent advances in LLMs have enabled autonomous penetration testing agents capable of compromising hosts, but the same capabilities empower attackers to automate cyberattacks at scale. There are currently no defenses specifically designed to counter LLM-powered attack agents.

LLM近期的进展使得自主渗透测试智能体能够攻击主机，但同样的能力也使攻击者能够大规模自动化网络攻击。目前还没有专门设计用于对抗由LLM驱动的攻击智能体的防御措施。

Existing defenses against human attackers do not target the unique vulnerabilities of LLMs, such as training biases, hallucinations, tokenization flaws, context window limitations, and reliance on unverified input. A proactive defense paradigm is needed that exploits these inherent LLM weaknesses to disrupt, detect, or neutralize malicious autonomous agents before they succeed.

针对人类攻击者的现有防御措施并未针对LLM的独特弱点，如训练偏差、幻觉、分词（tokenization）缺陷、上下文窗口限制以及对未验证输入的依赖。需要一种主动防御范式，利用这些固有的LLM弱点，在恶意自主智能体成功之前对其进行干扰、检测或中和。

Threat Model 威胁模型

A two-player Stackelberg game between a defender (D) and an autonomous multi-LLM-based attack agent (A). The attacker's goal is to gain root privileges on a host. The defender has black-box assumptions: no knowledge of the agent's architecture, system prompts, or backend LLM. The defender plants deceptive payloads (traps, cloaks, honeytokens) into network assets before the attack begins. The attacker agent operates fully autonomously with no human intervention mid-run, using tools like nmap, ssh, metasploit on Kali Linux.

防御者(D)与基于多LLM的自主攻击智能体(A)之间的二人Stackelberg博弈。攻击者的目标是获得主机的root权限。防御者假设为黑盒：不知道智能体的架构、系统提示词或后端LLM。防御者在攻击开始前将欺骗性载荷（陷阱、掩护、蜜标）植入网络资产中。攻击者智能体在运行过程中完全自主运行，无需人工干预，使用nmap、ssh、metasploit等Kali Linux上的工具。

Methodology 核心方法

The paper identifies 7 inherent LLM vulnerabilities (training bias, reliance on untrusted input, memory/context limitations, DFS search behavior, hallucinations, susceptibility to special characters, alignment constraints) and proposes 6 tactics with 15 techniques organized into three strategies: Cloak (mislead perception, divert attention), Honey (specialized lures, LLM-specific honeytokens), and Trap (model corruption, role manipulation, forced code execution). Payloads are crafted and embedded into network data points such as logs, config files, HTML, service banners, and filenames. Most techniques (14 of 15) work without prompt injection.

本文识别了7种固有的LLM漏洞（训练偏差、对不可信输入的依赖、记忆/上下文限制、DFS搜索行为、幻觉、对特殊字符的敏感性、对齐约束），并提出了6种策略和15种技术，组织为三类策略：Cloak（掩护，误导感知，转移注意力）、Honey（蜜标，专门的诱饵，针对LLM的蜜标）和Trap（陷阱，模型破坏，角色操纵，强制代码执行）。载荷被精心设计并嵌入到日志、配置文件、HTML、服务横幅（banners）和文件名等网络数据点中。大多数技术（15个中的14个）不需要提示词注入。

Architecture 架构设计

A defense framework operating in two phases: (1) embedding crafted payloads into selected data points across the network to form traps, and (2) monitoring attacker interactions with these traps to stop, delay, or detect the agent. The open-source CHeaT tool automates insertion of cloaks, honeytokens, and traps into existing system files and assets. The framework is model-agnostic and works under black-box assumptions.

一个分两个阶段运行的防御框架：(1) 将设计的载荷嵌入到整个网络中选定的数据点以形成陷阱；(2) 监测攻击者与这些陷阱的交互，以停止、延迟或检测智能体。开源的CHeaT工具可以自动将Cloak、Honeytoken和Trap插入到现有的系统文件和资产中。该框架与模型无关，且在黑盒假设下工作。

LLM Models 使用的大模型

Tool Integration 工具集成

Memory Mechanism 记忆机制

none

Attack Phases Covered 覆盖的攻击阶段

reconnaissance

scanning

enumeration

exploitation

post exploitation

privilege escalation

lateral movement

reporting

Evaluation 评估结果

All 15 techniques achieved average Defense Success Rates of 55-67% across models in isolation. In end-to-end evaluation on 11 CTF machines, the defense achieved 100% success rate: PentestGPT with GPT-4o failed to compromise any machine when defenses were applied, whereas without defenses it solved all easy and some medium machines. Adaptive adversaries were unable to meaningfully mitigate defenses, with the best ensemble approach still yielding 78.35% cumulative DSR after 10 traps.

在孤立测试中，所有15种技术在各模型上的平均防御成功率（DSR）达到55-67%。在11台CTF机器的端到端评估中，防御达到了100%的成功率：使用GPT-4o的PentestGPT在应用防御时未能攻破任何机器，而没有防御时它解决了所有简单和部分中等难度的机器。自适应对手无法有效减轻防御，在使用10个陷阱后，最佳的集成（ensemble）方法仍有78.35%的累积DSR。

Environment 评估环境

Metrics 评估指标

Baseline Comparisons 基准对比

PentestGPT (no defense baseline)
AutoAttacker
PenHeal
HackingBuddy
Naive adversary (no adaptive strategy)
Context-aware adversary (paper-based)
Context-aware adversary (payload-based)
SVM preprocessing adversary
DeBERTa preprocessing adversary
Informed system prompt adversary
PAT system prompt adversary
Fine-tuned adversary (DPO)
Ensemble adversary

Scale 评估规模

11 CTF machines (5 easy, 4 medium, 2 hard) for end-to-end evaluation; 4,233 data points (17 base types x 249 payloads) for technique analysis

Contributions 核心贡献

First work to propose using LLM exploits as a defense against LLM-powered attack agents, inverting the typical offensive use of LLM vulnerabilities
Identification of 7 inherent LLM vulnerabilities exploitable by defenders, with 6 tactics and 15 novel defense techniques organized into Cloak, Honey, and Trap strategies
Discovery of novel exploits: LLM susceptibility to lures, reverse shell counterattacks without prompt injection, LLM-specific honeytokens using asymmetric Unicode characters (Set A and Set B), and landmine/rare tokens that cause hallucinations
A multifaceted defense framework with three components (Cloak to hide assets, Honey to detect agents, Trap to halt attacks), with 14 of 15 techniques requiring no prompt injection
Open-source CHeaT tool that automates insertion of cloaks, honeytokens, and traps into existing system files and network assets
Comprehensive evaluation against 3 multi-LLM agents and 1 single-LLM agent, 4 backend models, 11 CTF machines, and 7 adaptive adversary strategies

首次提出利用LLM漏洞作为对抗LLM攻击智能体的防御手段，将通常用于攻击的LLM弱点反向利用
识别了7种可被防御者利用的固有LLM漏洞，提出了6种战术和15种新型防御技术，并组织为Cloak、Honey和Trap策略
发现了新型攻击向量：LLM对诱饵的易感性、无需提示词注入的反弹shell反击、利用非对称Unicode字符（Set A和Set B）设计的LLM专用蜜标，以及导致幻觉的“地雷（landmine）”/罕见Token
提出了一个多维防御框架，包含三个组件（用于隐藏资产的Cloak，用于检测智能体的Honey，用于停止攻击的Trap），其中15项技术中的14项不需要提示词注入
开发并开源了CHeaT工具，可自动将掩护、蜜标和陷阱插入现有系统文件和网络资产中
针对3个多智能体、1个单智能体、4个后端模型、11台CTF机器和7种自适应对手策略进行了全面评估

Limitations 局限性

Defenses are overt by nature; a human adversary who takes over from a stuck agent might recognize planted traps, though debugging misinformation without deep familiarity is difficult
Evaluation limited to CTF environments, not real production networks
Only one technique (T5.2, change agent role/objectives) requires prompt injection, but prompt injection was shown to sometimes reduce rather than improve DSR due to model safety detection
Payloads were crafted and validated primarily using GPT-4o, then tested on other models; results may differ for future model architectures
Manual deployment assumed for simplicity; automated deployment at scale in dynamic real-world networks remains future work
Framework modeled as a one-shot Stackelberg game; extension to multi-turn dynamic interaction is left as future work

防御本质上是公开的；接管被困智能体的人类对手可能会识别出布置的陷阱，尽管在不深入了解环境的情况下纠正误导信息是很困难的
评估仅限于CTF环境，而非真实的生产网络
只有一种技术（T5.2，更改智能体角色/目标）需要提示词注入，但实验表明，由于模型安全检测，提示词注入有时反而会降低DSR
载荷主要针对GPT-4o进行设计和验证，然后在其他模型上测试；结果可能因未来的模型架构而异
为简单起见假设手动部署；在动态现实网络中进行大规模自动化部署仍是未来工作
框架被建模为一次性Stackelberg博弈；扩展到多轮动态交互留作未来工作

Research Gaps 研究空白

No existing defenses specifically designed to counter autonomous LLM-powered attack agents
Mitigation of adversarial prompt injection remains an open challenge for AI safety, with current detection and prevention methods largely ineffective
Larger context windows and RAG do not eliminate vulnerability to well-crafted misinformation; LLMs still struggle when critical details are buried among distractors
The fundamental tension between agent trust in input (needed for operation) and vulnerability to deception creates an inherent security gap in LLM-based autonomous agents
Moving target defenses that dynamically adapt traps during an ongoing attack are unexplored
Cross-model generalizability of rare/landmine tokens needs further investigation as new models emerge

目前还没有专门设计用于对抗自主LLM驱动攻击智能体的防御措施
缓解对抗性提示词注入仍是AI安全领域的一个开放挑战，目前的检测和预防方法在很大程度上是无效的
更大的上下文窗口和RAG并不能消除对精心设计的误导信息的敏感性；当关键细节埋没在干扰信息中时，LLM仍然难以处理
智能体对输入的信任（运行所需）与对欺骗的易感性之间的基本矛盾，在基于LLM的自主智能体中造成了固有的安全漏洞
在攻击过程中动态调整陷阱的移动目标防御（Moving target defenses）尚未被探索
罕见/地雷Token跨模型的泛化能力需要随着新模型的出现进一步调查

Novel Techniques 新颖技术

Asymmetric Unicode characters (Set A: invisible to humans but parsed by LLMs; Set B: visible to humans but invisible to LLMs) for creating LLM-specific honeytokens
Landmine/rare tokens that corrupt LLM internal state, causing hallucinations and garbage output (e.g., sequences like \u00c3\u0142 for Llama)
Circular reference loops in data points that trap DFS-biased LLM agents in endless cycles, exploiting their inability to detect revisitation
Reverse shell counterattack without prompt injection: embedding plausible-looking commands in data points that agents execute voluntarily, giving defenders backdoor access to attacker infrastructure
Jedi Mind Trick: simple false statements in logs (e.g., 'No vulnerabilities found') that LLMs uncritically adopt as beliefs
Exploding the search space by planting fake CVEs, credentials, and services to overwhelm agent memory and context windows
Using LLM alignment/safety features offensively: triggering model refusal by embedding dangerous-sounding content in SSH banners or files

非对称Unicode字符（Set A：人类不可见但LLM可解析；Set B：人类可见但LLM不可见），用于创建LLM专用蜜标
会破坏LLM内部状态的地雷/罕见Token，导致幻觉和乱码输出（例如Llama的 \u00c3\u0142 等序列）
在数据点中设置循环引用，使具有DFS偏差的LLM智能体陷入无限循环，利用其无法检测重复访问的弱点
无需提示词注入的反弹shell反击：在数据点中嵌入看似合理的命令，智能体自愿执行这些命令，从而让防御者获得对攻击者基础设施的后门访问权限
绝地武士读心术（Jedi Mind Trick）：在日志中加入简单的虚假陈述（例如“未发现漏洞”），LLM会不加批判地将其作为信念采纳
通过植入虚假的CVE、凭据和服务来扩充搜索空间，使智能体的记忆和上下文窗口过载
进攻性地利用LLM对齐/安全功能：通过在SSH横幅或文件中嵌入听起来很危险的内容，触发模型拒绝服务

Open Questions 开放问题

How will defenses need to evolve as LLMs improve in detecting misinformation and reasoning about deceptive inputs?
Can defenders achieve effective moving target defense by dynamically updating traps during an ongoing autonomous attack?
What is the optimal density and distribution of traps across a real production network to maximize defense without impacting legitimate operations?
How can landmine tokens be systematically discovered for new model architectures, and can models be hardened against them?
Will multi-modal LLM agents (processing images, audio) be vulnerable to analogous deception techniques in non-text modalities?

随着LLM在检测误导信息和推理欺骗性输入方面的能力提升，防御手段需要如何演进？
防御者能否通过在持续的自主攻击过程中动态更新陷阱来实现有效的移动目标防御？
在真实的生产网络中，陷阱的最佳密度和分布是多少，以便在不影响正常运行的情况下最大化防御？
如何系统性地为新模型架构发现地雷Token？模型能否针对其进行加固？
多模态LLM智能体（处理图像、音频）是否会易受非文本模态中类似欺骗技术的攻击？

Builds On 基于前人工作

PentestGPT
AutoAttacker
PenHeal
HackingBuddy
Bad Characters (Boucher et al., IEEE S&P 2022) - invisible Unicode attacks
GCG adversarial suffixes (Zou et al., 2023)
Prompt injection formalization (Liu et al., USENIX Security 2024)
PAT - adversarial prompt optimization (Mo et al., NeurIPS 2024)

Open Source 开源信息

Yes - https://doi.org/10.5281/zenodo.15601739 (CHeaT tool, datasets, CTF machines, payloads)