Hacking Back the AI-Hacker: Prompt Injection as a Defense Against LLM-driven Cyberattacks Hacking Back the AI-Hacker: Prompt Injection as a Defense Against LLM-driven Cyberattacks
Problem & Motivation 问题与动机
LLM-based agents are increasingly being used to automate cyberattacks end-to-end, from reconnaissance to exploitation, lowering the barrier for sophisticated attacks. There is a need for defensive strategies specifically tailored to counter these automated LLM-driven cyberattacks.
基于LLM的智能体越来越多地被用于端到端的自动化网络攻击,从侦察到利用,降低了复杂攻击的门槛。需要专门量身定制的防御策略来对抗这些自动化的LLM驱动网络攻击。
While LLMs enable powerful automated attacks, they also have an inherent weakness: susceptibility to prompt injections. The paper proposes flipping this vulnerability from a liability into a defensive asset, using prompt injections proactively to disrupt, mislead, or counterattack LLM-driven adversaries. No prior work has explored using prompt injection as a deliberate defensive mechanism against attacking LLM-agents.
虽然LLM能够实现强大的自动化攻击,但它们也有一个固有的弱点:易受提示词注入(prompt injection)攻击。本文提出将这一弱点从负债转化为防御资产,主动利用提示词注入来干扰、误导或反击LLM驱动的对手。此前尚无研究探索将提示词注入作为对抗攻击型LLM智能体的刻意防御机制。
Threat Model 威胁模型
A two-party game between an attacker (LLM-agent A) and a defender (D). The attacker is an LLM-agent attempting to compromise a remote target machine S to achieve an adversarial objective (e.g., capturing a flag). The defender operates on S, is agnostic to the attacker's LLM model and strategy, is unaware of the actual vulnerabilities in S, and aims to disrupt the attacker's operations by executing a predefined sabotage objective. The attacker has a maximum of 30 rounds of actions.
攻击者(LLM智能体 A)与防御者(D)之间的双人博弈。攻击者是一个尝试攻破远程目标机器S以实现对抗目标(如捕获旗标)的LLM智能体。防御者在S上运行,对攻击者的LLM模型和策略一无所知,也不了解S中的实际漏洞,其目标是通过执行预定义的破坏目标来中断攻击者的操作。攻击者最多有30轮行动机会。
Methodology 核心方法
Mantis (Malicious LLM-Agent Neutralization and exploitation Through prompt Injections) is a defensive framework that deploys purposefully vulnerable decoy services on a target system to attract LLM-agent attackers. When an attacker exploits a decoy, Mantis injects carefully crafted prompt injections into the system responses, manipulating the attacker's LLM into either disrupting its own operations (passive defense via agent-tarpit) or compromising the attacker's own machine (active defense via agent-counterstrike with reverse shell). The prompt injections are hidden from human observers using ANSI escape sequences in terminals and HTML comment tags in web pages.
Mantis(Malicious LLM-Agent Neutralization and exploitation Through prompt Injections,通过提示词注入中和并利用恶意LLM智能体)是一个防御框架,它在目标系统上部署故意留有漏洞的诱饵服务以吸引LLM智能体攻击者。当攻击者利用诱饵时,Mantis在系统响应中注入精心设计的提示词注入载荷,操纵攻击者的LLM,使其要么破坏自身的运行(通过智能体焦油坑/agent-tarpit进行的被动防御),要么攻破攻击者自己的机器(通过带有反弹shell的智能体反击/agent-counterstrike进行的积极防御)。提示词注入载荷在终端中使用ANSI转义序列,在网页中使用HTML注释标签,从而对人类观察者隐藏。
Architecture 架构设计
Two core components: (1) Decoys - fake vulnerable services (FTP server with anonymous auth, Web-app with SQL injection) that attract LLM-agents and trigger activation events upon exploitation; (2) Injection Manager - coordinates prompt injection deployment in real-time, generates payloads consisting of an execution trigger and target instructions, and manages counterstrike operations (spawning tarpit services, reverse shell listeners). Mantis integrates seamlessly with legitimate services on the host machine.
两个核心组件:(1) 诱饵(Decoys)——吸引LLM智能体并在被利用时触发激活事件的虚假漏洞服务(如具有匿名认证的FTP服务器、具有SQL注入的Web应用);(2) 注入管理器(Injection Manager)——实时协调提示词注入的部署,生成由执行触发器和目标指令组成的载荷,并管理反击操作(启动焦油坑服务、反弹shell监听器)。Mantis可以与主机上的合法服务无缝集成。
LLM Models 使用的大模型
Tool Integration 工具集成
Memory Mechanism 记忆机制
none
Attack Phases Covered 覆盖的攻击阶段
Evaluation 评估结果
Mantis achieved approximately 95.4% success rate in achieving its sabotage objective across all configurations, while reducing the attacker's success rate to less than 3%. The agent-counterstrike method was the most reliable sabotage objective. The FTP decoy was more effective than the Web-app decoy due to simpler exploitation requirements. Increasing tarpit complexity directly amplified the attacker's API costs (up to $4.55 per attack with 150 subdirectories per node).
Mantis在所有配置中实现其破坏目标的成功率约为95.4%,同时将攻击者的成功率降低到不到3%。智能体反击(agent-counterstrike)是最可靠的破坏方法。由于利用要求更简单,FTP诱饵比Web应用诱饵更有效。增加焦油坑的复杂度会直接放大攻击者的API成本(每个攻击在每个节点有150个子目录时成本高达4.55美元)。
Environment 评估环境
Metrics 评估指标
Baseline Comparisons 基准对比
- PentestGPTAuto (extended PentestGPT with weaponizer module)
- AutoPenAgent
- HackingBuddyGPT
Scale 评估规模
3 HackTheBox very-easy CTF machines (Dancing, Redeemer, Synced), plus 2 advanced CTFs (Chemistry, Cicada) in appendix testing
Contributions 核心贡献
- Proactive defense via prompt injections: Reframes prompt injections from vulnerabilities into strategic defensive assets by embedding them into system responses to disrupt automated LLM-driven attacks
- Steerability analysis: Provides a foundational study on how LLM-based agents for cyberattacks can be systematically steered using crafted responses, demonstrating that controlled interactions can exploit attacking LLM-agents' decision-making
- Development of the Mantis framework: An adaptive defense system that autonomously deploys decoys and uses prompt injections in real-time to mislead and counterattack LLM-driven attacks, with modular design and open-source release
- 通过提示词注入进行主动防御:将提示词注入从漏洞重新定义为战略防御资产,通过将其嵌入系统响应来破坏自动化的LLM驱动攻击
- 可控性分析:提供了关于如何使用精心设计的响应系统地引导基于LLM的网络攻击智能体的基础研究,证明了受控交互可以利用攻击型LLM智能体的决策过程
- Mantis框架的开发:一个自适应防御系统,能够实时自动部署诱饵并使用提示词注入来误导和反击LLM驱动的攻击,具有模块化设计并已开源
Limitations 局限性
- Defense effectiveness depends on LLMs remaining susceptible to prompt injection; if future LLMs become robust against prompt injection, Mantis's approach would be less effective
- Active defense (hack-back/agent-counterstrike) raises ethical and legal concerns, limiting applicability in general contexts
- Attackers who gain knowledge of Mantis's defenses can instruct their LLM-agent to bypass known decoys or filter execution triggers from Mantis's default pool, requiring human-in-the-loop to counter
- Evaluation limited to very-easy CTF machines; effectiveness on complex real-world systems with multiple genuine vulnerabilities is less clear
- Web-app decoy is less effective on machines where competing real vulnerable services (e.g., SMB) exist, as failed decoy exploitation causes the LLM-agent to shift focus
- Agent-tarpit objective requires continuous interaction and the LLM-agent can occasionally escape the trap
- Only tested against three publicly available open-source attacking agents; proprietary agents like AutoAttacker and PenHeal were unavailable
- 防御有效性取决于LLM是否仍然易受提示词注入的影响;如果未来的LLM对提示词注入具有鲁棒性,Mantis的方法将不那么有效
- 积极防御(黑客反击/智能体反击)引发了伦理和法律问题,限制了其在一般环境中的适用性
- 了解Mantis防御措施的攻击者可以指示其LLM智能体绕过已知诱饵,或从Mantis的默认池中过滤执行触发器,这需要人工参与来应对
- 评估仅限于非常简单的CTF机器;在具有多个真实漏洞的复杂现实系统中的有效性尚不明确
- 在存在竞争性真实漏洞服务(如SMB)的机器上,Web应用诱饵效果较差,因为诱饵利用失败会导致LLM智能体转移注意力
- 智能体焦油坑(agent-tarpit)目标需要持续交互,LLM智能体偶尔可以逃脱陷阱
- 仅针对三种公开可用的开源攻击智能体进行了测试;无法获得像AutoAttacker和PenHeal这样的专有智能体
Research Gaps 研究空白
- Whether defenses tailored explicitly to the context of LLM-agents can counter prompt injection attacks more effectively than general-purpose Generative AI defenses
- Development of dynamically tailored execution triggers specific to the attacking LLM-agent using fingerprinting techniques like LLMmap
- Scalability and adaptability of prompt-injection-based defenses against increasingly sophisticated and human-in-the-loop LLM-agent attacks
- Exploring the arms race between defensive prompt injection and LLM robustness improvements
- Integration of Mantis-like defenses with existing honeypot frameworks for production deployment
- 明确针对LLM智能体语境定制的防御措施是否比通用的生成式AI防御措施更有效地应对提示词注入攻击
- 开发使用LLMmap等指纹识别技术、专门针对攻击型LLM智能体的动态定制执行触发器
- 基于提示词注入的防御措施在应对日益复杂且有人类参与的LLM智能体攻击时的可扩展性和适应性
- 探索防御性提示词注入与LLM鲁棒性提升之间的军备竞赛
- 将类似Mantis的防御措施与现有的蜜罐框架集成以用于生产部署
Novel Techniques 新颖技术
- Using prompt injection offensively as a defensive mechanism against LLM-agent attackers, flipping a known vulnerability into a strategic defense
- Agent-tarpit: trapping LLM-agents in infinite filesystem exploration loops using dynamically generated fake directory trees with supernormal stimulus naming
- Agent-counterstrike: inducing LLM-agents to open reverse shells on their own machines by injecting payloads into decoy service responses
- Hiding prompt injections from human observers using ANSI escape sequences (terminal) and HTML comment tags (browser) so defenses are invisible to manual reconnaissance
- Decoy banner manipulation: embedding hidden prompts in service banners (e.g., HTTP Server header) to subtly influence LLM-agent prioritization of decoy services
- Momentum-based execution triggers that align with the LLM-agent's exploitation flow rather than using traditional adversarial prefixes like 'Ignore all previous instructions'
- 将提示词注入作为对抗LLM智能体攻击者的防御机制,将已知漏洞转化为战略防御
- 智能体焦油坑(Agent-tarpit):使用具有超常刺激命名(supernormal stimulus naming)的动态生成虚假目录树,将LLM智能体困在无限的文件系统探索循环中
- 智能体反击(Agent-counterstrike):通过在诱饵服务响应中注入载荷,诱导LLM智能体在其自己的机器上开启反弹shell
- 对人类观察者隐藏提示词注入:在终端中使用ANSI转义序列,在浏览器中使用HTML注释标签,使防御措施在手动侦察中不可见
- 诱饵横幅操纵:在服务横幅(如HTTP Server头)中嵌入隐藏提示,微妙地影响LLM智能体对诱饵服务的优先级排序
- 基于动量的执行触发器(Momentum-based execution triggers):与LLM智能体的利用流程相吻合,而不是使用“忽略所有之前的指令”等传统的对抗性前缀
Open Questions 开放问题
- Can LLM-agents be made robust to defensive prompt injections without losing their general capability to process system responses?
- What is the equilibrium in the arms race between prompt-injection-based defenses and LLM robustness improvements?
- How effective are these defenses against human-in-the-loop attack configurations where a human can recognize decoys?
- Can Mantis-style defenses scale to protect complex enterprise networks with many genuine services alongside decoys?
- What are the legal frameworks needed to enable active defense (hack-back) techniques like agent-counterstrike in practice?
- 能否在不丧失处理系统响应的通用能力的前提下,使LLM智能体对防御性提示词注入具有鲁棒性?
- 基于提示词注入的防御与LLM鲁棒性提升之间的军备竞赛平衡点在哪里?
- 在人类可以识别诱饵的、有人类参与的攻击配置中,这些防御措施的效果如何?
- Mantis风格的防御措施能否扩展到保护具有许多真实服务和诱饵并存的复杂企业网络?
- 在实践中启用积极防御(反击)技术(如智能体反击)需要什么样的法律框架?
Builds On 基于前人工作
- PentestGPT
- AutoPenAgent
- HackingBuddyGPT
- NeuralExec
- LLMmap
- AutoAttacker
Open Source 开源信息
Yes - https://github.com/pasquini-dario/project_mantis