#50

LLM Agent Honeypot: Monitoring AI Hacking Agents in the Wild LLM Agent Honeypot: Monitoring AI Hacking Agents in the Wild

Reworr, Dmitrii Volkov

2025 | arXiv (preprint)

system defense fully-autonomous single-agent

PDF Preview 论文预览

Loading PDF... 加载 PDF 中...

Problem & Motivation 问题与动机

LLM-based autonomous hacking agents represent a growing threat to cybersecurity, but limited empirical data exists regarding their actual use in real-world attack scenarios. There is a gap in understanding the current threat landscape and developing effective defenses against AI-driven attacks.

基于大语言模型（LLM）的自主黑客智能体对网络安全构成了日益严重的威胁，但关于它们在真实攻击场景中实际使用的实证数据非常有限。目前在理解当前的威胁格局以及开发针对 AI 驱动攻击的有效防御措施方面存在空白。

Recent developments show LLM-based agents can perform complex offensive security tasks, but evaluations have been limited to controlled environments (e.g., CTFs, benchmarks). This paper fills the gap by providing empirical evidence of LLM-based hacking agents operating in the wild, serving as an early warning system about the current level of AI-driven threats.

最近的发展表明，基于 LLM 的智能体可以执行复杂的进攻性安全任务，但评估仅限于受控环境（如 CTF、基准测试）。本文通过提供在野外运行的基于 LLM 的黑客智能体的实证证据，填补了这一空白，作为关于当前 AI 驱动威胁水平的早期预警系统。

Threat Model 威胁模型

Autonomous LLM-based agents that connect to publicly accessible SSH services and attempt to exploit them. The system distinguishes between three actor types: traditional software bots, LLM-based agents, and human attackers. The assumption is that software bots cannot pass human-like prompt injection tests, and humans cannot respond as quickly as LLMs.

自主的基于 LLM 的智能体连接到公开可访问的 SSH 服务并尝试利用它们。该系统区分了三类行为体：传统软件机器人、基于 LLM 的智能体和人类攻击者。假设软件机器人无法通过类人提示注入（prompt injection）测试，而人类的响应速度无法像 LLM 那样快。

Methodology 核心方法

The authors deploy deliberately vulnerable SSH honeypot servers augmented with embedded prompt injections and time-based analysis techniques to detect and monitor autonomous LLM hacking agents in the wild. The system uses a multi-step detection pipeline: first, prompt injections embedded in banner messages, command outputs, and system files attempt to hijack the agent's goal (e.g., via 'Important Message' attacks); then, timing analysis of response intervals distinguishes LLM agents (which respond in under 1.7 seconds) from human operators (who take significantly longer). Cross-referencing behavioral responses to prompt injections with timing data reduces false positives.

作者部署了故意留有漏洞的 SSH 蜜罐服务器，并增强了嵌入式提示注入和基于时间的分析技术，以检测和监控野外自主的 LLM 黑客智能体。该系统采用多步检测流程：首先，在横幅消息、命令输出和系统文件中嵌入提示注入，试图劫持智能体的目标（例如，通过“重要消息”攻击）；然后，通过对响应间隔的时间分析，将 LLM 智能体（响应时间在 1.7 秒以内）与人类操作员（响应时间显著更长）区分开。通过将对提示注入的行为响应与时间数据进行交叉验证，减少了误报。

Architecture 架构设计

Built on a customized Cowrie SSH honeypot with modifications including a vulnerable SSH version, altered default users, and modified standard outputs. Prompt injections are integrated into SSH banner messages, command outputs (e.g., uname), and accessible system files. The detection pipeline proceeds: attacker connects -> command output with embedded prompt injection (goal hijacking via 'Important Message') -> timing analysis -> if agent responds, attempt prompt stealing to extract system instructions -> re-verification and classification. A public dashboard at https://ai-honeypot.palisaderesearch.org/ provides real-time metrics.

基于定制的 Cowrie SSH 蜜罐构建，修改包括易受攻击的 SSH 版本、修改默认用户和标准输出。提示注入被集成到 SSH 横幅消息、命令输出（如 uname）和可访问的系统文件中。检测流程如下：攻击者连接 -> 带有嵌入式提示注入的命令输出（通过“重要消息”进行目标劫持） -> 时间分析 -> 如果智能体有响应，尝试进行提示窃取以提取系统指令 -> 重新验证和分类。公开发布的仪表盘（https://ai-honeypot.palisaderesearch.org/）提供实时指标。

LLM Models 使用的大模型

Tool Integration 工具集成

Memory Mechanism 记忆机制

none

Attack Phases Covered 覆盖的攻击阶段

reconnaissance

scanning

enumeration

exploitation

post exploitation

privilege escalation

lateral movement

reporting

Evaluation 评估结果

Over a three-month public deployment, the honeypot recorded 8,130,731 interaction attempts and identified 8 potential AI-driven hacking attempts. The 'Important Message' prompt injection technique achieved 100% success rate for goal hijacking against both ReAct and CoT agent frameworks using GPT-4o. LLM agents responded within 1.7 seconds on average, while human attackers took over 10 seconds, validating timing analysis as a discrimination method.

在为期三个月的公开部署中，蜜罐记录了 8,130,731 次交互尝试，并识别出 8 个潜在的 AI 驱动的黑客尝试。“重要消息”提示注入技术对 GPT-4o 的 ReAct 和 CoT 智能体框架都实现了 100% 的目标劫持成功率。LLM 智能体平均在 1.7 秒内响应，而人类攻击者则需要 10 秒以上，验证了时间分析作为一种区分方法的可行性。

Environment 评估环境

Metrics 评估指标

Baseline Comparisons 基准对比

AgentDojo prompt injection findings

Scale 评估规模

10 honeypot IP addresses deployed across various countries, recording 8,130,731 total interactions over ~3 months

Contributions 核心贡献

Deploy deliberately vulnerable SSH honeypot servers with embedded prompt injections specifically designed to detect LLM-based hacking agents in the wild
Implement a multi-step detection methodology combining active behavioral manipulation (prompt injection) and passive timing analysis to distinguish LLM agents from humans and software bots
Provide empirical evidence from a public deployment demonstrating that LLM-based hacking attempts exist in the real world, identifying 8 potential AI-driven attacks among over 8 million interactions

部署了故意留有漏洞的 SSH 蜜罐服务器，其中嵌入了专门设计的提示注入，用于检测野外基于 LLM 的黑客智能体
实现了一种多步检测方法，结合了主动行为操纵（提示注入）和被动时间分析，以区分 LLM 智能体与人类及软件机器人
提供了来自公开部署的实证证据，证明野外确实存在基于 LLM 的黑客尝试，在 800 多万次交互中识别出 8 个潜在的 AI 驱动攻击

Limitations 局限性

Limited internet coverage: the honeypot monitors only a narrow slice of global attack traffic across 10 IPs and cannot generalize to all regions or networks
Blind spots in high-level targets: focused on publicly accessible SSH services, missing attackers targeting specialized, state-level, or closed-off infrastructures
Focus limited to fully autonomous agents; does not detect AI-enhanced tools (e.g., AI-assisted fuzzing) or human-in-the-loop AI usage
Novelty of detection methods: prompt injection and timing analysis techniques are relatively new and have only been tested in a limited deployment; robustness against evasion strategies is unconfirmed
Very small number of detected AI agents (8 potential, 1 confirmed) limits the statistical significance of findings

互联网覆盖范围有限：蜜罐仅监控了 10 个 IP 地址，涵盖全球攻击流量的一小部分，无法推广到所有地区或网络
对高级目标的盲区：侧重于公开访问的 SSH 服务，无法涵盖针对专业、国家级或封闭基础设施的攻击者
焦点仅限于完全自主的智能体；无法检测 AI 增强工具（如 AI 辅助模糊测试）或人机协作（human-in-the-loop）的 AI 使用
检测方法的新颖性：提示注入和时间分析技术相对较新，仅在有限的部署中进行了测试；对抗逃逸策略的稳健性尚未确认
检测到的 AI 智能体数量非常少（8 个潜在，1 个确认），限制了研究结果的统计学意义

Research Gaps 研究空白

Need for larger-scale and longer-duration deployments to capture more AI-driven attacks and build statistically significant datasets
Detection methods need to be tested against adversarial evasion strategies (e.g., agents that intentionally add delays to mimic human timing)
Expanding honeypot attack surfaces beyond SSH to web applications, databases, email services, social media, and industrial control systems
Identifying distinctive behavioral patterns and strategies used by different LLM agent frameworks in real-world attacks
Developing detection for human-in-the-loop and AI-assisted (non-autonomous) attack scenarios

需要更大规模和更长时间的部署，以捕获更多 AI 驱动的攻击并建立具有统计学意义的数据集
检测方法需要针对对抗逃逸策略进行测试（例如，智能体故意增加延迟以模仿人类的时间特征）
将蜜罐攻击面从 SSH 扩展到 Web 应用程序、数据库、电子邮件服务、社交媒体和工业控制系统
识别不同 LLM 智能体框架在真实攻击中使用的独特行为模式和策略
开发针对人机协作和 AI 辅助（非自主）攻击场景的检测方法

Novel Techniques 新颖技术

Using prompt injection as a detection mechanism rather than an attack vector - embedding prompts in honeypot outputs to identify LLM agents
Two-stage prompt injection pipeline: goal hijacking first (via 'Important Message' attack), then prompt stealing to extract agent system instructions
Timing analysis as a distinguishing feature: LLM agents respond in under 1.7 seconds vs. humans taking 10+ seconds
Embedding prompt injections across multiple system touchpoints (banners, command outputs, filesystem) to maximize detection surface

将提示注入作为一种检测机制而非攻击向量——在蜜罐输出中嵌入提示以识别 LLM 智能体
两阶段提示注入流水线：首先劫持目标（通过“重要消息”攻击），然后进行提示窃取以提取智能体的系统指令
将时间分析作为区分特征：LLM 智能体响应时间在 1.7 秒以下，而人类需要 10 秒以上
在多个系统接触点（横幅、命令输出、文件系统）嵌入提示注入，以最大化检测表面

Open Questions 开放问题

Can LLM agents be modified to evade timing-based detection by introducing artificial delays?
What proportion of real-world cyberattacks are currently AI-driven vs. human-driven?
How effective would these detection techniques be against more sophisticated or fine-tuned hacking agents?
Can prompt injection detection be extended to non-SSH protocols and web-based attack surfaces?

LLM 智能体能否通过引入人为延迟来绕过基于时间的检测？
目前真实世界的网络攻击中，AI 驱动与人类驱动的比例是多少？
这些检测技术对更复杂或经过微调的黑客智能体有多大效果？
提示注入检测能否扩展到非 SSH 协议和基于 Web 的攻击面？

Builds On 基于前人工作

Cowrie SSH honeypot
AgentDojo
Advanced Cowrie Configuration
Honeynet Project

Open Source 开源信息

Partial - public dashboard at https://ai-honeypot.palisaderesearch.org/ but honeypot system code not released