HackSynth: LLM Agent and Evaluation Framework for Autonomous Penetration Testing HackSynth: LLM Agent and Evaluation Framework for Autonomous Penetration Testing
Problem & Motivation 问题与动机
Current LLM-based penetration testing agents lack transparency in their decision-making processes and there is no standardized benchmark for evaluating them. Existing tools either require human intervention (PentestGPT, HackingBuddyGPT) or are constrained to specific frameworks like Metasploit (AutoAttacker), limiting their autonomy and generalizability.
目前基于 LLM 的渗透测试智能体在决策过程中缺乏透明度,且没有标准化的基准来评估它们。现有工具要么需要人类干预(如 PentestGPT, HackingBuddyGPT),要么受限于特定框架(如 AutoAttacker 仅限于 Metasploit),限制了其自主性和通用性。
As LLM-based autonomous hacking agents become more capable, there is a critical gap in understanding their underlying mechanisms, decision-making processes, and potential vulnerabilities. Without standardized benchmarks and transparent architectures, it is impossible to predict agent behaviors in complex real-world scenarios or assess their safety. The paper fills this gap by providing both an open, analyzable agent architecture and reproducible CTF-based benchmarks.
随着基于 LLM 的自主黑客智能体能力日益增强,在理解其底层机制、决策过程和潜在脆弱性方面存在关键空白。如果没有标准化的基准和透明的架构,就无法预测智能体在复杂现实场景中的行为,也无法评估其安全性。本文通过提供可分析的开放智能体架构和可重复的基于 CTF 的基准来填补这一空白。
Threat Model 威胁模型
The agent operates in a containerized Kali Linux environment with firewall rules restricting network access solely to designated target machines. The primary threats considered are: (1) the agent hallucinating target IP addresses and scanning out-of-scope hosts, (2) the agent executing destructive commands on the host system, and (3) the agent potentially circumventing firewall rules if it gains sufficient privileges inside the container.
智能体在容器化的 Kali Linux 环境中运行,通过防火墙规则限制网络访问仅限于指定的目标机器。考虑的主要威胁包括:(1) 智能体幻觉出目标 IP 地址并扫描范围外的机器;(2) 智能体在宿主系统上执行破坏性命令;(3) 智能体如果获得容器内的足够权限,可能会绕过防火墙规则。
Methodology 核心方法
HackSynth uses a dual-module architecture consisting of a Planner and a Summarizer, both powered by LLMs. The Planner generates one executable bash command per iteration based on the current summarized history. The command is executed in a containerized Kali Linux environment, and its output is forwarded to the Summarizer, which maintains a concise running summary of all past actions and observations. This feedback loop iterates until the flag is captured or a maximum iteration limit (20 steps) is reached. The architecture is intentionally simple to enable thorough analysis of core LLM parameters like temperature and top-p.
HackSynth 采用由规划器(Planner)和摘要器(Summarizer)组成的双模块架构,均由 LLM 驱动。规划器根据当前摘要化的历史记录,在每次迭代中生成一个可执行的 bash 命令。该命令在容器化的 Kali Linux 环境中执行,输出被转发给摘要器,摘要器维护所有过去操作和观察结果的简明运行摘要。这种反馈循环持续进行,直到捕获 flag 或达到最大迭代限制(20 步)。该架构设计有意保持简单,以便彻底分析温度(temperature)和 top-p 等核心 LLM 参数。
Architecture 架构设计
Two LLM-based modules in a cyclic feedback loop: (1) Planner module - receives the summarized history via a dynamically populated {summarized_history} placeholder in its prompt and generates a single terminal command wrapped in <CMD></CMD> tags; (2) Summarizer module - receives the previous summary and new command output (truncated to a configurable 'new observation window size') and produces an updated summary. Commands execute in an isolated containerized Kali Linux environment with firewall restrictions. A cron job resets firewall rules before each command execution to prevent the agent from overriding security controls.
循环反馈循环中的两个基于 LLM 的模块:(1) 规划器模块 - 通过提示词中动态填充的 {summarized_history} 占位符接收摘要历史,并生成包装在 <CMD></CMD> 标签中的单个终端命令;(2) 摘要器模块 - 接收先前的摘要和新的命令输出(截断到可配置的“新观察窗口大小”),并生成更新后的摘要。命令在带有防火墙限制的隔离容器化 Kali Linux 环境中执行。在每次命令执行前,通过 cron 任务重置防火墙规则,以防止智能体覆盖安全控制。
LLM Models 使用的大模型
Tool Integration 工具集成
Memory Mechanism 记忆机制
conversation-history
Attack Phases Covered 覆盖的攻击阶段
Evaluation 评估结果
GPT-4o achieved the best performance, solving 34.2% of PicoCTF (41/120) and 40% of OverTheWire (32/80) challenges. Among open-source models, Llama-3.1-70B was strongest on PicoCTF (22.5%) and Qwen2-72B performed well on OverTheWire (25%). Temperature values at or below 1 are optimal; above 1.6, error rates spike and environments become unusable. The observation window size of 250 characters was optimal for PicoCTF and 500 for OverTheWire. Enabling sampling increased performance by 38%, while prompt-chaining decreased it by 5%.
GPT-4o 表现最佳,解决了 34.2% 的 PicoCTF (41/120) 和 40% 的 OverTheWire (32/80) 挑战。在开源模型中,Llama-3.1-70B 在 PicoCTF 上表现最强 (22.5%),而 Qwen2-72B 在 OverTheWire 上表现良好 (25%)。温度值在 1 或以下时效果最佳;超过 1.6 后,错误率激增,环境变得不可用。250 字符的观察窗口大小对 PicoCTF 最优,500 字符对 OverTheWire 最优。启用采样使性能提高了 38%,而提示词链式调用(prompt-chaining)使性能下降了 5%。
Environment 评估环境
Metrics 评估指标
Baseline Comparisons 基准对比
- GPT-4o
- GPT-4o-mini
- Llama-3.1-8B
- Llama-3.1-70B
- Qwen2-72B
- Mixtral-8x7B
- Phi-3-mini
- Phi-3.5-MoE
Scale 评估规模
200 CTF challenges total: 120 from PicoCTF (6 categories, 3 difficulty levels) and 80 from OverTheWire (4 wargames: Bandit, Natas, Leviathan, Krypton)
Contributions 核心贡献
- HackSynth: an autonomous LLM-based penetration testing agent with a transparent dual-module (Planner + Summarizer) architecture that operates without human intervention
- Two new standardized CTF-based benchmarks (PicoCTF with 120 challenges, OverTheWire with 80 challenges) with dynamic solver scripts that prevent memorization of flags, publicly available for the research community
- Extensive parameter optimization study analyzing the effects of temperature, top-p, and observation window size on agent performance and reliability
- Safety and behavioral analysis including documentation of dangerous emergent behaviors (IP hallucination, environment destabilization, sandbox escape attempts) and evaluation of command usage patterns across models
- HackSynth:一种具有透明双模块(规划器 + 摘要器)架构的自主 LLM 渗透测试智能体,无需人类干预即可操作。
- 两个新的标准化的基于 CTF 的基准(包含 120 个挑战的 PicoCTF 和 80 个挑战的 OverTheWire),配备动态求解脚本以防止 flag 被记忆,可供研究社区公开使用。
- 广泛的参数优化研究,分析了温度、top-p 和观察窗口大小对智能体性能和可靠性的影响。
- 安全性和行为分析,包括记录危险的突现行为(IP 幻觉、环境不稳定、沙箱逃逸尝试),并评估跨模型的命令使用模式。
Limitations 局限性
- The agent is limited to non-interactive command-line operations, preventing it from using interactive tools like text editors, debuggers with interactive modes, or GUI-based applications
- The agent tends to fixate on initial strategies ('rabbit hole' effect), where early problem-solving steps strongly influence subsequent actions, sometimes leading to repetitive ineffective attempts
- No support for challenges requiring personalized instanced environments or CAPTCHAs, limiting benchmark coverage
- The firewall-based security model is acknowledged as potentially circumventable - the agent could reset firewall rules, route attacks through internet-connected target machines, or gain sufficient container privileges to override restrictions
- Binary exploitation challenges were unsolved by all models (0% across both benchmarks), indicating a fundamental limitation in LLM-based approaches for low-level memory exploitation
- The architecture lacks specialized modules for visual analysis, internet search for known exploits, or interactive terminal emulation, all of which could improve performance
- Smaller open-source models (Phi-3-mini, Phi-3.5-MoE) derived limited benefit from additional iterative cycles, frequently getting trapped in repetitive solution attempts
- 智能体局限于非交互式命令行操作,无法使用文本编辑器、具有交互模式的调试器或基于 GUI 的应用程序等交互式工具。
- 智能体倾向于固守最初的策略(“兔子洞”效应),早期的解题步骤会强烈影响后续操作,有时导致重复的无效尝试。
- 不支持需要个性化实例环境或验证码(CAPTCHA)的挑战,限制了基准测试的覆盖范围。
- 防火墙安全模型被公认为可能被绕过——智能体可能会重置防火墙规则、通过连接互联网的目标机器路由攻击,或获得足够的容器权限来覆盖限制。
- 所有模型都未能解决二进制漏洞利用挑战(在两个基准测试中均为 0%),这表明基于 LLM 的方法在低层内存利用方面存在根本性局限。
- 架构缺乏用于视觉分析、互联网搜索已知漏洞利用或交互式终端模拟的专用模块,而这些都可能提高性能。
- 较小的开源模型(Phi-3-mini, Phi-3.5-MoE)从额外的迭代周期中获益有限,经常陷入重复的解题尝试中。
Research Gaps 研究空白
- No robust safety framework exists for deploying autonomous hacking agents - current containment mechanisms (containers, firewalls) are insufficient and can be circumvented
- Lack of standardized, reproducible benchmarks for LLM-based pentesting agents prior to this work - existing benchmarks have static flags that can be memorized during training
- The relationship between LLM sampling parameters (temperature, top-p) and pentesting agent performance is poorly understood, with implications for both effectiveness and safety
- No existing work thoroughly analyzes the behavioral patterns and emergent risks of autonomous hacking agents at the command-level
- Fine-tuning LLMs specifically for penetration testing tasks remains unexplored - using distillation from larger models or RLHF with cybersecurity expert feedback
- Multi-modal capabilities (visual analysis of screenshots, graphical challenge elements) are absent from current CTF-solving agents
- 目前尚不存在部署自主黑客智能体的健壮安全框架——现有的遏制机制(容器、防火墙)不足且可被绕过。
- 在此项工作之前,缺乏针对基于 LLM 的渗透测试智能体的标准化、可重复基准——现有基准的静态 flag 可能在训练期间被记忆。
- LLM 采样参数(温度、top-p)与渗透测试智能体性能之间的关系尚不明确,这影响了有效性和安全性。
- 尚无现有工作彻底分析自主黑客智能体在命令层面的行为模式和突现风险。
- 专门针对渗透测试任务微调 LLM 仍待探索——例如使用大模型的蒸馏或带有网络安全专家反馈的 RLHF。
- 当前的 CTF 解题智能体缺乏多模态能力(对截图、图形挑战元素的视觉分析)。
Novel Techniques 新颖技术
- Dual-module Planner-Summarizer architecture that separates command generation from context maintenance, enabling transparent analysis of agent decision-making
- New observation window size parameter that truncates command output to a fixed character limit before passing to the Summarizer, balancing information retention with context pollution
- Dynamic solver scripts for CTF benchmarks that programmatically solve challenges and return current flags, preventing benchmark contamination through LLM training data memorization
- Cron-job-based firewall rule enforcement that resets security rules before every command execution, providing a defense-in-depth approach against agent privilege escalation
- 双模块规划器-摘要器架构,将命令生成与上下文维护分离,从而能够对智能体决策进行透明分析。
- 新的观察窗口大小参数,在传递给摘要器之前将命令输出截断为固定字符限制,平衡了信息保留与上下文污染。
- CTF 基准的动态求解脚本,通过程序化解题并返回当前 flag,防止通过 LLM 训练数据记忆导致基准污染。
- 基于 cron 任务的防火墙规则强制执行,在每次命令执行前重置安全规则,提供针对智能体权限提升的纵深防御方法。
Open Questions 开放问题
- How can autonomous hacking agents be made truly safe when the very capabilities that make them useful (command execution, network access) are also the source of risk?
- Why do all tested LLMs completely fail at binary exploitation challenges, and can specialized training or tool integration address this?
- What is the optimal architecture for balancing agent autonomy with safety constraints - is the simple Planner-Summarizer design inherently limited compared to multi-agent approaches?
- Can fine-tuning on pentesting-specific data (via distillation or RLHF) significantly improve smaller models to match or exceed GPT-4o performance at lower cost?
- How would HackSynth perform in live CTF competitions against human participants, and what would this reveal about the current capability gap?
- Does the observation that one LLM tends to dominate across all categories (rather than models specializing) hold for more complex, real-world pentesting scenarios?
- 当使自主黑客智能体发挥作用的能力(命令执行、网络访问)本身就是风险来源时,如何确保其真正的安全?
- 为什么所有测试的 LLM 在二进制漏洞利用挑战上完全失败?专门的训练或工具集成能否解决此问题?
- 平衡智能体自主性与安全约束的最佳架构是什么——简单的规划器-摘要器设计与多智能体方法相比是否具有内在局限性?
- 针对渗透测试特定数据进行的微调(通过蒸馏或 RLHF)能否显著提升小模型,使其以更低的成本达到或超过 GPT-4o 的表现?
- HackSynth 在真实的 CTF 竞赛中面对人类选手的表现如何?这将揭示目前的哪些能力差距?
- 关于一个 LLM 倾向于在所有类别中占据主导地位(而非模型各有所长)的观察结果,在更复杂、现实的渗透测试场景中是否成立?
Builds On 基于前人工作
- PentestGPT
- HackingBuddyGPT
- AutoAttacker
- Enigma
- Cybench
- NYU CTF agent
- SWE-agent
- Intercode CTF benchmark
Open Source 开源信息
Yes - https://github.com/aielte-research/HackSynth