#14

ARACNE: An LLM-Based Autonomous Shell Pentesting Agent ARACNE: An LLM-Based Autonomous Shell Pentesting Agent

Tomas Nieponice, Veronica Valeros, Sebastian Garcia

2025 | arXiv (preprint)

arXiv:2502.18528

Problem & Motivation 问题与动机

Existing LLM-based attacking agents such as PenHeal, AutoAttacker, and HackSynth show promising results but suffer from occasional mistakes, hallucinations, rigid architectures that do not allow leveraging specialized models for different tasks, and mandatory summarization components that reduce accuracy.

现有的基于大语言模型(LLM)的攻击智能体(如 PenHeal, AutoAttacker 和 HackSynth)虽然展现了有希望的结果,但仍面临偶尔出错、幻觉、架构僵化(无法针对不同任务利用专门模型)以及强制性的总结组件(会降低准确性)等问题。

There is a need for a more modular and flexible autonomous shell pentesting agent that can assign the best-suited LLM to each sub-task (planning, command interpretation, summarization), improve upon prior success rates on standard benchmarks, and provide optional summarization to trade off between context length and accuracy.

需要一个更具模块化和灵活性、全自主的 shell 渗透测试智能体,能够为每个子任务(规划、命令解释、总结)分配最适合的 LLM,提高在标准基准测试中的成功率,并提供可选的总结功能以在上下文长度和准确性之间进行权衡。

Threat Model 威胁模型

The agent operates as a fully autonomous attacker connecting to target systems via SSH. It assumes valid SSH credentials are provided and that the target is a Linux shell environment. Jailbreak prompts are used to bypass LLM guardrails, framing interactions as taking place in a simulated or dummy environment.

该智能体作为一个完全自主的攻击者,通过 SSH 连接到目标系统。它假设已提供有效的 SSH 凭据,且目标是 Linux shell 环境。使用越狱(Jailbreak)提示来绕过 LLM 的安全护栏,将交互框定在模拟或虚构的环境中进行。

Methodology 核心方法

ARACNE is a multi-LLM autonomous shell pentesting agent with a modular architecture consisting of four key modules: a planner (GPT-o3-mini) that generates attack plans in JSON format, an interpreter (LLaMA 3.1) that translates plan steps into executable Linux shell commands, an optional summarizer (GPT-4o) that compresses attack context to manage context window limits, and a core agent that orchestrates all modules and executes commands on the target via SSH. The planner centralizes all decision-making and includes goal verification logic to determine when to stop. The agent iteratively plans, interprets, executes, and re-plans based on command output.

ARACNE 是一个多 LLM 自主 shell 渗透测试智能体,具有模块化架构,由四个关键模块组成:规划器(GPT-o3-mini),以 JSON 格式生成攻击计划;解释器(LLaMA 3.1),将计划步骤翻译为可执行的 Linux shell 命令;可选的总结器(GPT-4o),压缩攻击上下文以应对上下文窗口限制;以及核心智能体,负责编排所有模块并通过 SSH 在目标上执行命令。规划器集中了所有决策,并包含目标验证逻辑以确定何时停止。智能体根据命令输出迭代地进行规划、解释、执行和重新规划。

Architecture 架构设计

Modular multi-LLM pipeline: (1) User provides a goal to the Core Agent, (2) Core Agent sends context to the Planner, (3) Planner (GPT-o3-mini) generates an attack plan with steps, goal_verification, and goal_reached fields in JSON, (4) Plan is sent to the Interpreter (LLaMA 3.1) which produces a Linux command, (5) Core Agent executes the command on the target via SSH using Paramiko's invoke_shell(), (6) Command output is stored in a context file and fed back to the Planner. When the optional Summarizer (GPT-4o) is enabled, it compresses the context before passing it to the Planner.

模块化多 LLM 流水线:(1) 用户向核心智能体提供目标;(2) 核心智能体将上下文发送给规划器;(3) 规划器(GPT-o3-mini)生成包含步骤、目标验证(goal_verification)和目标达成(goal_reached)字段的 JSON 格式攻击计划;(4) 计划被发送给解释器(LLaMA 3.1),生成一条 Linux 命令;(5) 核心智能体使用 Paramiko 的 invoke_shell() 通过 SSH 在目标上执行命令;(6) 命令输出存储在上下文文件中并反馈给规划器。当启用了可选的总结器(GPT-4o)时,它会在将上下文传递给规划器之前对其进行压缩。

LLM Models 使用的大模型

GPT-o3-miniLLaMA 3.1GPT-4o

Tool Integration 工具集成

SSH (Paramiko)Linux shell commands

Memory Mechanism 记忆机制

conversation-history

Attack Phases Covered 覆盖的攻击阶段

reconnaissance
scanning
enumeration
exploitation
post exploitation
privilege escalation
lateral movement
reporting

Evaluation 评估结果

ARACNE achieved a 60% success rate against ShelLM (both with and without the summarizer module) across 10 attack goals. Against the Over the Wire Bandit CTF, ARACNE solved 19 out of 33 challenges for a 57.58% success rate, a 0.48% improvement over the prior state-of-the-art (HackSynth at 57.1%). When winning, the average number of actions was fewer than 5 (3.95 for Bandit, 2.83 for ShelLM without summarizer).

ARACNE 在针对 ShelLM 的 10 个攻击目标中实现了 60% 的成功率(无论是否开启总结模块)。在针对 Over the Wire Bandit CTF 的测试中,ARACNE 解决了 33 个挑战中的 19 个,成功率为 57.58%,比之前的 SOTA 模型(HackSynth 的 57.1%)提高了 0.48%。在获胜情况下,平均操作次数少于 5 次(Bandit 为 3.95 次,不带总结器的 ShelLM 为 2.83 次)。

Environment 评估环境

ShelLM (LLM-based SSH honeypot)Over the Wire Bandit CTF

Metrics 评估指标

success-ratenum-actionsnum-attempts

Baseline Comparisons 基准对比

  • HackSynth

Scale 评估规模

10 attack goals against ShelLM, 33 Bandit CTF challenges

Contributions 核心贡献

  • A new fully autonomous LLM-based agent (ARACNE) designed to interact with real Linux shell environments via SSH.
  • A modular multi-LLM architecture that separates planning (GPT-o3-mini), command generation (LLaMA 3.1), and optional summarization (GPT-4o), improving flexibility by allowing the best model for each task.
  • A 0.48% improvement over the state-of-the-art (HackSynth) on the Over the Wire Bandit CTF benchmark.
  • 开发了一个新的基于大语言模型的全自主智能体(ARACNE),旨在通过 SSH 与真实的 Linux shell 环境交互。
  • 提出了一种模块化的多 LLM 架构,将规划(GPT-o3-mini)、命令生成(LLaMA 3.1)和可选的总结(GPT-4o)分离开来,通过为每项任务允许使用最佳模型,提高了系统的灵活性。
  • 在 Over the Wire Bandit CTF 基准测试中,比现有最先进技术(HackSynth)提高了 0.48%。

Limitations 局限性

  • Only supports SSH-based access with username/password credentials; cannot handle public key authentication (Bandit challenges 14, 17, 18 were unsolvable).
  • Cannot respond to interactive yes/no prompts during command execution, causing failures on Bandit challenges 28-31.
  • Small evaluation sample size (10 goals for ShelLM, 33 Bandit levels) makes it difficult to draw statistically robust conclusions.
  • The summarizer increases action count by 2.3 on average and reduces accuracy of context, though it does not change overall win rate in the small sample tested.
  • Relies on jailbreak prompts to bypass LLM guardrails, which work approximately 95% of the time but not universally across models.
  • Only evaluated with a single fixed combination of LLM models; other model combinations were not explored.
  • ShelLM is an LLM-simulated shell, not a real system, so results may not fully transfer to real-world environments.
  • The 0.48% improvement over HackSynth is marginal and may not be statistically significant given the evaluation scale.
  • 仅支持带有用户名/密码凭据的 SSH 访问;无法处理公钥身份验证(导致 Bandit 挑战 14、17、18 无法解决)。
  • 无法响应命令执行过程中的交互式 yes/no 提示,导致 Bandit 挑战 28-31 失败。
  • 评估样本量较小(ShelLM 10 个目标,Bandit 33 个关卡),难以得出统计上极其稳健的结论。
  • 总结器平均增加了 2.3 次操作并降低了上下文的准确性,尽管在测试的小样本中它并未改变整体胜率。
  • 依赖越狱提示来绕过 LLM 的安全护栏,这种方法在不同模型中的有效性约为 95%,并非百分之百通用。
  • 仅使用单一固定的 LLM 模型组合进行了评估;未探索其他模型组合。
  • ShelLM 是一个由 LLM 模拟的 shell,而非真实系统,因此结果可能无法完全迁移到现实环境中。
  • 相对于 HackSynth 0.48% 的提升是非常微小的,考虑到评估规模,可能不具备统计显著性。

Research Gaps 研究空白

  • Lack of evaluation on real SSH environments and Docker-based targets rather than simulated shells.
  • No comparison against LLM-based defensive mechanisms like Mantis in adversarial settings.
  • No systematic exploration of which LLM model combinations are optimal for each module in the multi-LLM architecture.
  • No integration with established security tools (Nmap, Metasploit, tcpdump) that could extend the attack surface.
  • Limited understanding of how summarization quality affects long-running attack campaigns.
  • No mechanism for the agent to learn from previous failed attempts within the same challenge.
  • 缺乏在真实 SSH 环境和基于 Docker 的目标(而非模拟 shell)上的评估。
  • 未在对抗场景中与 Mantis 等基于 LLM 的防御机制进行对比。
  • 未系统探索在多 LLM 架构中,哪些模型组合对于每个模块是最佳的。
  • 未与已有的安全工具(如 Nmap, Metasploit, tcpdump)集成,这本可以扩展攻击面。
  • 对总结质量如何影响长期运行的攻击行动了解有限。
  • 缺乏让智能体从同一挑战中先前失败的尝试中学习的机制。

Novel Techniques 新颖技术

  • Multi-LLM modular architecture that assigns different specialized LLMs to planning, command interpretation, and summarization tasks, allowing cost/capability optimization per component.
  • Optional summarizer module that provides a user-configurable trade-off between context accuracy and attack duration.
  • Built-in goal verification mechanism where the planner generates a verification plan alongside the attack plan to determine task completion.
  • Separation of strategic planning from command generation, allowing a powerful reasoning model for strategy and a lightweight model for command translation.
  • 多 LLM 模块化架构:为规划、命令解释和总结任务分配不同的专门 LLM,允许针对每个组件进行成本和能力的优化。
  • 可选的总结器模块:允许用户在上下文准确性和攻击持续时间之间进行可配置的权衡。
  • 内置目标验证机制:规划器在生成攻击计划的同时生成验证计划,以自主判定任务是否完成。
  • 战略规划与命令生成的解耦:允许使用强大的推理模型负责策略,使用轻量级模型负责命令翻译。

Open Questions 开放问题

  • What is the optimal combination of LLM models for each module in a multi-LLM pentesting architecture?
  • How would ARACNE perform against active LLM-based defenses such as Mantis prompt injection?
  • Can the agent be extended to handle interactive shell sessions requiring user input beyond simple command execution?
  • Would fine-tuning smaller open-source models for the planning role match or exceed the performance of GPT-o3-mini?
  • How does performance scale when attacking real systems with more complex, multi-step exploitation chains?
  • Could incorporating a feedback loop from failed attempts improve success rates on repeated challenges?
  • 在多 LLM 渗透测试架构中,每个模块的最佳模型组合是什么?
  • ARACNE 在面对 Mantis 提示词注入等主动 LLM 防御措施时表现如何?
  • 能否扩展智能体以处理除简单命令执行外还需要用户输入的交互式 shell 会话?
  • 微调较小的开源模型来承担规划角色,是否能达到或超过 GPT-o3-mini 的表现?
  • 在攻击具有更复杂、多步漏洞利用链的真实系统时,性能将如何扩展?
  • 引入失败尝试的反馈循环能否提高重复挑战时的成功率?

Builds On 基于前人工作

  • AutoAttacker
  • PenHeal
  • HackSynth

Open Source 开源信息

No

Tags