#48

Context Relay for Long-Running Penetration-Testing Agents Context Relay for Long-Running Penetration-Testing Agents

Marius Vangeli, Joel Brynielsson, Mika Cohen, Farzad Kamrani

2026 | Workshop on LLM Assisted Security and Trust Exploration (LAST-X) 2026, co-located with NDSS (workshop)

https://dx.doi.org/10.14722/last-x.2026.23042

system penetration-testing fully-autonomous single-agent ReAct

PDF Preview 论文预览

Loading PDF... 加载 PDF 中...

Problem & Motivation 问题与动机

LLM-driven autonomous penetration testing agents struggle with longer-duration multi-stage exploits because the token context window fills up with exploration, failed attempts, and terminal output, degrading decision quality over time -- a phenomenon known as 'context rot'.

由大语言模型（LLM）驱动的自主渗透测试智能体在处理长期、多阶段的利用任务时面临困难。这是因为 token 上下文窗口会被探索过程、失败尝试和终端输出填满，导致决策质量随着时间的推移而下降——这种现象被称为“上下文腐烂”（context rot）。

Real-world penetration testing requires persistence over extended durations, maintaining system knowledge and strategic coherence across multiple phases of exploitation (reconnaissance, exploit attempts, multi-stage attack chains, lateral movement). Empirical studies report degradation beginning at approximately 10-32k tokens depending on the model. OpenAI explicitly identifies context window limitations as a key bottleneck for agentic offensive security applications. No prior work provides reproducible evaluations of context management strategies tailored to agentic offensive security workflows.

现实中的渗透测试需要长期坚持，并在利用的多个阶段（侦察、利用尝试、多阶段攻击链、横向移动）中保持系统知识和战略连贯性。实证研究报告称，根据模型的不同，性能下降大约从 10k-32k 个 token 开始。OpenAI 明确指出，上下文窗口限制是攻防安全应用中智能体的关键瓶颈。此前尚无研究针对渗透测试工作流提供可复现的上下文管理策略评估。

Threat Model 威胁模型

Black-box penetration testing scenario: the agent receives only a target IP address and general penetration-testing guidelines. No hints, credentials, or explicit guidance are provided beyond the system prompt. The agent operates autonomously from a Kali Linux container against vulnerable Docker-based target services on a shared network.

黑盒渗透测试场景：智能体仅接收目标 IP 地址和一般的渗透测试指南。除了系统提示词外，不提供任何提示、凭据或显式指导。智能体从 Kali Linux 容器自主运行，针对共享网络上基于 Docker 的易受攻击目标服务进行攻击。

Methodology 核心方法

CHAP (Context Handoff for Autonomous Penetration testing) is a context-relay system that mimics shift-based work. Agents work in rotations: when a context limit is approached or a natural checkpoint is reached, a dedicated summarization agent compresses the current session into a structured 'handoff protocol' capturing the agent's position, network map, approaches tried, exploit attempts, and unexplored attack surface. A fresh agent instance is then initialized with the shared system prompt and accumulated handoff protocols from all previous sessions, maintaining strategic coherence across extended engagements while resetting the context window.

CHAP（自主渗透测试上下文移交）是一个模拟轮班工作的上下文中继系统。智能体轮换工作：当接近上下文限制或达到自然检查点时，专门的总结智能体会将当前会话压缩为结构化的“移交协议”（handoff protocol），捕捉智能体的位置、网络拓扑、尝试过的方法、利用尝试以及未探索的攻击面。然后初始化一个新的智能体实例，加载共享系统提示词以及来自之前所有会话的累积移交协议，在重置上下文窗口的同时保持跨长期任务的战略连贯性。

Architecture 架构设计

The system has three components: (1) a containerized Docker testbed with a Kali Linux attack container and vulnerable target services on a shared network, (2) an autonomous agent framework with a prompting scheme that instructs the LLM to respond in JSON with reasoning and shell commands, executed via the Docker Python SDK, and (3) the CHAP relay system that invokes a dedicated summarization agent to produce handoff protocols at session boundaries. Each fresh agent starts with the system prompt plus all accumulated prior handoff protocols (pi_1, pi_2, ..., pi_n).

该系统有三个组成部分：（1）容器化的 Docker 测试床，包含 Kali Linux 攻击容器和共享网络上的易受攻击目标服务；（2）自主智能体框架，采用提示词方案指示 LLM 以包含推理和 shell 命令的 JSON 格式响应，并通过 Docker Python SDK 执行；（3）CHAP 中继系统，调用专门的总结智能体在会话边界生成移交协议。每个新智能体启动时都会携带系统提示词加上之前累积的所有移交协议（pi_1, pi_2, ..., pi_n）。

LLM Models 使用的大模型

Tool Integration 工具集成

Memory Mechanism 记忆机制

conversation-history

Attack Phases Covered 覆盖的攻击阶段

reconnaissance

scanning

enumeration

exploitation

post exploitation

privilege escalation

lateral movement

reporting

Evaluation 评估结果

CHAP improved per-run success rate from 27.3% to 36.4% on 11 penetration testing challenges while reducing average token expenditure by 32.4% compared to the baseline agent. At pass@2, both methods converge to 45.5% coverage but solve partially different challenge sets. CHAP averaged 2.41 relays per challenge, with solved challenges averaging only 1.00 relay versus 3.21 for unsolved ones. Successful challenges cost slightly more with CHAP ($0.37 vs $0.27) due to information loss during relay, but unsuccessful challenges cost significantly less ($0.86 vs $1.31).

在 11 个渗透测试挑战中，CHAP 将单次运行成功率从 27.3% 提高到 36.4%，同时与基线智能体相比，平均 token 支出减少了 32.4%。在 pass@2 时，两种方法都收敛到 45.5% 的覆盖率，但解决的挑战集部分不同。CHAP 在每个挑战中平均进行 2.41 次中继，已解决的挑战平均仅 1.00 次中继，而未解决的为 3.21 次。由于中继期间的信息丢失，CHAP 在成功挑战中的成本略高（0.37 美元 vs 0.27 美元），但在未成功挑战中成本显著降低（0.86 美元 vs 1.31 美元）。

Environment 评估环境

Metrics 评估指标

Baseline Comparisons 基准对比

Baseline agent (same framework with CHAP disabled)

Scale 评估规模

11 containerized CTF challenges (extended from AutoPenBench), targeting real-world CVEs across 4 runs total (2 baseline, 2 CHAP)

Contributions 核心贡献

Introduces CHAP, a context management strategy based on structured handoff protocols designed specifically for autonomous penetration testing agents
Evaluates how context relay design affects exploit success rate and token cost efficiency compared to a baseline without context management
Provides fully reproducible results with open-source implementation, datasets, benchmark enhancements, and command logs with LLM reasoning traces
Extends the AutoPenBench benchmark by patching unintended exploit paths, hardening challenges, and converting them to black-box multi-stage scenarios

引入了 CHAP，一种专门为自主渗透测试智能体设计的基于结构化移交协议的上下文管理策略
评估了上下文中继设计与无上下文管理的基准相比，如何影响利用成功率和 token 成本效率
提供了完全可复现的结果，包括开源实现、数据集、基准增强以及带有 LLM 推理痕迹的命令日志
通过修复非预期利用路径、加固挑战并将其转换为黑盒多阶段场景，扩展了 AutoPenBench 基准测试

Limitations 局限性

220 iteration threshold limited exploration of true long-running scenarios; while more iterations would not have yielded significant returns, this caps the evaluation
30k token auto-trigger threshold (responsible for 71.7% of relays) was set conservatively and may have triggered compaction prematurely
Evaluation used a single model (GPT-5.1 Codex mini) with a large 400k context window, potentially mitigating context rot and making CHAP less impactful than it would be for smaller-window models
Only two runs on eleven single-host challenges limits generalizability; results are preliminary
No direct comparison against alternative context management strategies such as generic summarization, recursive summarization, selective pruning, or embedding-based compression
Patching unintended exploit paths does not guarantee all unintended solutions were eliminated or that challenges accurately reflect real-world penetration testing
Information loss during relay causes subsequent agents to spend approximately 20 additional iterations reorienting before making progress

220 次迭代阈值限制了对真正长期运行场景的探索；虽然更多迭代可能不会产生显著收益，但这限制了评估范围
30k token 自动触发阈值（导致了 71.7% 的中继）设置得较为保守，可能导致过早触发压缩
评估使用具有 400k 超大上下文窗口的单一模型（GPT-5.1 Codex mini），这可能减轻了上下文腐烂，使 CHAP 的影响力比在小窗口模型上表现得更小
仅在 11 个单机挑战上进行了两次运行，限制了普适性；结果是初步的
未直接与替代的上下文管理策略（如通用总结、递归总结、选择性修剪或基于嵌入的压缩）进行比较
修复非预期利用路径并不保证消除了所有非预期解决方案，也不保证挑战准确反映了现实世界的渗透测试
中继期间的信息丢失导致后续智能体在取得进展前需要花费约 20 次额外迭代来重新定位

Research Gaps 研究空白

No reproducible benchmarks exist for evaluating context management strategies tailored to agentic offensive security workflows
Most current CTF benchmarks test isolated exploits rather than sustained multi-host campaigns; reproducible benchmarks for end-to-end penetration testing in realistic network environments are needed
The offensive security benchmark landscape remains fragmented across independent efforts and lacks systematic consolidation
Direct comparison of different context management approaches (handoff protocols vs. generic summarization vs. pruning vs. embedding-based compression) for penetration testing agents is unexplored
Evaluation of context management across diverse models with varying context window sizes is needed
Multi-host network scenarios and extended attack chains remain unevaluated for context relay approaches

尚无可用于评估针对渗透测试智能体工作流定制的上下文管理策略的可复现基准
目前大多数 CTF 基准测试测试的是孤立的利用，而非持续的多主机活动；需要针对真实网络环境中端到端渗透测试的可复现基准
攻防安全基准测试领域仍然分散在各个独立工作中，缺乏系统整合
尚未对渗透测试智能体的不同上下文管理方法（移交协议 vs 通用总结 vs 修剪 vs 基于嵌入的压缩）进行直接比较
需要评估跨具有不同上下文窗口大小的多样化模型的上下文管理效果
多主机网络场景和扩展攻击链对上下文中继方法的效果仍有待评估

Novel Techniques 新颖技术

Structured handoff protocols that act as debugger-state snapshots, recording high-signal state information (exact commands, key outputs, paths, sessions, constraints) to let the next agent deterministically reconstruct current state
Dual relay triggers: agent-initiated relay at natural checkpoints (e.g., gaining foothold, privilege escalation) and automatic relay when context exceeds a token threshold
Dedicated summarization agent that generates handoff protocols in a structured format with sections for current state, recon/enumeration findings, exploit/foothold status, failed attempts, unexplored surface, and considerations for next agent
Protocol chaining where each new protocol only adds information not present in previous protocols, avoiding redundancy while building a coherent knowledge chain

结构化移交协议：充当调试器状态快照，记录高信号状态信息（精确命令、关键输出、路径、会话、约束），使下一个智能体能够确定性地重建当前状态
双重中继触发：智能体在自然检查点（如获得立足点、权限提升）发起的发起中继，以及上下文超过 token 阈值时的自动中继
专门的总结智能体：生成结构化格式的移交协议，包含当前状态、侦察/枚举发现、利用/立足点状态、失败尝试、未探索表面以及给下一任智能体的注意事项等章节
协议链接：每个新协议仅添加先前协议中不存在的信息，在构建连贯知识链的同时避免冗余

Open Questions 开放问题

How would CHAP perform with models that have smaller context windows (e.g., 8k-32k tokens) where context rot is more severe?
Can different models operate in rotation across sessions, potentially combining specialist strengths?
How does CHAP compare to alternative context management strategies like recursive summarization, selective pruning, or knowledge graphs?
What is the optimal balance between relay frequency and information loss during compaction?
Can the handoff protocol format be further optimized to reduce the ~20 iteration reorientation overhead observed in successor agents?

CHAP 在上下文腐烂更严重的、具有较小上下文窗口（如 8k-32k token）的模型上表现如何？
不同的模型能否跨会话轮换工作，从而结合各自的专长？
CHAP 与递归总结、选择性修剪或知识图谱等替代上下文管理策略相比如何？
中继频率与压缩期间的信息丢失之间的最佳平衡点是什么？
能否进一步优化移交协议格式，以减少后继智能体中观察到的约 20 次迭代的重新定位开销？

Builds On 基于前人工作

AutoPenBench
PentestGPT

Open Source 开源信息

Yes - https://github.com/marvang/chap