#27

CybORG: A Gym for the Development of Autonomous Cyber Agents CybORG: A Gym for the Development of Autonomous Cyber Agents

Maxwell Standen, Martin Lucas, David Bowman, Toby J. Richer, Junae Kim, Damian Marriott

2021 | IJCAI-21 1st International Workshop on Adaptive Cyber Defense (workshop)

arXiv:2108.09118

Problem & Motivation 问题与动机

Autonomous Cyber Operations (ACO) research lacks a unified training environment that combines both simulation (for rapid RL training) and emulation (for real-world validation) of adversarial cyber scenarios involving both red and blue team agents.

自主网络行动 (ACO) 研究缺乏一个统一的训练环境,该环境能结合针对对抗性网络场景(涉及红蓝队智能体)的模拟(用于快速强化学习训练)和仿真(用于现实世界验证)。

Existing cyber security environments each satisfy only a subset of ACO research requirements. Some offer simulation but not emulation, others lack scalability, flexibility, or support for adversarial co-evolution. The 'reality gap' between simulated and real environments means agents trained purely in simulation may fail in practice. A gym that bridges simulation and emulation is needed to train effective autonomous cyber agents and validate them on real infrastructure.

现有的网络安全环境各自只能满足 ACO 研究需求的一部分。有些提供模拟但不提供仿真,有些缺乏可扩展性、灵活性,或不支持对抗性协同演化。模拟环境与真实环境之间的“现实鸿沟”意味着纯粹在模拟中训练的智能体在实践中可能会失败。需要一个连接模拟与仿真的 Gym 环境,以训练有效的自主网络智能体并在真实基础设施上验证它们。

Threat Model 威胁模型

Red team (attacker) and blue team (defender) agents operate in an adversarial network environment. The red agent aims to compromise hosts and escalate privileges, while blue agents defend. The environment assumes standard network attack scenarios following the cyber kill chain.

红队(攻击者)和蓝队(防御者)智能体在对抗性网络环境中操作。红队智能体目标是攻陷主机并提升权限,而蓝队智能体进行防御。环境假设遵循网络攻击生命周期的标准网络攻击场景。

Methodology 核心方法

CybORG is a modular gym environment for Autonomous Cyber Operations that provides both a finite state machine-based simulator for rapid agent training and an AWS-based emulator for real-world validation. Scenarios are defined in YAML files specifying hosts, networks, subnets, agent actions, and reward structures. Agents interact through an OpenAI Gym-compatible interface, selecting actions and receiving observations and rewards. The system enables training RL agents in simulation and then transferring and testing them in emulated environments with real virtual machines and security tools like Metasploit.

CybORG 是一个用于自主网络行动的模块化 Gym 环境,它提供了一个基于有限状态机的模拟器(用于快速智能体训练)和一个基于 AWS 的仿真器(用于现实世界验证)。场景在 YAML 文件中定义,指定了主机、网络、子网、智能体动作和奖励结构。智能体通过兼容 OpenAI Gym 的接口进行交互,选择动作并接收观察结果和奖励。该系统支持在模拟中训练强化学习智能体,然后在运行真实虚拟机和 Metasploit 等安全工具的仿真环境中进行迁移和测试。

Architecture 架构设计

Two-tier architecture: (1) A simulator that models scenarios as finite state machines with precondition/effect-based actions for fast training; (2) An emulator that deploys virtual infrastructure on AWS using IaaS, with actuator objects connecting to VMs via SSH to execute real security tool commands (Metasploit, Velociraptor). Both tiers share a common YAML-based scenario definition format and the same OpenAI Gym agent interface.

两层架构:(1) 一个模拟器,将场景建模为具有基于前提条件/效果动作的有限状态机,以便快速训练;(2) 一个仿真器,使用 IaaS 在 AWS 上部署虚拟基础设施,执行器对象通过 SSH 连接到虚拟机以执行真实的安全工具命令(Metasploit, Velociraptor)。两层共享通用的基于 YAML 的场景定义格式和相同的 OpenAI Gym 智能体接口。

Tool Integration 工具集成

metasploitvelociraptormeterpretersshAWS-CLI

Memory Mechanism 记忆机制

belief-state

Attack Phases Covered 覆盖的攻击阶段

reconnaissance
scanning
enumeration
exploitation
post exploitation
privilege escalation
lateral movement
reporting

Evaluation 评估结果

21 independent RL agents were trained in the simulator for up to 2500 iterations each. When transferred to the emulator, they achieved a 66% overall success rate (139 successful tests out of total runs). Almost half of the trained agents succeeded on every emulator run, while 4 agents never succeeded, suggesting simulation-to-emulation transfer is feasible but the simulator model needs refinement.

在模拟器中训练了 21 个独立的强化学习智能体,每个迭代次数达 2500 次。迁移到仿真器后,它们实现了 66% 的总体成功率(总运行中成功 139 次)。几乎一半的受训智能体在每次仿真运行中都取得了成功,而有 4 个智能体从未成功,这表明从模拟到仿真的迁移是可行的,但模拟器模型仍需改进。

Environment 评估环境

custom-labAWS-emulated-network

Metrics 评估指标

success-ratenum-stepsconvergence-rate

Scale 评估规模

1 scenario with 3 hosts (Attacker Kali, Gateway Ubuntu 18, Internal Windows 2008) across 2 subnets; 21 independently trained RL agents each tested 10 times in emulation

Contributions 核心贡献

  • Introduction of CybORG, a modular gym environment combining both simulation and emulation modes for training and validating autonomous cyber agents
  • A common OpenAI Gym-compatible interface that allows the same agent code to interact with both simulated and emulated environments
  • YAML-based scenario definition format that specifies both simulation state transitions and emulation system images/commands
  • Demonstration that RL agents trained in simulation can successfully transfer to emulated environments running real security tools (Metasploit) on virtual infrastructure
  • Comprehensive comparison of 12 existing cyber security training environments against ACO research requirements
  • 引入了 CybORG,这是一个结合了模拟和仿真模式的模块化 Gym 环境,用于训练和验证自主网络智能体
  • 提供了一个通用的兼容 OpenAI Gym 的接口,允许相同的智能体代码与模拟和仿真环境进行交互
  • 采用了基于 YAML 的场景定义格式,规定了模拟状态转移以及仿真系统镜像/命令
  • 证明了在模拟中训练的强化学习智能体可以成功迁移到在虚拟基础设施上运行真实安全工具(Metasploit)的仿真环境中
  • 对 12 个现有的网络安全训练环境与 ACO 研究需求进行了全面对比

Limitations 局限性

  • Only a single simple scenario (3 hosts, 2 subnets) was evaluated, limiting generalizability claims
  • 34% failure rate when transferring agents from simulation to emulation, indicating significant reality gap issues
  • Some agents overfit to simulation artifacts not present in the emulator (e.g., one agent failed because it learned to bypass the autoroute action based on simulation-specific observation quirks)
  • Blue team agent training is not yet implemented; only red agent actions, sensors, and actuators are available
  • The emulator relies on AWS cloud infrastructure, introducing external dependencies and potential cost/latency issues
  • Work-in-progress status: limited scenario library, limited public availability at time of publication
  • No comparison with other training approaches or baseline agents beyond the single DQN architecture
  • The simulator uses a simplified finite state machine model that may not capture all real-world complexities
  • 仅评估了一个简单的场景(3 台主机,2 个子网),限制了通用性结论
  • 将智能体从模拟迁移到仿真时有 34% 的失败率,表明存在显著的“现实鸿沟”问题
  • 一些智能体对仿真器中不存在的模拟人工痕迹产生了过拟合(例如,一个智能体因为学会了基于模拟特有的观察怪癖来绕过 autoroute 动作而失败)
  • 蓝队智能体训练尚未实现;目前仅提供红队动作、传感器和执行器
  • 仿真器依赖 AWS 云基础设施,引入了外部依赖以及潜在的成本/延迟问题
  • 处于开发中状态:场景库有限,发布时公共可用性有限
  • 除了单一的 DQN 架构外,没有与其他训练方法或基准智能体进行对比
  • 模拟器使用的是简化的有限状态机模型,可能无法捕捉所有的现实世界复杂性

Research Gaps 研究空白

  • No existing environment prior to CybORG combined simulation, emulation, scalability, flexibility, efficiency, adversarial co-evolution support, and RL-readiness in a single platform
  • The reality gap between cyber simulations and real-world environments remains a significant unsolved challenge
  • Blue team autonomous agent development is underexplored compared to red team agent research
  • Methods for systematically reducing simulator-to-emulator transfer failures are lacking
  • Scalability to complex enterprise-scale networks with many hosts and diverse operating systems is unaddressed
  • No standardized benchmarks exist for comparing autonomous cyber operation agents across different environments
  • 在 CybORG 之前,没有任何现有环境在单一平台中结合了模拟、仿真、可扩展性、灵活性、效率、对抗性协同演化支持和强化学习就绪性
  • 网络模拟与现实世界环境之间的“现实鸿沟”仍然是一个尚未解决的重大挑战
  • 与红队智能体研究相比,蓝队自主智能体的开发尚未得到充分探索
  • 缺乏系统性减少模拟到仿真迁移失败的方法
  • 尚未解决扩展到具有众多主机和多样化操作系统的复杂企业级网络的问题
  • 尚不存在用于跨不同环境比较自主网络行动智能体的标准化基准

Novel Techniques 新颖技术

  • Dual-fidelity environment design: same scenario definition drives both a fast finite-state-machine simulator and a full AWS-based emulator, enabling train-in-simulation then validate-in-emulation workflows
  • YAML-based scenario specification that defines actions twice: once as state transitions (simulation) and once as executable commands (emulation)
  • Actuator object abstraction that connects to VMs via SSH and wraps third-party security tools (Metasploit, Velociraptor) into a unified agent interface
  • Action masking for RL agents where already-determined parameters mask the agent's parameter selection to prevent random or invalid parameter choices
  • 双保真度环境设计:相同的场景定义同时驱动快速的有限状态机模拟器和完整的基于 AWS 的仿真器,支持“在模拟中训练,然后在仿真中验证”的工作流。
  • 基于 YAML 的场景规范,动作被定义两次:一次作为状态转移(模拟),一次作为可执行命令(仿真)。
  • 执行器对象抽象,通过 SSH 连接到虚拟机,并将第三方安全工具(Metasploit, Velociraptor)包装到统一的智能体接口中。
  • 强化学习智能体的动作掩码(Action masking):已确定的参数会掩盖智能体的参数选择,以防止随机或无效的参数选择。

Open Questions 开放问题

  • How well does the simulation-to-emulation transfer approach scale to more complex, realistic enterprise network scenarios?
  • Can the reality gap be systematically minimized through automated simulator refinement based on emulator feedback?
  • How would blue team agents perform when co-trained adversarially against red agents in this framework?
  • Could LLM-based agents replace or complement RL agents in CybORG for more generalizable cyber operations?
  • What is the optimal balance between simulation fidelity and training speed for producing effective transferable agents?
  • How can deception techniques (honeypots, decoys) be incorporated into the blue team's action space?
  • “模拟到仿真”的迁移方法在扩展到更复杂、更现实的企业网络场景时表现如何?
  • 能否通过基于仿真器反馈的自动化模拟器精炼,系统地最小化“现实鸿沟”?
  • 在该框架下,蓝队智能体与红队智能体进行对抗性共同训练时表现如何?
  • 在 CybORG 中,基于 LLM 的智能体能否取代或补充强化学习智能体,以实现更通用的网络行动?
  • 为了产生有效的可迁移智能体,模拟保真度与训练速度之间的最佳平衡点在哪里?
  • 如何将欺骗技术(蜜罐、诱饵)整合到蓝队的动作空间中?

Builds On 基于前人工作

  • OpenAI Gym (Brockman et al., 2016)
  • Deep Q-Network (Mnih et al., 2015)
  • DRQN / Deep Recurrent Q-Learning (Hausknecht and Stone, 2015)
  • Cyber Kill Chain (Hutchins et al., 2011)
  • Metasploit Framework
  • Schwartz and Kurniawati, 2019 (Autonomous Penetration Testing using RL)

Open Source 开源信息

Partial - limited access for reviewers with public release planned after scenario library expansion

Tags