#17

Using LLMs to Automate Threat Intelligence Analysis Workflows in Security Operation Centers Using LLMs to Automate Threat Intelligence Analysis Workflows in Security Operation Centers

PeiYu Tseng, ZihDwo Yeh, Xushu Dai, Peng Liu

2024 | arXiv (preprint)

system general-cybersecurity fully-autonomous single-agent

PDF Preview 论文预览

Loading PDF... 加载 PDF 中...

Problem & Motivation 问题与动机

Modern SIEM systems cannot relieve SOC analysts from the labor-intensive, repetitive tasks involved in analyzing Cyber Threat Intelligence (CTI) reports written in natural language. Analysts must manually read CTI reports, extract Indicators of Compromise (IOCs), generate RegEx patterns for SIEM correlation rules, and map relationships between IOCs.

现代 SIEM（安全信息和事件管理）系统无法将安全运营中心（SOC）分析师从繁重的、重复性的自然语言网络威胁情报（CTI）报告分析任务中解脱出来。分析师必须手动阅读 CTI 报告，提取失陷指标（IOC），为 SIEM 关联规则生成正则表达式（RegEx）模式，并映射 IOC 之间的关系。

CTI analysis is a critical bottleneck in SOC workflows. The threat intelligence market is growing rapidly (projected $21.92 billion by 2028), yet analysts still spend excessive time on repetitive extraction and pattern-generation tasks, increasing response time to attacks. Existing ML-based approaches using domain-specific NER models fail to generalize to evolving attack techniques and new named entities.

CTI 分析是 SOC 工作流中的一个关键瓶颈。威胁情报市场增长迅速（预计到 2028 年达到 219.2 亿美元），但分析师仍花费大量时间在重复的提取和模式生成任务上，这增加了对攻击的响应时间。现有的基于机器学习（ML）并使用特定领域命名实体识别（NER）模型的方法，无法泛化到不断进化的攻击技术和新的命名实体。

Threat Model 威胁模型

The system assumes access to publicly available CTI reports (e.g., from FireEye, CrowdStrike, MITRE ATT&CK, Telegram/X). The threat model focuses on defensive SOC operations where analysts need to rapidly process CTI to create SIEM correlation rules. LLM factual errors are treated as a primary reliability concern.

该系统假设可以访问公开的 CTI 报告（例如来自 FireEye、CrowdStrike、MITRE ATT&CK、Telegram/X）。威胁模型侧重于防御性的 SOC 运营，分析师需要快速处理 CTI 以创建 SIEM 关联规则。大语言模型（LLM）的事实性错误被视为首要的可靠性问题。

Methodology 核心方法

The paper proposes an 8-step AI agent pipeline that automates CTI report analysis. The agent first extracts IOCs from CTI paragraphs using an LLM, then purifies results via majority voting and RAG-based filtering against a domain knowledge vector database. It identifies capture vs. non-capture groups in IOC strings using retrieval-augmented matching, generates RegEx with iterative LLM-plus-tester refinement, extracts inter-IOC relationships via noun-verb pair analysis, maps relationship verbs to standardized categories, verifies relationships against predefined rules, and finally constructs a relationship graph.

本文提出了一个 8 步 AI 智能体流水线，用于自动化 CTI 报告分析。智能体首先利用 LLM 从 CTI 段落中提取 IOC，然后通过多数投票和基于领域知识向量数据库的 RAG 过滤来纯化结果。它利用检索增强匹配来识别 IOC 字符串中的捕获组（capture group）与非捕获组，通过 LLM 与测试器的迭代精炼来生成正则表达式，通过名词-动词对分析提取 IOC 间关系，将关系动词映射到标准化类别，根据预定义规则验证关系，并最终构建关系图。

Architecture 架构设计

An 8-step sequential pipeline: (1) IOC extraction via paragraph-level LLM prompting, (2) purification via majority voting + RAG filtering, (3) capture group finding via retrieval-augmented matching, (4) RegEx generation with automated RegEx tester feedback loop, (5) relationship extraction via LLM noun-verb identification, (6) relationship mapping via predefined verb-to-category table, (7) relationship verification against predefined rules, (8) relationship graph construction.

一个 8 步顺序流水线：(1) 通过段落级 LLM 提示词进行 IOC 提取；(2) 通过多数投票 + RAG 过滤进行纯化；(3) 通过检索增强匹配寻找捕获组；(4) 通过自动化正则表达式测试器反馈循环生成正则表达式；(5) 通过 LLM 名词-动词识别进行关系提取；(6) 通过预定义动词类别表进行关系映射；(7) 根据预定义规则进行关系验证；(8) 构建关系图。

LLM Models 使用的大模型

Tool Integration 工具集成

Memory Mechanism 记忆机制

RAG

Attack Phases Covered 覆盖的攻击阶段

reconnaissance

scanning

enumeration

exploitation

post exploitation

privilege escalation

lateral movement

reporting

Evaluation 评估结果

Evaluated on 50+ CTI reports, the LLM identified over 2,900 potential IOCs, of which approximately 2,300 were valid after purification (including filenames, domain names, hash values, IP addresses, command lines, registry keys, and values). Hash values, IP addresses, and domain names comprised 70% of valid IOCs. The agent generated approximately 2,200 RegEx patterns and failed to identify only 3% of IOCs compared to manually identified ground truth.

在 50 多份 CTI 报告的评估中，LLM 识别了 2900 多个潜在 IOC，其中约 2300 个在纯化后被证实有效（包括文件名、域名、哈希值、IP 地址、命令行、注册表项及值）。哈希值、IP 地址和域名占有效 IOC 的 70%。该智能体生成了约 2200 个正则表达式模式，与人工识别的基准事实相比，漏掉的 IOC 仅占 3%。

Environment 评估环境

Metrics 评估指标

Scale 评估规模

50+ CTI reports, 2,900+ potential IOCs identified

Contributions 核心贡献

A novel AI agent that automates extraction of important information from CTI reports and generates RegEx patterns for SIEM correlation rules without human intervention
A four-step purification process (majority voting + RAG filtering) to address LLM factual errors and reduce false positives/negatives in IOC extraction
A retrieval-augmented matching mechanism to distinguish capture groups from non-capture groups in IOC strings for accurate RegEx generation
Automated relationship graph construction depicting dependencies between IOCs within a CTI report
First work to exploit LLM capabilities for making CTI analysis workflows substantially more automated end-to-end

提出了一种新型 AI 智能体，可在无人工干预的情况下自动提取 CTI 报告中的重要信息，并为 SIEM 关联规则生成正则表达式模式
设计了一个四步纯化过程（多数投票 + RAG 过滤），以解决 LLM 的事实性错误，并减少 IOC 提取中的假阳性和假阴性
提出了一种检索增强匹配机制，用于区分 IOC 字符串中的捕获组和非捕获组，从而生成准确的正则表达式
实现了自动化的关系图构建，描述了 CTI 报告内各 IOC 之间的依赖关系
这是首个利用 LLM 能力使 CTI 分析工作流在端到端层面实现高度自动化的研究工作

Limitations 局限性

No comparison with existing baseline systems or alternative approaches; evaluation lacks rigor in demonstrating relative improvement
Evaluation metrics are limited -- no precision/recall/F1 reported for IOC extraction, only a 3% miss rate against manual ground truth
Relies exclusively on GPT-4 with no evaluation of alternative or open-source LLMs, making cost and reproducibility unclear
The relationship mapping uses a predefined verb-to-category table, which may not generalize to novel or unconventional CTI language
The relationship verification uses predefined rules (e.g., registry keys cannot have 'create' relationship to files), which requires manual rule curation and may miss edge cases
No analysis of computational cost, latency, or API expenses for processing CTI reports at scale
The majority voting mechanism requires multiple LLM calls per paragraph, multiplying cost and latency without quantifying the tradeoff
Scope limited to IOC extraction and RegEx generation; does not address broader SOC tasks like incident response or threat hunting

未与现有的基准系统或替代方法进行对比；评估在证明相对改进方面缺乏严谨性
评估指标有限 —— 未报告 IOC 提取的精确率/召回率/F1 分数，仅给出了相对于人工基准 3% 的漏检率
完全依赖 GPT-4，未评估其他替代模型或开源 LLM，导致成本和可重复性不明确
关系映射使用了预定义的动词类别表，可能无法泛化到新型或非常规的 CTI 描述语言
关系验证使用了预定义规则（例如注册表项不能对文件具有“创建”关系），这需要人工维护规则库，且可能遗漏边缘情况
未分析大规模处理 CTI 报告时的计算成本、延迟或 API 开销
多数投票机制对每个段落需要多次 LLM 调用，在没有量化权衡的情况下成倍增加了成本和延迟
范围局限于 IOC 提取和正则表达式生成；未解决更广泛的 SOC 任务，如事件响应或威胁狩猎

Research Gaps 研究空白

No existing work combines LLM-based IOC extraction with automated RegEx generation and relationship graph construction end-to-end
Domain-specific NER models for CTI analysis fail to generalize to evolving attack techniques and new named entities
Existing approaches to extract 'nuggets' from security documents are inadequate in terms of generalization
No established benchmarks or datasets for evaluating automated CTI-to-SIEM-rule pipelines
The paper does not address how the system handles adversarial or deceptive CTI reports

尚无现有研究能将基于 LLM 的 IOC 提取与自动化的正则表达式生成及关系图构建进行端到端的结合
针对 CTI 分析的领域特定 NER 模型无法泛化到不断演进的攻击技术和新的命名实体
现有的从安全文档中提取“关键信息（nuggets）”的方法在泛化能力方面不足
缺乏成熟的基准测试或数据集来评估自动化的“从 CTI 到 SIEM 规则”流水线
本文未探讨系统如何处理具有对抗性或误导性的 CTI 报告

Novel Techniques 新颖技术

Paragraph-level LLM prompting for IOC extraction instead of feeding entire CTI reports, reducing hallucination and improving focus
Majority voting across multiple LLM runs to filter factual errors in extracted IOCs
RAG-based purification using a vector database of OS documentation (Windows/Linux) to validate LLM-extracted filenames and command lines
Retrieval-augmented matching to distinguish capture groups (attacker-modifiable parts) from non-capture groups (fixed system paths/commands) in IOC strings for RegEx generation
Iterative RegEx refinement loop where a RegEx tester provides feedback to the LLM until valid patterns are produced
Noun-verb pair extraction and pronoun resolution for automated relationship identification between IOCs

段落级 LLM 提示词：用于 IOC 提取而非输入整份 CTI 报告，从而减少幻觉并提高专注度
跨多次 LLM 运行的多数投票：用于过滤提取出的 IOC 中的事实性错误
基于 RAG 的纯化：使用操作系统文档（Windows/Linux）向量数据库来验证 LLM 提取的文件名和命令行
检索增强匹配：在正则表达式生成中区分捕获组（攻击者可修改部分）与非捕获组（固定系统路径/命令）
迭代式正则表达式精炼循环：由正则表达式测试器向 LLM 提供反馈，直到产生有效的模式
名词-动词对提取与代词消解：用于自动识别 IOC 之间的关系

Open Questions 开放问题

How well does this approach generalize to CTI reports in languages other than English?
What is the actual precision/recall tradeoff of the majority voting threshold, and how sensitive is performance to this parameter?
Can the pipeline handle CTI reports describing zero-day or previously unseen attack techniques where the vector database lacks relevant entries?
How does the system perform with smaller or open-source LLMs compared to GPT-4?
Could the relationship graph be integrated with existing threat intelligence platforms (e.g., MISP, OpenCTI) for operational use?
What happens when CTI reports contain deliberately misleading or ambiguous information?

该方法在英语以外的 CTI 报告上的泛化效果如何？
多数投票阈值的实际精确率/召回率权衡是什么，性能对该参数有多敏感？
当向量数据库缺乏相关条目时，流水线能否处理描述 0day 或以前未见攻击技术的 CTI 报告？
与 GPT-4 相比，该流水线在使用较小或开源 LLM 时的表现如何？
关系图能否与现有的威胁情报平台（如 MISP, OpenCTI）集成以投入运营？
当 CTI 报告包含故意误导或含糊不清的信息时会发生什么？

Builds On 基于前人工作

RAG
MITRE-ATT&CK
TTPDrill
TINKER