Building an AI SOC Agent: A Project Overview

Why I Started This

I went to a BSidesSLC workshop on building an AI-driven SOC agent, run by Cliff Crosland. It used a pre-built API and a scaffolded workflow. I walked out with the parts of it that actually mattered to me — the idea that an LLM, given the right tools and structure, could do the first pass of work that a Tier 1 analyst spends most of their day on. Triaging alerts. Pulling context. Writing up findings.

I wanted to build my own version, end-to-end. Not because the workshop version was wrong — it was a good starting point — but because I learn by stripping things down to parts I have to assemble myself. So I decided to write my own MCP server instead of using the Scanner API, design my own pipeline, and stage the build in phases I could actually finish.

This post is the overview. A companion post goes deep on Phase 1, which is already shipping.

Before I Wrote Any Code

The first week of this project was almost entirely research and planning. No Python, no prompts — just trying to understand what I was actually signing up for.

I worked through Anthropic's intro course on MCP and reinforced the weaker parts with short videos until the protocol clicked. (Certificate) I spent time on mcpmarket.com's leaderboards reading what the top-rated servers were doing, which is where I first saw the MISP threat intel MCP and started thinking about enrichment as a replaceable module rather than a built-in. I also read through the FastMCP / Python SDK source to understand how a real implementation handles transport and tool registration.

I kept a running list of questions I didn't have good answers to yet:

Is there a point to building your own MCP from scratch, or should I just fork a good one and modify it?
How do I balance tight-loop development (Claude Code driving every small change) against wider delegation (agent teams handling big chunks)?
At what layer of an agent architecture should per-task model selection live?
Elasticsearch vs PostgreSQL for the data layer?
Sub-agents versus agent teams — when does each one actually pay off?

Most of those questions didn't have clean answers. What I got from asking them was a mental map of the tradeoff space, which turned out to be more useful than a decision. I landed on a mental model for agent architecture that I kept coming back to: user → orchestrator → agents → tools/skills → memory → observability/guardrails. Every design choice I've made since has been about which layer I'm actually working on.

The planning itself was also an exercise in working with AI. I used ChatGPT to refine the project spec and pressure-test my thinking, then used those refined prompts to drive Claude Code during implementation. The planning output from one model became the input for another. That loop — plan with one, build with another, reflect in writing — is the actual workflow I'm trying to internalize from this project, not just the SOC piece.

I also committed to a spec-first discipline using OpenSpec. Before any code got written, I seeded openspec/specs/ with separate spec files for each service — system, MCP server, agent, Discord bot, data, and infrastructure. Each phase then moves through a tight cycle: propose the change, generate design and task artifacts, apply the tasks, verify, and archive the deltas back into the canonical specs. That structure is what made it possible to think in phases at all instead of everything collapsing into one undifferentiated build.

The Problem Space

Security Operations Centers are drowning in signal. The average analyst triages hundreds of alerts a shift, and most of the work is repetitive: check the source IP's reputation, look up the user, see if this has fired before, decide if it's noise. That repetition is exactly the shape of work LLMs are good at, if you give them the right context and scaffold the reasoning.

The tricky part isn't the AI. It's designing the surface the AI reasons against. Tools that return too much data waste tokens and confuse the model. Tools that return too little force the model to guess. Free-form agents hallucinate. Over-scripted workflows kill the reasoning that makes LLMs useful in the first place. The whole project is an exercise in finding the right level of structure.

The Vision

An AI-powered SOC simulation that ingests semi-realistic security telemetry, exposes it through a custom Model Context Protocol (MCP) server, and drives a structured agent workflow through triage, investigation, and reporting — eventually posting the results into a Discord channel an analyst can actually interact with.

The design principle I keep coming back to: the system controls the workflow, the model controls the reasoning within each step. Scripted scaffolding with open-ended reasoning inside the lanes. That gives me predictability at the system level without sacrificing flexibility at the task level.

High-Level Architecture

There are six layers in the target design:

Data layer — a mini-SIEM with CloudTrail-style, Sysmon-style, and auth logs, plus derived alerts. File-based to start, database-backed later.
MCP server — a custom-built tool layer that exposes alerts, events, entity context, enrichment, and runbooks to the agent.
AI agent — an LLM-driven pipeline that classifies, selects a runbook, investigates, assesses, and writes up findings.
Runbooks — Markdown playbooks mapped to alert types, guiding the model's investigation.
Discord bot — the analyst-facing interface for triggering investigations and reading results.
Enrichment layer — external threat intel (mocked early on with my own MCP deployed on the cloud, but I'm planning on evolving it to use https://github.com/MISP/MISP ).

graph TB
    user["Analyst<br/>triggers investigations, reads reports"]
    discord["Discord Bot<br/>analyst-facing surface"]
    agent["AI Agent<br/>six-step pipeline"]
    mcp["MCP Server<br/>six tools over stdio"]
    data["Data Layer<br/>CloudTrail, Sysmon, auth logs, alerts"]
    runbooks["Runbooks<br/>markdown playbooks"]
    enrich["Enrichment<br/>threat intel MCP"]

    user --> discord
    discord --> agent
    runbooks -.->|guides| agent
    enrich -.->|external intel| agent
    agent --> mcp
    mcp --> data

    classDef core fill:#e5dbff,stroke:#5f3dc4,stroke-width:2px,color:#1e1b4b
    classDef layer fill:#e7f5ff,stroke:#1971c2,color:#0c1a33
    classDef side fill:#f8f9fa,stroke:#868e96,stroke-dasharray:4 3,color:#1f2937
    class agent core
    class user,discord,mcp,data layer
    class runbooks,enrich side

Phases

I'm intentionally breaking this into phases I can actually ship, rather than boiling the ocean. Each phase produces something runnable.

Phase 1 — Local MVP. JSON-backed data, custom MCP server, a six-step agent pipeline, console output. End-to-end alert → report loop on one machine. (Done. Deep dive linked below.)
Phase 2 — Investigation depth. Real threat intel enrichment, an evaluation harness for prompt quality, retry and timeout handling, better subprocess management.
Phase 3 — Discord integration. Replace the CLI with a bot. Trigger investigations from chat, get formatted results back.
Phase 4 — Real data layer. Move from JSON files to PostgreSQL or Elasticsearch. Start handling realistic query patterns.
Phase 5 — Containerization. Dockerize the MCP server, agent, and bot. Each gets its own Dockerfile and pyproject.toml. Full local stack via docker-compose.yml, with a docker-compose.dev.yml variant for hot reload and debug ports.
Phase 6 — Cloud deployment. Deploy to Oracle Cloud Free Tier (2× ARM A1) running k3s. Fly.io, Railway, and Render are fallback targets. Scheduled alert processing. Live system.
Phase 7 — Kubernetes and CI/CD. Helm packaging, GitHub Actions for lint/test/build on PR and OCI registry push + rollout on deploy, structured logging via structlog, basic monitoring. Production-shaped environment.

graph TB
    p1["Phase 1: Local MVP<br/>JSON data, custom MCP, six-step pipeline<br/>DONE"]
    p2["Phase 2: Investigation Depth<br/>Real threat intel, eval harness, retries"]
    p3["Phase 3: Discord Integration<br/>Replace CLI with a bot"]
    p4["Phase 4: Real Data Layer<br/>PostgreSQL or Elasticsearch"]
    p5["Phase 5: Containerization<br/>Dockerize MCP, agent, bot"]
    p6["Phase 6: Cloud Deployment<br/>Oracle Cloud Free Tier + k3s"]
    p7["Phase 7: Kubernetes and CI/CD<br/>Helm, GitHub Actions, structlog"]

    p1 --> p2
    p2 --> p3
    p3 --> p4
    p4 --> p5
    p5 --> p6
    p6 --> p7

    classDef done fill:#d3f9d8,stroke:#2f9e44,stroke-width:3px,color:#14532d
    classDef near fill:#fff4e6,stroke:#e67700,color:#78350f
    classDef far fill:#f8f9fa,stroke:#868e96,color:#374151
    class p1 done
    class p2,p3 near
    class p4,p5,p6,p7 far

The honest risk is scope creep. I'm stacking four learning curves — MCP, agent design, cloud infrastructure, and the security domain — and the cloud/K8s layers are the least differentiated portfolio signal. What's rare isn't another Dockerized app. It's someone who has actually designed tools an LLM reasons well against, and can talk about the tradeoffs. So Phases 1–3 are the real project. Everything after is bonus.

Why This Approach

I could have forked the workshop repo, swapped in my own credentials, and called it a day. I didn't, for a specific reason: I wanted to understand MCP from the inside. Writing my own server forced me to think about tool granularity, error modes, and what the model actually needs from each call. That perspective doesn't come from configuring someone else's framework.

I also went back and forth on the agent architecture more times than I want to admit. I looked at conversational-profile frameworks (agency-agents was one), found them too thin — no first-class tools, no skills, just personality descriptions. I tried the Superpowers MCP, found it heavy on tokens and pinned to specific Claude Code model versions. Each false start taught me what I actually needed from an architecture. By the time I committed to a pipeline-driven design with its own MCP, it was because I'd rejected three other shapes, not because it was the first thing I tried.

The second reason is portfolio shape. A SOC analyst role, an AppSec role, a detection engineering role — none of them ask for someone who can deploy an LLM wrapper. They ask for someone who understands the workflow, the data, and the tradeoffs. Building this project the hard way gives me something to talk about in every one of those interviews.

The Phase 1 deep dive covers the six-step pipeline, the MCP tool design, the adaptive report formatter, and the places I knowingly cut corners.

→ Phase 1 Deep Dive → GitHub repo