AI SOC Agent, Phase 1: Shipping the End-to-End Loop

The local MVP: custom MCP server, six-step pipeline, adaptive reporting, and everything I deliberately cut.

The Problem

A Tier 1 SOC analyst spends most of a shift doing the same work for the hundredth time. Pull the alert. Look up the source IP. Check the user's recent activity. Decide if it's noise or if it needs escalation. Write it up. Move on.

The pattern is mechanical, but the judgment inside each step isn't. You have to read the signal in context, weigh it against what you know about the environment, and decide whether the story the alert tells is plausible. That's exactly the kind of task modern LLMs can do reasonably well — if the surrounding system gives them structured inputs and constrains the shape of their reasoning.

I set out to build a working version of that system. Not a demo, not a prompt-and-pray wrapper, but a pipeline that takes a real alert ID and produces a report a human analyst could actually read. Phase 1 is the first honest cut at that.

What I Built

A locally runnable security investigation agent, built around three components:

  • A custom MCP server exposing six tools an LLM can call to retrieve alerts, search events, get entity context, enrich IPs, and load runbooks
  • A six-step pipeline — classify → select_runbook → investigate → assess → summarize → respond — where each step is a separate module driven by a plain-text prompt template
  • A CLI runner (run_agent.py) that manages the MCP server subprocess lifecycle and feeds alerts into the pipeline

The whole thing runs on one machine with python run_agent.py --alert alert-001. Three sample alerts ship with the repo: a brute-force attack, a lateral-movement scenario, and a data exfiltration case.

graph TB
    subgraph cli["CLI Runner - run_agent.py"]
        pipeline["Six-Step Pipeline<br/>classify → select → investigate →<br/>assess → summarize → respond"]
        prompts["agent/prompts/*.txt"]
        pipeline -.->|reads| prompts
    end

    subgraph mcps["MCP Server - stdio subprocess"]
        tools["Six tools:<br/>list_alerts, get_alert, search_events,<br/>get_entity_context, enrich_ip, get_runbook"]
    end

    store["JSON Store<br/>data/*.json"]

    cli <-->|stdio IPC| mcps
    tools -->|StorageBase| store

    classDef host fill:#e5dbff,stroke:#5f3dc4,color:#1e1b4b
    classDef subp fill:#ffe8cc,stroke:#d9480f,color:#7c2d12
    classDef datap fill:#fff4e6,stroke:#e67700,color:#78350f
    class pipeline,prompts host
    class tools subp
    class store datap

How I Got to a Design

I didn't sit down and start writing Python. The first real chunk of time on this project was spent on two things: building enough mental model of MCP to design a decent server, and figuring out the shape of the agent itself.

For MCP, I worked through Anthropic's intro course, reinforced it with short-form video, then read the Python SDK source until the transport and tool-registration code stopped being opaque. Reading real implementations changed how I thought about my own tools — granularity, error shape, what the model actually sees per call.

For the agent design, I used ChatGPT as a thinking partner. I'd dump raw notes on tradeoffs (Elasticsearch vs PostgreSQL, agent teams vs sub-agents, how much autonomy to give the model), push back on its first answers, and iterate until the project spec felt defensible. That spec then became the input to Claude Code, which did most of the implementation work.

The process I settled into: ChatGPT for design and specs, Claude Code for implementation, and my own judgment for whether the output held together. To make that repeatable across seven phases, I wrapped the whole thing in OpenSpec.

Before writing any production code, I seeded openspec/specs/ with one spec per service — system, mcp-server, agent, discord-bot, data, infra. Each one declares the purpose, the requirements (MUST/SHOULD), and the contract. Phase 1 then ran through the OpenSpec cycle:

/opsx:propose phase-1-local-mvp   # generate proposal, delta specs, design, tasks
review and edit artifacts          # human judgment stays in the loop
/opsx:apply                        # Claude executes the task list
/opsx:verify                       # sanity check before archiving
/opsx:archive                      # merge delta specs into canonical specs

That loop is the reason Phase 1 shipped in one sitting. Thirty-nine discrete tasks, each traceable back to a spec delta, reconciled into the main specs before the change was archived. Two disciplines were load-bearing: context hygiene (clearing the AI context window before each apply so the model wasn't dragging stale assumptions forward) and model routing (higher-reasoning model for proposals and design, faster model for straightforward task execution). Both of those came directly out of the struggles earlier in the project — they were the answers to the "how do I balance tight loop vs wide delegation" question I kept asking and never fully resolved.

OpenSpec would have been overhead on a one-afternoon project. On a seven-phase one, it's the thing preventing the tail from getting lost.

System Architecture

The Pipeline

The agent is not a single prompt or a free-roaming ReAct loop. It's an explicit six-step workflow, and the model reasons freely within each step but can't skip between them.

  1. Classify — given the raw alert, identify the likely attack type and confidence.
  2. Select Runbook — map the classification to a Markdown playbook.
  3. Investigate — follow the runbook, calling MCP tools to gather evidence.
  4. Assess — weigh the evidence. Produce a severity, a confidence score, and a recommended action.
  5. Summarize — generate a structured analyst-ready write-up.
  6. Respond — format the output adaptively for the human on the other end.

Each step lives in its own file under agent/workflow/steps/. State is passed forward through a shared InvestigationState object, so each step has exactly the context it needs and nothing more. The pipeline orchestrator in pipeline.py is the only thing that knows the sequence.

graph LR
    alert["Alert"]
    c["Classify"]
    s["Select<br/>Runbook"]
    i["Investigate"]
    a["Assess"]
    sm["Summarize"]
    r["Respond"]
    report["Report"]

    alert --> c
    c -->|classification| s
    s -->|runbook| i
    i -->|evidence| a
    a -->|verdict + confidence| sm
    sm -->|structured summary| r
    r --> report

    classDef io fill:#e7f5ff,stroke:#1971c2,color:#0c1a33
    classDef think fill:#e5dbff,stroke:#5f3dc4,color:#1e1b4b
    classDef act fill:#ffe8cc,stroke:#d9480f,color:#7c2d12
    classDef out fill:#fff4e6,stroke:#e67700,color:#78350f
    class alert,report io
    class c,s think
    class i,a act
    class sm,r out

InvestigationState is threaded forward through every step — each step receives exactly the context it needs from the ones that came before.

The MCP Layer

The MCP server runs as a subprocess launched by run_agent.py and communicates over stdio. It exposes six tools:

| Tool | Purpose | |---|---| | list_alerts | Return all alerts in the store | | get_alert | Fetch a single alert by ID | | search_events | Keyword search across the event log | | get_entity_context | Pull context for an IP, hostname, or user | | enrich_ip | Return threat intel for an IP (mocked in Phase 1) | | get_runbook | Load a Markdown playbook for an alert type |

The storage layer sits behind an abstract StorageBase class. Phase 1 ships with a JSON file implementation. Swapping in PostgreSQL later is a matter of writing one adapter, not rewriting the MCP server.

Prompt-Driven Design

Every step's behavior is defined by a plain-text prompt file under agent/prompts/. Not Python string templates, not a framework abstraction — just .txt files. This was deliberate. When a step misbehaves, I edit a prompt, not a class. When I want to try a different classification strategy, I copy a file. Prompts are the dominant source of behavior in an LLM pipeline, and treating them as code that lives in Python hides them from the thing that actually matters.

Here's the actual classify step prompt in full — agent/prompts/classify.txt:

You are a Tier 1 SOC analyst. Classify the following security alert.

Alert ID: {alert_id}
Alert Data:
{alert_data}

Respond in JSON with exactly this structure:
{{
  "alert_type": "<brute-force|lateral-movement|data-exfil|unknown>",
  "severity": "<low|medium|high|critical>",
  "confidence": <0.0-1.0>,
  "reasoning": "<one sentence explaining your classification>"
}}

Be concise. Output only the JSON object.

Nothing fancy. A role, the inputs, the output contract, and a tone instruction. The doubled braces are Python .format() escapes around the literal JSON the model is asked to emit. When I want to tune classification behavior, I edit this file — not a Python module, not a framework config.

Adaptive Reporting

The last step, respond, is the one I'm most happy with. The report it produces varies based on the assessment confidence:

  • A high-confidence false positive gets a two-line dismissal. The analyst doesn't need paragraphs to close a noisy alert.
  • A low-confidence finding produces a detailed breakdown, surfaces the unresolved questions, and names a specific escalation tier.
  • A confirmed incident gets the full structured report — timeline, evidence, entities, recommended action.

The tone and depth of the output are themselves a function of the investigation. That felt right. A static template would have treated all outputs the same and forced the analyst to re-read boilerplate they already know.

~/projects/ai-soc-agent
$ python run_agent.py --alert alert-002

[classify] alert_type=lateral-movement severity=high confidence=0.82 [select] runbook=lateral_movement.md [investigate] calling search_events(source_ip=10.0.2.41) [investigate] calling get_entity_context(host=web-prod-03) [assess] severity=high confidence=0.78 action=escalate-tier-2 [summarize] building structured report [respond] rendered adaptive report — see below

Key Design Decisions

Step-wise pipeline over a monolithic agent. I could have written a single "investigate this alert" prompt and let the model drive the whole thing. That works in demos and falls over in reality. Every step I broke out is a step I can now inspect, swap, re-prompt, or mock independently. The model still reasons freely — but only inside one lane at a time.

Plain-text prompts over framework abstractions. Frameworks make the first 80% fast and the last 20% miserable. I wanted the prompts to be first-class artifacts I could diff, version, and iterate on without fighting an abstraction. If I outgrow this later, I'll move. For now, flat files.

JSON store behind an abstract interface. It would have been faster to hard-code file reads. It would also have coupled every tool to the storage format. The StorageBase class is two dozen lines and it gives me a clean path to PostgreSQL in Phase 4 without touching tool code.

Adaptive output over a fixed template. Copy-pasting the same report structure for every alert would have been easier. It would also have made the output feel like a mail merge, which is exactly the tone you don't want when you're asking an analyst to trust the system.

Where I Got Stuck

The honest version of this build includes a week of not-shipping. I flipped between agent architectures multiple times. I looked at conversational-profile frameworks like agency-agents, decided the lack of real tools and skills made them too shallow, and backed out. I tried the Superpowers MCP as a shortcut, realized its token usage was heavy and it was pinned to specific Claude Code model versions, and backed out of that too.

I burned time on environment problems that had nothing to do with the agent. nvm-windows fought me on Node setup until I switched to fnm. I lost an afternoon to tool-configuration decisions — skills versus tools versus agent profiles — before realizing the answer depended on a design I hadn't committed to yet.

The question I kept circling was how much context to spend on planning versus building. opus-plan mode helped. Letting an orchestrator decide model selection per task helped more. The pattern that eventually worked: use a stronger model to plan in the abstract, then delegate execution to cheaper models per step. That same idea — the system routes work, the model does the thinking — is what ended up driving the pipeline design itself. The development workflow and the product converged on the same shape, which I don't think was an accident.

The meta-lesson: every false start was information. I didn't know I needed a pipeline with its own MCP until three other architectures had failed to fit. Deleting code is part of the process.

Example Run

export ANTHROPIC_API_KEY=sk-...
python run_agent.py --alert alert-001

The CLI boots the MCP server as a subprocess, passes the alert ID into the pipeline, and streams the final report to stdout. The model calls tools as needed during the investigate and assess steps. Latency is dominated by the Claude calls, not anything in the pipeline itself.

What's Missing (Honest Assessment)

I shipped Phase 1 in one sitting and I'm clear about what I didn't do:

  • Threat intel is mocked. enrich_ip returns static data regardless of the IP. Every investigation gets the same enrichment. This is the biggest gap between Phase 1 and anything production-shaped, and it's the first thing I'll fix.
  • There's no evaluation harness. The pipeline makes live Claude calls, and I have no systematic visibility into classification accuracy, runbook selection reliability, or how well-calibrated the confidence scores are. Prompt drift will go undetected until something obviously breaks.
  • No retry or timeout logic. A single slow or failed Claude call halts the whole pipeline. No backoff, no fallback, no circuit breaker. Fine for a local MVP, fragile for anything I'd leave running.
  • Subprocess handling is brittle. run_agent.py manages the MCP server lifecycle, but if the server crashes mid-run, the agent can hang or return an unclear error. It needs health checks and a clean shutdown path.
  • The integration test requires a live API key. test_integration.py hits the Anthropic API. Without a key, the suite fails. That needs to be gated explicitly so it doesn't surprise CI or rack up accidental bills.

None of these are unknown unknowns. They're deliberate Phase 1 cutoffs. Phase 2 is the reliability and evaluation pass.

What I Learned

Working with an LLM pipeline feels different from working with a traditional backend. In a backend, bugs are mostly binary — it works or it doesn't. In an LLM pipeline, the system can be running perfectly while producing quietly worse output because a prompt drifted or a model version changed. That's why the evaluation harness gap bothers me most: without it, I'm flying blind on the thing that matters.

Structure matters more than I expected. The six-step pipeline isn't there for show. Every time I was tempted to let the model "just figure it out," the output got vaguer. Every time I constrained the step to one job, the quality of each call went up. The structure isn't limiting the model — it's giving the model enough walls to push against.

Prompts are code. Keeping them as .txt files is only a shortcut if I treat them with the same discipline I'd give a module: version them, diff them, test them against fixtures. That's a Phase 2 commitment too.

I also got a cleaner mental model for agent architecture out of this build that I keep coming back to: user → orchestrator → agents → tools/skills → memory → observability/guardrails. Almost every design question that felt confusing in the abstract resolved once I asked which layer it lived on. Model selection lives at the orchestrator. Input validation lives at the tool layer. Report adaptivity lives above the pipeline, not inside it. Once the layers were named, the decisions stopped fighting each other.

And planning with one AI while building with another — ChatGPT on design, Claude Code on implementation — turned out to be the real process skill I took away. Neither tool alone would have gotten this shipped in one sitting. The handoff is where the work actually happens.

What's Next (Phase 2)

  • Real threat intelligence — OTX, AbuseIPDB, or the MISP MCP behind enrich_ip.
  • An evaluation harness with fixtures, so I can measure classification and assessment quality before shipping prompt changes.
  • Retry, timeout, and graceful degradation on the Claude calls.
  • Subprocess health checks and clean failure modes for the MCP server.
  • Wire up the four runbooks (brute_force, privilege_escalation, lateral_movement, data_exfiltration) and the synthetic data generators (cloudtrail.py, sysmon.py, auth.py) so the investigation step has real log surface to search, not just the three seed alerts.
  • pydantic schemas on MCP tool inputs and outputs, and structlog for anything the agent emits — the groundwork for Phase 6's observability layer.

Close

Repo: github.com/THESunnyNguyen/ai-soc-agent

If you're building something similar, I'd be curious what you're doing for prompt evaluation. That's the part I still don't have a strong answer for, and I suspect nobody does yet.