← Back to Blog
guide
16 min read

AI-Powered Incident Investigation: The Complete Guide for SRE Teams (2026)

AI-powered incident investigation is an LLM agent that runs tools, queries infrastructure, and reasons over evidence — not stream-correlation AIOps. The 2026 landscape, architecture, and pilot plan.

By Noah Casarotto-Dinning, CEO at Arvo AI|

Key Takeaways

  • AI-powered incident investigation means an LLM agent that runs tools, queries infrastructure, and reasons over evidence in multiple steps — not stream-correlation AIOps. The distinction is structural: traditional AIOps clusters events; an investigation agent runs kubectl, queries metrics, searches knowledge bases, and updates its hypotheses as findings arrive.
  • We propose the AI Investigation Capability Ladder (AICL). Six tiers: L0 (manual), L1 (alert correlation), L2 (LLM-summarized timeline), L3 (single-shot LLM diagnosis), L4 (agentic multi-step investigation), L5 (closed-loop investigate + remediate with human approval).
  • CNCF now hosts two open-source agentic projects in this lane. HolmesGPT entered the CNCF Sandbox in October 2025. K8sGPT has been Sandbox since December 19, 2023. Aurora (Apache 2.0, self-hosted) is the third major open-source option and the only one that spans AWS, Azure, GCP, OVH, Scaleway, and Kubernetes in a single deployment.
  • The 2024 DORA State of DevOps Report formalized recovery time as Failed Deployment Recovery Time (FDRT). Per DORA's metrics history, FDRT replaced "MTTR" as the official term in 2023 because MTTR had grown ambiguous. The 2024 DORA report PDF added "deployment rework rate" as a fifth core measure.
  • The closed-source peer set is well-funded. Resolve.ai raised $125M at a $1B valuation in February 2026. Traversal reports 32% MTTR reduction and 82% RCA accuracy at American Express across 250 billion log lines per day. Cleric, Neubird, Causely, and Ciroos round out the category.

Cloud incidents in 2026 surface faster than humans can investigate them. AI-powered incident investigation is a system in which a large language model runs as an agent — calling infrastructure tools, querying logs and metrics, traversing dependency graphs, and reasoning over evidence across multiple steps to produce a root-cause analysis. Unlike traditional AIOps, which clusters events and ranks suspects from streams it already has, an investigation agent goes and gets new evidence: it shells into a sandboxed pod, runs kubectl describe, hits the cloud API, reads the relevant CI/CD pipeline, then re-plans its next step based on what it found.

This guide is for SRE, platform, and DevOps leaders evaluating where to invest their incident-response automation budget in 2026. We cover what the category looks like, how the open-source and commercial offerings actually differ, the standards bodies tracking outcomes, and how to pilot a tool without betting the farm.

What "investigation" means here

Three things blur together when people say "AI incident response":

  1. Alert correlation — clustering related events to reduce noise. PagerDuty AIOps, BigPanda, Moogsoft (now Dell APEX), Dynatrace Davis, Splunk ITSI Event iQ. This is mature ML; not investigation.
  2. Postmortem generation — drafting an incident report from artifacts that already exist (Slack transcript, alert timeline, monitor data). Rootly, incident.io, FireHydrant, Datadog Bits AI, PagerDuty Scribe. Covered separately in our Automated Post-Mortem Generation guide.
  3. Agentic investigation — an LLM that runs new tool calls during the incident to gather evidence it doesn't already have. Aurora, HolmesGPT, K8sGPT, Cleric, Resolve.ai, Traversal, Neubird Hawkeye, Causely. This is the category this post is about.

Conflating them produces bad evaluations. A team that picks a postmortem generator expecting it to find root cause will be disappointed; a team that picks an AIOps correlator expecting it to run kubectl will be even more disappointed.

The AI Investigation Capability Ladder (AICL)

Six tiers, increasing autonomy. Pick the tier you can defend operationally — going further is engineering, going less far is process.

TierWhat runsHuman roleRepresentative tools
L0 — ManualEngineer reads alerts, runs kubectl and cloud CLIs by handEverythingPagerDuty, Slack, Datadog
L1 — Alert correlationML correlator clusters and dedupes eventsTriage from a smaller listPagerDuty AIOps, BigPanda, Dell APEX (Moogsoft), Splunk ITSI
L2 — LLM-summarized timelineLLM summarizes an event stream into proseReads summary instead of raw eventsDatadog Bits AI summaries, incident.io Scribe
L3 — Single-shot LLM diagnosisLLM produces an RCA from one prompt over alert + telemetryTrusts a single inferenceK8sGPT analyzers, vendor "AI insights" buttons
L4 — Agentic multi-step investigationLLM agent calls many tools across multiple turns, replans as findings arriveReviews trace, ships fixAurora, HolmesGPT, Cleric, Resolve.ai, Traversal, Neubird, Causely
L5 — Closed-loop investigate + remediateAgent investigates and proposes (or applies, with approval) a fixApproves remediationAurora + Aurora Actions, Resolve.ai, ServiceNow Now Assist SRE

The honest framing: most teams are L0 or L1 today. Per JetBrains' AI Pulse coverage (April 2026), 78.2% of survey respondents don't use AI in CI/CD workflows at all — a useful proxy for the broader DevOps stack. Investigation lags even further because it requires giving an agent infrastructure permissions, which makes security review harder than for build-time AI.

Traditional AIOps vs agentic investigation

Both are useful; they cover non-overlapping work.

CapabilityTraditional AIOps (L1)Agentic investigation (L4)
InputEvent stream, telemetry already ingestedSame, plus live tool calls
OutputRanked suspects, correlated incidentsRCA narrative, evidence chain, suggested fix
New evidenceNo — operates on what's already in the systemYes — agent issues new commands
ReasoningML clustering / topology distance scoringLLM step-by-step (ReAct or similar)
Why it can be wrongMissing event, weak topology graphHallucination, tool misuse, prompt drift
Cost modelPer-event or per-hostPer LLM token + tool runtime
Failure modeQuiet — wrong cluster, you don't knowLoud — agent's trace is human-readable

Most production deployments will run both. AIOps reduces the alert volume the agent has to investigate; the agent does the deep work AIOps cannot. Vendor-neutral evidence of this stacking pattern is showing up in 2025–2026 product announcements: PagerDuty's SRE Agent layers an agentic loop on top of its existing AIOps; Splunk's ITSI Episode Summarization (announced at .CONF25, September 2025) layers an LLM summary on top of its KPI engine.

The agentic peer set in 2026

This is the actual decision the buyer faces. Apache-2.0 open source vs commercial, single cloud vs multi-cloud, in-cluster vs cross-system, with or without RAG.

ProductLicenseScopeNotes
AuroraApache 2.0AWS, Azure, GCP, OVH, Scaleway, Kubernetes + 30+ integrationsLangGraph-orchestrated ReAct agent. Memgraph-backed dependency graph used by the alert correlator; Weaviate hybrid (BM25 + vector) RAG over runbooks and past postmortems. Self-hosted via Docker Compose or Helm.
HolmesGPTApache 2.0Cloud-native, Kubernetes-first; AWS, GCP, Oracle Cloud, OpenShift toolsetsCo-maintained by Robusta and Microsoft. CNCF Sandbox since October 2025. Read-only, RBAC-respecting; posts findings back to Slack / PagerDuty / Jira.
K8sGPTApache 2.0Kubernetes resource diagnosticsCNCF Sandbox since December 19, 2023. Analyzer-based — closer to L3 than L4 in our ladder.
Cleric.aiClosed sourceSlack-first AI SREGartner Cool Vendor 2025. Integrates Datadog and Grafana.
Resolve.aiClosed sourceMulti-cloud AI SRE$125M Series A at $1B valuation in February 2026. Founded by Spiros Xanthos and Mayank Agarwal, ex-Splunk.
TraversalClosed source"Causal search engine" for production systems$48M Sequoia/Kleiner round (June 2025). Reports 32% MTTR reduction and 82% RCA accuracy in production at American Express.
Neubird HawkeyeClosed sourceLlama 3.2 70B fine-tuned + ChromaDB RAGSaaS or VPC, SOC-2. Integrates Datadog, Splunk, CloudWatch, PagerDuty, ServiceNow.
CauselyClosed sourceCausal-graph reasoner for KubernetesGartner Cool Vendor 2025. MCP server. Gemini-powered.
Ciroos.AIClosed source"SRE Teammate" multi-agentMCP and A2A architecture.

If you need a self-hosted, multi-cloud, Apache-2.0 option, Aurora is the broadest. If you live entirely inside Kubernetes and want a CNCF-blessed option, HolmesGPT is the strong choice. K8sGPT is the lightweight diagnostic pre-step. The closed-source options trade source availability for managed-service ergonomics and (in Resolve.ai and Traversal's cases) a lot of recent capital.

For a deeper open-source-only comparison, see our Open-Source AI SRE: Aurora vs HolmesGPT vs K8sGPT guide.

Architecture: what makes investigation "agentic"?

Five components show up in every credible agentic-investigation product. If a tool is missing more than one, it sits below L4 on the AICL.

1. A tool-calling loop (ReAct or similar)

The agent issues a tool call, sees the result, decides the next call, and continues until it has enough evidence. This is the ReAct pattern (Reason + Act, Yao et al. 2022). Aurora's implementation is a single-node LangGraph workflow wrapping langchain.agents.create_agent; the agent decides at every step whether to invoke a tool or finalize the RCA. HolmesGPT uses a similar pattern with its own toolset registry. The choice between LangGraph, LangChain, AutoGen, or a custom loop is implementation detail — what matters is multi-turn tool use.

2. Tool reach across the stack

An investigation agent that can only read Kubernetes will miss every multi-cloud incident. Tool reach matters more than algorithmic sophistication. Aurora exposes 30+ integrations covering cloud CLIs (AWS, Azure, GCP, OVH, Scaleway, Cloudflare), Kubernetes, Terraform, Docker, monitoring (Datadog, Grafana, NewRelic, OpsGenie, Netdata, Dynatrace, Coroot, ThousandEyes, BigPanda, incident.io), logging (Splunk), CI/CD (Jenkins, Spinnaker, CloudBees), code (GitHub, Bitbucket), and docs (Confluence, Notion, SharePoint, Jira). A fully connected instance surfaces 100+ discrete tool callables to the agent.

3. Sandboxed CLI execution

Letting the agent run kubectl and cloud CLIs raises the obvious concern: arbitrary command execution. Aurora's architecture wraps every command in a four-layer safety pipeline before the command leaves the planner:

  1. Prompt-injection input rail (NVIDIA NeMo Guardrails) blocks commands that originate from injected instructions.
  2. Static signature match against 37 vendored SigmaHQ detection rules covering known-malicious command patterns.
  3. Per-org command policy — allow/deny lists scoped to the customer's tenant.
  4. LLM safety judge adapted from Meta's PurpleLlama AlignmentCheck.

Approved commands execute via kubectl exec into ephemeral terminal pods inside an "untrusted" Kubernetes namespace. See our AI Agent kubectl Safety guide for the full threat model.

4. Retrieval over organizational memory

The agent's first move on most investigations should be checking whether a similar incident has happened before. Aurora uses Weaviate for hybrid (BM25 + vector) search over runbooks, past postmortems, and Aurora Learn — a corpus of past good RCAs that get injected as context for new investigations. HolmesGPT supports RAG over runbooks via its toolset system. K8sGPT does not have a first-class RAG layer.

The honest measurement: RAG quality dominates accuracy on incidents that have happened before. Sparse-only retrieval misses semantic recall; dense-only retrieval misses literal identifier matches. Hybrid wins, and is now table stakes.

5. Infrastructure topology

An LLM that doesn't know that service A depends on database B will mis-attribute symptoms. Aurora uses Memgraph as a live dependency graph populated by an infrastructure-discovery pipeline; the topology is consulted by Aurora's alert correlator before the agent runs, and dependency context surfaces into the agent's working set through retrieval results tagged "[Auto-Discovery]". The agent does not Cypher-query the graph directly during an incident — it reads digested dependency context the way a human SRE would read a service map.

What the DORA and VOID anchors actually say

Two industry sources are worth grounding the investigation conversation in.

DORA — Failed Deployment Recovery Time. Per DORA's metrics history, the original "Mean Time to Recover" metric was renamed Failed Deployment Recovery Time (FDRT) in 2023 because MTTR had grown ambiguous in industry usage. FDRT measures recovery from change-induced failures specifically — the place where investigation speed matters most. The 2024 DORA State of DevOps Report PDF further refined the metric set, adding "deployment rework rate" as a fifth core measure.

VOID — incident reality, not vendor claims. The Verica Open Incident Database catalogs public incident reports. The 2nd Annual VOID Report (December 2022) reviewed approximately 10,000 incidents from 600+ organizations and concluded that MTTR is unreliable as a comparison metric across organizations and that only about 25% of public incident reports clearly identify a root cause. The implication for buyers: outcome metrics like "MTTR reduced X%" should be interpreted carefully, including when vendors quote them. The 2024 DORA report itself notes that AI adoption correlated with a 1.5% throughput decrease and 7.2% stability decrease in the 2024 cohort — a counterintuitive finding that has driven careful 2026 research into where AI helps and where it doesn't.

An evaluation scorecard for AI investigation tools

Treat this as the rubric for a paid PoC. Each row matters more than vendor demos suggest.

  1. Multi-step tool use. Trace one incident end to end — does the agent call more than one tool, and does each subsequent call depend on the previous result? If not, you're at L3, not L4.
  2. Cloud scope. Match the agent's supported clouds to your real footprint. Multi-cloud is the most common reason a single-cloud investigation tool gets ripped out.
  3. Sandboxing and RBAC. Read the tool's command-execution architecture. If the agent runs commands directly from a worker pod with broad cluster credentials, model the blast radius.
  4. RAG quality. Ingest 50 of your real past postmortems and 20 runbooks. Then run a real recurring incident type. Did the agent retrieve the right historical material?
  5. Trace readability. Have a non-ML engineer read the agent's trace for one incident. Could they tell what it tried, what it found, and why it concluded what it did?
  6. Cost and rate-limit headroom. Long agentic investigations are token-expensive. Budget the LLM bill at 10x typical and stress-test rate limits against your busiest incident week, not a quiet one.
  7. Open source vs SaaS posture. If you handle regulated workloads, self-hosting is not optional. Apache-2.0 projects (Aurora, HolmesGPT, K8sGPT) protect you against vendor lock-in.
  8. Where it sits on the AICL. Decide up front whether you want L4 (recommended) or L5 (apply, with approval). Most regulated teams pilot at L4 and stay there for the first year.

How to run a low-risk pilot

  1. Pick one alert source and one cluster. PagerDuty + one Kubernetes cluster, or Datadog + one service group. Resist the urge to install across the org on day one.
  2. Run read-only for at least four weeks. Compare the agent's RCA to the human RCA on every incident. Track agreement rate, time to RCA, and how often the agent surfaced a finding the human missed (or vice versa).
  3. Ingest your historical context. Past postmortems and runbooks into the agent's knowledge base. This is the single biggest accuracy lever, and most teams underinvest in it. Plan a week for the ingestion alone.
  4. Add one chat channel and one slash command. Engineers should be able to ask the agent follow-up questions about the incident interactively. This is where the L4 → L5 trust curve gets built.
  5. Review traces weekly. Spend an hour a week reading the agent's tool-call traces. Look for tool misuse, excessive retries, and hallucinated identifiers. Iterate on prompts or RAG content as needed.
  6. Promote to alert-triggered investigation when the trace is clean for two consecutive weeks. Webhook from PagerDuty / Datadog / Grafana / incident.io straight into the agent. The investigation is now happening before the on-call has opened their laptop.
  7. Decide on L5 (remediation) only after three months at clean L4. Closed-loop remediation is a separate trust escalation. Most teams do it through pull requests with human approval — Aurora's Aurora Actions feature is the open-source pattern for this.

What can go wrong

A short list of failure modes worth pre-mortem-ing.

  • Prompt drift. A model upgrade silently changes agent behavior. Pin model versions in pilot; gate upgrades on a regression suite of past incidents.
  • Tool misuse. Agent runs the wrong cloud account, the wrong cluster, or a destructive subcommand. Mitigated by sandboxing and RBAC, but not eliminated — keep traces auditable.
  • Hallucinated identifiers. Agent cites a pod or resource that doesn't exist. Usually a sign of insufficient retrieval or a stale infrastructure graph; fix the graph, not the prompt.
  • Token cost runaway. Long investigations on busy incidents can produce surprisingly large bills. Budget aggressively and alert on cost as you would on error rate.
  • Over-trust. The agent produces an RCA that reads convincingly but is wrong on a load-bearing detail. The cure is trace review; the prevention is RAG investment and conservative AICL placement.

Where Aurora fits

We build Aurora — Apache-2.0, self-hosted, multi-cloud agentic incident investigation. It runs L4 today; the Aurora Actions feature extends to L5 closed-loop work through scheduled and post-incident automations that propose or, with org-level approval, apply remediations. If you're evaluating the category, we're one of the options to test. Whatever you pick should give you a readable trace, a credible sandbox, and a license that doesn't trap you — those criteria narrow the field whether you choose Aurora or not.

AI SRE
Incident Investigation
Root Cause Analysis
Agentic AI
AIOps
DORA
LangGraph
Aurora

Frequently Asked Questions

Try Aurora for Free

Open source, AI-powered incident management. Deploy in minutes.