AI-Powered Incident Investigation: The Complete Guide for SRE Teams (2026)
AI-powered incident investigation is an LLM agent that runs tools, queries infrastructure, and reasons over evidence — not stream-correlation AIOps. The 2026 landscape, architecture, and pilot plan.
Key Takeaways
- AI-powered incident investigation means an LLM agent that runs tools, queries infrastructure, and reasons over evidence in multiple steps — not stream-correlation AIOps. The distinction is structural: traditional AIOps clusters events; an investigation agent runs
kubectl, queries metrics, searches knowledge bases, and updates its hypotheses as findings arrive.- We propose the AI Investigation Capability Ladder (AICL). Six tiers: L0 (manual), L1 (alert correlation), L2 (LLM-summarized timeline), L3 (single-shot LLM diagnosis), L4 (agentic multi-step investigation), L5 (closed-loop investigate + remediate with human approval).
- CNCF now hosts two open-source agentic projects in this lane. HolmesGPT entered the CNCF Sandbox in October 2025. K8sGPT has been Sandbox since December 19, 2023. Aurora (Apache 2.0, self-hosted) is the third major open-source option and the only one that spans AWS, Azure, GCP, OVH, Scaleway, and Kubernetes in a single deployment.
- The 2024 DORA State of DevOps Report formalized recovery time as Failed Deployment Recovery Time (FDRT). Per DORA's metrics history, FDRT replaced "MTTR" as the official term in 2023 because MTTR had grown ambiguous. The 2024 DORA report PDF added "deployment rework rate" as a fifth core measure.
- The closed-source peer set is well-funded. Resolve.ai raised $125M at a $1B valuation in February 2026. Traversal reports 32% MTTR reduction and 82% RCA accuracy at American Express across 250 billion log lines per day. Cleric, Neubird, Causely, and Ciroos round out the category.
Cloud incidents in 2026 surface faster than humans can investigate them. AI-powered incident investigation is a system in which a large language model runs as an agent — calling infrastructure tools, querying logs and metrics, traversing dependency graphs, and reasoning over evidence across multiple steps to produce a root-cause analysis. Unlike traditional AIOps, which clusters events and ranks suspects from streams it already has, an investigation agent goes and gets new evidence: it shells into a sandboxed pod, runs kubectl describe, hits the cloud API, reads the relevant CI/CD pipeline, then re-plans its next step based on what it found.
This guide is for SRE, platform, and DevOps leaders evaluating where to invest their incident-response automation budget in 2026. We cover what the category looks like, how the open-source and commercial offerings actually differ, the standards bodies tracking outcomes, and how to pilot a tool without betting the farm.
What "investigation" means here
Three things blur together when people say "AI incident response":
- Alert correlation — clustering related events to reduce noise. PagerDuty AIOps, BigPanda, Moogsoft (now Dell APEX), Dynatrace Davis, Splunk ITSI Event iQ. This is mature ML; not investigation.
- Postmortem generation — drafting an incident report from artifacts that already exist (Slack transcript, alert timeline, monitor data). Rootly, incident.io, FireHydrant, Datadog Bits AI, PagerDuty Scribe. Covered separately in our Automated Post-Mortem Generation guide.
- Agentic investigation — an LLM that runs new tool calls during the incident to gather evidence it doesn't already have. Aurora, HolmesGPT, K8sGPT, Cleric, Resolve.ai, Traversal, Neubird Hawkeye, Causely. This is the category this post is about.
Conflating them produces bad evaluations. A team that picks a postmortem generator expecting it to find root cause will be disappointed; a team that picks an AIOps correlator expecting it to run kubectl will be even more disappointed.
The AI Investigation Capability Ladder (AICL)
Six tiers, increasing autonomy. Pick the tier you can defend operationally — going further is engineering, going less far is process.
| Tier | What runs | Human role | Representative tools |
|---|---|---|---|
| L0 — Manual | Engineer reads alerts, runs kubectl and cloud CLIs by hand | Everything | PagerDuty, Slack, Datadog |
| L1 — Alert correlation | ML correlator clusters and dedupes events | Triage from a smaller list | PagerDuty AIOps, BigPanda, Dell APEX (Moogsoft), Splunk ITSI |
| L2 — LLM-summarized timeline | LLM summarizes an event stream into prose | Reads summary instead of raw events | Datadog Bits AI summaries, incident.io Scribe |
| L3 — Single-shot LLM diagnosis | LLM produces an RCA from one prompt over alert + telemetry | Trusts a single inference | K8sGPT analyzers, vendor "AI insights" buttons |
| L4 — Agentic multi-step investigation | LLM agent calls many tools across multiple turns, replans as findings arrive | Reviews trace, ships fix | Aurora, HolmesGPT, Cleric, Resolve.ai, Traversal, Neubird, Causely |
| L5 — Closed-loop investigate + remediate | Agent investigates and proposes (or applies, with approval) a fix | Approves remediation | Aurora + Aurora Actions, Resolve.ai, ServiceNow Now Assist SRE |
The honest framing: most teams are L0 or L1 today. Per JetBrains' AI Pulse coverage (April 2026), 78.2% of survey respondents don't use AI in CI/CD workflows at all — a useful proxy for the broader DevOps stack. Investigation lags even further because it requires giving an agent infrastructure permissions, which makes security review harder than for build-time AI.
Traditional AIOps vs agentic investigation
Both are useful; they cover non-overlapping work.
| Capability | Traditional AIOps (L1) | Agentic investigation (L4) |
|---|---|---|
| Input | Event stream, telemetry already ingested | Same, plus live tool calls |
| Output | Ranked suspects, correlated incidents | RCA narrative, evidence chain, suggested fix |
| New evidence | No — operates on what's already in the system | Yes — agent issues new commands |
| Reasoning | ML clustering / topology distance scoring | LLM step-by-step (ReAct or similar) |
| Why it can be wrong | Missing event, weak topology graph | Hallucination, tool misuse, prompt drift |
| Cost model | Per-event or per-host | Per LLM token + tool runtime |
| Failure mode | Quiet — wrong cluster, you don't know | Loud — agent's trace is human-readable |
Most production deployments will run both. AIOps reduces the alert volume the agent has to investigate; the agent does the deep work AIOps cannot. Vendor-neutral evidence of this stacking pattern is showing up in 2025–2026 product announcements: PagerDuty's SRE Agent layers an agentic loop on top of its existing AIOps; Splunk's ITSI Episode Summarization (announced at .CONF25, September 2025) layers an LLM summary on top of its KPI engine.
The agentic peer set in 2026
This is the actual decision the buyer faces. Apache-2.0 open source vs commercial, single cloud vs multi-cloud, in-cluster vs cross-system, with or without RAG.
| Product | License | Scope | Notes |
|---|---|---|---|
| Aurora | Apache 2.0 | AWS, Azure, GCP, OVH, Scaleway, Kubernetes + 30+ integrations | LangGraph-orchestrated ReAct agent. Memgraph-backed dependency graph used by the alert correlator; Weaviate hybrid (BM25 + vector) RAG over runbooks and past postmortems. Self-hosted via Docker Compose or Helm. |
| HolmesGPT | Apache 2.0 | Cloud-native, Kubernetes-first; AWS, GCP, Oracle Cloud, OpenShift toolsets | Co-maintained by Robusta and Microsoft. CNCF Sandbox since October 2025. Read-only, RBAC-respecting; posts findings back to Slack / PagerDuty / Jira. |
| K8sGPT | Apache 2.0 | Kubernetes resource diagnostics | CNCF Sandbox since December 19, 2023. Analyzer-based — closer to L3 than L4 in our ladder. |
| Cleric.ai | Closed source | Slack-first AI SRE | Gartner Cool Vendor 2025. Integrates Datadog and Grafana. |
| Resolve.ai | Closed source | Multi-cloud AI SRE | $125M Series A at $1B valuation in February 2026. Founded by Spiros Xanthos and Mayank Agarwal, ex-Splunk. |
| Traversal | Closed source | "Causal search engine" for production systems | $48M Sequoia/Kleiner round (June 2025). Reports 32% MTTR reduction and 82% RCA accuracy in production at American Express. |
| Neubird Hawkeye | Closed source | Llama 3.2 70B fine-tuned + ChromaDB RAG | SaaS or VPC, SOC-2. Integrates Datadog, Splunk, CloudWatch, PagerDuty, ServiceNow. |
| Causely | Closed source | Causal-graph reasoner for Kubernetes | Gartner Cool Vendor 2025. MCP server. Gemini-powered. |
| Ciroos.AI | Closed source | "SRE Teammate" multi-agent | MCP and A2A architecture. |
If you need a self-hosted, multi-cloud, Apache-2.0 option, Aurora is the broadest. If you live entirely inside Kubernetes and want a CNCF-blessed option, HolmesGPT is the strong choice. K8sGPT is the lightweight diagnostic pre-step. The closed-source options trade source availability for managed-service ergonomics and (in Resolve.ai and Traversal's cases) a lot of recent capital.
For a deeper open-source-only comparison, see our Open-Source AI SRE: Aurora vs HolmesGPT vs K8sGPT guide.
Architecture: what makes investigation "agentic"?
Five components show up in every credible agentic-investigation product. If a tool is missing more than one, it sits below L4 on the AICL.
1. A tool-calling loop (ReAct or similar)
The agent issues a tool call, sees the result, decides the next call, and continues until it has enough evidence. This is the ReAct pattern (Reason + Act, Yao et al. 2022). Aurora's implementation is a single-node LangGraph workflow wrapping langchain.agents.create_agent; the agent decides at every step whether to invoke a tool or finalize the RCA. HolmesGPT uses a similar pattern with its own toolset registry. The choice between LangGraph, LangChain, AutoGen, or a custom loop is implementation detail — what matters is multi-turn tool use.
2. Tool reach across the stack
An investigation agent that can only read Kubernetes will miss every multi-cloud incident. Tool reach matters more than algorithmic sophistication. Aurora exposes 30+ integrations covering cloud CLIs (AWS, Azure, GCP, OVH, Scaleway, Cloudflare), Kubernetes, Terraform, Docker, monitoring (Datadog, Grafana, NewRelic, OpsGenie, Netdata, Dynatrace, Coroot, ThousandEyes, BigPanda, incident.io), logging (Splunk), CI/CD (Jenkins, Spinnaker, CloudBees), code (GitHub, Bitbucket), and docs (Confluence, Notion, SharePoint, Jira). A fully connected instance surfaces 100+ discrete tool callables to the agent.
3. Sandboxed CLI execution
Letting the agent run kubectl and cloud CLIs raises the obvious concern: arbitrary command execution. Aurora's architecture wraps every command in a four-layer safety pipeline before the command leaves the planner:
- Prompt-injection input rail (NVIDIA NeMo Guardrails) blocks commands that originate from injected instructions.
- Static signature match against 37 vendored SigmaHQ detection rules covering known-malicious command patterns.
- Per-org command policy — allow/deny lists scoped to the customer's tenant.
- LLM safety judge adapted from Meta's PurpleLlama AlignmentCheck.
Approved commands execute via kubectl exec into ephemeral terminal pods inside an "untrusted" Kubernetes namespace. See our AI Agent kubectl Safety guide for the full threat model.
4. Retrieval over organizational memory
The agent's first move on most investigations should be checking whether a similar incident has happened before. Aurora uses Weaviate for hybrid (BM25 + vector) search over runbooks, past postmortems, and Aurora Learn — a corpus of past good RCAs that get injected as context for new investigations. HolmesGPT supports RAG over runbooks via its toolset system. K8sGPT does not have a first-class RAG layer.
The honest measurement: RAG quality dominates accuracy on incidents that have happened before. Sparse-only retrieval misses semantic recall; dense-only retrieval misses literal identifier matches. Hybrid wins, and is now table stakes.
5. Infrastructure topology
An LLM that doesn't know that service A depends on database B will mis-attribute symptoms. Aurora uses Memgraph as a live dependency graph populated by an infrastructure-discovery pipeline; the topology is consulted by Aurora's alert correlator before the agent runs, and dependency context surfaces into the agent's working set through retrieval results tagged "[Auto-Discovery]". The agent does not Cypher-query the graph directly during an incident — it reads digested dependency context the way a human SRE would read a service map.
What the DORA and VOID anchors actually say
Two industry sources are worth grounding the investigation conversation in.
DORA — Failed Deployment Recovery Time. Per DORA's metrics history, the original "Mean Time to Recover" metric was renamed Failed Deployment Recovery Time (FDRT) in 2023 because MTTR had grown ambiguous in industry usage. FDRT measures recovery from change-induced failures specifically — the place where investigation speed matters most. The 2024 DORA State of DevOps Report PDF further refined the metric set, adding "deployment rework rate" as a fifth core measure.
VOID — incident reality, not vendor claims. The Verica Open Incident Database catalogs public incident reports. The 2nd Annual VOID Report (December 2022) reviewed approximately 10,000 incidents from 600+ organizations and concluded that MTTR is unreliable as a comparison metric across organizations and that only about 25% of public incident reports clearly identify a root cause. The implication for buyers: outcome metrics like "MTTR reduced X%" should be interpreted carefully, including when vendors quote them. The 2024 DORA report itself notes that AI adoption correlated with a 1.5% throughput decrease and 7.2% stability decrease in the 2024 cohort — a counterintuitive finding that has driven careful 2026 research into where AI helps and where it doesn't.
An evaluation scorecard for AI investigation tools
Treat this as the rubric for a paid PoC. Each row matters more than vendor demos suggest.
- Multi-step tool use. Trace one incident end to end — does the agent call more than one tool, and does each subsequent call depend on the previous result? If not, you're at L3, not L4.
- Cloud scope. Match the agent's supported clouds to your real footprint. Multi-cloud is the most common reason a single-cloud investigation tool gets ripped out.
- Sandboxing and RBAC. Read the tool's command-execution architecture. If the agent runs commands directly from a worker pod with broad cluster credentials, model the blast radius.
- RAG quality. Ingest 50 of your real past postmortems and 20 runbooks. Then run a real recurring incident type. Did the agent retrieve the right historical material?
- Trace readability. Have a non-ML engineer read the agent's trace for one incident. Could they tell what it tried, what it found, and why it concluded what it did?
- Cost and rate-limit headroom. Long agentic investigations are token-expensive. Budget the LLM bill at 10x typical and stress-test rate limits against your busiest incident week, not a quiet one.
- Open source vs SaaS posture. If you handle regulated workloads, self-hosting is not optional. Apache-2.0 projects (Aurora, HolmesGPT, K8sGPT) protect you against vendor lock-in.
- Where it sits on the AICL. Decide up front whether you want L4 (recommended) or L5 (apply, with approval). Most regulated teams pilot at L4 and stay there for the first year.
How to run a low-risk pilot
- Pick one alert source and one cluster. PagerDuty + one Kubernetes cluster, or Datadog + one service group. Resist the urge to install across the org on day one.
- Run read-only for at least four weeks. Compare the agent's RCA to the human RCA on every incident. Track agreement rate, time to RCA, and how often the agent surfaced a finding the human missed (or vice versa).
- Ingest your historical context. Past postmortems and runbooks into the agent's knowledge base. This is the single biggest accuracy lever, and most teams underinvest in it. Plan a week for the ingestion alone.
- Add one chat channel and one slash command. Engineers should be able to ask the agent follow-up questions about the incident interactively. This is where the L4 → L5 trust curve gets built.
- Review traces weekly. Spend an hour a week reading the agent's tool-call traces. Look for tool misuse, excessive retries, and hallucinated identifiers. Iterate on prompts or RAG content as needed.
- Promote to alert-triggered investigation when the trace is clean for two consecutive weeks. Webhook from PagerDuty / Datadog / Grafana / incident.io straight into the agent. The investigation is now happening before the on-call has opened their laptop.
- Decide on L5 (remediation) only after three months at clean L4. Closed-loop remediation is a separate trust escalation. Most teams do it through pull requests with human approval — Aurora's Aurora Actions feature is the open-source pattern for this.
What can go wrong
A short list of failure modes worth pre-mortem-ing.
- Prompt drift. A model upgrade silently changes agent behavior. Pin model versions in pilot; gate upgrades on a regression suite of past incidents.
- Tool misuse. Agent runs the wrong cloud account, the wrong cluster, or a destructive subcommand. Mitigated by sandboxing and RBAC, but not eliminated — keep traces auditable.
- Hallucinated identifiers. Agent cites a pod or resource that doesn't exist. Usually a sign of insufficient retrieval or a stale infrastructure graph; fix the graph, not the prompt.
- Token cost runaway. Long investigations on busy incidents can produce surprisingly large bills. Budget aggressively and alert on cost as you would on error rate.
- Over-trust. The agent produces an RCA that reads convincingly but is wrong on a load-bearing detail. The cure is trace review; the prevention is RAG investment and conservative AICL placement.
Where Aurora fits
We build Aurora — Apache-2.0, self-hosted, multi-cloud agentic incident investigation. It runs L4 today; the Aurora Actions feature extends to L5 closed-loop work through scheduled and post-incident automations that propose or, with org-level approval, apply remediations. If you're evaluating the category, we're one of the options to test. Whatever you pick should give you a readable trace, a credible sandbox, and a license that doesn't trap you — those criteria narrow the field whether you choose Aurora or not.
- GitHub: github.com/Arvo-AI/aurora
- Docs: arvo-ai.github.io/aurora
- Related guides: Aurora Actions · Automated Post-Mortem Generation · CI/CD Auto-Remediation · Open-Source AI SRE comparison