How is AI-powered incident investigation different from AIOps?

Traditional AIOps (BigPanda, Moogsoft / Dell APEX, PagerDuty AIOps, Dynatrace Davis, Splunk ITSI) clusters and deduplicates events from streams already ingested into the observability stack. Agentic investigation issues new tool calls during the incident — kubectl, cloud CLIs, knowledge-base searches — and replans based on what the calls return. AIOps narrows the alert volume; agentic investigation does the deep diagnostic work.

What is the AI Investigation Capability Ladder (AICL)?

The AICL is a six-tier framework we use to compare incident-investigation tools. L0 is fully manual, L1 is alert correlation, L2 is LLM-summarized timelines, L3 is single-shot LLM diagnosis, L4 is agentic multi-step investigation (the modern AI SRE category), and L5 is closed-loop investigate plus remediate with human approval. Most teams pilot at L4 and graduate to L5 for specific use cases after three months of clean L4 operation.

Which open-source AI investigation tools are available?

Three meaningful Apache 2.0 options: Aurora (Arvo AI), HolmesGPT (Robusta + Microsoft), and K8sGPT. HolmesGPT entered the CNCF Sandbox in October 2025; K8sGPT has been in CNCF Sandbox since December 19, 2023; Aurora is the broadest multi-cloud option and is self-hostable via Docker Compose or Helm.

Is HolmesGPT a CNCF project?

Yes. HolmesGPT was accepted into the CNCF Sandbox in October 2025 and is co-maintained by Robusta and Microsoft. It is read-only, RBAC-respecting, supports 25+ toolsets, and integrates with AWS, GCP, Oracle Cloud, and OpenShift. Its findings route back to Slack, PagerDuty, or Jira.

Failed Deployment Recovery Time. Per DORA's metrics history, FDRT was formalized in 2023 to replace the broader 'Mean Time to Recover' (MTTR) metric, because MTTR had grown ambiguous across the industry. FDRT measures recovery time specifically from change-induced failures. The 2024 State of DevOps Report further refined the metric set by adding a fifth core measure, 'deployment rework rate.' Faster investigation is the leading lever on FDRT for teams that already deploy frequently.

Can an AI investigation agent take destructive actions on my infrastructure?

It depends on the tool's permission model and the deployment configuration. Aurora wraps every command in a four-layer safety pipeline (prompt-injection input rail, SigmaHQ signature match, per-org command policy, LLM safety judge) and executes commands in sandboxed Kubernetes pods inside an isolated 'untrusted' namespace. Most regulated teams deploy read-only initially and graduate to write actions only through pull-request-based remediation with explicit human approval.

Does Aurora support multi-cloud investigation?

Yes. Aurora is currently the only open-source agentic investigation tool that supports AWS, Azure, GCP, OVH, Scaleway, and Kubernetes in a single deployment. Cloud authentication uses native patterns per provider — STS AssumeRole for AWS, Service Principal for Azure, OAuth for GCP, API keys for OVH and Scaleway, kubeconfig for Kubernetes.

How long does an agent take to investigate an incident?

Typical agentic investigations run 3–15 minutes depending on incident complexity, tool latency, and how many turns the agent needs. Aurora's per-provider RCA prompts specify a minimum number of tool calls and an expected runtime window to prevent the agent from terminating too early on under-investigated incidents.

How does AI-powered incident investigation differ from AI postmortem generation?

Investigation runs during the incident and produces a root-cause analysis from fresh evidence the agent collects with tool calls. Postmortem generation runs after the incident and produces a retrospective document from artifacts already collected (Slack transcript, alert timeline, the investigation agent's own trace). The two are complementary; in agentic-investigation systems like Aurora, the investigation output feeds directly into the postmortem.

AI-Powered Incident Investigation: The Complete Guide for SRE Teams (2026)

Key Takeaways

AI-powered incident investigation means an LLM agent that runs tools, queries infrastructure, and reasons over evidence in multiple steps — not stream-correlation AIOps. The distinction is structural: traditional AIOps clusters events; an investigation agent runs kubectl, queries metrics, searches knowledge bases, and updates its hypotheses as findings arrive.

We propose the AI Investigation Capability Ladder (AICL). Six tiers: L0 (manual), L1 (alert correlation), L2 (LLM-summarized timeline), L3 (single-shot LLM diagnosis), L4 (agentic multi-step investigation), L5 (closed-loop investigate + remediate with human approval).

CNCF now hosts two open-source agentic projects in this lane. HolmesGPT entered the CNCF Sandbox in October 2025. K8sGPT has been Sandbox since December 19, 2023. Aurora (Apache 2.0, self-hosted) is the third major open-source option and the only one that spans AWS, Azure, GCP, OVH, Scaleway, and Kubernetes in a single deployment.

The 2024 DORA State of DevOps Report formalized recovery time as Failed Deployment Recovery Time (FDRT). Per DORA's metrics history, FDRT replaced "MTTR" as the official term in 2023 because MTTR had grown ambiguous. The 2024 DORA report PDF added "deployment rework rate" as a fifth core measure.

The closed-source peer set is well-funded. Resolve.ai raised $125M at a $1B valuation in February 2026. Traversal reports 32% MTTR reduction and 82% RCA accuracy at American Express across 250 billion log lines per day. Cleric, Neubird, Causely, and Ciroos round out the category.

Cloud incidents in 2026 surface faster than humans can investigate them. AI-powered incident investigation is a system in which a large language model runs as an agent — calling infrastructure tools, querying logs and metrics, traversing dependency graphs, and reasoning over evidence across multiple steps to produce a root-cause analysis. Unlike traditional AIOps, which clusters events and ranks suspects from streams it already has, an investigation agent goes and gets new evidence: it shells into a sandboxed pod, runs kubectl describe, hits the cloud API, reads the relevant CI/CD pipeline, then re-plans its next step based on what it found.

This guide is for SRE, platform, and DevOps leaders evaluating where to invest their incident-response automation budget in 2026. We cover what the category looks like, how the open-source and commercial offerings actually differ, the standards bodies tracking outcomes, and how to pilot a tool without betting the farm.

What "investigation" means here

Three things blur together when people say "AI incident response":

Alert correlation — clustering related events to reduce noise. PagerDuty AIOps, BigPanda, Moogsoft (now Dell APEX), Dynatrace Davis, Splunk ITSI Event iQ. This is mature ML; not investigation.
Postmortem generation — drafting an incident report from artifacts that already exist (Slack transcript, alert timeline, monitor data). Rootly, incident.io, FireHydrant, Datadog Bits AI, PagerDuty Scribe. Covered separately in our Automated Post-Mortem Generation guide.
Agentic investigation — an LLM that runs new tool calls during the incident to gather evidence it doesn't already have. Aurora, HolmesGPT, K8sGPT, Cleric, Resolve.ai, Traversal, Neubird Hawkeye, Causely. This is the category this post is about.

Conflating them produces bad evaluations. A team that picks a postmortem generator expecting it to find root cause will be disappointed; a team that picks an AIOps correlator expecting it to run kubectl will be even more disappointed.

The AI Investigation Capability Ladder (AICL)

Six tiers, increasing autonomy. Pick the tier you can defend operationally — going further is engineering, going less far is process.

Tier	What runs	Human role	Representative tools
L0 — Manual	Engineer reads alerts, runs `kubectl` and cloud CLIs by hand	Everything	PagerDuty, Slack, Datadog
L1 — Alert correlation	ML correlator clusters and dedupes events	Triage from a smaller list	PagerDuty AIOps, BigPanda, Dell APEX (Moogsoft), Splunk ITSI
L2 — LLM-summarized timeline	LLM summarizes an event stream into prose	Reads summary instead of raw events	Datadog Bits AI summaries, incident.io Scribe
L3 — Single-shot LLM diagnosis	LLM produces an RCA from one prompt over alert + telemetry	Trusts a single inference	K8sGPT analyzers, vendor "AI insights" buttons
L4 — Agentic multi-step investigation	LLM agent calls many tools across multiple turns, replans as findings arrive	Reviews trace, ships fix	Aurora, HolmesGPT, Cleric, Resolve.ai, Traversal, Neubird, Causely
L5 — Closed-loop investigate + remediate	Agent investigates and proposes (or applies, with approval) a fix	Approves remediation	Aurora + Aurora Actions, Resolve.ai, ServiceNow Now Assist SRE

The honest framing: most teams are L0 or L1 today. Per JetBrains' AI Pulse coverage (April 2026), 78.2% of survey respondents don't use AI in CI/CD workflows at all — a useful proxy for the broader DevOps stack. Investigation lags even further because it requires giving an agent infrastructure permissions, which makes security review harder than for build-time AI.

Traditional AIOps vs agentic investigation

Both are useful; they cover non-overlapping work.

Capability	Traditional AIOps (L1)	Agentic investigation (L4)
Input	Event stream, telemetry already ingested	Same, plus live tool calls
Output	Ranked suspects, correlated incidents	RCA narrative, evidence chain, suggested fix
New evidence	No — operates on what's already in the system	Yes — agent issues new commands
Reasoning	ML clustering / topology distance scoring	LLM step-by-step (ReAct or similar)
Why it can be wrong	Missing event, weak topology graph	Hallucination, tool misuse, prompt drift
Cost model	Per-event or per-host	Per LLM token + tool runtime
Failure mode	Quiet — wrong cluster, you don't know	Loud — agent's trace is human-readable

Most production deployments will run both. AIOps reduces the alert volume the agent has to investigate; the agent does the deep work AIOps cannot. Vendor-neutral evidence of this stacking pattern is showing up in 2025–2026 product announcements: PagerDuty's SRE Agent layers an agentic loop on top of its existing AIOps; Splunk's ITSI Episode Summarization (announced at .CONF25, September 2025) layers an LLM summary on top of its KPI engine.

The agentic peer set in 2026

This is the actual decision the buyer faces. Apache-2.0 open source vs commercial, single cloud vs multi-cloud, in-cluster vs cross-system, with or without RAG.

Product	License	Scope	Notes
Aurora	Apache 2.0	AWS, Azure, GCP, OVH, Scaleway, Kubernetes + 30+ integrations	LangGraph-orchestrated ReAct agent. Memgraph-backed dependency graph used by the alert correlator; Weaviate hybrid (BM25 + vector) RAG over runbooks and past postmortems. Self-hosted via Docker Compose or Helm.
HolmesGPT	Apache 2.0	Cloud-native, Kubernetes-first; AWS, GCP, Oracle Cloud, OpenShift toolsets	Co-maintained by Robusta and Microsoft. CNCF Sandbox since October 2025. Read-only, RBAC-respecting; posts findings back to Slack / PagerDuty / Jira.
K8sGPT	Apache 2.0	Kubernetes resource diagnostics	CNCF Sandbox since December 19, 2023. Analyzer-based — closer to L3 than L4 in our ladder.
Cleric.ai	Closed source	Slack-first AI SRE	Gartner Cool Vendor 2025. Integrates Datadog and Grafana.
Resolve.ai	Closed source	Multi-cloud AI SRE	$125M Series A at $1B valuation in February 2026. Founded by Spiros Xanthos and Mayank Agarwal, ex-Splunk.
Traversal	Closed source	"Causal search engine" for production systems	$48M Sequoia/Kleiner round (June 2025). Reports 32% MTTR reduction and 82% RCA accuracy in production at American Express.
Neubird Hawkeye	Closed source	Llama 3.2 70B fine-tuned + ChromaDB RAG	SaaS or VPC, SOC-2. Integrates Datadog, Splunk, CloudWatch, PagerDuty, ServiceNow.
Causely	Closed source	Causal-graph reasoner for Kubernetes	Gartner Cool Vendor 2025. MCP server. Gemini-powered.
Ciroos.AI	Closed source	"SRE Teammate" multi-agent	MCP and A2A architecture.

If you need a self-hosted, multi-cloud, Apache-2.0 option, Aurora is the broadest. If you live entirely inside Kubernetes and want a CNCF-blessed option, HolmesGPT is the strong choice. K8sGPT is the lightweight diagnostic pre-step. The closed-source options trade source availability for managed-service ergonomics and (in Resolve.ai and Traversal's cases) a lot of recent capital.

For a deeper open-source-only comparison, see our Open-Source AI SRE: Aurora vs HolmesGPT vs K8sGPT guide.

Architecture: what makes investigation "agentic"?

Five components show up in every credible agentic-investigation product. If a tool is missing more than one, it sits below L4 on the AICL.

1. A tool-calling loop (ReAct or similar)

The agent issues a tool call, sees the result, decides the next call, and continues until it has enough evidence. This is the ReAct pattern (Reason + Act, Yao et al. 2022). Aurora's implementation is a single-node LangGraph workflow wrapping langchain.agents.create_agent; the agent decides at every step whether to invoke a tool or finalize the RCA. HolmesGPT uses a similar pattern with its own toolset registry. The choice between LangGraph, LangChain, AutoGen, or a custom loop is implementation detail — what matters is multi-turn tool use.

2. Tool reach across the stack

An investigation agent that can only read Kubernetes will miss every multi-cloud incident. Tool reach matters more than algorithmic sophistication. Aurora exposes 30+ integrations covering cloud CLIs (AWS, Azure, GCP, OVH, Scaleway, Cloudflare), Kubernetes, Terraform, Docker, monitoring (Datadog, Grafana, NewRelic, OpsGenie, Netdata, Dynatrace, Coroot, ThousandEyes, BigPanda, incident.io), logging (Splunk), CI/CD (Jenkins, Spinnaker, CloudBees), code (GitHub, Bitbucket), and docs (Confluence, Notion, SharePoint, Jira). A fully connected instance surfaces 100+ discrete tool callables to the agent.

3. Sandboxed CLI execution

Letting the agent run kubectl and cloud CLIs raises the obvious concern: arbitrary command execution. Aurora's architecture wraps every command in a four-layer safety pipeline before the command leaves the planner:

Prompt-injection input rail (NVIDIA NeMo Guardrails) blocks commands that originate from injected instructions.
Static signature match against 37 vendored SigmaHQ detection rules covering known-malicious command patterns.
Per-org command policy — allow/deny lists scoped to the customer's tenant.
LLM safety judge adapted from Meta's PurpleLlama AlignmentCheck.

Approved commands execute via kubectl exec into ephemeral terminal pods inside an "untrusted" Kubernetes namespace. See our AI Agent kubectl Safety guide for the full threat model.

4. Retrieval over organizational memory

The agent's first move on most investigations should be checking whether a similar incident has happened before. Aurora uses Weaviate for hybrid (BM25 + vector) search over runbooks, past postmortems, and Aurora Learn — a corpus of past good RCAs that get injected as context for new investigations. HolmesGPT supports RAG over runbooks via its toolset system. K8sGPT does not have a first-class RAG layer.

The honest measurement: RAG quality dominates accuracy on incidents that have happened before. Sparse-only retrieval misses semantic recall; dense-only retrieval misses literal identifier matches. Hybrid wins, and is now table stakes.

5. Infrastructure topology

An LLM that doesn't know that service A depends on database B will mis-attribute symptoms. Aurora uses Memgraph as a live dependency graph populated by an infrastructure-discovery pipeline; the topology is consulted by Aurora's alert correlator before the agent runs, and dependency context surfaces into the agent's working set through retrieval results tagged "[Auto-Discovery]". The agent does not Cypher-query the graph directly during an incident — it reads digested dependency context the way a human SRE would read a service map.

What the DORA and VOID anchors actually say

Two industry sources are worth grounding the investigation conversation in.

DORA — Failed Deployment Recovery Time. Per DORA's metrics history, the original "Mean Time to Recover" metric was renamed Failed Deployment Recovery Time (FDRT) in 2023 because MTTR had grown ambiguous in industry usage. FDRT measures recovery from change-induced failures specifically — the place where investigation speed matters most. The 2024 DORA State of DevOps Report PDF further refined the metric set, adding "deployment rework rate" as a fifth core measure.

VOID — incident reality, not vendor claims. The Verica Open Incident Database catalogs public incident reports. The 2nd Annual VOID Report (December 2022) reviewed approximately 10,000 incidents from 600+ organizations and concluded that MTTR is unreliable as a comparison metric across organizations and that only about 25% of public incident reports clearly identify a root cause. The implication for buyers: outcome metrics like "MTTR reduced X%" should be interpreted carefully, including when vendors quote them. The 2024 DORA report itself notes that AI adoption correlated with a 1.5% throughput decrease and 7.2% stability decrease in the 2024 cohort — a counterintuitive finding that has driven careful 2026 research into where AI helps and where it doesn't.

An evaluation scorecard for AI investigation tools

Treat this as the rubric for a paid PoC. Each row matters more than vendor demos suggest.

Multi-step tool use. Trace one incident end to end — does the agent call more than one tool, and does each subsequent call depend on the previous result? If not, you're at L3, not L4.
Cloud scope. Match the agent's supported clouds to your real footprint. Multi-cloud is the most common reason a single-cloud investigation tool gets ripped out.
Sandboxing and RBAC. Read the tool's command-execution architecture. If the agent runs commands directly from a worker pod with broad cluster credentials, model the blast radius.
RAG quality. Ingest 50 of your real past postmortems and 20 runbooks. Then run a real recurring incident type. Did the agent retrieve the right historical material?
Trace readability. Have a non-ML engineer read the agent's trace for one incident. Could they tell what it tried, what it found, and why it concluded what it did?
Cost and rate-limit headroom. Long agentic investigations are token-expensive. Budget the LLM bill at 10x typical and stress-test rate limits against your busiest incident week, not a quiet one.
Open source vs SaaS posture. If you handle regulated workloads, self-hosting is not optional. Apache-2.0 projects (Aurora, HolmesGPT, K8sGPT) protect you against vendor lock-in.
Where it sits on the AICL. Decide up front whether you want L4 (recommended) or L5 (apply, with approval). Most regulated teams pilot at L4 and stay there for the first year.

How to run a low-risk pilot

Pick one alert source and one cluster. PagerDuty + one Kubernetes cluster, or Datadog + one service group. Resist the urge to install across the org on day one.
Run read-only for at least four weeks. Compare the agent's RCA to the human RCA on every incident. Track agreement rate, time to RCA, and how often the agent surfaced a finding the human missed (or vice versa).
Ingest your historical context. Past postmortems and runbooks into the agent's knowledge base. This is the single biggest accuracy lever, and most teams underinvest in it. Plan a week for the ingestion alone.
Add one chat channel and one slash command. Engineers should be able to ask the agent follow-up questions about the incident interactively. This is where the L4 → L5 trust curve gets built.
Review traces weekly. Spend an hour a week reading the agent's tool-call traces. Look for tool misuse, excessive retries, and hallucinated identifiers. Iterate on prompts or RAG content as needed.
Promote to alert-triggered investigation when the trace is clean for two consecutive weeks. Webhook from PagerDuty / Datadog / Grafana / incident.io straight into the agent. The investigation is now happening before the on-call has opened their laptop.
Decide on L5 (remediation) only after three months at clean L4. Closed-loop remediation is a separate trust escalation. Most teams do it through pull requests with human approval — Aurora's Aurora Actions feature is the open-source pattern for this.

What can go wrong

A short list of failure modes worth pre-mortem-ing.

Prompt drift. A model upgrade silently changes agent behavior. Pin model versions in pilot; gate upgrades on a regression suite of past incidents.
Tool misuse. Agent runs the wrong cloud account, the wrong cluster, or a destructive subcommand. Mitigated by sandboxing and RBAC, but not eliminated — keep traces auditable.
Hallucinated identifiers. Agent cites a pod or resource that doesn't exist. Usually a sign of insufficient retrieval or a stale infrastructure graph; fix the graph, not the prompt.
Token cost runaway. Long investigations on busy incidents can produce surprisingly large bills. Budget aggressively and alert on cost as you would on error rate.
Over-trust. The agent produces an RCA that reads convincingly but is wrong on a load-bearing detail. The cure is trace review; the prevention is RAG investment and conservative AICL placement.

Where Aurora fits

We build Aurora — Apache-2.0, self-hosted, multi-cloud agentic incident investigation. It runs L4 today; the Aurora Actions feature extends to L5 closed-loop work through scheduled and post-incident automations that propose or, with org-level approval, apply remediations. If you're evaluating the category, we're one of the options to test. Whatever you pick should give you a readable trace, a credible sandbox, and a license that doesn't trap you — those criteria narrow the field whether you choose Aurora or not.

GitHub: github.com/Arvo-AI/aurora
Docs: arvo-ai.github.io/aurora
Related guides: Aurora Actions · Automated Post-Mortem Generation · CI/CD Auto-Remediation · Open-Source AI SRE comparison

AI-Powered Incident Investigation: The Complete Guide for SRE Teams (2026)

Key Takeaways

What "investigation" means here

The AI Investigation Capability Ladder (AICL)

Traditional AIOps vs agentic investigation

The agentic peer set in 2026

Architecture: what makes investigation "agentic"?

1. A tool-calling loop (ReAct or similar)

2. Tool reach across the stack

3. Sandboxed CLI execution

4. Retrieval over organizational memory

5. Infrastructure topology

What the DORA and VOID anchors actually say

An evaluation scorecard for AI investigation tools

How to run a low-risk pilot

What can go wrong

Where Aurora fits

Frequently Asked Questions

Related Articles

Automated Post-Mortem Generation: The Complete Guide for SRE Teams (2026)

Aurora Actions: User-Defined Background Automations for Incident Response

CI/CD Auto-Remediation: The Complete Guide for SRE and Platform Teams (2026)

Try Aurora for Free

AI-Powered Incident Investigation: The Complete Guide for SRE Teams (2026)

Key Takeaways

What "investigation" means here

The AI Investigation Capability Ladder (AICL)

Traditional AIOps vs agentic investigation

The agentic peer set in 2026

Architecture: what makes investigation "agentic"?

1. A tool-calling loop (ReAct or similar)

2. Tool reach across the stack

3. Sandboxed CLI execution

4. Retrieval over organizational memory

5. Infrastructure topology

What the DORA and VOID anchors actually say

An evaluation scorecard for AI investigation tools

How to run a low-risk pilot

What can go wrong

Where Aurora fits

Frequently Asked Questions

What is AI-powered incident investigation?

How is AI-powered incident investigation different from AIOps?

What is the AI Investigation Capability Ladder (AICL)?

Which open-source AI investigation tools are available?

Is HolmesGPT a CNCF project?

What is DORA FDRT?

Can an AI investigation agent take destructive actions on my infrastructure?

Does Aurora support multi-cloud investigation?

How long does an agent take to investigate an incident?

How does AI-powered incident investigation differ from AI postmortem generation?

Related Articles

Automated Post-Mortem Generation: The Complete Guide for SRE Teams (2026)

Aurora Actions: User-Defined Background Automations for Incident Response

CI/CD Auto-Remediation: The Complete Guide for SRE and Platform Teams (2026)

Try Aurora for Free