HolmesGPT Alternative: Multi-Cloud Open Source AI SRE (2026)
HolmesGPT is a great Kubernetes investigator. Aurora is the open-source alternative for multi-cloud reach, sandboxed execution, and a full incident platform.
Key Takeaways
- HolmesGPT is an excellent, CNCF-blessed, Kubernetes-native investigator. It was accepted into the CNCF Sandbox in October 2025, is Apache 2.0, and is maintained by Robusta with major contributions from Microsoft.
- The honest trade-off is reach and action. HolmesGPT is read-only by design and respects RBAC. It diagnoses and can open suggested-fix PRs in Operator mode, and it runs read-only commands and queries, but it does not perform write or mutating cloud or cluster actions.
- Aurora is the alternative when you need multi-cloud and execution. It investigates across AWS, Azure, GCP, OVH, Scaleway, and Kubernetes, and runs 'kubectl', 'aws', 'az', and 'gcloud' in sandboxed Kubernetes pods.
- Aurora is a fuller incident platform. It builds a Memgraph blast-radius graph, generates postmortems that export to Confluence and Notion, and drafts remediation pull requests.
- Both are free, self-hosted, and BYO-LLM. Both support local inference via Ollama for air-gapped environments, so incident telemetry never has to leave your perimeter.
- If you are pure-Kubernetes and want CNCF governance, HolmesGPT may be the better pick. This is not a hit piece. It is a fit guide.
If your incidents live entirely inside Kubernetes, HolmesGPT is one of the strongest open-source AI SRE agents you can deploy in 2026. It is genuinely good, and it carries CNCF governance. The reason teams look for a HolmesGPT alternative is usually one of three things: their infrastructure spans more than Kubernetes, they want the agent to safely run commands rather than only read state, or they need a fuller platform with postmortems and a dependency graph. This post is an honest fit guide, not a takedown. We name HolmesGPT's real strengths, then show where Aurora is the better tool.
What is HolmesGPT?
HolmesGPT is an open-source, Apache 2.0 AI agent for investigating cloud-native incidents, built by Robusta with major contributions from Microsoft, and accepted into the CNCF Sandbox in October 2025. It connects large language models to live observability data through an agentic loop: when you feed it an alert or a question, it iteratively calls tools, gathers data from multiple sources, and builds a root-cause analysis, as described on its GitHub project and the CNCF announcement.
Its design strengths are real and worth naming clearly:
- CNCF Sandbox governance. Vendor-neutral governance and a joint roadmap maintained by Robusta and Microsoft is meaningful for risk-averse buyers.
- Read-only and RBAC-aware. Per the README, it 'has read-only access and respects RBAC permissions' and is described as safe to run in production. That small blast radius is a legitimate selling point.
- Deep observability integrations. It ships with built-in toolsets for Prometheus, Loki, Tempo, Grafana, Datadog, ArgoCD, and many more, and it can consume MCP servers for additional sources.
- Operator mode. It can run in the background, message you in Slack, and, with the GitHub integration, open pull requests with suggested fixes.
- BYO-LLM. It supports OpenAI, Anthropic, Azure OpenAI, Bedrock, Gemini, and self-hosted models, so your data does not have to train anyone's model.
For a clear sense of where HolmesGPT sits versus the other CNCF Sandbox option, see our HolmesGPT vs K8sGPT comparison.
What is Aurora?
Aurora is an open-source, Apache 2.0 AI SRE and incident-management platform from Arvo AI that autonomously investigates incidents across multiple clouds and Kubernetes, executes commands in sandboxed pods, and generates root-cause analyses, postmortems, and remediation PRs. Its agents are LangGraph-orchestrated and, per the Aurora repository, they query AWS, Azure, GCP, OVH, Scaleway, and Kubernetes, then run 'kubectl', 'aws', 'az', and 'gcloud' commands inside sandboxed Kubernetes pods.
Where HolmesGPT stops at read-only diagnosis, Aurora is built to act. It builds a Memgraph infrastructure knowledge graph to model blast radius, suggests code fixes, and can open pull requests with the remediation. It ingests alerts via webhook from eleven monitoring connectors, PagerDuty, Datadog, Grafana, New Relic, OpsGenie, Netdata, Dynatrace, Coroot, ThousandEyes, BigPanda, and incident.io, plus a Slack bot. It is self-hosted, air-gapped capable, and supports local inference via Ollama. Destructive actions are human-gated.
HolmesGPT vs Aurora: the head-to-head
The core difference is scope and action: HolmesGPT is a read-only Kubernetes-first investigator with CNCF governance, while Aurora is a multi-cloud agent that executes commands and ships a fuller incident platform. Both are Apache 2.0, both self-host, and both let you bring your own model. The decision comes down to how much of your world lives outside Kubernetes and whether you want the agent to act, not only advise.
| Dimension | HolmesGPT | Aurora |
|---|---|---|
| License | Apache 2.0 | Apache 2.0 |
| Governance | CNCF Sandbox since Oct 2025, Robusta plus Microsoft | Independent, built by Arvo AI |
| Multi-cloud reach | Kubernetes and cloud-native; native AWS, Azure, GCP and database toolsets, several via MCP | Native AWS, Azure, GCP, OVH, Scaleway, plus Kubernetes |
| Investigation vs execution | Read-only diagnosis, RBAC-aware | Runs kubectl, aws, az, gcloud in sandboxed pods |
| Write and remediation actions | Suggested-fix PRs in Operator mode | Human-gated execution plus remediation PRs |
| Dependency graph | Not a built-in feature | Memgraph blast-radius graph |
| Postmortems | Investigation reports | RCAs and postmortems exported to Confluence and Notion |
| Self-host and air-gap | Self-hosted, BYO-LLM via Ollama | Self-hosted, air-gapped bundles, BYO-LLM via Ollama |
A few numbers to set scale. As of mid-2026, HolmesGPT is around 2,600 GitHub stars with frequent releases (0.31.1 shipped on 28 May 2026), reflecting a mature, fast-moving ecosystem. Aurora is the younger project at roughly 263 stars. If raw ecosystem maturity and CNCF backing are your top priorities, that gap is a fair point in HolmesGPT's favor.
Where HolmesGPT wins
Be honest about this. If your stack is heavily Kubernetes plus Prometheus plus Grafana, HolmesGPT's 30-plus built-in toolsets and read-only-by-default posture make it a low-risk, high-coverage choice. CNCF Sandbox status gives you neutral governance and a roadmap shared between Robusta and Microsoft. The read-only model means a smaller blast radius and an easier security review. For a pure-Kubernetes team that wants a CNCF-blessed investigator, those are strong reasons to pick HolmesGPT and stop reading here.
Where Aurora wins
Three places. First, reach: when an incident crosses from a pod into an AWS IAM policy, an Azure load balancer, or a GCP quota, an agent that natively queries those clouds correlates the failure in one investigation instead of leaving you to stitch it together by hand. See our guide to multi-cloud incident management for why this matters. Second, action: Aurora runs cloud and cluster commands inside sandboxed Kubernetes pods with destructive actions gated behind human approval, so the agent can do the read-heavy investigative legwork itself. We wrote a whole companion piece on the architecture of that, AI agent kubectl safety. Third, platform depth: a Memgraph blast-radius graph, postmortems that export to Confluence and Notion, and remediation PRs make Aurora a fuller incident workflow rather than a single investigation step.
Where do AIOps incumbents fit?
A fair question when you are evaluating open-source AI SRE: should you just buy a mature AIOps platform instead? The honest answer is that those tools solve a different, older problem and come with real lock-in.
Dell APEX AIOps Incident Management, the product formerly known as Moogsoft, is actively maintained and not discontinued. Per Dell's own writeup, Dell acquired Moogsoft, the original AIOps pioneer, and the product is a strong event-correlation and noise-reduction engine built on its 50-plus patented machine-learning inventions. The trade-offs are Dell ownership and contract gating: the companion infrastructure-observability product, formerly CloudIQ, is included only with Dell ProSupport or ProSupport Plus service agreements. Pricing for Incident Management is enterprise and opaque, with no public per-event or per-seat rate published by Dell. Either way, that is ML correlation, not agentic investigation, and it is neither open source nor self-hostable on your terms.
Grafana OnCall is a different category again, and an important one to get right. OnCall is alert routing, scheduling, and escalation, not investigation, and the grafana/oncall OSS repository was archived on 24 March 2026, pushing users toward Grafana Cloud IRM. OnCall and Aurora are complementary, not substitutes. Aurora does not replace your routing layer. It is the AI investigation layer that sits on top of whatever routing you migrate to, whether that is a self-hosted option like Keep or notifications via ntfy or Twilio. If you are mid-migration off OnCall, point your new router's webhook at Aurora and keep the two concerns separate.
Which should you choose?
Choose HolmesGPT if your incidents live inside Kubernetes and CNCF governance matters to you. Choose Aurora if your infrastructure spans multiple clouds, you want the agent to execute and not only diagnose, or you need postmortems, a dependency graph, and remediation PRs in one platform.
Pick HolmesGPT when:
- Your stack is heavily Kubernetes, Prometheus, and Grafana, and your incidents stay there.
- You value CNCF Sandbox governance and a fast-moving ecosystem with 30-plus observability toolsets.
- You want strict read-only behavior for the simplest possible security review.
- You do not need cross-cloud reasoning out of the box.
Pick Aurora when:
- You operate across AWS, Azure, GCP, OVH, or Scaleway and need cross-cloud correlation in a single investigation.
- You want the agent to run 'kubectl', 'aws', 'az', and 'gcloud' itself in sandboxed pods, with destructive actions human-gated.
- You want a Memgraph blast-radius graph, auto-generated postmortems exported to Confluence and Notion, and remediation pull requests.
- You want a vendor-neutral, free, self-hosted platform rather than per-event enterprise licensing.
In practice, some teams run both: HolmesGPT for fast in-cluster Kubernetes triage, Aurora for cross-cloud investigation, execution, and postmortem generation. For a three-way technical breakdown including K8sGPT, see our deeper guide on open-source AI SRE: Aurora vs HolmesGPT vs K8sGPT.
Getting started with Aurora
Aurora is the multi-cloud, execution-capable option among open-source AI SREs. It deploys via Docker Compose or Helm, supports any LLM provider including local models via Ollama for air-gapped deployments, and ingests alerts from eleven monitoring connectors plus a Slack bot. Point your alert source's webhook at Aurora, connect read-only cloud credentials first, and let it investigate alongside your on-call rotation for two weeks before you enable any write actions. For the safety architecture behind sandboxed execution, read AI agent kubectl safety.