Self-Hosted AI SRE: The 2026 Guide to Air-Gapped, Multi-Cloud, and BYO-LLM Deployment
Self-hosted AI SRE runs the agent, memory, and LLM inside your perimeter. The 2026 architecture for air-gapped, multi-cloud, BYO-LLM deployment.
Key Takeaways
- Self-hosted AI SRE means the agent runtime, its memory layer, and the LLM all run inside the customer's perimeter. Every inference call, every telemetry read, and every postmortem write happens on customer-owned infrastructure. The definition is structural. A vendor agent that ships data to vendor-managed inference is not self-hosted under this definition.
- We propose the Sovereignty Spectrum. Five deployment tiers: T1 Public SaaS, T2 Private SaaS, T3 VPC-Isolated, T4 On-Prem Hosted, T5 Air-Gapped. Of the fifteen most-cited AI SRE tools in 2026, only Aurora, HolmesGPT, and K8sGPT credibly reach T4 or T5. The other twelve top out at T1 or T2.
- Air-gapped deployment requires three independent stacks: orchestration, memory, and inference. Orchestration is the agent loop (LangGraph, ReAct). Memory is the dependency graph plus RAG corpus (Memgraph, Weaviate). Inference is the LLM (Ollama, vLLM, or a sovereign endpoint). All three must run locally, with no outbound network call.
- Regulatory drivers are concrete and dated. The EU Data Boundary for the Microsoft Cloud was completed on 26 February 2025. The EU AI Act implementation timeline phases in through 2027. The SEC adopted cybersecurity disclosure rules on 26 July 2023 (Form 8-K Item 1.05 effective 18 December 2023).
- Open-weight LLMs in 2026 are credible for local inference. Meta's Llama 3.3 70B (December 2024) delivers similar performance to Llama 3.1 405B at lower inference cost, per Meta's own announcement. Mistral, DeepSeek, and Qwen have released competitive open-weight models. Aurora's reference local stack uses Ollama with a 70B-class model.
In Arvo's design-partner conversations across 2025, every regulated customer ran into the same procurement wall: every credible commercial AI SRE required production telemetry, including customer data inside log lines, error messages, and stack traces, to leave the customer perimeter for inference. For a SaaS startup the wall is paperwork. For a bank, a defence contractor, an EU sovereign-data buyer, or a healthcare provider, it blocks the procurement.
Self-hosted AI SRE removes the wall. The agent, its memory, and the LLM all run inside the customer's perimeter. This guide is the 2026 reference for evaluating, designing, and deploying a self-hosted AI SRE, with every commercial tool mapped to its deployment tier and Aurora's air-gapped stack used as the worked example.
What does self-hosted AI SRE mean?
The phrase is overloaded. Three definitions circulate in 2026 vendor marketing, and only the strictest meaningfully reduces the trust surface.
- Self-hosted collector with VPC peering. A vendor agent runs in the customer VPC, gathers telemetry, and ships it (sometimes after partial filtering) to a vendor-managed inference plane. The inference call leaves the customer perimeter. Most commercial AI SREs in 2026 use this pattern and call it "private deployment."
- Single-tenant SaaS. A dedicated vendor-managed instance inside a vendor-owned cloud account. The data plane is isolated from other tenants but still vendor-operated. Inference still leaves the customer perimeter.
- True self-hosted. Every component (orchestration runtime, memory layers, inference endpoint, secrets manager) runs on customer-owned infrastructure. No outbound network call is required for an investigation to complete.
This guide uses the third definition. For audits and compliance reviews, only the third meaning answers the question "could a malicious actor at the vendor have read our incident transcript" with a structural no.
The Sovereignty Spectrum
Each tier increases perimeter control over the previous one. Choose the tier the team can defend operationally; aiming further than that is engineering debt waiting to happen.
| Tier | What runs on customer infrastructure | What leaves the perimeter | Representative tools |
|---|---|---|---|
| T1, Public SaaS | Nothing | Telemetry, transcripts, investigation prompts | Datadog Bits AI, incident.io AI SRE, Rootly AI, PagerDuty SRE Agent, ServiceNow Now Assist, Splunk ITSI, Cleric.ai, Causely |
| T2, Private SaaS (VPC peering) | A vendor-supplied agent or collector | Telemetry, embeddings, sometimes whole log lines, all inference calls | Resolve.ai (satellite agent), Traversal, NeuBird Hawkeye (VPC option), Edwin AI |
| T3, VPC-Isolated single-tenant | Vendor-managed control plane inside a vendor-owned cloud account dedicated to one customer | All inference calls; cross-tenant data flow is structurally absent, the vendor still operates the plane | Some incumbent "private cloud" tiers (custom-quoted) |
| T4, On-prem hosted, hosted LLM | Agent, memory, dependency graph, RAG corpus | LLM API calls to OpenAI, Anthropic, Google, or Bedrock | Aurora with managed LLM; HolmesGPT with managed LLM |
| T5, Air-gapped | Agent, memory, dependency graph, RAG corpus, and a local LLM via Ollama, vLLM, or a sovereign endpoint | Nothing. Investigation completes without an outbound call | Aurora with Ollama; HolmesGPT with self-hosted LLM endpoint; K8sGPT with local LLM (Kubernetes-only scope) |
A team's deployment tier is fixed by its strictest constraint, not its average. The FINMA Circular 2018/03 on outsourcing for Swiss banks and insurers pushes regulated workloads toward T5. A privacy-by-design product advertising "your incident data never leaves your servers" lands at T5. A team that cannot obtain controller approval for an LLM provider under GDPR Article 28 lands at T5.
Any other constraint allows T3 or T4. A single strict regulator collapses the choice to T5.
Why does self-hosting matter in 2026?
Three pressures, in roughly this order.
Regulatory. The EU Data Boundary for the Microsoft Cloud was completed on 26 February 2025. The boundary covers data processing and storage for core services and is the model EU procurement teams now apply to other vendors. The EU AI Act timeline phases in through 2027, with high-risk system obligations under Chapter III (risk management, data governance, human oversight, post-market monitoring) applicable to operational AI used in critical infrastructure. The SEC's cybersecurity disclosure rules (adopted 26 July 2023, Form 8-K Item 1.05 effective 18 December 2023) make incident response transparency a public-company concern.
Sovereignty and latency. Sovereign cloud is no longer a French preoccupation. OVHcloud Sovereign Cloud, Scaleway, T-Systems Sovereign Cloud, Stackit (Schwarz Group), and Oracle EU Sovereign Cloud ship contractually sovereign tiers. An AI SRE that cannot operate without sending telemetry to a US hyperscaler region is unfit for these workloads. Latency follows the same constraint: an EU-hosted agent calling a US-hosted LLM during an incident incurs round-trip latency on every step of a multi-turn investigation.
Data leakage and trust. Production log lines frequently contain customer PII, secrets, and proprietary identifiers. GitGuardian's State of Secrets Sprawl 2024 found 12.8 million new exposed secrets across public repositories alone in 2023, a steady reminder that telemetry contains material auditors care about. The audit calculation for a security team is the same as for any third-party data flow: if it can leak, model the risk as if it will. T5 makes the model trivial because nothing leaves the perimeter.
For the full incident-investigation context, see AI-Powered Incident Investigation: The Complete Guide for SRE Teams.
Which AI SRE tools can be fully self-hosted?
The honest map.
| Tool | Best achievable tier | Constraint |
|---|---|---|
| Aurora | T5, Air-Gapped | Reference stack: Docker Compose or Helm chart, Ollama local LLM, Vault, Memgraph, Weaviate. See the Aurora repo. |
| HolmesGPT | T4, On-prem with hosted LLM (T5 with self-hosted LLM endpoint) | Apache 2.0. Per the HolmesGPT docs, documentation assumes a hosted model provider (OpenAI, Azure OpenAI, Bedrock). Self-hosted LLM is an advanced configuration. |
| K8sGPT | T4, On-prem (T5 with local LLM, Kubernetes scope only) | CLI or Helm. Local LLMs via Ollama supported. Scope is limited to the Kubernetes API. |
| Resolve.ai | T2, Private SaaS | Satellite agent in the customer VPC for telemetry. Inference is vendor-managed. No publicly documented air-gapped option. |
| Traversal | T2, Private SaaS | Flexible deployment options. Inference is vendor-managed. |
| NeuBird Hawkeye | T2, Private SaaS (VPC) | VPC deployment available. Ephemeral telemetry processing claimed by NeuBird. Inference path is vendor-managed. |
| Causely | T1, Public SaaS | Kubernetes-only. SaaS control plane. |
| Cleric.ai | T1, Public SaaS | Slack-first SaaS. |
| PagerDuty SRE Agent | T1, Public SaaS | Inside PagerDuty Operations Cloud. |
| Datadog Bits AI SRE | T1, Public SaaS | Multi-tenant inside Datadog. HIPAA-compliant per Datadog's documentation, not air-gapped. |
| incident.io AI SRE | T1, Public SaaS | Hosted multi-tenant. AI SRE access design-partner-gated. |
| Rootly AI | T1, Public SaaS | Closed-core SaaS. Rootly AI Labs publishes open-source prototypes. |
| ServiceNow Now Assist SRE | T1, Public SaaS | ServiceNow cloud. GA targeted June 2026. |
| Edwin AI (LogicMonitor) | T2, Private (LogicMonitor-managed) | Bundled with LogicMonitor Envision platform. Not standalone. |
| Splunk ITSI Episode Summarization | T1, Public SaaS | Splunk Cloud only as of May 2026 (Alpha). |
The open-source projects are the only tools today that credibly reach T4 or T5 with public documentation. Aurora is the only one with multi-cloud scope at T5. Resolve.ai, Traversal, NeuBird, and Datadog Bits AI publish FedRAMP-adjacent or HIPAA tiers but no air-gapped reference architecture as of May 2026. For the broader category overview, see our open-source incident management overview and the Aurora Actions launch post for scheduled and event-triggered automations on top of self-hosted Aurora.
What is the architecture of a self-hosted AI SRE?
A self-hosted agentic AI SRE has three concurrent runtime stacks. Skip any one and the deployment regresses to a lower sovereignty tier.
1. Orchestration runtime
The agent loop is the LangGraph, ReAct, or equivalent orchestration that decides what tool to call next. It is the smallest of the three stacks by resource footprint and the easiest to self-host. Requirements:
- A Python or Node runtime, typically containerised.
- A task queue (Celery, RQ, BullMQ) for long-running investigations.
- Postgres for agent state, investigation records, and audit logs.
- A secrets store (HashiCorp Vault, AWS Secrets Manager, or KMS) for cloud credentials and LLM keys.
- A web UI or API surface for engineers to inspect and trigger investigations.
Aurora ships this stack as a Docker Compose for single-node deployment and a Helm chart for Kubernetes-native deployment, both documented in the repo.
2. Memory layer
The agent without memory is a stateless inference call. Memory is the difference between an agent that learns from the environment and an agent that makes the same investigative mistake every week.
- Dependency graph. A graph database (Memgraph, Neo4j) that holds the live topology of the infrastructure: services, dependencies, alert sources, and ownership. The agent traverses the graph to assess blast radius and trace upstream causes before issuing tool calls.
- RAG corpus. A vector database (Weaviate, Qdrant, Chroma) holding embeddings of past postmortems, runbooks, design docs, and code. Hybrid retrieval combining BM25 and vector search outperforms either alone on SRE corpora because exact-match identifiers (service names, error codes) coexist with semantic concepts (failure modes). See also the root cause analysis complete guide for SREs for the broader investigation context.
- Event store. Postgres or an event-sourcing database for the agent's own investigation history. Past investigations become future evidence.
Aurora's reference stack is Memgraph, Weaviate, and Postgres. Each runs in a customer container, and none requires an outbound network call.
3. Inference layer
The LLM. Three paths, in increasing sovereignty:
- Managed LLM API. OpenAI, Anthropic, Google, Bedrock. Cheapest to start, lowest operational burden, but the deployment stays at T4.
- Private endpoint. Azure OpenAI dedicated, Bedrock Provisioned Throughput, or a partner-hosted endpoint. Stronger contractual perimeter, although the data still leaves the customer cloud account.
- Local LLM. Ollama, vLLM, or a sovereign inference appliance. Reaches T5.
For T5, the inference stack is the operational lift. Hardware is the largest single line item, and team expertise is the second.
BYO-LLM: which models run well locally?
Open-weight model quality has progressed enough to anchor an agentic SRE loop in 2026. The current options:
- Llama 3.3 70B (Meta, December 2024). Meta states the model delivers similar performance to Llama 3.1 405B at lower inference cost. A common starting point for local deployments.
- DeepSeek-R1 (model card). A reasoning-tuned open-weight model.
- Qwen 2.5 and 3 families (Qwen 2.5 release). Strong multilingual support for teams with non-English runbook content.
- Mistral Large (Mistral models). Strong tool-use performance.
Hardware sizing for a 70B-class model: in float16, weights are roughly 140GB, so plan two 80GB cards (a pair of H100 or A100 80GB) or a single H200 (141GB). Q4-quantised variants compress weights to roughly 35-40GB and fit on a single 80GB card with context room, at some latency and quality cost. See the Llama 3.3 70B model card for the canonical parameter and tensor sizes. Specific latency targets are workload-dependent and should be measured, not assumed.
The constraint to flag: running a local LLM is a real engineering discipline. Teams without LLM-ops capacity should consider T4 (managed API) as the long-term answer and revisit T5 when the team is staffed for it.
How does multi-cloud authentication work in a self-hosted agent?
A self-hosted agent must still reach customer cloud APIs. The auth pattern matters because credentials live in the customer perimeter. Vendor-managed inference makes credential exfiltration a vendor-trust problem. Self-hosted inference makes it a customer-operations problem, which is the desired state.
Aurora's reference multi-cloud auth pattern:
| Cloud | Pattern |
|---|---|
| AWS | STS AssumeRole into customer accounts via a least-privilege investigation role. Credentials never persist in agent storage. |
| Azure | Service Principal with Reader (and incident-scoped Operator) role assignments. |
| GCP | OAuth-based authentication or workload identity federation. |
| OVH | API key per investigation scope, stored in Vault. |
| Scaleway | API token stored in Vault. |
| Kubernetes | Kubeconfig per cluster, stored in Vault. Sandboxed kubectl execution into an isolated namespace; see our AI Agent kubectl Safety guide. |
The Vault binding matters: every cloud credential is short-lived where the cloud supports it, and every credential use is auditable. In a T5 deployment, the auditor's "who issued this command" question is answered by the Vault audit log and the agent's tool-call trace, not by a vendor SOC 2 attestation.
What does an air-gapped AI SRE deployment require?
The hard version requires no outbound network call during an investigation, including for inference.
Aurora's air-gapped reference architecture covers six layers:
- Mirrored container registry. Every image (Aurora, Memgraph, Weaviate, Postgres, Vault, Ollama) is pulled from a customer-internal registry. No Docker Hub calls.
- Mirrored package indices. Python wheels and OS packages served from internal Artifactory or equivalent.
- Mirrored model weights. Llama 3.3 weights downloaded once on a connected jumpbox, scanned, hashed, and copied into the air-gapped network. Same for embedding models.
- Local DNS. No outbound DNS resolution required. Cloud APIs are reached via VPC private endpoints (AWS PrivateLink, Azure Private Endpoint, GCP Private Service Connect).
- No telemetry to vendor. Neither Aurora nor the open-source components phone home; this is verified per release.
- Sealed Vault. Vault sealed and unsealed via internal HSM or Shamir keyshares. No auto-unseal against a vendor KMS.
The provisioning lift is real. Teams that have operated air-gapped Kubernetes will recognise the pattern. Teams that have not should pilot in a connected environment first.
How Aurora implements the Sovereignty Spectrum
Every Aurora deployment is configured for the customer's tier. The same code base supports all five.
- T1 and T2. Aurora deployed to a public-cloud account with managed services for Postgres, Memgraph, and Weaviate. LLM via OpenAI or Anthropic API. Useful for evaluation pilots.
- T3. Aurora deployed to a customer-owned VPC with private endpoints to managed services. LLM via private endpoint (Azure OpenAI dedicated, Bedrock).
- T4. Aurora deployed to customer-owned VMs or Kubernetes with self-hosted Postgres, Memgraph, and Weaviate. LLM via managed API or private endpoint.
- T5. Aurora deployed to customer-owned air-gapped infrastructure with Ollama-hosted Llama 3.3 (or a sovereign LLM endpoint). All dependencies mirrored.
Aurora ships a single codebase that serves all five tiers. Tier downgrade ("drop from T5 to T3 for one workload") and upgrade ("move the EU workload from T3 to T5") become configuration changes rather than migrations.
How does self-hosted AI SRE cost compare to SaaS?
A precise total cost of ownership depends on team size, model choice, infrastructure pricing, regional rates, and incident volume. Procurement should model the variable axes against incident volume rather than anchor on a single vendor-supplied number.
- Self-hosted T4 or T5 fixed costs. Compute for the agent runtime, memory stores, and (for T5) the LLM node. Storage for the RAG corpus and audit log. Engineering time to operate the stack.
- Self-hosted T4 variable costs. Managed LLM API usage at provider rates (OpenAI pricing, Anthropic pricing, Bedrock pricing). Scales with the number and depth of investigations.
- Commercial SaaS variable costs. Per-seat tiers (incident.io, Rootly, PagerDuty), per-investigation billing (Datadog Bits AI, NeuBird), or per-credit consumption (ServiceNow). All published on the vendor's pricing page.
The break-even between a self-hosted Tier 5 deployment and per-investigation SaaS depends on the vendor's per-investigation price, the LLM choice, and the engineering cost of running the stack. Procurement teams should model three points: today's incident volume, twelve-month projected volume, and a 3x scenario. If any of the three is dominated by sovereignty rather than economics, the regulator decides the deployment tier, not the spreadsheet.
When self-hosting is the wrong answer
Self-hosting is an engineering commitment, not a checkbox.
Teams that should skip it:
- No LLM-ops capacity. If no one on the team has run inference servers in production, do not start with air-gapped Ollama. Pilot at T1 or T2.
- Small team, low incident volume. Below twenty incidents per month, the operational overhead can exceed the cost savings of self-hosting. T1 is fine if the data classification allows it.
- No regulatory or sovereignty pressure. If the compliance team is not asking and the data classification is not sensitive, the sovereignty premium is paid for nothing.
- Early in the AI SRE evaluation curve. A managed pilot validates the value of the agent to the team. Self-host after that decision, not before it.
Teams that should default to self-hosting:
- Regulated workloads (finance, healthcare, defence, critical infrastructure).
- EU sovereign-data customers.
- Customers that advertise sovereignty as a product attribute themselves.
- Public-sector buyers under FedRAMP High, IRAP PROTECTED, IL5, or equivalent.
- Anyone whose log lines contain customer PII that has not been scrubbed at source.
What to watch next
Arvo expects three shifts in the self-hosted AI SRE landscape over the next twelve months.
- Sovereign LLM endpoints. EU-hosted, contract-bound LLM endpoints from cloud regions outside US jurisdiction will turn T4 into a viable tier for European regulated customers without forcing T5. Anthropic, OpenAI, and Google are each shipping or piloting EU-resident inference.
- Air-gap reference appliances. Appliance-style packages (preloaded GPU servers with Aurora, a local LLM, and a sealed Vault) sold as turn-key T5 deployments are likely to emerge from hardware vendors.
- Open benchmark cohorts. Closed-source players still measure themselves on private datasets. The first open, named, multi-LLM benchmark on a public incident corpus will become the citation surface the category orbits.
In 2024 self-hosted AI SRE was a theoretical option. By 2025 it was niche. In 2026 it has become the procurement default for regulated workloads. The tools that can execute it today are Aurora at the multi-cloud end, HolmesGPT at the CNCF and Kubernetes end, and K8sGPT for diagnostics.
For the full landscape of AI SRE tools and how each maps to a deployment tier, see Top 15 AI SRE Tools in 2026. For the broader category overview, see AI SRE: The Complete Guide for Engineering Teams in 2026. For the investigation and postmortem halves of the workflow, see AI-Powered Incident Investigation and Automated Post-Mortem Generation.