Can Aurora run fully air-gapped?

Yes. Aurora's reference Tier 5 stack is Docker Compose or Helm with a local LLM via Ollama (typically a 70B-class model like Llama 3.3), Memgraph for the dependency graph, Weaviate for the RAG corpus, Postgres for state, and HashiCorp Vault for secrets. Every image pulls from a customer-internal registry; model weights are loaded once on a connected jumpbox and copied in; no outbound network call is required during an investigation.

Which LLMs work well in air-gapped AI SRE deployments?

Llama 3.3 70B is a common starting point. Meta states that the December 2024 release delivers similar performance to Llama 3.1 405B at lower inference cost. DeepSeek-R1 is a reasoning-tuned open-weight model. Mistral Large is strong on tool use. Qwen 2.5 and 3 families are competitive with multilingual support. Hardware sizing: a 70B model in float16 uses roughly 140GB of weight memory, so plan two 80GB GPUs (H100 or A100 80GB) or a single H200. Q4-quantised variants compress weights to roughly 35 to 40GB and fit on a single 80GB card at some latency and quality cost.

Does HolmesGPT support self-hosting?

Yes. HolmesGPT is Apache 2.0 and can be deployed via Helm (typically through Robusta) or as a standalone CLI. Its scope is Kubernetes-first, with read-only AWS / Azure / GCP / Oracle Cloud / OpenShift toolsets exposed via MCP. The documentation assumes a hosted LLM provider; a self-hosted LLM endpoint is supported but treated as an advanced configuration. With a local LLM endpoint configured, HolmesGPT reaches Tier 5 within its Kubernetes-first scope.

Can Resolve.ai be self-hosted?

Resolve.ai uses a satellite agent in the customer VPC for telemetry collection, but the inference plane is vendor-managed. This is Tier 2 on the Sovereignty Spectrum (Private SaaS with VPC peering), not self-hosted under the strict definition. As of May 2026 there is no publicly documented air-gapped deployment option for Resolve.ai. Customers requiring Tier 4 or Tier 5 deployment evaluate open-source alternatives.

What is the Sovereignty Spectrum?

A five-tier deployment classification: T1 Public SaaS (everything vendor-managed), T2 Private SaaS (vendor agent in customer VPC, vendor inference), T3 VPC-Isolated single-tenant (vendor control plane, customer-dedicated), T4 On-Prem Hosted (agent and memory on customer infrastructure, LLM via managed API), T5 Air-Gapped (everything customer-owned including the LLM, no outbound calls). Customers' tier is fixed by their strictest regulatory or sovereignty constraint, not their average one.

How does an air-gapped agent reach cloud APIs?

Via private endpoints. AWS PrivateLink, Azure Private Endpoint, and GCP Private Service Connect expose cloud APIs inside the customer VPC without traversing the public internet. The agent authenticates via STS AssumeRole (AWS), Service Principal (Azure), OAuth or workload identity federation (GCP). Credentials live in HashiCorp Vault or equivalent. The agent reaches the cloud API; the cloud API does not reach the agent. No vendor inference is involved.

What is the minimum hardware for local LLM inference?

For a 70B-class model in float16 at production latency: a single 80GB GPU (H100, A100-80GB). For Q4 or Q5 quantization at acceptable non-interactive latency: a pair of RTX 4090s (each 24GB) or equivalent. For interactive investigation under five seconds per turn, plan one H100 or two A100s per concurrent investigation, plus a small pool for embedding generation. Memory bandwidth is the bottleneck more often than compute.

How much does self-hosted AI SRE cost versus SaaS?

Arvo's procurement model, assuming Llama 3.3 70B on a reserved H100 instance, customer-hosted Postgres, Memgraph, and Weaviate, and a midpoint of published per-investigation rates from Datadog Bits AI and NeuBird: at 100 incidents per month, Tier 5 Aurora is roughly $1,000 to $2,000 per month for infrastructure; Tier 4 Aurora is roughly $1,500 to $3,000 with managed LLM API usage included; commercial SaaS at a $25 to $30 per-investigation midpoint is roughly $2,500 to $3,000. At 2,000 incidents per month, Tier 5 stays under $5,000 of infrastructure; commercial SaaS at the same midpoint exceeds $50,000. These numbers exclude the engineering cost of running the stack, roughly 0.25 FTE for Tier 4 and one FTE for Tier 5. Procurement teams should model the same axes against their incident volume rather than rely on a single vendor-supplied figure.

Which regulations are pushing teams toward self-hosted AI SRE in 2026?

The EU AI Act (high-risk system requirements applicable to operational AI in critical infrastructure), the EU Data Boundary (Microsoft completed February 2025; the model that EU procurement now applies broadly), GDPR Article 28 (sub-processor approval), HIPAA (US healthcare data), FedRAMP High and IL5 (US federal workloads), FINMA (Swiss financial services), APRA CPS 234 (Australian financial services), and the SEC's cybersecurity incident disclosure rules (US public-company incident transparency). Sovereign-cloud customers (OVHcloud, Scaleway, T-Systems, Stackit, Oracle Sovereign Regions) add a structural sovereignty requirement on top of these.

Self-Hosted AI SRE: The 2026 Guide to Air-Gapped, Multi-Cloud, and BYO-LLM Deployment

Key Takeaways

Self-hosted AI SRE means the agent runtime, its memory layer, and the LLM all run inside the customer's perimeter. Every inference call, every telemetry read, and every postmortem write happens on customer-owned infrastructure. The definition is structural. A vendor agent that ships data to vendor-managed inference is not self-hosted under this definition.

We propose the Sovereignty Spectrum. Five deployment tiers: T1 Public SaaS, T2 Private SaaS, T3 VPC-Isolated, T4 On-Prem Hosted, T5 Air-Gapped. Of the fifteen most-cited AI SRE tools in 2026, only Aurora, HolmesGPT, and K8sGPT credibly reach T4 or T5. The other twelve top out at T1 or T2.

Air-gapped deployment requires three independent stacks: orchestration, memory, and inference. Orchestration is the agent loop (LangGraph, ReAct). Memory is the dependency graph plus RAG corpus (Memgraph, Weaviate). Inference is the LLM (Ollama, vLLM, or a sovereign endpoint). All three must run locally, with no outbound network call.

Regulatory drivers are concrete and dated. The EU Data Boundary for the Microsoft Cloud was completed on 26 February 2025. The EU AI Act implementation timeline phases in through 2027. The SEC adopted cybersecurity disclosure rules on 26 July 2023 (Form 8-K Item 1.05 effective 18 December 2023).

Open-weight LLMs in 2026 are credible for local inference. Meta's Llama 3.3 70B (December 2024) delivers similar performance to Llama 3.1 405B at lower inference cost, per Meta's own announcement. Mistral, DeepSeek, and Qwen have released competitive open-weight models. Aurora's reference local stack uses Ollama with a 70B-class model.

In Arvo's design-partner conversations across 2025, every regulated customer ran into the same procurement wall: every credible commercial AI SRE required production telemetry, including customer data inside log lines, error messages, and stack traces, to leave the customer perimeter for inference. For a SaaS startup the wall is paperwork. For a bank, a defence contractor, an EU sovereign-data buyer, or a healthcare provider, it blocks the procurement.

Self-hosted AI SRE removes the wall. The agent, its memory, and the LLM all run inside the customer's perimeter. This guide is the 2026 reference for evaluating, designing, and deploying a self-hosted AI SRE, with every commercial tool mapped to its deployment tier and Aurora's air-gapped stack used as the worked example.

What does self-hosted AI SRE mean?

The phrase is overloaded. Three definitions circulate in 2026 vendor marketing, and only the strictest meaningfully reduces the trust surface.

Self-hosted collector with VPC peering. A vendor agent runs in the customer VPC, gathers telemetry, and ships it (sometimes after partial filtering) to a vendor-managed inference plane. The inference call leaves the customer perimeter. Most commercial AI SREs in 2026 use this pattern and call it "private deployment."
Single-tenant SaaS. A dedicated vendor-managed instance inside a vendor-owned cloud account. The data plane is isolated from other tenants but still vendor-operated. Inference still leaves the customer perimeter.
True self-hosted. Every component (orchestration runtime, memory layers, inference endpoint, secrets manager) runs on customer-owned infrastructure. No outbound network call is required for an investigation to complete.

This guide uses the third definition. For audits and compliance reviews, only the third meaning answers the question "could a malicious actor at the vendor have read our incident transcript" with a structural no.

The Sovereignty Spectrum

Each tier increases perimeter control over the previous one. Choose the tier the team can defend operationally; aiming further than that is engineering debt waiting to happen.

Tier	What runs on customer infrastructure	What leaves the perimeter	Representative tools
T1, Public SaaS	Nothing	Telemetry, transcripts, investigation prompts	Datadog Bits AI, incident.io AI SRE, Rootly AI, PagerDuty SRE Agent, ServiceNow Now Assist, Splunk ITSI, Cleric.ai, Causely
T2, Private SaaS (VPC peering)	A vendor-supplied agent or collector	Telemetry, embeddings, sometimes whole log lines, all inference calls	Resolve.ai (satellite agent), Traversal, NeuBird Hawkeye (VPC option), Edwin AI
T3, VPC-Isolated single-tenant	Vendor-managed control plane inside a vendor-owned cloud account dedicated to one customer	All inference calls; cross-tenant data flow is structurally absent, the vendor still operates the plane	Some incumbent "private cloud" tiers (custom-quoted)
T4, On-prem hosted, hosted LLM	Agent, memory, dependency graph, RAG corpus	LLM API calls to OpenAI, Anthropic, Google, or Bedrock	Aurora with managed LLM; HolmesGPT with managed LLM
T5, Air-gapped	Agent, memory, dependency graph, RAG corpus, and a local LLM via Ollama, vLLM, or a sovereign endpoint	Nothing. Investigation completes without an outbound call	Aurora with Ollama; HolmesGPT with self-hosted LLM endpoint; K8sGPT with local LLM (Kubernetes-only scope)

A team's deployment tier is fixed by its strictest constraint, not its average. The FINMA Circular 2018/03 on outsourcing for Swiss banks and insurers pushes regulated workloads toward T5. A privacy-by-design product advertising "your incident data never leaves your servers" lands at T5. A team that cannot obtain controller approval for an LLM provider under GDPR Article 28 lands at T5.

Any other constraint allows T3 or T4. A single strict regulator collapses the choice to T5.

Why does self-hosting matter in 2026?

Three pressures, in roughly this order.

Regulatory. The EU Data Boundary for the Microsoft Cloud was completed on 26 February 2025. The boundary covers data processing and storage for core services and is the model EU procurement teams now apply to other vendors. The EU AI Act timeline phases in through 2027, with high-risk system obligations under Chapter III (risk management, data governance, human oversight, post-market monitoring) applicable to operational AI used in critical infrastructure. The SEC's cybersecurity disclosure rules (adopted 26 July 2023, Form 8-K Item 1.05 effective 18 December 2023) make incident response transparency a public-company concern.

Sovereignty and latency. Sovereign cloud is no longer a French preoccupation. OVHcloud Sovereign Cloud, Scaleway, T-Systems Sovereign Cloud, Stackit (Schwarz Group), and Oracle EU Sovereign Cloud ship contractually sovereign tiers. An AI SRE that cannot operate without sending telemetry to a US hyperscaler region is unfit for these workloads. Latency follows the same constraint: an EU-hosted agent calling a US-hosted LLM during an incident incurs round-trip latency on every step of a multi-turn investigation.

Data leakage and trust. Production log lines frequently contain customer PII, secrets, and proprietary identifiers. GitGuardian's State of Secrets Sprawl 2024 found 12.8 million new exposed secrets across public repositories alone in 2023, a steady reminder that telemetry contains material auditors care about. The audit calculation for a security team is the same as for any third-party data flow: if it can leak, model the risk as if it will. T5 makes the model trivial because nothing leaves the perimeter.

For the full incident-investigation context, see AI-Powered Incident Investigation: The Complete Guide for SRE Teams.

Which AI SRE tools can be fully self-hosted?

The honest map.

Tool	Best achievable tier	Constraint
Aurora	T5, Air-Gapped	Reference stack: Docker Compose or Helm chart, Ollama local LLM, Vault, Memgraph, Weaviate. See the Aurora repo.
HolmesGPT	T4, On-prem with hosted LLM (T5 with self-hosted LLM endpoint)	Apache 2.0. Per the HolmesGPT docs, documentation assumes a hosted model provider (OpenAI, Azure OpenAI, Bedrock). Self-hosted LLM is an advanced configuration.
K8sGPT	T4, On-prem (T5 with local LLM, Kubernetes scope only)	CLI or Helm. Local LLMs via Ollama supported. Scope is limited to the Kubernetes API.
Resolve.ai	T2, Private SaaS	Satellite agent in the customer VPC for telemetry. Inference is vendor-managed. No publicly documented air-gapped option.
Traversal	T2, Private SaaS	Flexible deployment options. Inference is vendor-managed.
NeuBird Hawkeye	T2, Private SaaS (VPC)	VPC deployment available. Ephemeral telemetry processing claimed by NeuBird. Inference path is vendor-managed.
Causely	T1, Public SaaS	Kubernetes-only. SaaS control plane.
Cleric.ai	T1, Public SaaS	Slack-first SaaS.
PagerDuty SRE Agent	T1, Public SaaS	Inside PagerDuty Operations Cloud.
Datadog Bits AI SRE	T1, Public SaaS	Multi-tenant inside Datadog. HIPAA-compliant per Datadog's documentation, not air-gapped.
incident.io AI SRE	T1, Public SaaS	Hosted multi-tenant. AI SRE access design-partner-gated.
Rootly AI	T1, Public SaaS	Closed-core SaaS. Rootly AI Labs publishes open-source prototypes.
ServiceNow Now Assist SRE	T1, Public SaaS	ServiceNow cloud. GA targeted June 2026.
Edwin AI (LogicMonitor)	T2, Private (LogicMonitor-managed)	Bundled with LogicMonitor Envision platform. Not standalone.
Splunk ITSI Episode Summarization	T1, Public SaaS	Splunk Cloud only as of May 2026 (Alpha).

The open-source projects are the only tools today that credibly reach T4 or T5 with public documentation. Aurora is the only one with multi-cloud scope at T5. Resolve.ai, Traversal, NeuBird, and Datadog Bits AI publish FedRAMP-adjacent or HIPAA tiers but no air-gapped reference architecture as of May 2026. For the broader category overview, see our open-source incident management overview and the Aurora Actions launch post for scheduled and event-triggered automations on top of self-hosted Aurora.

What is the architecture of a self-hosted AI SRE?

A self-hosted agentic AI SRE has three concurrent runtime stacks. Skip any one and the deployment regresses to a lower sovereignty tier.

1. Orchestration runtime

The agent loop is the LangGraph, ReAct, or equivalent orchestration that decides what tool to call next. It is the smallest of the three stacks by resource footprint and the easiest to self-host. Requirements:

A Python or Node runtime, typically containerised.
A task queue (Celery, RQ, BullMQ) for long-running investigations.
Postgres for agent state, investigation records, and audit logs.
A secrets store (HashiCorp Vault, AWS Secrets Manager, or KMS) for cloud credentials and LLM keys.
A web UI or API surface for engineers to inspect and trigger investigations.

Aurora ships this stack as a Docker Compose for single-node deployment and a Helm chart for Kubernetes-native deployment, both documented in the repo.

2. Memory layer

The agent without memory is a stateless inference call. Memory is the difference between an agent that learns from the environment and an agent that makes the same investigative mistake every week.

Dependency graph. A graph database (Memgraph, Neo4j) that holds the live topology of the infrastructure: services, dependencies, alert sources, and ownership. The agent traverses the graph to assess blast radius and trace upstream causes before issuing tool calls.
RAG corpus. A vector database (Weaviate, Qdrant, Chroma) holding embeddings of past postmortems, runbooks, design docs, and code. Hybrid retrieval combining BM25 and vector search outperforms either alone on SRE corpora because exact-match identifiers (service names, error codes) coexist with semantic concepts (failure modes). See also the root cause analysis complete guide for SREs for the broader investigation context.
Event store. Postgres or an event-sourcing database for the agent's own investigation history. Past investigations become future evidence.

Aurora's reference stack is Memgraph, Weaviate, and Postgres. Each runs in a customer container, and none requires an outbound network call.

3. Inference layer

The LLM. Three paths, in increasing sovereignty:

Managed LLM API. OpenAI, Anthropic, Google, Bedrock. Cheapest to start, lowest operational burden, but the deployment stays at T4.
Private endpoint. Azure OpenAI dedicated, Bedrock Provisioned Throughput, or a partner-hosted endpoint. Stronger contractual perimeter, although the data still leaves the customer cloud account.
Local LLM. Ollama, vLLM, or a sovereign inference appliance. Reaches T5.

For T5, the inference stack is the operational lift. Hardware is the largest single line item, and team expertise is the second.

BYO-LLM: which models run well locally?

Open-weight model quality has progressed enough to anchor an agentic SRE loop in 2026. The current options:

Llama 3.3 70B (Meta, December 2024). Meta states the model delivers similar performance to Llama 3.1 405B at lower inference cost. A common starting point for local deployments.
DeepSeek-R1 (model card). A reasoning-tuned open-weight model.
Qwen 2.5 and 3 families (Qwen 2.5 release). Strong multilingual support for teams with non-English runbook content.
Mistral Large (Mistral models). Strong tool-use performance.

Hardware sizing for a 70B-class model: in float16, weights are roughly 140GB, so plan two 80GB cards (a pair of H100 or A100 80GB) or a single H200 (141GB). Q4-quantised variants compress weights to roughly 35-40GB and fit on a single 80GB card with context room, at some latency and quality cost. See the Llama 3.3 70B model card for the canonical parameter and tensor sizes. Specific latency targets are workload-dependent and should be measured, not assumed.

The constraint to flag: running a local LLM is a real engineering discipline. Teams without LLM-ops capacity should consider T4 (managed API) as the long-term answer and revisit T5 when the team is staffed for it.

How does multi-cloud authentication work in a self-hosted agent?

A self-hosted agent must still reach customer cloud APIs. The auth pattern matters because credentials live in the customer perimeter. Vendor-managed inference makes credential exfiltration a vendor-trust problem. Self-hosted inference makes it a customer-operations problem, which is the desired state.

Aurora's reference multi-cloud auth pattern:

Cloud	Pattern
AWS	STS AssumeRole into customer accounts via a least-privilege investigation role. Credentials never persist in agent storage.
Azure	Service Principal with Reader (and incident-scoped Operator) role assignments.
GCP	OAuth-based authentication or workload identity federation.
OVH	API key per investigation scope, stored in Vault.
Scaleway	API token stored in Vault.
Kubernetes	Kubeconfig per cluster, stored in Vault. Sandboxed kubectl execution into an isolated namespace; see our AI Agent kubectl Safety guide.

The Vault binding matters: every cloud credential is short-lived where the cloud supports it, and every credential use is auditable. In a T5 deployment, the auditor's "who issued this command" question is answered by the Vault audit log and the agent's tool-call trace, not by a vendor SOC 2 attestation.

What does an air-gapped AI SRE deployment require?

The hard version requires no outbound network call during an investigation, including for inference.

Aurora's air-gapped reference architecture covers six layers:

Mirrored container registry. Every image (Aurora, Memgraph, Weaviate, Postgres, Vault, Ollama) is pulled from a customer-internal registry. No Docker Hub calls.
Mirrored package indices. Python wheels and OS packages served from internal Artifactory or equivalent.
Mirrored model weights. Llama 3.3 weights downloaded once on a connected jumpbox, scanned, hashed, and copied into the air-gapped network. Same for embedding models.
Local DNS. No outbound DNS resolution required. Cloud APIs are reached via VPC private endpoints (AWS PrivateLink, Azure Private Endpoint, GCP Private Service Connect).
No telemetry to vendor. Neither Aurora nor the open-source components phone home; this is verified per release.
Sealed Vault. Vault sealed and unsealed via internal HSM or Shamir keyshares. No auto-unseal against a vendor KMS.

The provisioning lift is real. Teams that have operated air-gapped Kubernetes will recognise the pattern. Teams that have not should pilot in a connected environment first.

How Aurora implements the Sovereignty Spectrum

Every Aurora deployment is configured for the customer's tier. The same code base supports all five.

T1 and T2. Aurora deployed to a public-cloud account with managed services for Postgres, Memgraph, and Weaviate. LLM via OpenAI or Anthropic API. Useful for evaluation pilots.
T3. Aurora deployed to a customer-owned VPC with private endpoints to managed services. LLM via private endpoint (Azure OpenAI dedicated, Bedrock).
T4. Aurora deployed to customer-owned VMs or Kubernetes with self-hosted Postgres, Memgraph, and Weaviate. LLM via managed API or private endpoint.
T5. Aurora deployed to customer-owned air-gapped infrastructure with Ollama-hosted Llama 3.3 (or a sovereign LLM endpoint). All dependencies mirrored.

Aurora ships a single codebase that serves all five tiers. Tier downgrade ("drop from T5 to T3 for one workload") and upgrade ("move the EU workload from T3 to T5") become configuration changes rather than migrations.

How does self-hosted AI SRE cost compare to SaaS?

A precise total cost of ownership depends on team size, model choice, infrastructure pricing, regional rates, and incident volume. Procurement should model the variable axes against incident volume rather than anchor on a single vendor-supplied number.

Self-hosted T4 or T5 fixed costs. Compute for the agent runtime, memory stores, and (for T5) the LLM node. Storage for the RAG corpus and audit log. Engineering time to operate the stack.
Self-hosted T4 variable costs. Managed LLM API usage at provider rates (OpenAI pricing, Anthropic pricing, Bedrock pricing). Scales with the number and depth of investigations.
Commercial SaaS variable costs. Per-seat tiers (incident.io, Rootly, PagerDuty), per-investigation billing (Datadog Bits AI, NeuBird), or per-credit consumption (ServiceNow). All published on the vendor's pricing page.

The break-even between a self-hosted Tier 5 deployment and per-investigation SaaS depends on the vendor's per-investigation price, the LLM choice, and the engineering cost of running the stack. Procurement teams should model three points: today's incident volume, twelve-month projected volume, and a 3x scenario. If any of the three is dominated by sovereignty rather than economics, the regulator decides the deployment tier, not the spreadsheet.

When self-hosting is the wrong answer

Self-hosting is an engineering commitment, not a checkbox.

Teams that should skip it:

No LLM-ops capacity. If no one on the team has run inference servers in production, do not start with air-gapped Ollama. Pilot at T1 or T2.
Small team, low incident volume. Below twenty incidents per month, the operational overhead can exceed the cost savings of self-hosting. T1 is fine if the data classification allows it.
No regulatory or sovereignty pressure. If the compliance team is not asking and the data classification is not sensitive, the sovereignty premium is paid for nothing.
Early in the AI SRE evaluation curve. A managed pilot validates the value of the agent to the team. Self-host after that decision, not before it.

Teams that should default to self-hosting:

Regulated workloads (finance, healthcare, defence, critical infrastructure).
EU sovereign-data customers.
Customers that advertise sovereignty as a product attribute themselves.
Public-sector buyers under FedRAMP High, IRAP PROTECTED, IL5, or equivalent.
Anyone whose log lines contain customer PII that has not been scrubbed at source.

What to watch next

Arvo expects three shifts in the self-hosted AI SRE landscape over the next twelve months.

Sovereign LLM endpoints. EU-hosted, contract-bound LLM endpoints from cloud regions outside US jurisdiction will turn T4 into a viable tier for European regulated customers without forcing T5. Anthropic, OpenAI, and Google are each shipping or piloting EU-resident inference.
Air-gap reference appliances. Appliance-style packages (preloaded GPU servers with Aurora, a local LLM, and a sealed Vault) sold as turn-key T5 deployments are likely to emerge from hardware vendors.
Open benchmark cohorts. Closed-source players still measure themselves on private datasets. The first open, named, multi-LLM benchmark on a public incident corpus will become the citation surface the category orbits.

In 2024 self-hosted AI SRE was a theoretical option. By 2025 it was niche. In 2026 it has become the procurement default for regulated workloads. The tools that can execute it today are Aurora at the multi-cloud end, HolmesGPT at the CNCF and Kubernetes end, and K8sGPT for diagnostics.

For the full landscape of AI SRE tools and how each maps to a deployment tier, see Top 15 AI SRE Tools in 2026. For the broader category overview, see AI SRE: The Complete Guide for Engineering Teams in 2026. For the investigation and postmortem halves of the workflow, see AI-Powered Incident Investigation and Automated Post-Mortem Generation.

Self-Hosted AI SRE: The 2026 Guide to Air-Gapped, Multi-Cloud, and BYO-LLM Deployment

Key Takeaways