Skip to main content
Version: main 🚧

AI Factory Architecture Overview

This page describes the architecture for AI infrastructure platforms that provision isolated GPU environments for multiple tenants from a shared pool of GPU infrastructure. It covers node topology, tenant isolation, networking, storage, observability, and identity. Use it as a starting point when designing your deployment, and follow the cross-references to configure each layer.

The architecture has four layers:

  1. Bare metal — server inventory, OS imaging, and hardware lifecycle (vMetal)
  2. Control Plane Cluster — the Kubernetes cluster that runs tenant cluster control planes (vCluster Platform)
  3. Tenant workload clusters — the isolated Kubernetes environments tenants interact with (vCluster)
  4. Platform services — observability, identity, storage, and networking that spans all layers
AI factory architecture: central control plane managing private GPU nodes and shared node pools for tenant workloads
AI factory architecture: bare metal to tenant workloads

Layer 1: Bare metal​

vMetal manages the full hardware lifecycle for your GPU fleet: PXE boot, OS imaging, BMC lifecycle management, and hardware inventory. When a bare metal host registers, vMetal provisions the OS, configures firmware, and makes the machine available to the node provisioner.

Node topology​

  • CPU nodes host the Control Plane Cluster. Typically two or three machines in HA configuration, sized for control plane pod density and platform services. These machines do not run GPU workloads.
  • GPU nodes are dedicated to tenant workloads. Each tenant cluster receives its own set of GPU nodes. Tenants share no compute, memory, or hardware.

Hardware lifecycle​

vMetal integrates with the Metal3 node provider in vCluster Platform. When a tenant cluster is created or a tenant scales their compute, Platform calls vMetal to provision machines from the available pool. vMetal joins them to the correct tenant cluster and reclaims them on release. See Bare metal overview for the full lifecycle model.

Layer 2: Control plane cluster​

The Control Plane ClusterControl Plane ClusterThe Kubernetes cluster that hosts the virtualized control planes for tenant clusters. The Control Plane Cluster is operated by the platform provider and is completely invisible to tenants. There are no shared control plane nodes, no in-cluster agent pods, and no lateral path between tenant environments. With shared nodes, this cluster also runs tenant workloads alongside the control plane pods — the same node pool is used for both.Related: Tenant Cluster, Control Plane Cluster, Tenant Cluster is the Kubernetes cluster that hosts the control planes of all tenant clusters. vCluster Platform runs here and manages provisioning, access control, and lifecycle.

Control Plane Cluster options:

  • vCluster Standalone — a zero-dependency Kubernetes distribution that runs directly on your CPU nodes. The natural choice when building from bare metal with no existing Kubernetes substrate.
  • An existing managed Kubernetes cluster (EKS, AKS, GKE) — if you already operate a cloud cluster for platform services.

For AI cloud deployments, vCluster Standalone is the typical choice. It solves the "cluster one" problem: the platform does not depend on a third-party Kubernetes distribution and bootstraps entirely from vCluster tooling. Deploy in HA mode before taking production traffic.

Tenant cluster control planes

Each tenant clusterTenant ClusterA fully isolated Kubernetes environment provisioned for a single tenant. Each tenant cluster has its own API server, controller manager, and resource namespace, backed by a virtualized control plane hosted on a Control Plane Cluster. From the tenant's perspective it behaves exactly like a standard Kubernetes cluster.Related: Control Plane Cluster, Tenant Cluster runs as a single StatefulSet pod on the Control Plane Cluster. The pod contains the Kubernetes API server, controller manager, data store, and syncer. From the tenant's perspective this is indistinguishable from a dedicated cluster. From your operations perspective, control planes are namespaced workloads you manage through Platform.

See Control plane sizing for resource recommendations at scale.

Control plane HA​

In HA mode, Platform components run across multiple replicas, and high-traffic tenant control planes can be configured for additional replicas.

Control Plane Cluster quorum. Standalone HA requires at least three nodes to maintain quorum. The cluster tolerates the loss of one node without service interruption. Loss of two or more nodes requires intervention. See Standalone HA.

Tenant control plane replicas. Each tenant control plane runs as a single pod by default. For tenants with SLA requirements, configure additional replicas. See Container control plane HA.

Backing store. The default embedded SQLite store is suitable for development and small tenant clusters. For production clusters with high API write rates, use the external etcd or PostgreSQL store. See Backing store options.

Failure modes. When a tenant control plane pod is evicted or fails, Kubernetes reschedules it on another Control Plane Cluster node. The tenant's API server recovers without manual intervention. Recovery time depends on backing store replication mode and pod scheduling latency.

Layer 3: Tenant workload clusters​

vCluster supports two worker node models. The choice sets the isolation level each tenant receives.

  • Private nodes. Each tenant cluster runs on dedicated bare metal GPU nodes with its own CNI, CSI, and compute. No infrastructure is shared between tenants. Standard for AI cloud providers and regulated deployments. See Private Nodes.
  • Shared nodes with dedicated node pools. Tenant clusters run on the Control Plane Cluster's shared node pool. Each tenant gets exclusive access to a labeled subset of nodes. CNI, CSI, and platform services are shared. Right for enterprise dev and experiment tiers. See Architecture: worker nodes.

The rest of this section covers the private nodes model in detail, as it is standard for AI cloud deployments.

Node joining and automation​

Nodes join using vcluster node join with a per-tenant token. For automated provisioning at scale, configure Auto Nodes or the Metal3 node provider to have Platform drive the join and reclaim lifecycle.

See Join private nodes and Node requirements.

Control plane connectivity​

Each tenant's control plane pod on the Control Plane Cluster connects to its private nodes over an encrypted VPN tunnel. This means nodes do not need to be on the same network as the Control Plane Cluster. See VPN for private nodes.

Per-tenant GPU stack​

Each tenant cluster runs its own GPU stack, isolated from other tenants:

ComponentPurposeRequired
GPU OperatorDrives NVIDIA GPU drivers, container runtime, device pluginsRequired
Scheduler (Kueue, Run.ai, Volcano)Job queuing and GPU quota enforcementRequired
CNI (Calico or Cilium)Per-tenant network policy and pod networkingRequired
CSI driverPer-tenant persistent storageRequired
vNodeStrong per-workload isolation at the node levelOptional

Certified Stacks provide pre-validated configurations for GPU Operator and schedulers. Use them as the base template for new tenant clusters.

Virtual Nodes​

vNode adds a strong isolation boundary between tenant workloads at the node level using Linux user namespaces and seccomp. It is not a separate deployment mode. It is an optional isolation layer that applies to any node model, including shared nodes, private nodes, and Standalone.

On shared nodes, vNode is the primary mechanism for cross-tenant workload isolation. Tenants share physical infrastructure, but each gets its own isolated node view. On private nodes, the value shifts to intra-tenant protection. vNode contains privileged workloads, agentic code execution, and root access within the tenant's own nodes, preventing node escape.

Isolation mechanics. vNode enforces boundaries at the Linux kernel level:

  • UID remapping. Root inside the container (UID 0) maps to an unprivileged host user (UID 100000+). A container escape does not yield host root.
  • Namespace isolation. Each workload gets its own PID, network, mount, UTS, and IPC namespace. Processes in one workload cannot see or signal processes in another.
  • seccomp enforcement. A seccomp profile restricts the syscall surface. Syscalls required by the workload are explicitly allowlisted; the rest are blocked.
  • CDI-based GPU passthrough. The Container Device Interface delivers direct GPU access without requiring privileged mode or host-level device exposure.

Enable vNode by setting sync.toHost.pods.runtimeClassName: vnode in the cluster template.

vNode is recommended when:

  • Tenants run agentic or dynamic code execution workloads
  • Tenants need root access inside containers without the risk of node escape
  • Workloads use GPU access through CDI (Container Device Interface)
  • Multiple tenants share physical nodes and require strong workload boundaries

See Virtual Nodes and the vNode documentation.

Layer 4: Platform services​

Network isolation​

Traffic paths. Three distinct traffic paths exist in this architecture:

  • Tenant API traffic. Tenant users reach the tenant API server pod on the Control Plane Cluster over TLS. Platform issues per-tenant kubeconfigs scoped to that endpoint.
  • Control plane to node traffic. The tenant API server communicates with kubelets and the CNI on private GPU nodes over an encrypted VPN tunnel. GPU nodes do not need L3 reachability to the Control Plane Cluster.
  • Pod-to-pod traffic. Within the tenant cluster, the tenant's CNI manages pod networking and NetworkPolicy enforcement.

East-west isolation. On private nodes, each tenant cluster installs its own CNI and each private node belongs to exactly one tenant cluster. There is no L2 or L3 adjacency between tenant node pools. Netris enforces this at the switch layer with per-tenant VLAN or VXLAN segments.

CNI selection. Each private-node tenant cluster installs its own CNI during node join. Calico and Cilium are the common choices. The CNI governs pod-to-pod networking, NetworkPolicy enforcement, and load balancer integration within the tenant cluster.

Netris for L2/L3 and InfiniBand isolation. On bare metal, Netris provides switch-level VLAN and VXLAN isolation between tenant networks. Each tenant cluster gets an exclusive network segment at the switch layer, eliminating any L2 adjacency between tenants. For InfiniBand workloads requiring NCCL, Netris manages IB subnet partitioning so each tenant's training jobs run on a dedicated partition with full fabric bandwidth. For RoCE deployments, Netris applies per-tenant VLAN or VXLAN segments with QoS policies that prevent one tenant's traffic from affecting another's.

See Netris integration and Bare Metal (Netris).

Storage​

Each tenant cluster mounts its own CSI driver. The driver runs inside the tenant cluster namespace and has no visibility into other tenants' volumes or StorageClasses.

Common storage backends:

BackendUse caseNotes
Rook-CephDistributed block storage on the tenant's own nodesNo external dependency; requires spare capacity on GPU nodes
NFSShared file storage accessible across tenant nodesSimple to provision; latency-sensitive for checkpoint-heavy workloads
WekaHigh-performance parallel file systemOptimized for AI/ML checkpoint and dataset I/O at scale
LonghornLightweight replicated block storageWell suited for dev and experiment tiers
Cloud-native CSI (such as EBS or Azure Disk)Managed cloud block storageAvailable when GPU nodes run on cloud infrastructure

For tenant data durability, configure the backing store for each control plane. See Backing store options.

Observability​

Prometheus and Grafana. Each tenant cluster exposes a ServiceMonitor that Platform's Prometheus instance scrapes. Grafana dashboards provide per-tenant cluster visibility. See Aggregating metrics and Fleet monitoring (OpenTelemetry).

OpenCost. OpenCost provides per-tenant cost attribution across GPU, CPU, and memory. This enables chargeback and quota enforcement. See Cost control.

Audit logging. Every API call against Platform and tenant cluster control planes is logged. See Audit logging.

Identity and access​

SSO. Connect vCluster Platform to your corporate identity provider using OIDC or SAML. All human access to Platform and tenant clusters flows through the IdP. See SSO configuration.

Authentication flow. When a tenant user authenticates:

  1. The browser redirects to the corporate IdP.
  2. The IdP issues an identity token.
  3. Platform validates the token and maps the identity to a team and project.
  4. Platform issues a scoped kubeconfig for the tenant's cluster.
  5. API calls hit the tenant control plane pod. The kube-apiserver validates tokens against the same OIDC issuer.

Project-scoped RBAC. Platform organizes tenant clusters into projects. RBAC is scoped to the project boundary. Tenants can only see clusters within their project. Within a tenant cluster, RBAC bindings control access to namespaces and resources. Templates enforce default RBAC policies so every tenant cluster starts with a known permission set. See Users and permissions and Projects.

Cluster access control. Platform issues tenant kubeconfigs scoped to the tenant cluster. Tenants cannot reach the Control Plane Cluster, other tenant clusters, or platform internals. See Access control.

Cluster sizing​

These configurations are starting points. Actual sizing depends on tenant count, workload characteristics, control plane fanout, and checkpoint I/O patterns.

TierControl Plane ClusterTenant cluster count
Small3 nodesUp to 10 tenant clusters
Medium3 nodes (larger instances)10–50 tenant clusters
Large5+ nodes50–200+ tenant clusters

GPU nodes. GPU node counts are driven entirely by tenant workload requirements. Control Plane Cluster nodes are CPU-only; no GPU workloads run on control plane nodes.

Control plane pod density. Platform schedules one control plane pod per tenant cluster. Pod resource requirements grow with tenant activity, CRD count, and API request rate. See Control plane sizing for per-pod recommendations.

Scaling. Scale the Control Plane Cluster vertically before adding nodes. For large deployments, distribute control plane load across multiple regions. See Platform HA and Multi-region Platform.

Architecture decisions summary​

LayerDecisionOptionsRequiredGuidance
Bare metalMachine lifecyclevMetal (recommended) or manual joinNovMetal docs, Metal3 node provider
Control Plane ClusterFoundationvCluster Standalone or managed KubernetesYesStandalone, Architecture
Tenant isolationNode modelPrivate nodes (standard for AI cloud)YesPrivate Nodes
Workload isolationRuntime layervNode (optional, recommended for untrusted)NovNode docs
NetworkingL2/L3 isolationNetris (bare metal) or cloud VPCNoNetris integration
GPU stackSchedulerKueue, Run.ai, or VolcanoYesCertified Stacks
ObservabilityMetricsPrometheus + OpenCost + GrafanaYesMonitoring overview
IdentityAuthOIDC / SAML through Platform SSOYesSSO configuration

Use cases on this architecture​

The four-layer stack supports different platform types. The layers themselves stay the same. What changes is tenant isolation model, workload runtime, and compliance posture.

Use caseNode modelvNodeKey additionsProduction guide
AI Cloud: Managed Kubernetes ServicePrivate nodes per customerRecommended for privileged/agentic workloadsCustomer-facing provisioning API, vBilling, Certified StacksAI Cloud
Enterprise AI FactoryMixed: private for production ML, shared for dev/experimentRecommended for untrusted code executionCorporate SSO, chargeback through OpenCost, self-service portalEnterprise AI Factory
Sovereign AI CloudPrivate nodes per customerRequired for regulated or multi-tenant isolationNetris for hard network isolation, air-gap support, per-tenant audit trails, compliance governance profilesAI Cloud
Agentic workloadsPrivate or shared nodesRequiredvNode UID remapping, seccomp profiles, CDI for GPU passthroughvNode docs

Sovereign AI clouds​

Sovereign deployments add compliance requirements on top of the baseline architecture. The stack is identical. The additions are:

  • Hard network isolation at the switch layer using Netris VLAN/VXLAN and IB subnet partitioning. See Netris integration.
  • Air-gap support. vCluster Standalone and the full platform operate without external network dependencies. Certified Stacks can be mirrored to a private registry.
  • Per-tenant audit trails. Every API call against Platform and tenant cluster control planes is logged. See Audit logging.
  • Custom governance profiles. Per-project policies enforce data residency and tenant separation at provisioning time. See Projects and Quotas.

Agentic workloads​

Agentic workloads require root access, dynamic package installation, and arbitrary code execution. vNode provides the isolation boundary that makes this safe on shared or dedicated GPU nodes:

  • UID remapping maps root inside the container (UID 0) to an unprivileged host user (UID 100000+). Container escapes do not yield host root.
  • Namespace isolation gives each agent its own PID, network, and mount namespace. No cross-agent visibility.
  • Capability scoping confines privileged Linux capabilities to the container's namespace.
  • Direct GPU passthrough using CDI delivers native GPU performance without VM overhead.

See vNode documentation for security validation results and configuration.

Next steps​