Designing a hyperscaler
During interviews, I’ve noticed a common challenge that keeps coming up. SRE teams love to test candidates with this multi-regional cluster design question:
Imagine you’re working for a Cloud Provider and can’t rely on any other Cloud Provider’s managed services. Your company has chosen to deploy its internal control plane for operating itself in Kubernetes for high availability and scalability (multi-region). Additionally you want to create a managed Kubernetes product for your customers. What are typical challenges you need to solve?
- How would you design the internal control plane?
- How would you manage application secrets?
- How would you handle incoming traffic?
- How would you manage updates?
- How would you manage failure domains in new deployments?
- How would you deploy the customer control plane components?
- How would you manage stateful components of the platform?
- How would you achieve proper multi-tenancy?
How would you design the internal control plane?
Assumptions:
- Physical servers are out of scope. I would assume that there is already a well-defined core virtualization layer (e.g. VMware vSphere/ESXi clusters, Proxmox VE, etc.) in place. We only care about VMs going forward.
- Physical connectivity across all zones and regions is fully set up, including hardware based load balancing (F5, NetScaler, etc.), BGP peering and a software-defined network (SDN) between these virt hosts.
- Core services like DNS, NTP, DHCP, etc. are already in place and managed by the core virtualization layer.
- Initial cluster bootstrapping is out-of-scope. Typically this would be a combination of pre-baked OS-images (via packer) and cloud-init. I would assume that there is a well-defined process in place to provision these new VMs (e.g. via Terraform, Ansible, etc.).
Core cluster design principles:
- Each region gets its own regional Kubernetes cluster, spanning multiple zones.
- 3 Kubernetes control plane nodes, ideally in different zones.
- 3 etcd nodes, again hopefully in different zones.
- Node pools (taints) for different workloads:
- 6+ storage nodes (Ceph, Longhorn, OpenEBS Mayastor etc.) if the storage layer not already provided by the core virtualization layer
- 3+ edge nodes (Envoy Proxy, software firewalls)
- 3 monitoring nodes (Prometheus)
- 3 logging nodes (Loki + Alloy)
- 10+ nodes: compute (all other workloads)
A regional management cluster with the same specs as above will be used to manage the other Kubernetes clusters. This cluster will centrally host:
- GitOps (ArgoCD or Flux)
- Container Registry (Harbor)
- Aggregates metrics and logs from all clusters (Alertmanager, Grafana)
- A git service provider if not using a third party (GitHub, GitLab, etc.)
While the above management cluster will be used to sync global workloads across all clusters, each cluster is still fully independent and should be able to operate without the management cluster being up.
How would you manage application secrets?
Via an Key Management Service (KMS), e.g. HashiCorp Vault. Sealed Secrets or SOPS (GitOps) for bootstrapping the secrets architecture.
The difficult part here is to securely run the KMS service. You would need to ensure that your physical servers are not tempered with (SecureBoot and TPM2) and storage is encrypted (typically unattended attested TPM2 auto-unlock of a LUKS boot volume). You really need to secure and trust all layers where the keys are stored and processed.
How would you handle incoming traffic?
Optimally the CNI (e.g. Cilium, Calico, etc.) gets integrated with the BGP routing layer of the core virtualization layer and can announce the IPs of Kubernetes LB services to the physical BGP routers. This would direct the traffic to the Kubernetes edge nodes running envoy or other L4 proxies (and potentially some firewall / abuse detection) and forward the TCP/UDP traffic to a actual destination (ingress/Gateway API controllers).
MetalLB in BGP mode essentially would do the same thing, but without being tightly integrated into the CNI - however, it still might make sense to use it, if you want to be highly compatible with FRR which it runs as a sidecar.
How would you manage updates?
So there are multiple update types that we need to consider:
- OS updates
- Kubernetes updates
- Cluster core updates (CNI, CSI, etc.)
- Application updates (Helm charts, Docker images, etc.)
It very important to have some dedicated canary clusters in place for testing any updates. These VMs / clusters should receive all OS + k8s + cluster core first.
1. OS updates
Most container optimized immutables linux distributions have a build-in auto-update with automated rollback (e.g. openSUSE LeapMicro). I would use that in combination with Kured to automatically drain and reboot the nodes after updates.
2. Kubernetes updates
I’ve had good experience with the k3s and rke2 automated k8s system upgrades. Depending on the k8s distribution, I would look into those options first. Otherwise, it might make sense to write an operator for that.
3. Cluster core updates (CNI, CSI, etc.)
These might not be managed via GitOps (ArgoCD, Flux, etc.) as they are part of the bootstrapping process to get an functional cluster. I would use a combination of GitOps and a CI/CD pipeline to automatically apply the updates to the cluster (via Terraform, Ansible, etc.).
4. Application updates (Helm charts, Docker images, etc.)
This is where GitOps really shines. I would use ArgoCD or Flux to automatically apply updates all clusters. Renovate is also something that makes sense to integrate at this layer.
How would you manage failure domains in new deployments?
Most applications need to be deployed across all zones. As a prerequisite, all nodes need to be properly labeled (topology.kubernetes.io/zone
).
I would then look into specifying default pod topology spread constraints or check out ways to automatically inject anti-affinity rules into pods to prevent zonal co-locations.
I’m not sure if hiding the complexity of zonal awareness from core infrastructure maintainers of new deployments is actually a good idea, especially when it comes to stateful/replicated workloads. So I would expect that most applications would still set affinity and disruption budgets explicitly and validate that (CI).
How would you deploy the customer control plane components?
Similar to GKE / EKS / AKS, the control plane components would not be directly accessible for customers. I would try to reuse as much as possible from the internal developer platform architecture and commercialize that.
Ideally each cluster for a customer would run on its own VMs on our core virtualization layer:
- 3 Kubernetes control plane nodes, ideally in different zones
- 3 etcd nodes, again hopefully in different zones
An option for automating these customer cluster provisioning could be to use Crossplane in combination with the Terraform provider (integrating with the core virtualization layer), or using the Cluster API project.
How would you manage stateful components of the platform?
The core components of the platform (control plane, etcd, server certificates) would need to be tightly monitored and a clear maintenance window would be required to upgrade them. Etcd snapshots must be taken regularly.
How would you achieve proper multi-tenancy?
The core virtualization layer (hypervisor) must be able to provide confidential compute capabilities (SecureBoot, virtual TPM, encrypted virtual RAM and storage) and network isolation (per customer / project). That’s out of scope of the Kubernetes platform itself.
Within a Kubernetes cluster (as a customer) to achieve multi-tenancy, I would use a combination of:
- CRI: Kata Containers (lightweight VMs), potentially with confidential containers which requires most of the above as prerequisite.
- CNI: Cilium (eBPF based network policies), service mesh and node-to-node WireGuard encryption
In a nutshell, I would try to use virtualization techniques with proper hardware-based isolation guarantees (e.g. AMD SEV, Intel TDX) and encryption facilities (TPM2, etc.) to have proper VM-based multi-tenancy first. Then I would try to use nested virtualization (Kata Containers) to provide proper isolation for the Kubernetes workloads.