kube-site-follower: Kubernetes operator for site failover

Table of Contents
A Kubernetes controller that watches for site failover triggers and signals applications to follow. When a trigger fires (e.g., database primary moves to a different node), the controller updates DNS records and notifies apps which site is active - keeping everything co-located with critical services for optimal performance.
⭐ GitHub: github.com/runatyr1/kube-site-follower #
Why “Site Follower”? #
In a multi-site Kubernetes setup, you want applications co-located with critical services (like your database primary). Cross-site queries add latency and reduce performance.
When a critical service fails over to a different node/site, everything else should follow:
- DNS update - Records point to the new site’s IP (for external routing)
- ConfigMap update - Apps read the active site and enable/disable accordingly (for internal signaling)
Both methods can be used by frontend or backend apps depending on your setup.
Current limitation: Supports 2-node Kubernetes clusters with 2 sites. Currently only CloudNativePG trigger is implemented. More triggers and multi-site support planned.
🚀 Deployment #
The Helm chart optionally bundles CloudNativePG operator and a ready-to-use Postgres cluster.
Quick Start (Helm CLI) #
helm repo add kube-site-follower https://runatyr1.github.io/kube-site-follower-helm
helm repo update
helm install kube-site-follower kube-site-follower/kube-site-follower \
--namespace kube-site-follower --create-namespace \
--set cloudnativepg.install=true \
--set controller.sites.us-east.nodeName=node-us-east-01 \
--set controller.sites.us-east.ip=1.2.3.4 \
--set controller.sites.eu-west.nodeName=node-eu-west-01 \
--set controller.sites.eu-west.ip=5.6.7.8 \
--set controller.dns.provider=namecheap \
--set controller.dns.records[0]=demo1.example.com \
--set controller.dns.namecheap.apiUser=YOUR_USER \
--set controller.dns.namecheap.apiKey=YOUR_KEY
GitOps Examples #
FluxCD HelmRelease
apiVersion: helm.toolkit.fluxcd.io/v2
kind: HelmRelease
metadata:
name: kube-site-follower
spec:
chart:
spec:
chart: kube-site-follower
sourceRef:
kind: HelmRepository
name: kube-site-follower
valuesFrom:
- kind: Secret
name: kube-site-follower-secrets
valuesKey: us-east-ip
targetPath: controller.sites.us-east.ip
- kind: Secret
name: kube-site-follower-secrets
valuesKey: eu-west-ip
targetPath: controller.sites.eu-west.ip
- kind: Secret
name: kube-site-follower-secrets
valuesKey: namecheap-apiUser
targetPath: controller.dns.namecheap.apiUser
- kind: Secret
name: kube-site-follower-secrets
valuesKey: namecheap-apiKey
targetPath: controller.dns.namecheap.apiKey
values:
cloudnativepg:
install: true
controller:
watch:
preferredPrimaryNode: node-us-east-01 # optional
sites:
us-east:
nodeName: node-us-east-01
eu-west:
nodeName: node-eu-west-01
dns:
provider: namecheap
records:
- demo1.example.com
ArgoCD Application
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: kube-site-follower
namespace: argocd
spec:
project: default
source:
repoURL: https://runatyr1.github.io/kube-site-follower-helm
chart: kube-site-follower
targetRevision: "*"
helm:
values: |
cloudnativepg:
install: true
controller:
watch:
preferredPrimaryNode: node-us-east-01
sites:
us-east:
nodeName: node-us-east-01
ip: "1.2.3.4" # or use AVP/external-secrets
eu-west:
nodeName: node-eu-west-01
ip: "5.6.7.8"
dns:
provider: namecheap
records:
- demo1.example.com
namecheap:
apiUser: $NAMECHEAP_USER # ArgoCD Vault Plugin
apiKey: $NAMECHEAP_KEY
destination:
server: https://kubernetes.default.svc
namespace: kube-site-follower
syncPolicy:
automated:
prune: true
selfHeal: true
syncOptions:
- CreateNamespace=true
Kubernetes Secret (for sensitive values)
apiVersion: v1
kind: Secret
metadata:
name: kube-site-follower-secrets
namespace: flux-system # or argocd namespace
type: Opaque
stringData:
us-east-ip: "1.2.3.4"
eu-west-ip: "5.6.7.8"
namecheap-apiUser: "your-api-user"
namecheap-apiKey: "your-api-key"
🏗️ Architecture #
The controller runs a simple loop every 10 seconds:
- WATCH - Query trigger source (e.g., CNPG Cluster CRD for primary pod)
- LOOKUP - Get pod spec to find which node it’s running on
- MATCH - Map node name to site config (name + IP)
- ACT - If site changed: update ConfigMap + DNS records
Preferred Primary Node (optional): On startup, the controller can ensure the primary runs on a specific node by triggering a switchover if needed. This sets a “home” site for your database. One-time operation by design - after failover, restart controller to return to preferred node (kubectl rollout restart deployment/kube-site-follower-kube-site-follower -n kube-site-follower). The controller also detects cluster recreation (via UID change) and automatically re-runs the preferred primary check without requiring a restart.
Database protection: The PostgreSQL cluster uses helm.sh/resource-policy: keep annotation. Uninstalling the Helm chart will NOT delete your database - manual deletion required (kubectl delete cluster pg-cluster -n postgres).
📁 Project Structure #
kube-site-follower/
├── cmd/
│ └── controller/
│ └── main.go # Entry point, main loop
├── pkg/
│ ├── config/
│ │ └── config.go # Environment variable loading
│ ├── watchers/ # Trigger implementations
│ │ └── cnpg/
│ │ └── watcher.go # CloudNativePG trigger
│ └── actions/
│ ├── configmap/
│ │ └── updater.go # ConfigMap updater
│ └── dns/
│ └── namecheap/
│ └── provider.go # Namecheap DNS API client
├── charts/
│ └── kube-site-follower/ # Helm chart
│ ├── Chart.yaml # With CloudNativePG as subchart
│ ├── values.yaml
│ └── templates/
│ ├── deployment.yaml
│ ├── rbac.yaml
│ └── postgres-cluster.yaml
├── Dockerfile
└── go.mod
🔧 Core Components #
Main Controller Loop #
The controller uses a ticker-based loop that checks the database state every 10 seconds:
// Main loop
ticker := time.NewTicker(10 * time.Second)
checkAndUpdate := func() {
// 1. Get primary node from CNPG watcher
primaryNode, err := watcher.GetPrimaryNode(ctx)
// 2. Find matching site config
for _, site := range cfg.Sites {
if site.NodeName == primaryNode {
activeSite = site.Name
activeIP = site.IP
break
}
}
// 3. If changed, trigger failover actions
if primaryNode != currentPrimaryNode {
cmUpdater.Update(ctx, activeSite, primaryNode)
dnsUpdater.UpdateRecord(record, activeIP)
}
}
CNPG Watcher (Trigger) #
The CloudNativePG trigger uses the Kubernetes dynamic client to query the Cluster CRD. Other triggers would follow a similar pattern:
Expand to see watcher code
// CloudNativePG Cluster GVR
var clusterGVR = schema.GroupVersionResource{
Group: "postgresql.cnpg.io",
Version: "v1",
Resource: "clusters",
}
// GetPrimaryNode returns the node where primary PostgreSQL runs
func (w *Watcher) GetPrimaryNode(ctx context.Context) (string, error) {
// Get the CNPG cluster CRD
cluster, err := w.dynamicClient.Resource(clusterGVR).
Namespace(w.namespace).
Get(ctx, w.clusterName, metav1.GetOptions{})
// Extract currentPrimary pod name from status
primaryPod, err := getPrimaryPodName(cluster)
// Get pod to find which node it's scheduled on
pod, err := w.clientset.CoreV1().Pods(w.namespace).
Get(ctx, primaryPod, metav1.GetOptions{})
return pod.Spec.NodeName, nil
}
ConfigMap Updater #
Updates a ConfigMap that apps can mount to know if they’re on the active site:
func (u *Updater) Update(ctx context.Context, activeSite, activeNode string) error {
cm := &corev1.ConfigMap{
ObjectMeta: metav1.ObjectMeta{
Name: u.name,
Namespace: u.namespace,
},
Data: map[string]string{
"active-site": activeSite,
"active-node": activeNode,
"last-updated": time.Now().UTC().Format(time.RFC3339),
},
}
// Create or update...
}
DNS Provider (Namecheap) #
Updates A records via the Namecheap API:
Expand to see Namecheap provider
func (p *Provider) UpdateRecord(fqdn, ip string) error {
// Parse domain: "demo1.runatyr.dev" -> host="demo1", sld="runatyr", tld="dev"
parts := strings.Split(fqdn, ".")
host := parts[0]
sld := parts[len(parts)-2]
tld := parts[len(parts)-1]
// Build Namecheap API URL
apiURL := fmt.Sprintf(
"https://api.namecheap.com/xml.response?"+
"ApiUser=%s&ApiKey=%s&UserName=%s&ClientIp=%s&"+
"Command=namecheap.domains.dns.setHosts&"+
"SLD=%s&TLD=%s&"+
"HostName1=%s&RecordType1=A&Address1=%s&TTL1=300",
p.apiUser, p.apiKey, p.apiUser, p.clientIP,
sld, tld, host, ip,
)
// Execute request...
}
📱 How Apps Consume the Signal #
Apps mount the ConfigMap and check if they’re on the active site:
import os
MY_SITE = os.environ.get('MY_SITE') # Set per-deployment
CONFIG_PATH = '/etc/kube-site-follower/active-site'
def is_active():
with open(CONFIG_PATH) as f:
return f.read().strip() == MY_SITE
# Active site: read/write to database, process requests
# Standby site: idle, wait for activation
🎯 Supported Triggers & Signals #
Failover Triggers (what causes a site switch):
- ✅ CloudNativePG - PostgreSQL primary location
- 🔜 MongoDB Operator
- 🔜 K8s Node Status
Signals (how apps are notified):
- ✅ K8s ConfigMap
- ✅ DNS: Namecheap API
- 🔜 DNS: Cloudflare, AWS Route53
- 🔜 Webhook (custom)
📊 Configuration Reference #
| Variable | Required | Default | Description |
|---|---|---|---|
CNPG_NAMESPACE | No | postgres | Namespace of CNPG cluster |
CNPG_CLUSTER_NAME | No | pg-cluster | Name of CNPG cluster |
PREFERRED_PRIMARY_NODE | No | - | Node where primary should run on startup |
SITES | Yes | - | JSON array of site definitions |
DNS_PROVIDER | No | namecheap | DNS provider to use |
DNS_RECORDS | No | - | Comma-separated DNS records |
NAMECHEAP_API_USER | If namecheap | - | Namecheap API username |
NAMECHEAP_API_KEY | If namecheap | - | Namecheap API key |
CONFIGMAP_NAME | No | kube-site-follower-status | Status ConfigMap name |