Skip to main content

kube-site-follower: Kubernetes operator for site failover

·6 mins

A Kubernetes controller that watches for site failover triggers and signals applications to follow. When a trigger fires (e.g., database primary moves to a different node), the controller updates DNS records and notifies apps which site is active - keeping everything co-located with critical services for optimal performance.

⭐ GitHub: github.com/runatyr1/kube-site-follower #

Why “Site Follower”? #

In a multi-site Kubernetes setup, you want applications co-located with critical services (like your database primary). Cross-site queries add latency and reduce performance.

When a critical service fails over to a different node/site, everything else should follow:

  • DNS update - Records point to the new site’s IP (for external routing)
  • ConfigMap update - Apps read the active site and enable/disable accordingly (for internal signaling)

Both methods can be used by frontend or backend apps depending on your setup.

Current limitation: Supports 2-node Kubernetes clusters with 2 sites. Currently only CloudNativePG trigger is implemented. More triggers and multi-site support planned.

🚀 Deployment #

The Helm chart optionally bundles CloudNativePG operator and a ready-to-use Postgres cluster.

Quick Start (Helm CLI) #

helm repo add kube-site-follower https://runatyr1.github.io/kube-site-follower-helm
helm repo update

helm install kube-site-follower kube-site-follower/kube-site-follower \
  --namespace kube-site-follower --create-namespace \
  --set cloudnativepg.install=true \
  --set controller.sites.us-east.nodeName=node-us-east-01 \
  --set controller.sites.us-east.ip=1.2.3.4 \
  --set controller.sites.eu-west.nodeName=node-eu-west-01 \
  --set controller.sites.eu-west.ip=5.6.7.8 \
  --set controller.dns.provider=namecheap \
  --set controller.dns.records[0]=demo1.example.com \
  --set controller.dns.namecheap.apiUser=YOUR_USER \
  --set controller.dns.namecheap.apiKey=YOUR_KEY

GitOps Examples #

FluxCD HelmRelease
apiVersion: helm.toolkit.fluxcd.io/v2
kind: HelmRelease
metadata:
  name: kube-site-follower
spec:
  chart:
    spec:
      chart: kube-site-follower
      sourceRef:
        kind: HelmRepository
        name: kube-site-follower
  valuesFrom:
  - kind: Secret
    name: kube-site-follower-secrets
    valuesKey: us-east-ip
    targetPath: controller.sites.us-east.ip
  - kind: Secret
    name: kube-site-follower-secrets
    valuesKey: eu-west-ip
    targetPath: controller.sites.eu-west.ip
  - kind: Secret
    name: kube-site-follower-secrets
    valuesKey: namecheap-apiUser
    targetPath: controller.dns.namecheap.apiUser
  - kind: Secret
    name: kube-site-follower-secrets
    valuesKey: namecheap-apiKey
    targetPath: controller.dns.namecheap.apiKey
  values:
    cloudnativepg:
      install: true
    controller:
      watch:
        preferredPrimaryNode: node-us-east-01  # optional
      sites:
        us-east:
          nodeName: node-us-east-01
        eu-west:
          nodeName: node-eu-west-01
      dns:
        provider: namecheap
        records:
          - demo1.example.com
ArgoCD Application
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: kube-site-follower
  namespace: argocd
spec:
  project: default
  source:
    repoURL: https://runatyr1.github.io/kube-site-follower-helm
    chart: kube-site-follower
    targetRevision: "*"
    helm:
      values: |
        cloudnativepg:
          install: true
        controller:
          watch:
            preferredPrimaryNode: node-us-east-01
          sites:
            us-east:
              nodeName: node-us-east-01
              ip: "1.2.3.4"  # or use AVP/external-secrets
            eu-west:
              nodeName: node-eu-west-01
              ip: "5.6.7.8"
          dns:
            provider: namecheap
            records:
              - demo1.example.com
            namecheap:
              apiUser: $NAMECHEAP_USER  # ArgoCD Vault Plugin
              apiKey: $NAMECHEAP_KEY
  destination:
    server: https://kubernetes.default.svc
    namespace: kube-site-follower
  syncPolicy:
    automated:
      prune: true
      selfHeal: true
    syncOptions:
      - CreateNamespace=true
Kubernetes Secret (for sensitive values)
apiVersion: v1
kind: Secret
metadata:
  name: kube-site-follower-secrets
  namespace: flux-system  # or argocd namespace
type: Opaque
stringData:
  us-east-ip: "1.2.3.4"
  eu-west-ip: "5.6.7.8"
  namecheap-apiUser: "your-api-user"
  namecheap-apiKey: "your-api-key"

🏗️ Architecture #

flowchart TB subgraph K8s["Kubernetes Cluster"] Controller[kube-site-follower] Trigger["Trigger (e.g. DB primary location)"] CM[ConfigMap] Apps[Frontend / Backend Apps] end DNS[DNS Provider] Users[Users] Controller -->|watch| Trigger Controller -->|update| CM Controller -->|update A record| DNS CM -->|mount & read| Apps Users -->|DNS lookup| DNS DNS -->|route to active site| Apps

The controller runs a simple loop every 10 seconds:

  1. WATCH - Query trigger source (e.g., CNPG Cluster CRD for primary pod)
  2. LOOKUP - Get pod spec to find which node it’s running on
  3. MATCH - Map node name to site config (name + IP)
  4. ACT - If site changed: update ConfigMap + DNS records

Preferred Primary Node (optional): On startup, the controller can ensure the primary runs on a specific node by triggering a switchover if needed. This sets a “home” site for your database. One-time operation by design - after failover, restart controller to return to preferred node (kubectl rollout restart deployment/kube-site-follower-kube-site-follower -n kube-site-follower). The controller also detects cluster recreation (via UID change) and automatically re-runs the preferred primary check without requiring a restart.

Database protection: The PostgreSQL cluster uses helm.sh/resource-policy: keep annotation. Uninstalling the Helm chart will NOT delete your database - manual deletion required (kubectl delete cluster pg-cluster -n postgres).

📁 Project Structure #

kube-site-follower/
├── cmd/
│   └── controller/
│       └── main.go              # Entry point, main loop
├── pkg/
│   ├── config/
│   │   └── config.go            # Environment variable loading
│   ├── watchers/                   # Trigger implementations
│   │   └── cnpg/
│   │       └── watcher.go       # CloudNativePG trigger
│   └── actions/
│       ├── configmap/
│       │   └── updater.go       # ConfigMap updater
│       └── dns/
│           └── namecheap/
│               └── provider.go  # Namecheap DNS API client
├── charts/
│   └── kube-site-follower/        # Helm chart
│       ├── Chart.yaml           # With CloudNativePG as subchart
│       ├── values.yaml
│       └── templates/
│           ├── deployment.yaml
│           ├── rbac.yaml
│           └── postgres-cluster.yaml
├── Dockerfile
└── go.mod

🔧 Core Components #

Main Controller Loop #

The controller uses a ticker-based loop that checks the database state every 10 seconds:

// Main loop
ticker := time.NewTicker(10 * time.Second)

checkAndUpdate := func() {
    // 1. Get primary node from CNPG watcher
    primaryNode, err := watcher.GetPrimaryNode(ctx)

    // 2. Find matching site config
    for _, site := range cfg.Sites {
        if site.NodeName == primaryNode {
            activeSite = site.Name
            activeIP = site.IP
            break
        }
    }

    // 3. If changed, trigger failover actions
    if primaryNode != currentPrimaryNode {
        cmUpdater.Update(ctx, activeSite, primaryNode)
        dnsUpdater.UpdateRecord(record, activeIP)
    }
}

CNPG Watcher (Trigger) #

The CloudNativePG trigger uses the Kubernetes dynamic client to query the Cluster CRD. Other triggers would follow a similar pattern:

Expand to see watcher code
// CloudNativePG Cluster GVR
var clusterGVR = schema.GroupVersionResource{
    Group:    "postgresql.cnpg.io",
    Version:  "v1",
    Resource: "clusters",
}

// GetPrimaryNode returns the node where primary PostgreSQL runs
func (w *Watcher) GetPrimaryNode(ctx context.Context) (string, error) {
    // Get the CNPG cluster CRD
    cluster, err := w.dynamicClient.Resource(clusterGVR).
        Namespace(w.namespace).
        Get(ctx, w.clusterName, metav1.GetOptions{})

    // Extract currentPrimary pod name from status
    primaryPod, err := getPrimaryPodName(cluster)

    // Get pod to find which node it's scheduled on
    pod, err := w.clientset.CoreV1().Pods(w.namespace).
        Get(ctx, primaryPod, metav1.GetOptions{})

    return pod.Spec.NodeName, nil
}

ConfigMap Updater #

Updates a ConfigMap that apps can mount to know if they’re on the active site:

func (u *Updater) Update(ctx context.Context, activeSite, activeNode string) error {
    cm := &corev1.ConfigMap{
        ObjectMeta: metav1.ObjectMeta{
            Name:      u.name,
            Namespace: u.namespace,
        },
        Data: map[string]string{
            "active-site":  activeSite,
            "active-node":  activeNode,
            "last-updated": time.Now().UTC().Format(time.RFC3339),
        },
    }
    // Create or update...
}

DNS Provider (Namecheap) #

Updates A records via the Namecheap API:

Expand to see Namecheap provider
func (p *Provider) UpdateRecord(fqdn, ip string) error {
    // Parse domain: "demo1.runatyr.dev" -> host="demo1", sld="runatyr", tld="dev"
    parts := strings.Split(fqdn, ".")
    host := parts[0]
    sld := parts[len(parts)-2]
    tld := parts[len(parts)-1]

    // Build Namecheap API URL
    apiURL := fmt.Sprintf(
        "https://api.namecheap.com/xml.response?"+
            "ApiUser=%s&ApiKey=%s&UserName=%s&ClientIp=%s&"+
            "Command=namecheap.domains.dns.setHosts&"+
            "SLD=%s&TLD=%s&"+
            "HostName1=%s&RecordType1=A&Address1=%s&TTL1=300",
        p.apiUser, p.apiKey, p.apiUser, p.clientIP,
        sld, tld, host, ip,
    )
    // Execute request...
}

📱 How Apps Consume the Signal #

Apps mount the ConfigMap and check if they’re on the active site:

import os

MY_SITE = os.environ.get('MY_SITE')  # Set per-deployment
CONFIG_PATH = '/etc/kube-site-follower/active-site'

def is_active():
    with open(CONFIG_PATH) as f:
        return f.read().strip() == MY_SITE

# Active site: read/write to database, process requests
# Standby site: idle, wait for activation

🎯 Supported Triggers & Signals #

Failover Triggers (what causes a site switch):

  • CloudNativePG - PostgreSQL primary location
  • 🔜 MongoDB Operator
  • 🔜 K8s Node Status

Signals (how apps are notified):

  • ✅ K8s ConfigMap
  • ✅ DNS: Namecheap API
  • 🔜 DNS: Cloudflare, AWS Route53
  • 🔜 Webhook (custom)

📊 Configuration Reference #

VariableRequiredDefaultDescription
CNPG_NAMESPACENopostgresNamespace of CNPG cluster
CNPG_CLUSTER_NAMENopg-clusterName of CNPG cluster
PREFERRED_PRIMARY_NODENo-Node where primary should run on startup
SITESYes-JSON array of site definitions
DNS_PROVIDERNonamecheapDNS provider to use
DNS_RECORDSNo-Comma-separated DNS records
NAMECHEAP_API_USERIf namecheap-Namecheap API username
NAMECHEAP_API_KEYIf namecheap-Namecheap API key
CONFIGMAP_NAMENokube-site-follower-statusStatus ConfigMap name