Skip to content

Disaster Recovery

Recovery procedures by failure mode. All DB recovery flows are wrapped by scripts/disaster-recovery.sh.

Failure modes at a glance

Scenario Auto-recovers? Action Downtime
Pod crash Yes (Argo CD selfHeal) None Seconds
Spot node preemption Yes (GKE reschedules) None 1-2 min
DB data corruption No PITR via CNPG 10-20 min
Accidental table drop No PITR to before drop 10-20 min
GKE cluster destroyed No Terraform + Argo CD + DB restore 30-60 min
GCP region outage No Manual (cross-region not configured) Hours

Backup architecture

PostgreSQL (CNPG)
  |
  +-- WAL archiving (continuous) --> GCS
  +-- Base backup (daily 03:00 UTC) --> GCS
  |
  GCS bucket: gs://tomoda-db-backups-development-485000/
    dev/
      base/    # Full backups
      wals/    # WAL segments (enables PITR to any second)
    prod/
      base/
      wals/
  • WAL archiving is continuous. RPO is effectively seconds — limited to in-flight transactions.
  • Base backups run daily at 03:00 UTC. PITR uses the nearest base backup plus replayed WALs.
  • Retention is 30 days, enforced by a GCS lifecycle rule.
  • Cost is well under $1/mo at current data sizes.

Three recovery modes

The scripts/disaster-recovery.sh script supports three modes. Pick by scenario:

Mode When
pitr You know exactly when the bad thing happened. Restore to that timestamp.
latest You just need the most recent good state.
full-rebuild Cluster is gone. Re-run Terraform, wait for Argo CD, then restore the DB.

Mode: pitr

Use this for bad migrations or accidental data destruction with a known timestamp.

./scripts/disaster-recovery.sh \
  --env prod \
  --mode pitr \
  --target-time "2026-04-27T09:55:00Z"

The script:

  1. Scales the backend Deployment to 0 to stop further writes
  2. Creates a new CNPG cluster postgres-prod-recovered whose bootstrap.recovery block points at the GCS Barman store with recoveryTarget.targetTime
  3. Waits up to 20 minutes for the recovery cluster to report healthy
  4. Verifies the recovered data — prints user count and PostGIS version
  5. Prints next-step instructions for traffic switchover

Mode: latest

Same flow as PITR but without the recoveryTarget block — CNPG restores to the latest available point (end of the most recent WAL segment).

./scripts/disaster-recovery.sh --env prod --mode latest

Mode: full-rebuild

Cluster destroyed or corrupted at the K8s level.

./scripts/disaster-recovery.sh --env prod --mode full-rebuild

The script:

  1. Runs terraform apply in infrastructure/gcp/ to recreate GKE, networking, IAM
  2. Fetches cluster credentials
  3. Waits for the CNPG operator CRD to exist (Argo CD is reconciling sys apps)
  4. Waits up to 10 minutes for the postgres-prod cluster to report healthy

If the new CNPG cluster does not auto-restore from WAL archive (e.g., it bootstrapped fresh because no bootstrap.recovery was set on the Cluster CR), follow up with:

./scripts/disaster-recovery.sh --env prod --mode latest

Traffic switchover after PITR

The recovery script creates postgres-prod-recovered. To direct the backend at it, repoint the ExternalName service the backend uses:

# Dev
kubectl patch svc postgres-postgresql -n data \
  -p '{"spec":{"externalName":"postgres-dev-recovered-rw.data.svc.cluster.local"}}'

# Prod
kubectl patch svc prod-postgres-postgresql -n data \
  -p '{"spec":{"externalName":"postgres-prod-recovered-rw.data.svc.cluster.local"}}'

# Scale backend pools back up (HPA will then take over within ~1 min)
kubectl scale deployment/tomoda-api -n prod --replicas=2
kubectl scale deployment/tomoda-async -n prod --replicas=2

# Smoke test
curl -fsS https://api.tomoda.life/health

When the recovery cluster is confirmed good:

  1. Delete the old broken Cluster CR: kubectl delete cluster postgres-prod -n data
  2. Update Git manifests so the recovery cluster becomes the canonical primary (or rename it back and update the ExternalName in YAML)
  3. Commit and let Argo CD reconcile

No-action scenarios

Single pod crash

selfHeal: true is set on every Argo CD application. K8s restarts the pod automatically.

kubectl get pods -n prod
kubectl get pods -n data

Spot node preemption

GKE reschedules pods. CNPG handles Postgres pod rescheduling and reattaches the PVC.

kubectl get nodes
kubectl get pods -n data -o wide

Verifying backup health

Run regularly (e.g., weekly via a smoke-test cron) to catch silent backup failures:

# Last successful base backup timestamp
kubectl get cluster postgres-dev -n data \
  -o jsonpath='{.status.lastSuccessfulBackup}'
kubectl get cluster postgres-prod -n data \
  -o jsonpath='{.status.lastSuccessfulBackup}'

# All Backup CRs
kubectl get backup -n data

# Recent base backups in GCS
gcloud storage ls -l gs://tomoda-db-backups-development-485000/prod/base/ | tail -5

# Continuous archiving status
kubectl get cluster -n data \
  -o jsonpath='{range .items[*]}{.metadata.name}: {.status.conditions[?(@.type=="ContinuousArchiving")].status}{"\n"}{end}'

ContinuousArchiving should be True for both clusters at all times.

RTO / RPO

Metric Target Current capability
RPO (max data loss) Seconds Continuous WAL archiving
RTO (recovery time) <30 min PITR 10-20 min; full rebuild 30-60 min

File reference

File Purpose
scripts/disaster-recovery.sh Automated DR script — three modes
k8s/envs/{dev,prod}/middleware/postgres/manifests/cluster.yaml CNPG Cluster + backup config
k8s/envs/{dev,prod}/middleware/postgres/manifests/backup.yaml ScheduledBackup CRs
infrastructure/gcp/backup.tf GCS bucket + Workload Identity SA + IAM