Disaster Recovery¶

Recovery procedures by failure mode. All DB recovery flows are wrapped by scripts/disaster-recovery.sh.

Failure modes at a glance¶

Scenario	Auto-recovers?	Action	Downtime
Pod crash	Yes (Argo CD `selfHeal`)	None	Seconds
Spot node preemption	Yes (GKE reschedules)	None	1-2 min
DB data corruption	No	PITR via CNPG	10-20 min
Accidental table drop	No	PITR to before drop	10-20 min
GKE cluster destroyed	No	Terraform + Argo CD + DB restore	30-60 min
GCP region outage	No	Manual (cross-region not configured)	Hours

Backup architecture¶

PostgreSQL (CNPG)
  |
  +-- WAL archiving (continuous) --> GCS
  +-- Base backup (daily 03:00 UTC) --> GCS
  |
  GCS bucket: gs://tomoda-db-backups-development-485000/
    dev/
      base/    # Full backups
      wals/    # WAL segments (enables PITR to any second)
    prod/
      base/
      wals/

WAL archiving is continuous. RPO is effectively seconds — limited to in-flight transactions.
Base backups run daily at 03:00 UTC. PITR uses the nearest base backup plus replayed WALs.
Retention is 30 days, enforced by a GCS lifecycle rule.
Cost is well under $1/mo at current data sizes.

Three recovery modes¶

The scripts/disaster-recovery.sh script supports three modes. Pick by scenario:

Mode	When
`pitr`	You know exactly when the bad thing happened. Restore to that timestamp.
`latest`	You just need the most recent good state.
`full-rebuild`	Cluster is gone. Re-run Terraform, wait for Argo CD, then restore the DB.

Mode: `pitr`¶

Use this for bad migrations or accidental data destruction with a known timestamp.

./scripts/disaster-recovery.sh \
  --env prod \
  --mode pitr \
  --target-time "2026-04-27T09:55:00Z"

The script:

Scales the backend Deployment to 0 to stop further writes
Creates a new CNPG cluster postgres-prod-recovered whose bootstrap.recovery block points at the GCS Barman store with recoveryTarget.targetTime
Waits up to 20 minutes for the recovery cluster to report healthy
Verifies the recovered data — prints user count and PostGIS version
Prints next-step instructions for traffic switchover

Mode: `latest`¶

Same flow as PITR but without the recoveryTarget block — CNPG restores to the latest available point (end of the most recent WAL segment).

./scripts/disaster-recovery.sh --env prod --mode latest

Mode: `full-rebuild`¶

Cluster destroyed or corrupted at the K8s level.

./scripts/disaster-recovery.sh --env prod --mode full-rebuild

The script:

Runs terraform apply in infrastructure/gcp/ to recreate GKE, networking, IAM
Fetches cluster credentials
Waits for the CNPG operator CRD to exist (Argo CD is reconciling sys apps)
Waits up to 10 minutes for the postgres-prod cluster to report healthy

If the new CNPG cluster does not auto-restore from WAL archive (e.g., it bootstrapped fresh because no bootstrap.recovery was set on the Cluster CR), follow up with:

./scripts/disaster-recovery.sh --env prod --mode latest

Traffic switchover after PITR¶

The recovery script creates postgres-prod-recovered. To direct the backend at it, repoint the ExternalName service the backend uses:

# Dev
kubectl patch svc postgres-postgresql -n data \
  -p '{"spec":{"externalName":"postgres-dev-recovered-rw.data.svc.cluster.local"}}'

# Prod
kubectl patch svc postgres-prod-postgresql -n data \
  -p '{"spec":{"externalName":"postgres-prod-recovered-rw.data.svc.cluster.local"}}'

# Scale backend pools back up (HPA will then take over within ~1 min)
kubectl scale deployment/tomoda-api -n prod --replicas=2
kubectl scale deployment/tomoda-async -n prod --replicas=2

# Smoke test
curl -fsS https://api.tomoda.life/health

When the recovery cluster is confirmed good:

Delete the old broken Cluster CR: kubectl delete cluster postgres-prod -n data
Update Git manifests so the recovery cluster becomes the canonical primary (or rename it back and update the ExternalName in YAML)
Commit and let Argo CD reconcile

No-action scenarios¶

Single pod crash¶

selfHeal: true is set on every Argo CD application. K8s restarts the pod automatically.

kubectl get pods -n prod
kubectl get pods -n data

Spot node preemption¶

GKE reschedules pods. CNPG handles Postgres pod rescheduling and reattaches the PVC.

kubectl get nodes
kubectl get pods -n data -o wide

Verifying backup health¶

Run regularly (e.g., weekly via a smoke-test cron) to catch silent backup failures:

# Last successful base backup timestamp
kubectl get cluster postgres-dev -n data \
  -o jsonpath='{.status.lastSuccessfulBackup}'
kubectl get cluster postgres-prod -n data \
  -o jsonpath='{.status.lastSuccessfulBackup}'

# All Backup CRs
kubectl get backup -n data

# Recent base backups in GCS
gcloud storage ls -l gs://tomoda-db-backups-development-485000/prod/base/ | tail -5

# Continuous archiving status
kubectl get cluster -n data \
  -o jsonpath='{range .items[*]}{.metadata.name}: {.status.conditions[?(@.type=="ContinuousArchiving")].status}{"\n"}{end}'

ContinuousArchiving should be True for both clusters at all times.

RTO / RPO¶

Metric	Target	Current capability
RPO (max data loss)	Seconds	Continuous WAL archiving
RTO (recovery time)	<30 min	PITR 10-20 min; full rebuild 30-60 min

File reference¶

File	Purpose
`scripts/disaster-recovery.sh`	Automated DR script — three modes
`k8s/envs/{dev,prod}/postgres/manifests/cluster.yaml`	CNPG Cluster + backup config
`k8s/envs/{dev,prod}/postgres/manifests/backup.yaml`	ScheduledBackup CRs
`infrastructure/gcp/backup.tf`	GCS bucket + Workload Identity SA + IAM