Disaster Recovery¶
Recovery procedures by failure mode. All DB recovery flows are wrapped by scripts/disaster-recovery.sh.
Failure modes at a glance¶
| Scenario | Auto-recovers? | Action | Downtime |
|---|---|---|---|
| Pod crash | Yes (Argo CD selfHeal) |
None | Seconds |
| Spot node preemption | Yes (GKE reschedules) | None | 1-2 min |
| DB data corruption | No | PITR via CNPG | 10-20 min |
| Accidental table drop | No | PITR to before drop | 10-20 min |
| GKE cluster destroyed | No | Terraform + Argo CD + DB restore | 30-60 min |
| GCP region outage | No | Manual (cross-region not configured) | Hours |
Backup architecture¶
PostgreSQL (CNPG)
|
+-- WAL archiving (continuous) --> GCS
+-- Base backup (daily 03:00 UTC) --> GCS
|
GCS bucket: gs://tomoda-db-backups-development-485000/
dev/
base/ # Full backups
wals/ # WAL segments (enables PITR to any second)
prod/
base/
wals/
- WAL archiving is continuous. RPO is effectively seconds — limited to in-flight transactions.
- Base backups run daily at 03:00 UTC. PITR uses the nearest base backup plus replayed WALs.
- Retention is 30 days, enforced by a GCS lifecycle rule.
- Cost is well under $1/mo at current data sizes.
Three recovery modes¶
The scripts/disaster-recovery.sh script supports three modes. Pick by scenario:
| Mode | When |
|---|---|
pitr |
You know exactly when the bad thing happened. Restore to that timestamp. |
latest |
You just need the most recent good state. |
full-rebuild |
Cluster is gone. Re-run Terraform, wait for Argo CD, then restore the DB. |
Mode: pitr¶
Use this for bad migrations or accidental data destruction with a known timestamp.
./scripts/disaster-recovery.sh \
--env prod \
--mode pitr \
--target-time "2026-04-27T09:55:00Z"
The script:
- Scales the backend
Deploymentto 0 to stop further writes - Creates a new CNPG cluster
postgres-prod-recoveredwhosebootstrap.recoveryblock points at the GCS Barman store withrecoveryTarget.targetTime - Waits up to 20 minutes for the recovery cluster to report healthy
- Verifies the recovered data — prints user count and PostGIS version
- Prints next-step instructions for traffic switchover
Mode: latest¶
Same flow as PITR but without the recoveryTarget block — CNPG restores to the latest available point (end of the most recent WAL segment).
./scripts/disaster-recovery.sh --env prod --mode latest
Mode: full-rebuild¶
Cluster destroyed or corrupted at the K8s level.
./scripts/disaster-recovery.sh --env prod --mode full-rebuild
The script:
- Runs
terraform applyininfrastructure/gcp/to recreate GKE, networking, IAM - Fetches cluster credentials
- Waits for the CNPG operator CRD to exist (Argo CD is reconciling sys apps)
- Waits up to 10 minutes for the
postgres-prodcluster to report healthy
If the new CNPG cluster does not auto-restore from WAL archive (e.g., it bootstrapped fresh because no bootstrap.recovery was set on the Cluster CR), follow up with:
./scripts/disaster-recovery.sh --env prod --mode latest
Traffic switchover after PITR¶
The recovery script creates postgres-prod-recovered. To direct the backend at it, repoint the ExternalName service the backend uses:
# Dev
kubectl patch svc postgres-postgresql -n data \
-p '{"spec":{"externalName":"postgres-dev-recovered-rw.data.svc.cluster.local"}}'
# Prod
kubectl patch svc prod-postgres-postgresql -n data \
-p '{"spec":{"externalName":"postgres-prod-recovered-rw.data.svc.cluster.local"}}'
# Scale backend pools back up (HPA will then take over within ~1 min)
kubectl scale deployment/tomoda-api -n prod --replicas=2
kubectl scale deployment/tomoda-async -n prod --replicas=2
# Smoke test
curl -fsS https://api.tomoda.life/health
When the recovery cluster is confirmed good:
- Delete the old broken Cluster CR:
kubectl delete cluster postgres-prod -n data - Update Git manifests so the recovery cluster becomes the canonical primary (or rename it back and update the ExternalName in YAML)
- Commit and let Argo CD reconcile
No-action scenarios¶
Single pod crash¶
selfHeal: true is set on every Argo CD application. K8s restarts the pod automatically.
kubectl get pods -n prod
kubectl get pods -n data
Spot node preemption¶
GKE reschedules pods. CNPG handles Postgres pod rescheduling and reattaches the PVC.
kubectl get nodes
kubectl get pods -n data -o wide
Verifying backup health¶
Run regularly (e.g., weekly via a smoke-test cron) to catch silent backup failures:
# Last successful base backup timestamp
kubectl get cluster postgres-dev -n data \
-o jsonpath='{.status.lastSuccessfulBackup}'
kubectl get cluster postgres-prod -n data \
-o jsonpath='{.status.lastSuccessfulBackup}'
# All Backup CRs
kubectl get backup -n data
# Recent base backups in GCS
gcloud storage ls -l gs://tomoda-db-backups-development-485000/prod/base/ | tail -5
# Continuous archiving status
kubectl get cluster -n data \
-o jsonpath='{range .items[*]}{.metadata.name}: {.status.conditions[?(@.type=="ContinuousArchiving")].status}{"\n"}{end}'
ContinuousArchiving should be True for both clusters at all times.
RTO / RPO¶
| Metric | Target | Current capability |
|---|---|---|
| RPO (max data loss) | Seconds | Continuous WAL archiving |
| RTO (recovery time) | <30 min | PITR 10-20 min; full rebuild 30-60 min |
File reference¶
| File | Purpose |
|---|---|
scripts/disaster-recovery.sh |
Automated DR script — three modes |
k8s/envs/{dev,prod}/middleware/postgres/manifests/cluster.yaml |
CNPG Cluster + backup config |
k8s/envs/{dev,prod}/middleware/postgres/manifests/backup.yaml |
ScheduledBackup CRs |
infrastructure/gcp/backup.tf |
GCS bucket + Workload Identity SA + IAM |