Postgres Operations¶
Tomoda uses CloudNativePG (CNPG) — a Kubernetes operator that manages PostgreSQL clusters, including provisioning, WAL archiving, base backups, PITR, monitoring, and failover.
Cluster topology¶
CNPG Operator (namespace: cnpg-system)
|
+-- postgres-dev (namespace: data, image: postgis:17-3.5)
| +-- postgres-dev-1 (primary, 10Gi PVC)
| +-- ScheduledBackup --> gs://tomoda-db-backups-.../dev/
|
+-- postgres-prod (namespace: data, image: postgis:17-3.5)
+-- postgres-prod-1 (primary, 20Gi PVC)
+-- ScheduledBackup --> gs://tomoda-db-backups-.../prod/
Dev and prod are separate clusters, same GKE
The postgres-dev and postgres-prod CNPG Cluster resources are independent — different PVCs, different credentials, different backup paths. Both live in the same GKE cluster and the same data namespace, but they share nothing at the storage layer. Destroying one does not affect the other.
| Property | postgres-dev |
postgres-prod |
|---|---|---|
| Image | ghcr.io/cloudnative-pg/postgis:17-3.5 |
Same |
| Instances | 1 | 1 (raise for HA) |
| Storage | 10Gi PVC | 20Gi PVC |
| Database | tomoda_dev |
tomoda_prod |
| Owner | tomoda_dev_user |
tomoda_prod_user |
| Extensions | PostGIS, PostGIS Topology | Same |
shared_buffers |
128MB | 256MB |
| Daily base backup | 03:00 UTC -> GCS | Same |
| WAL archiving | Continuous -> GCS | Same |
| Retention | 30d | 30d |
| Monitoring | PodMonitor (Prometheus) | Same |
Service DNS¶
CNPG creates three services per cluster automatically:
| Suffix | Purpose |
|---|---|
-rw |
Read-write (primary only) |
-ro |
Read-only (replicas only — empty until you scale to 2+) |
-r |
Any instance |
The backend uses a stable ExternalName alias for backward compatibility:
| Alias | Resolves to | Used by |
|---|---|---|
postgres-postgresql.data.svc.cluster.local |
postgres-dev-rw.data.svc.cluster.local |
Backend (dev) |
prod-postgres-postgresql.data.svc.cluster.local |
postgres-prod-rw.data.svc.cluster.local |
Backend (prod) |
Configuration source files¶
| File | Purpose |
|---|---|
k8s/envs/dev/middleware/postgres/manifests/cluster.yaml |
Dev Cluster CR + ExternalSecret + ExternalName |
k8s/envs/prod/middleware/postgres/manifests/cluster.yaml |
Prod Cluster CR + ExternalSecret + ExternalName |
k8s/envs/{dev,prod}/middleware/postgres/manifests/backup.yaml |
ScheduledBackup CRs |
k8s/envs/dev/sys/cnpg/application.yaml |
Argo CD app installing the CNPG operator via Helm |
infrastructure/gcp/backup.tf |
GCS bucket + cnpg-backup-sa GCP SA + Workload Identity binding |
Tuning PostgreSQL parameters¶
Edit spec.postgresql.parameters in the cluster CR. Example from prod:
postgresql:
parameters:
max_connections: "100"
shared_buffers: "256MB"
effective_cache_size: "512MB"
work_mem: "4MB"
log_min_duration_statement: "1000" # log queries slower than 1s
Commit, push, Argo CD syncs. CNPG applies the change with an automatic rolling restart for params that require one.
Credentials¶
K8s Secrets:
postgres-dev-credentialsindatapostgres-prod-credentialsindata
Both are populated by an ExternalSecret that pulls tomoda-db-password from GCP Secret Manager (see Secrets Management). The same secret is consumed by the backend as DB_PASSWORD — they must match.
Backups¶
How it works¶
CNPG uses Barman with two mechanisms:
- WAL archiving (continuous) — every WAL segment is shipped to GCS as it is written. Enables PITR to any second.
- Scheduled base backups (daily) — physical backup at 03:00 UTC. Combined with WAL archives, this is the starting point for any restore.
Backup storage layout:
gs://tomoda-db-backups-development-485000/
dev/
base/ # base backups
wals/ # WAL archive segments
prod/
base/
wals/
GCS lifecycle auto-deletes after 30 days (configured via Terraform).
Verify backups¶
# Backup CRs
kubectl get backup -n data
# Scheduled backups
kubectl get scheduledbackup -n data
# Last successful backup
kubectl get cluster postgres-prod -n data \
-o jsonpath='{.status.lastSuccessfulBackup}'
# Per-cluster backup history
kubectl get backup -n data -l cnpg.io/cluster=postgres-prod
# Continuous archiving status
kubectl get cluster postgres-prod -n data \
-o jsonpath='{.status.conditions}' | jq
# GCS contents
gcloud storage ls gs://tomoda-db-backups-development-485000/prod/base/
gcloud storage ls gs://tomoda-db-backups-development-485000/prod/wals/ | head
Trigger a manual backup¶
kubectl apply -f - <<EOF
apiVersion: postgresql.cnpg.io/v1
kind: Backup
metadata:
name: postgres-prod-manual-$(date +%Y%m%d%H%M)
namespace: data
spec:
method: barmanObjectStore
cluster:
name: postgres-prod
EOF
kubectl get backup -n data -w
Restores¶
For full-cluster PITR and latest-state restores, use the wrapper script — see Disaster Recovery.
Restore a single table (logical)¶
When you need to recover one table without rolling the whole DB back:
# 1. Spin up a recovery cluster via the DR script
./scripts/disaster-recovery.sh --env prod --mode pitr \
--target-time "2026-04-27T09:55:00Z"
# 2. Wait for postgres-prod-recovered to be healthy
# 3. Dump just the table from the recovery cluster
kubectl exec -it postgres-prod-recovered-1 -n data -- \
pg_dump -U tomoda_prod_user -d tomoda_prod \
--table=users --data-only --column-inserts \
> users-recovered.sql
# 4. Apply to the live cluster
cat users-recovered.sql | kubectl exec -i postgres-prod-1 -n data -- \
psql -U tomoda_prod_user -d tomoda_prod
# 5. Delete the recovery cluster
kubectl delete cluster postgres-prod-recovered -n data
Logical dumps over kubectl exec
kubectl exec buffers stdout. For large tables, prefer port-forwarding and running pg_dump from your laptop, or use a temporary debug pod with gcloud storage cp to GCS.
pg_dump / pg_restore between environments¶
For seeding dev from a redacted prod snapshot:
kubectl exec -it postgres-prod-1 -n data -- \
pg_dump -U tomoda_prod_user -d tomoda_prod | gzip > prod-dump.sql.gz
# Redact before loading into dev. Then:
gunzip -c prod-dump.sql.gz | kubectl exec -i postgres-dev-1 -n data -- \
psql -U tomoda_dev_user -d tomoda_dev
Scaling¶
Storage¶
The default storage class supports online expansion. Edit the cluster CR:
storage:
size: 50Gi # was 20Gi
CNPG resizes the PVC in place. No downtime.
CPU / memory¶
Edit spec.resources in the cluster CR. CNPG performs a rolling restart.
Read replicas¶
spec:
instances: 2 # 1 primary + 1 replica
CNPG configures streaming replication, exposes postgres-prod-ro for read-only queries, and handles automatic failover if the primary dies.
Version upgrades¶
CNPG handles minor PostgreSQL upgrades (e.g., 17.1 -> 17.2) via image bump and rolling restart:
imageName: ghcr.io/cloudnative-pg/postgis:17.2-3.5
Commit, push. For major upgrades (17 -> 18), use CNPG's pg_upgrade workflow — that requires a planned maintenance window and the official CNPG upgrade docs. Do not attempt in-place major upgrades without testing in dev first.
Monitoring¶
CNPG exports Prometheus metrics via the built-in PodMonitor. Useful metrics in Grafana:
| Metric | Watch for |
|---|---|
cnpg_pg_replication_lag |
>1s sustained |
cnpg_pg_stat_activity_count |
Connection saturation |
cnpg_pg_database_size_bytes |
Storage growth trend |
cnpg_collector_up |
Should be 1 |
cnpg_pg_stat_bgwriter_buffers_checkpoint |
Checkpoint cadence |
Import the official CNPG Grafana dashboard (#20417) if it is not already loaded.
Debugging¶
# Cluster overview
kubectl get cluster -n data
# Detailed status
kubectl describe cluster postgres-prod -n data
# Status conditions
kubectl get cluster postgres-prod -n data \
-o jsonpath='{.status.conditions}' | jq
# Current primary
kubectl get cluster postgres-prod -n data \
-o jsonpath='{.status.currentPrimary}'
# Postgres pod logs
kubectl logs postgres-prod-1 -n data
# CNPG operator logs
kubectl logs -n cnpg-system -l app.kubernetes.io/name=cloudnative-pg
Cluster stuck in "Setting up primary"¶
kubectl logs -n cnpg-system -l app.kubernetes.io/name=cloudnative-pg --tail=50
kubectl describe pod postgres-prod-1 -n data
Most common cause: PVC binding failure (storage class issue) or image pull failure.
Backup failing¶
kubectl get backup -n data -o wide
kubectl describe sa postgres-prod -n data
# Should have annotation: iam.gke.io/gcp-service-account: cnpg-backup-sa@...
# Test GCS access from the pod
kubectl exec -it postgres-prod-1 -n data -- \
curl -H "Metadata-Flavor: Google" \
http://metadata.google.internal/computeMetadata/v1/instance/service-accounts/default/email
PVC full¶
kubectl exec -it postgres-prod-1 -n data -- df -h /var/lib/postgresql/data
# Resize: edit cluster.yaml -> spec.storage.size, commit, push