Postgres Operations¶

Tomoda uses CloudNativePG (CNPG) — a Kubernetes operator that manages PostgreSQL clusters, including provisioning, WAL archiving, base backups, PITR, monitoring, and failover.

Cluster topology¶

CNPG Operator (namespace: cnpg-system)
  |
  +-- postgres-dev  (namespace: data, image: postgis:17-3.5)
  |     +-- postgres-dev-1     (primary, 10Gi PVC)
  |     +-- ScheduledBackup --> gs://tomoda-db-backups-.../dev/
  |
  +-- postgres-prod (namespace: data, image: postgis:17-3.5)
        +-- postgres-prod-1    (primary, 20Gi PVC)
        +-- ScheduledBackup --> gs://tomoda-db-backups-.../prod/

Dev and prod are separate clusters, same GKE

The postgres-dev and postgres-prod CNPG Cluster resources are independent — different PVCs, different credentials, different backup paths. Both live in the same GKE cluster and the same data namespace, but they share nothing at the storage layer. Destroying one does not affect the other.

Property	`postgres-dev`	`postgres-prod`
Image	`ghcr.io/cloudnative-pg/postgis:17-3.5`	Same
Instances	1	1 (raise for HA)
Storage	10Gi PVC	20Gi PVC
Database	`tomoda_dev`	`tomoda_prod`
Owner	`tomoda_dev_user`	`tomoda_prod_user`
Extensions	PostGIS, PostGIS Topology	Same
`shared_buffers`	128MB	256MB
Daily base backup	03:00 UTC -> GCS	Same
WAL archiving	Continuous -> GCS	Same
Retention	30d	30d
Monitoring	PodMonitor (Prometheus)	Same

Service DNS¶

CNPG creates three services per cluster automatically:

Suffix	Purpose
`-rw`	Read-write (primary only)
`-ro`	Read-only (replicas only — empty until you scale to 2+)
`-r`	Any instance

The backend uses a stable ExternalName alias for backward compatibility:

Alias	Resolves to	Used by
`postgres-postgresql.data.svc.cluster.local`	`postgres-dev-rw.data.svc.cluster.local`	Backend (dev)
`postgres-prod-postgresql.data.svc.cluster.local`	`postgres-prod-rw.data.svc.cluster.local`	Backend (prod)

Configuration source files¶

File	Purpose
`k8s/envs/dev/postgres/manifests/cluster.yaml`	Dev `Cluster` CR + ExternalSecret + ExternalName
`k8s/envs/prod/postgres/manifests/cluster.yaml`	Prod `Cluster` CR + ExternalSecret + ExternalName
`k8s/envs/{dev,prod}/postgres/manifests/backup.yaml`	`ScheduledBackup` CRs
`k8s/envs/platform/cnpg/application.yaml`	Argo CD app installing the CNPG operator via Helm
`infrastructure/gcp/backup.tf`	GCS bucket + `cnpg-backup-sa` GCP SA + Workload Identity binding

Tuning PostgreSQL parameters¶

Edit spec.postgresql.parameters in the cluster CR. Example from prod:

postgresql:
  parameters:
    max_connections: "100"
    shared_buffers: "256MB"
    effective_cache_size: "512MB"
    work_mem: "4MB"
    log_min_duration_statement: "1000"   # log queries slower than 1s

Commit, push, Argo CD syncs. CNPG applies the change with an automatic rolling restart for params that require one.

Credentials¶

K8s Secrets:

postgres-dev-credentials in data
postgres-prod-credentials in data

Both are populated by an ExternalSecret that pulls tomoda-db-password from GCP Secret Manager (see Secrets Management). The same secret is consumed by the backend as DB_PASSWORD — they must match.

Backups¶

How it works¶

CNPG uses Barman with two mechanisms:

WAL archiving (continuous) — every WAL segment is shipped to GCS as it is written. Enables PITR to any second.
Scheduled base backups (daily) — physical backup at 03:00 UTC. Combined with WAL archives, this is the starting point for any restore.

Backup storage layout:

gs://tomoda-db-backups-development-485000/
  dev/
    base/     # base backups
    wals/     # WAL archive segments
  prod/
    base/
    wals/

GCS lifecycle auto-deletes after 30 days (configured via Terraform).

Verify backups¶

# Backup CRs
kubectl get backup -n data

# Scheduled backups
kubectl get scheduledbackup -n data

# Last successful backup
kubectl get cluster postgres-prod -n data \
  -o jsonpath='{.status.lastSuccessfulBackup}'

# Per-cluster backup history
kubectl get backup -n data -l cnpg.io/cluster=postgres-prod

# Continuous archiving status
kubectl get cluster postgres-prod -n data \
  -o jsonpath='{.status.conditions}' | jq

# GCS contents
gcloud storage ls gs://tomoda-db-backups-development-485000/prod/base/
gcloud storage ls gs://tomoda-db-backups-development-485000/prod/wals/ | head

Trigger a manual backup¶

kubectl apply -f - <<EOF
apiVersion: postgresql.cnpg.io/v1
kind: Backup
metadata:
  name: postgres-prod-manual-$(date +%Y%m%d%H%M)
  namespace: data
spec:
  method: barmanObjectStore
  cluster:
    name: postgres-prod
EOF

kubectl get backup -n data -w

Restores¶

For full-cluster PITR and latest-state restores, use the wrapper script — see Disaster Recovery.

Restore a single table (logical)¶

When you need to recover one table without rolling the whole DB back:

# 1. Spin up a recovery cluster via the DR script
./scripts/disaster-recovery.sh --env prod --mode pitr \
  --target-time "2026-04-27T09:55:00Z"

# 2. Wait for postgres-prod-recovered to be healthy

# 3. Dump just the table from the recovery cluster
kubectl exec -it postgres-prod-recovered-1 -n data -- \
  pg_dump -U tomoda_prod_user -d tomoda_prod \
  --table=users --data-only --column-inserts \
  > users-recovered.sql

# 4. Apply to the live cluster
cat users-recovered.sql | kubectl exec -i postgres-prod-1 -n data -- \
  psql -U tomoda_prod_user -d tomoda_prod

# 5. Delete the recovery cluster
kubectl delete cluster postgres-prod-recovered -n data

Logical dumps over kubectl exec

kubectl exec buffers stdout. For large tables, prefer port-forwarding and running pg_dump from your laptop, or use a temporary debug pod with gcloud storage cp to GCS.

pg_dump / pg_restore between environments¶

For seeding dev from a redacted prod snapshot:

kubectl exec -it postgres-prod-1 -n data -- \
  pg_dump -U tomoda_prod_user -d tomoda_prod | gzip > prod-dump.sql.gz

# Redact before loading into dev. Then:
gunzip -c prod-dump.sql.gz | kubectl exec -i postgres-dev-1 -n data -- \
  psql -U tomoda_dev_user -d tomoda_dev

Scaling¶

Storage¶

The default storage class supports online expansion. Edit the cluster CR:

storage:
  size: 50Gi   # was 20Gi

CNPG resizes the PVC in place. No downtime.

CPU / memory¶

Edit spec.resources in the cluster CR. CNPG performs a rolling restart.

Read replicas¶

spec:
  instances: 2   # 1 primary + 1 replica

CNPG configures streaming replication, exposes postgres-prod-ro for read-only queries, and handles automatic failover if the primary dies.

To actually route backend reads to the replica, set DB_READ_HOST to postgres-prod-ro.data.svc.cluster.local in the prod ConfigMap and roll the backend. With DB_READ_HOST empty the replica is a warm failover standby only. Full step-by-step: Scaling → Read replica.

Connection pooler¶

When pod autoscaling multiplies pgx pools toward max_connections (100), front the primary with a CNPG Pooler (PgBouncer) in transaction mode, then point the backend DB_HOST at the pooler service and size DB_MAX_CONNS per pod to fit the pooler's server pool.

Config-only cutover

Transaction-mode pooling is incompatible with pgx's default prepared-statement caching, but the backend already handles this behind a flag: set DB_POOLER=true and pgx switches to the pooler-safe CacheDescribe exec mode. No app release needed. Full playbook + the Pooler manifest: Scaling → Connection pooler.

Version upgrades¶

CNPG handles minor PostgreSQL upgrades (e.g., 17.1 -> 17.2) via image bump and rolling restart:

imageName: ghcr.io/cloudnative-pg/postgis:17.2-3.5

Commit, push. For major upgrades (17 -> 18), use CNPG's pg_upgrade workflow — that requires a planned maintenance window and the official CNPG upgrade docs. Do not attempt in-place major upgrades without testing in dev first.

Monitoring¶

CNPG exports Prometheus metrics via the built-in PodMonitor. Useful metrics in Grafana:

Metric	Watch for
`cnpg_pg_replication_lag`	>1s sustained
`cnpg_pg_stat_activity_count`	Connection saturation
`cnpg_pg_database_size_bytes`	Storage growth trend
`cnpg_collector_up`	Should be 1
`cnpg_pg_stat_bgwriter_buffers_checkpoint`	Checkpoint cadence

Import the official CNPG Grafana dashboard (#20417) if it is not already loaded.

Debugging¶

# Cluster overview
kubectl get cluster -n data

# Detailed status
kubectl describe cluster postgres-prod -n data

# Status conditions
kubectl get cluster postgres-prod -n data \
  -o jsonpath='{.status.conditions}' | jq

# Current primary
kubectl get cluster postgres-prod -n data \
  -o jsonpath='{.status.currentPrimary}'

# Postgres pod logs
kubectl logs postgres-prod-1 -n data

# CNPG operator logs
kubectl logs -n cnpg-system -l app.kubernetes.io/name=cloudnative-pg

Cluster stuck in "Setting up primary"¶

kubectl logs -n cnpg-system -l app.kubernetes.io/name=cloudnative-pg --tail=50
kubectl describe pod postgres-prod-1 -n data

Most common cause: PVC binding failure (storage class issue) or image pull failure.

Backup failing¶

kubectl get backup -n data -o wide
kubectl describe sa postgres-prod -n data
# Should have annotation: iam.gke.io/gcp-service-account: cnpg-backup-sa@...

# Test GCS access from the pod
kubectl exec -it postgres-prod-1 -n data -- \
  curl -H "Metadata-Flavor: Google" \
  http://metadata.google.internal/computeMetadata/v1/instance/service-accounts/default/email

PVC full¶

kubectl exec -it postgres-prod-1 -n data -- df -h /var/lib/postgresql/data
# Resize: edit cluster.yaml -> spec.storage.size, commit, push