Skip to content

Postgres Operations

Tomoda uses CloudNativePG (CNPG) — a Kubernetes operator that manages PostgreSQL clusters, including provisioning, WAL archiving, base backups, PITR, monitoring, and failover.

Cluster topology

CNPG Operator (namespace: cnpg-system)
  |
  +-- postgres-dev  (namespace: data, image: postgis:17-3.5)
  |     +-- postgres-dev-1     (primary, 10Gi PVC)
  |     +-- ScheduledBackup --> gs://tomoda-db-backups-.../dev/
  |
  +-- postgres-prod (namespace: data, image: postgis:17-3.5)
        +-- postgres-prod-1    (primary, 20Gi PVC)
        +-- ScheduledBackup --> gs://tomoda-db-backups-.../prod/

Dev and prod are separate clusters, same GKE

The postgres-dev and postgres-prod CNPG Cluster resources are independent — different PVCs, different credentials, different backup paths. Both live in the same GKE cluster and the same data namespace, but they share nothing at the storage layer. Destroying one does not affect the other.

Property postgres-dev postgres-prod
Image ghcr.io/cloudnative-pg/postgis:17-3.5 Same
Instances 1 1 (raise for HA)
Storage 10Gi PVC 20Gi PVC
Database tomoda_dev tomoda_prod
Owner tomoda_dev_user tomoda_prod_user
Extensions PostGIS, PostGIS Topology Same
shared_buffers 128MB 256MB
Daily base backup 03:00 UTC -> GCS Same
WAL archiving Continuous -> GCS Same
Retention 30d 30d
Monitoring PodMonitor (Prometheus) Same

Service DNS

CNPG creates three services per cluster automatically:

Suffix Purpose
-rw Read-write (primary only)
-ro Read-only (replicas only — empty until you scale to 2+)
-r Any instance

The backend uses a stable ExternalName alias for backward compatibility:

Alias Resolves to Used by
postgres-postgresql.data.svc.cluster.local postgres-dev-rw.data.svc.cluster.local Backend (dev)
prod-postgres-postgresql.data.svc.cluster.local postgres-prod-rw.data.svc.cluster.local Backend (prod)

Configuration source files

File Purpose
k8s/envs/dev/middleware/postgres/manifests/cluster.yaml Dev Cluster CR + ExternalSecret + ExternalName
k8s/envs/prod/middleware/postgres/manifests/cluster.yaml Prod Cluster CR + ExternalSecret + ExternalName
k8s/envs/{dev,prod}/middleware/postgres/manifests/backup.yaml ScheduledBackup CRs
k8s/envs/dev/sys/cnpg/application.yaml Argo CD app installing the CNPG operator via Helm
infrastructure/gcp/backup.tf GCS bucket + cnpg-backup-sa GCP SA + Workload Identity binding

Tuning PostgreSQL parameters

Edit spec.postgresql.parameters in the cluster CR. Example from prod:

postgresql:
  parameters:
    max_connections: "100"
    shared_buffers: "256MB"
    effective_cache_size: "512MB"
    work_mem: "4MB"
    log_min_duration_statement: "1000"   # log queries slower than 1s

Commit, push, Argo CD syncs. CNPG applies the change with an automatic rolling restart for params that require one.

Credentials

K8s Secrets:

  • postgres-dev-credentials in data
  • postgres-prod-credentials in data

Both are populated by an ExternalSecret that pulls tomoda-db-password from GCP Secret Manager (see Secrets Management). The same secret is consumed by the backend as DB_PASSWORD — they must match.

Backups

How it works

CNPG uses Barman with two mechanisms:

  1. WAL archiving (continuous) — every WAL segment is shipped to GCS as it is written. Enables PITR to any second.
  2. Scheduled base backups (daily) — physical backup at 03:00 UTC. Combined with WAL archives, this is the starting point for any restore.

Backup storage layout:

gs://tomoda-db-backups-development-485000/
  dev/
    base/     # base backups
    wals/     # WAL archive segments
  prod/
    base/
    wals/

GCS lifecycle auto-deletes after 30 days (configured via Terraform).

Verify backups

# Backup CRs
kubectl get backup -n data

# Scheduled backups
kubectl get scheduledbackup -n data

# Last successful backup
kubectl get cluster postgres-prod -n data \
  -o jsonpath='{.status.lastSuccessfulBackup}'

# Per-cluster backup history
kubectl get backup -n data -l cnpg.io/cluster=postgres-prod

# Continuous archiving status
kubectl get cluster postgres-prod -n data \
  -o jsonpath='{.status.conditions}' | jq

# GCS contents
gcloud storage ls gs://tomoda-db-backups-development-485000/prod/base/
gcloud storage ls gs://tomoda-db-backups-development-485000/prod/wals/ | head

Trigger a manual backup

kubectl apply -f - <<EOF
apiVersion: postgresql.cnpg.io/v1
kind: Backup
metadata:
  name: postgres-prod-manual-$(date +%Y%m%d%H%M)
  namespace: data
spec:
  method: barmanObjectStore
  cluster:
    name: postgres-prod
EOF

kubectl get backup -n data -w

Restores

For full-cluster PITR and latest-state restores, use the wrapper script — see Disaster Recovery.

Restore a single table (logical)

When you need to recover one table without rolling the whole DB back:

# 1. Spin up a recovery cluster via the DR script
./scripts/disaster-recovery.sh --env prod --mode pitr \
  --target-time "2026-04-27T09:55:00Z"

# 2. Wait for postgres-prod-recovered to be healthy

# 3. Dump just the table from the recovery cluster
kubectl exec -it postgres-prod-recovered-1 -n data -- \
  pg_dump -U tomoda_prod_user -d tomoda_prod \
  --table=users --data-only --column-inserts \
  > users-recovered.sql

# 4. Apply to the live cluster
cat users-recovered.sql | kubectl exec -i postgres-prod-1 -n data -- \
  psql -U tomoda_prod_user -d tomoda_prod

# 5. Delete the recovery cluster
kubectl delete cluster postgres-prod-recovered -n data

Logical dumps over kubectl exec

kubectl exec buffers stdout. For large tables, prefer port-forwarding and running pg_dump from your laptop, or use a temporary debug pod with gcloud storage cp to GCS.

pg_dump / pg_restore between environments

For seeding dev from a redacted prod snapshot:

kubectl exec -it postgres-prod-1 -n data -- \
  pg_dump -U tomoda_prod_user -d tomoda_prod | gzip > prod-dump.sql.gz

# Redact before loading into dev. Then:
gunzip -c prod-dump.sql.gz | kubectl exec -i postgres-dev-1 -n data -- \
  psql -U tomoda_dev_user -d tomoda_dev

Scaling

Storage

The default storage class supports online expansion. Edit the cluster CR:

storage:
  size: 50Gi   # was 20Gi

CNPG resizes the PVC in place. No downtime.

CPU / memory

Edit spec.resources in the cluster CR. CNPG performs a rolling restart.

Read replicas

spec:
  instances: 2   # 1 primary + 1 replica

CNPG configures streaming replication, exposes postgres-prod-ro for read-only queries, and handles automatic failover if the primary dies.

Version upgrades

CNPG handles minor PostgreSQL upgrades (e.g., 17.1 -> 17.2) via image bump and rolling restart:

imageName: ghcr.io/cloudnative-pg/postgis:17.2-3.5

Commit, push. For major upgrades (17 -> 18), use CNPG's pg_upgrade workflow — that requires a planned maintenance window and the official CNPG upgrade docs. Do not attempt in-place major upgrades without testing in dev first.

Monitoring

CNPG exports Prometheus metrics via the built-in PodMonitor. Useful metrics in Grafana:

Metric Watch for
cnpg_pg_replication_lag >1s sustained
cnpg_pg_stat_activity_count Connection saturation
cnpg_pg_database_size_bytes Storage growth trend
cnpg_collector_up Should be 1
cnpg_pg_stat_bgwriter_buffers_checkpoint Checkpoint cadence

Import the official CNPG Grafana dashboard (#20417) if it is not already loaded.

Debugging

# Cluster overview
kubectl get cluster -n data

# Detailed status
kubectl describe cluster postgres-prod -n data

# Status conditions
kubectl get cluster postgres-prod -n data \
  -o jsonpath='{.status.conditions}' | jq

# Current primary
kubectl get cluster postgres-prod -n data \
  -o jsonpath='{.status.currentPrimary}'

# Postgres pod logs
kubectl logs postgres-prod-1 -n data

# CNPG operator logs
kubectl logs -n cnpg-system -l app.kubernetes.io/name=cloudnative-pg

Cluster stuck in "Setting up primary"

kubectl logs -n cnpg-system -l app.kubernetes.io/name=cloudnative-pg --tail=50
kubectl describe pod postgres-prod-1 -n data

Most common cause: PVC binding failure (storage class issue) or image pull failure.

Backup failing

kubectl get backup -n data -o wide
kubectl describe sa postgres-prod -n data
# Should have annotation: iam.gke.io/gcp-service-account: cnpg-backup-sa@...

# Test GCS access from the pod
kubectl exec -it postgres-prod-1 -n data -- \
  curl -H "Metadata-Flavor: Google" \
  http://metadata.google.internal/computeMetadata/v1/instance/service-accounts/default/email

PVC full

kubectl exec -it postgres-prod-1 -n data -- df -h /var/lib/postgresql/data
# Resize: edit cluster.yaml -> spec.storage.size, commit, push