Scaling¶

The platform runs with horizontal autoscaling on the backend pools and manual scaling on data services. Here is how to grow each layer.

The app-side companion, Long-term Scaling & Growth in the tomoda repo, records the seams the code already exposes (DB_READ_HOST, DB_MAX_CONNS, the ws-hub / api-hub process modes, chat_messages partitioning) and the trigger that says "build the next step now". This page is the ops half: the concrete cluster playbook for each of those steps. Read the app-side doc for the "why" and the seam; read here for "which manifest, which command, zero-downtime or not".

Just-in-time playbooks¶

Each step below is unbuilt on purpose. It plugs into a seam that already ships, so it becomes a config or deployment change, not a rewrite. Build it when the trigger fires, not before.

Step	Trigger	Zero-downtime	Playbook
Read replica	`PostgresCPUHigh` (primary CPU high under steady read load)	Yes	below
Connection pooler	`PostgresConnectionsNearMax` (connections near `max_connections`)	Yes, rolling	below
Dedicated WS tier	api HPA pinned at max while CPU looks idle; per-socket memory rising	Yes, rolling	below
chat_messages archival	hot-storage cost of retained history grows	Yes	below

The alerts named above live in k8s/envs/platform/manifests/alerting-rules.yaml (group tomoda.scaling) and fire before the corresponding trigger, giving lead time to build the step.

Read replica¶

Trigger. PostgresCPUHigh fires: primary read CPU sits high under steady load while write volume is fine.

Playbook.

Raise instances in k8s/envs/prod/postgres/manifests/cluster.yaml:

spec:
  instances: 2   # was 1 — 1 primary + 1 streaming replica

Commit, push. CNPG creates the replica, sets up streaming replication, and exposes postgres-prod-ro.data.svc.cluster.local (read-only, replicas only).
Point the backend read pool at it. Edit the prod ConfigMap in k8s/apps/tomoda/overlays/prod/external-secret.yaml:
```
DB_READ_HOST: "postgres-prod-ro.data.svc.cluster.local"
```
Commit, push. Argo CD syncs; roll the backend (kubectl rollout restart deployment/tomoda-api deployment/tomoda-async -n tomoda-prod if a manifest change didn't already trigger it). Read-only stores pick up the replica DSN on the next pod start.

Zero-downtime? Yes. DB_READ_HOST empty means reads use the primary, so the seam is a no-op until set. Setting it is a rolling restart, no schema or write path change. Leaving instances: 2 with DB_READ_HOST empty gives a warm failover standby without routing any reads to it.

See Postgres Operations → Read replicas for the CNPG-side detail.

Connection pooler¶

Trigger. PostgresConnectionsNearMax fires: total backend connections approach the primary's max_connections (100 in cluster.yaml) as pod autoscaling multiplies pgx pools (pods x DB_MAX_CONNS).

Playbook. CNPG ships a PgBouncer Pooler CRD. Add it in transaction mode in front of the primary, then point DB_HOST at the pooler service.

# k8s/envs/prod/postgres/manifests/pooler.yaml (new file; add to the dir Argo
# syncs). CNPG exposes the pooler as a Service named after the Pooler.
apiVersion: postgresql.cnpg.io/v1
kind: Pooler
metadata:
  name: postgres-prod-pooler
  namespace: data
spec:
  cluster:
    name: postgres-prod
  instances: 2
  type: rw
  pgbouncer:
    poolMode: transaction
    parameters:
      max_client_conn: "500"
      default_pool_size: "40"   # server-side conns to Postgres; keep the sum
                                # of pooler default_pool_size well under 100

Then point the backend at the pooler in k8s/apps/tomoda/overlays/prod/external-secret.yaml:

DB_HOST: "postgres-prod-pooler.data.svc.cluster.local"
DB_MAX_CONNS: "10"     # per-pod cap; pods x this must fit the pooler, not the DB
DB_POOLER: "true"      # switches pgx to the pooler-safe CacheDescribe exec mode

Config-only cutover

Transaction-mode pooling is incompatible with pgx's default per-connection prepared-statement cache. The backend already handles this behind a flag: DB_POOLER=true switches pgx to the CacheDescribe exec mode (backend/internal/database/pgx.go), which uses no persistent server-side prepared statements. It is off by default so direct connections keep the faster cache. So the cutover is config-only: add the Pooler, then set DB_HOST + DB_POOLER=true together. Verify in dev first.

Zero-downtime? Yes, rolling. The pooler is a network hop in front of the DSN; moving DB_HOST and setting DB_POOLER is a rolling restart.

Dedicated WS tier¶

Trigger. api HPA spends meaningful time pinned at maxReplicas (HPAPinnedAtMax) while HTTP request-rate looks idle, or api-pod memory rises from per-socket send buffers (PodMemoryNearLimit). Both mean WS load the CPU-based HPA can't see.

Playbook. The backend binary already ships ws-hub and api-hub modes; this is a deployment split, not a code change. Add a ws-hub Deployment + HPA, route /ws to it at the ingress, and flip the api pods to api-hub. The full manifest sketch, migration sequence, and custom-metric HPA notes live in Tomoda — WS Pool Split so they stay in one place. In short:

Land tomoda-ws Deployment (SERVER_MODE=ws-hub) + Service at 0 replicas; confirm it reconciles.
Bring it to 2 replicas, add its HPA (memory-based until prometheus-adapter is installed for a WS-connection metric).
Add the ingress /ws path pointing at ws-service. New WS connections land on the WS pool.
Flip the api pods from multi-hub to api-hub (edit SERVER_MODE on backend-api-deployment.yaml). Sockets on api pods tear down on the roll; clients reconnect to the WS pool.

Zero-downtime? Yes, rolling. Redis pub/sub keeps cross-pod fan-out correct across the mixed fleet during the roll.

chat_messages partition archival¶

chat_messages is PARTITION BY RANGE by month. The backend auto-creates the next partitions at boot (ensureChatMessagePartitions in the tomoda repo). There is deliberately no age-based prune: chat history is retained. The retention policy lives in the app, not in ops:

Friend and DM messages persist until the user deletes their account (internal/services/chat/message_purger.go).
Event-chat messages are purged per that event's retention setting.
Account deletion purges a user's chats and messages for compliance (internal/services/user/user_service.go).

So partitioning here is for query pruning and per-partition vacuum, and the only ops scale step is archival, not deletion.

Trigger. Total chat_messages storage on the primary grows past what is cost-effective to keep on hot storage (watch cnpg_pg_database_size_bytes).

Playbook (archive, do not drop). Detach the oldest cold partitions and move them off hot storage, keeping the data queryable-on-demand rather than deleting it:

ALTER TABLE chat_messages DETACH PARTITION chat_messages_YYYY_MM (online, does not touch live writes, which land in the current-month partition).
Move the detached table to a cheaper tablespace, or pg_dump it to object storage (S3) and drop the local copy once archived.

Zero-downtime? Yes. Detach and archival never touch the current-month partition where writes land.

No prune job by design

Messages are retained, so there is no scheduled DROP. Compliance deletion is handled per-user by account deletion in the app. Only add archival (detach + cold storage) when hot-storage cost, not row age, makes it worthwhile.

Backend (api + async pools)¶

The backend is split into two Deployments (see Apps → Tomoda):

Deployment	Mode	Prod bounds	Dev bounds
`tomoda-api`	`multi-hub` (HTTP + WS Hub)	min 2, max 6	min 1, max 3
`tomoda-async`	`async` (worker + scheduler)	min 2, max 6	min 1, max 3

Both pools auto-scale via HPAs on CPU (70%) + memory (80%). Files:

k8s/apps/tomoda/overlays/prod/hpa-api.yaml
k8s/apps/tomoda/overlays/prod/hpa-async.yaml
k8s/apps/tomoda/overlays/dev/hpa-api.yaml
k8s/apps/tomoda/overlays/dev/hpa-async.yaml

The tomoda-async HPA scales down more slowly (600s window, vs 300s for api) so in-flight worker tasks don't get cut short by aggressive eviction.

Day-to-day HPA inspection¶

# Snapshot of current state — TARGETS shows current vs target utilisation.
kubectl get hpa -n prod

# Detail: per-pod resource usage, recent scaling events, last decision time.
kubectl describe hpa tomoda-api-hpa -n prod
kubectl describe hpa tomoda-async-hpa -n prod

# Raw metric values the HPA sees (handy when behavior surprises you).
kubectl top pods -n prod -l app=tomoda-api
kubectl top pods -n prod -l app=tomoda-async

If TARGETS shows <unknown>/70%, metrics-server isn't returning data — check it's running (kubectl -n kube-system get pods | grep metrics-server) and that the pods have resource requests set (HPA percent-of-request math needs them).

Changing bounds¶

Edit the matching HPA manifest, commit, push, Argo CD syncs. For example, to raise the prod api ceiling:

# k8s/apps/tomoda/overlays/prod/hpa-api.yaml
spec:
  minReplicas: 2
  maxReplicas: 10   # was 6

Don't kubectl scale or kubectl edit hpa directly in prod — Argo CD will revert it on the next sync. Always go through Git.

Changing the metric or thresholds¶

To lower the CPU target (more aggressive scale-up):

spec:
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 60   # was 70

To swap to a custom metric (queue depth for async, WS connections for api — both planned, not yet implemented):

# Async on Asynq queue depth — requires prometheus-adapter
spec:
  metrics:
    - type: External
      external:
        metric:
          name: asynq_pending_tasks
        target:
          type: Value
          value: "100"   # scale up while >100 tasks queued

Prometheus Adapter is not yet installed in the cluster. Until then, CPU + memory are the only available signals. Wire prometheus-adapter before adding any custom-metric HPA targets — see the kube-prometheus-stack docs (the operator is already in the cluster; adapter is a separate chart).

When to skip HPA¶

There are still legitimate cases for setting replicas: directly on the Deployment and removing the HPA:

Pinning for a planned event (load test, marketing push) — set both minReplicas and maxReplicas to the desired number to effectively pin without disabling the HPA.
Diagnosing a scaling regression — temporarily delete the HPA in the cluster (kubectl delete hpa tomoda-api-hpa -n prod) to isolate whether the HPA itself is misbehaving. Re-apply via Git when done — Argo will heal it.
An incident where the HPA is making things worse (e.g., scaling up into a downstream that's already saturated). Same pattern — delete in cluster, fix root cause, re-apply via Git.

Debugging stuck scaling¶

Symptom: pods don't grow under obvious load, or don't shrink when load relaxes.

# Is the HPA seeing metrics?
kubectl get hpa tomoda-api-hpa -n prod -o yaml | yq .status

# Recent scaling events
kubectl describe hpa tomoda-api-hpa -n prod | grep -A20 Events:

# Are the pods actually under load?
kubectl top pods -n prod -l app=tomoda-api

Common causes:

<unknown>/... in TARGETS — metrics-server isn't returning data. See Debugging → Metrics.
No resources.requests on the container — HPA percent math needs requests defined. All tomoda deployments set them; check after any resource block edit.
Stuck at maxReplicas — load is real, you've hit the ceiling. Raise maxReplicas and watch for the next bottleneck (often Postgres or Redis).
Flapping (rapid scale up/down) — adjust the behavior block to widen stabilizationWindowSeconds. The current configs use 30s up / 300s down for api and 30s up / 600s down for async, which is intentionally conservative on the down side.

PDB interaction¶

Both envs run two PDBs (tomoda-api-pdb, tomoda-async-pdb):

Env	`minAvailable`	File
prod	50%	`k8s/apps/tomoda/overlays/prod/pdb.yaml`
dev	1	`k8s/apps/tomoda/overlays/dev/pdb.yaml`

In prod, min replicas of 2 plus HPA-driven growth means at least one pod of each pool stays available during voluntary disruption (node drain, GKE upgrade). The cluster autoscaler respects PDBs when consolidating nodes.

Dev is pinned at 1 replica, so minAvailable: 1 means the single pod won't be voluntarily evicted until a replacement is Ready. On a node drain or GKE upgrade the pod still moves (the PDB blocks the eviction, waits for the reschedule, then the drain proceeds). If a dev drain stalls, bump the dev HPA to 2 for the window or delete the PDB in-cluster — Argo heals it back.

If a PDB is blocking a drain, you'll see it in the kubectl drain output ("cannot evict ... violates pod disruption budget"). Wait for replicas to come up (HPA reacts), then re-drain — or temporarily patch the PDB if it's a planned operation. Don't delete the PDB; Argo will heal it back.

Split WS into its own pool¶

tomoda-api can split into separate tomoda-api (HTTP) and tomoda-ws (WebSocket) pools so WS scales on a connection-count metric instead of CPU. The backend binary already supports the modes (api-hub, ws-hub), so this is a deployment change. See the Dedicated WS tier playbook above for the trigger and sequence, and Tomoda — WS Pool Split for the full manifest sketch and custom-metric HPA.

Postgres (CNPG)¶

Both dev and prod CNPG clusters are single-instance today. Three scaling axes:

Vertical (CPU / RAM)¶

Edit k8s/envs/{dev,prod}/postgres/manifests/cluster.yaml:

resources:
  requests:
    cpu: 250m
    memory: 512Mi
  limits:
    cpu: 2000m
    memory: 2048Mi

CNPG performs a rolling restart to apply. With one instance, expect ~30s downtime.

Storage¶

PVCs are 10Gi (dev) and 20Gi (prod). The pd-balanced storage class supports online expansion:

storage:
  size: 50Gi   # was 20Gi

Commit. CNPG resizes the PVC in place — no downtime.

Horizontal (read replicas)¶

For read-heavy workloads, add a replica. This is a just-in-time step with a config seam already in the backend, so see Read replica above for the full playbook (bump instances, set DB_READ_HOST, rolling restart). The connection-pooler step, which relieves the primary's max_connections cap, is at Connection pooler.

See Postgres Operations for backup, restore, and tuning details.

Redis¶

Redis runs as a Bitnami Helm chart. Tune via values:

# k8s/envs/{dev,prod}/redis/values.yaml
master:
  resources:
    limits:
      memory: 1Gi
replica:
  replicaCount: 0   # set to 1+ to enable replicas

The current setup is single-master, no replicas. Redis is used for sessions and cache only — losing it forces re-login but does not lose persistent data.

Node pools¶

The GKE cluster runs on a spot node pool (defined in infrastructure/gcp/). Two ways to grow capacity:

Static — change `node_count` in Terraform¶

# infrastructure/gcp/gke.tf (or wherever the node pool is defined)
node_count = 5   # was 3

terraform apply. GKE adds nodes.

Dynamic — cluster autoscaler¶

If the node pool was created with autoscaling enabled, GKE will add nodes when pending pods exceed capacity. Check:

gcloud container clusters describe gke-tomoda \
  --zone us-central1-a --project development-485000 \
  --format="value(autoscaling)"

If autoscaling is off, edit the node pool block in Terraform to enable it. Verify before relying on it — node-level autoscaling and HPA are two different mechanisms and both need to be configured for true elastic scale.

Spot nodes can be preempted

GCP preempts spot VMs without notice. CNPG and the backend both tolerate node loss (Postgres reattaches its PVC; backend reschedules), but a single-replica deployment will see brief downtime. Run at least 2 backend replicas in prod to ride out preemptions.