Skip to content

Scaling

The platform runs with horizontal autoscaling on the backend pools and manual scaling on data services. Here is how to grow each layer.

Backend (api + async pools)

The backend is split into two Deployments (see Apps → Tomoda):

Deployment Mode Prod bounds Dev bounds
tomoda-api multi-hub (HTTP + WS Hub) min 2, max 6 min 1, max 3
tomoda-async async (worker + scheduler) min 2, max 6 min 1, max 3

Both pools auto-scale via HPAs on CPU (70%) + memory (80%). Files:

  • k8s/apps/tomoda/overlays/prod/hpa-api.yaml
  • k8s/apps/tomoda/overlays/prod/hpa-async.yaml
  • k8s/apps/tomoda/overlays/dev/hpa-api.yaml
  • k8s/apps/tomoda/overlays/dev/hpa-async.yaml

The tomoda-async HPA scales down more slowly (600s window, vs 300s for api) so in-flight worker tasks don't get cut short by aggressive eviction.

Day-to-day HPA inspection

# Snapshot of current state — TARGETS shows current vs target utilisation.
kubectl get hpa -n prod

# Detail: per-pod resource usage, recent scaling events, last decision time.
kubectl describe hpa tomoda-api-hpa -n prod
kubectl describe hpa tomoda-async-hpa -n prod

# Raw metric values the HPA sees (handy when behavior surprises you).
kubectl top pods -n prod -l app=tomoda-api
kubectl top pods -n prod -l app=tomoda-async

If TARGETS shows <unknown>/70%, metrics-server isn't returning data — check it's running (kubectl -n kube-system get pods | grep metrics-server) and that the pods have resource requests set (HPA percent-of-request math needs them).

Changing bounds

Edit the matching HPA manifest, commit, push, Argo CD syncs. For example, to raise the prod api ceiling:

# k8s/apps/tomoda/overlays/prod/hpa-api.yaml
spec:
  minReplicas: 2
  maxReplicas: 10   # was 6

Don't kubectl scale or kubectl edit hpa directly in prod — Argo CD will revert it on the next sync. Always go through Git.

Changing the metric or thresholds

To lower the CPU target (more aggressive scale-up):

spec:
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 60   # was 70

To swap to a custom metric (queue depth for async, WS connections for api — both planned, not yet implemented):

# Async on Asynq queue depth — requires prometheus-adapter
spec:
  metrics:
    - type: External
      external:
        metric:
          name: asynq_pending_tasks
        target:
          type: Value
          value: "100"   # scale up while >100 tasks queued

Prometheus Adapter is not yet installed in the cluster. Until then, CPU + memory are the only available signals. Wire prometheus-adapter before adding any custom-metric HPA targets — see the kube-prometheus-stack docs (the operator is already in the cluster; adapter is a separate chart).

When to skip HPA

There are still legitimate cases for setting replicas: directly on the Deployment and removing the HPA:

  • Pinning for a planned event (load test, marketing push) — set both minReplicas and maxReplicas to the desired number to effectively pin without disabling the HPA.
  • Diagnosing a scaling regression — temporarily delete the HPA in the cluster (kubectl delete hpa tomoda-api-hpa -n prod) to isolate whether the HPA itself is misbehaving. Re-apply via Git when done — Argo will heal it.
  • An incident where the HPA is making things worse (e.g., scaling up into a downstream that's already saturated). Same pattern — delete in cluster, fix root cause, re-apply via Git.

Debugging stuck scaling

Symptom: pods don't grow under obvious load, or don't shrink when load relaxes.

# Is the HPA seeing metrics?
kubectl get hpa tomoda-api-hpa -n prod -o yaml | yq .status

# Recent scaling events
kubectl describe hpa tomoda-api-hpa -n prod | grep -A20 Events:

# Are the pods actually under load?
kubectl top pods -n prod -l app=tomoda-api

Common causes:

  1. <unknown>/... in TARGETS — metrics-server isn't returning data. See Debugging → Metrics.
  2. No resources.requests on the container — HPA percent math needs requests defined. All tomoda deployments set them; check after any resource block edit.
  3. Stuck at maxReplicas — load is real, you've hit the ceiling. Raise maxReplicas and watch for the next bottleneck (often Postgres or Redis).
  4. Flapping (rapid scale up/down) — adjust the behavior block to widen stabilizationWindowSeconds. The current configs use 30s up / 300s down for api and 30s up / 600s down for async, which is intentionally conservative on the down side.

PDB interaction

Prod has two PDBs (tomoda-api-pdb, tomoda-async-pdb), both minAvailable: 50%. With min replicas of 2 and HPA-driven growth, this means at least one pod of each pool stays available during voluntary disruption (node drain, GKE upgrade). The cluster autoscaler respects PDBs when consolidating nodes.

If a PDB is blocking a drain, you'll see it in the kubectl drain output ("cannot evict ... violates pod disruption budget"). Wait for replicas to come up (HPA reacts), then re-drain — or temporarily patch the PDB if it's a planned operation. Don't delete the PDB; Argo will heal it back.

Planned: split WS into its own pool

tomoda-api will eventually split into separate tomoda-api (HTTP) and tomoda-ws (WebSocket) pools so WS scales on a connection-count metric instead of CPU. The backend binary already supports the modes (api-hub, ws-hub). The k8s wiring is documented but not deployed — see Tomoda — WS Pool Split (Planned).

Postgres (CNPG)

Both dev and prod CNPG clusters are single-instance today. Three scaling axes:

Vertical (CPU / RAM)

Edit k8s/envs/{dev,prod}/middleware/postgres/manifests/cluster.yaml:

resources:
  requests:
    cpu: 250m
    memory: 512Mi
  limits:
    cpu: 2000m
    memory: 2048Mi

CNPG performs a rolling restart to apply. With one instance, expect ~30s downtime.

Storage

PVCs are 10Gi (dev) and 20Gi (prod). The pd-balanced storage class supports online expansion:

storage:
  size: 50Gi   # was 20Gi

Commit. CNPG resizes the PVC in place — no downtime.

Horizontal (read replicas)

For read-heavy workloads, add a replica:

spec:
  instances: 2   # 1 primary + 1 streaming replica

CNPG creates the replica, sets up streaming replication, and exposes a postgres-prod-ro service for read-only queries. Update the backend to route read traffic there if you want to actually use it — otherwise the replica only serves as warm standby for failover.

See Postgres Operations for backup, restore, and tuning details.

Redis

Redis runs as a Bitnami Helm chart. Tune via values:

# k8s/envs/{dev,prod}/middleware/redis/values.yaml
master:
  resources:
    limits:
      memory: 1Gi
replica:
  replicaCount: 0   # set to 1+ to enable replicas

The current setup is single-master, no replicas. Redis is used for sessions and cache only — losing it forces re-login but does not lose persistent data.

Node pools

The GKE cluster runs on a spot node pool (defined in infrastructure/gcp/). Two ways to grow capacity:

Static — change node_count in Terraform

# infrastructure/gcp/gke.tf (or wherever the node pool is defined)
node_count = 5   # was 3

terraform apply. GKE adds nodes.

Dynamic — cluster autoscaler

If the node pool was created with autoscaling enabled, GKE will add nodes when pending pods exceed capacity. Check:

gcloud container clusters describe gke-tomoda \
  --zone asia-east1-a --project development-485000 \
  --format="value(autoscaling)"

If autoscaling is off, edit the node pool block in Terraform to enable it. Verify before relying on it — node-level autoscaling and HPA are two different mechanisms and both need to be configured for true elastic scale.

Spot nodes can be preempted

GCP preempts spot VMs without notice. CNPG and the backend both tolerate node loss (Postgres reattaches its PVC; backend reschedules), but a single-replica deployment will see brief downtime. Run at least 2 backend replicas in prod to ride out preemptions.