Scaling¶
The platform runs with horizontal autoscaling on the backend pools and manual scaling on data services. Here is how to grow each layer.
Backend (api + async pools)¶
The backend is split into two Deployments (see Apps → Tomoda):
| Deployment | Mode | Prod bounds | Dev bounds |
|---|---|---|---|
tomoda-api |
multi-hub (HTTP + WS Hub) |
min 2, max 6 | min 1, max 3 |
tomoda-async |
async (worker + scheduler) |
min 2, max 6 | min 1, max 3 |
Both pools auto-scale via HPAs on CPU (70%) + memory (80%). Files:
k8s/apps/tomoda/overlays/prod/hpa-api.yamlk8s/apps/tomoda/overlays/prod/hpa-async.yamlk8s/apps/tomoda/overlays/dev/hpa-api.yamlk8s/apps/tomoda/overlays/dev/hpa-async.yaml
The tomoda-async HPA scales down more slowly (600s window, vs 300s for
api) so in-flight worker tasks don't get cut short by aggressive eviction.
Day-to-day HPA inspection¶
# Snapshot of current state — TARGETS shows current vs target utilisation.
kubectl get hpa -n prod
# Detail: per-pod resource usage, recent scaling events, last decision time.
kubectl describe hpa tomoda-api-hpa -n prod
kubectl describe hpa tomoda-async-hpa -n prod
# Raw metric values the HPA sees (handy when behavior surprises you).
kubectl top pods -n prod -l app=tomoda-api
kubectl top pods -n prod -l app=tomoda-async
If TARGETS shows <unknown>/70%, metrics-server isn't returning data —
check it's running (kubectl -n kube-system get pods | grep metrics-server)
and that the pods have resource requests set (HPA percent-of-request
math needs them).
Changing bounds¶
Edit the matching HPA manifest, commit, push, Argo CD syncs. For example, to raise the prod api ceiling:
# k8s/apps/tomoda/overlays/prod/hpa-api.yaml
spec:
minReplicas: 2
maxReplicas: 10 # was 6
Don't kubectl scale or kubectl edit hpa directly in prod — Argo CD will
revert it on the next sync. Always go through Git.
Changing the metric or thresholds¶
To lower the CPU target (more aggressive scale-up):
spec:
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 60 # was 70
To swap to a custom metric (queue depth for async, WS connections for api — both planned, not yet implemented):
# Async on Asynq queue depth — requires prometheus-adapter
spec:
metrics:
- type: External
external:
metric:
name: asynq_pending_tasks
target:
type: Value
value: "100" # scale up while >100 tasks queued
Prometheus Adapter is not yet installed in the cluster. Until then, CPU + memory are the only available signals. Wire prometheus-adapter before adding any custom-metric HPA targets — see the kube-prometheus-stack docs (the operator is already in the cluster; adapter is a separate chart).
When to skip HPA¶
There are still legitimate cases for setting replicas: directly on the
Deployment and removing the HPA:
- Pinning for a planned event (load test, marketing push) — set both
minReplicasandmaxReplicasto the desired number to effectively pin without disabling the HPA. - Diagnosing a scaling regression — temporarily delete the HPA in
the cluster (
kubectl delete hpa tomoda-api-hpa -n prod) to isolate whether the HPA itself is misbehaving. Re-apply via Git when done — Argo will heal it. - An incident where the HPA is making things worse (e.g., scaling up into a downstream that's already saturated). Same pattern — delete in cluster, fix root cause, re-apply via Git.
Debugging stuck scaling¶
Symptom: pods don't grow under obvious load, or don't shrink when load relaxes.
# Is the HPA seeing metrics?
kubectl get hpa tomoda-api-hpa -n prod -o yaml | yq .status
# Recent scaling events
kubectl describe hpa tomoda-api-hpa -n prod | grep -A20 Events:
# Are the pods actually under load?
kubectl top pods -n prod -l app=tomoda-api
Common causes:
<unknown>/...in TARGETS — metrics-server isn't returning data. See Debugging → Metrics.- No
resources.requestson the container — HPA percent math needs requests defined. All tomoda deployments set them; check after any resource block edit. - Stuck at maxReplicas — load is real, you've hit the ceiling.
Raise
maxReplicasand watch for the next bottleneck (often Postgres or Redis). - Flapping (rapid scale up/down) — adjust the
behaviorblock to widenstabilizationWindowSeconds. The current configs use 30s up / 300s down for api and 30s up / 600s down for async, which is intentionally conservative on the down side.
PDB interaction¶
Prod has two PDBs (tomoda-api-pdb, tomoda-async-pdb), both
minAvailable: 50%. With min replicas of 2 and HPA-driven growth, this
means at least one pod of each pool stays available during voluntary
disruption (node drain, GKE upgrade). The cluster autoscaler respects PDBs
when consolidating nodes.
If a PDB is blocking a drain, you'll see it in the
kubectl drain output ("cannot evict ... violates pod disruption budget").
Wait for replicas to come up (HPA reacts), then re-drain — or temporarily
patch the PDB if it's a planned operation. Don't delete the PDB; Argo will
heal it back.
Planned: split WS into its own pool¶
tomoda-api will eventually split into separate tomoda-api (HTTP) and
tomoda-ws (WebSocket) pools so WS scales on a connection-count metric
instead of CPU. The backend binary already supports the modes (api-hub,
ws-hub). The k8s wiring is documented but not deployed — see
Tomoda — WS Pool Split (Planned).
Postgres (CNPG)¶
Both dev and prod CNPG clusters are single-instance today. Three scaling axes:
Vertical (CPU / RAM)¶
Edit k8s/envs/{dev,prod}/middleware/postgres/manifests/cluster.yaml:
resources:
requests:
cpu: 250m
memory: 512Mi
limits:
cpu: 2000m
memory: 2048Mi
CNPG performs a rolling restart to apply. With one instance, expect ~30s downtime.
Storage¶
PVCs are 10Gi (dev) and 20Gi (prod). The pd-balanced storage class supports online expansion:
storage:
size: 50Gi # was 20Gi
Commit. CNPG resizes the PVC in place — no downtime.
Horizontal (read replicas)¶
For read-heavy workloads, add a replica:
spec:
instances: 2 # 1 primary + 1 streaming replica
CNPG creates the replica, sets up streaming replication, and exposes a postgres-prod-ro service for read-only queries. Update the backend to route read traffic there if you want to actually use it — otherwise the replica only serves as warm standby for failover.
See Postgres Operations for backup, restore, and tuning details.
Redis¶
Redis runs as a Bitnami Helm chart. Tune via values:
# k8s/envs/{dev,prod}/middleware/redis/values.yaml
master:
resources:
limits:
memory: 1Gi
replica:
replicaCount: 0 # set to 1+ to enable replicas
The current setup is single-master, no replicas. Redis is used for sessions and cache only — losing it forces re-login but does not lose persistent data.
Node pools¶
The GKE cluster runs on a spot node pool (defined in infrastructure/gcp/). Two ways to grow capacity:
Static — change node_count in Terraform¶
# infrastructure/gcp/gke.tf (or wherever the node pool is defined)
node_count = 5 # was 3
terraform apply. GKE adds nodes.
Dynamic — cluster autoscaler¶
If the node pool was created with autoscaling enabled, GKE will add nodes when pending pods exceed capacity. Check:
gcloud container clusters describe gke-tomoda \
--zone asia-east1-a --project development-485000 \
--format="value(autoscaling)"
If autoscaling is off, edit the node pool block in Terraform to enable it. Verify before relying on it — node-level autoscaling and HPA are two different mechanisms and both need to be configured for true elastic scale.
Spot nodes can be preempted
GCP preempts spot VMs without notice. CNPG and the backend both tolerate node loss (Postgres reattaches its PVC; backend reschedules), but a single-replica deployment will see brief downtime. Run at least 2 backend replicas in prod to ride out preemptions.