GKE Cluster¶
The single Kubernetes cluster behind everything Tomoda runs. Provisioned by infrastructure/gcp/gke.tf and infrastructure/gcp/gke_iam.tf.
Cluster shape¶
| Field | Value |
|---|---|
| Name | gke-tomoda |
| Location | asia-east1-a (single zone) |
| Network | gke-tomoda-vpc (custom) |
| Subnetwork | gke-tomoda-subnet |
| Pod range | pods secondary range (10.1.0.0/16) |
| Services range | services secondary range (10.2.0.0/20) |
| Workload Identity | Enabled (${project_id}.svc.id.goog) |
| Default node pool | Removed (remove_default_node_pool = true) |
| Deletion protection | On |
Deletion protection
deletion_protection = true on the cluster. Running terraform destroy against this module will fail until you flip the bit in gke.tf, apply, and only then destroy. That is intentional — losing the cluster means losing every running workload, every PVC, and the ability to restore CNPG without a manual disaster-recovery flow.
The cluster name gke-tomoda is load-bearing in many places (CI substitutions, kubeconfig contexts, runbooks). Do not rename it without a coordinated rollout — a Terraform rename is destructive and would force-replace the cluster.
Node pools¶
Two pools exist. Both are spot. Only one is actually running anything today.
gke-tomoda-node-pool (idle)¶
| Field | Value |
|---|---|
| Machine type | e2-medium (2 vCPU, 4 GB) |
| Disk | 30 GB |
| Spot | Yes |
initial_node_count |
0 |
| Autoscaling | min 0, max 2 |
Slot for cheaper, smaller nodes if we ever shed RAM pressure off the high-mem pool. Stays at 0 nodes unless a workload requests it; the autoscaler can grow it to 2 if needed and shrink back to 0 when idle.
gke-tomoda-high-mem-pool (active)¶
| Field | Value |
|---|---|
| Machine type | e2-standard-2 (2 vCPU, 8 GB) |
| Disk | 50 GB |
| Spot | Yes |
initial_node_count |
1 |
| Autoscaling | min 1, max 4 |
Every workload — Argo CD, Traefik, oauth2-proxy, CNPG, Photon, the app pods themselves — runs on this pool. Floor of 1 keeps current workload viable; ceiling of 4 gives headroom both for replacing spot-preempted nodes automatically and for any future HPA-driven scale-out.
Scaling the cluster¶
Both pools have a GKE cluster autoscaler configured (autoscaling {} block in gke.tf). The autoscaler grows the pool when there are unschedulable pods and shrinks it when nodes are underutilized — most importantly, it replaces spot nodes that get preempted.
When you need a higher ceiling (sustained HPA pressure, larger workloads), bump max_node_count in gke.tf and terraform apply. There's no need to manually edit node counts — that's the autoscaler's job.
For pod-level scaling, add an HPA resource against the workload Deployment. Pod-level HPA + cluster autoscaler compose: HPA scales replicas up, autoscaler provisions nodes if existing capacity is exhausted.
For one-off bursts (e.g. running the Photon indexer locally with scripts/photon-index-local.sh), prefer spinning up a separate Compute Engine VM rather than scaling the cluster — it keeps Argo CD's drift detection happy.
Workload Identity¶
Workload Identity is enabled with the pool ${project_id}.svc.id.goog. Every Workload Identity binding in this project follows the pattern:
serviceAccount:${project_id}.svc.id.goog[<namespace>/<ksa-name>]
The actual bindings used by workloads in the cluster are listed in IAM Overview. The relevant ones for this page:
- The GKE default compute SA (
287267207777-compute@developer.gserviceaccount.com) getsroles/artifactregistry.readerso nodes can pull images from both repos without per-pool config (gke_iam.tf). - Per-workload GCP SAs (CNPG backup, photon-indexer, image-updater, cloud-build) are documented on their own pages.
What is not in here¶
- No private cluster — the control plane endpoint is public. We rely on GCP's auth, not network ACLs, for control-plane access.
- No release channel —
release_channelis not set ingke.tf, so the cluster follows the default channel. If we want predictable Kubernetes version upgrades, declaringRAPID/REGULAR/STABLEexplicitly is the next step. - No maintenance window — upgrades happen whenever GCP decides. Watch the GKE console for advance notice.
- No node taints / no GPU pools — every pod that schedules onto the high-mem pool competes with every other pod for the same 2 vCPU / 8 GB.
- No cluster autoscaling — verified against
gke.tf; noautoscalingblock on either node pool.
Recovery¶
If the cluster is somehow lost (deletion-protection bypassed, project deleted, etc.):
- Restore the cluster with
terraform applyfrom clean state. - Re-bootstrap Argo CD from the repo's
argocd/manifests. - Restore Postgres from the most recent CNPG base backup in the backup bucket. The Postgres operations runbook has the step-by-step.
There is no automated DR — recovery is a manual replay of Terraform plus Argo CD plus the Postgres restore.