Skip to content

GKE Cluster

The single Kubernetes cluster behind everything Tomoda runs. Provisioned by infrastructure/gcp/gke.tf and infrastructure/gcp/gke_iam.tf.

Cluster shape

Field Value
Name gke-tomoda
Location asia-east1-a (single zone)
Network gke-tomoda-vpc (custom)
Subnetwork gke-tomoda-subnet
Pod range pods secondary range (10.1.0.0/16)
Services range services secondary range (10.2.0.0/20)
Workload Identity Enabled (${project_id}.svc.id.goog)
Default node pool Removed (remove_default_node_pool = true)
Deletion protection On

Deletion protection

deletion_protection = true on the cluster. Running terraform destroy against this module will fail until you flip the bit in gke.tf, apply, and only then destroy. That is intentional — losing the cluster means losing every running workload, every PVC, and the ability to restore CNPG without a manual disaster-recovery flow.

The cluster name gke-tomoda is load-bearing in many places (CI substitutions, kubeconfig contexts, runbooks). Do not rename it without a coordinated rollout — a Terraform rename is destructive and would force-replace the cluster.

Node pools

Two pools exist. Both are spot. Only one is actually running anything today.

gke-tomoda-node-pool (idle)

Field Value
Machine type e2-medium (2 vCPU, 4 GB)
Disk 30 GB
Spot Yes
initial_node_count 0
Autoscaling min 0, max 2

Slot for cheaper, smaller nodes if we ever shed RAM pressure off the high-mem pool. Stays at 0 nodes unless a workload requests it; the autoscaler can grow it to 2 if needed and shrink back to 0 when idle.

gke-tomoda-high-mem-pool (active)

Field Value
Machine type e2-standard-2 (2 vCPU, 8 GB)
Disk 50 GB
Spot Yes
initial_node_count 1
Autoscaling min 1, max 4

Every workload — Argo CD, Traefik, oauth2-proxy, CNPG, Photon, the app pods themselves — runs on this pool. Floor of 1 keeps current workload viable; ceiling of 4 gives headroom both for replacing spot-preempted nodes automatically and for any future HPA-driven scale-out.

Scaling the cluster

Both pools have a GKE cluster autoscaler configured (autoscaling {} block in gke.tf). The autoscaler grows the pool when there are unschedulable pods and shrinks it when nodes are underutilized — most importantly, it replaces spot nodes that get preempted.

When you need a higher ceiling (sustained HPA pressure, larger workloads), bump max_node_count in gke.tf and terraform apply. There's no need to manually edit node counts — that's the autoscaler's job.

For pod-level scaling, add an HPA resource against the workload Deployment. Pod-level HPA + cluster autoscaler compose: HPA scales replicas up, autoscaler provisions nodes if existing capacity is exhausted.

For one-off bursts (e.g. running the Photon indexer locally with scripts/photon-index-local.sh), prefer spinning up a separate Compute Engine VM rather than scaling the cluster — it keeps Argo CD's drift detection happy.

Workload Identity

Workload Identity is enabled with the pool ${project_id}.svc.id.goog. Every Workload Identity binding in this project follows the pattern:

serviceAccount:${project_id}.svc.id.goog[<namespace>/<ksa-name>]

The actual bindings used by workloads in the cluster are listed in IAM Overview. The relevant ones for this page:

  • The GKE default compute SA (287267207777-compute@developer.gserviceaccount.com) gets roles/artifactregistry.reader so nodes can pull images from both repos without per-pool config (gke_iam.tf).
  • Per-workload GCP SAs (CNPG backup, photon-indexer, image-updater, cloud-build) are documented on their own pages.

What is not in here

  • No private cluster — the control plane endpoint is public. We rely on GCP's auth, not network ACLs, for control-plane access.
  • No release channelrelease_channel is not set in gke.tf, so the cluster follows the default channel. If we want predictable Kubernetes version upgrades, declaring RAPID / REGULAR / STABLE explicitly is the next step.
  • No maintenance window — upgrades happen whenever GCP decides. Watch the GKE console for advance notice.
  • No node taints / no GPU pools — every pod that schedules onto the high-mem pool competes with every other pod for the same 2 vCPU / 8 GB.
  • No cluster autoscaling — verified against gke.tf; no autoscaling block on either node pool.

Recovery

If the cluster is somehow lost (deletion-protection bypassed, project deleted, etc.):

  1. Restore the cluster with terraform apply from clean state.
  2. Re-bootstrap Argo CD from the repo's argocd/ manifests.
  3. Restore Postgres from the most recent CNPG base backup in the backup bucket. The Postgres operations runbook has the step-by-step.

There is no automated DR — recovery is a manual replay of Terraform plus Argo CD plus the Postgres restore.