GKE Cluster¶

The single Kubernetes cluster behind everything Tomoda runs. Provisioned by infrastructure/gcp/gke.tf and infrastructure/gcp/gke_iam.tf.

Cluster shape¶

Field	Value
Name	`gke-tomoda`
Location	`us-central1-a` (single zone)
Network	`gke-tomoda-vpc` (custom)
Subnetwork	`gke-tomoda-subnet`
Pod range	`pods` secondary range (10.1.0.0/16)
Services range	`services` secondary range (10.2.0.0/20)
Workload Identity	Enabled (`${project_id}.svc.id.goog`)
Dataplane	V2 (`datapath_provider = "ADVANCED_DATAPATH"`)
Default node pool	Removed (`remove_default_node_pool = true`)
Deletion protection	On

Dataplane V2 is create-time only

datapath_provider = "ADVANCED_DATAPATH" (GKE Dataplane V2) is set in gke.tf. It switches the cluster to eBPF-based networking and enables enforcement of Kubernetes NetworkPolicy manifests. It cannot be changed on a live cluster, flipping it forces a cluster replacement. See Network policies.

Deletion protection

deletion_protection = true on the cluster. Running terraform destroy against this module will fail until you flip the bit in gke.tf, apply, and only then destroy. That is intentional — losing the cluster means losing every running workload, every PVC, and the ability to restore CNPG without a manual disaster-recovery flow.

The cluster name gke-tomoda is load-bearing in many places (CI substitutions, kubeconfig contexts, runbooks). Do not rename it without a coordinated rollout — a Terraform rename is destructive and would force-replace the cluster.

Node pools¶

Three pools are defined; two run today. Both spot pools run under the dedicated node service account gke-node-sa (google_service_account.gke_nodes, wired via node_config.service_account on each pool in gke.tf), not the default Compute Engine SA. It is least-privilege, holding only roles/logging.logWriter, roles/monitoring.metricWriter, and roles/artifactregistry.reader. See IAM Overview.

`gke-tomoda-node-pool` (on-demand system pool)¶

Field	Value
Machine type	`e2-medium` (2 vCPU, 4 GB)
Disk	30 GB
Spot	No (on-demand)
`initial_node_count`	1
Autoscaling	min 1, max 2
Node label	`pool=system`

One stable node for preemption-sensitive control-plane pods, pinned here via nodeSelector: {pool: system}: the ARC controller + listener (a preempted listener drops its GitHub session and stalls CI), Traefik (ingress — preemption is a brief outage for every service), and the small controllers cert-manager, external-secrets, external-dns, oauth2-proxy. All are tiny and fit on one e2-medium. Argo CD stays on Spot — it self-heals (a preempted controller reconnects and re-reconciles, no data loss). ARC runner pods also stay on Spot (a preempted job just retries). On-demand costs ~$24.5/mo for the guaranteed node vs ~$7.3/mo Spot — +$17/mo for control-plane stability.

`gke-tomoda-high-mem-pool` (active, spot)¶

Field	Value
Machine type	`e2-standard-2` (2 vCPU, 8 GB)
Disk	50 GB
Spot	Yes
`initial_node_count`	1
Autoscaling	min 1, max 5

Every general workload — Argo CD, Traefik, oauth2-proxy, CNPG, Photon, the app pods, CI runner bursts — runs on this pool. Floor of 1 keeps current workload viable; ceiling of 5 is the budget cap. Extra nodes cost ~$15/mo each but $0 at idle since the autoscaler removes them when the burst clears. Raised from 3 to give CI runner bursts + the dev tenant headroom.

Scaling the cluster¶

Both pools have a GKE cluster autoscaler configured (autoscaling {} block in gke.tf). The autoscaler grows the pool when there are unschedulable pods and shrinks it when nodes are underutilized — most importantly, it replaces spot nodes that get preempted.

When you need a higher ceiling (sustained HPA pressure, larger workloads), bump max_node_count in gke.tf and terraform apply. There's no need to manually edit node counts — that's the autoscaler's job.

For pod-level scaling, add an HPA resource against the workload Deployment. Pod-level HPA + cluster autoscaler compose: HPA scales replicas up, autoscaler provisions nodes if existing capacity is exhausted.

For one-off bursts (e.g. running the Photon indexer locally with scripts/photon-index-local.sh), prefer spinning up a separate Compute Engine VM rather than scaling the cluster — it keeps Argo CD's drift detection happy.

Workload Identity¶

Workload Identity is enabled with the pool ${project_id}.svc.id.goog. Every Workload Identity binding in this project follows the pattern:

serviceAccount:${project_id}.svc.id.goog[<namespace>/<ksa-name>]

The actual bindings used by workloads in the cluster are listed in IAM Overview. The relevant ones for this page:

The node service account gke-node-sa gets roles/artifactregistry.reader so nodes can pull images from both repos without per-pool config (gke.tf).
Per-workload GCP SAs (CNPG backup, photon-indexer, image-updater, cloud-build) are documented on their own pages.

What is not in here¶

No private cluster — the control plane endpoint is public. We rely on GCP's auth, not network ACLs, for control-plane access.
No release channel — release_channel is not set in gke.tf, so the cluster follows the default channel. If we want predictable Kubernetes version upgrades, declaring RAPID / REGULAR / STABLE explicitly is the next step.
No maintenance window — upgrades happen whenever GCP decides. Watch the GKE console for advance notice.
No GPU pools. No taints on the two spot pools — every pod on the high-mem pool competes for the same 2 vCPU / 8 GB. The gated stateful_nodes pool (off until prod launch, enable_stateful_pool = true) is the one exception: it carries a workload=stateful:NoSchedule taint so only the prod CNPG cluster (which tolerates it) lands there.
No cluster autoscaling — verified against gke.tf; no autoscaling block on either node pool.

Recovery¶

If the cluster is somehow lost (deletion-protection bypassed, project deleted, etc.):

Restore the cluster with terraform apply from clean state.
Re-bootstrap Argo CD from the repo's argocd/ manifests.
Restore Postgres from the most recent CNPG base backup in the backup bucket. The Postgres operations runbook has the step-by-step.

There is no automated DR — recovery is a manual replay of Terraform plus Argo CD plus the Postgres restore.