Environments¶

Tomoda runs on a production cluster with a dev tenant — one physical GKE cluster, one GCP project, and three logical tiers: platform (cluster-wide shared services), dev (a deliberately small tenant for testing changes before they hit prod), and prod (the canonical environment, sized for real traffic).

One project, one cluster, three tiers¶

Layer	Value
GCP project	`development-485000`
Region / zone	`us-central1` / `us-central1-a`
GKE cluster	`gke-tomoda` (note: no env suffix — single cluster serves both tenants)
VPC	`gke-tomoda-vpc` (subnet `10.0.0.0/24`, pods `10.1.0.0/16`, services `10.2.0.0/20`)
Dev app namespace	`tomoda`
Prod app namespace	`prod`
Shared data namespace	`data` (both `postgres-dev`/`postgres-prod`, `redis`/`redis-prod`)
Shared platform namespace	`platform` (Photon geocoder + indexer, ARC runners later)
Shared system namespaces	`argocd`, `cert-manager`, `external-secrets`, `traefik-system`, `monitoring`, etc.

The cluster shape is defined in infrastructure/gcp/gke.tf and the VPC in infrastructure/gcp/vpc.tf. Workload Identity is enabled, so pods can impersonate GCP service accounts without static credentials — this is how CNPG reaches GCS for backups, for example.

Bootstrap: the app-of-apps pattern¶

Argo CD is installed by Terraform (infrastructure/gcp/argocd.tf). Once it is up, a single bootstrap manifest is applied manually and everything else flows from git:

kubectl apply -f k8s/envs/bootstrap.yaml

k8s/envs/bootstrap.yaml creates three Argo Applications:

platform — recurses into k8s/envs/platform/ and brings up cert-manager, traefik, external-dns, external-secrets, oauth2-proxy, monitoring (Prometheus + Grafana + Alertmanager), loki, tempo, blackbox-exporter, and Photon (geocoder + indexer). Required; everything else depends on it.
dev — recurses into k8s/envs/dev/ and brings up postgres-dev (CNPG, 1 Gi PVC, no backup), redis-dev (256 Mi), pgAdmin, Redis Insight, and the tomoda dev overlay (1 replica each — see Tier sizing below). Optional. Delete the dev Argo Application to free dev's footprint.
prod — recurses into k8s/envs/prod/ and brings up postgres-prod (10 Gi PVC, backups + retention), redis-prod (10 Gi), pgAdmin, Redis Insight, and the tomoda prod overlay (HPA-driven). Optional. Skip if you only want a dev environment on this cluster (rare).

One GKE cluster · gke-tomoda · us-central1-a

platform tier shared, required · cert-manager, ESO, Traefik, monitoring, …

dev tier 1 Gi PVCs, no backup, 1 replica

prod tier 10 Gi PVCs, full backups, HPA

Three tiers live inside one cluster — they aren't separate clusters or projects, just three Argo CD Applications differing in sizing and backup policy. Tier sizing is the only real per-environment difference.

Bootstrapping a fresh cluster

Apply k8s/envs/bootstrap.yaml once. Argo CD owns the world afterward. Don't kubectl apply anything else by hand — change git and let Argo reconcile.

What's shared vs per-environment¶

Shared (single instance in the cluster, lives in k8s/envs/platform/):

All system services: cert-manager, traefik, external-dns, external-secrets, monitoring (Prometheus + Grafana + Alertmanager + Loki + Tempo), oauth2-proxy, argocd-ingress, blackbox-exporter, the alerting rules + dashboards.
Photon geocoder + indexer — singleton in the platform namespace. Both dev and prod backends hit photon.platform.svc.cluster.local:2322. Sharing makes sense because the multilingual planet index is read-only, takes hours + ~$500 of compute to rebuild, and gives identical answers regardless of which tenant queries it.
The Argo CD control plane.
Workload Identity bindings, Artifact Registry connectivity, GCP Secret Manager access via ESO.

Per-environment (two instances in the cluster, separated by name):

Postgres Cluster (CNPG): postgres-dev (tomoda_dev DB, 1 Gi, no backups) vs postgres-prod (tomoda_prod DB, 10 Gi, daily backups with 30d retention). Both in data namespace, distinguished by cluster name. See k8s/envs/{dev,prod}/postgres/manifests/cluster.yaml.
Redis (Bitnami chart): redis (dev, 256 Mi memory cap) vs redis-prod (10 Gi). Both in data. Separate per env so a dev workload runaway can't degrade prod's queue depth.
The tomoda app: Application/tomoda (dev overlay → tomoda-dev namespace, 1 replica each, HPA pinned at maxReplicas=1) vs Application/tomoda-prod (prod overlay → tomoda-prod namespace, HPA-driven scale-out).
pgAdmin and Redis Insight management UIs.

Per-environment but outside the cluster:

Artifact Registry repos: tomoda-dev-repo and tomoda-prod-repo in us-central1.
Cloud Build triggers: dev (branch main, manual approval) vs prod (tag ^v[0-9]+\.[0-9]+\.[0-9]+$, auto-approved).
S3 + CloudFront: tomoda-assets-dev / assets-dev.tomoda.life vs tomoda-assets-prod / assets.tomoda.life. AWS is the only Terraform stack that splits by environment — it uses workspaces (default = dev, prod). GCP and Cloudflare are single-copy; see Terraform CI/CD for the full split matrix and how the apply pipeline treats each.

Tier sizing — dev is deliberately small¶

Dev exists to validate changes before they reach prod, not to handle real load. The dev tier is sized to make "is this code broken" answerable cheaply:

Resource	Dev	Prod
Postgres PVC	1 Gi	10 Gi
Postgres memory limit	256 Mi	512 Mi+
Postgres backups	Off	Daily, 30d retention
Redis PVC	1 Gi	10 Gi
Redis memory limit	256 Mi	1 Gi
Tomoda api / async replicas	1 (HPA pinned at maxReplicas=1)	2+ (HPA scales up)
Total dev footprint	~600 Mi RAM, ~1 Gi disk	dominant share of cluster

A runaway dev workload can't OOM prod neighbours because the memory limits cap it. If you outgrow the dev sizing (e.g., need to test a migration against a realistic dataset), pump the values temporarily — the Terraform doesn't have to change.

Code promotion: dev → prod on the same cluster¶

A change ships to prod through dev first:

PR merges to main
    │
    ├─► Cloud Build pushes us-central1-docker.pkg.dev/.../tomoda-{backend,frontend}:dev-<sha>
    │   AND us-central1-docker.pkg.dev/.../tomoda-{backend,frontend}:<git-sha>
    │
    ▼
Argo CD Image Updater watches the dev-* tags
    │
    ▼ writes the new tag into k8s/apps/tomoda/overlays/dev/kustomization.yaml
    │
    ▼
Argo CD reconciles → dev pods (namespace `tomoda-dev`) pull the new image (~2-5 min)
    │
    ▼
[Verify in dev: smoke test, watch Sentry / Grafana for regressions]
    │
    ▼ promote manually
    │
    ▼  scripts/release.sh → cuts semver tag → updates overlays/prod/kustomization.yaml → push
    │
    ▼
Argo CD reconciles → prod pods (namespace `tomoda-prod`) pull the new image

Prod's Application/tomoda-prod deliberately has no argocd-image-updater.argoproj.io/* annotations — prod tag bumps require a manual scripts/release.sh invocation. This is the gate that makes the dev → prod flow explicit.

Trade-offs¶

The single-cluster model keeps cost and operational complexity low — one VPC, one Argo CD UI, one set of IAM bindings, one Photon install. The cost is a shared blast radius: a node-level incident, an Argo CD outage, or a buggy ClusterRole affects both tenants simultaneously. Production traffic and dev traffic share the same Traefik LoadBalancer.

If we ever outgrow this, the migration path is to add a second GKE cluster (gke-tomoda-staging?) and move dev there, leaving the current cluster as prod-only. The kustomize overlays already separate everything that would need to move; the bootstrap pattern would just be pointed at the new cluster.