Environments¶
Tomoda runs on a production cluster with a dev tenant — one physical GKE cluster, one GCP project, and three logical tiers: platform (cluster-wide shared services), dev (a deliberately small tenant for testing changes before they hit prod), and prod (the canonical environment, sized for real traffic).
One project, one cluster, three tiers¶
| Layer | Value |
|---|---|
| GCP project | development-485000 |
| Region / zone | asia-east1 / asia-east1-a |
| GKE cluster | gke-tomoda (note: no env suffix — single cluster serves both tenants) |
| VPC | gke-tomoda-vpc (subnet 10.0.0.0/24, pods 10.1.0.0/16, services 10.2.0.0/20) |
| Dev app namespace | tomoda |
| Prod app namespace | prod |
| Shared data namespace | data (both postgres-dev/prod-postgres, redis/prod-redis) |
| Shared platform namespace | platform (Photon geocoder + indexer, ARC runners later) |
| Shared system namespaces | argocd, cert-manager, external-secrets, traefik-system, monitoring, etc. |
The cluster shape is defined in infrastructure/gcp/gke.tf and the VPC in infrastructure/gcp/vpc.tf. Workload Identity is enabled, so pods can impersonate GCP service accounts without static credentials — this is how CNPG reaches GCS for backups, for example.
Bootstrap: the app-of-apps pattern¶
Argo CD is installed by Terraform (infrastructure/gcp/argocd.tf). Once it is up, a single bootstrap manifest is applied manually and everything else flows from git:
kubectl apply -f k8s/envs/bootstrap.yaml
k8s/envs/bootstrap.yaml creates three Argo Applications:
platform— recurses intok8s/envs/platform/and brings up cert-manager, traefik, external-dns, external-secrets, oauth2-proxy, monitoring (Prometheus + Grafana + Alertmanager), loki, tempo, blackbox-exporter, and Photon (geocoder + indexer). Required; everything else depends on it.dev— recurses intok8s/envs/dev/and brings uppostgres-dev(CNPG, 1 Gi PVC, no backup),redis-dev(256 Mi), pgAdmin, Redis Insight, and the tomoda dev overlay (1 replica each — see Tier sizing below). Optional. Delete thedevArgo Application to free dev's footprint.prod— recurses intok8s/envs/prod/and brings upprod-postgres(10 Gi PVC, backups + retention),prod-redis(10 Gi), pgAdmin, Redis Insight, and the tomoda prod overlay (HPA-driven). Optional. Skip if you only want a dev environment on this cluster (rare).
flowchart LR
BS[envs/bootstrap.yaml] --> P[Application: platform]
BS --> D[Application: dev]
BS --> R[Application: prod]
P --> P1[cert-manager / traefik / external-dns / ESO]
P --> P2[monitoring / loki / tempo / alertmanager]
P --> P3[oauth2-proxy / blackbox-exporter / argocd-ingress]
P --> P4[photon + photon-indexer]
D --> D1[postgres-dev — 1 Gi, no backup]
D --> D2[redis-dev — 256 Mi]
D --> D3[tomoda dev overlay — 1 replica each]
R --> R1[prod-postgres — 10 Gi, full backups]
R --> R2[prod-redis — 10 Gi]
R --> R3[tomoda prod overlay — HPA]
Bootstrapping a fresh cluster
Apply k8s/envs/bootstrap.yaml once. Argo CD owns the world afterward. Don't kubectl apply anything else by hand — change git and let Argo reconcile.
What's shared vs per-environment¶
Shared (single instance in the cluster, lives in k8s/envs/platform/):
- All system services: cert-manager, traefik, external-dns, external-secrets, monitoring (Prometheus + Grafana + Alertmanager + Loki + Tempo), oauth2-proxy, argocd-ingress, blackbox-exporter, the alerting rules + dashboards.
- Photon geocoder + indexer — singleton in the
platformnamespace. Both dev and prod backends hitphoton.platform.svc.cluster.local:2322. Sharing makes sense because the multilingual planet index is read-only, takes hours + ~$500 of compute to rebuild, and gives identical answers regardless of which tenant queries it. - The Argo CD control plane.
- Workload Identity bindings, Artifact Registry connectivity, GCP Secret Manager access via ESO.
Per-environment (two instances in the cluster, separated by name):
- Postgres
Cluster(CNPG):postgres-dev(tomoda_devDB, 1 Gi, no backups) vsprod-postgres(tomoda_prodDB, 10 Gi, daily backups with 30d retention). Both indatanamespace, distinguished by cluster name. Seek8s/envs/{dev,prod}/postgres/manifests/cluster.yaml. - Redis (Bitnami chart):
redis(dev, 256 Mi memory cap) vsprod-redis(10 Gi). Both indata. Separate per env so a dev workload runaway can't degrade prod's queue depth. - The tomoda app:
Application/tomoda(dev overlay →tomodanamespace, 1 replica each, HPA pinned at maxReplicas=1) vsApplication/prod-tomoda(prod overlay →prodnamespace, HPA-driven scale-out). - pgAdmin and Redis Insight management UIs.
Per-environment but outside the cluster:
- Artifact Registry repos:
tomoda-dev-repoandtomoda-prod-repoinasia-east1. - Cloud Build triggers: dev (branch
main, manual approval) vs prod (tag^v[0-9]+\.[0-9]+\.[0-9]+$, auto-approved). - S3 + CloudFront:
tomoda-assets-dev/assets-dev.tomoda.lifevstomoda-assets-prod/assets.tomoda.life. AWS uses Terraform workspaces (defaultanddev).
Tier sizing — dev is deliberately small¶
Dev exists to validate changes before they reach prod, not to handle real load. The dev tier is sized to make "is this code broken" answerable cheaply:
| Resource | Dev | Prod |
|---|---|---|
| Postgres PVC | 1 Gi | 10 Gi |
| Postgres memory limit | 256 Mi | 512 Mi+ |
| Postgres backups | Off | Daily, 30d retention |
| Redis PVC | 1 Gi | 10 Gi |
| Redis memory limit | 256 Mi | 1 Gi |
| Tomoda api / async replicas | 1 (HPA pinned at maxReplicas=1) | 2+ (HPA scales up) |
| Total dev footprint | ~600 Mi RAM, ~1 Gi disk | dominant share of cluster |
A runaway dev workload can't OOM prod neighbours because the memory limits cap it. If you outgrow the dev sizing (e.g., need to test a migration against a realistic dataset), pump the values temporarily — the Terraform doesn't have to change.
Code promotion: dev → prod on the same cluster¶
A change ships to prod through dev first:
PR merges to main
│
├─► Cloud Build pushes asia-east1-docker.pkg.dev/.../tomoda-{backend,frontend}:dev-<sha>
│ AND asia-east1-docker.pkg.dev/.../tomoda-{backend,frontend}:<git-sha>
│
▼
Argo CD Image Updater watches the dev-* tags
│
▼ writes the new tag into k8s/apps/tomoda/overlays/dev/kustomization.yaml
│
▼
Argo CD reconciles → dev pods (namespace `tomoda`) pull the new image (~2-5 min)
│
▼
[Verify in dev: smoke test, watch Sentry / Grafana for regressions]
│
▼ promote manually
│
▼ scripts/release.sh → cuts semver tag → updates overlays/prod/kustomization.yaml → push
│
▼
Argo CD reconciles → prod pods (namespace `prod`) pull the new image
Prod's Application/prod-tomoda deliberately has no argocd-image-updater.argoproj.io/* annotations — prod tag bumps require a manual scripts/release.sh invocation. This is the gate that makes the dev → prod flow explicit.
Trade-offs¶
The single-cluster model keeps cost and operational complexity low — one VPC, one Argo CD UI, one set of IAM bindings, one Photon install. The cost is a shared blast radius: a node-level incident, an Argo CD outage, or a buggy ClusterRole affects both tenants simultaneously. Production traffic and dev traffic share the same Traefik LoadBalancer.
If we ever outgrow this, the migration path is to add a second GKE cluster (gke-tomoda-staging?) and move dev there, leaving the current cluster as prod-only. The kustomize overlays already separate everything that would need to move; the bootstrap pattern would just be pointed at the new cluster.