Skip to content

Environments

Tomoda runs on a production cluster with a dev tenant — one physical GKE cluster, one GCP project, and three logical tiers: platform (cluster-wide shared services), dev (a deliberately small tenant for testing changes before they hit prod), and prod (the canonical environment, sized for real traffic).

One project, one cluster, three tiers

Layer Value
GCP project development-485000
Region / zone asia-east1 / asia-east1-a
GKE cluster gke-tomoda (note: no env suffix — single cluster serves both tenants)
VPC gke-tomoda-vpc (subnet 10.0.0.0/24, pods 10.1.0.0/16, services 10.2.0.0/20)
Dev app namespace tomoda
Prod app namespace prod
Shared data namespace data (both postgres-dev/prod-postgres, redis/prod-redis)
Shared platform namespace platform (Photon geocoder + indexer, ARC runners later)
Shared system namespaces argocd, cert-manager, external-secrets, traefik-system, monitoring, etc.

The cluster shape is defined in infrastructure/gcp/gke.tf and the VPC in infrastructure/gcp/vpc.tf. Workload Identity is enabled, so pods can impersonate GCP service accounts without static credentials — this is how CNPG reaches GCS for backups, for example.

Bootstrap: the app-of-apps pattern

Argo CD is installed by Terraform (infrastructure/gcp/argocd.tf). Once it is up, a single bootstrap manifest is applied manually and everything else flows from git:

kubectl apply -f k8s/envs/bootstrap.yaml

k8s/envs/bootstrap.yaml creates three Argo Applications:

  • platform — recurses into k8s/envs/platform/ and brings up cert-manager, traefik, external-dns, external-secrets, oauth2-proxy, monitoring (Prometheus + Grafana + Alertmanager), loki, tempo, blackbox-exporter, and Photon (geocoder + indexer). Required; everything else depends on it.
  • dev — recurses into k8s/envs/dev/ and brings up postgres-dev (CNPG, 1 Gi PVC, no backup), redis-dev (256 Mi), pgAdmin, Redis Insight, and the tomoda dev overlay (1 replica each — see Tier sizing below). Optional. Delete the dev Argo Application to free dev's footprint.
  • prod — recurses into k8s/envs/prod/ and brings up prod-postgres (10 Gi PVC, backups + retention), prod-redis (10 Gi), pgAdmin, Redis Insight, and the tomoda prod overlay (HPA-driven). Optional. Skip if you only want a dev environment on this cluster (rare).
flowchart LR
    BS[envs/bootstrap.yaml] --> P[Application: platform]
    BS --> D[Application: dev]
    BS --> R[Application: prod]

    P --> P1[cert-manager / traefik / external-dns / ESO]
    P --> P2[monitoring / loki / tempo / alertmanager]
    P --> P3[oauth2-proxy / blackbox-exporter / argocd-ingress]
    P --> P4[photon + photon-indexer]

    D --> D1[postgres-dev — 1 Gi, no backup]
    D --> D2[redis-dev — 256 Mi]
    D --> D3[tomoda dev overlay — 1 replica each]

    R --> R1[prod-postgres — 10 Gi, full backups]
    R --> R2[prod-redis — 10 Gi]
    R --> R3[tomoda prod overlay — HPA]

Bootstrapping a fresh cluster

Apply k8s/envs/bootstrap.yaml once. Argo CD owns the world afterward. Don't kubectl apply anything else by hand — change git and let Argo reconcile.

What's shared vs per-environment

Shared (single instance in the cluster, lives in k8s/envs/platform/):

  • All system services: cert-manager, traefik, external-dns, external-secrets, monitoring (Prometheus + Grafana + Alertmanager + Loki + Tempo), oauth2-proxy, argocd-ingress, blackbox-exporter, the alerting rules + dashboards.
  • Photon geocoder + indexer — singleton in the platform namespace. Both dev and prod backends hit photon.platform.svc.cluster.local:2322. Sharing makes sense because the multilingual planet index is read-only, takes hours + ~$500 of compute to rebuild, and gives identical answers regardless of which tenant queries it.
  • The Argo CD control plane.
  • Workload Identity bindings, Artifact Registry connectivity, GCP Secret Manager access via ESO.

Per-environment (two instances in the cluster, separated by name):

  • Postgres Cluster (CNPG): postgres-dev (tomoda_dev DB, 1 Gi, no backups) vs prod-postgres (tomoda_prod DB, 10 Gi, daily backups with 30d retention). Both in data namespace, distinguished by cluster name. See k8s/envs/{dev,prod}/postgres/manifests/cluster.yaml.
  • Redis (Bitnami chart): redis (dev, 256 Mi memory cap) vs prod-redis (10 Gi). Both in data. Separate per env so a dev workload runaway can't degrade prod's queue depth.
  • The tomoda app: Application/tomoda (dev overlay → tomoda namespace, 1 replica each, HPA pinned at maxReplicas=1) vs Application/prod-tomoda (prod overlay → prod namespace, HPA-driven scale-out).
  • pgAdmin and Redis Insight management UIs.

Per-environment but outside the cluster:

  • Artifact Registry repos: tomoda-dev-repo and tomoda-prod-repo in asia-east1.
  • Cloud Build triggers: dev (branch main, manual approval) vs prod (tag ^v[0-9]+\.[0-9]+\.[0-9]+$, auto-approved).
  • S3 + CloudFront: tomoda-assets-dev / assets-dev.tomoda.life vs tomoda-assets-prod / assets.tomoda.life. AWS uses Terraform workspaces (default and dev).

Tier sizing — dev is deliberately small

Dev exists to validate changes before they reach prod, not to handle real load. The dev tier is sized to make "is this code broken" answerable cheaply:

Resource Dev Prod
Postgres PVC 1 Gi 10 Gi
Postgres memory limit 256 Mi 512 Mi+
Postgres backups Off Daily, 30d retention
Redis PVC 1 Gi 10 Gi
Redis memory limit 256 Mi 1 Gi
Tomoda api / async replicas 1 (HPA pinned at maxReplicas=1) 2+ (HPA scales up)
Total dev footprint ~600 Mi RAM, ~1 Gi disk dominant share of cluster

A runaway dev workload can't OOM prod neighbours because the memory limits cap it. If you outgrow the dev sizing (e.g., need to test a migration against a realistic dataset), pump the values temporarily — the Terraform doesn't have to change.

Code promotion: dev → prod on the same cluster

A change ships to prod through dev first:

PR merges to main
    │
    ├─► Cloud Build pushes asia-east1-docker.pkg.dev/.../tomoda-{backend,frontend}:dev-<sha>
    │   AND asia-east1-docker.pkg.dev/.../tomoda-{backend,frontend}:<git-sha>
    │
    ▼
Argo CD Image Updater watches the dev-* tags
    │
    ▼ writes the new tag into k8s/apps/tomoda/overlays/dev/kustomization.yaml
    │
    ▼
Argo CD reconciles → dev pods (namespace `tomoda`) pull the new image (~2-5 min)
    │
    ▼
[Verify in dev: smoke test, watch Sentry / Grafana for regressions]
    │
    ▼ promote manually
    │
    ▼  scripts/release.sh → cuts semver tag → updates overlays/prod/kustomization.yaml → push
    │
    ▼
Argo CD reconciles → prod pods (namespace `prod`) pull the new image

Prod's Application/prod-tomoda deliberately has no argocd-image-updater.argoproj.io/* annotations — prod tag bumps require a manual scripts/release.sh invocation. This is the gate that makes the dev → prod flow explicit.

Trade-offs

The single-cluster model keeps cost and operational complexity low — one VPC, one Argo CD UI, one set of IAM bindings, one Photon install. The cost is a shared blast radius: a node-level incident, an Argo CD outage, or a buggy ClusterRole affects both tenants simultaneously. Production traffic and dev traffic share the same Traefik LoadBalancer.

If we ever outgrow this, the migration path is to add a second GKE cluster (gke-tomoda-staging?) and move dev there, leaving the current cluster as prod-only. The kustomize overlays already separate everything that would need to move; the bootstrap pattern would just be pointed at the new cluster.