Decisions¶

This page is an ADR-lite log: the load-bearing choices that shape the rest of the platform, why we made them, and what we gave up.

Each entry follows the same shape: Decision / Context / Rationale / Trade-offs.

GKE + Argo CD + Kustomize for the control plane¶

Decision. All workloads run on GKE. Every workload is described declaratively in this git repository. Argo CD reconciles cluster state to git, with Kustomize as the templating layer (and Helm for upstream charts).

Context. The alternatives were raw kubectl apply from CI, Helm-only with a release controller (e.g. Flux + Helm), or running everything on managed services (Cloud Run + Cloud SQL + Memorystore).

Rationale. GitOps gives us an audit trail by default — every change is a git commit, and the live cluster state can be diffed against main at any moment. Kustomize keeps the overlay model simple (base + per-env patches) without the variable-substitution complexity Helm introduces. Argo's UI is a strong operational asset for a small team.

Trade-offs. Two more systems to learn (Argo and Kustomize). A bad commit to main is now a production change with selfHeal: true enabled. Drift detection is a feature, not a free benefit — when something is wrong in the cluster, Argo will often pull it back to git rather than alert.

In-cluster Postgres via CloudNativePG¶

Decision. Postgres is a CNPG Cluster running inside GKE with the ghcr.io/cloudnative-pg/postgis:17-3.5 image. There is no Cloud SQL instance.

Context. Cloud SQL was the obvious default. CNPG was selected after evaluating the cost and operational model.

Rationale. A db-custom-1-3840 Cloud SQL instance is roughly an order of magnitude more expensive than the equivalent CNPG pod on a spot node. CNPG ships PostGIS in its image (we use the geo features). Continuous WAL archiving via Barman writes to a GCS bucket (tomoda-db-backups-*), so we still get point-in-time recovery off-cluster. The same manifests work on any Kubernetes — there is no lock-in to Cloud SQL's API.

Trade-offs. We own backup verification, version upgrades, and failover. The cluster's underlying spot node can be reclaimed at any time, causing a Postgres restart (we run a single instance — HA is on the roadmap). Restores from Barman are not as instant as a Cloud SQL clone.

In-cluster Redis via Bitnami Helm chart¶

Decision. Redis runs as a Bitnami standalone Helm release in the data namespace, no auth, ClusterIP only.

Context. Memorystore would have been the managed alternative.

Rationale. Same cost/portability story as Postgres. Because Redis is never exposed outside the cluster — only backend pods reach it — auth adds operational friction (rotating secrets, syncing them via ESO) without a meaningful security improvement. NetworkPolicy in the data namespace is the real boundary.

Trade-offs. No managed failover, no automatic patching. The disk is a PersistentVolumeClaim; size has to be increased manually when it fills up.

Traefik for cluster ingress¶

Decision. Ingress is Traefik, deployed via the upstream Helm chart, exposed as a GCP external L4 LoadBalancer.

Context. GKE's bundled GCE Ingress (using a GCLB at L7) was the obvious GCP-native alternative, as was the GKE Gateway API implementation.

Rationale. Traefik works identically on any Kubernetes — keeping the door open to multi-cloud or a future move. Its middleware ecosystem (oauth2-proxy chains, rate limit, basic-auth) and CRD-based config map cleanly onto the Argo-managed manifests pattern. cert-manager + Let's Encrypt for TLS is straightforward against Traefik. The L4 LoadBalancer puts the TLS termination inside the cluster, where it is easier to introspect.

Trade-offs. We do not get GCLB's Cloud Armor / WAF integration for free. We pay for one external L4 LB instead of using GCE Ingress's HTTPS LB, and we maintain Traefik's config ourselves.

AWS S3 + CloudFront for static assets¶

Decision. Static assets live in S3 buckets (tomoda-assets-{env}) in us-east-2 and are served by CloudFront with assets.tomoda.life / assets-dev.tomoda.life as aliases.

Context. GCS + Cloud CDN sits next to all our other GCP infrastructure and would have been the consistent choice.

Rationale. This is largely historical — the assets pipeline predates the GKE setup and was already battle-tested when GCP became the primary cloud. CloudFront's caching, OAC, and ACM integration are mature. There has been no operational reason to migrate.

Trade-offs. A second cloud account to administer for what is, structurally, a small surface. Two IAM systems, two billing dashboards. The split is documented and stable, so the ongoing tax is mostly cognitive.

Cloudflare DNS-only (no orange-cloud proxy)¶

Decision. Cloudflare hosts the tomoda.life zone. Every record is proxied = false. The Cloudflare CDN and WAF are not used.

Context. Cloudflare's proxy mode would give us free DDoS protection and a global edge.

Rationale. We already have a CDN on the asset path (CloudFront). Putting Cloudflare in front of CloudFront double-caches and complicates invalidation. Putting Cloudflare in front of the API path adds latency and obscures the real client IP unless we wire up the trust header chain. Keeping Cloudflare as DNS-only means external-dns and Terraform can both manage records cleanly, and origin behaviour is identical to direct.

Trade-offs. We lose Cloudflare's L7 protection. If we ever want it back, we add it on a per-host basis — leave assets.* unproxied (CloudFront already protects it) and proxy only api.* / app.*.

Single GCP project for dev and prod¶

Decision. Everything runs in GCP project development-485000. Dev and prod are separated by Kubernetes namespace and by per-env Artifact Registry repos, not by project.

Context. A per-environment project (or dedicated tomoda-prod) is the conventional cloud organization pattern.

Rationale. A small team, one cluster, one VPC, one set of IAM bindings — operations are dramatically simpler. The Terraform stays in a single state file. Cross-env queries (cost reports, audit logs, billing alerts) work without federation.

Trade-offs. Bigger blast radius. A bad IAM grant or a wrong gcloud config set project is potentially production-affecting. Cost attribution between dev and prod relies on labels rather than project boundaries.

Spot instances everywhere¶

Decision. Both GKE node pools are spot. The active pool is e2-standard-2 (spot: true, node_count: 1); a second e2-medium spot pool exists with node_count: 0 for future scale-out. See infrastructure/gcp/gke.tf.

Context. On-demand nodes would be the obvious choice for a production cluster.

Rationale. Spot is 60-90 % cheaper. The workload tolerates restart (Argo will re-reconcile, CNPG will replay WAL, Redis warms back from disk). For a small team running a single-tenant platform, spending the cost difference on on-demand reliability would be poor value.

Trade-offs. A node can vanish on ~30 seconds notice. This is acceptable but visible — you will see occasional pod restarts in kubectl get events.

Dex + Google OAuth for Argo CD SSO¶

Decision. Argo CD's bundled Dex is configured with a Google OAuth connector, restricted to the tomoda.life hosted domain. Default RBAC policy is role:admin. See infrastructure/gcp/argocd.tf.

Context. Argo CD ships with local user auth; SSO is opt-in.

Rationale. Tying access to the company Google Workspace means joiners and leavers are handled by HR processes, not by Argo CD configuration. The hostedDomains: ["tomoda.life"] constraint enforces that only company accounts can sign in.

Trade-offs. A Google outage takes down Argo CD login. The default role:admin policy means everyone with a tomoda.life Google account has full Argo permissions today; this should be tightened with explicit RBAC once we have more than a handful of users.

External Secrets Operator for secret distribution¶

Decision. Secrets live in GCP Secret Manager (and a smaller set in AWS Secrets Manager). External Secrets Operator runs in-cluster and materializes them as native Secret objects via ExternalSecret and ClusterSecretStore.

Context. Alternatives: Sealed Secrets (encrypt then commit to git), Bitnami Vault-Secrets, or kubectl create secret by hand.

Rationale. Secrets are managed where they belong — in the cloud KMS — and never enter git, encrypted or otherwise. Rotation in Secret Manager is picked up by ESO automatically (refreshInterval: 1h). Pods reference secrets the same way they would for any other K8s Secret. Workload Identity removes the bootstrap problem of "how does the secret operator authenticate".

Trade-offs. ESO is one more controller to monitor. A bug in ClusterSecretStore configuration can stall secret materialization across the whole cluster.

App-of-apps GitOps pattern¶

Decision. Each environment has a single bootstrap Application manifest (k8s/envs/{dev,prod}/bootstrap.yaml) that, with recurse: true, picks up every other Application in k8s/envs/<env>/{sys,middleware}/. There is no flat manifest list anywhere.

Context. The simpler model is to list every Application explicitly in one file.

Rationale. Adding a new app is one application.yaml in the right directory — no edits to a central registry. The directory structure itself is the registry. Pruning works automatically: delete the directory and Argo prunes the resources.

Trade-offs. "What is deployed?" is answered by walking the tree, not by reading one file. A typo in a sub-directory's application.yaml can cause Argo to fail to render it silently — argocd app list is the source of truth, not the filesystem.