Observability Manual Setup¶

One-time, human-only setup needed before the observability stack can run end-to-end. Work through this list in order — code that depends on these values is rejected by Terraform / ESO / the Worker runtime if a secret is missing.

Why this page exists

A lot of these accounts (Sentry, Cloudflare, Discord) are managed outside the repo and outside Terraform. If they aren't recorded, the system breaks silently when a teammate rotates a key or someone new tries to repro the setup. Treat this page as the single checklist for "what humans need to click."

Checklist at a glance¶

Sentry — org/project, DSN, auth token, Discord integration
Cloudflare — API token (new scope), record account ID + zone ID
Cloudflare Tunnel — for the synthetic Worker → Loki HTTP push path
Discord — alerts webhook URL
GCP Secret Manager — create 9 new secrets (values pasted from the steps above; scripts/setup-gcp-secrets.sh is the canonical creator)
Synthetic test user — backend seed migration writes the row; the password lives in GCP SM

The full step-by-step is below. Each section ends with the secret(s) you should have produced; the GCP Secret Manager section is where you push them in.

Step 1 — Sentry¶

For frontend crash reporting (@sentry/react-native) wired in tomoda-labs/tomoda PR #301.

Go to sentry.io and sign in (or sign up). Use a @tomoda.life Workspace account so org membership stays with the company, not a personal email.
Create an organization — current slug is tomoda-platforms-inc (hardcoded as the default in frontend/app.config.js and in the Grafana data source jsonData.orgSlug). Pick a different slug only if you're setting up an isolated org for a new environment; override it via SENTRY_ORG=<slug> at build time and update the Grafana data source's orgSlug to match.
Create a new project:
- Platform: React Native
- Project slug: tomoda-frontend (or override with SENTRY_PROJECT=<slug>)
- Default alert rules: skip (we create our own pointing at #alerts-frontend).
Copy the DSN from project Settings → Client Keys (DSN). Looks like https://<key>@<id>.ingest.sentry.io/<project>.

Create two Sentry auth tokens at Settings → Account → Auth Tokens (or org-level). Why two: source-map upload needs write scopes and rotates rarely; the in-cluster Grafana data source (grafana-sentry-datasource plugin) needs only read scopes and may rotate independently. Least-privilege + decoupled cadence.

#	Name	Scopes	Used by
1	`sourcemap-upload`	`project:releases`, `org:read`	`frontend/scripts/build-with-secrets.sh` at native + web release-build time
2	`grafana-readonly`	`event:read`, `org:read`, `project:read`, `member:read`	Grafana's `grafana-sentry-datasource` plugin in the `monitoring` namespace

Copy each value as you create it — Sentry shows each only once. Both go into GCP SM in Step 5. 6. Install the Discord integration. Sentry routes frontend crash alerts directly to the #alerts-frontend channel via OAuth — there's no webhook URL, no GCP SM secret needed for this route. Sentry's servers post via the installed bot. - Project Settings → Integrations → Discord → Install. - Authorize on Discord, pick the Tomoda guild, pick the #alerts-frontend channel (separate from #monitoring, which is reserved for Alertmanager + synthetic webhook traffic — see Step 4). - Back in Sentry: Project → Alerts → Create Alert → Issue Alert → "When a new issue is created" → Action: send a Discord notification to #alerts-frontend. Sentry groups events into issues server-side, so one bug producing thousands of events still produces one Discord message. No further throttling needed. 7. Decide retention + sampling in Project Settings → Performance/Replay. Defaults are fine for v1.

Produces three values that go into GCP Secret Manager (in Step 5 below):

Value	GCP SM key	Used for
DSN	`tomoda-sentry-dsn`	Baked into the JS bundle at build time; tells the SDK where to send events
`sourcemap-upload` token	`tomoda-sentry-auth-token`	Used by the Sentry CLI during the native + web release build to upload source maps
`grafana-readonly` token	`tomoda-sentry-grafana-token`	Projected via ESO into `monitoring/sentry-grafana-credentials` so Grafana's Sentry data source can query frontend metrics — see Sentry data source

GCP Secret Manager is the source of truth. Native release builds pull the DSN + upload token via frontend/scripts/build-with-secrets.sh using your gcloud auth login session. The Grafana token is read by ESO at sync time (every hour) and projected into the K8s Secret the Grafana pod mounts as SENTRY_AUTH_TOKEN. No values touch disk in either path.

Org slug (tomoda-platforms-inc) and project slug (tomoda-frontend) are hardcoded as defaults in frontend/app.config.js and in the Grafana data source jsonData.orgSlug — they're configuration, not secrets.

Step 2 — Cloudflare API token (expanded scope)¶

The existing external-dns-cloudflare-secret in-cluster is for external-dns and stays untouched. We're creating a new token used by Terraform for both DNS + Workers + Tunnel management.

Sign in to the Cloudflare dashboard.
Profile (top right) → API Tokens → Create Token → Custom Token.
Permissions (add all four rows):

Resource type Resource Permission

Zone DNS Edit

Account Workers Scripts Edit

Account Cloudflare Tunnel Edit

Account Account Settings Read
Zone resources: Include → Specific zone → tomoda.life (do NOT pick "all zones").
Account resources: Include → Specific account → your Tomoda account.
TTL: optional. Leave blank (never expires) or set to a year and put a rotation reminder on the calendar.
Create token → copy the value (shown once).
While in the dashboard, record:
- Cloudflare Account ID (visible on the right sidebar of any zone overview).
- tomoda.life Zone ID (visible on the tomoda.life zone overview page).

Produces:

Cloudflare API token → goes into GCP SM as tomoda-cloudflare-api-token in Step 5.
Account ID + Zone ID → not secret; will be put in Terraform variables (we'll commit those as part of the synthetic / tunnel PR).

Step 3 — Cloudflare Tunnel (for Worker → Loki push)¶

The synthetic Worker pushes probe results to Loki. Loki is in-cluster-only, so we expose a /loki/api/v1/push endpoint through a Cloudflare Tunnel protected by Cloudflare Access (service token auth, no humans involved).

Cloudflare dashboard → Zero Trust → Networks → Tunnels → Create a tunnel.
Connector type: Cloudflared (default).
Name: tomoda-prod-tunnel.
Copy the tunnel token (a long base64-ish string) — shown once. This is what cloudflared running in-cluster uses to authenticate to the tunnel.
Skip the "install connector" step — we'll run it as a Kubernetes Deployment via Argo CD; no need to install on a VM.
Public hostname tab → Add a public hostname:
- Subdomain: loki-push
- Domain: tomoda.life
- Service: HTTP → loki.monitoring.svc.cluster.local:3100
- Save.
Cloudflare Access (still in Zero Trust): Applications → Add an application → Self-hosted.
- Application name: loki-push
- Subdomain/domain: loki-push.tomoda.life
- Identity providers: leave default (we won't use SSO; service-token-only).
- Add a policy: name synthetic-worker, action Service Auth, include Service token → we'll generate the token next.
Zero Trust → Access → Service Auth → Create Service Token.
- Name: synthetic-worker
- Duration: never expire (or 1 year if you prefer + put a rotation reminder).
- Copy Client ID and Client Secret — shown once.

Produces three secrets:

Tunnel token → GCP SM as tomoda-cloudflare-tunnel-token
Access Client ID → GCP SM as tomoda-cloudflare-access-client-id
Access Client Secret → GCP SM as tomoda-cloudflare-access-client-secret

Step 4 — Discord alerts webhook¶

The webhook URL is shared by Alertmanager, Sentry, and the synthetic Worker. Treat it as one shared "primary alerts pipe."

In Discord, create the #monitoring channel if it doesn't exist (reserved for Alertmanager + Cloudflare synthetic webhook traffic — separate from #alerts-frontend which Sentry posts to via OAuth).
Channel name → Edit Channel → Integrations → Webhooks → New Webhook.
Name: Tomoda Alerts Primary. Avatar: optional.
Copy Webhook URL — this is the value you'll store. Format: https://discord.com/api/webhooks/<id>/<token>.

Produces:

Discord webhook URL → GCP SM as tomoda-alert-webhook-primary (provider-portable name; the value can later be a Slack/PagerDuty/Opsgenie URL without renaming the secret).

Why generic name

Naming the secret tomoda-alert-webhook-primary instead of tomoda-discord-webhook lets us swap providers (Slack, PagerDuty, Opsgenie) by changing the value only — no Terraform / manifest renames cascade through the cluster.

If you want a low-priority channel later, add a separate tomoda-alert-webhook-secondary secret pointing at the low-priority channel. Not needed for v1.

Step 5 — GCP Secret Manager: create the 9 new secrets¶

The canonical creator is scripts/setup-gcp-secrets.sh — same script that bootstraps the rest of the app's secrets. Pass the new values via env vars and the script creates them (or updates the existing version) idempotently. The tomoda-synthetic-probe-password auto-generates if you leave it unset.

gcloud auth login
gcloud config set project development-485000

# Pass the values gathered from Steps 1-4. Anything you leave out either
# stays at its current value (if already set) or remains empty (the script
# prints "Empty:" so you know what's outstanding).
TOMODA_SENTRY_DSN="<from Step 1>" \
TOMODA_SENTRY_AUTH_TOKEN="<from Step 1, sourcemap-upload token>" \
TOMODA_SENTRY_GRAFANA_TOKEN="<from Step 1, grafana-readonly token>" \
TOMODA_CLOUDFLARE_API_TOKEN="<from Step 2>" \
TOMODA_CLOUDFLARE_TUNNEL_TOKEN="<from Step 3>" \
TOMODA_CLOUDFLARE_ACCESS_CLIENT_ID="<from Step 3>" \
TOMODA_CLOUDFLARE_ACCESS_CLIENT_SECRET="<from Step 3>" \
TOMODA_ALERT_WEBHOOK_PRIMARY="<from Step 4>" \
./scripts/setup-gcp-secrets.sh
# tomoda-synthetic-probe-password auto-generates via openssl rand -base64 32.

# Verify all nine
gcloud secrets list --filter="name~tomoda-(sentry|cloudflare|alert|synthetic)"

If you'd rather skip the script and create each secret by hand (one at a time, no env vars), each accepts a value via stdin so no value lands in shell history:

echo -n "<sentry-dsn>"                | gcloud secrets create tomoda-sentry-dsn                       --replication-policy=automatic --data-file=-
echo -n "<sentry-auth-token>"         | gcloud secrets create tomoda-sentry-auth-token                --replication-policy=automatic --data-file=-
echo -n "<sentry-grafana-token>"      | gcloud secrets create tomoda-sentry-grafana-token             --replication-policy=automatic --data-file=-
echo -n "<cloudflare-api-token>"      | gcloud secrets create tomoda-cloudflare-api-token             --replication-policy=automatic --data-file=-
echo -n "<discord-webhook-url>"       | gcloud secrets create tomoda-alert-webhook-primary            --replication-policy=automatic --data-file=-
echo -n "<cloudflare-tunnel-token>"   | gcloud secrets create tomoda-cloudflare-tunnel-token          --replication-policy=automatic --data-file=-
echo -n "<cf-access-client-id>"       | gcloud secrets create tomoda-cloudflare-access-client-id      --replication-policy=automatic --data-file=-
echo -n "<cf-access-client-secret>"   | gcloud secrets create tomoda-cloudflare-access-client-secret  --replication-policy=automatic --data-file=-
echo -n "$(openssl rand -base64 32)"  | gcloud secrets create tomoda-synthetic-probe-password         --replication-policy=automatic --data-file=-

After the secrets exist, the per-area Terraform / K8s code (committed in the corresponding PRs) wires them up via ESO into the right namespaces — you don't need to touch any namespace bindings manually.

Step 6 — Synthetic test user (backend seed)¶

A dedicated user account exists in the database solely so the synthetic Worker can exercise the real POST /api/v1/auth/login endpoint end-to-end.

You decide the password now. It gets stored in GCP SM (Step 5 secret #6), and a backend migration writes the hashed value to the DB on next deploy. The Worker reads the cleartext password from a Worker secret binding (sourced from the GCP SM secret at Terraform apply time).

Pick an email address you control. Recommend synthetic+probe@tomoda.life (the +probe lets you filter it out in any analytics later). The email must accept mail (deliverability is verified in tests), but the account never logs in interactively.
Generate a strong password and stash it in your password manager:
```
openssl rand -base64 32
```
Store it in GCP SM as tomoda-synthetic-probe-password (already covered in Step 5).
On next deploy, the migration in backend/internal/database/migrations/seed_synthetic_user.go (added in the synthetics PR) reads the password from the env (via ESO) and either inserts the user or updates the password hash to match. Idempotent — safe to run repeatedly.
MFA stance: the synthetic account is automatically flagged account_type = 'synthetic' in the DB and is excluded from MFA enforcement (the migration adds this filter to the MFA-required check). No human action needed.

Don't reuse a real user account

Tempting to use a personal account — don't. The synthetic account writes login history rows on every probe (one per minute), would pollute that user's session list, and would inflate "active users" metrics. The dedicated account also lets us suppress it from analytics, leaderboards, and anti-abuse heuristics in one place.

Alert routing & throttling¶

Two Discord channels, one per source category. Each source has its own throttling story — the channels themselves don't enforce anything.

Channel	Source	Mechanism	Throttling owner
`#alerts-frontend`	Sentry — every new React Native issue	Sentry Discord OAuth integration (no webhook URL)	Sentry server-side dedup: events → issues, one Discord msg per new issue, no further config
`#monitoring`	Alertmanager (Prometheus rules — cluster, DB, API health) + Cloudflare synthetic Worker (probe failures)	Webhook from `tomoda-alert-webhook-primary`	Alertmanager grouping + inhibition + Worker state-change-only notifications (details below)

Sentry → `#alerts-frontend`¶

Defaults. The alert rule fires on "A new issue is created" only. Sentry's server-side fingerprinting collapses thousands of events into a single Issue, so a single bug ships one Discord message regardless of event volume. No tuning needed for v1.

If #alerts-frontend ever feels noisy later, the right knob is Settings → Notifications → Per-issue rate limit: at most once per N hours at the workspace level. Don't add rate-limiting at the alert-rule level — that drops new issues, not duplicates of existing ones.

Alertmanager → `#monitoring` (implemented in Phase 5)¶

Standard throttling profile that the Phase 5 PR will ship:

Setting	Default	Effect
`group_by: [alertname, cluster, service]`	—	Related alerts batched into one Discord message
`group_wait: 30s`	—	Wait 30s after first alert before firing — gathers correlated alerts
`group_interval: 5m`	—	Within a group, batched update at most every 5 min
`repeat_interval: 4h`	—	Don't re-fire the same alert for 4h (warning severity)
`repeat_interval: 1h` (critical severity)	—	Critical re-fires every 1h, not 4h
Inhibition: `ClusterDown` → all warnings for that cluster	—	When cluster is down, suppress 50× "service unreachable" cascade

Outage budget: roughly one Discord message every 5-15 minutes per incident, plus hourly reminders for critical alerts. A 6-hour outage produces ~12-15 messages instead of thousands.

Cloudflare synthetic Worker → `#monitoring` (implemented in the synthetics PR)¶

State-change-only notifications, stored in Cloudflare Workers KV:

Probe goes healthy → failed: one Discord message with the failure detail.
Probe stays failed → failed: silent.
Probe goes failed → healthy: one "recovered" Discord message with downtime duration.
Still-down reminder: every 1 hour while a probe remains failed (so a 6h outage produces "DOWN" + 5× "STILL DOWN" + "RECOVERED" = 7 messages per probe, not 360).

Per-probe override possible if some endpoints are flappier than others (e.g., raise reminder cadence to 4h for a probe that's known to false-positive).

Verification¶

After you've completed all six steps and confirmed the secrets exist:

# Confirm all 9 new secrets are present in GCP SM
gcloud secrets list --project=development-485000 \
  --filter="name~tomoda-(sentry|cloudflare|alert|synthetic)" \
  --format="table(name,createTime)"

# Should show:
#   tomoda-sentry-dsn
#   tomoda-sentry-auth-token
#   tomoda-sentry-grafana-token
#   tomoda-cloudflare-api-token
#   tomoda-alert-webhook-primary
#   tomoda-cloudflare-tunnel-token
#   tomoda-cloudflare-access-client-id
#   tomoda-cloudflare-access-client-secret
#   tomoda-synthetic-probe-password

Confirm to the team that manual setup is complete, then the code PRs that depend on these values (Alertmanager Discord routing, synthetic Worker, Loki tunnel) can land.

Rotation¶

These secrets rotate independently. When you rotate one, only the consumers that read it need to be restarted — ESO syncs every hour, so a kubectl rollout restart accelerates picking up the new value if you need to verify immediately.

Secret	When to rotate	Consumers to restart
`tomoda-cloudflare-api-token`	If leaked, or annually	Re-run `terraform plan/apply` — Cloudflare provider uses the new value automatically
`tomoda-alert-webhook-primary`	If leaked, or if changing alert provider	Restart `alertmanager`; trigger Worker redeploy so its secret binding refreshes
`tomoda-cloudflare-tunnel-token`	If leaked, or annually	Restart the `cloudflared` Deployment in `monitoring`
`tomoda-cloudflare-access-client-*`	If leaked	Worker redeploy (it embeds the value at deploy time)
`tomoda-synthetic-probe-password`	Quarterly + on team turnover	Re-run the seed migration (it updates the hash) + Worker redeploy
`tomoda-sentry-dsn`	Rarely — when rotating the Sentry project DSN	Next native release build picks up the new value (it's pulled by `build-with-secrets.sh` at build time)
`tomoda-sentry-auth-token`	Annually, or if leaked	Same — next native release build pulls the new value
`tomoda-sentry-grafana-token`	Annually, or if leaked	ESO syncs within 1h; `kubectl rollout restart deploy/monitoring-grafana -n monitoring` to verify immediately

Secrets management — overall secret-store architecture, ESO bridging, env-var reference
Tempo, Prometheus, Loki — what these secrets feed into

Resource type	Resource	Permission
Zone	DNS	Edit
Account	Workers Scripts	Edit
Account	Cloudflare Tunnel	Edit
Account	Account Settings	Read