Skip to content

Observability Manual Setup

One-time, human-only setup needed before the observability stack can run end-to-end. Work through this list in order — code that depends on these values is rejected by Terraform / ESO / the Worker runtime if a secret is missing.

Why this page exists

A lot of these accounts (Sentry, Cloudflare, Discord) are managed outside the repo and outside Terraform. If they aren't recorded, the system breaks silently when a teammate rotates a key or someone new tries to repro the setup. Treat this page as the single checklist for "what humans need to click."

Checklist at a glance

  • Sentry — org/project, DSN, auth token, Discord integration
  • Cloudflare — API token (new scope), record account ID + zone ID
  • Cloudflare Tunnel — for the synthetic Worker → Loki HTTP push path
  • Discord — alerts webhook URL
  • GCP Secret Manager — create 9 new secrets (values pasted from the steps above; scripts/setup-gcp-secrets.sh is the canonical creator)
  • Synthetic test user — backend seed migration writes the row; the password lives in GCP SM

The full step-by-step is below. Each section ends with the secret(s) you should have produced; the GCP Secret Manager section is where you push them in.


Step 1 — Sentry

For frontend crash reporting (@sentry/react-native) wired in tomoda-labs/tomoda PR #301.

  1. Go to sentry.io and sign in (or sign up). Use a @tomoda.life Workspace account so org membership stays with the company, not a personal email.
  2. Create an organization — current slug is tomoda-platforms-inc (hardcoded as the default in frontend/app.config.js and in the Grafana data source jsonData.orgSlug). Pick a different slug only if you're setting up an isolated org for a new environment; override it via SENTRY_ORG=<slug> at build time and update the Grafana data source's orgSlug to match.
  3. Create a new project:
    • Platform: React Native
    • Project slug: tomoda-frontend (or override with SENTRY_PROJECT=<slug>)
    • Default alert rules: skip (we create our own pointing at #alerts-frontend).
  4. Copy the DSN from project Settings → Client Keys (DSN). Looks like https://<key>@<id>.ingest.sentry.io/<project>.
  5. Create two Sentry auth tokens at Settings → Account → Auth Tokens (or org-level). Why two: source-map upload needs write scopes and rotates rarely; the in-cluster Grafana data source (grafana-sentry-datasource plugin) needs only read scopes and may rotate independently. Least-privilege + decoupled cadence.

    # Name Scopes Used by
    1 sourcemap-upload project:releases, org:read frontend/scripts/build-with-secrets.sh at native + web release-build time
    2 grafana-readonly event:read, org:read, project:read, member:read Grafana's grafana-sentry-datasource plugin in the monitoring namespace

    Copy each value as you create it — Sentry shows each only once. Both go into GCP SM in Step 5. 6. Install the Discord integration. Sentry routes frontend crash alerts directly to the #alerts-frontend channel via OAuth — there's no webhook URL, no GCP SM secret needed for this route. Sentry's servers post via the installed bot. - Project Settings → Integrations → Discord → Install. - Authorize on Discord, pick the Tomoda guild, pick the #alerts-frontend channel (separate from #monitoring, which is reserved for Alertmanager + synthetic webhook traffic — see Step 4). - Back in Sentry: Project → Alerts → Create Alert → Issue Alert → "When a new issue is created" → Action: send a Discord notification to #alerts-frontend. Sentry groups events into issues server-side, so one bug producing thousands of events still produces one Discord message. No further throttling needed. 7. Decide retention + sampling in Project Settings → Performance/Replay. Defaults are fine for v1.

Produces three values that go into GCP Secret Manager (in Step 5 below):

Value GCP SM key Used for
DSN tomoda-sentry-dsn Baked into the JS bundle at build time; tells the SDK where to send events
sourcemap-upload token tomoda-sentry-auth-token Used by the Sentry CLI during the native + web release build to upload source maps
grafana-readonly token tomoda-sentry-grafana-token Projected via ESO into monitoring/sentry-grafana-credentials so Grafana's Sentry data source can query frontend metrics — see Sentry data source

GCP Secret Manager is the source of truth. Native release builds pull the DSN + upload token via frontend/scripts/build-with-secrets.sh using your gcloud auth login session. The Grafana token is read by ESO at sync time (every hour) and projected into the K8s Secret the Grafana pod mounts as SENTRY_AUTH_TOKEN. No values touch disk in either path.

Org slug (tomoda-platforms-inc) and project slug (tomoda-frontend) are hardcoded as defaults in frontend/app.config.js and in the Grafana data source jsonData.orgSlug — they're configuration, not secrets.


Step 2 — Cloudflare API token (expanded scope)

The existing external-dns-cloudflare-secret in-cluster is for external-dns and stays untouched. We're creating a new token used by Terraform for both DNS + Workers + Tunnel management.

  1. Sign in to the Cloudflare dashboard.
  2. Profile (top right) → API TokensCreate TokenCustom Token.
  3. Permissions (add all four rows):

    Resource type Resource Permission
    Zone DNS Edit
    Account Workers Scripts Edit
    Account Cloudflare Tunnel Edit
    Account Account Settings Read
  4. Zone resources: Include → Specific zone → tomoda.life (do NOT pick "all zones").

  5. Account resources: Include → Specific account → your Tomoda account.
  6. TTL: optional. Leave blank (never expires) or set to a year and put a rotation reminder on the calendar.
  7. Create token → copy the value (shown once).
  8. While in the dashboard, record:
    • Cloudflare Account ID (visible on the right sidebar of any zone overview).
    • tomoda.life Zone ID (visible on the tomoda.life zone overview page).

Produces:

  • Cloudflare API token → goes into GCP SM as tomoda-cloudflare-api-token in Step 5.
  • Account ID + Zone ID → not secret; will be put in Terraform variables (we'll commit those as part of the synthetic / tunnel PR).

Step 3 — Cloudflare Tunnel (for Worker → Loki push)

The synthetic Worker pushes probe results to Loki. Loki is in-cluster-only, so we expose a /loki/api/v1/push endpoint through a Cloudflare Tunnel protected by Cloudflare Access (service token auth, no humans involved).

  1. Cloudflare dashboard → Zero TrustNetworks → TunnelsCreate a tunnel.
  2. Connector type: Cloudflared (default).
  3. Name: tomoda-prod-tunnel.
  4. Copy the tunnel token (a long base64-ish string) — shown once. This is what cloudflared running in-cluster uses to authenticate to the tunnel.
  5. Skip the "install connector" step — we'll run it as a Kubernetes Deployment via Argo CD; no need to install on a VM.
  6. Public hostname tab → Add a public hostname:
    • Subdomain: loki-push
    • Domain: tomoda.life
    • Service: HTTPloki.monitoring.svc.cluster.local:3100
    • Save.
  7. Cloudflare Access (still in Zero Trust): Applications → Add an application → Self-hosted.
    • Application name: loki-push
    • Subdomain/domain: loki-push.tomoda.life
    • Identity providers: leave default (we won't use SSO; service-token-only).
    • Add a policy: name synthetic-worker, action Service Auth, include Service token → we'll generate the token next.
  8. Zero TrustAccessService AuthCreate Service Token.
    • Name: synthetic-worker
    • Duration: never expire (or 1 year if you prefer + put a rotation reminder).
    • Copy Client ID and Client Secret — shown once.

Produces three secrets:

  • Tunnel token → GCP SM as tomoda-cloudflare-tunnel-token
  • Access Client ID → GCP SM as tomoda-cloudflare-access-client-id
  • Access Client Secret → GCP SM as tomoda-cloudflare-access-client-secret

Step 4 — Discord alerts webhook

The webhook URL is shared by Alertmanager, Sentry, and the synthetic Worker. Treat it as one shared "primary alerts pipe."

  1. In Discord, create the #monitoring channel if it doesn't exist (reserved for Alertmanager + Cloudflare synthetic webhook traffic — separate from #alerts-frontend which Sentry posts to via OAuth).
  2. Channel name → Edit ChannelIntegrationsWebhooksNew Webhook.
  3. Name: Tomoda Alerts Primary. Avatar: optional.
  4. Copy Webhook URL — this is the value you'll store. Format: https://discord.com/api/webhooks/<id>/<token>.

Produces:

  • Discord webhook URL → GCP SM as tomoda-alert-webhook-primary (provider-portable name; the value can later be a Slack/PagerDuty/Opsgenie URL without renaming the secret).

Why generic name

Naming the secret tomoda-alert-webhook-primary instead of tomoda-discord-webhook lets us swap providers (Slack, PagerDuty, Opsgenie) by changing the value only — no Terraform / manifest renames cascade through the cluster.

If you want a low-priority channel later, add a separate tomoda-alert-webhook-secondary secret pointing at the low-priority channel. Not needed for v1.


Step 5 — GCP Secret Manager: create the 9 new secrets

The canonical creator is scripts/setup-gcp-secrets.sh — same script that bootstraps the rest of the app's secrets. Pass the new values via env vars and the script creates them (or updates the existing version) idempotently. The tomoda-synthetic-probe-password auto-generates if you leave it unset.

gcloud auth login
gcloud config set project development-485000

# Pass the values gathered from Steps 1-4. Anything you leave out either
# stays at its current value (if already set) or remains empty (the script
# prints "Empty:" so you know what's outstanding).
TOMODA_SENTRY_DSN="<from Step 1>" \
TOMODA_SENTRY_AUTH_TOKEN="<from Step 1, sourcemap-upload token>" \
TOMODA_SENTRY_GRAFANA_TOKEN="<from Step 1, grafana-readonly token>" \
TOMODA_CLOUDFLARE_API_TOKEN="<from Step 2>" \
TOMODA_CLOUDFLARE_TUNNEL_TOKEN="<from Step 3>" \
TOMODA_CLOUDFLARE_ACCESS_CLIENT_ID="<from Step 3>" \
TOMODA_CLOUDFLARE_ACCESS_CLIENT_SECRET="<from Step 3>" \
TOMODA_ALERT_WEBHOOK_PRIMARY="<from Step 4>" \
./scripts/setup-gcp-secrets.sh
# tomoda-synthetic-probe-password auto-generates via openssl rand -base64 32.

# Verify all nine
gcloud secrets list --filter="name~tomoda-(sentry|cloudflare|alert|synthetic)"

If you'd rather skip the script and create each secret by hand (one at a time, no env vars), each accepts a value via stdin so no value lands in shell history:

echo -n "<sentry-dsn>"                | gcloud secrets create tomoda-sentry-dsn                       --replication-policy=automatic --data-file=-
echo -n "<sentry-auth-token>"         | gcloud secrets create tomoda-sentry-auth-token                --replication-policy=automatic --data-file=-
echo -n "<sentry-grafana-token>"      | gcloud secrets create tomoda-sentry-grafana-token             --replication-policy=automatic --data-file=-
echo -n "<cloudflare-api-token>"      | gcloud secrets create tomoda-cloudflare-api-token             --replication-policy=automatic --data-file=-
echo -n "<discord-webhook-url>"       | gcloud secrets create tomoda-alert-webhook-primary            --replication-policy=automatic --data-file=-
echo -n "<cloudflare-tunnel-token>"   | gcloud secrets create tomoda-cloudflare-tunnel-token          --replication-policy=automatic --data-file=-
echo -n "<cf-access-client-id>"       | gcloud secrets create tomoda-cloudflare-access-client-id      --replication-policy=automatic --data-file=-
echo -n "<cf-access-client-secret>"   | gcloud secrets create tomoda-cloudflare-access-client-secret  --replication-policy=automatic --data-file=-
echo -n "$(openssl rand -base64 32)"  | gcloud secrets create tomoda-synthetic-probe-password         --replication-policy=automatic --data-file=-

After the secrets exist, the per-area Terraform / K8s code (committed in the corresponding PRs) wires them up via ESO into the right namespaces — you don't need to touch any namespace bindings manually.


Step 6 — Synthetic test user (backend seed)

A dedicated user account exists in the database solely so the synthetic Worker can exercise the real POST /api/v1/auth/login endpoint end-to-end.

You decide the password now. It gets stored in GCP SM (Step 5 secret #6), and a backend migration writes the hashed value to the DB on next deploy. The Worker reads the cleartext password from a Worker secret binding (sourced from the GCP SM secret at Terraform apply time).

  1. Pick an email address you control. Recommend synthetic+probe@tomoda.life (the +probe lets you filter it out in any analytics later). The email must accept mail (deliverability is verified in tests), but the account never logs in interactively.
  2. Generate a strong password and stash it in your password manager:

    openssl rand -base64 32
    
  3. Store it in GCP SM as tomoda-synthetic-probe-password (already covered in Step 5).

  4. On next deploy, the migration in backend/internal/database/migrations/seed_synthetic_user.go (added in the synthetics PR) reads the password from the env (via ESO) and either inserts the user or updates the password hash to match. Idempotent — safe to run repeatedly.
  5. MFA stance: the synthetic account is automatically flagged account_type = 'synthetic' in the DB and is excluded from MFA enforcement (the migration adds this filter to the MFA-required check). No human action needed.

Don't reuse a real user account

Tempting to use a personal account — don't. The synthetic account writes login history rows on every probe (one per minute), would pollute that user's session list, and would inflate "active users" metrics. The dedicated account also lets us suppress it from analytics, leaderboards, and anti-abuse heuristics in one place.


Alert routing & throttling

Two Discord channels, one per source category. Each source has its own throttling story — the channels themselves don't enforce anything.

Channel Source Mechanism Throttling owner
#alerts-frontend Sentry — every new React Native issue Sentry Discord OAuth integration (no webhook URL) Sentry server-side dedup: events → issues, one Discord msg per new issue, no further config
#monitoring Alertmanager (Prometheus rules — cluster, DB, API health) + Cloudflare synthetic Worker (probe failures) Webhook from tomoda-alert-webhook-primary Alertmanager grouping + inhibition + Worker state-change-only notifications (details below)

Sentry → #alerts-frontend

Defaults. The alert rule fires on "A new issue is created" only. Sentry's server-side fingerprinting collapses thousands of events into a single Issue, so a single bug ships one Discord message regardless of event volume. No tuning needed for v1.

If #alerts-frontend ever feels noisy later, the right knob is Settings → Notifications → Per-issue rate limit: at most once per N hours at the workspace level. Don't add rate-limiting at the alert-rule level — that drops new issues, not duplicates of existing ones.

Alertmanager → #monitoring (implemented in Phase 5)

Standard throttling profile that the Phase 5 PR will ship:

Setting Default Effect
group_by: [alertname, cluster, service] Related alerts batched into one Discord message
group_wait: 30s Wait 30s after first alert before firing — gathers correlated alerts
group_interval: 5m Within a group, batched update at most every 5 min
repeat_interval: 4h Don't re-fire the same alert for 4h (warning severity)
repeat_interval: 1h (critical severity) Critical re-fires every 1h, not 4h
Inhibition: ClusterDown → all warnings for that cluster When cluster is down, suppress 50× "service unreachable" cascade

Outage budget: roughly one Discord message every 5-15 minutes per incident, plus hourly reminders for critical alerts. A 6-hour outage produces ~12-15 messages instead of thousands.

Cloudflare synthetic Worker → #monitoring (implemented in the synthetics PR)

State-change-only notifications, stored in Cloudflare Workers KV:

  • Probe goes healthy → failed: one Discord message with the failure detail.
  • Probe stays failed → failed: silent.
  • Probe goes failed → healthy: one "recovered" Discord message with downtime duration.
  • Still-down reminder: every 1 hour while a probe remains failed (so a 6h outage produces "DOWN" + 5× "STILL DOWN" + "RECOVERED" = 7 messages per probe, not 360).

Per-probe override possible if some endpoints are flappier than others (e.g., raise reminder cadence to 4h for a probe that's known to false-positive).


Verification

After you've completed all six steps and confirmed the secrets exist:

# Confirm all 9 new secrets are present in GCP SM
gcloud secrets list --project=development-485000 \
  --filter="name~tomoda-(sentry|cloudflare|alert|synthetic)" \
  --format="table(name,createTime)"

# Should show:
#   tomoda-sentry-dsn
#   tomoda-sentry-auth-token
#   tomoda-sentry-grafana-token
#   tomoda-cloudflare-api-token
#   tomoda-alert-webhook-primary
#   tomoda-cloudflare-tunnel-token
#   tomoda-cloudflare-access-client-id
#   tomoda-cloudflare-access-client-secret
#   tomoda-synthetic-probe-password

Confirm to the team that manual setup is complete, then the code PRs that depend on these values (Alertmanager Discord routing, synthetic Worker, Loki tunnel) can land.

Rotation

These secrets rotate independently. When you rotate one, only the consumers that read it need to be restarted — ESO syncs every hour, so a kubectl rollout restart accelerates picking up the new value if you need to verify immediately.

Secret When to rotate Consumers to restart
tomoda-cloudflare-api-token If leaked, or annually Re-run terraform plan/apply — Cloudflare provider uses the new value automatically
tomoda-alert-webhook-primary If leaked, or if changing alert provider Restart alertmanager; trigger Worker redeploy so its secret binding refreshes
tomoda-cloudflare-tunnel-token If leaked, or annually Restart the cloudflared Deployment in monitoring
tomoda-cloudflare-access-client-* If leaked Worker redeploy (it embeds the value at deploy time)
tomoda-synthetic-probe-password Quarterly + on team turnover Re-run the seed migration (it updates the hash) + Worker redeploy
tomoda-sentry-dsn Rarely — when rotating the Sentry project DSN Next native release build picks up the new value (it's pulled by build-with-secrets.sh at build time)
tomoda-sentry-auth-token Annually, or if leaked Same — next native release build pulls the new value
tomoda-sentry-grafana-token Annually, or if leaked ESO syncs within 1h; kubectl rollout restart deploy/monitoring-grafana -n monitoring to verify immediately