Alerting¶

Prometheus Alertmanager → Discord #monitoring channel. One unified incident pipe for backend / database / cluster / cert / business-metric alerts. Messages are posted by the Tomodachi webhook. Sentry's frontend alerts route separately to #alerts-frontend (see Sentry data source — Sentry's Discord integration is OAuth, not webhook).

Pipeline¶

Prometheus ──► fires rule (PrometheusRule CRD)
   │
   ▼
Alertmanager ──► group_by [alertname, cluster, service]
   │            ──► group_wait 30s, group_interval 5m
   │            ──► inhibition rules (e.g. PostgresDown silences DB warnings)
   ▼
Discord webhook (Cloudflare-fronted) ──► #monitoring channel

Webhook URL is sourced from GCP SM tomoda-alert-webhook-primary via ESO → K8s Secret alertmanager-discord-credentials in monitoring/. The Discord receiver's apiURL reads the webhook-url key of that secret directly, so the URL is never inlined in git.

AlertmanagerConfig CRD, not raw config

Routing + the Discord receiver live in an AlertmanagerConfig CR (k8s/envs/platform/manifests/alertmanager-config.yaml), not inline in the chart values. The prometheus-operator's raw-config validator rejects discord webhook_url_file (#7159); the CRD's discordConfigs.apiURL reads the webhook from a secret key instead.

Routing + throttling defaults¶

These live in the AlertmanagerConfig CR (k8s/envs/platform/manifests/alertmanager-config.yaml) under spec.route:

Setting	Default	Effect
`group_by`	`[alertname, cluster, service]`	Related alerts batched into one Discord message
`group_wait`	`30s`	Wait 30s after first alert before firing — collects correlated alerts into the first batch
`group_interval`	`5m`	Within a group, batched update at most every 5min
`repeat_interval`	`4h` (warning), `1h` (critical)	Don't re-fire the same alert for N hours
`severity = critical` route override	`group_wait: 10s`, `repeat_interval: 1h`	Critical paging incidents stay loud during long outages

Inhibition rules¶

When a root-cause alert fires, downstream-symptom warnings are suppressed:

Source (firing)	Suppresses	Matched on
`PostgresDown` (critical)	All `severity=warning` alerts for the same `cluster`	`cluster` label
`BackendDown` (critical)	All `severity=warning` alerts for the same `service`	`service` label
`RedisDown` (critical)	All `severity=warning` alerts for the same `instance`	`instance` label

Result: a 6-hour incident with 50 correlated warnings fires ~12-15 Discord messages instead of thousands.

Alert rule inventory¶

Rules live in k8s/envs/platform/manifests/alerting-rules.yaml as a single PrometheusRule resource named tomoda-alerts in monitoring/.

Group	Rule	Severity	Trigger
application	`BackendDown`	critical	`up{job="backend-service"} == 0` for 2m
	`BackendHighErrorRate`	warning	5xx rate > 5% over 5m
	`BackendHighLatency`	warning	p95 > 2s over 5m
database	`PostgresDown`	critical	`cnpg_collector_up == 0` for 1m
	`PostgresReplicationLag`	warning	Lag > 10s for 5m
	`PostgresHighConnections`	warning	> 80% of `max_connections` for 5m
	`PostgresBackupStale`	warning	Last WAL archive > 1h ago
	`PostgresDiskUsageHigh`	warning	DB size > 15 GB
redis	`RedisDown`	critical	`redis_up == 0` for 1m
	`RedisHighMemory`	warning	> 90% of max memory for 5m
resources	`PodCrashLooping`	warning	Restart rate > 0 over 15m
	`PVCUsageHigh`	warning	PVC > 85% full for 10m
	`NodeMemoryHigh`	warning	Node memory > 90% for 10m
business	`AsyncQueueBacklog`	warning	Asynq pending > 500 for 10m
	`WSHubConnectionDrop`	warning	WS hub dropping > 1 conn/sec for 5m
certificates	`CertificateExpiringSoon`	warning	Cert expires in < 14d
	`CertificateRenewalFailed`	critical	cert-manager `Ready=False` for > 1h

Cost alerts (AWSCostMTDHigh, GCPCostMTDHigh, DeepSeekBalanceLow, SerperCreditsLow) live in a separate cost-alerts PrometheusRule and route through this same pipe. See Cost monitoring.

Verification¶

# 1. ESO has projected the webhook secret
kubectl get secret alertmanager-discord-credentials -n monitoring
#   NAME                                  TYPE     DATA   AGE
#   alertmanager-discord-credentials      Opaque   1      ...

# 2. Alertmanager pod sees the mounted secret file
kubectl exec -n monitoring statefulset/alertmanager-monitoring -- \
  cat /etc/alertmanager/secrets/alertmanager-discord-credentials/webhook-url
#   https://discord.com/api/webhooks/...

# 3. Alertmanager config loaded
kubectl exec -n monitoring statefulset/alertmanager-monitoring -- \
  amtool config show | head
#   route: ...
#   receivers: ...

# 4. Send a test alert
amtool alert add alertname=AlertmanagerTestPing severity=warning \
  --alertmanager.url=http://localhost:9093 \
  --annotations=summary="Test ping from amtool"
# (port-forward first: kubectl port-forward -n monitoring svc/alertmanager-operated 9093)

A real Discord message should appear in #monitoring within ~30s of the test ping.

Rotation¶

tomoda-alert-webhook-primary rotates independently:

# 1. Generate a new Discord webhook on the same #monitoring channel
#    (Discord Channel Settings → Integrations → Webhooks → New Webhook)
# 2. Push to GCP SM as a new version
echo -n "<new-webhook-url>" | gcloud secrets versions add tomoda-alert-webhook-primary \
  --project=development-485000 --data-file=-
# 3. ESO syncs within 1h; force-restart Alertmanager to verify immediately
kubectl rollout restart statefulset/alertmanager-monitoring -n monitoring
# 4. Delete the old webhook in Discord (otherwise both fire)

Provider swap (Slack / PagerDuty / Opsgenie)¶

The generic-named tomoda-alert-webhook-primary secret was designed for portable provider swaps. To switch from Discord to Slack:

In Slack, create an Incoming Webhook for the equivalent of #monitoring.
Push the new URL to GCP SM: echo -n "<slack-url>" | gcloud secrets versions add tomoda-alert-webhook-primary --data-file=-.
Edit the AlertmanagerConfig CR (k8s/envs/platform/manifests/alertmanager-config.yaml): swap the discordConfigs receiver for slackConfigs, updating the message template to Slack's markdown subset.
PR + Argo CD sync.

No K8s Secret renames cascade through the cluster.

Manual setup — provisioning the webhook secret in GCP SM
Prometheus — what's being scraped that the rules query
Sentry data source — frontend alerts (separate channel via OAuth)