Skip to content

Alerting

Tomodachi — the Discord webhook avatar that posts alerts to #monitoring

Prometheus Alertmanager → Discord #monitoring channel. One unified incident pipe for backend / database / cluster / cert / business-metric alerts. Messages are posted by the Tomodachi webhook. Sentry's frontend alerts route separately to #alerts-frontend (see Sentry data source — Sentry's Discord integration is OAuth, not webhook).

Pipeline

Prometheus ──► fires rule (PrometheusRule CRD)
   │
   ▼
Alertmanager ──► group_by [alertname, cluster, service]
   │            ──► group_wait 30s, group_interval 5m
   │            ──► inhibition rules (e.g. PostgresDown silences DB warnings)
   ▼
Discord webhook (Cloudflare-fronted) ──► #monitoring channel

Webhook URL is sourced from GCP SM tomoda-alert-webhook-primary via ESO → K8s Secret alertmanager-discord-credentials in monitoring/ → mounted into the Alertmanager pod at /etc/alertmanager/secrets/alertmanager-discord-credentials/webhook-url. The Alertmanager config references it via webhook_url_file: /etc/.../webhook-url so the URL is never inlined in values.yaml.

Routing + throttling defaults

These live in k8s/envs/dev/sys/monitoring/values.yaml under alertmanager.config.route:

Setting Default Effect
group_by [alertname, cluster, service] Related alerts batched into one Discord message
group_wait 30s Wait 30s after first alert before firing — collects correlated alerts into the first batch
group_interval 5m Within a group, batched update at most every 5min
repeat_interval 4h (warning), 1h (critical) Don't re-fire the same alert for N hours
severity = critical route override group_wait: 10s, repeat_interval: 1h Critical paging incidents stay loud during long outages

Inhibition rules

When a root-cause alert fires, downstream-symptom warnings are suppressed:

Source (firing) Suppresses Matched on
PostgresDown (critical) All severity=warning alerts for the same cluster cluster label
BackendDown (critical) All severity=warning alerts for the same service service label
RedisDown (critical) All severity=warning alerts for the same instance instance label

Result: a 6-hour incident with 50 correlated warnings fires ~12-15 Discord messages instead of thousands.

Alert rule inventory

Rules live in k8s/envs/dev/sys/manifests/alerting-rules.yaml as a single PrometheusRule resource named tomoda-alerts in monitoring/.

Group Rule Severity Trigger
application BackendDown critical up{job="backend-service"} == 0 for 2m
BackendHighErrorRate warning 5xx rate > 5% over 5m
BackendHighLatency warning p95 > 2s over 5m
database PostgresDown critical cnpg_collector_up == 0 for 1m
PostgresReplicationLag warning Lag > 10s for 5m
PostgresHighConnections warning > 80% of max_connections for 5m
PostgresBackupStale warning Last WAL archive > 1h ago
PostgresDiskUsageHigh warning DB size > 15 GB
redis RedisDown critical redis_up == 0 for 1m
RedisHighMemory warning > 90% of max memory for 5m
resources PodCrashLooping warning Restart rate > 0 over 15m
PVCUsageHigh warning PVC > 85% full for 10m
NodeMemoryHigh warning Node memory > 90% for 10m
business AsyncQueueBacklog warning Asynq pending > 500 for 10m
WSHubConnectionDrop warning WS hub dropping > 1 conn/sec for 5m
certificates CertificateExpiringSoon warning Cert expires in < 14d
CertificateRenewalFailed critical cert-manager Ready=False for > 1h

Verification

# 1. ESO has projected the webhook secret
kubectl get secret alertmanager-discord-credentials -n monitoring
#   NAME                                  TYPE     DATA   AGE
#   alertmanager-discord-credentials      Opaque   1      ...

# 2. Alertmanager pod sees the mounted secret file
kubectl exec -n monitoring statefulset/alertmanager-monitoring -- \
  cat /etc/alertmanager/secrets/alertmanager-discord-credentials/webhook-url
#   https://discord.com/api/webhooks/...

# 3. Alertmanager config loaded
kubectl exec -n monitoring statefulset/alertmanager-monitoring -- \
  amtool config show | head
#   route: ...
#   receivers: ...

# 4. Send a test alert
amtool alert add alertname=AlertmanagerTestPing severity=warning \
  --alertmanager.url=http://localhost:9093 \
  --annotations=summary="Test ping from amtool"
# (port-forward first: kubectl port-forward -n monitoring svc/alertmanager-operated 9093)

A real Discord message should appear in #monitoring within ~30s of the test ping.

Rotation

tomoda-alert-webhook-primary rotates independently:

# 1. Generate a new Discord webhook on the same #monitoring channel
#    (Discord Channel Settings → Integrations → Webhooks → New Webhook)
# 2. Push to GCP SM as a new version
echo -n "<new-webhook-url>" | gcloud secrets versions add tomoda-alert-webhook-primary \
  --project=development-485000 --data-file=-
# 3. ESO syncs within 1h; force-restart Alertmanager to verify immediately
kubectl rollout restart statefulset/alertmanager-monitoring -n monitoring
# 4. Delete the old webhook in Discord (otherwise both fire)

Provider swap (Slack / PagerDuty / Opsgenie)

The generic-named tomoda-alert-webhook-primary secret was designed for portable provider swaps. To switch from Discord to Slack:

  1. In Slack, create an Incoming Webhook for the equivalent of #monitoring.
  2. Push the new URL to GCP SM: echo -n "<slack-url>" | gcloud secrets versions add tomoda-alert-webhook-primary --data-file=-.
  3. Edit k8s/envs/dev/sys/monitoring/values.yaml: change discord_configs: to slack_configs:, update template format to Slack's markdown subset.
  4. PR + Argo CD sync.

No K8s Secret renames cascade through the cluster.