Alerting¶
![]()
Prometheus Alertmanager → Discord #monitoring channel. One unified incident pipe for backend / database / cluster / cert / business-metric alerts. Messages are posted by the Tomodachi webhook. Sentry's frontend alerts route separately to #alerts-frontend (see Sentry data source — Sentry's Discord integration is OAuth, not webhook).
Pipeline¶
Prometheus ──► fires rule (PrometheusRule CRD)
│
▼
Alertmanager ──► group_by [alertname, cluster, service]
│ ──► group_wait 30s, group_interval 5m
│ ──► inhibition rules (e.g. PostgresDown silences DB warnings)
▼
Discord webhook (Cloudflare-fronted) ──► #monitoring channel
Webhook URL is sourced from GCP SM tomoda-alert-webhook-primary via ESO → K8s Secret alertmanager-discord-credentials in monitoring/ → mounted into the Alertmanager pod at /etc/alertmanager/secrets/alertmanager-discord-credentials/webhook-url. The Alertmanager config references it via webhook_url_file: /etc/.../webhook-url so the URL is never inlined in values.yaml.
Routing + throttling defaults¶
These live in k8s/envs/dev/sys/monitoring/values.yaml under alertmanager.config.route:
| Setting | Default | Effect |
|---|---|---|
group_by |
[alertname, cluster, service] |
Related alerts batched into one Discord message |
group_wait |
30s |
Wait 30s after first alert before firing — collects correlated alerts into the first batch |
group_interval |
5m |
Within a group, batched update at most every 5min |
repeat_interval |
4h (warning), 1h (critical) |
Don't re-fire the same alert for N hours |
severity = critical route override |
group_wait: 10s, repeat_interval: 1h |
Critical paging incidents stay loud during long outages |
Inhibition rules¶
When a root-cause alert fires, downstream-symptom warnings are suppressed:
| Source (firing) | Suppresses | Matched on |
|---|---|---|
PostgresDown (critical) |
All severity=warning alerts for the same cluster |
cluster label |
BackendDown (critical) |
All severity=warning alerts for the same service |
service label |
RedisDown (critical) |
All severity=warning alerts for the same instance |
instance label |
Result: a 6-hour incident with 50 correlated warnings fires ~12-15 Discord messages instead of thousands.
Alert rule inventory¶
Rules live in k8s/envs/dev/sys/manifests/alerting-rules.yaml as a single PrometheusRule resource named tomoda-alerts in monitoring/.
| Group | Rule | Severity | Trigger |
|---|---|---|---|
| application | BackendDown |
critical | up{job="backend-service"} == 0 for 2m |
BackendHighErrorRate |
warning | 5xx rate > 5% over 5m | |
BackendHighLatency |
warning | p95 > 2s over 5m | |
| database | PostgresDown |
critical | cnpg_collector_up == 0 for 1m |
PostgresReplicationLag |
warning | Lag > 10s for 5m | |
PostgresHighConnections |
warning | > 80% of max_connections for 5m |
|
PostgresBackupStale |
warning | Last WAL archive > 1h ago | |
PostgresDiskUsageHigh |
warning | DB size > 15 GB | |
| redis | RedisDown |
critical | redis_up == 0 for 1m |
RedisHighMemory |
warning | > 90% of max memory for 5m | |
| resources | PodCrashLooping |
warning | Restart rate > 0 over 15m |
PVCUsageHigh |
warning | PVC > 85% full for 10m | |
NodeMemoryHigh |
warning | Node memory > 90% for 10m | |
| business | AsyncQueueBacklog |
warning | Asynq pending > 500 for 10m |
WSHubConnectionDrop |
warning | WS hub dropping > 1 conn/sec for 5m | |
| certificates | CertificateExpiringSoon |
warning | Cert expires in < 14d |
CertificateRenewalFailed |
critical | cert-manager Ready=False for > 1h |
Verification¶
# 1. ESO has projected the webhook secret
kubectl get secret alertmanager-discord-credentials -n monitoring
# NAME TYPE DATA AGE
# alertmanager-discord-credentials Opaque 1 ...
# 2. Alertmanager pod sees the mounted secret file
kubectl exec -n monitoring statefulset/alertmanager-monitoring -- \
cat /etc/alertmanager/secrets/alertmanager-discord-credentials/webhook-url
# https://discord.com/api/webhooks/...
# 3. Alertmanager config loaded
kubectl exec -n monitoring statefulset/alertmanager-monitoring -- \
amtool config show | head
# route: ...
# receivers: ...
# 4. Send a test alert
amtool alert add alertname=AlertmanagerTestPing severity=warning \
--alertmanager.url=http://localhost:9093 \
--annotations=summary="Test ping from amtool"
# (port-forward first: kubectl port-forward -n monitoring svc/alertmanager-operated 9093)
A real Discord message should appear in #monitoring within ~30s of the test ping.
Rotation¶
tomoda-alert-webhook-primary rotates independently:
# 1. Generate a new Discord webhook on the same #monitoring channel
# (Discord Channel Settings → Integrations → Webhooks → New Webhook)
# 2. Push to GCP SM as a new version
echo -n "<new-webhook-url>" | gcloud secrets versions add tomoda-alert-webhook-primary \
--project=development-485000 --data-file=-
# 3. ESO syncs within 1h; force-restart Alertmanager to verify immediately
kubectl rollout restart statefulset/alertmanager-monitoring -n monitoring
# 4. Delete the old webhook in Discord (otherwise both fire)
Provider swap (Slack / PagerDuty / Opsgenie)¶
The generic-named tomoda-alert-webhook-primary secret was designed for portable provider swaps. To switch from Discord to Slack:
- In Slack, create an Incoming Webhook for the equivalent of
#monitoring. - Push the new URL to GCP SM:
echo -n "<slack-url>" | gcloud secrets versions add tomoda-alert-webhook-primary --data-file=-. - Edit
k8s/envs/dev/sys/monitoring/values.yaml: changediscord_configs:toslack_configs:, update template format to Slack's markdown subset. - PR + Argo CD sync.
No K8s Secret renames cascade through the cluster.
Related docs¶
- Manual setup — provisioning the webhook secret in GCP SM
- Prometheus — what's being scraped that the rules query
- Sentry data source — frontend alerts (separate channel via OAuth)