Monitoring¶

Prometheus, Alertmanager, Grafana, the operator, node-exporter, kube-state-metrics, and the default dashboards — all from one Helm chart, one Argo CD Application, one namespace.

Installed by k8s/envs/platform/monitoring/application.yaml, configured by k8s/envs/platform/monitoring/values.yaml. Custom alerting rules live in k8s/envs/platform/manifests/alerting-rules.yaml and are deployed by the sys-resources Application.

Chart and source¶

Field	Value
Helm chart	`kube-prometheus-stack`
Repository	`https://prometheus-community.github.io/helm-charts`
Version	`87.17.0` (operator v0.92, Prometheus v3.13)
Destination namespace	`monitoring` (created by Argo CD)
Argo CD Application	`monitoring`
Server-side apply	Yes (`ServerSideApply=true` — the chart's CRDs are large enough to exceed the client-side annotation size limit)

What you get from this single chart:

Prometheus — central scrape and TSDB. 3-day retention, 5Gi PVC, scrapes ServiceMonitor and PodMonitor resources from every namespace.
Alertmanager — alert routing. 5Gi PVC.
Grafana — dashboards and exploration UI.
node-exporter — per-node host metrics (CPU, memory, disk, network).
kube-state-metrics — cluster object metrics (deployments, pods, PVCs, etc.).
prometheus-operator — reconciles the CR types above.

Scrape configuration¶

By default the operator restricts ServiceMonitor discovery to resources labelled release: monitoring. That's disabled here:

prometheus:
  prometheusSpec:
    serviceMonitorSelectorNilUsesHelmValues: false
    serviceMonitorSelector: {}
    serviceMonitorNamespaceSelector: {}
    podMonitorSelectorNilUsesHelmValues: false
    podMonitorSelector: {}
    podMonitorNamespaceSelector: {}

Empty selectors = "match everything". Any ServiceMonitor or PodMonitor in any namespace is scraped. That's how the Traefik ServiceMonitor (in traefik-system), the Loki ServiceMonitor (in monitoring), and the CNPG PodMonitor (in cnpg-system) all light up automatically.

Retention is short — three days — because the storage allocation is 5Gi. For longer-term metrics, the next step would be a remote-write target. Not configured today.

Grafana¶

Grafana is exposed at https://grafana.tomoda.life through Traefik, gated by the same two oauth2-proxy middlewares as Argo CD (see oauth2-proxy):

ingress:
  ingressClassName: traefik
  annotations:
    cert-manager.io/cluster-issuer: letsencrypt-prod
    traefik.ingress.kubernetes.io/router.entrypoints: websecure
    traefik.ingress.kubernetes.io/router.middlewares: sys-oauth2-proxy-errors@kubernetescrd,sys-oauth2-proxy-auth@kubernetescrd
  hosts:
    - grafana.tomoda.life

Two layers of auth:

oauth2-proxy stops unauthenticated traffic at Traefik.
Grafana's auth.proxy trusts the X-Auth-Request-Email header that oauth2-proxy injects, and auto-creates a Grafana user from it (auto_sign_up: true).

A separate auth.google block is configured but disabled (enabled: false); the proxy header path is the active one. The Google client secret for that disabled flow is still wired in from a K8s Secret named google-oauth-credentials — populated by External Secrets from the GCP SM key oauth-client-secret.

A 2Gi PVC backs /var/lib/grafana, so dashboards and users survive restarts.

Default dashboards¶

Three dashboards are auto-provisioned via the chart's dashboards.default block:

Dashboard	grafana.com ID	Datasource
Traefik metrics	11462 (rev 6)	Prometheus
Traefik logs	13702 (rev 1)	Loki
CNPG (Postgres)	20417 (rev 3)	Prometheus

Beyond that, the kube-prometheus-stack chart bundles its own kubernetes dashboards (cluster overview, namespace overview, node usage, etc.) — those appear under the default "Kubernetes" folder.

A Loki datasource is added as an additionalDataSource (http://loki:3100), so logs and metrics share the same Grafana — see Loki. A Tempo datasource (http://tempo:3100) is also wired in with bidirectional jumps to Loki — see operations/observability/tempo and prometheus scrape coverage.

Custom alerting rules¶

k8s/envs/platform/manifests/alerting-rules.yaml defines a PrometheusRule named tomoda-alerts with four groups:

Group	Sample alerts
`tomoda.application`	`BackendDown`, `BackendHighErrorRate` (>5% 5xx), `BackendHighLatency` (p95 > 2s)
`tomoda.database`	`PostgresDown`, `PostgresReplicationLag`, `PostgresHighConnections`, `PostgresBackupStale`, `PostgresDiskUsageHigh`
`tomoda.redis`	`RedisDown`, `RedisHighMemory`
`tomoda.resources`	`PodCrashLooping`, `PVCUsageHigh`, `NodeMemoryHigh`
`tomoda.business`	`AsyncQueueBacklog`, `WSHubConnectionDrop`
`tomoda.semantic`	`SemanticHighErrorRate`, `SemanticAuthFailing`, `SemanticRateLimited`, `SemanticP95Slow`, `SemanticDailySpendHigh`, `SemanticCacheHitLow`, `SearchEnrichmentHighErrorRate`, `SearchEnrichmentAuthFailing`, `SearchEnrichmentRateLimited`
`tomoda.certificates`	`CertificateExpiringSoon`, `CertificateRenewalFailed`

The release: monitoring label on the rule is what the Alertmanager picks up via the kube-prometheus-stack's default ruleSelector. Routing to Discord is defined in an AlertmanagerConfig CR — see Alerting.

Operational notes¶

Default Grafana admin password. values.yaml sets adminPassword: "admin". Anyone hitting Grafana via direct port-forward (bypassing oauth2-proxy) can log in. The header-auth path is what matters for the public hostname; the admin password matters if you ever port-forward.
Storage is the binding constraint. Three days of retention on 5Gi works because the cluster is small. As scrape targets grow, expect the PVC to fill — either bump it or wire remote-write.
Disabling the chart loses CRDs. As with cert-manager, the kube-prometheus-stack ships its CRDs. Uninstalling cascades to every ServiceMonitor, PrometheusRule, etc. in the cluster.