Monitoring¶
Prometheus, Alertmanager, Grafana, the operator, node-exporter, kube-state-metrics, and the default dashboards — all from one Helm chart, one Argo CD Application, one namespace.
Installed by k8s/envs/dev/sys/monitoring/application.yaml, configured by k8s/envs/dev/sys/monitoring/values.yaml. Custom alerting rules live in k8s/envs/dev/sys/manifests/alerting-rules.yaml and are deployed by the sys-resources Application.
Chart and source¶
| Field | Value |
|---|---|
| Helm chart | kube-prometheus-stack |
| Repository | https://prometheus-community.github.io/helm-charts |
| Version | 61.3.2 |
| Destination namespace | monitoring (created by Argo CD) |
| Argo CD Application | monitoring |
| Server-side apply | Yes (ServerSideApply=true — the chart's CRDs are large enough to exceed the client-side annotation size limit) |
What you get from this single chart:
- Prometheus — central scrape and TSDB. 3-day retention, 5Gi PVC, scrapes
ServiceMonitorandPodMonitorresources from every namespace. - Alertmanager — alert routing. 5Gi PVC.
- Grafana — dashboards and exploration UI.
- node-exporter — per-node host metrics (CPU, memory, disk, network).
- kube-state-metrics — cluster object metrics (deployments, pods, PVCs, etc.).
- prometheus-operator — reconciles the CR types above.
Scrape configuration¶
By default the operator restricts ServiceMonitor discovery to resources labelled release: monitoring. That's disabled here:
prometheus:
prometheusSpec:
serviceMonitorSelectorNilUsesHelmValues: false
serviceMonitorSelector: {}
serviceMonitorNamespaceSelector: {}
podMonitorSelectorNilUsesHelmValues: false
podMonitorSelector: {}
podMonitorNamespaceSelector: {}
Empty selectors = "match everything". Any ServiceMonitor or PodMonitor in any namespace is scraped. That's how the Traefik ServiceMonitor (in traefik-system), the Loki ServiceMonitor (in monitoring), and the CNPG PodMonitor (in cnpg-system) all light up automatically.
Retention is short — three days — because the storage allocation is 5Gi. For longer-term metrics, the next step would be a remote-write target. Not configured today.
Grafana¶
Grafana is exposed at https://grafana.tomoda.life through Traefik, gated by the same two oauth2-proxy middlewares as Argo CD (see oauth2-proxy):
ingress:
ingressClassName: traefik
annotations:
cert-manager.io/cluster-issuer: letsencrypt-prod
traefik.ingress.kubernetes.io/router.entrypoints: websecure
traefik.ingress.kubernetes.io/router.middlewares: sys-oauth2-proxy-errors@kubernetescrd,sys-oauth2-proxy-auth@kubernetescrd
hosts:
- grafana.tomoda.life
Two layers of auth:
- oauth2-proxy stops unauthenticated traffic at Traefik.
- Grafana's
auth.proxytrusts theX-Auth-Request-Emailheader that oauth2-proxy injects, and auto-creates a Grafana user from it (auto_sign_up: true).
A separate auth.google block is configured but disabled (enabled: false); the proxy header path is the active one. The Google client secret for that disabled flow is still wired in from a K8s Secret named google-oauth-credentials — populated by External Secrets from the GCP SM key oauth-client-secret.
A 2Gi PVC backs /var/lib/grafana, so dashboards and users survive restarts.
Default dashboards¶
Three dashboards are auto-provisioned via the chart's dashboards.default block:
| Dashboard | grafana.com ID | Datasource |
|---|---|---|
| Traefik metrics | 11462 (rev 6) | Prometheus |
| Traefik logs | 13702 (rev 1) | Loki |
| CNPG (Postgres) | 20417 (rev 3) | Prometheus |
Beyond that, the kube-prometheus-stack chart bundles its own kubernetes dashboards (cluster overview, namespace overview, node usage, etc.) — those appear under the default "Kubernetes" folder.
A Loki datasource is added as an additionalDataSource (http://loki:3100), so logs and metrics share the same Grafana — see Loki. A Tempo datasource (http://tempo:3100) is also wired in with bidirectional jumps to Loki — see operations/observability/tempo and prometheus scrape coverage.
Custom alerting rules¶
k8s/envs/dev/sys/manifests/alerting-rules.yaml defines a PrometheusRule named tomoda-alerts with four groups:
| Group | Sample alerts |
|---|---|
tomoda.application |
BackendDown, BackendHighErrorRate (>5% 5xx), BackendHighLatency (p95 > 2s) |
tomoda.database |
PostgresDown, PostgresReplicationLag, PostgresHighConnections, PostgresBackupStale, PostgresDiskUsageHigh |
tomoda.redis |
RedisDown, RedisHighMemory |
tomoda.resources |
PodCrashLooping, PVCUsageHigh, NodeMemoryHigh |
The release: monitoring label on the rule is what the Alertmanager picks up via the kube-prometheus-stack's default ruleSelector. There is no separate Alertmanager routing config in this repo — alerts go to the chart's default route (Alertmanager's built-in null/log receiver) unless customised.
Operational notes¶
- Default Grafana admin password.
values.yamlsetsadminPassword: "admin". Anyone hitting Grafana via direct port-forward (bypassing oauth2-proxy) can log in. The header-auth path is what matters for the public hostname; the admin password matters if you ever port-forward. - Storage is the binding constraint. Three days of retention on 5Gi works because the cluster is small. As scrape targets grow, expect the PVC to fill — either bump it or wire remote-write.
- Disabling the chart loses CRDs. As with cert-manager, the kube-prometheus-stack ships its CRDs. Uninstalling cascades to every
ServiceMonitor,PrometheusRule, etc. in the cluster.