Skip to content

Prometheus scrape coverage

This page tracks what the cluster's Prometheus actually scrapes. The deployment itself is documented in kubernetes/system/monitoring.md; this page focuses on who is scraped, how, and where the ServiceMonitor / PodMonitor / Probe CR lives.

kube-prometheus-stack's serviceMonitorSelector is set to {} (match everything in any namespace), so any of the resources below light up automatically once they exist.

ServiceMonitor inventory

Target Resource type Owned by Namespace
Tomoda backend (tomoda-api) ServiceMonitor tomoda-backend k8s/apps/tomoda/base/servicemonitor.yaml tomoda, prod
Tempo (own internal metrics) ServiceMonitor (via Helm serviceMonitor.enabled: true) k8s/envs/dev/sys/tempo/values.yaml monitoring
Loki ServiceMonitor (via Helm) k8s/envs/dev/sys/loki/values.yaml monitoring
Traefik ServiceMonitor (via Helm) k8s/envs/dev/sys/traefik/values.yaml monitoring
Redis (dev) ServiceMonitor (via Bitnami metrics.serviceMonitor.enabled: true) k8s/envs/dev/middleware/redis/values.yaml monitoring (cross-ns selector → data)
Redis (prod) ServiceMonitor (via Bitnami) k8s/envs/prod/middleware/redis/values.yaml monitoring (cross-ns selector → data)
Argo CD (controller, server, repo-server, applicationSet, notifications) ServiceMonitor (via chart, set in TF) infrastructure/gcp/argocd.tf argocd
Postgres (CNPG primary + standby) PodMonitor (CNPG built-in) k8s/envs/{dev,prod}/middleware/postgres/manifests/cluster.yaml cnpg-system
Blackbox-exporter (own internal metrics) ServiceMonitor (via Helm) k8s/envs/dev/sys/blackbox-exporter/values.yaml monitoring
Photon (/status via blackbox) Probe photon k8s/envs/dev/sys/manifests/photon-probe.yaml monitoring

The release: monitoring label is set on each of these — purely convention, since the operator's selector is empty.

Tomoda backend specifics

The backend exposes /metrics on the same listener as /health (port 8080) when OBSERVABILITY_METRICS_ENABLED=true. That env var is set in both the dev and prod overlay kustomizations:

- name: OBSERVABILITY_METRICS_ENABLED
  value: "true"
- name: OBSERVABILITY_TRACING_ENABLED
  value: "true"
- name: OBSERVABILITY_OTLP_ENDPOINT
  value: "tempo.monitoring.svc.cluster.local:4317"
- name: OBSERVABILITY_OTLP_INSECURE
  value: "true"
- name: OBSERVABILITY_SAMPLE_RATE
  value: "0.1"

The base ServiceMonitor selects app: tomoda-api. The dev/prod overlays apply commonLabels: env: dev|prod, which kustomize propagates to both the resource labels and the selector — so the same base manifest correctly scopes to the right Service in each namespace.

NetworkPolicy

tomoda-api-policy and tomoda-async-policy (k8s/apps/tomoda/base/network-policy.yaml) allow inbound :8080 from the monitoring namespace, alongside Traefik. Without that rule, scrape attempts would be silently dropped — up{job="backend-service"} would be zero with no obvious cause.

Argo CD specifics

The Helm release in infrastructure/gcp/argocd.tf now enables per-component metrics. Five ServiceMonitors land in the argocd namespace — one each for:

  • argocd-application-controller — sync queue depth, app health counts
  • argocd-server — API request rates
  • argocd-repo-server — repo cache stats, manifest generation latency
  • argocd-applicationset-controller
  • argocd-notifications-controller

These metrics drive the standard Argo CD Grafana dashboard (grafana.com 14584) if added later.

Redis prod

Prior to this change, dev had metrics.enabled: true but prod did not — Bitnami's chart simply didn't render the redis-exporter sidecar or the ServiceMonitor in prod. Prod values now mirror dev:

metrics:
  enabled: true
  serviceMonitor:
    enabled: true
    namespace: monitoring
    namespaceSelector:
      matchNames:
        - data
    additionalLabels:
      release: monitoring

prod-redis-master now appears alongside redis-master under up{job=~"prod-redis.*"}.

Photon — blackbox probe

Photon doesn't expose Prometheus metrics. The prometheus-blackbox-exporter Helm chart in k8s/envs/dev/sys/blackbox-exporter/ deploys an HTTP prober, and a Probe CR in k8s/envs/dev/sys/manifests/photon-probe.yaml tells Prometheus to ask blackbox to hit http://photon.data.svc.cluster.local:2322/status every 30s.

The resulting metrics (with target=http://photon.data.svc.cluster.local:2322/status):

  • probe_success — 1 if the request succeeded, 0 otherwise
  • probe_duration_seconds — how long the request took
  • probe_http_status_code — last response code

A "Photon down" alert keys off probe_success{service="photon"} == 0 for 2m.

Debugging "where are my metrics?"

# Is the ServiceMonitor visible to Prometheus?
kubectl get servicemonitor -A | grep <name>

# Does Prometheus actually have the target?
kubectl port-forward -n monitoring svc/monitoring-kube-prometheus-prometheus 9090:9090
# Open http://localhost:9090/targets → look for the job

# Is the pod's /metrics endpoint working?
kubectl port-forward -n <ns> <pod> 8080:8080
curl localhost:8080/metrics | head -20

# Is a NetworkPolicy blocking the scrape?
kubectl get networkpolicy -n <ns>

If a ServiceMonitor exists but the target is missing from Prometheus, the most common causes are: (a) wrong port name (endpoints[].port must match the Service's ports[].name, not the number), (b) the Service's selector doesn't actually hit any pods, © a NetworkPolicy blocks ingress from monitoring.