Prometheus scrape coverage¶

This page tracks what the cluster's Prometheus actually scrapes. The deployment itself is documented in kubernetes/system/monitoring.md; this page focuses on who is scraped, how, and where the ServiceMonitor / PodMonitor / Probe CR lives.

kube-prometheus-stack's serviceMonitorSelector is set to {} (match everything in any namespace), so any of the resources below light up automatically once they exist.

ServiceMonitor inventory¶

Target	Resource type	Owned by	Namespace
Tomoda backend (`tomoda-api`)	ServiceMonitor `tomoda-backend`	`k8s/apps/tomoda/base/servicemonitor.yaml`	`tomoda`, `prod`
Tempo (own internal metrics)	ServiceMonitor (via Helm `serviceMonitor.enabled: true`)	`k8s/envs/platform/tempo/values.yaml`	`monitoring`
Loki	ServiceMonitor (via Helm)	`k8s/envs/platform/loki/values.yaml`	`monitoring`
Traefik	ServiceMonitor (via Helm)	`k8s/envs/platform/traefik/values.yaml`	`monitoring`
Redis (dev)	ServiceMonitor (via Bitnami `metrics.serviceMonitor.enabled: true`)	`k8s/envs/dev/redis/values.yaml`	`monitoring` (cross-ns selector → `data`)
Redis (prod)	ServiceMonitor (via Bitnami)	`k8s/envs/prod/redis/values.yaml`	`monitoring` (cross-ns selector → `data`)
Argo CD (controller, server, repo-server, applicationSet, notifications)	ServiceMonitor (via chart, set in TF)	`infrastructure/gcp/argocd.tf`	`argocd`
Postgres (CNPG primary + standby)	PodMonitor (CNPG built-in)	`k8s/envs/{dev,prod}/postgres/manifests/cluster.yaml`	`cnpg-system`
Blackbox-exporter (own internal metrics)	ServiceMonitor (via Helm)	`k8s/envs/platform/blackbox-exporter/values.yaml`	`monitoring`
Photon (`/status` via blackbox)	Probe `photon`	`k8s/envs/platform/manifests/photon-probe.yaml`	`monitoring`

The release: monitoring label is set on each of these — purely convention, since the operator's selector is empty.

Tomoda backend specifics¶

The backend exposes /metrics on the same listener as /health (port 8080) when OBSERVABILITY_METRICS_ENABLED=true. That env var is set in both the dev and prod overlay kustomizations:

- name: OBSERVABILITY_METRICS_ENABLED
  value: "true"
- name: OBSERVABILITY_TRACING_ENABLED
  value: "true"
- name: OBSERVABILITY_OTLP_ENDPOINT
  value: "tempo.monitoring.svc.cluster.local:4317"
- name: OBSERVABILITY_OTLP_INSECURE
  value: "true"
- name: OBSERVABILITY_SAMPLE_RATE
  value: "0.1"

The base ServiceMonitor selects app: tomoda-api. The dev/prod overlays apply commonLabels: env: dev|prod, which kustomize propagates to both the resource labels and the selector — so the same base manifest correctly scopes to the right Service in each namespace.

NetworkPolicy¶

tomoda-api-policy and tomoda-async-policy (k8s/apps/tomoda/base/network-policy.yaml) allow inbound :8080 from the monitoring namespace, alongside Traefik. Without that rule, scrape attempts would be silently dropped — up{job="backend-service"} would be zero with no obvious cause.

Argo CD specifics¶

The Helm release in infrastructure/gcp/argocd.tf now enables per-component metrics. Five ServiceMonitors land in the argocd namespace — one each for:

argocd-application-controller — sync queue depth, app health counts
argocd-server — API request rates
argocd-repo-server — repo cache stats, manifest generation latency
argocd-applicationset-controller
argocd-notifications-controller

These metrics drive the standard Argo CD Grafana dashboard (grafana.com 14584) if added later.

Redis prod¶

Prior to this change, dev had metrics.enabled: true but prod did not — Bitnami's chart simply didn't render the redis-exporter sidecar or the ServiceMonitor in prod. Prod values now mirror dev:

metrics:
  enabled: true
  serviceMonitor:
    enabled: true
    namespace: monitoring
    namespaceSelector:
      matchNames:
        - data
    additionalLabels:
      release: monitoring

redis-prod-master now appears alongside redis-master under up{job=~"redis-prod.*"}.

Photon — blackbox probe¶

Photon doesn't expose Prometheus metrics. The prometheus-blackbox-exporter Helm chart in k8s/envs/platform/blackbox-exporter/ deploys an HTTP prober, and a Probe CR in k8s/envs/platform/manifests/photon-probe.yaml tells Prometheus to ask blackbox to hit http://photon.platform.svc.cluster.local:2322/status every 30s.

The resulting metrics (with target=http://photon.platform.svc.cluster.local:2322/status):

probe_success — 1 if the request succeeded, 0 otherwise
probe_duration_seconds — how long the request took
probe_http_status_code — last response code

A "Photon down" alert keys off probe_success{service="photon"} == 0 for 2m.

Debugging "where are my metrics?"¶

# Is the ServiceMonitor visible to Prometheus?
kubectl get servicemonitor -A | grep <name>

# Does Prometheus actually have the target?
kubectl port-forward -n monitoring svc/monitoring-kube-prometheus-prometheus 9090:9090
# Open http://localhost:9090/targets → look for the job

# Is the pod's /metrics endpoint working?
kubectl port-forward -n <ns> <pod> 8080:8080
curl localhost:8080/metrics | head -20

# Is a NetworkPolicy blocking the scrape?
kubectl get networkpolicy -n <ns>

If a ServiceMonitor exists but the target is missing from Prometheus, the most common causes are: (a) wrong port name (endpoints[].port must match the Service's ports[].name, not the number), (b) the Service's selector doesn't actually hit any pods, © a NetworkPolicy blocks ingress from monitoring.