Tempo¶
Distributed-tracing backend. The tomoda backend emits OpenTelemetry spans (Go OTel SDK + OTLP gRPC exporter), Tempo stores them in GCS, and Grafana queries Tempo as a data source. Spans link back to Loki logs via trace_id.
Installed by k8s/envs/dev/sys/tempo/application.yaml, configured by k8s/envs/dev/sys/tempo/values.yaml. The GCS bucket and Workload Identity binding come from infrastructure/gcp/tempo.tf.
Chart and source¶
| Field | Value |
|---|---|
| Helm chart | tempo (monolithic single-binary mode) |
| Repository | https://grafana.github.io/helm-charts |
| Version | 1.10.1 |
| Destination namespace | monitoring (shared with kube-prometheus-stack and Loki) |
| Argo CD Application | tempo |
We picked the monolithic chart over tempo-distributed deliberately: trace volume is low (10% sampling on a small backend), and a single replica with GCS storage is plenty. If write throughput grows past a few thousand spans/sec, switch to tempo-distributed (distributors + ingesters + queriers split).
Storage — GCS backend¶
Tempo stores trace blocks in ${project_id}-tomoda-traces (provisioned by infrastructure/gcp/tempo.tf). The bucket has a 30-day lifecycle rule that deletes old objects; Tempo's compactor block_retention is aligned to the same 720h window so the index doesn't reference deleted blocks.
resource "google_storage_bucket" "tomoda_traces" {
name = "${var.project_id}-tomoda-traces"
location = var.region
lifecycle_rule {
condition { age = 30 }
action { type = "Delete" }
}
}
Authentication is via Workload Identity — no service account key is mounted. The GCP SA tempo@${project_id}.iam.gserviceaccount.com is bound to the K8s SA monitoring/tempo, which is what the Tempo Helm chart creates via serviceAccount.name: tempo. The link is the annotation on the KSA:
serviceAccount:
create: true
name: tempo
annotations:
iam.gke.io/gcp-service-account: tempo@development-485000.iam.gserviceaccount.com
No secrets in GCP Secret Manager are required for Tempo.
Ingestion — OTLP gRPC¶
Tempo's receivers config exposes OTLP on both gRPC (:4317) and HTTP (:4318). The tomoda backend uses gRPC:
OBSERVABILITY_OTLP_ENDPOINT=tempo.monitoring.svc.cluster.local:4317
OBSERVABILITY_OTLP_INSECURE=true
OBSERVABILITY_SAMPLE_RATE=0.1
Apps in any namespace can reach Tempo via that FQDN. Insecure (plain gRPC, no mTLS) is fine inside the cluster — NetworkPolicies and the trust boundary already restrict who can connect.
The sampling rate is set on the application side (head sampling — Go SDK decides per-trace). Tempo itself doesn't sub-sample.
Grafana integration¶
The Tempo data source is added in k8s/envs/dev/sys/monitoring/values.yaml alongside Loki:
- name: Tempo
type: tempo
uid: tempo
url: http://tempo:3100
jsonData:
tracesToLogsV2:
datasourceUid: loki
filterByTraceID: true
query: '{namespace="$${__span.tags["k8s.namespace.name"]}"} | trace_id="$${__span.traceId}"'
Two cross-data-source jumps are wired up:
- Trace → logs: click a span in the Tempo UI, jump to a Loki query filtered to the same
trace_idin a ±5min window. Uses LogQL's| jsonstage to extracttrace_idfrom the JSON log body at query time (Loki 2.x doesn't support structured metadata yet — see loki.md). - Logs → trace: the Loki data source has a
derivedFieldsrule that matches"trace_id":"..."in JSON log lines and renders it as a clickable Tempo link.
Operational notes¶
- Single replica. Monolithic mode means one Tempo pod handles ingest + query + compaction. Restarts cause brief ingest gaps (OTel exporters retry, but the SDK's queue is bounded — bursts during a restart can drop spans). Acceptable at current volume; if you see drops in app metrics (
otelcol_exporter_send_failed_spans_total), bump replicas or switch chart. - GCS is the source of truth. The 5Gi PVC is just for WAL — actual trace data lives in GCS. PVC loss is recoverable; bucket loss is not.
- No multi-tenant mode.
multitenancy_enabled: false— all spans share one tenant ID (single-tenant). If we ever onboard a second team, flip it on and add tenant headers to the OTel exporter config. - 30-day retention. Driven by the GCS lifecycle rule. Bump the lifecycle age + Tempo's
block_retentiontogether if longer trace history is needed.
Debugging¶
# Verify Tempo is healthy
kubectl get pods -n monitoring -l app.kubernetes.io/name=tempo
# Check it's accepting traces (look for "received OTLP traces" lines)
kubectl logs -n monitoring deploy/tempo --tail=200 | grep -i otlp
# Confirm Workload Identity is working
kubectl exec -n monitoring deploy/tempo -- \
curl -s -H "Metadata-Flavor: Google" \
http://metadata.google.internal/computeMetadata/v1/instance/service-accounts/default/email
# Expected: tempo@development-485000.iam.gserviceaccount.com
# List recent trace blocks in GCS
gsutil ls -lh gs://development-485000-tomoda-traces/single-tenant/ | tail -20
# Tempo's /ready endpoint
kubectl port-forward -n monitoring svc/tempo 3100:3100
curl localhost:3100/ready
If the backend logs failed to upload span batch: ... PermissionDenied, the Workload Identity link is broken — check the KSA annotation and the IAM binding.