Tempo¶

Distributed-tracing backend. The tomoda backend emits OpenTelemetry spans (Go OTel SDK + OTLP gRPC exporter), Tempo stores them in GCS, and Grafana queries Tempo as a data source. Spans link back to Loki logs via trace_id.

Installed by k8s/envs/platform/tempo/application.yaml, configured by k8s/envs/platform/tempo/values.yaml. The GCS bucket and Workload Identity binding come from infrastructure/gcp/tempo.tf.

Chart and source¶

Field	Value
Helm chart	`tempo` (monolithic single-binary mode)
Repository	`https://grafana.github.io/helm-charts`
Version	`1.10.1`
Destination namespace	`monitoring` (shared with kube-prometheus-stack and Loki)
Argo CD Application	`tempo`

We picked the monolithic chart over tempo-distributed deliberately: trace volume is low (10% sampling on a small backend), and a single replica with GCS storage is plenty. If write throughput grows past a few thousand spans/sec, switch to tempo-distributed (distributors + ingesters + queriers split).

Storage — shared GCS bucket¶

Tempo stores trace blocks in tomoda-observability-${project_id} (provisioned by infrastructure/gcp/tempo.tf). Naming follows the same tomoda-{name}-{project_id} convention as tomoda-db-backups-${project_id}. The bucket is shared with Loki — Tempo writes under prefix tempo/, Loki under loki/ — so the two writers can't collide and per-product blast radius is preserved. The bucket has a 14-day lifecycle rule that deletes old objects across both prefixes; Tempo's compactor block_retention is aligned to the same 336h window so the index doesn't reference deleted blocks.

resource "google_storage_bucket" "tomoda_observability" {
  name     = "tomoda-observability-${var.project_id}"
  location = var.region
  lifecycle_rule {
    condition { age = 14 }
    action    { type = "Delete" }
  }
}

Authentication is via Workload Identity — no service account key is mounted. A single GCP SA observability@${project_id}.iam.gserviceaccount.com is bound to both monitoring/tempo and monitoring/loki KSAs (one binding each, same SA). The Tempo Helm chart creates its KSA via serviceAccount.name: tempo with this annotation:

serviceAccount:
  create: true
  name: tempo
  annotations:
    iam.gke.io/gcp-service-account: observability@development-485000.iam.gserviceaccount.com

The chart values also pin the bucket name and prefix:

tempo:
  storage:
    trace:
      backend: gcs
      gcs:
        bucket_name: tomoda-observability-development-485000
        prefix: tempo

No secrets in GCP Secret Manager are required for Tempo. See loki.md for the matching Loki side of the same SA.

Ingestion — OTLP gRPC¶

Tempo's receivers config exposes OTLP on both gRPC (:4317) and HTTP (:4318). The tomoda backend uses gRPC:

OBSERVABILITY_OTLP_ENDPOINT=tempo.monitoring.svc.cluster.local:4317
OBSERVABILITY_OTLP_INSECURE=true
OBSERVABILITY_SAMPLE_RATE=0.1

Apps in any namespace can reach Tempo via that FQDN. Insecure (plain gRPC, no mTLS) is fine inside the cluster — NetworkPolicies and the trust boundary already restrict who can connect.

The sampling rate is set on the application side (head sampling — Go SDK decides per-trace). Tempo itself doesn't sub-sample.

Grafana integration¶

The Tempo data source is added in k8s/envs/platform/monitoring/values.yaml alongside Loki:

- name: Tempo
  type: tempo
  uid: tempo
  url: http://tempo:3100
  jsonData:
    tracesToLogsV2:
      datasourceUid: loki
      filterByTraceID: true
      query: '{namespace="$${__span.tags["k8s.namespace.name"]}"} | trace_id="$${__span.traceId}"'

Two cross-data-source jumps are wired up:

Trace → logs: click a span in the Tempo UI, jump to a Loki query filtered to the same trace_id in a ±5min window. Uses LogQL's | json stage to extract trace_id from the JSON log body at query time (Loki 2.x doesn't support structured metadata yet — see loki.md).
Logs → trace: the Loki data source has a derivedFields rule that matches "trace_id":"..." in JSON log lines and renders it as a clickable Tempo link.

Operational notes¶

Single replica. Monolithic mode means one Tempo pod handles ingest + query + compaction. Restarts cause brief ingest gaps (OTel exporters retry, but the SDK's queue is bounded — bursts during a restart can drop spans). Acceptable at current volume; if you see drops in app metrics (otelcol_exporter_send_failed_spans_total), bump replicas or switch chart.
GCS is the source of truth. The 5Gi PVC is just for WAL — actual trace data lives in GCS. PVC loss is recoverable; bucket loss is not.
No multi-tenant mode. multitenancy_enabled: false — all spans share one tenant ID (single-tenant). If we ever onboard a second team, flip it on and add tenant headers to the OTel exporter config.
14-day retention. Driven by the GCS lifecycle rule. Bump the lifecycle age + Tempo's block_retention together if longer trace history is needed.

Debugging¶

# Verify Tempo is healthy
kubectl get pods -n monitoring -l app.kubernetes.io/name=tempo

# Check it's accepting traces (look for "received OTLP traces" lines)
kubectl logs -n monitoring deploy/tempo --tail=200 | grep -i otlp

# Confirm Workload Identity is working
kubectl exec -n monitoring deploy/tempo -- \
  curl -s -H "Metadata-Flavor: Google" \
  http://metadata.google.internal/computeMetadata/v1/instance/service-accounts/default/email
# Expected: tempo@development-485000.iam.gserviceaccount.com

# List recent trace blocks in GCS
gsutil ls -lh gs://development-485000-tomoda-traces/single-tenant/ | tail -20

# Tempo's /ready endpoint
kubectl port-forward -n monitoring svc/tempo 3100:3100
curl localhost:3100/ready

If the backend logs failed to upload span batch: ... PermissionDenied, the Workload Identity link is broken — check the KSA annotation and the IAM binding.