Skip to content

Debugging

A field guide for common production issues. Start with kubectl describe, escalate to logs, then dashboards.

First moves

# What's broken?
kubectl get pods -A | grep -v Running

# Recent events across the namespace
kubectl get events -n prod --sort-by='.lastTimestamp' | tail -30

# Argo CD app health
argocd app list
argocd app get prod-tomoda

Pod-level diagnosis

describe first

kubectl describe pod <pod-name> -n prod

Look for:

  • Events at the bottom — FailedScheduling, BackOff, Unhealthy, OOM kills
  • ConditionsReady: False with a reason
  • Container statuses — restart counts, last termination reason

Logs

# Current container
kubectl logs <pod> -n prod -c backend

# Previous (after a crash)
kubectl logs <pod> -n prod -c backend --previous

# Last 5 minutes, follow
kubectl logs -f <pod> -n prod --since=5m

Exec into the pod

kubectl exec -it <pod> -n prod -- /bin/sh

# Inside the container
env | grep -v -i password   # confirm env vars are loaded
nslookup postgres-postgresql.data.svc.cluster.local
curl -v http://localhost:8080/health

Logs at scale (Loki)

For multi-pod or historical queries, use Grafana's Explore tab against the Loki datasource:

# All errors in prod backend in the last hour
{namespace="prod", app="backend"} |= "error" | json

# Just 5xx responses
{namespace="prod", app="backend"} | json | status_code >= 500

# Slow requests
{namespace="prod", app="backend"} | json | duration_ms > 1000

Loki labels available: namespace, pod, app, container.

Metrics and alerts

Open Grafana and load the relevant dashboard:

  • Backend (Go) — request rate, p50/p95/p99 latency, error rate, goroutines, GC
  • CNPG Postgres — connections, replication lag, query duration, WAL archive lag
  • Redis — ops/sec, memory, hit rate
  • Cluster / Nodes — CPU, memory, pod count, scheduler events

Prometheus is scraping via PodMonitor for CNPG and via standard endpoints for the backend. Alertmanager fires on:

  • Pod CrashLoopBackOff for >5 min
  • 5xx rate >1% over 5 min
  • Postgres replication lag >10s
  • WAL archive falling behind

Check active alerts:

kubectl port-forward svc/prometheus-alertmanager -n monitoring 9093:9093
# Open http://localhost:9093

Argo CD sync failures

# What's out of sync?
argocd app diff prod-tomoda

# Recent sync history
argocd app history prod-tomoda

# Events on the application
kubectl describe application prod-tomoda -n argocd

Common causes:

  • ComparisonError — manifest syntax error in the latest commit. Check argocd app get for the line.
  • OutOfSync with no auto-sync — auto-sync was disabled (Option 3 in Rollback). Re-enable: argocd app set prod-tomoda --sync-policy automated.
  • SyncFailed — Argo CD applied the manifest but the resource controller rejected it (PDB violation, ResourceQuota, immutable field). The error message is in the sync details.

External Secrets

If a pod's env vars are stale or missing, check the ExternalSecret chain:

# Is it syncing?
kubectl get externalsecret -n prod
# STATUS should be SecretSynced; READY should be True

# Why isn't it syncing?
kubectl describe externalsecret backend-secrets-prod -n prod
# Look at Status.Conditions for the failure reason

# Is the ClusterSecretStore healthy?
kubectl get clustersecretstore
kubectl describe clustersecretstore gsm-tomoda
kubectl describe clustersecretstore aws-sm-tomoda

Common causes:

  • Workload Identity binding broken — the ESO controller's K8s SA isn't bound to the GCP SA. Check the iam.gke.io/gcp-service-account annotation.
  • GCP SM secret missing — the remoteRef.key doesn't exist. Create it via scripts/setup-gcp-secrets.sh.
  • AWS reader credentials expired/rotated — the aws-eso-credentials K8s secret in the external-secrets namespace is wrong. Re-create per Secrets Management.

Force a refresh (instead of waiting for the 1h interval):

kubectl delete externalsecret backend-secrets-prod -n prod
# Argo CD recreates it within seconds

Photon /status health check

Photon is the geocoder. If localized lookups stop working:

kubectl port-forward -n data svc/photon 2322:2322 &
curl -s http://localhost:2322/status | jq
# Expect: { "status": "Ok", ... }

If status is anything else, check the pod logs — most often it is mid-download of a new index from GCS.

kubectl logs -n data deployment/photon --tail=100

See Photon Multilang Rollout for the index refresh procedure.

Postgres

For DB-specific diagnostics (cluster status, replication, backup health, slow queries), see Postgres Operations.