Debugging¶

A field guide for common production issues. Start with kubectl describe, escalate to logs, then dashboards.

First moves¶

# What's broken?
kubectl get pods -A | grep -v Running

# Recent events across the namespace
kubectl get events -n prod --sort-by='.lastTimestamp' | tail -30

# Argo CD app health
argocd app list
argocd app get tomoda-prod

Pod-level diagnosis¶

`describe` first¶

kubectl describe pod <pod-name> -n prod

Look for:

Events at the bottom — FailedScheduling, BackOff, Unhealthy, OOM kills
Conditions — Ready: False with a reason
Container statuses — restart counts, last termination reason

Logs¶

# Current container
kubectl logs <pod> -n prod -c backend

# Previous (after a crash)
kubectl logs <pod> -n prod -c backend --previous

# Last 5 minutes, follow
kubectl logs -f <pod> -n prod --since=5m

Exec into the pod¶

kubectl exec -it <pod> -n prod -- /bin/sh

# Inside the container
env | grep -v -i password   # confirm env vars are loaded
nslookup postgres-postgresql.data.svc.cluster.local
curl -v http://localhost:8080/health

Logs at scale (Loki)¶

For multi-pod or historical queries, use Grafana's Explore tab against the Loki datasource:

# All errors in prod backend in the last hour
{namespace="prod", app="backend"} |= "error" | json

# Just 5xx responses
{namespace="prod", app="backend"} | json | status_code >= 500

# Slow requests
{namespace="prod", app="backend"} | json | duration_ms > 1000

Loki labels available: namespace, pod, app, container.

Metrics and alerts¶

Open Grafana and load the relevant dashboard:

Backend (Go) — request rate, p50/p95/p99 latency, error rate, goroutines, GC
CNPG Postgres — connections, replication lag, query duration, WAL archive lag
Redis — ops/sec, memory, hit rate
Cluster / Nodes — CPU, memory, pod count, scheduler events

Prometheus is scraping via PodMonitor for CNPG and via standard endpoints for the backend. Alertmanager fires on:

Pod CrashLoopBackOff for >5 min
5xx rate >1% over 5 min
Postgres replication lag >10s
WAL archive falling behind

Check active alerts:

kubectl port-forward svc/prometheus-alertmanager -n monitoring 9093:9093
# Open http://localhost:9093

Argo CD sync failures¶

# What's out of sync?
argocd app diff tomoda-prod

# Recent sync history
argocd app history tomoda-prod

# Events on the application
kubectl describe application tomoda-prod -n argocd

Common causes:

ComparisonError — manifest syntax error in the latest commit. Check argocd app get for the line.
OutOfSync with no auto-sync — auto-sync was disabled (Option 3 in Rollback). Re-enable: argocd app set tomoda-prod --sync-policy automated.
SyncFailed — Argo CD applied the manifest but the resource controller rejected it (PDB violation, ResourceQuota, immutable field). The error message is in the sync details.

External Secrets¶

If a pod's env vars are stale or missing, check the ExternalSecret chain:

# Is it syncing?
kubectl get externalsecret -n prod
# STATUS should be SecretSynced; READY should be True

# Why isn't it syncing?
kubectl describe externalsecret backend-secrets-prod -n prod
# Look at Status.Conditions for the failure reason

# Are the secret stores healthy?
kubectl get clustersecretstore
kubectl describe clustersecretstore gsm-tomoda
# AWS store is namespace-scoped (one per tenant)
kubectl describe secretstore aws-sm-dev  -n tomoda-dev
kubectl describe secretstore aws-sm-prod -n tomoda-prod

Common causes:

Workload Identity binding broken — the ESO controller's K8s SA isn't bound to the GCP SA. Check the iam.gke.io/gcp-service-account annotation.
GCP SM secret missing — the remoteRef.key doesn't exist. Create it via scripts/setup-gcp-secrets.sh.
AWS reader credentials expired/rotated — the aws-eso-credentials K8s secret in the external-secrets namespace is wrong. Re-create per Secrets Management.

Force a refresh (instead of waiting for the 1h interval):

kubectl delete externalsecret backend-secrets-prod -n prod
# Argo CD recreates it within seconds

Photon `/status` health check¶

Photon is the geocoder. If localized lookups stop working:

kubectl port-forward -n data svc/photon 2322:2322 &
curl -s http://localhost:2322/status | jq
# Expect: { "status": "Ok", ... }

If status is anything else, check the pod logs — most often it is mid-download of a new index from GCS.

kubectl logs -n data deployment/photon --tail=100

See Photon indexing for the index refresh procedure.

Postgres¶

For DB-specific diagnostics (cluster status, replication, backup health, slow queries), see Postgres Operations.