Debugging¶
A field guide for common production issues. Start with kubectl describe, escalate to logs, then dashboards.
First moves¶
# What's broken?
kubectl get pods -A | grep -v Running
# Recent events across the namespace
kubectl get events -n prod --sort-by='.lastTimestamp' | tail -30
# Argo CD app health
argocd app list
argocd app get prod-tomoda
Pod-level diagnosis¶
describe first¶
kubectl describe pod <pod-name> -n prod
Look for:
- Events at the bottom —
FailedScheduling,BackOff,Unhealthy, OOM kills - Conditions —
Ready: Falsewith a reason - Container statuses — restart counts, last termination reason
Logs¶
# Current container
kubectl logs <pod> -n prod -c backend
# Previous (after a crash)
kubectl logs <pod> -n prod -c backend --previous
# Last 5 minutes, follow
kubectl logs -f <pod> -n prod --since=5m
Exec into the pod¶
kubectl exec -it <pod> -n prod -- /bin/sh
# Inside the container
env | grep -v -i password # confirm env vars are loaded
nslookup postgres-postgresql.data.svc.cluster.local
curl -v http://localhost:8080/health
Logs at scale (Loki)¶
For multi-pod or historical queries, use Grafana's Explore tab against the Loki datasource:
# All errors in prod backend in the last hour
{namespace="prod", app="backend"} |= "error" | json
# Just 5xx responses
{namespace="prod", app="backend"} | json | status_code >= 500
# Slow requests
{namespace="prod", app="backend"} | json | duration_ms > 1000
Loki labels available: namespace, pod, app, container.
Metrics and alerts¶
Open Grafana and load the relevant dashboard:
- Backend (Go) — request rate, p50/p95/p99 latency, error rate, goroutines, GC
- CNPG Postgres — connections, replication lag, query duration, WAL archive lag
- Redis — ops/sec, memory, hit rate
- Cluster / Nodes — CPU, memory, pod count, scheduler events
Prometheus is scraping via PodMonitor for CNPG and via standard endpoints for the backend. Alertmanager fires on:
- Pod CrashLoopBackOff for >5 min
- 5xx rate >1% over 5 min
- Postgres replication lag >10s
- WAL archive falling behind
Check active alerts:
kubectl port-forward svc/prometheus-alertmanager -n monitoring 9093:9093
# Open http://localhost:9093
Argo CD sync failures¶
# What's out of sync?
argocd app diff prod-tomoda
# Recent sync history
argocd app history prod-tomoda
# Events on the application
kubectl describe application prod-tomoda -n argocd
Common causes:
ComparisonError— manifest syntax error in the latest commit. Checkargocd app getfor the line.OutOfSyncwith no auto-sync — auto-sync was disabled (Option 3 in Rollback). Re-enable:argocd app set prod-tomoda --sync-policy automated.SyncFailed— Argo CD applied the manifest but the resource controller rejected it (PDB violation, ResourceQuota, immutable field). The error message is in the sync details.
External Secrets¶
If a pod's env vars are stale or missing, check the ExternalSecret chain:
# Is it syncing?
kubectl get externalsecret -n prod
# STATUS should be SecretSynced; READY should be True
# Why isn't it syncing?
kubectl describe externalsecret backend-secrets-prod -n prod
# Look at Status.Conditions for the failure reason
# Is the ClusterSecretStore healthy?
kubectl get clustersecretstore
kubectl describe clustersecretstore gsm-tomoda
kubectl describe clustersecretstore aws-sm-tomoda
Common causes:
- Workload Identity binding broken — the ESO controller's K8s SA isn't bound to the GCP SA. Check the
iam.gke.io/gcp-service-accountannotation. - GCP SM secret missing — the
remoteRef.keydoesn't exist. Create it viascripts/setup-gcp-secrets.sh. - AWS reader credentials expired/rotated — the
aws-eso-credentialsK8s secret in theexternal-secretsnamespace is wrong. Re-create per Secrets Management.
Force a refresh (instead of waiting for the 1h interval):
kubectl delete externalsecret backend-secrets-prod -n prod
# Argo CD recreates it within seconds
Photon /status health check¶
Photon is the geocoder. If localized lookups stop working:
kubectl port-forward -n data svc/photon 2322:2322 &
curl -s http://localhost:2322/status | jq
# Expect: { "status": "Ok", ... }
If status is anything else, check the pod logs — most often it is mid-download of a new index from GCS.
kubectl logs -n data deployment/photon --tail=100
See Photon Multilang Rollout for the index refresh procedure.
Postgres¶
For DB-specific diagnostics (cluster status, replication, backup health, slow queries), see Postgres Operations.