Skip to content

Runbook

Common day-to-day operational procedures. Each section is self-contained — skim the headings and jump.

Start local dev

# Terminal 1 — backend + infra
task dev

# Terminal 2 — frontend
task dev:frontend

task dev brings up Postgres, Redis, MinIO, and Photon via Docker Compose, runs migrations, regenerates Swagger docs, pulls GCP secrets (if available), and starts the backend on :8080 with Air for hot reload. The frontend runs on :8081 via the Expo dev server. See Docker Compose for the underlying stack.

Reset local DB

When migrations are in a weird state or you want to start over:

task db:reset           # docker:clean -> docker:up -> migrate
task db:test:setup      # re-seed ~20+ users, events, chats

db:reset runs task docker:clean (which is docker-compose down -v --remove-orphans) so it wipes the named volumes too — your local Postgres data is gone.

Pull GCP secrets

If your task dev is starting without real secrets (you see "GCP credentials not available — running with local defaults only" in the output), auth with gcloud:

gcloud auth login
gcloud config set project development-485000
task dev   # re-run — secrets will now be exported into the shell

The full mechanism is documented in Secrets.

Tail backend logs

Locally

The backend logs to stdout via Zap. In structured-logging mode this is JSON; in ENV=local it's the colored console encoder. Just look at the terminal where task dev is running.

To pipe through jq for readability:

task dev 2>&1 | jq -R 'fromjson? // .'

Production

Backend logs go to Cloud Logging (each Pod's stdout is collected by the GKE logging agent). Use the Logs Explorer in the GCP Console with a query like:

resource.type="k8s_container"
resource.labels.namespace_name="<dev-or-prod>"
resource.labels.container_name="backend"

For a faster CLI workflow:

gcloud logging read \
  'resource.type="k8s_container" AND resource.labels.container_name="backend"' \
  --limit 100 --format json --project development-485000

Deploy a backend hotfix

For a small, urgent fix on main:

git checkout main && git pull
# make the fix, commit, push
./scripts/release.sh    # bump VERSION, tag, push tag, optional GH release

The tag push triggers Cloud Build → image lands in Artifact Registry → ArgoCD picks it up → rolling update on GKE. See Deployment for the full chain.

If you need to ship a fix without bumping a tag (rare and discouraged), bump only the image reference in the devops/ repo and let ArgoCD reconcile.

Rollback

Two paths:

  1. ArgoCD UI — open the tomoda Application, click History and Rollback, select the previous sync, and confirm. ArgoCD re-applies the previous manifest set, including the previous image SHA. Fastest option.
  2. Git revert in the devops repogit revert the commit that bumped the image (in devops/k8s/apps/tomoda/) and push. ArgoCD reconciles on its next sync.

Avoid kubectl rollout undo unless ArgoCD itself is broken — it'll drift from the Git source of truth.

Mobile app rollback

Mobile is a separate beast because rollback means different things depending on the release vehicle.

Type of release Rollback options
EAS Update (OTA, JS-only) eas update --branch production --message "rollback" republishing the previous bundle, or use eas update:rollback to revert the channel pointer.
Native release (new binary submitted to App Store / Play Store) You can't unship a binary that's been approved. Submit a hotfix build with an incremented version. For Android you can halt the staged rollout in Play Console; for iOS you can pull from sale (App Store Connect → Pricing) but that's a sledgehammer.

See Native Release for the full release flow.

Add a new env var

A new backend env var is a multi-step change:

  1. Add the field to the config struct (backend/internal/config/config.go or equivalent — see ../backend/infrastructure/config.md).
  2. Add the default / per-env value in backend/config.local.yaml and backend/config.{dev,prod}.yaml.
  3. Update backend/env.example.local so other developers see it.
  4. Add to scripts/pull-secrets.sh if the value is sensitive (so task dev picks it up).
  5. Create the secret in GCP Secret Manager (gcloud secrets create ...) for non-local environments.
  6. Add a key to the K8s backend-secrets Secret in the devops/ repo so it gets injected into the Pod.

Skipping step 6 will leave the Pod running without the value — usually a silent failure in the affected code path.

Add a new scheduled task

Scheduled jobs are not K8s CronJobs — they're handled by the in-process scheduler. See ../backend/infrastructure/workers-scheduler.md for how to register a new periodic task.

Investigate a slow query

task db:console       # opens psql against the local Postgres

Then:

EXPLAIN (ANALYZE, BUFFERS) SELECT ...;

For production, use Cloud SQL Insights in the GCP Console — it tracks the top queries by total time and exposes execution plans without needing a direct DB shell.

Common culprits in this codebase:

  • Missing index on a column added via a recent migration. Check backend/migrations/ and compare against pg_indexes.
  • An H3 spatial query missing the idx_*_h3_* index. Discovery and presence queries lean on these.
  • Sequential scan on a JSONB column. Add a GIN index or pull the field out into a column.

Investigate WebSocket flapping

Symptoms: users report repeated reconnects, real-time updates stop arriving, presence shows users as offline that are actually online.

The Hub is in-memory per pod; cross-pod fanout flows through Redis pub/sub on chat:event:*. Pod restarts drop the connections owned by that pod (clients reconnect, but it's user-visible); a Redis outage breaks cross-pod fanout specifically — same-pod chat still works.

Triage steps:

  1. Check Pod restart count. kubectl -n <ns> get pods -l app.kubernetes.io/name=tomoda-api — if RESTARTS is climbing, the Pod is OOMing or panicking. Look at logs (kubectl logs ... --previous).
  2. Check liveness probe. A slow /health (>30s with periodSeconds: 30) will get the Pod killed. Probe is at backend/internal/handlers/health.go.
  3. Check ping/pong intervals. The Hub sends pings; the client must pong within the deadline. Mismatched timeouts between the proxy in front (Cloudflare / Ingress) and the Hub will look like flapping. See ../backend/infrastructure/websocket.md.
  4. Check Ingress idle timeout. GKE Ingress defaults can drop idle WS connections — verify in devops/k8s/apps/tomoda/base/ingress.yaml.
  5. Check cross-pod fanout. If users on one pod see messages but users on a sibling pod don't, the issue is the Hub subscriber. Logs include Hub subscriber loop error; reconnecting on Redis pub/sub failures; MONITOR Redis briefly to confirm chat:event:* traffic if needed.

If the Pod is healthy and connections are still flapping, the issue is client-side or network-side, not on the server.