Runbook¶
Common day-to-day operational procedures. Each section is self-contained — skim the headings and jump.
Start local dev¶
# Terminal 1 — backend + infra
task dev
# Terminal 2 — frontend
task dev:frontend
task dev brings up Postgres, Redis, MinIO, and Photon via Docker Compose, runs migrations, regenerates Swagger docs, pulls GCP secrets (if available), and starts the backend on :8080 with Air for hot reload. The frontend runs on :8081 via the Expo dev server. See Docker Compose for the underlying stack.
Reset local DB¶
When migrations are in a weird state or you want to start over:
task db:reset # docker:clean -> docker:up -> migrate
task db:test:setup # re-seed ~20+ users, events, chats
db:reset runs task docker:clean (which is docker-compose down -v --remove-orphans) so it wipes the named volumes too — your local Postgres data is gone.
Pull GCP secrets¶
If your task dev is starting without real secrets (you see "GCP credentials not available — running with local defaults only" in the output), auth with gcloud:
gcloud auth login
gcloud config set project development-485000
task dev # re-run — secrets will now be exported into the shell
The full mechanism is documented in Secrets.
Tail backend logs¶
Locally¶
The backend logs to stdout via Zap. In structured-logging mode this is JSON; in ENV=local it's the colored console encoder. Just look at the terminal where task dev is running.
To pipe through jq for readability:
task dev 2>&1 | jq -R 'fromjson? // .'
Production¶
Backend logs go to Cloud Logging (each Pod's stdout is collected by the GKE logging agent). Use the Logs Explorer in the GCP Console with a query like:
resource.type="k8s_container"
resource.labels.namespace_name="<dev-or-prod>"
resource.labels.container_name="backend"
For a faster CLI workflow:
gcloud logging read \
'resource.type="k8s_container" AND resource.labels.container_name="backend"' \
--limit 100 --format json --project development-485000
Deploy a backend hotfix¶
For a small, urgent fix on main:
git checkout main && git pull
# make the fix, commit, push
./scripts/release.sh # bump VERSION, tag, push tag, optional GH release
The tag push triggers Cloud Build → image lands in Artifact Registry → ArgoCD picks it up → rolling update on GKE. See Deployment for the full chain.
If you need to ship a fix without bumping a tag (rare and discouraged), bump only the image reference in the devops/ repo and let ArgoCD reconcile.
Rollback¶
Two paths:
- ArgoCD UI — open the
tomodaApplication, click History and Rollback, select the previous sync, and confirm. ArgoCD re-applies the previous manifest set, including the previous image SHA. Fastest option. - Git revert in the devops repo —
git revertthe commit that bumped the image (indevops/k8s/apps/tomoda/) and push. ArgoCD reconciles on its next sync.
Avoid kubectl rollout undo unless ArgoCD itself is broken — it'll drift from the Git source of truth.
Mobile app rollback¶
Mobile is a separate beast because rollback means different things depending on the release vehicle.
| Type of release | Rollback options |
|---|---|
| EAS Update (OTA, JS-only) | eas update --branch production --message "rollback" republishing the previous bundle, or use eas update:rollback to revert the channel pointer. |
| Native release (new binary submitted to App Store / Play Store) | You can't unship a binary that's been approved. Submit a hotfix build with an incremented version. For Android you can halt the staged rollout in Play Console; for iOS you can pull from sale (App Store Connect → Pricing) but that's a sledgehammer. |
See Native Release for the full release flow.
Add a new env var¶
A new backend env var is a multi-step change:
- Add the field to the config struct (
backend/internal/config/config.goor equivalent — see../backend/infrastructure/config.md). - Add the default / per-env value in
backend/config.local.yamlandbackend/config.{dev,prod}.yaml. - Update
backend/env.example.localso other developers see it. - Add to
scripts/pull-secrets.shif the value is sensitive (sotask devpicks it up). - Create the secret in GCP Secret Manager (
gcloud secrets create ...) for non-local environments. - Add a key to the K8s
backend-secretsSecret in thedevops/repo so it gets injected into the Pod.
Skipping step 6 will leave the Pod running without the value — usually a silent failure in the affected code path.
Add a new scheduled task¶
Scheduled jobs are not K8s CronJobs — they're handled by the in-process scheduler. See ../backend/infrastructure/workers-scheduler.md for how to register a new periodic task.
Investigate a slow query¶
task db:console # opens psql against the local Postgres
Then:
EXPLAIN (ANALYZE, BUFFERS) SELECT ...;
For production, use Cloud SQL Insights in the GCP Console — it tracks the top queries by total time and exposes execution plans without needing a direct DB shell.
Common culprits in this codebase:
- Missing index on a column added via a recent migration. Check
backend/migrations/and compare againstpg_indexes. - An H3 spatial query missing the
idx_*_h3_*index. Discovery and presence queries lean on these. - Sequential scan on a JSONB column. Add a GIN index or pull the field out into a column.
Investigate WebSocket flapping¶
Symptoms: users report repeated reconnects, real-time updates stop arriving, presence shows users as offline that are actually online.
The Hub is in-memory per pod; cross-pod fanout flows through Redis pub/sub on
chat:event:*. Pod restarts drop the connections owned by that pod (clients
reconnect, but it's user-visible); a Redis outage breaks cross-pod fanout
specifically — same-pod chat still works.
Triage steps:
- Check Pod restart count.
kubectl -n <ns> get pods -l app.kubernetes.io/name=tomoda-api— ifRESTARTSis climbing, the Pod is OOMing or panicking. Look at logs (kubectl logs ... --previous). - Check liveness probe. A slow
/health(>30s withperiodSeconds: 30) will get the Pod killed. Probe is atbackend/internal/handlers/health.go. - Check ping/pong intervals. The Hub sends pings; the client must pong within the deadline. Mismatched timeouts between the proxy in front (Cloudflare / Ingress) and the Hub will look like flapping. See
../backend/infrastructure/websocket.md. - Check Ingress idle timeout. GKE Ingress defaults can drop idle WS connections — verify in
devops/k8s/apps/tomoda/base/ingress.yaml. - Check cross-pod fanout. If users on one pod see messages but users on a sibling pod don't, the issue is the Hub subscriber. Logs include
Hub subscriber loop error; reconnectingon Redis pub/sub failures;MONITORRedis briefly to confirmchat:event:*traffic if needed.
If the Pod is healthy and connections are still flapping, the issue is client-side or network-side, not on the server.