Skip to content

Rollback

When a deploy breaks prod, pick the option that gets you back to known-good fastest, then clean up the GitOps state.

Option 1 — Image rollback (fastest)

Point the running deployment at the previous image SHA. The app rolls back in ~30s once the new pod is healthy.

  1. Open https://argo-app.tomoda.life, select prod-tomoda.
  2. Click History and Rollback (clock icon, top toolbar).
  3. Pick the previous Sync revision (look for the green "Synced" badge before the bad one).
  4. Click Rollback.

Argo CD applies the previous manifests, including the previous image SHA. The backend rolls.

Argo CD rollback leaves Git ahead

Rollback via the UI does not revert the Git commit that Image Updater wrote. The next reconcile will try to drift back. Follow up with Option 2 (Git revert) or manually edit the Kustomize overlay back to the known-good SHA.

Via Kustomize edit + commit

cd devops
# Find the last good SHA
git log -- k8s/apps/tomoda/overlays/prod/kustomization.yaml

# Edit the image override
$EDITOR k8s/apps/tomoda/overlays/prod/kustomization.yaml
# Change the `images:` newTag to the known-good SHA

git commit -am "rollback: prod backend to <SHA>"
git push origin main

# Argo CD picks it up on next poll, or force it:
argocd app sync prod-tomoda

Option 2 — Git revert (cleanest)

Revert the offending commit in the app repo (tomoda). Cloud Build builds the reverted code, Image Updater promotes the resulting image, Argo CD rolls.

cd tomoda
git revert <bad-commit-sha>
git push origin main

# For prod, also cut a new patch tag
git tag vX.Y.(Z+1)
git push --tags

This is the right answer when the bug is in app code and you don't want to keep building old SHAs. Total wall-clock: 10-15 min (build + Image Updater poll + Argo CD sync).

Option 3 — Manual scale-down (emergency only)

If the app is actively causing data corruption or runaway costs and you need it off right now:

# Stop the API pool — kills inbound traffic immediately.
kubectl scale deployment/tomoda-api -n prod --replicas=0

# Optionally stop async too if the incident involves runaway jobs.
kubectl scale deployment/tomoda-async -n prod --replicas=0

This stops new traffic immediately. Argo CD will try to scale it back up because selfHeal: true is configured. To prevent that during the incident:

# Temporarily disable auto-sync on the Argo CD app
argocd app set prod-tomoda --sync-policy none

Last resort

A scaled-down app leaves the environment in a non-GitOps state. Re-enable sync (argocd app set prod-tomoda --sync-policy automated) as soon as the incident is contained, and follow up with Option 1 or 2 to get the manifest aligned with reality.

Option 4 — Database rollback

App rollbacks are cheap; DB rollbacks are not. If the bad deploy ran a destructive migration or wrote corrupt data, you need point-in-time recovery.

See Disaster Recovery for the full procedure. Summary:

./scripts/disaster-recovery.sh --env prod \
  --mode pitr \
  --target-time "2026-04-27T09:55:00Z"

PITR creates a new CNPG cluster restored to the requested timestamp. After verification you switch the ExternalName service to point at it, then update the Git manifests.

Verifying the rollback

# Image SHA matches the known-good
kubectl get deploy backend -n prod \
  -o jsonpath='{.spec.template.spec.containers[0].image}'

# Health endpoint
curl -fsS https://api.tomoda.life/health

# Argo CD reports Synced + Healthy
argocd app get prod-tomoda

After verification, write a brief post-mortem and add the failure pattern to Debugging so the next on-call recognises it faster.