Loki¶

Cluster-wide log aggregation. Promtail tails every pod's logs and ships them to Loki; Grafana queries Loki using its built-in Loki datasource. There is no Elasticsearch, no Cloud Logging sink, no Fluent Bit — just loki-stack from Grafana's Helm chart.

Installed by k8s/envs/platform/loki/application.yaml, configured by k8s/envs/platform/loki/values.yaml.

Chart and source¶

Field	Value
Helm chart	`loki-stack`
Repository	`https://grafana.github.io/helm-charts`
Version	`2.10.2`
Destination namespace	`monitoring` (shared with kube-prometheus-stack)
Argo CD Application	`loki`

The loki-stack chart bundles both Loki (the log store) and Promtail (the per-node log shipper) — they're enabled in values.yaml:

loki:
  enabled: true
  persistence:
    enabled: true
    size: 5Gi
promtail:
  enabled: true

Promtail¶

Promtail runs as a DaemonSet — one pod per node — and tails /var/log/pods/*/*.log from every container in the cluster. It does not require workloads to opt in. Every pod's stdout/stderr lands in Loki by default.

The pipeline in values.yaml parses the Traefik JSON access log specifically:

promtail:
  config:
    snippets:
      pipelineStages:
        - cri: {}
        - match:
            selector: '{container="traefik"}'
            stages:
              - replace:
                  expression: '([?&])token=[^&"\s]+'
                  replace: '${1}token=REDACTED'
              - json:
                  expressions:
                    entryPointName: entryPointName
                    request_Host: RequestHost
                    request_Path: RequestPath
                    status: DownstreamStatus
                    method: RequestMethod
                    msg: msg
                    level: level
              - labels:
                  entryPointName:
                  request_Host:
                  status:
                  method:
                  level:

cri: {} strips the container-runtime envelope. The replace stage redacts WebSocket-handshake JWTs (passed as ?token=... because the browser WebSocket constructor can't set headers) before any downstream stage sees the line. RequestPath lands in Loki as /ws/chats/abc?token=REDACTED, with host, method, status, and msg untouched. The json stage extracts Traefik fields, and labels promotes a handful of them (host, status, method, level, entrypoint) into Loki labels, which is what makes the Traefik logs Grafana dashboard (gnetId: 13702) work out of the box. The unredacted line still lives in the pod's stdout until Promtail picks it up (sub-second window), so operators with kubectl logs access can still see live tokens there.

Non-JSON logs (everything that isn't Traefik) still get ingested, just without the structured labels. They're queryable by {namespace="..."} / {app="..."} / similar.

Loki¶

Loki runs as a single StatefulSet with a 20Gi PVC for chunk storage and a 7-day compactor-enforced retention. It exposes:

http://loki:3100 — inside the cluster, used as a Grafana datasource.
A ServiceMonitor (label release: monitoring) — so Prometheus scrapes Loki's own metrics (loki_request_duration_seconds, etc.).

There is no S3 / GCS backend configured — chunks live on the PVC. That keeps the deployment simple but caps long-term log volume at the disk size.

Querying¶

Logs are queried in Grafana via the Loki datasource (added in k8s/envs/platform/monitoring/values.yaml):

grafana:
  additionalDataSources:
    - name: Loki
      type: loki
      url: http://loki:3100
      access: proxy
      isDefault: false

Open Grafana at https://grafana.tomoda.life, switch to the Explore tab, pick the Loki datasource, and use LogQL:

{namespace="tomoda"} |= "error"
{namespace="traefik-system", status=~"5.."}

The Traefik-logs dashboard provides a pre-built view over the structured labels.

Operational notes¶

20Gi PVC + 7-day compactor retention. Sized for current write rate with ~13× headroom. Retention math + the compactor config live in operations/observability/loki.md.
No object-storage backend. Switching Loki to a GCS-backed schema is the next scaling step, not a current capability.
Promtail consumes CPU on busy nodes. Resource limits in values.yaml cap it at 200m CPU / 128Mi memory per node — usually fine, but visible in node-pressure events if log volume spikes.
One Loki across all environments. Both the tomoda (dev) and tomoda-prod namespaces ship logs to the same Loki. Filter by namespace in queries when looking at prod traffic only.
Backend Zap JSON parsing. The Promtail pipeline parses tomoda backend logs and promotes level as a label; trace_id stays in the body for trace→log navigation from Tempo. See the observability/loki page for the full label strategy.