Traces
Logs and traces checkpoint #
Secrets gave the platform a real path and used it for central Prometheus metrics. The policies checkpoint then added a basic admission boundary. Now the platform can continue with observability.
Metrics answer the aggregate question: are requests happening, are stages rendering, and how long are they taking? The next operational question is narrower and more practical:
What happened to this request?
Metrics do not answer that by themselves. For one slow or broken render, I need the logs for that request and the trace that shows how it moved through AWS, GCP, and Azure.
So we add two more observability signals:
- logs from Kubernetes pods
- distributed traces from the Mandelbrot request path
The shape is still intentionally small. Each cluster keeps local fallbacks, and Grafana Cloud becomes the cross-cluster view.
GitOps ownership #
The same Argo CD observability application now owns the expanded stack. The base already had Prometheus and Grafana. This branch adds Loki, Promtail, Jaeger, and the OpenTelemetry Collector:
platform/observability/base/
loki-config.yaml
loki-deployment.yaml
loki-service.yaml
promtail-service-account.yaml
promtail-rbac.yaml
promtail-config.yaml
promtail-daemonset.yaml
jaeger-deployment.yaml
jaeger-service.yaml
otel-collector-config.yaml
otel-collector-deployment.yaml
otel-collector-service.yaml
The cloud overlays add the provider-backed credentials:
platform/observability/overlays/aws/
grafana-cloud-logs-external-secret.yaml
grafana-cloud-traces-external-secret.yaml
platform/observability/overlays/gcp/
grafana-cloud-logs-external-secret.yaml
grafana-cloud-traces-external-secret.yaml
platform/observability/overlays/azure/
grafana-cloud-logs-external-secret.yaml
grafana-cloud-traces-external-secret.yaml
That keeps the secrets model established earlier. Git declares the workload shape and the secret references. The cloud secret backend stores the actual Grafana Cloud endpoints and tokens. External Secrets materializes those values into Kubernetes.
Logs #
Loki is the local log store. It is deliberately small here: one in-cluster service that can answer recent log queries for the local cluster. Promtail is the log shipper. It runs as a DaemonSet, reads pod logs from the node, attaches Kubernetes labels, parses JSON log fields, and pushes the result to two places:
Promtail -> local Loki
Promtail -> Grafana Cloud Logs
The Promtail clients show that split:
clients:
- url: http://loki.observability.svc.cluster.local:3100/loki/api/v1/push
- url: ${GRAFANA_CLOUD_LOKI_URL}
basic_auth:
username: ${GRAFANA_CLOUD_LOKI_USERNAME}
password: ${GRAFANA_CLOUD_LOKI_PASSWORD}
The useful part is the pipeline. Promtail first decodes the container runtime log format, then parses the Mandelbrot JSON payload:
pipeline_stages:
- cri: {}
- json:
expressions:
level:
message:
traceId:
spanId:
platform:
service:
cloud:
region:
- labels:
level:
platform:
service:
cloud:
region:
- output:
source: message
That gives log queries stable labels and keeps the visible message readable.
The Mandelbrot service now writes structured JSON logs. A render completion log carries the request identifiers and platform context:
{
"time": "2026-05-23T08:15:00.000Z",
"level": "info",
"message": "mandelbrot render completed",
"platform": "trinity",
"service": "mandelbrot",
"cloud": "aws",
"region": "us-east-1",
"traceId": "f3d75d28dfae23d5abea1b500ac3431e",
"spanId": "56c89d3fde9c645e",
"jobId": "m_mp84cacn_5kgis0t",
"status": "complete"
}
The first useful LogQL query is simple:
{namespace="mandelbrot", app="mandelbrot", platform="trinity"} | json | traceId != ""
That query says: show Mandelbrot logs, parse the JSON payload, and keep only lines that can be correlated to a trace.
Traces #
A trace is the end-to-end story of one request. A span is one timed operation inside that story. For Mandelbrot, the trace follows the render request as it fans out into stage renders:
POST /api/render
render stage aws
POST /internal/render-stage
render tile
render stage gcp
POST /internal/render-stage
render tile
render stage azure
POST /internal/render-stage
render tile
This branch keeps the application implementation small. It does not bring in a full OpenTelemetry SDK yet. Instead, the Node service creates Zipkin-format spans directly, propagates B3 headers across stage calls, and exports each span to two in-cluster endpoints:
env:
- name: OTEL_SERVICE_NAME
value: mandelbrot
- name: ZIPKIN_ENDPOINTS
value: http://jaeger.observability.svc.cluster.local:9411/api/v2/spans,http://otel-collector.observability.svc.cluster.local:9411/api/v2/spans
- name: TRACE_SAMPLE_RATE
value: "1"
The two endpoints have different jobs:
Jaeger:
local trace fallback inside the cluster
OpenTelemetry Collector:
receives Zipkin spans and exports them to Grafana Cloud Traces
The span export path in the app is explicit. The service posts a Zipkin span to every configured endpoint and logs export failures:
for (const endpoint of zipkinEndpoints) {
fetch(endpoint, {
method: "POST",
headers: { "content-type": "application/json" },
body: JSON.stringify([zipkinSpan]),
});
}
Silent telemetry failure is worse than no telemetry in a checkpoint like this, because it makes the platform look observable while data is being dropped.
Collector #
The OpenTelemetry Collector is the bridge between the local cluster and Grafana Cloud Traces. It listens for Zipkin spans on port 9411, batches them, and exports them over OTLP HTTP with basic auth:
receivers:
zipkin:
endpoint: 0.0.0.0:9411
processors:
batch: {}
exporters:
otlphttp/grafana_cloud:
endpoint: ${env:GRAFANA_CLOUD_TEMPO_OTLP_HTTP_ENDPOINT}
auth:
authenticator: basicauth/grafana_cloud
service:
pipelines:
traces:
receivers:
- zipkin
processors:
- batch
exporters:
- otlphttp/grafana_cloud
The collector credentials come from the grafana-cloud-traces Kubernetes Secret. That secret is created by External Secrets from the provider backend.
The AWS overlay shows the shape:
apiVersion: external-secrets.io/v1
kind: ExternalSecret
metadata:
name: grafana-cloud-traces
namespace: observability
spec:
secretStoreRef:
kind: ClusterSecretStore
name: aws-secrets-manager
target:
name: grafana-cloud-traces
creationPolicy: Owner
deletionPolicy: Retain
data:
- secretKey: endpoint
remoteRef:
key: trinity-dev-aws-grafana-cloud-traces-endpoint-mg
- secretKey: username
remoteRef:
key: trinity-dev-aws-grafana-cloud-traces-username-mg
- secretKey: password
remoteRef:
key: trinity-dev-aws-grafana-cloud-traces-password-mg
The -mg suffix is not a new concept in the platform. It is the branch's current AWS backend-name suffix, used to avoid names that were blocked by AWS Secrets Manager scheduled deletion. The important part is that the ExternalSecret keys match the actual backend names created for that cluster.
Grafana correlation #
Grafana is still provisioned from Git. It now gets three in-cluster data sources:
Prometheus -> metrics
Loki -> logs
Jaeger -> traces
The Loki data source includes a derived field that extracts traceId from JSON logs and links it to Jaeger:
- name: Loki
uid: Loki
type: loki
url: http://loki.observability.svc.cluster.local:3100
jsonData:
derivedFields:
- datasourceUid: Jaeger
matcherRegex: '"traceId":"([a-f0-9]{16,32})"'
name: TraceID
url: '$${__value.raw}'
That gives the local workflow:
dashboard -> logs -> trace
Grafana Cloud gives the same kind of workflow across all three clusters. Prometheus remote-writes metrics, Promtail pushes logs, and the OpenTelemetry Collector exports traces. The common traceId ties the signals together.
Validation #
After Argo CD syncs the observability application, first check the local components:
KUBECONFIG=./kubeconfig.aws.yaml kubectl -n observability get deployment loki jaeger otel-collector
KUBECONFIG=./kubeconfig.aws.yaml kubectl -n observability get daemonset promtail
KUBECONFIG=./kubeconfig.aws.yaml kubectl -n observability get service loki jaeger otel-collector
Then check that External Secrets materialized the remote credentials:
KUBECONFIG=./kubeconfig.aws.yaml kubectl -n observability get externalsecret grafana-cloud-logs grafana-cloud-traces
KUBECONFIG=./kubeconfig.aws.yaml kubectl -n observability get secret grafana-cloud-logs grafana-cloud-traces
Generate a few renders through Front Door and query Grafana Cloud Logs:
{namespace="mandelbrot", app="mandelbrot", platform="trinity"} | json | traceId != ""
The useful labels should include:
app=mandelbrot
container=mandelbrot
namespace=mandelbrot
platform=trinity
The useful messages are application events, not only platform noise:
request completed
mandelbrot render completed
mandelbrot stage rendered
Then query Grafana Cloud Traces for the mandelbrot service. The trace table should show recent POST /api/render spans with durations in the hundreds of milliseconds:
service name duration
mandelbrot POST /api/render 155 ms
mandelbrot POST /api/render 209 ms
mandelbrot POST /api/render 274 ms
mandelbrot POST /api/render 305 ms
The local Jaeger fallback is checked through port-forwarding:
KUBECONFIG=./kubeconfig.aws.yaml kubectl -n observability port-forward service/jaeger 16686:16686
Open http://localhost:16686, select the mandelbrot service, and search for recent traces.
Troubleshooting #
The first run exposed two useful failure modes.
Promtail came up cleanly, but Grafana Cloud rejected the first log batches. The first error was 405 Method Not Allowed. That was a bad Loki URL: the Grafana Cloud host was right, but Promtail must post to the full push endpoint:
/loki/api/v1/push
After that, the error changed to 401 Unauthorized with invalid scope requested. That was a token and username problem, not a Kubernetes problem. The logs username must be the Grafana Cloud Logs user ID, and the token needs write access to that Loki instance.
Traces had a different failure mode. The OpenTelemetry Collector was healthy, and a synthetic Zipkin span posted from inside the cluster was accepted and exported. Real Mandelbrot renders still did not move the collector counters. That narrowed the problem to the application pod.
The deployment had the new ZIPKIN_ENDPOINTS environment variable, but GCP was still serving the old pod with only the local Jaeger endpoint. The replacement pod was pending because the one-node GKE cluster did not have enough free CPU for a rolling-update surge. This branch changes the Mandelbrot deployment strategy to Recreate:
spec:
replicas: 1
strategy:
type: Recreate
For a real service, I would usually scale capacity or tune rollout settings more carefully. For this single-replica demo app, Recreate is acceptable because the goal is to avoid a tiny node getting stuck with both old and new pods during the same rollout.
The collector counters are the fastest trace check:
KUBECONFIG=./kubeconfig.gcp.yaml \
kubectl -n observability port-forward deployment/otel-collector 8888:8888
Then in another shell:
curl -s localhost:8888/metrics | grep -iE \
'otelcol_receiver_.*spans|otelcol_exporter_.*spans'
The expected direction is:
otelcol_receiver_accepted_spans increases
otelcol_exporter_sent_spans increases
otelcol_exporter_send_failed_spans stays at 0
If receiver counters stay at zero after a render, the app is not reaching the collector. If receiver counters increase but exporter failures increase too, the issue is between the collector and Grafana Cloud.
Exit #
This checkpoint completes the first full observability loop. The platform now has local metrics, logs, and traces in every cluster, plus a centralized Grafana Cloud view across AWS, GCP, and Azure.
It is still intentionally lightweight. Prometheus, Loki, and Jaeger use in-cluster ephemeral storage as local fallbacks. The durable platform view is the remote one: metrics are remote-written, logs are pushed by Promtail, and traces are exported by the OpenTelemetry Collector.
The next platform gap is release safety. Now that the platform has application delivery, traffic, secrets, policies, metrics, logs, and traces, the next checkpoint changes Mandelbrot from ordinary deployment to controlled progressive delivery.