Observability

Publish at:

Observability flow

Observability checkpoint #

The Mandelbrot service already exposes a Prometheus-compatible /metrics endpoint. That makes the first observability step small enough to keep inside the GitOps model: deploy Prometheus and Grafana into an observability namespace in each cluster, scrape the local Mandelbrot service, and provision a dashboard from Git.

The goal is deliberately narrow:

  • prove that the application exposes useful runtime signals
  • collect those signals inside each cluster
  • show them in a dashboard without manual Grafana setup
  • keep the manifests portable across AWS, GCP, and Azure

This checkpoint does not try to centralize metrics yet. It builds the local shape first.

GitOps ownership #

Argo CD owns the observability stack in each cluster:

trinity-observability-aws
trinity-observability-gcp
trinity-observability-azure

Each application points at the matching cloud overlay under platform/observability. In this branch the observability application syncs in wave 1, after the Argo CD project and base application wiring are already in place:

metadata:
  name: trinity-observability-aws
  namespace: argocd
  annotations:
    argocd.argoproj.io/sync-wave: "1"
spec:
  source:
    repoURL: https://github.com/maxgherman/trinity.git
    targetRevision: main
    path: platform/observability/overlays/aws
  destination:
    server: https://kubernetes.default.svc
    namespace: observability

The repository now has a shared observability base and one overlay per cloud:

platform/observability/
  base/
    namespace.yaml
    prometheus-deployment.yaml
    prometheus-service.yaml
    grafana-config.yaml
    grafana-dashboard.yaml
    grafana-deployment.yaml
    grafana-service.yaml
  overlays/
    aws/
    gcp/
    azure/

The base contains the common namespace, deployments, services, Prometheus configuration, Grafana provisioning, and dashboard. The overlays only change the cloud label in the Prometheus configuration. That is the same pattern as the application layer: one shared shape, small provider-specific overlays, and no manual cluster edits.

Prometheus #

Prometheus is the metrics collector. It periodically calls configured HTTP endpoints, reads numeric time-series samples, stores them locally, and lets you query them with PromQL. In this checkpoint it has one job: scrape the Mandelbrot service and keep enough recent data to inspect the application while testing the platform.

The Prometheus configuration is intentionally static. It scrapes mandelbrot.mandelbrot.svc.cluster.local:80 in the same cluster and attaches a cloud label through the overlay. That avoids cluster-wide Kubernetes discovery and RBAC while still proving the operating motion: metrics are reconciled from Git, collected inside each cluster, and shown through a pre-provisioned Grafana dashboard.

The deployment is deliberately small:

  • one Prometheus pod with a six-hour local retention window
  • a ClusterIP service
  • a ConfigMap mounted as prometheus.yml
  • an emptyDir data volume

That keeps the first metrics slice cheap and easy to inspect. The tradeoff is clear: Prometheus stores data on emptyDir, so pod restarts lose local history. For this checkpoint that is acceptable. A later version can add persistent volumes or remote write once the platform needs durable metrics.

The relevant part of the deployment is small:

args:
  - --config.file=/etc/prometheus/prometheus.yml
  - --storage.tsdb.path=/prometheus
  - --storage.tsdb.retention.time=6h
volumeMounts:
  - name: prometheus-config
    mountPath: /etc/prometheus/prometheus.yml
    subPath: prometheus.yml
    readOnly: true
  - name: prometheus-data
    mountPath: /prometheus
volumes:
  - name: prometheus-config
    configMap:
      name: prometheus-config
  - name: prometheus-data
    emptyDir: {}

The Prometheus overlays differ only by cloud label and external label:

global:
  scrape_interval: 15s
  evaluation_interval: 15s
  external_labels:
    cloud: aws
    platform: trinity
scrape_configs:
  - job_name: prometheus
    static_configs:
      - targets:
          - localhost:9090
  - job_name: mandelbrot
    metrics_path: /metrics
    static_configs:
      - targets:
          - mandelbrot.mandelbrot.svc.cluster.local:80
        labels:
          service: mandelbrot

The metric names are application-level signals:

mandelbrot_render_requests_total
mandelbrot_stage_renders_total
mandelbrot_stage_render_seconds_count
mandelbrot_stage_render_seconds_sum

That is enough to answer the first operational questions: is the app receiving render requests, are stage renders happening, and how long are they taking?

Grafana #

Grafana is also Git-provisioned. The datasource points at the in-cluster Prometheus service:

http://prometheus.observability.svc.cluster.local:9090

The dashboard tracks the first useful Mandelbrot signals:

  • render request rate by status
  • stage render rate by cloud and status
  • average stage duration
  • total stage samples

Grafana runs with anonymous viewer access for this checkpoint. That is not a production access model. It is a local inspection surface for the exercise, and it keeps the first metrics phase focused on collection and dashboard provisioning instead of authentication.

CI check #

After the manifests were added, CI had to understand the new platform slice. The manifest checker now allows ConfigMap, and the workflows render the observability overlays alongside the application overlays:

kubectl kustomize platform/observability/overlays/aws
kubectl kustomize platform/observability/overlays/gcp
kubectl kustomize platform/observability/overlays/azure

That catches the basic failure mode before Argo CD sees the change: invalid YAML, unsupported manifest kinds in the exercise checker, or a broken Kustomize overlay.

Validation #

The live validation path is the same in every cluster. First check that Argo CD created the observability application and reconciled the namespace:

KUBECONFIG=./kubeconfig.aws.yaml kubectl -n argocd get application trinity-observability-aws
KUBECONFIG=./kubeconfig.aws.yaml kubectl -n observability get deployment,service,pods

Then generate a few renders through Front Door and inspect Prometheus:

KUBECONFIG=./kubeconfig.aws.yaml kubectl -n observability port-forward service/prometheus 9090:9090

The useful Prometheus page is /targets. It should show the mandelbrot scrape target as UP. From the query page, these expressions should return data after a few renders:

mandelbrot_render_requests_total
mandelbrot_stage_renders_total
mandelbrot_stage_render_seconds_count
mandelbrot_stage_render_seconds_sum

Grafana is checked the same way:

KUBECONFIG=./kubeconfig.aws.yaml kubectl -n observability port-forward service/grafana 3000:3000

Then open the provisioned dashboard:

http://localhost:3000/d/trinity-mandelbrot/trinity-mandelbrot

The same verification passed in AWS, GCP, and Azure. Prometheus comes up, scrapes the local Mandelbrot service, and Grafana loads the checked-in dashboard without manual setup.

Exit #

This version of the observability platform does not yet centralize metrics across clouds, collect logs, or emit traces. It is the first checkpoint: the application exposes useful signals, each cluster can collect them, and the dashboard definition is versioned with the rest of the platform. The next checkpoint takes that step toward central metrics. Writing to a shared backend requires provider credentials, and that means the platform needs a proper secrets path before the observability stack can publish beyond the local cluster.

Source code #

Reference implementation (opens in a new tab)