Observability
Observability checkpoint #
The Mandelbrot service already exposes a Prometheus-compatible /metrics endpoint.
That makes the first observability step small enough to keep inside the GitOps model: deploy Prometheus and Grafana into an observability namespace in each cluster, scrape the local Mandelbrot service, and provision a dashboard from Git.
The goal is deliberately narrow:
- prove that the application exposes useful runtime signals
- collect those signals inside each cluster
- show them in a dashboard without manual Grafana setup
- keep the manifests portable across AWS, GCP, and Azure
This checkpoint does not try to centralize metrics yet. It builds the local shape first.
GitOps ownership #
Argo CD owns the observability stack in each cluster:
trinity-observability-aws
trinity-observability-gcp
trinity-observability-azure
Each application points at the matching cloud overlay under platform/observability.
In this branch the observability application syncs in wave 1, after the Argo CD project and base application wiring are already in place:
metadata:
name: trinity-observability-aws
namespace: argocd
annotations:
argocd.argoproj.io/sync-wave: "1"
spec:
source:
repoURL: https://github.com/maxgherman/trinity.git
targetRevision: main
path: platform/observability/overlays/aws
destination:
server: https://kubernetes.default.svc
namespace: observability
The repository now has a shared observability base and one overlay per cloud:
platform/observability/
base/
namespace.yaml
prometheus-deployment.yaml
prometheus-service.yaml
grafana-config.yaml
grafana-dashboard.yaml
grafana-deployment.yaml
grafana-service.yaml
overlays/
aws/
gcp/
azure/
The base contains the common namespace, deployments, services, Prometheus configuration, Grafana provisioning, and dashboard. The overlays only change the cloud label in the Prometheus configuration. That is the same pattern as the application layer: one shared shape, small provider-specific overlays, and no manual cluster edits.
Prometheus #
Prometheus is the metrics collector. It periodically calls configured HTTP endpoints, reads numeric time-series samples, stores them locally, and lets you query them with PromQL. In this checkpoint it has one job: scrape the Mandelbrot service and keep enough recent data to inspect the application while testing the platform.
The Prometheus configuration is intentionally static.
It scrapes mandelbrot.mandelbrot.svc.cluster.local:80 in the same cluster and attaches a cloud label through the overlay.
That avoids cluster-wide Kubernetes discovery and RBAC while still proving the operating motion: metrics are reconciled from Git, collected inside each cluster, and shown through a pre-provisioned Grafana dashboard.
The deployment is deliberately small:
- one Prometheus pod with a six-hour local retention window
- a
ClusterIPservice - a
ConfigMapmounted asprometheus.yml - an
emptyDirdata volume
That keeps the first metrics slice cheap and easy to inspect.
The tradeoff is clear: Prometheus stores data on emptyDir, so pod restarts lose local history.
For this checkpoint that is acceptable.
A later version can add persistent volumes or remote write once the platform needs durable metrics.
The relevant part of the deployment is small:
args:
- --config.file=/etc/prometheus/prometheus.yml
- --storage.tsdb.path=/prometheus
- --storage.tsdb.retention.time=6h
volumeMounts:
- name: prometheus-config
mountPath: /etc/prometheus/prometheus.yml
subPath: prometheus.yml
readOnly: true
- name: prometheus-data
mountPath: /prometheus
volumes:
- name: prometheus-config
configMap:
name: prometheus-config
- name: prometheus-data
emptyDir: {}
The Prometheus overlays differ only by cloud label and external label:
global:
scrape_interval: 15s
evaluation_interval: 15s
external_labels:
cloud: aws
platform: trinity
scrape_configs:
- job_name: prometheus
static_configs:
- targets:
- localhost:9090
- job_name: mandelbrot
metrics_path: /metrics
static_configs:
- targets:
- mandelbrot.mandelbrot.svc.cluster.local:80
labels:
service: mandelbrot
The metric names are application-level signals:
mandelbrot_render_requests_total
mandelbrot_stage_renders_total
mandelbrot_stage_render_seconds_count
mandelbrot_stage_render_seconds_sum
That is enough to answer the first operational questions: is the app receiving render requests, are stage renders happening, and how long are they taking?
Grafana #
Grafana is also Git-provisioned. The datasource points at the in-cluster Prometheus service:
http://prometheus.observability.svc.cluster.local:9090
The dashboard tracks the first useful Mandelbrot signals:
- render request rate by status
- stage render rate by cloud and status
- average stage duration
- total stage samples
Grafana runs with anonymous viewer access for this checkpoint. That is not a production access model. It is a local inspection surface for the exercise, and it keeps the first metrics phase focused on collection and dashboard provisioning instead of authentication.
CI check #
After the manifests were added, CI had to understand the new platform slice.
The manifest checker now allows ConfigMap, and the workflows render the observability overlays alongside the application overlays:
kubectl kustomize platform/observability/overlays/aws
kubectl kustomize platform/observability/overlays/gcp
kubectl kustomize platform/observability/overlays/azure
That catches the basic failure mode before Argo CD sees the change: invalid YAML, unsupported manifest kinds in the exercise checker, or a broken Kustomize overlay.
Validation #
The live validation path is the same in every cluster. First check that Argo CD created the observability application and reconciled the namespace:
KUBECONFIG=./kubeconfig.aws.yaml kubectl -n argocd get application trinity-observability-aws
KUBECONFIG=./kubeconfig.aws.yaml kubectl -n observability get deployment,service,pods
Then generate a few renders through Front Door and inspect Prometheus:
KUBECONFIG=./kubeconfig.aws.yaml kubectl -n observability port-forward service/prometheus 9090:9090
The useful Prometheus page is /targets.
It should show the mandelbrot scrape target as UP.
From the query page, these expressions should return data after a few renders:
mandelbrot_render_requests_total
mandelbrot_stage_renders_total
mandelbrot_stage_render_seconds_count
mandelbrot_stage_render_seconds_sum
Grafana is checked the same way:
KUBECONFIG=./kubeconfig.aws.yaml kubectl -n observability port-forward service/grafana 3000:3000
Then open the provisioned dashboard:
http://localhost:3000/d/trinity-mandelbrot/trinity-mandelbrot
The same verification passed in AWS, GCP, and Azure. Prometheus comes up, scrapes the local Mandelbrot service, and Grafana loads the checked-in dashboard without manual setup.
Exit #
This version of the observability platform does not yet centralize metrics across clouds, collect logs, or emit traces. It is the first checkpoint: the application exposes useful signals, each cluster can collect them, and the dashboard definition is versioned with the rest of the platform. The next checkpoint takes that step toward central metrics. Writing to a shared backend requires provider credentials, and that means the platform needs a proper secrets path before the observability stack can publish beyond the local cluster.