Secrets

Publish at:

Secret flow

Secrets checkpoint #

The local observability checkpoint proved that each cluster could collect its own metrics. The next step is a shared backend, and that needs credentials. Grafana Cloud remote_write and any real SaaS dependency create the same question: where does the secret live, and how does it reach Kubernetes without being committed to Git?

So the next checkpoint is secrets. A platform needs a repeatable path from a real secret backend into each cluster. Secrets cannot be stored in Git. Manual creation via kubectl create secret also is not acceptable.

External Secrets Operator is the right next move. It keeps the Kubernetes desired state in Git while keeping secret values in a real secret backend. Git declares that a Kubernetes Secret should exist. The operator reads the value from a provider such as AWS Secrets Manager, Google Secret Manager, or Azure Key Vault and materializes the Kubernetes Secret inside the cluster.

That gives the platform the split it needs:

Git:
  secret shape, target name, provider reference

Cloud secret backend:
  secret value

External Secrets Operator:
  reconciliation between the two

Operator first #

The first implementation step installs the operator in each cluster through Argo CD:

trinity-secrets-aws
trinity-secrets-gcp
trinity-secrets-azure

The AWS application is typical:

apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: trinity-secrets-aws
  namespace: argocd
  annotations:
    argocd.argoproj.io/sync-wave: "1"
spec:
  project: trinity
  source:
    repoURL: https://charts.external-secrets.io
    chart: external-secrets
    targetRevision: 2.4.1
    helm:
      releaseName: external-secrets
      values: |
        installCRDs: true
        serviceAccount:
          create: false
          name: external-secrets
  destination:
    server: https://kubernetes.default.svc
    namespace: external-secrets

Those applications use the upstream External Secrets Helm chart, pinned to a specific chart version, and deploy into the external-secrets namespace. Because External Secrets Operator is installed from an external Helm chart and creates cluster-wide Kubernetes resources, the Argo CD project had to allow both the chart repository and those cluster-scoped resource types.

The sync order also changes slightly:

wave -1: Argo CD project
wave  0: applications
wave  1: secrets operator
wave  2: secrets demo and observability

Before any platform component starts asking for synced secrets, the operator should be present. We need to mount the observability application to Grafana Cloud credentials from External Secrets.

The live check is intentionally small:

KUBECONFIG=./kubeconfig.aws.yaml kubectl -n argocd get application trinity-secrets-aws
KUBECONFIG=./kubeconfig.aws.yaml kubectl -n external-secrets get deployment,pods
KUBECONFIG=./kubeconfig.aws.yaml kubectl get crd | grep external-secrets

After syncing the clusters, the AWS root application showed the new secrets app alongside the existing workloads:

NAME                        SYNC STATUS   HEALTH STATUS
trinity-dev-aws-root        Synced        Healthy
trinity-hello-aws           Synced        Healthy
trinity-mandelbrot-aws      Synced        Healthy
trinity-observability-aws   Synced        Healthy
trinity-secrets-aws         Synced        Healthy

The same operator check passed in the other clusters.

But installing the operator is only half the checkpoint. The important part is whether the operator can reach a real provider backend without static credentials in Git or Kubernetes.

Provider-backed proof #

So the next slice adds one harmless test secret per cloud:

hello-from-aws
hello-from-gcp
hello-from-azure

Those are just proof values. The point is to prove the path:

cloud secret backend -> External Secrets Operator -> Kubernetes Secret

Pulumi owns the cloud-side identity and backend resources. Argo CD owns the Kubernetes declaration that asks for a secret to be synced.

The split looks like this:

  • AWS: Secrets Manager, an EKS OIDC provider, an IAM role for service accounts, and an external-secrets service account annotated with the role ARN
  • GCP: Google Secret Manager, GKE Workload Identity, a Google service account, an IAM binding from the Kubernetes service account to that Google service account, and Secret Manager accessor permissions
  • Azure: Key Vault, AKS OIDC and workload identity, a user-assigned managed identity, a federated identity credential, and an external-secrets service account annotated with the Azure client and tenant IDs

The secret values come from Pulumi secret config:

pulumi -C infra/pulumi/aws config set --secret trinity:secretsDemoValue hello-from-aws
pulumi -C infra/pulumi/gcp config set --secret trinity:secretsDemoValue hello-from-gcp
pulumi -C infra/pulumi/azure config set --secret trinity:secretsDemoValue hello-from-azure

That keeps the repository clean. Git contains the desired shape of the sync, not the value.

Cloud identity #

The GCP leg exposed a useful platform detail. The deploy identity could create the cluster, but the secrets checkpoint needed more project-level permissions: service account administration, Secret Manager administration, and service usage administration. Secret Manager also needs the relevant project APIs enabled before the stack can create the backend resources. It is a cloud-project bootstrap boundary. That is the common pattern in this checkpoint. The Kubernetes manifest is small, but the cloud identity work behind it is provider-specific. Each cluster needs a native way for the external-secrets service account to read from its cloud secret backend without a long-lived credential:

AWS:
  Kubernetes service account
  -> projected service account token
  -> EKS OIDC provider
  -> IAM role
  -> AWS Secrets Manager

GCP:
  Kubernetes service account
  -> GKE Workload Identity
  -> Google service account
  -> Google Secret Manager

Azure:
  Kubernetes service account
  -> AKS OIDC issuer
  -> federated identity credential
  -> user-assigned managed identity
  -> Azure Key Vault

The important part is that each provider uses short-lived, identity-based access instead of static credentials copied into the cluster.

GitOps shape #

The GitOps side lives under platform/secrets-demo:

platform/secrets-demo/
  base/
    namespace.yaml
  overlays/
    aws/
    gcp/
    azure/

Each overlay adds a provider-specific ClusterSecretStore and one ExternalSecret. The AWS store, for example, uses the service account token to assume the IAM role:

apiVersion: external-secrets.io/v1
kind: ClusterSecretStore
metadata:
  name: aws-secrets-manager
spec:
  provider:
    aws:
      service: SecretsManager
      region: us-east-1
      auth:
        jwt:
          serviceAccountRef:
            name: external-secrets
            namespace: external-secrets

The ExternalSecret is deliberately small:

apiVersion: external-secrets.io/v1
kind: ExternalSecret
metadata:
  name: provider-test-secret
  namespace: secrets-demo
spec:
  refreshInterval: 1h
  secretStoreRef:
    kind: ClusterSecretStore
    name: aws-secrets-manager
  target:
    name: provider-test-secret
    creationPolicy: Owner
    deletionPolicy: Retain
  data:
    - secretKey: message
      remoteRef:
        key: trinity-dev-aws-secrets-demo
        conversionStrategy: Default
        decodingStrategy: None
        metadataPolicy: None
        nullBytePolicy: Ignore

The default-looking fields at the bottom are intentional. External Secrets defaults them into the live object:

conversionStrategy: Default
decodingStrategy: None
metadataPolicy: None
nullBytePolicy: Ignore

Without pinning those fields in Git, Argo CD keeps seeing the ExternalSecret as OutOfSync even though the secret has synced. That is the kind of small reconciler mismatch that is easy to dismiss, but it matters in a GitOps platform. Healthy but permanently out-of-sync resources train operators to ignore drift.

The root Argo CD application now creates one backend checkpoint application per cloud:

trinity-secrets-demo-aws
trinity-secrets-demo-gcp
trinity-secrets-demo-azure

Validation #

After the stacks and Argo CD applications reconciled, each cluster reported the same state:

for cloud in aws gcp azure; do
  KUBECONFIG=./kubeconfig.${cloud}.yaml \
    kubectl -n secrets-demo get externalsecret provider-test-secret
done
NAME                   STORETYPE            STORE                 REFRESH INTERVAL   STATUS         READY
provider-test-secret   ClusterSecretStore   aws-secrets-manager   1h                 SecretSynced   True
provider-test-secret   ClusterSecretStore   gcp-secret-manager    1h                 SecretSynced   True
provider-test-secret   ClusterSecretStore   azure-key-vault       1h                 SecretSynced   True

And the synced Kubernetes secrets contained the expected harmless values:

for cloud in aws gcp azure; do
  KUBECONFIG=./kubeconfig.${cloud}.yaml \
    kubectl -n secrets-demo get secret provider-test-secret \
    -o jsonpath='{.data.message}' | base64 -d
  echo
done
hello-from-aws
hello-from-gcp
hello-from-azure

That completes the secrets backend checkpoint. The platform now has a working provider-backed path from AWS Secrets Manager, Google Secret Manager, and Azure Key Vault into Kubernetes without committing secret values to Git.

The final validation pass still showed the provider-backed test secret synced in every cluster:

cloud  store                external secret  kubernetes secret
aws    aws-secrets-manager  SecretSynced     present
gcp    gcp-secret-manager   SecretSynced     present
azure  azure-key-vault      SecretSynced     present

Central metrics credentials #

The secrets path is useful on its own, but the real payoff is using it for platform credentials. The observability checkpoint left each cluster with its own local Prometheus and Grafana. That proved collection and dashboard provisioning, but it did not answer the cross-cluster question. To see the platform as one system, the three Prometheus instances need to send metrics to a shared backend. For this exercise I used Grafana Cloud as that backend. Each cluster still runs its local Prometheus. The difference is that Prometheus now remote-writes the same metrics to Grafana Cloud with external labels attached:

global:
  scrape_interval: 15s
  evaluation_interval: 15s
  external_labels:
    cloud: aws
    cluster: trinity-dev-aws
    platform: trinity
remote_write:
  - url: __GRAFANA_CLOUD_REMOTE_WRITE_URL__
    basic_auth:
      username_file: /etc/grafana-cloud/username
      password_file: /etc/grafana-cloud/password

The labels are the important part for the shared view. Grafana Cloud receives samples from AWS, GCP, and Azure into one metrics backend, and the queries can group or filter by cloud, cluster, and platform.

The credentials are deliberately not stored in the Prometheus ConfigMap. Grafana Cloud gives three pieces of information for Prometheus remote write:

  • remote-write URL
  • Prometheus username or instance ID
  • access policy token with metrics:write

Those values are set as Pulumi secret config for each cloud stack:

for cloud in aws gcp azure; do
  pulumi -C infra/pulumi/${cloud} config set --secret trinity:grafanaCloudRemoteWriteUrl "https://<grafana-cloud-prometheus-host>/api/prom/push"
  pulumi -C infra/pulumi/${cloud} config set --secret trinity:grafanaCloudPrometheusUsername "<prometheus-user-id>"
  pulumi -C infra/pulumi/${cloud} config set --secret trinity:grafanaCloudPrometheusPassword "<grafana-cloud-access-policy-token>"
done

Pulumi then writes them into the cloud secret backend for that cluster:

trinity-dev-aws-grafana-cloud-remote-write-url
trinity-dev-aws-grafana-cloud-remote-write-username
trinity-dev-aws-grafana-cloud-remote-write-password

The same naming pattern exists for GCP and Azure. External Secrets syncs those backend values into the observability namespace as one Kubernetes Secret:

grafana-cloud-remote-write

Prometheus reads the username and password from mounted files. The URL needs a small extra step because Prometheus supports username_file and password_file, but not a matching url_file. The deployment uses a tiny init container to render the final prometheus.yml from a checked-in template and the synced URL:

remote_write_url="$(cat /etc/grafana-cloud/url)"
sed "s#__GRAFANA_CLOUD_REMOTE_WRITE_URL__#${remote_write_url}#g" \
  /etc/prometheus-template/prometheus.yml.template \
  > /etc/prometheus-generated/prometheus.yml

That keeps the full remote-write configuration out of Git while still letting Argo CD own the workload shape.

The validation path has three layers. First, check that External Secrets has materialized the Grafana Cloud secret:

for cloud in aws gcp azure; do
  KUBECONFIG=./kubeconfig.${cloud}.yaml \
    kubectl -n observability get externalsecret grafana-cloud-remote-write
  KUBECONFIG=./kubeconfig.${cloud}.yaml \
    kubectl -n observability get secret grafana-cloud-remote-write
done

Then check that Prometheus has started with the rendered configuration:

KUBECONFIG=./kubeconfig.aws.yaml \
  kubectl -n observability logs deployment/prometheus -c render-prometheus-config

KUBECONFIG=./kubeconfig.aws.yaml \
  kubectl -n observability logs deployment/prometheus -c prometheus --tail=80

Finally, generate Mandelbrot renders and query Grafana Cloud. These are the first useful cross-cluster expressions:

mandelbrot_render_requests_total{platform="trinity"}
sum by (cloud, cluster) (
  mandelbrot_render_requests_total{platform="trinity"}
)
sum by (cloud, cluster) (
  rate(mandelbrot_stage_renders_total{platform="trinity"}[5m])
)

The combined Grafana Cloud view now shows metrics from all three clusters in one place. The local Grafana instances are still useful for cluster-local inspection, but Grafana Cloud is the platform view. The final reconciliation pass found two practical issues.

First, External Secrets can fail for different reasons that look similar from the app list. A missing ClusterSecretStore was just sync ordering while the operator and store were still being applied. A Secret does not exist error after the store was ready meant the cloud backend value was actually missing. Checking the ExternalSecret events and the cloud secret versions was the fastest way to separate those cases.

Second, the original GCP cluster was too small for the full platform. With Argo CD, External Secrets, observability, and Mandelbrot all running, Prometheus stayed pending with Insufficient cpu on the single GKE node. The follow-up infrastructure fix is to raise the GCP node pool to two nodes. That change is later than the first secrets branch snapshot, but it is the practical fix for the run described here. After that, AWS, GCP, and Azure all settled with the same shape: the secret apps healthy, the synced Kubernetes secrets present, and every observability pod running. This closes the centralized metrics part of the observability goal. Before adding more observability signals, the platform needs one more boundary: admission control. The next checkpoint is about what the cluster should reject before a workload is allowed to run.

Lifecycle boundary #

There was one important correction after the first end-to-end run. This landed later than the first secrets branch snapshot, but it belongs with the secrets checkpoint because it changes the operational boundary. I initially treated secret backend creation as part of the normal infrastructure deploy. That works until a backend object is deleted. Azure Key Vault soft-delete keeps the vault name reserved, and AWS Secrets Manager keeps a deleted secret name in a scheduled-deletion state. Re-running the normal CI deployment then fails with "already exists in deleted state" or "scheduled for deletion" errors.

The fix was to make secret backend lifecycle explicit. Normal cluster deployment now grants access to existing backend names, but it does not purge or recreate tombstoned secrets as a side effect. A separate manually triggered GitHub Actions workflow manages the secret backends:

Secret Backends
  apply   -> create/update backend containers and secret versions
  delete  -> remove them only with an explicit confirmation input

That keeps routine deploys from doing destructive secret cleanup while still making the secret lifecycle repeatable from CI. It also made a real bug visible in the helper script: pulumi config get --show-secrets is not a valid command for the Pulumi CLI version I was using. The script now reads stack config with pulumi config --show-secrets --json and parses the values from JSON. Before that fix, GCP Secret Manager had secret containers but no enabled versions, so External Secrets reported SecretSyncedError even though the names existed.

That lesson is bigger than this demo secret. Secrets have lifecycle semantics that are different from ordinary infrastructure. A deleted secret may still have recovery behavior, reserved names, disabled versions, or purge windows. The normal deploy path should grant access to the current backend objects and sync the Kubernetes shape. Creating, deleting, and purging backend secrets deserves a separate, explicit operational path.

Source code #

Reference implementation (opens in a new tab)