Rollouts
Progressive delivery checkpoint #
By now we have GitOps delivery, global traffic, secrets, policies, metrics, logs, and traces. That gives enough visibility to decide whether a change is healthy. The missing piece is a controlled way to pause, inspect, promote, or abort a workload change.
Before this checkpoint, Mandelbrot was a normal Kubernetes Deployment. A bad change could still be fixed through Git: revert the commit and let Argo CD reconcile. That is a valid basic rollback path, but it does not show progressive delivery.
This checkpoint adds Argo Rollouts and changes Mandelbrot from a Deployment to a Rollout. The goal is deliberately small:
- install the Argo Rollouts controller in every cluster
- run Mandelbrot with two replicas
- pause a canary at 50 percent
- inspect the live system with the signals already built
- promote or abort from a controlled operator workflow
No service mesh is involved. This is still Kubernetes service-level traffic balancing, not provider-specific traffic splitting. That is enough to prove the operating motion.
GitOps ownership #
Argo CD installs Argo Rollouts through one application per cluster:
trinity-rollouts-aws
trinity-rollouts-gcp
trinity-rollouts-azure
The AWS application is representative:
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: trinity-rollouts-aws
namespace: argocd
annotations:
argocd.argoproj.io/sync-wave: "-1"
spec:
project: trinity
source:
repoURL: https://argoproj.github.io/argo-helm
chart: argo-rollouts
targetRevision: 2.40.9
helm:
releaseName: argo-rollouts
values: |
controller:
replicas: 1
dashboard:
enabled: false
destination:
server: https://kubernetes.default.svc
namespace: argo-rollouts
The real manifest also pins controller resources. The dashboard is disabled because this checkpoint uses the CLI workflow rather than exposing another UI. The Argo CD project had to allow the Argo Helm repository and the Rollouts analysis kinds:
sourceRepos:
- https://github.com/maxgherman/trinity.git
- https://argoproj.github.io/argo-helm
clusterResourceWhitelist:
- group: argoproj.io
kind: AnalysisTemplate
- group: argoproj.io
kind: ClusterAnalysisTemplate
The Mandelbrot application also gets one important sync option:
syncOptions:
- CreateNamespace=true
- ApplyOutOfSyncOnly=true
- SkipDryRunOnMissingResource=true
SkipDryRunOnMissingResource=true avoids a dry-run failure when Argo CD sees the Rollout manifest before the Rollouts CRD is established. The workflow still syncs the root and waits for the CRD before it asks Mandelbrot to sync.
Mandelbrot Rollout #
The workload changes from Deployment to Rollout:
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
name: mandelbrot
namespace: mandelbrot
spec:
replicas: 2
strategy:
canary:
steps:
- setWeight: 50
- pause: {}
- setWeight: 100
The canary is intentionally simple. With two replicas, the rollout can hold one stable pod and one canary pod while an operator checks the system. The checks are the tools the previous chapters already built:
- UI behavior through Front Door
/api/meta- Prometheus metrics
- Loki logs
- Jaeger or Grafana Cloud traces
- Argo CD and Rollout status
The pod template also gets a release marker:
env:
- name: MANDELBROT_RELEASE
value: continuous
The app exposes that value through /api/meta:
{
"cloud": "aws",
"region": "us-east-1",
"release": "continuous",
"pod": "mandelbrot-69d45cf95f-khkkh",
"route": ["aws", "gcp", "azure"]
}
That detail matters because the app source is mounted from a ConfigMap. A ConfigMap-only change does not automatically create a new ReplicaSet. Changing MANDELBROT_RELEASE, or another pod-template field, gives Kubernetes a real rollout signal.
Operator workflow #
The operator path moves into GitHub Actions. The Mandelbrot Rollout workflow authenticates to AWS, GCP, and Azure with the same GitHub OIDC identities as the deployment workflow. It installs the kubectl argo rollouts plugin and supports these operations:
status
sync
promote
promote-full
abort
undo
restart
The workflow is manually triggered and gated by the infra-deploy-approval environment. That keeps rollout promotion in an explicit operator path instead of hiding it in an automatic post-merge job. The workflow configures the selected cluster context, syncs the root application when needed, and waits for the Rollouts CRD:
kubectl --context "${context}" -n argocd patch application "trinity-${TRINITY_ENVIRONMENT}-${cloud}-root" \
--type merge \
-p '{"operation":{"sync":{"syncStrategy":{"hook":{}}}}}'
kubectl --context "${context}" wait crd/rollouts.argoproj.io \
--for=condition=Established \
--timeout=30s
Then sync patches the Mandelbrot Argo CD application and waits until the Rollout is either paused or healthy:
kubectl --context "${context}" -n argocd patch application "trinity-mandelbrot-${cloud}" \
--type merge \
-p '{"operation":{"sync":{"syncStrategy":{"hook":{}}}}}'
Promotion is explicit:
kubectl argo rollouts --context "${context}" promote mandelbrot \
--namespace mandelbrot
There was one real workflow bug during the first pass. The kubectl plugin does not accept the context flag before the plugin name:
kubectl --context aws argo rollouts get rollout mandelbrot
That fails with:
flags cannot be placed before plugin name: --context
The fixed form passes --context after the plugin command:
kubectl argo rollouts --context aws get rollout mandelbrot \
--namespace mandelbrot
Release Drill #
A normal release is still Git-first:
- Merge the application change.
- Run
Mandelbrot Rolloutwithoperation: syncandcloud: all. - Inspect the paused 50 percent canary.
- Run
operation: promotewhen the canary is good.
The first adoption of the Rollout may go straight to healthy. There is no previous stable ReplicaSet to canary against. The pause becomes visible on the next pod-template change.
That mattered in the first real test. A mounted ConfigMap change alone would not prove the canary machinery, so the branch changed both user-visible behavior and the pod template. The UI now renders continuously instead of drawing one sample, and MANDELBROT_RELEASE changed from stable to continuous.
The validated drill was:
- Merge the continuous-render change to
main. - Run
operation: sync,cloud: all. - Confirm each rollout pauses with two desired replicas, one stable pod and one canary pod.
- Open the app through Front Door and verify continuous rendering while the rollout is paused.
- Run
operation: promote,cloud: all. - Confirm the rollouts and Argo CD applications return to
Healthy.
One later follow-up reduced the idle delay between continuous render cycles. That changed MANDELBROT_RELEASE to faster-cycles, and /api/meta through Front Door reported the new release from a live pod:
{"cloud":"aws","region":"us-east-1","release":"faster-cycles","pod":"mandelbrot-69d45cf95f-khkkh","route":["aws","gcp","azure"]}
After promotion, all three clusters reached a healthy rollout state:
cloud status step weight desired ready stable replica set
aws Healthy 3/3 100 2 2 mandelbrot-69d45cf95f
gcp Healthy 3/3 100 2 2 mandelbrot-bbb879c59
azure Healthy 3/3 100 2 2 mandelbrot-7c45d9dfdb
Rollback #
If the canary is bad, use the operator workflow with:
operation: abort
cloud: all
That stops the in-progress Rollout and leaves the stable ReplicaSet serving traffic.
For a durable GitOps rollback, revert the bad Git commit and run:
operation: sync
cloud: all
The workflow also exposes undo as an emergency escape hatch:
operation: undo
cloud: all
to_revision: <optional-rollout-revision>
That can move the live Rollout back to a previous revision, but it does not replace the need to fix Git. Argo CD self-heal will keep reconciling the declared revision from the repository. For the platform's normal operating model, Git remains the durable source of truth.
CI check #
CI had to understand the new manifest kinds. The manifest checker now allows the Argo Rollouts kinds:
AnalysisTemplate
ClusterAnalysisTemplate
Rollout
The Mandelbrot overlays still render through Kustomize:
kubectl kustomize apps/mandelbrot/overlays/aws
kubectl kustomize apps/mandelbrot/overlays/gcp
kubectl kustomize apps/mandelbrot/overlays/azure
That catches the basic shape before Argo CD or the rollout controller sees the change.
Exit #
This closes the original delivery loop. The platform can now deploy application changes through Git, route traffic globally, observe behavior across clusters, enforce a small policy baseline, and promote or abort a canary.
The rollout setup is intentionally modest. It does not use mesh traffic splitting or automated metric analysis yet. Those would be natural next steps, but the platform now has the essential operator motion: change, pause, inspect, promote, abort, and recover through Git.