Async processing

SQS, DLQ

Publish at:

Day Four #

Not every useful system should finish its work while the caller waits. Some tasks are better accepted quickly, buffered safely, and processed in the background when compute is ready.

That is where queues become useful. They absorb bursts, decouple components, and give failures somewhere deliberate to go instead of disappearing into logs.

Goal #

For every pull request:

  • keep the static preview site
  • keep the API Gateway + Lambda path
  • add an SQS queue for async jobs
  • add a DLQ for messages that keep failing
  • expose a POST /jobs endpoint that enqueues work
  • process queue messages in a worker Lambda

How would the infrastructure shape look like for async processing?

What already exists #

At this point, you already have:

  • a per-PR preview stack deployed from bin/preview.ts
  • a private S3 bucket fronted by CloudFront
  • GitHub Actions assuming AWS access through OIDC
  • a preview page that can call an API via generated site/config.js
  • deterministic stack naming with Stack-PR<number>
  • teardown on PR close

Why a queue? #

An HTTP request is a poor place to do slow or failure-prone work.

If the caller has to wait for every image resize, document conversion, outbound webhook, or background calculation, the system becomes fragile quickly.

A queue changes the shape:

  • the API accepts the job fast
  • the queue holds it durably
  • the worker processes it when ready
  • repeated failures go to a dead-letter queue for inspection

That is a very common AWS pattern, and it is materially different from a synchronous API.

The shape of the app #

Developer pushes to PR
        |
        v
+----------------------+
| GitHub Actions       |
| deploy preview stack |
+----------------------+
        |
        v
+----------------------+
| CDK                  |
| Stack-PR<number>     |
+----------------------+
        |
        +-------------------------------+
        |                               |
        v                               v
+----------------------+      +----------------------+
| CloudFront + S3      |      | API Gateway          |
| static preview site  |      | POST /jobs           |
+----------------------+      +----------------------+
        |                               |
        |                               v
        |                      +----------------------+
        |                      | SubmitJob Lambda     |
        |                      | send to SQS          |
        |                      +----------------------+
        |                               |
        |                               v
        |                      +----------------------+
        |                      | SQS queue            |
        |                      +----------------------+
        |                               |
        |                               v
        |                      +----------------------+
        |                      | Worker Lambda        |
        |                      | async processing     |
        |                      +----------------------+
        |                               |
        |                               v
        |                      +----------------------+
        |                      | DLQ                  |
        |                      | after retries        |
        |                      +----------------------+
        |
     Browser

CDK #

We will evolve the same preview stack again. The new pieces are:

  • one dead-letter queue
  • one main queue with a redrive policy
  • one Lambda that accepts POST /jobs and sends messages to the queue
  • one worker Lambda subscribed to the queue
  • a few new stack outputs for verification and runtime config

lib/stack.ts

import * as cdk from 'aws-cdk-lib';
import { Construct } from 'constructs';
import * as s3 from 'aws-cdk-lib/aws-s3';
import * as cloudfront from 'aws-cdk-lib/aws-cloudfront';
import * as origins from 'aws-cdk-lib/aws-cloudfront-origins';
import * as s3deploy from 'aws-cdk-lib/aws-s3-deployment';
import * as lambda from 'aws-cdk-lib/aws-lambda';
import * as sqs from 'aws-cdk-lib/aws-sqs';
import * as apigwv2 from 'aws-cdk-lib/aws-apigatewayv2';
import * as integrations from 'aws-cdk-lib/aws-apigatewayv2-integrations';
import * as eventsources from 'aws-cdk-lib/aws-lambda-event-sources';
import { Tags } from 'aws-cdk-lib';

export class Stack extends cdk.Stack {
  constructor(scope: Construct, id: string, props?: cdk.StackProps) {
    super(scope, id, props);

    const pr = String(this.node.tryGetContext('pr') ?? 'local');
    const repo = this.node.tryGetContext('repo');
    const sha = this.node.tryGetContext('sha');
    const run = this.node.tryGetContext('run');
    const acct = this.node.tryGetContext('acct');
    const reg = this.node.tryGetContext('reg');

    const bucket = new s3.Bucket(this, 'SiteBucket', {
      autoDeleteObjects: true,
      removalPolicy: cdk.RemovalPolicy.DESTROY,
      blockPublicAccess: s3.BlockPublicAccess.BLOCK_ALL,
      encryption: s3.BucketEncryption.S3_MANAGED,
      ...(acct && reg
        ? {
            bucketName: `pr-${String(pr)}-${String(acct)}-${String(reg)}`
              .toLowerCase()
              .replace(/[^a-z0-9.-]/g, '')
              .slice(0, 63)
              .replace(/[.-]+$/g, ''),
          }
        : {}),
    });

    const distribution = new cloudfront.Distribution(this, 'SiteDistribution', {
      defaultRootObject: 'index.html',
      defaultBehavior: {
        origin: origins.S3BucketOrigin.withOriginAccessControl(bucket),
        viewerProtocolPolicy: cloudfront.ViewerProtocolPolicy.REDIRECT_TO_HTTPS,
        cachePolicy: cloudfront.CachePolicy.CACHING_OPTIMIZED,
      },
      errorResponses: [
        { httpStatus: 403, responseHttpStatus: 200, responsePagePath: '/index.html' },
        { httpStatus: 404, responseHttpStatus: 200, responsePagePath: '/index.html' },
      ],
    });

    new s3deploy.BucketDeployment(this, 'DeploySite', {
      destinationBucket: bucket,
      sources: [s3deploy.Source.asset('site')],
      distribution,
      distributionPaths: ['/*'],
    });

    const deadLetterQueue = new sqs.Queue(this, 'JobsDeadLetterQueue', {
      encryption: sqs.QueueEncryption.SQS_MANAGED,
      retentionPeriod: cdk.Duration.days(14),
    });

    const jobsQueue = new sqs.Queue(this, 'JobsQueue', {
      encryption: sqs.QueueEncryption.SQS_MANAGED,
      visibilityTimeout: cdk.Duration.seconds(60),
      deadLetterQueue: {
        queue: deadLetterQueue,
        maxReceiveCount: 3,
      },
    });

    const submitJob = new lambda.Function(this, 'SubmitJobFunction', {
      runtime: lambda.Runtime.NODEJS_20_X,
      handler: 'index.handler',
      code: lambda.Code.fromAsset('lambda/submit-job'),
      timeout: cdk.Duration.seconds(5),
      memorySize: 256,
      environment: {
        PR_NUMBER: pr,
        QUEUE_URL: jobsQueue.queueUrl,
      },
    });

    jobsQueue.grantSendMessages(submitJob);

    const worker = new lambda.Function(this, 'WorkerFunction', {
      runtime: lambda.Runtime.NODEJS_20_X,
      handler: 'index.handler',
      code: lambda.Code.fromAsset('lambda/worker'),
      timeout: cdk.Duration.seconds(30),
      memorySize: 256,
      environment: {
        PR_NUMBER: pr,
      },
    });

    worker.addEventSource(
      new eventsources.SqsEventSource(jobsQueue, {
        batchSize: 5,
        reportBatchItemFailures: true,
      }),
    );

    const hello = new lambda.Function(this, 'HelloFunction', {
      runtime: lambda.Runtime.NODEJS_20_X,
      handler: 'index.handler',
      code: lambda.Code.fromAsset('lambda/hello'),
      timeout: cdk.Duration.seconds(5),
      memorySize: 256,
      environment: {
        PR_NUMBER: pr,
      },
    });

    const api = new apigwv2.HttpApi(this, 'PreviewApi', {
      corsPreflight: {
        allowHeaders: ['content-type'],
        allowMethods: [apigwv2.CorsHttpMethod.GET, apigwv2.CorsHttpMethod.POST],
        allowOrigins: ['*'],
      },
    });

    api.addRoutes({
      path: '/hello',
      methods: [apigwv2.HttpMethod.GET],
      integration: new integrations.HttpLambdaIntegration('HelloIntegration', hello),
    });

    api.addRoutes({
      path: '/jobs',
      methods: [apigwv2.HttpMethod.POST],
      integration: new integrations.HttpLambdaIntegration('SubmitJobIntegration', submitJob),
    });

    Tags.of(this).add('managed-by', 'cdk');
    Tags.of(this).add('preview', 'true');
    Tags.of(this).add('pr', pr);
    if (repo) Tags.of(this).add('repo', String(repo));
    if (sha) Tags.of(this).add('sha', String(sha));
    if (run) Tags.of(this).add('run-id', String(run));
    Tags.of(bucket).add('resource', 'preview-bucket');
    Tags.of(hello).add('resource', 'preview-api-function');
    Tags.of(submitJob).add('resource', 'preview-job-submit-function');
    Tags.of(worker).add('resource', 'preview-worker-function');
    Tags.of(jobsQueue).add('resource', 'preview-jobs-queue');
    Tags.of(deadLetterQueue).add('resource', 'preview-jobs-dlq');

    new cdk.CfnOutput(this, 'BucketName', { value: bucket.bucketName });
    new cdk.CfnOutput(this, 'PreviewUrl', {
      value: `https://${distribution.domainName}/?pr=${encodeURIComponent(pr)}`,
    });
    new cdk.CfnOutput(this, 'DistributionId', { value: distribution.distributionId });
    new cdk.CfnOutput(this, 'ApiUrl', { value: `${api.url}hello` });
    new cdk.CfnOutput(this, 'JobUrl', { value: `${api.url}jobs` });
    new cdk.CfnOutput(this, 'JobsQueueName', { value: jobsQueue.queueName });
    new cdk.CfnOutput(this, 'JobsDeadLetterQueueName', { value: deadLetterQueue.queueName });
  }
}

There are two small but important details here:

  • maxReceiveCount: 3 means a message gets three delivery attempts before moving to the DLQ
  • visibilityTimeout: 60 gives the worker time to finish before the message becomes visible again

The producer Lambda #

The API only validates the request, shapes a message, and sends it to SQS.

lambda/submit-job/index.mjs

This is the whole point of the producer: return fast, queue work, move on.

The worker Lambda #

The worker is subscribed to the queue. Most messages should succeed. Some messages should fail.

For a tutorial, it helps if failure is easy to trigger on purpose, so the worker treats {"fail": true} as a deliberate failure and returns a partial batch failure for that message.

lambda/worker/index.mjs

That gives you a useful demo:

  • normal messages are processed successfully
  • failing messages are retried
  • after the retry limit, they land in the DLQ

The page #

The preview page reads a generated runtime config file and calls the deployed API directly from the browser.

This time the runtime value is JobUrl, not ApiUrl.

site/index.html

The generated site/config.js is intentionally tiny:

window.__PREVIEW_JOB_URL__ = 'https://example.execute-api.us-east-1.amazonaws.com/jobs';

CI #

The page needs a runtime config file generated after deploy.

This workflow step reads the deployed outputs, writes site/config.js, uploads it to the preview bucket, and invalidates just that file in CloudFront:

- name: Generate config.js for the preview page
  run: |
    STACK="Stack-PR${PR_NUMBER}"
    JOB_URL=$(jq -r --arg stack "$STACK" '.[$stack].JobUrl' cdk-outputs.json)
    BUCKET_NAME=$(jq -r --arg stack "$STACK" '.[$stack].BucketName' cdk-outputs.json)
    DISTRIBUTION_ID=$(jq -r --arg stack "$STACK" '.[$stack].DistributionId' cdk-outputs.json)

    test -n "$JOB_URL" && test "$JOB_URL" != "null"
    test -n "$BUCKET_NAME" && test "$BUCKET_NAME" != "null"
    test -n "$DISTRIBUTION_ID" && test "$DISTRIBUTION_ID" != "null"

    cat > site/config.js <<EOF
    window.__PREVIEW_JOB_URL__ = '${JOB_URL}';
    EOF

    aws s3 cp site/config.js "s3://${BUCKET_NAME}/config.js" --content-type application/javascript
    aws cloudfront create-invalidation \
      --distribution-id "$DISTRIBUTION_ID" \
      --paths /config.js

The PR comment can now show all three useful URLs:

- name: Comment preview URLs
  uses: actions/github-script@v7
  env:
    PR_NUMBER: ${{ github.event.pull_request.number }}
  with:
    github-token: ${{ secrets.PR_COMMENT_TOKEN || github.token }}
    script: |
      const fs = require('fs');
      const marker = '<!-- pr-preview-urls -->';
      const stackName = `Stack-PR${process.env.PR_NUMBER}`;
      const outputs = JSON.parse(fs.readFileSync('cdk-outputs.json', 'utf8'));
      const previewUrl = outputs?.[stackName]?.PreviewUrl;
      const apiUrl = outputs?.[stackName]?.ApiUrl;
      const jobUrl = outputs?.[stackName]?.JobUrl;
      if (!previewUrl) throw new Error(`Missing PreviewUrl output for stack "${stackName}"`);
      if (!apiUrl) throw new Error(`Missing ApiUrl output for stack "${stackName}"`);
      if (!jobUrl) throw new Error(`Missing JobUrl output for stack "${stackName}"`);

      const body = `${marker}\nPreview: ${previewUrl}\nAPI: ${apiUrl}\nJobs: ${jobUrl}`;
      const { data: comments } = await github.rest.issues.listComments({
        owner: context.repo.owner,
        repo: context.repo.repo,
        issue_number: context.issue.number,
        per_page: 100,
      });

      const existing = comments.find((c) =>
        c.user?.type === 'Bot' && typeof c.body === 'string' && c.body.includes(marker),
      );

      if (existing) {
        await github.rest.issues.updateComment({
          owner: context.repo.owner,
          repo: context.repo.repo,
          comment_id: existing.id,
          body,
        });
      } else {
        await github.rest.issues.createComment({
          owner: context.repo.owner,
          repo: context.repo.repo,
          issue_number: context.issue.number,
          body,
        });
      }

Nothing about teardown changes. Closing the pull request still destroys the same stack, which now happens to contain a website, an API, a queue, and a DLQ.

What you should see #

  • opening a PR deploys a preview stack with a site, API, queue, and DLQ
  • the PR comment shows PreviewUrl, ApiUrl, and JobUrl
  • visiting the site still works over HTTPS through CloudFront
  • clicking Submit Job queues a successful async task
  • clicking Submit Failing Job queues a message that is retried and then sent to the DLQ
  • closing the PR removes the stack and all of those resources

Verify the stack #

After deploy, start with the stack outputs:

STACK=Stack-PR123
REGION=us-east-1

aws cloudformation describe-stacks \
  --stack-name "$STACK" \
  --region "$REGION" \
  --query 'Stacks[0].Outputs'

You want these outputs:

  • PreviewUrl
  • ApiUrl
  • JobUrl
  • JobsQueueName
  • JobsDeadLetterQueueName

Resolve the API and job endpoints:

API_URL=$(aws cloudformation describe-stacks \
  --stack-name "$STACK" \
  --region "$REGION" \
  --query "Stacks[0].Outputs[?OutputKey=='ApiUrl'].OutputValue" \
  --output text)

JOB_URL=$(aws cloudformation describe-stacks \
  --stack-name "$STACK" \
  --region "$REGION" \
  --query "Stacks[0].Outputs[?OutputKey=='JobUrl'].OutputValue" \
  --output text)

Verify the synchronous API still works:

curl "$API_URL"

That should hit /hello and return JSON.

Now test the async success path:

curl -X POST "$JOB_URL" \
  -H "content-type: application/json" \
  -d '{"task":"resize-image"}'

That should return 202 from the submit Lambda, and the worker should process the message asynchronously.

Check worker logs:

aws logs tail "/aws/lambda/${STACK}-WorkerFunction" \
  --region "$REGION" \
  --since 10m

If the exact log group name differs in your account, list matching groups first:

aws logs describe-log-groups \
  --region "$REGION" \
  --query "logGroups[?contains(logGroupName, 'WorkerFunction')].logGroupName" \
  --output text

Now test the DLQ path:

curl -X POST "$JOB_URL" \
  -H "content-type: application/json" \
  -d '{"task":"broken-job","fail":true}'

That job is designed to fail on purpose. After retries, it should land in the dead-letter queue.

Resolve the queue names:

JOBS_QUEUE=$(aws cloudformation describe-stacks \
  --stack-name "$STACK" \
  --region "$REGION" \
  --query "Stacks[0].Outputs[?OutputKey=='JobsQueueName'].OutputValue" \
  --output text)

DLQ=$(aws cloudformation describe-stacks \
  --stack-name "$STACK" \
  --region "$REGION" \
  --query "Stacks[0].Outputs[?OutputKey=='JobsDeadLetterQueueName'].OutputValue" \
  --output text)

Resolve queue URLs:

aws sqs get-queue-url --queue-name "$JOBS_QUEUE" --region "$REGION"
aws sqs get-queue-url --queue-name "$DLQ" --region "$REGION"

Then inspect DLQ depth:

DLQ_URL=$(aws sqs get-queue-url \
  --queue-name "$DLQ" \
  --region "$REGION" \
  --query 'QueueUrl' \
  --output text)

aws sqs get-queue-attributes \
  --queue-url "$DLQ_URL" \
  --region "$REGION" \
  --attribute-names ApproximateNumberOfMessages ApproximateNumberOfMessagesNotVisible

If everything is working:

  • GET /hello returns JSON
  • POST /jobs returns 202
  • successful jobs appear in worker logs
  • failing jobs eventually show up in the DLQ

Notes and hardening #

  • Keep CORS broad only for the tutorial. Once the pattern works, restrict origins to the preview site domain.
  • The queue visibilityTimeout should stay comfortably above the worker Lambda timeout.
  • The worker currently fails on purpose when fail: true. That is deliberate for a tutorial; real failure rules would come from actual business conditions.
  • A DLQ is a holding area, not a fix. In a real system, add alarms, dashboards, and a runbook for replay or remediation.
  • SQS decouples the public API from background work, but it does not guarantee that your worker code is idempotent. Design for retries.

Well-Architected Framework #

  • Security: OIDC still removes long-lived AWS keys from CI, the preview site origin remains private behind CloudFront, and both queues use server-side encryption.
  • Operational Excellence: this chapter introduces a new infrastructure pattern without changing the overall deployment model; the same PR workflow now provisions async infrastructure repeatedly and predictably.
  • Reliability: the queue buffers work, the worker can retry failed messages, and the DLQ prevents poison messages from being retried forever.
  • Cost Optimization: SQS and Lambda are inexpensive at tutorial scale, and the preview stack still disappears automatically when the pull request closes.
  • Performance Efficiency: the HTTP layer stays fast because it only accepts work; the slower part happens asynchronously behind the queue.
  • Sustainability: short-lived preview stacks avoid idle resources, and the queue-backed pattern lets compute happen only when there is actual work to process.

Source code #

Reference implementation (opens in a new tab)