Async processing

SQS, DLQ

Published: 22 Apr 2026

aws cdk github-actions oidc lambda api-gateway sqs dlq cloudfront s3 ci-cd

Day Four #

Not every useful system should finish its work while the caller waits. Some tasks are better accepted quickly, buffered safely, and processed in the background when compute is ready.

That is where queues become useful. They absorb bursts, decouple components, and give failures somewhere deliberate to go instead of disappearing into logs.

Goal #

For every pull request:

keep the static preview site
keep the API Gateway + Lambda path
add an SQS queue for async jobs
add a DLQ for messages that keep failing
expose a POST /jobs endpoint that enqueues work
process queue messages in a worker Lambda

How would the infrastructure shape look like for async processing?

What already exists #

At this point, you already have:

a per-PR preview stack deployed from bin/preview.ts
a private S3 bucket fronted by CloudFront
GitHub Actions assuming AWS access through OIDC
a preview page that can call an API via generated site/config.js
deterministic stack naming with Stack-PR<number>
teardown on PR close

Why a queue? #

An HTTP request is a poor place to do slow or failure-prone work.

If the caller has to wait for every image resize, document conversion, outbound webhook, or background calculation, the system becomes fragile quickly.

A queue changes the shape:

the API accepts the job fast
the queue holds it durably
the worker processes it when ready
repeated failures go to a dead-letter queue for inspection

That is a very common AWS pattern, and it is materially different from a synchronous API.

The shape of the app #

Developer pushes to PR
        |
        v
+----------------------+
| GitHub Actions       |
| deploy preview stack |
+----------------------+
        |
        v
+----------------------+
| CDK                  |
| Stack-PR<number>     |
+----------------------+
        |
        +-------------------------------+
        |                               |
        v                               v
+----------------------+      +----------------------+
| CloudFront + S3      |      | API Gateway          |
| static preview site  |      | POST /jobs           |
+----------------------+      +----------------------+
        |                               |
        |                               v
        |                      +----------------------+
        |                      | SubmitJob Lambda     |
        |                      | send to SQS          |
        |                      +----------------------+
        |                               |
        |                               v
        |                      +----------------------+
        |                      | SQS queue            |
        |                      +----------------------+
        |                               |
        |                               v
        |                      +----------------------+
        |                      | Worker Lambda        |
        |                      | async processing     |
        |                      +----------------------+
        |                               |
        |                               v
        |                      +----------------------+
        |                      | DLQ                  |
        |                      | after retries        |
        |                      +----------------------+
        |
     Browser

CDK #

We will evolve the same preview stack again. The new pieces are:

one dead-letter queue
one main queue with a redrive policy
one Lambda that accepts POST /jobs and sends messages to the queue
one worker Lambda subscribed to the queue
a few new stack outputs for verification and runtime config

lib/stack.ts

import * as cdk from 'aws-cdk-lib';
import { Construct } from 'constructs';
import * as s3 from 'aws-cdk-lib/aws-s3';
import * as cloudfront from 'aws-cdk-lib/aws-cloudfront';
import * as origins from 'aws-cdk-lib/aws-cloudfront-origins';
import * as s3deploy from 'aws-cdk-lib/aws-s3-deployment';
import * as lambda from 'aws-cdk-lib/aws-lambda';
import * as sqs from 'aws-cdk-lib/aws-sqs';
import * as apigwv2 from 'aws-cdk-lib/aws-apigatewayv2';
import * as integrations from 'aws-cdk-lib/aws-apigatewayv2-integrations';
import * as eventsources from 'aws-cdk-lib/aws-lambda-event-sources';
import { Tags } from 'aws-cdk-lib';

export class Stack extends cdk.Stack {
  constructor(scope: Construct, id: string, props?: cdk.StackProps) {
    super(scope, id, props);

    const pr = String(this.node.tryGetContext('pr') ?? 'local');
    const repo = this.node.tryGetContext('repo');
    const sha = this.node.tryGetContext('sha');
    const run = this.node.tryGetContext('run');
    const acct = this.node.tryGetContext('acct');
    const reg = this.node.tryGetContext('reg');

    const bucket = new s3.Bucket(this, 'SiteBucket', {
      autoDeleteObjects: true,
      removalPolicy: cdk.RemovalPolicy.DESTROY,
      blockPublicAccess: s3.BlockPublicAccess.BLOCK_ALL,
      encryption: s3.BucketEncryption.S3_MANAGED,
      ...(acct && reg
        ? {
            bucketName: `pr-${String(pr)}-${String(acct)}-${String(reg)}`
              .toLowerCase()
              .replace(/[^a-z0-9.-]/g, '')
              .slice(0, 63)
              .replace(/[.-]+$/g, ''),
          }
        : {}),
    });

    const distribution = new cloudfront.Distribution(this, 'SiteDistribution', {
      defaultRootObject: 'index.html',
      defaultBehavior: {
        origin: origins.S3BucketOrigin.withOriginAccessControl(bucket),
        viewerProtocolPolicy: cloudfront.ViewerProtocolPolicy.REDIRECT_TO_HTTPS,
        cachePolicy: cloudfront.CachePolicy.CACHING_OPTIMIZED,
      },
      errorResponses: [
        { httpStatus: 403, responseHttpStatus: 200, responsePagePath: '/index.html' },
        { httpStatus: 404, responseHttpStatus: 200, responsePagePath: '/index.html' },
      ],
    });

    new s3deploy.BucketDeployment(this, 'DeploySite', {
      destinationBucket: bucket,
      sources: [s3deploy.Source.asset('site')],
      distribution,
      distributionPaths: ['/*'],
    });

    const deadLetterQueue = new sqs.Queue(this, 'JobsDeadLetterQueue', {
      encryption: sqs.QueueEncryption.SQS_MANAGED,
      retentionPeriod: cdk.Duration.days(14),
    });

    const jobsQueue = new sqs.Queue(this, 'JobsQueue', {
      encryption: sqs.QueueEncryption.SQS_MANAGED,
      visibilityTimeout: cdk.Duration.seconds(60),
      deadLetterQueue: {
        queue: deadLetterQueue,
        maxReceiveCount: 3,
      },
    });

    const submitJob = new lambda.Function(this, 'SubmitJobFunction', {
      runtime: lambda.Runtime.NODEJS_20_X,
      handler: 'index.handler',
      code: lambda.Code.fromAsset('lambda/submit-job'),
      timeout: cdk.Duration.seconds(5),
      memorySize: 256,
      environment: {
        PR_NUMBER: pr,
        QUEUE_URL: jobsQueue.queueUrl,
      },
    });

    jobsQueue.grantSendMessages(submitJob);

    const worker = new lambda.Function(this, 'WorkerFunction', {
      runtime: lambda.Runtime.NODEJS_20_X,
      handler: 'index.handler',
      code: lambda.Code.fromAsset('lambda/worker'),
      timeout: cdk.Duration.seconds(30),
      memorySize: 256,
      environment: {
        PR_NUMBER: pr,
      },
    });

    worker.addEventSource(
      new eventsources.SqsEventSource(jobsQueue, {
        batchSize: 5,
        reportBatchItemFailures: true,
      }),
    );

    const hello = new lambda.Function(this, 'HelloFunction', {
      runtime: lambda.Runtime.NODEJS_20_X,
      handler: 'index.handler',
      code: lambda.Code.fromAsset('lambda/hello'),
      timeout: cdk.Duration.seconds(5),
      memorySize: 256,
      environment: {
        PR_NUMBER: pr,
      },
    });

    const api = new apigwv2.HttpApi(this, 'PreviewApi', {
      corsPreflight: {
        allowHeaders: ['content-type'],
        allowMethods: [apigwv2.CorsHttpMethod.GET, apigwv2.CorsHttpMethod.POST],
        allowOrigins: ['*'],
      },
    });

    api.addRoutes({
      path: '/hello',
      methods: [apigwv2.HttpMethod.GET],
      integration: new integrations.HttpLambdaIntegration('HelloIntegration', hello),
    });

    api.addRoutes({
      path: '/jobs',
      methods: [apigwv2.HttpMethod.POST],
      integration: new integrations.HttpLambdaIntegration('SubmitJobIntegration', submitJob),
    });

    Tags.of(this).add('managed-by', 'cdk');
    Tags.of(this).add('preview', 'true');
    Tags.of(this).add('pr', pr);
    if (repo) Tags.of(this).add('repo', String(repo));
    if (sha) Tags.of(this).add('sha', String(sha));
    if (run) Tags.of(this).add('run-id', String(run));
    Tags.of(bucket).add('resource', 'preview-bucket');
    Tags.of(hello).add('resource', 'preview-api-function');
    Tags.of(submitJob).add('resource', 'preview-job-submit-function');
    Tags.of(worker).add('resource', 'preview-worker-function');
    Tags.of(jobsQueue).add('resource', 'preview-jobs-queue');
    Tags.of(deadLetterQueue).add('resource', 'preview-jobs-dlq');

    new cdk.CfnOutput(this, 'BucketName', { value: bucket.bucketName });
    new cdk.CfnOutput(this, 'PreviewUrl', {
      value: `https://${distribution.domainName}/?pr=${encodeURIComponent(pr)}`,
    });
    new cdk.CfnOutput(this, 'DistributionId', { value: distribution.distributionId });
    new cdk.CfnOutput(this, 'ApiUrl', { value: `${api.url}hello` });
    new cdk.CfnOutput(this, 'JobUrl', { value: `${api.url}jobs` });
    new cdk.CfnOutput(this, 'JobsQueueName', { value: jobsQueue.queueName });
    new cdk.CfnOutput(this, 'JobsDeadLetterQueueName', { value: deadLetterQueue.queueName });
  }
}

There are two small but important details here:

maxReceiveCount: 3 means a message gets three delivery attempts before moving to the DLQ
visibilityTimeout: 60 gives the worker time to finish before the message becomes visible again

The producer Lambda #

The API only validates the request, shapes a message, and sends it to SQS.

lambda/submit-job/index.mjs

import { SQSClient, SendMessageCommand } from '@aws-sdk/client-sqs';

const sqs = new SQSClient({});
const headers = {
  'content-type': 'application/json',
  'access-control-allow-origin': '*',
};

export const handler = async (event) => {
  const queueUrl = process.env.QUEUE_URL;
  const prNumber = process.env.PR_NUMBER ?? 'local';

  if (!queueUrl) {
    return {
      statusCode: 500,
      headers,
      body: JSON.stringify({ ok: false, error: 'QUEUE_URL is not configured' }),
    };
  }

  let payload;
  try {
    payload = event?.body ? JSON.parse(event.body) : {};
  } catch {
    return {
      statusCode: 400,
      headers,
      body: JSON.stringify({ ok: false, error: 'Invalid JSON body' }),
    };
  }

  const task = typeof payload?.task === 'string' && payload.task.trim()
    ? payload.task.trim()
    : 'demo-job';

  const shouldFail = payload?.fail === true;
  const body = {
    task,
    fail: shouldFail,
    prNumber,
    requestedAt: new Date().toISOString(),
  };

  const response = await sqs.send(
    new SendMessageCommand({
      QueueUrl: queueUrl,
      MessageBody: JSON.stringify(body),
    }),
  );

  return {
    statusCode: 202,
    headers,
    body: JSON.stringify({
      ok: true,
      queued: true,
      messageId: response.MessageId,
      task,
      fail: shouldFail,
      note: shouldFail
        ? 'This message will be retried and then moved to the DLQ.'
        : 'The worker Lambda will process this message asynchronously.',
    }),
  };
};

This is the whole point of the producer: return fast, queue work, move on.

The worker Lambda #

The worker is subscribed to the queue. Most messages should succeed. Some messages should fail.

For a tutorial, it helps if failure is easy to trigger on purpose, so the worker treats {"fail": true} as a deliberate failure and returns a partial batch failure for that message.

lambda/worker/index.mjs

export const handler = async (event) => {
  const failures = [];

  for (const record of event.Records ?? []) {
    let payload;

    try {
      payload = JSON.parse(record.body);
    } catch (error) {
      console.error('Failed to parse SQS message body', {
        messageId: record.messageId,
        body: record.body,
        error,
      });
      failures.push({ itemIdentifier: record.messageId });
      continue;
    }

    if (payload.fail) {
      console.error('Worker rejected message on purpose to demonstrate the DLQ path', {
        messageId: record.messageId,
        payload,
      });
      failures.push({ itemIdentifier: record.messageId });
      continue;
    }

    console.log('Processed async job', {
      messageId: record.messageId,
      payload,
      processedAt: new Date().toISOString(),
    });
  }

  return {
    batchItemFailures: failures,
  };
};

That gives you a useful demo:

normal messages are processed successfully
failing messages are retried
after the retry limit, they land in the DLQ

The page #

The preview page reads a generated runtime config file and calls the deployed API directly from the browser.

This time the runtime value is JobUrl, not ApiUrl.

site/index.html

<!doctype html>
<html lang="en">
  <head>
    <meta charset="utf-8" />
    <meta name="viewport" content="width=device-width, initial-scale=1" />
    <title>PR Preview</title>
    <style>
      :root {
        color-scheme: light dark;
      }
      body {
        font-family: system-ui, -apple-system, Segoe UI, Roboto, sans-serif;
        margin: 0;
        padding: 3rem 1.5rem;
      }
      .card {
        max-width: 52rem;
        margin: 0 auto;
        border: 1px solid rgba(127, 127, 127, 0.35);
        border-radius: 14px;
        padding: 1.25rem;
      }
      .row {
        display: flex;
        gap: 0.75rem;
        flex-wrap: wrap;
        margin-top: 1rem;
      }
      button {
        border: 0;
        border-radius: 999px;
        padding: 0.65rem 1rem;
        cursor: pointer;
      }
      pre {
        white-space: pre-wrap;
        overflow-wrap: anywhere;
        padding: 0.9rem;
        border-radius: 10px;
        background: rgba(127, 127, 127, 0.12);
      }
      input {
        width: 100%;
        box-sizing: border-box;
        padding: 0.7rem 0.85rem;
        border: 1px solid rgba(127, 127, 127, 0.35);
        border-radius: 10px;
        background: transparent;
      }
    </style>
  </head>
  <body>
    <div class="card">
      <h1>PR Preview</h1>
      <p>This version adds an async path with API Gateway, SQS, Lambda, and a DLQ.</p>
      <label for="task-name">Task name</label>
      <input id="task-name" name="task-name" type="text" value="resize-image" />
      <div class="row">
        <button id="submit-job" type="button">Submit Job</button>
        <button id="submit-fail-job" type="button">Submit Failing Job</button>
      </div>
      <p id="status">Ready.</p>
      <pre id="output" hidden></pre>
    </div>

    <script src="./config.js"></script>
    <script>
      window.__PREVIEW_JOB_URL__ = window.__PREVIEW_JOB_URL__ || '';

      const status = document.getElementById('status');
      const output = document.getElementById('output');
      const taskName = document.getElementById('task-name');
      const submitJobButton = document.getElementById('submit-job');
      const submitFailJobButton = document.getElementById('submit-fail-job');
      const jobUrl = window.__PREVIEW_JOB_URL__;

      if (!jobUrl) {
        submitJobButton.disabled = true;
        submitFailJobButton.disabled = true;
        status.textContent = 'Job URL is not configured.';
      }

      const submitJob = async ({ fail }) => {
        status.textContent = fail ? 'Submitting failing job...' : 'Submitting job...';
        output.hidden = true;

        try {
          const response = await fetch(jobUrl, {
            method: 'POST',
            headers: {
              'content-type': 'application/json',
              accept: 'application/json',
            },
            body: JSON.stringify({
              task: taskName.value.trim() || 'demo-job',
              fail,
            }),
          });

          const data = await response.json();

          if (!response.ok) {
            throw new Error(data?.error || `HTTP ${response.status}`);
          }

          status.textContent = fail
            ? 'Failing job queued. Watch retries and the DLQ.'
            : 'Job queued successfully.';
          output.textContent = JSON.stringify(data, null, 2);
          output.hidden = false;
        } catch (error) {
          status.textContent = `Request failed: ${error.message}`;
          output.textContent = '';
          output.hidden = true;
        }
      };

      submitJobButton.addEventListener('click', () => submitJob({ fail: false }));
      submitFailJobButton.addEventListener('click', () => submitJob({ fail: true }));
    </script>
  </body>
</html>

The generated site/config.js is intentionally tiny:

window.__PREVIEW_JOB_URL__ = 'https://example.execute-api.us-east-1.amazonaws.com/jobs';

CI #

The page needs a runtime config file generated after deploy.

This workflow step reads the deployed outputs, writes site/config.js, uploads it to the preview bucket, and invalidates just that file in CloudFront:

- name: Generate config.js for the preview page
  run: |
    STACK="Stack-PR${PR_NUMBER}"
    JOB_URL=$(jq -r --arg stack "$STACK" '.[$stack].JobUrl' cdk-outputs.json)
    BUCKET_NAME=$(jq -r --arg stack "$STACK" '.[$stack].BucketName' cdk-outputs.json)
    DISTRIBUTION_ID=$(jq -r --arg stack "$STACK" '.[$stack].DistributionId' cdk-outputs.json)

    test -n "$JOB_URL" && test "$JOB_URL" != "null"
    test -n "$BUCKET_NAME" && test "$BUCKET_NAME" != "null"
    test -n "$DISTRIBUTION_ID" && test "$DISTRIBUTION_ID" != "null"

    cat > site/config.js <<EOF
    window.__PREVIEW_JOB_URL__ = '${JOB_URL}';
    EOF

    aws s3 cp site/config.js "s3://${BUCKET_NAME}/config.js" --content-type application/javascript
    aws cloudfront create-invalidation \
      --distribution-id "$DISTRIBUTION_ID" \
      --paths /config.js

The PR comment can now show all three useful URLs:

- name: Comment preview URLs
  uses: actions/github-script@v7
  env:
    PR_NUMBER: ${{ github.event.pull_request.number }}
  with:
    github-token: ${{ secrets.PR_COMMENT_TOKEN || github.token }}
    script: |
      const fs = require('fs');
      const marker = '<!-- pr-preview-urls -->';
      const stackName = `Stack-PR${process.env.PR_NUMBER}`;
      const outputs = JSON.parse(fs.readFileSync('cdk-outputs.json', 'utf8'));
      const previewUrl = outputs?.[stackName]?.PreviewUrl;
      const apiUrl = outputs?.[stackName]?.ApiUrl;
      const jobUrl = outputs?.[stackName]?.JobUrl;
      if (!previewUrl) throw new Error(`Missing PreviewUrl output for stack "${stackName}"`);
      if (!apiUrl) throw new Error(`Missing ApiUrl output for stack "${stackName}"`);
      if (!jobUrl) throw new Error(`Missing JobUrl output for stack "${stackName}"`);

      const body = `${marker}\nPreview: ${previewUrl}\nAPI: ${apiUrl}\nJobs: ${jobUrl}`;
      const { data: comments } = await github.rest.issues.listComments({
        owner: context.repo.owner,
        repo: context.repo.repo,
        issue_number: context.issue.number,
        per_page: 100,
      });

      const existing = comments.find((c) =>
        c.user?.type === 'Bot' && typeof c.body === 'string' && c.body.includes(marker),
      );

      if (existing) {
        await github.rest.issues.updateComment({
          owner: context.repo.owner,
          repo: context.repo.repo,
          comment_id: existing.id,
          body,
        });
      } else {
        await github.rest.issues.createComment({
          owner: context.repo.owner,
          repo: context.repo.repo,
          issue_number: context.issue.number,
          body,
        });
      }

Nothing about teardown changes. Closing the pull request still destroys the same stack, which now happens to contain a website, an API, a queue, and a DLQ.

What you should see #

opening a PR deploys a preview stack with a site, API, queue, and DLQ
the PR comment shows PreviewUrl, ApiUrl, and JobUrl
visiting the site still works over HTTPS through CloudFront
clicking Submit Job queues a successful async task
clicking Submit Failing Job queues a message that is retried and then sent to the DLQ
closing the PR removes the stack and all of those resources

Verify the stack #

After deploy, start with the stack outputs:

STACK=Stack-PR123
REGION=us-east-1

aws cloudformation describe-stacks \
  --stack-name "$STACK" \
  --region "$REGION" \
  --query 'Stacks[0].Outputs'

You want these outputs:

PreviewUrl
ApiUrl
JobUrl
JobsQueueName
JobsDeadLetterQueueName

Resolve the API and job endpoints:

API_URL=$(aws cloudformation describe-stacks \
  --stack-name "$STACK" \
  --region "$REGION" \
  --query "Stacks[0].Outputs[?OutputKey=='ApiUrl'].OutputValue" \
  --output text)

JOB_URL=$(aws cloudformation describe-stacks \
  --stack-name "$STACK" \
  --region "$REGION" \
  --query "Stacks[0].Outputs[?OutputKey=='JobUrl'].OutputValue" \
  --output text)

Verify the synchronous API still works:

curl "$API_URL"

That should hit /hello and return JSON.

Now test the async success path:

curl -X POST "$JOB_URL" \
  -H "content-type: application/json" \
  -d '{"task":"resize-image"}'

That should return 202 from the submit Lambda, and the worker should process the message asynchronously.

Check worker logs:

aws logs tail "/aws/lambda/${STACK}-WorkerFunction" \
  --region "$REGION" \
  --since 10m

If the exact log group name differs in your account, list matching groups first:

aws logs describe-log-groups \
  --region "$REGION" \
  --query "logGroups[?contains(logGroupName, 'WorkerFunction')].logGroupName" \
  --output text

Now test the DLQ path:

curl -X POST "$JOB_URL" \
  -H "content-type: application/json" \
  -d '{"task":"broken-job","fail":true}'

That job is designed to fail on purpose. After retries, it should land in the dead-letter queue.

Resolve the queue names:

JOBS_QUEUE=$(aws cloudformation describe-stacks \
  --stack-name "$STACK" \
  --region "$REGION" \
  --query "Stacks[0].Outputs[?OutputKey=='JobsQueueName'].OutputValue" \
  --output text)

DLQ=$(aws cloudformation describe-stacks \
  --stack-name "$STACK" \
  --region "$REGION" \
  --query "Stacks[0].Outputs[?OutputKey=='JobsDeadLetterQueueName'].OutputValue" \
  --output text)

Resolve queue URLs:

aws sqs get-queue-url --queue-name "$JOBS_QUEUE" --region "$REGION"
aws sqs get-queue-url --queue-name "$DLQ" --region "$REGION"

Then inspect DLQ depth:

DLQ_URL=$(aws sqs get-queue-url \
  --queue-name "$DLQ" \
  --region "$REGION" \
  --query 'QueueUrl' \
  --output text)

aws sqs get-queue-attributes \
  --queue-url "$DLQ_URL" \
  --region "$REGION" \
  --attribute-names ApproximateNumberOfMessages ApproximateNumberOfMessagesNotVisible

If everything is working:

GET /hello returns JSON
POST /jobs returns 202
successful jobs appear in worker logs
failing jobs eventually show up in the DLQ

Notes and hardening #

Keep CORS broad only for the tutorial. Once the pattern works, restrict origins to the preview site domain.
The queue visibilityTimeout should stay comfortably above the worker Lambda timeout.
The worker currently fails on purpose when fail: true. That is deliberate for a tutorial; real failure rules would come from actual business conditions.
A DLQ is a holding area, not a fix. In a real system, add alarms, dashboards, and a runbook for replay or remediation.
SQS decouples the public API from background work, but it does not guarantee that your worker code is idempotent. Design for retries.

Well-Architected Framework #

Security: OIDC still removes long-lived AWS keys from CI, the preview site origin remains private behind CloudFront, and both queues use server-side encryption.
Operational Excellence: this chapter introduces a new infrastructure pattern without changing the overall deployment model; the same PR workflow now provisions async infrastructure repeatedly and predictably.
Reliability: the queue buffers work, the worker can retry failed messages, and the DLQ prevents poison messages from being retried forever.
Cost Optimization: SQS and Lambda are inexpensive at tutorial scale, and the preview stack still disappears automatically when the pull request closes.
Performance Efficiency: the HTTP layer stays fast because it only accepts work; the slower part happens asynchronously behind the queue.
Sustainability: short-lived preview stacks avoid idle resources, and the queue-backed pattern lets compute happen only when there is actual work to process.

Source code #

Reference implementation (opens in a new tab)