Async processing
SQS, DLQ
Day Four #
Not every useful system should finish its work while the caller waits. Some tasks are better accepted quickly, buffered safely, and processed in the background when compute is ready.
That is where queues become useful. They absorb bursts, decouple components, and give failures somewhere deliberate to go instead of disappearing into logs.
Goal #
For every pull request:
- keep the static preview site
- keep the API Gateway + Lambda path
- add an
SQSqueue for async jobs - add a
DLQfor messages that keep failing - expose a
POST /jobsendpoint that enqueues work - process queue messages in a worker Lambda
How would the infrastructure shape look like for async processing?
What already exists #
At this point, you already have:
- a per-PR preview stack deployed from
bin/preview.ts - a private
S3bucket fronted byCloudFront GitHub Actionsassuming AWS access through OIDC- a preview page that can call an API via generated
site/config.js - deterministic stack naming with
Stack-PR<number> - teardown on PR close
Why a queue? #
An HTTP request is a poor place to do slow or failure-prone work.
If the caller has to wait for every image resize, document conversion, outbound webhook, or background calculation, the system becomes fragile quickly.
A queue changes the shape:
- the API accepts the job fast
- the queue holds it durably
- the worker processes it when ready
- repeated failures go to a dead-letter queue for inspection
That is a very common AWS pattern, and it is materially different from a synchronous API.
The shape of the app #
Developer pushes to PR
|
v
+----------------------+
| GitHub Actions |
| deploy preview stack |
+----------------------+
|
v
+----------------------+
| CDK |
| Stack-PR<number> |
+----------------------+
|
+-------------------------------+
| |
v v
+----------------------+ +----------------------+
| CloudFront + S3 | | API Gateway |
| static preview site | | POST /jobs |
+----------------------+ +----------------------+
| |
| v
| +----------------------+
| | SubmitJob Lambda |
| | send to SQS |
| +----------------------+
| |
| v
| +----------------------+
| | SQS queue |
| +----------------------+
| |
| v
| +----------------------+
| | Worker Lambda |
| | async processing |
| +----------------------+
| |
| v
| +----------------------+
| | DLQ |
| | after retries |
| +----------------------+
|
Browser
CDK #
We will evolve the same preview stack again. The new pieces are:
- one dead-letter queue
- one main queue with a redrive policy
- one Lambda that accepts
POST /jobsand sends messages to the queue - one worker Lambda subscribed to the queue
- a few new stack outputs for verification and runtime config
lib/stack.ts
import * as cdk from 'aws-cdk-lib';
import { Construct } from 'constructs';
import * as s3 from 'aws-cdk-lib/aws-s3';
import * as cloudfront from 'aws-cdk-lib/aws-cloudfront';
import * as origins from 'aws-cdk-lib/aws-cloudfront-origins';
import * as s3deploy from 'aws-cdk-lib/aws-s3-deployment';
import * as lambda from 'aws-cdk-lib/aws-lambda';
import * as sqs from 'aws-cdk-lib/aws-sqs';
import * as apigwv2 from 'aws-cdk-lib/aws-apigatewayv2';
import * as integrations from 'aws-cdk-lib/aws-apigatewayv2-integrations';
import * as eventsources from 'aws-cdk-lib/aws-lambda-event-sources';
import { Tags } from 'aws-cdk-lib';
export class Stack extends cdk.Stack {
constructor(scope: Construct, id: string, props?: cdk.StackProps) {
super(scope, id, props);
const pr = String(this.node.tryGetContext('pr') ?? 'local');
const repo = this.node.tryGetContext('repo');
const sha = this.node.tryGetContext('sha');
const run = this.node.tryGetContext('run');
const acct = this.node.tryGetContext('acct');
const reg = this.node.tryGetContext('reg');
const bucket = new s3.Bucket(this, 'SiteBucket', {
autoDeleteObjects: true,
removalPolicy: cdk.RemovalPolicy.DESTROY,
blockPublicAccess: s3.BlockPublicAccess.BLOCK_ALL,
encryption: s3.BucketEncryption.S3_MANAGED,
...(acct && reg
? {
bucketName: `pr-${String(pr)}-${String(acct)}-${String(reg)}`
.toLowerCase()
.replace(/[^a-z0-9.-]/g, '')
.slice(0, 63)
.replace(/[.-]+$/g, ''),
}
: {}),
});
const distribution = new cloudfront.Distribution(this, 'SiteDistribution', {
defaultRootObject: 'index.html',
defaultBehavior: {
origin: origins.S3BucketOrigin.withOriginAccessControl(bucket),
viewerProtocolPolicy: cloudfront.ViewerProtocolPolicy.REDIRECT_TO_HTTPS,
cachePolicy: cloudfront.CachePolicy.CACHING_OPTIMIZED,
},
errorResponses: [
{ httpStatus: 403, responseHttpStatus: 200, responsePagePath: '/index.html' },
{ httpStatus: 404, responseHttpStatus: 200, responsePagePath: '/index.html' },
],
});
new s3deploy.BucketDeployment(this, 'DeploySite', {
destinationBucket: bucket,
sources: [s3deploy.Source.asset('site')],
distribution,
distributionPaths: ['/*'],
});
const deadLetterQueue = new sqs.Queue(this, 'JobsDeadLetterQueue', {
encryption: sqs.QueueEncryption.SQS_MANAGED,
retentionPeriod: cdk.Duration.days(14),
});
const jobsQueue = new sqs.Queue(this, 'JobsQueue', {
encryption: sqs.QueueEncryption.SQS_MANAGED,
visibilityTimeout: cdk.Duration.seconds(60),
deadLetterQueue: {
queue: deadLetterQueue,
maxReceiveCount: 3,
},
});
const submitJob = new lambda.Function(this, 'SubmitJobFunction', {
runtime: lambda.Runtime.NODEJS_20_X,
handler: 'index.handler',
code: lambda.Code.fromAsset('lambda/submit-job'),
timeout: cdk.Duration.seconds(5),
memorySize: 256,
environment: {
PR_NUMBER: pr,
QUEUE_URL: jobsQueue.queueUrl,
},
});
jobsQueue.grantSendMessages(submitJob);
const worker = new lambda.Function(this, 'WorkerFunction', {
runtime: lambda.Runtime.NODEJS_20_X,
handler: 'index.handler',
code: lambda.Code.fromAsset('lambda/worker'),
timeout: cdk.Duration.seconds(30),
memorySize: 256,
environment: {
PR_NUMBER: pr,
},
});
worker.addEventSource(
new eventsources.SqsEventSource(jobsQueue, {
batchSize: 5,
reportBatchItemFailures: true,
}),
);
const hello = new lambda.Function(this, 'HelloFunction', {
runtime: lambda.Runtime.NODEJS_20_X,
handler: 'index.handler',
code: lambda.Code.fromAsset('lambda/hello'),
timeout: cdk.Duration.seconds(5),
memorySize: 256,
environment: {
PR_NUMBER: pr,
},
});
const api = new apigwv2.HttpApi(this, 'PreviewApi', {
corsPreflight: {
allowHeaders: ['content-type'],
allowMethods: [apigwv2.CorsHttpMethod.GET, apigwv2.CorsHttpMethod.POST],
allowOrigins: ['*'],
},
});
api.addRoutes({
path: '/hello',
methods: [apigwv2.HttpMethod.GET],
integration: new integrations.HttpLambdaIntegration('HelloIntegration', hello),
});
api.addRoutes({
path: '/jobs',
methods: [apigwv2.HttpMethod.POST],
integration: new integrations.HttpLambdaIntegration('SubmitJobIntegration', submitJob),
});
Tags.of(this).add('managed-by', 'cdk');
Tags.of(this).add('preview', 'true');
Tags.of(this).add('pr', pr);
if (repo) Tags.of(this).add('repo', String(repo));
if (sha) Tags.of(this).add('sha', String(sha));
if (run) Tags.of(this).add('run-id', String(run));
Tags.of(bucket).add('resource', 'preview-bucket');
Tags.of(hello).add('resource', 'preview-api-function');
Tags.of(submitJob).add('resource', 'preview-job-submit-function');
Tags.of(worker).add('resource', 'preview-worker-function');
Tags.of(jobsQueue).add('resource', 'preview-jobs-queue');
Tags.of(deadLetterQueue).add('resource', 'preview-jobs-dlq');
new cdk.CfnOutput(this, 'BucketName', { value: bucket.bucketName });
new cdk.CfnOutput(this, 'PreviewUrl', {
value: `https://${distribution.domainName}/?pr=${encodeURIComponent(pr)}`,
});
new cdk.CfnOutput(this, 'DistributionId', { value: distribution.distributionId });
new cdk.CfnOutput(this, 'ApiUrl', { value: `${api.url}hello` });
new cdk.CfnOutput(this, 'JobUrl', { value: `${api.url}jobs` });
new cdk.CfnOutput(this, 'JobsQueueName', { value: jobsQueue.queueName });
new cdk.CfnOutput(this, 'JobsDeadLetterQueueName', { value: deadLetterQueue.queueName });
}
}
There are two small but important details here:
maxReceiveCount: 3means a message gets three delivery attempts before moving to the DLQvisibilityTimeout: 60gives the worker time to finish before the message becomes visible again
The producer Lambda #
The API only validates the request, shapes a message, and sends it to SQS.
lambda/submit-job/index.mjs
import { SQSClient, SendMessageCommand } from '@aws-sdk/client-sqs';
const sqs = new SQSClient({});
const headers = {
'content-type': 'application/json',
'access-control-allow-origin': '*',
};
export const handler = async (event) => {
const queueUrl = process.env.QUEUE_URL;
const prNumber = process.env.PR_NUMBER ?? 'local';
if (!queueUrl) {
return {
statusCode: 500,
headers,
body: JSON.stringify({ ok: false, error: 'QUEUE_URL is not configured' }),
};
}
let payload;
try {
payload = event?.body ? JSON.parse(event.body) : {};
} catch {
return {
statusCode: 400,
headers,
body: JSON.stringify({ ok: false, error: 'Invalid JSON body' }),
};
}
const task = typeof payload?.task === 'string' && payload.task.trim()
? payload.task.trim()
: 'demo-job';
const shouldFail = payload?.fail === true;
const body = {
task,
fail: shouldFail,
prNumber,
requestedAt: new Date().toISOString(),
};
const response = await sqs.send(
new SendMessageCommand({
QueueUrl: queueUrl,
MessageBody: JSON.stringify(body),
}),
);
return {
statusCode: 202,
headers,
body: JSON.stringify({
ok: true,
queued: true,
messageId: response.MessageId,
task,
fail: shouldFail,
note: shouldFail
? 'This message will be retried and then moved to the DLQ.'
: 'The worker Lambda will process this message asynchronously.',
}),
};
};
This is the whole point of the producer: return fast, queue work, move on.
The worker Lambda #
The worker is subscribed to the queue. Most messages should succeed. Some messages should fail.
For a tutorial, it helps if failure is easy to trigger on purpose, so the worker treats {"fail": true} as a deliberate failure and returns a partial batch failure for that message.
lambda/worker/index.mjs
export const handler = async (event) => {
const failures = [];
for (const record of event.Records ?? []) {
let payload;
try {
payload = JSON.parse(record.body);
} catch (error) {
console.error('Failed to parse SQS message body', {
messageId: record.messageId,
body: record.body,
error,
});
failures.push({ itemIdentifier: record.messageId });
continue;
}
if (payload.fail) {
console.error('Worker rejected message on purpose to demonstrate the DLQ path', {
messageId: record.messageId,
payload,
});
failures.push({ itemIdentifier: record.messageId });
continue;
}
console.log('Processed async job', {
messageId: record.messageId,
payload,
processedAt: new Date().toISOString(),
});
}
return {
batchItemFailures: failures,
};
};
That gives you a useful demo:
- normal messages are processed successfully
- failing messages are retried
- after the retry limit, they land in the DLQ
The page #
The preview page reads a generated runtime config file and calls the deployed API directly from the browser.
This time the runtime value is JobUrl, not ApiUrl.
site/index.html
<!doctype html>
<html lang="en">
<head>
<meta charset="utf-8" />
<meta name="viewport" content="width=device-width, initial-scale=1" />
<title>PR Preview</title>
<style>
:root {
color-scheme: light dark;
}
body {
font-family: system-ui, -apple-system, Segoe UI, Roboto, sans-serif;
margin: 0;
padding: 3rem 1.5rem;
}
.card {
max-width: 52rem;
margin: 0 auto;
border: 1px solid rgba(127, 127, 127, 0.35);
border-radius: 14px;
padding: 1.25rem;
}
.row {
display: flex;
gap: 0.75rem;
flex-wrap: wrap;
margin-top: 1rem;
}
button {
border: 0;
border-radius: 999px;
padding: 0.65rem 1rem;
cursor: pointer;
}
pre {
white-space: pre-wrap;
overflow-wrap: anywhere;
padding: 0.9rem;
border-radius: 10px;
background: rgba(127, 127, 127, 0.12);
}
input {
width: 100%;
box-sizing: border-box;
padding: 0.7rem 0.85rem;
border: 1px solid rgba(127, 127, 127, 0.35);
border-radius: 10px;
background: transparent;
}
</style>
</head>
<body>
<div class="card">
<h1>PR Preview</h1>
<p>This version adds an async path with API Gateway, SQS, Lambda, and a DLQ.</p>
<label for="task-name">Task name</label>
<input id="task-name" name="task-name" type="text" value="resize-image" />
<div class="row">
<button id="submit-job" type="button">Submit Job</button>
<button id="submit-fail-job" type="button">Submit Failing Job</button>
</div>
<p id="status">Ready.</p>
<pre id="output" hidden></pre>
</div>
<script src="./config.js"></script>
<script>
window.__PREVIEW_JOB_URL__ = window.__PREVIEW_JOB_URL__ || '';
const status = document.getElementById('status');
const output = document.getElementById('output');
const taskName = document.getElementById('task-name');
const submitJobButton = document.getElementById('submit-job');
const submitFailJobButton = document.getElementById('submit-fail-job');
const jobUrl = window.__PREVIEW_JOB_URL__;
if (!jobUrl) {
submitJobButton.disabled = true;
submitFailJobButton.disabled = true;
status.textContent = 'Job URL is not configured.';
}
const submitJob = async ({ fail }) => {
status.textContent = fail ? 'Submitting failing job...' : 'Submitting job...';
output.hidden = true;
try {
const response = await fetch(jobUrl, {
method: 'POST',
headers: {
'content-type': 'application/json',
accept: 'application/json',
},
body: JSON.stringify({
task: taskName.value.trim() || 'demo-job',
fail,
}),
});
const data = await response.json();
if (!response.ok) {
throw new Error(data?.error || `HTTP ${response.status}`);
}
status.textContent = fail
? 'Failing job queued. Watch retries and the DLQ.'
: 'Job queued successfully.';
output.textContent = JSON.stringify(data, null, 2);
output.hidden = false;
} catch (error) {
status.textContent = `Request failed: ${error.message}`;
output.textContent = '';
output.hidden = true;
}
};
submitJobButton.addEventListener('click', () => submitJob({ fail: false }));
submitFailJobButton.addEventListener('click', () => submitJob({ fail: true }));
</script>
</body>
</html>
The generated site/config.js is intentionally tiny:
window.__PREVIEW_JOB_URL__ = 'https://example.execute-api.us-east-1.amazonaws.com/jobs';
CI #
The page needs a runtime config file generated after deploy.
This workflow step reads the deployed outputs, writes site/config.js, uploads it to the preview bucket, and invalidates just that file in CloudFront:
- name: Generate config.js for the preview page
run: |
STACK="Stack-PR${PR_NUMBER}"
JOB_URL=$(jq -r --arg stack "$STACK" '.[$stack].JobUrl' cdk-outputs.json)
BUCKET_NAME=$(jq -r --arg stack "$STACK" '.[$stack].BucketName' cdk-outputs.json)
DISTRIBUTION_ID=$(jq -r --arg stack "$STACK" '.[$stack].DistributionId' cdk-outputs.json)
test -n "$JOB_URL" && test "$JOB_URL" != "null"
test -n "$BUCKET_NAME" && test "$BUCKET_NAME" != "null"
test -n "$DISTRIBUTION_ID" && test "$DISTRIBUTION_ID" != "null"
cat > site/config.js <<EOF
window.__PREVIEW_JOB_URL__ = '${JOB_URL}';
EOF
aws s3 cp site/config.js "s3://${BUCKET_NAME}/config.js" --content-type application/javascript
aws cloudfront create-invalidation \
--distribution-id "$DISTRIBUTION_ID" \
--paths /config.js
The PR comment can now show all three useful URLs:
- name: Comment preview URLs
uses: actions/github-script@v7
env:
PR_NUMBER: ${{ github.event.pull_request.number }}
with:
github-token: ${{ secrets.PR_COMMENT_TOKEN || github.token }}
script: |
const fs = require('fs');
const marker = '<!-- pr-preview-urls -->';
const stackName = `Stack-PR${process.env.PR_NUMBER}`;
const outputs = JSON.parse(fs.readFileSync('cdk-outputs.json', 'utf8'));
const previewUrl = outputs?.[stackName]?.PreviewUrl;
const apiUrl = outputs?.[stackName]?.ApiUrl;
const jobUrl = outputs?.[stackName]?.JobUrl;
if (!previewUrl) throw new Error(`Missing PreviewUrl output for stack "${stackName}"`);
if (!apiUrl) throw new Error(`Missing ApiUrl output for stack "${stackName}"`);
if (!jobUrl) throw new Error(`Missing JobUrl output for stack "${stackName}"`);
const body = `${marker}\nPreview: ${previewUrl}\nAPI: ${apiUrl}\nJobs: ${jobUrl}`;
const { data: comments } = await github.rest.issues.listComments({
owner: context.repo.owner,
repo: context.repo.repo,
issue_number: context.issue.number,
per_page: 100,
});
const existing = comments.find((c) =>
c.user?.type === 'Bot' && typeof c.body === 'string' && c.body.includes(marker),
);
if (existing) {
await github.rest.issues.updateComment({
owner: context.repo.owner,
repo: context.repo.repo,
comment_id: existing.id,
body,
});
} else {
await github.rest.issues.createComment({
owner: context.repo.owner,
repo: context.repo.repo,
issue_number: context.issue.number,
body,
});
}
Nothing about teardown changes. Closing the pull request still destroys the same stack, which now happens to contain a website, an API, a queue, and a DLQ.
What you should see #
- opening a PR deploys a preview stack with a site, API, queue, and DLQ
- the PR comment shows
PreviewUrl,ApiUrl, andJobUrl - visiting the site still works over HTTPS through CloudFront
- clicking
Submit Jobqueues a successful async task - clicking
Submit Failing Jobqueues a message that is retried and then sent to the DLQ - closing the PR removes the stack and all of those resources
Verify the stack #
After deploy, start with the stack outputs:
STACK=Stack-PR123
REGION=us-east-1
aws cloudformation describe-stacks \
--stack-name "$STACK" \
--region "$REGION" \
--query 'Stacks[0].Outputs'
You want these outputs:
PreviewUrlApiUrlJobUrlJobsQueueNameJobsDeadLetterQueueName
Resolve the API and job endpoints:
API_URL=$(aws cloudformation describe-stacks \
--stack-name "$STACK" \
--region "$REGION" \
--query "Stacks[0].Outputs[?OutputKey=='ApiUrl'].OutputValue" \
--output text)
JOB_URL=$(aws cloudformation describe-stacks \
--stack-name "$STACK" \
--region "$REGION" \
--query "Stacks[0].Outputs[?OutputKey=='JobUrl'].OutputValue" \
--output text)
Verify the synchronous API still works:
curl "$API_URL"
That should hit /hello and return JSON.
Now test the async success path:
curl -X POST "$JOB_URL" \
-H "content-type: application/json" \
-d '{"task":"resize-image"}'
That should return 202 from the submit Lambda, and the worker should process the message asynchronously.
Check worker logs:
aws logs tail "/aws/lambda/${STACK}-WorkerFunction" \
--region "$REGION" \
--since 10m
If the exact log group name differs in your account, list matching groups first:
aws logs describe-log-groups \
--region "$REGION" \
--query "logGroups[?contains(logGroupName, 'WorkerFunction')].logGroupName" \
--output text
Now test the DLQ path:
curl -X POST "$JOB_URL" \
-H "content-type: application/json" \
-d '{"task":"broken-job","fail":true}'
That job is designed to fail on purpose. After retries, it should land in the dead-letter queue.
Resolve the queue names:
JOBS_QUEUE=$(aws cloudformation describe-stacks \
--stack-name "$STACK" \
--region "$REGION" \
--query "Stacks[0].Outputs[?OutputKey=='JobsQueueName'].OutputValue" \
--output text)
DLQ=$(aws cloudformation describe-stacks \
--stack-name "$STACK" \
--region "$REGION" \
--query "Stacks[0].Outputs[?OutputKey=='JobsDeadLetterQueueName'].OutputValue" \
--output text)
Resolve queue URLs:
aws sqs get-queue-url --queue-name "$JOBS_QUEUE" --region "$REGION"
aws sqs get-queue-url --queue-name "$DLQ" --region "$REGION"
Then inspect DLQ depth:
DLQ_URL=$(aws sqs get-queue-url \
--queue-name "$DLQ" \
--region "$REGION" \
--query 'QueueUrl' \
--output text)
aws sqs get-queue-attributes \
--queue-url "$DLQ_URL" \
--region "$REGION" \
--attribute-names ApproximateNumberOfMessages ApproximateNumberOfMessagesNotVisible
If everything is working:
GET /helloreturns JSONPOST /jobsreturns202- successful jobs appear in worker logs
- failing jobs eventually show up in the DLQ
Notes and hardening #
- Keep CORS broad only for the tutorial. Once the pattern works, restrict origins to the preview site domain.
- The queue
visibilityTimeoutshould stay comfortably above the worker Lambda timeout. - The worker currently fails on purpose when
fail: true. That is deliberate for a tutorial; real failure rules would come from actual business conditions. - A DLQ is a holding area, not a fix. In a real system, add alarms, dashboards, and a runbook for replay or remediation.
SQSdecouples the public API from background work, but it does not guarantee that your worker code is idempotent. Design for retries.
Well-Architected Framework #
- Security: OIDC still removes long-lived AWS keys from CI, the preview site origin remains private behind CloudFront, and both queues use server-side encryption.
- Operational Excellence: this chapter introduces a new infrastructure pattern without changing the overall deployment model; the same PR workflow now provisions async infrastructure repeatedly and predictably.
- Reliability: the queue buffers work, the worker can retry failed messages, and the DLQ prevents poison messages from being retried forever.
- Cost Optimization:
SQSand Lambda are inexpensive at tutorial scale, and the preview stack still disappears automatically when the pull request closes. - Performance Efficiency: the HTTP layer stays fast because it only accepts work; the slower part happens asynchronously behind the queue.
- Sustainability: short-lived preview stacks avoid idle resources, and the queue-backed pattern lets compute happen only when there is actual work to process.