AWS Lambda Best Practices for Production
By Oleksandr Andrushchenko — Published on — Modified on
AWS Lambda is easy to start with, but production Lambda systems require more than writing a handler function. You need to think about idempotency, retries, timeouts, IAM permissions, database connections, observability, deployment strategy, and downstream limits.
This article covers practical AWS Lambda best practices for production. The goal is not to list every AWS feature, but to explain how to build Lambda functions that are reliable, secure, observable, cost-aware, and easier to maintain over time.
Table of Contents
- Write Small, Single-Purpose Functions
- Design for Idempotency
- Choose the Right Event Source
- Keep Functions Stateless
- Optimize Initialization
- Manage Database Connections Correctly
- Handle Errors Properly
- Control Concurrency
- Secure Your Lambda Functions
- Build for Observability
- Monitor Costs
- Deploy Safely
- Testing Lambda Functions
- Production Readiness Checklist
- Common Production Mistakes
- Conclusion
Write Small, Single-Purpose Functions
A good production Lambda function should have a clear responsibility. It should be easy to understand what event it handles, what side effects it performs, and what success or failure means.
One Responsibility per Function
One responsibility per function does not mean every function must contain only ten lines of code. It means the function should represent one clear unit of work.
| Good Function Boundary | Bad Function Boundary |
|---|---|
| Process one SQS order message | Process orders, send reports, sync users, and update analytics |
| Resize uploaded image | Handle every possible S3 file workflow in one function |
| Validate webhook signature and enqueue event | Receive webhook, process payment, send email, update CRM, and generate PDF |
Avoid Monolithic Lambdas
A monolithic Lambda usually starts as a convenient shortcut and later becomes hard to test, deploy, observe, and debug. If one function handles too many unrelated event types, every change becomes risky.
# Bad: one function handles unrelated workflows
def lambda_handler(event, context):
if event["type"] == "user_created":
create_user_profile(event)
elif event["type"] == "order_created":
process_order(event)
elif event["type"] == "image_uploaded":
resize_image(event)
elif event["type"] == "daily_report":
generate_report(event)
Better:
user-created-handler
order-created-handler
image-uploaded-handler
daily-report-handler
Rule of thumb: if a Lambda function needs many unrelated if event_type == ... branches, it may be doing too much.
Design for Idempotency
Idempotency means the same event can be processed more than once without producing incorrect results. In production Lambda systems, this is not optional. Retries, duplicate deliveries, client retries, and network failures can all cause the same logical event to appear more than once.
Why Duplicate Events Happen
- Queue retries: failed SQS messages can be delivered again.
- Async retries: asynchronous Lambda invocations may be retried.
- Client retries: API clients may resend requests after timeouts.
- Stream retries: a failed batch may be retried from the stream.
- Network uncertainty: a caller may not know whether a previous request succeeded.
Safe Retry Strategies
Retries are useful only when repeating the operation is safe. Retrying a temporary network failure is good. Retrying a payment charge without idempotency is dangerous.
| Operation | Retry Risk | Required Protection |
|---|---|---|
| Read user profile | Low | Normal retry |
| Send email | Medium | Message ID or send log |
| Charge payment | High | Idempotency key |
| Create order | High | Unique request ID or conditional write |
Idempotency Keys
An idempotency key is a unique identifier for one logical operation. Before performing the side effect, the function checks whether the operation was already processed.
def process_payment(event):
payment_id = event["paymentId"]
if payment_already_processed(payment_id):
return {
"status": "already_processed"
}
charge_customer(event)
mark_payment_processed(payment_id)
return {
"status": "processed"
}
Rule of thumb: every production Lambda that performs external side effects should be designed as if the same event may arrive twice.
Choose the Right Event Source
The event source defines how Lambda receives data, how retries work, whether batching exists, how failures are handled, and how the function scales. Choosing the wrong event source creates production problems that cannot be solved only inside the handler code.
HTTP APIs
Use API Gateway, Function URLs, or Application Load Balancer when the function must respond to an HTTP request.
- API Gateway: production APIs, routing, auth, throttling, custom domains.
- Function URLs: simple single-function HTTP endpoints.
- ALB: hybrid architectures where Lambda is one target behind a load balancer.
Queues
Use SQS when work can be processed asynchronously and should survive temporary failures. Queues are useful for background jobs, buffering, retry handling, and protecting downstream systems.
Streams
Use Kinesis, DynamoDB Streams, or Kafka when records arrive as an ordered stream and must be processed in batches.
Scheduled Jobs
Use EventBridge Scheduler for cron-like tasks such as cleanup jobs, reports, synchronization, and periodic checks.
Workflows
Use Step Functions when the business process has multiple steps, branches, retries, waits, or compensation logic.
| Need | Recommended Service |
|---|---|
| REST API | API Gateway |
| Simple webhook | Function URL |
| Background job | SQS |
| Broadcast event | SNS or EventBridge |
| File processing | S3 Event |
| Database change reaction | DynamoDB Streams |
| Multi-step workflow | Step Functions |
Keep Functions Stateless
Lambda functions should be designed as stateless units of work. The execution environment may be reused, but it can also disappear at any time. Do not treat local memory or local files as the source of truth.
Do Not Depend on Local Files
Lambda provides temporary storage in /tmp, but it is not durable application storage. It can be useful for temporary files, downloads, generated reports, or intermediate processing.
def lambda_handler(event, context):
temp_path = "/tmp/report.csv"
generate_report(temp_path)
upload_to_s3(temp_path)
return {
"status": "uploaded"
}
Important: /tmp is temporary. Store durable data in S3, DynamoDB, RDS, ElastiCache, or another external system.
Reuse Execution Environment Carefully
Warm invocations may reuse global variables, clients, and cached configuration. This is useful for performance, but dangerous if you store request-specific data globally.
# Good: reusable client
import boto3
s3_client = boto3.client("s3")
def lambda_handler(event, context):
return s3_client.list_buckets()
# Bad: request-specific mutable state
current_user_id = None
def lambda_handler(event, context):
global current_user_id
current_user_id = event["userId"]
return process_user(current_user_id)
Store State Externally
- S3: files, reports, exports, images, documents.
- DynamoDB: key-value access, metadata, idempotency records.
- RDS / Aurora: relational data and transactions.
- ElastiCache: low-latency cached data.
- SQS: durable pending work.
Rule of thumb: use global state for reusable infrastructure clients, not for business state.
Optimize Initialization
Initialization affects cold starts. Code outside the handler runs when Lambda creates a new execution environment. Keep that code small, useful, and predictable.
Reuse SDK Clients
Create AWS SDK clients outside the handler so they can be reused during warm invocations.
import boto3
dynamodb = boto3.resource("dynamodb")
table = dynamodb.Table("Users")
def lambda_handler(event, context):
response = table.get_item(
Key={"id": event["userId"]}
)
return response.get("Item")
Lazy Loading
If a heavy dependency is used only in rare cases, load it only when needed.
def lambda_handler(event, context):
if event.get("generatePdf"):
import reportlab
return generate_pdf(reportlab, event)
return {
"message": "No PDF needed"
}
Reduce Package Size
Large packages increase deployment complexity and can increase cold start time. Remove unused dependencies, tests, documentation, local caches, and development-only files.
Common package bloat:
- tests
- docs
- local virtual environments
- unused libraries
- example files
- development tools
- large generated artifacts
Avoid Heavy Frameworks When Not Needed
A simple Lambda function does not always need a full web framework. Use the simplest structure that solves the problem.
| Situation | Good Choice |
|---|---|
| Single webhook | Plain Lambda handler |
| Few simple background jobs | Plain handlers |
| Many routes and shared middleware | Small framework may help |
| Existing large application | Framework may reduce migration effort |
Manage Database Connections Correctly
Database connections are one of the most common production problems with Lambda. Lambda can scale quickly, but relational databases have connection limits.
Reuse Connections
Reuse database connections carefully across warm invocations. Always handle stale or closed connections.
import os
import psycopg2
connection = None
def get_connection():
global connection
if connection is None or connection.closed:
connection = psycopg2.connect(
host=os.environ["DB_HOST"],
dbname=os.environ["DB_NAME"],
user=os.environ["DB_USER"],
password=os.environ["DB_PASSWORD"]
)
return connection
def lambda_handler(event, context):
conn = get_connection()
with conn.cursor() as cursor:
cursor.execute("SELECT now()")
row = cursor.fetchone()
return {
"databaseTime": str(row[0])
}
Use RDS Proxy
RDS Proxy helps pool and manage database connections between Lambda and relational databases such as RDS or Aurora.
Problem:
1,000 Lambda invocations
-> 1,000 direct database connections
-> database connection exhaustion
Better:
1,000 Lambda invocations
-> RDS Proxy
-> managed database connection pool
Protect Downstream Databases
- Limit concurrency for functions that write to relational databases.
- Use SQS to buffer write-heavy workloads.
- Keep transactions short.
- Avoid opening new connections for every invocation.
- Use DynamoDB when the access pattern fits key-value or document-style reads/writes.
Handle Errors Properly
Error handling depends heavily on the event source. An API request, an SQS message, and a stream record should not all be handled the same way.
Retries
Retries are useful for temporary failures, but dangerous for permanent failures. Invalid input will not become valid after ten retries.
| Error Type | Retry? | Example |
|---|---|---|
| Temporary network issue | Yes, with backoff | Timeout calling external API |
| Rate limit | Yes, carefully | HTTP 429 |
| Invalid input | No | Missing required field |
| Business rule failure | Usually no | Payment method rejected |
Dead Letter Queues
A dead-letter queue stores messages that failed repeatedly. This is useful for debugging and manual recovery.
SQS Queue
-> Lambda Worker
-> success: message deleted
-> failure: retry
-> repeated failure: move to DLQ
Lambda Destinations
Lambda Destinations can route successful or failed asynchronous invocation results to another service. This is useful when you need to react to success or failure outcomes.
Poison Messages
A poison message is a message that always fails. If not handled, it can be retried repeatedly and block useful processing.
- Validate messages early.
- Separate temporary and permanent errors.
- Use partial batch failures when supported.
- Send bad messages to a DLQ.
- Log enough context to debug the message later.
Control Concurrency
Lambda can scale quickly. That is useful, but it can also overload databases, APIs, queues, and legacy systems. Production Lambda systems should control concurrency intentionally.
Reserved Concurrency
Reserved concurrency can reserve capacity for a function and also limit its maximum concurrency. This is useful for protecting downstream systems.
Without limit:
SQS has 50,000 messages
-> Lambda scales aggressively
-> database becomes overloaded
With reserved concurrency:
SQS has 50,000 messages
-> Lambda processes at controlled speed
-> database remains healthy
Provisioned Concurrency
Provisioned concurrency keeps execution environments initialized and ready. This is useful for latency-sensitive APIs where cold starts are not acceptable.
| Use Reserved Concurrency | Use Provisioned Concurrency |
|---|---|
| To limit or reserve capacity | To reduce cold starts |
| To protect downstream systems | To improve latency predictability |
| For queue workers and database writers | For important synchronous APIs |
Protect External Systems
- Use queues to buffer spikes.
- Set reserved concurrency for database-heavy functions.
- Use backoff for external API retries.
- Set timeouts so slow downstream calls do not consume all function time.
- Use circuit-breaker behavior for repeated downstream failures.
Secure Your Lambda Functions
Production Lambda functions should follow the same security principles as any other backend system: least privilege, secure secret handling, controlled network access, and safe input validation.
Least-Privilege IAM
Give each Lambda function only the permissions it needs. Avoid broad permissions such as * unless there is a strong reason.
{
"Effect": "Allow",
"Action": [
"s3:GetObject"
],
"Resource": [
"arn:aws:s3:::example-bucket/uploads/*"
]
}
Rule of thumb: permissions should be specific to the service, action, and resource.
Secrets Manager and Parameter Store
Do not hardcode secrets in source code. Use AWS Secrets Manager or SSM Parameter Store for passwords, API keys, tokens, and sensitive configuration.
import boto3
import json
secrets_client = boto3.client("secretsmanager")
def get_secret(secret_id):
response = secrets_client.get_secret_value(
SecretId=secret_id
)
return json.loads(response["SecretString"])
Environment Variables
Environment variables are useful for non-secret configuration such as table names, bucket names, feature flags, and API endpoints.
import os
TABLE_NAME = os.environ["TABLE_NAME"]
def lambda_handler(event, context):
return {
"table": TABLE_NAME
}
Important: environment variables are configuration, not a replacement for a secret management strategy.
VPC Considerations
Put Lambda in a VPC only when it needs private network access, such as private RDS, ElastiCache, or internal services. VPC configuration adds networking complexity and requires correct subnet, security group, and NAT design for outbound internet access.
Build for Observability
Production Lambda systems are distributed by nature. One user action can pass through API Gateway, Lambda, SQS, another Lambda, DynamoDB, and EventBridge. Without good observability, debugging becomes painful.
Structured Logging
Prefer structured JSON logs over random text messages. Include request IDs, event IDs, user IDs when safe, correlation IDs, and important business context.
import json
import logging
logger = logging.getLogger()
logger.setLevel(logging.INFO)
def lambda_handler(event, context):
logger.info(json.dumps({
"message": "Processing event",
"requestId": context.aws_request_id,
"eventType": event.get("type"),
"orderId": event.get("orderId")
}))
return {
"status": "ok"
}
CloudWatch Metrics
Track metrics that show whether the function is healthy and whether the system is falling behind.
| Metric | Why It Matters |
|---|---|
| Errors | Function failures |
| Duration | Performance and cost |
| Throttles | Concurrency limits |
| ConcurrentExecutions | Scaling behavior |
| IteratorAge | Stream consumer lag |
| Dead-letter queue depth | Failed async processing |
Tracing
Use tracing when requests cross multiple services. Tracing helps identify slow database calls, external API bottlenecks, retries, and service-to-service latency.
Correlation IDs
A correlation ID connects logs from different services that belong to the same workflow.
def lambda_handler(event, context):
correlation_id = (
event.get("correlationId")
or context.aws_request_id
)
logger.info(json.dumps({
"message": "Start processing",
"correlationId": correlation_id
}))
Monitor Costs
Lambda can be inexpensive, but it can also become costly when functions run too long, retry too often, process unnecessary events, or trigger each other in loops.
Memory vs Duration
Lambda cost depends on memory and duration. More memory can increase CPU and reduce duration, so the cheapest configuration is not always the smallest memory setting.
| Configuration | Duration | Result |
|---|---|---|
| Low memory | Long duration | May be slow and not actually cheaper |
| Balanced memory | Shorter duration | Often best trade-off |
| Too much memory | Small additional improvement | Diminishing returns |
Unused Invocations
Filter events before they invoke Lambda when possible. Use EventBridge rules, S3 prefix/suffix filters, and event source filtering for streams or queues.
Infinite Loops
Recursive triggers can create unexpected cost and system load.
Bad:
S3 upload -> Lambda -> writes to same S3 prefix -> Lambda runs again
Bad:
DynamoDB update -> Stream -> Lambda -> updates same item -> Stream runs again
Cost Alarms
Use billing alarms and service-level alarms to detect unusual usage early. Monitor invocations, errors, retries, DLQ size, and concurrency spikes.
Deploy Safely
Production Lambda deployment should support safe release, rollback, and validation. Do not treat every deployment as a manual overwrite of the current function.
Versions
Lambda versions are immutable snapshots of function code and configuration. They allow you to point production traffic to a specific known version.
Aliases
Aliases are named pointers to versions, such as dev, staging, or prod.
prod alias -> version 12
staging alias -> version 13
Canary Deployments
A canary deployment sends a small percentage of traffic to a new version before shifting all traffic.
95% traffic -> version 12
5% traffic -> version 13
If metrics are healthy:
100% traffic -> version 13
Rollback Strategy
A rollback should be fast and predictable. If version 13 fails, move the production alias back to version 12.
Rule of thumb: every production Lambda should have a deployment and rollback strategy, not only a deploy button.
Testing Lambda Functions
Lambda functions should be tested like any other backend code. The handler should be thin, and business logic should be testable without invoking AWS for every unit test.
Unit Tests
Move business logic into normal functions or classes and test them directly.
def calculate_total(order):
return sum(item["price"] * item["quantity"] for item in order["items"])
def lambda_handler(event, context):
total = calculate_total(event["order"])
return {
"total": total
}
Integration Tests
Integration tests should validate real interactions with AWS services such as DynamoDB, SQS, S3, or API Gateway. These tests catch IAM, event format, serialization, and infrastructure problems.
Local Testing
Local testing is useful for fast feedback, but it does not perfectly reproduce AWS behavior. Always validate important workflows in a real AWS environment before production.
Production Validation
After deployment, validate logs, metrics, alarms, traces, DLQ behavior, and business outcomes. A function that deploys successfully can still fail at runtime because of permissions, event shape changes, or downstream dependencies.
Production Readiness Checklist
- Function has one clear responsibility.
- Handler is idempotent.
- Event source is appropriate for the workload.
- Timeout is configured intentionally.
- Memory is tested with realistic payloads.
- SDK clients are reused when appropriate.
- Database connections are managed safely.
- Retries and failure handling are understood.
- Dead-letter queue or failure destination exists where needed.
- Reserved concurrency protects downstream systems when necessary.
- IAM permissions follow least privilege.
- Secrets are stored in Secrets Manager or Parameter Store.
- Logs are structured and useful.
- Metrics and alarms exist for errors, throttles, duration, and DLQ depth.
- Tracing or correlation IDs exist for distributed flows.
- Deployment uses versions, aliases, or safe rollout strategy.
- Rollback path is known.
- Unit and integration tests cover critical behavior.
- Cost alarms or usage monitoring are configured.
Common Production Mistakes
- Putting too much logic into one Lambda function.
- Assuming events are delivered exactly once.
- Opening database connections on every invocation.
- Using Lambda for long-running workloads that need another compute model.
- Ignoring downstream limits.
- Using broad IAM permissions.
- Hardcoding secrets in code or environment variables.
- Not configuring DLQs or failure destinations.
- Logging too little to debug production issues.
- Logging sensitive data accidentally.
- Deploying without rollback strategy.
- Optimizing without measuring real metrics.
Conclusion
Production AWS Lambda is not just about serverless code. It is about designing reliable event-driven systems with safe retries, clear ownership, correct permissions, strong observability, controlled concurrency, and predictable deployment.
The best Lambda functions are usually small, stateless, idempotent, observable, and secure by default. They use the right event source, protect downstream systems, handle errors intentionally, and expose enough metrics and logs to debug real production incidents.
Key takeaway: a Lambda function is production-ready when you understand how it is triggered, how it fails, how it retries, how it scales, how it is secured, how it is monitored, and how it can be rolled back safely.
Comments (0)