AWS Lambda Best Practices for Production

By Oleksandr Andrushchenko — Published on — Modified on

AWS Lambda Best Practices for Production
AWS Lambda Best Practices for Production

AWS Lambda is easy to start with, but production Lambda systems require more than writing a handler function. You need to think about idempotency, retries, timeouts, IAM permissions, database connections, observability, deployment strategy, and downstream limits.

This article covers practical AWS Lambda best practices for production. The goal is not to list every AWS feature, but to explain how to build Lambda functions that are reliable, secure, observable, cost-aware, and easier to maintain over time.

Table of Contents

Write Small, Single-Purpose Functions

A good production Lambda function should have a clear responsibility. It should be easy to understand what event it handles, what side effects it performs, and what success or failure means.

One Responsibility per Function

One responsibility per function does not mean every function must contain only ten lines of code. It means the function should represent one clear unit of work.

Good Function Boundary Bad Function Boundary
Process one SQS order message Process orders, send reports, sync users, and update analytics
Resize uploaded image Handle every possible S3 file workflow in one function
Validate webhook signature and enqueue event Receive webhook, process payment, send email, update CRM, and generate PDF

Avoid Monolithic Lambdas

A monolithic Lambda usually starts as a convenient shortcut and later becomes hard to test, deploy, observe, and debug. If one function handles too many unrelated event types, every change becomes risky.

# Bad: one function handles unrelated workflows
def lambda_handler(event, context):
    if event["type"] == "user_created":
        create_user_profile(event)

    elif event["type"] == "order_created":
        process_order(event)

    elif event["type"] == "image_uploaded":
        resize_image(event)

    elif event["type"] == "daily_report":
        generate_report(event)
Better:
user-created-handler
order-created-handler
image-uploaded-handler
daily-report-handler

Rule of thumb: if a Lambda function needs many unrelated if event_type == ... branches, it may be doing too much.

Design for Idempotency

Idempotency means the same event can be processed more than once without producing incorrect results. In production Lambda systems, this is not optional. Retries, duplicate deliveries, client retries, and network failures can all cause the same logical event to appear more than once.

Why Duplicate Events Happen

  • Queue retries: failed SQS messages can be delivered again.
  • Async retries: asynchronous Lambda invocations may be retried.
  • Client retries: API clients may resend requests after timeouts.
  • Stream retries: a failed batch may be retried from the stream.
  • Network uncertainty: a caller may not know whether a previous request succeeded.

Safe Retry Strategies

Retries are useful only when repeating the operation is safe. Retrying a temporary network failure is good. Retrying a payment charge without idempotency is dangerous.

Operation Retry Risk Required Protection
Read user profile Low Normal retry
Send email Medium Message ID or send log
Charge payment High Idempotency key
Create order High Unique request ID or conditional write

Idempotency Keys

An idempotency key is a unique identifier for one logical operation. Before performing the side effect, the function checks whether the operation was already processed.

def process_payment(event):
    payment_id = event["paymentId"]

    if payment_already_processed(payment_id):
        return {
            "status": "already_processed"
        }

    charge_customer(event)
    mark_payment_processed(payment_id)

    return {
        "status": "processed"
    }

Rule of thumb: every production Lambda that performs external side effects should be designed as if the same event may arrive twice.

Choose the Right Event Source

The event source defines how Lambda receives data, how retries work, whether batching exists, how failures are handled, and how the function scales. Choosing the wrong event source creates production problems that cannot be solved only inside the handler code.

HTTP APIs

Use API Gateway, Function URLs, or Application Load Balancer when the function must respond to an HTTP request.

  • API Gateway: production APIs, routing, auth, throttling, custom domains.
  • Function URLs: simple single-function HTTP endpoints.
  • ALB: hybrid architectures where Lambda is one target behind a load balancer.

Queues

Use SQS when work can be processed asynchronously and should survive temporary failures. Queues are useful for background jobs, buffering, retry handling, and protecting downstream systems.

Streams

Use Kinesis, DynamoDB Streams, or Kafka when records arrive as an ordered stream and must be processed in batches.

Scheduled Jobs

Use EventBridge Scheduler for cron-like tasks such as cleanup jobs, reports, synchronization, and periodic checks.

Workflows

Use Step Functions when the business process has multiple steps, branches, retries, waits, or compensation logic.

Need Recommended Service
REST API API Gateway
Simple webhook Function URL
Background job SQS
Broadcast event SNS or EventBridge
File processing S3 Event
Database change reaction DynamoDB Streams
Multi-step workflow Step Functions

Keep Functions Stateless

Lambda functions should be designed as stateless units of work. The execution environment may be reused, but it can also disappear at any time. Do not treat local memory or local files as the source of truth.

Do Not Depend on Local Files

Lambda provides temporary storage in /tmp, but it is not durable application storage. It can be useful for temporary files, downloads, generated reports, or intermediate processing.

def lambda_handler(event, context):
    temp_path = "/tmp/report.csv"

    generate_report(temp_path)
    upload_to_s3(temp_path)

    return {
        "status": "uploaded"
    }

Important: /tmp is temporary. Store durable data in S3, DynamoDB, RDS, ElastiCache, or another external system.

Reuse Execution Environment Carefully

Warm invocations may reuse global variables, clients, and cached configuration. This is useful for performance, but dangerous if you store request-specific data globally.

# Good: reusable client
import boto3

s3_client = boto3.client("s3")

def lambda_handler(event, context):
    return s3_client.list_buckets()
# Bad: request-specific mutable state
current_user_id = None

def lambda_handler(event, context):
    global current_user_id
    current_user_id = event["userId"]

    return process_user(current_user_id)

Store State Externally

  • S3: files, reports, exports, images, documents.
  • DynamoDB: key-value access, metadata, idempotency records.
  • RDS / Aurora: relational data and transactions.
  • ElastiCache: low-latency cached data.
  • SQS: durable pending work.

Rule of thumb: use global state for reusable infrastructure clients, not for business state.

Optimize Initialization

Initialization affects cold starts. Code outside the handler runs when Lambda creates a new execution environment. Keep that code small, useful, and predictable.

Reuse SDK Clients

Create AWS SDK clients outside the handler so they can be reused during warm invocations.

import boto3

dynamodb = boto3.resource("dynamodb")
table = dynamodb.Table("Users")

def lambda_handler(event, context):
    response = table.get_item(
        Key={"id": event["userId"]}
    )

    return response.get("Item")

Lazy Loading

If a heavy dependency is used only in rare cases, load it only when needed.

def lambda_handler(event, context):
    if event.get("generatePdf"):
        import reportlab
        return generate_pdf(reportlab, event)

    return {
        "message": "No PDF needed"
    }

Reduce Package Size

Large packages increase deployment complexity and can increase cold start time. Remove unused dependencies, tests, documentation, local caches, and development-only files.

Common package bloat:
- tests
- docs
- local virtual environments
- unused libraries
- example files
- development tools
- large generated artifacts

Avoid Heavy Frameworks When Not Needed

A simple Lambda function does not always need a full web framework. Use the simplest structure that solves the problem.

Situation Good Choice
Single webhook Plain Lambda handler
Few simple background jobs Plain handlers
Many routes and shared middleware Small framework may help
Existing large application Framework may reduce migration effort

Manage Database Connections Correctly

Database connections are one of the most common production problems with Lambda. Lambda can scale quickly, but relational databases have connection limits.

Reuse Connections

Reuse database connections carefully across warm invocations. Always handle stale or closed connections.

import os
import psycopg2

connection = None

def get_connection():
    global connection

    if connection is None or connection.closed:
        connection = psycopg2.connect(
            host=os.environ["DB_HOST"],
            dbname=os.environ["DB_NAME"],
            user=os.environ["DB_USER"],
            password=os.environ["DB_PASSWORD"]
        )

    return connection

def lambda_handler(event, context):
    conn = get_connection()

    with conn.cursor() as cursor:
        cursor.execute("SELECT now()")
        row = cursor.fetchone()

    return {
        "databaseTime": str(row[0])
    }

Use RDS Proxy

RDS Proxy helps pool and manage database connections between Lambda and relational databases such as RDS or Aurora.

Problem:
1,000 Lambda invocations
  -> 1,000 direct database connections
  -> database connection exhaustion

Better:
1,000 Lambda invocations
  -> RDS Proxy
  -> managed database connection pool

Protect Downstream Databases

  • Limit concurrency for functions that write to relational databases.
  • Use SQS to buffer write-heavy workloads.
  • Keep transactions short.
  • Avoid opening new connections for every invocation.
  • Use DynamoDB when the access pattern fits key-value or document-style reads/writes.

Handle Errors Properly

Error handling depends heavily on the event source. An API request, an SQS message, and a stream record should not all be handled the same way.

Retries

Retries are useful for temporary failures, but dangerous for permanent failures. Invalid input will not become valid after ten retries.

Error Type Retry? Example
Temporary network issue Yes, with backoff Timeout calling external API
Rate limit Yes, carefully HTTP 429
Invalid input No Missing required field
Business rule failure Usually no Payment method rejected

Dead Letter Queues

A dead-letter queue stores messages that failed repeatedly. This is useful for debugging and manual recovery.

SQS Queue
  -> Lambda Worker
      -> success: message deleted
      -> failure: retry
      -> repeated failure: move to DLQ

Lambda Destinations

Lambda Destinations can route successful or failed asynchronous invocation results to another service. This is useful when you need to react to success or failure outcomes.

Poison Messages

A poison message is a message that always fails. If not handled, it can be retried repeatedly and block useful processing.

  • Validate messages early.
  • Separate temporary and permanent errors.
  • Use partial batch failures when supported.
  • Send bad messages to a DLQ.
  • Log enough context to debug the message later.

Control Concurrency

Lambda can scale quickly. That is useful, but it can also overload databases, APIs, queues, and legacy systems. Production Lambda systems should control concurrency intentionally.

Reserved Concurrency

Reserved concurrency can reserve capacity for a function and also limit its maximum concurrency. This is useful for protecting downstream systems.

Without limit:
SQS has 50,000 messages
  -> Lambda scales aggressively
  -> database becomes overloaded

With reserved concurrency:
SQS has 50,000 messages
  -> Lambda processes at controlled speed
  -> database remains healthy

Provisioned Concurrency

Provisioned concurrency keeps execution environments initialized and ready. This is useful for latency-sensitive APIs where cold starts are not acceptable.

Use Reserved Concurrency Use Provisioned Concurrency
To limit or reserve capacity To reduce cold starts
To protect downstream systems To improve latency predictability
For queue workers and database writers For important synchronous APIs

Protect External Systems

  • Use queues to buffer spikes.
  • Set reserved concurrency for database-heavy functions.
  • Use backoff for external API retries.
  • Set timeouts so slow downstream calls do not consume all function time.
  • Use circuit-breaker behavior for repeated downstream failures.

Secure Your Lambda Functions

Production Lambda functions should follow the same security principles as any other backend system: least privilege, secure secret handling, controlled network access, and safe input validation.

Least-Privilege IAM

Give each Lambda function only the permissions it needs. Avoid broad permissions such as * unless there is a strong reason.

{
  "Effect": "Allow",
  "Action": [
    "s3:GetObject"
  ],
  "Resource": [
    "arn:aws:s3:::example-bucket/uploads/*"
  ]
}

Rule of thumb: permissions should be specific to the service, action, and resource.

Secrets Manager and Parameter Store

Do not hardcode secrets in source code. Use AWS Secrets Manager or SSM Parameter Store for passwords, API keys, tokens, and sensitive configuration.

import boto3
import json

secrets_client = boto3.client("secretsmanager")

def get_secret(secret_id):
    response = secrets_client.get_secret_value(
        SecretId=secret_id
    )

    return json.loads(response["SecretString"])

Environment Variables

Environment variables are useful for non-secret configuration such as table names, bucket names, feature flags, and API endpoints.

import os

TABLE_NAME = os.environ["TABLE_NAME"]

def lambda_handler(event, context):
    return {
        "table": TABLE_NAME
    }

Important: environment variables are configuration, not a replacement for a secret management strategy.

VPC Considerations

Put Lambda in a VPC only when it needs private network access, such as private RDS, ElastiCache, or internal services. VPC configuration adds networking complexity and requires correct subnet, security group, and NAT design for outbound internet access.

Build for Observability

Production Lambda systems are distributed by nature. One user action can pass through API Gateway, Lambda, SQS, another Lambda, DynamoDB, and EventBridge. Without good observability, debugging becomes painful.

Structured Logging

Prefer structured JSON logs over random text messages. Include request IDs, event IDs, user IDs when safe, correlation IDs, and important business context.

import json
import logging

logger = logging.getLogger()
logger.setLevel(logging.INFO)

def lambda_handler(event, context):
    logger.info(json.dumps({
        "message": "Processing event",
        "requestId": context.aws_request_id,
        "eventType": event.get("type"),
        "orderId": event.get("orderId")
    }))

    return {
        "status": "ok"
    }

CloudWatch Metrics

Track metrics that show whether the function is healthy and whether the system is falling behind.

Metric Why It Matters
Errors Function failures
Duration Performance and cost
Throttles Concurrency limits
ConcurrentExecutions Scaling behavior
IteratorAge Stream consumer lag
Dead-letter queue depth Failed async processing

Tracing

Use tracing when requests cross multiple services. Tracing helps identify slow database calls, external API bottlenecks, retries, and service-to-service latency.

Correlation IDs

A correlation ID connects logs from different services that belong to the same workflow.

def lambda_handler(event, context):
    correlation_id = (
        event.get("correlationId")
        or context.aws_request_id
    )

    logger.info(json.dumps({
        "message": "Start processing",
        "correlationId": correlation_id
    }))

Monitor Costs

Lambda can be inexpensive, but it can also become costly when functions run too long, retry too often, process unnecessary events, or trigger each other in loops.

Memory vs Duration

Lambda cost depends on memory and duration. More memory can increase CPU and reduce duration, so the cheapest configuration is not always the smallest memory setting.

Configuration Duration Result
Low memory Long duration May be slow and not actually cheaper
Balanced memory Shorter duration Often best trade-off
Too much memory Small additional improvement Diminishing returns

Unused Invocations

Filter events before they invoke Lambda when possible. Use EventBridge rules, S3 prefix/suffix filters, and event source filtering for streams or queues.

Infinite Loops

Recursive triggers can create unexpected cost and system load.

Bad:
S3 upload -> Lambda -> writes to same S3 prefix -> Lambda runs again

Bad:
DynamoDB update -> Stream -> Lambda -> updates same item -> Stream runs again

Cost Alarms

Use billing alarms and service-level alarms to detect unusual usage early. Monitor invocations, errors, retries, DLQ size, and concurrency spikes.

Deploy Safely

Production Lambda deployment should support safe release, rollback, and validation. Do not treat every deployment as a manual overwrite of the current function.

Versions

Lambda versions are immutable snapshots of function code and configuration. They allow you to point production traffic to a specific known version.

Aliases

Aliases are named pointers to versions, such as dev, staging, or prod.

prod alias -> version 12
staging alias -> version 13

Canary Deployments

A canary deployment sends a small percentage of traffic to a new version before shifting all traffic.

95% traffic -> version 12
5% traffic  -> version 13

If metrics are healthy:
100% traffic -> version 13

Rollback Strategy

A rollback should be fast and predictable. If version 13 fails, move the production alias back to version 12.

Rule of thumb: every production Lambda should have a deployment and rollback strategy, not only a deploy button.

Testing Lambda Functions

Lambda functions should be tested like any other backend code. The handler should be thin, and business logic should be testable without invoking AWS for every unit test.

Unit Tests

Move business logic into normal functions or classes and test them directly.

def calculate_total(order):
    return sum(item["price"] * item["quantity"] for item in order["items"])

def lambda_handler(event, context):
    total = calculate_total(event["order"])

    return {
        "total": total
    }

Integration Tests

Integration tests should validate real interactions with AWS services such as DynamoDB, SQS, S3, or API Gateway. These tests catch IAM, event format, serialization, and infrastructure problems.

Local Testing

Local testing is useful for fast feedback, but it does not perfectly reproduce AWS behavior. Always validate important workflows in a real AWS environment before production.

Production Validation

After deployment, validate logs, metrics, alarms, traces, DLQ behavior, and business outcomes. A function that deploys successfully can still fail at runtime because of permissions, event shape changes, or downstream dependencies.

Production Readiness Checklist

  • Function has one clear responsibility.
  • Handler is idempotent.
  • Event source is appropriate for the workload.
  • Timeout is configured intentionally.
  • Memory is tested with realistic payloads.
  • SDK clients are reused when appropriate.
  • Database connections are managed safely.
  • Retries and failure handling are understood.
  • Dead-letter queue or failure destination exists where needed.
  • Reserved concurrency protects downstream systems when necessary.
  • IAM permissions follow least privilege.
  • Secrets are stored in Secrets Manager or Parameter Store.
  • Logs are structured and useful.
  • Metrics and alarms exist for errors, throttles, duration, and DLQ depth.
  • Tracing or correlation IDs exist for distributed flows.
  • Deployment uses versions, aliases, or safe rollout strategy.
  • Rollback path is known.
  • Unit and integration tests cover critical behavior.
  • Cost alarms or usage monitoring are configured.

Common Production Mistakes

  • Putting too much logic into one Lambda function.
  • Assuming events are delivered exactly once.
  • Opening database connections on every invocation.
  • Using Lambda for long-running workloads that need another compute model.
  • Ignoring downstream limits.
  • Using broad IAM permissions.
  • Hardcoding secrets in code or environment variables.
  • Not configuring DLQs or failure destinations.
  • Logging too little to debug production issues.
  • Logging sensitive data accidentally.
  • Deploying without rollback strategy.
  • Optimizing without measuring real metrics.

Conclusion

Production AWS Lambda is not just about serverless code. It is about designing reliable event-driven systems with safe retries, clear ownership, correct permissions, strong observability, controlled concurrency, and predictable deployment.

The best Lambda functions are usually small, stateless, idempotent, observable, and secure by default. They use the right event source, protect downstream systems, handle errors intentionally, and expose enough metrics and logs to debug real production incidents.

Key takeaway: a Lambda function is production-ready when you understand how it is triggered, how it fails, how it retries, how it scales, how it is secured, how it is monitored, and how it can be rolled back safely.

Comments (0)