AWS Lambda Concurrency and Scaling

By Oleksandr Andrushchenko — Published on Jun 30 — Modified on Jul 01

AWS Lambda concurrency is one of the most important concepts to understand before running serverless applications in production. Lambda can scale quickly, but that does not mean every downstream system can handle unlimited parallel requests.

This article explains how Lambda concurrency and scaling work, how to calculate concurrency, when to use reserved concurrency, when to use provisioned concurrency, how event sources scale differently, and how to protect databases, queues, APIs, and third-party systems from overload.

What Is AWS Lambda Concurrency?
How AWS Lambda Scales
Reserved Concurrency
Provisioned Concurrency
How Event Sources Affect Scaling
Concurrency and Batch Processing
Protecting Downstream Systems
Lambda Concurrency and Database Connections
Monitoring Lambda Concurrency
Common Concurrency Mistakes
Production Checklist
Conclusion

What Is AWS Lambda Concurrency?

Concurrency means how many Lambda invocations are running at the same time. If one function is processing 50 requests at the same moment, it has 50 concurrent executions.

This is different from total requests. A function can receive 10,000 requests per minute, but its concurrency depends on how long each request takes to finish.

Concurrency Definition

In Lambda, each concurrent invocation uses a separate execution environment. If requests arrive while all existing environments are busy, Lambda may create more environments until concurrency limits are reached.

Term	Meaning
Invocation	One execution of a Lambda function
Concurrency	Number of invocations running at the same time
Execution environment	Runtime environment used to run one or more invocations over time
Throttling	What happens when Lambda cannot accept more concurrent invocations

Concurrency Formula

A practical way to estimate Lambda concurrency is:

Concurrency = requests per second × average duration in seconds

This formula is simple, but very useful for planning. If your function receives more traffic or takes longer to run, concurrency increases.

Simple Concurrency Example

Imagine an API Lambda receives 100 requests per second, and each request takes 500 ms.

100 requests per second × 0.5 seconds = 50 concurrent executions

If the same function becomes slower and now takes 2 seconds, concurrency increases:

100 requests per second × 2 seconds = 200 concurrent executions

Key takeaway: improving function duration reduces concurrency pressure.

How AWS Lambda Scales

Lambda scales by creating more execution environments when more invocations arrive. This is one of Lambda's biggest benefits, but also one of the biggest production risks if downstream systems are not ready.

Execution Environments

Each active concurrent invocation needs an execution environment. When an invocation finishes, the environment may stay warm and be reused by a later invocation.

Request 1 -> Execution Environment A
Request 2 -> Execution Environment B
Request 3 -> Execution Environment C

If three requests are running at the same time, Lambda needs three concurrent environments.

Scaling Up

When more requests arrive, Lambda creates more execution environments. For asynchronous, queue, and stream-based workloads, scaling behavior also depends on the event source.

Workload	Scaling Driver
API Gateway	Incoming HTTP request rate and function duration
SQS	Queue depth, batch size, event source mapping, and concurrency settings
Kinesis	Shard count, records per shard, batch size, and processing duration
DynamoDB Streams	Stream shards, records, batch size, and processing duration
EventBridge / SNS	Published event volume and Lambda concurrency limits

Scaling Down

When traffic decreases, Lambda eventually removes unused execution environments. You do not pay for idle execution environments unless you use features such as provisioned concurrency.

Throttling

Throttling happens when Lambda cannot run more concurrent invocations because a concurrency limit has been reached. For synchronous APIs, this can result in errors returned to clients. For queues or streams, records may wait and be retried depending on the event source.

Traffic spike
  -> Lambda reaches concurrency limit
  -> Additional invocations are throttled
  -> Caller receives error or event source retries later

Rule of thumb: throttling is not always bad. Sometimes it is a deliberate safety mechanism to protect downstream systems.

Reserved Concurrency

Reserved concurrency sets both a reservation and a maximum concurrency limit for a function. It guarantees that the function has a certain amount of concurrency available, and it also prevents the function from scaling beyond that value.

Why Use Reserved Concurrency?

Protect downstream systems from too many parallel requests.
Reserve capacity for critical functions.
Prevent one noisy function from consuming all account concurrency.
Throttle intentionally instead of letting the whole system overload.

Reserved Concurrency Example

Imagine a Lambda function writes to a relational database that can safely handle only 50 concurrent writes. If the function can scale to 500 concurrent executions, the database may fail.

Without reserved concurrency:
500 Lambda executions -> database overload

With reserved concurrency = 50:
50 Lambda executions -> database stays healthy
remaining work waits, retries, or throttles

Setting Reserved Concurrency to Zero

Setting reserved concurrency to 0 disables a function from processing events. This can be useful as an emergency brake during incidents.

Incident:
Function is creating bad writes or recursive events

Emergency action:
Set reserved concurrency to 0

Result:
Function stops receiving concurrent executions

Rule of thumb: use reserved concurrency for safety and capacity control.

Provisioned Concurrency

Provisioned concurrency keeps Lambda execution environments initialized and ready before requests arrive. It is mainly used to reduce cold starts for latency-sensitive workloads.

Why Use Provisioned Concurrency?

Reduce cold start latency for user-facing APIs.
Improve latency predictability for important synchronous workloads.
Pre-initialize heavy runtimes or frameworks.
Prepare for predictable traffic spikes such as business hours or scheduled launches.

Provisioned vs Reserved Concurrency

Feature	Reserved Concurrency	Provisioned Concurrency
Main purpose	Limit and reserve capacity	Keep environments initialized
Helps with cold starts	No direct cold start reduction	Yes
Protects downstream systems	Yes	Not by itself
Typical use case	Queue workers, database writers, critical functions	Latency-sensitive APIs

Provisioned Concurrency Example

If an API normally needs around 100 concurrent executions during peak time, provisioned concurrency can keep environments ready before traffic arrives.

Expected peak concurrency: 100
Provisioned concurrency: 110

Reason:
Add a buffer so environments are ready during normal peak traffic.

Rule of thumb: use provisioned concurrency for predictable low-latency APIs, not as a general scaling solution.

How Event Sources Affect Scaling

Lambda scaling is not identical for every trigger. HTTP requests, queues, streams, and event buses invoke Lambda differently. Understanding the event source is essential for production design.

API Gateway

With API Gateway, each incoming request can invoke Lambda synchronously. If many users call the API at the same time, concurrency grows based on request rate and function duration.

API concurrency depends on:

requests per second
× average Lambda duration

Best practice: keep API Lambdas fast. Move slow work to SQS or Step Functions instead of making users wait.

Amazon SQS

With SQS, Lambda polls messages from the queue and processes them in batches. Scaling depends on queue depth, batch size, processing duration, and event source mapping configuration.

def lambda_handler(event, context):
    for record in event["Records"]:
        process_message(record["body"])

    return {
        "processed": len(event["Records"])
    }

SQS is useful because it can buffer traffic spikes. Instead of forcing all work to happen immediately, messages can wait in the queue and be processed at a controlled rate.

Useful SQS Controls

Batch size: how many messages one invocation receives.
Maximum concurrency: event source-level limit for SQS processing.
Reserved concurrency: function-level concurrency limit.
Visibility timeout: how long a message stays hidden while being processed.
Dead-letter queue: where repeatedly failing messages go.

Kinesis Data Streams

With Kinesis, Lambda processes records from stream shards. Scaling is tied to shard count and stream processing behavior.

Kinesis stream
  -> shard 1 -> Lambda batches
  -> shard 2 -> Lambda batches
  -> shard 3 -> Lambda batches

If processing is slow or records fail repeatedly, stream lag can increase. For streams, one bad batch can block progress for affected shards until the issue is resolved or failure handling moves it forward.

DynamoDB Streams

DynamoDB Streams work similarly to stream processing. Lambda reads changes from the stream and invokes your function with batches of records.

def lambda_handler(event, context):
    for record in event["Records"]:
        event_name = record["eventName"]

        if event_name == "INSERT":
            process_insert(record)

        elif event_name == "MODIFY":
            process_update(record)

        elif event_name == "REMOVE":
            process_delete(record)

    return {
        "processed": len(event["Records"])
    }

Important: avoid infinite loops where a stream-triggered Lambda writes back to the same table and triggers itself again.

EventBridge and SNS

EventBridge and SNS invoke Lambda asynchronously. They are useful for event-driven systems and fan-out patterns.

Service	Scaling Behavior	Common Risk
SNS	Fan-out to subscribers	Many subscribers can multiply downstream work
EventBridge	Routes events to matching targets	Poor event filtering can create unnecessary invocations

Rule of thumb: the event source is part of the scaling model. Do not tune Lambda without understanding how it is invoked.

Concurrency and Batch Processing

Batch processing changes how concurrency behaves. One Lambda invocation can process multiple records, so increasing batch size can increase throughput without increasing invocation count.

Batch Size	Benefit	Risk
Small batch	Simpler failure handling	More invocations
Medium batch	Balanced throughput and reliability	Needs monitoring
Large batch	Higher throughput	More memory usage and harder retries

Partial Batch Failures

If one record fails, you usually do not want to retry the entire batch. Use partial batch failure handling when supported by the event source.

def lambda_handler(event, context):
    failed_items = []

    for record in event["Records"]:
        try:
            process_record(record)
        except Exception:
            failed_items.append({
                "itemIdentifier": record["messageId"]
            })

    return {
        "batchItemFailures": failed_items
    }

Rule of thumb: larger batches improve throughput, but only if your failure handling is correct.

Protecting Downstream Systems

The biggest Lambda scaling problem is usually not Lambda itself. It is what Lambda calls: databases, third-party APIs, internal services, queues, caches, and legacy systems.

Lambda can scale quickly.
Your database may not.
Your third-party API may not.
Your internal service may not.

Common Downstream Risks

Database connection exhaustion
API rate limits
Cache overload
Too many parallel writes
Legacy service saturation
Retry storms

Protection Strategies

Problem	Protection Strategy
Too many database connections	RDS Proxy, connection reuse, reserved concurrency
External API rate limits	SQS buffering, backoff, limited concurrency
Slow background processing	SQS, batch tuning, DLQ
Large event bursts	Queue buffering, event filtering, throttling
Retry storms	Backoff, circuit breaker behavior, DLQ

Lambda Concurrency and Database Connections

Relational databases are one of the most common bottlenecks in Lambda applications. If each concurrent Lambda invocation opens a database connection, concurrency can quickly turn into connection exhaustion.

200 concurrent Lambda executions
× 1 database connection per invocation
= 200 database connections

Bad Pattern

import psycopg2
import os

def lambda_handler(event, context):
    connection = psycopg2.connect(
        host=os.environ["DB_HOST"],
        dbname=os.environ["DB_NAME"],
        user=os.environ["DB_USER"],
        password=os.environ["DB_PASSWORD"]
    )

    with connection.cursor() as cursor:
        cursor.execute("SELECT now()")
        row = cursor.fetchone()

    connection.close()

    return {
        "databaseTime": str(row[0])
    }

This creates a new connection for every invocation. Under high concurrency, it can overload the database.

Better Pattern

import psycopg2
import os

connection = None

def get_connection():
    global connection

    if connection is None or connection.closed:
        connection = psycopg2.connect(
            host=os.environ["DB_HOST"],
            dbname=os.environ["DB_NAME"],
            user=os.environ["DB_USER"],
            password=os.environ["DB_PASSWORD"]
        )

    return connection

def lambda_handler(event, context):
    conn = get_connection()

    with conn.cursor() as cursor:
        cursor.execute("SELECT now()")
        row = cursor.fetchone()

    return {
        "databaseTime": str(row[0])
    }

Production Pattern

Reuse connections across warm invocations when safe.
Use RDS Proxy for relational database connection pooling.
Limit concurrency for database-heavy functions.
Keep transactions short.
Use SQS to buffer write-heavy workloads.

Rule of thumb: Lambda concurrency must be designed around database connection limits.

Monitoring Lambda Concurrency

You cannot manage Lambda scaling safely without monitoring. CloudWatch metrics help you understand whether functions are scaling, throttling, falling behind, or overloading downstream systems.

Metric	What It Shows
ConcurrentExecutions	How many invocations are running at the same time
UnreservedConcurrentExecutions	Concurrency used by functions without reserved concurrency
Throttles	How often invocations are rejected due to concurrency limits
Duration	Longer duration increases concurrency pressure
Errors	Function failures that may cause retries
IteratorAge	Whether stream processing is falling behind
ApproximateAgeOfOldestMessage	Whether SQS processing is falling behind

Use Alarms

Alarm on throttles for critical functions.
Alarm on DLQ depth for async and queue-based workloads.
Alarm on IteratorAge for streams.
Alarm on SQS oldest message age for queue workers.
Alarm on database connection usage for Lambda + RDS workloads.

Common Concurrency Mistakes

Assuming Lambda scaling means the whole system can scale.
Opening a new database connection per invocation.
Not setting reserved concurrency for database-heavy functions.
Using provisioned concurrency when the real problem is downstream overload.
Ignoring queue age and stream lag.
Using large batches without partial failure handling.
Letting one noisy function consume all available concurrency.
Not monitoring throttles.
Retrying failures too aggressively.

Production Checklist

Estimate concurrency using request rate and average duration.
Measure real concurrency with CloudWatch metrics.
Use reserved concurrency to protect downstream systems.
Use provisioned concurrency only when cold start latency matters.
Keep API Lambdas fast and move slow work to queues.
Tune SQS batch size and visibility timeout.
Use partial batch failures for supported batch event sources.
Watch stream lag for Kinesis and DynamoDB Streams.
Reuse database connections carefully.
Use RDS Proxy for relational database workloads.
Set alarms for throttles, errors, queue age, stream age, and DLQ depth.
Protect external APIs with rate limiting, backoff, queues, or concurrency limits.

Conclusion

AWS Lambda concurrency and scaling are powerful, but they must be controlled intentionally. Lambda can create many parallel execution environments, but your database, cache, queue, third-party API, or internal service may not be able to handle the same level of traffic.

Use reserved concurrency to limit and protect, provisioned concurrency to reduce cold starts, SQS to buffer spikes, batch tuning to improve throughput, and CloudWatch metrics to understand real behavior.

Key takeaway: good Lambda scaling is not unlimited scaling. Good Lambda scaling means processing work fast enough while protecting downstream systems, controlling retries, monitoring lag, and keeping the whole architecture stable.

Table of Contents