AWS Lambda Concurrency and Scaling
By Oleksandr Andrushchenko — Published on — Modified on
AWS Lambda concurrency is one of the most important concepts to understand before running serverless applications in production. Lambda can scale quickly, but that does not mean every downstream system can handle unlimited parallel requests.
This article explains how Lambda concurrency and scaling work, how to calculate concurrency, when to use reserved concurrency, when to use provisioned concurrency, how event sources scale differently, and how to protect databases, queues, APIs, and third-party systems from overload.
Table of Contents
- What Is AWS Lambda Concurrency?
- How AWS Lambda Scales
- Reserved Concurrency
- Provisioned Concurrency
- How Event Sources Affect Scaling
- Concurrency and Batch Processing
- Protecting Downstream Systems
- Lambda Concurrency and Database Connections
- Monitoring Lambda Concurrency
- Common Concurrency Mistakes
- Production Checklist
- Conclusion
What Is AWS Lambda Concurrency?
Concurrency means how many Lambda invocations are running at the same time. If one function is processing 50 requests at the same moment, it has 50 concurrent executions.
This is different from total requests. A function can receive 10,000 requests per minute, but its concurrency depends on how long each request takes to finish.
Concurrency Definition
In Lambda, each concurrent invocation uses a separate execution environment. If requests arrive while all existing environments are busy, Lambda may create more environments until concurrency limits are reached.
| Term | Meaning |
|---|---|
| Invocation | One execution of a Lambda function |
| Concurrency | Number of invocations running at the same time |
| Execution environment | Runtime environment used to run one or more invocations over time |
| Throttling | What happens when Lambda cannot accept more concurrent invocations |
Concurrency Formula
A practical way to estimate Lambda concurrency is:
Concurrency = requests per second × average duration in seconds
This formula is simple, but very useful for planning. If your function receives more traffic or takes longer to run, concurrency increases.
Simple Concurrency Example
Imagine an API Lambda receives 100 requests per second, and each request takes 500 ms.
100 requests per second × 0.5 seconds = 50 concurrent executions
If the same function becomes slower and now takes 2 seconds, concurrency increases:
100 requests per second × 2 seconds = 200 concurrent executions
Key takeaway: improving function duration reduces concurrency pressure.
How AWS Lambda Scales
Lambda scales by creating more execution environments when more invocations arrive. This is one of Lambda's biggest benefits, but also one of the biggest production risks if downstream systems are not ready.
Execution Environments
Each active concurrent invocation needs an execution environment. When an invocation finishes, the environment may stay warm and be reused by a later invocation.
Request 1 -> Execution Environment A
Request 2 -> Execution Environment B
Request 3 -> Execution Environment C
If three requests are running at the same time, Lambda needs three concurrent environments.
Scaling Up
When more requests arrive, Lambda creates more execution environments. For asynchronous, queue, and stream-based workloads, scaling behavior also depends on the event source.
| Workload | Scaling Driver |
|---|---|
| API Gateway | Incoming HTTP request rate and function duration |
| SQS | Queue depth, batch size, event source mapping, and concurrency settings |
| Kinesis | Shard count, records per shard, batch size, and processing duration |
| DynamoDB Streams | Stream shards, records, batch size, and processing duration |
| EventBridge / SNS | Published event volume and Lambda concurrency limits |
Scaling Down
When traffic decreases, Lambda eventually removes unused execution environments. You do not pay for idle execution environments unless you use features such as provisioned concurrency.
Throttling
Throttling happens when Lambda cannot run more concurrent invocations because a concurrency limit has been reached. For synchronous APIs, this can result in errors returned to clients. For queues or streams, records may wait and be retried depending on the event source.
Traffic spike
-> Lambda reaches concurrency limit
-> Additional invocations are throttled
-> Caller receives error or event source retries later
Rule of thumb: throttling is not always bad. Sometimes it is a deliberate safety mechanism to protect downstream systems.
Reserved Concurrency
Reserved concurrency sets both a reservation and a maximum concurrency limit for a function. It guarantees that the function has a certain amount of concurrency available, and it also prevents the function from scaling beyond that value.
Why Use Reserved Concurrency?
- Protect downstream systems from too many parallel requests.
- Reserve capacity for critical functions.
- Prevent one noisy function from consuming all account concurrency.
- Throttle intentionally instead of letting the whole system overload.
Reserved Concurrency Example
Imagine a Lambda function writes to a relational database that can safely handle only 50 concurrent writes. If the function can scale to 500 concurrent executions, the database may fail.
Without reserved concurrency:
500 Lambda executions -> database overload
With reserved concurrency = 50:
50 Lambda executions -> database stays healthy
remaining work waits, retries, or throttles
Setting Reserved Concurrency to Zero
Setting reserved concurrency to 0 disables a function from processing events. This can be useful as an emergency brake during incidents.
Incident:
Function is creating bad writes or recursive events
Emergency action:
Set reserved concurrency to 0
Result:
Function stops receiving concurrent executions
Rule of thumb: use reserved concurrency for safety and capacity control.
Provisioned Concurrency
Provisioned concurrency keeps Lambda execution environments initialized and ready before requests arrive. It is mainly used to reduce cold starts for latency-sensitive workloads.
Why Use Provisioned Concurrency?
- Reduce cold start latency for user-facing APIs.
- Improve latency predictability for important synchronous workloads.
- Pre-initialize heavy runtimes or frameworks.
- Prepare for predictable traffic spikes such as business hours or scheduled launches.
Provisioned vs Reserved Concurrency
| Feature | Reserved Concurrency | Provisioned Concurrency |
|---|---|---|
| Main purpose | Limit and reserve capacity | Keep environments initialized |
| Helps with cold starts | No direct cold start reduction | Yes |
| Protects downstream systems | Yes | Not by itself |
| Typical use case | Queue workers, database writers, critical functions | Latency-sensitive APIs |
Provisioned Concurrency Example
If an API normally needs around 100 concurrent executions during peak time, provisioned concurrency can keep environments ready before traffic arrives.
Expected peak concurrency: 100
Provisioned concurrency: 110
Reason:
Add a buffer so environments are ready during normal peak traffic.
Rule of thumb: use provisioned concurrency for predictable low-latency APIs, not as a general scaling solution.
How Event Sources Affect Scaling
Lambda scaling is not identical for every trigger. HTTP requests, queues, streams, and event buses invoke Lambda differently. Understanding the event source is essential for production design.
API Gateway
With API Gateway, each incoming request can invoke Lambda synchronously. If many users call the API at the same time, concurrency grows based on request rate and function duration.
API concurrency depends on:
requests per second
× average Lambda duration
Best practice: keep API Lambdas fast. Move slow work to SQS or Step Functions instead of making users wait.
Amazon SQS
With SQS, Lambda polls messages from the queue and processes them in batches. Scaling depends on queue depth, batch size, processing duration, and event source mapping configuration.
def lambda_handler(event, context):
for record in event["Records"]:
process_message(record["body"])
return {
"processed": len(event["Records"])
}
SQS is useful because it can buffer traffic spikes. Instead of forcing all work to happen immediately, messages can wait in the queue and be processed at a controlled rate.
Useful SQS Controls
- Batch size: how many messages one invocation receives.
- Maximum concurrency: event source-level limit for SQS processing.
- Reserved concurrency: function-level concurrency limit.
- Visibility timeout: how long a message stays hidden while being processed.
- Dead-letter queue: where repeatedly failing messages go.
Kinesis Data Streams
With Kinesis, Lambda processes records from stream shards. Scaling is tied to shard count and stream processing behavior.
Kinesis stream
-> shard 1 -> Lambda batches
-> shard 2 -> Lambda batches
-> shard 3 -> Lambda batches
If processing is slow or records fail repeatedly, stream lag can increase. For streams, one bad batch can block progress for affected shards until the issue is resolved or failure handling moves it forward.
DynamoDB Streams
DynamoDB Streams work similarly to stream processing. Lambda reads changes from the stream and invokes your function with batches of records.
def lambda_handler(event, context):
for record in event["Records"]:
event_name = record["eventName"]
if event_name == "INSERT":
process_insert(record)
elif event_name == "MODIFY":
process_update(record)
elif event_name == "REMOVE":
process_delete(record)
return {
"processed": len(event["Records"])
}
Important: avoid infinite loops where a stream-triggered Lambda writes back to the same table and triggers itself again.
EventBridge and SNS
EventBridge and SNS invoke Lambda asynchronously. They are useful for event-driven systems and fan-out patterns.
| Service | Scaling Behavior | Common Risk |
|---|---|---|
| SNS | Fan-out to subscribers | Many subscribers can multiply downstream work |
| EventBridge | Routes events to matching targets | Poor event filtering can create unnecessary invocations |
Rule of thumb: the event source is part of the scaling model. Do not tune Lambda without understanding how it is invoked.
Concurrency and Batch Processing
Batch processing changes how concurrency behaves. One Lambda invocation can process multiple records, so increasing batch size can increase throughput without increasing invocation count.
| Batch Size | Benefit | Risk |
|---|---|---|
| Small batch | Simpler failure handling | More invocations |
| Medium batch | Balanced throughput and reliability | Needs monitoring |
| Large batch | Higher throughput | More memory usage and harder retries |
Partial Batch Failures
If one record fails, you usually do not want to retry the entire batch. Use partial batch failure handling when supported by the event source.
def lambda_handler(event, context):
failed_items = []
for record in event["Records"]:
try:
process_record(record)
except Exception:
failed_items.append({
"itemIdentifier": record["messageId"]
})
return {
"batchItemFailures": failed_items
}
Rule of thumb: larger batches improve throughput, but only if your failure handling is correct.
Protecting Downstream Systems
The biggest Lambda scaling problem is usually not Lambda itself. It is what Lambda calls: databases, third-party APIs, internal services, queues, caches, and legacy systems.
Lambda can scale quickly.
Your database may not.
Your third-party API may not.
Your internal service may not.
Common Downstream Risks
- Database connection exhaustion
- API rate limits
- Cache overload
- Too many parallel writes
- Legacy service saturation
- Retry storms
Protection Strategies
| Problem | Protection Strategy |
|---|---|
| Too many database connections | RDS Proxy, connection reuse, reserved concurrency |
| External API rate limits | SQS buffering, backoff, limited concurrency |
| Slow background processing | SQS, batch tuning, DLQ |
| Large event bursts | Queue buffering, event filtering, throttling |
| Retry storms | Backoff, circuit breaker behavior, DLQ |
Lambda Concurrency and Database Connections
Relational databases are one of the most common bottlenecks in Lambda applications. If each concurrent Lambda invocation opens a database connection, concurrency can quickly turn into connection exhaustion.
200 concurrent Lambda executions
× 1 database connection per invocation
= 200 database connections
Bad Pattern
import psycopg2
import os
def lambda_handler(event, context):
connection = psycopg2.connect(
host=os.environ["DB_HOST"],
dbname=os.environ["DB_NAME"],
user=os.environ["DB_USER"],
password=os.environ["DB_PASSWORD"]
)
with connection.cursor() as cursor:
cursor.execute("SELECT now()")
row = cursor.fetchone()
connection.close()
return {
"databaseTime": str(row[0])
}
This creates a new connection for every invocation. Under high concurrency, it can overload the database.
Better Pattern
import psycopg2
import os
connection = None
def get_connection():
global connection
if connection is None or connection.closed:
connection = psycopg2.connect(
host=os.environ["DB_HOST"],
dbname=os.environ["DB_NAME"],
user=os.environ["DB_USER"],
password=os.environ["DB_PASSWORD"]
)
return connection
def lambda_handler(event, context):
conn = get_connection()
with conn.cursor() as cursor:
cursor.execute("SELECT now()")
row = cursor.fetchone()
return {
"databaseTime": str(row[0])
}
Production Pattern
- Reuse connections across warm invocations when safe.
- Use RDS Proxy for relational database connection pooling.
- Limit concurrency for database-heavy functions.
- Keep transactions short.
- Use SQS to buffer write-heavy workloads.
Rule of thumb: Lambda concurrency must be designed around database connection limits.
Monitoring Lambda Concurrency
You cannot manage Lambda scaling safely without monitoring. CloudWatch metrics help you understand whether functions are scaling, throttling, falling behind, or overloading downstream systems.
| Metric | What It Shows |
|---|---|
| ConcurrentExecutions | How many invocations are running at the same time |
| UnreservedConcurrentExecutions | Concurrency used by functions without reserved concurrency |
| Throttles | How often invocations are rejected due to concurrency limits |
| Duration | Longer duration increases concurrency pressure |
| Errors | Function failures that may cause retries |
| IteratorAge | Whether stream processing is falling behind |
| ApproximateAgeOfOldestMessage | Whether SQS processing is falling behind |
Use Alarms
- Alarm on throttles for critical functions.
- Alarm on DLQ depth for async and queue-based workloads.
- Alarm on IteratorAge for streams.
- Alarm on SQS oldest message age for queue workers.
- Alarm on database connection usage for Lambda + RDS workloads.
Common Concurrency Mistakes
- Assuming Lambda scaling means the whole system can scale.
- Opening a new database connection per invocation.
- Not setting reserved concurrency for database-heavy functions.
- Using provisioned concurrency when the real problem is downstream overload.
- Ignoring queue age and stream lag.
- Using large batches without partial failure handling.
- Letting one noisy function consume all available concurrency.
- Not monitoring throttles.
- Retrying failures too aggressively.
Production Checklist
- Estimate concurrency using request rate and average duration.
- Measure real concurrency with CloudWatch metrics.
- Use reserved concurrency to protect downstream systems.
- Use provisioned concurrency only when cold start latency matters.
- Keep API Lambdas fast and move slow work to queues.
- Tune SQS batch size and visibility timeout.
- Use partial batch failures for supported batch event sources.
- Watch stream lag for Kinesis and DynamoDB Streams.
- Reuse database connections carefully.
- Use RDS Proxy for relational database workloads.
- Set alarms for throttles, errors, queue age, stream age, and DLQ depth.
- Protect external APIs with rate limiting, backoff, queues, or concurrency limits.
Conclusion
AWS Lambda concurrency and scaling are powerful, but they must be controlled intentionally. Lambda can create many parallel execution environments, but your database, cache, queue, third-party API, or internal service may not be able to handle the same level of traffic.
Use reserved concurrency to limit and protect, provisioned concurrency to reduce cold starts, SQS to buffer spikes, batch tuning to improve throughput, and CloudWatch metrics to understand real behavior.
Key takeaway: good Lambda scaling is not unlimited scaling. Good Lambda scaling means processing work fast enough while protecting downstream systems, controlling retries, monitoring lag, and keeping the whole architecture stable.
Comments (0)