Parallel Processing in AWS Lambda with Python: What Actually Works (And What Doesn’t)

Parallel processing concept

The Problem

We had a Lambda function processing ZIP files containing 80–150 complex XML documents. Each XML required schema validation, namespace resolution, and deep tree traversal, taking ~200ms to parse, resulting in 16–25 seconds of total processing time.

The obvious solution? Parallelize the parsing. But this turned out to be surprisingly tricky in AWS Lambda.

Attempt #1: ThreadPoolExecutor (Failed)

Our first instinct was to use Python’s ThreadPoolExecutor: it’s simple, built-in, and widely recommended.

from concurrent.futures import ThreadPoolExecutor

def parse_all_xmls(xml_files):
    with ThreadPoolExecutor(max_workers=4) as executor:
        results = list(executor.map(parse_single_xml, xml_files))
    return results

Result: No improvement whatsoever. Same 16–25 seconds.

Why It Failed: The GIL

Python’s Global Interpreter Lock (GIL) allows only one thread to execute Python bytecode at a time. For CPU-bound work, threads simply take turns on the same core with no true parallelism.

What ThreadPoolExecutor actually does for CPU-bound work:

 ThreadPoolExecutor (CPU-bound)
┌─────────────────────────────────┐
│ Python Process (single GIL)     │
│ ┌──────┐ ┌──────┐ ┌──────┐      │
│ │ T1   │ │ T2   │ │ T3   │      │
│ └──┬───┘ └──┬───┘ └──┬───┘      │
│    └────────┼────────┘          │
│             ▼                   │
│            GIL                  │
│   Only one runs at once         │
└─────────────────────────────────┘

ThreadPoolExecutor does work well for I/O-bound tasks (S3, APIs, databases), where threads release the GIL while waiting on network responses. XML parsing is pure CPU work, so the GIL is never released.

Attempt #2: multiprocessing.Pool (Crashed)

Since threads didn’t help, we tried true multiprocessing:

from multiprocessing import Pool

def parse_all_xmls(xml_files):
    with Pool(processes=3) as pool:
        results = pool.map(parse_single_xml, xml_files)
    return results

Result: Crashes or hangs in Lambda.

Why It Failed: No /dev/shm

Standard Python multiprocessing relies on shared memory (/dev/shm). AWS Lambda severely restricts this space, leading to crashes, hangs, or unpredictable performance.

At this point, it was clear: parallelism in Lambda isn’t a Python problem; it’s an environment problem.

Attempt #3: lambda-multiprocessing (Success!)

The [lambda-multiprocessing](https://pypi.org/project/lambda-multiprocessing/) library avoids shared memory by using pipes for inter-process communication:

from lambda_multiprocessing import Pool

def _parse_single_xml(item):
    """Must be module-level for pickling."""
    filename, xml_content = item
    return parse_xml(xml_content)

def parse_all_xmls(xml_files):
    with Pool(processes=3) as pool:
        results = pool.map(_parse_single_xml, xml_files)
    return results

Why multiprocessing bypasses the GIL:

 Multiprocessing (true parallelism)

Process 1           Process 2           Process 3
┌─────────────┐     ┌─────────────┐     ┌─────────────┐
│ Python +    │     │ Python +    │     │ Python +    │
│ GIL #1      │     │ GIL #2      │     │ GIL #3      │
│ parse XML   │     │ parse XML   │     │ parse XML   │
└─────┬───────┘     └─────┬───────┘     └─────┬───────┘
      ▼                   ▼                   ▼
    vCPU 1              vCPU 2              vCPU 3

→ Each process has its own GIL
→ True CPU parallelism

Important: If you’re processing fewer than ~5 items, multiprocessing is often slower due to process startup overhead. Parallelism only pays off once the workload is large enough.

The Hidden Gotcha: Lambda Memory = vCPUs

Our first test with lambda-multiprocessing at 2048MB memory showed only 1.13x speedup, barely better than sequential. This was confusing at first: we had multiple worker processes, but almost no real performance gain.

The reason is simple but non-obvious: Lambda allocates vCPUs proportionally to memory.

What happens at 2048MB (≈1.15 vCPUs):

 2048MB Lambda (≈1.15 vCPUs)
┌──────────────────────────────┐
│ Single partial CPU           │
│ ┌──────┐ ┌──────┐ ┌──────┐   │
│ │ W1   │ │ W2   │ │ W3   │   │
│ └──┬───┘ └──┬───┘ └──┬───┘   │
│    └────────┴────────┘       │
│        (Time-slicing)        │
└──────────────────────────────┘

Real parallelism only starts around 4096MB , where Lambda provides ~2.3 vCPUs.

The Final Solution

4096MB memory + 3 workers consistently delivered real parallelism:

4096MB Lambda (≈2.3 vCPUs)
┌──────────────────────────────┐
│ vCPU 1        │ vCPU 2       │
│ ┌──────┐      │ ┌──────┐     │
│ │ W1   │      │ │ W2   │     │
│ └──────┘      │ └──────┘     │
│               │ ┌──────┐     │
│               │ │ W3   │     │
│               │ └──────┘     │
└──────────────────────────────┘

This reduced total parsing time from 16–25s down to 7–11s.

Benchmark Tables

Speedup Table

Memory	Workers	Sequential	Parallel	Speedup
2048MB	2	16.19s	14.32s	1.13x
4096MB	2	16.14s	8.19s	1.97x
4096MB	3	16.15s	7.12s	2.27x
4096MB	3	20.03s (152 files)	8.82s	2.27x

Cost Table

Config	Duration	Memory Cost/ms	Total Cost
2048MB	16s	1x	16 units
4096MB	7s	2x	14 units

Benchmarks validate our vCPU + worker strategy.

Summary: Decision Tree

Who this is for: Engineers running CPU-heavy workloads inside AWS Lambda who are surprised that “just add threads” didn’t work.

Is your Lambda task slow?
│
├─► I/O-bound?
│   └─► ThreadPoolExecutor ✅
│
└─► CPU-bound?
    ├─► <5 items → stay sequential
    └─► 5+ items → lambda-multiprocessing + 4096MB+ memory

Key Takeaways

ThreadPoolExecutor is for I/O, not CPU
Standard multiprocessing breaks in Lambda
lambda-multiprocessing works, with enough memory
Memory size controls CPU parallelism
Always benchmark; overhead matters

The real lesson: In Lambda, performance tuning is never just about Python. It’s about understanding how AWS slices CPU, memory, and process isolation, and designing with those constraints, not against them.

Have you run into similar surprises with parallelism in Lambda? I’d love to hear what worked (or didn’t) in the comments.

Kevin Tan

Parallel Processing in AWS Lambda with Python: What Actually Works (And What Doesn't)

The Problem

Attempt #1: ThreadPoolExecutor (Failed)

Why It Failed: The GIL

Attempt #2: multiprocessing.Pool (Crashed)

Why It Failed: No /dev/shm

Attempt #3: lambda-multiprocessing (Success!)

The Hidden Gotcha: Lambda Memory = vCPUs

The Final Solution

Benchmark Tables

Speedup Table

Cost Table

Summary: Decision Tree

Key Takeaways

Discussion

The Problem

Attempt #1: ThreadPoolExecutor (Failed)

Why It Failed: The GIL

Attempt #2: multiprocessing.Pool (Crashed)

Why It Failed: No /dev/shm

Attempt #3: lambda-multiprocessing (Success!)

The Hidden Gotcha: Lambda Memory = vCPUs

The Final Solution

Benchmark Tables

Speedup Table

Cost Table

Summary: Decision Tree

Key Takeaways

Subscribe to the newsletter

Discussion

How to Reduce AWS Lambda Costs: 10 Practical Tips

Copilot SDK Performance: How I Cut 33% Latency

Serverless vs Containers: A Decision Framework

The 7 Rs of Cloud Migration: A Decision Guide