TL;DR: We tried every “obvious” way to parallelize CPU-bound work in AWS Lambda — and most of them either did nothing or flat-out broke.
ThreadPoolExecutor helps I/O, not CPU. Standard multiprocessing crashes due to Lambda’s /dev/shmlimitations. What actually works is lambda-multiprocessing plus the right memory allocation.
Photo by Deepak Rautela on Unsplash
The Problem
We had a Lambda function processing ZIP files containing 80–150 complex XML documents. Each XML required schema validation, namespace resolution, and deep tree traversal, taking ~200ms to parse — resulting in 16–25 seconds of total processing time.
The obvious solution? Parallelize the parsing. But this turned out to be surprisingly tricky in AWS Lambda.
Attempt #1: ThreadPoolExecutor (Failed)
Our first instinct was to use Python’s ThreadPoolExecutor — it’s simple, built-in, and widely recommended.
from concurrent.futures import ThreadPoolExecutor
def parse_all_xmls(xml_files):
with ThreadPoolExecutor(max_workers=4) as executor:
results = list(executor.map(parse_single_xml, xml_files))
return results
Result: No improvement whatsoever. Same 16–25 seconds.
Why It Failed: The GIL
Python’s Global Interpreter Lock (GIL) allows only one thread to execute Python bytecode at a time. For CPU-bound work, threads simply take turns on the same core — no true parallelism.
What ThreadPoolExecutor actually does for CPU-bound work:
ThreadPoolExecutor (CPU-bound)
┌─────────────────────────────────┐
│ Python Process (single GIL) │
│ ┌──────┐ ┌──────┐ ┌──────┐ │
│ │ T1 │ │ T2 │ │ T3 │ │
│ └──┬───┘ └──┬───┘ └──┬───┘ │
│ └────────┼────────┘ │
│ ▼ │
│ GIL │
│ Only one runs at once │
└─────────────────────────────────┘
ThreadPoolExecutor does work well for I/O-bound tasks (S3, APIs, databases), where threads release the GIL while waiting on network responses. XML parsing is pure CPU work — the GIL is never released.
Attempt #2: multiprocessing.Pool (Crashed)
Since threads didn’t help, we tried true multiprocessing:
from multiprocessing import Pool
def parse_all_xmls(xml_files):
with Pool(processes=3) as pool:
results = pool.map(parse_single_xml, xml_files)
return results
Result: Crashes or hangs in Lambda.
Why It Failed: No /dev/shm
Standard Python multiprocessing relies on shared memory (/dev/shm). AWS Lambda severely restricts this space, leading to crashes, hangs, or unpredictable performance.
At this point, it was clear: parallelism in Lambda isn’t a Python problem — it’s an environment problem.
Attempt #3: lambda-multiprocessing (Success!)
The [lambda-multiprocessing](https://pypi.org/project/lambda-multiprocessing/) library avoids shared memory by using pipes for inter-process communication:
from lambda_multiprocessing import Pool
def _parse_single_xml(item):
"""Must be module-level for pickling."""
filename, xml_content = item
return parse_xml(xml_content)
def parse_all_xmls(xml_files):
with Pool(processes=3) as pool:
results = pool.map(_parse_single_xml, xml_files)
return results
Why multiprocessing bypasses the GIL:
Multiprocessing (true parallelism)
Process 1 Process 2 Process 3
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ Python + │ │ Python + │ │ Python + │
│ GIL #1 │ │ GIL #2 │ │ GIL #3 │
│ parse XML │ │ parse XML │ │ parse XML │
└─────┬───────┘ └─────┬───────┘ └─────┬───────┘
▼ ▼ ▼
vCPU 1 vCPU 2 vCPU 3
→ Each process has its own GIL
→ True CPU parallelism
Important: If you’re processing fewer than ~5 items, multiprocessing is often slower due to process startup overhead. Parallelism only pays off once the workload is large enough.
The Hidden Gotcha: Lambda Memory = vCPUs
Our first test with lambda-multiprocessing at 2048MB memory showed only 1.13× speedup — barely better than sequential. This was confusing at first: we had multiple worker processes, but almost no real performance gain.
The reason is simple but non-obvious: Lambda allocates vCPUs proportionally to memory.
What happens at 2048MB (≈1.15 vCPUs):
2048MB Lambda (≈1.15 vCPUs)
┌──────────────────────────────┐
│ Single partial CPU │
│ ┌──────┐ ┌──────┐ ┌──────┐ │
│ │ W1 │ │ W2 │ │ W3 │ │
│ └──┬───┘ └──┬───┘ └──┬───┘ │
│ └────────┴────────┘ │
│ (Time-slicing) │
└──────────────────────────────┘
Real parallelism only starts around 4096MB , where Lambda provides ~2.3 vCPUs.
The Final Solution
4096MB memory + 3 workers consistently delivered real parallelism:
4096MB Lambda (≈2.3 vCPUs)
┌──────────────────────────────┐
│ vCPU 1 │ vCPU 2 │
│ ┌──────┐ │ ┌──────┐ │
│ │ W1 │ │ │ W2 │ │
│ └──────┘ │ └──────┘ │
│ │ ┌──────┐ │
│ │ │ W3 │ │
│ │ └──────┘ │
└──────────────────────────────┘
This reduced total parsing time from 16–25s down to 7–11s.
Benchmark Tables
Speedup Table
Speed Table
Cost Table
Cost Table
Benchmarks validate our vCPU + worker strategy.
Summary: Decision Tree
Who this is for: Engineers running CPU-heavy workloads inside AWS Lambda who are surprised that “just add threads” didn’t work.
Is your Lambda task slow?
│
├─► I/O-bound?
│ └─► ThreadPoolExecutor ✅
│
└─► CPU-bound?
├─► <5 items → stay sequential
└─► 5+ items → lambda-multiprocessing + 4096MB+ memory
Key Takeaways
- ThreadPoolExecutor is for I/O, not CPU
- Standard multiprocessing breaks in Lambda
- lambda-multiprocessing works — with enough memory
- Memory size controls CPU parallelism
- Always benchmark; overhead matters
The real lesson: In Lambda, performance tuning is never just about Python. It’s about understanding how AWS slices CPU, memory, and process isolation — and designing with those constraints, not against them.
Have you run into similar surprises with parallelism in Lambda? I’d love to hear what worked — or didn’t — in the comments.