
The Problem
We had a Lambda function processing ZIP files containing 80–150 complex XML documents. Each XML required schema validation, namespace resolution, and deep tree traversal, taking ~200ms to parse, resulting in 16–25 seconds of total processing time.
The obvious solution? Parallelize the parsing. But this turned out to be surprisingly tricky in AWS Lambda.
Attempt #1: ThreadPoolExecutor (Failed)
Our first instinct was to use Python’s ThreadPoolExecutor: it’s simple, built-in, and widely recommended.
from concurrent.futures import ThreadPoolExecutor
def parse_all_xmls(xml_files):
with ThreadPoolExecutor(max_workers=4) as executor:
results = list(executor.map(parse_single_xml, xml_files))
return results
Result: No improvement whatsoever. Same 16–25 seconds.
Why It Failed: The GIL
Python’s Global Interpreter Lock (GIL) allows only one thread to execute Python bytecode at a time. For CPU-bound work, threads simply take turns on the same core with no true parallelism.
What ThreadPoolExecutor actually does for CPU-bound work:
ThreadPoolExecutor (CPU-bound)
┌─────────────────────────────────┐
│ Python Process (single GIL) │
│ ┌──────┐ ┌──────┐ ┌──────┐ │
│ │ T1 │ │ T2 │ │ T3 │ │
│ └──┬───┘ └──┬───┘ └──┬───┘ │
│ └────────┼────────┘ │
│ ▼ │
│ GIL │
│ Only one runs at once │
└─────────────────────────────────┘
ThreadPoolExecutor does work well for I/O-bound tasks (S3, APIs, databases), where threads release the GIL while waiting on network responses. XML parsing is pure CPU work, so the GIL is never released.
Attempt #2: multiprocessing.Pool (Crashed)
Since threads didn’t help, we tried true multiprocessing:
from multiprocessing import Pool
def parse_all_xmls(xml_files):
with Pool(processes=3) as pool:
results = pool.map(parse_single_xml, xml_files)
return results
Result: Crashes or hangs in Lambda.
Why It Failed: No /dev/shm
Standard Python multiprocessing relies on shared memory (/dev/shm). AWS Lambda severely restricts this space, leading to crashes, hangs, or unpredictable performance.
At this point, it was clear: parallelism in Lambda isn’t a Python problem; it’s an environment problem.
Attempt #3: lambda-multiprocessing (Success!)
The [lambda-multiprocessing](https://pypi.org/project/lambda-multiprocessing/) library avoids shared memory by using pipes for inter-process communication:
from lambda_multiprocessing import Pool
def _parse_single_xml(item):
"""Must be module-level for pickling."""
filename, xml_content = item
return parse_xml(xml_content)
def parse_all_xmls(xml_files):
with Pool(processes=3) as pool:
results = pool.map(_parse_single_xml, xml_files)
return results
Why multiprocessing bypasses the GIL:
Multiprocessing (true parallelism)
Process 1 Process 2 Process 3
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ Python + │ │ Python + │ │ Python + │
│ GIL #1 │ │ GIL #2 │ │ GIL #3 │
│ parse XML │ │ parse XML │ │ parse XML │
└─────┬───────┘ └─────┬───────┘ └─────┬───────┘
▼ ▼ ▼
vCPU 1 vCPU 2 vCPU 3
→ Each process has its own GIL
→ True CPU parallelism
Important: If you’re processing fewer than ~5 items, multiprocessing is often slower due to process startup overhead. Parallelism only pays off once the workload is large enough.
The Hidden Gotcha: Lambda Memory = vCPUs
Our first test with lambda-multiprocessing at 2048MB memory showed only 1.13x speedup, barely better than sequential. This was confusing at first: we had multiple worker processes, but almost no real performance gain.
The reason is simple but non-obvious: Lambda allocates vCPUs proportionally to memory.
What happens at 2048MB (≈1.15 vCPUs):
2048MB Lambda (≈1.15 vCPUs)
┌──────────────────────────────┐
│ Single partial CPU │
│ ┌──────┐ ┌──────┐ ┌──────┐ │
│ │ W1 │ │ W2 │ │ W3 │ │
│ └──┬───┘ └──┬───┘ └──┬───┘ │
│ └────────┴────────┘ │
│ (Time-slicing) │
└──────────────────────────────┘
Real parallelism only starts around 4096MB , where Lambda provides ~2.3 vCPUs.
The Final Solution
4096MB memory + 3 workers consistently delivered real parallelism:
4096MB Lambda (≈2.3 vCPUs)
┌──────────────────────────────┐
│ vCPU 1 │ vCPU 2 │
│ ┌──────┐ │ ┌──────┐ │
│ │ W1 │ │ │ W2 │ │
│ └──────┘ │ └──────┘ │
│ │ ┌──────┐ │
│ │ │ W3 │ │
│ │ └──────┘ │
└──────────────────────────────┘
This reduced total parsing time from 16–25s down to 7–11s.
Benchmark Tables
Speedup Table
| Memory | Workers | Sequential | Parallel | Speedup |
|---|---|---|---|---|
| 2048MB | 2 | 16.19s | 14.32s | 1.13x |
| 4096MB | 2 | 16.14s | 8.19s | 1.97x |
| 4096MB | 3 | 16.15s | 7.12s | 2.27x |
| 4096MB | 3 | 20.03s (152 files) | 8.82s | 2.27x |
Cost Table
| Config | Duration | Memory Cost/ms | Total Cost |
|---|---|---|---|
| 2048MB | 16s | 1x | 16 units |
| 4096MB | 7s | 2x | 14 units |
Benchmarks validate our vCPU + worker strategy.
Summary: Decision Tree
Who this is for: Engineers running CPU-heavy workloads inside AWS Lambda who are surprised that “just add threads” didn’t work.
Is your Lambda task slow?
│
├─► I/O-bound?
│ └─► ThreadPoolExecutor ✅
│
└─► CPU-bound?
├─► <5 items → stay sequential
└─► 5+ items → lambda-multiprocessing + 4096MB+ memory
Key Takeaways
- ThreadPoolExecutor is for I/O, not CPU
- Standard multiprocessing breaks in Lambda
- lambda-multiprocessing works, with enough memory
- Memory size controls CPU parallelism
- Always benchmark; overhead matters
The real lesson: In Lambda, performance tuning is never just about Python. It’s about understanding how AWS slices CPU, memory, and process isolation, and designing with those constraints, not against them.
Have you run into similar surprises with parallelism in Lambda? I’d love to hear what worked (or didn’t) in the comments.
Discussion
Comments are powered by GitHub Discussions. Sign in with GitHub to join the conversation.