Parallel Processing in AWS Lambda with Python: What Actually Works (And What Doesn’t)

Originally published on Medium ↗

TL;DR: We tried every “obvious” way to parallelize CPU-bound work in AWS Lambda — and most of them either did nothing or flat-out broke.

ThreadPoolExecutor helps I/O, not CPU. Standard multiprocessing crashes due to Lambda’s /dev/shmlimitations. What actually works is lambda-multiprocessing plus the right memory allocation.

Photo by Deepak Rautela on Unsplash

The Problem

We had a Lambda function processing ZIP files containing 80–150 complex XML documents. Each XML required schema validation, namespace resolution, and deep tree traversal, taking ~200ms to parse — resulting in 16–25 seconds of total processing time.

The obvious solution? Parallelize the parsing. But this turned out to be surprisingly tricky in AWS Lambda.

Attempt #1: ThreadPoolExecutor (Failed)

Our first instinct was to use Python’s ThreadPoolExecutor — it’s simple, built-in, and widely recommended.

from concurrent.futures import ThreadPoolExecutor  
  
def parse_all_xmls(xml_files):  
    with ThreadPoolExecutor(max_workers=4) as executor:  
        results = list(executor.map(parse_single_xml, xml_files))  
    return results

Result: No improvement whatsoever. Same 16–25 seconds.

Why It Failed: The GIL

Python’s Global Interpreter Lock (GIL) allows only one thread to execute Python bytecode at a time. For CPU-bound work, threads simply take turns on the same core — no true parallelism.

What ThreadPoolExecutor actually does for CPU-bound work:

 ThreadPoolExecutor (CPU-bound)  
┌─────────────────────────────────┐  
│ Python Process (single GIL)     │  
│ ┌──────┐ ┌──────┐ ┌──────┐      │  
│ │ T1   │ │ T2   │ │ T3   │      │  
│ └──┬───┘ └──┬───┘ └──┬───┘      │  
│    └────────┼────────┘          │  
│             ▼                   │  
│            GIL                  │  
│   Only one runs at once         │  
└─────────────────────────────────┘

ThreadPoolExecutor does work well for I/O-bound tasks (S3, APIs, databases), where threads release the GIL while waiting on network responses. XML parsing is pure CPU work — the GIL is never released.

Attempt #2: multiprocessing.Pool (Crashed)

Since threads didn’t help, we tried true multiprocessing:

from multiprocessing import Pool  
  
def parse_all_xmls(xml_files):  
    with Pool(processes=3) as pool:  
        results = pool.map(parse_single_xml, xml_files)  
    return results

Result: Crashes or hangs in Lambda.

Why It Failed: No /dev/shm

Standard Python multiprocessing relies on shared memory (/dev/shm). AWS Lambda severely restricts this space, leading to crashes, hangs, or unpredictable performance.

At this point, it was clear: parallelism in Lambda isn’t a Python problem — it’s an environment problem.

Attempt #3: lambda-multiprocessing (Success!)

The [lambda-multiprocessing](https://pypi.org/project/lambda-multiprocessing/) library avoids shared memory by using pipes for inter-process communication:

from lambda_multiprocessing import Pool  
  
def _parse_single_xml(item):  
    """Must be module-level for pickling."""  
    filename, xml_content = item  
    return parse_xml(xml_content)  
  
def parse_all_xmls(xml_files):  
    with Pool(processes=3) as pool:  
        results = pool.map(_parse_single_xml, xml_files)  
    return results

Why multiprocessing bypasses the GIL:

 Multiprocessing (true parallelism)  
  
Process 1           Process 2           Process 3  
┌─────────────┐     ┌─────────────┐     ┌─────────────┐  
│ Python +    │     │ Python +    │     │ Python +    │  
│ GIL #1      │     │ GIL #2      │     │ GIL #3      │  
│ parse XML   │     │ parse XML   │     │ parse XML   │  
└─────┬───────┘     └─────┬───────┘     └─────┬───────┘  
      ▼                   ▼                   ▼  
    vCPU 1              vCPU 2              vCPU 3  
  
→ Each process has its own GIL  
→ True CPU parallelism

Important: If you’re processing fewer than ~5 items, multiprocessing is often slower due to process startup overhead. Parallelism only pays off once the workload is large enough.

The Hidden Gotcha: Lambda Memory = vCPUs

Our first test with lambda-multiprocessing at 2048MB memory showed only 1.13× speedup — barely better than sequential. This was confusing at first: we had multiple worker processes, but almost no real performance gain.

The reason is simple but non-obvious: Lambda allocates vCPUs proportionally to memory.

What happens at 2048MB (≈1.15 vCPUs):

 2048MB Lambda (≈1.15 vCPUs)  
┌──────────────────────────────┐  
│ Single partial CPU           │  
│ ┌──────┐ ┌──────┐ ┌──────┐   │  
│ │ W1   │ │ W2   │ │ W3   │   │  
│ └──┬───┘ └──┬───┘ └──┬───┘   │  
│    └────────┴────────┘       │  
│        (Time-slicing)        │  
└──────────────────────────────┘

Real parallelism only starts around 4096MB , where Lambda provides ~2.3 vCPUs.

The Final Solution

4096MB memory + 3 workers consistently delivered real parallelism:

4096MB Lambda (≈2.3 vCPUs)  
┌──────────────────────────────┐  
│ vCPU 1        │ vCPU 2       │  
│ ┌──────┐      │ ┌──────┐     │  
│ │ W1   │      │ │ W2   │     │  
│ └──────┘      │ └──────┘     │  
│               │ ┌──────┐     │  
│               │ │ W3   │     │  
│               │ └──────┘     │  
└──────────────────────────────┘

This reduced total parsing time from 16–25s down to 7–11s.

Benchmark Tables

Speedup Table

Speed Table

Cost Table

Cost Table

Benchmarks validate our vCPU + worker strategy.

Summary: Decision Tree

Who this is for: Engineers running CPU-heavy workloads inside AWS Lambda who are surprised that “just add threads” didn’t work.

Is your Lambda task slow?  
│  
├─► I/O-bound?  
│   └─► ThreadPoolExecutor ✅  
│  
└─► CPU-bound?  
    ├─► <5 items → stay sequential  
    └─► 5+ items → lambda-multiprocessing + 4096MB+ memory

Key Takeaways

  1. ThreadPoolExecutor is for I/O, not CPU
  2. Standard multiprocessing breaks in Lambda
  3. lambda-multiprocessing works — with enough memory
  4. Memory size controls CPU parallelism
  5. Always benchmark; overhead matters

The real lesson: In Lambda, performance tuning is never just about Python. It’s about understanding how AWS slices CPU, memory, and process isolation — and designing with those constraints, not against them.

Have you run into similar surprises with parallelism in Lambda? I’d love to hear what worked — or didn’t — in the comments.