Parallel Processing in AWS Lambda with Python: What Actually Works (And What Doesn't)

Parallel processing concept

The Problem

We had a Lambda function processing ZIP files containing 80–150 complex XML documents. Each XML required schema validation, namespace resolution, and deep tree traversal, taking ~200ms to parse, resulting in 16–25 seconds of total processing time.

The obvious solution? Parallelize the parsing. But this turned out to be surprisingly tricky in AWS Lambda.

Attempt #1: ThreadPoolExecutor (Failed)

Our first instinct was to use Python’s ThreadPoolExecutor: it’s simple, built-in, and widely recommended.

from concurrent.futures import ThreadPoolExecutor

def parse_all_xmls(xml_files):
    with ThreadPoolExecutor(max_workers=4) as executor:
        results = list(executor.map(parse_single_xml, xml_files))
    return results

Result: No improvement whatsoever. Same 16–25 seconds.

Why It Failed: The GIL

Python’s Global Interpreter Lock (GIL) allows only one thread to execute Python bytecode at a time. For CPU-bound work, threads simply take turns on the same core with no true parallelism.

What ThreadPoolExecutor actually does for CPU-bound work:

 ThreadPoolExecutor (CPU-bound)
┌─────────────────────────────────┐
│ Python Process (single GIL)     │
│ ┌──────┐ ┌──────┐ ┌──────┐      │
│ │ T1   │ │ T2   │ │ T3   │      │
│ └──┬───┘ └──┬───┘ └──┬───┘      │
│    └────────┼────────┘          │
│             ▼                   │
│            GIL                  │
│   Only one runs at once         │
└─────────────────────────────────┘

ThreadPoolExecutor does work well for I/O-bound tasks (S3, APIs, databases), where threads release the GIL while waiting on network responses. XML parsing is pure CPU work, so the GIL is never released.

Attempt #2: multiprocessing.Pool (Crashed)

Since threads didn’t help, we tried true multiprocessing:

from multiprocessing import Pool

def parse_all_xmls(xml_files):
    with Pool(processes=3) as pool:
        results = pool.map(parse_single_xml, xml_files)
    return results

Result: Crashes or hangs in Lambda.

Why It Failed: No /dev/shm

Standard Python multiprocessing relies on shared memory (/dev/shm). AWS Lambda severely restricts this space, leading to crashes, hangs, or unpredictable performance.

At this point, it was clear: parallelism in Lambda isn’t a Python problem; it’s an environment problem.

Attempt #3: lambda-multiprocessing (Success!)

The [lambda-multiprocessing](https://pypi.org/project/lambda-multiprocessing/) library avoids shared memory by using pipes for inter-process communication:

from lambda_multiprocessing import Pool

def _parse_single_xml(item):
    """Must be module-level for pickling."""
    filename, xml_content = item
    return parse_xml(xml_content)

def parse_all_xmls(xml_files):
    with Pool(processes=3) as pool:
        results = pool.map(_parse_single_xml, xml_files)
    return results

Why multiprocessing bypasses the GIL:

 Multiprocessing (true parallelism)

Process 1           Process 2           Process 3
┌─────────────┐     ┌─────────────┐     ┌─────────────┐
│ Python +    │     │ Python +    │     │ Python +    │
│ GIL #1      │     │ GIL #2      │     │ GIL #3      │
│ parse XML   │     │ parse XML   │     │ parse XML   │
└─────┬───────┘     └─────┬───────┘     └─────┬───────┘
      ▼                   ▼                   ▼
    vCPU 1              vCPU 2              vCPU 3

→ Each process has its own GIL
→ True CPU parallelism

Important: If you’re processing fewer than ~5 items, multiprocessing is often slower due to process startup overhead. Parallelism only pays off once the workload is large enough.

The Hidden Gotcha: Lambda Memory = vCPUs

Our first test with lambda-multiprocessing at 2048MB memory showed only 1.13x speedup, barely better than sequential. This was confusing at first: we had multiple worker processes, but almost no real performance gain.

The reason is simple but non-obvious: Lambda allocates vCPUs proportionally to memory.

What happens at 2048MB (≈1.15 vCPUs):

 2048MB Lambda (≈1.15 vCPUs)
┌──────────────────────────────┐
│ Single partial CPU           │
│ ┌──────┐ ┌──────┐ ┌──────┐   │
│ │ W1   │ │ W2   │ │ W3   │   │
│ └──┬───┘ └──┬───┘ └──┬───┘   │
│    └────────┴────────┘       │
│        (Time-slicing)        │
└──────────────────────────────┘

Real parallelism only starts around 4096MB , where Lambda provides ~2.3 vCPUs.

The Final Solution

4096MB memory + 3 workers consistently delivered real parallelism:

4096MB Lambda (≈2.3 vCPUs)
┌──────────────────────────────┐
│ vCPU 1        │ vCPU 2       │
│ ┌──────┐      │ ┌──────┐     │
│ │ W1   │      │ │ W2   │     │
│ └──────┘      │ └──────┘     │
│               │ ┌──────┐     │
│               │ │ W3   │     │
│               │ └──────┘     │
└──────────────────────────────┘

This reduced total parsing time from 16–25s down to 7–11s.

Benchmark Tables

Speedup Table

Memory Workers Sequential Parallel Speedup
2048MB 2 16.19s 14.32s 1.13x
4096MB 2 16.14s 8.19s 1.97x
4096MB 3 16.15s 7.12s 2.27x
4096MB 3 20.03s (152 files) 8.82s 2.27x

Cost Table

Config Duration Memory Cost/ms Total Cost
2048MB 16s 1x 16 units
4096MB 7s 2x 14 units

Benchmarks validate our vCPU + worker strategy.

Summary: Decision Tree

Who this is for: Engineers running CPU-heavy workloads inside AWS Lambda who are surprised that “just add threads” didn’t work.

Is your Lambda task slow?
│
├─► I/O-bound?
│   └─► ThreadPoolExecutor ✅
│
└─► CPU-bound?
    ├─► <5 items → stay sequential
    └─► 5+ items → lambda-multiprocessing + 4096MB+ memory

Key Takeaways

  1. ThreadPoolExecutor is for I/O, not CPU
  2. Standard multiprocessing breaks in Lambda
  3. lambda-multiprocessing works, with enough memory
  4. Memory size controls CPU parallelism
  5. Always benchmark; overhead matters

The real lesson: In Lambda, performance tuning is never just about Python. It’s about understanding how AWS slices CPU, memory, and process isolation, and designing with those constraints, not against them.

Have you run into similar surprises with parallelism in Lambda? I’d love to hear what worked (or didn’t) in the comments.

Kevin Tan
Written by

Cloud Solutions Architect and Engineering Leader based in Singapore. I write about AWS, distributed systems, and building reliable software at scale.

Discussion

Comments are powered by GitHub Discussions. Sign in with GitHub to join the conversation.