Hundreds of teal threads stretched parallel on a loom: many independent strands running side by side to form a single piece of cloth.

TL;DR: Python’s GIL blocks ThreadPoolExecutor from parallelizing CPU-bound work, and standard multiprocessing crashes in Lambda due to /dev/shm restrictions. The fix: lambda-multiprocessing (pipe-based IPC) with 4096MB+ memory to get enough vCPUs for real parallelism. This cut XML parsing from 16-25s to 7-11s.

The Problem

We had a Lambda function processing ZIP files containing 80–150 complex XML documents. Each XML required schema validation, namespace resolution, and deep tree traversal, taking ~200ms to parse, resulting in 16–25 seconds of total processing time.

The obvious solution? Parallelize the parsing. But this turned out to be surprisingly tricky in AWS Lambda.

Attempt #1: ThreadPoolExecutor (Failed)

Our first instinct was to use Python’s ThreadPoolExecutor: it’s simple, built-in, and widely recommended.

from concurrent.futures import ThreadPoolExecutor

def parse_all_xmls(xml_files):
    with ThreadPoolExecutor(max_workers=4) as executor:
        results = list(executor.map(parse_single_xml, xml_files))
    return results

Result: No improvement whatsoever. Same 16–25 seconds.

Why It Failed: The GIL

Python’s Global Interpreter Lock (GIL) allows only one thread to execute Python bytecode at a time. For CPU-bound work, threads simply take turns on the same core with no true parallelism.

What ThreadPoolExecutor actually does for CPU-bound work:

Three threads inside one Python process funneling through a single GIL onto one vCPU

ThreadPoolExecutor does work well for I/O-bound tasks (S3, APIs, databases), where threads release the GIL while waiting on network responses. XML parsing is pure CPU work, so the GIL is never released.

Attempt #2: multiprocessing.Pool (Crashed)

Since threads didn’t help, we tried true multiprocessing:

from multiprocessing import Pool

def parse_all_xmls(xml_files):
    with Pool(processes=3) as pool:
        results = pool.map(parse_single_xml, xml_files)
    return results

Result: Crashes or hangs in Lambda.

Why It Failed: No /dev/shm

Standard Python multiprocessing relies on shared memory (/dev/shm). AWS Lambda severely restricts this space, leading to crashes, hangs, or unpredictable performance.

At this point, it was clear: parallelism in Lambda isn’t a Python problem; it’s an environment problem.

Attempt #3: lambda-multiprocessing (Success!)

The [lambda-multiprocessing](https://pypi.org/project/lambda-multiprocessing/) library avoids shared memory by using pipes for inter-process communication:

from lambda_multiprocessing import Pool

def _parse_single_xml(item):
    """Must be module-level for pickling."""
    filename, xml_content = item
    return parse_xml(xml_content)

def parse_all_xmls(xml_files):
    with Pool(processes=3) as pool:
        results = pool.map(_parse_single_xml, xml_files)
    return results

Why multiprocessing bypasses the GIL:

Three independent Python processes, each with its own GIL, parsing XML on three separate vCPUs in parallel

Important: If you’re processing fewer than ~5 items, multiprocessing is often slower due to process startup overhead. Parallelism only pays off once the workload is large enough.

The Hidden Gotcha: Lambda Memory = vCPUs

Our first test with lambda-multiprocessing at 2048MB memory showed only 1.13x speedup, barely better than sequential. This was confusing at first: we had multiple worker processes, but almost no real performance gain.

The reason is simple but non-obvious: Lambda allocates vCPUs proportionally to memory.

What happens at 2048MB (≈1.15 vCPUs):

Three workers in a 2048MB Lambda all funnel onto a single partial vCPU and have to time-slice

Real parallelism only starts around 4096MB , where Lambda provides ~2.3 vCPUs.

The Final Solution

4096MB memory + 3 workers consistently delivered real parallelism:

A 4096MB Lambda splits three workers across two vCPUs — worker 1 on its own core, workers 2 and 3 sharing the second

This reduced total parsing time from 16–25s down to 7–11s.

Speedup Benchmark

Memory Workers Sequential Parallel Speedup
2048MB 2 16.19s 14.32s 1.13x
4096MB 2 16.14s 8.19s 1.97x
4096MB 3 16.15s 7.12s 2.27x
4096MB 3 20.03s (152 files) 8.82s 2.27x

Cost Benchmark

Config Duration Memory Cost/ms Total Cost
2048MB 16s 1x 16 units
4096MB 7s 2x 14 units

Benchmarks validate our vCPU + worker strategy.

The cost table shows that doubling memory can actually reduce total cost when it cuts duration by more than half. For a deeper dive into Lambda pricing levers beyond memory sizing, see 10 Practical Tips to Reduce AWS Lambda Costs.

Summary: When to Use What

Who this is for: Engineers running CPU-heavy workloads inside AWS Lambda who are surprised that “just add threads” didn’t work.

Workload Items per call Use Why
I/O-bound any ThreadPoolExecutor Threads release the GIL while waiting on I/O
CPU-bound < 5 Stay sequential Process startup overhead beats the speedup
CPU-bound 5+ lambda-multiprocessing Pair with 4096MB+ memory for real parallelism

Key Takeaways

  1. ThreadPoolExecutor is for I/O, not CPU
  2. Standard multiprocessing breaks in Lambda
  3. lambda-multiprocessing works, with enough memory
  4. Memory size controls CPU parallelism
  5. Always benchmark; overhead matters

The real lesson: In Lambda, performance tuning is never just about Python. It’s about understanding how AWS slices CPU, memory, and process isolation, and designing with those constraints, not against them.

If you’re still deciding whether Lambda is the right fit for your workload, see Serverless vs Containers: A Decision Framework for a quick way to evaluate the tradeoffs.

Have you run into similar surprises with parallelism in Lambda? I’d love to hear what worked (or didn’t) in the comments.

aws serverless python performance
Kevin Tan

Kevin Tan

Cloud Solutions Architect and Engineering Leader based in Singapore. I write about AWS, distributed systems, and building reliable software at scale.