I Profiled the Copilot SDK—33% of Latency Was Avoidable

Close-up of a chronograph watch face

I was evaluating GitHub’s Copilot SDK for a latency-sensitive workflow—something that needed to feel instant. The responses were correct, but they didn’t feel fast. Before blaming the model, I decided to profile it.

Across 27 runs and 9 models, a pattern emerged: most latency wasn’t where I expected—and some of it was entirely avoidable.

TL;DR

66% of latency is model inference — you can’t optimize this
33% is client lifecycle overhead (~2.5s) — this is entirely avoidable
Reusing clients and sessions saves ~1.4s per request
gpt-4.1 was faster than every “mini” model I tested

Methodology (Read This First)

To keep the results grounded and reproducible, all measurements were taken under the following conditions:

SDK: GitHub Copilot SDK for Python (technical preview)
Environment: Python 3.14, macOS
Prompt: "Say hello in 5 words"
Runs: 3 iterations per model (27 total runs)
Workload: Single-turn request, no tools, short output

These numbers are directional, not absolute. Backend load, region, prompt complexity, and SDK changes will affect results — but the relative patterns were consistent across runs.

Where Does the Time Actually Go?

After aggregating results across all models, latency broke down as follows:

Phase	Avg Time	% of Total
Time to First Token	5.1s	66%
Session Creation	1.2s	15%
Client Stop	1.0s	13%
CLI Start	0.4s	5%
Token Generation	<0.1s	~1%

Key insight: model inference dominates end‑to‑end latency — but client lifecycle management alone accounts for ~2.5 seconds of overhead per request.

That overhead is optional.

Tip 1: Reuse Your Client and Sessions

Every await client.start() spawns a CLI process (~0.4s). Every await client.stop() tears it down (~1s). Doing this per request quietly burns time.

# Avoid this pattern
for query in queries:
    client = CopilotClient()
    await client.start()
    session = await client.create_session()
    await session.send_and_wait({"prompt": query})
    await client.stop()  # ~1.4s overhead per query

# Prefer this
client = CopilotClient()
await client.start()
session = await client.create_session({
    "model": "gpt-4.1"
})

for query in queries:
    await session.send_and_wait({"prompt": query})

await client.stop()  # only once

Transferable principle: Lifecycle management often matters more than request‑level optimization. This echoes one of the key lessons from why AI agents fail in production — the architecture matters more than the model.

What about session creation (1.2s)? That overhead is harder to avoid for single-turn requests—each new conversation needs a fresh session. But for multi-turn workflows, reusing the same session eliminates it entirely.

The SDK doesn’t currently expose session pooling or pre-warming — sessions are created on-demand. For high-throughput scenarios, you’d need to implement pooling yourself or design workflows that maximize turns per session.

(Assumes a single user or trust boundary. Short‑lived CLI tools or strict tenant isolation may require different trade‑offs.)

Tip 2: Benchmark Models — Don’t Trust Labels

Surprise: gpt-4.1 was faster than every “mini” model I tested.

Average latency across models (3 runs each):

Model	Avg Total	Time to First Token	Tier
gpt-4.1 (fastest)	5.9s	3.4s	Standard
claude-opus-4.5	6.7s	4.2s	Top-tier
claude-sonnet-4	7.1s	4.7s	Standard
gemini-3-pro	7.9s	5.3s	Top-tier
claude-haiku-4	8.2s	5.6s	Fast
o3-mini	8.5s	5.9s	Fast
gpt-4.1-mini	8.6s	6.0s	Fast
gpt-4o	8.7s	5.9s	Standard
gpt-5	9.7s	6.9s	Top-tier

Notable observations:

gpt-4.1 was the fastest overall — faster than several “mini” models
claude-opus-4.5 outperformed other top‑tier models on latency
Model tiers did not reliably predict speed

Transferable principle: Model class names are marketing abstractions, not performance guarantees. Always measure against your own workload.

Tip 3: Stream for Perceived Performance

Once generation starts, tokens arrive quickly. The problem is the silence beforehand.

Streaming doesn’t reduce total latency — but it dramatically improves perceived responsiveness. With gpt-4.1, the first token appears in ~3.4s instead of waiting ~5.9s for the full response.

from copilot import CopilotClient
from copilot.generated.session_events import SessionEventType

client = CopilotClient()
await client.start()

session = await client.create_session({
    "model": "gpt-4.1",
    "streaming": True
})

def handle_event(event):
    if event.type == SessionEventType.ASSISTANT_MESSAGE_DELTA:
        print(event.data.delta_content, end="", flush=True)

session.on(handle_event)
await session.send_and_wait({"prompt": "Explain async/await"})

Transferable principle: Perceived latency is a first‑order UX metric — treat it as such.

What This Data Does Not Say

To avoid over‑interpreting the results:

This is not a throughput benchmark
This is not cost‑normalized
This does not measure tool‑calling, long‑context, or multi‑turn sessions
This does not account for multi‑region variance

The focus here is single‑turn, latency‑sensitive user interactions.

Decision Checklist

Use this as a quick reference:

Building an interactive UI? Enable streaming.
Sending multiple requests per user? Reuse clients and sessions.
Choosing a “fast” model? Benchmark — don’t assume.
Optimizing perceived speed? Focus on TTFT, not token rate.

What Surprised Me—and What to Take Away

I expected “mini” models to win on latency. They didn’t. gpt-4.1 consistently outperformed gpt-4.1-mini, o3-mini, and even claude-haiku-4. Meanwhile, gpt-5 was the slowest of all nine.

The other surprise: lifecycle overhead. I assumed start() and stop() were negligible—they cost 1.4 seconds per request.

The fixes:

Reuse clients and sessions — save ~1.4s per request
Benchmark models yourself — labels lie
Stream responses — better UX without changing total time

Combined impact: Picking gpt-4.1 over gpt-5 saves ~40% on latency. Add lifecycle reuse and you’re looking at sub-5s round trips instead of 10s+.

SDK: github.com/github/copilot-sdk

For a framework that gets lifecycle and state management right out of the box, see my walkthrough of the Strands Agents SDK.

If you’ve profiled Copilot or other agent SDKs in production, I’d love to compare notes.

TL;DR

Methodology (Read This First)

Where Does the Time Actually Go?

Tip 1: Reuse Your Client and Sessions

Tip 2: Benchmark Models — Don’t Trust Labels

Tip 3: Stream for Perceived Performance

What This Data Does Not Say

Decision Checklist

What Surprised Me—and What to Take Away

Subscribe to the newsletter

AI Agent Error Handling Patterns: 5 Lessons from Breaking Mine in Production

How I Built pdf-mcp: Solving Claude's Large PDF Limitations with MCP

Why AI Agents Fail in Production (And How to Fix Them)

OpenAI Codex + MCP Server: Adding Tools Fast