
I was evaluating GitHub’s Copilot SDK for a latency-sensitive workflow—something that needed to feel instant. The responses were correct, but they didn’t feel fast. Before blaming the model, I decided to profile it.
Across 27 runs and 9 models, a pattern emerged: most latency wasn’t where I expected—and some of it was entirely avoidable.
TL;DR
- 66% of latency is model inference — you can’t optimize this
- 33% is client lifecycle overhead (~2.5s) — this is entirely avoidable
- Reusing clients and sessions saves ~1.4s per request
gpt-4.1was faster than every “mini” model I tested
Methodology (Read This First)
To keep the results grounded and reproducible, all measurements were taken under the following conditions:
- SDK: GitHub Copilot SDK for Python (technical preview)
- Environment: Python 3.14, macOS
- Prompt:
"Say hello in 5 words" - Runs: 3 iterations per model (27 total runs)
- Workload: Single-turn request, no tools, short output
These numbers are directional, not absolute. Backend load, region, prompt complexity, and SDK changes will affect results — but the relative patterns were consistent across runs.
Where Does the Time Actually Go?
After aggregating results across all models, latency broke down as follows:
| Phase | Avg Time | % of Total |
|---|---|---|
| Time to First Token | 5.1s | 66% |
| Session Creation | 1.2s | 15% |
| Client Stop | 1.0s | 13% |
| CLI Start | 0.4s | 5% |
| Token Generation | <0.1s | ~1% |
Key insight: model inference dominates end‑to‑end latency — but client lifecycle management alone accounts for ~2.5 seconds of overhead per request.
That overhead is optional.
Tip 1: Reuse Your Client and Sessions
Every await client.start() spawns a CLI process (~0.4s). Every await client.stop() tears it down (~1s). Doing this per request quietly burns time.
# Avoid this pattern
for query in queries:
client = CopilotClient()
await client.start()
session = await client.create_session()
await session.send_and_wait({"prompt": query})
await client.stop() # ~1.4s overhead per query
# Prefer this
client = CopilotClient()
await client.start()
session = await client.create_session({
"model": "gpt-4.1"
})
for query in queries:
await session.send_and_wait({"prompt": query})
await client.stop() # only once
Transferable principle: Lifecycle management often matters more than request‑level optimization. This echoes one of the key lessons from why AI agents fail in production — the architecture matters more than the model.
What about session creation (1.2s)? That overhead is harder to avoid for single-turn requests—each new conversation needs a fresh session. But for multi-turn workflows, reusing the same session eliminates it entirely.
The SDK doesn’t currently expose session pooling or pre-warming — sessions are created on-demand. For high-throughput scenarios, you’d need to implement pooling yourself or design workflows that maximize turns per session.
(Assumes a single user or trust boundary. Short‑lived CLI tools or strict tenant isolation may require different trade‑offs.)
Tip 2: Benchmark Models — Don’t Trust Labels
Surprise:
gpt-4.1was faster than every “mini” model I tested.
Average latency across models (3 runs each):
| Model | Avg Total | Time to First Token | Tier |
|---|---|---|---|
| gpt-4.1 (fastest) | 5.9s | 3.4s | Standard |
| claude-opus-4.5 | 6.7s | 4.2s | Top-tier |
| claude-sonnet-4 | 7.1s | 4.7s | Standard |
| gemini-3-pro | 7.9s | 5.3s | Top-tier |
| claude-haiku-4 | 8.2s | 5.6s | Fast |
| o3-mini | 8.5s | 5.9s | Fast |
| gpt-4.1-mini | 8.6s | 6.0s | Fast |
| gpt-4o | 8.7s | 5.9s | Standard |
| gpt-5 | 9.7s | 6.9s | Top-tier |
Notable observations:
gpt-4.1was the fastest overall — faster than several “mini” modelsclaude-opus-4.5outperformed other top‑tier models on latency- Model tiers did not reliably predict speed
Transferable principle: Model class names are marketing abstractions, not performance guarantees. Always measure against your own workload.
Tip 3: Stream for Perceived Performance
Once generation starts, tokens arrive quickly. The problem is the silence beforehand.
Streaming doesn’t reduce total latency — but it dramatically improves perceived responsiveness. With gpt-4.1, the first token appears in ~3.4s instead of waiting ~5.9s for the full response.
from copilot import CopilotClient
from copilot.generated.session_events import SessionEventType
client = CopilotClient()
await client.start()
session = await client.create_session({
"model": "gpt-4.1",
"streaming": True
})
def handle_event(event):
if event.type == SessionEventType.ASSISTANT_MESSAGE_DELTA:
print(event.data.delta_content, end="", flush=True)
session.on(handle_event)
await session.send_and_wait({"prompt": "Explain async/await"})
Transferable principle: Perceived latency is a first‑order UX metric — treat it as such.
What This Data Does Not Say
To avoid over‑interpreting the results:
- This is not a throughput benchmark
- This is not cost‑normalized
- This does not measure tool‑calling, long‑context, or multi‑turn sessions
- This does not account for multi‑region variance
The focus here is single‑turn, latency‑sensitive user interactions.
Decision Checklist
Use this as a quick reference:
- Building an interactive UI? Enable streaming.
- Sending multiple requests per user? Reuse clients and sessions.
- Choosing a “fast” model? Benchmark — don’t assume.
- Optimizing perceived speed? Focus on TTFT, not token rate.
What Surprised Me—and What to Take Away
I expected “mini” models to win on latency. They didn’t. gpt-4.1 consistently outperformed gpt-4.1-mini, o3-mini, and even claude-haiku-4. Meanwhile, gpt-5 was the slowest of all nine.
The other surprise: lifecycle overhead. I assumed start() and stop() were negligible—they cost 1.4 seconds per request.
The fixes:
- Reuse clients and sessions — save ~1.4s per request
- Benchmark models yourself — labels lie
- Stream responses — better UX without changing total time
Combined impact: Picking gpt-4.1 over gpt-5 saves ~40% on latency. Add lifecycle reuse and you’re looking at sub-5s round trips instead of 10s+.
SDK: github.com/github/copilot-sdk
For a framework that gets lifecycle and state management right out of the box, see my walkthrough of the Strands Agents SDK.
If you’ve profiled Copilot or other agent SDKs in production, I’d love to compare notes.