In our last post in the series, we looked at how the GitHub Copilot SDK allows us to spin up a fully stateful AI agent loop in under 20 lines of code. It feels like magic. By abstracting away conversation history, tool definitions, and orchestration loops, the SDK lets you focus entirely on building.
But in software engineering, magic always comes with a bill. In the world of AI agents, that bill is paid in tokens.
When you wrap your LLM inside a high-level framework, you subject yourself to what I call the Abstraction Tax - hidden prompt context and infrastructure bloat that happens entirely under the hood. Today, we are going to look at how to audit your agent's token efficiency using event handling, peek at what the Copilot SDK is actually whispering to your model, and learn how to slash your token usage by over 99%.
Peeking Under the Hood: Event Handling
To understand what our agent is doing behind our backs, we need visibility. Fortunately, the GitHub Copilot SDK features a robust event-driven architecture. By registering a listener via session.on(), we can intercept real-time telemetry like system message composition and precise token consumption metrics.
Here is the setup we will use to audit our agent's efficiency:
from copilot.generated.session_events import AssistantUsageData, SystemMessageData
def handle_usage(event):
if isinstance(event.data, AssistantUsageData):
print("Tokens used:", {
"input": event.data.input_tokens,
"output": event.data.output_tokens,
"cache_read": event.data.cache_read_tokens,
"cache_write": event.data.cache_write_tokens,
})
elif isinstance(event.data, SystemMessageData):
print("System message:", json.dumps({
"first_line": event.data.content.split('\n')[0],
"content_length": len(event.data.content)
}, indent=2))
In the session context, we register the event handler:
# Attach our event handler to audit the session session.on(handle_usage)
The Default State: Paying the Full Tax
When we run the code exactly as written above—asking a simple math question ("What is 2 + 2?")—look at what the SDK actually outputs before giving us the answer:
System message: {
"first_line": "You are the GitHub Copilot CLI, a terminal assistant built by GitHub. You are an interactive CLI tool that helps users with software engineering tasks.",
"content_length": 26220
}
Tokens used: {'input': 13500, 'output': 8, 'cache_read': 12192, 'cache_write': 0}
2 + 2 is 4.
The Breakdown
-
System Prompt Length: 26,220 characters.
-
Input Tokens: 13,500 tokens.
-
The Reality Check: To answer a 5-token question, the SDK processed 13,500 tokens.
Because GitHub Copilot was natively engineered as a coding assistant, the SDK automatically injects a massive, coding-centric system persona and tool environment. While prompt caching (noted by the cache_read tokens) helps mitigate the latency and cost, carrying a 26k-character background system prompt for a non-coding persona is incredibly inefficient.
Phase 1: Reclaiming the Persona
If your agent is meant to be a customer support bot, a creative writer, or a simple calculator, it shouldn't be masquerading as a terminal assistant. We can strip away this default background context adding the system_message configuration parameter in client.create_session() function and replacing it with a lean, custom prompt:
system_message=SystemMessageReplaceConfig(
mode="replace",
content="You are a helpful assistant."
)
Running the code now gives us a drastically different profile:
System message: {
"first_line": "You are a helpful assistant.",
"content_length": 28
}
Tokens used: {'input': 7149, 'output': 8, 'cache_read': 0, 'cache_write': 0}
2 + 2 is 4.
The Breakdown
-
System Prompt Length: Dropped from 26,220 characters to just 28 characters.
-
Input Tokens: Cut in half, down to 7,149 tokens.
This is a massive step forward, but 7,149 tokens for a simple arithmetic question is still an incredibly steep abstraction tax. Where is the remaining bulk coming from if our system message is only 28 characters long?
The answer lies in the default tool definitions that the SDK implicitly injects to give your agent its autonomy.
Phase 2: Eliminating the Tool Bloat
To achieve true token efficiency, we must explicitly manage the tools available to the agent. If an agent does not require external terminal utilities or file management capabilities to fulfill its task, we should strip them out entirely.
Adding available_tools=[] parameter, we pass an empty array to the session, telling the SDK to leave its default toolkits at home.
system_message=SystemMessageReplaceConfig(
mode="replace",
content="You are a helpful assistant."
),
available_tools=[]
Let's look at the optimized output:
System message: {
"first_line": "You are a helpful assistant.",
"content_length": 28
}
Tokens used: {'input': 56, 'output': 8, 'cache_read': 0, 'cache_write': 0}
2 + 2 is 4.
The Breakdown
-
System Prompt Length: 28 characters.
-
Input Tokens: 56 tokens.
The Verdict: Optimization Payoff
By strategically taking control of our prompt composition and tool alignment, we achieved a staggering reduction in overhead:
| Configuration | System Prompt Length | Input Tokens | Token Reduction |
| Default SDK Behavior | 26,220 chars | 13,500 | Baseline |
| Custom Persona Only | 28 chars | 7,149 | ~47% |
| Custom Persona + Explicit Tools | 28 chars | 56 | 99.58% |
Key Takeaway: High-level frameworks like the GitHub Copilot SDK are incredibly accelerative, but they make heavy assumptions about your agent's use case. If you don't audit your agent's event loop, you risk burning millions of unnecessary tokens on default workflows that don't match your intended application.
Always tailor your agent's environment to its specific mission: define an explicit system persona and provision only the exact tools it needs to get the job done.
Try It Yourself
Want to see these metrics adjust live? You don't need to configure a local Python environment to test this out. Click the badge below to jump straight into our interactive Google Colab notebook, plug in your API key, and test the token optimization scripts yourself!