Wednesday, June 10, 2026

GitHub Copilot SDK Tutorial 04 - Custom Tools and Asynchronous Event Streaming

In our last post, we observed how the GitHub Copilot SDK provides out-of-the-box reasoning loop handling, automatically modifying search queries to track down data within localized files.

However, local files only get you so far. To make your AI assistant truly powerful, it needs to interact with external APIs, databases, or production environments.

Today, we will transition our agent from using built-in system tools to using a completely custom tool, a web search tool powered by Firecrawl. Additionally, we will re-architect our telemetry pipeline into an asynchronous event stream. Instead of blocking the execution thread while waiting for a complete answer, we will decouple the message processing loop, formatting real-time updates exactly like a production-ready application streaming data to a user interface.

Part 1: Defining Custom Agent Skills

The GitHub Copilot SDK makes adding custom tools straightforward through the @define_tool decorator. You can declare inputs using Pydantic schemas, which the SDK uses to generate JSON schemas behind the scenes. This allows the LLM to understand what your tool does and what parameters it requires.

We'll build a search wrapper around the Firecrawl API to give our agent real-time access to the live internet:

from pydantic import BaseModel, Field
from copilot import define_tool
import httpx

# Define the schema the LLM will analyze
class FirecrawlSearchParams(BaseModel):
    query: str = Field(..., description="The search query to find information on the web")
    limit: int = Field(default=5, description="Maximum number of results to return")

# Implement the API call logic
async def fetch_firecrawl_results(params: FirecrawlSearchParams) -> dict[str, Any]:
    endpoint = "https://api.firecrawl.dev/v2/search"
    headers = {
        "Authorization": f"Bearer {firecrawl_api_key}",
        "Content-Type": "application/json"
    }
    payload = {"query": params.query, "limit": params.limit}

    async with httpx.AsyncClient() as client:
        response = await client.post(endpoint, json=payload, headers=headers, timeout=30.0)
        response.raise_for_status()
        return response.json()

# Bind the function to the SDK as a custom tool
@define_tool("web_search", description="Search the web")
async def firecrawl_search_tool(params: FirecrawlSearchParams) -> dict[str, Any]:
    return await fetch_firecrawl_results(params)

By exposing firecrawl_search_tool to the model, we give it the capacity to request internet details autonomously whenever its internal training weights fall short.

Part 2: Moving to Asynchronous Event Streaming

In our previous implementations, we called .send_and_wait(), which blocks your application code until the entire multi-turn tool loop completes. This approach doesn't scale well for user-facing applications. If an agent executes three consecutive API queries, your user shouldn't be left staring at a static loading spinner.

To address this, we will build an asynchronous generator loop.

First, we establish an asyncio.Queue to trap events. Our handle_event listener acts as a fast, thread-safe producer that immediately returns control back to the agent:

event_queue = asyncio.Queue()

def handle_event(event):
    # Quickly push the raw event into the queue without blocking
    asyncio.get_event_loop().call_soon_threadsafe(event_queue.put_nowait, event)

Next, we write a consumer generator (event_generator) that pulls raw data out of the queue, sanitizes it, and normalizes it into standardized, UI-friendly dictionaries. When the session sends a SessionIdleData event, the generator gracefully terminates:

async def event_generator():
    while True:
        event = await event_queue.get()
        try:
            if isinstance(event.data, AssistantUsageData):
                yield {"title": "Tokens used", "content": {"input": event.data.input_tokens, "output": event.data.output_tokens}}
            elif isinstance(event.data, ToolExecutionStartData):
                yield {"title": "Tool execution", "content": {"tool_name": event.data.tool_name, "arguments": event.data.arguments}}
            elif isinstance(event.data, AssistantMessageData):
                if event.data.content:
                    yield {"title": "Assistant message", "content": event.data.content}
            elif isinstance(event.data, SessionIdleData):
                break  # The agent is done processing
        finally:
            event_queue.task_done()

Part 3: Running the Async Loop

With our custom web search tool registered and our event queue waiting, we can execute the call using .send(). By pairing this with asyncio.create_task, the agent runs concurrently alongside our consumer loop:

async def main():
    question = "what is Qwen 3.7?"
    async with CopilotClient() as client:
        async with await client.create_session(
            on_permission_request=PermissionHandler.approve_all,
            model=model,
            provider=provider,
            system_message=SystemMessageReplaceConfig(
                mode="replace",
                content="You are a helpful assistant. Use web_search for queries."
            ),
            tools=[firecrawl_search_tool],     # Injecting our tool logic
            available_tools=['web_search']     # Whitelisting the execution capability
        ) as session:
            session.on(handle_event)

            # Fire off the question asynchronously without blocking
            send_task = asyncio.create_task(session.send(question))

            # Stream finalized results cleanly to our console/frontend as they happen
            async map info in event_generator():
                print(f"{info['title']}: {info['content']}")

            await send_task

The Output: Real-Time Telemetry

When we ask about Qwen 3.7, look at the clean, decoupled logs streamed out of our event_generator:

System message: {
  "first_line": "You are a helpful assistant. Use web_search for queries.",
  "content_length": 56
}
Tokens used: {'input': 161, 'output': 24, 'cache_read': 0, 'cache_write': 0}
Tool execution: {'tool_name': 'web_search', 'arguments': {'query': 'what is Qwen 3.7?'}}
Tokens used: {'input': 764, 'output': 266, 'cache_read': 0, 'cache_write': 0}
Assistant message: **Qwen 3.7** is a series of large language models released by Alibaba in May 2026. It is marketed as a significant advancement in AI, particularly regarding agentic workflows, reasoning, and multimodal capabilities. ...

Why This Design Matters

By separating the producer (session.send) from the consumer (event_generator), you get complete control over data streaming:

  • UI Compatibility: You can map the output of event_generator directly to WebSockets or an SSE (Server-Sent Events) web endpoint.
  • Component Tracking: You don't have to wait for the final text block to know if your system worked. The UI can immediately render a specialized component showing exactly what external tool arguments the agent invoked ({'query': 'what is Qwen 3.7?'}) while the tool is running.

In our next guide, we will explore how to put the guardrails to the tools.

Try It Yourself

Want to check out this asynchronous data stream yourself? Click the badge below to jump directly into our interactive Google Colab notebook, swap in your own custom tool configurations, and watch your streaming components update live!

Open In Colab

Tuesday, June 09, 2026

GitHub Copilot SDK Tutorial 03 - Agentic RAG

In our last post of the series, we audited the GitHub Copilot SDK's token usage. We learned how to strip away the "Abstraction Tax" by setting a custom persona and disabling default toolkits, bringing our input overhead down by over 99%.

Now that we have a lean, hyper-efficient engine, it’s time to give our agent some real work.

Traditionally, if you wanted an AI to answer questions based on a corporate document like an employee handbook, you would build a RAG (Retrieval-Augmented Generation) pipeline. You would chunk the text, generate vector embeddings, store them in a database, and perform a mathematical similarity search.

Today, we are going to bypass traditional RAG completely. By introducing agentic tool execution, we will watch our agent autonomously reason, hunt for data, fail, pivot, and ultimately find the right answer all on its own.

The Setup: Armed with File Tools

Instead of serving pre-chewed text chunks to our LLM, we are going to hand our agent raw text files and a couple of command-line tools: grep (for pattern searching) and view (for reading files).

First, let's configure our environment, point our CopilotClient back to our budget-friendly Gemini 3.1 Flash Lite endpoint, and restrict its toolkit to just those two utilities.

async def main():
    question = "What should I do if I am pregnant?"
    async with CopilotClient() as client:
        async with await client.create_session(
            on_permission_request=PermissionHandler.approve_all,
            model=model,
            provider=provider,
            system_message=SystemMessageReplaceConfig(
                mode="replace",
                content="You are a helpful assistant. Answer queries using /content/data/EmployeeHandbook.txt"
            ),
            available_tools=['grep', 'view']  # Giving the agent its tools
        ) as session:

            session.on(handle_usage)
            response = await session.send_and_wait(question, timeout=300)

By adding available_tools=['grep', 'view'] and setting PermissionHandler.approve_all, we are handing the agent keys to the workspace. We aren't telling it how to use them; we are just providing the instructions and standing back.

Watching the Agent Reason in Real-Time

When we run this code against a standard HR Employee Handbook, something fascinating happens under the hood. Let's look at the telemetry logs generated by our event handler:

System message: {
  "first_line": "You are a helpful assistant. Answer queries using `/content/data/EmployeeHandbook.txt`",
  "content_length": 86
}
Tokens used: {'input': 973, 'output': 15, 'cache_read': 0, 'cache_write': 0}
Tool execution: {'tool_name': 'grep', 'arguments': {'pattern': 'pregnan'}}

Tokens used: {'input': 1003, 'output': 15, 'cache_read': 0, 'cache_write': 0}
Tool execution: {'tool_name': 'grep', 'arguments': {'pattern': 'maternity'}}

Tokens used: {'input': 1036, 'output': 22, 'cache_read': 0, 'cache_write': 0}
Tool execution: {'tool_name': 'view', 'arguments': {'path': '/content/data/EmployeeHandbook.txt'}}

Tokens used: {'input': 1105, 'output': 23, 'cache_read': 0, 'cache_write': 0}
Tool execution: {'tool_name': 'grep', 'arguments': {'pattern': 'maternity', 'output_mode': 'content'}}

Tokens used: {'input': 1346, 'output': 170, 'cache_read': 0, 'cache_write': 0}
Assistant message: If you are pregnant, you should notify the company of your intent to take maternity leave **no later than 12 weeks before the expected date of confinement**. ...

The Autonomous Loop Broken Down

Look closely at how the agent reacted when the user asked, "What should I do if I am pregnant?"

  1. The First Attempt (Grep 'pregnan'): The agent automatically uses grep to look for variations of the word "pregnant". However, our specific corporate handbook doesn't use that word in its headings. The search returns nothing.
  2. The Pivot (Query Rewriting): In a traditional keyword search, the loop would end here with an unhelpful "I couldn't find anything." But an agent possesses reasoning capabilities. It realizes "pregnant" relates to "maternity leave," so it autonomously rewrites its search intent and fires a second grep for 'maternity'.
  3. The Deep Dive (View File): After finding hits, it calls the view tool to read the specific sections of /content/data/EmployeeHandbook.txt.
  4. The Final Synthesis: It pulls the relevant paragraphs, calculates the parameters, and delivers a beautifully structured answer outlining the 12-week notice requirement and confinement details.

Agentic Workflows vs. Traditional RAG

This highlights a fundamental shift in how we handle unstructured data:

The Core Difference: Traditional RAG relies heavily on embeddings to find semantic similarities between terms. If your vector math or chunking strategies are slightly off, relevant context gets missed. An Agentic Workflow solves this via iterative reasoning: it can notice a tool returned a blank result, rethink its strategy, and try alternative terms autonomously.

The Agentic Trade-Off

This incredible intelligence isn't completely free. You will notice two clear trade-offs when shifting from standard API bots to true agents:

  • Higher Token Consumption: Because the agent is running multiple back-and-forth loops, evaluating tool outputs, and re-submitting context, it uses significantly more input tokens per query.
  • Increased Latency: Waiting for multiple execution loops means responses take seconds rather than milliseconds.

The Good News: In production environments, Prompt Caching heavily mitigates these costs. While the free-tier Gemini API doesn't have it enabled by default, production endpoints allow cached system instructions and document states to be read at roughly one-tenth of the standard input token cost. Given the massive leap in answering reliability, it's an incredibly good deal.

In our next tutorial, we will take this a step further and look at how to build and register our own completely custom Python functions as agent skills.

Try It Yourself

Want to watch the agent hunt through the employee handbook live? Click the badge below to jump directly into the interactive Google Colab notebook, run the telemetry audits, and try changing the questions to see how the agent adapts its tool usage!

Open In Colab

Monday, June 08, 2026

GitHub Copilot SDK Tutorial 02 - The Abstraction Tax

In our last post in the series, we looked at how the GitHub Copilot SDK allows us to spin up a fully stateful AI agent loop in under 20 lines of code. It feels like magic. By abstracting away conversation history, tool definitions, and orchestration loops, the SDK lets you focus entirely on building.

But in software engineering, magic always comes with a bill. In the world of AI agents, that bill is paid in tokens.

When you wrap your LLM inside a high-level framework, you subject yourself to what I call the Abstraction Tax - hidden prompt context and infrastructure bloat that happens entirely under the hood. Today, we are going to look at how to audit your agent's token efficiency using event handling, peek at what the Copilot SDK is actually whispering to your model, and learn how to slash your token usage by over 99%.

Peeking Under the Hood: Event Handling

To understand what our agent is doing behind our backs, we need visibility. Fortunately, the GitHub Copilot SDK features a robust event-driven architecture. By registering a listener via session.on(), we can intercept real-time telemetry like system message composition and precise token consumption metrics.

Here is the setup we will use to audit our agent's efficiency:

from copilot.generated.session_events import AssistantUsageData, SystemMessageData

def handle_usage(event):  
    if isinstance(event.data, AssistantUsageData):  
        print("Tokens used:", {  
            "input": event.data.input_tokens,  
            "output": event.data.output_tokens,  
            "cache_read": event.data.cache_read_tokens,  
            "cache_write": event.data.cache_write_tokens,  
        })
    elif isinstance(event.data, SystemMessageData):
        print("System message:", json.dumps({
            "first_line": event.data.content.split('\n')[0],
            "content_length": len(event.data.content)
        }, indent=2))
In the session context, we register the event handler:
# Attach our event handler to audit the session
session.on(handle_usage)  

The Default State: Paying the Full Tax

When we run the code exactly as written above—asking a simple math question ("What is 2 + 2?")—look at what the SDK actually outputs before giving us the answer:

System message: {
  "first_line": "You are the GitHub Copilot CLI, a terminal assistant built by GitHub. You are an interactive CLI tool that helps users with software engineering tasks.",
  "content_length": 26220
}
Tokens used: {'input': 13500, 'output': 8, 'cache_read': 12192, 'cache_write': 0}
2 + 2 is 4.

The Breakdown

  • System Prompt Length: 26,220 characters.

  • Input Tokens: 13,500 tokens.

  • The Reality Check: To answer a 5-token question, the SDK processed 13,500 tokens.

Because GitHub Copilot was natively engineered as a coding assistant, the SDK automatically injects a massive, coding-centric system persona and tool environment. While prompt caching (noted by the cache_read tokens) helps mitigate the latency and cost, carrying a 26k-character background system prompt for a non-coding persona is incredibly inefficient.

Phase 1: Reclaiming the Persona

If your agent is meant to be a customer support bot, a creative writer, or a simple calculator, it shouldn't be masquerading as a terminal assistant. We can strip away this default background context adding the system_message configuration parameter in client.create_session() function and replacing it with a lean, custom prompt:

system_message=SystemMessageReplaceConfig(
    mode="replace",
    content="You are a helpful assistant."
)

Running the code now gives us a drastically different profile:

System message: {
  "first_line": "You are a helpful assistant.",
  "content_length": 28
}
Tokens used: {'input': 7149, 'output': 8, 'cache_read': 0, 'cache_write': 0}
2 + 2 is 4.

The Breakdown

  • System Prompt Length: Dropped from 26,220 characters to just 28 characters.

  • Input Tokens: Cut in half, down to 7,149 tokens.

This is a massive step forward, but 7,149 tokens for a simple arithmetic question is still an incredibly steep abstraction tax. Where is the remaining bulk coming from if our system message is only 28 characters long?

The answer lies in the default tool definitions that the SDK implicitly injects to give your agent its autonomy.

Phase 2: Eliminating the Tool Bloat

To achieve true token efficiency, we must explicitly manage the tools available to the agent. If an agent does not require external terminal utilities or file management capabilities to fulfill its task, we should strip them out entirely.

Adding available_tools=[] parameter, we pass an empty array to the session, telling the SDK to leave its default toolkits at home.

system_message=SystemMessageReplaceConfig(
    mode="replace",
    content="You are a helpful assistant."
),
available_tools=[]

Let's look at the optimized output:

System message: {
  "first_line": "You are a helpful assistant.",
  "content_length": 28
}
Tokens used: {'input': 56, 'output': 8, 'cache_read': 0, 'cache_write': 0}
2 + 2 is 4.

The Breakdown

  • System Prompt Length: 28 characters.

  • Input Tokens: 56 tokens.

The Verdict: Optimization Payoff

By strategically taking control of our prompt composition and tool alignment, we achieved a staggering reduction in overhead:

Configuration System Prompt Length Input Tokens Token Reduction
Default SDK Behavior 26,220 chars 13,500 Baseline
Custom Persona Only 28 chars 7,149 ~47%
Custom Persona + Explicit Tools 28 chars 56 99.58%

Key Takeaway: High-level frameworks like the GitHub Copilot SDK are incredibly accelerative, but they make heavy assumptions about your agent's use case. If you don't audit your agent's event loop, you risk burning millions of unnecessary tokens on default workflows that don't match your intended application.

Always tailor your agent's environment to its specific mission: define an explicit system persona and provision only the exact tools it needs to get the job done.

Try It Yourself

Want to see these metrics adjust live? You don't need to configure a local Python environment to test this out. Click the badge below to jump straight into our interactive Google Colab notebook, plug in your API key, and test the token optimization scripts yourself!

Open In Colab

Sunday, June 07, 2026

How 20-Year-Old Software Principles Govern Today's AI Agents

In our previous post, we explored the mathematical realities of agentic LLM architectures, detailing how deterministic harnesses and human-in-the-loop (HITL) evaluations prevent cascading failures. But if we step back from the probability formulas, a profound realization emerges: the architectural challenges we face with AI agents today are almost identical to the human management challenges we faced two decades ago.

In 2006, I co-authored a paper titled "Degree of Freedom - Experience of Applying Software Framework". Back then, we were not managing Large Language Models; we were managing global virtual system development teams. Yet, the core software engineering principle we applied then is exactly what we rely on now. We just use a different vocabulary.

Whether dealing with offshore human developers in 2006 or autonomous digital colleagues in 2026, unchecked freedom in a complex system inevitably leads to entropy.

The 2006 Problem: The Design-Implementation Gap

Twenty years ago, a major challenge in programming-in-the-large (PITL) was managing the control between upstream design teams and downstream implementation teams. Ideally, an implementation team would take a user-requirement-based design and execute it closely. In practice, this was incredibly difficult to enforce across different geographical locations and time zones.

We realized that the primary cause of this design-implementation gap was the degree of freedom granted to the implementation developers. Given system design, developers simply had too many ways to implement it.

When developers are given too much freedom, human nature and individual coding habits take over:

  • Developers would frequently bypass the structure of the original design by adding new classes to make their own coding easier.
  • Junior developers, failing to recognize necessary architectural layers, would mix persistence operations, presentation logic, and business operations all into a single server page.
  • Developers would bring in their own pre-existing habits, causing the implementation to drift further from the intended design.

Object-oriented programming was supposed to impose discipline, but we discovered that OO without strict discipline was just as chaotic as unstructured programming with goto statements.

The 2006 Solution: Limiting Freedom via Frameworks

To bridge this gap, we adopted and eventually open-sourced a software frameworks (Struts, Hibernate,  MVC pattern, etc). Our original goal was to standardize development and reuse major components. However, the most powerful benefit came as a surprise: the framework allowed us to efficiently manage the design-implementation gap.

The frameworks solved our problems by fundamentally limiting the developers' choices:

  • It constrained developers, forcing them to follow the framework's structure rather than improvising their own designs to make things work.
  • It homogenized the highly diversified coding styles and habits of developers, which was especially crucial in development centers with high turnover rates.
  • It made any deviations highly visible; if a developer tried to overload a single page with too much logic, the framework violation became immediately obvious during code reviews.

By limiting the degree of freedom, we standardized the process and ensured reliability.

Fast Forward to Today: Harness Engineering for Digital Colleagues

Today, we are building agentic pipelines where LLMs act as our "downstream implementation team." And once again, we are facing a massive design-implementation gap.

An LLM has a near-infinite degree of freedom. If you give an open-ended prompt to an LLM, it has billions of latent pathways it can take to generate an answer. Without constraints, it will "improvise"—which we now call hallucinating. It will mix logic, ignore the intended architecture, and output unstructured data, much like a junior developer in 2006 trying to cram all business and persistence logic into a single script.

In our previous post, we discussed how to fix this using Harness Engineering. A harness is simply the 2026 term for a framework.

When we build a deterministic harness around an LLM (such as enforcing strict JSON schemas, unit test validation, or routing the agent through a rigid LangGraph state machine), we are doing exactly what the frameworks did two decades ago. We are removing the agent's degree of freedom.

  • Instead of hoping the AI formats correctly, a structural harness forces it into predefined operational layers.
  • Instead of relying on the AI's "creativity," a deterministic evaluator acts as the ultimate code review, instantly catching any deviation from the required schema.
  • Instead of unconstrained generation, we limit its choices so that its output is homogenized, predictable, and manageable.

The Enduring Principle of Software Engineering

Technology evolves at a blinding pace, but the fundamental laws of systems engineering do not. The transition from human offshore teams to digital LLM colleagues has not changed the rules of the game.

Discipline is the bedrock of production software. Twenty years ago, we instilled that discipline by building frameworks to constrain human developers. Today, we instill it by building deterministic harnesses to constrain AI agents. The names have changed, but the secret to reliable architecture remains exactly the same: carefully control the degree of freedom.

Saturday, June 06, 2026

Beyond p^n: The Mathematics of Agentic Reliability and Harness Engineering

The transition from a linear Retrieval-Augmented Generation (RAG) pipeline to an agentic architecture is often framed as an upgrade in "intelligence." But under the hood, it is fundamentally an upgrade in probability.
If you have built a naive, multi-step LLM pipeline, you have likely encountered cascading failure. This post explores the exact mathematics of why linear pipelines collapse, how agentic retry loops fix them, and why your system is ultimately only as strong as your evaluation harness.

1. The Open-Loop Trap: The Math of "AND"

In a standard, fixed pipeline (e.g., Retrieve -> Rerank -> Synthesize -> Format), the architecture is an open loop. To get a successful final output, the LLM must succeed at Step 1, AND Step 2, AND Step 3.
If your underlying model has a zero-shot accuracy of p for any given task, the probability of successfully navigating an n-step pipeline is the product of those probabilities:
$$P(\text{success}) = p^n$$
Because p is a fraction, p^n decays exponentially. If your model gets it right 80% of the time (p = 0.8), a 4-step pipeline has a success rate of just 40.9%. The system is mathematically guaranteed to degrade as it scales.

2. The Agentic Loop: The Math of "OR"

Agentic architectures replace the "AND" with an "OR" by introducing a feedback loop. Instead of blindly passing data to the next node, an agent evaluates the result. If the result is flawed, it tries again, up to a limit of k retries.
Assuming the system can perfectly detect when an error occurs, the probability of failing a step is no longer (1-p). It is the probability of failing every single retry: (1-p)^k.
Subtracting this from 100% gives us the probability of succeeding at least once during those k attempts:
$$P(\text{step success}) = 1 - (1-p)^k$$
With p = 0.8 and k = 3, the success rate for a single step jumps from 80% to 99.2%. When you multiply these fortified nodes together, the pipeline stabilizes.

3. The Evaluator Bottleneck: Enter q

The previous equation makes a massive, often fatal assumption: that the agent perfectly recognizes its own failures. In reality, the evaluation step has its own probability of correctness, which we call q. When the agent generates an answer, the evaluator might falsely accept a hallucination (False Positive) or falsely reject a correct answer (False Negative).
When we account for an imperfect evaluator q, the math shifts from basic probability into a stochastic Markov chain. The master equation for a single agentic step becomes:
$$P(\text{step success}) = \left( \frac{p \cdot q}{p \cdot q + (1-p)(1-q)} \right) \left( 1 - (p(1-q) + (1-p)q)^k \right)$$

The False Positive Ceiling

Notice the fraction on the left side of the equation: $$\frac{p \cdot q}{p \cdot q + (1-p)(1-q)}$$.
This represents the absolute mathematical ceiling of your agent. Even if you give the agent infinite retries (k = infinity), it can never exceed this success rate. If both your generator and evaluator have an accuracy of 70% (p = 0.7, q = 0.7), your step's maximum possible success rate is capped at roughly 84%. False positives will eventually let bad data leak through the loop.

4. Harness Engineering: Breaking the Ceiling

To maximize the master equation, we must maximize q. This is the domain of Harness Engineering - the design of the environment that catches, tests, and evaluates the agent's output.
Harnesses generally fall into two categories:

Indeterministic Harness (LLM-as-a-Judge)

This relies on another prompt (or a separate model) to evaluate the output. "Does this summary look accurate?"
  • The Problem: q remains fractional. You are fighting probability with probability.
  • The Mitigation: If you must use an LLM judge, the task must be highly asymmetrical. Evaluating is easier than generating, so q is naturally higher than p. Furthermore, developers should route evaluations to heavier, highly capable models to push q as close to 0.99 as economically feasible.

Deterministic Harness (The Golden Rule)

The most robust agents do not use LLMs to check LLMs. They offload evaluation to strict, true/false code.
  • Format Checking: Did the model output pure, valid JSON, or did it break the code by adding conversational filler like "Here is your JSON:"?
  • Execution Checking: If the agent wrote a block of code, did it actually run successfully without crashing?
  • Rule Counting: If the prompt asked for exactly five product recommendations, does len(list) == 5?
By using code as the harness, q becomes 1.0.
When q = 1.0, the mathematical ceiling shatters. The left side of our master equation becomes 1, and the formula collapses back to the ideal 1 - (1-p)^k. A deterministic harness allows an agent to safely iterate until it hits a guaranteed success.

5. Human-in-the-Loop (HITL) as the Ultimate Harness

There are scenarios where a deterministic harness is impossible to build, and an indeterministic harness is too risky. If an agent is writing a sensitive client email or executing a live database mutation, a false positive is catastrophic.
This is where Human-in-the-Loop (HITL) architecture becomes a mathematical necessity.
HITL should not be viewed as a failure of automation; it is a dynamic routing strategy. When an agentic step involves high-stakes subjectivity, the retry loop routes the evaluation to a human operator.
  1. The human acts as an evaluator with a functional q ≈ 1.0.
  2. The human can either accept the output (Terminal Success), or reject it and provide targeted feedback.
  3. The LLM processes the human feedback on the next iteration, artificially boosting its generation probability (p) for the subsequent attempt.

Conclusion

Building a production-ready agentic system is not about choosing a smarter foundational model. It is about actively managing the mathematical probabilities of your pipeline. By keeping steps short (n), leveraging retry loops (k), strictly enforcing deterministic harnesses wherever possible (q = 1.0), and utilizing HITL for high-risk evaluations, you can engineer a system that naturally self-corrects and vastly outperforms traditional linear RAG.

Friday, June 05, 2026

GitHub Copilot SDK Tutorial 01 - Agent 101

Historically, building an AI agent meant manually writing custom orchestration loops and stitching tools together from scratch. However, with the rise of sophisticated coding assistants, we can now leverage their production-ready SDKs to build our own custom agents. This allows us to tap into the exact same underlying infrastructure as your favorite AI tools—complete with native support for agent skills, memory, and robust tool execution.

The GitHub Copilot SDK is a powerful framework designed for this exact purpose. This step-by-step tutorial series will guide you from building a basic, foundational agent all the way to deploying an enterprise-class AI assistant.

Understanding the Architecture: Chatbots vs. Agents

When building a traditional chatbot, you typically make direct, stateless API calls to a LLM. This leaves you responsible for manually managing conversation history and writing custom loops to sustain a continuous dialogue.

An agent framework like the GitHub Copilot SDK abstracts this complexity away by structuring interactions into two core components:

  • The Client (CopilotClient): Act as the bridge between your local environment and the AI infrastructure.
  • The Session (via client.create_session): Represents a continuous, stateful interaction. Unlike a one-off API call, a session automatically retains context, manages conversation history, and tracks ongoing sub-tasks.

Step 1: Configure Your LLM Provider

Before spinning up our agent loop, we need to define our model and provider configuration. A major benefit of this setup is flexibility: you do not even need an active GitHub Copilot subscription to get started.

For this demonstration, we will connect to Gemini’s OpenAI-compatible endpoint using the lightweight and fast Gemini 3.1 Flash Lite model.

provider = {
    "type": "openai",
    "base_url": "https://generativelanguage.googleapis.com/v1beta/openai/",
    "wire_api": "completions",
    "api_key": os.getenv("GEMINI_API_KEY")
}

model="gemini-3.1-flash-lite"

Step 2: Initialize the Agent Loop

With our provider configured, we can now initialize the CopilotClient, establish a stateful session, and send our first asynchronous query.

question = "What is 2 + 2?"
async with CopilotClient() as client:
    async with await client.create_session(
        on_permission_request=PermissionHandler.approve_all,
        model=model,
        provider=provider
    ) as session:
        response = await session.send_and_wait(question)
        print(response.data.content)

Note: By setting PermissionHandler.approve_all, we are giving the agent permission to execute tools autonomously within this session context—a fundamental trait of true AI agency.

We have successfully configured and executed the very first stateful AI agent using the GitHub Copilot SDK in less than 20 lines of code. While a basic query is a great starting point, the real magic happens when we begin giving our agent actual autonomy.

If you want to see this code run live without setting up a local environment, click the badge below to jump straight into the interactive notebook and test it out.

Open In Colab

Monday, February 24, 2025

The Two-Hour Prompt: Why Good AI Instructions Take Time

I need a script to build a graph of some Wikipedia pages. Instead of coding it myself, I experimented with using an LLM to generate it for me. After spending two hours in front of my computer, here is the prompt I have written:

It is a short prompt if you remove the JSON sample. You might think it shouldn't take two hours to write—but think again.

Programming is the process of instructing a computer on exactly what to do. So, when I was writing the prompt, I was designing how the script would work. At the same time, I had to figure out how to gather the right data by studying the best way to retrieve it.

The resulting prompt is a pseudocode-level specification. The time invested was worth it because it worked right away. There were minor bugs, but they were very easy to fix.

Here is what Grok 3 commented:

In short, the prompt’s apparent simplicity hides a dense web of interlocking steps, each requiring careful thought, validation, and articulation. Two hours is reasonable for distilling such a process into a coherent set of instructions, especially if you were simultaneously designing the workflow and documenting it. It’s a bit like writing code and its documentation at the same time—except you’re doing it in natural language, which adds an extra layer of effort to keep it intuitive yet precise.

Yes, you can do Vibe Coding—blindly accepting AI suggestions, copying error messages, and hoping for the best for hobby projects. If you are working on real projects with deadlines, it is better to learn software engineering properly.