Sunday, June 07, 2026

How 20-Year-Old Software Principles Govern Today's AI Agents

In our previous post, we explored the mathematical realities of agentic LLM architectures, detailing how deterministic harnesses and human-in-the-loop (HITL) evaluations prevent cascading failures. But if we step back from the probability formulas, a profound realization emerges: the architectural challenges we face with AI agents today are almost identical to the human management challenges we faced two decades ago.

In 2006, I co-authored a paper titled "Degree of Freedom - Experience of Applying Software Framework". Back then, we were not managing Large Language Models; we were managing global virtual system development teams. Yet, the core software engineering principle we applied then is exactly what we rely on now. We just use a different vocabulary.

Whether dealing with offshore human developers in 2006 or autonomous digital colleagues in 2026, unchecked freedom in a complex system inevitably leads to entropy.

The 2006 Problem: The Design-Implementation Gap

Twenty years ago, a major challenge in programming-in-the-large (PITL) was managing the control between upstream design teams and downstream implementation teams. Ideally, an implementation team would take a user-requirement-based design and execute it closely. In practice, this was incredibly difficult to enforce across different geographical locations and time zones.

We realized that the primary cause of this design-implementation gap was the degree of freedom granted to the implementation developers. Given system design, developers simply had too many ways to implement it.

When developers are given too much freedom, human nature and individual coding habits take over:

  • Developers would frequently bypass the structure of the original design by adding new classes to make their own coding easier.
  • Junior developers, failing to recognize necessary architectural layers, would mix persistence operations, presentation logic, and business operations all into a single server page.
  • Developers would bring in their own pre-existing habits, causing the implementation to drift further from the intended design.

Object-oriented programming was supposed to impose discipline, but we discovered that OO without strict discipline was just as chaotic as unstructured programming with goto statements.

The 2006 Solution: Limiting Freedom via Frameworks

To bridge this gap, we adopted and eventually open-sourced a software frameworks (Struts, Hibernate,  MVC pattern, etc). Our original goal was to standardize development and reuse major components. However, the most powerful benefit came as a surprise: the framework allowed us to efficiently manage the design-implementation gap.

The frameworks solved our problems by fundamentally limiting the developers' choices:

  • It constrained developers, forcing them to follow the framework's structure rather than improvising their own designs to make things work.
  • It homogenized the highly diversified coding styles and habits of developers, which was especially crucial in development centers with high turnover rates.
  • It made any deviations highly visible; if a developer tried to overload a single page with too much logic, the framework violation became immediately obvious during code reviews.

By limiting the degree of freedom, we standardized the process and ensured reliability.

Fast Forward to Today: Harness Engineering for Digital Colleagues

Today, we are building agentic pipelines where LLMs act as our "downstream implementation team." And once again, we are facing a massive design-implementation gap.

An LLM has a near-infinite degree of freedom. If you give an open-ended prompt to an LLM, it has billions of latent pathways it can take to generate an answer. Without constraints, it will "improvise"—which we now call hallucinating. It will mix logic, ignore the intended architecture, and output unstructured data, much like a junior developer in 2006 trying to cram all business and persistence logic into a single script.

In our previous post, we discussed how to fix this using Harness Engineering. A harness is simply the 2026 term for a framework.

When we build a deterministic harness around an LLM (such as enforcing strict JSON schemas, unit test validation, or routing the agent through a rigid LangGraph state machine), we are doing exactly what the frameworks did two decades ago. We are removing the agent's degree of freedom.

  • Instead of hoping the AI formats correctly, a structural harness forces it into predefined operational layers.
  • Instead of relying on the AI's "creativity," a deterministic evaluator acts as the ultimate code review, instantly catching any deviation from the required schema.
  • Instead of unconstrained generation, we limit its choices so that its output is homogenized, predictable, and manageable.

The Enduring Principle of Software Engineering

Technology evolves at a blinding pace, but the fundamental laws of systems engineering do not. The transition from human offshore teams to digital LLM colleagues has not changed the rules of the game.

Discipline is the bedrock of production software. Twenty years ago, we instilled that discipline by building frameworks to constrain human developers. Today, we instill it by building deterministic harnesses to constrain AI agents. The names have changed, but the secret to reliable architecture remains exactly the same: carefully control the degree of freedom.

Saturday, June 06, 2026

Beyond p^n: The Mathematics of Agentic Reliability and Harness Engineering

The transition from a linear Retrieval-Augmented Generation (RAG) pipeline to an agentic architecture is often framed as an upgrade in "intelligence." But under the hood, it is fundamentally an upgrade in probability.
If you have built a naive, multi-step LLM pipeline, you have likely encountered cascading failure. This post explores the exact mathematics of why linear pipelines collapse, how agentic retry loops fix them, and why your system is ultimately only as strong as your evaluation harness.

1. The Open-Loop Trap: The Math of "AND"

In a standard, fixed pipeline (e.g., Retrieve -> Rerank -> Synthesize -> Format), the architecture is an open loop. To get a successful final output, the LLM must succeed at Step 1, AND Step 2, AND Step 3.
If your underlying model has a zero-shot accuracy of p for any given task, the probability of successfully navigating an n-step pipeline is the product of those probabilities:
$$P(\text{success}) = p^n$$
Because p is a fraction, p^n decays exponentially. If your model gets it right 80% of the time (p = 0.8), a 4-step pipeline has a success rate of just 40.9%. The system is mathematically guaranteed to degrade as it scales.

2. The Agentic Loop: The Math of "OR"

Agentic architectures replace the "AND" with an "OR" by introducing a feedback loop. Instead of blindly passing data to the next node, an agent evaluates the result. If the result is flawed, it tries again, up to a limit of k retries.
Assuming the system can perfectly detect when an error occurs, the probability of failing a step is no longer (1-p). It is the probability of failing every single retry: (1-p)^k.
Subtracting this from 100% gives us the probability of succeeding at least once during those k attempts:
$$P(\text{step success}) = 1 - (1-p)^k$$
With p = 0.8 and k = 3, the success rate for a single step jumps from 80% to 99.2%. When you multiply these fortified nodes together, the pipeline stabilizes.

3. The Evaluator Bottleneck: Enter q

The previous equation makes a massive, often fatal assumption: that the agent perfectly recognizes its own failures. In reality, the evaluation step has its own probability of correctness, which we call q. When the agent generates an answer, the evaluator might falsely accept a hallucination (False Positive) or falsely reject a correct answer (False Negative).
When we account for an imperfect evaluator q, the math shifts from basic probability into a stochastic Markov chain. The master equation for a single agentic step becomes:
$$P(\text{step success}) = \left( \frac{p \cdot q}{p \cdot q + (1-p)(1-q)} \right) \left( 1 - (p(1-q) + (1-p)q)^k \right)$$

The False Positive Ceiling

Notice the fraction on the left side of the equation: $$\frac{p \cdot q}{p \cdot q + (1-p)(1-q)}$$.
This represents the absolute mathematical ceiling of your agent. Even if you give the agent infinite retries (k = infinity), it can never exceed this success rate. If both your generator and evaluator have an accuracy of 70% (p = 0.7, q = 0.7), your step's maximum possible success rate is capped at roughly 84%. False positives will eventually let bad data leak through the loop.

4. Harness Engineering: Breaking the Ceiling

To maximize the master equation, we must maximize q. This is the domain of Harness Engineering - the design of the environment that catches, tests, and evaluates the agent's output.
Harnesses generally fall into two categories:

Indeterministic Harness (LLM-as-a-Judge)

This relies on another prompt (or a separate model) to evaluate the output. "Does this summary look accurate?"
  • The Problem: q remains fractional. You are fighting probability with probability.
  • The Mitigation: If you must use an LLM judge, the task must be highly asymmetrical. Evaluating is easier than generating, so q is naturally higher than p. Furthermore, developers should route evaluations to heavier, highly capable models to push q as close to 0.99 as economically feasible.

Deterministic Harness (The Golden Rule)

The most robust agents do not use LLMs to check LLMs. They offload evaluation to strict, true/false code.
  • Format Checking: Did the model output pure, valid JSON, or did it break the code by adding conversational filler like "Here is your JSON:"?
  • Execution Checking: If the agent wrote a block of code, did it actually run successfully without crashing?
  • Rule Counting: If the prompt asked for exactly five product recommendations, does len(list) == 5?
By using code as the harness, q becomes 1.0.
When q = 1.0, the mathematical ceiling shatters. The left side of our master equation becomes 1, and the formula collapses back to the ideal 1 - (1-p)^k. A deterministic harness allows an agent to safely iterate until it hits a guaranteed success.

5. Human-in-the-Loop (HITL) as the Ultimate Harness

There are scenarios where a deterministic harness is impossible to build, and an indeterministic harness is too risky. If an agent is writing a sensitive client email or executing a live database mutation, a false positive is catastrophic.
This is where Human-in-the-Loop (HITL) architecture becomes a mathematical necessity.
HITL should not be viewed as a failure of automation; it is a dynamic routing strategy. When an agentic step involves high-stakes subjectivity, the retry loop routes the evaluation to a human operator.
  1. The human acts as an evaluator with a functional q ≈ 1.0.
  2. The human can either accept the output (Terminal Success), or reject it and provide targeted feedback.
  3. The LLM processes the human feedback on the next iteration, artificially boosting its generation probability (p) for the subsequent attempt.

Conclusion

Building a production-ready agentic system is not about choosing a smarter foundational model. It is about actively managing the mathematical probabilities of your pipeline. By keeping steps short (n), leveraging retry loops (k), strictly enforcing deterministic harnesses wherever possible (q = 1.0), and utilizing HITL for high-risk evaluations, you can engineer a system that naturally self-corrects and vastly outperforms traditional linear RAG.

Friday, June 05, 2026

GitHub Copilot SDK Tutorial 01 - Agent 101

Historically, building an AI agent meant manually writing custom orchestration loops and stitching tools together from scratch. However, with the rise of sophisticated coding assistants, we can now leverage their production-ready SDKs to build our own custom agents. This allows us to tap into the exact same underlying infrastructure as your favorite AI tools—complete with native support for agent skills, memory, and robust tool execution.

The GitHub Copilot SDK is a powerful framework designed for this exact purpose. This step-by-step tutorial series will guide you from building a basic, foundational agent all the way to deploying an enterprise-class AI assistant.

Understanding the Architecture: Chatbots vs. Agents

When building a traditional chatbot, you typically make direct, stateless API calls to a LLM. This leaves you responsible for manually managing conversation history and writing custom loops to sustain a continuous dialogue.

An agent framework like the GitHub Copilot SDK abstracts this complexity away by structuring interactions into two core components:

  • The Client (CopilotClient): Act as the bridge between your local environment and the AI infrastructure.
  • The Session (via client.create_session): Represents a continuous, stateful interaction. Unlike a one-off API call, a session automatically retains context, manages conversation history, and tracks ongoing sub-tasks.

Step 1: Configure Your LLM Provider

Before spinning up our agent loop, we need to define our model and provider configuration. A major benefit of this setup is flexibility: you do not even need an active GitHub Copilot subscription to get started.

For this demonstration, we will connect to Gemini’s OpenAI-compatible endpoint using the lightweight and fast Gemini 3.1 Flash Lite model.

provider = {
    "type": "openai",
    "base_url": "https://generativelanguage.googleapis.com/v1beta/openai/",
    "wire_api": "completions",
    "api_key": os.getenv("GEMINI_API_KEY")
}

model="gemini-3.1-flash-lite"

Step 2: Initialize the Agent Loop

With our provider configured, we can now initialize the CopilotClient, establish a stateful session, and send our first asynchronous query.

question = "What is 2 + 2?"
async with CopilotClient() as client:
    async with await client.create_session(
        on_permission_request=PermissionHandler.approve_all,
        model=model,
        provider=provider
    ) as session:
        response = await session.send_and_wait(question)
        print(response.data.content)

Note: By setting PermissionHandler.approve_all, we are giving the agent permission to execute tools autonomously within this session context—a fundamental trait of true AI agency.

We have successfully configured and executed the very first stateful AI agent using the GitHub Copilot SDK in less than 20 lines of code. While a basic query is a great starting point, the real magic happens when we begin giving our agent actual autonomy.

If you want to see this code run live without setting up a local environment, click the badge below to jump straight into the interactive notebook and test it out.

Open In Colab

Monday, February 24, 2025

The Two-Hour Prompt: Why Good AI Instructions Take Time

I need a script to build a graph of some Wikipedia pages. Instead of coding it myself, I experimented with using an LLM to generate it for me. After spending two hours in front of my computer, here is the prompt I have written:

It is a short prompt if you remove the JSON sample. You might think it shouldn't take two hours to write—but think again.

Programming is the process of instructing a computer on exactly what to do. So, when I was writing the prompt, I was designing how the script would work. At the same time, I had to figure out how to gather the right data by studying the best way to retrieve it.

The resulting prompt is a pseudocode-level specification. The time invested was worth it because it worked right away. There were minor bugs, but they were very easy to fix.

Here is what Grok 3 commented:

In short, the prompt’s apparent simplicity hides a dense web of interlocking steps, each requiring careful thought, validation, and articulation. Two hours is reasonable for distilling such a process into a coherent set of instructions, especially if you were simultaneously designing the workflow and documenting it. It’s a bit like writing code and its documentation at the same time—except you’re doing it in natural language, which adds an extra layer of effort to keep it intuitive yet precise.

Yes, you can do Vibe Coding—blindly accepting AI suggestions, copying error messages, and hoping for the best for hobby projects. If you are working on real projects with deadlines, it is better to learn software engineering properly.

Monday, January 20, 2025

The Reality Check on AI Coding Agents: Lessons from Devin

A recent blog post from Answer.AI titled "Thoughts On A Month With Devin" has sparked intense discussion in the development community, particularly on Hacker News. Their detailed experiment with Devin, an AI coding agent, offers valuable insights into the current state of autonomous AI development tools.

The Promise vs. Reality

When Devin first appeared on the scene, it seemed to represent a breakthrough in AI coding assistance. The Answer.AI team was initially impressed by Devin's capabilities: pulling data between platforms, creating applications from scratch, and even navigating API documentation – all while maintaining natural communication through Slack. It appeared to be the autonomous coding assistant many had dreamed of.

However, their extensive testing revealed a more complex reality. Out of 20 real-world tasks, Devin succeeded in only three cases, with 14 outright failures and three inconclusive results. More concerning was the unpredictability – there seemed to be no clear pattern to predict which tasks would succeed or fail.

The Autonomy Paradox

Perhaps the most interesting insight from Answer.AI's experiment is how Devin's supposed strength – its autonomy – became its greatest weakness. As one Hacker News commenter, davedx, pointedly asked: "Why doesn't Devin have an 'ask for help' escape hatch when it gets stuck?" Another commenter, rsynnott, noted that Devin embodied "the worst stereotype" of a junior developer – one who won't admit when they're stuck.

The Tools vs. Agents Debate

The Hacker News discussion revealed a clear divide in approaches to AI coding assistance. As CGamesPlay pointed out, tools like GitHub Copilot (which they described as "better tab complete") and Aider (for more advanced edits) are proving more practical because they assist developers rather than try to replace them. This aligns with Answer.AI's conclusion that developer-guided tools are currently more effective than autonomous agents.

Current Sweet Spots for AI in Development

Despite the setbacks, both the original blog post and Hacker News comments helped identify where AI truly shines. As rbren, the creator of OpenHands, noted, about 20% of their commits come from AI agents handling routine tasks like fixing merge conflicts. Other commenters highlighted success with:

  1. Generating boilerplate code and repetitive patterns
  2. Assisting with specific, well-defined problems like complex SQL queries
  3. Supporting documentation and test writing
  4. Helping developers learn new technologies

The Future Outlook

The Hacker News discussion revealed interesting perspectives on AI's future in development. As commenter bufferoverflow suggested, LLMs might reach mid-level developer capabilities in 2-3 years and senior-level in 4-5 years. However, Zanfa countered that progress isn't linear, drawing parallels to self-driving cars. As npilk noted, the situation resembles AI image generation in 2022 – showing obvious flaws but with potential for rapid improvement.

The Human Element Remains Critical

Both the Answer.AI blog post and subsequent discussion emphasize that successful software development isn't just about writing code. As jboggan pointed out in the Hacker News thread, if humans can't learn to use the tool effectively and discern patterns of best practices, then it isn't really a useful tool. This highlights the continuing importance of human judgment and oversight.

Economic Implications

The Hacker News discussion raised important points about the economic impact of AI in development. While some commenters like the_af expressed concerns about job displacement and salary deflation, others like lolinder drew parallels to previous fears about offshoring, which didn't lead to the predicted negative outcomes. This debate reflects the broader uncertainty about AI's impact on the software development profession.

Conclusion

The Answer.AI team's experiment with Devin, and the subsequent Hacker News discussion, serve as both a reality check and a roadmap. While fully autonomous coding agents may not be ready for prime time, the experiment has helped clarify where AI can most effectively support development work. The future of software development likely lies not in replacement but in synergy – finding the sweet spot where AI amplifies human capabilities rather than attempting to supplant them entirely.

As we move forward, the focus should be on developing tools that maintain this balance, keeping developers in the driver's seat while leveraging AI's strengths in handling routine tasks and generating initial solutions. This approach promises to enhance productivity while maintaining the quality and reliability that professional software development demands.

Monday, December 02, 2024

Beyond Mimicry: The Quest for Reasoning in Large Language Models

Large Language Models (LLMs) have captivated the world with their ability to generate human-like text, translate languages, answer questions, and even write code. However, beneath the surface of impressive fluency lies a fundamental limitation: a struggle with true generalization and logical reasoning. While LLMs can mimic reasoning processes, they often fall short when confronted with tasks requiring genuine understanding, extrapolation beyond observed patterns, or the application of logical principles. This article delves into the architectural and training-related reasons behind these limitations.

The Autoregressive Bottleneck: A Word-by-Word Worldview

At the heart of most current LLMs lies the autoregressive architecture. These models predict the next word in a sequence based on the preceding words. This approach, while effective at generating fluent and grammatically correct text, inherently fosters a local, rather than global, optimization process.

  • Local Optimization vs. Global Reasoning: The autoregressive model excels at identifying and replicating statistical patterns within its training data. Each word is chosen to minimize immediate prediction error, akin to a myopic traveler always selecting the next closest city without considering the overall journey. This leads to difficulties in tasks requiring holistic understanding or logical coherence across an entire text. The analogy of the traveling salesman problem perfectly illustrates this; the algorithm minimizes the cost at each step but doesn't necessarily find the globally optimal route.

  • The Chinese Room Argument Reimagined: Philosopher John Searle's Chinese Room argument challenges the notion that manipulating symbols according to rules equates to genuine understanding. LLMs, in their current form, operate much like the person in the Chinese Room. They can process and generate text by following statistically derived rules (encoded in their massive weight matrices), but this doesn't necessarily mean they comprehend the meaning or possess the ability to reason about the information.

  • Error Propagation and the Fragility of Reasoning Chains: The sequential nature of autoregressive models makes them highly susceptible to error propagation. A single incorrect word prediction can cascade into a series of errors, derailing the entire generation process. This is particularly problematic in tasks requiring multi-step reasoning, where a flawed premise can invalidate the entire chain of thought, even if subsequent steps are logically sound. While techniques like "chain-of-thought" prompting encourage LLMs to articulate intermediate reasoning steps, they remain vulnerable to this cascading effect. A wrong "thought" leads to an incorrect overall conclusion.

Training Limitations: Statistical Patterns vs. Logical Principles

The training methodology of LLMs also contributes significantly to their limitations in generalization and reasoning.

  • Self-Supervised Pretraining: Learning Correlations, Not Causation: LLMs are typically pretrained on massive text corpora using self-supervised learning, where the model learns to predict masked or subsequent words. While this allows them to acquire a vast amount of linguistic and factual knowledge, it primarily captures statistical correlations between words, not causal relationships or logical principles. The model learns what words tend to co-occur, but not necessarily why they co-occur or the logical connections between them. This explains why early GPT models, while fluent, often produced nonsensical or factually incorrect outputs.

  • The Specialization Tradeoff of Supervised Fine-tuning: Instruction tuning and Reinforcement Learning from Human Feedback (RLHF) refine LLMs to better align with human expectations and follow instructions. However, this supervised learning process introduces a form of specialization, akin to training a skilled craftsman in one particular area. While this enhances performance on specific tasks seen during training, it can hinder generalization to novel or unseen scenarios. The model becomes adept at solving problems similar to those it has encountered before, but struggles with tasks outside its "comfort zone," as evidenced by failures on simple tasks like counting characters in a word if such tasks are uncommon in training data.

  • The Long Tail Problem: Skewed Performance and Unseen Scenarios: Even with multi-task training, LLMs face the "long tail" problem. They perform well on tasks that are well-represented in the training data but often fail on rare or unusual tasks. This is because statistical learning models are fundamentally limited by the distribution of the data they are trained on. They can interpolate and extrapolate within the bounds of observed patterns, but struggle with tasks that deviate significantly from those patterns.

Reasoning Tokens: A Superficial Facade?

Recent efforts have focused on incorporating "reasoning tokens" or prompting LLMs to generate "chain-of-thought" explanations. While these approaches can improve performance on certain reasoning tasks, they often represent a superficial mimicry of reasoning rather than genuine cognitive understanding.

  • Imitating System 2 Thinking without the Underlying Mechanisms: The goal is to simulate "System 2" thinking, characterized by deliberate and logical reasoning, as opposed to the intuitive "System 1" thinking. However, LLMs achieve this by generating text that resembles step-by-step reasoning, not by actually engaging in logical deduction, induction, or abduction. The model is still fundamentally predicting the next token based on statistical patterns; it's simply conditioned on a prompt that encourages a more verbose and structured output.

  • Vulnerability to Surface-Level Cues and Biases: LLMs remain susceptible to surface-level cues and biases present in the training data. They can be easily misled by irrelevant information or subtle changes in phrasing, leading to illogical or incorrect conclusions, even when they appear to be "reasoning" correctly. This highlights the lack of deep understanding and robust reasoning capabilities.

Conclusion

Large Language Models have made remarkable strides in natural language processing, but their current limitations in generalization and reasoning highlight the need for a fundamental shift in approach. While statistical pattern recognition remains a powerful tool, it is insufficient on its own to achieve true cognitive understanding. The quest for reasoning in LLMs is a challenging but crucial endeavor that promises to unlock the full potential of artificial intelligence and transform the way we interact with information and knowledge.

Tuesday, November 12, 2024

A Glimpse to the Future Large Reasoning Models

 Let's dive deeper into how large language models might evolve to large reasoning model:

1. Baseline Auto-Regressive Model: The Foundation – Predicting the Next Word with Context

At its core, the baseline autoregressive model is a sophisticated "next word prediction" engine. It doesn't just guess randomly; it uses the context of preceding words to make informed predictions. This context is captured through contextual embeddings. Imagine it like this: the model reads a sentence word by word, and with each word, it builds an understanding of the overall meaning and relationships between the words. This understanding is encoded in the contextual embeddings. These embeddings are then used to predict the most likely next word.

Here's a breakdown of the process:

  • Tokenization: The input text is broken down into individual units – tokens. These can be words, subwords (parts of words), or even characters.

  • Contextual Embedding Layer: This is where the magic happens. Each token is converted into a vector (a list of numbers) called a contextual embedding. Crucially, this embedding is not fixed; it depends on the surrounding words. So, the same word can have different embeddings depending on the context it appears in. For example, the word "bank" will have a different embedding in the sentence "I sat by the river bank" compared to "I went to the bank to deposit money." This context-sensitive embedding is what allows the model to understand nuances in language.

  • Decoder Block: This part of the model takes the contextual embeddings as input and uses them to predict the probability of each possible next word/token. It considers all the words in its vocabulary and assigns a probability to each one, based on how well it fits the current context. The word with the highest probability is selected as the next word in the sequence.

Therefore, the baseline autoregressive model is fundamentally a context-driven next-word prediction engine. The contextual embeddings are central to this process, as they represent the model's understanding of the meaning and relationships between words in a given sequence.

2. Unrolled Auto-Regressive Model (Figure 2): The Sequential Process

This diagram illustrates the iterative nature of text generation. The model predicts one token at a time, and each prediction becomes the input for the next step. This "unrolling" visualizes how the model builds up a sequence token by token. The key takeaway here is that the model's understanding of the context evolves with each prediction. Early predictions can significantly influence later ones.


3. Auto-Regressive Model with Reasoning Tokens (Chain-of-Thought): Thinking Step-by-Step

This introduces the concept of explicit reasoning. By providing examples with intermediate reasoning steps during training, the model learns to generate its own reasoning steps before arriving at the final answer.

  • Reasoning Tokens: These special tokens act as prompts to guide the model's thinking process. They can be natural language phrases or specific symbols that signal reasoning steps. For instance, reasoning tokens might start with "Therefore," "Because," or "Step 1:".

  • Benefits of Chain-of-Thought: This approach improves performance on complex reasoning tasks by forcing the model to decompose the problem into smaller, more manageable steps. It also makes the model's reasoning more transparent and interpretable.

OpenAI's o1 model is one of those model trained with chain-of-thought reasoning.

4. Auto-Regressive Model with Reasoning Embedding: Implicit Reasoning

Here is the interesting part. Instead of having the reasoning tokens generated one by one, the context embedding of the reasoning token could possibly trained. So, Given the same embedding will generate the same token. If such model was trained, we can predict the next token efficiently without the overhead of generating explicit reasoning tokens.

  • Reasoning Embedding Layer: This new layer learns to encode the essence of the reasoning process directly into the embeddings. Instead of explicitly generating reasoning steps, the model incorporates the learned reasoning patterns into its prediction process.

  • Efficiency Gains: By eliminating the need to generate intermediate tokens, this approach reduces computational cost and speeds up text generation.


As large language models evolve into powerful reasoning engines, we stand on the brink of a new era in AI capabilities. From foundational autoregressive models to innovative reasoning embeddings, each step forward enhances the efficiency, interpretability, and complexity of what these models can achieve. By integrating explicit reasoning (reasoning tokens) and implicit reasoning (reasoning embeddings) mechanisms, the future promises not only faster and more accurate text generation but also models capable of deeper understanding and problem-solving.