Monday, February 24, 2025

The Two-Hour Prompt: Why Good AI Instructions Take Time

I need a script to build a graph of some Wikipedia pages. Instead of coding it myself, I experimented with using an LLM to generate it for me. After spending two hours in front of my computer, here is the prompt I have written:

It is a short prompt if you remove the JSON sample. You might think it shouldn't take two hours to write—but think again.

Programming is the process of instructing a computer on exactly what to do. So, when I was writing the prompt, I was designing how the script would work. At the same time, I had to figure out how to gather the right data by studying the best way to retrieve it.

The resulting prompt is a pseudocode-level specification. The time invested was worth it because it worked right away. There were minor bugs, but they were very easy to fix.

Here is what Grok 3 commented:

In short, the prompt’s apparent simplicity hides a dense web of interlocking steps, each requiring careful thought, validation, and articulation. Two hours is reasonable for distilling such a process into a coherent set of instructions, especially if you were simultaneously designing the workflow and documenting it. It’s a bit like writing code and its documentation at the same time—except you’re doing it in natural language, which adds an extra layer of effort to keep it intuitive yet precise.

Yes, you can do Vibe Coding—blindly accepting AI suggestions, copying error messages, and hoping for the best for hobby projects. If you are working on real projects with deadlines, it is better to learn software engineering properly.

Monday, January 20, 2025

The Reality Check on AI Coding Agents: Lessons from Devin

A recent blog post from Answer.AI titled "Thoughts On A Month With Devin" has sparked intense discussion in the development community, particularly on Hacker News. Their detailed experiment with Devin, an AI coding agent, offers valuable insights into the current state of autonomous AI development tools.

The Promise vs. Reality

When Devin first appeared on the scene, it seemed to represent a breakthrough in AI coding assistance. The Answer.AI team was initially impressed by Devin's capabilities: pulling data between platforms, creating applications from scratch, and even navigating API documentation – all while maintaining natural communication through Slack. It appeared to be the autonomous coding assistant many had dreamed of.

However, their extensive testing revealed a more complex reality. Out of 20 real-world tasks, Devin succeeded in only three cases, with 14 outright failures and three inconclusive results. More concerning was the unpredictability – there seemed to be no clear pattern to predict which tasks would succeed or fail.

The Autonomy Paradox

Perhaps the most interesting insight from Answer.AI's experiment is how Devin's supposed strength – its autonomy – became its greatest weakness. As one Hacker News commenter, davedx, pointedly asked: "Why doesn't Devin have an 'ask for help' escape hatch when it gets stuck?" Another commenter, rsynnott, noted that Devin embodied "the worst stereotype" of a junior developer – one who won't admit when they're stuck.

The Tools vs. Agents Debate

The Hacker News discussion revealed a clear divide in approaches to AI coding assistance. As CGamesPlay pointed out, tools like GitHub Copilot (which they described as "better tab complete") and Aider (for more advanced edits) are proving more practical because they assist developers rather than try to replace them. This aligns with Answer.AI's conclusion that developer-guided tools are currently more effective than autonomous agents.

Current Sweet Spots for AI in Development

Despite the setbacks, both the original blog post and Hacker News comments helped identify where AI truly shines. As rbren, the creator of OpenHands, noted, about 20% of their commits come from AI agents handling routine tasks like fixing merge conflicts. Other commenters highlighted success with:

Generating boilerplate code and repetitive patterns
Assisting with specific, well-defined problems like complex SQL queries
Supporting documentation and test writing
Helping developers learn new technologies

The Future Outlook

The Hacker News discussion revealed interesting perspectives on AI's future in development. As commenter bufferoverflow suggested, LLMs might reach mid-level developer capabilities in 2-3 years and senior-level in 4-5 years. However, Zanfa countered that progress isn't linear, drawing parallels to self-driving cars. As npilk noted, the situation resembles AI image generation in 2022 – showing obvious flaws but with potential for rapid improvement.

The Human Element Remains Critical

Both the Answer.AI blog post and subsequent discussion emphasize that successful software development isn't just about writing code. As jboggan pointed out in the Hacker News thread, if humans can't learn to use the tool effectively and discern patterns of best practices, then it isn't really a useful tool. This highlights the continuing importance of human judgment and oversight.

Economic Implications

The Hacker News discussion raised important points about the economic impact of AI in development. While some commenters like the_af expressed concerns about job displacement and salary deflation, others like lolinder drew parallels to previous fears about offshoring, which didn't lead to the predicted negative outcomes. This debate reflects the broader uncertainty about AI's impact on the software development profession.

Conclusion

The Answer.AI team's experiment with Devin, and the subsequent Hacker News discussion, serve as both a reality check and a roadmap. While fully autonomous coding agents may not be ready for prime time, the experiment has helped clarify where AI can most effectively support development work. The future of software development likely lies not in replacement but in synergy – finding the sweet spot where AI amplifies human capabilities rather than attempting to supplant them entirely.

As we move forward, the focus should be on developing tools that maintain this balance, keeping developers in the driver's seat while leveraging AI's strengths in handling routine tasks and generating initial solutions. This approach promises to enhance productivity while maintaining the quality and reliability that professional software development demands.

Monday, December 02, 2024

Beyond Mimicry: The Quest for Reasoning in Large Language Models

Large Language Models (LLMs) have captivated the world with their ability to generate human-like text, translate languages, answer questions, and even write code. However, beneath the surface of impressive fluency lies a fundamental limitation: a struggle with true generalization and logical reasoning. While LLMs can mimic reasoning processes, they often fall short when confronted with tasks requiring genuine understanding, extrapolation beyond observed patterns, or the application of logical principles. This article delves into the architectural and training-related reasons behind these limitations.

The Autoregressive Bottleneck: A Word-by-Word Worldview

At the heart of most current LLMs lies the autoregressive architecture. These models predict the next word in a sequence based on the preceding words. This approach, while effective at generating fluent and grammatically correct text, inherently fosters a local, rather than global, optimization process.

Local Optimization vs. Global Reasoning: The autoregressive model excels at identifying and replicating statistical patterns within its training data. Each word is chosen to minimize immediate prediction error, akin to a myopic traveler always selecting the next closest city without considering the overall journey. This leads to difficulties in tasks requiring holistic understanding or logical coherence across an entire text. The analogy of the traveling salesman problem perfectly illustrates this; the algorithm minimizes the cost at each step but doesn't necessarily find the globally optimal route.

The Chinese Room Argument Reimagined: Philosopher John Searle's Chinese Room argument challenges the notion that manipulating symbols according to rules equates to genuine understanding. LLMs, in their current form, operate much like the person in the Chinese Room. They can process and generate text by following statistically derived rules (encoded in their massive weight matrices), but this doesn't necessarily mean they comprehend the meaning or possess the ability to reason about the information.

Error Propagation and the Fragility of Reasoning Chains: The sequential nature of autoregressive models makes them highly susceptible to error propagation. A single incorrect word prediction can cascade into a series of errors, derailing the entire generation process. This is particularly problematic in tasks requiring multi-step reasoning, where a flawed premise can invalidate the entire chain of thought, even if subsequent steps are logically sound. While techniques like "chain-of-thought" prompting encourage LLMs to articulate intermediate reasoning steps, they remain vulnerable to this cascading effect. A wrong "thought" leads to an incorrect overall conclusion.

Training Limitations: Statistical Patterns vs. Logical Principles

The training methodology of LLMs also contributes significantly to their limitations in generalization and reasoning.

Self-Supervised Pretraining: Learning Correlations, Not Causation: LLMs are typically pretrained on massive text corpora using self-supervised learning, where the model learns to predict masked or subsequent words. While this allows them to acquire a vast amount of linguistic and factual knowledge, it primarily captures statistical correlations between words, not causal relationships or logical principles. The model learns what words tend to co-occur, but not necessarily why they co-occur or the logical connections between them. This explains why early GPT models, while fluent, often produced nonsensical or factually incorrect outputs.

The Specialization Tradeoff of Supervised Fine-tuning: Instruction tuning and Reinforcement Learning from Human Feedback (RLHF) refine LLMs to better align with human expectations and follow instructions. However, this supervised learning process introduces a form of specialization, akin to training a skilled craftsman in one particular area. While this enhances performance on specific tasks seen during training, it can hinder generalization to novel or unseen scenarios. The model becomes adept at solving problems similar to those it has encountered before, but struggles with tasks outside its "comfort zone," as evidenced by failures on simple tasks like counting characters in a word if such tasks are uncommon in training data.

The Long Tail Problem: Skewed Performance and Unseen Scenarios: Even with multi-task training, LLMs face the "long tail" problem. They perform well on tasks that are well-represented in the training data but often fail on rare or unusual tasks. This is because statistical learning models are fundamentally limited by the distribution of the data they are trained on. They can interpolate and extrapolate within the bounds of observed patterns, but struggle with tasks that deviate significantly from those patterns.

Reasoning Tokens: A Superficial Facade?

Recent efforts have focused on incorporating "reasoning tokens" or prompting LLMs to generate "chain-of-thought" explanations. While these approaches can improve performance on certain reasoning tasks, they often represent a superficial mimicry of reasoning rather than genuine cognitive understanding.

Imitating System 2 Thinking without the Underlying Mechanisms: The goal is to simulate "System 2" thinking, characterized by deliberate and logical reasoning, as opposed to the intuitive "System 1" thinking. However, LLMs achieve this by generating text that resembles step-by-step reasoning, not by actually engaging in logical deduction, induction, or abduction. The model is still fundamentally predicting the next token based on statistical patterns; it's simply conditioned on a prompt that encourages a more verbose and structured output.

Vulnerability to Surface-Level Cues and Biases: LLMs remain susceptible to surface-level cues and biases present in the training data. They can be easily misled by irrelevant information or subtle changes in phrasing, leading to illogical or incorrect conclusions, even when they appear to be "reasoning" correctly. This highlights the lack of deep understanding and robust reasoning capabilities.

Conclusion

Large Language Models have made remarkable strides in natural language processing, but their current limitations in generalization and reasoning highlight the need for a fundamental shift in approach. While statistical pattern recognition remains a powerful tool, it is insufficient on its own to achieve true cognitive understanding. The quest for reasoning in LLMs is a challenging but crucial endeavor that promises to unlock the full potential of artificial intelligence and transform the way we interact with information and knowledge.

Tuesday, November 12, 2024

A Glimpse to the Future Large Reasoning Models

Let's dive deeper into how large language models might evolve to large reasoning model:

1. Baseline Auto-Regressive Model: The Foundation – Predicting the Next Word with Context

At its core, the baseline autoregressive model is a sophisticated "next word prediction" engine. It doesn't just guess randomly; it uses the context of preceding words to make informed predictions. This context is captured through contextual embeddings. Imagine it like this: the model reads a sentence word by word, and with each word, it builds an understanding of the overall meaning and relationships between the words. This understanding is encoded in the contextual embeddings. These embeddings are then used to predict the most likely next word.

Here's a breakdown of the process:

Tokenization: The input text is broken down into individual units – tokens. These can be words, subwords (parts of words), or even characters.

Contextual Embedding Layer: This is where the magic happens. Each token is converted into a vector (a list of numbers) called a contextual embedding. Crucially, this embedding is not fixed; it depends on the surrounding words. So, the same word can have different embeddings depending on the context it appears in. For example, the word "bank" will have a different embedding in the sentence "I sat by the river bank" compared to "I went to the bank to deposit money." This context-sensitive embedding is what allows the model to understand nuances in language.

Decoder Block: This part of the model takes the contextual embeddings as input and uses them to predict the probability of each possible next word/token. It considers all the words in its vocabulary and assigns a probability to each one, based on how well it fits the current context. The word with the highest probability is selected as the next word in the sequence.

Therefore, the baseline autoregressive model is fundamentally a context-driven next-word prediction engine. The contextual embeddings are central to this process, as they represent the model's understanding of the meaning and relationships between words in a given sequence.

2. Unrolled Auto-Regressive Model (Figure 2): The Sequential Process

This diagram illustrates the iterative nature of text generation. The model predicts one token at a time, and each prediction becomes the input for the next step. This "unrolling" visualizes how the model builds up a sequence token by token. The key takeaway here is that the model's understanding of the context evolves with each prediction. Early predictions can significantly influence later ones.

3. Auto-Regressive Model with Reasoning Tokens (Chain-of-Thought): Thinking Step-by-Step

This introduces the concept of explicit reasoning. By providing examples with intermediate reasoning steps during training, the model learns to generate its own reasoning steps before arriving at the final answer.

Reasoning Tokens: These special tokens act as prompts to guide the model's thinking process. They can be natural language phrases or specific symbols that signal reasoning steps. For instance, reasoning tokens might start with "Therefore," "Because," or "Step 1:".

Benefits of Chain-of-Thought: This approach improves performance on complex reasoning tasks by forcing the model to decompose the problem into smaller, more manageable steps. It also makes the model's reasoning more transparent and interpretable.

OpenAI's o1 model is one of those model trained with chain-of-thought reasoning.

4. Auto-Regressive Model with Reasoning Embedding: Implicit Reasoning

Here is the interesting part. Instead of having the reasoning tokens generated one by one, the context embedding of the reasoning token could possibly trained. So, Given the same embedding will generate the same token. If such model was trained, we can predict the next token efficiently without the overhead of generating explicit reasoning tokens.

Reasoning Embedding Layer: This new layer learns to encode the essence of the reasoning process directly into the embeddings. Instead of explicitly generating reasoning steps, the model incorporates the learned reasoning patterns into its prediction process.

Efficiency Gains: By eliminating the need to generate intermediate tokens, this approach reduces computational cost and speeds up text generation.

As large language models evolve into powerful reasoning engines, we stand on the brink of a new era in AI capabilities. From foundational autoregressive models to innovative reasoning embeddings, each step forward enhances the efficiency, interpretability, and complexity of what these models can achieve. By integrating explicit reasoning (reasoning tokens) and implicit reasoning (reasoning embeddings) mechanisms, the future promises not only faster and more accurate text generation but also models capable of deeper understanding and problem-solving.

Monday, October 21, 2024

Efficient Multilingual Control of Robotic Dog Using LLM

Introduction

As the world of robotics continues to advance, the integration of artificial intelligence (AI) in robotic systems has become essential for making these machines smarter, more intuitive, and easier to control. One exciting area of development is the use of large language models (LLMs) to enhance the interaction between humans and robots. Recently, a question was raised in an LLM group about how to implement this integration.

The Challenge

The objective was to enable a robotic dog to understand and execute commands given in both English and Cantonese. However, there were key limitations to consider:

Multilingual Capability: The model needed to understand and process commands in both languages accurately.
Edge Device Compatibility: Given that the onboard GPU was a Jetson GPU with only 8GB of VRAM, the model had to be small and efficient enough to run effectively within this limited hardware capacity.
Fast Response Time: The robotic dog should be able to interpret commands and respond almost instantaneously, maintaining a natural interaction experience with users.

To address these challenges, we implement a PoC utilized a quantized version of the Qwen 2.5 1.5B model, which provided a balance between size, multilingual capabilities, and performance.

Why Use a Quantized Version of Qwen 2.5 1.5B Model?

The Qwen 2.5 1.5B model was chosen for several reasons:

Multilingual Capability: The model supports multiple languages, including English and Cantonese. This feature allowed the robotic dog to interpret commands accurately, regardless of the language used.
Efficient Edge Computing: A smaller model was preferred to fit within the constraints of the onboard Jetson GPU. The Qwen 2.5 1.5B model was quantized, reducing its memory footprint, making it lightweight and compatible with the edge device. Quantization reduces the model size by converting the weights from 32-bit floating points to smaller data types, such as 4-bit, without significantly sacrificing performance.
Optimized for Performance: Despite its smaller size, the model remained powerful enough to handle the command interpretation. By using the quantized version (Qwen2.5-1.5b-instruct-q4_k_m.gguf), it managed to provide a fast response time while consuming minimal VRAM.

Proof of Concept

We can quickly build a proof of concept (PoC) using llama.cpp to load the Qwen model.

The Prompt

You are the command center for an advanced robotic dog. Your role is to interpret user inputs and generate appropriate commands for the dog to execute. The available commands are:
- turn right
- turn left
- move forward
- move backward
- dance
- bark

Based on the user's input, create a list of one or more commands for the robotic dog. Output the commands as a JSON array.

Sample results

Two different Cantonese phrases are listed here (English translations are provided in brackets). The first one is straightforward, while the second requires understanding the user's intention to generate a command list.

Sample 1:
In this case, the model accurately interpreted the user's straightforward command, providing a sequence of actions for the robotic dog.

轉右向前行兩步，再吠兩聲嚟聽吓 (Turn right, move forward two steps, and then bark twice)

["turn right", "move forward", "move forward", "bark", "bark"]

Sample 2:
In the following case, the model was able to understand the user’s intention and interpreted "cheering up" as asking the robotic dog to perform an action that would be entertaining, like dancing. This showcases the model’s ability to grasp user sentiment and respond creatively.

我今日好唔開心，可以氹吓我嗎? (I'm feeling very sad today, can you cheer me up?)

["dance", "jump"]

Performance Summary

With llama.cpp and the quantized Qwen model, it leads to the following performance results:

Response Time: ~700 milliseconds in average on a Nvidia T4 Card. This means the model processed the input and generated commands in well under a second, ensuring a fluid interaction between the user and the robotic dog.
VRAM Usage: 2.8GB with default settings. By setting the maximum context length to only 500 tokens, the VRAM usage was reduced to 1.4GB, which is well within the 8GB limit of the Jetson GPU.

The efficient use of memory and fast response time demonstrated the feasibility of running LLMs on edge devices, even for multilingual applications.

Key Takeaways

The PoC demonstrated that it is possible to use a quantized version of a multilingual language model for real-time robotic control on edge devices. The key benefits included:

Multilingual Support: The ability to understand commands in both English and Cantonese expanded the usability and flexibility of the robotic dog.
Edge Device Compatibility: By using a smaller, quantized model, the AI was able to run efficiently on limited hardware without compromising performance.
Real-Time Performance: Fast response times ensured that the robotic dog could react promptly, making interactions feel natural and engaging.

This proof of concept paves the way for more advanced, language-based control systems for robots that can be deployed on edge devices, making them more accessible and practical for various real-world applications.

Sunday, September 01, 2024

Llama 3 and 3.1: Advancements in AI Translation Capabilities

In the rapidly evolving field of artificial intelligence, Llama 3.1 has garnered significant attention since its release six weeks ago. This model builds upon the foundation laid by Llama 3, which debuted three months prior. While Llama 3 introduced impressive capabilities that hinted at the future potential of AI language models, it also revealed certain limitations that have been further explored and addressed with Llama 3.1.

Multilingual Challenges in AI

One of the critical areas where these limitations have come to light is in multilingual tasks, particularly translation. As AI continues to integrate more deeply into global communications, the ability to accurately and seamlessly translate between languages is paramount. However, recent tests with Llama 3 have shown that this is an area where even the most advanced models can face challenges.

Observations with Llama 3

Initial experiments with Llama 3, specifically using the 8B instruct model, revealed an interesting phenomenon. While the model performed admirably when asked to answer questions, its translation capabilities were less reliable. When prompted in English to translate a passage into Chinese, the model often returned incoherent output, far from an accurate translation.

What is red-tailed hawk?

The red-tailed hawk (Buteo jamaicensis) is a bird of prey that breeds throughout most of North America, from the interior of Alaska and northern Canada to as far south as Panama and the West Indies. It occupies a wide range of habitats and altitudes, including deserts, grasslands, forests, agricultural fields, and urban areas. Red-tailed hawks are opportunistic generalists and their diet is variable. In North America, they are most often predators of small mammals such as rodents. Prey that is terrestrial and at least partially diurnal is preferred, so types such as ground squirrels are preferred where they naturally occur. The bird is legally protected in Canada and Mexico, and in the United States by the Migratory Bird Treaty Act of 1918. This red-tailed hawk in flight was photographed at the Sacramento National Wildlife Refuge in California.

Please translate the answer to Chinese

(red téng sū chōu) (red téng sū chōu)

: , , ; , , , , . , , , .

RED-TAILED HAWK(Buteo jamaicensis) : , , , , . , , , .

To further investigate, tests were conducted with the 70B model, anticipating that its larger parameter count might overcome the translation issues observed with the 8B version. Surprisingly, the results were similar. Despite the significant increase in model size, the 70B model also struggled with translation when the prompt was given in English.

Key Insight: Language of Instruction

A crucial discovery was made when the prompt language was adjusted to Chinese. Both the 8B and 70B models were able to translate the passage correctly when instructions were given in the target language itself. This finding suggests that the models' translation capabilities are more effectively activated when the context is set in the language they are translating into.

Advancements with Llama 3.1

Further testing with Llama 3.1 has shown marked improvements. Both the 8B and 70B models performed well with English prompts, demonstrating enhanced translation capabilities compared to their predecessors.

Implications for Current Users

Despite the release of Llama 3.1 over a month ago, many organizations are still utilizing Llama 3. For these users, it is recommended to implement programmatic mitigation strategies:

Detect translation prompts within the system
Translate the prompts to the target language before passing them to Llama 3
Implement these processes to optimize translation outcomes

By adopting these strategies, organizations can maximize the potential of their current AI translation tools, even if they have not yet upgraded to the latest model.

Monday, December 18, 2023

The State of Generative AI: A Survey of 2023 and Outlook for 2024

Introduction

The year 2023 witnessed significant advancements in the field of Artificial Intelligence (AI), particularly in the realm of generative AI. This article provides an in-depth survey of the key trends in AI development in 2023 and offers insights into the outlook for 2024, shedding light on the transformative effects of generative AI.

Key Trends in AI Development in 2023

Generative AI Advancements The year 2023 marked a period of remarkable progress in generative AI, with substantial implications for industries and workforces. The State of AI in 2023 report by McKinsey highlighted the transformative effects of generative AI, emphasizing its potential to revolutionize various sectors. OpenAI's GPT-4 emerged as a groundbreaking generative AI model, revolutionizing natural language processing and creative content generation. The explosive growth of generative AI (gen AI) tools was confirmed by the latest annual McKinsey Global Survey, with one-third of survey respondents using gen AI regularly in at least one business function.

Multi-modal Learning Another notable trend in AI development was the emergence of multi-modal learning, which brought forth new possibilities for AI models and their applications. This approach enabled AI systems to process and understand information from multiple modalities, leading to enhanced performance and versatility.

Economic Impact of AI The economic impact of AI became increasingly pronounced in 2023, influencing global economies and driving discussions on the future of work and productivity. AI could contribute up to $15.7 trillion to the global economy in 2030, more than the current output of China and India combined. Of this, $6.6 trillion is likely to come from increased productivity and $9.1 trillion is likely to come from consumption-side effects.

Outlook for 2024

Monumental Leaps in AI Capabilities The outlook for 2024 foresees monumental leaps in AI capabilities, particularly in areas demanding complex problem-solving, fueled by quantum advancements. These advancements are expected to redefine the boundaries of AI applications and unlock new frontiers in technology.

Rapid Transformations Across Industries 2024 is poised to witness rapid transformations across industries, with generative AI expected to play a pivotal role in reshaping business operations and driving regulatory discourse. This paper examines the real-world application of AI in multiple sectors, including healthcare, finance, agriculture, retail, energy, and automotive. Several case studies are described to illustrate the impact of AI on these industries. The AI Dossier highlights the most compelling use cases of AI in six major industries, providing insights into the practical applications of AI across various sectors.

Challenges and Potential Failures in Generative AI Initiatives Despite the promising outlook, 2024 also brings forth challenges and potential failures in generative AI initiatives. Predictions suggest that the majority of such initiatives may face obstacles and encounter failures, underscoring the complexities and uncertainties associated with generative AI.

Industry Outlook for Generative AI The industry outlook for generative AI reflects early promise, potential payoffs, and uncertainties. AI is helping organizations in the energy, resources, and industrials industry to rapidly innovate, reduce their climate impact, and increase business productivity. The impact of AI on a hospitality company has been studied, providing insights into the transformative effects of AI in the hospitality sector.

Impact of AI on Worldwide IT Spending Projections for 2024 indicate a significant impact of AI on worldwide IT spending, with expectations of substantial growth driven by the integration of AI technologies across various sectors. The influence of AI on global IT spending is set to shape the landscape of technology investments and strategies.

Conclusion

The year 2023 marked a pivotal phase in the evolution of generative AI, setting the stage for transformative developments in the year ahead. As we look towards 2024, the landscape of AI is poised for monumental leaps, rapid transformations, and the navigation of challenges and uncertainties. The journey of generative AI continues to unfold, shaping the future of technology and innovation on a global scale.