Showing posts with label ai. Show all posts

Monday, December 02, 2024

Beyond Mimicry: The Quest for Reasoning in Large Language Models

Large Language Models (LLMs) have captivated the world with their ability to generate human-like text, translate languages, answer questions, and even write code. However, beneath the surface of impressive fluency lies a fundamental limitation: a struggle with true generalization and logical reasoning. While LLMs can mimic reasoning processes, they often fall short when confronted with tasks requiring genuine understanding, extrapolation beyond observed patterns, or the application of logical principles. This article delves into the architectural and training-related reasons behind these limitations.

The Autoregressive Bottleneck: A Word-by-Word Worldview

At the heart of most current LLMs lies the autoregressive architecture. These models predict the next word in a sequence based on the preceding words. This approach, while effective at generating fluent and grammatically correct text, inherently fosters a local, rather than global, optimization process.

Local Optimization vs. Global Reasoning: The autoregressive model excels at identifying and replicating statistical patterns within its training data. Each word is chosen to minimize immediate prediction error, akin to a myopic traveler always selecting the next closest city without considering the overall journey. This leads to difficulties in tasks requiring holistic understanding or logical coherence across an entire text. The analogy of the traveling salesman problem perfectly illustrates this; the algorithm minimizes the cost at each step but doesn't necessarily find the globally optimal route.

The Chinese Room Argument Reimagined: Philosopher John Searle's Chinese Room argument challenges the notion that manipulating symbols according to rules equates to genuine understanding. LLMs, in their current form, operate much like the person in the Chinese Room. They can process and generate text by following statistically derived rules (encoded in their massive weight matrices), but this doesn't necessarily mean they comprehend the meaning or possess the ability to reason about the information.

Error Propagation and the Fragility of Reasoning Chains: The sequential nature of autoregressive models makes them highly susceptible to error propagation. A single incorrect word prediction can cascade into a series of errors, derailing the entire generation process. This is particularly problematic in tasks requiring multi-step reasoning, where a flawed premise can invalidate the entire chain of thought, even if subsequent steps are logically sound. While techniques like "chain-of-thought" prompting encourage LLMs to articulate intermediate reasoning steps, they remain vulnerable to this cascading effect. A wrong "thought" leads to an incorrect overall conclusion.

Training Limitations: Statistical Patterns vs. Logical Principles

The training methodology of LLMs also contributes significantly to their limitations in generalization and reasoning.

Self-Supervised Pretraining: Learning Correlations, Not Causation: LLMs are typically pretrained on massive text corpora using self-supervised learning, where the model learns to predict masked or subsequent words. While this allows them to acquire a vast amount of linguistic and factual knowledge, it primarily captures statistical correlations between words, not causal relationships or logical principles. The model learns what words tend to co-occur, but not necessarily why they co-occur or the logical connections between them. This explains why early GPT models, while fluent, often produced nonsensical or factually incorrect outputs.

The Specialization Tradeoff of Supervised Fine-tuning: Instruction tuning and Reinforcement Learning from Human Feedback (RLHF) refine LLMs to better align with human expectations and follow instructions. However, this supervised learning process introduces a form of specialization, akin to training a skilled craftsman in one particular area. While this enhances performance on specific tasks seen during training, it can hinder generalization to novel or unseen scenarios. The model becomes adept at solving problems similar to those it has encountered before, but struggles with tasks outside its "comfort zone," as evidenced by failures on simple tasks like counting characters in a word if such tasks are uncommon in training data.

The Long Tail Problem: Skewed Performance and Unseen Scenarios: Even with multi-task training, LLMs face the "long tail" problem. They perform well on tasks that are well-represented in the training data but often fail on rare or unusual tasks. This is because statistical learning models are fundamentally limited by the distribution of the data they are trained on. They can interpolate and extrapolate within the bounds of observed patterns, but struggle with tasks that deviate significantly from those patterns.

Reasoning Tokens: A Superficial Facade?

Recent efforts have focused on incorporating "reasoning tokens" or prompting LLMs to generate "chain-of-thought" explanations. While these approaches can improve performance on certain reasoning tasks, they often represent a superficial mimicry of reasoning rather than genuine cognitive understanding.

Imitating System 2 Thinking without the Underlying Mechanisms: The goal is to simulate "System 2" thinking, characterized by deliberate and logical reasoning, as opposed to the intuitive "System 1" thinking. However, LLMs achieve this by generating text that resembles step-by-step reasoning, not by actually engaging in logical deduction, induction, or abduction. The model is still fundamentally predicting the next token based on statistical patterns; it's simply conditioned on a prompt that encourages a more verbose and structured output.

Vulnerability to Surface-Level Cues and Biases: LLMs remain susceptible to surface-level cues and biases present in the training data. They can be easily misled by irrelevant information or subtle changes in phrasing, leading to illogical or incorrect conclusions, even when they appear to be "reasoning" correctly. This highlights the lack of deep understanding and robust reasoning capabilities.

Conclusion

Large Language Models have made remarkable strides in natural language processing, but their current limitations in generalization and reasoning highlight the need for a fundamental shift in approach. While statistical pattern recognition remains a powerful tool, it is insufficient on its own to achieve true cognitive understanding. The quest for reasoning in LLMs is a challenging but crucial endeavor that promises to unlock the full potential of artificial intelligence and transform the way we interact with information and knowledge.

Tuesday, November 12, 2024

A Glimpse to the Future Large Reasoning Models

Let's dive deeper into how large language models might evolve to large reasoning model:

1. Baseline Auto-Regressive Model: The Foundation – Predicting the Next Word with Context

At its core, the baseline autoregressive model is a sophisticated "next word prediction" engine. It doesn't just guess randomly; it uses the context of preceding words to make informed predictions. This context is captured through contextual embeddings. Imagine it like this: the model reads a sentence word by word, and with each word, it builds an understanding of the overall meaning and relationships between the words. This understanding is encoded in the contextual embeddings. These embeddings are then used to predict the most likely next word.

Here's a breakdown of the process:

Tokenization: The input text is broken down into individual units – tokens. These can be words, subwords (parts of words), or even characters.

Contextual Embedding Layer: This is where the magic happens. Each token is converted into a vector (a list of numbers) called a contextual embedding. Crucially, this embedding is not fixed; it depends on the surrounding words. So, the same word can have different embeddings depending on the context it appears in. For example, the word "bank" will have a different embedding in the sentence "I sat by the river bank" compared to "I went to the bank to deposit money." This context-sensitive embedding is what allows the model to understand nuances in language.

Decoder Block: This part of the model takes the contextual embeddings as input and uses them to predict the probability of each possible next word/token. It considers all the words in its vocabulary and assigns a probability to each one, based on how well it fits the current context. The word with the highest probability is selected as the next word in the sequence.

Therefore, the baseline autoregressive model is fundamentally a context-driven next-word prediction engine. The contextual embeddings are central to this process, as they represent the model's understanding of the meaning and relationships between words in a given sequence.

2. Unrolled Auto-Regressive Model (Figure 2): The Sequential Process

This diagram illustrates the iterative nature of text generation. The model predicts one token at a time, and each prediction becomes the input for the next step. This "unrolling" visualizes how the model builds up a sequence token by token. The key takeaway here is that the model's understanding of the context evolves with each prediction. Early predictions can significantly influence later ones.

3. Auto-Regressive Model with Reasoning Tokens (Chain-of-Thought): Thinking Step-by-Step

This introduces the concept of explicit reasoning. By providing examples with intermediate reasoning steps during training, the model learns to generate its own reasoning steps before arriving at the final answer.

Reasoning Tokens: These special tokens act as prompts to guide the model's thinking process. They can be natural language phrases or specific symbols that signal reasoning steps. For instance, reasoning tokens might start with "Therefore," "Because," or "Step 1:".

Benefits of Chain-of-Thought: This approach improves performance on complex reasoning tasks by forcing the model to decompose the problem into smaller, more manageable steps. It also makes the model's reasoning more transparent and interpretable.

OpenAI's o1 model is one of those model trained with chain-of-thought reasoning.

4. Auto-Regressive Model with Reasoning Embedding: Implicit Reasoning

Here is the interesting part. Instead of having the reasoning tokens generated one by one, the context embedding of the reasoning token could possibly trained. So, Given the same embedding will generate the same token. If such model was trained, we can predict the next token efficiently without the overhead of generating explicit reasoning tokens.

Reasoning Embedding Layer: This new layer learns to encode the essence of the reasoning process directly into the embeddings. Instead of explicitly generating reasoning steps, the model incorporates the learned reasoning patterns into its prediction process.

Efficiency Gains: By eliminating the need to generate intermediate tokens, this approach reduces computational cost and speeds up text generation.

As large language models evolve into powerful reasoning engines, we stand on the brink of a new era in AI capabilities. From foundational autoregressive models to innovative reasoning embeddings, each step forward enhances the efficiency, interpretability, and complexity of what these models can achieve. By integrating explicit reasoning (reasoning tokens) and implicit reasoning (reasoning embeddings) mechanisms, the future promises not only faster and more accurate text generation but also models capable of deeper understanding and problem-solving.

Monday, October 21, 2024

Efficient Multilingual Control of Robotic Dog Using LLM

Introduction

As the world of robotics continues to advance, the integration of artificial intelligence (AI) in robotic systems has become essential for making these machines smarter, more intuitive, and easier to control. One exciting area of development is the use of large language models (LLMs) to enhance the interaction between humans and robots. Recently, a question was raised in an LLM group about how to implement this integration.

The Challenge

The objective was to enable a robotic dog to understand and execute commands given in both English and Cantonese. However, there were key limitations to consider:

Multilingual Capability: The model needed to understand and process commands in both languages accurately.
Edge Device Compatibility: Given that the onboard GPU was a Jetson GPU with only 8GB of VRAM, the model had to be small and efficient enough to run effectively within this limited hardware capacity.
Fast Response Time: The robotic dog should be able to interpret commands and respond almost instantaneously, maintaining a natural interaction experience with users.

To address these challenges, we implement a PoC utilized a quantized version of the Qwen 2.5 1.5B model, which provided a balance between size, multilingual capabilities, and performance.

Why Use a Quantized Version of Qwen 2.5 1.5B Model?

The Qwen 2.5 1.5B model was chosen for several reasons:

Multilingual Capability: The model supports multiple languages, including English and Cantonese. This feature allowed the robotic dog to interpret commands accurately, regardless of the language used.
Efficient Edge Computing: A smaller model was preferred to fit within the constraints of the onboard Jetson GPU. The Qwen 2.5 1.5B model was quantized, reducing its memory footprint, making it lightweight and compatible with the edge device. Quantization reduces the model size by converting the weights from 32-bit floating points to smaller data types, such as 4-bit, without significantly sacrificing performance.
Optimized for Performance: Despite its smaller size, the model remained powerful enough to handle the command interpretation. By using the quantized version (Qwen2.5-1.5b-instruct-q4_k_m.gguf), it managed to provide a fast response time while consuming minimal VRAM.

Proof of Concept

We can quickly build a proof of concept (PoC) using llama.cpp to load the Qwen model.

The Prompt

You are the command center for an advanced robotic dog. Your role is to interpret user inputs and generate appropriate commands for the dog to execute. The available commands are:
- turn right
- turn left
- move forward
- move backward
- dance
- bark

Based on the user's input, create a list of one or more commands for the robotic dog. Output the commands as a JSON array.

Sample results

Two different Cantonese phrases are listed here (English translations are provided in brackets). The first one is straightforward, while the second requires understanding the user's intention to generate a command list.

Sample 1:
In this case, the model accurately interpreted the user's straightforward command, providing a sequence of actions for the robotic dog.

轉右向前行兩步，再吠兩聲嚟聽吓 (Turn right, move forward two steps, and then bark twice)

["turn right", "move forward", "move forward", "bark", "bark"]

Sample 2:
In the following case, the model was able to understand the user’s intention and interpreted "cheering up" as asking the robotic dog to perform an action that would be entertaining, like dancing. This showcases the model’s ability to grasp user sentiment and respond creatively.

我今日好唔開心，可以氹吓我嗎? (I'm feeling very sad today, can you cheer me up?)

["dance", "jump"]

Performance Summary

With llama.cpp and the quantized Qwen model, it leads to the following performance results:

Response Time: ~700 milliseconds in average on a Nvidia T4 Card. This means the model processed the input and generated commands in well under a second, ensuring a fluid interaction between the user and the robotic dog.
VRAM Usage: 2.8GB with default settings. By setting the maximum context length to only 500 tokens, the VRAM usage was reduced to 1.4GB, which is well within the 8GB limit of the Jetson GPU.

The efficient use of memory and fast response time demonstrated the feasibility of running LLMs on edge devices, even for multilingual applications.

Key Takeaways

The PoC demonstrated that it is possible to use a quantized version of a multilingual language model for real-time robotic control on edge devices. The key benefits included:

Multilingual Support: The ability to understand commands in both English and Cantonese expanded the usability and flexibility of the robotic dog.
Edge Device Compatibility: By using a smaller, quantized model, the AI was able to run efficiently on limited hardware without compromising performance.
Real-Time Performance: Fast response times ensured that the robotic dog could react promptly, making interactions feel natural and engaging.

This proof of concept paves the way for more advanced, language-based control systems for robots that can be deployed on edge devices, making them more accessible and practical for various real-world applications.

Thursday, January 26, 2023

Transforming Document Exploration: Using GPT-3 to Create a Chatbot that Answers Your Questions

GPT-3 is the underlying model of ChatGPT. It is an advanced language model that can generate text but it can't find information that is not already in the model. In additon, there are limitation to feed long text to GPT-3. In this article, we will show you how to use a technique called conditional generation to build a chatbot that can help you explore long documents.

Conditional generation is a way of training a language model to understand the context of a document and generate answers to questions about it. We will take the following steps to build the chatbot:

Create a corpus from the document.
Match the user's question to the relavant passages in the corpus.
Create a prompt that includes the context of the question.
Send the prompt to GPT-3 to generate an answer.
Create a function that combines steps 2-4.
Test the chatbot by creating a question-answer loop.
Use Gradio to create an interface for the chatbot.

In this exercise, we will build a chatbot to answer the question in an employee handbook.

Create a corpus from the document

First we have prepared a file HKIHRMEmployeeHandbook.txt which is text version of a sample employee handbook created by Hong Kong Institute of Human Resource Management.

To start, we are using a python library called "sentence-transformers" to create a corpus of embeddings from the text document "HKIHRMEmployeeHandbook.txt". The sentence-transformers library is an NLP library that makes use of pre-trained transformer-based models, like BERT, RoBERTa, DistilBERT, and, in this case, multi-qa-mpnet-base-cos-v1, to generate embeddings for each sentence in the corpus. This is useful for our chatbot application as it allows us to find the most relevant passage in our document for the user's query. This will help to provide more accurate and useful answers for the user.

Embeddings are a way to represent text in a numerical form that can be used for comparison and similarity calculation by the model. In this case, we are using sentence-transformers library to convert the text into embeddings. These embeddings are a compact and dense representation of the text, which can be used to compare and measure similarity between different texts.

There are several advantages of using embeddings instead of keyword search when building a chatbot for document exploration:

Semantic Similarity: Embeddings capture the semantic similarity between sentences, which allows for more accurate and relevant matches between the user's query and the passages in the corpus.
Handling synonyms and variations: Embeddings can handle synonyms and variations in the query, which are difficult to handle with traditional keyword search. For example, if the user asks "what are the benefits of working overtime?" and the document talks about "compensation for working beyond regular hours", the embeddings will be able to match the query with the relevant passage, even though the words in the query and document are different.
Handling context: Embeddings can also capture the context in which a word is used, which allows for more accurate matches between the query and the passages in the corpus. For example, "what are the benefits of working overtime?" and "what are the drawbacks of working overtime?" have similar keywords but opposite meaning, embeddings can help to understand the context and give the correct answer.
Scalability: Embeddings can handle large datasets and can be used to match queries against a large corpus of documents efficiently.

Overall, using embeddings instead of keyword search improves the accuracy and relevance of the chatbot's answers, and provides better user experience.

Match the user's question to the relavant passages in the corpus

Once we have the embeddings, we can find the most similar passages in the document to the given query. The util.semantic_search method to find the most similar passages in the corpus based on the query embedding and the corpus embeddings. The top_k variable controls how many passages to retrieve. After getting the passages from the corpus, they are joined as a whole piece together with linebreak to form the context. The result is as follows:

2.2 Working Hours 3.2 Overtime Compensation Prior approval must be obtained from respective supervisor for working overtime. Overtime for employees at grade 7-10 will be compensated by pay at the following rates: Category 1 a) For non-shift employees, Monday to Friday starting from an hour after normal working hour to 12:00 midnight and Saturday after 2:00 p.m. b) For shift employees, overtime worked beyond his shift 1.5 times hourly rate Overtime for employees at grade 4-6 will be compensated by way of time off. Such compensation leave will only be granted when their department’ s workload permits. Any overtime not compensated in the form of leave by the end of the calendar year will be paid in the following February at the rate of 1.5 times the employee's hourly rate. 3.3 Annual Bonus Employees who work overtime in the following conditions are entitled to claim for meal allowance: - overtime for 3 consecutive hours from Monday to Saturday; - overtime for 6 consecutive hours on Sunday or public holidays Employees who are required to report for duty outside office hours in case of emergency will be entitled to an emergency allowance and compensation leave 4.5 Compensation Leave or Time-off for Overtime Work Employees at grade 4-6 who have worked overtime will be compensated by way of time off. Such compensation leave will only be granted when their department’ s workload permits. Please also refer to “Overtime Compensation” under Section 3.2 of this handbook for details.

You can observed that all passages retrieved are more or less relavant to the overtime question.

Create a prompt that includes the context of the question

Then, we can creating a prompt template. The prompt will be created and send for GPT-3 to generate an answer for the query based on the context. The following prompt will be created

Context: 2.2 Working Hours 3.2 Overtime Compensation Prior approval must be obtained from respective supervisor for working overtime. Overtime for employees at grade 7-10 will be compensated by pay at the following rates: Category 1 a) For non-shift employees, Monday to Friday starting from an hour after normal working hour to 12:00 midnight and Saturday after 2:00 p.m. b) For shift employees, overtime worked beyond his shift 1.5 times hourly rate Overtime for employees at grade 4-6 will be compensated by way of time off. Such compensation leave will only be granted when their department’ s workload permits. Any overtime not compensated in the form of leave by the end of the calendar year will be paid in the following February at the rate of 1.5 times the employee's hourly rate. 3.3 Annual Bonus Employees who work overtime in the following conditions are entitled to claim for meal allowance: - overtime for 3 consecutive hours from Monday to Saturday; - overtime for 6 consecutive hours on Sunday or public holidays Employees who are required to report for duty outside office hours in case of emergency will be entitled to an emergency allowance and compensation leave 4.5 Compensation Leave or Time-off for Overtime Work Employees at grade 4-6 who have worked overtime will be compensated by way of time off. Such compensation leave will only be granted when their department’ s workload permits. Please also refer to “Overtime Compensation” under Section 3.2 of this handbook for details. Answer the following question: Q: what should I do if I worked overtime? A:

You can try a few other forms like:

Answer the following question. If the answer cannot be found in the context, write "I don't know"
Answer the following question. If the answer cannot be found in the context, rewrite the question that is more related to the context by starting with "Do you want to ask" and then elaborate the answer

Send the prompt to GPT-3 to generate an answer

The following code is using the OpenAI API to generate a response for the prompt using GPT-3. GPT-3 will come up with an answer that fits the context:

If you are an employee at grade 7-10, you should obtain prior approval from your supervisor and you will be compensated by pay at the rate of 1.5 times your hourly rate. If you are an employee at grade 4-6, you will be compensated by way of time off. Such compensation leave will only be granted when your department’s workload permits. Any overtime not compensated in the form of leave by the end of the calendar year will be paid in the following February at the rate of 1.5 times your hourly rate.

Create a function that wraps up everything

Wrapping the code into a function called answer(query) allows for easy reuse of the code and inputting of different queries. This function takes in a query string, and return the response of the API. This makes it easy to input different queries and get an answer without having to repeat the previous steps every time.

Test the chatbot by creating a question-answer loop

Here, we have an infinite loop that repeatedly prompts the user for a query, and exits the loop when the user enters "xxx". You can input multiple queries and get answers without having to restart the code. It allows easy testing of the function by allowing the user to input different queries and see the answers generated by GPT-3. The output will be as follows:

Q: do I get paid if I got sick? A: Yes, employees who are not able to report for duty due to illness are entitled to full pay sick leave, subject to a maximum of the lesser of 120 days or the number of sickness days that the employee has accumulated according to provisions in the Employment Ordinance. Additionally, employees who suffer from injury arising out of and in the course of employment are entitled to compensation in accordance with the Employee Compensation Ordinance. ====================================================================== Q: what should I do if I got sick? A: If I become ill, I should notify my immediate supervisor as soon as possible and provide a medical certificate if necessary. I am entitled to full pay sick leave, subject to a maximum of 120 days or the number of sickness days I have accumulated according to the Employment Ordinance. If I have to leave during office hours due to sickness or personal reasons, I must obtain prior approval from either my supervisor, department manager or the Human Resources Department. ====================================================================== Q: xxx

Use Gradio to create an interface for the chatbot

Finally, let's use gradio library to create a (GUI) for the answer(query) function defined earlier: Now you have a more user friendly GUI to ask the questions:

Try it out in Colab:

Monday, July 09, 2018

Testing with Intel Movidius Neural Compute Stick

One day, I came across this USB stick.

It is an USB stick with Myriad VPU 2 which is a chip specialized for convolution neural network. It is generally a good news as not all computers come with an expensive graphics card. Some device is not even capable for a graphics card like Raspberry Pi. So, a USB stick seems to be a perfect solution.

It has a FaceNet example for Tensorflow so I decided to try it out. It is target for Ubuntu on PC at the moment and specifically supports only version 16.04. Luckily, Virtual PC is supported as well so it is OK to run on my Windows 10 PC.

The NCSDK v2 installation is pretty painless. Then the NC App Zoo for the FaceNet sample. The run.py in the sample is pretty useless. It is just taking an image without getting the face and resampling it to a 180x180 image and feed it into the model. Obviously, it is not how FaceNet works.

Since the original TensorFlow implementation from David already got a compare.py. So, I copied it and the MTCNN dependencies over and renamed it to compare_tf.py. Then made a copy and updated it with the NCSDK as compare_nc.py. The results as as follows:

compare_nc.py

Images:
0: elvis-presley-401920_640.jpg
1: neal_2017-12-19-155037.jpg
2: president-67550_640.jpg
3: trump.jpg
4: valid.jpg

Distance matrix
        0         1         2         3         4
0    0.0000    0.6212    0.6725    0.7981    0.5387
1    0.6212    0.0000    0.8101    0.7106    0.5050
2    0.6725    0.8101    0.0000    0.6509    0.6273
3    0.7981    0.7106    0.6509    0.0000    0.6946
4    0.5387    0.5050    0.6273    0.6946    0.0000

compare_tf.py

Images:
0: elvis-presley-401920_640.jpg
1: neal_2017-12-19-155037.jpg
2: president-67550_640.jpg
3: trump.jpg
4: valid.jpg

Distance matrix
        0         1         2         3         4
0    0.0000    1.4255    1.3354    1.3078    1.4498
1    1.4255    0.0000    1.5454    1.4255    0.6372
2    1.3354    1.5454    0.0000    1.2032    1.4949
3    1.3078    1.4255    1.2032    0.0000    1.4904
4    1.4498    0.6372    1.4949    1.4904    0.0000

Hmmm, not the same result expected. I think it shouldn't be the hardware failure. It could be there are some conversion problem when converting TensorFlow model to the NCSDK model.

Conclusion: it is quite primitive at the moment. For common models like inception or mobilenet, it might work well. For custom models, good luck.

GitHub Link: https://github.com/compustar/ncappzoo/tree/ncsdk2/tensorflow/facenet