Monday, October 21, 2024

Efficient Multilingual Control of Robotic Dog Using LLM

Introduction

As the world of robotics continues to advance, the integration of artificial intelligence (AI) in robotic systems has become essential for making these machines smarter, more intuitive, and easier to control. One exciting area of development is the use of large language models (LLMs) to enhance the interaction between humans and robots. Recently, a question was raised in an LLM group about how to implement this integration.

The Challenge

The objective was to enable a robotic dog to understand and execute commands given in both English and Cantonese. However, there were key limitations to consider:

  1. Multilingual Capability: The model needed to understand and process commands in both languages accurately.
  2. Edge Device Compatibility: Given that the onboard GPU was a Jetson GPU with only 8GB of VRAM, the model had to be small and efficient enough to run effectively within this limited hardware capacity.
  3. Fast Response Time: The robotic dog should be able to interpret commands and respond almost instantaneously, maintaining a natural interaction experience with users.

To address these challenges, we implement a PoC utilized a quantized version of the Qwen 2.5 1.5B model, which provided a balance between size, multilingual capabilities, and performance.


Why Use a Quantized Version of Qwen 2.5 1.5B Model?

The Qwen 2.5 1.5B model was chosen for several reasons:

  1. Multilingual Capability: The model supports multiple languages, including English and Cantonese. This feature allowed the robotic dog to interpret commands accurately, regardless of the language used.

  2. Efficient Edge Computing: A smaller model was preferred to fit within the constraints of the onboard Jetson GPU. The Qwen 2.5 1.5B model was quantized, reducing its memory footprint, making it lightweight and compatible with the edge device. Quantization reduces the model size by converting the weights from 32-bit floating points to smaller data types, such as 4-bit, without significantly sacrificing performance.

  3. Optimized for Performance: Despite its smaller size, the model remained powerful enough to handle the command interpretation. By using the quantized version (Qwen2.5-1.5b-instruct-q4_k_m.gguf), it managed to provide a fast response time while consuming minimal VRAM.


Proof of Concept

We can quickly build a proof of concept (PoC) using llama.cpp to load the Qwen model.

The Prompt
You are the command center for an advanced robotic dog. Your role is to interpret user inputs and generate appropriate commands for the dog to execute. The available commands are:
- turn right
- turn left
- move forward
- move backward
- dance
- bark

Based on the user's input, create a list of one or more commands for the robotic dog. Output the commands as a JSON array.
Sample results

Two different Cantonese phrases are listed here (English translations are provided in brackets). The first one is straightforward, while the second requires understanding the user's intention to generate a command list.

Sample 1:
In this case, the model accurately interpreted the user's straightforward command, providing a sequence of actions for the robotic dog.

轉右向前行兩步,再吠兩聲嚟聽吓 (Turn right, move forward two steps, and then bark twice)

["turn right", "move forward", "move forward", "bark", "bark"]

Sample 2:
In the following case, the model was able to understand the user’s intention and interpreted "cheering up" as asking the robotic dog to perform an action that would be entertaining, like dancing. This showcases the model’s ability to grasp user sentiment and respond creatively.

我今日好唔開心,可以氹吓我嗎? (I'm feeling very sad today, can you cheer me up?)

["dance", "jump"]


Performance Summary

With llama.cpp and the quantized Qwen model, it leads to the following performance results:

  • Response Time: ~700 milliseconds in average on a Nvidia T4 Card. This means the model processed the input and generated commands in well under a second, ensuring a fluid interaction between the user and the robotic dog.
  • VRAM Usage: 2.8GB with default settings. By setting the maximum context length to only 500 tokens, the VRAM usage was reduced to 1.4GB, which is well within the 8GB limit of the Jetson GPU.

The efficient use of memory and fast response time demonstrated the feasibility of running LLMs on edge devices, even for multilingual applications.


Key Takeaways

The PoC demonstrated that it is possible to use a quantized version of a multilingual language model for real-time robotic control on edge devices. The key benefits included:

  1. Multilingual Support: The ability to understand commands in both English and Cantonese expanded the usability and flexibility of the robotic dog.
  2. Edge Device Compatibility: By using a smaller, quantized model, the AI was able to run efficiently on limited hardware without compromising performance.
  3. Real-Time Performance: Fast response times ensured that the robotic dog could react promptly, making interactions feel natural and engaging.

This proof of concept paves the way for more advanced, language-based control systems for robots that can be deployed on edge devices, making them more accessible and practical for various real-world applications.