Background

Background The paper discusses the challenges of deploying large-scale language models on edge devices like PCs and smartphones, such as limited storage capacity and lower inference speeds, and points out that current research focuses on using smaller LLMs and the LoRA method to overcome these issues.
Existing Work Although there have been attempts to deploy smaller LLMs to edge devices, these efforts primarily focus on enhancing the inference speeds of models on devices. Still, there is a bottleneck in the language model's ability to call functions efficiently.

Core Contributions

Introduced an effective method to enhance accuracy and latency for language model function calling on devices, achieving state-of-the-art results
- Challenge 1: How to improve function calling accuracy and reduce latency on edge devices This study showcases how the enhancement of a 2B parameter model, through tokenizing core function names and fine-tuning, addresses the challenge by outperforming GPT-4 in function calling, saving over 95% of context length, allowing 37 times more function calling capacity with the same battery, and reducing latency by approximately 35 times per function call.
- Challenge 2: How to train a language model to understand the meaning associated with specific functional tokens By integrating the function descriptions into the training dataset, the model learns the importance of these special tokens. Using the special token <nexa_end> as the early stopping criterion allows for rapid and accurate function calling.

Implementation and Deployment

The study employs the Google Gemma-2B model as the pre-trained model and uses two training methodologies: full model training and LoRA training. Training is conducted using an AdamW optimizer and a linear learning rate scheduler. The evaluation consists of extensive benchmarking to compare the model's function calling accuracy and response time and make comparisons with leading models such as GPT-4 and GPT-3.5. The efficacy of the RAG technique, along with Meta’s FAISS for semantic search, has been employed to improve the retrieval process of function call descriptions, significantly increasing accuracy and decreasing latency. The evaluation studies the impact of training dataset size and model training method on performance metrics, finding that even APIs with only 100 data points can achieve 98.095% accuracy, indicating that high performance can be maintained even in resource-constrained situations.

Summary

This paper addresses the deployment and function call efficiency issues of LLMs on edge devices. By introducing specialized training methods and reducing the amount of context that needs to be processed during inference, the paper significantly improves the accuracy of function calls and reduces latency on devices. The experimental results demonstrate a significant impact on the performance of function calling tasks.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

2404.01744.md

2404.01744.md

Background

Core Contributions

Implementation and Deployment

Summary

Files

2404.01744.md

Latest commit

History

2404.01744.md

File metadata and controls

Background

Core Contributions

Implementation and Deployment

Summary