-
Background The paper discusses a critical challenge in large language model (LLM) when performing instruction fine-tuning in Chinese: the direct use of raw data for instruction datasets after being labeled, which often results in mismatches between instructions and responses due to unscreened raw data.
-
Existing Work The open-source community currently lacks high-quality Chinese instruction fine-tuning datasets. Existing datasets like Firefly, COIG, BELLE, MOSS, and OL-CC suffer from issues such as limited scope, poor quality, commercial restrictions, or insufficient coverage. This gap hinders the advancement of LLMs in effectively processing and executing complex Chinese instructions.
- The Kun Strategy
-
Challenge 1: How to improve the consistency between instructions and responses in the dataset? The Kun strategy addresses this issue through a process called AP (Answer Polishment), which refines raw data to ensure a tighter correlation between instructions and responses. Unlike methods dependent on LLMs, Kun provides an independent and scalable training approach.
-
Challenge 2: How to avoid costly and time-consuming manual annotation? Algorithmic advancement in Kun enhances data retention and clarity, and its innovative data generation approach substantially reduces reliance on costly and time-consuming manual annotation.
-
Kun's training method requires a base model, high-quality seed data, and a large amount of primary data. The primary data mainly includes a large volume of unlabeled data distinct from labeled data. The authors used a model called Yi with 6B parameters as the base model and optimized data and instruction matching in the dataset. To assess the effectiveness of the model, researchers conducted a comprehensive human evaluation using 500 prompts from ShareGPT-zh, testing the responses generated by the model across various tasks and comparing them with responses from other models. Their Kun-52k variant demonstrated superiority. Furthermore, they evaluated the dataset quality that includes 1000 instruction-output pairs from sources like Wudao, Wanjuan, and SkyPile, with a focus on clarity, feasibility, practicality, and alignment to ensure high dataset quality.
The paper presents the Kun strategy, addressing the data consistency issue in Chinese large language model instruction fine-tuning, reducing dependency on manual annotation through the AP process and new data generation methods. The evaluation results indicate that the Kun strategy has a significant advantage in creating high-quality datasets.