Skip to content

Latest commit

 

History

History
20 lines (15 loc) · 2.64 KB

2401.06477.md

File metadata and controls

20 lines (15 loc) · 2.64 KB

Background

  • Background The paper discusses a critical challenge in large language model (LLM) when performing instruction fine-tuning in Chinese: the direct use of raw data for instruction datasets after being labeled, which often results in mismatches between instructions and responses due to unscreened raw data.

  • Existing Work The open-source community currently lacks high-quality Chinese instruction fine-tuning datasets. Existing datasets like Firefly, COIG, BELLE, MOSS, and OL-CC suffer from issues such as limited scope, poor quality, commercial restrictions, or insufficient coverage. This gap hinders the advancement of LLMs in effectively processing and executing complex Chinese instructions.

Core Contributions

  • The Kun Strategy
    • Challenge 1: How to improve the consistency between instructions and responses in the dataset? The Kun strategy addresses this issue through a process called AP (Answer Polishment), which refines raw data to ensure a tighter correlation between instructions and responses. Unlike methods dependent on LLMs, Kun provides an independent and scalable training approach.

    • Challenge 2: How to avoid costly and time-consuming manual annotation? Algorithmic advancement in Kun enhances data retention and clarity, and its innovative data generation approach substantially reduces reliance on costly and time-consuming manual annotation.

Implementation and Deployment

Kun's training method requires a base model, high-quality seed data, and a large amount of primary data. The primary data mainly includes a large volume of unlabeled data distinct from labeled data. The authors used a model called Yi with 6B parameters as the base model and optimized data and instruction matching in the dataset. To assess the effectiveness of the model, researchers conducted a comprehensive human evaluation using 500 prompts from ShareGPT-zh, testing the responses generated by the model across various tasks and comparing them with responses from other models. Their Kun-52k variant demonstrated superiority. Furthermore, they evaluated the dataset quality that includes 1000 instruction-output pairs from sources like Wudao, Wanjuan, and SkyPile, with a focus on clarity, feasibility, practicality, and alignment to ensure high dataset quality.

Summary

The paper presents the Kun strategy, addressing the data consistency issue in Chinese large language model instruction fine-tuning, reducing dependency on manual annotation through the AP process and new data generation methods. The evaluation results indicate that the Kun strategy has a significant advantage in creating high-quality datasets.