Skip to content

Latest commit

 

History

History
20 lines (15 loc) · 2.42 KB

2403.18058.md

File metadata and controls

20 lines (15 loc) · 2.42 KB

Background

  • Background The article discusses the significant advancements in large language models (LLMs), especially in the field of English language for instruction tuning, while highlighting a gap in the development of Chinese instruction tuning due to unique linguistic features and cultural depth of the Chinese language. Existing datasets for Chinese instruction tuning are either derived from LLMs focused on English or do not align well with the interaction patterns of Chinese-speaking users.

  • Existing Work The existing Chinese instruction tuning datasets are generally small in scale or lacking in quality. They include translated English instruction datasets, datasets generated by LLMs, as well as self-generated datasets, which suffer from misalignments with natural Chinese communication patterns, lack of genuine Chinese linguistic data, numerous problematic data points, and limited scale.

Core Contributions

  • Introduced a high-quality Chinese instruction fine-tuning dataset COIG-CQIA
    • Challenge 1: Aligning dataset with real-world Chinese user interactions A high-quality human-written corpus was collected from various Chinese internet sources, including Q&A communities, Wikis, examinations, and existing NLP datasets, followed by rigorous filtering and processing.

    • Challenge 2: Demonstrating improved performance of models trained on the dataset Conducted extensive experiments to explore the impact of data quality, source, and mix ratio. Models fine-tuned on the COIG-CQIA dataset showcased competitive results in human assessments as well as knowledge and security benchmarks.

Implementation and Deployment

Models of various scales were trained on different subsets of the COIG-CQIA dataset, followed by in-depth evaluation and analyses. The findings confirmed that models fine-tuned on the COIG-CQIA dataset achieve competitive results in human assessments as well as knowledge and security benchmarks, establishing COIG-CQIA as a valuable asset for the Chinese NLP community.

Summary

This paper presents the COIG-CQIA dataset, a high-quality dataset for Chinese instruction fine-tuning designed to align well with human interactions. The research emphasizes the importance of high-quality data sources for model fine-tuning and demonstrates through experiments how the strategies for creating datasets and methods of fine-tuning significantly impact model performance.