feat: Spellcheck #345

jeremyarancio · 2024-07-08T07:28:59Z

What

LLMs with QLoRA development.

Description

LLM fine-tuning on the Spellcheck
Re-evaluation of Foundational models on the benchmark
Normalization evaluation algorithm to not consider some types of errors
Add data processing pipeline

Part of

Spellcheck

…mpty strings

…struct

"flavour" -> "flavor" - "ï" -> "i" - "â" -> "a" - "oe"

Scripts are customed to handle training in the cloud using Sagemaker Training Jobs

…e for defining the level

…ecode normalization + scheduler linear

Prompt was intentionally overfitted on the benchmark to create later the synthetic training dataset. Examples from benchmark are removed from the prompt.

…sts + dag

…parse + add cometML logs

…cripts improvments

…ript

… Argilla pipeline (WIP)

jeremyarancio added 18 commits June 18, 2024 20:59

feat(spellcheck): ⚡ Add feature: push to Argilla from an HF dataset

4ad41de

fix(spellcheck): 🐛 Add fixes to T5 script

128e056

fix(spellcheck): 🐛 Add guardrail to prevent compiuting metrics with e…

3217a89

…mpty strings

feat(spellcheck): 🎨 Training pipeline using Metaflow

8bd97f2

feat(spellcheck): 🎨 LLM QLoRA TRL training script - Mistral - 7B - In…

a858bc8

…struct

perf(spellcheck): 🧪 Normalize evaluation algorithm

634b050

"flavour" -> "flavor" - "ï" -> "i" - "â" -> "a" - "oe"

feat(spellcheck): 🎨 Implement LLM training with Sagemaker & Metaflow

e16c6a2

Scripts are customed to handle training in the cloud using Sagemaker Training Jobs

feat(spellcheck): ⚡ Mistral 7b instruct v3 trained

2f6f75b

feat(spellcheck): 🎨 Update guidelines: accents

db2780c

refactor(spellcheck): ✨ Update Logging to consider script and src cod…

a6bc377

…e for defining the level

feat(spellcheck): ✨ Dataset processing methods & pipeline created

94b17d7

build(spellcheck): ✨ Dataset processing (oe, percentage alignment): v3.1

8a3e9eb

feat(spellcheck): ♻️ Training Mistral-7B-Instruct: instruction + unid…

fd0460b

…ecode normalization + scheduler linear

Delete previous training lllm dag

1559740

feat(spellcheck): ⚡ Add eval normalization: remove "\n"

a612752

refactor(spellcheck): 👷 Foundational LLMs re-evaluated on the benchmark

cb393c6

Prompt was intentionally overfitted on the benchmark to create later the synthetic training dataset. Examples from benchmark are removed from the prompt.

Modify overfitted prompt

edce9c5

Merge branch 'develop' into spellcheck

1645a83

jeremyarancio added the spellcheck label Jul 8, 2024

jeremyarancio requested a review from raphael0202 July 8, 2024 07:28

jeremyarancio self-assigned this Jul 8, 2024

jeremyarancio added 9 commits July 8, 2024 13:37

refactor(spellcheck): 🏷️ Refactor Argilla extraction: modules + unite…

6170159

…sts + dag

refactor(spellcheck): ⬆️ Refactor training job: add parameters to arg…

d2f63c7

…parse + add cometML logs

feat(spellcheck): 🎨 Fine-tune Mistral-7b with guidelines + training s…

b9c1dc2

…cripts improvments

feat(spellcheck): ✨ Add args for training + Train Mistral-7b-base

6ddf189

fix(spellcheck): 🐛 Correction error in Mistral-7B-Base fine-tuning sc…

33e9691

…ript

feat(spellcheck): 🎨 DPO dataset extraction and push

a3d310d

fix(spellcheck): 🔖 small fixes

fea5716

feat(spellcheck): ⚡ DPO training script

7092b69

refactor(spellcheck): 🚧 Refactor training pipeline: WIP

7448e10

jeremyarancio added 13 commits July 18, 2024 16:16

Update get_logger for Metaflow logging

cdcc825

feat(spellcheck): ✨ Double the benchmark size: extraction and push to…

7cf82a9

… Argilla pipeline (WIP)

docs(spellcheck): 📝 Document benchmark generation pipeline

e28a470

feat(spellcheck): 🐛 Remove legacy metadata in Argilla

b6b5bf8

refactor(spellcheck): 🚧 Refactor training pipeline (WIP)

a4f295a

refactor(spellcheck): 🚧 Refactor training script (WIP)

c6aaee5

refactor(spellcheck): 🚧 Refactor training pipeline

4401966

chore(spellcheck): ✨ Update Python from 3.9 to 3.10

e870a24

refactor(spellcheck): ⚡ LLM training pipeline refactored

08a097a

feat(spellcheck): ✨ Pretraining before fine-tuning (WIP)

b563be1

feat(spellcheck): 🚑 Pretraining + Finetuning Mistral-7B

f0d924d

refactor(spellcheck): ✨ Refactor training

ac4f496

feat(spellcheck): 🚧 Batch processing (WIP)

1600a65

teolemon changed the title ~~Spellcheck~~ feat: Spellcheck Aug 6, 2024

jeremyarancio added 4 commits August 14, 2024 20:43

feat(spellcheck): 🚧 Batch job with vllm and GCP (WIP)

f28dbc5

feat(spellcheck): ⚡ Batch job operational on GCP

c7b0709

refactor(spellcheck): ✨ Clean code and add logging to batch job

0ef32f8

fix(spellcheck): 📦 Forgot to add batch dep requirements

70a33e1

raphael0202 approved these changes Aug 23, 2024

View reviewed changes

raphael0202 merged commit 24adbb2 into develop Aug 23, 2024
1 of 2 checks passed

raphael0202 deleted the spellcheck branch August 23, 2024 09:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Spellcheck #345

feat: Spellcheck #345

jeremyarancio commented Jul 8, 2024

feat: Spellcheck #345

feat: Spellcheck #345

Conversation

jeremyarancio commented Jul 8, 2024

What

Description

Part of