Peft gaudi (#1178)

* enable mpt peft LORA finetune in Gaudi1 * update README * mpt model change due to DL1 lack of support for torch.roll
intel · Jul 18, 2023 · 3dc184e · 3dc184e
1 parent 4675d42
commit 3dc184e
Show file tree

Hide file tree

Showing 14 changed files with 2,205 additions and 28 deletions.
diff --git a/workflows/chatbot/fine_tuning/README.md b/workflows/chatbot/fine_tuning/README.md
@@ -36,7 +36,7 @@ The instruction-following dataset is needed for the finetuning. We select two ki
 
 We employ the [LoRA approach](https://arxiv.org/pdf/2106.09685.pdf) to finetune the LLM efficiently, currently, FLAN-T5 and LLaMA are supported for finetuning.
 
-## 1. Single Node Fine-tuning
+## 1. Single Node Fine-tuning in Xeon SPR
 
 For FLAN-T5, use the below command line for finetuning on the Alpaca dataset.
 
@@ -83,6 +83,7 @@ python finetune_clm.py \
         --output_dir ./llama_peft_finetuned_model \
         --peft lora \
         --use_fast_tokenizer false \
+        --no_cuda \
 ```
 
 For [MPT](https://huggingface.co/mosaicml/mpt-7b), use the below command line for finetuning on the Alpaca dataset. Only LORA supports MPT in PEFT perspective.it uses gpt-neox-20b tokenizer, so you need to define it in command line explicitly.This model also requires that trust_remote_code=True be passed to the from_pretrained method. This is because we use a custom MPT model architecture that is not yet part of the Hugging Face transformers package.
@@ -108,17 +109,18 @@ python finetune_clm.py \
         --peft lora \
         --trust_remote_code True \
         --tokenizer_name "EleutherAI/gpt-neox-20b" \
+        --no_cuda \
 ```
 
 Where the `--dataset_concatenation` argument is a way to vastly accelerate the fine-tuning process through training samples concatenation. With several tokenized sentences concatenated into a longer and concentrated sentence as the training sample instead of having several training samples with different lengths, this way is more efficient due to the parallelism characteristic provided by the more concentrated training samples.
 
 For finetuning on SPR, add `--bf16` argument will speedup the finetuning process without the loss of model's performance.
 You could also indicate `--peft` to switch peft method in P-tuning, Prefix tuning, Prompt tuning, LLama Adapter, LoRA,
-see https://github.com/huggingface/peft. Note for FLAN-T5, only LoRA is supported.
+see https://github.com/huggingface/peft. Note for FLAN-T5/MPT, only LoRA is supported.
 
 Add option **"--use_fast_tokenizer False"** when using latest transformers if you met failure in llama fast tokenizer for llama, The `tokenizer_class` in `tokenizer_config.json` should be changed from `LLaMATokenizer` to `LlamaTokenizer`
 
-## 2. Multi-node Fine-tuning
+## 2. Multi-node Fine-tuning in Xeon SPR
 
 We also supported Distributed Data Parallel finetuning on single node and multi-node settings. To use Distributed Data Parallel to speedup training, the bash command needs a small adjustment.
 <br>
@@ -132,7 +134,7 @@ For example, to finetune FLAN-T5 through Distributed Data Parallel training, bas
 <br>
 *`<NODE_RANK>`* is the rank of the current node, rank starts from 0 to *`<NUM_NODES>`*`-1`.
 <br>
-> Also please note that to use CPU for training in each node with multi-node settings, argument `--no_cuda` is mandatory, and `--xpu_backend ccl` is required if to use ccl as the distributed backend. In multi-node setting, following command needs to be launched in each node, and all the commands should be the same except for *`<NODE_RANK>`*, which should be integer from 0 to *`<NUM_NODES>`*`-1` assigned to each node.
+> Also please note that to use CPU for training in each node with multi-node settings, argument `--no_cuda` is mandatory, and `--ddp_backend ccl` is required if to use ccl as the distributed backend. In multi-node setting, following command needs to be launched in each node, and all the commands should be the same except for *`<NODE_RANK>`*, which should be integer from 0 to *`<NUM_NODES>`*`-1` assigned to each node.
 
 ``` bash
 python -m torch.distributed.launch --master_addr=<MASTER_ADDRESS> --nproc_per_node=<NUM_PROCESSES_PER_NODE> --nnodes=<NUM_NODES> --node_rank=<NODE_RANK> \
@@ -153,7 +155,9 @@ python -m torch.distributed.launch --master_addr=<MASTER_ADDRESS> --nproc_per_no
         --save_total_limit 2 \
         --overwrite_output_dir \
         --output_dir ./flan-t5-xl_peft_finetuned_model \
-        --peft lora
+        --peft lora \
+        --no_cuda \
+        --ddp_backend ccl \
 ```
 If you have enabled passwordless SSH in cpu clusters, you could also use mpirun in master node to start the DDP finetune. Take llama alpaca finetune for example. follow the [hugginface guide](https://huggingface.co/docs/transformers/perf_train_cpu_many) to install Intel® oneCCL Bindings for PyTorch, IPEX
 
@@ -206,6 +210,8 @@ mpirun -f nodefile -n 16 -ppn 4 -genv OMP_NUM_THREADS=56 python3 finetune_clm.py
     --dataset_concatenation \
     --use_fast_tokenizer false \
     --do_train \
+    --no_cuda \
+    --ddp_backend ccl \
 
 ## for DDP LORA for MPT
 mpirun -f nodefile -n 16 -ppn 4 -genv OMP_NUM_THREADS=56 python3 finetune_clm.py \
@@ -229,6 +235,75 @@ mpirun -f nodefile -n 16 -ppn 4 -genv OMP_NUM_THREADS=56 python3 finetune_clm.py
     --do_train \
     --trust_remote_code True \
     --tokenizer_name "EleutherAI/gpt-neox-20b" \
+    --no_cuda \
+    --ddp_backend ccl \
 ```
 you could also indicate `--peft` to switch peft method in P-tuning, Prefix tuning, Prompt tuning, LLama Adapter, LORA,
 see https://github.com/huggingface/peft
+
+## 1. Single Node Fine-tuning in Habana DL1
+
+Follow install guidance in [optimum-habana](https://github.com/huggingface/optimum-habana)
+
+For LLaMA, use the below command line for finetuning on the Alpaca dataset.
+
+```bash
+python finetune_clm.py \
+        --model_name_or_path "decapoda-research/llama-7b-hf" \
+        --bf16 True \
+        --train_file "/path/to/alpaca_data.json" \
+        --dataset_concatenation \
+        --per_device_train_batch_size 2 \
+        --per_device_eval_batch_size 2 \
+        --gradient_accumulation_steps 4 \
+        --do_train \
+        --learning_rate 1e-4 \
+        --num_train_epochs 3 \
+        --logging_steps 100 \
+        --save_total_limit 2 \
+        --overwrite_output_dir \
+        --log_level info \
+        --save_strategy epoch \
+        --output_dir ./llama_peft_finetuned_model \
+        --peft lora \
+        --use_fast_tokenizer false \
+        --habana \
+        --use_habana \
+        --use_lazy_mode \
+```
+
+For [MPT](https://huggingface.co/mosaicml/mpt-7b), use the below command line for finetuning on the Alpaca dataset. Only LORA supports MPT in PEFT perspective.it uses gpt-neox-20b tokenizer, so you need to define it in command line explicitly.This model also requires that trust_remote_code=True be passed to the from_pretrained method. This is because we use a custom MPT model architecture that is not yet part of the Hugging Face transformers package.
+
+```bash
+python finetune_clm.py \
+        --model_name_or_path "mosaicml/mpt-7b" \
+        --bf16 True \
+        --train_file "/path/to/alpaca_data.json" \
+        --dataset_concatenation \
+        --per_device_train_batch_size 2 \
+        --per_device_eval_batch_size 2 \
+        --gradient_accumulation_steps 4 \
+        --do_train \
+        --learning_rate 1e-4 \
+        --num_train_epochs 3 \
+        --logging_steps 100 \
+        --save_total_limit 2 \
+        --overwrite_output_dir \
+        --log_level info \
+        --save_strategy epoch \
+        --output_dir ./mpt_peft_finetuned_model \
+        --peft lora \
+        --trust_remote_code True \
+        --tokenizer_name "EleutherAI/gpt-neox-20b" \
+        --habana \
+        --use_habana \
+        --use_lazy_mode \
+```
+
+Where the `--dataset_concatenation` argument is a way to vastly accelerate the fine-tuning process through training samples concatenation. With several tokenized sentences concatenated into a longer and concentrated sentence as the training sample instead of having several training samples with different lengths, this way is more efficient due to the parallelism characteristic provided by the more concentrated training samples.
+
+For finetuning on SPR, add `--bf16` argument will speedup the finetuning process without the loss of model's performance.
+You could also indicate `--peft` to switch peft method in P-tuning, Prefix tuning, Prompt tuning, LLama Adapter, LoRA,
+see https://github.com/huggingface/peft. Note for MPT, only LoRA is supported.
+
+Add option **"--use_fast_tokenizer False"** when using latest transformers if you met failure in llama fast tokenizer for llama, The `tokenizer_class` in `tokenizer_config.json` should be changed from `LLaMATokenizer` to `LlamaTokenizer`
diff --git a/workflows/chatbot/fine_tuning/instruction_tuning_pipeline/finetune_clm.py b/workflows/chatbot/fine_tuning/instruction_tuning_pipeline/finetune_clm.py
@@ -50,6 +50,8 @@
 import copy
 import re
 import torch
+import importlib.util
+from transformers.utils.import_utils import is_optimum_available
 
 IGNORE_INDEX = -100
 
@@ -58,6 +60,10 @@
 logger = logging.getLogger(__name__)
 
 
+def is_optimum_habana_available():
+    return is_optimum_available() and importlib.util.find_spec("optimum.habana") != None
+
+
 @dataclass
 class ModelArguments:
     """
@@ -115,6 +121,7 @@ class ModelArguments:
         },
     )
 
+
 @dataclass
 class DataArguments:
     """
@@ -257,6 +264,10 @@ class FinetuneArguments:
         default=True,
         metadata={"help": "if False, masks out inputs in loss"},
     )
+    habana: bool = field(
+        default=False,
+        metadata={"help": "if False, masks out inputs in loss"},
+    )
 
 
 PROMPT_DICT = {
@@ -293,10 +304,16 @@ def main():
     # See all possible arguments in src/transformers/training_args.py
     # or by passing the --help flag to this script.
     # We now keep distinct sets of args, for a cleaner separation of concerns.
+    if not is_optimum_habana_available():
+        parser = HfArgumentParser(
+            (ModelArguments, DataArguments, TrainingArguments, FinetuneArguments)
+        )
+    else:
+        from optimum.habana import GaudiTrainingArguments
 
-    parser = HfArgumentParser(
-        (ModelArguments, DataArguments, TrainingArguments, FinetuneArguments)
-    )
+        parser = HfArgumentParser(
+            (ModelArguments, DataArguments, GaudiTrainingArguments, FinetuneArguments)
+        )
     if len(sys.argv) == 2 and sys.argv[1].endswith(".json"):
         # If we pass only one argument to the script and it's the path to a json file,
         # let's parse it to get our arguments.
@@ -311,6 +328,11 @@ def main():
             finetune_args,
         ) = parser.parse_args_into_dataclasses()
 
+    if finetune_args.habana:
+        if not is_optimum_habana_available():
+            raise ImportError(
+                "optimum habana is not installed. refer https://github.com/huggingface/optimum-habana"
+            )
     # Setup logging
     logging.basicConfig(
         format="%(asctime)s - %(levelname)s - %(name)s -   %(message)s",
@@ -470,17 +492,32 @@ def main():
     # Load model
     if model_args.model_name_or_path:
         model_dtype = torch.bfloat16 if training_args.bf16 else None
-        model = AutoModelForCausalLM.from_pretrained(
-            model_args.model_name_or_path,
-            from_tf=bool(".ckpt" in model_args.model_name_or_path),
-            config=config,
-            cache_dir=model_args.cache_dir,
-            revision=model_args.model_revision,
-            use_auth_token=True if model_args.use_auth_token else None,
-            trust_remote_code=True if model_args.trust_remote_code else None,
-            torch_dtype=model_dtype,
-            low_cpu_mem_usage=True,
-        )
+        if re.search("mpt", model_args.model_name_or_path, re.IGNORECASE):
+            from models.mpt.modeling_mpt import MPTForCausalLM
+
+            model = MPTForCausalLM.from_pretrained(
+                model_args.model_name_or_path,
+                from_tf=bool(".ckpt" in model_args.model_name_or_path),
+                config=config,
+                cache_dir=model_args.cache_dir,
+                revision=model_args.model_revision,
+                use_auth_token=True if model_args.use_auth_token else None,
+                trust_remote_code=True if model_args.trust_remote_code else None,
+                torch_dtype=model_dtype,
+                low_cpu_mem_usage=True,
+            )
+        else:
+            model = AutoModelForCausalLM.from_pretrained(
+                model_args.model_name_or_path,
+                from_tf=bool(".ckpt" in model_args.model_name_or_path),
+                config=config,
+                cache_dir=model_args.cache_dir,
+                revision=model_args.model_revision,
+                use_auth_token=True if model_args.use_auth_token else None,
+                trust_remote_code=True if model_args.trust_remote_code else None,
+                torch_dtype=model_dtype,
+                low_cpu_mem_usage=True,
+            )
     else:
         raise ValueError(
             "Must provide model_name_or_path to load a pretrained CausalLM model."
@@ -642,15 +679,33 @@ def concatenate_data(dataset, max_seq_length):
         model = get_peft_model(model, peft_config)
         model.print_trainable_parameters()
 
-        # Initialize our Trainer
-        trainer = Trainer(
-            model=model,
-            args=training_args,
-            train_dataset=train_dataset if training_args.do_train else None,
-            eval_dataset=eval_dataset if training_args.do_eval else None,
-            tokenizer=tokenizer,
-            data_collator=data_collator,
-        )
+        if not finetune_args.habana:
+            # Initialize our Trainer
+            trainer = Trainer(
+                model=model,
+                args=training_args,
+                train_dataset=train_dataset if training_args.do_train else None,
+                eval_dataset=eval_dataset if training_args.do_eval else None,
+                tokenizer=tokenizer,
+                data_collator=data_collator,
+            )
+        else:
+            from optimum.habana import GaudiConfig, GaudiTrainer
+
+            gaudi_config = GaudiConfig()
+            gaudi_config.use_fused_adam = True
+            gaudi_config.use_fused_clip_norm = True
+            # Initialize our Trainer
+            trainer = GaudiTrainer(
+                model=model,
+                gaudi_config=gaudi_config,
+                args=training_args,
+                train_dataset=train_dataset if training_args.do_train else None,
+                eval_dataset=eval_dataset if training_args.do_eval else None,
+                tokenizer=tokenizer,
+                data_collator=data_collator,
+            )
+
         trainer.train(resume_from_checkpoint=training_args.resume_from_checkpoint)
         with training_args.main_process_first(desc="save model"):
             if is_main_process(training_args.local_rank):

diff --git a/workflows/chatbot/fine_tuning/instruction_tuning_pipeline/models/__init__.py b/workflows/chatbot/fine_tuning/instruction_tuning_pipeline/models/__init__.py
diff --git a/workflows/chatbot/fine_tuning/instruction_tuning_pipeline/models/mpt/__init__.py b/workflows/chatbot/fine_tuning/instruction_tuning_pipeline/models/mpt/__init__.py
diff --git a/workflows/chatbot/fine_tuning/instruction_tuning_pipeline/models/mpt/adapt_tokenizer.py b/workflows/chatbot/fine_tuning/instruction_tuning_pipeline/models/mpt/adapt_tokenizer.py
@@ -0,0 +1,41 @@
+from typing import Union
+from transformers import AutoTokenizer, PreTrainedTokenizer, PreTrainedTokenizerFast
+Tokenizer = Union[PreTrainedTokenizer, PreTrainedTokenizerFast]
+NUM_SENTINEL_TOKENS: int = 100
+
+def adapt_tokenizer_for_denoising(tokenizer: Tokenizer):
+    """Adds sentinel tokens and padding token (if missing).
+
+    Expands the tokenizer vocabulary to include sentinel tokens
+    used in mixture-of-denoiser tasks as well as a padding token.
+
+    All added tokens are added as special tokens. No tokens are
+    added if sentinel tokens and padding token already exist.
+    """
+    sentinels_to_add = [f'<extra_id_{i}>' for i in range(NUM_SENTINEL_TOKENS)]
+    tokenizer.add_tokens(sentinels_to_add, special_tokens=True)
+    if tokenizer.pad_token is None:
+        tokenizer.add_tokens('<pad>', special_tokens=True)
+        tokenizer.pad_token = '<pad>'
+        assert tokenizer.pad_token_id is not None
+    sentinels = ''.join([f'<extra_id_{i}>' for i in range(NUM_SENTINEL_TOKENS)])
+    _sentinel_token_ids = tokenizer(sentinels, add_special_tokens=False).input_ids
+    tokenizer.sentinel_token_ids = _sentinel_token_ids
+
+class AutoTokenizerForMOD(AutoTokenizer):
+    """AutoTokenizer + Adaptation for MOD.
+
+    A simple wrapper around AutoTokenizer to make instantiating
+    an MOD-adapted tokenizer a bit easier.
+
+    MOD-adapted tokenizers have sentinel tokens (e.g., <extra_id_0>),
+    a padding token, and a property to get the token ids of the
+    sentinel tokens.
+    """
+
+    @classmethod
+    def from_pretrained(cls, *args, **kwargs):
+        """See `AutoTokenizer.from_pretrained` docstring."""
+        tokenizer = super().from_pretrained(*args, **kwargs)
+        adapt_tokenizer_for_denoising(tokenizer)
+        return tokenizer