Skip to content

Commit

Permalink
Peft gaudi (#1178)
Browse files Browse the repository at this point in the history
* enable mpt peft LORA finetune in Gaudi1

* update README

* mpt model change due to DL1 lack of support for torch.roll
  • Loading branch information
sywangyi authored Jul 18, 2023
1 parent 4675d42 commit 3dc184e
Show file tree
Hide file tree
Showing 14 changed files with 2,205 additions and 28 deletions.
85 changes: 80 additions & 5 deletions workflows/chatbot/fine_tuning/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -36,7 +36,7 @@ The instruction-following dataset is needed for the finetuning. We select two ki

We employ the [LoRA approach](https://arxiv.org/pdf/2106.09685.pdf) to finetune the LLM efficiently, currently, FLAN-T5 and LLaMA are supported for finetuning.

## 1. Single Node Fine-tuning
## 1. Single Node Fine-tuning in Xeon SPR

For FLAN-T5, use the below command line for finetuning on the Alpaca dataset.

Expand Down Expand Up @@ -83,6 +83,7 @@ python finetune_clm.py \
--output_dir ./llama_peft_finetuned_model \
--peft lora \
--use_fast_tokenizer false \
--no_cuda \
```

For [MPT](https://huggingface.co/mosaicml/mpt-7b), use the below command line for finetuning on the Alpaca dataset. Only LORA supports MPT in PEFT perspective.it uses gpt-neox-20b tokenizer, so you need to define it in command line explicitly.This model also requires that trust_remote_code=True be passed to the from_pretrained method. This is because we use a custom MPT model architecture that is not yet part of the Hugging Face transformers package.
Expand All @@ -108,17 +109,18 @@ python finetune_clm.py \
--peft lora \
--trust_remote_code True \
--tokenizer_name "EleutherAI/gpt-neox-20b" \
--no_cuda \
```

Where the `--dataset_concatenation` argument is a way to vastly accelerate the fine-tuning process through training samples concatenation. With several tokenized sentences concatenated into a longer and concentrated sentence as the training sample instead of having several training samples with different lengths, this way is more efficient due to the parallelism characteristic provided by the more concentrated training samples.

For finetuning on SPR, add `--bf16` argument will speedup the finetuning process without the loss of model's performance.
You could also indicate `--peft` to switch peft method in P-tuning, Prefix tuning, Prompt tuning, LLama Adapter, LoRA,
see https://github.com/huggingface/peft. Note for FLAN-T5, only LoRA is supported.
see https://github.com/huggingface/peft. Note for FLAN-T5/MPT, only LoRA is supported.

Add option **"--use_fast_tokenizer False"** when using latest transformers if you met failure in llama fast tokenizer for llama, The `tokenizer_class` in `tokenizer_config.json` should be changed from `LLaMATokenizer` to `LlamaTokenizer`

## 2. Multi-node Fine-tuning
## 2. Multi-node Fine-tuning in Xeon SPR

We also supported Distributed Data Parallel finetuning on single node and multi-node settings. To use Distributed Data Parallel to speedup training, the bash command needs a small adjustment.
<br>
Expand All @@ -132,7 +134,7 @@ For example, to finetune FLAN-T5 through Distributed Data Parallel training, bas
<br>
*`<NODE_RANK>`* is the rank of the current node, rank starts from 0 to *`<NUM_NODES>`*`-1`.
<br>
> Also please note that to use CPU for training in each node with multi-node settings, argument `--no_cuda` is mandatory, and `--xpu_backend ccl` is required if to use ccl as the distributed backend. In multi-node setting, following command needs to be launched in each node, and all the commands should be the same except for *`<NODE_RANK>`*, which should be integer from 0 to *`<NUM_NODES>`*`-1` assigned to each node.
> Also please note that to use CPU for training in each node with multi-node settings, argument `--no_cuda` is mandatory, and `--ddp_backend ccl` is required if to use ccl as the distributed backend. In multi-node setting, following command needs to be launched in each node, and all the commands should be the same except for *`<NODE_RANK>`*, which should be integer from 0 to *`<NUM_NODES>`*`-1` assigned to each node.
``` bash
python -m torch.distributed.launch --master_addr=<MASTER_ADDRESS> --nproc_per_node=<NUM_PROCESSES_PER_NODE> --nnodes=<NUM_NODES> --node_rank=<NODE_RANK> \
Expand All @@ -153,7 +155,9 @@ python -m torch.distributed.launch --master_addr=<MASTER_ADDRESS> --nproc_per_no
--save_total_limit 2 \
--overwrite_output_dir \
--output_dir ./flan-t5-xl_peft_finetuned_model \
--peft lora
--peft lora \
--no_cuda \
--ddp_backend ccl \
```
If you have enabled passwordless SSH in cpu clusters, you could also use mpirun in master node to start the DDP finetune. Take llama alpaca finetune for example. follow the [hugginface guide](https://huggingface.co/docs/transformers/perf_train_cpu_many) to install Intel® oneCCL Bindings for PyTorch, IPEX

Expand Down Expand Up @@ -206,6 +210,8 @@ mpirun -f nodefile -n 16 -ppn 4 -genv OMP_NUM_THREADS=56 python3 finetune_clm.py
--dataset_concatenation \
--use_fast_tokenizer false \
--do_train \
--no_cuda \
--ddp_backend ccl \

## for DDP LORA for MPT
mpirun -f nodefile -n 16 -ppn 4 -genv OMP_NUM_THREADS=56 python3 finetune_clm.py \
Expand All @@ -229,6 +235,75 @@ mpirun -f nodefile -n 16 -ppn 4 -genv OMP_NUM_THREADS=56 python3 finetune_clm.py
--do_train \
--trust_remote_code True \
--tokenizer_name "EleutherAI/gpt-neox-20b" \
--no_cuda \
--ddp_backend ccl \
```
you could also indicate `--peft` to switch peft method in P-tuning, Prefix tuning, Prompt tuning, LLama Adapter, LORA,
see https://github.com/huggingface/peft

## 1. Single Node Fine-tuning in Habana DL1

Follow install guidance in [optimum-habana](https://github.com/huggingface/optimum-habana)

For LLaMA, use the below command line for finetuning on the Alpaca dataset.

```bash
python finetune_clm.py \
--model_name_or_path "decapoda-research/llama-7b-hf" \
--bf16 True \
--train_file "/path/to/alpaca_data.json" \
--dataset_concatenation \
--per_device_train_batch_size 2 \
--per_device_eval_batch_size 2 \
--gradient_accumulation_steps 4 \
--do_train \
--learning_rate 1e-4 \
--num_train_epochs 3 \
--logging_steps 100 \
--save_total_limit 2 \
--overwrite_output_dir \
--log_level info \
--save_strategy epoch \
--output_dir ./llama_peft_finetuned_model \
--peft lora \
--use_fast_tokenizer false \
--habana \
--use_habana \
--use_lazy_mode \
```

For [MPT](https://huggingface.co/mosaicml/mpt-7b), use the below command line for finetuning on the Alpaca dataset. Only LORA supports MPT in PEFT perspective.it uses gpt-neox-20b tokenizer, so you need to define it in command line explicitly.This model also requires that trust_remote_code=True be passed to the from_pretrained method. This is because we use a custom MPT model architecture that is not yet part of the Hugging Face transformers package.

```bash
python finetune_clm.py \
--model_name_or_path "mosaicml/mpt-7b" \
--bf16 True \
--train_file "/path/to/alpaca_data.json" \
--dataset_concatenation \
--per_device_train_batch_size 2 \
--per_device_eval_batch_size 2 \
--gradient_accumulation_steps 4 \
--do_train \
--learning_rate 1e-4 \
--num_train_epochs 3 \
--logging_steps 100 \
--save_total_limit 2 \
--overwrite_output_dir \
--log_level info \
--save_strategy epoch \
--output_dir ./mpt_peft_finetuned_model \
--peft lora \
--trust_remote_code True \
--tokenizer_name "EleutherAI/gpt-neox-20b" \
--habana \
--use_habana \
--use_lazy_mode \
```

Where the `--dataset_concatenation` argument is a way to vastly accelerate the fine-tuning process through training samples concatenation. With several tokenized sentences concatenated into a longer and concentrated sentence as the training sample instead of having several training samples with different lengths, this way is more efficient due to the parallelism characteristic provided by the more concentrated training samples.

For finetuning on SPR, add `--bf16` argument will speedup the finetuning process without the loss of model's performance.
You could also indicate `--peft` to switch peft method in P-tuning, Prefix tuning, Prompt tuning, LLama Adapter, LoRA,
see https://github.com/huggingface/peft. Note for MPT, only LoRA is supported.

Add option **"--use_fast_tokenizer False"** when using latest transformers if you met failure in llama fast tokenizer for llama, The `tokenizer_class` in `tokenizer_config.json` should be changed from `LLaMATokenizer` to `LlamaTokenizer`
Original file line number Diff line number Diff line change
Expand Up @@ -50,6 +50,8 @@
import copy
import re
import torch
import importlib.util
from transformers.utils.import_utils import is_optimum_available

IGNORE_INDEX = -100

Expand All @@ -58,6 +60,10 @@
logger = logging.getLogger(__name__)


def is_optimum_habana_available():
return is_optimum_available() and importlib.util.find_spec("optimum.habana") != None


@dataclass
class ModelArguments:
"""
Expand Down Expand Up @@ -115,6 +121,7 @@ class ModelArguments:
},
)


@dataclass
class DataArguments:
"""
Expand Down Expand Up @@ -257,6 +264,10 @@ class FinetuneArguments:
default=True,
metadata={"help": "if False, masks out inputs in loss"},
)
habana: bool = field(
default=False,
metadata={"help": "if False, masks out inputs in loss"},
)


PROMPT_DICT = {
Expand Down Expand Up @@ -293,10 +304,16 @@ def main():
# See all possible arguments in src/transformers/training_args.py
# or by passing the --help flag to this script.
# We now keep distinct sets of args, for a cleaner separation of concerns.
if not is_optimum_habana_available():
parser = HfArgumentParser(
(ModelArguments, DataArguments, TrainingArguments, FinetuneArguments)
)
else:
from optimum.habana import GaudiTrainingArguments

parser = HfArgumentParser(
(ModelArguments, DataArguments, TrainingArguments, FinetuneArguments)
)
parser = HfArgumentParser(
(ModelArguments, DataArguments, GaudiTrainingArguments, FinetuneArguments)
)
if len(sys.argv) == 2 and sys.argv[1].endswith(".json"):
# If we pass only one argument to the script and it's the path to a json file,
# let's parse it to get our arguments.
Expand All @@ -311,6 +328,11 @@ def main():
finetune_args,
) = parser.parse_args_into_dataclasses()

if finetune_args.habana:
if not is_optimum_habana_available():
raise ImportError(
"optimum habana is not installed. refer https://github.com/huggingface/optimum-habana"
)
# Setup logging
logging.basicConfig(
format="%(asctime)s - %(levelname)s - %(name)s - %(message)s",
Expand Down Expand Up @@ -470,17 +492,32 @@ def main():
# Load model
if model_args.model_name_or_path:
model_dtype = torch.bfloat16 if training_args.bf16 else None
model = AutoModelForCausalLM.from_pretrained(
model_args.model_name_or_path,
from_tf=bool(".ckpt" in model_args.model_name_or_path),
config=config,
cache_dir=model_args.cache_dir,
revision=model_args.model_revision,
use_auth_token=True if model_args.use_auth_token else None,
trust_remote_code=True if model_args.trust_remote_code else None,
torch_dtype=model_dtype,
low_cpu_mem_usage=True,
)
if re.search("mpt", model_args.model_name_or_path, re.IGNORECASE):
from models.mpt.modeling_mpt import MPTForCausalLM

model = MPTForCausalLM.from_pretrained(
model_args.model_name_or_path,
from_tf=bool(".ckpt" in model_args.model_name_or_path),
config=config,
cache_dir=model_args.cache_dir,
revision=model_args.model_revision,
use_auth_token=True if model_args.use_auth_token else None,
trust_remote_code=True if model_args.trust_remote_code else None,
torch_dtype=model_dtype,
low_cpu_mem_usage=True,
)
else:
model = AutoModelForCausalLM.from_pretrained(
model_args.model_name_or_path,
from_tf=bool(".ckpt" in model_args.model_name_or_path),
config=config,
cache_dir=model_args.cache_dir,
revision=model_args.model_revision,
use_auth_token=True if model_args.use_auth_token else None,
trust_remote_code=True if model_args.trust_remote_code else None,
torch_dtype=model_dtype,
low_cpu_mem_usage=True,
)
else:
raise ValueError(
"Must provide model_name_or_path to load a pretrained CausalLM model."
Expand Down Expand Up @@ -642,15 +679,33 @@ def concatenate_data(dataset, max_seq_length):
model = get_peft_model(model, peft_config)
model.print_trainable_parameters()

# Initialize our Trainer
trainer = Trainer(
model=model,
args=training_args,
train_dataset=train_dataset if training_args.do_train else None,
eval_dataset=eval_dataset if training_args.do_eval else None,
tokenizer=tokenizer,
data_collator=data_collator,
)
if not finetune_args.habana:
# Initialize our Trainer
trainer = Trainer(
model=model,
args=training_args,
train_dataset=train_dataset if training_args.do_train else None,
eval_dataset=eval_dataset if training_args.do_eval else None,
tokenizer=tokenizer,
data_collator=data_collator,
)
else:
from optimum.habana import GaudiConfig, GaudiTrainer

gaudi_config = GaudiConfig()
gaudi_config.use_fused_adam = True
gaudi_config.use_fused_clip_norm = True
# Initialize our Trainer
trainer = GaudiTrainer(
model=model,
gaudi_config=gaudi_config,
args=training_args,
train_dataset=train_dataset if training_args.do_train else None,
eval_dataset=eval_dataset if training_args.do_eval else None,
tokenizer=tokenizer,
data_collator=data_collator,
)

trainer.train(resume_from_checkpoint=training_args.resume_from_checkpoint)
with training_args.main_process_first(desc="save model"):
if is_main_process(training_args.local_rank):
Expand Down
Empty file.
Empty file.
Original file line number Diff line number Diff line change
@@ -0,0 +1,41 @@
from typing import Union
from transformers import AutoTokenizer, PreTrainedTokenizer, PreTrainedTokenizerFast
Tokenizer = Union[PreTrainedTokenizer, PreTrainedTokenizerFast]
NUM_SENTINEL_TOKENS: int = 100

def adapt_tokenizer_for_denoising(tokenizer: Tokenizer):
"""Adds sentinel tokens and padding token (if missing).
Expands the tokenizer vocabulary to include sentinel tokens
used in mixture-of-denoiser tasks as well as a padding token.
All added tokens are added as special tokens. No tokens are
added if sentinel tokens and padding token already exist.
"""
sentinels_to_add = [f'<extra_id_{i}>' for i in range(NUM_SENTINEL_TOKENS)]
tokenizer.add_tokens(sentinels_to_add, special_tokens=True)
if tokenizer.pad_token is None:
tokenizer.add_tokens('<pad>', special_tokens=True)
tokenizer.pad_token = '<pad>'
assert tokenizer.pad_token_id is not None
sentinels = ''.join([f'<extra_id_{i}>' for i in range(NUM_SENTINEL_TOKENS)])
_sentinel_token_ids = tokenizer(sentinels, add_special_tokens=False).input_ids
tokenizer.sentinel_token_ids = _sentinel_token_ids

class AutoTokenizerForMOD(AutoTokenizer):
"""AutoTokenizer + Adaptation for MOD.
A simple wrapper around AutoTokenizer to make instantiating
an MOD-adapted tokenizer a bit easier.
MOD-adapted tokenizers have sentinel tokens (e.g., <extra_id_0>),
a padding token, and a property to get the token ids of the
sentinel tokens.
"""

@classmethod
def from_pretrained(cls, *args, **kwargs):
"""See `AutoTokenizer.from_pretrained` docstring."""
tokenizer = super().from_pretrained(*args, **kwargs)
adapt_tokenizer_for_denoising(tokenizer)
return tokenizer
Loading

0 comments on commit 3dc184e

Please sign in to comment.