modify Dockerfile for finetuning (#1243)

* modify Dockerfile for finetuning * install peft from source code * update readme and dockerfile * delete real file path --------- Co-authored-by: Lv, Liang1 <liang1.lv@intel.com>
intel · Jul 28, 2023 · 797aa28 · 797aa28
1 parent 9694e55
commit 797aa28
Show file tree

Hide file tree

Showing 4 changed files with 30 additions and 38 deletions.
diff --git a/workflows/chatbot/fine_tuning/docker/Dockerfile b/workflows/chatbot/fine_tuning/docker/Dockerfile
@@ -60,20 +60,16 @@ RUN conda init bash && \
     conda config --add channels intel && \
     conda create -yn chatbot-finetuning python=3.9 && \
     echo "conda activate chatbot-finetuning" >> ~/.bashrc && \
-    source ~/.bashrc && \
-    wget https://intel-extension-for-pytorch.s3.amazonaws.com/torch_ccl/cpu/oneccl_bind_pt-1.13.0%2Bcpu-cp39-cp39-linux_x86_64.whl && \
-    pip install datasets torch accelerate SentencePiece git+https://github.com/huggingface/peft.git evaluate nltk rouge_score protobuf==3.20.1 tokenizers einops
+    source ~/.bashrc
 
-# Build ITREX
-RUN cd /itrex && pip install -v . && \
+SHELL ["/bin/bash", "--login", "-c", "conda", "run", "-n", "chatbot-finetuning"]
+RUN wget https://intel-extension-for-pytorch.s3.amazonaws.com/torch_ccl/cpu/oneccl_bind_pt-1.13.0%2Bcpu-cp39-cp39-linux_x86_64.whl && \
+    pip install datasets torch accelerate SentencePiece evaluate nltk rouge_score protobuf==3.20.1 tokenizers einops && \
+    git clone https://github.com/huggingface/peft.git && cd peft && python setup.py install && \
+    cd /itrex && pip install -v . && \
     cd workflows/chatbot/fine_tuning && pip install -r requirements.txt
 
-# Copy the model files to the Docker image
-# Please update the local model path here with the actual path
-COPY flan-t5-xl /flan/
-
-COPY alpaca_data.json /dataset/
-WORKDIR /itrex/workflows/chatbot
+WORKDIR /itrex/workflows/chatbot/fine_tuning
 
 
 # HABANA environment
@@ -91,10 +87,10 @@ RUN git clone https://github.com/huggingface/optimum-habana.git && \
     git-lfs install
 
 RUN pip install git+https://github.com/huggingface/optimum-habana.git && \
-    pip install peft && \
     pip install einops && \
     pip install datasets && \
-    pip install git+https://github.com/HabanaAI/DeepSpeed.git@1.10.0
+    pip install git+https://github.com/HabanaAI/DeepSpeed.git@1.10.0 && \
+    git clone https://github.com/huggingface/peft.git && cd peft && python setup.py install
 
 # Download ITREX code
 ARG ITREX_VER=main
@@ -108,14 +104,4 @@ RUN git clone --single-branch --branch=${ITREX_VER} ${REPO} itrex && \
 RUN cd /itrex && pip install -v . && \
     pip install transformers==4.28.1
 
-# Copy the model files to the Docker image
-# Please update the local model path here with the actual path
-COPY flan-t5-xl /flan/
-
-COPY alpaca_data.json /dataset/
-
-# Copy peft gaudi branch to the docker image
-COPY nlp-toolkit-peft-gaudi  /nlp-toolkit/
-WORKDIR /nlp-toolkit/workflows/chatbot/fine_tuning
-
-# WORKDIR /itrex/workflows/chatbot/fine_tuning
+WORKDIR /itrex/workflows/chatbot/fine_tuning
diff --git a/workflows/chatbot/fine_tuning/docker/README.md b/workflows/chatbot/fine_tuning/docker/README.md
@@ -27,15 +27,11 @@ Please clone a ITREX repo to this path.
 ```bash
 git clone https://github.com/intel-innersource/frameworks.ai.nlp-toolkit.intel-nlp-toolkit.git
 ```
-You can modify the model path at line 70 if you are going to run other models:
-```bash
-vim /path/to/workspace/frameworks.ai.nlp-toolkit.intel-nlp-toolkit/workflows/chatbot/fine_tuning/docker/Dockerfile
-```
-```
-COPY flan-t5-xl /flan/
-```
+
 
 ## 4. Build Docker Image
+| Note: If your docker daemon is too big and cost long time to build docker image, you could create a `.dockerignore` file including useless files to reduce the daemon size.
+
 ### On Xeon SPR Environment
 ```bash
 docker build --build-arg UBUNTU_VER=22.04 --build-arg https_proxy=$https_proxy --build-arg http_proxy=$http_proxy -f  /path/to/workspace/frameworks.ai.nlp-toolkit.intel-nlp-toolkit/workflows/chatbot/fine_tuning/docker/Dockerfile -t chatbot_finetune .   --target cpu
@@ -45,13 +41,16 @@ docker build --build-arg UBUNTU_VER=22.04 --build-arg https_proxy=$https_proxy -
 DOCKER_BUILDKIT=1 docker build --network=host --tag chatbot_finetuning:latest  --build-arg https_proxy=$https_proxy --build-arg http_proxy=$http_proxy  ./ -f Dockerfile  --target hpu
 ```
 ## 5. Create Docker Container
+Before creating your docker container, make sure the model has been downloaded to local. 
+
+Then mount the `model files` and `alpaca_data.json` to the docker container using `'-v'`. Make sure using the `absolute path` for local files.
 ### On Xeon SPR Environment
 ```bash
-docker run -it --disable-content-trust --privileged --name="chatbot" --hostname="chatbot-container" --network=host -e https_proxy -e http_proxy -e HTTPS_PROXY -e HTTP_PROXY -e no_proxy -e NO_PROXY -v /dev/shm:/dev/shm "chatbot_finetune"
+docker run -it --disable-content-trust --privileged --name="chatbot" --hostname="chatbot-container" --network=host -e https_proxy -e http_proxy -e HTTPS_PROXY -e HTTP_PROXY -e no_proxy -e NO_PROXY -v /dev/shm:/dev/shm -v /absolute/path/to/flan-t5-xl:/flan -v /absolute/path/to/alpaca_data.json:/dataset/alpaca_data.json "chatbot_finetune"
 ```
 ### On Habana Gaudi Environment
 ```bash
-docker run -it --runtime=habana -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none -e https_proxy -e http_proxy -e HTTPS_PROXY -e HTTP_PROXY -e no_proxy -e NO_PROXY -v /dev/shm:/dev/shm --cap-add=sys_nice --net=host --ipc=host chatbot_finetuning:latest 
+docker run -it --runtime=habana -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none -e https_proxy -e http_proxy -e HTTPS_PROXY -e HTTP_PROXY -e no_proxy -e NO_PROXY -v /dev/shm:/dev/shm  -v /absolute/path/to/flan-t5-xl:/flan -v /absolute/path/to/alpaca_data.json:/dataset/alpaca_data.json --cap-add=sys_nice --net=host --ipc=host chatbot_finetuning:latest 
 ```
 
 # Finetune
@@ -60,10 +59,10 @@ We employ the [LoRA approach](https://arxiv.org/pdf/2106.09685.pdf) to finetune
 
 ## 1. Single Node Fine-tuning  in Xeon SPR
 
-For FLAN-T5, use the below command line for finetuning on the Alpaca dataset.
+For FLAN-T5, use the below command line for finetuning on the Alpaca dataset. Please make sure the file path is consistent with the path mounted to docker container.
 
 ```bash
-python fine_tuning/instruction_tuning_pipeline/finetune_seq2seq.py \
+python instruction_tuning_pipeline/finetune_seq2seq.py \
         --model_name_or_path "/flan" \
         --train_file "/dataset/alpaca_data.json" \
         --per_device_train_batch_size 2 \
@@ -85,7 +84,7 @@ python fine_tuning/instruction_tuning_pipeline/finetune_seq2seq.py \
 For LLaMA, use the below command line for finetuning on the Alpaca dataset.
 
 ```bash
-python fine_tuning/instruction_tuning_pipeline/finetune_clm.py \
+python instruction_tuning_pipeline/finetune_clm.py \
         --model_name_or_path "/llama_7b" \
         --train_file "/dataset/alpaca_data.json" \
         --dataset_concatenation \
@@ -107,7 +106,7 @@ python fine_tuning/instruction_tuning_pipeline/finetune_clm.py \
 For [MPT](https://huggingface.co/mosaicml/mpt-7b), use the below command line for finetuning on the Alpaca dataset. Only LORA supports MPT in PEFT perspective.it uses gpt-neox-20b tokenizer, so you need to define it in command line explicitly.This model also requires that trust_remote_code=True be passed to the from_pretrained method. This is because we use a custom MPT model architecture that is not yet part of the Hugging Face transformers package.
 
 ```bash
-python finetune_clm.py \
+python instruction_tuning_pipeline/finetune_clm.py \
         --model_name_or_path "mosaicml/mpt-7b" \
         --bf16 True \
         --train_file "/path/to/alpaca_data.json" \
@@ -266,7 +265,7 @@ Follow install guidance in [optimum-habana](https://github.com/huggingface/optim
 For LLaMA, use the below command line for finetuning on the Alpaca dataset.
 
 ```bash
-python finetune_clm.py \
+python instruction_tuning_pipeline/finetune_clm.py \
         --model_name_or_path "decapoda-research/llama-7b-hf" \
         --bf16 True \
         --train_file "/path/to/alpaca_data.json" \

diff --git a/workflows/chatbot/fine_tuning/instruction_tuning_pipeline/finetune_clm.py b/workflows/chatbot/fine_tuning/instruction_tuning_pipeline/finetune_clm.py
@@ -357,6 +357,10 @@ def main():
         )
     else:
         raise ValueError("Please provide value for model_name_or_path or config_name.")
+
+    # set use_fast_tokenizer to False for Llama series models
+    if "llama" in model.config.model_type:
+        model_args.use_fast_tokenizer = False
 
     tokenizer_kwargs = {
         "cache_dir": model_args.cache_dir,

diff --git a/workflows/chatbot/fine_tuning/instruction_tuning_pipeline/finetune_seq2seq.py b/workflows/chatbot/fine_tuning/instruction_tuning_pipeline/finetune_seq2seq.py
@@ -374,6 +374,9 @@ def main():
                 **dataset_args,
             )
 
+    # set use_fast_tokenizer to False for Llama series models
+    if "llama" in model.config.model_type:
+        model_args.use_fast_tokenizer = False
 
     tokenizer_kwargs = {
         "cache_dir": model_args.cache_dir,