root@dc53c9f6e164:/workspace/axolotl# accelerate launch -m axolotl.cli.train examples/openllama-3b/lora.yml --deepspeed deepspeed_configs/zero1.json The following values were not passed to `accelerate launch` and had defaults used instead: `--num_processes` was set to a value of `4` More than one GPU was found, enabling multi-GPU training. If this was unintended please pass in `--num_processes=1`. `--num_machines` was set to a value of `1` `--mixed_precision` was set to a value of `'no'` `--dynamo_backend` was set to a value of `'no'` To avoid this warning pass in values for each of the problematic parameters or run `accelerate config`. WARNING: BNB_CUDA_VERSION=118 environment variable detected; loading libbitsandbytes_cuda118.so. This can be used to load a bitsandbytes version that is different from the PyTorch CUDA version. If this was unintended set the BNB_CUDA_VERSION variable to an empty string: export BNB_CUDA_VERSION= If you use the manual override make sure the right libcudart.so is in your LD_LIBRARY_PATH For example by adding the following to your .bashrc: export LD_LIBRARY_PATH=$LD_LIBRARY_PATH::58] [PID:814] PyTorch version 2.1.2+cu118 available. [2024-05-29 07:53:04,534] [INFO] [datasets.:58] [PID:812] PyTorch version 2.1.2+cu118 available. [2024-05-29 07:53:04,534] [INFO] [datasets.:58] [PID:813] PyTorch version 2.1.2+cu118 available. [2024-05-29 07:53:04,549] [INFO] [datasets.:58] [PID:815] PyTorch version 2.1.2+cu118 available. [2024-05-29 07:53:05,265] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-05-29 07:53:05,267] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-05-29 07:53:05,269] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-05-29 07:53:05,276] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-05-29 07:53:05,322] [INFO] [root.spawn:38] [PID:814] gcc -pthread -B /root/miniconda3/envs/py3.10/compiler_compat -Wno-unused-result -Wsign-compare -DNDEBUG -fwrapv -O2 -Wall -fPIC -O2 -isystem /root/miniconda3/envs/py3.10/include -fPIC -O2 -isystem /root/miniconda3/envs/py3.10/include -fPIC -c /tmp/tmp5_ya9bz5/test.c -o /tmp/tmp5_ya9bz5/test.o [2024-05-29 07:53:05,324] [INFO] [root.spawn:38] [PID:812] gcc -pthread -B /root/miniconda3/envs/py3.10/compiler_compat -Wno-unused-result -Wsign-compare -DNDEBUG -fwrapv -O2 -Wall -fPIC -O2 -isystem /root/miniconda3/envs/py3.10/include -fPIC -O2 -isystem /root/miniconda3/envs/py3.10/include -fPIC -c /tmp/tmpscip52pk/test.c -o /tmp/tmpscip52pk/test.o [2024-05-29 07:53:05,326] [INFO] [root.spawn:38] [PID:813] gcc -pthread -B /root/miniconda3/envs/py3.10/compiler_compat -Wno-unused-result -Wsign-compare -DNDEBUG -fwrapv -O2 -Wall -fPIC -O2 -isystem /root/miniconda3/envs/py3.10/include -fPIC -O2 -isystem /root/miniconda3/envs/py3.10/include -fPIC -c /tmp/tmptcck0du1/test.c -o /tmp/tmptcck0du1/test.o [2024-05-29 07:53:05,332] [INFO] [root.spawn:38] [PID:815] gcc -pthread -B /root/miniconda3/envs/py3.10/compiler_compat -Wno-unused-result -Wsign-compare -DNDEBUG -fwrapv -O2 -Wall -fPIC -O2 -isystem /root/miniconda3/envs/py3.10/include -fPIC -O2 -isystem /root/miniconda3/envs/py3.10/include -fPIC -c /tmp/tmp4zaefw69/test.c -o /tmp/tmp4zaefw69/test.o [2024-05-29 07:53:05,337] [INFO] [root.spawn:38] [PID:812] gcc -pthread -B /root/miniconda3/envs/py3.10/compiler_compat /tmp/tmpscip52pk/test.o -laio -o /tmp/tmpscip52pk/a.out [2024-05-29 07:53:05,338] [INFO] [root.spawn:38] [PID:814] gcc -pthread -B /root/miniconda3/envs/py3.10/compiler_compat /tmp/tmp5_ya9bz5/test.o -laio -o /tmp/tmp5_ya9bz5/a.out [2024-05-29 07:53:05,340] [INFO] [root.spawn:38] [PID:813] gcc -pthread -B /root/miniconda3/envs/py3.10/compiler_compat /tmp/tmptcck0du1/test.o -laio -o /tmp/tmptcck0du1/a.out [2024-05-29 07:53:05,344] [INFO] [root.spawn:38] [PID:815] gcc -pthread -B /root/miniconda3/envs/py3.10/compiler_compat /tmp/tmp4zaefw69/test.o -laio -o /tmp/tmp4zaefw69/a.out [WARNING] Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH [WARNING] Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH [WARNING] Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH [WARNING] Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH [WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.1 [WARNING] using untested triton version (2.1.0), only 1.0.0 is known to be compatible [WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.1 [WARNING] using untested triton version (2.1.0), only 1.0.0 is known to be compatible [WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.1 [WARNING] using untested triton version (2.1.0), only 1.0.0 is known to be compatible [WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.1 [WARNING] using untested triton version (2.1.0), only 1.0.0 is known to be compatible [2024-05-29 07:53:06,219] [WARNING] [axolotl.utils.config.models.input.hint_sample_packing_padding:747] [PID:813] [RANK:1] `pad_to_sequence_len: true` is recommended when using sample_packing [2024-05-29 07:53:06,219] [INFO] [axolotl.utils.config.models.input.check_bf16:1102] [PID:813] [RANK:1] bf16 support detected, but not enabled for this configuration. /root/miniconda3/envs/py3.10/lib/python3.10/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`. warnings.warn( [2024-05-29 07:53:06,220] [WARNING] [axolotl.utils.config.models.input.hint_sample_packing_padding:747] [PID:815] [RANK:3] `pad_to_sequence_len: true` is recommended when using sample_packing [2024-05-29 07:53:06,221] [INFO] [axolotl.utils.config.models.input.check_bf16:1102] [PID:815] [RANK:3] bf16 support detected, but not enabled for this configuration. /root/miniconda3/envs/py3.10/lib/python3.10/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`. warnings.warn( [2024-05-29 07:53:06,223] [WARNING] [axolotl.utils.config.models.input.hint_sample_packing_padding:747] [PID:814] [RANK:2] `pad_to_sequence_len: true` is recommended when using sample_packing [2024-05-29 07:53:06,223] [INFO] [axolotl.utils.config.models.input.check_bf16:1102] [PID:814] [RANK:2] bf16 support detected, but not enabled for this configuration. /root/miniconda3/envs/py3.10/lib/python3.10/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`. warnings.warn( [2024-05-29 07:53:06,231] [WARNING] [axolotl.utils.config.models.input.hint_sample_packing_padding:747] [PID:812] [RANK:0] `pad_to_sequence_len: true` is recommended when using sample_packing [2024-05-29 07:53:06,232] [INFO] [axolotl.utils.config.models.input.check_bf16:1102] [PID:812] [RANK:0] bf16 support detected, but not enabled for this configuration. /root/miniconda3/envs/py3.10/lib/python3.10/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`. warnings.warn( [2024-05-29 07:53:06,398] [INFO] [axolotl.normalize_config:182] [PID:815] [RANK:3] GPU memory usage baseline: 0.000GB (+0.320GB misc) [2024-05-29 07:53:06,398] [INFO] [axolotl.normalize_config:182] [PID:813] [RANK:1] GPU memory usage baseline: 0.000GB (+0.322GB misc) [2024-05-29 07:53:06,400] [INFO] [comm.py:637:init_distributed] cdb=None [2024-05-29 07:53:06,401] [INFO] [comm.py:637:init_distributed] cdb=None /root/miniconda3/envs/py3.10/lib/python3.10/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`. warnings.warn( /root/miniconda3/envs/py3.10/lib/python3.10/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`. warnings.warn( [2024-05-29 07:53:06,626] [INFO] [axolotl.normalize_config:182] [PID:814] [RANK:2] GPU memory usage baseline: 0.000GB (+0.320GB misc) [2024-05-29 07:53:06,629] [INFO] [comm.py:637:init_distributed] cdb=None [2024-05-29 07:53:06,632] [INFO] [axolotl.normalize_config:182] [PID:812] [RANK:0] GPU memory usage baseline: 0.000GB (+0.320GB misc) [2024-05-29 07:53:06,636] [INFO] [comm.py:637:init_distributed] cdb=None [2024-05-29 07:53:06,636] [INFO] [comm.py:668:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl dP dP dP 88 88 88 .d8888b. dP. .dP .d8888b. 88 .d8888b. d8888P 88 88' `88 `8bd8' 88' `88 88 88' `88 88 88 88. .88 .d88b. 88. .88 88 88. .88 88 88 `88888P8 dP' `dP `88888P' dP `88888P' dP dP **************************************** **** Axolotl Dependency Versions ***** accelerate: 0.30.1 peft: 0.11.1 transformers: 4.41.1 trl: 0.8.6 torch: 2.1.2+cu118 bitsandbytes: 0.43.1 **************************************** /root/miniconda3/envs/py3.10/lib/python3.10/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`. warnings.warn( You are using the default legacy behaviour of the . This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565 /root/miniconda3/envs/py3.10/lib/python3.10/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`. warnings.warn( [2024-05-29 07:53:07,029] [DEBUG] [axolotl.load_tokenizer:280] [PID:813] [RANK:1] EOS: 2 / [2024-05-29 07:53:07,029] [DEBUG] [axolotl.load_tokenizer:281] [PID:813] [RANK:1] BOS: 1 / [2024-05-29 07:53:07,029] [DEBUG] [axolotl.load_tokenizer:282] [PID:813] [RANK:1] PAD: 2 / [2024-05-29 07:53:07,029] [DEBUG] [axolotl.load_tokenizer:283] [PID:813] [RANK:1] UNK: 0 / [2024-05-29 07:53:07,029] [INFO] [axolotl.load_tokenizer:294] [PID:813] [RANK:1] No Chat template selected. Consider adding a chat template for easier inference. You are using the default legacy behaviour of the . This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565 You are using the default legacy behaviour of the . This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565 [2024-05-29 07:53:07,225] [DEBUG] [axolotl.load_tokenizer:280] [PID:815] [RANK:3] EOS: 2 / [2024-05-29 07:53:07,225] [DEBUG] [axolotl.load_tokenizer:281] [PID:815] [RANK:3] BOS: 1 / [2024-05-29 07:53:07,225] [DEBUG] [axolotl.load_tokenizer:282] [PID:815] [RANK:3] PAD: 2 / [2024-05-29 07:53:07,225] [DEBUG] [axolotl.load_tokenizer:283] [PID:815] [RANK:3] UNK: 0 / [2024-05-29 07:53:07,225] [INFO] [axolotl.load_tokenizer:294] [PID:815] [RANK:3] No Chat template selected. Consider adding a chat template for easier inference. [2024-05-29 07:53:07,345] [DEBUG] [axolotl.load_tokenizer:280] [PID:814] [RANK:2] EOS: 2 / [2024-05-29 07:53:07,345] [DEBUG] [axolotl.load_tokenizer:281] [PID:814] [RANK:2] BOS: 1 / [2024-05-29 07:53:07,345] [DEBUG] [axolotl.load_tokenizer:282] [PID:814] [RANK:2] PAD: 2 / [2024-05-29 07:53:07,345] [DEBUG] [axolotl.load_tokenizer:283] [PID:814] [RANK:2] UNK: 0 / [2024-05-29 07:53:07,345] [INFO] [axolotl.load_tokenizer:294] [PID:814] [RANK:2] No Chat template selected. Consider adding a chat template for easier inference. You are using the default legacy behaviour of the . This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565 [2024-05-29 07:53:07,683] [DEBUG] [axolotl.load_tokenizer:280] [PID:812] [RANK:0] EOS: 2 / [2024-05-29 07:53:07,684] [DEBUG] [axolotl.load_tokenizer:281] [PID:812] [RANK:0] BOS: 1 / [2024-05-29 07:53:07,684] [DEBUG] [axolotl.load_tokenizer:282] [PID:812] [RANK:0] PAD: 2 / [2024-05-29 07:53:07,684] [DEBUG] [axolotl.load_tokenizer:283] [PID:812] [RANK:0] UNK: 0 / [2024-05-29 07:53:07,684] [INFO] [axolotl.load_tokenizer:294] [PID:812] [RANK:0] No Chat template selected. Consider adding a chat template for easier inference. [2024-05-29 07:53:07,684] [INFO] [axolotl.load_tokenized_prepared_datasets:183] [PID:812] [RANK:0] Unable to find prepared dataset in last_run_prepared/8cc35674c453a287d7de953d7084a596 [2024-05-29 07:53:07,684] [INFO] [axolotl.load_tokenized_prepared_datasets:184] [PID:812] [RANK:0] Loading raw datasets... [2024-05-29 07:53:07,684] [WARNING] [axolotl.load_tokenized_prepared_datasets:186] [PID:812] [RANK:0] Processing datasets during training can lead to VRAM instability. Please pre-process your dataset. [2024-05-29 07:53:07,684] [INFO] [axolotl.load_tokenized_prepared_datasets:193] [PID:812] [RANK:0] No seed provided, using default seed of 42 Repo card metadata block was not found. Setting CardData to empty. [2024-05-29 07:53:08,910] [WARNING] [huggingface_hub.repocard.content:107] [PID:812] Repo card metadata block was not found. Setting CardData to empty. Repo card metadata block was not found. Setting CardData to empty. [2024-05-29 07:53:11,504] [WARNING] [huggingface_hub.repocard.content:107] [PID:812] Repo card metadata block was not found. Setting CardData to empty. [2024-05-29 07:53:12,819] [INFO] [axolotl.load_tokenized_prepared_datasets:410] [PID:812] [RANK:0] merging datasets [2024-05-29 07:53:13,551] [INFO] [axolotl.load_tokenized_prepared_datasets:183] [PID:815] [RANK:3] Unable to find prepared dataset in last_run_prepared/8cc35674c453a287d7de953d7084a596 [2024-05-29 07:53:13,551] [INFO] [axolotl.load_tokenized_prepared_datasets:183] [PID:813] [RANK:1] Unable to find prepared dataset in last_run_prepared/8cc35674c453a287d7de953d7084a596 [2024-05-29 07:53:13,551] [INFO] [axolotl.load_tokenized_prepared_datasets:423] [PID:812] [RANK:0] Saving merged prepared dataset to disk... last_run_prepared/8cc35674c453a287d7de953d7084a596 [2024-05-29 07:53:13,551] [INFO] [axolotl.load_tokenized_prepared_datasets:184] [PID:815] [RANK:3] Loading raw datasets... [2024-05-29 07:53:13,551] [INFO] [axolotl.load_tokenized_prepared_datasets:184] [PID:813] [RANK:1] Loading raw datasets... [2024-05-29 07:53:13,551] [WARNING] [axolotl.load_tokenized_prepared_datasets:186] [PID:815] [RANK:3] Processing datasets during training can lead to VRAM instability. Please pre-process your dataset. [2024-05-29 07:53:13,551] [INFO] [axolotl.load_tokenized_prepared_datasets:183] [PID:814] [RANK:2] Unable to find prepared dataset in last_run_prepared/8cc35674c453a287d7de953d7084a596 [2024-05-29 07:53:13,551] [WARNING] [axolotl.load_tokenized_prepared_datasets:186] [PID:813] [RANK:1] Processing datasets during training can lead to VRAM instability. Please pre-process your dataset. [2024-05-29 07:53:13,551] [INFO] [axolotl.load_tokenized_prepared_datasets:193] [PID:815] [RANK:3] No seed provided, using default seed of 42 [2024-05-29 07:53:13,551] [INFO] [axolotl.load_tokenized_prepared_datasets:184] [PID:814] [RANK:2] Loading raw datasets... [2024-05-29 07:53:13,551] [INFO] [axolotl.load_tokenized_prepared_datasets:193] [PID:813] [RANK:1] No seed provided, using default seed of 42 [2024-05-29 07:53:13,552] [WARNING] [axolotl.load_tokenized_prepared_datasets:186] [PID:814] [RANK:2] Processing datasets during training can lead to VRAM instability. Please pre-process your dataset. [2024-05-29 07:53:13,552] [INFO] [axolotl.load_tokenized_prepared_datasets:193] [PID:814] [RANK:2] No seed provided, using default seed of 42 Saving the dataset (1/1 shards): 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 54568/54568 [00:00<00:00, 191917.43 examples/s] Repo card metadata block was not found. Setting CardData to empty. [2024-05-29 07:53:14,297] [WARNING] [huggingface_hub.repocard.content:107] [PID:814] Repo card metadata block was not found. Setting CardData to empty. Repo card metadata block was not found. Setting CardData to empty. [2024-05-29 07:53:14,302] [WARNING] [huggingface_hub.repocard.content:107] [PID:813] Repo card metadata block was not found. Setting CardData to empty. Repo card metadata block was not found. Setting CardData to empty. [2024-05-29 07:53:14,610] [WARNING] [huggingface_hub.repocard.content:107] [PID:815] Repo card metadata block was not found. Setting CardData to empty. Repo card metadata block was not found. Setting CardData to empty. [2024-05-29 07:53:16,621] [WARNING] [huggingface_hub.repocard.content:107] [PID:813] Repo card metadata block was not found. Setting CardData to empty. Repo card metadata block was not found. Setting CardData to empty. [2024-05-29 07:53:16,940] [WARNING] [huggingface_hub.repocard.content:107] [PID:814] Repo card metadata block was not found. Setting CardData to empty. Repo card metadata block was not found. Setting CardData to empty. [2024-05-29 07:53:17,264] [WARNING] [huggingface_hub.repocard.content:107] [PID:815] Repo card metadata block was not found. Setting CardData to empty. [2024-05-29 07:53:18,048] [INFO] [axolotl.load_tokenized_prepared_datasets:410] [PID:813] [RANK:1] merging datasets [2024-05-29 07:53:18,332] [INFO] [axolotl.load_tokenized_prepared_datasets:410] [PID:815] [RANK:3] merging datasets [2024-05-29 07:53:18,418] [INFO] [axolotl.load_tokenized_prepared_datasets:410] [PID:814] [RANK:2] merging datasets [2024-05-29 07:53:18,426] [DEBUG] [axolotl.calculate_total_num_steps:299] [PID:812] [RANK:0] total_num_tokens: 182_913 [2024-05-29 07:53:18,435] [DEBUG] [axolotl.calculate_total_num_steps:312] [PID:812] [RANK:0] `total_supervised_tokens: 38_104` [2024-05-29 07:53:21,655] [DEBUG] [axolotl.calculate_total_num_steps:364] [PID:812] [RANK:0] data_loader_len: 22 [2024-05-29 07:53:21,981] [INFO] [axolotl.calc_sample_packing_eff_est:370] [PID:812] [RANK:0] sample_packing_eff_est across ranks: [0.9923665523529053, 0.9923665523529053, 0.9923665523529053, 0.9923665523529053] [2024-05-29 07:53:21,986] [DEBUG] [axolotl.calculate_total_num_steps:382] [PID:812] [RANK:0] sample_packing_eff_est: None [2024-05-29 07:53:21,986] [DEBUG] [axolotl.calculate_total_num_steps:390] [PID:812] [RANK:0] total_num_steps: 88 [2024-05-29 07:53:22,044] [DEBUG] [axolotl.calculate_total_num_steps:299] [PID:812] [RANK:0] total_num_tokens: 10_466_111 [2024-05-29 07:53:22,508] [DEBUG] [axolotl.calculate_total_num_steps:312] [PID:812] [RANK:0] `total_supervised_tokens: 6_735_490` [2024-05-29 07:53:26,106] [DEBUG] [axolotl.calculate_total_num_steps:364] [PID:812] [RANK:0] data_loader_len: 1281 [2024-05-29 07:53:26,110] [INFO] [axolotl.calc_sample_packing_eff_est:370] [PID:812] [RANK:0] sample_packing_eff_est across ranks: [0.9973469376564026, 0.9973469376564026, 0.9973469376564026, 0.9973469376564026] [2024-05-29 07:53:26,118] [DEBUG] [axolotl.calculate_total_num_steps:382] [PID:812] [RANK:0] sample_packing_eff_est: 1.0 [2024-05-29 07:53:26,118] [DEBUG] [axolotl.calculate_total_num_steps:390] [PID:812] [RANK:0] total_num_steps: 5124 [2024-05-29 07:53:26,123] [DEBUG] [axolotl.train.train:56] [PID:812] [RANK:0] loading tokenizer... openlm-research/open_llama_3b_v2 [2024-05-29 07:53:26,564] [DEBUG] [axolotl.load_tokenizer:280] [PID:815] [RANK:3] EOS: 2 / [2024-05-29 07:53:26,564] [DEBUG] [axolotl.load_tokenizer:281] [PID:815] [RANK:3] BOS: 1 / [2024-05-29 07:53:26,564] [DEBUG] [axolotl.load_tokenizer:282] [PID:815] [RANK:3] PAD: 2 / [2024-05-29 07:53:26,564] [DEBUG] [axolotl.load_tokenizer:283] [PID:815] [RANK:3] UNK: 0 / [2024-05-29 07:53:26,564] [INFO] [axolotl.load_tokenizer:294] [PID:815] [RANK:3] No Chat template selected. Consider adding a chat template for easier inference. [2024-05-29 07:53:26,571] [DEBUG] [axolotl.load_tokenizer:280] [PID:813] [RANK:1] EOS: 2 / [2024-05-29 07:53:26,571] [DEBUG] [axolotl.load_tokenizer:281] [PID:813] [RANK:1] BOS: 1 / [2024-05-29 07:53:26,571] [DEBUG] [axolotl.load_tokenizer:282] [PID:813] [RANK:1] PAD: 2 / [2024-05-29 07:53:26,571] [DEBUG] [axolotl.load_tokenizer:283] [PID:813] [RANK:1] UNK: 0 / [2024-05-29 07:53:26,571] [INFO] [axolotl.load_tokenizer:294] [PID:813] [RANK:1] No Chat template selected. Consider adding a chat template for easier inference. [2024-05-29 07:53:26,575] [DEBUG] [axolotl.load_tokenizer:280] [PID:814] [RANK:2] EOS: 2 / [2024-05-29 07:53:26,576] [DEBUG] [axolotl.load_tokenizer:281] [PID:814] [RANK:2] BOS: 1 / [2024-05-29 07:53:26,576] [DEBUG] [axolotl.load_tokenizer:282] [PID:814] [RANK:2] PAD: 2 / [2024-05-29 07:53:26,576] [DEBUG] [axolotl.load_tokenizer:283] [PID:814] [RANK:2] UNK: 0 / [2024-05-29 07:53:26,576] [INFO] [axolotl.load_tokenizer:294] [PID:814] [RANK:2] No Chat template selected. Consider adding a chat template for easier inference. [2024-05-29 07:53:26,608] [DEBUG] [axolotl.load_tokenizer:280] [PID:812] [RANK:0] EOS: 2 / [2024-05-29 07:53:26,608] [DEBUG] [axolotl.load_tokenizer:281] [PID:812] [RANK:0] BOS: 1 / [2024-05-29 07:53:26,608] [DEBUG] [axolotl.load_tokenizer:282] [PID:812] [RANK:0] PAD: 2 / [2024-05-29 07:53:26,608] [DEBUG] [axolotl.load_tokenizer:283] [PID:812] [RANK:0] UNK: 0 / [2024-05-29 07:53:26,608] [INFO] [axolotl.load_tokenizer:294] [PID:812] [RANK:0] No Chat template selected. Consider adding a chat template for easier inference. [2024-05-29 07:53:26,608] [DEBUG] [axolotl.train.train:85] [PID:812] [RANK:0] loading model and peft_config... [2024-05-29 07:53:26,751] [INFO] [axolotl.load_model:360] [PID:815] [RANK:3] patching with flash attention for sample packing [2024-05-29 07:53:26,751] [INFO] [axolotl.load_model:419] [PID:815] [RANK:3] patching _expand_mask [2024-05-29 07:53:26,758] [INFO] [axolotl.load_model:360] [PID:814] [RANK:2] patching with flash attention for sample packing [2024-05-29 07:53:26,759] [INFO] [axolotl.load_model:419] [PID:814] [RANK:2] patching _expand_mask [2024-05-29 07:53:26,760] [INFO] [axolotl.load_model:360] [PID:813] [RANK:1] patching with flash attention for sample packing [2024-05-29 07:53:26,760] [INFO] [axolotl.load_model:419] [PID:813] [RANK:1] patching _expand_mask [2024-05-29 07:53:26,876] [INFO] [axolotl.load_model:360] [PID:812] [RANK:0] patching with flash attention for sample packing [2024-05-29 07:53:26,877] [INFO] [axolotl.load_model:419] [PID:812] [RANK:0] patching _expand_mask /root/miniconda3/envs/py3.10/lib/python3.10/site-packages/torch/_utils.py:831: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage() return self.fget.__get__(instance, owner)() /root/miniconda3/envs/py3.10/lib/python3.10/site-packages/torch/_utils.py:831: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage() return self.fget.__get__(instance, owner)() /root/miniconda3/envs/py3.10/lib/python3.10/site-packages/torch/_utils.py:831: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage() return self.fget.__get__(instance, owner)() /root/miniconda3/envs/py3.10/lib/python3.10/site-packages/torch/_utils.py:831: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage() return self.fget.__get__(instance, owner)() [2024-05-29 07:53:28,744] [INFO] [axolotl.load_model:734] [PID:815] [RANK:3] GPU memory usage after model load: 3.410GB (+0.144GB cache, +0.744GB misc) [2024-05-29 07:53:28,748] [INFO] [axolotl.load_model:785] [PID:815] [RANK:3] converting PEFT model w/ prepare_model_for_kbit_training [2024-05-29 07:53:28,750] [INFO] [axolotl.load_model:794] [PID:815] [RANK:3] converting modules to torch.float16 for flash attention [2024-05-29 07:53:28,824] [INFO] [axolotl.load_model:734] [PID:813] [RANK:1] GPU memory usage after model load: 3.410GB (+0.144GB cache, +0.747GB misc) [2024-05-29 07:53:28,828] [INFO] [axolotl.load_model:785] [PID:813] [RANK:1] converting PEFT model w/ prepare_model_for_kbit_training [2024-05-29 07:53:28,830] [INFO] [axolotl.load_model:794] [PID:813] [RANK:1] converting modules to torch.float16 for flash attention [2024-05-29 07:53:28,936] [INFO] [axolotl.load_model:734] [PID:812] [RANK:0] GPU memory usage after model load: 3.410GB (+0.144GB cache, +0.744GB misc) [2024-05-29 07:53:28,940] [INFO] [axolotl.load_model:785] [PID:812] [RANK:0] converting PEFT model w/ prepare_model_for_kbit_training [2024-05-29 07:53:28,941] [INFO] [axolotl.load_model:734] [PID:814] [RANK:2] GPU memory usage after model load: 3.410GB (+0.144GB cache, +0.744GB misc) [2024-05-29 07:53:28,942] [INFO] [axolotl.load_model:794] [PID:812] [RANK:0] converting modules to torch.float16 for flash attention [2024-05-29 07:53:28,945] [INFO] [axolotl.load_model:785] [PID:814] [RANK:2] converting PEFT model w/ prepare_model_for_kbit_training [2024-05-29 07:53:28,947] [INFO] [axolotl.load_model:794] [PID:814] [RANK:2] converting modules to torch.float16 for flash attention [2024-05-29 07:53:28,976] [INFO] [axolotl.load_model:843] [PID:815] [RANK:3] GPU memory usage after adapters: 3.458GB (+0.911GB cache, +0.744GB misc) /root/miniconda3/envs/py3.10/lib/python3.10/site-packages/transformers/training_args.py:1474: FutureWarning: `evaluation_strategy` is deprecated and will be removed in version 4.46 of 🤗 Transformers. Use `eval_strategy` instead warnings.warn( [2024-05-29 07:53:29,055] [INFO] [axolotl.load_model:843] [PID:813] [RANK:1] GPU memory usage after adapters: 3.458GB (+0.911GB cache, +0.747GB misc) /root/miniconda3/envs/py3.10/lib/python3.10/site-packages/transformers/training_args.py:1474: FutureWarning: `evaluation_strategy` is deprecated and will be removed in version 4.46 of 🤗 Transformers. Use `eval_strategy` instead warnings.warn( trainable params: 12,712,960 || all params: 3,439,186,560 || trainable%: 0.3697 [2024-05-29 07:53:29,172] [INFO] [axolotl.load_model:843] [PID:814] [RANK:2] GPU memory usage after adapters: 3.458GB (+0.911GB cache, +0.744GB misc) /root/miniconda3/envs/py3.10/lib/python3.10/site-packages/transformers/training_args.py:1474: FutureWarning: `evaluation_strategy` is deprecated and will be removed in version 4.46 of 🤗 Transformers. Use `eval_strategy` instead warnings.warn( [2024-05-29 07:53:29,174] [INFO] [axolotl.load_model:843] [PID:812] [RANK:0] GPU memory usage after adapters: 3.458GB (+0.911GB cache, +0.744GB misc) /root/miniconda3/envs/py3.10/lib/python3.10/site-packages/transformers/training_args.py:1474: FutureWarning: `evaluation_strategy` is deprecated and will be removed in version 4.46 of 🤗 Transformers. Use `eval_strategy` instead warnings.warn( [2024-05-29 07:53:29,217] [INFO] [axolotl.train.train:119] [PID:812] [RANK:0] Pre-saving adapter config to ./outputs/lora-out [2024-05-29 07:53:29,220] [INFO] [axolotl.train.train:156] [PID:812] [RANK:0] Starting trainer... [2024-05-29 07:53:33,247] [WARNING] [engine.py:1188:_do_optimizer_sanity_check] **** You are using ZeRO with an untested optimizer, proceed with caution ***** 0%| | 0/5124 [00:00 sys.exit(main()) File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 46, in main args.func(args) File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/accelerate/commands/launch.py", line 1073, in launch_command multi_gpu_launcher(args) File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/accelerate/commands/launch.py", line 718, in multi_gpu_launcher distrib_run.run(args) File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/torch/distributed/run.py", line 797, in run elastic_launch( File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ==================================================== axolotl.cli.train FAILED ---------------------------------------------------- Failures: [1]: time : 2024-05-29_08:24:13 host : dc53c9f6e164 rank : 2 (local_rank: 2) exitcode : -6 (pid: 814) error_file: traceback : Signal 6 (SIGABRT) received by PID 814 [2]: time : 2024-05-29_08:24:13 host : dc53c9f6e164 rank : 3 (local_rank: 3) exitcode : -6 (pid: 815) error_file: traceback : Signal 6 (SIGABRT) received by PID 815 ---------------------------------------------------- Root Cause (first observed failure): [0]: time : 2024-05-29_08:24:13 host : dc53c9f6e164 rank : 1 (local_rank: 1) exitcode : -6 (pid: 813) error_file: traceback : Signal 6 (SIGABRT) received by PID 813 ====================================================