example config `llama-2/lora.yml` fails when `load_in_8bit` is set to `False` #456

radekosmulski · 2023-08-22T01:37:16Z

Please check that this issue hasn't been reported before.

I searched previous Bug Reports didn't find any similar reports.

Expected Behavior

I am inside the docker winglian/axolotl:main-py3.10-cu118-2.0.1 container. GPUs are visible with torch.cuda.device_count()

I start with the examples/llama-2/lora.yml config file. I am able to run it.

I want to do full fine-tuning and so I change load_in_8bit to false. I am able to train the model.

Current behaviour

Currently, the training fails with the following error:

                           dP            dP   dP
                           88            88   88
.d8888b. dP.  .dP .d8888b. 88 .d8888b. d8888P 88
88'  `88  `8bd8'  88'  `88 88 88'  `88   88   88
88.  .88  .d88b.  88.  .88 88 88.  .88   88   88
`88888P8 dP'  `dP `88888P' dP `88888P'   dP   dP

[2023-08-22 01:11:11,159] [WARNING] [axolotl.validate_config:120] [PID:142] We recommend setting `load_in_8bit: true` for LORA finetuning
[2023-08-22 01:11:11,160] [INFO] [axolotl.normalize_config:65] [PID:142] GPU memory usage baseline: 0.000GB (+1.281GB misc)
[2023-08-22 01:11:11,160] [INFO] [axolotl.scripts.train:189] [PID:142] loading tokenizer... meta-llama/Llama-2-7b-hf
[2023-08-22 01:11:11,773] [DEBUG] [axolotl.load_tokenizer:63] [PID:142] EOS: 2 / </s>
[2023-08-22 01:11:11,773] [DEBUG] [axolotl.load_tokenizer:64] [PID:142] BOS: 1 / <s>
[2023-08-22 01:11:11,773] [DEBUG] [axolotl.load_tokenizer:65] [PID:142] PAD: 0 / [PAD]
[2023-08-22 01:11:11,773] [DEBUG] [axolotl.load_tokenizer:66] [PID:142] UNK: 0 / <unk>
[2023-08-22 01:11:11,776] [INFO] [axolotl.load_tokenized_prepared_datasets:122] [PID:142] Loading prepared dataset from disk at last_run_prepared/ad149256d2226c66eef84cba1806c06f...
[2023-08-22 01:11:11,782] [INFO] [axolotl.load_tokenized_prepared_datasets:124] [PID:142] Prepared dataset loaded from disk...
Filter (num_proc=96): 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1980/1980 [00:01<00:00, 1533.47 examples/s]Filter (num_proc=20): 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 20/20 [00:00<00:00, 42.22 examples/s]Map (num_proc=96): 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1980/1980 [00:00<00:00, 3432.67 examples/s]Map (num_proc=20): 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 20/20 [00:00<00:00, 97.78 examples/s][2023-08-22 01:11:18,931] [INFO] [axolotl.calculate_total_num_steps:304] [PID:142] calculating total_num_tokens
[2023-08-22 01:11:18,936] [INFO] [axolotl.calculate_total_num_steps:311] [PID:142] 📝 UPDATE CONFIG WITH: `total_num_tokens: 445919`
[2023-08-22 01:11:18,945] [INFO] [axolotl.utils.dataloader.generate_batches:181] [PID:142] generating packed batches
[2023-08-22 01:11:18,948] [INFO] [axolotl.utils.dataloader.generate_batches:187] [PID:142] 39895f637f2764542fc4ec0a7600a1dda209d03c18db8369ff8c61a03881d503
[2023-08-22 01:11:23,371] [INFO] [axolotl.utils.dataloader.len_w_stats:281] [PID:142] packing_efficiency_estimate: 1.0 actual packing efficiency: 0.9720262799944196
[2023-08-22 01:11:23,371] [INFO] [axolotl.utils.dataloader._len_est:250] [PID:142] packing_efficiency_estimate: 1.0 total_num_tokens per device: 445919
[2023-08-22 01:11:23,371] [INFO] [axolotl.calculate_total_num_steps:351] [PID:142] data_loader_len: 52
[2023-08-22 01:11:23,371] [INFO] [axolotl.calculate_total_num_steps:360] [PID:142] 📝 UPDATE CONFIG WITH: `sample_packing_eff_est: 0.98`
[2023-08-22 01:11:23,371] [INFO] [axolotl.calculate_total_num_steps:368] [PID:142] total_num_steps: 39
[2023-08-22 01:11:23,371] [INFO] [axolotl.scripts.train:211] [PID:142] loading model and (optionally) peft_config...
[2023-08-22 01:11:23,382] [INFO] [axolotl.load_model:105] [PID:142] patching with flash attention
[2023-08-22 01:11:23,384] [INFO] [axolotl.load_model:146] [PID:142] patching _expand_mask
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:05<00:00,  2.78s/it][2023-08-22 01:12:23,049] [WARNING] [axolotl.load_model:342] [PID:142] increasing model.config.max_position_embeddings to 4096
[2023-08-22 01:12:23,049] [INFO] [axolotl.load_lora:488] [PID:142] found linear modules: ['q_proj', 'gate_proj', 'o_proj', 'up_proj', 'down_proj', 'k_proj', 'v_proj']
trainable params: 79,953,920 || all params: 6,818,369,536 || trainable%: 1.172625208678628
Traceback (most recent call last):
  File "/workspace/axolotl/scripts/finetune.py", line 315, in <module>
    fire.Fire(train)
  File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/fire/core.py", line 141, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
  File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/fire/core.py", line 475, in _Fire
    component, remaining_args = _CallAndUpdateTrace(
  File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
  File "/workspace/axolotl/scripts/finetune.py", line 212, in train
    model, peft_config = load_model(cfg, tokenizer)
  File "/workspace/axolotl/src/axolotl/utils/models.py", line 409, in load_model
    log_gpu_memory_usage(LOG, "after adapters", model.device)
  File "/workspace/axolotl/src/axolotl/utils/bench.py", line 34, in log_gpu_memory_usage
    usage, cache, misc = gpu_memory_usage_all(device)
  File "/workspace/axolotl/src/axolotl/utils/bench.py", line 12, in gpu_memory_usage_all
    usage = torch.cuda.memory_allocated(device) / 1024.0**3
  File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/torch/cuda/memory.py", line 351, in memory_allocated
    return memory_stats(device=device).get("allocated_bytes.all.current", 0)
  File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/torch/cuda/memory.py", line 230, in memory_stats
    stats = memory_stats_as_nested_dict(device=device)
  File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/torch/cuda/memory.py", line 241, in memory_stats_as_nested_dict
    device = _get_device_index(device, optional=True)
  File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/torch/cuda/_utils.py", line 32, in _get_device_index
    raise ValueError('Expected a cuda device, but got: {}'.format(device))
ValueError: Expected a cuda device, but got: cpu
Traceback (most recent call last):
  File "/root/miniconda3/envs/py3.10/bin/accelerate", line 8, in <module>
    sys.exit(main())
  File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 45, in main
    args.func(args)
  File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/accelerate/commands/launch.py", line 979, in launch_command
    simple_launcher(args)
  File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/accelerate/commands/launch.py", line 628, in simple_launcher
    raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
subprocess.CalledProcessError: Command '['/root/miniconda3/envs/py3.10/bin/python', 'axolotl/scripts/finetune.py', 'axolotl/examples/llama-2/lora.yml']' returned non-zero exit status 1.

Steps to reproduce

Start docker container: winglian/axolotl:main-py3.10-cu118-2.0.1
Modify the examples/llama-2/lora.yml config file (set load_in_8bit to false
Run fine-tuning using modified config.

Possible solution

No response

Which Operating Systems are you using?

Linux
macOS
Windows

Python Version

the one in the official docker container

axolotl branch-commit

main/50682a3c068f723de154950b03c3f86bf673e688

Acknowledgements

My issue title is concise, descriptive, and in title casing.
I have searched the existing issues to make sure this bug has not been reported yet.
I am using the latest version of axolotl.
I have provided enough information for the maintainers to reproduce and diagnose the issue.

The text was updated successfully, but these errors were encountered:

winglian · 2023-08-22T11:29:13Z

if you want to do a full fine tune, you should set adapter: (remove lora)

radekosmulski · 2023-08-22T12:00:40Z

Thank you for your answer, @winglian! And yes, sorry, I didn't express myself well.

I wanted to train with LoRA attached to a full, non-quantized model to compare against some runs I did with HF directly.

I just followed your suggestion (removed the LoRA adapter and it seems I am able to do full fine-tuning on a single 80GB A100?! I am using the adamw_torch optimizer. I didn't realize that was possible -- thought for a 7b model you always needed sharding or some sort of parallelism even with a microbatch size of 1.

And on top of that we are training here on packed examples of length up to 4096, that seems way beyond what I expected 🙂 Definitely need to study the code of the library, amazing.

Thank you very much for your answer 🙏

Mhmm BTW full fine-tuning seems to have stopped (0% volatile GPU utilization) but the training didn't crash, just seems to have frozen. Oh well, maybe the examples don't always get packed to full 4096 and just hit a particularly tricky one 🤔

Anyhow, also rechecked for the original issue I raised this bug report for -- it seems I didn't mess anything up, the problem exists when training without quantization and LoRA. Can close it though if you don't feel this is something that doesn't need to be supported, let me know please.

Extremely grateful for your help!

mhenrichsen · 2023-08-22T19:02:25Z

@radekosmulski you're probably on the very edge of what you can do with 80gb. It's usually 12x model size required of VRAM for a full finetune.

radekosmulski · 2023-08-24T03:42:22Z

@mhenrichsen thank you very much for your comment, that is very useful to know! 🙂🙏

mhenrichsen · 2023-08-24T08:00:21Z

@radekosmulski is this resolved? Can we close it?

radekosmulski · 2023-08-24T08:19:06Z

@mhenrichsen it is resolved in the sense that I learned something new and very useful 🙂 so I am extremely grateful for this 🙂

but fine-tuning on unquantized weights with LoRA gives the error as above (somehow the model is not getting moved to the GPU), so assuming one should be able to use LoRA without loading the model in 4 or 8 bits, this is still broken

Napuh · 2023-09-01T08:08:56Z

Happens too with btlm 3b 8k (cerebras). Is it not possible to attach a LoRa to a fp16 model?

Napuh · 2023-09-20T10:33:06Z

As a workaround, if you have only one GPU, you can run the script without accelerate, using only python and it should work.

radekosmulski added the bug Something isn't working label Aug 22, 2023

radekosmulski changed the title ~~example config llama-2/lora.yml fails when load_in_8bit is set to `False~~ example config llama-2/lora.yml fails when load_in_8bit is set to False Aug 22, 2023

Napuh mentioned this issue Sep 19, 2023

support to disable exllama for gptq #604

Merged

winglian mentioned this issue Sep 19, 2023

skip the gpu memory checks if the device is set to 'auto' #609

Merged

winglian closed this as completed in #609 Sep 21, 2023

Palmik mentioned this issue Oct 27, 2023

LoRA not working with accelerate + bfloat16 (without load_in_8_bit) #793

Closed

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

example config `llama-2/lora.yml` fails when `load_in_8bit` is set to `False` #456

example config `llama-2/lora.yml` fails when `load_in_8bit` is set to `False` #456

radekosmulski commented Aug 22, 2023

winglian commented Aug 22, 2023

radekosmulski commented Aug 22, 2023

mhenrichsen commented Aug 22, 2023

radekosmulski commented Aug 24, 2023

mhenrichsen commented Aug 24, 2023

radekosmulski commented Aug 24, 2023

Napuh commented Sep 1, 2023

Napuh commented Sep 20, 2023

example config llama-2/lora.yml fails when load_in_8bit is set to False #456

example config llama-2/lora.yml fails when load_in_8bit is set to False #456

Comments

radekosmulski commented Aug 22, 2023

Please check that this issue hasn't been reported before.

Expected Behavior

Current behaviour

Steps to reproduce

Possible solution

Which Operating Systems are you using?

Python Version

axolotl branch-commit

Acknowledgements

winglian commented Aug 22, 2023

radekosmulski commented Aug 22, 2023

mhenrichsen commented Aug 22, 2023

radekosmulski commented Aug 24, 2023

mhenrichsen commented Aug 24, 2023

radekosmulski commented Aug 24, 2023

Napuh commented Sep 1, 2023

Napuh commented Sep 20, 2023

example config `llama-2/lora.yml` fails when `load_in_8bit` is set to `False` #456

example config `llama-2/lora.yml` fails when `load_in_8bit` is set to `False` #456