Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

example config llama-2/lora.yml fails when load_in_8bit is set to False #456

Closed
6 of 8 tasks
radekosmulski opened this issue Aug 22, 2023 · 8 comments · Fixed by #609
Closed
6 of 8 tasks

example config llama-2/lora.yml fails when load_in_8bit is set to False #456

radekosmulski opened this issue Aug 22, 2023 · 8 comments · Fixed by #609
Labels
bug Something isn't working

Comments

@radekosmulski
Copy link

Please check that this issue hasn't been reported before.

  • I searched previous Bug Reports didn't find any similar reports.

Expected Behavior

I am inside the docker winglian/axolotl:main-py3.10-cu118-2.0.1 container. GPUs are visible with torch.cuda.device_count()

I start with the examples/llama-2/lora.yml config file. I am able to run it.

I want to do full fine-tuning and so I change load_in_8bit to false. I am able to train the model.

Current behaviour

Currently, the training fails with the following error:

                           dP            dP   dP
                           88            88   88
.d8888b. dP.  .dP .d8888b. 88 .d8888b. d8888P 88
88'  `88  `8bd8'  88'  `88 88 88'  `88   88   88
88.  .88  .d88b.  88.  .88 88 88.  .88   88   88
`88888P8 dP'  `dP `88888P' dP `88888P'   dP   dP

[2023-08-22 01:11:11,159] [WARNING] [axolotl.validate_config:120] [PID:142] We recommend setting `load_in_8bit: true` for LORA finetuning
[2023-08-22 01:11:11,160] [INFO] [axolotl.normalize_config:65] [PID:142] GPU memory usage baseline: 0.000GB (+1.281GB misc)
[2023-08-22 01:11:11,160] [INFO] [axolotl.scripts.train:189] [PID:142] loading tokenizer... meta-llama/Llama-2-7b-hf
[2023-08-22 01:11:11,773] [DEBUG] [axolotl.load_tokenizer:63] [PID:142] EOS: 2 / </s>
[2023-08-22 01:11:11,773] [DEBUG] [axolotl.load_tokenizer:64] [PID:142] BOS: 1 / <s>
[2023-08-22 01:11:11,773] [DEBUG] [axolotl.load_tokenizer:65] [PID:142] PAD: 0 / [PAD]
[2023-08-22 01:11:11,773] [DEBUG] [axolotl.load_tokenizer:66] [PID:142] UNK: 0 / <unk>
[2023-08-22 01:11:11,776] [INFO] [axolotl.load_tokenized_prepared_datasets:122] [PID:142] Loading prepared dataset from disk at last_run_prepared/ad149256d2226c66eef84cba1806c06f...
[2023-08-22 01:11:11,782] [INFO] [axolotl.load_tokenized_prepared_datasets:124] [PID:142] Prepared dataset loaded from disk...
Filter (num_proc=96): 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1980/1980 [00:01<00:00, 1533.47 examples/s]Filter (num_proc=20): 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 20/20 [00:00<00:00, 42.22 examples/s]Map (num_proc=96): 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1980/1980 [00:00<00:00, 3432.67 examples/s]Map (num_proc=20): 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 20/20 [00:00<00:00, 97.78 examples/s][2023-08-22 01:11:18,931] [INFO] [axolotl.calculate_total_num_steps:304] [PID:142] calculating total_num_tokens
[2023-08-22 01:11:18,936] [INFO] [axolotl.calculate_total_num_steps:311] [PID:142] 📝 UPDATE CONFIG WITH: `total_num_tokens: 445919`
[2023-08-22 01:11:18,945] [INFO] [axolotl.utils.dataloader.generate_batches:181] [PID:142] generating packed batches
[2023-08-22 01:11:18,948] [INFO] [axolotl.utils.dataloader.generate_batches:187] [PID:142] 39895f637f2764542fc4ec0a7600a1dda209d03c18db8369ff8c61a03881d503
[2023-08-22 01:11:23,371] [INFO] [axolotl.utils.dataloader.len_w_stats:281] [PID:142] packing_efficiency_estimate: 1.0 actual packing efficiency: 0.9720262799944196
[2023-08-22 01:11:23,371] [INFO] [axolotl.utils.dataloader._len_est:250] [PID:142] packing_efficiency_estimate: 1.0 total_num_tokens per device: 445919
[2023-08-22 01:11:23,371] [INFO] [axolotl.calculate_total_num_steps:351] [PID:142] data_loader_len: 52
[2023-08-22 01:11:23,371] [INFO] [axolotl.calculate_total_num_steps:360] [PID:142] 📝 UPDATE CONFIG WITH: `sample_packing_eff_est: 0.98`
[2023-08-22 01:11:23,371] [INFO] [axolotl.calculate_total_num_steps:368] [PID:142] total_num_steps: 39
[2023-08-22 01:11:23,371] [INFO] [axolotl.scripts.train:211] [PID:142] loading model and (optionally) peft_config...
[2023-08-22 01:11:23,382] [INFO] [axolotl.load_model:105] [PID:142] patching with flash attention
[2023-08-22 01:11:23,384] [INFO] [axolotl.load_model:146] [PID:142] patching _expand_mask
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:05<00:00,  2.78s/it][2023-08-22 01:12:23,049] [WARNING] [axolotl.load_model:342] [PID:142] increasing model.config.max_position_embeddings to 4096
[2023-08-22 01:12:23,049] [INFO] [axolotl.load_lora:488] [PID:142] found linear modules: ['q_proj', 'gate_proj', 'o_proj', 'up_proj', 'down_proj', 'k_proj', 'v_proj']
trainable params: 79,953,920 || all params: 6,818,369,536 || trainable%: 1.172625208678628
Traceback (most recent call last):
  File "/workspace/axolotl/scripts/finetune.py", line 315, in <module>
    fire.Fire(train)
  File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/fire/core.py", line 141, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
  File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/fire/core.py", line 475, in _Fire
    component, remaining_args = _CallAndUpdateTrace(
  File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
  File "/workspace/axolotl/scripts/finetune.py", line 212, in train
    model, peft_config = load_model(cfg, tokenizer)
  File "/workspace/axolotl/src/axolotl/utils/models.py", line 409, in load_model
    log_gpu_memory_usage(LOG, "after adapters", model.device)
  File "/workspace/axolotl/src/axolotl/utils/bench.py", line 34, in log_gpu_memory_usage
    usage, cache, misc = gpu_memory_usage_all(device)
  File "/workspace/axolotl/src/axolotl/utils/bench.py", line 12, in gpu_memory_usage_all
    usage = torch.cuda.memory_allocated(device) / 1024.0**3
  File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/torch/cuda/memory.py", line 351, in memory_allocated
    return memory_stats(device=device).get("allocated_bytes.all.current", 0)
  File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/torch/cuda/memory.py", line 230, in memory_stats
    stats = memory_stats_as_nested_dict(device=device)
  File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/torch/cuda/memory.py", line 241, in memory_stats_as_nested_dict
    device = _get_device_index(device, optional=True)
  File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/torch/cuda/_utils.py", line 32, in _get_device_index
    raise ValueError('Expected a cuda device, but got: {}'.format(device))
ValueError: Expected a cuda device, but got: cpu
Traceback (most recent call last):
  File "/root/miniconda3/envs/py3.10/bin/accelerate", line 8, in <module>
    sys.exit(main())
  File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 45, in main
    args.func(args)
  File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/accelerate/commands/launch.py", line 979, in launch_command
    simple_launcher(args)
  File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/accelerate/commands/launch.py", line 628, in simple_launcher
    raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
subprocess.CalledProcessError: Command '['/root/miniconda3/envs/py3.10/bin/python', 'axolotl/scripts/finetune.py', 'axolotl/examples/llama-2/lora.yml']' returned non-zero exit status 1.

Steps to reproduce

  1. Start docker container: winglian/axolotl:main-py3.10-cu118-2.0.1
  2. Modify the examples/llama-2/lora.yml config file (set load_in_8bit to false
  3. Run fine-tuning using modified config.

Possible solution

No response

Which Operating Systems are you using?

  • Linux
  • macOS
  • Windows

Python Version

the one in the official docker container

axolotl branch-commit

main/50682a3c068f723de154950b03c3f86bf673e688

Acknowledgements

  • My issue title is concise, descriptive, and in title casing.
  • I have searched the existing issues to make sure this bug has not been reported yet.
  • I am using the latest version of axolotl.
  • I have provided enough information for the maintainers to reproduce and diagnose the issue.
@radekosmulski radekosmulski added the bug Something isn't working label Aug 22, 2023
@radekosmulski radekosmulski changed the title example config llama-2/lora.yml fails when load_in_8bit is set to `False example config llama-2/lora.yml fails when load_in_8bit is set to False Aug 22, 2023
@winglian
Copy link
Collaborator

if you want to do a full fine tune, you should set adapter: (remove lora)

@radekosmulski
Copy link
Author

Thank you for your answer, @winglian! And yes, sorry, I didn't express myself well.

I wanted to train with LoRA attached to a full, non-quantized model to compare against some runs I did with HF directly.

I just followed your suggestion (removed the LoRA adapter and it seems I am able to do full fine-tuning on a single 80GB A100?! I am using the adamw_torch optimizer. I didn't realize that was possible -- thought for a 7b model you always needed sharding or some sort of parallelism even with a microbatch size of 1.

And on top of that we are training here on packed examples of length up to 4096, that seems way beyond what I expected 🙂 Definitely need to study the code of the library, amazing.

Thank you very much for your answer 🙏

Mhmm BTW full fine-tuning seems to have stopped (0% volatile GPU utilization) but the training didn't crash, just seems to have frozen. Oh well, maybe the examples don't always get packed to full 4096 and just hit a particularly tricky one 🤔

Anyhow, also rechecked for the original issue I raised this bug report for -- it seems I didn't mess anything up, the problem exists when training without quantization and LoRA. Can close it though if you don't feel this is something that doesn't need to be supported, let me know please.

Extremely grateful for your help!

@mhenrichsen
Copy link
Collaborator

@radekosmulski you're probably on the very edge of what you can do with 80gb. It's usually 12x model size required of VRAM for a full finetune.

@radekosmulski
Copy link
Author

@mhenrichsen thank you very much for your comment, that is very useful to know! 🙂🙏

@mhenrichsen
Copy link
Collaborator

@radekosmulski is this resolved? Can we close it?

@radekosmulski
Copy link
Author

@mhenrichsen it is resolved in the sense that I learned something new and very useful 🙂 so I am extremely grateful for this 🙂

but fine-tuning on unquantized weights with LoRA gives the error as above (somehow the model is not getting moved to the GPU), so assuming one should be able to use LoRA without loading the model in 4 or 8 bits, this is still broken

@Napuh
Copy link
Contributor

Napuh commented Sep 1, 2023

Happens too with btlm 3b 8k (cerebras). Is it not possible to attach a LoRa to a fp16 model?

@Napuh
Copy link
Contributor

Napuh commented Sep 20, 2023

As a workaround, if you have only one GPU, you can run the script without accelerate, using only python and it should work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants