You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Please check that this issue hasn't been reported before.
I searched previous Bug Reports didn't find any similar reports.
Expected Behavior
I expect similar loss and grad_norm when training a model with the same setting regardless whether flash attention is enabled or not.
Current behaviour
Currently, during training steps (right from the start), I can see messages of {'loss': 0.0, 'grad_norm': nan, 'learning_rate': 6.545084971874738e-06, 'epoch': 0.4}
for few steps, before a
File "/home/huada524/ondemand/data/sys/myjobs/projects/default/1/huada524-prune-env/lib/python3.10/site-packages/flash_attn/bert_padding.py", line 110, in unpad_input
indices = torch.nonzero(attention_mask.flatten(), as_tuple=False).flatten()
RuntimeError: CUDA error: an illegal memory access was encountered
error appear and the training stops.
However, if flash attention is disabled with flash_attention: false, then the network trains normally. {'loss': 3.0972, 'grad_norm': 0.76171875, 'learning_rate': 3.4549150281252635e-06, 'epoch': 0.6}
Steps to reproduce
I Installed my axolotl on a remote cluster with 3x L40 graphic cards with slurm, using the following script:
module load python
module load cuda
echo "Setting up python venv..."
python -m venv venv
source venv/bin/activate
python -m pip install --upgrade pip
pip install -U wheel
pip install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu124 -I
pip install ninja
export TORCH_CUDA_ARCH_LIST="8.6;8.9"
export CUDA_VISIBLE_DEVICES=2
export LD_LIBRARY_PATH=/home/huada524/ondemand/data/sys/myjobs/projects/default/1/venv/lib64/python3.10/site-packages/nvidia/nvjitlink/lib:$LD_LIBRARY_PATH
pip install -v -U "git+https://github.com/facebookresearch/xformers.git@main#egg=xformers"
cd axolotl
git pull
pip install packaging
pip install -e '.[flash-attn,deepspeed]'
# I manually disabled xformers installation from axolotl/requirements.txt so it won't attempt to override the one I just compiled with.
# I also have to apply this patch https://github.com/microsoft/DeepSpeed/issues/5603 to make sure axolotl would launch
cd ..
Note the model I am training the lora with is meta's llama-3-70B model with some of its layers removed.
GPU are running on a CUDA version of 12.5, while the loaded module is 12.3.
Please check that this issue hasn't been reported before.
Expected Behavior
I expect similar loss and grad_norm when training a model with the same setting regardless whether flash attention is enabled or not.
Current behaviour
Currently, during training steps (right from the start), I can see messages of
{'loss': 0.0, 'grad_norm': nan, 'learning_rate': 6.545084971874738e-06, 'epoch': 0.4}
for few steps, before a
error appear and the training stops.
However, if flash attention is disabled with
flash_attention: false
, then the network trains normally.{'loss': 3.0972, 'grad_norm': 0.76171875, 'learning_rate': 3.4549150281252635e-06, 'epoch': 0.6}
Steps to reproduce
Note the model I am training the lora with is meta's llama-3-70B model with some of its layers removed.
GPU are running on a CUDA version of 12.5, while the loaded module is 12.3.
**** Axolotl Dependency Versions *****
accelerate: 0.30.1
peft: 0.11.1
transformers: 4.41.1
trl: 0.8.7.dev0
torch: 2.4.0.dev20240610+cu124
bitsandbytes: 0.43.1
Config yaml
Possible solution
No response
Which Operating Systems are you using?
Python Version
3.10
axolotl branch-commit
5783839
Acknowledgements
The text was updated successfully, but these errors were encountered: