Monkeypatch flash attention in for llama #520

dakinggg · 2023-08-11T17:57:47Z

This PR adds a monkeypatch of triton flash attention for llama2 models. If we start doing this for more models we can try to generalize into an algorithm, but until that time I think this monkeypatch is good enough.

TODO:

Copy profiling results for 7b
Copy profiling results for 70b
Paste in evidence that the test passes

Result for 7b scale to compare implementations:

Test results:

tests/test_llama_patch.py::test_patch_equivalence[meta-llama/Llama-2-7b-hf-True-torch] 
------------------------------------------------------------------------------------------------------------------- live log call -------------------------------------------------------------------------------------------------------------------
2023-08-11 22:18:17 [    INFO] Setting seed to 42 (reproducibility.py:159)
2023-08-11 22:18:23 [    INFO] Setting seed to 42 (reproducibility.py:159)
PASSED                                                                                                                                                                                                                                        [ 12%]
tests/test_llama_patch.py::test_patch_equivalence[meta-llama/Llama-2-7b-hf-True-triton] 
------------------------------------------------------------------------------------------------------------------- live log call -------------------------------------------------------------------------------------------------------------------
2023-08-11 22:18:23 [    INFO] Setting seed to 42 (reproducibility.py:159)
2023-08-11 22:18:24 [    INFO] Setting seed to 42 (reproducibility.py:159)
PASSED                                                                                                                                                                                                                                        [ 25%]
tests/test_llama_patch.py::test_patch_equivalence[meta-llama/Llama-2-7b-hf-False-torch] 
------------------------------------------------------------------------------------------------------------------- live log call -------------------------------------------------------------------------------------------------------------------
2023-08-11 22:18:25 [    INFO] Setting seed to 42 (reproducibility.py:159)
2023-08-11 22:18:26 [    INFO] Setting seed to 42 (reproducibility.py:159)
PASSED                                                                                                                                                                                                                                        [ 37%]
tests/test_llama_patch.py::test_patch_equivalence[meta-llama/Llama-2-7b-hf-False-triton] 
------------------------------------------------------------------------------------------------------------------- live log call -------------------------------------------------------------------------------------------------------------------
2023-08-11 22:18:27 [    INFO] Setting seed to 42 (reproducibility.py:159)
2023-08-11 22:18:27 [    INFO] Setting seed to 42 (reproducibility.py:159)
PASSED                                                                                                                                                                                                                                        [ 50%]
tests/test_llama_patch.py::test_patch_equivalence[meta-llama/Llama-2-70b-hf-True-torch] 
------------------------------------------------------------------------------------------------------------------- live log call -------------------------------------------------------------------------------------------------------------------
2023-08-11 22:18:28 [    INFO] Setting seed to 42 (reproducibility.py:159)
2023-08-11 22:18:29 [    INFO] Setting seed to 42 (reproducibility.py:159)
PASSED                                                                                                                                                                                                                                        [ 62%]
tests/test_llama_patch.py::test_patch_equivalence[meta-llama/Llama-2-70b-hf-True-triton] 
------------------------------------------------------------------------------------------------------------------- live log call -------------------------------------------------------------------------------------------------------------------
2023-08-11 22:18:30 [    INFO] Setting seed to 42 (reproducibility.py:159)
2023-08-11 22:18:32 [    INFO] Setting seed to 42 (reproducibility.py:159)
PASSED                                                                                                                                                                                                                                        [ 75%]
tests/test_llama_patch.py::test_patch_equivalence[meta-llama/Llama-2-70b-hf-False-torch] 
------------------------------------------------------------------------------------------------------------------- live log call -------------------------------------------------------------------------------------------------------------------
2023-08-11 22:18:33 [    INFO] Setting seed to 42 (reproducibility.py:159)
2023-08-11 22:18:34 [    INFO] Setting seed to 42 (reproducibility.py:159)
PASSED                                                                                                                                                                                                                                        [ 87%]
tests/test_llama_patch.py::test_patch_equivalence[meta-llama/Llama-2-70b-hf-False-triton] 
------------------------------------------------------------------------------------------------------------------- live log call -------------------------------------------------------------------------------------------------------------------
2023-08-11 22:18:36 [    INFO] Setting seed to 42 (reproducibility.py:159)
2023-08-11 22:18:37 [    INFO] Setting seed to 42 (reproducibility.py:159)
PASSED                                                                                                                                                                                                                                        [100%]

================================================================================================================= warnings summary ==================================================================================================================
../../miniconda3/envs/foundry-3.10/lib/python3.10/site-packages/accelerate/utils/dataclasses.py:29
  /mnt/workdisk/danielking/miniconda3/envs/foundry-3.10/lib/python3.10/site-packages/accelerate/utils/dataclasses.py:29: DeprecationWarning: The distutils package is deprecated and slated for removal in Python 3.12. Use setuptools or check PEP 632 for potential alternatives
    from distutils.util import strtobool

../../miniconda3/envs/foundry-3.10/lib/python3.10/site-packages/deepspeed/ops/op_builder/builder.py:15
  /mnt/workdisk/danielking/miniconda3/envs/foundry-3.10/lib/python3.10/site-packages/deepspeed/ops/op_builder/builder.py:15: DeprecationWarning: The distutils.sysconfig module is deprecated, use sysconfig instead
    import distutils.sysconfig

../../miniconda3/envs/foundry-3.10/lib/python3.10/site-packages/jupyter_client/connect.py:20
  /mnt/workdisk/danielking/miniconda3/envs/foundry-3.10/lib/python3.10/site-packages/jupyter_client/connect.py:20: DeprecationWarning: Jupyter is migrating its paths to use standard platformdirs
  given by the platformdirs library.  To remove this warning and
  see the appropriate new directories, set the environment variable
  `JUPYTER_PLATFORM_DIRS=1` and then run `jupyter --paths`.
  The use of platformdirs will be the default in `jupyter_core` v6
    from jupyter_core.paths import jupyter_data_dir, jupyter_runtime_dir, secure_write

../../miniconda3/envs/foundry-3.10/lib/python3.10/site-packages/comet_ml/monkey_patching.py:19
  /mnt/workdisk/danielking/miniconda3/envs/foundry-3.10/lib/python3.10/site-packages/comet_ml/monkey_patching.py:19: DeprecationWarning: the imp module is deprecated in favour of importlib and slated for removal in Python 3.12; see the module's documentation for alternative uses
    import imp

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
========================================================================================================== 8 passed, 4 warnings in 22.00s ===========================================================================================================

70b running on 16 and 32 gpus:

tests/test_llama_patch.py

llmfoundry/models/layers/llama_attention_monkeypatch.py

vchiley

left a few comments

germanjke · 2023-08-14T11:29:38Z

Hi guys! @vchiley @dakinggg when do you planning to release this request? Thanks

Co-authored-by: Vitaliy Chiley <6439018+vchiley@users.noreply.github.com>

vchiley

lgtm

port changes

f087093

dakinggg mentioned this pull request Aug 11, 2023

Monkeypatch flash attention for llama2 models #518

Closed

3 tasks

dakinggg added 9 commits August 11, 2023 18:40

Merge branch 'main' into llama2-2

59a5a15

skip test

19e31d2

precommit 1

78fb11e

precommit 2

563bdcb

add some comments

789f6ed

precommit 3

b6321e5

precommit 4

6096d56

precommit 5

bd9d823

add explicit transformers dep

ad0c4a5

dakinggg marked this pull request as ready for review August 12, 2023 00:49

dakinggg requested review from vchiley and sashaDoubov August 12, 2023 00:49

add empty cache

ca044b5