Add shifted sparse attention #973

joecummings · 2023-12-17T22:29:15Z

Summary

Add shifted sparse attention (w/ flash attention) to enable longer context training w/ less memory overhead.

Paper: https://arxiv.org/pdf/2309.12307.pdf
Code: https://github.com/dvlab-research/LongLoRA/tree/main

Testing

Added test to check for raised ValueError if sample_packing = True and s2_attention = True
pytest tests/utils/test_models.py::ModelsUtilsTest::test_cfg_throws_error_with_s2_attention_and_sample_packing
Run accelerate launch -m axolotl.cli.train examples/openllama-3b/lora.yml with the following config changes:

datasets:
  - path: Yukang/LongAlpaca-12k # From LongLoRA paper
    type: alpaca
sequence_len:  65536
s2_attention: true

[INSERT WANDB LOG HERE]

Follow-ups

Add ability to train embed and norm during LoRA, which improves performance according to the above paper (e.g. LoRA+)
Implement S2 Attn without flash attention (https://github.com/dvlab-research/LongLoRA/blob/be3bb473b4ca2c291c1d26419f6e409d774dd422/llama_attn_replace.py#L225)
Implement w/ sample packing
Refactor code

winglian · 2024-01-10T16:24:31Z

@joecummings do you have time to rebase this onto main? If not, I can take a stab at rebasing later this week.

joecummings · 2024-01-10T19:07:31Z

@joecummings do you have time to rebase this onto main? If not, I can take a stab at rebasing later this week.

yep I'll do this later today!

winglian

Looks good to me. @NanoCode012 xan you take a quick look please to make sure I didn't miss anything? Thanks

src/axolotl/utils/models.py

NanoCode012 · 2024-01-15T05:04:00Z

src/axolotl/utils/models.py

+        )
+
+    # Modify all llama derived models in one block
+    if cfg.is_llama_derived_model:


Since it applies to llama models, do we need to account for mistral here as well?

@joecummings this should work with mistral and mixtral too, right?

NanoCode012 · 2024-01-15T05:05:20Z

src/axolotl/utils/models.py

-                cross_entropy=cfg.flash_attn_cross_entropy,
-                rms_norm=cfg.flash_attn_rms_norm,
+            if cfg.sample_packing:
+                if cfg.device not in ["mps", "cpu"] and not inference:


Hm, does it mean, FA won't be enabled for inference mode now?

I don't think FA was ever enabled for flash_attention. here's the original code:

if cfg.is_llama_derived_model and cfg.flash_attention and cfg.sample_packing: if cfg.device not in ["mps", "cpu"] and not inference: from axolotl.monkeypatch.llama_attn_hijack_flash import ( replace_llama_attn_with_flash_attn, ) LOG.info("patching with flash attention for sample packing") replace_llama_attn_with_flash_attn( packed=cfg.sample_packing, cross_entropy=cfg.flash_attn_cross_entropy, rms_norm=cfg.flash_attn_rms_norm, )

src/axolotl/utils/models.py

tests/e2e/patched/test_llama_s2_attention.py

tests/utils/test_models.py

…sage

winglian · 2024-01-18T15:16:45Z

Thanks for all your work on this @joecummings !

joecummings marked this pull request as ready for review December 18, 2023 00:49

winglian requested a review from NanoCode012 December 18, 2023 15:00

joecummings force-pushed the feature/add-s2-attn branch 3 times, most recently from 42a0645 to a6be9cb Compare December 19, 2023 03:54

joecummings force-pushed the feature/add-s2-attn branch 2 times, most recently from cb08b2d to 7628056 Compare January 11, 2024 15:04

joecummings and others added 8 commits January 12, 2024 18:05

Add s2_attn to hijack flash code

2b2fd52

Refactor code to account for s2_attn

1450af9

Add test for models utils

60126bf

Add s2_attention option to llama configs

5e66cb4

Add s2_attention option to README config

0f57f30

Format code to appease linter

cb335d8

chore: lint

dcb5694

Remove xpos and llama-landmark [bad merge]

4135039

joecummings force-pushed the feature/add-s2-attn branch from 0412089 to 4135039 Compare January 12, 2024 23:05

winglian added 4 commits January 14, 2024 14:22

add e2e smoke tests for shifted sparse attention

34c62fb

remove stray patch from merge

cb899d9

update yml with link to paper for s2_attention/longlora

02d1e90

fix assertion check for full fine tune

e8ba3fe

winglian requested a review from casper-hansen January 14, 2024 22:30

winglian approved these changes Jan 14, 2024

View reviewed changes

NanoCode012 reviewed Jan 15, 2024

View reviewed changes

winglian added 5 commits January 17, 2024 11:18

increase sequence len for tests and PR feedback updates

9292665

reduce context len to 16k for tests

5e0890d

reduce context len to 16k for tests

bee8f8c

reduce batch size for larger context len and udpate test to check mes…

e6e67dd

…sage

fix test for message

4f09ef4

winglian requested a review from NanoCode012 January 18, 2024 13:13

winglian merged commit 1d70f24 into axolotl-ai-cloud:main Jan 18, 2024
7 checks passed

joecummings deleted the feature/add-s2-attn branch January 18, 2024 16:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add shifted sparse attention #973

Add shifted sparse attention #973

joecummings commented Dec 17, 2023 •

edited

Loading

winglian commented Jan 10, 2024

joecummings commented Jan 10, 2024

winglian left a comment

NanoCode012 Jan 15, 2024

winglian Jan 16, 2024

NanoCode012 Jan 15, 2024

winglian Jan 17, 2024

winglian commented Jan 18, 2024

Add shifted sparse attention #973

Add shifted sparse attention #973

Conversation

joecummings commented Dec 17, 2023 • edited Loading

Summary

Testing

Follow-ups

winglian commented Jan 10, 2024

joecummings commented Jan 10, 2024

winglian left a comment

Choose a reason for hiding this comment

NanoCode012 Jan 15, 2024

Choose a reason for hiding this comment

winglian Jan 16, 2024

Choose a reason for hiding this comment

NanoCode012 Jan 15, 2024

Choose a reason for hiding this comment

winglian Jan 17, 2024

Choose a reason for hiding this comment

winglian commented Jan 18, 2024

joecummings commented Dec 17, 2023 •

edited

Loading