feat: Add DataClass Arguments to Activate Padding-Free and MultiPack Plugin and FastKernels #280

achew010 · 2024-08-02T05:57:24Z

Description of the change

This PR adds two dataclass arguments to enable padding free and multipack for the sft_trainer.py, via the new fms acceleration attention-and-distributed-packing plugin and allows the current --fastkernels dataclass to support optimized full-finetuning:

--padding_free: technique to process multiple examples in single batch without adding padding tokens that waste compute.
--multipack: technique for multi-gpu training to balance out number of tokens processed in each device, to minimize waiting time.
--fast_kernels: Previously limited only for QPEFT (used to raise if not activated with --fast_lora), Now allows for optimized full/standard LoRA finetuning.

These are extremely effective methods to improve training throughputs:

see the section on benchmarks below. Currently, either padding free is used alone, or together with multipack. We do not currently support the option of using multipack alone.
padding free and multipack used in the instructlab (ILAB) work, see below on the section of the early version of this plugin. For general use when producing this plugin, we have greatly simplified the user interface

NOTE: adhering to the design of fms-acceleration, the new plugin is optional, and separately installed.

there is no dependency on the fms-acceleration-peft for GPTQ LoRA. However we have tested it with other plugins:
- accelerated-peft
- fused-ops-and-kernels

Notes on Padding Free

works for both single and multi-gpu.
works on both pretokenized and untokenized datasets
works with current transformers releases (<=4.43), when padding free is not yet integrated from our PR into Hugging Face: Enhancing SFT Training Efficiency Using Packing and FlashAttention2 with Position IDs huggingface/transformers#31629.
is forward-compatible with future transformers that natively supports padding-free (>= 4.44).
verified against the version found in HF main, merged in via PR Enhancing SFT Training Efficiency Using Packing and FlashAttention2 with Position IDs huggingface/transformers#31629.

Notes on Multipack

works only for multi-gpu.
currently only includes the version of multipack optimized for linear attention implementations like flash-attn.

Notes on FastKernels

currently supports FastCrossEntropyLoss, FastRoPE, FastRMSLayerNorm but will include SwiGLU and Liger Kernels (e.g. FusedCrossEntropyLoss) in the future
Works for full-finetuning, LoRA and QPEFT,
- pass --fast_kernels True True True on full finetuning/LoRA runs
- pass --fast_kernels True True True --auto_gptq triton_v2 --fused_lora auto_gptq True for GPTQ-LoRA
- pass --fast_kernels True True True --bitsandbytes nf4 --fused_lora bitsandbytes True for QLoRA
FastRoPE currently doesn't accept positional_ids but this issue will be addressed in the future

Benchmarks

PaddingFree and Multipack Benchmarks for Mistral 7B

Notes:

Shown below are the runtimes for running a subset of 6000 FLAN samples.
Tested two cases of per device batch sizes 4 and 8, for varying number gpus from 2 to 8
Verified that untokenized dataset produces the same improvements for paddingfree and multipack

Per Device Batch Size 4

Framework Config	Num Devices	Per Device Batch Size	Train Runtime (secs)	Speedups
full-FT	2	4	1537	baseline
padding-free	2	4	859	1.79 x
padding-free + multipack	2	4	751	2.05 x
full-FT	4	4	932	baseline
padding-free	4	4	483	1.93 x
padding-free + multipack	4	4	342	2.75 x
full-FT	8	4	551	baseline
padding-free	8	4	275	2.00 x
padding-free + multipack	8	4	163	3.38 x

Per Device Batch Size 8

Framework Config	Num Devices	Per Device Batch Size	Train Runtime (secs)	Speedup
full-FT	2	8	1722	baseline
padding-free	2	8	678	2.54 x
padding-free + multipack	2	8	603	2.86 x
full-FT	4	8	1025	baseline
padding-free	4	8	380	2.70 x
padding-free + multipack	4	8	289	3.55 x
full-FT	8	8	611	baseline
padding-free	8	8	215	2.84 x
padding-free + multipack	8	8	140	4.36 x

Verified Similar Improvements for Untokenized Dataset

Framework Config	Num Devices	Per Device Batch Size	Train Runtime (secs)	Speedups
full-FT	2	4	1516	baseline
padding-free	2	4	848	1.78x
padding-free + multipack	2	4	747	2.02x

Full Finetuning Benchmarks for Mistral 7B

Early Version Of This Plugin

We have an unofficial version with more features than our present release. @kmehant is currently using for ILAB work. It addition to the padding-free and multipack, it also has the additional two plugins below:

LossAccelerationPlugin: various methods of averaging losses across GPUs in distributed training, like straight per-token averaging.
MLPDropoutAccelerationPlugin. Adding Dolomite-style residual dropout to MLP units.

To use the early version a quick hack of sft_trainer with pretokenized + custom tokenizer: https://github.com/fabianlim/fms-hf-tuning/tree/attn-plugin . This will be superceded by this PR in the near future

Use with these command line arugments:

	  --padding_free huggingface-injected \
	  --loss_across_gpus mean token \

How to verify the PR

Additional checks/tests were added to

Ensures parsing --padding_free and multipack is correct in test_dataclass_parse_successfully
Ensures wrong arguments to --padding_free are caught in test_dataclass_will_fail_to_accept_illegal_args
Ensures Plugin is successfully instantiated from dataclass in test_framework_initialize_and_trains_with_aadp
Ensure --padding_free must be used with flash-attn, otherwise error is raised
Ensure --multi_pack must be used with --padding_free, otherwise error is raised
Ensure --packing True with --padding_free will raise an error
Ensure --fast_kernels works with full finetuning
Ensure that --fast_lora not called with either --auto_gptq or --bitsandbytes will raise an error

Ran the full suite of acceleration checks to verify all fms-acceleration unit tests passed

pytest tests/acceleration/

Was the PR tested

I have added >=1 unit test(s) for every new method I have added.
I have ensured all unit tests pass

kmehant

@achew010 lets gracefully handle the case when use_flash_attn is set to False and padding free is being used.

fms-hf-tuning/tuning/config/configs.py

Line 39 in d728007

use_flash_attn: bool = field(

tuning/sft_trainer.py

anhuong

Thank you for the excellent change and description! Had a few questions...also am wondering, should the plugin be installed by default so users can utilize these new parameters? Looks like a very useful addition.

Also please add some of the great description from this PR into the readme.

tuning/config/acceleration_configs/attention_and_distributed_packing.py

anhuong · 2024-09-18T00:10:24Z

tuning/config/acceleration_configs/attention_and_distributed_packing.py

+@dataclass
+class MultiPack:
+
+    num_processes: int = 16


Is there any guidance on what this number should be set to?

this number is a reasonable one for most datasets of reasonable size (e.g. under a million examples). The packing algorithm is relatively fast, but in the event the dataset is too large, then our plugin will raise a warning
https://github.com/foundation-model-stack/fms-acceleration/blob/4e81c64453ec5d2b06a8d14a2a72374cc736098a/plugins/attention-and-distributed-packing/src/fms_acceleration_aadp/framework_plugin_multipack.py#L117-L123

that advises the user to increase this number if the process times out.

tuning/config/acceleration_configs/attention_and_distributed_packing.py

tuning/config/acceleration_configs/acceleration_framework_config.py

anhuong · 2024-09-18T00:23:03Z

tuning/sft_trainer.py

    framework = AccelerationFrameworkConfig.from_dataclasses(
-        quantized_lora_config, fusedops_kernels_config
+        quantized_lora_config,
+        fusedops_kernels_config,
+        attention_and_distributed_packing_config,


Just for my understanding, so these are all model loader augmentors that change how the model is loaded based on the acceleration framework configurations? Although padding free and multipack are both dataset augmentors? How does setting the acceleration framework here affect the dataset loading?

you are right in saying padding free and multipack affect the dataloading, but more specifically

padding free only requires modifications to data collation.

multpack requires modification to dataloader

Both we handle by our AccelerationPatcher, which is a component that we wrote to allow controlled replacements of the data collator and data loader.

thank you for the explanation

anhuong · 2024-09-18T00:24:07Z

tuning/sft_trainer.py

+                "ensure `use_flash_attn = True` to use padding-free flash attention"
+            )
+
+        if train_args.packing is True:


nit: can simplify to if train_args.packing

fabianlim · 2024-09-18T01:31:19Z

@anhuong thanks for the review. For making this default, I drafted out various possibilities in this issue here #334. We can discuss offline,

anhuong

Small additional comments

README.md

anhuong · 2024-09-18T15:56:59Z

README.md

+ * `fused_ops_and_kernels` works for full-finetuning, LoRA, QLoRA and GPTQ-LORA, 
+    - pass `--fast_kernels True True True` for full finetuning/LoRA
+    - pass `--fast_kernels True True True --auto_gptq triton_v2 --fused_lora auto_gptq True` for GPTQ-LoRA
+    - pass `--fast_kernels True True True --bitsandbytes nf4 --fused_lora bitsandbytes True` for QLoRA


I'm wondering for fast-kernels if there is a better way to understand what is being set to true
--fast_kernels True True True feels unclear on what is being set to True. Could the user instead pass in --fast_kernels <types of kernel to use> like --fast_kernels FastCrossEntropyLoss FastRoPE FastRMSLayerNorm. If they only want one would they currently have to set --fast_kernels False True False whereas instead setting --fast_kernels FastRoPE would be easier?

Yes that is correct, but unfortunately that will be more complicated than the current implementation.

Consider the plugin dataclass (e.g., FusedOpsAndKernelsConfig), see here

the plugin dataclass is a nested dataclass; this is because it has dataclasses as members.

each member dataclass (e.g., FastKernelsConfig) needs to be parsable by HfArgumentParser, which actually does not support parsing a dataclass type.

hence, we made it possible due to our parsable_dataclass decorator, that

masquarades the member dataclass as a List, which HfArgumentParser does support lists of a uniform type.

allows our member dataclass to contain mixed types by the casting logic implemented in parsable_dataclass via EnsureTypes.

All this logic is needed just.to parse --fast_kernels False True False into the dataclass FastKernelsConfig(fast_loss=False, fast_rsm_layernorm=True, fast_rope_embeddings=False).

To support parsing of the kind --fast_kernels FastRoPE, we need to handle

handling of different types. "FastRoPE" is clearly a boolean type, but we also need to handle str inputs, float inputs etc, where it would need to be a key=value pair

handling different orders, we need to be able to parse --dataclass_key key_a=a key_b and --dataclass key_b key_a=a equivalently.

I suggest to merge this PR first, we can come back to this again when we do the fused cross entropy. BTW i left a comment for @achew010 to help upgrade FusedOpsAndKernelsConfig to non-experimental status in this PR by deleting these experimental=True entries, see here

Thank you for the details, I agree getting this merged and thinking about improvements for fast_kernels later makes sense. Is just the FusedOpsAndKernelsConfig ready to move out of experimental or can this also be done for PaddingFree and Multipack?

anhuong · 2024-09-18T15:59:17Z

tuning/sft_trainer.py

    framework = AccelerationFrameworkConfig.from_dataclasses(
-        quantized_lora_config, fusedops_kernels_config
+        quantized_lora_config,
+        fusedops_kernels_config,
+        attention_and_distributed_packing_config,


thank you for the explanation

anhuong · 2024-09-18T16:12:33Z

Also we added the new automation that ensure PRs follow convention commits which you can see is failing -- https://github.com/foundation-model-stack/fms-hf-tuning/actions/runs/10920573842/job/30310716778?pr=280 please address the change

anhuong · 2024-09-19T20:44:03Z

Please update the branch with the new changes from main and then once the experimental fields are updated this is good to merge in to me 👍

anhuong · 2024-09-19T20:44:55Z

Note @kmehant I think since you requested changes, an approval is needed from your side as well before this can merge

Signed-off-by: 1000960000 user <aaron.chew1@ibm.com> Signed-off-by: 1000850000 user <aaron.chew1@ibm.com>

Signed-off-by: 1000850000 user <aaron.chew1@ibm.com>

Signed-off-by: Yu Chin Fabian Lim <flim@sg.ibm.com> Signed-off-by: 1000850000 user <aaron.chew1@ibm.com>

Signed-off-by: 1000850000 user <aaron.chew1@ibm.com>

kmehant

@anhuong thanks for letting me know. Its really annoying that I am not able to dismiss my review some way so that I do not stand as a blocker :( forcing me to push a approval.

Nonetheless, I have used most of these features as part of iLab and undoubtedly vouch for the changes. Thanks.

anhuong

We can also mark paddingFree and multiPack as not experimental but LGTM

achew010 force-pushed the args-for-padding-free-plugin branch 2 times, most recently from e350ae7 to 193ab9d Compare August 2, 2024 06:20

achew010 marked this pull request as ready for review August 2, 2024 06:25

achew010 requested review from anhuong, Ssukriti and alex-jw-brooks as code owners August 2, 2024 06:25

achew010 force-pushed the args-for-padding-free-plugin branch 2 times, most recently from 2310f32 to f9d046f Compare August 6, 2024 05:38

kmehant requested changes Aug 6, 2024

View reviewed changes

achew010 marked this pull request as draft August 6, 2024 13:55

fabianlim changed the title ~~Add DataClass Arguments to Activate Padding-Free Plugin~~ Add DataClass Arguments to Activate Padding-Free and MultiPack Plugin Aug 28, 2024

achew010 force-pushed the args-for-padding-free-plugin branch 2 times, most recently from 29362a4 to 00d17e7 Compare August 29, 2024 09:41

fabianlim force-pushed the args-for-padding-free-plugin branch 6 times, most recently from 3b20f22 to 53d1a8c Compare August 29, 2024 10:59

fabianlim marked this pull request as ready for review August 29, 2024 10:59

achew010 force-pushed the args-for-padding-free-plugin branch from 8f1c9ea to b15a9c7 Compare September 4, 2024 09:04

achew010 force-pushed the args-for-padding-free-plugin branch from 3cccc41 to 46d587f Compare September 11, 2024 11:33

fabianlim changed the title ~~Add DataClass Arguments to Activate Padding-Free and MultiPack Plugin~~ Add DataClass Arguments to Activate Padding-Free and MultiPack Plugin and FastKernels Sep 12, 2024

fabianlim reviewed Sep 12, 2024

View reviewed changes

tuning/sft_trainer.py Outdated Show resolved Hide resolved

kmehant requested review from kmehant and removed request for kmehant September 16, 2024 06:48

anhuong reviewed Sep 18, 2024

View reviewed changes

kmehant self-requested a review September 18, 2024 10:48

anhuong reviewed Sep 18, 2024

View reviewed changes

kmehant changed the title ~~Add DataClass Arguments to Activate Padding-Free and MultiPack Plugin and FastKernels~~ feat: Add DataClass Arguments to Activate Padding-Free and MultiPack Plugin and FastKernels Sep 18, 2024

github-actions bot added the feat label Sep 18, 2024

achew010 and others added 14 commits September 20, 2024 02:30

add arguments to activate ilab plugin

df4d0c0

Signed-off-by: 1000960000 user <aaron.chew1@ibm.com> Signed-off-by: 1000850000 user <aaron.chew1@ibm.com>

plugin rename

6e76633

Signed-off-by: 1000850000 user <aaron.chew1@ibm.com>

added multipack dataclass

8c38dad

Signed-off-by: 1000850000 user <aaron.chew1@ibm.com>

formatted scripts

016c04c

Signed-off-by: 1000850000 user <aaron.chew1@ibm.com>

fix unit tests

a2a0280

Signed-off-by: 1000850000 user <aaron.chew1@ibm.com>

additional fmt fixes

1f963ac

Signed-off-by: 1000850000 user <aaron.chew1@ibm.com>

fix import

deb2fdf

Signed-off-by: Yu Chin Fabian Lim <flim@sg.ibm.com> Signed-off-by: 1000850000 user <aaron.chew1@ibm.com>

removed unit test enforcing pretokenized datasets with paddingfree

a4f39bc

Signed-off-by: 1000850000 user <aaron.chew1@ibm.com>

modifications to dataclasses to support fast kernels on full finetuning

6d10f72

Signed-off-by: 1000850000 user <aaron.chew1@ibm.com>

minor syntax fixes and fmt

735ff0a

Signed-off-by: 1000850000 user <aaron.chew1@ibm.com>

added more checks and unit tests

aa27fcb

Signed-off-by: 1000850000 user <aaron.chew1@ibm.com>

Addressed changes from code review

a9d72d5

Signed-off-by: 1000850000 user <aaron.chew1@ibm.com>

formatting fixes to README

a203559

Signed-off-by: 1000850000 user <aaron.chew1@ibm.com>

removed experimental status for fused lora and fast kernels

b78936e

Signed-off-by: 1000850000 user <aaron.chew1@ibm.com>

achew010 force-pushed the args-for-padding-free-plugin branch from f38c827 to b78936e Compare September 20, 2024 02:31

kmehant approved these changes Sep 20, 2024

View reviewed changes

anhuong approved these changes Sep 20, 2024

View reviewed changes

anhuong merged commit 926fb9b into foundation-model-stack:main Sep 20, 2024
8 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Add DataClass Arguments to Activate Padding-Free and MultiPack Plugin and FastKernels #280

feat: Add DataClass Arguments to Activate Padding-Free and MultiPack Plugin and FastKernels #280

achew010 commented Aug 2, 2024 •

edited

Loading

kmehant left a comment •

edited

Loading

anhuong left a comment

anhuong Sep 18, 2024

fabianlim Sep 18, 2024

anhuong Sep 18, 2024

fabianlim Sep 18, 2024

anhuong Sep 18, 2024

anhuong Sep 18, 2024

fabianlim commented Sep 18, 2024

anhuong left a comment

anhuong Sep 18, 2024

fabianlim Sep 19, 2024 •

edited

Loading

fabianlim Sep 19, 2024

anhuong Sep 19, 2024

anhuong Sep 18, 2024

anhuong commented Sep 18, 2024

anhuong commented Sep 19, 2024

anhuong commented Sep 19, 2024

kmehant left a comment •

edited

Loading

anhuong left a comment

feat: Add DataClass Arguments to Activate Padding-Free and MultiPack Plugin and FastKernels #280

feat: Add DataClass Arguments to Activate Padding-Free and MultiPack Plugin and FastKernels #280

Conversation

achew010 commented Aug 2, 2024 • edited Loading

Description of the change

Benchmarks

Early Version Of This Plugin

How to verify the PR

Was the PR tested

kmehant left a comment • edited Loading

Choose a reason for hiding this comment

anhuong left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fabianlim commented Sep 18, 2024

anhuong left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fabianlim Sep 19, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

anhuong commented Sep 18, 2024

anhuong commented Sep 19, 2024

anhuong commented Sep 19, 2024

kmehant left a comment • edited Loading

Choose a reason for hiding this comment

anhuong left a comment

Choose a reason for hiding this comment

achew010 commented Aug 2, 2024 •

edited

Loading

kmehant left a comment •

edited

Loading

fabianlim Sep 19, 2024 •

edited

Loading

kmehant left a comment •

edited

Loading