SparseML Compression Pt 3: SparseAutoModel interface & inferring params #2190

Satrat · 2024-03-20T02:11:42Z

This final PR for SparseML sparsity compression adds compression support to save_pretrained and from_pretrained. It also adds support for inferring global sparsity and sparsity structure params from the model. See the corresponding internal docs PR for design details, but there have been some minor changes that are reflected in README.md

Callouts

Overriding model.save_pretrained() is tricky, because the model is initialized from SparseAutoModelForCausalLM.from_pretrained, but the returned model is a child of PreTrainedModel(for instance LlamaForCausalLM inherits from LlamaPreTrainedModel inherits from PreTrainedModel) To override PreTrainedModel.save_pretrained we need to do a bit of "class instance surgery". See compression_save.py for implementation details
Ended up not storing the sparsity config in the safetensors metadata, storing it in the HF config seemed sufficient and I felt it was going to complicate the code without much added gain
How we're inferring the sparsity structure is definitely hacky. We determine it by looking at the applied modifiers and checking for a mask_structure attribute. I think this is sufficient for now, but open to other ideas here

Example

from sparseml.transformers import SparseAutoModelForCausalLM
from sparseml.utils.pytorch.utils import measure_cuda_memory
import torch

MODEL_PATH = "zoo:llama2-7b-open_platypus_orca_llama2_pretrain-pruned60"
OUTPUT_PATH = "./test_compress_output"
RECIPE = "zoo:llama2-7b-open_platypus_orca_llama2_pretrain-pruned60"

torch.cuda.set_device(0)
with measure_cuda_memory() as m:
    model = SparseAutoModelForCausalLM.from_pretrained(MODEL_PATH, device_map="cuda:0")
print(f"Load dense model peak GPU {m.overall_peak_memory / float(2**30):.4f} GB")

sparsity_config = getattr(model,"sparsity_config", None)
print(f"Sparsity config before compression: {sparsity_config}")
with measure_cuda_memory() as m:
    model.save_compressed(OUTPUT_PATH)
print(f"Save compressed model peak GPU {m.overall_peak_memory / float(2**30):.4f} GB")

torch.cuda.set_device(1)
with measure_cuda_memory() as m:
    model_again = SparseAutoModelForCausalLM.from_pretrained(
        OUTPUT_PATH, device_map="cuda:1"
    )
print(f"Load compressed model peak GPU {m.overall_peak_memory / float(2**30):.4f} GB")
sparsity_config = getattr(model,"sparsity_config", None)
print(f"Sparsity config after compression: {sparsity_config}")

og_state_dict = model.state_dict()
reconstructed_state_dict = model_again.state_dict()
assert(len(og_state_dict) == len(reconstructed_state_dict))
for key in og_state_dict.keys():
    dense_tensor = og_state_dict[key]
    reconstructed_tensor = reconstructed_state_dict[key]
    assert torch.equal(dense_tensor.cpu(), reconstructed_tensor.cpu())

Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 6/6 [00:03<00:00, 1.55it/s]
Load dense model peak GPU 25.2276 GB
Sparsity config before compression: None
Compressing model: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 291/291 [01:22<00:00, 3.51it/s]
Save compressed model peak GPU 26.3272 GB
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 7.87it/s]
Decompressing model: 291it [00:23, 12.31it/s]
Load compressed model peak GPU 25.7159 GB
Sparsity config after compression: format='sparse_bitmask' global_sparsity=57.66354988216862 sparsity_structure='unstructured'

Testing

Unit tests are included for from_pretrained, save_pretrained and save_compressed. In addition I manually tested the following scenarios. See the README for instructions on turning on compression during finetuning/oneshot

load dense -> apply SparseGPT -> save compressed -> reload compressed
FSDP: load dense -> apply SparseGPT -> save compressed -> reload compressed
load sparse -> sparse finetuning -> save compressed -> reload compressed
FSDP: load sparse -> sparse finetuning -> save compressed -> reload compressed

…into tensor_compression

The base branch was changed.

mgoin

I tried using this to compress an existing model on HF. I chose the TinyLlama here: https://huggingface.co/neuralmagic/TinyLlama-1.1B-Chat-v1.0-pruned2.4/tree/main

There are a few issues I noticed:

It seems like float32 is always used to save, where we should preserve the existing dtype of the model
SparseAutoModelForCausalLM.save_compressed seems to only produce the model weights and config file in the output directory, where we should be including other needed files like the tokenizer. This may be addressed by inheriting instead of using a static function

Code:

from sparseml.transformers import SparseAutoModelForCausalLM

MODEL_PATH = "neuralmagic/TinyLlama-1.1B-Chat-v1.0-pruned2.4"
OUTPUT_PATH = "./test_compress_output"

model = SparseAutoModelForCausalLM.from_pretrained(MODEL_PATH, device_map="cuda:0")
SparseAutoModelForCausalLM.save_compressed(model, OUTPUT_PATH)

Output:

config.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 672/672 [00:00<00:00, 7.52MB/s]
pytorch_model.bin: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2.20G/2.20G [01:51<00:00, 19.8MB/s]
/home/mgoin/venvs/nm/lib/python3.10/site-packages/torch/_utils.py:831: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
  return self.fget.__get__(instance, owner)()
generation_config.json: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 132/132 [00:00<00:00, 1.47MB/s]
2024-03-20 19:29:24 sparseml.transformers.utils.helpers INFO     model_path is a huggingface model id. Attempting to download recipe from https://huggingface.co/
2024-03-20 19:29:24 sparseml.transformers.utils.helpers INFO     Found recipe: recipe.yaml for model id: neuralmagic/TinyLlama-1.1B-Chat-v1.0-pruned2.4. Downloading...
recipe.yaml: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 237/237 [00:00<00:00, 2.64MB/s]
Logging all SparseML modifier-level logs to sparse_logs/20-03-2024_19.29.24.log
2024-03-20 19:29:24 sparseml.core.logger.logger INFO     Logging all SparseML modifier-level logs to sparse_logs/20-03-2024_19.29.24.log
2024-03-20 19:29:24 sparseml.core.recipe.recipe INFO     Loading recipe from file /home/mgoin/.cache/huggingface/hub/models--neuralmagic--TinyLlama-1.1B-Chat-v1.0-pruned2.4/snapshots/22ff818572f6fb2bd02110dd0b40c0169533c6da/recipe.yaml
manager stage: Model structure initialized
2024-03-20 19:29:24 sparseml.pytorch.model_load.helpers INFO     Applied an unstaged recipe to the model at neuralmagic/TinyLlama-1.1B-Chat-v1.0-pruned2.4
2024-03-20 19:29:24 sparseml.pytorch.model_load.helpers WARNING  Model state was not reloaded for SparseML: could not find model weights for neuralmagic/TinyLlama-1.1B-Chat-v1.0-pruned2.4
Compressing model: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 201/201 [00:04<00:00, 43.20it/s]

Looking at the output of the save_compressed shows a larger file size, but I think this is because it is saving as float32, even though the model is originally float16:

ll test_compress_output 
total 2.5G
-rw-r--r-- 1 mgoin mgoin  906 Mar 20 19:29 config.json
-rw-r--r-- 1 mgoin mgoin  124 Mar 20 19:29 generation_config.json
-rw-r--r-- 1 mgoin mgoin 2.5G Mar 20 19:29 model.safetensors

Here is the section of the config.json in that directory that mentions the sparsity level and dtype:

    "sparsity_config": {
        "format": "sparse_bitmask",
        "global_sparsity": 44.0587486922757,
        "sparsity_structure": "2:4"
    },
    "tie_word_embeddings": false,
    "torch_dtype": "float32",

Satrat · 2024-03-21T13:57:19Z

I tried using this to compress an existing model on HF. I chose the TinyLlama here: https://huggingface.co/neuralmagic/TinyLlama-1.1B-Chat-v1.0-pruned2.4/tree/main

There are a few issues I noticed:

It seems like float32 is always used to save, where we should preserve the existing dtype of the model
SparseAutoModelForCausalLM.save_compressed seems to only produce the model weights and config file in the output directory, where we should be including other needed files like the tokenizer. This may be addressed by inheriting instead of using a static function

Code:

from sparseml.transformers import SparseAutoModelForCausalLM

MODEL_PATH = "neuralmagic/TinyLlama-1.1B-Chat-v1.0-pruned2.4"
OUTPUT_PATH = "./test_compress_output"

model = SparseAutoModelForCausalLM.from_pretrained(MODEL_PATH, device_map="cuda:0")
SparseAutoModelForCausalLM.save_compressed(model, OUTPUT_PATH)

Output:

config.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 672/672 [00:00<00:00, 7.52MB/s]
pytorch_model.bin: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2.20G/2.20G [01:51<00:00, 19.8MB/s]
/home/mgoin/venvs/nm/lib/python3.10/site-packages/torch/_utils.py:831: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
  return self.fget.__get__(instance, owner)()
generation_config.json: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 132/132 [00:00<00:00, 1.47MB/s]
2024-03-20 19:29:24 sparseml.transformers.utils.helpers INFO     model_path is a huggingface model id. Attempting to download recipe from https://huggingface.co/
2024-03-20 19:29:24 sparseml.transformers.utils.helpers INFO     Found recipe: recipe.yaml for model id: neuralmagic/TinyLlama-1.1B-Chat-v1.0-pruned2.4. Downloading...
recipe.yaml: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 237/237 [00:00<00:00, 2.64MB/s]
Logging all SparseML modifier-level logs to sparse_logs/20-03-2024_19.29.24.log
2024-03-20 19:29:24 sparseml.core.logger.logger INFO     Logging all SparseML modifier-level logs to sparse_logs/20-03-2024_19.29.24.log
2024-03-20 19:29:24 sparseml.core.recipe.recipe INFO     Loading recipe from file /home/mgoin/.cache/huggingface/hub/models--neuralmagic--TinyLlama-1.1B-Chat-v1.0-pruned2.4/snapshots/22ff818572f6fb2bd02110dd0b40c0169533c6da/recipe.yaml
manager stage: Model structure initialized
2024-03-20 19:29:24 sparseml.pytorch.model_load.helpers INFO     Applied an unstaged recipe to the model at neuralmagic/TinyLlama-1.1B-Chat-v1.0-pruned2.4
2024-03-20 19:29:24 sparseml.pytorch.model_load.helpers WARNING  Model state was not reloaded for SparseML: could not find model weights for neuralmagic/TinyLlama-1.1B-Chat-v1.0-pruned2.4
Compressing model: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 201/201 [00:04<00:00, 43.20it/s]

Looking at the output of the save_compressed shows a larger file size, but I think this is because it is saving as float32, even though the model is originally float16:

ll test_compress_output 
total 2.5G
-rw-r--r-- 1 mgoin mgoin  906 Mar 20 19:29 config.json
-rw-r--r-- 1 mgoin mgoin  124 Mar 20 19:29 generation_config.json
-rw-r--r-- 1 mgoin mgoin 2.5G Mar 20 19:29 model.safetensors

Here is the section of the config.json in that directory that mentions the sparsity level and dtype:

    "sparsity_config": {
        "format": "sparse_bitmask",
        "global_sparsity": 44.0587486922757,
        "sparsity_structure": "2:4"
    },
    "tie_word_embeddings": false,
    "torch_dtype": "float32",

AutoModelForCausalLM behaves the same for both of these points:

from transformers import AutoModelForCausalLM

MODEL_PATH = "neuralmagic/TinyLlama-1.1B-Chat-v1.0-pruned2.4"
OUTPUT_PATH = "./test_compress_output_tiny_llama"

model = AutoModelForCausalLM.from_pretrained(MODEL_PATH, device_map="cuda:0")
model.save_pretrained(OUTPUT_PATH)

Output:

(.venv) sadkins@gpuserver6:/nm/drive0/sadkins/sparseml$ ls test_compress_output_tiny_llama/
config.json  generation_config.json  pytorch_model.bin

config.json shows "torch_dtype": "float32", and file size is 4GB

Adding in the torch_dtype="auto" argument to from_pretrained fixes the typing issue for both the sparse and non-sparse classes. For the tokenizer not being saved, isn't it expected that we would have to save it separately?

from sparseml.transformers import SparseAutoModelForCausalLM, SparseAutoTokenizer

MODEL_PATH = "neuralmagic/TinyLlama-1.1B-Chat-v1.0-pruned2.4"
OUTPUT_PATH = "./test_compress_output_tiny_llama"

model = SparseAutoModelForCausalLM.from_pretrained(MODEL_PATH, device_map="cuda:0", torch_dtype="auto")
tokenizer = SparseAutoTokenizer.from_pretrained(MODEL_PATH)
SparseAutoModelForCausalLM.save_compressed(model, OUTPUT_PATH)
tokenizer.save_pretrained(OUTPUT_PATH)

Output:

(.venv) sadkins@gpuserver6:/nm/drive0/sadkins/sparseml$ ls test_compress_output_tiny_llama/
config.json  generation_config.json  model.safetensors  special_tokens_map.json  tokenizer_config.json  tokenizer.json

config.json shows "torch_dtype": "float16", and file size is 1.28GB

We could default to torch_dtype="auto" rather than None in SparseModelForCausalLM, I agree its a more intuitive default, but it would change the behavior from the Hugging Face parent

* working implementation * remove unneeded file * update README * clean up and docstrings * finetuning and one-shot interface * update README * update save * update README

Satrat · 2024-03-22T18:23:43Z

@mgoin @robertgshaw2-neuralmagic latest commit now has the change from SparseAutoModelForCausalLM.save_pretrained(model...) to model.save_pretrained(...) I updated the PR description with notes on the new implementation, and README is updated with the new interface

src/sparseml/transformers/compression/README.md

src/sparseml/transformers/utils/sparse_model.py

src/sparseml/transformers/compression/utils/compress_save.py

mgoin · 2024-03-23T16:24:21Z

@Satrat thanks for figuring out those dtype and saving confusions, I think you're all right there so nothing to change there. I'll look a bit more on HF saving/uploading flows to make sure I'm testing it right.

src/sparseml/transformers/compression/utils/compress_save.py

src/sparseml/pytorch/model_load/helpers.py

src/sparseml/transformers/compression/utils/compress_save.py

Satrat added 30 commits March 12, 2024 14:57

initial classes

45a16ed

WIP

a7cee23

compression working

e1549e8

unit tests and README

92ba386

docstrings

f061d78

README and fix test

40a75a9

add bitmask source

522813c

Merge branch 'main' into tensor_compression

c6d0b4d

initial commit

118c223

compression working

be2223f

formatting

c07d36a

cleanup

d2a8a78

dtype tests

1096700

Merge branch 'main' into tensor_compression

1749b28

oops fix test

013d17b

Merge branch 'tensor_compression' of github.com:neuralmagic/sparseml …

813c8e7

…into tensor_compression

tests

41223bb

add bfloat16

2c6eeba

Merge branch 'tensor_compression' into tensor_decompression

2d515fa

unit tests

35e1dba

docstrings

dd9d82f

update README

e473ef3

initial commit

4817a97

messy but working

8a7282e

move statements to debug

08e039e

warn on conflicts, store device

e369710

Merge branch 'main' into tensor_compression

e7fb048

Merge branch 'tensor_compression' into tensor_decompression

52a916a

merge conflict

1ecf58b

fix typing

59ff306

mgoin reviewed Mar 20, 2024

View reviewed changes

Satrat added 4 commits March 21, 2024 14:06

update tests for dtype checks

7b1a090

Merge branch 'main' into compression_ui

5cdd162

fix conflict

b5fd814

Compression UI Changes + One-shot/Finetune Support (#2194)

ed2978e

* working implementation * remove unneeded file * update README * clean up and docstrings * finetuning and one-shot interface * update README * update save * update README

Satrat requested review from dbogunowicz and mgoin March 22, 2024 18:20

robertgshaw2-neuralmagic reviewed Mar 22, 2024

View reviewed changes

mgoin reviewed Mar 23, 2024

View reviewed changes

src/sparseml/transformers/compression/utils/compress_save.py Outdated Show resolved Hide resolved

dbogunowicz previously approved these changes Mar 26, 2024

View reviewed changes

Satrat added 2 commits March 27, 2024 09:31

Merge branch 'main' into compression_ui

a58a283

logs for sparsity calculations

093477e

Satrat dismissed dbogunowicz’s stale review via 093477e March 27, 2024 14:22

mgoin reviewed Mar 27, 2024

View reviewed changes

src/sparseml/transformers/compression/utils/compress_save.py Outdated Show resolved Hide resolved

Satrat requested review from dbogunowicz, mgoin and robertgshaw2-neuralmagic March 27, 2024 15:33

Satrat added 2 commits March 27, 2024 15:43

logger change

d7b95c3

remove save_compressed

29ef0e7

mgoin previously approved these changes Mar 27, 2024

View reviewed changes

src/sparseml/pytorch/model_load/helpers.py Show resolved Hide resolved

src/sparseml/transformers/compression/utils/compress_save.py Show resolved Hide resolved

log for quantized models

f998c3f

Satrat dismissed mgoin’s stale review via f998c3f March 27, 2024 18:35

Merge branch 'main' into compression_ui

cb399aa

mgoin merged commit 85b0e72 into main Mar 27, 2024
11 of 15 checks passed

mgoin deleted the compression_ui branch March 27, 2024 19:14

mgoin mentioned this pull request Apr 3, 2024

[Roadmap] SparseML Roadmap Q2 2024 #2216

Closed

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SparseML Compression Pt 3: SparseAutoModel interface & inferring params #2190

SparseML Compression Pt 3: SparseAutoModel interface & inferring params #2190

Satrat commented Mar 20, 2024 •

edited

Loading

mgoin left a comment

Satrat commented Mar 21, 2024

Satrat commented Mar 22, 2024

mgoin commented Mar 23, 2024

SparseML Compression Pt 3: SparseAutoModel interface & inferring params #2190

SparseML Compression Pt 3: SparseAutoModel interface & inferring params #2190

Conversation

Satrat commented Mar 20, 2024 • edited Loading

Callouts

Example

Testing

mgoin left a comment

Choose a reason for hiding this comment

Satrat commented Mar 21, 2024

Satrat commented Mar 22, 2024

mgoin commented Mar 23, 2024

Satrat commented Mar 20, 2024 •

edited

Loading