Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SparseML Compression Pt 3: SparseAutoModel interface & inferring params #2190

Merged
merged 62 commits into from
Mar 27, 2024

Conversation

Satrat
Copy link
Contributor

@Satrat Satrat commented Mar 20, 2024

This final PR for SparseML sparsity compression adds compression support to save_pretrained and from_pretrained. It also adds support for inferring global sparsity and sparsity structure params from the model. See the corresponding internal docs PR for design details, but there have been some minor changes that are reflected in README.md

Callouts

  • Overriding model.save_pretrained() is tricky, because the model is initialized from SparseAutoModelForCausalLM.from_pretrained, but the returned model is a child of PreTrainedModel(for instance LlamaForCausalLM inherits from LlamaPreTrainedModel inherits from PreTrainedModel) To override PreTrainedModel.save_pretrained we need to do a bit of "class instance surgery". See compression_save.py for implementation details
  • Ended up not storing the sparsity config in the safetensors metadata, storing it in the HF config seemed sufficient and I felt it was going to complicate the code without much added gain
  • How we're inferring the sparsity structure is definitely hacky. We determine it by looking at the applied modifiers and checking for a mask_structure attribute. I think this is sufficient for now, but open to other ideas here

Example

from sparseml.transformers import SparseAutoModelForCausalLM
from sparseml.utils.pytorch.utils import measure_cuda_memory
import torch

MODEL_PATH = "zoo:llama2-7b-open_platypus_orca_llama2_pretrain-pruned60"
OUTPUT_PATH = "./test_compress_output"
RECIPE = "zoo:llama2-7b-open_platypus_orca_llama2_pretrain-pruned60"

torch.cuda.set_device(0)
with measure_cuda_memory() as m:
    model = SparseAutoModelForCausalLM.from_pretrained(MODEL_PATH, device_map="cuda:0")
print(f"Load dense model peak GPU {m.overall_peak_memory / float(2**30):.4f} GB")

sparsity_config = getattr(model,"sparsity_config", None)
print(f"Sparsity config before compression: {sparsity_config}")
with measure_cuda_memory() as m:
    model.save_compressed(OUTPUT_PATH)
print(f"Save compressed model peak GPU {m.overall_peak_memory / float(2**30):.4f} GB")

torch.cuda.set_device(1)
with measure_cuda_memory() as m:
    model_again = SparseAutoModelForCausalLM.from_pretrained(
        OUTPUT_PATH, device_map="cuda:1"
    )
print(f"Load compressed model peak GPU {m.overall_peak_memory / float(2**30):.4f} GB")
sparsity_config = getattr(model,"sparsity_config", None)
print(f"Sparsity config after compression: {sparsity_config}")

og_state_dict = model.state_dict()
reconstructed_state_dict = model_again.state_dict()
assert(len(og_state_dict) == len(reconstructed_state_dict))
for key in og_state_dict.keys():
    dense_tensor = og_state_dict[key]
    reconstructed_tensor = reconstructed_state_dict[key]
    assert torch.equal(dense_tensor.cpu(), reconstructed_tensor.cpu())

Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 6/6 [00:03<00:00, 1.55it/s]
Load dense model peak GPU 25.2276 GB
Sparsity config before compression: None
Compressing model: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 291/291 [01:22<00:00, 3.51it/s]
Save compressed model peak GPU 26.3272 GB
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 7.87it/s]
Decompressing model: 291it [00:23, 12.31it/s]
Load compressed model peak GPU 25.7159 GB
Sparsity config after compression: format='sparse_bitmask' global_sparsity=57.66354988216862 sparsity_structure='unstructured'

Testing

Unit tests are included for from_pretrained, save_pretrained and save_compressed. In addition I manually tested the following scenarios. See the README for instructions on turning on compression during finetuning/oneshot

  • load dense -> apply SparseGPT -> save compressed -> reload compressed
  • FSDP: load dense -> apply SparseGPT -> save compressed -> reload compressed
  • load sparse -> sparse finetuning -> save compressed -> reload compressed
  • FSDP: load sparse -> sparse finetuning -> save compressed -> reload compressed

@mgoin mgoin dismissed dbogunowicz’s stale review March 20, 2024 21:09

The base branch was changed.

Copy link
Member

@mgoin mgoin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried using this to compress an existing model on HF. I chose the TinyLlama here: https://huggingface.co/neuralmagic/TinyLlama-1.1B-Chat-v1.0-pruned2.4/tree/main

There are a few issues I noticed:

  • It seems like float32 is always used to save, where we should preserve the existing dtype of the model
  • SparseAutoModelForCausalLM.save_compressed seems to only produce the model weights and config file in the output directory, where we should be including other needed files like the tokenizer. This may be addressed by inheriting instead of using a static function

Code:

from sparseml.transformers import SparseAutoModelForCausalLM

MODEL_PATH = "neuralmagic/TinyLlama-1.1B-Chat-v1.0-pruned2.4"
OUTPUT_PATH = "./test_compress_output"

model = SparseAutoModelForCausalLM.from_pretrained(MODEL_PATH, device_map="cuda:0")
SparseAutoModelForCausalLM.save_compressed(model, OUTPUT_PATH)

Output:

config.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 672/672 [00:00<00:00, 7.52MB/s]
pytorch_model.bin: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2.20G/2.20G [01:51<00:00, 19.8MB/s]
/home/mgoin/venvs/nm/lib/python3.10/site-packages/torch/_utils.py:831: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
  return self.fget.__get__(instance, owner)()
generation_config.json: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 132/132 [00:00<00:00, 1.47MB/s]
2024-03-20 19:29:24 sparseml.transformers.utils.helpers INFO     model_path is a huggingface model id. Attempting to download recipe from https://huggingface.co/
2024-03-20 19:29:24 sparseml.transformers.utils.helpers INFO     Found recipe: recipe.yaml for model id: neuralmagic/TinyLlama-1.1B-Chat-v1.0-pruned2.4. Downloading...
recipe.yaml: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 237/237 [00:00<00:00, 2.64MB/s]
Logging all SparseML modifier-level logs to sparse_logs/20-03-2024_19.29.24.log
2024-03-20 19:29:24 sparseml.core.logger.logger INFO     Logging all SparseML modifier-level logs to sparse_logs/20-03-2024_19.29.24.log
2024-03-20 19:29:24 sparseml.core.recipe.recipe INFO     Loading recipe from file /home/mgoin/.cache/huggingface/hub/models--neuralmagic--TinyLlama-1.1B-Chat-v1.0-pruned2.4/snapshots/22ff818572f6fb2bd02110dd0b40c0169533c6da/recipe.yaml
manager stage: Model structure initialized
2024-03-20 19:29:24 sparseml.pytorch.model_load.helpers INFO     Applied an unstaged recipe to the model at neuralmagic/TinyLlama-1.1B-Chat-v1.0-pruned2.4
2024-03-20 19:29:24 sparseml.pytorch.model_load.helpers WARNING  Model state was not reloaded for SparseML: could not find model weights for neuralmagic/TinyLlama-1.1B-Chat-v1.0-pruned2.4
Compressing model: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 201/201 [00:04<00:00, 43.20it/s]

Looking at the output of the save_compressed shows a larger file size, but I think this is because it is saving as float32, even though the model is originally float16:

ll test_compress_output 
total 2.5G
-rw-r--r-- 1 mgoin mgoin  906 Mar 20 19:29 config.json
-rw-r--r-- 1 mgoin mgoin  124 Mar 20 19:29 generation_config.json
-rw-r--r-- 1 mgoin mgoin 2.5G Mar 20 19:29 model.safetensors

Here is the section of the config.json in that directory that mentions the sparsity level and dtype:

    "sparsity_config": {
        "format": "sparse_bitmask",
        "global_sparsity": 44.0587486922757,
        "sparsity_structure": "2:4"
    },
    "tie_word_embeddings": false,
    "torch_dtype": "float32",

@Satrat
Copy link
Contributor Author

Satrat commented Mar 21, 2024

I tried using this to compress an existing model on HF. I chose the TinyLlama here: https://huggingface.co/neuralmagic/TinyLlama-1.1B-Chat-v1.0-pruned2.4/tree/main

There are a few issues I noticed:

  • It seems like float32 is always used to save, where we should preserve the existing dtype of the model
  • SparseAutoModelForCausalLM.save_compressed seems to only produce the model weights and config file in the output directory, where we should be including other needed files like the tokenizer. This may be addressed by inheriting instead of using a static function

Code:

from sparseml.transformers import SparseAutoModelForCausalLM

MODEL_PATH = "neuralmagic/TinyLlama-1.1B-Chat-v1.0-pruned2.4"
OUTPUT_PATH = "./test_compress_output"

model = SparseAutoModelForCausalLM.from_pretrained(MODEL_PATH, device_map="cuda:0")
SparseAutoModelForCausalLM.save_compressed(model, OUTPUT_PATH)

Output:

config.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 672/672 [00:00<00:00, 7.52MB/s]
pytorch_model.bin: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2.20G/2.20G [01:51<00:00, 19.8MB/s]
/home/mgoin/venvs/nm/lib/python3.10/site-packages/torch/_utils.py:831: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
  return self.fget.__get__(instance, owner)()
generation_config.json: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 132/132 [00:00<00:00, 1.47MB/s]
2024-03-20 19:29:24 sparseml.transformers.utils.helpers INFO     model_path is a huggingface model id. Attempting to download recipe from https://huggingface.co/
2024-03-20 19:29:24 sparseml.transformers.utils.helpers INFO     Found recipe: recipe.yaml for model id: neuralmagic/TinyLlama-1.1B-Chat-v1.0-pruned2.4. Downloading...
recipe.yaml: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 237/237 [00:00<00:00, 2.64MB/s]
Logging all SparseML modifier-level logs to sparse_logs/20-03-2024_19.29.24.log
2024-03-20 19:29:24 sparseml.core.logger.logger INFO     Logging all SparseML modifier-level logs to sparse_logs/20-03-2024_19.29.24.log
2024-03-20 19:29:24 sparseml.core.recipe.recipe INFO     Loading recipe from file /home/mgoin/.cache/huggingface/hub/models--neuralmagic--TinyLlama-1.1B-Chat-v1.0-pruned2.4/snapshots/22ff818572f6fb2bd02110dd0b40c0169533c6da/recipe.yaml
manager stage: Model structure initialized
2024-03-20 19:29:24 sparseml.pytorch.model_load.helpers INFO     Applied an unstaged recipe to the model at neuralmagic/TinyLlama-1.1B-Chat-v1.0-pruned2.4
2024-03-20 19:29:24 sparseml.pytorch.model_load.helpers WARNING  Model state was not reloaded for SparseML: could not find model weights for neuralmagic/TinyLlama-1.1B-Chat-v1.0-pruned2.4
Compressing model: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 201/201 [00:04<00:00, 43.20it/s]

Looking at the output of the save_compressed shows a larger file size, but I think this is because it is saving as float32, even though the model is originally float16:

ll test_compress_output 
total 2.5G
-rw-r--r-- 1 mgoin mgoin  906 Mar 20 19:29 config.json
-rw-r--r-- 1 mgoin mgoin  124 Mar 20 19:29 generation_config.json
-rw-r--r-- 1 mgoin mgoin 2.5G Mar 20 19:29 model.safetensors

Here is the section of the config.json in that directory that mentions the sparsity level and dtype:

    "sparsity_config": {
        "format": "sparse_bitmask",
        "global_sparsity": 44.0587486922757,
        "sparsity_structure": "2:4"
    },
    "tie_word_embeddings": false,
    "torch_dtype": "float32",

AutoModelForCausalLM behaves the same for both of these points:

from transformers import AutoModelForCausalLM

MODEL_PATH = "neuralmagic/TinyLlama-1.1B-Chat-v1.0-pruned2.4"
OUTPUT_PATH = "./test_compress_output_tiny_llama"

model = AutoModelForCausalLM.from_pretrained(MODEL_PATH, device_map="cuda:0")
model.save_pretrained(OUTPUT_PATH)

Output:

(.venv) sadkins@gpuserver6:/nm/drive0/sadkins/sparseml$ ls test_compress_output_tiny_llama/
config.json  generation_config.json  pytorch_model.bin

config.json shows "torch_dtype": "float32", and file size is 4GB

Adding in the torch_dtype="auto" argument to from_pretrained fixes the typing issue for both the sparse and non-sparse classes. For the tokenizer not being saved, isn't it expected that we would have to save it separately?

from sparseml.transformers import SparseAutoModelForCausalLM, SparseAutoTokenizer

MODEL_PATH = "neuralmagic/TinyLlama-1.1B-Chat-v1.0-pruned2.4"
OUTPUT_PATH = "./test_compress_output_tiny_llama"

model = SparseAutoModelForCausalLM.from_pretrained(MODEL_PATH, device_map="cuda:0", torch_dtype="auto")
tokenizer = SparseAutoTokenizer.from_pretrained(MODEL_PATH)
SparseAutoModelForCausalLM.save_compressed(model, OUTPUT_PATH)
tokenizer.save_pretrained(OUTPUT_PATH)

Output:

(.venv) sadkins@gpuserver6:/nm/drive0/sadkins/sparseml$ ls test_compress_output_tiny_llama/
config.json  generation_config.json  model.safetensors  special_tokens_map.json  tokenizer_config.json  tokenizer.json

config.json shows "torch_dtype": "float16", and file size is 1.28GB

We could default to torch_dtype="auto" rather than None in SparseModelForCausalLM, I agree its a more intuitive default, but it would change the behavior from the Hugging Face parent

* working implementation

* remove unneeded file

* update README

* clean up and docstrings

* finetuning and one-shot interface

* update README

* update save

* update README
@Satrat
Copy link
Contributor Author

Satrat commented Mar 22, 2024

@mgoin @robertgshaw2-neuralmagic latest commit now has the change from SparseAutoModelForCausalLM.save_pretrained(model...) to model.save_pretrained(...) I updated the PR description with notes on the new implementation, and README is updated with the new interface

@mgoin
Copy link
Member

mgoin commented Mar 23, 2024

@Satrat thanks for figuring out those dtype and saving confusions, I think you're all right there so nothing to change there. I'll look a bit more on HF saving/uploading flows to make sure I'm testing it right.

dbogunowicz
dbogunowicz previously approved these changes Mar 26, 2024
@mgoin mgoin merged commit 85b0e72 into main Mar 27, 2024
11 of 15 checks passed
@mgoin mgoin deleted the compression_ui branch March 27, 2024 19:14
@mgoin mgoin mentioned this pull request Apr 3, 2024
8 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants