-
Notifications
You must be signed in to change notification settings - Fork 140
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SparseML Compression Pt 3: SparseAutoModel interface & inferring params #2190
Conversation
…into tensor_compression
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried using this to compress an existing model on HF. I chose the TinyLlama here: https://huggingface.co/neuralmagic/TinyLlama-1.1B-Chat-v1.0-pruned2.4/tree/main
There are a few issues I noticed:
- It seems like float32 is always used to save, where we should preserve the existing dtype of the model
SparseAutoModelForCausalLM.save_compressed
seems to only produce the model weights and config file in the output directory, where we should be including other needed files like the tokenizer. This may be addressed by inheriting instead of using a static function
Code:
from sparseml.transformers import SparseAutoModelForCausalLM
MODEL_PATH = "neuralmagic/TinyLlama-1.1B-Chat-v1.0-pruned2.4"
OUTPUT_PATH = "./test_compress_output"
model = SparseAutoModelForCausalLM.from_pretrained(MODEL_PATH, device_map="cuda:0")
SparseAutoModelForCausalLM.save_compressed(model, OUTPUT_PATH)
Output:
config.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 672/672 [00:00<00:00, 7.52MB/s]
pytorch_model.bin: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2.20G/2.20G [01:51<00:00, 19.8MB/s]
/home/mgoin/venvs/nm/lib/python3.10/site-packages/torch/_utils.py:831: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
return self.fget.__get__(instance, owner)()
generation_config.json: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 132/132 [00:00<00:00, 1.47MB/s]
2024-03-20 19:29:24 sparseml.transformers.utils.helpers INFO model_path is a huggingface model id. Attempting to download recipe from https://huggingface.co/
2024-03-20 19:29:24 sparseml.transformers.utils.helpers INFO Found recipe: recipe.yaml for model id: neuralmagic/TinyLlama-1.1B-Chat-v1.0-pruned2.4. Downloading...
recipe.yaml: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 237/237 [00:00<00:00, 2.64MB/s]
Logging all SparseML modifier-level logs to sparse_logs/20-03-2024_19.29.24.log
2024-03-20 19:29:24 sparseml.core.logger.logger INFO Logging all SparseML modifier-level logs to sparse_logs/20-03-2024_19.29.24.log
2024-03-20 19:29:24 sparseml.core.recipe.recipe INFO Loading recipe from file /home/mgoin/.cache/huggingface/hub/models--neuralmagic--TinyLlama-1.1B-Chat-v1.0-pruned2.4/snapshots/22ff818572f6fb2bd02110dd0b40c0169533c6da/recipe.yaml
manager stage: Model structure initialized
2024-03-20 19:29:24 sparseml.pytorch.model_load.helpers INFO Applied an unstaged recipe to the model at neuralmagic/TinyLlama-1.1B-Chat-v1.0-pruned2.4
2024-03-20 19:29:24 sparseml.pytorch.model_load.helpers WARNING Model state was not reloaded for SparseML: could not find model weights for neuralmagic/TinyLlama-1.1B-Chat-v1.0-pruned2.4
Compressing model: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 201/201 [00:04<00:00, 43.20it/s]
Looking at the output of the save_compressed shows a larger file size, but I think this is because it is saving as float32, even though the model is originally float16:
ll test_compress_output
total 2.5G
-rw-r--r-- 1 mgoin mgoin 906 Mar 20 19:29 config.json
-rw-r--r-- 1 mgoin mgoin 124 Mar 20 19:29 generation_config.json
-rw-r--r-- 1 mgoin mgoin 2.5G Mar 20 19:29 model.safetensors
Here is the section of the config.json in that directory that mentions the sparsity level and dtype:
"sparsity_config": {
"format": "sparse_bitmask",
"global_sparsity": 44.0587486922757,
"sparsity_structure": "2:4"
},
"tie_word_embeddings": false,
"torch_dtype": "float32",
from transformers import AutoModelForCausalLM
MODEL_PATH = "neuralmagic/TinyLlama-1.1B-Chat-v1.0-pruned2.4"
OUTPUT_PATH = "./test_compress_output_tiny_llama"
model = AutoModelForCausalLM.from_pretrained(MODEL_PATH, device_map="cuda:0")
model.save_pretrained(OUTPUT_PATH) Output:
config.json shows Adding in the from sparseml.transformers import SparseAutoModelForCausalLM, SparseAutoTokenizer
MODEL_PATH = "neuralmagic/TinyLlama-1.1B-Chat-v1.0-pruned2.4"
OUTPUT_PATH = "./test_compress_output_tiny_llama"
model = SparseAutoModelForCausalLM.from_pretrained(MODEL_PATH, device_map="cuda:0", torch_dtype="auto")
tokenizer = SparseAutoTokenizer.from_pretrained(MODEL_PATH)
SparseAutoModelForCausalLM.save_compressed(model, OUTPUT_PATH)
tokenizer.save_pretrained(OUTPUT_PATH) Output:
config.json shows We could default to |
* working implementation * remove unneeded file * update README * clean up and docstrings * finetuning and one-shot interface * update README * update save * update README
@mgoin @robertgshaw2-neuralmagic latest commit now has the change from |
@Satrat thanks for figuring out those dtype and saving confusions, I think you're all right there so nothing to change there. I'll look a bit more on HF saving/uploading flows to make sure I'm testing it right. |
This final PR for SparseML sparsity compression adds compression support to
save_pretrained
andfrom_pretrained
. It also adds support for inferring global sparsity and sparsity structure params from the model. See the corresponding internal docs PR for design details, but there have been some minor changes that are reflected in README.mdCallouts
model.save_pretrained()
is tricky, because the model is initialized fromSparseAutoModelForCausalLM.from_pretrained
, but the returned model is a child ofPreTrainedModel
(for instanceLlamaForCausalLM
inherits fromLlamaPreTrainedModel
inherits fromPreTrainedModel
) To overridePreTrainedModel.save_pretrained
we need to do a bit of "class instance surgery". Seecompression_save.py
for implementation detailsExample
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 6/6 [00:03<00:00, 1.55it/s]
Load dense model peak GPU 25.2276 GB
Sparsity config before compression: None
Compressing model: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 291/291 [01:22<00:00, 3.51it/s]
Save compressed model peak GPU 26.3272 GB
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 7.87it/s]
Decompressing model: 291it [00:23, 12.31it/s]
Load compressed model peak GPU 25.7159 GB
Sparsity config after compression: format='sparse_bitmask' global_sparsity=57.66354988216862 sparsity_structure='unstructured'
Testing
Unit tests are included for
from_pretrained
,save_pretrained
andsave_compressed
. In addition I manually tested the following scenarios. See the README for instructions on turning on compression during finetuning/oneshot