RuntimeError: expected mat1 and mat2 to have the same dtype, but got: float != c10::BFloat16 #844

griff4692 · 2023-11-10T16:39:48Z

Please check that this issue hasn't been reported before.

I searched previous Bug Reports didn't find any similar reports.

Expected Behavior

I fine-tuned Mistral with axolotl using bf16 precision

I want to generate from this fine-tuned model: /path-to-my-fined-tuned-checkpoint/checkpoint-500

There is a dtype mismatch.

Current behaviour

  File "/home/ga2530/axolotl-bhc/scripts/sent_inference_utils.py", line 570, in run_prompt
    generated = model.generate(
  File "/home/ga2530/miniconda3/envs/ax/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/ga2530/miniconda3/envs/ax/lib/python3.9/site-packages/transformers/generation/utils.py", line 1652, in generate
    return self.sample(
  File "/home/ga2530/miniconda3/envs/ax/lib/python3.9/site-packages/transformers/generation/utils.py", line 2734, in sample
    outputs = self(
  File "/home/ga2530/miniconda3/envs/ax/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/ga2530/miniconda3/envs/ax/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/ga2530/miniconda3/envs/ax/lib/python3.9/site-packages/accelerate/hooks.py", line 164, in new_forward
    output = module._old_forward(*args, **kwargs)
  File "/home/ga2530/miniconda3/envs/ax/lib/python3.9/site-packages/transformers/models/mistral/modeling_mistral.py", line 1045, in forward
    outputs = self.model(
  File "/home/ga2530/miniconda3/envs/ax/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/ga2530/miniconda3/envs/ax/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/ga2530/miniconda3/envs/ax/lib/python3.9/site-packages/transformers/models/mistral/modeling_mistral.py", line 932, in forward
    layer_outputs = decoder_layer(
  File "/home/ga2530/miniconda3/envs/ax/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/ga2530/miniconda3/envs/ax/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/ga2530/miniconda3/envs/ax/lib/python3.9/site-packages/accelerate/hooks.py", line 164, in new_forward
    output = module._old_forward(*args, **kwargs)
  File "/home/ga2530/miniconda3/envs/ax/lib/python3.9/site-packages/transformers/models/mistral/modeling_mistral.py", line 621, in forward
    hidden_states, self_attn_weights, present_key_value = self.self_attn(
  File "/home/ga2530/miniconda3/envs/ax/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/ga2530/miniconda3/envs/ax/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/ga2530/miniconda3/envs/ax/lib/python3.9/site-packages/accelerate/hooks.py", line 164, in new_forward
    output = module._old_forward(*args, **kwargs)
  File "/home/ga2530/miniconda3/envs/ax/lib/python3.9/site-packages/transformers/models/mistral/modeling_mistral.py", line 342, in forward
    query_states = self.q_proj(hidden_states)
  File "/home/ga2530/miniconda3/envs/ax/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/ga2530/miniconda3/envs/ax/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/ga2530/miniconda3/envs/ax/lib/python3.9/site-packages/accelerate/hooks.py", line 164, in new_forward
    output = module._old_forward(*args, **kwargs)
  File "/home/ga2530/miniconda3/envs/ax/lib/python3.9/site-packages/torch/nn/modules/linear.py", line 114, in forward
    return F.linear(input, self.weight, self.bias)
RuntimeError: expected mat1 and mat2 to have the same dtype, but got: float != c10::BFloat16

Steps to reproduce

    config = Path(os.path.expanduser('path-to-my-config.yml'))
    parsed_cfg = load_cfg(config)
    parsed_cfg.sample_packing = False
    # My fine-tuned checkpoint
    parsed_cfg.base_model_config = '/path-to-my-fined-tuned-checkpoint'
    parsed_cfg.base_model = '/path-to-my-fined-tuned-checkpoint/checkpoint-500'
    parser = transformers.HfArgumentParser((TrainerCliArgs))
    parsed_cli_args, _ = parser.parse_args_into_dataclasses(
        return_remaining_strings=True
    )
    parsed_cli_args.inference = True
    model, tokenizer = load_model_and_tokenizer(cfg=cfg, cli_args=cli_args)

    prompt = "DEBUG"
    batch = tokenizer(prompt, return_tensors="pt", add_special_tokens=True)

    model.eval()
    with torch.no_grad():
        generation_config = GenerationConfig(
            repetition_penalty=1.1,
            max_new_tokens=1024,
            temperature=0.9,
            top_p=0.95,
            top_k=40,
            bos_token_id=tokenizer.bos_token_id,
            eos_token_id=tokenizer.eos_token_id,
            pad_token_id=tokenizer.pad_token_id,
            do_sample=True,
            use_cache=True,
            return_dict_in_generate=True,
            output_attentions=False,
            output_hidden_states=False,
            output_scores=False,
        )
        generated = model.generate(
            inputs=batch["input_ids"].to(cfg.device),
            generation_config=generation_config,
        )

Config yaml

base_model: mistralai/Mistral-7B-Instruct-v0.1
model_type: MistralForCausalLM
tokenizer_type: LlamaTokenizer
is_mistral_derived_model: true

load_in_8bit: false
load_in_4bit: false
strict: false

datasets:
  - path: /nlp/projects/summarization/bhc_data_cleanup/prompt_sent_frost_instruct.jsonl
    type: summarizetldr

dataset_prepared_path:
val_set_size: 0.005
output_dir: /nlp/projects/summarization/bhc_data_cleanup/mistral_weights/sent_frost_instruct

sequence_len: 8192
sample_packing: false
pad_to_sequence_len: true

wandb_project: mistral
wandb_entity: griffinadams
wandb_watch:
wandb_run_id: sent_frost_instruct
wandb_log_model:

gradient_accumulation_steps: 8
micro_batch_size: 1
num_epochs: 1
optimizer: adamw_bnb_8bit
lr_scheduler: cosine
learning_rate: 0.000005

train_on_inputs: false
group_by_length: false
bf16: true
fp16: false
tf32: false

gradient_checkpointing: true
early_stopping_patience:
resume_from_checkpoint:
local_rank:
logging_steps: 1
xformers_attention:
flash_attention: true

warmup_steps: 10
eval_steps: 100
eval_table_size:
eval_table_max_new_tokens: 128
save_steps: 100
save_strategy: steps
debug:
deepspeed: /home/ga2530/axolotl/deepspeed/zero2.json
weight_decay: 0.0
fsdp:
fsdp_config:
special_tokens:
  bos_token: "<s>"
  eos_token: "</s>"
  unk_token: "<unk>"

Possible solution

I tried

with torch.cuda.amp.autocast() but that did not work

Which Operating Systems are you using?

Linux
macOS
Windows

Python Version

3.9

axolotl branch-commit

main/f544ab2bed513bef269e6887d35c8aa12a852473

Acknowledgements

My issue title is concise, descriptive, and in title casing.
I have searched the existing issues to make sure this bug has not been reported yet.
I am using the latest version of axolotl.
I have provided enough information for the maintainers to reproduce and diagnose the issue.

The text was updated successfully, but these errors were encountered:

griff4692 · 2023-11-10T19:12:42Z

I'm able to resolve the issue by casting the model to bf16

model = model.to(torch.bfloat16)

but not sure if this is the best way to do it in this codebase

winglian · 2023-11-14T21:32:35Z

@griff4692 my guess is the issue is in the deepspeed json configuration during training.

mathiasesn · 2023-11-22T13:37:37Z

Also a problem for me with NO deepspeed.json configuration used. A simple fix would be to:

if cfg.bf16:
    model = model.to(torch.bfloat16)

For example here and here.

timothylimyl · 2023-12-13T02:45:09Z

Hi,

tested on 13/12/23, same issue still appears (tested with mistral):

RuntimeError: expected mat1 and mat2 to have the same dtype, but got: float != c10::BFloat16

This issue is caused in the linear layer in torch.nn.linear:

    def forward(self, input: Tensor) -> Tensor:
        return F.linear(input, self.weight, self.bias)

Basically, there's a mismatch here with self.weight dtype being bf16 while input dtype being torch.float32.

I think the fix needs to be done here:
https://github.com/OpenAccess-AI-Collective/axolotl/blob/main/src/axolotl/utils/models.py#L174

We can add a casting to appropriate dtype here via the model config. Let me know what you think, I can make a PR.

NanoCode012 · 2024-03-30T18:03:50Z

Closed thanks to @taziksh

griff4692 added the bug Something isn't working label Nov 10, 2023

taziksh mentioned this issue Dec 19, 2023

Fix: bf16 support for inference #981

Merged

NanoCode012 closed this as completed Mar 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RuntimeError: expected mat1 and mat2 to have the same dtype, but got: float != c10::BFloat16 #844

RuntimeError: expected mat1 and mat2 to have the same dtype, but got: float != c10::BFloat16 #844

griff4692 commented Nov 10, 2023

griff4692 commented Nov 10, 2023

winglian commented Nov 14, 2023

mathiasesn commented Nov 22, 2023 •

edited

Loading

timothylimyl commented Dec 13, 2023

NanoCode012 commented Mar 30, 2024

RuntimeError: expected mat1 and mat2 to have the same dtype, but got: float != c10::BFloat16 #844

RuntimeError: expected mat1 and mat2 to have the same dtype, but got: float != c10::BFloat16 #844

Comments

griff4692 commented Nov 10, 2023

Please check that this issue hasn't been reported before.

Expected Behavior

Current behaviour

Steps to reproduce

Config yaml

Possible solution

Which Operating Systems are you using?

Python Version

axolotl branch-commit

Acknowledgements

griff4692 commented Nov 10, 2023

winglian commented Nov 14, 2023

mathiasesn commented Nov 22, 2023 • edited Loading

timothylimyl commented Dec 13, 2023

NanoCode012 commented Mar 30, 2024

mathiasesn commented Nov 22, 2023 •

edited

Loading