Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement PyTorch support for float8 types (F8_E5M2 and F8_E4M3) #404

Merged
merged 2 commits into from
Jan 18, 2024
Merged

Implement PyTorch support for float8 types (F8_E5M2 and F8_E4M3) #404

merged 2 commits into from
Jan 18, 2024

Conversation

zeux
Copy link
Contributor

@zeux zeux commented Dec 15, 2023

This PR completes support for float8 types by making them available when using safetensors from Python with PyTorch; float8 types are supported by PyTorch since July (pytorch/pytorch#104242).

Note that PyTorch name for e4m3 type has an extra "fn" prefix to match MLIR, but the format should be the same ("fn" means "finite").

The added test checks that -0.5 roundtrips in both formats - both types are single-byte and have the same representation for zero, but different representations for -0.5.

Note that PyTorch name for e4m3 type has an extra "fn" prefix to match
MLIR, but the format should be the same ("fn" means "finite").

We also test that -0.5 roundtrips in both formats, which makes sure that
the format is preserved properly - both types are single-byte and have
the same representation for zero, but different representations for
-0.5.
@zeux
Copy link
Contributor Author

zeux commented Dec 15, 2023

This would need to increase the version of PyTorch required (I think it's currently listed as >=1.10, but float8 formats are only supported in 2.1). I am not sure what the versioning restrictions are here, but I can probably change the code to only use float8 et al when it is present / when requested? This will complicate the Python wrapper a little bit. Updated the code to handle PyTorch 2.0 and earlier automatically (without support for float8 tensors)

@KohakuBlueleaf
Copy link

Really need this!

zeux added a commit to zeux/calm that referenced this pull request Dec 29, 2023
NVidia GPUs support two fp8 types: e5m2 and e4m3. PyTorch supports both
from version 2.1; note that safetensors currently does not support these
fully, but it will once this PR gets merged:
huggingface/safetensors#404

This change implements initial support for e5m2. e4m3 should be a better
fit in general, but:

- It has a smaller exponent range so it requires weight adjustment to
  fit into this range; Llama2 works fine without it but Mistral breaks
  due to small weights that get rounded to zero.
- More critically, NV GPUs only support fp8 to half/float conversion
  natively since Hopper (SM9.0). fp8e5m2 has a fast emulation path because
  it has the same exponent range as fp16 (similarly to bfloat16,
  conversion just requires zero padding), but fp8e4m3 emulation is
  impractically slow.

We currently just use builtin PyTorch conversion which results in an
aggregate ~0.5% perplexity drop. This probably can be improved in the
future.

Warp-parallel matmul needs to process 4 elements at a time now so that
we keep loading 4b per thread to maximize effective bandwidth.
@Narsil
Copy link
Collaborator

Narsil commented Jan 18, 2024

Thanks a lot for this PR, sorry I missed it when you published it.

@Narsil Narsil merged commit e19f55e into huggingface:main Jan 18, 2024
acodereviewersbestfriend77 added a commit to acodereviewersbestfriend77/calm that referenced this pull request Aug 2, 2024
NVidia GPUs support two fp8 types: e5m2 and e4m3. PyTorch supports both
from version 2.1; note that safetensors currently does not support these
fully, but it will once this PR gets merged:
huggingface/safetensors#404

This change implements initial support for e5m2. e4m3 should be a better
fit in general, but:

- It has a smaller exponent range so it requires weight adjustment to
  fit into this range; Llama2 works fine without it but Mistral breaks
  due to small weights that get rounded to zero.
- More critically, NV GPUs only support fp8 to half/float conversion
  natively since Hopper (SM9.0). fp8e5m2 has a fast emulation path because
  it has the same exponent range as fp16 (similarly to bfloat16,
  conversion just requires zero padding), but fp8e4m3 emulation is
  impractically slow.

We currently just use builtin PyTorch conversion which results in an
aggregate ~0.5% perplexity drop. This probably can be improved in the
future.

Warp-parallel matmul needs to process 4 elements at a time now so that
we keep loading 4b per thread to maximize effective bandwidth.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants