Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a document for leveraging Advanced Matrix Extensions #2439

Merged
merged 18 commits into from
Jun 13, 2023

Conversation

CaoE
Copy link
Contributor

@CaoE CaoE commented Jun 7, 2023

Fixes #2355

Description

Add a document about how to leverage AMX with PyTorch on the 4th Gen of Xeon.

Checklist

  • The issue that is being fixed is referred in the description (see above "Fixes #ISSUE_NUMBER")
  • Only one issue is addressed in this pull request
  • Labels from the issue that this PR is fixing are added to this pull request
  • No unnecessary issues are included into this pull request.

cc @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10 @ZailiWang @ZhaoqiongZ @leslie-fang-intel @Xia-Weiwen @sekahler2 @zhuhaozhe @Valentine233

@pytorch-bot
Copy link

pytorch-bot bot commented Jun 7, 2023

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/tutorials/2439

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure

As of commit 52c96e2:

NEW FAILURE - The following job has failed:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@github-actions github-actions bot added docathon-h1-2023 A label for the docathon in H1 2023 advanced intel and removed cla signed labels Jun 7, 2023
@netlify
Copy link

netlify bot commented Jun 7, 2023

Deploy Preview for pytorch-tutorials-preview ready!

Name Link
🔨 Latest commit 52c96e2
🔍 Latest deploy log https://app.netlify.com/sites/pytorch-tutorials-preview/deploys/6488c820e404990008cb834f
😎 Deploy Preview https://deploy-preview-2439--pytorch-tutorials-preview.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site settings.

@CaoE CaoE changed the title Add amx doc Add a document for leveraging Advanced Matrix Extensions Jun 7, 2023
For more detailed information of oneDNN, see `oneDNN`_.

The operation is fully handled by oneDNN according to the execution code path generated. I.e. when a supported operation gets executed into oneDNN implementation on a hardware platform with AMX support, AMX instructions will be invoked automatically inside oneDNN.
No manual operations are required to enable this feature.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
No manual operations are required to enable this feature.
Since oneDNN is the default acceleration library for CPU, no manual operations are required to enable the AMX support.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apply the change.

``conv_transpose1d``,
``conv_transpose2d``,
``conv_transpose3d``,
``linear``
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess we need a special note for quantized linear here that whether AMX kernel is chosen also depends on the policy of the quantization backend. Currently, the x86 quant backend uses fbgemm, not onednn while users can use onednn backend to turn on AMX for linear op. cc @Xia-Weiwen

In general, it is also true that whether to dispatch to AMX kernels is a backend/library choice. The backend/library would choose the most optimal kernels. It is worth noting in this tutorial.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add note.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. However, I am not sure if it's OK to give such details in tutorial. 🤔


Advanced Matrix Extensions (AMX), also known as Intel® Advanced Matrix Extensions (Intel® AMX), is an extension to the x86 instruction set architecture (ISA).
Intel advances AI capabilities with 4th Gen Intel® Xeon® Scalable processors and Intel® AMX, delivering 3x to 10x higher inference and training performance versus the previous generation, see `Accelerate AI Workloads with Intel® AMX`_.
AMX supports two data types, INT8 and BFloat16, compared to AVX512 FP32, it can achieve up to 32x and 16x acceleration, respectively, see figure 6 of `Accelerate AI Workloads with Intel® AMX`_.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are the speedups only on some particular newer hardware? Is the hardware consumer or enterprise centric?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AMX is only available from the 4th gen of Xeon (codename sapphire rapids), it is enterprise centric.

Confirm AMX is being utilized
------------------------------

Set environment variable ``export ONEDNN_VERBOSE=1`` to get oneDNN verbose at runtime.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it would be nice to have some python function like is_x_available()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a python function torch.backends.mkldnn.verbose.

Note: For quantized linear, whether to leverage AMX depends on which quantization backend to choose.
At present, x86 quantization backend is used by default for quantized linear, using fbgemm, while users can specify onednn backend to turn on AMX for quantized linear.

Guidelines of leveraging AMX with workloads
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would start with this section on how to use it and have the supported ops show up at the bottm

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your suggestion. Do you mean to move this section above the supported ops ? Like this:
AMX in PyTorch
Guidelines of leveraging AMX with workloads
List supported ops
...

Introduction
============

Advanced Matrix Extensions (AMX), also known as Intel® Advanced Matrix Extensions (Intel® AMX), is an extension to the x86 instruction set architecture (ISA).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I realize AMX is lower level than other Intel technologies but it's still worth rationalizing to an end user in a few lines why it's interesting for them to know about AMX vs Intel compiler technologies

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added more introduction to AMX and the benefits it can bring.


Advanced Matrix Extensions (AMX), also known as Intel® Advanced Matrix Extensions (Intel® AMX), is an extension to the x86 instruction set architecture (ISA).
Intel advances AI capabilities with 4th Gen Intel® Xeon® Scalable processors and Intel® AMX, delivering 3x to 10x higher inference and training performance versus the previous generation, see `Accelerate AI Workloads with Intel® AMX`_.
AMX supports two data types, INT8 and BFloat16, compared to AVX512 FP32, it can achieve up to 32x and 16x acceleration, respectively, see figure 6 of `Accelerate AI Workloads with Intel® AMX`_.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can directly copy the wording from

Compared to 3rd Gen Intel Xeon Scalable processors running Intel® Advanced Vector Extensions 512 Neural Network Instructions (Intel® AVX-512 VNNI), 4th Gen Intel Xeon Scalable processors running Intel AMX can perform 2,048 INT8 operations per cycle, rather than 256 INT8 operations per cycle. They can also perform 1,024 BF16 operations per cycle, as compared to 64 FP32 operations per cycle.

which is a quote from https://www.intel.com/content/www/us/en/products/docs/accelerator-engines/advanced-matrix-extensions/ai-solution-brief.html

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Quoted this.

``conv1d``,
``conv2d``,
``conv3d``,
``conv1d``,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why we have 2 sets of conv1d, conv2d, conv3d here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed typos.

``addbmm``,
``linear``,
``matmul``,
``_convolution``
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_convolution is not intended to be directly used, start with a _

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed _convolution .

Copy link
Contributor

@jgong5 jgong5 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a "summary" or "conclusion" section to summarize the document.

@CaoE
Copy link
Contributor Author

CaoE commented Jun 8, 2023

Please add a "summary" or "conclusion" section to summarize the document.

Added conclusion section.

@CaoE CaoE marked this pull request as ready for review June 8, 2023 15:11
@CaoE
Copy link
Contributor Author

CaoE commented Jun 9, 2023

@msaroufim Could you please review this doc ? Thanks.

@CaoE
Copy link
Contributor Author

CaoE commented Jun 9, 2023

@ngimel Could you please review this doc ? Thank you.

@CaoE
Copy link
Contributor Author

CaoE commented Jun 10, 2023

@msaroufim Could you please review this doc ? Thank you.

@CaoE
Copy link
Contributor Author

CaoE commented Jun 10, 2023

@kit1980 Could you please review this doc ? Thank you.

Copy link
Contributor

@svekars svekars left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple editorial fixes for proper HTML rendering.

to get higher performance out-of-box on x86 CPUs with AMX support.
For more detailed information of oneDNN, see `oneDNN`_.

The operation is fully handled by oneDNN according to the execution code path generated. I.e. when a supported operation gets executed into oneDNN implementation on a hardware platform with AMX support, AMX instructions will be invoked automatically inside oneDNN.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The operation is fully handled by oneDNN according to the execution code path generated. I.e. when a supported operation gets executed into oneDNN implementation on a hardware platform with AMX support, AMX instructions will be invoked automatically inside oneDNN.
The operation is fully handled by oneDNN according to the execution code path generated. For example, when a supported operation gets executed into oneDNN implementation on a hardware platform with AMX support, AMX instructions will be invoked automatically inside oneDNN.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your comments.
Fixed.

with torch.cpu.amp.autocast():
output = model(input)

Note: Use channels last format to get better performance.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Note: Use channels last format to get better performance.
.. note:: Use channels' last format to get better performance.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed


When the generated graph model runs into oneDNN implementations with the supported operators, AMX accelerations will be activated.

Note: When using PyTorch on CPUs that support AMX, the framework will automatically enable AMX usage by default.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Note: When using PyTorch on CPUs that support AMX, the framework will automatically enable AMX usage by default.
.. note:: When using PyTorch on CPUs that support AMX, the framework will automatically enable AMX usage by default.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

Comment on lines 61 to 75
- BF16 CPU ops that can leverage AMX:

``conv1d``,
``conv2d``,
``conv3d``,
``conv_transpose1d``,
``conv_transpose2d``,
``conv_transpose3d``,
``bmm``,
``mm``,
``baddbmm``,
``addmm``,
``addbmm``,
``linear``,
``matmul``,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- BF16 CPU ops that can leverage AMX:
``conv1d``,
``conv2d``,
``conv3d``,
``conv_transpose1d``,
``conv_transpose2d``,
``conv_transpose3d``,
``bmm``,
``mm``,
``baddbmm``,
``addmm``,
``addbmm``,
``linear``,
``matmul``,
BF16 CPU ops that can leverage AMX:
- ``conv1d``
- ``conv2d``
- ``conv3d``
- ``conv_transpose1d``
- ``conv_transpose2d``
- ``conv_transpose3d``
- ``bmm``
- ``mm``
- ``baddbmm``
- ``addmm``
- ``addbmm``
- ``linear``
- ``matmul``

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

``linear``,
``matmul``,

- Quantization CPU ops that can leverage AMX:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- Quantization CPU ops that can leverage AMX:
Quantization CPU ops that can leverage AMX:

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

Confirm AMX is being utilized
------------------------------

Set environment variable ``export ONEDNN_VERBOSE=1``, or use ``torch.backends.mkldnn.verbose`` to flexibly enable oneDNN to dump verbose messages.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Set environment variable ``export ONEDNN_VERBOSE=1``, or use ``torch.backends.mkldnn.verbose`` to flexibly enable oneDNN to dump verbose messages.
Set environment variable to ``export ONEDNN_VERBOSE=1`` or use ``torch.backends.mkldnn.verbose`` to flexibly enable oneDNN to dump verbose messages.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think to should not be added here because the specific environment variable we want to use here is ONEDNN_VERBOSE, whose value we set to 1.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your comments. I will keep the original version for this sentence.

recipes_source/amx.rst Show resolved Hide resolved
Comment on lines 31 to 39
- BFloat16 data type:

Using ``torch.cpu.amp`` or ``torch.autocast("cpu")`` would utilize AMX acceleration for supported operators.

::

model = model.to(memory_format=torch.channels_last)
with torch.cpu.amp.autocast():
output = model(input)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- BFloat16 data type:
Using ``torch.cpu.amp`` or ``torch.autocast("cpu")`` would utilize AMX acceleration for supported operators.
::
model = model.to(memory_format=torch.channels_last)
with torch.cpu.amp.autocast():
output = model(input)
- BFloat16 data type:
- Using ``torch.cpu.amp`` or ``torch.autocast("cpu")`` would utilize AMX acceleration for supported operators.
::
model = model.to(memory_format=torch.channels_last)
with torch.cpu.amp.autocast():
output = model(input)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed


- Quantization:

Applying quantization would utilize AMX acceleration for supported operators.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Applying quantization would utilize AMX acceleration for supported operators.
- Applying quantization would utilize AMX acceleration for supported operators.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed


- torch.compile:

When the generated graph model runs into oneDNN implementations with the supported operators, AMX accelerations will be activated.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
When the generated graph model runs into oneDNN implementations with the supported operators, AMX accelerations will be activated.
- When the generated graph model runs into oneDNN implementations with the supported operators, AMX accelerations will be activated.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

@svekars svekars merged commit f87d5aa into pytorch:main Jun 13, 2023
11 of 12 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
advanced cla signed docathon-h1-2023 A label for the docathon in H1 2023 intel
Projects
None yet
Development

Successfully merging this pull request may close these issues.

💡 [REQUEST] - Write a tutorial about how to leverage AMX with PyTorch on the 4th Gen of Xeon
8 participants