Enable Opset11 Sequence Ops on DirectML, and make the CPU implementations agnostic to backend EP #14442

smk2007 · 2023-01-26T17:47:44Z

Enable Opset11 Sequence Ops on DirectML, and make the CPU implementations agnostic to backend EP

Opset 11 introduced the following sequence related operators:
- SequenceAt
- SequenceConstruct
- SequenceEmpty
- SequenceLength
- SequenceErase
- SequenceInsert
- ConcatFromSequence

With the exception of ConcatFromSequence, all of the above operators were implemented with CPU kernels that a) required all of the contained tensors to also be on CPU, and b) would clone each tensor into a new sequence as a side effect of each operator. The implementation of sequences are backend agnostic, as they dont affect actual tensor layout or manipulate the contents of the tensors. In addition, with the exception of SequenceAt, the other operators need not make copies of the underlying referenced tensors.

Consequently, this change does the following:

Sequence* operators (except SequenceAt) no longer copies the contents of a sequence of tensors on every kernel execution.
SequenceAt uses the DataTransferManager to copy tensors agnostic to backend.
The internal container implemented by TensorSeq has changed from onnxruntime::Tensor to OrtValue. This is because onnxruntime::Tensor does not support copy or assignment construction, so it must have a singular owner. However, is same tensor participates in multiple containers it would have multiple container "owners" and this would not be possible.
Other code that accessed values from TensorSeq have associated changes to extract Tensors from OrtValues now.

In addition, DirectML execution was very slow when the above Sequence operators were added to a graph, as this caused MemcpyToHost and MemcpyFromHost kernels to be inserted between the graph and the sequence operators. To optimize DirectML,

The CPU implementations for the Sequence* ops were registered as DML implementations. Since the above changes also includes making the CPU kernel implementations EP agnostic, the CPU kernels can be added as is.
The ConcatFromSequence operator needed to be implemented on DirectML. However, there was little DirectML EP operator framework support for operators that accept/output sequences of tensors. This change has modified the internal COM interfaces to include new apis to interrogate for sequence shapes, and extract the needed tensors from TensorSeq.

…t Tensors. Remove CPU copies in Sequence operators. Register CPU SequenceOps as DirectML ops. Implement ConcatFromSequence on Directml

include/onnxruntime/core/framework/tensor_shape.h

onnxruntime/core/providers/dml/DmlExecutionProvider/src/ExecutionProvider.cpp

onnxruntime/core/framework/utils.cc

onnxruntime/core/providers/cpu/controlflow/loop.cc

onnxruntime/core/providers/cpu/sequence/sequence_ops.cc

onnxruntime/core/providers/cpu/optional/optional_ops.cc

onnxruntime/core/providers/cpu/tensor/identity_op.h

onnxruntime/core/providers/shared_library/provider_wrappedtypes.h

orttraining/orttraining/training_ops/cpu/optimizer/clip_grad_norm/clip_grad_norm.cc

yuslepukhin

🕐

onnxruntime/core/framework/utils.cc

onnxruntime/core/providers/cpu/optional/optional_ops.cc

onnxruntime/core/session/onnxruntime_c_api.cc

onnxruntime/test/providers/cpu/sequence/concat_from_sequence_op_test.cc

PatriceVignola · 2023-02-21T18:55:00Z

@yuslepukhin Thank you for the review. I'll just need a final approval if everything looks good.

yuslepukhin · 2023-02-21T19:18:22Z

We also need re-work the places where SetElements({}) is called. Let's add a Clear() method to make sure the intent is clear. We can still use the same method of releasing the buffer, or using the good old swap() idom.

…ilk/sequence-ops-on-dml

yuslepukhin

…ions agnostic to backend EP (#14442) Enable Opset11 Sequence Ops on DirectML, and make the CPU implementations agnostic to backend EP Opset 11 introduced the following sequence related operators: - SequenceAt - SequenceConstruct - SequenceEmpty - SequenceLength - SequenceErase - SequenceInsert - ConcatFromSequence With the exception of ConcatFromSequence, all of the above operators were implemented with CPU kernels that a) required all of the contained tensors to also be on CPU, and b) would clone each tensor into a new sequence as a side effect of each operator. The implementation of sequences are backend agnostic, as they dont affect actual tensor layout or manipulate the contents of the tensors. In addition, with the exception of SequenceAt, the other operators need not make copies of the underlying referenced tensors. Consequently, this change does the following: 1) Sequence* operators (except SequenceAt) no longer copies the contents of a sequence of tensors on every kernel execution. 2) SequenceAt uses the DataTransferManager to copy tensors agnostic to backend. 3) The internal container implemented by TensorSeq has changed from onnxruntime::Tensor to OrtValue. This is because onnxruntime::Tensor does not support copy or assignment construction, so it must have a singular owner. However, is same tensor participates in multiple containers it would have multiple container "owners" and this would not be possible. 4) Other code that accessed values from TensorSeq have associated changes to extract Tensors from OrtValues now. In addition, DirectML execution was very slow when the above Sequence operators were added to a graph, as this caused MemcpyToHost and MemcpyFromHost kernels to be inserted between the graph and the sequence operators. To optimize DirectML, 1) The CPU implementations for the Sequence* ops were registered as DML implementations. Since the above changes also includes making the CPU kernel implementations EP agnostic, the CPU kernels can be added as is. 2) The ConcatFromSequence operator needed to be implemented on DirectML. However, there was little DirectML EP operator framework support for operators that accept/output sequences of tensors. This change has modified the internal COM interfaces to include new apis to interrogate for sequence shapes, and extract the needed tensors from TensorSeq. --------- Co-authored-by: Patrice Vignola <vignola.patrice@gmail.com>

smk2007 added 6 commits January 25, 2023 10:09

Implement CPU SequenceOps with underlying vector of OrtValues, and no…

b39412a

…t Tensors. Remove CPU copies in Sequence operators. Register CPU SequenceOps as DirectML ops. Implement ConcatFromSequence on Directml

cleanup

25280f1

enable new_axis attribute for ConcatFromSequence

72ffe76

make tests pass

7a348c8

fix unit tests

30485a7

Merge branch 'main' into user/sheilk/sequence-ops-on-dml

400837b

smk2007 requested review from snnn, fdwr, pranavsharma, PatriceVignola, sumitsays, RyanUnderhill and martinb35 January 26, 2023 17:47

smk2007 commented Jan 26, 2023

View reviewed changes

include/onnxruntime/core/framework/tensor_shape.h Outdated Show resolved Hide resolved

smk2007 commented Jan 26, 2023

View reviewed changes

onnxruntime/core/providers/dml/DmlExecutionProvider/src/ExecutionProvider.cpp Outdated Show resolved Hide resolved

smk2007 added 14 commits January 26, 2023 13:19

remove pragma optimize, and clean up sequence op checks

863d34f

fix orttraining build

0f32a66

cleanup

4a2340b

cleanup

a3e0272

cleanup

031b06f

more cleanup

435a810

fix splitfromsequence test failures

a446ace

fix python build break

3f8de66

cuda build break, and sequence op int vs size mismatch

a0685b2

fix more python build failures, and try to avoid gpu build failures

cb3ba6e

no need to include tensorseq.h in cuda clip_grad_norm

5b8c1a4

cleanup

e2d03a7

cleanup

c465313

cleanup

80d7545

fdwr requested a review from jeffbloo January 28, 2023 03:20

yuslepukhin reviewed Feb 15, 2023

View reviewed changes

onnxruntime/core/framework/utils.cc Outdated Show resolved Hide resolved

yuslepukhin reviewed Feb 15, 2023

View reviewed changes

onnxruntime/core/providers/cpu/controlflow/loop.cc Outdated Show resolved Hide resolved

yuslepukhin reviewed Feb 15, 2023

View reviewed changes

onnxruntime/core/providers/cpu/controlflow/loop.cc Outdated Show resolved Hide resolved

yuslepukhin reviewed Feb 15, 2023

View reviewed changes

onnxruntime/core/providers/cpu/sequence/sequence_ops.cc Outdated Show resolved Hide resolved

yuslepukhin reviewed Feb 15, 2023

View reviewed changes

onnxruntime/core/providers/cpu/optional/optional_ops.cc Outdated Show resolved Hide resolved

yuslepukhin reviewed Feb 15, 2023

View reviewed changes

onnxruntime/core/providers/cpu/tensor/identity_op.h Show resolved Hide resolved

yuslepukhin reviewed Feb 15, 2023

View reviewed changes

onnxruntime/core/providers/shared_library/provider_wrappedtypes.h Show resolved Hide resolved

yuslepukhin reviewed Feb 15, 2023

View reviewed changes

orttraining/orttraining/training_ops/cpu/optimizer/clip_grad_norm/clip_grad_norm.cc Outdated Show resolved Hide resolved

yuslepukhin reviewed Feb 15, 2023

View reviewed changes

orttraining/orttraining/training_ops/cpu/optimizer/clip_grad_norm/clip_grad_norm.cc Outdated Show resolved Hide resolved

yuslepukhin requested changes Feb 15, 2023

View reviewed changes

PatriceVignola added 3 commits February 15, 2023 01:47

Address PR comments

cb1b124

Fix

e9f1988

Fix test failures

99098bc

yuslepukhin reviewed Feb 15, 2023

View reviewed changes

onnxruntime/core/framework/utils.cc Show resolved Hide resolved

yuslepukhin reviewed Feb 15, 2023

View reviewed changes

onnxruntime/core/providers/cpu/optional/optional_ops.cc Outdated Show resolved Hide resolved

yuslepukhin reviewed Feb 15, 2023

View reviewed changes

onnxruntime/core/providers/cpu/optional/optional_ops.cc Show resolved Hide resolved

yuslepukhin reviewed Feb 15, 2023

View reviewed changes

onnxruntime/core/session/onnxruntime_c_api.cc Outdated Show resolved Hide resolved

yuslepukhin reviewed Feb 15, 2023

View reviewed changes

onnxruntime/test/providers/cpu/sequence/concat_from_sequence_op_test.cc Show resolved Hide resolved

PatriceVignola added 2 commits February 15, 2023 16:19

Address PR comments

94508f9

PR fixes

079d14b

PatriceVignola requested a review from yuslepukhin February 16, 2023 08:16

PatriceVignola added 3 commits February 21, 2023 11:31

Merge branch 'main' of github.com:microsoft/onnxruntime into user/she…

bc516f4

…ilk/sequence-ops-on-dml

Remove SetElements({})

2e0d8d4

Add InitOrtValue overload

5b9d78b

yuslepukhin approved these changes Feb 21, 2023

View reviewed changes

PatriceVignola merged commit 1b7f654 into main Feb 22, 2023

PatriceVignola deleted the user/sheilk/sequence-ops-on-dml branch February 22, 2023 02:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable Opset11 Sequence Ops on DirectML, and make the CPU implementations agnostic to backend EP #14442

Enable Opset11 Sequence Ops on DirectML, and make the CPU implementations agnostic to backend EP #14442

smk2007 commented Jan 26, 2023

yuslepukhin left a comment

PatriceVignola commented Feb 21, 2023

yuslepukhin commented Feb 21, 2023 •

edited

Loading

yuslepukhin left a comment

Enable Opset11 Sequence Ops on DirectML, and make the CPU implementations agnostic to backend EP #14442

Enable Opset11 Sequence Ops on DirectML, and make the CPU implementations agnostic to backend EP #14442

Conversation

smk2007 commented Jan 26, 2023

yuslepukhin left a comment

Choose a reason for hiding this comment

PatriceVignola commented Feb 21, 2023

yuslepukhin commented Feb 21, 2023 • edited Loading

yuslepukhin left a comment

Choose a reason for hiding this comment

yuslepukhin commented Feb 21, 2023 •

edited

Loading