Minimal Build for On-Device Training #16326

baijumeswani · 2023-06-12T17:33:56Z

🛠️ Changes in this pull request:

This pull request introduces two significant changes to the project:

Changing on device training checkpoint format: The current implementation stores the on device training checkpoint as a sequence of tensors in multiple files inside a checkpoint folder, which can be inefficient in terms of storage and performance. In this PR, I have modified the checkpoint format to utilize the flatbuffer table to save the checkpoint to a single file, providing a more compact and efficient representation. The changes around this are twofold:
- Add the checkpoint flatbuffer schema that will generate the necessary checkpoint source files.
- Update the checkpoint saving and loading functionality to use the new format.
Adding support for onnxruntime minimal build: To support scenarios where binary size is a constraint, I made changes to ensure that the training build can work well with the minimal build.

🔍 Open Issues:

In order to extract the optimizer type, the existing implementation re-loaded the onnx optimizer model and parsed it. This is no longer possible, since the model format can either be onnx or ort. One idea is to do the same for ort format optimizer model. This needs some investigation.
Changes to the offline tooling to generate ort format training artifacts.
End-to-end training example showcasing the use of the minimal training build.
Add support for export model for inferencing in a minimal build.

…with tensor proto sequence files

onnxruntime/core/flatbuffers/schema/ort_training.fbs.h

onnxruntime/core/flatbuffers/schema/ort.fbs.h

onnxruntime/core/flatbuffers/schema/compile_schema.py

onnxruntime/core/graph/graph_flatbuffers_utils.cc

orttraining/orttraining/training_api/checkpoint.cc

…baijumeswani/training-minimal-build

onnxruntime/core/flatbuffers/schema/ort_training.fbs

onnxruntime/core/flatbuffers/schema/README.md

onnxruntime/core/graph/graph_flatbuffers_utils.cc

…baijumeswani/training-minimal-build

pengwa

Not very familar with flatbuffer format, just have few general comments.

onnxruntime/core/flatbuffers/schema/README.md

onnxruntime/core/flatbuffers/checkpoint_version.h

orttraining/orttraining/training_api/checkpoint.cc

orttraining/orttraining/training_api/checkpoint_property.h

orttraining/orttraining/training_api/checkpoint.cc

orttraining/orttraining/core/optimizer/graph_transformer_utils.cc

onnxruntime/core/flatbuffers/schema/README.md

pengwa

Minors.

orttraining/orttraining/training_api/checkpoint.h

onnxruntime/core/flatbuffers/schema/README.md

onnxruntime/core/flatbuffers/schema/compile_schema.py

onnxruntime/core/graph/graph_flatbuffers_utils.cc

orttraining/orttraining/training_api/checkpoint.cc

.lintrunner.toml

…baijumeswani/training-minimal-build

orttraining/orttraining/training_api/checkpoint.cc

…baijumeswani/training-minimal-build

pengwa

I am suggesting to rename "requires_grad" to "requires_grad_params" to better represent itself.

Since it is part of schema, if we want to do that, maybe we should do it earlier (instead of bumping the versions next time). Any thought?

Besides that, LGTM.

orttraining/orttraining/training_api/checkpoint.cc

edgchen1

a few comments, looks good overall

onnxruntime/core/flatbuffers/schema/README.md

orttraining/orttraining/training_api/checkpoint.cc

…baijumeswani/training-minimal-build

orttraining/orttraining/training_api/checkpoint.cc

baijumeswani · 2023-06-22T20:25:07Z

Thank you for the valuable feedback @edgchen1 @pengwa @skottmckay @askhade 😄

baijumeswani added 3 commits June 12, 2023 17:31

Flatbuffer schema for the training checkpoint

ada56a0

Load and Save to a flatbuffer checkpoint file instead of to a folder …

1a023ce

…with tensor proto sequence files

Support minimal build for training

871cdf8

baijumeswani added the training issues related to ONNX Runtime training; typically submitted using template label Jun 12, 2023

baijumeswani requested review from skottmckay, askhade, pengwa and edgchen1 June 12, 2023 17:33

github-advanced-security bot found potential problems Jun 12, 2023

View reviewed changes

onnxruntime/core/flatbuffers/schema/ort_training.fbs.h Fixed Show fixed Hide fixed

baijumeswani commented Jun 12, 2023

View reviewed changes

onnxruntime/core/flatbuffers/schema/ort.fbs.h Show resolved Hide resolved

baijumeswani added 2 commits June 12, 2023 17:47

Disable lintrunner on the training auto generated flatbuffer header

9e5228b

Address lint issue

bdd0df3

baijumeswani changed the title ~~Baijumeswani/training minimal build~~ Minimal Build for On-Device Training Jun 12, 2023

filesystem::exists does not work on macOS < 10.15. Remove it for now

1cd5a76

baijumeswani closed this Jun 12, 2023

baijumeswani reopened this Jun 12, 2023

Java test pass in a path to a file instead of to a directory

8d1e263

skottmckay reviewed Jun 13, 2023

View reviewed changes

baijumeswani added 4 commits June 13, 2023 15:29

Merge branch 'main' of https://github.com/microsoft/onnxruntime into …

28a30d9

…baijumeswani/training-minimal-build

Address pull request review comments

233a757

Merge branch 'main' of https://github.com/microsoft/onnxruntime into …

9c2b81a

…baijumeswani/training-minimal-build

Remove unnecessary namespace from flatbuffer schema

f3ddea4

baijumeswani requested a review from a team as a code owner June 13, 2023 22:32

Merge branch 'main' of https://github.com/microsoft/onnxruntime into …

fa828ed

…baijumeswani/training-minimal-build

baijumeswani mentioned this pull request Jun 14, 2023

Add ability to create ort format models from training offline utility #16360

Merged

edgchen1 reviewed Jun 15, 2023

View reviewed changes

baijumeswani added 4 commits June 15, 2023 19:34

Address pull request review comments

2b2eb63

Merge branch 'main' of https://github.com/microsoft/onnxruntime into …

4a70bfc

…baijumeswani/training-minimal-build

Address pull request review comments

b45f7a9

Update golden numbers

b8c46f9

baijumeswani added 2 commits June 20, 2023 17:09

Merge branch 'main' of https://github.com/microsoft/onnxruntime into …

1e54582

…baijumeswani/training-minimal-build

remove const

f4241b5

pengwa reviewed Jun 21, 2023

View reviewed changes

orttraining/orttraining/training_api/checkpoint.h Outdated Show resolved Hide resolved

onnxruntime/core/flatbuffers/schema/README.md Outdated Show resolved Hide resolved

edgchen1 reviewed Jun 21, 2023

View reviewed changes

orttraining/orttraining/training_api/checkpoint.cc Show resolved Hide resolved

edgchen1 reviewed Jun 21, 2023

View reviewed changes

.lintrunner.toml Outdated Show resolved Hide resolved

baijumeswani added 4 commits June 22, 2023 00:10

Address pull request review comments

10f9933

commit to revert

d3c0004

Address pull request review comments

8a1d072

Merge branch 'main' of https://github.com/microsoft/onnxruntime into …

366859e

…baijumeswani/training-minimal-build

pengwa reviewed Jun 22, 2023

View reviewed changes

orttraining/orttraining/training_api/checkpoint.cc Show resolved Hide resolved

baijumeswani added 2 commits June 22, 2023 02:15

Fix pipeline failure

ca9aada

Merge branch 'main' of https://github.com/microsoft/onnxruntime into …

0c3f791

…baijumeswani/training-minimal-build

pengwa reviewed Jun 22, 2023

View reviewed changes

orttraining/orttraining/training_api/checkpoint.cc Show resolved Hide resolved

orttraining/orttraining/training_api/checkpoint.cc Outdated Show resolved Hide resolved

edgchen1 previously approved these changes Jun 22, 2023

View reviewed changes

baijumeswani added 2 commits June 22, 2023 05:13

Address pull request review comments

ae5e1e0

Merge branch 'main' of https://github.com/microsoft/onnxruntime into …

bc5e18d

…baijumeswani/training-minimal-build

baijumeswani dismissed edgchen1’s stale review via bc5e18d June 22, 2023 05:13

pengwa previously approved these changes Jun 22, 2023

View reviewed changes

github-advanced-security bot found potential problems Jun 22, 2023

View reviewed changes

orttraining/orttraining/training_api/checkpoint.cc Fixed Show fixed Hide fixed

orttraining/orttraining/training_api/checkpoint.cc Fixed Show fixed Hide fixed

Fix pipeline failure

9c54e96

baijumeswani dismissed pengwa’s stale review via 9c54e96 June 22, 2023 14:42

askhade approved these changes Jun 22, 2023

View reviewed changes

YUNQIUGUO approved these changes Jun 22, 2023

View reviewed changes

baijumeswani merged commit 10ba1e2 into main Jun 22, 2023
87 of 91 checks passed

baijumeswani deleted the baijumeswani/training-minimal-build branch June 22, 2023 19:27

baijumeswani mentioned this pull request Jun 27, 2023

Fix bug that accidentally disabled training op tests #16488

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Minimal Build for On-Device Training #16326

Minimal Build for On-Device Training #16326

baijumeswani commented Jun 12, 2023 •

edited

Loading

pengwa left a comment

pengwa left a comment

pengwa left a comment

edgchen1 left a comment

baijumeswani commented Jun 22, 2023

Minimal Build for On-Device Training #16326

Minimal Build for On-Device Training #16326

Conversation

baijumeswani commented Jun 12, 2023 • edited Loading

pengwa left a comment

Choose a reason for hiding this comment

pengwa left a comment

Choose a reason for hiding this comment

pengwa left a comment

Choose a reason for hiding this comment

edgchen1 left a comment

Choose a reason for hiding this comment

baijumeswani commented Jun 22, 2023

baijumeswani commented Jun 12, 2023 •

edited

Loading