Enable pickle of model with TensorFlow 2.11 #1040

oliverholworthy · 2023-03-28T17:34:49Z

Supports #1016

Goals ⚽

Enable pickle of model with TensorFlow 2.11

Implementation Details 🚧

TensorFlow/Keras 2.11 enabled v3 saving format and it appears to now be expected that the from_config method builds the model, so that every layer is instantiated with variables needed.
Moved the creation of variable should_compute_train_metrics_for_batch to __init__ method of Model so that it creates the variable correctly when reloading model.

Testing Details 🔍

Existing test test_pickle was failing with TensorFlow 2.11

This is required for model reloading to work correctly. Otherwise there is a mismatch between the reloaded model and the variables it expects.

github-actions · 2023-03-28T17:45:40Z

Documentation preview

https://nvidia-merlin.github.io/models/review/pr-1040

merlin/models/tf/models/base.py

oliverholworthy · 2023-03-29T12:38:55Z

merlin/models/tf/models/base.py

+        super(BaseModel, self).__init__(**kwargs)
+
+        # Initializing model control flags controlled by MetricsComputeCallback()
+        self._should_compute_train_metrics_for_batch = tf.Variable(


This needs to be moved to the __init__ method (from the compile) method so that this variable is created when reloading the model.

oliverholworthy · 2023-03-29T12:40:50Z

merlin/models/tf/models/base.py

@@ -1343,6 +1347,9 @@ def fit(
        x = _maybe_convert_merlin_dataset(x, batch_size, **kwargs)
        self._maybe_set_schema(x)

+        if hasattr(x, "batch_size"):
+            self._batch_size = x.batch_size


We need to save the batch size used during fit so that when we reload the mode we pass it the same shape inputs. There were some tests of the two tower model that required this for example. This might indicate something to further investigate. I'm not sure it should matter what batch size we use when re-loading the model.

oliverholworthy · 2023-03-29T12:42:16Z

merlin/models/tf/models/base.py

+
+        inputs = model.get_sample_inputs(batch_size=batch_size)
+        if inputs:
+            model(inputs)


This is the important part of this PR. We're calling the model with some sample data that matches the input schema of the model. This has the side-effect of building all the layers (creating all the relevant variables). So that the variables can be reloaded correctly with the new v3 keras saving_lib.

oliverholworthy · 2023-03-29T12:43:05Z

merlin/models/tf/transforms/features.py

@@ -63,6 +63,8 @@ def compute_output_shape(self, input_shape):
                col_schema_shape = self.schema[name].shape
                if col_schema_shape.is_list:
                    max_seq_length = col_schema_shape.dims[1].max
+                    if max_seq_length is not None:
+                        max_seq_length = int(max_seq_length)


When serializing/deserializing the schema, we get back a float value of the shape. If/when we fix that in core, this line can be removed

oliverholworthy · 2023-03-29T12:44:06Z

tests/unit/tf/experimental/test_sample_weight.py

    output = mm.ContrastiveOutput(
        DotProduct(),
        post=ContrastiveSampleWeight(
-            pos_class_weight=tf.random.uniform(shape=(1000,)),
-            neg_class_weight=tf.random.uniform(shape=(1000,)),
+            pos_class_weight=tf.random.uniform(shape=(item_id_cardinality,)),


This test was randomly failing depending if you get unlucky and get an item id of the max value 1000.

oliverholworthy · 2023-03-29T12:44:48Z

tests/unit/tf/core/test_combinators.py

@@ -251,7 +251,7 @@ def test_with_model(self, run_eagerly, music_streaming_data):
            layer,
            tf.keras.layers.Dense(1),
            mm.BinaryClassificationTask("click"),
-            schema=music_streaming_data.schema,
+            schema=music_streaming_data.schema.select_by_name("item_recency"),


This test started failing after the change to Model.from_config because we were passing in an inconsistent schema to the model compared with what it expected.

review-notebook-app · 2023-03-29T14:15:35Z

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

oliverholworthy · 2023-03-29T14:28:06Z

I've got the tests passing locally now. The GitHub Actions jobs are now stuck in a queued state though

karlhigley · 2023-03-29T15:32:18Z

I think it's just taking time to catch up with all the actions that got queued by previous pushes. You can go into the Actions tab and cancel those to speed it up

oliverholworthy · 2023-03-29T16:56:47Z

I've got the tests passing locally now.

I said I'd got this working. However, actually this test in tests/unit/tf/examples/test_06_advanced_own_architecture.py is now failing again. I realised that the notebook test was fixed accidentally, as a result of model.save somehow swalliowing errors raised in Model.from_config. After the last 2 commits to correclty create the correct inputs the call to model(inputs) no longer raises, which causse the reload to fail later when it finds variables with different shapes

edknv · 2023-03-30T01:49:10Z

merlin/models/tf/models/base.py

+        if self.input_schema is not None:
+            inputs = {}
+            for column in self.input_schema:


Maybe the issue is schema filtering/propagation? If for example I change these lines to go through self.schema instead of input_schema, i.e.,

if self.schema is not None: inputs = {} for column in self.schema:

and change the notebook to use sub_schema in the model by passing sub_schema:

model = mm.Model(deep_dlrm_interaction, binary_task, schema=sub_schema)

I can get all tests to succeed.

The test passing is not equivalent to it running correctly unfortunately. Using self.schema instead of self.inout_schema results in an exception being thrown inside the model from_config method where we call model(inputs). This is because we end up passing the target rating_binary as an inout which the model does not accept. That exception is caught somewhere in TensorFlow/Keras, which seems to result in essentially skipping this model call, and reverting to the previous mechanism for building the model during load.

It turns out that this new saving v3 lib in keras seems like it's only used during pickle and not a regular saved model that we do with model.save(). I've added a condition to apply this model build in from config based on this threadlocal that is set in there.

8e55be1

This means that most models will be pickleable. Although we only explictly test this in one test test_pickle. However, there is at least one kind of model that isn't pickleable (the one in define your own architecture notebook).

This same saving_lib that is being used for pickle. is now the main saving lib in 2.12. So I think we'll have some more work to get that working reliably for all models.

This seems like a good solution. I believe it's the same with the upstream keras models, i.e., most but not all keras models are pickable, but the recommended way is to use the save() method not pickle.

It looks like test_pickle is failing after 8e55be1 and a916914 :(

Looks like GPU CI was testing against tensorflow/keras 2.12 where the saving_lib moved from keras.saving.experimental.saving_lib to keras.saving.saving_lib. I've added an extra import for this and seems to work. This feels kind of brittle in the sense that this could break easily in future versions when the temporary _SAVING_V3_ENABLED variable is removed. Unless we have some way to detect if the model is being saved with this new keras format (used for pickle) vs the saved model format.

Perhaps we can figure out if we'd like to try supporting the keras native format properly which does seem to have uncovered some bugs in various model layers, so from that perspective seems worrthwhile.

oliverholworthy added 3 commits March 28, 2023 16:58

Enable Model serialization with keras saving v3

002cfdd

Move _should_compute_train_metrics_for_batch into model __init__

9a25571

This is required for model reloading to work correctly. Otherwise there is a mismatch between the reloaded model and the variables it expects.

Wrap saving_lib import with try/except supporting different versions

51500cb

oliverholworthy added the chore Maintenance for the repository label Mar 28, 2023

oliverholworthy added this to the Merlin 23.03 milestone Mar 28, 2023

oliverholworthy self-assigned this Mar 28, 2023

Merge branch 'main' into pickle-keras-saving-v3

4ffe43e

oliverholworthy commented Mar 28, 2023

View reviewed changes

merlin/models/tf/models/base.py Outdated Show resolved Hide resolved

edknv mentioned this pull request Mar 28, 2023

Tensorflow 2.11 support #1016

Merged

oliverholworthy added 16 commits March 28, 2023 20:53

Move should_compute_train_metrics_for_batch to BaseModel init

3c6c2c9

Remove call to super().build in Model.build

9fc06e2

exclude targets in input schema property

8ab76d8

Construct sample inputs from schema to build model in from_config

7db0a38

remove now unused super class config

a0cc9fd

Reformat base.py

285bbb6

Add batch_size argument to get_sample_inputs

89be185

Use same batch size when reloading model that was used during fit

11e46ca

Call table build in EmbeddingTablePrediction

3841baa

Move maybe_set_schema before batch size

22b49cf

Use value_count max to construct sample input

9742cb8

revert change to input_schema property and update cond test schema

18c4573

Handle float values in shape (coerced to float when serialized)

90f0503

Merge branch 'main' into pickle-keras-saving-v3

9df236a

Correct cardinality in test_contrastive_sample_weight_serialization

bb08a89

Reformat base.py

de2ef00

oliverholworthy commented Mar 29, 2023

View reviewed changes

Set correct schema in Define-your-own-architecture notebook

6f59b22

oliverholworthy added 2 commits March 29, 2023 15:16

Restore moedel pickle test

b38a136

Reformat test_pickle

9dbe42b

oliverholworthy closed this Mar 29, 2023

oliverholworthy reopened this Mar 29, 2023

oliverholworthy added 2 commits March 29, 2023 16:14

exclude targets from input_schema property

00ba12d

Revert change to define your own architecture notebook (not requried)

2730b6d

karlhigley requested review from edknv, marcromeyn and gabrielspmoreira March 29, 2023 15:37

edknv approved these changes Mar 29, 2023

View reviewed changes

edknv reviewed Mar 30, 2023

View reviewed changes

oliverholworthy and others added 3 commits March 30, 2023 21:53

Add conditional import of saving_lib to handle TF 2.11 pickle

8e55be1

Merge branch 'main' into pickle-keras-saving-v3

a916914

Update saving_lib import to handle TF 2.12

64886e8

oliverholworthy merged commit 210aade into NVIDIA-Merlin:main Mar 31, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable pickle of model with TensorFlow 2.11 #1040

Enable pickle of model with TensorFlow 2.11 #1040

oliverholworthy commented Mar 28, 2023 •

edited

Loading

github-actions bot commented Mar 28, 2023

oliverholworthy Mar 29, 2023

oliverholworthy Mar 29, 2023

oliverholworthy Mar 29, 2023

oliverholworthy Mar 29, 2023

oliverholworthy Mar 29, 2023

oliverholworthy Mar 29, 2023

review-notebook-app bot commented Mar 29, 2023

oliverholworthy commented Mar 29, 2023

karlhigley commented Mar 29, 2023

oliverholworthy commented Mar 29, 2023

edknv Mar 30, 2023

oliverholworthy Mar 30, 2023

oliverholworthy Mar 30, 2023

oliverholworthy Mar 30, 2023

oliverholworthy Mar 30, 2023

edknv Mar 30, 2023

edknv Mar 31, 2023

oliverholworthy Mar 31, 2023

oliverholworthy Mar 31, 2023

Enable pickle of model with TensorFlow 2.11 #1040

Enable pickle of model with TensorFlow 2.11 #1040

Conversation

oliverholworthy commented Mar 28, 2023 • edited Loading

Goals ⚽

Implementation Details 🚧

Testing Details 🔍

github-actions bot commented Mar 28, 2023

Documentation preview

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

review-notebook-app bot commented Mar 29, 2023

oliverholworthy commented Mar 29, 2023

karlhigley commented Mar 29, 2023

oliverholworthy commented Mar 29, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

oliverholworthy commented Mar 28, 2023 •

edited

Loading