Add dataloader pre-trained embeddings support to Merlin Models #1083

gabrielspmoreira · 2023-05-04T03:35:04Z

Fixes #1070, Fixes #1071, Fixes #1068, Fixes #1072, Fixes #1073

Goals ⚽

Add support to pre-trained embeddings provided by the EmbeddingOperator transform from Merlin dataloader: [RMP] Support pre-trained vector embeddings as input features into a model via the dataloader Merlin#211

Tasks

Convert sequential pre-trained embeddings into 3d ragged tensor representation and ensure SequenceTransform accepts 3D ragged tensor #1071
Create a PretrainedEmbeddings block and integrate it with InputBlockV2. #1068
Ensure that BroadcastFeatures support expanding 2D pre-trained embeddings (contextual features) to concat with 3D sequence features #1073
Adapt SequenceMasking to support pre-trained embeddings #1070
Update models ranking models (DLRM, DCN) to support PretrainedEmbeddings with fixed dim projection. #1072

Implementation Details 🚧

Creates the PretrainedEmbeddings block that takes pre-trained embedding features and optionally projects them to a dim with a linear layer, applies a sequence aggregator and normalizes (e.g. with l2-norm). All of these options are configurable
The InputBlockV2 was changed to accept an optional pretrained_embeddings argument, which by default selects features tagged with the EMBEDDING tag,
PrepareListFeatures was changed to define the shape of the last dim the pre-trained embeddings provided by the Loader, as that dim is None in graph mode.

Testing Details 🔍

Created many tests demonstrating how pre-trained embeddings can be used with models like DLRM, DCN, sequential Transformer models with BroadcastFeatures and with causal and masked language modeling SequenceMasking classes
I took this opportunity also to try and speed up many tests by reducing drastically the cardinality of categorical features in some dataset schemas used for synthetic data generation. Many tests had to be updated to match the new cardinalities.

github-actions · 2023-05-04T03:42:44Z

Documentation preview

https://nvidia-merlin.github.io/models/review/pr-1083

…trained embeddings

…s undefined

…96 to 100, in order to speed-up tests

…ence_testing_data

…uage modeling

…d item is equal the item id cardinality

sararb

Thank you for the PR @gabrielspmoreira! This looks good to me, I've just left some minor comments.

sararb · 2023-05-15T19:26:40Z

merlin/models/tf/blocks/mlp.py

@@ -96,7 +96,12 @@ def MLPBlock(

    for idx, dim in enumerate(dimensions):
        dropout_layer = None
-        activation_idx = activation if isinstance(activation, str) else activation[idx]
+        activation = activation or "linear"


I understand that if activation is None, we will use "linear" by default. If yes, can we set the default value of activation in the method's args to None ?

sararb · 2023-05-15T19:56:21Z

merlin/models/tf/inputs/embedding.py

+    aggregation: Optional[TabularAggregationType], optional
+        Transformation block to apply for aggregating the inputs, by default None
+    block_name: str, optional
+        Name of the block, by default "embeddings"


Suggested change

Name of the block, by default "embeddings"

Name of the block, by default "pretrained_embeddings"

sararb · 2023-05-15T19:58:49Z

merlin/models/tf/inputs/embedding.py

@@ -1238,6 +1335,8 @@ def process_str_sequence_combiner(
        combiner = tf.keras.layers.Lambda(lambda x: tf.reduce_mean(x, axis=1))
    elif combiner == "sum":
        combiner = tf.keras.layers.Lambda(lambda x: tf.reduce_sum(x, axis=1))
+    elif combiner == "max":
+        combiner = tf.keras.layers.Lambda(lambda x: tf.reduce_max(x, axis=1))
    else:
        raise ValueError(
            "Only 'mean' and 'sum' str combiners is implemented for dense"


Suggested change

"Only 'mean' and 'sum' str combiners is implemented for dense"

"Only 'mean', 'sum', and 'max' str combiners is implemented for dense"

sararb · 2023-05-15T20:18:01Z

merlin/models/tf/transforms/regularization.py

        else:
-            inputs = tf.linalg.l2_normalize(inputs, axis=axis)
+            inputs = self._l2_norm(inputs)


we might need to specify the axis parameter in this line

sararb · 2023-05-15T20:19:01Z

merlin/models/tf/transforms/regularization.py

+    ) -> Union[tf.Tensor, tf.SparseTensor, tf.RaggedTensor]:
+        """Computes L2-norm for a given axis, typically axis = -1.
+        Equivalent to tf.linalg.l2_normalize(), but that function
+        does not support tf.RaggedTensor


Good to know, thanks for the fix!

marcromeyn · 2023-05-16T08:16:50Z

merlin/models/tf/inputs/embedding.py

+    for col in schema:
+        table_name = col.name
+
+        tables[table_name] = NoOp()


Should this be configurable to allow users to do things like normalization?

gabrielspmoreira self-assigned this May 4, 2023

gabrielspmoreira added the enhancement New feature or request label May 4, 2023

gabrielspmoreira added this to the Merlin 22.05 milestone May 4, 2023

gabrielspmoreira marked this pull request as draft May 4, 2023 03:36

karlhigley modified the milestones: Merlin 22.05, Merlin 23.05 May 4, 2023

gabrielspmoreira force-pushed the tf/pretrained_emb branch from 1f1857c to 1d13cb5 Compare May 9, 2023 04:20

gabrielspmoreira mentioned this pull request May 9, 2023

[RMP] Support pre-trained vector embeddings as input features into a model via the dataloader NVIDIA-Merlin/Merlin#211

Closed

33 tasks

gabrielspmoreira force-pushed the tf/pretrained_emb branch from 50a3ddd to 78264e4 Compare May 10, 2023 13:08

gabrielspmoreira marked this pull request as ready for review May 10, 2023 13:10

gabrielspmoreira requested review from sararb and marcromeyn May 10, 2023 22:52

rnyak requested a review from oliverholworthy May 15, 2023 15:33

gabrielspmoreira added the status/needs-review label May 15, 2023

gabrielspmoreira added 11 commits May 15, 2023 12:41

Created PretrainedEmbeddings and changed other blocks to support pre-…

97404a0

…trained embeddings

Using SequenceAggregator with pre-trained embeddings

37a06fd

Added test for aggregating sequences of pre-trained embeddings

6cadf44

Fixed bugs in graph mode when using EmbeddingOperator, as last dim wa…

260ab05

…s undefined

Reducing cardinality of testing and sequence_testing fixture from 519…

eb5d648

…96 to 100, in order to speed-up tests

Changing test of Transformers with pre-trained embeddings to use sequ…

295d88d

…ence_testing_data

Fixed tests

0dc187d

Fixed tests and added test of pre-trained embeddings with masked lang…

13c7b12

…uage modeling

Added tests with pretrained embeddings for DLRM and DCN

1e7e283

Fixed PopularityBasedSamplerV2 that was raising error when the sample…

5e09725

…d item is equal the item id cardinality

Linting fix

c535e83

gabrielspmoreira force-pushed the tf/pretrained_emb branch from bd63bf3 to c535e83 Compare May 15, 2023 16:27

sararb approved these changes May 15, 2023

View reviewed changes

gabrielspmoreira added 2 commits May 15, 2023 20:28

Fixed failing test

993e4c8

Implementing Sara's suggestions on pretrained embeddings

e8e6bf5

marcromeyn reviewed May 16, 2023

View reviewed changes

marcromeyn approved these changes May 16, 2023

View reviewed changes

gabrielspmoreira and others added 3 commits May 16, 2023 20:50

Merge branch 'main' into tf/pretrained_emb

3f47e66

Merge branch 'main' into tf/pretrained_emb

880d1a5

Fixed test

16742a5

gabrielspmoreira merged commit 8efbd36 into main May 17, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add dataloader pre-trained embeddings support to Merlin Models #1083

Add dataloader pre-trained embeddings support to Merlin Models #1083

gabrielspmoreira commented May 4, 2023 •

edited

Loading

github-actions bot commented May 4, 2023

sararb left a comment

sararb May 15, 2023

sararb May 15, 2023

sararb May 15, 2023

sararb May 15, 2023

sararb May 15, 2023

marcromeyn May 16, 2023

	Name of the block, by default "embeddings"
	Name of the block, by default "pretrained_embeddings"

	"Only 'mean' and 'sum' str combiners is implemented for dense"
	"Only 'mean', 'sum', and 'max' str combiners is implemented for dense"

Add dataloader pre-trained embeddings support to Merlin Models #1083

Add dataloader pre-trained embeddings support to Merlin Models #1083

Conversation

gabrielspmoreira commented May 4, 2023 • edited Loading

Goals ⚽

Tasks

Implementation Details 🚧

Testing Details 🔍

github-actions bot commented May 4, 2023

Documentation preview

sararb left a comment

Choose a reason for hiding this comment

sararb May 15, 2023

Choose a reason for hiding this comment

sararb May 15, 2023

Choose a reason for hiding this comment

sararb May 15, 2023

Choose a reason for hiding this comment

sararb May 15, 2023

Choose a reason for hiding this comment

sararb May 15, 2023

Choose a reason for hiding this comment

marcromeyn May 16, 2023

Choose a reason for hiding this comment

gabrielspmoreira commented May 4, 2023 •

edited

Loading