Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Add options to reserve categorical indices in the Categorify() op #1074

Closed
gabrielspmoreira opened this issue Aug 27, 2021 · 8 comments · Fixed by #1093
Closed

[FEA] Add options to reserve categorical indices in the Categorify() op #1074

gabrielspmoreira opened this issue Aug 27, 2021 · 8 comments · Fixed by #1093
Assignees

Comments

@gabrielspmoreira
Copy link
Member

Is your feature request related to a problem? Please describe.
Some pipelines trust that categorical features will have some reserved values after preprocessing.
For example, for sequential features (e.g. session-based recommendation, time series, text data) it is important to reserve a value for padding (e.g. 0) in the model side, as not all sequences might have the same length.
It is also important to reserve a value (e.g. 1) to map Out-Of-Vocabulary (OOV) categorical values that might appear in the transform() (e.g. during evaluation or inference)

Describe the solution you'd like
Create one argument for Categorify() op to set the desired oov_index (for which all OOV values will be mapped during the preproc) and another argument start_index to set the first value that should be used to encode known values (e.g. 2). We can also have a similar option for null values (null_index), so that they are mapped to a known index.
In this case, if we call Categorify(oov_index=1, null_index=2, start_index=3), the index 0 could be safely used for padding for example.
I think we don't need an padding_index, because it is specific to sequential features. It is better to give users flexibility to reserve a number of categorical values by using start_index

@rnyak
Copy link
Contributor

rnyak commented Aug 27, 2021

Related to NVIDIA-Merlin/Transformers4Rec#160

@EvenOldridge
Copy link
Member

@gabrielspmoreira Do we really need that level of complexity? Can't we just specify the offset? Or does categorify work differently if you specify oov, null, and start_index ranges.

Also do we want to handle the case where people specify ridiculous values? What if I say oov = 100000000?

I think a simple reserved offset is more straightforward and less error prone.

@gabrielspmoreira
Copy link
Member Author

@EvenOldridge I agree we can simplify the solution.
With the offset (start_index) we solve the issue of reserving 0 for padding.
And we could reserve the first two indices after start_index / offset for null and OOV during transform() (inference).
What do you think @benfred ?

@benfred
Copy link
Member

benfred commented Aug 31, 2021

I think we should simplify, and just specify a start_index with a default value of 0.

The implementation of Categorify has gotten pretty complex - so I thought it might help to put some pointers on where to get started with this:

@viswa-nvidia viswa-nvidia added this to the NVTabular-v21.09 milestone Aug 31, 2021
@lesnikow lesnikow self-assigned this Sep 1, 2021
@lesnikow
Copy link
Contributor

lesnikow commented Sep 2, 2021

I assigned this issue to myself, but to follow what is in CONTRIBUTING.md and to make it explicit for tracking who is working on what, I am working on this issue currently.

@lesnikow
Copy link
Contributor

lesnikow commented Sep 2, 2021

@benfred @rjzamora Would one of you have some time today that I could meet with you to ask about how to implement this additional argument? I read over the current implementation of the class, but I am not fully understanding what the particular methods Ben outlined are supposed to do. I tagged you two since it looks like you have spent the most time working on this module. I would appreciate it if so.

@lesnikow
Copy link
Contributor

lesnikow commented Sep 2, 2021

@gabrielspmoreira, someone else, would you have a test case available to test the appropriate desired functionality? I found this test from tests/unit/ops/test_ops.py that tests Categorify that looks like a helpful template.

def test_categorify_lists(tmpdir, freq_threshold, cpu, dtype, vocabs):
    df = dispatch._make_df(
        {
            "Authors": [["User_A"], ["User_A", "User_E"], ["User_B", "User_C"], ["User_C"]],
            "Engaging User": ["User_B", "User_B", "User_A", "User_D"],
            "Post": [1, 2, 3, 4],
        }
    ) 

    cat_names = ["Authors", "Engaging User"]
    label_name = ["Post"]

    cat_features = cat_names >> ops.Categorify(
        out_path=str(tmpdir), freq_threshold=freq_threshold, dtype=dtype, vocabs=vocabs
    )

    workflow = nvt.Workflow(cat_features + label_name)
    df_out = workflow.fit_transform(nvt.Dataset(df, cpu=cpu)).to_ddf().compute()

    # Columns are encoded independently
    if cpu:
        assert df_out["Authors"][0].dtype == np.dtype(dtype) if dtype else np.dtype("int64")
        compare = [list(row) for row in df_out["Authors"].tolist()]
    else:
        assert df_out["Authors"].dtype == cudf.core.dtypes.ListDtype(dtype if dtype else "int64")
        compare = df_out["Authors"].to_arrow().to_pylist()

    if freq_threshold < 2 or vocabs is not None:
        assert compare == [[1], [1, 4], [3, 2], [2]]
    else:
        assert compare == [[1], [1, 0], [0, 2], [2]]

I would like to implement and test a feature add for this issue, but I am not exactly sure how to make a test to test the desired addition of Categorify with a new start_index argument.

lesnikow pushed a commit that referenced this issue Sep 2, 2021
Add tests for issue #1074 on adding a start_index argument to
the Categorify op. The motivation for this issue is to allow an
offset for encoding out-of-vocabulary and other special values.

In this commit, we add a test function to test various values of our
new start_index arg in tests/unit/ops/text_ops.py and add a
start_value arg to the class signature for Categorify(),
documentation for its intended function in Categorify()'s
docstring, a start_value arg to FitOptions(), and documentation
for this new argument in FitOptions()'s docstring.
@lesnikow
Copy link
Contributor

lesnikow commented Sep 2, 2021

@gabrielspmoreira I did not see how to add you as a reviewer for my draft PR, but when you have a chance, would you be able to see whether the implemented tests test your intended functionality? This draft PR consists of the commit 642e5f5.

lesnikow pushed a commit that referenced this issue Sep 3, 2021
This commit implements the feature in issue #1074. This issue
asks to add an argument start_index to Categorify to give an offset
for translating vocabulary items to categorical values.

We update nvtabular/ops/categorify.py to add a start_index arg in the
implementation of Categorify(). This update touches the categorify.py module
in various places. We also add docstrings to the _encode() and
_write_uniques() methods for improved readability in categorify.py.

We also update the test_categorify_lists_with_start_index() test
method in tests/unit/test_ops.py to test various start_index
values.
lesnikow pushed a commit that referenced this issue Sep 11, 2021
Change start_indexi's default value to 0 from 1. The motivation
for this is to have more Pythonic code, and for increased consistency
with the rest of the updated modules. This implements the suggestion
made by @benfred towards merging the implementation of
issue #1074's feature request.

This change includes modifying what Categorify()'s _encode()
method does, default arg value updates in some other methods,
docstring updating, and updating Categorify()'s tests that
test start_index's behavior.
lesnikow pushed a commit that referenced this issue Sep 13, 2021
Adjust _get_embeddings_dask() to add start_index in output embedding.
This is towards implementing issue #1074.
lesnikow pushed a commit that referenced this issue Sep 13, 2021
Add unit test for start_index and get_embedding_size() for
the change made in commit fa172a.

This is for implementing the feature request of issue #1074.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants