Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clean up the way shapes are computed and specified #1760

Merged
merged 12 commits into from
Feb 15, 2023

Conversation

karlhigley
Copy link
Contributor

@karlhigley karlhigley commented Feb 13, 2023

This replaces shapes specified implicitly via the is_list/is_ragged flag or via the value_count property with more explicitly computed and specified shapes.

Depends on NVIDIA-Merlin/core#215

@@ -18,6 +18,7 @@
from dask.dataframe.utils import meta_nonempty

from merlin.core.dispatch import DataFrameType, annotate
from merlin.dtypes.shape import DefaultShapes
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this going to be a new feature of dtypes in core?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shapes are implemented as a subfield of dtypes in Core, but this isn't really intended to be a feature of dtypes in particular. We might want to hide that implementation detail a bit more thoroughly by adjusting the imports.

As far as the defaults go, we just thought it was easier to read and understand than needing to remember which shapes mean what.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The DefaultShapes.LIST object seems like it could be useful, but the tests are failing to find this in core ImportError: cannot import name 'DefaultShapes' from 'merlin.dtypes.shape

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Expected since there is (or was) an outstanding Core PR that adds it

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like it's still open: NVIDIA-Merlin/core#215

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Merged now

@@ -129,7 +129,7 @@ def transform(self, col_selector: ColumnSelector, df: DataFrameType) -> DataFram

def _compute_dtype(self, col_schema, input_schema):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this method required to be overriden with this change? or could it be removed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It could be removed, but that breaks some of the tests even though the functionality works without it. (There are actually two places like this.)

sparse_names=None,
sparse_max=None,
sparse_as_dense=False,
padded_cols=None,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

renaming these arguments could be out-of-scope for this PR? Since it may be a breaking change for something that uses this function. It may be clearer to separate this into a different PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We couldn't really figure out what this function was supposed to do without the renames, so we went for it. The function name starts with an _ so we've appropriately signaled that external code shouldn't depend on its stability. The two places in the NVTabular that use it call it via argument order instead of specifying the names, so I understand the caution but I think we're okay.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like it will be a safe change, and the names make things clearer :). We may consider removing this functon along with the other loader code to follow-up on the promise here in one of the upcoming releases

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And by that point we'll hopefully have updated the dataloader API to a point where this augment_schema function is no longer required either here or the copies in Transformers4Rec and in Merlin Models. The existence of this function suggests to me that there's something missing from the dataloader API as it currently exists.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the something that's missing is "transforms implemented as operators that provide schema tracking" 😺

agg_is_lists = {"list": True}

agg = self._find_agg(col_schema, input_schema)
is_list = agg_is_lists.get(agg, col_schema.is_list)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the case where we have fallback to the second argument of the .get here and check the col_schema.is_list attribute. If we have a fixed list, would we like to preserve that info in the shape later on instead of turning in into a ragged list which is presumably default shape corresponding to DefaultShapes.LIST

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not actually sure. Are there GroupBy aggregations that don't change the shape of list columns?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From looking at the list of aggregations, I think everything changes the shape, either from list to scalar or from scalar to list.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if the default there should actually be False 🤔

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In practice it seems that it may not matter whether it's col_schema.is_list or False. I tried running Groupby with a an agg that is not "list" on a dataframe with list feautres and we get errors in both cudf and pandas

from merlin.io import Dataset
import nvtabular as nvt
import cudf

df = cudf.DataFrame({"a": [1, 1, 2], "b": [[10], [20], [20]]})
workflow = nvt.Workflow(["a", "b"] >> nvt.ops.Groupby(groupby_cols=["a"], aggs=["sum"]))
workflow.fit_transform(Dataset(df)).compute()
# => Raises DataError: All requested aggregations are unsupported.

Some of these aggs, like sum or mean could in theory work on list features if we wanted them to.

Pandas for example, handles sum across lists as concatenation.

import pandas as pd

df = pd.DataFrame({"a": [1, 1, 2], "b": [[10], [20], [20]]})
df.groupby("a").sum()
# =>
          b
a          
1  [10, 20]
2      [20]

or if numpy arrays, then as an element-wise sum

df = pd.DataFrame({"a": [1, 1, 2], "b": [np.array([10]), np.array([20]), np.array([20])]})
df.groupby("a").sum()
# =>
      b
a      
1  [30]
2  [20]

var/std/mean/median also work and in this example return scalars. If the element type contained an array of more then 1 dimension, then mean could start returning a list type too

since cudf doesn't appear to support aggregating across list columns this is probably something we don't need to be concerned about for now.

import cudf
df = cudf.DataFrame({"a": [1, 1, 2], "b": [[10], [20], [20]]})
df.groupby("a").sum()
# => Raises DataError: All requested aggregations are unsupported.

df["b"].sum()
# => Raises TypeError: cannot perform sum with type list

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I think False could be fine as the default here, and as far as this PR goes, it's no less clear than it was before.

If we need groupby aggregations for list columns as a future feature of NVTabular this will need to be revisited. I suppose even if cudf and pandas don't natively support this we could implement this ourselves by extracting the cupy/numpy arrays from the series in our own agg function to handle list column aggregations.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

currently it seems that the only agg that is supported for list columns is list which will result in adding one additional dimension to the shape of the original list

from merlin.io import Dataset
import nvtabular as nvt
import cudf

df = cudf.DataFrame({"a": [1, 1, 2], "b": [[10], [20], [20]]})
workflow = nvt.Workflow(["a", "b"] >> nvt.ops.Groupby(groupby_cols=["a"], aggs=["list"]))
workflow.fit_transform(Dataset(df)).compute()
# =>
   a        b_list
0  1  [[10], [20]]
1  2        [[20]]

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's great to know; I really appreciate your thoroughness in testing this out. This probably warrants a further update to the shapes here, I'll open a separate issue.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tracked here: #1763

@karlhigley karlhigley modified the milestones: Merlin 23.03, Merlin 23.02 Feb 14, 2023
properties["value_count"] = {"max": sparse_max[col]}
if sparse_as_dense:
properties["value_count"]["min"] = properties["value_count"]["max"]
dims = Shape(((1, batch_size), None))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the batch_size required for?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not required for anything specific, just following the principle that the schema should always accurately reflect the data to the greatest extent possible. Here we have shape information since we know the batch size, so we fill it in in case that helps something downstream. I don't know if it actually will, but it seemed like the right thing to do.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants