Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve GroupBy JIT error handling #13854

Conversation

brandon-b-miller
Copy link
Contributor

This PR implements general error handling for features that are missing dtype overloads in JIT GroupBy. Before, if a feature was not supported for a certain dtype, a somewhat confusing numba error is raised. With this PR however, a clear error is raised, for instance in the case of missing float dtypes for the corr aggregation:

import pandas as pd
import cudf
df = pd.DataFrame({
    'a': [1, 1, 1, 2, 2, 2],
    'b': range(6),
    'c': range(6)
})
df['b'] = df['b'].astype('float64')
df['c'] = df['c'].astype('float64')


gdf = cudf.from_pandas(df, nan_as_null=False)
got = gdf.groupby('a').apply(lambda x: x['b'].corr(x['c']), engine='jit')

Now prints

Series.corr is not supported between float64 and float64 within JIT GroupBy apply.
Please file an issue requesting support for this feature at cuDF's GitHub page.

@brandon-b-miller brandon-b-miller added feature request New feature or request numba Numba issue Python Affects Python cuDF API. non-breaking Non-breaking change labels Aug 11, 2023
@brandon-b-miller brandon-b-miller self-assigned this Aug 11, 2023
@brandon-b-miller brandon-b-miller requested a review from a team as a code owner August 11, 2023 17:03
Comment on lines 164 to 165
"apply.\nPlease file an issue "
"requesting support for this feature at cuDF's GitHub page."
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we intend to support all possible inputs that can raise this error? It seems like we're asking for a lot of small issues and some of them might even be "wontfix."

Copy link
Contributor Author

@brandon-b-miller brandon-b-miller Aug 11, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a great question and worth discussion. From my point of view, the API is trying to convince the user as best as possible that their function is simply being run in a loop over groups, as in the iterative approach. In an ideal world then, every function we support in JIT groupby apply would support the same set of dtypes that the real Series.api supports. We have big holes in that right now, for example corr which is int only and the rest of the functions which don't support int16, int8 or any unsigned types yet. Each one of these missing overloads is a missing feature IMO and it will take some time to close the gap.

In raising as such I mainly intend to clarify what is going wrong, however in suggesting the issue I was aiming to create a feedback mechanism that might aid in our prioritization. However maybe as you suggest this is clumsy as we'd be waiting for users to file issues we already know exist.

What would you think of creating an umbrella issue that tracks the gap between supported dtypes and linking to it in the error instead, perhaps suggesting a github reply? We could track everything that's wontfix there. Or, if pointing to github seems unnecessary in its entirety I can remove it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A single issue would be better! Maybe you can start it and populate it with known gaps. But generally it’s unusual for cudf to point to GitHub issues in other circumstances such as when calling unsupported keyword arguments from pandas.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shwina What are your thoughts here?

Copy link
Contributor

@shwina shwina Aug 11, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think perhaps pointing a single issue in the error message so folks can provide input as comments is maybe a good compromise? Also it would be a real clickable link in the error message, which is nice. Perhaps something along:

The <func>() function is not supported for the input types <...> by cuDF's JIT engine. To see what's currently supported or request support for a new function, go to https://github.com/rapidsai/cudf/issues/42

@review-notebook-app
Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@copy-pr-bot
Copy link

copy-pr-bot bot commented Aug 31, 2023

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Copy link
Contributor

@vyasr vyasr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks pretty close, couple of small things left that are worth doing though.

python/cudf/cudf/core/udf/groupby_typing.py Outdated Show resolved Hide resolved
python/cudf/cudf/core/udf/groupby_typing.py Outdated Show resolved Hide resolved
python/cudf/cudf/core/udf/groupby_typing.py Outdated Show resolved Hide resolved
python/cudf/cudf/core/udf/groupby_typing.py Show resolved Hide resolved
python/cudf/cudf/core/udf/groupby_typing.py Outdated Show resolved Hide resolved
brandon-b-miller and others added 2 commits October 2, 2023 08:37
Co-authored-by: Vyas Ramasubramani <vyas.ramasubramani@gmail.com>
@brandon-b-miller brandon-b-miller changed the base branch from branch-23.10 to branch-23.12 October 2, 2023 14:28
@brandon-b-miller
Copy link
Contributor Author

/ok to test

@vyasr
Copy link
Contributor

vyasr commented Oct 9, 2023

@brandon-b-miller could you fix style here? Then we can finalize.

@brandon-b-miller
Copy link
Contributor Author

/ok to test

@brandon-b-miller
Copy link
Contributor Author

/ok to test

@brandon-b-miller brandon-b-miller changed the base branch from branch-23.12 to branch-24.02 November 30, 2023 19:43
@brandon-b-miller
Copy link
Contributor Author

/ok to test

@brandon-b-miller
Copy link
Contributor Author

/ok to test

@brandon-b-miller
Copy link
Contributor Author

/ok to test

@brandon-b-miller
Copy link
Contributor Author

/ok to test

@brandon-b-miller
Copy link
Contributor Author

I think this is ready to merge - any last thoughts anyone?

Copy link
Contributor

@bdice bdice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggested simplifying the error message text. Otherwise I don't have strong opinions on this PR and we should merge it to unblock further work on migration to Numba 0.58.

python/cudf/cudf/core/udf/groupby_typing.py Outdated Show resolved Hide resolved
@brandon-b-miller
Copy link
Contributor Author

/ok to test

@brandon-b-miller
Copy link
Contributor Author

/merge

@rapids-bot rapids-bot bot merged commit 0fa80ec into rapidsai:branch-24.02 Dec 12, 2023
68 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request non-breaking Non-breaking change numba Numba issue Python Affects Python cuDF API.
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

4 participants