Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Switch agg defaults to numeric_only=None #270

Merged
merged 4 commits into from
Sep 22, 2020

Conversation

V1NAY8
Copy link
Contributor

@V1NAY8 V1NAY8 commented Sep 3, 2020

Closes #254

  • Added numeric_only = True | False | None for all agg and aggs (except nunique)
  • Added logic where booleans are not supported by median_absolute_deviation
  • Added tests, doctests for every change made, also modified existing tests to work with these changes.
  • Nox and pytest sessions are successful.

@elasticmachine
Copy link

Since this is a community submitted pull request, a Jenkins build has not been kicked off automatically. Can an Elastic organization member please verify the contents of this patch and then kick off a build manually?

@V1NAY8
Copy link
Contributor Author

V1NAY8 commented Sep 3, 2020

@sethmlarson Please review this and let me know any changes are required. I am happy to do it.
One query for the below example:
I know we shouldnt return NaN for numeric_only=True, But for mad we have to fill with NaN or exclude the column.

>>>import eland as ed
>>>from eland.tests.common import TestData
>>>funcs = ["max", "min", "mean", "sum", "median", "mad", "var", "std"]
>>>filter_data = ["AvgTicketPrice","Cancelled","dayOfWeek", "timestamp","DestCountry"]
>>>data = TestData()
>>>ed_flights = data.ed_flights().filter(filter_data)
>>>ed_flights.agg(funcs,numeric_only=True)
        AvgTicketPrice    Cancelled     dayOfWeek
max       1.199729e+03     1.000000      6.000000
min       1.000205e+02     0.000000      0.000000
mean      6.282537e+02     0.128494      2.835975
sum       8.204365e+06  1678.000000  37035.000000
median    6.403873e+02     0.000000      3.000000
mad       2.134435e+02          NaN      2.000000
var       7.096457e+04     0.111987      3.761279
std       2.664071e+02     0.334664      1.939513

Copy link
Contributor

@sethmlarson sethmlarson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks so much for this, it's a ton of work and very much appreciated :)

The above comment with NaN being returned for mad on bool is all good, agree with you there.

I also forgot to mention sum in the related issue as it has special behavior as well. sum should always be the same dtype as the input except for bool where the sum of bools should be an int64. I recommend this:

# These aggregations maintain the column datatype
elif pd_agg in {"max", "min", "median", "sum"}:
    # 'sum' isn't representable with bool, use int64
    if pd_agg == "sum" and field.is_bool:
        agg_value = np.int64(agg_value)
    else:
        agg_value = field.np_dtype.type(agg_value)

In general make sure we're following this ruleset:

  • numeric_only=True: Returns all values as float64, NaN/NaT values are removed in single-agg situations
  • numeric_only=None: Returns all values as the same dtype where possible, NaN/NaT are removed in single-agg situations
  • numeric_only=False: Returns all values as the same dtype where possible, NaN/NaT are preserved
  • For data types that are lossy (ie if I throw anything that's not a 0 into bool() it'll return True) make sure we're only converting to bool for lossless aggs like min, max, and median.

eland/dataframe.py Outdated Show resolved Hide resolved
eland/field_mappings.py Outdated Show resolved Hide resolved
eland/operations.py Outdated Show resolved Hide resolved
eland/ndframe.py Outdated Show resolved Hide resolved
eland/ndframe.py Outdated Show resolved Hide resolved
Copy link
Contributor

@sethmlarson sethmlarson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for these changes. Here's some comments and questions for you:

.gitattributes Outdated Show resolved Hide resolved
eland/dataframe.py Outdated Show resolved Hide resolved
eland/ndframe.py Outdated Show resolved Hide resolved
eland/operations.py Outdated Show resolved Hide resolved
eland/operations.py Outdated Show resolved Hide resolved
eland/tests/dataframe/test_metrics_pytest.py Show resolved Hide resolved
eland/tests/dataframe/test_metrics_pytest.py Outdated Show resolved Hide resolved
eland/tests/dataframe/test_metrics_pytest.py Outdated Show resolved Hide resolved
eland/tests/dataframe/test_metrics_pytest.py Outdated Show resolved Hide resolved
eland/tests/dataframe/test_metrics_pytest.py Outdated Show resolved Hide resolved
@sethmlarson
Copy link
Contributor

sethmlarson commented Sep 10, 2020

@V1NAY8 Btw I really appreciate everything you're doing here! Recently we announced a program for rewarding community contributors. You should check it out :)

https://www.elastic.co/community/contributor

@V1NAY8
Copy link
Contributor Author

V1NAY8 commented Sep 12, 2020

@sethmlarson
For the following:

elif pd_agg in {"max", "min", "median", "sum"}:
    # 'sum' isn't representable with bool, use int64
    if pd_agg == "sum" and field.is_bool:
        agg_value = np.int64(agg_value)
    else:
        agg_value = field.np_dtype.type(agg_value)

Should this happen for numeric_only=True
or only for False and None?

@sethmlarson
Copy link
Contributor

@V1NAY8 I think this is only needed for numeric_only=None/False cases.

@sethmlarson
Copy link
Contributor

@V1NAY8 Sorry I haven't reviewed this yet, other things are eating up all my time. It is still on my list, thank you for the patience :)

Copy link
Contributor

@sethmlarson sethmlarson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is my last review! Thanks for all this work :)

eland/ndframe.py Outdated Show resolved Hide resolved
eland/tests/dataframe/test_metrics_pytest.py Show resolved Hide resolved
eland/tests/dataframe/test_metrics_pytest.py Show resolved Hide resolved
@sethmlarson
Copy link
Contributor

jenkins test this please

@sethmlarson
Copy link
Contributor

CI is passing btw :)

@sethmlarson
Copy link
Contributor

jenkins test this please

Copy link
Contributor

@sethmlarson sethmlarson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@V1NAY8
Copy link
Contributor Author

V1NAY8 commented Sep 22, 2020

Thanks for the Patience and reviewing this. It came back and forth. Ill make my contributions better to reduce multiple reviews. 😄

@sethmlarson
Copy link
Contributor

@V1NAY8 No worries! This was a really big change, it's tough to get things like this through on round one.
Thank you also for your patience with me not getting back to reviews as fast as I wanted to :)

@sethmlarson sethmlarson merged commit 4d96ad3 into elastic:master Sep 22, 2020
@sethmlarson
Copy link
Contributor

Boom! 🎉

@V1NAY8 V1NAY8 deleted the issue/254 branch September 22, 2020 15:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Switch agg defaults to numeric_only=None
3 participants