Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

groupby task should test NAs in grouping columns (distinct result for NA group), and NAs in value columns too #40

Closed
jangorecki opened this issue Oct 5, 2018 · 12 comments
Assignees
Milestone

Comments

@jangorecki
Copy link
Contributor

No description provided.

@jangorecki
Copy link
Contributor Author

jangorecki commented Nov 4, 2018

this will probably need to wait for pandas-dev/pandas#3729
otherwise we would need to split pandas benchmark script into two separate cases, one having NAs, another one not.
If pandas won't make it for 0.24.0 then we can skip that test for pandas as we now skip 1e9 with lack of memory.

@mattdowle
Copy link
Contributor

If a solution like pandas can't handle a test, then that is not a good reason to disable the test. It should be shown as a fail. Similarly, for example, data.table should be shown as a fail for 500GB whereas pydatatable and Spark will work.
To be clear: db-bench is not supposed to only test tasks that work in all the products. It's supposed to show which products fail too (whether that's because of size or supported-features as the case may be).

@jangorecki
Copy link
Contributor Author

They can handle but require another syntax, so two test scripts for same test, or branching if before each query. They are working on that pandas-dev/pandas#3729 but issue is from 2013. For now we probably should add it as "not supported with same syntax, requires extra nafill overhead", will do it if it won't be resolved for pandas 0.24.0. Don't want to waste time now for adding exceptions if it could be solved soon.

@mattdowle
Copy link
Contributor

mattdowle commented Nov 12, 2018

Yes it's a tough tradeoff.

Don't want to waste time now for adding exceptions if it could be solved soon.

But there'll be exceptions for other tests, not just this one. So capability to deal with exceptions needs to be added anyway. Although, in this case, I'm sure what you mean by exceptions or why an exception is needed: "two test scripts for same test, or branching if before each query".
On this particular nafill task, adding it as working and including the overhead seems fair to me. Then when the overhead is eliminated in pandas, the result will speed up for pandas. Perhaps one of the reasons that it has been low priority is they haven't measured how much the overhead is. That's one of the points of this db-bench to measure things like that. Maybe the nafill overhead is not that bad and the result will look perfectly reasonable.
In general, if a product can do something (even if with overhead or more complex syntax) then db-bench should use that. db-bench does display the syntax too since the idea is to measure both speed and syntax. If the syntax is too long to show in the plot, then it could be a link to the underlying code with a summary of the number of lines/characters needed to achieve the result (where shortest is not necessarily best as dplyr advocates have proved more popular).

@mattdowle
Copy link
Contributor

mattdowle commented Dec 12, 2018

To be clear, this issue is about NAs in the grouping columns, right? i.e. dup of #54. If so, can the title be changed.
It would be really good and fair if db-bench highlighted that data.table treats NA in the grouping columns as a distinct group but other products don't, at least easily. There was a related discussion on the Arrow mailing list that I took part in.

@jangorecki
Copy link
Contributor Author

Pandas issue is about NA in grouping columns, but we want to tests NAs in all columns.

@mattdowle mattdowle changed the title groupby task should test NAs also groupby task should test NAs in grouping columns (distinct result for NA group), and NAs in value columns too Dec 12, 2018
@jangorecki
Copy link
Contributor Author

jangorecki commented May 9, 2020

pandas finally implemented NA support in groupby: pandas-dev/pandas#3729 (comment)
will be good to add to benchmark then.

@jangorecki jangorecki added this to the 2.1.0 milestone May 13, 2020
@jangorecki
Copy link
Contributor Author

According to this information https://stackoverflow.com/questions/63057886/clickhouse-string-field-disk-usage-null-vs-empty
It looks like clickhouse will have a significant overhead caused by NAs.
We could eventually branch script and create nulllable columns only for data cases where NAs are present, but then we should do that also for other tools, because they are also getting overhead, i.e. mean(v1, na.rm=TRUE) in R. Therefore no script branching for NAs is planned.

@jangorecki
Copy link
Contributor Author

jangorecki commented Dec 13, 2020

Interesting that it still doesn't work for pandas. Till pandas-dev/pandas#36327 is resolved we will have to escape this script for pandas because it will produce incorrect results. Dask inherits this issue from pandas as well.

@jangorecki
Copy link
Contributor Author

Groupby benchmark scripts has been amended, wherever possible, to provide NA-aware processing. Where it was not possible (pandas and dask) it will skip this data case. When this capability will be implemented there, then we will remove skipping this data cases to enable it.

Below is dummy data we will need to use for q8 validation to ensure it behaves as expected (described in 5bc7113#comments):

$ cat data/G0_1e7_1e2_5_0.csv 
id1,id2,id3,id4,id5,id6,v1,v2,v3
a,a,a,1,1,3,0,0,3
a,a,a,1,1,3,0,0,4
a,a,a,1,1,3,0,0,3.5
a,a,a,1,1,3,0,0,
a,a,a,1,1,2,0,0,
a,a,a,1,1,2,0,0,
a,a,a,1,1,2,0,0,
a,a,a,1,1,1,0,0,5
a,a,a,1,1,1,0,0,
a,a,a,1,1,1,0,0,

What is left now is to run all and tweak reports to handle new change.

@jangorecki
Copy link
Contributor Author

This has been implemented.

Function calls are now NA-aware, in all data cases.
One new data case having NAs has been added, for each size (1e7,1e8,1e9). It can be accessed from the report page by link in the table at the bottom of report in "Explore more data cases" section.

Over the course of development, the following issues has been encountered:

Rdatatable/data.table#4849
h2oai/datatable#2808
h2oai/datatable#2809
pandas-dev/pandas#36327
dask/dask#6986
dask/dask#6990

They are blocking data case having NAs for pandas and dask, and in case of python datatable, only 1e9 size data case having NA.

Issue is now resolved.

I would like to acknowledge valuable feedback I got from Julia developers (@bkamins and @nalimilan) during implementation.

@jangorecki
Copy link
Contributor Author

jangorecki commented Jan 4, 2021

In 73647e5 I removed initial filtering out of all NAs variables in q9. It is to align answers between solutions, and to keep all groups (if a group was full of NAs then it would have been filtered out).
In case of q8 we wanted to filter out such groups, because question was about "top n by group", while correlation is a standard aggregate so should not be filtering out groups, same as sum or mean functions.

Tmonster added a commit to Tmonster/db-benchmark that referenced this issue Oct 16, 2023
Fix groupby setup and join scripts for duckdb and duckdb latest
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants