groupby task should test NAs in grouping columns (distinct result for NA group), and NAs in value columns too #40

jangorecki · 2018-10-05T09:04:25Z

No description provided.

jangorecki · 2018-11-04T05:30:04Z

this will probably need to wait for pandas-dev/pandas#3729
otherwise we would need to split pandas benchmark script into two separate cases, one having NAs, another one not.
If pandas won't make it for 0.24.0 then we can skip that test for pandas as we now skip 1e9 with lack of memory.

mattdowle · 2018-11-10T00:41:43Z

If a solution like pandas can't handle a test, then that is not a good reason to disable the test. It should be shown as a fail. Similarly, for example, data.table should be shown as a fail for 500GB whereas pydatatable and Spark will work.
To be clear: db-bench is not supposed to only test tasks that work in all the products. It's supposed to show which products fail too (whether that's because of size or supported-features as the case may be).

jangorecki · 2018-11-10T06:00:11Z

They can handle but require another syntax, so two test scripts for same test, or branching if before each query. They are working on that pandas-dev/pandas#3729 but issue is from 2013. For now we probably should add it as "not supported with same syntax, requires extra nafill overhead", will do it if it won't be resolved for pandas 0.24.0. Don't want to waste time now for adding exceptions if it could be solved soon.

mattdowle · 2018-11-12T21:13:36Z

Yes it's a tough tradeoff.

Don't want to waste time now for adding exceptions if it could be solved soon.

But there'll be exceptions for other tests, not just this one. So capability to deal with exceptions needs to be added anyway. Although, in this case, I'm sure what you mean by exceptions or why an exception is needed: "two test scripts for same test, or branching if before each query".
On this particular nafill task, adding it as working and including the overhead seems fair to me. Then when the overhead is eliminated in pandas, the result will speed up for pandas. Perhaps one of the reasons that it has been low priority is they haven't measured how much the overhead is. That's one of the points of this db-bench to measure things like that. Maybe the nafill overhead is not that bad and the result will look perfectly reasonable.
In general, if a product can do something (even if with overhead or more complex syntax) then db-bench should use that. db-bench does display the syntax too since the idea is to measure both speed and syntax. If the syntax is too long to show in the plot, then it could be a link to the underlying code with a summary of the number of lines/characters needed to achieve the result (where shortest is not necessarily best as dplyr advocates have proved more popular).

mattdowle · 2018-12-12T00:56:11Z

To be clear, this issue is about NAs in the grouping columns, right? i.e. dup of #54. If so, can the title be changed.
It would be really good and fair if db-bench highlighted that data.table treats NA in the grouping columns as a distinct group but other products don't, at least easily. There was a related discussion on the Arrow mailing list that I took part in.

jangorecki · 2018-12-12T04:20:40Z

Pandas issue is about NA in grouping columns, but we want to tests NAs in all columns.

jangorecki · 2020-05-09T20:13:35Z

pandas finally implemented NA support in groupby: pandas-dev/pandas#3729 (comment)
will be good to add to benchmark then.

jangorecki · 2020-11-22T19:21:40Z

According to this information https://stackoverflow.com/questions/63057886/clickhouse-string-field-disk-usage-null-vs-empty
It looks like clickhouse will have a significant overhead caused by NAs.
We could eventually branch script and create nulllable columns only for data cases where NAs are present, but then we should do that also for other tools, because they are also getting overhead, i.e. mean(v1, na.rm=TRUE) in R. Therefore no script branching for NAs is planned.

jangorecki · 2020-12-13T18:38:02Z

Interesting that it still doesn't work for pandas. Till pandas-dev/pandas#36327 is resolved we will have to escape this script for pandas because it will produce incorrect results. Dask inherits this issue from pandas as well.

jangorecki · 2020-12-14T11:11:25Z

Groupby benchmark scripts has been amended, wherever possible, to provide NA-aware processing. Where it was not possible (pandas and dask) it will skip this data case. When this capability will be implemented there, then we will remove skipping this data cases to enable it.

Below is dummy data we will need to use for q8 validation to ensure it behaves as expected (described in 5bc7113#comments):

$ cat data/G0_1e7_1e2_5_0.csv 
id1,id2,id3,id4,id5,id6,v1,v2,v3
a,a,a,1,1,3,0,0,3
a,a,a,1,1,3,0,0,4
a,a,a,1,1,3,0,0,3.5
a,a,a,1,1,3,0,0,
a,a,a,1,1,2,0,0,
a,a,a,1,1,2,0,0,
a,a,a,1,1,2,0,0,
a,a,a,1,1,1,0,0,5
a,a,a,1,1,1,0,0,
a,a,a,1,1,1,0,0,

What is left now is to run all and tweak reports to handle new change.

jangorecki · 2020-12-18T07:26:34Z

This has been implemented.

Function calls are now NA-aware, in all data cases.
One new data case having NAs has been added, for each size (1e7,1e8,1e9). It can be accessed from the report page by link in the table at the bottom of report in "Explore more data cases" section.

Over the course of development, the following issues has been encountered:

Rdatatable/data.table#4849
h2oai/datatable#2808
h2oai/datatable#2809
pandas-dev/pandas#36327
dask/dask#6986
dask/dask#6990

They are blocking data case having NAs for pandas and dask, and in case of python datatable, only 1e9 size data case having NA.

Issue is now resolved.

I would like to acknowledge valuable feedback I got from Julia developers (@bkamins and @nalimilan) during implementation.

jangorecki · 2021-01-04T08:42:25Z

In 73647e5 I removed initial filtering out of all NAs variables in q9. It is to align answers between solutions, and to keep all groups (if a group was full of NAs then it would have been filtered out).
In case of q8 we wanted to filter out such groups, because question was about "top n by group", while correlation is a standard aggregate so should not be filtering out groups, same as sum or mean functions.

Fix groupby setup and join scripts for duckdb and duckdb latest

jangorecki self-assigned this Oct 29, 2018

jangorecki added a commit that referenced this issue Nov 3, 2018

generating data having NA and order, related to #40 and #12

55779e1

jangorecki added a commit that referenced this issue Nov 4, 2018

disable NAs grouping for now due to lack of support in pandas #40

124b930

jangorecki mentioned this issue Dec 6, 2018

Add benchark with missings #54

Closed

mattdowle changed the title ~~groupby task should test NAs also~~ groupby task should test NAs in grouping columns (distinct result for NA group), and NAs in value columns too Dec 12, 2018

jangorecki mentioned this issue Jan 22, 2019

Do not allow for missing values in Julia #70

Merged

jangorecki mentioned this issue May 11, 2020

Julia: fix missing ( and ) bug #147

Merged

jangorecki added this to the 2.1.0 milestone May 13, 2020

jangorecki referenced this issue Dec 13, 2020

julia groupby NA aware

5bc7113

jangorecki closed this as completed Dec 18, 2020

jangorecki added a commit that referenced this issue Jan 4, 2021

align groupby q9, followup of #40

73647e5

Tmonster added a commit to Tmonster/db-benchmark that referenced this issue Oct 16, 2023

Updates to duckdb setup groupby and join scripts (h2oai#40)

21bc480

Fix groupby setup and join scripts for duckdb and duckdb latest

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

groupby task should test NAs in grouping columns (distinct result for NA group), and NAs in value columns too #40

groupby task should test NAs in grouping columns (distinct result for NA group), and NAs in value columns too #40

jangorecki commented Oct 5, 2018

jangorecki commented Nov 4, 2018 •

edited

Loading

mattdowle commented Nov 10, 2018

jangorecki commented Nov 10, 2018

mattdowle commented Nov 12, 2018 •

edited

Loading

mattdowle commented Dec 12, 2018 •

edited

Loading

jangorecki commented Dec 12, 2018

jangorecki commented May 9, 2020 •

edited

Loading

jangorecki commented Nov 22, 2020

jangorecki commented Dec 13, 2020 •

edited

Loading

jangorecki commented Dec 14, 2020

jangorecki commented Dec 18, 2020

jangorecki commented Jan 4, 2021 •

edited

Loading

groupby task should test NAs in grouping columns (distinct result for NA group), and NAs in value columns too #40

groupby task should test NAs in grouping columns (distinct result for NA group), and NAs in value columns too #40

Comments

jangorecki commented Oct 5, 2018

jangorecki commented Nov 4, 2018 • edited Loading

mattdowle commented Nov 10, 2018

jangorecki commented Nov 10, 2018

mattdowle commented Nov 12, 2018 • edited Loading

mattdowle commented Dec 12, 2018 • edited Loading

jangorecki commented Dec 12, 2018

jangorecki commented May 9, 2020 • edited Loading

jangorecki commented Nov 22, 2020

jangorecki commented Dec 13, 2020 • edited Loading

jangorecki commented Dec 14, 2020

jangorecki commented Dec 18, 2020

jangorecki commented Jan 4, 2021 • edited Loading

jangorecki commented Nov 4, 2018 •

edited

Loading

mattdowle commented Nov 12, 2018 •

edited

Loading

mattdowle commented Dec 12, 2018 •

edited

Loading

jangorecki commented May 9, 2020 •

edited

Loading

jangorecki commented Dec 13, 2020 •

edited

Loading

jangorecki commented Jan 4, 2021 •

edited

Loading