Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inconsistent empty-set filtering behavior on multi-value columns #2750

Closed
gianm opened this issue Mar 28, 2016 · 21 comments
Closed

Inconsistent empty-set filtering behavior on multi-value columns #2750

gianm opened this issue Mar 28, 2016 · 21 comments
Milestone

Comments

@gianm
Copy link
Contributor

gianm commented Mar 28, 2016

Right now filtering on nulls on multi-value columns with a filter like {"type": "selector", "dimension": "foo", "value": null} sometimes matches empty sets and sometimes doesn't.

Empty sets occur when writing a row in a multi-value column where the underlying input row's field was either missing, null, or [].

Query-level filters on dim = null:

  • … on IncrementalIndex do match empty sets (IncrementalIndex translates null and [] to null when indexing, and its ValueMatcherFactory considers that representation a match for null)
  • …on segments written by IndexMerger do not match empty sets (it doesn't include them in the bitmap for null)
  • …on segments written by IndexMergerV9 do match empty sets (it includes them in the bitmap for null)

FilteredAggregator filters on dim = null:

  • …on IncrementalIndex do match empty sets (aggregators get a dimension selector that returns empty sets as [null])
  • …on segments written by IndexMerger do not match empty sets (the dimension selector returns empty sets as [], which FilteredAggregator does not consider a match for null)
  • …on segments written by IndexMergerV9 do not match empty sets (the dimension selector returns empty sets as [], which FilteredAggregator does not consider a match for null)
@gianm gianm added this to the 0.9.1 milestone Mar 28, 2016
@gianm
Copy link
Contributor Author

gianm commented Mar 28, 2016

Looking for thoughts on what makes sense.

IMO: filters are supposed to be filtering on values, not the entire set, so from that standpoint, it makes sense for a null selector to not match an empty set.

But, this behavior sorta conflicts with what happens in single-value columns, where missing fields and fields equal to [] are "lifted" to the single value null. It's a bit weird for the same underlying JSON object to sometimes end up in a Druid row that matches null for a particular dimension, and sometimes doesn't, just based on whether that dimension is going to end up single- or multi-value.

So I guess I'm conflicted on what should happen here.

@gianm gianm changed the title Inconsistent empty-row filtering behavior on multi-value columns Inconsistent empty-set filtering behavior on multi-value columns Mar 28, 2016
@gianm
Copy link
Contributor Author

gianm commented Mar 28, 2016

One possibility is to make it so empty sets in multi-value columns on disk don't match null (don't include those rows in that bitmap), and then make IncrementalIndex behave that way too with some logic like this:

  • if we ever saw multi-values for a dimension, don't allow empty sets to match null (mimic multi-value on-disk column behavior)
  • if we never saw multi-values for a dimension, do allow empty sets to match null (mimic single-value on-disk column behavior)

I believe this behavior makes sense if you think "null" is an actual value rather than meaning "not present".

@xvrl
Copy link
Member

xvrl commented Mar 28, 2016

#665 and #995 may provide some clues

@gianm
Copy link
Contributor Author

gianm commented Mar 28, 2016

Another possibility is to make it so empty sets always match null, as sort of a special case (it's special since they don't actually contain a null value). We could document that as something like "a filter on null will match any rows containing a null value, or any rows containing no values at all".

I believe this behavior makes sense if you think "null" means "not present".

@xvrl
Copy link
Member

xvrl commented Mar 28, 2016

@cheddar you may have some opinion on this given your work on #995

@drcrallen
Copy link
Contributor

@gianm I think that approach makes the most sense for the same example I give in #995 (comment) :

If you have a dimension... let's say "cake"... which never actually have a value for (all null) and you say "give me all events where cake is not cheesecake", then I would expect it to return all events.

A counter -example is: If there is a sequence of events for which "cake" is absent, and another sequence of events where "cake" is empty [], would it be reasonable to optimize them away such that "cake" is not stored at all in either case?

@gianm
Copy link
Contributor Author

gianm commented Mar 28, 2016

@drcrallen I don't follow -- are you suggesting a particular approach?

@drcrallen
Copy link
Contributor

@gianm I propose a selector on null should match any of the following:

  1. A dimension is missing
  2. A dimension is explicitly null
  3. A multi-value is zero-length []
  4. A multi-value explicitly contains null ["foo","bar",""]

@gianm
Copy link
Contributor Author

gianm commented Mar 28, 2016

Since we treat columns that aren't present as if they match null (and we treat missing fields and empty arrays in JSON the same as null fields in JSON), it seems to me like Druid already has a lot of built-in bias towards treating null as "not present" rather than as an actual value.

So I'm leaning towards thinking that filters on null should match [] in multi-value columns just like they match null in single-value columns and like they match all rows if there is no column.

@gianm
Copy link
Contributor Author

gianm commented Mar 28, 2016

@drcrallen I think I agree with you

@xvrl
Copy link
Member

xvrl commented Mar 28, 2016

@drcrallen "" should not be part of the spec, "" is an optimization detail at the column storage level. The only thing we should tell the user about "" is that they get mapped to null

@vogievetsky
Copy link
Contributor

I agree with @xvrl regarding not focusing on "" and just call it null

@vogievetsky
Copy link
Contributor

I feel like having the selector match both ["foo","bar",""] and [] is really strange. Would anyone ever want that on purpose?

@drcrallen
Copy link
Contributor

@vogievetsky IMHO if they don't want to match on null they shouldn't put null in the multi-value set. The only reason it would be put there is because they want to use it in filtering.

@vogievetsky
Copy link
Contributor

@drcrallen I actually agree that ["foo","bar",""] should match (by your reasoning). I am thinking that maybe [] should not match. (I know crazy right?).

I guess ideally it would be great to have some way to tell ["foo","bar",""] apart from [].

Maybe a byRow flag?

@drcrallen
Copy link
Contributor

@vogievetsky does that also mean [] should be able to be treated differently than [""]?

@gianm
Copy link
Contributor Author

gianm commented Mar 28, 2016

@vogievetsky byRow filters will probably be added at some point (#2217 proposes them, for example) so let's assume we are only talking about non-byRow filters right now.

@vogievetsky
Copy link
Contributor

BTW also going with @drcrallen suggestion and just telling people not to put nulls in MV dimension sets unless they know what they are doing is fine I think.

@gianm
Copy link
Contributor Author

gianm commented Mar 29, 2016

PR #2753 makes things behave such that filtering on null DOES match empties.

@vogievetsky
Copy link
Contributor

@drcrallen yes I assumed that [] and [""] would be treated differently. I am guessing that is not straightforward given how Druid stores data?

@gianm
Copy link
Contributor Author

gianm commented Mar 29, 2016

[] and [""] could be treated differently in multi-value columns, as they are stored differently (first is [] and second is [""]). In single-value columns they are stored the same way (as a single "" value).

gianm added a commit to gianm/druid that referenced this issue Mar 29, 2016
The behavior is now that filters on "null" will match rows with no
values. The behavior in the past was inconsistent; sometimes these
filters would match and sometimes they wouldn't.

Adds tests for this behavior to SelectorFilterTest and
BoundFilterTest, for query-level filters and filtered aggregates.

Fixes apache#2750.
@fjy fjy closed this as completed in #2753 Mar 30, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants