Inconsistent empty-set filtering behavior on multi-value columns #2750

gianm · 2016-03-28T20:46:44Z

Right now filtering on nulls on multi-value columns with a filter like {"type": "selector", "dimension": "foo", "value": null} sometimes matches empty sets and sometimes doesn't.

Empty sets occur when writing a row in a multi-value column where the underlying input row's field was either missing, null, or [].

Query-level filters on dim = null:

… on IncrementalIndex do match empty sets (IncrementalIndex translates null and [] to null when indexing, and its ValueMatcherFactory considers that representation a match for null)
…on segments written by IndexMerger do not match empty sets (it doesn't include them in the bitmap for null)
…on segments written by IndexMergerV9 do match empty sets (it includes them in the bitmap for null)

FilteredAggregator filters on dim = null:

…on IncrementalIndex do match empty sets (aggregators get a dimension selector that returns empty sets as [null])
…on segments written by IndexMerger do not match empty sets (the dimension selector returns empty sets as [], which FilteredAggregator does not consider a match for null)
…on segments written by IndexMergerV9 do not match empty sets (the dimension selector returns empty sets as [], which FilteredAggregator does not consider a match for null)

The text was updated successfully, but these errors were encountered:

gianm · 2016-03-28T20:56:00Z

Looking for thoughts on what makes sense.

IMO: filters are supposed to be filtering on values, not the entire set, so from that standpoint, it makes sense for a null selector to not match an empty set.

But, this behavior sorta conflicts with what happens in single-value columns, where missing fields and fields equal to [] are "lifted" to the single value null. It's a bit weird for the same underlying JSON object to sometimes end up in a Druid row that matches null for a particular dimension, and sometimes doesn't, just based on whether that dimension is going to end up single- or multi-value.

So I guess I'm conflicted on what should happen here.

gianm · 2016-03-28T21:01:52Z

One possibility is to make it so empty sets in multi-value columns on disk don't match null (don't include those rows in that bitmap), and then make IncrementalIndex behave that way too with some logic like this:

if we ever saw multi-values for a dimension, don't allow empty sets to match null (mimic multi-value on-disk column behavior)
if we never saw multi-values for a dimension, do allow empty sets to match null (mimic single-value on-disk column behavior)

I believe this behavior makes sense if you think "null" is an actual value rather than meaning "not present".

xvrl · 2016-03-28T21:17:08Z

#665 and #995 may provide some clues

gianm · 2016-03-28T21:28:07Z

Another possibility is to make it so empty sets always match null, as sort of a special case (it's special since they don't actually contain a null value). We could document that as something like "a filter on null will match any rows containing a null value, or any rows containing no values at all".

I believe this behavior makes sense if you think "null" means "not present".

xvrl · 2016-03-28T21:32:27Z

@cheddar you may have some opinion on this given your work on #995

drcrallen · 2016-03-28T21:32:52Z

@gianm I think that approach makes the most sense for the same example I give in #995 (comment) :

If you have a dimension... let's say "cake"... which never actually have a value for (all null) and you say "give me all events where cake is not cheesecake", then I would expect it to return all events.

A counter -example is: If there is a sequence of events for which "cake" is absent, and another sequence of events where "cake" is empty [], would it be reasonable to optimize them away such that "cake" is not stored at all in either case?

gianm · 2016-03-28T21:38:13Z

@drcrallen I don't follow -- are you suggesting a particular approach?

drcrallen · 2016-03-28T21:44:09Z

@gianm I propose a selector on null should match any of the following:

A dimension is missing
A dimension is explicitly null
A multi-value is zero-length []
A multi-value explicitly contains null ["foo","bar",""]

gianm · 2016-03-28T21:44:16Z

Since we treat columns that aren't present as if they match null (and we treat missing fields and empty arrays in JSON the same as null fields in JSON), it seems to me like Druid already has a lot of built-in bias towards treating null as "not present" rather than as an actual value.

So I'm leaning towards thinking that filters on null should match [] in multi-value columns just like they match null in single-value columns and like they match all rows if there is no column.

gianm · 2016-03-28T21:44:35Z

@drcrallen I think I agree with you

xvrl · 2016-03-28T21:57:06Z

@drcrallen "" should not be part of the spec, "" is an optimization detail at the column storage level. The only thing we should tell the user about "" is that they get mapped to null

vogievetsky · 2016-03-28T22:01:06Z

I agree with @xvrl regarding not focusing on "" and just call it null

vogievetsky · 2016-03-28T22:35:05Z

I feel like having the selector match both ["foo","bar",""] and [] is really strange. Would anyone ever want that on purpose?

drcrallen · 2016-03-28T22:41:48Z

@vogievetsky IMHO if they don't want to match on null they shouldn't put null in the multi-value set. The only reason it would be put there is because they want to use it in filtering.

vogievetsky · 2016-03-28T23:37:43Z

@drcrallen I actually agree that ["foo","bar",""] should match (by your reasoning). I am thinking that maybe [] should not match. (I know crazy right?).

I guess ideally it would be great to have some way to tell ["foo","bar",""] apart from [].

Maybe a byRow flag?

drcrallen · 2016-03-28T23:40:01Z

@vogievetsky does that also mean [] should be able to be treated differently than [""]?

gianm · 2016-03-28T23:41:39Z

@vogievetsky byRow filters will probably be added at some point (#2217 proposes them, for example) so let's assume we are only talking about non-byRow filters right now.

vogievetsky · 2016-03-28T23:42:41Z

BTW also going with @drcrallen suggestion and just telling people not to put nulls in MV dimension sets unless they know what they are doing is fine I think.

gianm · 2016-03-29T01:31:52Z

PR #2753 makes things behave such that filtering on null DOES match empties.

vogievetsky · 2016-03-29T02:33:43Z

@drcrallen yes I assumed that [] and [""] would be treated differently. I am guessing that is not straightforward given how Druid stores data?

gianm · 2016-03-29T03:21:41Z

[] and [""] could be treated differently in multi-value columns, as they are stored differently (first is [] and second is [""]). In single-value columns they are stored the same way (as a single "" value).

The behavior is now that filters on "null" will match rows with no values. The behavior in the past was inconsistent; sometimes these filters would match and sometimes they wouldn't. Adds tests for this behavior to SelectorFilterTest and BoundFilterTest, for query-level filters and filtered aggregates. Fixes apache#2750.

gianm added the Improvement label Mar 28, 2016

gianm added this to the 0.9.1 milestone Mar 28, 2016

gianm changed the title ~~Inconsistent empty-row filtering behavior on multi-value columns~~ Inconsistent empty-set filtering behavior on multi-value columns Mar 28, 2016

gianm mentioned this issue Mar 29, 2016

More consistent empty-set filtering behavior on multi-value columns. #2753

Merged

fjy closed this as completed in #2753 Mar 30, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inconsistent empty-set filtering behavior on multi-value columns #2750

Inconsistent empty-set filtering behavior on multi-value columns #2750

gianm commented Mar 28, 2016

gianm commented Mar 28, 2016

gianm commented Mar 28, 2016

xvrl commented Mar 28, 2016

gianm commented Mar 28, 2016

xvrl commented Mar 28, 2016

drcrallen commented Mar 28, 2016

gianm commented Mar 28, 2016

drcrallen commented Mar 28, 2016

gianm commented Mar 28, 2016

gianm commented Mar 28, 2016

xvrl commented Mar 28, 2016

vogievetsky commented Mar 28, 2016

vogievetsky commented Mar 28, 2016

drcrallen commented Mar 28, 2016

vogievetsky commented Mar 28, 2016

drcrallen commented Mar 28, 2016

gianm commented Mar 28, 2016

vogievetsky commented Mar 28, 2016

gianm commented Mar 29, 2016

vogievetsky commented Mar 29, 2016

gianm commented Mar 29, 2016

Inconsistent empty-set filtering behavior on multi-value columns #2750

Inconsistent empty-set filtering behavior on multi-value columns #2750

Comments

gianm commented Mar 28, 2016

gianm commented Mar 28, 2016

gianm commented Mar 28, 2016

xvrl commented Mar 28, 2016

gianm commented Mar 28, 2016

xvrl commented Mar 28, 2016

drcrallen commented Mar 28, 2016

gianm commented Mar 28, 2016

drcrallen commented Mar 28, 2016

gianm commented Mar 28, 2016

gianm commented Mar 28, 2016

xvrl commented Mar 28, 2016

vogievetsky commented Mar 28, 2016

vogievetsky commented Mar 28, 2016

drcrallen commented Mar 28, 2016

vogievetsky commented Mar 28, 2016

drcrallen commented Mar 28, 2016

gianm commented Mar 28, 2016

vogievetsky commented Mar 28, 2016

gianm commented Mar 29, 2016

vogievetsky commented Mar 29, 2016

gianm commented Mar 29, 2016