Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SQL: Plan non-equijoin conditions as cross join followed by filter #15302

Merged
merged 18 commits into from
Nov 29, 2023
Merged
Show file tree
Hide file tree
Changes from 11 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
38 changes: 22 additions & 16 deletions docs/querying/datasource.md
Original file line number Diff line number Diff line change
Expand Up @@ -320,12 +320,16 @@ Join datasources allow you to do a SQL-style join of two datasources. Stacking j
you to join arbitrarily many datasources.

In Druid {{DRUIDVERSION}}, joins in native queries are implemented with a broadcast hash-join algorithm. This means
that all datasources other than the leftmost "base" datasource must fit in memory. It also means that the join condition
must be an equality. This feature is intended mainly to allow joining regular Druid tables with [lookup](#lookup),
[inline](#inline), and [query](#query) datasources.
that all datasources other than the leftmost "base" datasource must fit in memory. In native queries, the join condition
must be an equality. In SQL, any join condition is accepted, but only equalities of a certain form
(see [Joins in SQL](#joins-in-sql)) execute as part of a native join. For other kinds of conditions, planner will try to re-arrange
condition such that some of the sub-conditions are evaluated as a filter on top of join and other sub-conditions are left
out in the join condition. In worst case scenario, SQL will execute the join condition as a cross join
(cartesian product) plus a filter.

Refer to the [Query execution](query-execution.md#join) page for more details on how queries are executed when you
use join datasources.
This feature is intended mainly to allow joining regular Druid tables with [lookup](#lookup), [inline](#inline), and
[query](#query) datasources. Refer to the [Query execution](query-execution.md#join) page for more details on how
queries are executed when you use join datasources.

#### Joins in SQL

Expand All @@ -335,21 +339,23 @@ SQL joins take the form:
<o1> [ INNER | LEFT [OUTER] ] JOIN <o2> ON <condition>
```

The condition must involve only equalities, but functions are okay, and there can be multiple equalities ANDed together.
Conditions like `t1.x = t2.x`, or `LOWER(t1.x) = t2.x`, or `t1.x = t2.x AND t1.y = t2.y` can all be handled. Conditions
like `t1.x <> t2.x` cannot currently be handled.
Any condition is accepted, but only certain kinds of conditions execute as part of a native join. To execute efficiently
as part of a native join, a condition must be a single clause like the following, or an `AND` of clauses involving at
lease one of the following:
abhishekagarwal87 marked this conversation as resolved.
Show resolved Hide resolved

Note that Druid SQL is less rigid than what native join datasources can handle. In cases where a SQL query does
something that is not allowed as-is with a native join datasource, Druid SQL will generate a subquery. This can have
a substantial effect on performance and scalability, so it is something to watch out for. Some examples of when the
SQL layer will generate subqueries include:
- Equality between fields of the same type on each side, like `t1 JOIN t2 ON t1.x = t2.x`.
- Equality between a function call on one side, and a field on the other side, like `t1 JOIN t2 ON LOWER(t1.x) = t2.x`.
- The equality operator may be `=` (which does not match nulls) or `IS NOT DISTINCT FROM` (which does match nulls).

- Joining a regular Druid table to itself, or to another regular Druid table. The native join datasource can accept
a table on the left-hand side, but not the right, so a subquery is needed.
In other cases, Druid will either insert a subquery below the join, or will use a cross join (cartesian product)
followed by a filter. Joins executed in these ways may run into resource or performance constraints. To determine
if your query is using one of these execution paths, run `EXPLAIN PLAN FOR <query>` and look for the following:

- Join conditions where the expressions on either side are of different types.
- `query` type datasources under the `left` or `right` key of your `join` datasource.
- `join` type datasource with `condition` set to `"1"` (cartesian product) followed by a `filter` that encodes the
condition you provided.

- Join conditions where the right-hand expression is not a direct column access.
In these cases, you may be able to improve the performance of your query by rewriting it.

For more information about how Druid translates SQL to native queries, refer to the
[Druid SQL](sql-translation.md) documentation.
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -772,7 +772,7 @@ public void testJoinWithLookup()
DruidExpression.ofColumn(ColumnType.STRING, "dim2"),
DruidExpression.ofColumn(ColumnType.STRING, "j0.k")
),
JoinType.LEFT
JoinType.INNER
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The initial query has a left join here and we end up planning it as in inner. Can there be a case where the user does not expect the results which have a pair on both left and right ? Also that seems to be case where the result changed for the not sql compatible mode

)
)
.setInterval(querySegmentSpec(Filtration.eternity()))
Expand All @@ -797,12 +797,7 @@ public void testJoinWithLookup()
.build())
.setExpectedRowSignature(rowSignature)
.setExpectedResultRows(
NullHandling.sqlCompatible()
? ImmutableList.of(
new Object[]{"xabc", 1L}
)
: ImmutableList.of(
new Object[]{NullHandling.defaultStringValue(), 3L},
ImmutableList.of(
new Object[]{"xabc", 1L}
)
)
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -145,7 +145,7 @@ public void testExactCountDistinctWithFilter()

@Ignore
@Override
public void testUnplannableQueries()
public void testUnplannableScanOrderByNonTime()
{

}
Expand Down
Loading
Loading