Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SEDONA-233] Incorrect results for several joins in a single stage #748

Merged
merged 2 commits into from
Jan 20, 2023

Conversation

umartin
Copy link
Contributor

@umartin umartin commented Jan 18, 2023

Did you read the Contributor Guide?

Is this PR related to a JIRA ticket?

What changes were proposed in this PR?

This patch changes how the deduplication gets it partition id. The previous method of getting it from TaskContext was unreliable. Now it uses mapPartitionsWithIndex. The documentation clearly states that is uses the original partition id. https://spark.apache.org/docs/latest/api/scala/org/apache/spark/rdd/RDD.html#mapPartitionsWithIndex[U](f:(Int,Iterator[T])=%3EIterator[U],preservesPartitioning:Boolean)(implicitevidence$9:scala.reflect.ClassTag[U]):org.apache.spark.rdd.RDD[U]

Deduplication is refactored out of the join judgement into a separate DuplicatesFilter.

Deduplication code that is used in sedona-flink is moved to common.

How was this patch tested?

Unit test added

Did this PR include necessary documentation updates?

  • No, this PR does not affect any public API so no need to change the docs.

@@ -0,0 +1,48 @@
package org.apache.sedona.core.joinJudgement;

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add Apache License header here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes

@@ -0,0 +1,75 @@
/*
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please add another test in Sedona Spark SQL to verify that this bug is eliminated?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure! Hopefully I have time to add it tomorrow

@jiayuasu jiayuasu added the bug label Jan 19, 2023
@jiayuasu jiayuasu added this to the sedona-1.4.0 milestone Jan 19, 2023
Added test in sedona-sql.
Added missing license header.
@jiayuasu jiayuasu merged commit 43e1d79 into apache:master Jan 20, 2023
@jiayuasu
Copy link
Member

@yitao-li

Dear Yitao, Sedona R build started to fail since this PR. But the PR is not relevant to the R side, could you please take a look?

Thanks,
Jia

@yitao-li
Copy link
Contributor

yitao-li commented Jan 26, 2023 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants