-
Notifications
You must be signed in to change notification settings - Fork 664
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SEDONA-625] Add ST_GeneratePoints #1520
Conversation
@Kontinuation The result of this function is non-deterministic. Will this create a problem to the catalyst optimizer? |
The correct way to implement non-deterministic functions is to extend the Unfortunately, |
@furqaankhan can you solve the issue using Kristin's method? |
Yes, I am working on it. |
This PR should also add SRID to the generated points following the style of this PR: #1521 |
common/src/main/java/org/apache/sedona/common/utils/RandomPointsBuilderSeed.java
Outdated
Show resolved
Hide resolved
common/src/main/java/org/apache/sedona/common/utils/RandomPointsBuilderSeed.java
Outdated
Show resolved
Hide resolved
override protected def initializeInternal(partitionIndex: Int): Unit = random = new Random( | ||
randomSeed.getOrElse(0L) + partitionIndex) | ||
|
||
override protected def evalInternal(input: InternalRow): Any = { | ||
val geom = children.head.toGeometry(input) | ||
val numPoints = children(1).eval(input).asInstanceOf[Int] | ||
if (nArgs == 3) { | ||
val seed = children(2).eval(input).asInstanceOf[Int] | ||
return GeometrySerializer.serialize(Functions.generatePoints(geom, numPoints, seed)) | ||
} | ||
GeometrySerializer.serialize(Functions.generatePoints(geom, numPoints)) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should construct an instance of RandomPointsBuilderSeed
in the initializeInternal
function, and call setExtent
, setNumPoints
, getGeometry
methods using that instance in evalInternal
.
case class ST_GeneratePoints(inputExpressions: Seq[Expression], randomSeed: Option[Long] = None) | ||
extends Expression | ||
with Nondeterministic | ||
with ExpectsInputTypes | ||
with CodegenFallback | ||
with ExpressionWithRandomSeed { | ||
|
||
def this(inputExpressions: Seq[Expression]) = this(inputExpressions, Some(0L)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ExpressionWithRandomSeed
has binary incompatible changes since Spark 3.0, so we cannot use it to write binary compatible code for Spark 3.0 ~ 3.5. Maintaining compatibility with old Spark versions (<= 3.2) is really a burden for us.
We can initialize ST_GeneratePoints
in the old Spark 2.3 way:
case class ST_GeneratePoints(inputExpressions: Seq[Expression], randomSeed: Long) extends RDG {
def this(inputExpressions: Seq[Expression]) = this(inputExpressions, Utils.random.nextLong())
The consequence is that the random seed will be fixed for multiple minibatches when running streaming jobs. However, that's the best we can do to maintain compatibility with various Spark versions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using RDG doesn't work, as it extends UnaryExpression
. Hence, I used all the traits that RDG was using but replaced the UnaryExpression
with Expression
. Please let me know if you think otherwise
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I made a mistake when writing the example code. It should be
case class ST_GeneratePoints(inputExpressions: Seq[Expression], randomSeed: Long) extends Expression
with Nondeterministic
with ExpectsInputTypes
with CodegenFallback {
def this(inputExpressions: Seq[Expression]) = this(inputExpressions, Utils.random.nextLong())
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was going to push this changes by removing the ExpressionWithRandomSeed
trait and it's overridden members, to test it.
I am using the factory of the input geometry. that will preserve the SRID. Please let me know if you think otherwise. |
@furqaankhan Can you create a PR on snowflake-tester? |
I did create one, I closed it as it had passed. I reopened it as I added a new function signature. The PR. |
3c9d48b
to
9d47bad
Compare
return generatePoints(geom, numPoints, 0); | ||
} | ||
|
||
public static Geometry generatePoints(Geometry geom, int numPoints, Random random) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is there another overloaded function that takes Random
as input?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is for using the random
member of ST_GeneratePoints
to generate consistent random streams
if (seed > 0) { | ||
Functions.generatePoints(geom, numPoints, seed) | ||
} else { | ||
Functions.generatePoints(geom, numPoints, random) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this needed? This if condition seems to be repeating the same condition in Functions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is for generating consistent random streams even if seed is 0.
Did you read the Contributor Guide?
Is this PR related to a JIRA ticket?
[SEDONA-XXX] my subject
.What changes were proposed in this PR?
How was this patch tested?
Did this PR include necessary documentation updates?