-
Notifications
You must be signed in to change notification settings - Fork 654
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SEDONA-133] Allow user-defined schemas in Adapter.toDf() #655
Conversation
There is no way to distinguish between having no user data (in which case it is set to "null") and having one column of user data that may take on null values (in which case the string representation would also be "null"). But in the second case, we want to preserve it and not drop the null values.
if (data == "null") { | ||
return desiredType match { | ||
case _: ByteType => null.asInstanceOf[Byte] | ||
case _: ShortType => null.asInstanceOf[Short] | ||
case _: IntegerType => null.asInstanceOf[Integer] | ||
case _: LongType => null.asInstanceOf[Long] | ||
case _: FloatType => null.asInstanceOf[Float] | ||
case _: DoubleType => null.asInstanceOf[Double] | ||
case _: DateType => null.asInstanceOf[Date] | ||
case _: TimestampType => null.asInstanceOf[Timestamp] | ||
case _: BooleanType => null.asInstanceOf[Boolean] | ||
case _: StringType => null.asInstanceOf[String] | ||
} | ||
} | ||
|
||
desiredType match { | ||
case _: ByteType => data.toByte | ||
case _: ShortType => data.toShort | ||
case _: IntegerType => data.toInt | ||
case _: LongType => data.toLong | ||
case _: FloatType => data.toFloat | ||
case _: DoubleType => data.toDouble | ||
case _: DateType => Date.valueOf(data) | ||
case _: TimestampType => Timestamp.valueOf(data) | ||
case _: BooleanType => data.toBoolean | ||
case _: StringType => data | ||
case _: StructType => | ||
val desiredStructSchema = desiredType.asInstanceOf[StructType] | ||
new GenericRowWithSchema(parseStruct(data, desiredStructSchema), desiredStructSchema) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I couldn't find an elegant way to perform the null-safe conversion, so I have these long match
statements
|
|
Did you read the Contributor Guide?
Yes, I have read Contributor Rules and Contributor Development Guide
Is this PR related to a JIRA ticket?
Yes, the URL of the assoicated JIRA ticket is https://issues.apache.org/jira/browse/SEDONA-XXX. The PR name follows the format
[SEDONA-XXX] my subject
.Link to original ticket.
What changes were proposed in this PR?
This expands the Adapter API to allow for users to convert to DataFrames with a given schema (for both SpatialRDD and JavaPairRDD).
User data is still stored in String format, so these new methods parse/cast the strings to whichever new data type is requested. This is similar to Spark's UnivocityParser, which is used to parse CSV files, but unfortunately that functionality is not exposed publicly so I created a barebones version here. I didn't cover all the data types, but tried to cover the key ones. This page details the encoders that Spark uses and may be helpful to understand the appropriate data types for conversion. We could expand to cover more data types later, or I'm open to it now if you request it.
This also adds some helper private methods and refactors a few common operations into their own functions.
How was this patch tested?
Added unit tests to confirm that SpatialRDD/JavaPairRDD -> DataFrame conversion works as expected.
Note: It doesn't seem to be common practice to test private methods, so I didn't add unit tests for the private methods I introduced. Their behavior is tested implicitly by the public functions.
Did this PR include necessary documentation updates?
Yes, I am adding a new API. I am using the current SNAPSHOT version number in since
vX.Y.Z
format.Note: I don't see another appropriate place to change documentation. Please let me know if I missed this!
Questions
1.
Is the following behavior intentional? I don't have a strong geospatial background.
In the JavaPairRDD -> DataFrame test case (called "can convert JavaPairRDD to DataFrame with user-supplied schema"), you may notice that the left and right dataframes get switched. The SpatialJoinQuery has
pointRDD
on the left andpolygonRDD
on the right, but the final output hasleftGeometry
of type POLYGON, followed by thepolygonRDD
user data fields, andrightGeometry
of type POINT, followed by thepointRDD
user data (null).