[SEDONA-133] Allow user-defined schemas in Adapter.toDf() #655

brianrice2 · 2022-08-05T02:09:05Z

Did you read the Contributor Guide?

Yes, I have read Contributor Rules and Contributor Development Guide

Is this PR related to a JIRA ticket?

Yes, the URL of the assoicated JIRA ticket is https://issues.apache.org/jira/browse/SEDONA-XXX. The PR name follows the format [SEDONA-XXX] my subject.

Link to original ticket.

What changes were proposed in this PR?

This expands the Adapter API to allow for users to convert to DataFrames with a given schema (for both SpatialRDD and JavaPairRDD).

User data is still stored in String format, so these new methods parse/cast the strings to whichever new data type is requested. This is similar to Spark's UnivocityParser, which is used to parse CSV files, but unfortunately that functionality is not exposed publicly so I created a barebones version here. I didn't cover all the data types, but tried to cover the key ones. This page details the encoders that Spark uses and may be helpful to understand the appropriate data types for conversion. We could expand to cover more data types later, or I'm open to it now if you request it.

This also adds some helper private methods and refactors a few common operations into their own functions.

How was this patch tested?

Added unit tests to confirm that SpatialRDD/JavaPairRDD -> DataFrame conversion works as expected.

Note: It doesn't seem to be common practice to test private methods, so I didn't add unit tests for the private methods I introduced. Their behavior is tested implicitly by the public functions.

Did this PR include necessary documentation updates?

Yes, I am adding a new API. I am using the current SNAPSHOT version number in since vX.Y.Z format.

Note: I don't see another appropriate place to change documentation. Please let me know if I missed this!

Questions

1.

Is the following behavior intentional? I don't have a strong geospatial background.

In the JavaPairRDD -> DataFrame test case (called "can convert JavaPairRDD to DataFrame with user-supplied schema"), you may notice that the left and right dataframes get switched. The SpatialJoinQuery has pointRDD on the left and polygonRDD on the right, but the final output has leftGeometry of type POLYGON, followed by the polygonRDD user data fields, and rightGeometry of type POINT, followed by the pointRDD user data (null).

There is no way to distinguish between having no user data (in which case it is set to "null") and having one column of user data that may take on null values (in which case the string representation would also be "null"). But in the second case, we want to preserve it and not drop the null values.

brianrice2 · 2022-08-05T02:10:14Z

sql/src/main/scala/org/apache/sedona/sql/utils/Adapter.scala

+    if (data == "null") {
+      return desiredType match {
+        case _: ByteType => null.asInstanceOf[Byte]
+        case _: ShortType => null.asInstanceOf[Short]
+        case _: IntegerType => null.asInstanceOf[Integer]
+        case _: LongType => null.asInstanceOf[Long]
+        case _: FloatType => null.asInstanceOf[Float]
+        case _: DoubleType => null.asInstanceOf[Double]
+        case _: DateType => null.asInstanceOf[Date]
+        case _: TimestampType => null.asInstanceOf[Timestamp]
+        case _: BooleanType => null.asInstanceOf[Boolean]
+        case _: StringType => null.asInstanceOf[String]
+      }
+    }
+
+    desiredType match {
+      case _: ByteType => data.toByte
+      case _: ShortType => data.toShort
+      case _: IntegerType => data.toInt
+      case _: LongType => data.toLong
+      case _: FloatType => data.toFloat
+      case _: DoubleType => data.toDouble
+      case _: DateType => Date.valueOf(data)
+      case _: TimestampType => Timestamp.valueOf(data)
+      case _: BooleanType => data.toBoolean
+      case _: StringType => data
+      case _: StructType =>
+        val desiredStructSchema = desiredType.asInstanceOf[StructType]
+        new GenericRowWithSchema(parseStruct(data, desiredStructSchema), desiredStructSchema)
+    }


I couldn't find an elegant way to perform the null-safe conversion, so I have these long match statements

jiayuasu · 2022-08-07T05:33:20Z

@brianrice2

For your question Is the following behavior intentional? I don't have a strong geospatial background., yes, it is intentional but it was due to my old silly design in GeoSpark... I didn't change it back due to backwards compatibility... Maybe we should change it back at some point...
Please add some documentation in https://github.com/apache/incubator-sedona/blob/master/docs/tutorial/sql.md#convert-between-dataframe-and-spatialrdd to explain your API. Thank u!

brianrice2 · 2022-08-07T13:41:07Z

Thanks for providing that context! I'll create a followup Jira issue to discuss/prioritize separate from this issue. It would require extra care to be slotted into a major version update and communicated to users, so may be more trouble than it's worth. But at my work we do lots of join queries and primarily work with DataFrames, so I do come across this.
Thanks for pointing that out—added some notes on converting with a schema

brianrice2 added 8 commits July 30, 2022 11:07

Add first unit test & code

2ca1ae2

Support struct columns

8e10363

Add null-safe conversion

34286fd

Support DateType and TimestampType

6052fe3

Refactor user data extraction

3da9337

Add test for spatial RDD to dataframe conversion

f8f4971

Merge branch 'apache:master' into sedona-133-custom-schema

bb99461

brianrice2 commented Aug 5, 2022

View reviewed changes

Fix trailing comma in user-defined schema

7565049

jiayuasu added improvement sedona-sql affect public APIs labels Aug 7, 2022

jiayuasu added this to the sedona-1.3.0 milestone Aug 7, 2022

Add documentation for converting to DataFrame with schema

93b501a

jiayuasu approved these changes Aug 7, 2022

View reviewed changes

jiayuasu merged commit da7bbbc into apache:master Aug 16, 2022

brianrice2 deleted the sedona-133-custom-schema branch August 16, 2022 13:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SEDONA-133] Allow user-defined schemas in Adapter.toDf() #655

[SEDONA-133] Allow user-defined schemas in Adapter.toDf() #655

brianrice2 commented Aug 5, 2022

brianrice2 Aug 5, 2022

jiayuasu commented Aug 7, 2022

brianrice2 commented Aug 7, 2022 •

edited

Loading

[SEDONA-133] Allow user-defined schemas in Adapter.toDf() #655

[SEDONA-133] Allow user-defined schemas in Adapter.toDf() #655

Conversation

brianrice2 commented Aug 5, 2022

Did you read the Contributor Guide?

Is this PR related to a JIRA ticket?

What changes were proposed in this PR?

How was this patch tested?

Did this PR include necessary documentation updates?

Questions

1.

brianrice2 Aug 5, 2022

Choose a reason for hiding this comment

jiayuasu commented Aug 7, 2022

brianrice2 commented Aug 7, 2022 • edited Loading

brianrice2 commented Aug 7, 2022 •

edited

Loading