Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SEDONA-269] Add Raster data source and RS_AsGeoTiff and RS_AsArcGrid #828

Merged
merged 12 commits into from
May 14, 2023

Conversation

jiayuasu
Copy link
Member

@jiayuasu jiayuasu commented May 11, 2023

Did you read the Contributor Guide?

Is this PR related to a JIRA ticket?

What changes were proposed in this PR?

  1. Add a new built-in data source called raster designed for writing rasters
  2. This raster data source takes 3 options
    • optional fileExtension: this can be any file extension string. If not given, .tiff
    • optional pathField which indicates each raster file name. If not given, it will use UUID as the name of each raster
    • optional rasterField: the raster field column name, if you have multiple binary type columns in a DataFrame.
    • Raster name is an quite important field as many libraries need raster file names for additional information.
  3. Add two Raster output functions which convert a raster to a binary byte array
    • RS_AsGeoTiff
    • RS_AsArcGrid
  4. Move raster logics from io to io.raster.

The future plan of Sedona raster data sources:

  1. We will deprecate the current Sedona built-in geotiff data source in 1.4.1 (around 1 month from now) as it is not convenient for our upcoming raster operations. I plan to completely remove it in 1.5.0 (I guess around 4 months from now). To make this happen, a few other PRs that bridge Array[Double] and RasterUDT must be made. My idea is to completely get rid of Array[Double] and migrate Array[Double] based RS functions to RasterUDT environment. But let's see.
  2. With this PR in place, we will start to use (1) binaryFile data source and RS_FromXXX for reading rasters(2) RS_AsXXX and raster data source for writing rasters. This gives us much more flexibility and introduces opportunities for brining a flurry of new raster functions.

TODO:

Sedona R needs corresponding data_interface spark_write_raster

How was this patch tested?

Added new unit tests.

Did this PR include necessary documentation updates?

@jiayuasu
Copy link
Member Author

@umartin Hi Martin, the design of this PR is slightly different from your initial proposal in https://issues.apache.org/jira/browse/SEDONA-269 . This design does not need RS_AsXXX functions. It directly takes RasterUDT and generates raster images. Do you see any potential risk here?

@gregleleu
Copy link
Contributor

@jiayuasu I can take a look at the R part

@umartin
Copy link
Contributor

umartin commented May 11, 2023

Hi, I would prefer a plain binary data source and separate RS_AsXXX functions.

There are several benefits to separating the raster formats and the data source

  • The raster formats are useful for transferring rasters between different systems. For example I could use RS_AsXXX and write the raster to PostGIS, parquet or kafka for further processing outside Sedona and Spark. That means we would need the RS_AsXXX functions anyway.
  • Different formats requires different parameters. Since ArcGrid is single layered we might want a parameter to select which layer to convert in a multi layer raster. With GeoTIFF we might have parameters for compression level and compression codec. Doing conversion and binary file writing in the data source will lead to a messy API when there are many formats with wildly different parameters.
  • The data source API in Spark is constantly evolving. We might want to minimize the exposure to the API by keeping the data source simple.
  • We can create Flink bindings for the RS_AsXXX functions as well.

@jiayuasu
Copy link
Member Author

@umartin Thank you for the great suggestion. I will make changes accordingly and add RS_AsXXX as well.

@gregleleu Thank you for your help! Please wait for now. I will make some changes based on Martin's comment, then you can add new functions to Sedona R :-)

@jiayuasu jiayuasu changed the title [SEDONA-269] Add data source for writing raster files [SEDONA-269] Add Raster data source and RS_AsGeoTiff and RS_AsArcGrid May 14, 2023
@jiayuasu jiayuasu merged commit 43f8624 into master May 14, 2023
@jiayuasu jiayuasu deleted the geotiff-enhance branch May 14, 2023 21:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants