Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

try 1-N-N performance tuning with LATERAL subquery #1280

Open
MyqueWooMiddo opened this issue Mar 21, 2024 · 2 comments
Open

try 1-N-N performance tuning with LATERAL subquery #1280

MyqueWooMiddo opened this issue Mar 21, 2024 · 2 comments

Comments

@MyqueWooMiddo
Copy link

Expected behavior

reference to https://postgis.net/workshops/postgis-intro/knn.html

https://spark.apache.org/docs/latest/sql-ref-syntax-qry-select-lateral-subquery.html

I upgrade spark to 3.5.1 , try LATERAL to calculate 1-N-N (1-Nearest-Neighbour)

I'll get point's 1-N-N inside the same table : data_points(id,longitude,latitude) ,use sedona

Actual behavior

spark do not support this type LATERAL

Steps to reproduce the problem

with t_data as (
select id ,st_point(longitude,latitude) as point from data_points order by 1 limit 1000
)
select * from t_data t1, lateral (
select t2.id,ST_DistanceSpheroid(t1.point,t2.point) as distance from t_data t2
where t1.id!=t2.id order by 2 limit 1
)

Spark throws :
"org.apache.spark.sql.catalyst.ExtendedAnalysisException: [UNSUPPORTED_SUBQUERY_EXPRESSION_CATEGORY.ACCESSING_OUTER_QUERY_COLUMN_IS_NOT_ALLOWED] Unsupported subquery expression: Accessing outer query column is not allowed in this locationProject"

I just want to know How can optimize 1-N-N in a large dataset rather than row_number(order by distance) = 1

Settings

Sedona version = 1.5.1

Apache Spark version = 3.5.1

API type = Scala

Scala version = 2.12

JRE version = 1.8

Environment = Standalone

@jiayuasu
Copy link
Member

All NN join or KNN join is not currently supported in Apache Sedona. We will add the support in one or two months.

@MyqueWooMiddo
Copy link
Author

All NN join or KNN join is not currently supported in Apache Sedona. We will add the support in one or two months.

I think solution with iteral H3 of databricks Mosaic is a good idea.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants