Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GeometryType(geom) triggered an exception #1263

Open
ruanqizhen opened this issue Mar 1, 2024 · 6 comments
Open

GeometryType(geom) triggered an exception #1263

ruanqizhen opened this issue Mar 1, 2024 · 6 comments

Comments

@ruanqizhen
Copy link

ruanqizhen commented Mar 1, 2024

Expected behavior

When I call "GeometryType(geom)" to my table, it triggered a java.lang.NullPointerException exception. But "ST_GeometryType(geom)" works as expected.

It just crashed, so I couldn't find out which row caused the problem.

The track stack:

  File "<stdin>", line 1, in <module>
  File "/opt/amazon/spark/python/lib/pyspark.zip/pyspark/sql/dataframe.py", line 423, in show
    print(self._jdf.showString(n, int_truncate, vertical))
  File "/opt/amazon/spark/python/lib/py4j-0.10.9.3-src.zip/py4j/java_gateway.py", line 1321, in __call__
    return_value = get_return_value(
  File "/opt/amazon/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 111, in deco
    return f(*a, **kw)
  File "/opt/amazon/spark/python/lib/py4j-0.10.9.3-src.zip/py4j/protocol.py", line 326, in get_return_value
    raise Py4JJavaError(
py4j.protocol.Py4JJavaError: An error occurred while calling o128.showString.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 3 in stage 23.0 failed 4 times, most recent failure: Lost task 3.3 in stage 23.0 (TID 71) ([2600:1f13:b65:2706:d7b:30e2:5cb1:5cfa] executor 19): java.lang.NullPointerException

Driver stacktrace:
	at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2610)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2559)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2558)
	at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
	at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
	at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2558)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1200)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1200)
	at scala.Option.foreach(Option.scala:407)
	at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1200)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2798)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2740)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2729)
	at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
	at org.apache.spark.sql.execution.adaptive.AdaptiveExecutor.checkNoFailures(AdaptiveExecutor.scala:154)
	at org.apache.spark.sql.execution.adaptive.AdaptiveExecutor.doRun(AdaptiveExecutor.scala:88)
	at org.apache.spark.sql.execution.adaptive.AdaptiveExecutor.tryRunningAndGetFuture(AdaptiveExecutor.scala:66)
	at org.apache.spark.sql.execution.adaptive.AdaptiveExecutor.execute(AdaptiveExecutor.scala:57)
	at org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.$anonfun$getFinalPhysicalPlan$1(AdaptiveSparkPlanExec.scala:241)
	at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775)
	at org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.getFinalPhysicalPlan(AdaptiveSparkPlanExec.scala:240)
	at org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.withFinalPlanUpdate(AdaptiveSparkPlanExec.scala:509)
	at org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.executeCollect(AdaptiveSparkPlanExec.scala:471)
	at org.apache.spark.sql.Dataset.collectFromPlan(Dataset.scala:3779)
	at org.apache.spark.sql.Dataset.$anonfun$head$1(Dataset.scala:2769)
	at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3770)
	at org.apache.spark.sql.catalyst.QueryPlanningTracker$.withTracker(QueryPlanningTracker.scala:107)
	at org.apache.spark.sql.execution.SQLExecution$.withTracker(SQLExecution.scala:232)
	at org.apache.spark.sql.execution.SQLExecution$.executeQuery$1(SQLExecution.scala:110)
	at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$6(SQLExecution.scala:135)
	at org.apache.spark.sql.catalyst.QueryPlanningTracker$.withTracker(QueryPlanningTracker.scala:107)
	at org.apache.spark.sql.execution.SQLExecution$.withTracker(SQLExecution.scala:232)
	at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:135)
	at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:253)
	at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:134)
	at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775)
	at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:68)
	at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3768)
	at org.apache.spark.sql.Dataset.head(Dataset.scala:2769)
	at org.apache.spark.sql.Dataset.take(Dataset.scala:2976)
	at org.apache.spark.sql.Dataset.getRows(Dataset.scala:289)
	at org.apache.spark.sql.Dataset.showString(Dataset.scala:328)
	at sun.reflect.GeneratedMethodAccessor157.invoke(Unknown Source)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:238)
	at java.lang.Thread.run(Thread.java:750)
Caused by: java.lang.NullPointerException

Settings

Sedona version = 1.5

Apache Spark version = 3.2

Apache Flink version = ?

API type = Python

Python version = 3.10

Environment = Amazon Athena

@ruanqizhen ruanqizhen changed the title How to ignore the errors? GeometryType(geom) triggered an exception Mar 2, 2024
@jiayuasu
Copy link
Member

jiayuasu commented Mar 2, 2024

@ruanqizhen can you show me the full stacktrace? The one that is after Caused by: java.lang.NullPointerException ?

@ruanqizhen
Copy link
Author

@ruanqizhen can you show me the full stacktrace? The one that is after Caused by: java.lang.NullPointerException ?

This is the entire stacktrace it returned. Caused by: java.lang.NullPointerException is the last line.

@jiayuasu
Copy link
Member

jiayuasu commented Mar 3, 2024

@ruanqizhen I think the GeometryType might have indeterministic behavior over invalid geometries. We need to further investigate this issue.

Can you run ST_IsValid on your geometry column? Is there any invalid geometries? If you remove those and run GeometryType again, do you still have the same problem?

@ruanqizhen
Copy link
Author

@ruanqizhen I think the GeometryType might have indeterministic behavior over invalid geometries. We need to further investigate this issue.

Can you run ST_IsValid on your geometry column? Is there any invalid geometries? If you remove those and run GeometryType again, do you still have the same problem?

It is not likely caused by the invalid geometry, because I've called the ST_MakeValid() for all of the geometries.

@jiayuasu
Copy link
Member

jiayuasu commented Mar 4, 2024

@ruanqizhen ST_MakeValid might not work for all cases. Any way to fix geometries is ST_Buffer(geom, 0).

If both of the method still cannot fix the issue, I think you can stick to ST_GeometryType. If you don't want the ST_ in the result, you can easily write a PySpark UDF to split the string by _ and only keep the second half.

@ruanqizhen
Copy link
Author

@ruanqizhen ST_MakeValid might not work for all cases. Any way to fix geometries is ST_Buffer(geom, 0).

If both of the method still cannot fix the issue, I think you can stick to ST_GeometryType. If you don't want the ST_ in the result, you can easily write a PySpark UDF to split the string by _ and only keep the second half.

I'm using ST_GeometryType now. But I'm wondering if there is a way to find out which row caused the problem. Is there a way that I can "try", or skip the error, to let the query continue process other rows, so that I can then check which row was skipped.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants