Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] Spark Loader example schema and struct mismatch #501

Closed
1 task done
liuxiaocs7 opened this issue Aug 3, 2023 · 1 comment · Fixed by #504
Closed
1 task done

[Bug] Spark Loader example schema and struct mismatch #501

liuxiaocs7 opened this issue Aug 3, 2023 · 1 comment · Fixed by #504
Labels
bug Something isn't working

Comments

@liuxiaocs7
Copy link
Member

liuxiaocs7 commented Aug 3, 2023

Bug Type (问题类型)

exception / error (异常报错)

The current Spark example doesn't work properly.

Before submit

  • I had searched in the issues and found no similar issues.

Environment (环境信息)

Expected & Actual behavior (期望与实际表现)

java.lang.IllegalStateException: The id field must be empty or null when id strategy is 'PRIMARY_KEY' for vertex label 'software'
        at shaded.com.google.common.base.Preconditions.checkState(Preconditions.java:544)
        at org.apache.hugegraph.util.E.checkState(E.java:64)
        at org.apache.hugegraph.loader.builder.VertexBuilder.checkIdField(VertexBuilder.java:98)
        at org.apache.hugegraph.loader.builder.VertexBuilder.<init>(VertexBuilder.java:46)
        at org.apache.hugegraph.loader.spark.HugeGraphSparkLoader.initPartition(HugeGraphSparkLoader.java:201)
        at org.apache.hugegraph.loader.spark.HugeGraphSparkLoader.lambda$null$18e75a97$1(HugeGraphSparkLoader.java:155)
        at org.apache.spark.sql.Dataset.$anonfun$foreachPartition$2(Dataset.scala:2923)
        at org.apache.spark.sql.Dataset.$anonfun$foreachPartition$2$adapted(Dataset.scala:2923)
        at org.apache.spark.rdd.RDD.$anonfun$foreachPartition$2(RDD.scala:1020)
        at org.apache.spark.rdd.RDD.$anonfun$foreachPartition$2$adapted(RDD.scala:1020)
        at org.apache.spark.SparkContext.$anonfun$runJob$5(SparkContext.scala:2254)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
        at org.apache.spark.scheduler.Task.run(Task.scala:131)
        at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:506)
        at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1491)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:509)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:750)
23/08/03 23:36:07 ERROR Executor: Exception in task 0.0 in stage 3.0 (TID 3)
java.lang.IllegalStateException: The id field must be empty or null when id strategy is 'PRIMARY_KEY' for vertex label 'person'
        at shaded.com.google.common.base.Preconditions.checkState(Preconditions.java:544)
        at org.apache.hugegraph.util.E.checkState(E.java:64)
        at org.apache.hugegraph.loader.builder.VertexBuilder.checkIdField(VertexBuilder.java:98)
        at org.apache.hugegraph.loader.builder.VertexBuilder.<init>(VertexBuilder.java:46)
        at org.apache.hugegraph.loader.spark.HugeGraphSparkLoader.initPartition(HugeGraphSparkLoader.java:201)
        at org.apache.hugegraph.loader.spark.HugeGraphSparkLoader.lambda$null$18e75a97$1(HugeGraphSparkLoader.java:155)
        at org.apache.spark.sql.Dataset.$anonfun$foreachPartition$2(Dataset.scala:2923)
        at org.apache.spark.sql.Dataset.$anonfun$foreachPartition$2$adapted(Dataset.scala:2923)
        at org.apache.spark.rdd.RDD.$anonfun$foreachPartition$2(RDD.scala:1020)
        at org.apache.spark.rdd.RDD.$anonfun$foreachPartition$2$adapted(RDD.scala:1020)
        at org.apache.spark.SparkContext.$anonfun$runJob$5(SparkContext.scala:2254)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
        at org.apache.spark.scheduler.Task.run(Task.scala:131)
        at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:506)
        at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1491)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:509)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:750)

Vertex/Edge example (问题点 / 边数据举例)

No response

Schema [VertexLabel, EdgeLabel, IndexLabel] (元数据结构)

from this file: https://github.com/apache/incubator-hugegraph-toolchain/blob/master/hugegraph-loader/assembly/static/example/spark/schema.groovy
exec by client

  // Define schema
  schema.propertyKey("name").asText().ifNotExist().create();
  schema.propertyKey("age").asInt().ifNotExist().create();
  schema.propertyKey("city").asText().ifNotExist().create();
  schema.propertyKey("weight").asDouble().ifNotExist().create();
  schema.propertyKey("lang").asText().ifNotExist().create();
  schema.propertyKey("date").asText().ifNotExist().create();
  schema.propertyKey("price").asDouble().ifNotExist().create();

  schema.vertexLabel("person")
          .properties("name", "age", "city")
          .primaryKeys("name")
          .nullableKeys("age", "city")
          .ifNotExist()
          .create();

  schema.vertexLabel("software")
          .properties("name", "lang", "price")
          .primaryKeys("name")
          .ifNotExist()
          .create();

  schema.edgeLabel("knows")
          .sourceLabel("person")
          .targetLabel("person")
          .properties("date", "weight")
          .ifNotExist()
          .create();

  schema.edgeLabel("created")
          .sourceLabel("person")
          .targetLabel("software")
          .properties("date", "weight")
          .ifNotExist()
          .create();

InputSource from this file: https://github.com/apache/incubator-hugegraph-toolchain/blob/master/hugegraph-loader/assembly/static/example/spark/struct.json

remove backendStoreInfo to use docker rocksdb

{
  "vertices": [
    {
      "label": "person",
      "input": {
        "type": "file",
        "path": "example/spark/vertex_person.json",
        "format": "JSON",
        "header": ["name", "age", "city"],
        "charset": "UTF-8",
        "skipped_line": {
          "regex": "(^#|^//).*"
        }
      },
      "id": "name",
      "null_values": ["NULL", "null", ""]
    },
    {
      "label": "software",
      "input": {
        "type": "file",
        "path": "example/spark/vertex_software.json",
        "format": "JSON",
        "header": ["id","name", "lang", "price","ISBN"],
        "charset": "GBK"
      },
      "id": "name",
      "ignored": ["ISBN"]
    }
  ],
  "edges": [
    {
      "label": "knows",
      "source": ["source_name"],
      "target": ["target_name"],
      "input": {
        "type": "file",
        "path": "example/spark/edge_knows.json",
        "format": "JSON",
        "date_format": "yyyyMMdd",
        "header": ["source_name","target_name", "date", "weight"]
      },
      "field_mapping": {
        "source_name": "name",
        "target_name": "name"
      }
    }
  ]
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

1 participant