chore: improve spark parallel #450

zyxxoo · 2023-04-06T10:40:21Z

No description provided.

zyxxoo · 2023-04-06T10:46:51Z

hugegraph-loader/src/main/java/org/apache/hugegraph/loader/spark/HugeGraphSparkLoader.java

+                    LOG.info("\n Start to load data using spark bulkload \n");
+                    // gen-hfile
+                    HBaseDirectLoader directLoader = new HBaseDirectLoader(loadOptions, struct,
+                                                                           loadDistributeMetrics);


很奇怪这里的 loadDistributeMetrics，这个代码是跑在算子里面的，我理解应该是算子获取的是这个方法的备份吧？spark 怎么把这个传到 drive 里面来呢？

LoadDistributeMetrics 里用的是spark的累加器，能聚合executor的值到driver
https://github.com/apache/incubator-hugegraph-toolchain/blob/master/hugegraph-loader/src/main/java/org/apache/hugegraph/loader/metrics/LoadDistributeMetrics.java#L54

codecov · 2023-04-06T10:49:34Z

Codecov Report

Merging #450 (42927ce) into master (36a1ada) will decrease coverage by 0.05%.
The diff coverage is 0.00%.

❗ Current head 42927ce differs from pull request most recent head dad4504. Consider uploading reports for the commit dad4504 to get more accurate results

@@             Coverage Diff              @@
##             master     #450      +/-   ##
============================================
- Coverage     62.57%   62.52%   -0.05%     
+ Complexity     1867      894     -973     
============================================
  Files           260       91     -169     
  Lines          9418     4395    -5023     
  Branches        872      516     -356     
============================================
- Hits           5893     2748    -3145     
+ Misses         3143     1444    -1699     
+ Partials        382      203     -179

Impacted Files	Coverage Δ
...e/hugegraph/loader/spark/HugeGraphSparkLoader.java	`0.00% <0.00%> (ø)`

... and 169 files with indirect coverage changes

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

zyxxoo · 2023-04-06T10:56:55Z

hugegraph-loader/src/main/java/org/apache/hugegraph/loader/spark/HugeGraphSparkLoader.java

-                    LoadContext context = initPartition(this.loadOptions, struct);
-                    p.forEachRemaining((Row row) -> {
-                        loadRow(struct, row, p, context);
+            Future<?> future = Executors.newCachedThreadPool().submit(() -> {


按我个人理解，这里并发应该没有线程安全问题了

这里用 cache threadpool，按我个人理解应该是加载多个文件，所以并行执行，生成多个 DAG，然后由 spark 去做调度具体任务，所以我这里没有考虑线程池大小

zyxxoo force-pushed the zy_dev branch from f082a97 to 42927ce Compare April 6, 2023 10:41

zyxxoo commented Apr 6, 2023

View reviewed changes

imbajin requested review from simon824 and imbajin April 6, 2023 10:48

zyxxoo commented Apr 6, 2023

View reviewed changes

chore: improve spark parallel

dad4504

zyxxoo force-pushed the zy_dev branch from 42927ce to dad4504 Compare April 6, 2023 11:00

javeme approved these changes Apr 6, 2023

View reviewed changes

simon824 approved these changes Apr 7, 2023

View reviewed changes

simon824 merged commit cf1312e into master Apr 7, 2023

simon824 deleted the zy_dev branch April 7, 2023 01:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chore: improve spark parallel #450

chore: improve spark parallel #450

zyxxoo commented Apr 6, 2023

zyxxoo Apr 6, 2023

simon824 Apr 6, 2023

codecov bot commented Apr 6, 2023 •

edited

Loading

zyxxoo Apr 6, 2023

zyxxoo Apr 6, 2023

chore: improve spark parallel #450

chore: improve spark parallel #450

Conversation

zyxxoo commented Apr 6, 2023

zyxxoo Apr 6, 2023

Choose a reason for hiding this comment

simon824 Apr 6, 2023

Choose a reason for hiding this comment

codecov bot commented Apr 6, 2023 • edited Loading

Codecov Report

zyxxoo Apr 6, 2023

Choose a reason for hiding this comment

zyxxoo Apr 6, 2023

Choose a reason for hiding this comment

codecov bot commented Apr 6, 2023 •

edited

Loading