Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix deadlock problem when using elasticsearch-client-7.0.0 #5775

Merged
merged 8 commits into from
Nov 3, 2020
Merged

Fix deadlock problem when using elasticsearch-client-7.0.0 #5775

merged 8 commits into from
Nov 3, 2020

Conversation

zifeihan
Copy link
Member

@zifeihan zifeihan commented Nov 2, 2020

Fix deadlock problem when using es

  • Add a unit test to verify that the fix works.
  • Explain briefly about why the bug exists and how to fix it.

Hello, a few days ago, there was a problem that es data could not be written in our production environment. We dump the thread stack of the production environment. After analysis, it is found that es client has a deadlock problem when writing data.thread dump thread dump. In order to solve the problem, I checked many related materials, and found that this problem has been fixed in elasticsearch-client version 7.5.0. Prevent deadlock by using separate schedulers
so i update the elasticsearch client 7.0.0 to elasticsearch client 7.5.0.
In order to avoid es compatibility issues, I use elasticsearch server 7.0.0 as data storage. Tested whether the ui function is normal, and checked whether the OAP is abnormal during operation. The test results show that everything is normal. and the current elasticsearch server in our production environment is 7.2.0, and we are also using elasticsearch client 7.5.0. It has been running stably for 3 days without any compatibility issues.

Comment on lines +40 to +43
lang-mustache-client-7.5.0.jar
mapper-extras-client-7.5.0.jar
parent-join-client-7.5.0.jar
rank-eval-client-7.5.0.jar
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You need to update LICENSE including

  1. Versions of lang-mustache-client and parent-join-client
  2. Add mapper-extras-client with license as we missed before
  3. Add rank-eval-client with license as we missed before, both 6.3.2(es6 package) and 7.5.0(you updated)

@wu-sheng wu-sheng added the backend OAP backend related. label Nov 2, 2020
@wu-sheng wu-sheng added this to the 8.3.0 milestone Nov 2, 2020
@wu-sheng
Copy link
Member

wu-sheng commented Nov 2, 2020

@kezhenxu94 What ES version do we use to test?

Copy link
Member

@wu-sheng wu-sheng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And you missed the change log update.

@codecov
Copy link

codecov bot commented Nov 2, 2020

Codecov Report

Merging #5775 into master will decrease coverage by 6.68%.
The diff coverage is n/a.

Impacted file tree graph

@@             Coverage Diff              @@
##             master    #5775      +/-   ##
============================================
- Coverage     51.51%   44.83%   -6.69%     
+ Complexity     3465     2205    -1260     
============================================
  Files          1642     1632      -10     
  Lines         35112    34820     -292     
  Branches       3833     3985     +152     
============================================
- Hits          18089    15611    -2478     
- Misses        16125    18338    +2213     
+ Partials        898      871      -27     
Impacted Files Coverage Δ Complexity Δ
...ywalking/apm/agent/core/plugin/EnhanceContext.java 0.00% <0.00%> (-100.00%) 0.00% <0.00%> (-5.00%)
...walking/apm/agent/core/plugin/match/NameMatch.java 0.00% <0.00%> (-100.00%) 0.00% <0.00%> (-3.00%)
...ement/ui/template/UITemplateManagementService.java 0.00% <0.00%> (-100.00%) 0.00% <0.00%> (-6.00%)
...telemetry/prometheus/PrometheusMetricsCreator.java 0.00% <0.00%> (-100.00%) 0.00% <0.00%> (-4.00%)
.../apm/network/trace/component/ComponentsDefine.java 0.00% <0.00%> (-98.79%) 0.00% <0.00%> (-1.00%)
...g/oap/server/telemetry/prometheus/BaseMetrics.java 0.00% <0.00%> (-90.33%) 0.00% <0.00%> (-9.00%)
...analyzer/prometheus/PrometheusMetricConverter.java 0.00% <0.00%> (-90.15%) 0.00% <0.00%> (-20.00%)
...ing/oap/server/library/util/ProtoBufJsonUtils.java 0.00% <0.00%> (-90.00%) 0.00% <0.00%> (-2.00%)
...tworkalias/NetworkAddressAliasSetupDispatcher.java 11.11% <0.00%> (-88.89%) 1.00% <0.00%> (-1.00%)
...nt/kafka/provider/handler/MeterServiceHandler.java 0.00% <0.00%> (-86.67%) 0.00% <0.00%> (-6.00%)
... and 219 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b041ecb...1024e7e. Read the comment docs.

@zifeihan
Copy link
Member Author

zifeihan commented Nov 2, 2020

And you missed the change log update.

done.

@kezhenxu94
Copy link
Member

@kezhenxu94 What ES version do we use to test?

6.3.2 and 7.4.2

@wu-sheng
Copy link
Member

wu-sheng commented Nov 2, 2020

@zifeihan CI logs show, the dependency changes are not accurate, please recheck. You could use the same scripts we used in the CI, https://github.com/apache/skywalking/tree/master/tools/dependencies.

@zifeihan
Copy link
Member Author

zifeihan commented Nov 3, 2020

@zifeihan CI logs show, the dependency changes are not accurate, please recheck. You could use the same scripts we used in the CI, https://github.com/apache/skywalking/tree/master/tools/dependencies.

done.

kezhenxu94
kezhenxu94 previously approved these changes Nov 3, 2020
Copy link
Member

@kezhenxu94 kezhenxu94 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@wu-sheng
Copy link
Member

wu-sheng commented Nov 3, 2020

@zifeihan I have asked @kezhenxu94 to adjust e2e to text 7.0.0 and higher version, like 7.9.3(latest) to verify the compatibility. Let's wait for that.

@zifeihan
Copy link
Member Author

zifeihan commented Nov 3, 2020

@zifeihan I have asked @kezhenxu94 to adjust e2e to text 7.0.0 and higher version, like 7.9.3(latest) to verify the compatibility. Let's wait for that.

ok, got it.

Copy link
Member

@wu-sheng wu-sheng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are several other libs updated in this according to known-oap-backend-dependencies-es7.txt change log.
So, the LICENSE still needs some polish. In the current, we missed ES7 dependency version, such as hppc 0.7.1, which is ES6 depending.

@zifeihan Could you help on adding the ES6 dependency versions to the LICENSE file like I commented for the ES client.

dist-material/release-docs/LICENSE Show resolved Hide resolved
Copy link
Member

@wu-sheng wu-sheng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is much better now.

@wu-sheng wu-sheng merged commit a9d756c into apache:master Nov 3, 2020
@zifeihan zifeihan deleted the es_client7_update branch November 4, 2020 02:42
@libinglong
Copy link
Contributor

libinglong commented Mar 3, 2021

Notice this issue may cause the log "Grpc server thread pool is full, rejecting the task".

@zifeihan
Copy link
Member Author

zifeihan commented Mar 3, 2021

"Grpc server thread pool is full, rejecting the task

Thanks submit this, why it cause this error log? our production environment processes tens of billions of data every day, and we have never found such logs.

@libinglong
Copy link
Contributor

libinglong commented Mar 3, 2021

"Grpc server thread pool is full, rejecting the task

Thanks submit this, why it cause this error log? our production environment processes tens of billions of data every day, and we have never found such logs.

I found a lot of such logs in the oap log.
The thread stacks show all threads of grpcServerPool are waiting for a lock.

"grpcServerPool-1-thread-14" #360 prio=5 os_prio=0 tid=0x00007f1bb8041800 nid=0x104d6 waiting for monitor entry [0x00007f219d2f1000]
   java.lang.Thread.State: BLOCKED (on object monitor)
	at org.elasticsearch.action.bulk.BulkProcessor.internalAdd(BulkProcessor.java:319)
	- waiting to lock <0x00000000e2c38d28> (a org.elasticsearch.action.bulk.BulkProcessor)
	at org.elasticsearch.action.bulk.BulkProcessor.add(BulkProcessor.java:304)
	at org.elasticsearch.action.bulk.BulkProcessor.add(BulkProcessor.java:290)
	at org.apache.skywalking.oap.server.storage.plugin.elasticsearch.base.BatchProcessEsDAO.asynchronous(BatchProcessEsDAO.java:56)

@libinglong
Copy link
Contributor

libinglong commented Mar 3, 2021

@zifeihan I don't mean our scene is the same. The bug of the es client also cause other thread-block problems which this merge will fix.

@zifeihan
Copy link
Member Author

zifeihan commented Mar 3, 2021

@libinglong

@zifeihan I don't mean our scene is the same. The bug of the es client also cause other thread-block problems which this merge will fix.

Have you upgraded to version 8.3.0 or higher? If so, please dump thread to analyze the root cause of the problem. If not, I think this problem may be caused by the deadlock of es-client.

@libinglong
Copy link
Contributor

libinglong commented Mar 3, 2021

@libinglong

@zifeihan I don't mean our scene is the same. The bug of the es client also cause other thread-block problems which this merge will fix.

Have you upgraded to version 8.3.0 or higher? If so, please dump thread to analyze the root cause of the problem. If not, I think this problem may be caused by the deadlock of es-client.

I will upgrade to 8.3.0 soon. Yes, I am sure It's the problem of es-client.
I add the moment above as many people will search the warn log in the issues. This will help them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backend OAP backend related.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants