Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HDFS-17580. Change the default value of dfs.datanode.lock.fair to false due to potential hang #6943

Open
wants to merge 1 commit into
base: trunk
Choose a base branch
from

Conversation

hfutatzhanghb
Copy link
Contributor

…se due to potential hang

Description of PR

How was this patch tested?

For code changes:

  • Does the title or this PR starts with the corresponding JIRA issue id (e.g. 'HADOOP-17799. Your PR title ...')?
  • Object storage: have the integration tests been executed and the endpoint declared according to the connector-specific documentation?
  • If adding new dependencies to the code, are these dependencies licensed in a way that is compatible for inclusion under ASF 2.0?
  • If applicable, have you updated the LICENSE, LICENSE-binary, NOTICE-binary files?

@hadoop-yetus
Copy link

💔 -1 overall

Vote Subsystem Runtime Logfile Comment
+0 🆗 reexec 0m 0s Docker mode activated.
-1 ❌ docker 0m 25s Docker failed to build run-specific yetus/hadoop:tp-28288}.
Subsystem Report/Notes
GITHUB PR #6943
Console output https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6943/1/console
versions git=2.34.1
Powered by Apache Yetus 0.14.0 https://yetus.apache.org

This message was automatically generated.

@Hexiaoqiao
Copy link
Contributor

What happen here?

@hfutatzhanghb
Copy link
Contributor Author

What happen here?

@Hexiaoqiao Sir, we met a corner case where datanode hang because of one abnormal Nvme SSD disk last week.

One thread get stuck in below stack because of Nvme SSD exception.

"DataXceiver for client DFSClient_NONMAPREDUCE_1772448723_85 at /x.x.x.x:62528 [Receiving block BP-1169917699-x.x.x.x-1678688680604:blk_18858524775_17785113843]" #46490764 daemon prio=5 os_prio=0 tid=0x00007f79602ad800 nid=0xb692
 runnable [0x00007f79239c0000]
   java.lang.Thread.State: RUNNABLE
        at java.io.UnixFileSystem.getBooleanAttributes0(Native Method)
        at java.io.UnixFileSystem.getBooleanAttributes(UnixFileSystem.java:242)
        at java.io.File.exists(File.java:819)
        at org.apache.hadoop.hdfs.server.datanode.FileIoProvider.exists(FileIoProvider.java:805)
        at org.apache.hadoop.hdfs.server.datanode.DatanodeUtil.createFileWithExistsCheck(DatanodeUtil.java:62)
        at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.BlockPoolSlice.createRbwFile(BlockPoolSlice.java:389)
        at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeImpl.createRbwFile(FsVolumeImpl.java:946)
        at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeImpl.createRbw(FsVolumeImpl.java:1228)
        at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.createRbw(FsDatasetImpl.java:1500)
        at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.<init>(BlockReceiver.java:221)
        at org.apache.hadoop.hdfs.server.datanode.DataXceiver.getBlockReceiver(DataXceiver.java:1372)
        at org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:805)
        at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:176)
        at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:110)
        at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:314)
        at java.lang.Thread.run(Thread.java:748)

This causes the throughout of read/write blocks towards this datanode becoming zero:

image

After diving into code. we found that if one thread get stuck with holding dataset lock(even BP read lock), it may cause other threads wait to acquire lock in AQS forever.

We can refer to java.util.concurrent.locks.ReentrantReadWriteLock.FairSync#readerShouldBlock method.

@Hexiaoqiao
Copy link
Contributor

@hfutatzhanghb Thanks for your addendum information. From the stack and screenshot we could not make sure that it is blocked by RWLock here, any direct support information? Thanks again.

@ayushtkn
Copy link
Member

I am not able to decode whats the issue here, if there is any issue which lead to a potential hang, in that case we should fix that, rather than changing the default value

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
5 participants