HDFS-17580. Change the default value of dfs.datanode.lock.fair to false due to potential hang #6943

hfutatzhanghb · 2024-07-15T10:03:09Z

…se due to potential hang

Description of PR

How was this patch tested?

For code changes:

Does the title or this PR starts with the corresponding JIRA issue id (e.g. 'HADOOP-17799. Your PR title ...')?
Object storage: have the integration tests been executed and the endpoint declared according to the connector-specific documentation?
If adding new dependencies to the code, are these dependencies licensed in a way that is compatible for inclusion under ASF 2.0?
If applicable, have you updated the LICENSE, LICENSE-binary, NOTICE-binary files?

…se due to potential hang

hadoop-yetus · 2024-07-15T10:06:15Z

💔 -1 overall

Vote	Subsystem	Runtime	Logfile	Comment
+0 🆗	reexec	0m 0s		Docker mode activated.
-1 ❌	docker	0m 25s		Docker failed to build run-specific yetus/hadoop:tp-28288}.

Subsystem	Report/Notes
GITHUB PR	#6943
Console output	https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6943/1/console
versions	git=2.34.1
Powered by	Apache Yetus 0.14.0 https://yetus.apache.org

This message was automatically generated.

Hexiaoqiao · 2024-07-16T12:51:41Z

What happen here?

hfutatzhanghb · 2024-07-16T13:03:37Z

What happen here?

@Hexiaoqiao Sir, we met a corner case where datanode hang because of one abnormal Nvme SSD disk last week.

One thread get stuck in below stack because of Nvme SSD exception.

"DataXceiver for client DFSClient_NONMAPREDUCE_1772448723_85 at /x.x.x.x:62528 [Receiving block BP-1169917699-x.x.x.x-1678688680604:blk_18858524775_17785113843]" #46490764 daemon prio=5 os_prio=0 tid=0x00007f79602ad800 nid=0xb692
 runnable [0x00007f79239c0000]
   java.lang.Thread.State: RUNNABLE
        at java.io.UnixFileSystem.getBooleanAttributes0(Native Method)
        at java.io.UnixFileSystem.getBooleanAttributes(UnixFileSystem.java:242)
        at java.io.File.exists(File.java:819)
        at org.apache.hadoop.hdfs.server.datanode.FileIoProvider.exists(FileIoProvider.java:805)
        at org.apache.hadoop.hdfs.server.datanode.DatanodeUtil.createFileWithExistsCheck(DatanodeUtil.java:62)
        at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.BlockPoolSlice.createRbwFile(BlockPoolSlice.java:389)
        at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeImpl.createRbwFile(FsVolumeImpl.java:946)
        at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeImpl.createRbw(FsVolumeImpl.java:1228)
        at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.createRbw(FsDatasetImpl.java:1500)
        at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.<init>(BlockReceiver.java:221)
        at org.apache.hadoop.hdfs.server.datanode.DataXceiver.getBlockReceiver(DataXceiver.java:1372)
        at org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:805)
        at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:176)
        at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:110)
        at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:314)
        at java.lang.Thread.run(Thread.java:748)

This causes the throughout of read/write blocks towards this datanode becoming zero:

After diving into code. we found that if one thread get stuck with holding dataset lock(even BP read lock), it may cause other threads wait to acquire lock in AQS forever.

We can refer to java.util.concurrent.locks.ReentrantReadWriteLock.FairSync#readerShouldBlock method.

Hexiaoqiao · 2024-07-20T08:09:23Z

@hfutatzhanghb Thanks for your addendum information. From the stack and screenshot we could not make sure that it is blocked by RWLock here, any direct support information? Thanks again.

ayushtkn · 2024-08-16T19:24:34Z

I am not able to decode whats the issue here, if there is any issue which lead to a potential hang, in that case we should fix that, rather than changing the default value

HDFS-17580. Change the default value of dfs.datanode.lock.fair to fal…

b24d6e1

…se due to potential hang

github-actions bot added HDFS trunk labels Jul 15, 2024

zeekling approved these changes Jul 20, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HDFS-17580. Change the default value of dfs.datanode.lock.fair to false due to potential hang #6943

HDFS-17580. Change the default value of dfs.datanode.lock.fair to false due to potential hang #6943

hfutatzhanghb commented Jul 15, 2024

hadoop-yetus commented Jul 15, 2024

Hexiaoqiao commented Jul 16, 2024

hfutatzhanghb commented Jul 16, 2024

Hexiaoqiao commented Jul 20, 2024

ayushtkn commented Aug 16, 2024

HDFS-17580. Change the default value of dfs.datanode.lock.fair to false due to potential hang #6943

Are you sure you want to change the base?

HDFS-17580. Change the default value of dfs.datanode.lock.fair to false due to potential hang #6943

Conversation

hfutatzhanghb commented Jul 15, 2024

Description of PR

How was this patch tested?

For code changes:

hadoop-yetus commented Jul 15, 2024

Hexiaoqiao commented Jul 16, 2024

hfutatzhanghb commented Jul 16, 2024

Hexiaoqiao commented Jul 20, 2024

ayushtkn commented Aug 16, 2024