Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[202311] [Mellanox] Disable SSD NCQ on Mellanox platforms #18040

Merged
merged 1 commit into from
Feb 7, 2024

Conversation

volodymyrsamotiy
Copy link
Collaborator

Backport of PR #17567

Why I did it

Based on some research some products might experience an occasional IO failures in the communication between CPU and SSD because of NCQ.
There seems to be a problem between some kernel versions and some SATA controllers.

Syslog error message examples:

  • Error "ata1: SError: { UnrecovData Handshk }" - "failed command: WRITE FPDMA QUEUED".
  • Error "ata1: SError: { RecovComm HostInt PHYRdyChg CommWake 10B8B DevExch }" - "failed command: READ FPDMA QUEUED".

Some vendors already disabled NCQ on their platforms in SONiC due to similar issue:

Also there are other discussions on Debian/Ubuntu forums about similar issues and it was suggested to disable NCQ:

Work item tracking
  • Microsoft ADO (number only):

How I did it

Add a kernel parameter to tell libata to disable NCQ

How to verify it

Use FIO tool - fio --direct=1 --rw=randrw --bs=64k --ioengine=libaio --iodepth=64 --runtime=120 --numjobs=4

Test results with NCQ enabled:

 READ: bw=128MiB/s (135MB/s), 128MiB/s-128MiB/s (135MB/s-135MB/s), io=247MiB (259MB), run=1924-1924msec
WRITE: bw=131MiB/s (138MB/s), 131MiB/s-131MiB/s (138MB/s-138MB/s), io=253MiB (265MB), run=1924-1924msec
…
 READ: bw=130MiB/s (136MB/s), 130MiB/s-130MiB/s (136MB/s-136MB/s), io=247MiB (259MB), run=1902-1902msec
WRITE: bw=133MiB/s (139MB/s), 133MiB/s-133MiB/s (139MB/s-139MB/s), io=253MiB (265MB), run=1902-1902msec
…
 READ: bw=129MiB/s (135MB/s), 129MiB/s-129MiB/s (135MB/s-135MB/s), io=247MiB (259MB), run=1919-1919msec
WRITE: bw=132MiB/s (138MB/s), 132MiB/s-132MiB/s (138MB/s-138MB/s), io=253MiB (265MB), run=1919-1919msec

Test results with NCQ disabled:

 READ: bw=105MiB/s (110MB/s), 105MiB/s-105MiB/s (110MB/s-110MB/s), io=247MiB (259MB), run=2354-2354msec
WRITE: bw=107MiB/s (113MB/s), 107MiB/s-107MiB/s (113MB/s-113MB/s), io=253MiB (265MB), run=2354-2354msec
…
 READ: bw=105MiB/s (110MB/s), 105MiB/s-105MiB/s (110MB/s-110MB/s), io=247MiB (259MB), run=2349-2349msec
WRITE: bw=108MiB/s (113MB/s), 108MiB/s-108MiB/s (113MB/s-113MB/s), io=253MiB (265MB), run=2349-2349msec
…
 READ: bw=105MiB/s (110MB/s), 105MiB/s-105MiB/s (110MB/s-110MB/s), io=247MiB (259MB), run=2349-2349msec
WRITE: bw=108MiB/s (113MB/s), 108MiB/s-108MiB/s (113MB/s-113MB/s), io=253MiB (265MB), run=2349-2349msec

Which release branch to backport (provide reason below if selected)

  • 201811
  • 201911
  • 202006
  • 202012
  • 202106
  • 202111
  • 202205
  • 202211
  • 202305
  • 202311

Tested branch (Please provide the tested image version)

Description for the changelog

Link to config_db schema for YANG module changes

A picture of a cute animal (not mandatory but encouraged)

Signed-off-by: Volodymyr Samotiy <volodymyrs@nvidia.com>
@yxieca yxieca merged commit e13ef9d into sonic-net:202311 Feb 7, 2024
19 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants