Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add udev rule to create symlinks using EBS volumes' device names #3977

Merged
merged 4 commits into from
May 29, 2024

Conversation

arnaldo2792
Copy link
Contributor

@arnaldo2792 arnaldo2792 commented May 21, 2024

Issue number:

Closes #3975

Description of changes:


V2

The first two commits in the series add tools to interact with NVMe devices. They were used to query the device info instead of issuing ioctl syscalls, like in the first version of the PR.

The third commit extends ghostdog and adds the ebs-device-name command to emit the device name from the vendor data returned by nvme-cli.

The last commit adds two udev rules that use the value emitted by ghostdog to create the required symlinks.


V1

The first commit in the series extends ghostdog with a command to query the device using ioctl. The implementation was inspired by the script used in Amazon Linux and libnvme. Since the ioctl syscall is an actual C API, the structs used to interact with it include repr(C) so that the struct is in a compatible format for C. The Rustonomicon suggests the usage of rust-bindgen to generate rust bindings to interact with C APIs. However, adding the entire libnvme would be overkill so I added just the structures that are needed to query the device information we want.

For the created structures, I used kvm-bindings as a reference to understand how the structures should be configured, as well as how the Rust types are mapped to C types.

The second commit in the series updates the existing udev rules to create symlinks to the EBS volumes, for both the devices and the partitions, using the device name configured for the volume.

Testing done:

With aws-ecs-2, x86_64, I confirmed that the symlinks are created pointing at the correct locations. In my instance, xvda corresponds to the root volume (nvme0n1) and xvdb corresponds to the data volume (nvme1n1):

bash-5.1# ls -la /dev/xv*
lrwxrwxrwx. 1 root root  7 May 21 15:16 /dev/xvda -> nvme0n1
lrwxrwxrwx. 1 root root  9 May 21 15:16 /dev/xvda1 -> nvme0n1p1
lrwxrwxrwx. 1 root root 10 May 21 15:16 /dev/xvda10 -> nvme0n1p10
lrwxrwxrwx. 1 root root 10 May 21 15:16 /dev/xvda11 -> nvme0n1p11
lrwxrwxrwx. 1 root root 10 May 21 15:16 /dev/xvda12 -> nvme0n1p12
lrwxrwxrwx. 1 root root 10 May 21 15:16 /dev/xvda13 -> nvme0n1p13
lrwxrwxrwx. 1 root root  9 May 21 15:16 /dev/xvda2 -> nvme0n1p2
lrwxrwxrwx. 1 root root  9 May 21 15:16 /dev/xvda3 -> nvme0n1p3
lrwxrwxrwx. 1 root root  9 May 21 15:16 /dev/xvda4 -> nvme0n1p4
lrwxrwxrwx. 1 root root  9 May 21 15:16 /dev/xvda5 -> nvme0n1p5
lrwxrwxrwx. 1 root root  9 May 21 15:16 /dev/xvda6 -> nvme0n1p6
lrwxrwxrwx. 1 root root  9 May 21 15:16 /dev/xvda7 -> nvme0n1p7
lrwxrwxrwx. 1 root root  9 May 21 15:16 /dev/xvda8 -> nvme0n1p8
lrwxrwxrwx. 1 root root  9 May 21 15:16 /dev/xvda9 -> nvme0n1p9
lrwxrwxrwx. 1 root root  7 May 21 15:16 /dev/xvdb -> nvme1n1
lrwxrwxrwx. 1 root root  9 May 21 15:16 /dev/xvdb1 -> nvme1n1p1

Terms of contribution:

By submitting this pull request, I agree that this contribution is dual-licensed under the terms of both the Apache License, version 2.0, and the MIT license.


# Create symlinks for the device name configured in EC2, for both block devices and partitions
KERNEL=="nvme[0-9]*n[0-9]*", ENV{DEVTYPE}=="disk", ATTRS{model}=="Amazon Elastic Block Store", PROGRAM="/usr/bin/ghostdog ebs-device-name /dev/$kernel", SYMLINK+="$result"
KERNEL=="nvme[0-9]*n[0-9]*p[0-9]*", ENV{DEVTYPE}=="partition", ATTRS{model}=="Amazon Elastic Block Store", PROGRAM="/usr/bin/ghostdog ebs-device-name /dev/$kernel", SYMLINK+="$result$number"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you set ENV{XVD_DEVICE_NAME} in the parent, you should be able to import it here via IMPORT{parent}="XVD_DEVICE_NAME" to avoid the second binary invocation.

ENV{DEVTYPE}!="disk", GOTO="ebs_volumes_end"

# Create symlinks for the device name configured in EC2, for both block devices and partitions
KERNEL=="nvme[0-9]*n[0-9]*", ENV{DEVTYPE}=="disk", ATTRS{model}=="Amazon Elastic Block Store", PROGRAM="/usr/bin/ghostdog ebs-device-name /dev/$kernel", SYMLINK+="$result"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd sort of prefer to do this as part of the ghostdog scan operation, but it's OK to have a different subcommand.

I'd like it to take similar arguments though. It should also be set up with similar guards, and run after the io_timeout rule:

# add these guards to the top, after ACTION=="remove"
KERNEL!="nvme*", GOTO="ebs_volumes_end"
ATTRS{model}!="Amazon Elastic Block Store", GOTO="ebs_volumes_end"
...

# Set XVD_DEVICE_NAME for disk
KERNEL=="nvme[0-9]*n[0-9]*", ENV{DEVTYPE}=="disk", IMPORT{program}="/usr/bin/ghostdog ebs-device-name $devnode"

# Set XVD_DEVICE_NAME for partition
KERNEL=="nvme[0-9]*n[0-9]*p[0-9]*", ENV{DEVTYPE}=="partition", IMPORT{parent}="{XVD_DEVICE_NAME}"

# Use XVD_DEVICE_NAME
XVD_DEVICE_NAME=="xvd*", SYMLINK+="$env{XVD_DEVICE_NAME}"

#[derive(Debug)]
/// Structure returned by the NVME_IDENTIFY_CTRL command
/// See: https://github.com/linux-nvme/libnvme/blob/1ec2432ad0f71a8fc7e37548405d52923dd3dc6e/src/nvme/types.h#L1365
struct NvmeIdCtrl {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hooray, this struct has the same number of members as the types.h struct.


impl From<NvmeIdCtrl> for EBSNvmeDevice {
fn from(value: NvmeIdCtrl) -> Self {
// The device name is within the first 32 bytes of the vendor specific data, which
Copy link
Member

@larvacea larvacea May 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the name start at byte 0 of vendor specific data? If the name is shorter than 32 bytes, are the remaining bytes guaranteed to be whitespace (including, as Rust helpfully points out, zero bytes)?
Nit: if the name starts at byte 0, one could use trim_end() rather than trim(). Is it worth testing for a non-blank device name here (just to make for a more informative error message)?

// This is required, otherwise the syscall always returns the same device; this
// value corresponds to what Amazon Linux does.
// See: https://github.com/amazonlinux/amazon-ec2-utils/blob/09b53f2a20c3e4d227acfb452138fdb672baaa1a/ebsnvme-id#L114
cdw10: 1,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is defined in NVME 1.1, Figure 81 (page 84). The value 1 means "The Identify Controller data structure is returned to
the host."

// Test that the size of the structure matches the expected size, in case the
// structure shape is changed. This is forces developers to update the expected
// size of the struct.
fn test_nvme_id_ctrl_size() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! I was so relieved when I saw this test. Yes, by all means, let us validate that the huge lump of repr(C) definition actually added up to 4k bytes.

Copy link
Member

@larvacea larvacea left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks correct to me.

Signed-off-by: Arnaldo Garcia Rincon <agarrcia@amazon.com>
Copy link
Member

@larvacea larvacea left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You might wish to update the lookaside cache and rerun the Github CI. If you're happy with appending the partition number to XVD_DEVICE_NAME, I'm happy (see comment on ebs-volumes.rules).

/// Print the device type in the environment key format udev expects.
fn emit_device_type(device_type: &str) {
println!("BOTTLEROCKET_DEVICE_TYPE={}", device_type);
}

/// Print the device name in the environment key format udev expects.
fn emit_device_name(device_name: &str) {
println!("XVD_DEVICE_NAME={}", device_name)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per https://docs.aws.amazon.com/ebs/latest/userguide/nvme-ebs-volumes.html I believe this is fine because:

  • Device names are at most 32 bytes long, and unused trailing bytes are space characters.
  • Device names do not contain any characters that would require quoting or escaping.

Therefore this will give us a valid environment key assignment of the form XVD_DEVICE_NAME=/some/stuff and udev will be happy.

KERNEL=="nvme[0-9]*n[0-9]*", ENV{DEVTYPE}=="disk", IMPORT{program}="/usr/bin/ghostdog ebs-device-name $devnode", SYMLINK+="$env{XVD_DEVICE_NAME}"

# Add symlink for partition
KERNEL=="nvme[0-9]*n[0-9]*p[0-9]*", ENV{DEVTYPE}=="partition", IMPORT{parent}="XVD_DEVICE_NAME", SYMLINK+="$env{XVD_DEVICE_NAME}$number"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What form do you expect XVD_DEVICE_NAME to take? Is appending the partition number to that going to be unambiguous? I ask because if XVD_DEVICE_NAME is something like /dev/sda1, you could get a symlink for /dev/sda11 here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see that this is exactly what AL2023 does, so That Will Be Fine (probably, not provably, but what do you want). The XVD_DEVICE_NAME from the controller is of the form xvda, and the partitions get named xvda1, xvda128, and so on.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For attached-after-boot EBS, /dev/sdb and /dev/sdb1 appear. Also fine.

Copy link
Contributor

@bcressey bcressey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM overall. The udev rules can be tweaked a bit for simplicity.

For readability, I prefer udev rules that bail out and skip to the end immediately if they don't match what the rule is trying to cover. And any conditions checked at top don't need to be checked again in the body of the rules file.

@@ -122,6 +122,7 @@ Requires: %{_cross_os}kexec-tools
Requires: %{_cross_os}keyutils
Requires: %{_cross_os}makedumpfile
Requires: %{_cross_os}netdog
Requires: %{_cross_os}nvme-cli
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: we can omit the dependency in "release" since it will come in via ghostdog

Comment on lines 117 to 118
let device_name =
&device_info[NVME_IDENTIFY_DATA_SIZE - 1024..NVME_IDENTIFY_DATA_SIZE - 1024 + 32];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
let device_name =
&device_info[NVME_IDENTIFY_DATA_SIZE - 1024..NVME_IDENTIFY_DATA_SIZE - 1024 + 32];
let offset = NVME_IDENTIFY_DATA_SIZE - 1024;
let device_name = &device_info[offset..offset + 32];


# Follow AWS recommendation of never timing out IO on EBS volumes attached via NVMe
KERNEL=="nvme*", ATTRS{model}=="Amazon Elastic Block Store", ATTR{queue/io_timeout}="4294967295"
KERNEL=="nvme*", ENV{DEVTYPE}=="disk", ATTRS{model}=="Amazon Elastic Block Store", ATTR{queue/io_timeout}="4294967295"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: restore the previous guard and omit the redundant KERNEL and ATTRS{model} checks:

Suggested change
KERNEL=="nvme*", ENV{DEVTYPE}=="disk", ATTRS{model}=="Amazon Elastic Block Store", ATTR{queue/io_timeout}="4294967295"
SUBSYSTEM=="block", ENV{DEVTYPE}=="disk", ATTR{queue/io_timeout}="4294967295"

(Is it OK to keep the `SUBSYSTEM!="block" guard at the top? Or did that cause partitions to be skipped? Not sure why you removed it. If it works it's better to have it at the top and omit it here also.)

Signed-off-by: Arnaldo Garcia Rincon <agarrcia@amazon.com>
The new command uses nvme-cli to query the vendor data from the EBS
volumes, which contains the device name.

Signed-off-by: Arnaldo Garcia Rincon <agarrcia@amazon.com>
This adds two new rules to create symlinks for both the disks and the
partitions, using the device name configured for the EBS volume

Signed-off-by: Arnaldo Garcia Rincon <agarrcia@amazon.com>
@arnaldo2792
Copy link
Contributor Author

(Forced push addresses nits left in previous reviews.)

@arnaldo2792 arnaldo2792 merged commit c4d90b4 into bottlerocket-os:develop May 29, 2024
33 checks passed
@vigh-m vigh-m mentioned this pull request Aug 1, 2024
16 tasks
@ikevinc
Copy link

ikevinc commented Aug 8, 2024

Hey @bcressey @arnaldo2792,

Thanks for working on this!

Just so I fully understand, this introduces native support for instance store local NVME volumes without having to build a custom bootstrap container, right?

Will this work with Karpenter's RAID0 feature out of the box?
aws/karpenter-provider-aws#4735

Even with the bootstrap container workaround, I believe Karpenter had issues allocating the volumes for use, as described here:
#3060

@arnaldo2792
Copy link
Contributor Author

Hey @ikevinc thanks for the kind words! Just a clarification, this does not solve the problem you are pointing out. This PR introduced a udev rule that allows users to identify NVME devices using EBS device mappings (see this). The feature that you want is currently being worked in bottlerocket-os/bottlerocket-core-kit#15 and bottlerocket-os/bottlerocket-core-kit#62. With these two changes, you will be able to use ephemeral storage and configure a RAID 0 array, without requiring you to have bootstrap containers.

@kcao-nydig
Copy link

Hey @ikevinc thanks for the kind words! Just a clarification, this does not solve the problem you are pointing out. This PR introduced a udev rule that allows users to identify NVME devices using EBS device mappings (see this). The feature that you want is currently being worked in bottlerocket-os/bottlerocket-core-kit#15 and bottlerocket-os/bottlerocket-core-kit#62. With these two changes, you will be able to use ephemeral storage and configure a RAID 0 array, without requiring you to have bootstrap containers.

I see. Thanks for the clarifications. Will those make it into the 1.21.0 release?

@arnaldo2792 arnaldo2792 deleted the udev-rule branch September 5, 2024 17:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Create symlinks to devices using the device name configured for EBS volumes
5 participants