Skip to content
This repository has been archived by the owner on Aug 25, 2021. It is now read-only.

multipath: racing condition #154

Closed
tuan-hoang1 opened this issue Mar 5, 2020 · 25 comments
Closed

multipath: racing condition #154

tuan-hoang1 opened this issue Mar 5, 2020 · 25 comments
Labels
jira for syncing to jira

Comments

@tuan-hoang1
Copy link
Contributor

tuan-hoang1 commented Mar 5, 2020

Ideally, on first boot, when systemd-udevd.service and multipathd.service are activated, finished, /dev/disk/by-label/boot should be pointing to respective - for example /dev/mpatha1, assuming friendly naming is used - rather than /dev/sda1 or /dev/sdb1. So that ignition-setup-user.service and other Ignition services should be able to pick up /dev/mpatha1.

But I have encountered that ignition-setup-user.service keeps picking /dev/sda1, thus failing because a multipath link is already formed for sda and sdb to be mpatha.

I have tried blindly adding (for the sake of testing)

Wants=multipathd.service systemd-udev-settle.service
Requires=multipathd.service systemd-udev-settle.service
After=multipathd.service systemd-udev-settle.service

to
dracut/30ignition/ignition-diskful.target
dracut/30ignition/ignition-disks.service
dracut/30ignition/coreos-gpt-setup@.service
dracut/30ignition/ignition-setup-user.service
dracut/30ignition/ignition-subsequent.target

but there seems to be a racing condition happening, that in rescue shell /dev/mpatha is formed but ignition-setup-user.service keeps looking for /dev/sdb. Running systemctl restart initrd.target allows a successful first boot.

My understanding is that another trigger of systemd-udev-trigger.service should be enough for all /dev/disk/by-label/ links to be populated, after multipath paths are formed. Or we have to call

/usr/bin/udevadm trigger --type=subsystems --action=add
/usr/bin/udevadm trigger --type=devices --action=add

inside dracut/30ignition/ignition-setup-user.sh before mounting /dev/disk/by-label/boot

Full log : https://tpaste.us/7KzB

@cgwalters
Copy link
Member

So effectively what's going on here is the meaning (target) of /dev/disk/by-label/boot (and root) change in the middle of startup.

One thing to verify: is multipathd synchronously waiting until these devices appear? From looking at the systemd unit, it appears to me that it tries to only wait for devices mentioned on the kernel cmdline or so?

[Service]
Type=notify
NotifyAccess=main
ExecStartPre=-/sbin/multipath -A
ExecStart=/sbin/multipathd -d -s

And from the man page it seems like -A is blocking on the kernel cmdline probing? Or maybe it's just propagating?

OK from a quick glance at the multipathd code it does seem like it only signals readiness once it's processed the main loop and presumably switched the targets of the default udev symlinks.

Hmm. That said, I feel like this is all still hacky; what we should probably do is use a kernel argument to signal that multipath is in use for the rootfs. It looks like dracut's multipath kernel module tries to probe the root= kernel argument to detect this, but that doesn't work with the CoreOS model. We could invent coreos.multipath=/dev/mpatha1 or so which would signal this?

But it does seem to me that your changes to add Before=multipathd should work. Can you try adding some debug prints? Maybe something like ExecStartPost=/bin/ls -al /dev/disk/by-label/boot in multipath.service? That will tell us whether the symlink is being changed correctly at the right time.

@cgwalters
Copy link
Member

In other words I think multipath falls into coreos/fedora-coreos-tracker#94 some at least - something like having coreos-installer write a kernel argument that we read.

@tuan-hoang1
Copy link
Contributor Author

tuan-hoang1 commented Mar 9, 2020

One thing to verify: is multipathd synchronously waiting until these devices appear? From looking at the systemd unit, it appears to me that it tries to only wait for devices mentioned on the kernel cmdline or so?

I think any of Wants= Requires= Before= After= only allows unit B to start after unit A starts, not after unit A finishes.

Hmm. That said, I feel like this is all still hacky; what we should probably do is use a kernel argument to signal that multipath is in use for the rootfs. It looks like dracut's multipath kernel module tries to probe the root= kernel argument to detect this, but that doesn't work with the CoreOS model. We could invent coreos.multipath=/dev/mpatha1 or so which would signal this?

Yeah that's the general idea. And this detection has to be done before ignition-user-setup service.
Point is, how do we find the synchronization point where /dev/disk/by-label/{root,boot} are populated to get this detection in play.

But it does seem to me that your changes to add Before=multipathd should work. Can you try adding some debug prints? Maybe something like ExecStartPost=/bin/ls -al /dev/disk/by-label/boot in multipath.service? That will tell us whether the symlink is being changed correctly at the right time.

That ls says there is no /dev/disk/by-label/* is generated.
As mentioned above, I think Before= also works in the same way as After= - sync point is service's starting point, not service's finishing point.

@tuan-hoang1
Copy link
Contributor Author

Not just running systemctl restart initrd.target allows a successful first boot, systemctl restart ignition-setup-user also works during rescue shell, where the symlinks are populated.

I'd go with re-triggering the udev coldplug path, even though not very sure it's a good idea.

@tuan-hoang1
Copy link
Contributor Author

tuan-hoang1 commented Mar 9, 2020

A dummy sleep in ignition-setup-user.sh would just do the job btw. Or a run of udevadm settle in coreos-gpt-setup.

@bgilbert
Copy link
Contributor

The design for openshift/enhancements#211 puts a config file in /boot that specifies any extra steps needed to access the root filesystem. The idea is to avoid system-specific kernel arguments. It also isn't much help here, where extra steps are needed to access /boot.

cc @arithx

@tuan-hoang1
Copy link
Contributor Author

Looks like having udevadm settle in coreos-gpt-setup@.service should work. IIRC that one is the very first FCOS's service to run, before all ignition-*.service services.
Also, we might need something similar in the real OSTree rootfs. I'm seeing rpm-ostreed.service failing due to boot.mount failing.

@cgwalters
Copy link
Member

Looks like having udevadm settle in coreos-gpt-setup@.service should work.

Can use:

Requires=systemd-udev-settle.service
After=systemd-udev-settle.service

instead.

@cgwalters
Copy link
Member

And to make sure I understand; we are using the existing dracut multipath module right? Is the rd.multipath karg in use?

@darkmuggle
Copy link
Contributor

@tuan-hoang1 So there's a few things going on here:

  • the configuration for multipath is not there yet (need to create /etc/multipath and populate a multipath.conf)
  • the OStree mount for /sysroot is happening before multipathd.service is run. As a result we can't use the mpath.
  • the code for finding the OStree use labels. It doesn't even bother to check the dm-* or /dev/mapper/* devices

So there's a few problems going on here.

@darkmuggle
Copy link
Contributor

And to make sure I understand; we are using the existing dracut multipath module right? Is the rd.multipath karg in use?

The existing rd.multipath code won't work for a few reasons:

  • no configuration (you'll get a message about needing to have an /etc/multipath.conf
  • we really like /dev/disk/by-label paths

The LUKS opener is triggered via:

# This is our luks root label
Requires=dev-disk-by\x2dlabel-crypt_rootfs.device
After=dev-disk-by\x2dlabel-crypt_rootfs.device

So we need to change /usr/libexec/coreos-cryptfs to be mpath aware. Right now it blows up with

:/dev/mapper# /usr/libexec/coreos-cryptfs open
coreos-cryptfs: /dev/sda4 uses the 'cipher_null-ecb'
coreos-cryptfs: /dev/sda4 is not encrypted. Device will be mounted as device-mapper linear target.
[ 3789.240112]  sda: sda1 sda2 sda3 sda4
[ 3789.258203]  sda: sda1 sda2 sda3 sda4
CHANGED: partition=4 start=1050624 old: size=6582272 end=7632896 new: size=32503775,end=33554399
[ 3789.277761] device-mapper: table: 253:5: linear: Device lookup failed
[ 3789.281050] device-mapper: ioctl: error adding target to table
device-mapper: reload ioctl on coreos-luks-root-nocrypt  failed: Device or resource busy

@cgwalters
Copy link
Member

OK I just looked at man multipath.conf and...wow. Not sure we're going to be able to express what people want as kernel arguments (which is also true of the iSCSI case).

Which brings us to fedora-silverblue/issue-tracker#3 and https://lists.freedesktop.org/archives/systemd-devel/2019-November/043754.html

So...I guess my 2¢ is for now to introduce /boot/etc/multipath.conf and teach our code how to copy that into the initramfs' /etc (and also get it into the upstream dracut source too). As opposed to regenerating/appending to the initramfs or using a secondary initramfs.

But, there's more work here because coreos-installer needs to know about it - there's a PR here which needs some co-design work with whatever we do here.

So we need to change /usr/libexec/coreos-cryptfs to be mpath aware.

Don't we just need to delay everything that is parsing the labels until after multipathd has taken ownership of them? In the end isn't the point of multipathd that there's a single virtual block device - we just need to ensure the multipath userspace takes ownership of the link.

(I wonder if multipathd has any support today for detecting the case of labels on the devices it owns, or whether we need to re-run udev - and does udev know to ignore the "underlying" block devices of a mpath device?)

@darkmuggle
Copy link
Contributor

Don't we just need to delay everything that is parsing the labels until after multipathd has taken ownership of them?

Sadly, no. The only rules that apply are for 11-dm-parts.rules which will create something like:

:/dev/disk/by-id# ls -lh
total 0
lrwxrwxrwx 1 root root 10 Mar 31 18:46 dm-name-coreos-luks-root-nocrypt -> ../../dm-5
lrwxrwxrwx 1 root root 10 Mar 31 18:34 dm-name-mpatha -> ../../dm-0
lrwxrwxrwx 1 root root 10 Mar 31 18:34 dm-name-mpatha1 -> ../../dm-1
lrwxrwxrwx 1 root root 10 Mar 31 18:34 dm-name-mpatha2 -> ../../dm-2
lrwxrwxrwx 1 root root 10 Mar 31 18:34 dm-name-mpatha3 -> ../../dm-3
lrwxrwxrwx 1 root root 10 Mar 31 18:34 dm-name-mpatha4 -> ../../dm-4
lrwxrwxrwx 1 root root 10 Mar 31 18:34 dm-uuid-mpath-0x5000c500d1d1d1d1 -> ../../dm-0
lrwxrwxrwx 1 root root 10 Mar 31 18:34 dm-uuid-part1-mpath-0x5000c500d1d1d1d1 -> ../../dm-1
lrwxrwxrwx 1 root root 10 Mar 31 18:34 dm-uuid-part2-mpath-0x5000c500d1d1d1d1 -> ../../dm-2
lrwxrwxrwx 1 root root 10 Mar 31 18:34 dm-uuid-part3-mpath-0x5000c500d1d1d1d1 -> ../../dm-3
lrwxrwxrwx 1 root root 10 Mar 31 18:34 dm-uuid-part4-mpath-0x5000c500d1d1d1d1 -> ../../dm-4
lrwxrwxrwx 1 root root 10 Mar 31 18:34 scsi-0x5000c500d1d1d1d1 -> ../../dm-0
lrwxrwxrwx 1 root root  9 Mar 31 18:44 scsi-35000c500d1d1d1d1 -> ../../sda
lrwxrwxrwx 1 root root 10 Mar 31 18:44 scsi-35000c500d1d1d1d1-part1 -> ../../sda1
lrwxrwxrwx 1 root root 10 Mar 31 18:44 scsi-35000c500d1d1d1d1-part2 -> ../../sda2
lrwxrwxrwx 1 root root 10 Mar 31 18:44 scsi-35000c500d1d1d1d1-part3 -> ../../sda3
lrwxrwxrwx 1 root root 10 Mar 31 18:44 scsi-35000c500d1d1d1d1-part4 -> ../../sda4
lrwxrwxrwx 1 root root  9 Mar 31 18:44 wwn-0x5000c500d1d1d1d1 -> ../../sda
lrwxrwxrwx 1 root root 10 Mar 31 18:44 wwn-0x5000c500d1d1d1d1-part1 -> ../../sda1
lrwxrwxrwx 1 root root 10 Mar 31 18:44 wwn-0x5000c500d1d1d1d1-part2 -> ../../sda2
lrwxrwxrwx 1 root root 10 Mar 31 18:44 wwn-0x5000c500d1d1d1d1-part3 -> ../../sda3
lrwxrwxrwx 1 root root 10 Mar 31 18:44 wwn-0x5000c500d1d1d1d1-part4 -> ../../sda4
lrwxrwxrwx 1 root root 10 Mar 31 18:34 wwn-0xx5000c500d1d1d1d1 -> ../../dm-0

60-persistent-storage.rules is what creates the by-label, but it filters out dm-* devices. So the best case is a /dev/disk/by-id/dm-name-coreos-luks-root-nocrypt entry.

@darkmuggle
Copy link
Contributor

So...I guess my 2¢ is for now to introduce /boot/etc/multipath.conf and teach our code how to copy that into the initramfs' /etc (and also get it into the upstream dracut source too). As opposed to regenerating/appending to the initramfs or using a secondary initramfs.

Should we just cargo-cult in anything in /boot/etc?

@cgwalters
Copy link
Member

OK. Well, the labels I think are a sane default, but here we're not talking about the defaults. One path¹ then is to also have coreos-installer set root=/dev/dm-0 or whatever. Or, when multipath is configured we also frob the filesystem UUID². IOW it looks like how a lot of traditional systems work.

Should we just cargo-cult in anything in /boot/etc?

Seems reasonable to me - that said, the tradeoff is that /boot/etc would be our own invention. Probably worth discussing this with other upstreams (dracut, multipath?). And if we take anything there that includes /boot/etc/systemd/system/ which could be arbitrary code. So...maybe better to filter down to just /boot/etc/multipath.conf?

I think the value though in avoiding people mutating our initramfs is going to be useful down the line, not just for signing but also allowing us to change how things work more easily.

¹ 😉
² I am increasingly thinking we should probably default to frobbing the rootfs and /boot UUIDs anyways on general principle, and maybe even switch to using them post-firstboot. One downside of the label model is that things go south if e.g. one inserts a USB stick imaged with the metal image into an CoreOS system and reboot.

@darkmuggle
Copy link
Contributor

xref dracutdevs/dracut#792 for Dracut. Which was the implementation result of #154 (comment)

@darkmuggle
Copy link
Contributor

darkmuggle commented Apr 17, 2020

We have another interesting problem -- we don't have device selection logic. And udev rules are not helpful. If you rely on /dev/disk/by-{label,id,uuid} once the multipath device is activated, underlying devices are locked. The udev rules that create /dev/disk/by{label,id,uuid} exclude dm devices, so the by-label that we select points to something that is not mountable.

And then we have the problem with finding the active path. I've started to work on the problem. I imagine though any fix will be controversial given the dislike for udev.

@cgwalters
Copy link
Member

My thoughts on rootfs discovery for multipath are the same for the (new) Ignition LUKS: coreos/ignition#960 (comment)

@cgwalters
Copy link
Member

I imagine though any fix will be controversial given the dislike for udev.

I don't think anyone dislikes udev? I think we shouldn't have nontrivial logic in udev rules, but that's not "udev is bad". Or to restate this, let's use kernel arguments to find the root device in a predictable fashion - this is basically how everyone does it today.

@cgwalters
Copy link
Member

Filed coreos/fedora-coreos-tracker#465 which is related to this.

@darkmuggle
Copy link
Contributor

darkmuggle commented Apr 17, 2020

I imagine though any fix will be controversial given the dislike for udev.

The controversial part is that we'll need to use UDev and select based on device attributes. Consider:

:/dev/disk/by-id# ls -lh
total 0
lrwxrwxrwx 1 root root 10 Apr 17 14:35 dm-name-mpatha -> ../../dm-0
lrwxrwxrwx 1 root root 10 Apr 17 14:35 dm-name-mpatha1 -> ../../dm-1
lrwxrwxrwx 1 root root 10 Apr 17 14:35 dm-name-mpatha2 -> ../../dm-2
lrwxrwxrwx 1 root root 10 Apr 17 14:35 dm-name-mpatha3 -> ../../dm-3
lrwxrwxrwx 1 root root 10 Apr 17 14:35 dm-name-mpatha4 -> ../../dm-4
lrwxrwxrwx 1 root root 10 Apr 17 14:35 dm-uuid-mpath-0xe6d8b04a47ad45d3 -> ../../dm-0
lrwxrwxrwx 1 root root 10 Apr 17 14:35 dm-uuid-part1-mpath-0xe6d8b04a47ad45d3 -> ../../dm-1
lrwxrwxrwx 1 root root 10 Apr 17 14:35 dm-uuid-part2-mpath-0xe6d8b04a47ad45d3 -> ../../dm-2
lrwxrwxrwx 1 root root 10 Apr 17 14:35 dm-uuid-part3-mpath-0xe6d8b04a47ad45d3 -> ../../dm-3
lrwxrwxrwx 1 root root 10 Apr 17 14:35 dm-uuid-part4-mpath-0xe6d8b04a47ad45d3 -> ../../dm-4
lrwxrwxrwx 1 root root 10 Apr 17 14:35 scsi-0xe6d8b04a47ad45d3 -> ../../dm-0
lrwxrwxrwx 1 root root  9 Apr 17 14:35 scsi-3e6d8b04a47ad45d3 -> ../../sda
lrwxrwxrwx 1 root root 10 Apr 17 14:35 scsi-3e6d8b04a47ad45d3-part1 -> ../../sda1
lrwxrwxrwx 1 root root 10 Apr 17 14:35 scsi-3e6d8b04a47ad45d3-part2 -> ../../sda2
lrwxrwxrwx 1 root root 10 Apr 17 14:35 scsi-3e6d8b04a47ad45d3-part3 -> ../../sda3
lrwxrwxrwx 1 root root 10 Apr 17 14:35 scsi-3e6d8b04a47ad45d3-part4 -> ../../sda4
lrwxrwxrwx 1 root root  9 Apr 17 14:35 wwn-0xe6d8b04a47ad45d3 -> ../../sda
lrwxrwxrwx 1 root root 10 Apr 17 14:35 wwn-0xe6d8b04a47ad45d3-part1 -> ../../sda1
lrwxrwxrwx 1 root root 10 Apr 17 14:35 wwn-0xe6d8b04a47ad45d3-part2 -> ../../sda2
lrwxrwxrwx 1 root root 10 Apr 17 14:35 wwn-0xe6d8b04a47ad45d3-part3 -> ../../sda3
lrwxrwxrwx 1 root root 10 Apr 17 14:35 wwn-0xe6d8b04a47ad45d3-part4 -> ../../sda4
lrwxrwxrwx 1 root root 10 Apr 17 14:35 wwn-0xxe6d8b04a47ad45d3 -> ../../dm-0

And then

:/dev/disk/by-label# ls -lh
total 0
lrwxrwxrwx 1 root root 10 Apr 17 14:35 EFI-SYSTEM -> ../../sda2
lrwxrwxrwx 1 root root 10 Apr 17 14:35 boot -> ../../sda1
lrwxrwxrwx 1 root root 10 Apr 17 14:35 root -> ../../sda4

The default rules for labels excludes device mapper targets #154 (comment) and the device mapper rules do not create symlinks for labels.

With dracutdevs/dracut@b8a92b7 booting with rd.mutlipath=default will use the general configuration. In the simple case, then the node will have /dev/mapper/mpatha4 as the root disk. But what about the more complicated setups? IMO, having an ergonomic way for discovery is really useful. The path I was considering was to add UDEV rule for /dev/disk/coreos/<fs labels> that provided for the active path. Then we can continue to use the ostree= convention (that is the controversial part, IMHO)

If we require users to do the Kargs game for device selection, we're recreating the painful user experience that we had with networking via ip=.

edit: if you use Dracut upstream as an override and boot with rd.multipath=default, booting with cosa run --qemu-multipath will let you play with this.

@cgwalters
Copy link
Member

If we require users to do the Kargs game for device selection, we're recreating the painful user experience that we had with networking via ip=.

OK it's good you mentioned user experience because we really need to be defining that - and then we can debate implementation details like udev rules etc. separately.

One high level uncertainty I have is whether coreos-install needs to be involved in multipath or not. I think it does for s390x. If it does, here's my proposal for the user flow:

$ coreos-installer --multipath-wwid <wwid> --device-multipath-config /path/to/multipath.conf

And basically that ends up taking the wwid and injecting it as a kernel argument rd.multipath.wwid=<wwid>, and drops the config file in /boot/etc/multipathd.conf.

From there, in the initramfs we know how to assemble the device before we go to look for the label=root.

@cgwalters
Copy link
Member

One high level uncertainty I have is whether coreos-install needs to be involved in multipath or not.

Will it work for us to target coreos-installer at just one path, and then we assemble the multipath on the Ignition boot (and subsequent)? The analogy here is like NIC bonding/teaming - at install time, you don't strictly need a high level of redundancy.

That seems like it would simplify a lot if true. But AIUI at least on s390x there is some sort of magic needed to "activate" a block device that will be used by a z/VM guest or so. See also coreos/coreos-installer#185

@darkmuggle
Copy link
Contributor

We've solved this issue in FCOS. So I'm calling this done. @tuan-hoang1 thank you for your patience as we all worked this out.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
jira for syncing to jira
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants