Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

vif-plug-representor: fix support for plugging in PF ports #14

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

iulhaq
Copy link

@iulhaq iulhaq commented Jul 12, 2024

Allowing for plugging in PF ports also means user needs to set some more parameters from the Southbound.

Previously only lookup against VF ports was done.

Allowing for plugging in PF ports also means user needs to set some
more parameters from the Southbound.

Previously only lookup against VF ports was done.
Copy link
Member

@dshcherb dshcherb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR! I added a suggestion that I think would require less configuration and fit better with the isolation model in the original design.

return false;
}

const char *opt_bus_name = smap_get(&ctx_in->lport_options,
Copy link
Member

@dshcherb dshcherb Jul 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My suggestion would be to reuse the PF MAC for this: when a vf-num is not specified, we know that this is a request to plug a PF.

The issue with using the PCI info for plugging is that the main (hypervisor) host and the DPU host see two different PCIe topologies and PCI addresses don't match. Whereas the burned-in MAC can be seen from both the hypervisor host and the DPU host via devlink.

Requiring DPU-side information to be passed in vif-plug:representor fields makes an implicit requirement that the requester has privileged access to look that information up. However, the original design decision when writing this was to infer the DPU-side information based on the host-side information (e.g. both hosts can see the PF MAC, board serial, VF numbers). This allows you to have privilege separation between:

  1. the hypervisor service that picks a free PF/VF and is the source for the information to do the lookup of a representor
  2. the CMS that sets values for vif-plug:representor:<field>
  3. the DPU-side service (ovn-controller).

The hypervisor-facing MAC is possible to get via devlink from the DPU:

# Run from the DPU host to get a MAC of a PF via its representor: hw_addr is the MAC address of a PF facing the main host.

ubuntu@bf3:~$ devlink -j port show pci/0000:03:00.0/65536 | jq
{
  "port": {
    "pci/0000:03:00.0/65536": {
      "type": "eth",
      "netdev": "pf0hpf",
      "flavour": "pcipf",
      "controller": 1,
      "pfnum": 0,
      "external": true,
      "splittable": false,
      "function": {
        "hw_addr": "a0:88:c2:42:42:42"
      }
    }
  }
}
# Run from the main (hypervisor) host to get a PF MAC which matches the above hw_addr
ubuntu@hypervisor:~$ ip -j link show enp1s0np0 | jq .[].address
"a0:88:c2:42:42:42"

# If the main host-side MAC for a netdev has been changed programmatically you can still retrieve it via the permaddr (IFLA_PERM_ADDRESS). But this only matters if you collect this info at the main host.
# https://github.com/iproute2/iproute2/commit/2b8e699

In summary: I would add and initialize a separate index to the port table to do a lookup of a PF based on a PF MAC address passed in and in the flavor == DEVLINK_PORT_FLAVOUR_PCI_PF branch that you added use a function to look up a PF by a PF MAC from vif-plug:representor:pf-mac.

/* Port table.
*
* This data structure contains three indexes:
*
* mac_vf_table - port_node by PF MAC and VF number.
* ifindex_table - port_node by netdev ifindex.
* bus_dev_table - port_node by bus/dev name (only contains PHYSICAL and
* PCI_PF ports).
*
* There is a small number of PHYSICAL and PF flavoured ports per device. We
* will need to refer to these for every update we get to a VF in order to
* maintain the PF MAC+VF number index.
*
* Note that there is not really any association between PHYSICAL and PF
* representor ports from the devlink data structure point of view. However
* for systems running a kernel that does not provide the host facing MAC
* through devlink on the PF representor there is a compatibility interface in
* sysfs which is relative to a PHYSICAL ports netdev name (see the
* compat_get_host_pf_mac function).
*/
struct port_table {
struct hmap mac_vf_table; /* Hash table for lookups by mac+vf_num */
uint32_t mac_seed; /* We reuse the OVS mac+vlan hash functions for the
* PF MAC+VF number, and they require a uint32_t seed */
struct hmap ifindex_table; /* Hash table for lookups by ifindex */
struct hmap bus_dev_table; /* Hash table for lookup of PHYSICAL and PF
* ports by their bus_name/dev_name string.
* While there is a large number of VFs or SFs
* they will be associated with a small number
* of PFs */
};
static struct port_table *port_table;

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for taking the time to review @dshcherb much appreciated!

@iulhaq I tend to agree with the point @dshcherb puts forward about the two sides of the PCIe complex being enumerated independently. I have seen with my own eyes systems that expose a behavior where the PCI address for the device do not match.

You addition of a pf-num option is useful though as we do actually currently have a bug for systems that expose two PFs to the host and where the host put them into a bond. The reason we did not catch that is that we used the vendor recommended approach which was to bond at the DPU side and expose only one PF to the host (Example from NVIDIA BlueField documentation).

What would you think of plugging the PF representor on the basis of lack of vf-num option provided, avoiding the addition of the PCI and device name information?

@dshcherb dshcherb requested a review from fnordahl July 13, 2024 15:17
return false;
}

const char *opt_bus_name = smap_get(&ctx_in->lport_options,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for taking the time to review @dshcherb much appreciated!

@iulhaq I tend to agree with the point @dshcherb puts forward about the two sides of the PCIe complex being enumerated independently. I have seen with my own eyes systems that expose a behavior where the PCI address for the device do not match.

You addition of a pf-num option is useful though as we do actually currently have a bug for systems that expose two PFs to the host and where the host put them into a bond. The reason we did not catch that is that we used the vendor recommended approach which was to bond at the DPU side and expose only one PF to the host (Example from NVIDIA BlueField documentation).

What would you think of plugging the PF representor on the basis of lack of vf-num option provided, avoiding the addition of the PCI and device name information?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants