Re: [PATCH 8/9] vfio/pci: export nvlink2 support into vendor vfio_pci drivers

From: Alex Williamson
Date: Sat Mar 20 2021 - 00:41:28 EST


On Fri, 19 Mar 2021 19:59:43 -0300
Jason Gunthorpe <jgg@xxxxxxxxxx> wrote:

> On Fri, Mar 19, 2021 at 03:08:09PM -0600, Alex Williamson wrote:
> > On Fri, 19 Mar 2021 17:07:49 -0300
> > Jason Gunthorpe <jgg@xxxxxxxxxx> wrote:
> >
> > > On Fri, Mar 19, 2021 at 11:36:42AM -0600, Alex Williamson wrote:
> > > > On Fri, 19 Mar 2021 17:34:49 +0100
> > > > Christoph Hellwig <hch@xxxxxx> wrote:
> > > >
> > > > > On Fri, Mar 19, 2021 at 01:28:48PM -0300, Jason Gunthorpe wrote:
> > > > > > The wrinkle I don't yet have an easy answer to is how to load vfio_pci
> > > > > > as a universal "default" within the driver core lazy bind scheme and
> > > > > > still have working module autoloading... I'm hoping to get some
> > > > > > research into this..
> > > >
> > > > What about using MODULE_SOFTDEP("pre: ...") in the vfio-pci base
> > > > driver, which would load all the known variants in order to influence
> > > > the match, and therefore probe ordering?
> > >
> > > The way the driver core works is to first match against the already
> > > loaded driver list, then trigger an event for module loading and when
> > > new drivers are registered they bind to unbound devices.
> >
> > The former is based on id_tables, the latter on MODULE_DEVICE_TABLE, we
> > don't have either of those.
>
> Well, today we don't, but Max here adds id_table's to the special
> devices and a MODULE_DEVICE_TABLE would come too if we do the flavours
> thing below.

I think the id_tables are the wrong approach for IGD and NVLink
variants.

> My starting thinking is that everything should have these tables and
> they should work properly..

id_tables require ongoing maintenance whereas the existing variants
require only vendor + device class and some platform feature, like a
firmware or fdt table. They're meant to only add extra regions to
vfio-pci base support, not extensively modify the device interface.

> > As noted to Christoph, the cases where we want a vfio driver to
> > bind to anything automatically is the exception.
>
> I agree vfio should not automatically claim devices, but once vfio is
> told to claim a device everything from there after should be
> automatic.
>
> > > One answer is to have userspace udev have the "hook" here and when a
> > > vfio flavour mod alias is requested on a PCI device it swaps in
> > > vfio_pci if it can't find an alternative.
> > >
> > > The dream would be a system with no vfio modules loaded could do some
> > >
> > > echo "vfio" > /sys/bus/pci/xxx/driver_flavour
> > >
> > > And a module would be loaded and a struct vfio_device is created for
> > > that device. Very easy for the user.
> >
> > This is like switching a device to a parallel universe where we do
> > want vfio drivers to bind automatically to devices.
>
> Yes.
>
> If we do this I'd probably suggest that driver_override be bumped down
> to some user compat and 'vfio > driver_override' would just set the
> flavour.
>
> As-is driver_override seems dangerous as overriding the matching table
> could surely allow root userspace to crash the machine. In situations
> with trusted boot/signed modules this shouldn't be.

When we're dealing with meta-drivers that can bind to anything, we
shouldn't rely on the match, but should instead verify the driver is
appropriate in the probe callback. Even without driver_override,
there's the new_id mechanism. Either method allows the root user to
break driver binding. Greg has previously stated something to the
effect that users get to keep all the pieces when they break something
by manipulating driver binding.

> > > > If we coupled that with wildcard support in driver_override, ex.
> > > > "vfio_pci*", and used consistent module naming, I think we'd only need
> > > > to teach userspace about this wildcard and binding to a specific module
> > > > would come for free.
> > >
> > > What would the wildcard do?
> >
> > It allows a driver_override to match more than one driver, not too
> > dissimilar to your driver_flavor above. In this case it would match
> > all driver names starting with "vfio_pci". For example if we had:
> >
> > softdep vfio-pci pre: vfio-pci-foo vfio-pci-bar
> >
> > Then we'd pre-seed the condition that drivers foo and bar precede the
> > base vfio-pci driver, each will match the device to the driver and have
> > an opportunity in their probe function to either claim or skip the
> > device. Userspace could also set and exact driver_override, for
> > example if they want to force using the base vfio-pci driver or go
> > directly to a specific variant.
>
> Okay, I see. The problem is that this makes 'vfio-pci' monolithic, in
> normal situations it will load *everything*.
>
> While that might not seem too bad with these simple drivers, at least
> the mlx5 migration driver will have a large dependency tree and pull
> in lots of other modules. Even Max's sample from v1 pulls in mlx5_core.ko
> and a bunch of other stuff in its orbit.

Luckily the mlx5 driver doesn't need to be covered by compatibility
support, so we don't need to set a softdep for it and the module could
be named such that a wildcard driver_override of vfio_pci* shouldn't
logically include that driver. Users can manually create their own
modprobe.d softdep entry if they'd like to include it. Otherwise
userspace would need to know to bind to it specifically.

> This is why I want to try for fine grained autoloading first. It
> really is the elegant solution if we can work it out.

I just don't see how we create a manageable change to userspace.

> > > Open coding a match table in probe() and returning failure feels hacky
> > > to me.
> >
> > How's it any different than Max's get_foo_vfio_pci_driver() that calls
> > pci_match_id() with an internal match table?
>
> Well, I think that is hacky too - but it is hacky only to service user
> space compatability so lets put that aside

I don't see that dropping incompatible devices in the probe function
rather than the match via id_table is necessarily a hack. I think
driver-core explicitly supports this (see below).

> > It seems a better fit for the existing use cases, for example the
> > IGD variant can use a single line table to exclude all except Intel
> > VGA class devices in its probe callback, then test availability of
> > the extra regions we'd expose, otherwise return -ENODEV.
>
> I don't think we should over-focus on these two firmware triggered
> examples. I looked at the Intel GPU driver and it already only reads
> the firmware thing for certain PCI ID's, we can absolutely generate a
> narrow match table for it. Same is true for the NVIDIA GPU.

I'm not sure we can make this assertion, both only care about the type
of device and existence of associated firmware tables. No PCI IDs are
currently involved.

> The fact this is hard or whatever is beside the point - future drivers
> in this scheme should have exact match tables.
>
> The mlx5 sample is a good example, as it matches a very narrow NVMe
> device that is properly labeled with a subvendor ID. It does not match
> every NVMe device and then run code to figure it out. I think this is
> the right thing to do as it is the only thing that would give us fine
> grained module loading.

Sounds like the right thing to do for that device, if it's only designed
to run in this framework. That's more like the mdev device model

> Even so, I'm not *so* worried about "over matching" - if IGD or the
> nvidia stuff load on a wide set of devices then they can just not
> enable their extended stuff. It wastes some kernel memory, but it is
> OK.

I'd rather they bind to the base vfio-pci driver if their extended
features are not available.

> And if some driver *really* gets stuck here the true answer is to
> improve the driver core match capability.
>
> > devices in the deny-list and non-endpoint devices. Many drivers
> > clearly place implicit trust in their id_table, others don't. In the
> > case of meta drivers, I think it's fair to make use of the latter
> > approach.
>
> Well, AFAIK, the driver core doesn't have a 'try probe, if it fails
> then try another driver' approach. One device, one driver. Am I
> missing something?

If the driver probe callback fails, really_probe() returns 0 with the
comment:

/*
* Ignore errors returned by ->probe so that the next driver can try
* its luck.
*/
ret = 0;

That allows bus_for_each_drv() to continue to iterate.

>
> I would prefer not to propose to Greg such a radical change to how
> driver loading works..

Seems to be how it works already.

> I also think the softdep/implicit loading/ordering will not be
> welcomed, it feels weird to me.

AFAICT, it works within the existing driver-core, it's largely an
extension to pci-core driver_override support to enable wildcard
matching, ideally along with adding the same for all buses that support
driver_override. Thanks,

Alex