Re: [PATCH] PCI: hv: Do not set PCI_COMMAND_MEMORY to reduce VM boot time

From: Bjorn Helgaas
Date: Thu Apr 28 2022 - 15:12:25 EST

On Tue, Apr 26, 2022 at 07:25:43PM +0000, Jake Oshins wrote:
> > -----Original Message-----
> > From: Dexuan Cui <decui@xxxxxxxxxxxxx>
> > Sent: Tuesday, April 26, 2022 11:32 AM
> > To: Lorenzo Pieralisi <lorenzo.pieralisi@xxxxxxx>
> > Cc: Jake Oshins <jakeo@xxxxxxxxxxxxx>; Bjorn Helgaas <helgaas@xxxxxxxxxx>;
> > bhelgaas@xxxxxxxxxx; Alex Williamson <alex.williamson@xxxxxxxxxx>;
> > wei.liu@xxxxxxxxxx; KY Srinivasan <kys@xxxxxxxxxxxxx>; Haiyang Zhang
> > <haiyangz@xxxxxxxxxxxxx>; Stephen Hemminger <sthemmin@xxxxxxxxxxxxx>;
> > linux-hyperv@xxxxxxxxxxxxxxx; linux-pci@xxxxxxxxxxxxxxx; linux-
> > kernel@xxxxxxxxxxxxxxx; Michael Kelley (LINUX) <mikelley@xxxxxxxxxxxxx>;
> > robh@xxxxxxxxxx; kw@xxxxxxxxx; kvm@xxxxxxxxxxxxxxx
> > Subject: RE: [PATCH] PCI: hv: Do not set PCI_COMMAND_MEMORY to reduce
> > VM boot time
> >
> > > From: Lorenzo Pieralisi <lorenzo.pieralisi@xxxxxxx>
> > > Sent: Tuesday, April 26, 2022 9:45 AM
> > > > ...
> > > > Sorry I don't quite follow. pci-hyperv allocates MMIO for the bridge
> > > > window in hv_pci_allocate_bridge_windows() and registers the MMIO
> > > > ranges to the core PCI driver via pci_add_resource(), and later the
> > > > core PCI driver probes the bus/device(s), validates the BAR sizes
> > > > and the pre-initialized BAR values, and uses the BAR configuration.
> > > > IMO the whole process doesn't require the bit PCI_COMMAND_MEMORY to
> > > > be pre-set, and there should be no issue to delay setting the bit to
> > > > a PCI device device's .probe() -> pci_enable_device().
> > >
> > > IIUC you want to bootstrap devices with PCI_COMMAND_MEMORY clear
> > > (otherwise PCI core would toggle it on and off for eg BAR sizing).
> > >
> > > Is that correct ?
> >
> > Yes, that's the exact purpose of this patch.
> >
> > Do you see any potential architectural issue with the patch?
> > From my reading of the core PCI code, it looks like this should be safe.

I don't know much about Hyper-V, but in general I don't think the PCI
core should turn on PCI_COMMAND_MEMORY at all unless a driver requests
it. I assume that if a guest OS depends on PCI_COMMAND_MEMORY being
set, guest firmware would take care of setting it.

> > Jake has some concerns that I don't quite follow.
> > @Jake, could you please explain the concerns with more details?
> First, let me say that I really don't know whether this is an issue.
> I know it's an issue with other operating system kernels. I'm
> curious whether the Linux kernel / Linux PCI driver would behave in
> a way that has an issue here.
> The VM has a window of address space into which it chooses to put
> PCI device's BARs. The guest OS will generally pick the value that
> is within the BAR, by default, but it can theoretically place the
> device in any free address space. The subset of the VM's memory
> address space which can be populated by devices' BARs is finite, and
> generally not particularly large.
> Imagine a VM that is configured with 25 NVMe controllers, each of
> which requires 64KiB of address space. (This is just an example.)
> At first boot, all of these NVMe controllers are packed into address
> space, one after the other.
> While that VM is running, one of the 25 NVMe controllers fails and
> is replaced with an NVMe controller from a separate manufacturer,
> but this one requires 128KiB of memory, for some reason. Perhaps it
> implements the "controller buffer" feature of NVMe. It doesn't fit
> in the hole that was vacated by the failed NVMe controller, so it
> needs to be placed somewhere else in address space. This process
> continues over months, with several more failures and replacements.
> Eventually, the address space is very fragmented.
> At some point, there is an attempt to place an NVMe controller into
> the VM but there is no contiguous block of address space free which
> would allow that NVMe controller to operate. There is, however,
> enough total address space if the other, currently functioning, NVMe
> controllers are moved from the address space that they are using to
> other ranges, consolidating their usage and reducing fragmentation.
> Let's call this a rebalancing of memory resources.
> When the NVMe controllers are moved, a new value is written into
> their BAR. In general, the PCI spec would require that you clear
> the memory enable bit in the command register (PCI_COMMAND_MEMORY)
> during this move operation, both so that there's never a moment when
> two devices are occupying the same address space and because writing
> a 64-bit BAR atomically isn't possible. This is the reason that I
> originally wrote the code in this driver to unmap the device from
> the VM's address space when the memory enable bit is cleared.
> What I don't know is whether this sequence of operations can ever
> happen in Linux, or perhaps in a VM running Linux. Will it
> rebalance resources in order to consolidate address space? If it
> will, will this involve clearing the memory enable bit to ensure
> that two devices never overlap?

This sequence definitely can occur in Linux, but it hasn't yet become
a real priority. But we do already have issues with assigning space
for hot-added devices in general, especially if firmware hasn't
assigned large windows to things like Thunderbolt controllers. I
suspect that we have or will soon have issues where resource
assignment starts failing after a few hotplugs, e.g., dock/undock

There have been patches posted to rebalance resources (quiesce
drivers, reassign, restart drivers), but they haven't gone anywhere
yet for lack of interest and momentum. I do feel like we're the
tracks and the train is coming, though ;)