Re: [PATCH] PCI: hv: Do not set PCI_COMMAND_MEMORY to reduce VM boot time

From: Bjorn Helgaas
Date: Thu Apr 28 2022 - 15:12:25 EST

On Tue, Apr 26, 2022 at 07:25:43PM +0000, Jake Oshins wrote:
> > From my reading of the core PCI code, it looks like this should be safe.

I don't know much about Hyper-V, but in general I don't think the PCI
core should turn on PCI_COMMAND_MEMORY at all unless a driver requests
it. I assume that if a guest OS depends on PCI_COMMAND_MEMORY being
set, guest firmware would take care of setting it.

> > Jake has some concerns that I don't quite follow.
> > @Jake, could you please explain the concerns with more details?
> First, let me say that I really don't know whether this is an issue.
> I know it's an issue with other operating system kernels. I'm
> curious whether the Linux kernel / Linux PCI driver would behave in
> a way that has an issue here.
> The VM has a window of address space into which it chooses to put
> PCI device's BARs. The guest OS will generally pick the value that
> is within the BAR, by default, but it can theoretically place the
> device in any free address space. The subset of the VM's memory
> address space which can be populated by devices' BARs is finite, and
> generally not particularly large.
> Imagine a VM that is configured with 25 NVMe controllers, each of
> which requires 64KiB of address space. (This is just an example.)
> At first boot, all of these NVMe controllers are packed into address
> space, one after the other.
> While that VM is running, one of the 25 NVMe controllers fails and
> is replaced with an NVMe controller from a separate manufacturer,
> but this one requires 128KiB of memory, for some reason. Perhaps it
> implements the "controller buffer" feature of NVMe. It doesn't fit
> in the hole that was vacated by the failed NVMe controller, so it
> needs to be placed somewhere else in address space. This process
> continues over months, with several more failures and replacements.
> Eventually, the address space is very fragmented.
> At some point, there is an attempt to place an NVMe controller into
> the VM but there is no contiguous block of address space free which
> would allow that NVMe controller to operate. There is, however,
> enough total address space if the other, currently functioning, NVMe
> controllers are moved from the address space that they are using to
> other ranges, consolidating their usage and reducing fragmentation.
> Let's call this a rebalancing of memory resources.
> When the NVMe controllers are moved, a new value is written into
> their BAR. In general, the PCI spec would require that you clear
> the memory enable bit in the command register (PCI_COMMAND_MEMORY)
> during this move operation, both so that there's never a moment when
> two devices are occupying the same address space and because writing
> a 64-bit BAR atomically isn't possible. This is the reason that I
> originally wrote the code in this driver to unmap the device from
> the VM's address space when the memory enable bit is cleared.
> What I don't know is whether this sequence of operations can ever
> happen in Linux, or perhaps in a VM running Linux. Will it
> rebalance resources in order to consolidate address space? If it
> will, will this involve clearing the memory enable bit to ensure
> that two devices never overlap?

This sequence definitely can occur in Linux, but it hasn't yet become
a real priority. But we do already have issues with assigning space
for hot-added devices in general, especially if firmware hasn't
assigned large windows to things like Thunderbolt controllers. I
suspect that we have or will soon have issues where resource
assignment starts failing after a few hotplugs, e.g., dock/undock

There have been patches posted to rebalance resources (quiesce
drivers, reassign, restart drivers), but they haven't gone anywhere
yet for lack of interest and momentum. I do feel like we're the
tracks and the train is coming, though ;)