Re: drivers/pci: (and/or KVM): Slow PCI initialization during VM boot with passthrough of large BAR Nvidia GPUs on DGX H100

From: Alex Williamson
Date: Tue Dec 03 2024 - 17:06:44 EST


On Tue, 3 Dec 2024 14:33:10 -0600
Mitchell Augustin <mitchell.augustin@xxxxxxxxxxxxx> wrote:

> Thanks.
>
> I'm thinking about the cleanest way to accomplish this:
>
> 1. I'm wondering if replacing the pci_info() calls with equivalent
> printk_deferred() calls might be sufficient here. This works in my
> initial test, but I'm not sure if this is definitive proof that we
> wouldn't have any issues in all deployments, or if my configuration is
> just not impacted by this kind of deadlock.

Just switching to printk_deferred() alone seems like wishful thinking,
but if you were also to wrap the code in console_{un}lock(), that might
be a possible low-impact solution.

> 2. I did also draft a patch that would just eliminate the redundancy
> and disable the impacted logs by default, and allow them to be
> re-enabled with a new kernel command line option
> "pci=bar_logging_enabled" (at the cost of the performance gains due to
> reduced redundancy). This works well in all of my tests.

I suspect Bjorn would prefer not to add yet another pci command line
option and as we've seen here, the logs are useful by default.

> Do you think either of those approaches would work / be appropriate?
> Ultimately I am trying to avoid messy changes that would require
> actually propagating all of the info needed for these logs back up to
> pci_read_bases(), if at all possible, since there seems like no
> obvious way to do that without changing the signature of
> __pci_read_base() or tracking additional state.

The calling convention of __pci_read_base() is already changing if
we're having the caller disable decoding and it doesn't have a lot of
callers, so I don't think I'd worry about changing the signature.

I think maybe another alternative that doesn't hold off the console
would be to split the BAR sizing and resource processing into separate
steps. For example pci_read_bases() might pass arrays like:

u32 bars[PCI_STD_NUM_BARS] = { 0 };
u32 romsz = 0;

To a function like:

void __pci_read_bars(struct pci_dev *dev, u32 *bars, u32 *romsz,
int num_bars, int rom)
{
u16 orig_cmd;
u32 tmp;
int i;

if (!dev->mmio_always_on) {
pci_read_config_word(dev, PCI_COMMAND, &orig_cmd);
if (orig_cmd & PCI_COMMAND_DECODE_ENABLE) {
pci_write_config_word(dev, PCI_COMMAND,
orig_cmd & ~PCI_COMMAND_DECODE_ENABLE);
}
}

for (i = 0; i < num_bars; i++) {
unsigned int pos = PCI_BASE_ADDRESS_0 + (i << 2);

pci_read_config_dword(dev, pos, &tmp);
pci_write_config_dword(dev, pos, ~0);
pci_read_config_dword(dev, pos, &bars[i]);
pci_write_config_dword(dev, pos, tmp);
}

if (rom) {
pci_read_config_dword(dev, rom, &tmp);
pci_write_config_dword(dev, rom, PCI_ROM_ADDRESS_MASK);
pci_read_config_dword(dev, rom, romsz);
pci_write_config_dword(dev, rom, tmp);
}

if (!dev->mmio_always_on && (orig_cmd & PCI_COMMAND_DECODE_ENABLE))
pci_write_config_word(dev, PCI_COMMAND, orig_cmd);
}

pci_read_bases() would then iterate in a similar way that it does now,
passing pointers to the stashed data to __pci_read_base(), which would
then only do the resource processing and could freely print.

To me that seems better than blocking the console... Maybe there are
other ideas on the list. Thanks,

Alex