Re: drivers/pci: (and/or KVM): Slow PCI initialization during VM boot with passthrough of large BAR Nvidia GPUs on DGX H100

From: Mitchell Augustin
Date: Tue Dec 03 2024 - 18:09:46 EST


Thanks for the suggestions!

> The calling convention of __pci_read_base() is already changing if we're having the caller disable decoding

The way I implemented that in my initial patch draft[0] still allows
for __pci_read_base() to be called independently, as it was
originally, since (as far as I understand) the encode disable/enable
is just a mask - so I didn't need to remove the disable/enable inside
__pci_read_base(), and instead just added an extra one in
pci_read_bases(), turning the __pci_read_base() disable/enable into a
no-op when called from pci_read_bases(). In any case...

> I think maybe another alternative that doesn't hold off the console would be to split the BAR sizing and resource processing into separate steps.

This seems like a potentially better option, so I'll dig into that approach.


Providing some additional info you requested last week, just for more context:

> Do you have similar logs from that [hotplug] operation

Attached [1] is the guest boot output (boot was quick, since no GPUs
were attached at boot time)

Here is what I see when I enable a single GPU (takes 1-3 seconds):
[Dec 3 22:53] pci 0000:09:00.0: [10de:2330] type 00 class 0x030200 PCIe Endpoint
[ +0.000526] pci 0000:09:00.0: BAR 0 [mem 0x00000000-0x00ffffff 64bit pref]
[ +0.000237] pci 0000:09:00.0: BAR 2 [mem 0x00000000-0x1fffffffff 64bit pref]
[ +0.000212] pci 0000:09:00.0: BAR 4 [mem 0x00000000-0x01ffffff 64bit pref]
[ +0.000275] pci 0000:09:00.0: Max Payload Size set to 128 (was 256, max 256)
[ +0.000453] pci 0000:09:00.0: Enabling HDA controller
[ +0.003052] pci 0000:09:00.0: 252.048 Gb/s available PCIe bandwidth,
limited by 16.0 GT/s PCIe x16 link at 0000:00:02.0 (capable of 504.112
Gb/s with 32.0 GT/s PCIe x16 link)
[ +0.003327] pci 0000:09:00.0: BAR 2 [mem
0x384000000000-0x385fffffffff 64bit pref]: assigned
[ +0.000258] pci 0000:09:00.0: BAR 4 [mem
0x386000000000-0x386001ffffff 64bit pref]: assigned
[ +0.000311] pci 0000:09:00.0: BAR 0 [mem
0x386002000000-0x386002ffffff 64bit pref]: assigned
[ +0.000993] nvidia 0000:09:00.0: enabling device (0140 -> 0142)

And below[2] is the output of /proc/iomem after I have done that
process 4 times (GPUs are 0000:08, 0000:09, 0000:0a, 0000:0b)

[0] https://raw.githubusercontent.com/MitchellAugustin/script-dump-public/refs/heads/main/slow-ovmf-case/patches/0001-Move-decode-disable-enable-up-one-level-and-add-kern.patch
[1] https://pastebin.ubuntu.com/p/T4myPSQSJ5/
[2] https://pastebin.ubuntu.com/p/GD2WtkW9Gq/

Other info:
Guest:
ubuntu@testbox02:~$ cat /proc/cmdline
BOOT_IMAGE=/boot/vmlinuz-6.12.1
root=UUID=fec1c9ae-0df3-419c-80dd-f3035049b845 ro console=tty1
console=ttyS0

Host (DGX H100):
$ cat /proc/cmdline
BOOT_IMAGE=/boot/vmlinuz-6.12.0+
root=UUID=bb04f707-1c00-4806-8adb-cf9eb876786f ro
console=ttyS0,115200n8 iommu=pt init_on_alloc=0

- Mitchell Augustin

On Tue, Dec 3, 2024 at 4:06 PM Alex Williamson
<alex.williamson@xxxxxxxxxx> wrote:
>
> On Tue, 3 Dec 2024 14:33:10 -0600
> Mitchell Augustin <mitchell.augustin@xxxxxxxxxxxxx> wrote:
>
> > Thanks.
> >
> > I'm thinking about the cleanest way to accomplish this:
> >
> > 1. I'm wondering if replacing the pci_info() calls with equivalent
> > printk_deferred() calls might be sufficient here. This works in my
> > initial test, but I'm not sure if this is definitive proof that we
> > wouldn't have any issues in all deployments, or if my configuration is
> > just not impacted by this kind of deadlock.
>
> Just switching to printk_deferred() alone seems like wishful thinking,
> but if you were also to wrap the code in console_{un}lock(), that might
> be a possible low-impact solution.
>
> > 2. I did also draft a patch that would just eliminate the redundancy
> > and disable the impacted logs by default, and allow them to be
> > re-enabled with a new kernel command line option
> > "pci=bar_logging_enabled" (at the cost of the performance gains due to
> > reduced redundancy). This works well in all of my tests.
>
> I suspect Bjorn would prefer not to add yet another pci command line
> option and as we've seen here, the logs are useful by default.
>
> > Do you think either of those approaches would work / be appropriate?
> > Ultimately I am trying to avoid messy changes that would require
> > actually propagating all of the info needed for these logs back up to
> > pci_read_bases(), if at all possible, since there seems like no
> > obvious way to do that without changing the signature of
> > __pci_read_base() or tracking additional state.
>
> The calling convention of __pci_read_base() is already changing if
> we're having the caller disable decoding and it doesn't have a lot of
> callers, so I don't think I'd worry about changing the signature.
>
> I think maybe another alternative that doesn't hold off the console
> would be to split the BAR sizing and resource processing into separate
> steps. For example pci_read_bases() might pass arrays like:
>
> u32 bars[PCI_STD_NUM_BARS] = { 0 };
> u32 romsz = 0;
>
> To a function like:
>
> void __pci_read_bars(struct pci_dev *dev, u32 *bars, u32 *romsz,
> int num_bars, int rom)
> {
> u16 orig_cmd;
> u32 tmp;
> int i;
>
> if (!dev->mmio_always_on) {
> pci_read_config_word(dev, PCI_COMMAND, &orig_cmd);
> if (orig_cmd & PCI_COMMAND_DECODE_ENABLE) {
> pci_write_config_word(dev, PCI_COMMAND,
> orig_cmd & ~PCI_COMMAND_DECODE_ENABLE);
> }
> }
>
> for (i = 0; i < num_bars; i++) {
> unsigned int pos = PCI_BASE_ADDRESS_0 + (i << 2);
>
> pci_read_config_dword(dev, pos, &tmp);
> pci_write_config_dword(dev, pos, ~0);
> pci_read_config_dword(dev, pos, &bars[i]);
> pci_write_config_dword(dev, pos, tmp);
> }
>
> if (rom) {
> pci_read_config_dword(dev, rom, &tmp);
> pci_write_config_dword(dev, rom, PCI_ROM_ADDRESS_MASK);
> pci_read_config_dword(dev, rom, romsz);
> pci_write_config_dword(dev, rom, tmp);
> }
>
> if (!dev->mmio_always_on && (orig_cmd & PCI_COMMAND_DECODE_ENABLE))
> pci_write_config_word(dev, PCI_COMMAND, orig_cmd);
> }
>
> pci_read_bases() would then iterate in a similar way that it does now,
> passing pointers to the stashed data to __pci_read_base(), which would
> then only do the resource processing and could freely print.
>
> To me that seems better than blocking the console... Maybe there are
> other ideas on the list. Thanks,
>
> Alex
>


--
Mitchell Augustin
Software Engineer - Ubuntu Partner Engineering
Email:mitchell.augustin@xxxxxxxxxxxxx
Location:United States of America


canonical.com
ubuntu.com