RE: [PATCH] PCI/IOV: Decrease VF memory BAR size to save host memory occupied by PTEs
From: Deucher, Alexander
Date: Tue Oct 11 2022 - 10:06:03 EST
[AMD Official Use Only - General]
> -----Original Message-----
> From: Ma, Rui <Rui.Ma@xxxxxxx>
> Sent: Tuesday, October 11, 2022 7:19 AM
> To: Bjorn Helgaas <helgaas@xxxxxxxxxx>
> Cc: bhelgaas@xxxxxxxxxx; linux-pci@xxxxxxxxxxxxxxx; linux-
> kernel@xxxxxxxxxxxxxxx; Deucher, Alexander
> <Alexander.Deucher@xxxxxxx>
> Subject: RE: [PATCH] PCI/IOV: Decrease VF memory BAR size to save host
> memory occupied by PTEs
>
> [AMD Official Use Only - General]
>
> Hi Helgass:
> Thank you very much for your suggestions on my patch!
>
> The patch is a device-specific behavior. In our AMD device SR-IOV,
> the actual VF BAR size is dependent on NumVFs too. If only one VF is
> created, the VF BAR size will depend on BAR probing algorithm as described
> in Section 9.3.3.14, but when several VFs created our own driver will
> decrease BAR0 memory size according to NumVFs. So I want to add this
> quirk to keep Linux code and certain driver code consistent.
>
> > Except that this doesn't affect the *starting* address of each VF BAR,
> which I guess is what you mean by "BAR memory mapping is always based on
> the initial device physical device."
> Yes we should not change the starting address or the device cannot load
> well.
>
> > Well, I guess the device still describes its worst-case BAR size; the quirk
> basically just optimizes space usage. Right?
> Yes.
>
> > Aren't both virt and phys contiguous and nicely aligned for this case? It
> seems like the perfect application for huge pages.
> It cannot use huge page though address aligned.
+ KVM, Christian
I think this is just working around a bug in KVM. This is to avoid wasting huge amounts of memory for page tables due to KVM using 4K pages for some reason. I think we should figure out why KVM is not using huge pages for this rather than working around this in the PCI layer.
Alex
>
> >> + shift = 1;
> >> + shift = virtfn_get_shift(dev, iov->num_VFs, i);
> >Maybe more of the fiddling could be hidden in the quirk, e.g.,
> > size = quirk_vf_bar_size(dev, iov->num_VFs, i);
> > if (!size)
> > size = pci_iov_resource_size(dev, i + PCI_IOV_RESOURCES);
> If I hide get-shift func in the quirk is it concise to call pci_iov_resource_size()
> in quirk?
>
> And I solved other issues in the patch sent later. Thank you for your
> patience!
>
>
> Regards,
> Rui
>
> -----Original Message-----
> From: Bjorn Helgaas <helgaas@xxxxxxxxxx>
> Sent: Wednesday, September 28, 2022 6:07 AM
> To: Ma, Rui <Rui.Ma@xxxxxxx>
> Cc: bhelgaas@xxxxxxxxxx; linux-pci@xxxxxxxxxxxxxxx; linux-
> kernel@xxxxxxxxxxxxxxx
> Subject: Re: [PATCH] PCI/IOV: Decrease VF memory BAR size to save host
> memory occupied by PTEs
>
> On Mon, Sep 26, 2022 at 04:05:42PM +0800, Rui Ma wrote:
> > In SR_IOV scene, when the device physical space(such as Video RAM)is
> > fixed, as the number of VFs increases, the actual BAR memory space
> > used by each VF decreases.
>
> s/SR_IOV/SR-IOV/ to match spec usage.
>
> I think this is device-specific behavior, right? I don't see anything in the PCIe
> spec about the BAR size being dependent on NumVFs. If it's device-specific,
> it shouldn't be presented as "for all SR-IOV devices, the actual BAR memory
> space decreases as number of VFs increases."
>
> > However, the BAR memory mapping is always based on the initial device
> > physical device. So do not map this unneeded memory can save host
> > memory occupied by PTEs. Although each PTE only occupies a few bytes
> > of space on its own, a large number of PTEs can still take up a lot of space.
>
> So IIUC this is basically a quirk to override the "VF BAR aperture"
> size, which PCIe r6.0, sec 9.2.1.1.1 says is "determined by the usual BAR
> probing algorithm as described in Section 9.3.3.14."
>
> Except that this doesn't affect the *starting* address of each VF BAR, which I
> guess is what you mean by "BAR memory mapping is always based on the
> initial device physical device."
>
> Hmm. This kind of breaks the "plug and play" model of PCI because the
> device is no longer self-describing. Well, I guess the device still describes its
> worst-case BAR size; the quirk basically just optimizes space usage. Right?
>
> It's a shame if we can't reduce PTE usage by using hugeTLB pages for this.
> Aren't both virt and phys contiguous and nicely aligned for this case? It
> seems like the perfect application for huge pages.
>
> > Signed-off-by: Rui Ma <Rui.Ma@xxxxxxx>
> > ---
> > drivers/pci/iov.c | 14 ++++++++++++--
> > drivers/pci/pci.h | 15 +++++++++++++++
> > drivers/pci/quirks.c | 37 +++++++++++++++++++++++++++++++++++++
> > 3 files changed, 64 insertions(+), 2 deletions(-)
> >
> > diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c index
> > 952217572113..6b9f9b6b9be1 100644
> > --- a/drivers/pci/iov.c
> > +++ b/drivers/pci/iov.c
> > @@ -296,6 +296,14 @@ int pci_iov_add_virtfn(struct pci_dev *dev, int id)
> > struct pci_sriov *iov = dev->sriov;
> > struct pci_bus *bus;
> >
> > + /*
> > + * Some SR-IOV device's BAR map range is larger than they can actually
> use.
> > + * This extra BAR space occupy too much reverse mapping size(physical
> page
> > + * back to the PTEs). So add a divisor shift parameter to resize the
> > + * request resource of VF.
> > + */
> > + u16 shift = 1;
> > +
> > bus = virtfn_add_bus(dev->bus, pci_iov_virtfn_bus(dev, id));
> > if (!bus)
> > goto failed;
> > @@ -328,8 +336,10 @@ int pci_iov_add_virtfn(struct pci_dev *dev, int id)
> > virtfn->resource[i].name = pci_name(virtfn);
> > virtfn->resource[i].flags = res->flags;
> > size = pci_iov_resource_size(dev, i + PCI_IOV_RESOURCES);
> > + shift = 1;
> > + shift = virtfn_get_shift(dev, iov->num_VFs, i);
>
> Maybe more of the fiddling could be hidden in the quirk, e.g.,
>
> size = quirk_vf_bar_size(dev, iov->num_VFs, i);
> if (!size)
> size = pci_iov_resource_size(dev, i + PCI_IOV_RESOURCES);
>
> > virtfn->resource[i].start = res->start + size * id;
> > - virtfn->resource[i].end = virtfn->resource[i].start + size - 1;
> > + virtfn->resource[i].end = virtfn->resource[i].start + (size >>
> > +(shift - 1)) - 1;
> > rc = request_resource(res, &virtfn->resource[i]);
> > BUG_ON(rc);
> > }
> > @@ -680,12 +690,12 @@ static int sriov_enable(struct pci_dev *dev, int
> nr_virtfn)
> > msleep(100);
> > pci_cfg_access_unlock(dev);
> >
> > + iov->num_VFs = nr_virtfn;
> > rc = sriov_add_vfs(dev, initial);
> > if (rc)
> > goto err_pcibios;
> >
> > kobject_uevent(&dev->dev.kobj, KOBJ_CHANGE);
> > - iov->num_VFs = nr_virtfn;
> >
> > return 0;
> >
> > diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h index
> > 3d60cabde1a1..befc67a280eb 100644
> > --- a/drivers/pci/pci.h
> > +++ b/drivers/pci/pci.h
> > @@ -603,6 +603,21 @@ static inline int pci_dev_specific_reset(struct
> > pci_dev *dev, bool probe) } #endif
> >
> > +struct virtfn_get_shift_methods {
> > + u16 vendor;
> > + u16 device;
> > + u16 (*get_shift)(struct pci_dev *dev, u16 arg, int arg2); };
> > +
> > +#ifdef CONFIG_PCI_QUIRKS
> > +u16 virtfn_get_shift(struct pci_dev *dev, u16 arg1, int arg2); #else
> > +static inline u16 virtfn_get_shift(struct pci_dev *dev, u16 arg1, int
> > +arg2) {
> > + return (u16)1;
> > +}
> > +#endif
> > +
> > #if defined(CONFIG_PCI_QUIRKS) && defined(CONFIG_ARM64) int
> > acpi_get_rc_resources(struct device *dev, const char *hid, u16 segment,
> > struct resource *res);
> > diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c index
> > da829274fc66..add587919705 100644
> > --- a/drivers/pci/quirks.c
> > +++ b/drivers/pci/quirks.c
> > @@ -4085,6 +4085,43 @@ int pci_dev_specific_reset(struct pci_dev *dev,
> bool probe)
> > return -ENOTTY;
> > }
> >
> > +static u16 divided_by_VF(struct pci_dev *dev, u16 num_VFs, int
> > +bar_num)
>
> This is clearly ATI specific or at the very least specific to devices that divvy up
> BAR0 in special ways, so the name is a bit too generic.
>
> > +{
> > + u16 shift = 1;
> > +
> > + if (bar_num == 0) {
> > + while ((1 << shift) <= num_VFs)
> > + shift += 1;
> > + }
> > + pci_info(dev, "BAR %d get shift: %d.\n", bar_num, shift);
>
> Drop the period at end. If we're changing the size, I think it would be useful
> to know num_VFs, BAR #, and the new size. IIUC, "dev" here is the PF, so
> this is the "VF BAR", not the BAR 0 of the PF.
>
> > + return shift;
> > +}
> > +
> > +static const struct virtfn_get_shift_methods virtfn_get_shift_methods[] =
> {
> > + { PCI_VENDOR_ID_ATI, 0x73a1, divided_by_VF},
> > + { 0 }
> > +};
> > +
> > +/*
> > + * Get shift num to calculate SR-IOV device BAR. Sometimes the BAR
> > +size for
> > + * SR-IOV device is too large and we want to calculate the size to
> > +define
> > + * the end of virtfn.
> > + */
> > +u16 virtfn_get_shift(struct pci_dev *dev, u16 arg1, int arg2) {
> > + const struct virtfn_get_shift_methods *i;
> > +
> > + for (i = virtfn_get_shift_methods; i->get_shift; i++) {
> > + if ((i->vendor == dev->vendor ||
> > + i->vendor == (u16)PCI_ANY_ID) &&
> > + (i->device == dev->device ||
> > + i->device == (u16)PCI_ANY_ID))
> > + return i->get_shift(dev, arg1, arg2);
> > + }
> > +
> > + return (u16)1;
> > +}
> > +
> > static void quirk_dma_func0_alias(struct pci_dev *dev) {
> > if (PCI_FUNC(dev->devfn) != 0)
> > --
> > 2.25.1
> >