Re: [PATCH v12 1/1] vfio/nvgpu: Add vfio pci variant module for grace hopper

From: Alex Williamson
Date: Wed Oct 25 2023 - 10:21:10 EST


On Wed, 25 Oct 2023 12:43:24 +0000
Ankit Agrawal <ankita@xxxxxxxxxx> wrote:

> While the physical BAR is present on the device, it is not being used on the host
> system. The access to the device memory region on the host occur through the
> C2C interconnect link (not the PCIe) and is present for access as a separate memory
> region in the physical address space on the host. The variant driver queries this range
> from the host ACPI DSD tables.

BTW, it's still never been answered why the latest QEMU series dropped
the _DSD support.

> Now, this device memory region on the host is exposed as a device BAR in the VM.
> So the device BAR in the VM is actually mapped to the device memory region in the
> physical address space (and not to the physical BAR) on the host. The config space
> accesses to the device however, are still going to the physical BAR on the host.
>
> > Does this BAR2 size match the size we're reporting for the region?  Now
> > I'm confused why we need to intercept the BAR2 region info if there's
> > physically a real BAR behind it.  Thanks,
>
> Yes, it does match the size being reported through region info. But the region
> info ioctl is still intercepted to provide additional cap to establish the sparse
> mapping. Why we do sparse mapping? The actual device memory size is not
> power-of-2 aligned (a requirement for a BAR). So we roundup to the next
> power-of-2 value and report the size as such. Then we utilize sparse mapping
> to show only the actual size of the device memory as mappable.

Yes, it's clear to me why we need the sparse mapping and why we
intercept the accesses, but the fact that there's an underlying
physical BAR of the same size in config space has been completely
absent in any previous discussions.

In light of that, I don't think we should be independently calculating
the BAR2 region size using roundup_pow_of_two(nvdev->memlength).
Instead we should be using pci_resource_len() of the physical BAR2 to
make it evident that this relationship exists.

The comments throughout should also be updated to reflect this as
currently they're written as if there is no physical BAR2 and we're
making a completely independent decision relative to BAR2 sizing. A
comment should also be added to nvgrace_gpu_vfio_pci_read/write()
explaining that the physical BAR2 provides the correct behavior
relative to config space accesses.

The probe function should also fail if pci_resource_len() for BAR2 is
not sufficient for the coherent memory region. Thanks,

Alex