RE: [PATCH v10 1/1] vfio/nvgpu: Add vfio pci variant module for grace hopper

From: Tian, Kevin
Date: Thu Sep 28 2023 - 02:33:42 EST


> From: ankita@xxxxxxxxxx <ankita@xxxxxxxxxx>
> Sent: Friday, September 15, 2023 10:54 AM
>
> PCI BAR are aligned to the power-of-2, but the actual memory on the
> device may not. A read or write access to the physical address from the
> last device PFN up to the next power-of-2 aligned physical address
> results in reading ~0 and dropped writes.

Though the variant driver emulates the access to the offset beyond
the available memory size, how does the userspace driver or the guest
learn to know the actual size and avoid using the invalid hole to hold
valid data?

> +config NVGRACE_GPU_VFIO_PCI
> + tristate "VFIO support for the GPU in the NVIDIA Grace Hopper
> Superchip"
> + depends on ARM64 || (COMPILE_TEST && 64BIT)
> + select VFIO_PCI_CORE
> + help
> + VFIO support for the GPU in the NVIDIA Grace Hopper Superchip is
> + required to assign the GPU device to a VM using KVM/qemu/etc.

VFIO driver is not only for VM.

> +
> +static int nvgrace_gpu_vfio_pci_mmap(struct vfio_device *core_vdev,
> + struct vm_area_struct *vma)
> +{
> + struct nvgrace_gpu_vfio_pci_core_device *nvdev = container_of(
> + core_vdev, struct nvgrace_gpu_vfio_pci_core_device,
> core_device.vdev);
> +
> + unsigned long start_pfn;
> + unsigned int index;
> + u64 req_len, pgoff, end;
> + int ret = 0;
> +
> + index = vma->vm_pgoff >> (VFIO_PCI_OFFSET_SHIFT - PAGE_SHIFT);
> + if (index != VFIO_PCI_BAR2_REGION_INDEX)
> + return vfio_pci_core_mmap(core_vdev, vma);
> +
> + /*
> + * Request to mmap the BAR. Map to the CPU accessible memory on
> the
> + * GPU using the memory information gathered from the system ACPI
> + * tables.
> + */
> + pgoff = vma->vm_pgoff &
> + ((1U << (VFIO_PCI_OFFSET_SHIFT - PAGE_SHIFT)) - 1);
> +
> + if (check_sub_overflow(vma->vm_end, vma->vm_start, &req_len) ||
> + check_add_overflow(PHYS_PFN(nvdev->memphys), pgoff,
> &start_pfn) ||
> + check_add_overflow(PFN_PHYS(pgoff), req_len, &end))
> + return -EOVERFLOW;
> +
> + /*
> + * Check that the mapping request does not go beyond available
> device
> + * memory size
> + */
> + if (end > nvdev->memlength)
> + return -EINVAL;
> +
> + /*
> + * Perform a PFN map to the memory and back the device BAR by the
> + * GPU memory.
> + *
> + * The available GPU memory size may not be power-of-2 aligned.
> Given
> + * that the memory is exposed as a BAR, the mapping request is of
> the
> + * power-of-2 aligned size. Map only up to the size of the GPU

why is the mapping request in power-of-2? The caller should follow the
sparse mapping to raise the request. Otherwise the check on 'end' right
above will always fail.

> +
> +/*
> + * Read count bytes from the device memory at an offset. The actual device
> + * memory size (available) may not be a power-of-2. So the driver fakes
> + * the size to a power-of-2 (reported) when exposing to a user space driver.
> + *
> + * Read request beyond the actual device size is filled with ~0, while
> + * those beyond the actual reported size is skipped.
> + *
> + * A read from a negative or a reported+ offset, a negative count are

it's hard to understand 'reported+'. Please update it with a descriptive
word.

> + * considered error conditions and returned with an -EINVAL.
> + */
> +ssize_t nvgrace_gpu_read_mem(void __user *buf, size_t count, loff_t *ppos,
> + struct nvgrace_gpu_vfio_pci_core_device *nvdev)
> +{
> + u64 offset = *ppos & VFIO_PCI_OFFSET_MASK;
> + size_t mem_count, i, bar_size = roundup_pow_of_two(nvdev-
> >memlength);
> + u8 val = 0xFF;
> +
> + if (offset >= bar_size)
> + return -EINVAL;
> +
> + /* Clip short the read request beyond reported BAR size */
> + count = min(count, bar_size - (size_t)offset);
> +
> + /*
> + * Determine how many bytes to be actually read from the device
> memory.
> + * Do not read from the offset beyond available size.
> + */
> + if (offset >= nvdev->memlength)
> + return 0;

This violates the comment:

* Read request beyond the actual device size is filled with ~0, while
* those beyond the actual reported size is skipped.

> +
> + mem_count = min(count, nvdev->memlength - (size_t)offset);
> +
> + /*
> + * Handle read on the BAR2 region. Map to the target device memory
> + * physical address and copy to the request read buffer.
> + */
> + if (copy_to_user(buf, (u8 *)nvdev->memmap + offset, mem_count))
> + return -EFAULT;
> +
> + /*
> + * Only the device memory present on the hardware is mapped,
> which may
> + * not be power-of-2 aligned. A read to the BAR2 region implies an
> + * access outside the available device memory on the hardware. Fill

Why is a read to BAR2 region implies an out-of-scope access?

> +static ssize_t nvgrace_gpu_vfio_pci_read(struct vfio_device *core_vdev,
> + char __user *buf, size_t count, loff_t
> *ppos)
> +{
> + unsigned int index = VFIO_PCI_OFFSET_TO_INDEX(*ppos);
> + struct nvgrace_gpu_vfio_pci_core_device *nvdev = container_of(
> + core_vdev, struct nvgrace_gpu_vfio_pci_core_device,
> core_device.vdev);
> +
> + if (index == VFIO_PCI_BAR2_REGION_INDEX) {
> + mutex_lock(&nvdev->memmap_lock);
> + if (!nvdev->memmap) {
> + nvdev->memmap = memremap(nvdev->memphys,
> nvdev->memlength, MEMREMAP_WB);
> + if (!nvdev->memmap) {
> + mutex_unlock(&nvdev->memmap_lock);
> + return -ENOMEM;
> + }
> + }
> + mutex_unlock(&nvdev->memmap_lock);

move above to a helper and share with the write fn.

> +ssize_t nvgrace_gpu_write_mem(size_t count, loff_t *ppos, const void
> __user *buf,
> + struct nvgrace_gpu_vfio_pci_core_device *nvdev)
> +{
> + u64 offset = *ppos & VFIO_PCI_OFFSET_MASK;
> + size_t mem_count, bar_size = roundup_pow_of_two(nvdev-
> >memlength);
> +
> + if (offset >= bar_size)
> + return -EINVAL;
> +
> + /* Clip short the read request beyond reported BAR size */
> + count = min(count, bar_size - (size_t)offset);

s/read/write/

> +
> + /*
> + * Determine how many bytes to be actually written to the device
> memory.
> + * Do not write to the offset beyond available size.
> + */
> + if (offset >= nvdev->memlength)
> + return 0;

still need return count as no-write is still a success emulation of write.

ditto for the read fn.

> +
> + mem_count = min(count, nvdev->memlength - (size_t)offset);
> +
> + /*
> + * Only the device memory present on the hardware is mapped,
> which may
> + * not be power-of-2 aligned. A write to the BAR2 region implies an
> + * access outside the available device memory on the hardware. Drop
> + * those write requests.
> + */
> + if (copy_from_user((u8 *)nvdev->memmap + offset, buf,
> mem_count))
> + return -EFAULT;
> +
> + return count;

*ppos is not adjusted. ditto for the read fn.

> +
> +static ssize_t coherent_mem_show(struct device *dev,
> + struct device_attribute *attr, char *buf)
> +{
> + struct vfio_pci_core_device *core_device = dev_get_drvdata(dev);
> + struct nvgrace_gpu_vfio_pci_core_device *nvdev
> + = container_of(core_device, struct
> nvgrace_gpu_vfio_pci_core_device,
> + core_device);
> +
> + return sprintf(buf, "%u\n", nvdev->memlength ? 1 : 0);

if nvdev->memlength is 0 then probe will fail.

So here can always return 1.