Re: [PATCH v15 1/1] vfio/nvgrace-gpu: Add vfio pci variant module for grace hopper

From: Ankit Agrawal
Date: Thu Dec 21 2023 - 07:43:44 EST


Thanks Alex and Cedric for the review.

>> +/*
>> + * Copyright (c) 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved
>> + */
>> +
>> +#include "nvgrace_gpu_vfio_pci.h"
>
> drivers/vfio/pci/nvgrace-gpu/main.c:6:10: fatal error: nvgrace_gpu_vfio_pci.h: No such file or directory
>    6 | #include "nvgrace_gpu_vfio_pci.h"

Yeah, managed to miss the file. Will fix that in the next version.

>> +
>> +static bool nvgrace_gpu_vfio_pci_is_fake_bar(int index)
>> +{
>> +     return (index == RESMEM_REGION_INDEX ||
>> +         index == USEMEM_REGION_INDEX);
>> +}
>
> Presumably these macros are defined in the missing header, though we
> don't really need a header file just for that.  This doesn't need to be
> line wrapped, it's short enough with the macros as is.

Yeah that and the structures are moved to the header file.

>> +     info.flags = VFIO_REGION_INFO_FLAG_READ |
>> +             VFIO_REGION_INFO_FLAG_WRITE |
>> +             VFIO_REGION_INFO_FLAG_MMAP;
>
> Align these all:
>
>        info.flags = VFIO_REGION_INFO_FLAG_READ |
>                    VFIO_REGION_INFO_FLAG_WRITE |
>                     VFIO_REGION_INFO_FLAG_MMAP;

Ack.

>> +
>> +static bool range_intersect_range(loff_t range1_start, size_t count1,
>> +                               loff_t range2_start, size_t count2,
>> +                               loff_t *start_offset,
>> +                               size_t *intersect_count,
>> +                               size_t *register_offset)
>
> We should put this somewhere shared with virtio-vfio-pci.

Yeah, will move to vfio_pci_core.c

>> +
>> +     if (range_intersect_range(pos, count, PCI_BASE_ADDRESS_2, sizeof(val64),
>> +                               &copy_offset, &copy_count, &register_offset)) {
>> +             bar_size = roundup_pow_of_two(nvdev->resmem.memlength);
>> +             nvdev->resmem.u64_reg &= ~(bar_size - 1);
>> +             nvdev->resmem.u64_reg |= PCI_BASE_ADDRESS_MEM_TYPE_64 |
>> +                                      PCI_BASE_ADDRESS_MEM_PREFETCH;
>> +             val64 = cpu_to_le64(nvdev->resmem.u64_reg);
>
> As suggested and implemented in virtio-vfio-pci, store the value as
> little endian, then the write function simply becomes a
> copy_from_user(), we only need a cpu native representation of the value
> on read.

Ack.

>> +
>> +     if (range_intersect_range(pos, count, PCI_COMMAND, sizeof(val16),
>> +                               &copy_offset, &copy_count, &register_offset)) {
>> +             if (copy_from_user((void *)&val16, buf + copy_offset, copy_count))
>> +                     return -EFAULT;
>> +
>> +             if (le16_to_cpu(val16) & PCI_COMMAND_MEMORY)
>> +                     nvdev->bars_disabled = false;
>> +             else
>> +                     nvdev->bars_disabled = true;
>
> nvdev->bars_disabled = !(le16_to_cpu(val16) & PCI_COMMAND_MEMORY);
>
> But you're only tracking COMMAND_MEM relative to user writes, memory
> will be disabled on reset and should not initially be enabled.

I suppose you are suggesting we disable during reset and not enable until
VM driver does so through PCI_COMMAND?

> But then also, isn't this really just an empty token of pretending this
> is a PCI BAR if only the read/write and not mmap path honor the device
> memory enable bit?  We'd need to zap and restore vmas mapping these
> BARs if this was truly behaving as a PCI memory region.

I can do that change, but for my information, is this a requirement to be
PCI compliant?

> We discussed previously that access through the coherent memory is
> unaffected by device reset, is that also true of this non-cached BAR as
> well?

Actually, the coherent memory is not entirely unaffected during device reset.
It becomes unavailable (and read access returns ~0) for a brief time during
the reset. The non-cached BAR behaves in the same way as they are just
as it is just a carved out part of device memory.

> TBH, I'm still struggling with the justification to expose these memory
> ranges as BAR space versus attaching them as a device specific region
> where QEMU would map them into the VM address space and create ACPI
> tables to describe them to reflect the same mechanism in the VM as used
> on bare metal.  AFAICT the justification boils down to avoiding work in
> QEMU and we're sacrificing basic PCI semantics and creating a more
> complicated kernel driver to get there.  Let's have an on-list
> discussion why this is the correct approach.

Sorry it isn't clear to me how we are sacrificing PCI semantics here. What
features are we compromising (after we fix the ones you pointed out above)?

And if we managed to make these fake BARs PCI compliant, I suppose the
primary objection is the additional code that we added to make it compliant?

>> +     ret = nvgrace_gpu_map_device_mem(nvdev, index);
>> +     if (ret)
>> +             goto read_exit;
>
> We don't need a goto to simply return an error.

Yes, will fix that.

>> +     if (index == USEMEM_REGION_INDEX) {
>> +             if (copy_to_user(buf, (u8 *)nvdev->usemem.bar_remap.memaddr + offset, mem_count))
>> +                     ret = -EFAULT;
>> +     } else {
>> +             return do_io_rw(&nvdev->core_device, false, nvdev->resmem.bar_remap.ioaddr,
>> +                             (char __user *) buf, offset, mem_count, 0, 0, false);
>
> The vfio_device_ops.read prototype defines buf as a char __user*, so
> maybe look at why it's being passed as a void __user* rather than
> casting.

True, will fix that.

>> +             /* Check if the bars are disabled, allow access otherwise */
>> +             down_read(&nvdev->core_device.memory_lock);
>> +             if (nvdev->bars_disabled) {
>> +                     up_read(&nvdev->core_device.memory_lock);
>> +                     return -EIO;
>> +             }
>
> Why do we need bars_disabled here, or at all?  If we let do_io_rw()
> test memory it would read the command register from vconfig and all of
> this is redundant.

Yes, and I will make use of the same flag to cover the
USEMEM_REGION_INDEX cacheable device memory accesses.

>> -static ssize_t do_io_rw(struct vfio_pci_core_device *vdev, bool test_mem,
>> -                     void __iomem *io, char __user *buf,
>> -                     loff_t off, size_t count, size_t x_start,
>> -                     size_t x_end, bool iswrite)
>> +ssize_t do_io_rw(struct vfio_pci_core_device *vdev, bool test_mem,
>> +               void __iomem *io, char __user *buf,
>> +               loff_t off, size_t count, size_t x_start,
>> +               size_t x_end, bool iswrite)
>
> This should be exported in a separate patch and renamed to be
> consistent with other vfio-pci-core functions.

Sure, and will rename with vfio_pci_core prefix.

>> @@ -199,6 +199,7 @@ static ssize_t do_io_rw(struct vfio_pci_core_device *vdev, bool test_mem,
>>
>>       return done;
>>  }
>> +EXPORT_SYMBOL(do_io_rw);
>
> NAK, _GPL.  Thanks,

Yes, will make the change.

>./scripts/checkpatch.pl --strict will give you some tips on how to
improve the changes furthermore.

Yes, will do that.