Re: [PATCH v9 0/6] vfio/nvgrace-gpu: Support huge PFNMAP and wait for GPU ready post reset
From: Alex Williamson
Date: Fri Nov 28 2025 - 14:45:00 EST
On Thu, 27 Nov 2025 17:06:26 +0000
<ankita@xxxxxxxxxx> wrote:
> From: Ankit Agrawal <ankita@xxxxxxxxxx>
>
> NVIDIA's Grace based system have large GPU device memory. The device
> memory is mapped as VM_PFNMAP in the VMM VMA. The nvgrace-gpu
> module could make use of the huge PFNMAP support added in mm [1].
>
> To achieve this, nvgrace-gpu module is updated to implement huge_fault ops.
> The implementation establishes mapping according to the order request.
> Note that if the PFN or the VMA address is unaligned to the order, the
> mapping fallbacks to the PTE level.
>
> Secondly, it is expected that the mapping not be re-established until
> the GPU is ready post reset. Presence of the mappings during that time
> could potentially leads to harmless corrected RAS events to be logged if
> the CPU attempts to do speculative reads on the GPU memory on the Grace
> systems.
>
> It can take several seconds for the GPU to be ready. So it is desirable
> that the time overlaps as much of the VM startup as possible to reduce
> impact on the VM bootup time. The GPU readiness state is thus checked
> on the first fault/huge_fault request which amortizes the GPU readiness
> time. The GPU readiness is checked through BAR0 registers as is done
> at the device probe.
>
> Patch 1 Refactor vfio_pci_mmap_huge_fault and export the code to map
> at the various levels.
>
> Patch 2 implements support for huge pfnmap.
>
> Patch 3 vfio_pci_core_mmap cleanup.
>
> Patch 4 split the code to check the device readiness.
>
> Patch 5 reset_done handler implementation
>
> Patch 6 Ensures that the GPU is ready before re-establishing the mapping
> after reset.
>
> Applied over 6.18-rc6.
>
> Link: https://lore.kernel.org/all/20240826204353.2228736-1-peterx@xxxxxxxxxx/ [1]
>
> Changelog:
> [v9]
Applied to vfio next branch for v6.19. Thanks,
Alex