Re: [PATCH v7 5/6] vfio/nvgrace-gpu: Inform devmem unmapped after reset
From: Alex Williamson
Date: Wed Nov 26 2025 - 10:24:13 EST
On Wed, 26 Nov 2025 05:26:26 +0000
<ankita@xxxxxxxxxx> wrote:
> From: Ankit Agrawal <ankita@xxxxxxxxxx>
>
> Introduce a new flag reset_done to notify that the GPU has just
> been reset and the mapping to the GPU memory is zapped.
>
> Implement the reset_done handler to set this new variable. It
> will be used later in the patches to wait for the GPU memory
> to be ready before doing any mapping or access.
>
> Cc: Jason Gunthorpe <jgg@xxxxxxxx>
> Suggested-by: Alex Williamson <alex@xxxxxxxxxxx>
> Signed-off-by: Ankit Agrawal <ankita@xxxxxxxxxx>
> ---
> drivers/vfio/pci/nvgrace-gpu/main.c | 26 +++++++++++++++++++++++++-
> 1 file changed, 25 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/vfio/pci/nvgrace-gpu/main.c b/drivers/vfio/pci/nvgrace-gpu/main.c
> index f691deb8e43c..b46984e76be7 100644
> --- a/drivers/vfio/pci/nvgrace-gpu/main.c
> +++ b/drivers/vfio/pci/nvgrace-gpu/main.c
> @@ -58,6 +58,8 @@ struct nvgrace_gpu_pci_core_device {
> /* Lock to control device memory kernel mapping */
> struct mutex remap_lock;
> bool has_mig_hw_bug;
> + /* GPU has just been reset */
> + bool reset_done;
> };
>
> static void nvgrace_gpu_init_fake_bar_emu_regs(struct vfio_device *core_vdev)
> @@ -1048,12 +1050,34 @@ static const struct pci_device_id nvgrace_gpu_vfio_pci_table[] = {
>
> MODULE_DEVICE_TABLE(pci, nvgrace_gpu_vfio_pci_table);
>
> +/*
> + * The GPU reset is required to be serialized against the *first* mapping
> + * faults and read/writes accesses to prevent potential RAS events logging.
> + *
> + * The reset_done implementation is triggered on every reset and is used
> + * set the reset_done variable that assists in achieving the serialization.
I can't parse this last sentence, "used _to_ set"?. Maybe something as
simple as:
/*
* First fault or access after a reset needs to poll device readiness,
* flag that a reset has occurred.
*/
Thanks,
Alex
> + */
> +static void nvgrace_gpu_vfio_pci_reset_done(struct pci_dev *pdev)
> +{
> + struct vfio_pci_core_device *core_device = dev_get_drvdata(&pdev->dev);
> + struct nvgrace_gpu_pci_core_device *nvdev =
> + container_of(core_device, struct nvgrace_gpu_pci_core_device,
> + core_device);
> +
> + nvdev->reset_done = true;
> +}
> +
> +static const struct pci_error_handlers nvgrace_gpu_vfio_pci_err_handlers = {
> + .reset_done = nvgrace_gpu_vfio_pci_reset_done,
> + .error_detected = vfio_pci_core_aer_err_detected,
> +};
> +
> static struct pci_driver nvgrace_gpu_vfio_pci_driver = {
> .name = KBUILD_MODNAME,
> .id_table = nvgrace_gpu_vfio_pci_table,
> .probe = nvgrace_gpu_probe,
> .remove = nvgrace_gpu_remove,
> - .err_handler = &vfio_pci_core_err_handlers,
> + .err_handler = &nvgrace_gpu_vfio_pci_err_handlers,
> .driver_managed_dma = true,
> };
>