Re: [PATCH v8 6/6] vfio/nvgrace-gpu: wait for the GPU mem to be ready

Next message: Prakash Sangappa: "Re: [patch V3 07/12] rseq: Implement syscall entry work for time slice extensions"
Previous message: Bartlomiej Kubik: "[PATCH] fs/ntfs3: Initialize new folios before use"
In reply to: ankita: "[PATCH v8 6/6] vfio/nvgrace-gpu: wait for the GPU mem to be ready"
Next in thread: Ankit Agrawal: "Re: [PATCH v8 6/6] vfio/nvgrace-gpu: wait for the GPU mem to be ready"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

From: Alex Williamson

Date: Wed Nov 26 2025 - 17:03:28 EST

On Wed, 26 Nov 2025 19:28:46 +0000
<ankita@xxxxxxxxxx> wrote:
> +/*
> + * If the GPU memory is accessed by the CPU while the GPU is not ready
> + * after reset, it can cause harmless corrected RAS events to be logged.
> + * Make sure the GPU is ready before establishing the mappings.
> + */
> +static int
> +nvgrace_gpu_check_device_ready(struct nvgrace_gpu_pci_core_device *nvdev)
> +{
> + struct vfio_pci_core_device *vdev = &nvdev->core_device;
> + int ret;
> +
> + lockdep_assert_held_read(&vdev->memory_lock);
> +
> + if (!nvdev->reset_done)
> + return 0;
> +
> + ret = nvgrace_gpu_wait_device_ready(vdev->barmap[0]);
> + if (ret)
> + return ret;
> +
> + nvdev->reset_done = false;
> +
> + return 0;
> +}

It seems like we can call wait_device_ready here, generating ioread
accesses to BAR0, without knowing the memory-enable state of the device
in the command register. Is there anything special about this device
relative to BAR0 accesses regardless of the memory-enable bit that
allows us to ignore that?

If not, do we need to test before wait_device_ready, such as:

if (vdev->pm_runtime_engaged || !__vfio_pci_memory_enabled(vdev))
return -EIO;

This opens up a small can of worms though that vfio-pci allows
read/write access regardless of pm_runtime_engaged by waking the device
around such accesses. This driver doesn't currently participate in
runtime PM beyond the vfio-pci-core code. Do we need to add runtime PM
wrappers in its read/write handlers and a separate wrapper here that
drops the pm_runtime_engaged test?

There's a comment in the driver indicating the device is tolerate of
certain accesses, independent of the memory-enable bit, so I don't know
how much is actually required here. Thanks,

Alex