Re: [PATCH v8 1/1] vfio/nvgrace-gpu: Add Blackwell-Next GPU readiness check via CXL DVSEC

From: Ankit Agrawal

Date: Tue Jun 02 2026 - 08:29:51 EST

Got the below review comments from Sashiko [1].

> > @@ -712,6 +825,12 @@ nvgrace_gpu_read_mem(struct nvgrace_gpu_pci_core_device *nvdev,
> > else
> > mem_count = min(count, memregion->memlength - (size_t)offset);
> >
> > + if (nvdev->cxl_dvsec && READ_ONCE(nvdev->reset_done)) {
> > + ret = nvgrace_gpu_wait_device_ready_cxl(nvdev);
> > + if (ret)
> > + return ret;
> > + }
> > +
> > scoped_guard(rwsem_read, &vdev->memory_lock) {
> > ret = nvgrace_gpu_check_device_ready(nvdev);
> > if (ret)
>
> [Severity: High]
> If a reset races between the lockless wait and the in-lock check, could this
> code return -EAGAIN to userspace?
> [...]
> Should nvgrace_gpu_read_mem() include the same retry loop around the wait
> and lock acquisition as the fault handler?
>
> > @@ -846,6 +965,12 @@ nvgrace_gpu_write_mem(struct nvgrace_gpu_pci_core_device *nvdev,
> > + if (nvdev->cxl_dvsec && READ_ONCE(nvdev->reset_done)) {
> > + ret = nvgrace_gpu_wait_device_ready_cxl(nvdev);
> > + if (ret)
> > + return ret;
> > + }
> > +
> > scoped_guard(rwsem_read, &vdev->memory_lock) {
> > ret = nvgrace_gpu_check_device_ready(nvdev);
> > if (ret)
>
> [Severity: High]
> Does nvgrace_gpu_write_mem() have the same missing retry logic?
>
> Similar to the read path, a racing reset could cause this to return a
> spurious -EAGAIN to userspace.

This is the same point Alex addressed on v7 [2].

"Note that the read/write paths also have this gap where we can wait for
the device to be ready, but the check under memory_lock returns
-EAGAIN. The difference is that userspace will already automatically
handle the -EAGAIN vs the SIGBUS could be fatal."

So I'll skip this.

[1] https://lore.kernel.org/all/20260528115613.63f1b178@xxxxxxxxxxx/
[2] https://lore.kernel.org/all/20260602065100.48B2D1F00893@xxxxxxxxxxxxxxx/