Re: [PATCH v2 16/34] drm/msm: Mark VM as unusable on faults
From: Rob Clark
Date: Wed Mar 19 2025 - 17:32:07 EST
On Wed, Mar 19, 2025 at 9:15 AM Connor Abbott <cwabbott0@xxxxxxxxx> wrote:
>
> On Wed, Mar 19, 2025 at 10:55 AM Rob Clark <robdclark@xxxxxxxxx> wrote:
> >
> > From: Rob Clark <robdclark@xxxxxxxxxxxx>
> >
> > If userspace has opted-in to VM_BIND, then GPU faults and VM_BIND errors
> > will mark the VM as unusable.
> >
> > Signed-off-by: Rob Clark <robdclark@xxxxxxxxxxxx>
> > ---
> > drivers/gpu/drm/msm/msm_gem.h | 17 +++++++++++++++++
> > drivers/gpu/drm/msm/msm_gem_submit.c | 3 +++
> > drivers/gpu/drm/msm/msm_gpu.c | 16 ++++++++++++++--
> > 3 files changed, 34 insertions(+), 2 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/msm/msm_gem.h b/drivers/gpu/drm/msm/msm_gem.h
> > index acb976722580..7cb720137548 100644
> > --- a/drivers/gpu/drm/msm/msm_gem.h
> > +++ b/drivers/gpu/drm/msm/msm_gem.h
> > @@ -82,6 +82,23 @@ struct msm_gem_vm {
> >
> > /** @managed: is this a kernel managed VM? */
> > bool managed;
> > +
> > + /**
> > + * @unusable: True if the VM has turned unusable because something
> > + * bad happened during an asynchronous request.
> > + *
> > + * We don't try to recover from such failures, because this implies
> > + * informing userspace about the specific operation that failed, and
> > + * hoping the userspace driver can replay things from there. This all
> > + * sounds very complicated for little gain.
> > + *
> > + * Instead, we should just flag the VM as unusable, and fail any
> > + * further request targeting this VM.
> > + *
> > + * As an analogy, this would be mapped to a VK_ERROR_DEVICE_LOST
> > + * situation, where the logical device needs to be re-created.
> > + */
> > + bool unusable;
> > };
> > #define to_msm_vm(x) container_of(x, struct msm_gem_vm, base)
> >
> > diff --git a/drivers/gpu/drm/msm/msm_gem_submit.c b/drivers/gpu/drm/msm/msm_gem_submit.c
> > index 9731ad7993cf..9cef308a0ad1 100644
> > --- a/drivers/gpu/drm/msm/msm_gem_submit.c
> > +++ b/drivers/gpu/drm/msm/msm_gem_submit.c
> > @@ -668,6 +668,9 @@ int msm_ioctl_gem_submit(struct drm_device *dev, void *data,
> > if (args->pad)
> > return -EINVAL;
> >
> > + if (to_msm_vm(ctx->vm)->unusable)
> > + return UERR(EPIPE, dev, "context is unusable");
> > +
> > /* for now, we just have 3d pipe.. eventually this would need to
> > * be more clever to dispatch to appropriate gpu module:
> > */
> > diff --git a/drivers/gpu/drm/msm/msm_gpu.c b/drivers/gpu/drm/msm/msm_gpu.c
> > index 503e4dcc5a6f..4831f4e42fd9 100644
> > --- a/drivers/gpu/drm/msm/msm_gpu.c
> > +++ b/drivers/gpu/drm/msm/msm_gpu.c
> > @@ -386,8 +386,20 @@ static void recover_worker(struct kthread_work *work)
> >
> > /* Increment the fault counts */
> > submit->queue->faults++;
> > - if (submit->vm)
> > - to_msm_vm(submit->vm)->faults++;
> > + if (submit->vm) {
> > + struct msm_gem_vm *vm = to_msm_vm(submit->vm);
> > +
> > + vm->faults++;
> > +
> > + /*
> > + * If userspace has opted-in to VM_BIND (and therefore userspace
> > + * management of the VM), faults mark the VM as unusuable. This
> > + * matches vulkan expectations (vulkan is the main target for
> > + * VM_BIND)
>
> The bit about this matching Vulkan expectations isn't exactly true.
> Some Vulkan implementations do do this, but many will also just ignore
> the fault and try to continue going, and the spec allows either. It's
> a choice that we're making.
As mentioned on IRC, this is actually about GPU hangs rather then smmu
faults. I guess the $subject is a bit misleading.
BR,
-R
> Connor
>
> > + */
> > + if (!vm->managed)
> > + vm->unusable = true;
> > + }
> >
> > get_comm_cmdline(submit, &comm, &cmd);
> >
> > --
> > 2.48.1
> >