Re: [PATCH drm-misc-next v5 4/6] drm/gpuvm: track/lock/validate external/evicted objects

From: Thomas Hellström
Date: Wed Oct 04 2023 - 13:57:30 EST


On Wed, 2023-10-04 at 19:17 +0200, Danilo Krummrich wrote:
> On 10/4/23 17:29, Thomas Hellström wrote:
> >
> > On Wed, 2023-10-04 at 14:57 +0200, Danilo Krummrich wrote:
> > > On 10/3/23 11:11, Thomas Hellström wrote:
> > >
> > > <snip>
> > >
> > > > > > +
> > > > > > +/**
> > > > > > + * drm_gpuvm_bo_evict() - add / remove a &drm_gpuvm_bo to
> > > > > > /
> > > > > > from the &drm_gpuvms
> > > > > > + * evicted list
> > > > > > + * @vm_bo: the &drm_gpuvm_bo to add or remove
> > > > > > + * @evict: indicates whether the object is evicted
> > > > > > + *
> > > > > > + * Adds a &drm_gpuvm_bo to or removes it from the
> > > > > > &drm_gpuvms
> > > > > > evicted list.
> > > > > > + */
> > > > > > +void
> > > > > > +drm_gpuvm_bo_evict(struct drm_gpuvm_bo *vm_bo, bool evict)
> > > > > > +{
> > > > > > +    struct drm_gem_object *obj = vm_bo->obj;
> > > > > > +
> > > > > > +    dma_resv_assert_held(obj->resv);
> > > > > > +
> > > > > > +    /* Always lock list transactions, even if
> > > > > > DRM_GPUVM_RESV_PROTECTED is
> > > > > > +     * set. This is required to protect multiple
> > > > > > concurrent
> > > > > > calls to
> > > > > > +     * drm_gpuvm_bo_evict() with BOs with different
> > > > > > dma_resv.
> > > > > > +     */
> > > > >
> > > > > This doesn't work. The RESV_PROTECTED case requires the
> > > > > evicted
> > > > > flag we discussed before. The list is either protected by the
> > > > > spinlock or the resv. Otherwise a list add could race with a
> > > > > list
> > > > > removal elsewhere.
> > >
> > > I think it does unless I miss something, but it might be a bit
> > > subtle
> > > though.
> > >
> > > Concurrent drm_gpuvm_bo_evict() are protected by the spinlock.
> > > Additionally, when
> > > drm_gpuvm_bo_evict() is called we hold the dma-resv of the
> > > corresponding GEM object.
> > >
> > > In drm_gpuvm_validate() I assert that we hold *all* dma-resv,
> > > which
> > > implies that no
> > > one can call drm_gpuvm_bo_evict() on any of the VM's objects and
> > > no
> > > one can add a new
> > > one and directly call drm_gpuvm_bo_evict() on it either.
> >
> > But translated into how the data (the list in this case) is
> > protected
> > it becomes
> >
> > "Either the spinlock and the bo resv of a single list item OR the
> > bo
> > resvs of all bos that can potentially be on the list",
> >
> > while this is certainly possible to assert, any new / future code
> > that
> > manipulates the evict list will probably get this wrong and as a
> > result
> > the code becomes pretty fragile. I think drm_gpuvm_bo_destroy()
> > already
> > gets it wrong in that it, while holding a single resv, doesn't take
> > the
> > spinlock.
>
> That's true and I don't like it either. Unfortunately, with the dma-
> resv
> locking scheme we can't really protect the evict list without the
> drm_gpuvm_bo::evicted trick properly.
>
> But as pointed out in my other reply, I'm a bit worried about the
> drm_gpuvm_bo::evicted trick being too restrictive, but maybe it's
> fine
> doing it in the RESV_PROTECTED case.

Ah, indeed. I misread that as discussing the current code rather than
the drm_gpuvm_bo::evicted trick. If validating only a subset, or a
range, then with the drm_gpuvm_bo::evicted trick would be valid only
for that subset.

But the current code would break because the condition of locking "the
resvs of all bos that can potentially be on the list" doesn't hold
anymore, and you'd get list corruption.

What *would* work, though, is the solution currently in xe, The
original evict list, and a staging evict list whose items are copied
over on validation. The staging evict list being protected by the
spinlock, the original evict list by the resv, and they'd use separate
list heads in the drm_gpuvm_bo, but that is yet another complication.

But I think if this becomes an issue, those VMs (perhaps OpenGL UMD
VMs) only wanting to validate a subset, would simply initially rely on
the current non-RESV solution. It looks like it's only a matter of
flipping the flag on a per-vm basis.

/Thomas


>
> >
> > So I think that needs fixing, and if keeping that protection I
> > think it
> > needs to be documented with the list member and ideally an assert.
> > But
> > also note that lockdep_assert_held will typically give false true
> > for
> > dma_resv locks; as long as the first dma_resv lock locked in a
> > drm_exec
> > sequence  remains locked, lockdep thinks *all* dma_resv locks are
> > held.
> > (or something along those lines), so the resv lockdep asserts are
> > currently pretty useless.
> >
> > /Thomas
> >
> >
> >
> > >
> > > > >
> > > > > Thanks,
> > > > >
> > > > > Thomas
> > > > >
> > > > >
> > > >
> > >
> >
>