Re: [PATCH drm-misc-next v5 4/6] drm/gpuvm: track/lock/validate external/evicted objects

From: Thomas Hellström
Date: Tue Oct 03 2023 - 14:57:39 EST



On 10/3/23 18:55, Danilo Krummrich wrote:
It seems like we're mostly aligned on this series, except for the key
controversy we're discussing for a few versions now: locking of the internal
lists. Hence, let's just re-iterate the options we have to get this out of the
way.

(1) The spinlock dance. This basically works for every use case, updating the VA
space from the IOCTL, from the fence signaling path or anywhere else.
However, it has the downside of requiring spin_lock() / spin_unlock() for
*each* list element when locking all external objects and validating all
evicted objects. Typically, the amount of extobjs and evicted objects
shouldn't be excessive, but there might be exceptions, e.g. Xe.

(2) The dma-resv lock dance. This is convinient for drivers updating the VA
space from a VM_BIND ioctl() and is especially efficient if such drivers
have a huge amount of external and/or evicted objects to manage. However,
the downsides are that it requires a few tricks in drivers updating the VA
space from the fence signaling path (e.g. job_run()). Design wise, I'm still
skeptical that it is a good idea to protect internal data structures with
external locks in a way that it's not clear to callers that a certain
function would access one of those resources and hence needs protection.
E.g. it is counter intuitive that drm_gpuvm_bo_put() would require both the
dma-resv lock of the corresponding object and the VM's dma-resv lock held.
(Additionally, there were some concerns from amdgpu regarding flexibility in
terms of using GPUVM for non-VM_BIND uAPIs and compute, however, AFAICS
those discussions did not complete and to me it's still unclear why it
wouldn't work.)

(3) Simply use an internal mutex per list. This adds a tiny (IMHO negligible)
overhead for drivers updating the VA space from a VM_BIND ioctl(), namely
a *single* mutex_lock()/mutex_unlock() when locking all external objects
and validating all evicted objects. And it still requires some tricks for
drivers updating the VA space from the fence signaling path. However, it's
as simple as it can be and hence way less error prone as well as
self-contained and hence easy to use. Additionally, it's flexible in a way
that we don't have any expections on drivers to already hold certain locks
that the driver in some situation might not be able to acquire in the first
place.

(4) Arbitrary combinations of the above. For instance, the current V5 implements
both (1) and (2) (as either one or the other). But also (1) and (3) (as in
(1) additionally to (3)) would be an option, where a driver could opt-in for
the spinlock dance in case it updates the VA space from the fence signaling
path.

I also considered a few other options as well, however, they don't seem to be
flexible enough. For instance, as by now we could use SRCU for the external
object list. However, this falls apart once a driver wants to remove and re-add
extobjs for the same VM_BO instance. (For the same reason it wouldn't work for
evicted objects.)

Personally, after seeing the weird implications of (1), (2) and a combination of
both, I tend to go with (3). Optionally, with an opt-in for (1). The reason for
the latter is that with (3) the weirdness of (1) by its own mostly disappears.

Please let me know what you think, and, of course, other ideas than the
mentioned ones above are still welcome.

- Danilo

Here are the locking principles Daniel put together and Dave once called out for us to be applying when reviewing DRM code. These were prompted by very fragile and hard to understand locking patterns in the i915 driver and I think the xe vm_bind locking design was made with these in mind, (not sure exactly who wrote what, though so can't say for sure).

https://blog.ffwll.ch/2022/07/locking-engineering.html
https://blog.ffwll.ch/2022/08/locking-hierarchy.html

At least to me, this motivates using the resv design unless we strictly need lower level locks that are taken in the eviction paths or userptr invalidation paths, but doesn't rule out spinlocks or lock dropping tricks where these are really necessary. But pretty much rules out RCU / SRCU from what I can tell.

It also calls for documenting how individual members of structs are protected when ever possible.

Thanks,
Thomas