Re: [PATCH drm-misc-next v3 6/7] drm/gpuvm: generalize dma_resv/extobj handling and GEM validation

From: Thomas Hellström
Date: Wed Sep 20 2023 - 04:30:02 EST



On 9/20/23 09:44, Thomas Hellström wrote:
Hi,

On 9/20/23 07:37, Christian König wrote:
Am 19.09.23 um 17:23 schrieb Thomas Hellström:

On 9/19/23 17:16, Danilo Krummrich wrote:
On 9/19/23 14:21, Thomas Hellström wrote:
Hi Christian

On 9/19/23 14:07, Christian König wrote:
Am 13.09.23 um 17:46 schrieb Danilo Krummrich:
On 9/13/23 17:33, Christian König wrote:
Am 13.09.23 um 17:15 schrieb Danilo Krummrich:
On 9/13/23 16:26, Christian König wrote:
Am 13.09.23 um 14:16 schrieb Danilo Krummrich:
As mentioned in a different mail thread, the reply is based on the assumption
that we don't support anything else than GPUVM updates from the IOCTL.

I think that this assumption is incorrect.

Well, more precisely I should have said "don't support GPUVM updated from within
fence signaling critical sections". And looking at the code, that doesn't seem what
you're doing there.


Vulkan is just once specific use case, but this here should probably be able to handle other use cases as well.

Especially with HMM you get the requirement that you need to be able to invalidate GPUVM mappings without grabbing a reservation lock.

What do you mean with "invalidate GPUVM mappings" in this context? drm_gpuvm_bo_evict()
should only be called from a ttm_device_funcs::move callback, we should hold the dma-resv
lock there.

Well the question is which dma-resv lock do we hold?

In the move callback we only hold the dma-resv lock of the BO which is moved, but when that is a shared BO then that's not the same as the one for the VM.

Correct, Thomas' idea was to use the GEM's dma_resv lock to protect drm_gpuvm_bo::evicted
and then actually move the drm_gpuvm_bo to the VM's evicted list once we grabbed all
dma-resv locks when locking the VM's BOs using drm_exec. We can remove them from the evicted
list on validate(). This way we never touch the evicted list without holding at least the VM's
dma-resv lock.

Do you have any concerns about that?

Scratching my head a bit how that is supposed to work.

This implies that you go over all the evicted BOs during validation and not just the one mentioned in the CS.

That might work for Vulkan, but is pretty much a no-go for OpenGL.





See what the eviction lock in amdgpu is doing for example.

The eviction_lock seems to protect a VM state "evicting" of whether any BO that
is associated with the VM is currently evicting. At the same time amdgpu protects
the eviceted list of the VM with a different lock. So this seems to be entirely
unrelated. Tracking a "currently evicting" state is not part of the GPUVM
implementation currently and hence nothing would change for amdgpu there.

Sorry for the confusion we use different terminology in amdgpu.

The eviction lock and evicted state is for the VM page tables, e.g. if the whole VM is currently not used and swapped out or even de-allocated.

This is necessary because we have cases where we need to access the VM data without holding the dma-resv lock of this VM. Especially figuring out which parts of an address space contain mappings and which doesn't.

I think this is fine, this has nothing to do with lists of evicted GEM objects or external GEM
objects, right? Marking mappings (drm_gpuva) as invalidated (DRM_GPUVA_INVALIDATED) or accessing
the VA space does not require any dma-resv locks.

I hope so, but I'm not 100% sure.



This is a requirement which comes with HMM handling, you won't see this with Vulkan (or OpenGL, VAAPI etc..).


The invalidation lock on the other hand is what in this discussion is called eviction lock. This one is needed because what I wrote above, during the move callback only the dma-resv of the BO which is moved is locked, but not necessarily the dma-resv of the VM.

That's yet another thing, right? This is used to track whether *any* BO that belongs to the VM is
currently being evicted, correct? As mentioned, as by now this is not supported in GPUVM and hence
would be the same driver specific code with the same driver specifc lock.

That is most likely a show stopper using this for OpenGL based workloads as far as I can see. For those you need to able to figure out which non-VM BOs have been evicted and which parts of the VM needs updates.

We identify those with a bool in the gpuvm_bo, and that bool is protected by the bo_resv. In essence, the "evicted" list must be made up-to-date with all relevant locks held before traversing in the next exec.

What I still miss with this idea is how do we find all the drm_gpuvm_bo structures with the evicted bool set to true? When doing the drm_exec dance we come across all external ones and can add them to the list if needed, but what about the BOs having the VM's dma-resv?

Oh, they can be added to the evict list directly (no bool needed) in the eviction code, like in v3. Since for those we indeed hold the VM's dma_resv since it's aliased with the object's dma-resv.

Yeah, I wanted to note what Danilo seems to think about as well. How do we figure out the non-VM BOs evicted?

We can't walk over the list of all non-VM BOs on every submission, that's to much overhead for cases with lots of non-VM BOs.

And we can't rely on userspace sending all non-VM BOs as used list down to the kernel with each submission.

Regards,
Christian.

No, that's not needed: Mechanism below.

1) We maintain an evicted list. Typically protected by the vm resv.
2) Each gpuvm_bo has a bool "evicted". Protected by the bo resv.

a) Evicting a vm bo: The vm resv is held by the eviction code. Just put it on the evicted list.
b) Evicting a shared/external bo: The bo resv is held by the eviction code. Set the "evicted" bool
c) Validating the evicted list on exec: Loop through all *external/shared* bos. Lock them. After locking, check the "evicted" bool, if it's true. put the bo on the evicted list (we hold the VM resv at this point) and clear the "evicted" bool. Note that other vms will have their own gpuvm_bo which is marked evicted.

I have this coded up in a patch for Xe and it seems to be working properly.

/Thomas

Something along the lines of the attach patch.

From 12778a3f1b2ca055ff658864c538f944550c9adf Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Thomas=20Hellstr=C3=B6m?= <thomas.hellstrom@xxxxxxxxxxxxxxx>
Date: Thu, 14 Sep 2023 10:23:52 +0200
Subject: [PATCH] drm/gpuvm: Adjustment for extobj eviction.
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Signed-off-by: Thomas Hellström <thomas.hellstrom@xxxxxxxxxxxxxxx>
---
drivers/gpu/drm/drm_gpuvm.c | 32 ++++++++++++++++++++++++--------
include/drm/drm_gpuvm.h | 7 ++++++-
2 files changed, 30 insertions(+), 9 deletions(-)

diff --git a/drivers/gpu/drm/drm_gpuvm.c b/drivers/gpu/drm/drm_gpuvm.c
index 11a0aee1c038..029c38d7fa4d 100644
--- a/drivers/gpu/drm/drm_gpuvm.c
+++ b/drivers/gpu/drm/drm_gpuvm.c
@@ -956,6 +956,11 @@ drm_gpuvm_prepare_objects(struct drm_gpuvm *gpuvm,
ret = drm_exec_prepare_obj(exec, vm_bo->obj, num_fences);
if (ret)
break;
+
+ if (vm_bo->evicted) {
+ drm_gpuvm_bo_list_add(vm_bo, evict);
+ vm_bo->evicted = false;
+ }
}
/* Drop ref in case we break out of the loop. */
drm_gpuvm_bo_put(vm_bo);
@@ -1431,6 +1436,21 @@ drm_gpuvm_bo_extobj_add(struct drm_gpuvm_bo *vm_bo)
}
EXPORT_SYMBOL_GPL(drm_gpuvm_bo_extobj_add);

+void
+drm_gpuvm_bo_evict(struct drm_gpuvm_bo *vm_bo, bool evict)
+{
+ if (drm_gpuvm_is_extobj(vm_bo->vm, vm_bo->obj)) {
+ vm_bo->evicted = evict;
+ return;
+ }
+
+ if (evict)
+ drm_gpuvm_bo_list_add(vm_bo, evict);
+ else
+ drm_gpuvm_bo_list_del(vm_bo, evict);
+}
+EXPORT_SYMBOL(drm_gpuvm_bo_evict);
+
/**
* drm_gpuvm_bo_evict() - add / remove a &drm_gem_object to / from a
* &drm_gpuvms evicted list
@@ -1441,18 +1461,14 @@ EXPORT_SYMBOL_GPL(drm_gpuvm_bo_extobj_add);
* list containing a mapping of this &drm_gem_object.
*/
void
-drm_gpuvm_bo_evict(struct drm_gem_object *obj, bool evict)
+drm_gpuvm_gem_evict(struct drm_gem_object *obj, bool evict)
{
struct drm_gpuvm_bo *vm_bo;

- drm_gem_for_each_gpuvm_bo(vm_bo, obj) {
- if (evict)
- drm_gpuvm_bo_list_add(vm_bo, evict);
- else
- drm_gpuvm_bo_list_del(vm_bo, evict);
- }
+ drm_gem_for_each_gpuvm_bo(vm_bo, obj)
+ drm_gpuvm_bo_evict(vm_bo, evict);
}
-EXPORT_SYMBOL_GPL(drm_gpuvm_bo_evict);
+EXPORT_SYMBOL_GPL(drm_gpuvm_gem_evict);

static int
__drm_gpuva_insert(struct drm_gpuvm *gpuvm,
diff --git a/include/drm/drm_gpuvm.h b/include/drm/drm_gpuvm.h
index dce26a923d5d..c2216f18243f 100644
--- a/include/drm/drm_gpuvm.h
+++ b/include/drm/drm_gpuvm.h
@@ -550,6 +550,9 @@ struct drm_gpuvm_bo {
*/
struct kref kref;

+ /** @evicted: Whether the bo needs revalidation and rebinding. */
+ bool evicted;
+
/**
* @list: Structure containing all &list_heads.
*/
@@ -615,7 +618,9 @@ struct drm_gpuvm_bo *
drm_gpuvm_bo_find(struct drm_gpuvm *gpuvm,
struct drm_gem_object *obj);

-void drm_gpuvm_bo_evict(struct drm_gem_object *obj, bool evict);
+void drm_gpuvm_bo_evict(struct drm_gpuvm_bo *vm_bo, bool evict);
+
+void drm_gpuvm_gem_evict(struct drm_gem_object *obj, bool evict);
void drm_gpuvm_bo_extobj_add(struct drm_gpuvm_bo *vm_bo);

/**
--
2.41.0