On 14/07/2023 17.43, Christian König wrote:
Am 14.07.23 um 10:21 schrieb Asahi Lina:
A signaled scheduler fence can outlive its scheduler, since fences are
independencly reference counted. Therefore, we can't reference the
scheduler in the get_timeline_name() implementation.
Fixes oopses on `cat /sys/kernel/debug/dma_buf/bufinfo` when shared
dma-bufs reference fences from GPU schedulers that no longer exist.
Signed-off-by: Asahi Lina <lina@xxxxxxxxxxxxx>
---
drivers/gpu/drm/scheduler/sched_entity.c | 7 ++++++-
drivers/gpu/drm/scheduler/sched_fence.c | 4 +++-
include/drm/gpu_scheduler.h | 5 +++++
3 files changed, 14 insertions(+), 2 deletions(-)
diff --git a/drivers/gpu/drm/scheduler/sched_entity.c b/drivers/gpu/drm/scheduler/sched_entity.c
index b2bbc8a68b30..17f35b0b005a 100644
--- a/drivers/gpu/drm/scheduler/sched_entity.c
+++ b/drivers/gpu/drm/scheduler/sched_entity.c
@@ -389,7 +389,12 @@ static bool drm_sched_entity_add_dependency_cb(struct drm_sched_entity *entity)
/*
* Fence is from the same scheduler, only need to wait for
- * it to be scheduled
+ * it to be scheduled.
+ *
+ * Note: s_fence->sched could have been freed and reallocated
+ * as another scheduler. This false positive case is okay, as if
+ * the old scheduler was freed all of its jobs must have
+ * signaled their completion fences.
This is outright nonsense. As long as an entity for a scheduler exists
it is not allowed to free up this scheduler.
So this function can't be called like this.
As I already explained, the fences can outlive their scheduler. That means *this* entity certainly exists for *this* scheduler, but the *dependency* fence might have come from a past scheduler that was already destroyed along with all of its entities, and its address reused.
Christian, I'm really getting tired of your tone. I don't appreciate being told my comments are "outright nonsense" when you don't even take the time to understand what the issue is and what I'm trying to do/document. If you aren't interested in working with me, I'm just going to give up on drm_sched, wait until Rust gets workqueue support, and reimplement it in Rust. You can keep your broken fence lifetime semantics and I'll do my own thing.
~~ Lina