Re: [RFC PATCH 2/4] rust: sync: Add dma_fence abstractions

From: Boris Brezillon

Date: Wed Feb 11 2026 - 06:12:44 EST

On Wed, 11 Feb 2026 12:00:30 +0100
"Danilo Krummrich" <dakr@xxxxxxxxxx> wrote:

> On Wed Feb 11, 2026 at 11:20 AM CET, Boris Brezillon wrote:
> > On Wed, 11 Feb 2026 10:57:27 +0100
> > "Danilo Krummrich" <dakr@xxxxxxxxxx> wrote:
> >
> >> (Cc: Xe maintainers)
> >>
> >> On Tue Feb 10, 2026 at 12:40 PM CET, Alice Ryhl wrote:
> >> > On Tue, Feb 10, 2026 at 11:46:44AM +0100, Christian König wrote:
> >> >> On 2/10/26 11:36, Danilo Krummrich wrote:
> >> >> > On Tue Feb 10, 2026 at 11:15 AM CET, Alice Ryhl wrote:
> >> >> >> One way you can see this is by looking at what we require of the
> >> >> >> workqueue. For all this to work, it's pretty important that we never
> >> >> >> schedule anything on the workqueue that's not signalling safe, since
> >> >> >> otherwise you could have a deadlock where the workqueue is executes some
> >> >> >> random job calling kmalloc(GFP_KERNEL) and then blocks on our fence,
> >> >> >> meaning that the VM_BIND job never gets scheduled since the workqueue
> >> >> >> is never freed up. Deadlock.
> >> >> >
> >> >> > Yes, I also pointed this out multiple times in the past in the context of C GPU
> >> >> > scheduler discussions. It really depends on the workqueue and how it is used.
> >> >> >
> >> >> > In the C GPU scheduler the driver can pass its own workqueue to the scheduler,
> >> >> > which means that the driver has to ensure that at least one out of the
> >> >> > wq->max_active works is free for the scheduler to make progress on the
> >> >> > scheduler's run and free job work.
> >> >> >
> >> >> > Or in other words, there must be no more than wq->max_active - 1 works that
> >> >> > execute code violating the DMA fence signalling rules.
> >> >
> >> > Ouch, is that really the best way to do that? Why not two workqueues?
> >>
> >> Most drivers making use of this re-use the same workqueue for multiple GPU
> >> scheduler instances in firmware scheduling mode (i.e. 1:1 relationship between
> >> scheduler and entity). This is equivalent to the JobQ use-case.
> >>
> >> Note that we will have one JobQ instance per userspace queue, so sharing the
> >> workqueue between JobQ instances can make sense.
> >
> > Definitely, but I think that's orthogonal to allowing this common
> > workqueue to be used for work items that don't comply with the
> > dma-fence signalling rules, isn't it?
>
> Yes and no. If we allow passing around shared WQs without a corresponding type
> abstraction we open the door for drivers to abuse it the schedule their own
> work.
>
> I.e. sharing a workqueue between JobQs is fine, but we have to ensure they can't
> be used for anything else.

Totally agree with that, and that's where I was going with this special
DmaFenceWorkqueue wrapper/abstract, that would only accept
scheduling MaySignalDmaFencesWorkItem objects.

>
> >> Besides that, IIRC Xe was re-using the workqueue for something else, but that
> >> doesn't seem to be the case anymore. I can only find [1], which more seems like
> >> some custom GPU scheduler extention [2] to me...
> >
> > Yep, I think it can be the problematic case. It doesn't mean we can't
> > schedule work items that don't signal fences, but I think it'd be
> > simpler if we were forcing those to follow the same rules (no blocking
> > alloc, no locks taken that are also taken in other paths were blocking
> > allocs happen, etc) regardless of this wq->max_active value.
> >
> >>
> >> [1] https://elixir.bootlin.com/linux/v6.18.6/source/drivers/gpu/drm/xe/xe_gpu_scheduler.c#L40
> >> [2] https://elixir.bootlin.com/linux/v6.18.6/source/drivers/gpu/drm/xe/xe_gpu_scheduler_types.h#L28
>