Re: [PATCH 2/3] drm/scheduler: Fix UAF in drm_sched_fence_get_timeline_name

From: Christian König
Date: Thu Nov 02 2023 - 06:48:41 EST


Am 01.11.23 um 09:13 schrieb Daniel Vetter:
On Wed, 1 Nov 2023 at 07:59, Dave Airlie <airlied@xxxxxxxxx> wrote:
Well, to make it clear once more: Signaling a dma_fence from the
destructor of a reference counted object is very problematic! This will
be rejected no matter if you do that in C or in Rust.

What we can do is to make it safe in the sense that you don't access
freed up memory by using the scheduler fences even more as wrapper
around the hardware fence as we do now. But this quite a change and
requires a bit more than just hacking around
drm_sched_fence_get_timeline_name().
I really think this needs to be documented if nothing else out of this thread.

Clearly nobody is going to get it right and hidden here in this
thread, this info isn't useful.

Can we have some sort of design document for the dma-fence/scheduler
interactions written and we can try and refine it with solutions on
the list, because I'm tired of people proposing things and NAK's
getting thrown around without anything to point people at.

The next NAK I see on the list will mean I block all patches from the
sender until they write a documentation patch, because seriously this
stuff is too hard for someone to just keep it in their head and expect
everyone else to understand from reading the code.
I very much like the idea that NAK replies are counted as "you've just
volunteered yourself for some documentation patches so that next time
around you can reply with a link to the docs instead of just a NAK".

Yeah, that sounds like a great idea to me as well :)

Especially when I can use it to convince managers that we need to have more work force on writing documentation.

I don't think we'll get out of these discussions otherwise, since
currently we have undocumented, but very tricky semantics of the
drm/sched codebase for ringbuffer scheduling which is extended to fw
scheduling in also very tricky ways, with not entirely clear impacts
on semantics of all the drm/sched things. And as a result we just pile
up enormous amounts of threads where I think the only thing assured is
that people talk past each another.

The scheduler is certainly the ugliest part, but it's unfortunately still only the tip of the iceberg.

I have seen at least halve a dozen approach in the last two years where people tried to signal a dma_fence from userspace or similar.

Fortunately it was mostly prototyping and I could jump in early enough to stop that, but basically this is a fight against windmills.

I was considering to change the dma_fence semantics so that dma_fence_signal() could only be called from the interrupt contexts of devices and then put a big fat WARN_ON(!in_interrupt()) in there.

It's a sledgehammer, but as far as I can see the only thing which might help. Opinions?

Thanks,
Christian.


Converting NAKs into doc patches should at least eventually get rid of
the worst confusions we're dealing with here.

Cheers, Sima