Re: [Nouveau] [PATCH 09/17] drm/radeon: use common fence implementation for fences

From: Christian KÃnig
Date: Tue Jul 22 2014 - 12:40:12 EST


Maybe I've mixed things up a bit in my description. There is
fence_signal which the implementor/exporter of a fence must call when
the fence is completed. If the exporter has an ->enable_signaling
callback it can delay that call to fence_signal for as long as it
wishes as long as enable_signaling isn't called yet. But that's just
the optimization to not required irqs to be turned on all the time.

The other function is fence_is_signaled, which is used by code that is
interested in the fence state, together with fence_wait if it wants to
block and not just wants to know the momentary fence state. All the
other functions (the stuff that adds callbacks and the various _locked
and other versions) are just for fancy special cases.
Well that's rather bad, cause IRQs aren't reliable enough on Radeon HW for such a thing. Especially on Prime systems and Macs.

That's why we have this fancy HZ/2 timeout on all fence wait operations to manually check if the fence is signaled or not.

To guarantee that a fence is signaled after enable_signaling is called we would need to fire up a kernel thread which periodically calls fence->signaled.

Christian.

Am 22.07.2014 18:21, schrieb Daniel Vetter:
On Tue, Jul 22, 2014 at 5:59 PM, Christian KÃnig
<deathsimple@xxxxxxxxxxx> wrote:
Am 22.07.2014 17:42, schrieb Daniel Vetter:

On Tue, Jul 22, 2014 at 5:35 PM, Christian KÃnig
<christian.koenig@xxxxxxx> wrote:
Drivers exporting fences need to provide a fence->signaled and a
fence->wait
function, everything else like fence->enable_signaling or calling
fence_signaled() from the driver is optional.

Drivers wanting to use exported fences don't call fence->signaled or
fence->wait in atomic or interrupt context, and not with holding any
global
locking primitives (like mmap_sem etc...). Holding locking primitives
local
to the driver is ok, as long as they don't conflict with anything
possible
used by their own fence implementation.
Well that's almost what we have right now with the exception that
drivers are allowed (actually must for correctness when updating
fences) the ww_mutexes for dma-bufs (or other buffer objects).

In this case sorry for so much noise. I really haven't looked in so much
detail into anything but Maarten's Radeon patches.

But how does that then work right now? My impression was that it's mandatory
for drivers to call fence_signaled()?
Maybe I've mixed things up a bit in my description. There is
fence_signal which the implementor/exporter of a fence must call when
the fence is completed. If the exporter has an ->enable_signaling
callback it can delay that call to fence_signal for as long as it
wishes as long as enable_signaling isn't called yet. But that's just
the optimization to not required irqs to be turned on all the time.

The other function is fence_is_signaled, which is used by code that is
interested in the fence state, together with fence_wait if it wants to
block and not just wants to know the momentary fence state. All the
other functions (the stuff that adds callbacks and the various _locked
and other versions) are just for fancy special cases.

Locking
correctness is enforced with some extremely nasty lockdep annotations
+ additional debugging infrastructure enabled with
CONFIG_DEBUG_WW_MUTEX_SLOWPATH. We really need to be able to hold
dma-buf ww_mutexes while updating fences or waiting for them. And
obviously for ->wait we need non-atomic context, not just
non-interrupt.

Sounds mostly reasonable, but for holding the dma-buf ww_mutex, wouldn't be
an RCU be more appropriate here? E.g. aren't we just interested that the
current assigned fence at some point is signaled?
Yeah, as an optimization you can get the set of currently attached
fences to a dma-buf with just rcu. But if you update the set of fences
attached to a dma-buf (e.g. radeon blits the newly rendered frame to a
dma-buf exported by i915 for scanout on i915) then you need a write
lock on that buffer. Which is what the ww_mutex is for, to make sure
that you don't deadlock with i915 doing concurrent ops on the same
underlying buffer.

Something like grab ww_mutexes, grab a reference to the current fence
object, release ww_mutex, wait for fence, release reference to the fence
object.
Yeah, if the only thing you want to do is wait for fences, then the
rcu-protected fence ref grabbing + lockless waiting is all you need.
But e.g. in an execbuf you also need to update fences and maybe deep
down in the reservation code you notice that you need to evict some
stuff and so need to wait on some other guy to finish, and it's too
complicated to drop and reacquire all the locks. Or you simply need to
do a blocking wait on other gpus (because there's no direct hw sync
mechanism) and again dropping locks would needlessly complicate the
code. So I think we should allow this just to avoid too hairy/brittle
(and almost definitely little tested code) in drivers.

Afaik this is also the same way ttm currently handles things wrt
buffer reservation and eviction.

Agreed that any shared locks are out of the way (especially stuff like
dev->struct_mutex or other non-strictly driver-private stuff, i915 is
really bad here still).

Yeah that's also an point I've wanted to note on Maartens patch. Radeon
grabs the read side of it's exclusive semaphore while waiting for fences
(because it assumes that the fence it waits for is a Radeon fence).

Assuming that we need to wait in both directions with Prime (e.g. Intel
driver needs to wait for Radeon to finish rendering and Radeon needs to wait
for Intel to finish displaying), this might become a perfect example of
locking inversion.
fence updates are atomic on a dma-buf, protected by ww_mutex. The neat
trick of ww_mutex is that they enforce a global ordering, so in your
scenario either i915 or radeon would be first and you can't deadlock.
There is no way to interleave anything even if you have lots of
buffers shared between i915/radeon. Wrt deadlocking it's exactly the
same guarantees as the magic ttm provides for just one driver with
concurrent command submission since it's the same idea.

So from the core fence framework I think we already have exactly this,
and we only need to adjust the radeon implementation a bit to make it
less risky and invasive to the radeon driver logic.

Agree. Well the biggest problem I see is that exclusive semaphore I need to
take when anything calls into the driver. For the fence code I need to move
that down into the fence->signaled handler, cause that now can be called
from outside the driver.

Maarten solved this by telling the driver in the lockup handler (where we
grab the write side of the exclusive lock) that all interrupts are already
enabled, so that fence->signaled hopefully wouldn't mess with the hardware
at all. While this probably works, it just leaves me with a feeling that we
are doing something wrong here.
I'm not versed on the details in readon, but on i915 we can attach a
memory location and cookie value to each fence and just do a memory
fetch to figure out whether the fence has passed or not. So no locking
needed at all. Of course the fence itself needs to lock a reference
onto that memory location, which is a neat piece of integration work
that we still need to tackle in some cases - there's conflicting patch
series all over this ;-)

But like I've said fence->signaled is optional so you don't need this
necessarily, as long as radeon eventually calls fence_signaled once
the fence has completed.
-Daniel

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/