Re: DMA-buf and uncached system memory

From: Daniel Vetter
Date: Tue Feb 16 2021 - 04:26:41 EST


On Mon, Feb 15, 2021 at 09:58:08AM +0100, Christian König wrote:
> Hi guys,
>
> we are currently working an Freesync and direct scan out from system memory
> on AMD APUs in A+A laptops.
>
> On problem we stumbled over is that our display hardware needs to scan out
> from uncached system memory and we currently don't have a way to communicate
> that through DMA-buf.
>
> For our specific use case at hand we are going to implement something driver
> specific, but the question is should we have something more generic for
> this?
>
> After all the system memory access pattern is a PCIe extension and as such
> something generic.

Yes it's a problem, and it's a complete mess. So the defacto rules are:

1. importer has to do coherent transactions to snoop cpu caches.

This way both modes work:

- if the buffer is cached, we're fine

- if the buffer is not cached, but the exporter has flushed all the
caches, you're mostly just wasting time on inefficient bus cycles. Also
this doesn't work if your CPU doesn't just drop clean cachelines. Like I
thing some ARM are prone to do, not idea about AMD, Intel is guaranteed
to drop them which is why the uncached scanout for integrated Intel +
amd discrete "works".

2. exporters picks the mode freely, and can even change it at runtime
(i915 does this, since we don't have an "allocate for scanout" flag wired
through consistently). This doesn't work on arm, there the rule is "all
devices in the same system must use the same mode".

3. This should be solved at the dma-buf layer, but the dma-api refuses to
tell you this information (at least for dma_alloc_coherent). And I'm not
going to deal with the bikeshed that would bring into my inbox. Or at
least there's always been screaming that drivers shouldn't peek behind the
abstraction.

So I think if AMD also guarantees to drop clean cachelines just do the
same thing we do right now for intel integrated + discrete amd, but in
reserve. It's fragile, but it does work.

What we imo shouldn't do is driver private interfaces here, that's just
going to make the problem worse long term. Or at least driver-private
interfaces that spawn across drivers behind dma-buf, because imo this is
really a problem that dma-buf should solve.

If you do want to solve this at the dma-buf level I can try and point you
at the respective i915 and amdgpu code that makes the magic work - I've
had to fix it a few times in the past. I'm not sure whether we'd need to
pass the dynamic nature through though, i.e. whether we want to be able to
scan out imported dma-buf and hence request they be used in uncached mode.
-Daniel
--
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch