Re: [PATCH v3 1/1] drm: msm: Replace dma_map_sg with dma_sync_sg*

From: Daniel Vetter
Date: Fri Nov 30 2018 - 04:35:36 EST

On Thu, Nov 29, 2018 at 09:24:17AM -0800, Tomasz Figa wrote:
> [CC Marek]
> On Thu, Nov 29, 2018 at 9:09 AM Daniel Vetter <daniel@xxxxxxxx> wrote:
> >
> > On Thu, Nov 29, 2018 at 5:57 PM Christoph Hellwig <hch@xxxxxx> wrote:
> > > On Thu, Nov 29, 2018 at 05:28:07PM +0100, Daniel Vetter wrote:
> > > > Just spend a bit of time reading through the implementations already
> > > > merged. Is the struct device *dev parameter actually needed anywhere?
> > > > dma-api definitely needs it, because we need that to pick the right iommu.
> > > > But for cache management from what I've seen the target device doesn't
> > > > matter, all the target specific stuff will be handled by the iommu.
> > >
> > > It looks like only mips every uses the dev argument, and even there
> > > the function it passes dev to ignores it. So it could probably be removed.
> > >
> Whether the cache maintenance operation needs to actually do anything
> or not is a function of `dev`. We can have some devices that are
> coherent with CPU caches, and some that are not, on the same system.

So the thing is that the gpu driver knows this too. It fairly often can
even decide at runtime (through surface state bits or gpu pte bits)
whether to use coherent or non-coherent transactions. dma-api assuming
that a given device is always coherent or non-coherent is one of the
fundamental mismatches we have.

If you otoh need dev because there's some strange bridge caches you need
to flush (I've never seen that, but who knows), that would be a diffeernt
thing. All the bridge flushing I've seen is attached to the iommu though,
so would be really a surprise if the cache management needs that too.

> > > >
> > > > Dropping the dev parameter would make this a perfect fit for coherency
> > > > management of buffers used by multiple devices. Right now there's all
> > > > kinds of nasty tricks for that use cases needed to avoid redundant
> > > > flushes.
> > >
> > > Note that one thing I'd like to avoid is exposing these funtions directly
> > > to drivers, as that will get us into all kinds of abuses.
> >
> > What kind of abuse do you expect? It could very well be that gpu folks
> > call that "standard use case" ... At least on x86 with the i915 driver
> > we pretty much rely on architectural guarantees for how cache flushes
> > work very much. Down to userspace doing the cache flushing for
> > mappings the kernel has set up.
> i915 is a very specific case of a fully contained,
> architecture-specific hardware subsystem, where you can just hardcode
> all integration details inside the driver, because nobody else would
> care.

Nah, it's not fully contained, it's just with your arm eyewear on you
can't see how badly it leaks all over the place. The problem is that GPUs
(in many cases at least) want to decide whether and when to do cache
maintenance. We need a combo of iommu and cache maintenance apis that
allows this, across multiple devices, because the dma-api doesn't cut it.

Aside: The cache maintenance api _must_ do the flush when asked to, even
if it believes no cache maintenance is necessary. Even if the default mode
is coherent for a platform/dev combo, the gpu driver might want to do a
non-coherent transaction.

> In ARM world, you can have the same IP blocks licensed by multiple SoC
> vendors with different integration details and that often includes the
> option of coherency.

My experience is that for soc gpus, you need to retune up/download code
for every soc. Or at least every family of socs.

This is painful. I guess thus far the arm soc gpu drivers we have merged
aren't the ones that needed widely different tuning on different soc
families/ranges. What's worse, it's userspace who will decide whether to
use the coherent or non-coherent paths in many cases (that's at least how
it worked since decades on the big gpus, small gpus just lag a few years
usually). The kerenl only provides the tools for the userspace
opengl/vulkan driver to do all the things.

Trying to hide coherent vs. non-coherent like it's done for everyone else
is probably not going to work for gpus. In fact, hasn't ever worked for
gpus thus far, and unlikely to change I think.

> > > So I'd much prefer if we could have iommu APIs wrapping these that are
> > > specific to actual uses cases that we understand well.
> > >
> > > As for the buffer sharing: at least for the DMA API side I want to
> > > move the current buffer sharing users away from dma_alloc_coherent
> > > (and coherent dma_alloc_attrs users) and the remapping done in there
> > > required for non-coherent architectures. Instead I'd like to allocate
> > > plain old pages, and then just dma map them for each device separately,
> > > with DMA_ATTR_SKIP_CPU_SYNC passed for all but the first user to map
> > > or last user to unmap. On the iommu side it could probably work
> > > similar.
> >
> > I think this is what's often done. Except then there's also the issue
> > of how to get at the cma allocator if your device needs something
> > contiguous. There's a lot of that still going on in graphics/media.
> I suppose one could just expose CMA with the default pool directly.
> It's just an allocator, so not sure why it would need any
> device-specific information.
> There is also the use case of using CMA with device-specific pools of
> memory reusable by the system when not used by the device and those
> would have to somehow get the pool to allocate from, but I wonder if
> struct device is the right way to pass such information. I'd see the
> pool given explicitly like cma_alloc(struct cma_pool *, size, flags)
> and perhaps a wrapper cma_alloc_default(size, flags) that is just a
> simple macro calling cma_alloc(&cma_pool_default, size, flags).
> Best regards,
> Tomasz

Daniel Vetter
Software Engineer, Intel Corporation