Re: [RFC PATCH 01/12] dma-buf: Introduce dma_buf_get_pfn_unlocked() kAPI

From: Jason Gunthorpe
Date: Wed Jan 15 2025 - 12:10:02 EST


On Wed, Jan 15, 2025 at 05:34:23PM +0100, Christian König wrote:
> Granted, let me try to improve this.
> Here is a real world example of one of the issues we ran into and why
> CPU mappings of importers are redirected to the exporter.
> We have a good bunch of different exporters who track the CPU mappings
> of their backing store using address_space objects in one way or
> another and then uses unmap_mapping_range() to invalidate those CPU
> mappings.
> But when importers get the PFNs of the backing store they can look
> behind the curtain and directly insert this PFN into the CPU page
> tables.
> We had literally tons of cases like this where drivers developers cause
> access after free issues because the importer created a CPU mappings on
> their own without the exporter knowing about it.
> This is just one example of what we ran into. Additional to that
> basically the whole synchronization between drivers was overhauled as
> well because we found that we can't trust importers to always do the
> right thing.

But this, fundamentally, is importers creating attachments and then
*ignoring the lifetime rules of DMABUF*. If you created an attachment,
got a move and *ignored the move* because you put the PFN in your own
VMA, then you are not following the attachment lifetime rules!

To implement this safely the driver would need to use
unma_mapping_range() on the driver VMA inside the move callback, and
hook into the VMA fault callback to re-attach the dmabuf.

This is where I get into trouble with your argument. It is not that
the API has an issue, or that the rules of the API are not logical and
functional.

You are arguing that even a logical and functional API will be
mis-used by some people and that reviewers will not catch
it.

Honestly, I don't think that is consistent with the kernel philosophy.

We should do our best to make APIs that are had to mis-use, but if we
can't achieve that it doesn't mean we stop and give up on problems,
we go into the world of APIs that can be mis-used and we are supposed
to rely on the reviewer system to catch it.

> You can already turn both a TEE allocated buffer as well as a memfd
> into a DMA-buf. So basically TEE and memfd already provides different
> interfaces which go beyond what DMA-buf does and allows.

> In other words if you want to do things like direct I/O to block or
> network devices you can mmap() your memfd and do this while at the same
> time send your memfd as DMA-buf to your GPU, V4L or neural accelerator.
> Would this be a way you could work with as well?

I guess, but this still requires creating a dmabuf2 type thing with
very similar semantics and then shimming dmabuf2 to 1 for DRM consumers.

I don't see how it addresses your fundamental concern that the
semantics we want are an API that is too easy for drivers to abuse.

And being more functional and efficient we'd just see people wanting
to use dmabuf2 directly instead of bothering with 1.

> separate file descriptor representing the private MMIO which iommufd
> and KVM uses but you can turn it into a DMA-buf whenever you need to
> give it to a DMA-buf importer?

Well, it would end up just being used everywhere. I think one person
wanted to use this with DRM drivers for some reason, but RDMA would
use the dmabuf2 directly because it will be much more efficient than
using scatterlist.

Honestly, I'd much rather extend dmabuf and see DRM institute some
rule that DRM drivers may not use XYZ parts of the improvement. Like
maybe we could use some symbol namespaces to really enforce it
eg. MODULE_IMPORT_NS(DMABUF_NOT_FOR_DRM_USAGE)

Some of the improvements we want like the revoke rules for lifetime
seem to be agreeable.

Block the API that gives you the non-scatterlist attachment. Only
VFIO/RDMA/kvm/iommufd will get to implement it.

> In this case Xu is exporting MMIO from VFIO and importing to KVM and
> iommufd.
>
> So basically a portion of a PCIe BAR is imported into iommufd?

And KVM. We need to get the CPU address into KVM and IOMMU page
tables. It must go through a private FD path and not a VMA because of
the CC rules about machine check I mentioned earlier. The private FD
must have a lifetime model to ensure we don't UAF the PCIe BAR memory.

Someone else had some use case where they wanted to put the VFIO MMIO
PCIe BAR into a DMABUF and ship it into a GPU driver for
somethingsomething virtualization but I didn't understand it.

> Let's just say that both the ARM guys as well as the GPU people already
> have some pretty "interesting" ways of doing digital rights management
> and content protection.

Well, that is TEE stuff, TEE and CC are not the same thing, though
they have some high level conceptual overlap.

In a certain sense CC is a TEE that is built using KVM instead of the
TEE subsystem. Using KVM and integrating with the MM brings a whole
set of unique challenges that TEE got to avoid..

Jason