Re: [RFC PATCH 01/12] dma-buf: Introduce dma_buf_get_pfn_unlocked() kAPI
From: Xu Yilun
Date: Mon Jan 13 2025 - 05:12:12 EST
On Fri, Jan 10, 2025 at 04:38:38PM -0400, Jason Gunthorpe wrote:
> On Fri, Jan 10, 2025 at 08:34:55PM +0100, Simona Vetter wrote:
>
> > So if I'm getting this right, what you need from a functional pov is a
> > dma_buf_tdx_mmap? Because due to tdx restrictions, the normal dma_buf_mmap
I'm not sure if the word 'mmap' is proper here.
It is kind of like the mapping from (FD+offset) to backend memory,
which is directly provided by memory provider, rather than via VMA
and cpu page table. Basically VMA & cpu page table are for host to
access the memory, but VMM/host doesn't access most of the guest
memory, so why must build them.
> > is not going to work I guess?
>
> Don't want something TDX specific!
>
> There is a general desire, and CC is one, but there are other
> motivations like performance, to stop using VMAs and mmaps as a way to
> exchanage memory between two entities. Instead we want to use FDs.
Exactly.
>
> We now have memfd and guestmemfd that are usable with
> memfd_pin_folios() - this covers pinnable CPU memory.
>
> And for a long time we had DMABUF which is for all the other wild
> stuff, and it supports movable memory too.
>
> So, the normal DMABUF semantics with reservation locking and move
> notifiers seem workable to me here. They are broadly similar enough to
> the mmu notifier locking that they can serve the same job of updating
> page tables.
Yes. With this new sharing model, the lifecycle of the shared memory/pfn/Page
is directly controlled by dma-buf exporter, not by CPU mapping. So I also
think reservation lock & move_notify works well for lifecycle control,
no conflict (nothing to do) with follow_pfn() & mmu_notifier.
>
> > Also another thing that's a bit tricky is that kvm kinda has a 3rd dma-buf
> > memory model:
> > - permanently pinned dma-buf, they never move
> > - dynamic dma-buf, they move through ->move_notify and importers can remap
> > - revocable dma-buf, which thus far only exist for pci mmio resources
>
> I would like to see the importers be able to discover which one is
> going to be used, because we have RDMA cases where we can support 1
> and 3 but not 2.
>
> revocable doesn't require page faulting as it is a terminal condition.
>
> > Since we're leaning even more on that 3rd model I'm wondering whether we
> > should make it something official. Because the existing dynamic importers
> > do very much assume that re-acquiring the memory after move_notify will
> > work. But for the revocable use-case the entire point is that it will
> > never work.
>
> > I feel like that's a concept we need to make explicit, so that dynamic
> > importers can reject such memory if necessary.
>
> It strikes me as strange that HW can do page faulting, so it can
> support #2, but it can't handle a non-present fault?
>
> > So yeah there's a bunch of tricky lifetime questions that need to be
> > sorted out with proper design I think, and the current "let's just use pfn
> > directly" proposal hides them all under the rug.
>
> I don't think these two things are connected. The lifetime model that
> KVM needs to work with the EPT, and that VFIO needs for it's MMIO,
> definately should be reviewed and evaluated.
>
> But it is completely orthogonal to allowing iommufd and kvm to access
> the CPU PFN to use in their mapping flows, instead of the
> dma_addr_t.
>
> What I want to get to is a replacement for scatter list in DMABUF that
> is an array of arrays, roughly like:
>
> struct memory_chunks {
> struct memory_p2p_provider *provider;
> struct bio_vec addrs[];
> };
> int (*dmabuf_get_memory)(struct memory_chunks **chunks, size_t *num_chunks);
Maybe we need to specify which object the API is operating on,
struct dma_buf, or struct dma_buf_attachment, or a new attachment.
I think:
int (*dmabuf_get_memory)(struct dma_buf_attachment *attach, struct memory_chunks **chunks, size_t *num_chunks);
works, but maybe a new attachment is conceptually more clear to
importers and harder to abuse?
Thanks,
Yilun
>
> This can represent all forms of memory: P2P, private, CPU, etc and
> would be efficient with the new DMA API.
>
> This is similar to the structure BIO has, and it composes nicely with
> a future pin_user_pages() and memfd_pin_folios().
>
> Jason