Re: [PATCH v8 5/8] mm: Device exclusive memory access

From: Peter Xu
Date: Tue May 18 2021 - 13:27:50 EST

Next message: Andi Kleen: "Re: [RFC v2-fix 1/1] x86/tdx: Handle in-kernel MMIO"
Previous message: Andi Kleen: "Re: [RFC v2-fix 1/1] x86/tdx: Handle in-kernel MMIO"
In reply to: Alistair Popple: "Re: [PATCH v8 5/8] mm: Device exclusive memory access"
Next in thread: Jason Gunthorpe: "Re: [PATCH v8 5/8] mm: Device exclusive memory access"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Alistair,

While I got one reply below to your previous email, I also looked at the rest
code (majorly restore and fork sides) today and added a few more comments.

On Tue, May 18, 2021 at 11:19:05PM +1000, Alistair Popple wrote:

[...]

> > > diff --git a/mm/memory.c b/mm/memory.c
> > > index 3a5705cfc891..556ff396f2e9 100644
> > > --- a/mm/memory.c
> > > +++ b/mm/memory.c
> > > @@ -700,6 +700,84 @@ struct page *vm_normal_page_pmd(struct vm_area_struct
> > > *vma, unsigned long addr,>
> > > }
> > > #endif
> > >
> > > +static void restore_exclusive_pte(struct vm_area_struct *vma,
> > > + struct page *page, unsigned long address,
> > > + pte_t *ptep)
> > > +{
> > > + pte_t pte;
> > > + swp_entry_t entry;
> > > +
> > > + pte = pte_mkold(mk_pte(page, READ_ONCE(vma->vm_page_prot)));
> > > + if (pte_swp_soft_dirty(*ptep))
> > > + pte = pte_mksoft_dirty(pte);
> > > +
> > > + entry = pte_to_swp_entry(*ptep);
> > > + if (pte_swp_uffd_wp(*ptep))
> > > + pte = pte_mkuffd_wp(pte);
> > > + else if (is_writable_device_exclusive_entry(entry))
> > > + pte = maybe_mkwrite(pte_mkdirty(pte), vma);
> > > +
> > > + set_pte_at(vma->vm_mm, address, ptep, pte);
> > > +
> > > + /*
> > > + * No need to take a page reference as one was already
> > > + * created when the swap entry was made.
> > > + */
> > > + if (PageAnon(page))
> > > + page_add_anon_rmap(page, vma, address, false);
> > > + else
> > > + page_add_file_rmap(page, false);

This seems to be another leftover; maybe WARN_ON_ONCE(!PageAnon(page))?

> > > +
> > > + if (vma->vm_flags & VM_LOCKED)
> > > + mlock_vma_page(page);
> > > +
> > > + /*
> > > + * No need to invalidate - it was non-present before. However
> > > + * secondary CPUs may have mappings that need invalidating.
> > > + */
> > > + update_mmu_cache(vma, address, ptep);
> > > +}

[...]

> > > /*
> > >
> > > * copy one vm_area from one task to the other. Assumes the page tables
> > > * already present in the new task to be cleared in the whole range
> > >
> > > @@ -781,6 +859,12 @@ copy_nonpresent_pte(struct mm_struct *dst_mm, struct
> > > mm_struct *src_mm,>
> > > pte = pte_swp_mkuffd_wp(pte);
> > >
> > > set_pte_at(src_mm, addr, src_pte, pte);
> > >
> > > }
> > >
> > > + } else if (is_device_exclusive_entry(entry)) {
> > > + /* COW mappings should be dealt with by removing the entry
> > > */

Here the comment says "removing the entry" however I think it didn't remove the
pte, instead it keeps it (as there's no "return", so set_pte_at() will be
called below), so I got a bit confused.

> > > + VM_BUG_ON(is_cow_mapping(vm_flags));

Also here, if PageAnon() is the only case to support so far, doesn't that
easily satisfy is_cow_mapping()? Maybe I missed something..

I also have a pure and high level question regarding a process fork() when
there're device exclusive ptes: would the two processes then own the device
together? Is this a real usecase?

Indeed it'll be odd for a COW page since for COW page then it means after
parent/child writting to the page it'll clone into two, then it's a mistery on
which one will be the one that "exclusived owned" by the device..

> > > + page = pfn_swap_entry_to_page(entry);
> > > + get_page(page);
> > > + rss[mm_counter(page)]++;
> > >
> > > }
> > > set_pte_at(dst_mm, addr, dst_pte, pte);
> > > return 0;
> > >
> > > @@ -947,6 +1031,7 @@ copy_pte_range(struct vm_area_struct *dst_vma, struct
> > > vm_area_struct *src_vma,>
> > > int rss[NR_MM_COUNTERS];
> > > swp_entry_t entry = (swp_entry_t){0};
> > > struct page *prealloc = NULL;
> > >
> > > + struct page *locked_page = NULL;
> > >
> > > again:
> > > progress = 0;
> > >
> > > @@ -980,13 +1065,36 @@ copy_pte_range(struct vm_area_struct *dst_vma,
> > > struct vm_area_struct *src_vma,>
> > > continue;
> > >
> > > }
> > > if (unlikely(!pte_present(*src_pte))) {
> > >
> > > - entry.val = copy_nonpresent_pte(dst_mm, src_mm,
> > > - dst_pte, src_pte,
> > > - src_vma, addr, rss);
> > > - if (entry.val)
> > > - break;
> > > - progress += 8;
> > > - continue;
> > > + swp_entry_t swp_entry = pte_to_swp_entry(*src_pte);

(Just a side note to all of us: this will be one more place that I'll need to
look after in my uffd-wp series if this series lands first, as after that
series we can only call pte_to_swp_entry after a pte_has_swap_entry check, as
sometimes non-present pte won't contain a swap entry at all)

> > > +
> > > + if (unlikely(is_cow_mapping(src_vma->vm_flags) &&
> > > + is_device_exclusive_entry(swp_entry))) {
> > > + /*
> > > + * Normally this would require sending mmu
> > > + * notifiers, but copy_page_range() has
> > > already + * done that for COW mappings.
> > > + */
> > > + ret = try_restore_exclusive_pte(src_mm,
> > > src_pte, +
> > > src_vma, addr, +
> > > &locked_page); + if (ret == -EBUSY)
> > > + break;

Would it be possible that we put all the handling of device exclusive ptes into
copy_nonpresent_pte()? As IMHO all device exclusive ptes should still be one
kind of non-present pte. Splitting the cases really make it (at least to
me...) even harder to read.

Maybe you wanted to avoid the rework of copy_nonpresent_pte() as it currently
returns a entry.val which is indeed not straightforward already.. I wanted to
clean that up but not yet.

An easier option is perhaps failing the fork() directly when trylock_page()
failed when restoring the pte? So the userspace could try again the whole
fork(). However that'll also depend on my previous question on whether this is
a valid scenario after all. If "maintaining fork correctness" is the only
thing we persue for, maybe still worth to consider?

> > > + } else {
> > > + entry.val = copy_nonpresent_pte(dst_mm,
> > > src_mm, +
> > > dst_pte, src_pte, +
> > > src_vma, addr, +
> > > rss); + if (entry.val)
> > > + break;
> > > + progress += 8;
> > > + continue;
> > > + }
> > > + }
> > > + /* a non-present pte became present after dropping the ptl
> > > */
> > > + if (unlikely(locked_page)) {
> > > + unlock_page(locked_page);
> > > + put_page(locked_page);
> > > + locked_page = NULL;
> > >
> > > }
> > > /* copy_present_pte() will clear `*prealloc' if consumed */
> > > ret = copy_present_pte(dst_vma, src_vma, dst_pte, src_pte,
> > >

[...]

> > > +static vm_fault_t remove_device_exclusive_entry(struct vm_fault *vmf)
> > > +{
> > > + struct page *page = vmf->page;
> > > + struct vm_area_struct *vma = vmf->vma;
> > > + struct page_vma_mapped_walk pvmw = {
> > > + .page = page,
> > > + .vma = vma,
> > > + .address = vmf->address,
> > > + .flags = PVMW_SYNC,
> > > + };
> > > + vm_fault_t ret = 0;
> > > + struct mmu_notifier_range range;
> > > +
> > > + if (!lock_page_or_retry(page, vma->vm_mm, vmf->flags))
> > > + return VM_FAULT_RETRY;
> > > + mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, vma,
> > > vma->vm_mm,
> > > + vmf->address & PAGE_MASK,
> > > + (vmf->address & PAGE_MASK) + PAGE_SIZE);
> > > + mmu_notifier_invalidate_range_start(&range);
> >
> > I looked at MMU_NOTIFIER_CLEAR document and it tells me:
> >
> > * @MMU_NOTIFY_CLEAR: clear page table entry (many reasons for this like
> > * madvise() or replacing a page by another one, ...).
> >
> > Does MMU_NOTIFIER_CLEAR suite for this case? Normally I think for such a
> > case (existing pte is invalid) we don't need to notify at all. However
> > from what I read from the whole series, this seems to be a critical point
> > where we'd like to kick the owner/driver to synchronously stop doing atomic
> > operations from the device. Not sure whether we'd like a new notifier
> > type, or maybe at least comment on why to use CLEAR?
>
> Right, notifying the owner/driver when it no longer has exclusive access to
> the page and allowing it to stop atomic operations is the critical point and
> why it notifies when we ordinarily wouldn't (ie. invalid -> valid).
>
> I did consider adding a new type, but in the driver implementation it ends up
> being treated the same as a CLEAR notification anyway so didn't think it was
> necessary. But I suppose adding a different type would allow other listening
> notifiers to filter these which might be worthwhile.

Sounds good to me.

[...]

> > > + /*
> > > + * Check that our target page is still mapped at the
> > > expected
> > > + * address.
> > > + */
> > > + if (ttp->mm == mm && ttp->address == address &&
> > > + pte_write(pteval))
> > > + ttp->valid = true;
> >
> > I think I get the point of doing this (as after GUP the pte could have been
> > changed to point to another page), however it smells a bit odd to me (or
> > it's also possible that I'm not familiar enough with the code base..).
> > IIUC this is the _only_ reason that we passed in "address" into
> > try_to_protect() too, and further into the ttp_args.
>
> Yes, this is why "address" is passed up to ttp_args.
>
> > The odd part is the remote GUP should have walked the page table already, so
> > since the target here is the vaddr to replace, the 1st page table walk
> > should be able to both trylock/lock the page, then modify the pte with
> > pgtable lock held, return the locked page, then walk the rmap again to
> > remove all the rest of the ptes that are mapping to this page. In that
> > case before we call the rmap_walk() we know this must be the page we want
> > to take care of, and no one will be able to restore the original mm pte
> > either (as we're with the page lock). Then we don't need this check,
> > neither do we need ttp->address.
>
> If I am understanding you correctly I think this would be similar to the
> approach that was taken in v2. However it pretty much ended up being just an
> open-coded version of gup which is useful anyway to fault the page in.

I see. For easier reference this is v2 patch 1:

https://lore.kernel.org/lkml/20210219020750.16444-2-apopple@xxxxxxxxxx/

Indeed that looks like it, it's just that instead of grabbing the page only in
hmm_exclusive_pmd() we can do the pte modification along the way to seal the
whole thing (address/pte & page). I saw Christoph and Jason commented in that
patch, but not regarding to this approach. So is there a reason that you
switched? Do you think it'll work?

I have no strong opinion either, it's just not crystal clear why we'd need that
ttp->address at all for a rmap walk along with that "valid" field. Meanwhile it
should be slightly less efficient too to go with current approach, especially
when the page array gets huge, I think: since there'll be longer time we do GUP
before doing the rmap walk, so higher possibility that the GUPed pages got
replaced for whatever reason. Then the call to make_device_exclusive_range()
will fail as a whole just for a single page replacement within the range.

Thanks,

--
Peter Xu

Next message: Andi Kleen: "Re: [RFC v2-fix 1/1] x86/tdx: Handle in-kernel MMIO"
Previous message: Andi Kleen: "Re: [RFC v2-fix 1/1] x86/tdx: Handle in-kernel MMIO"
In reply to: Alistair Popple: "Re: [PATCH v8 5/8] mm: Device exclusive memory access"
Next in thread: Jason Gunthorpe: "Re: [PATCH v8 5/8] mm: Device exclusive memory access"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]