Re: [PATCH v10 07/10] mm: Device exclusive memory access

From: Peter Xu
Date: Wed Jun 09 2021 - 12:05:22 EST


On Wed, Jun 09, 2021 at 07:38:04PM +1000, Alistair Popple wrote:
> On Wednesday, 9 June 2021 4:33:52 AM AEST Peter Xu wrote:
> > On Mon, Jun 07, 2021 at 05:58:52PM +1000, Alistair Popple wrote:
> >
> > [...]
> >
> > > +static bool page_make_device_exclusive_one(struct page *page,
> > > + struct vm_area_struct *vma, unsigned long address, void *priv)
> > > +{
> > > + struct mm_struct *mm = vma->vm_mm;
> > > + struct page_vma_mapped_walk pvmw = {
> > > + .page = page,
> > > + .vma = vma,
> > > + .address = address,
> > > + };
> > > + struct make_exclusive_args *args = priv;
> > > + pte_t pteval;
> > > + struct page *subpage;
> > > + bool ret = true;
> > > + struct mmu_notifier_range range;
> > > + swp_entry_t entry;
> > > + pte_t swp_pte;
> > > +
> > > + mmu_notifier_range_init_owner(&range, MMU_NOTIFY_EXCLUSIVE, 0, vma,
> > > + vma->vm_mm, address, min(vma->vm_end,
> > > + address + page_size(page)), args->owner);
> > > + mmu_notifier_invalidate_range_start(&range);
> > > +
> > > + while (page_vma_mapped_walk(&pvmw)) {
> > > + /* Unexpected PMD-mapped THP? */
> > > + VM_BUG_ON_PAGE(!pvmw.pte, page);
> >
> > [1]
> >
> > > +
> > > + if (!pte_present(*pvmw.pte)) {
> > > + ret = false;
> > > + page_vma_mapped_walk_done(&pvmw);
> > > + break;
> > > + }
> > > +
> > > + subpage = page - page_to_pfn(page) + pte_pfn(*pvmw.pte);
> > > + address = pvmw.address;
> >
> > I raised a question here previously and didn't get an answer...
> >
> > https://lore.kernel.org/linux-mm/YLDr%2FRyAdUR4q0kk@t490s/
>
> Sorry, I had overlooked that. Will continue the discussion here.

No problem. I also didn't really express clearly last time, I'm happy we can
discuss this more thoroughly, even if it may be a corner case only.

>
> > I think I get your point now and it does look possible that the split page can
> > still be mapped somewhere else as thp, then having some subpage maintainance
> > looks necessary. The confusing part is above [1] you've also got that
> > VM_BUG_ON_PAGE() assuming it must not be a mapped pmd at all..
>
> Going back I thought your original question was whether subpage != page is
> possible. My main point was it's possible if we get a thp head. In that case we
> need to replace all pte's with exclusive entries because I haven't (yet)
> defined a pmd version of device exclusive entries and also rmap_walk won't deal
> with tail pages (see below).
>
> > Then I remembered these code majorly come from the try_to_unmap() so I looked
> > there. I _think_ what's missing here is something like:
> >
> > if (flags & TTU_SPLIT_HUGE_PMD)
> > split_huge_pmd_address(vma, address, false, page);
> >
> > at the entry of page_make_device_exclusive_one()?
> >
> > That !pte assertion in try_to_unmap() makes sense to me as long as it has split
> > the thp page first always. However seems not the case for FOLL_SPLIT_PMD as
> > you previously mentioned.
>
> At present this is limited to PageAnon pages which have had CoW broken, which I
> think means there shouldn't be other mappings so I expect the PMD will always
> have been split into small PTEs mapping subpages by GUP which is what that
> assertion [1] is checking. I could call split_huge_pmd_address() unconditionally
> as suggested but see the discussion below.

Yes, I think calling that unconditionally should be enough.

>
> > Meanwhile, I also started to wonder whether it's even right to call rmap_walk()
> > with tail pages... Please see below.
> >
> > > +
> > > + /* Nuke the page table entry. */
> > > + flush_cache_page(vma, address, pte_pfn(*pvmw.pte));
> > > + pteval = ptep_clear_flush(vma, address, pvmw.pte);
> > > +
> > > + /* Move the dirty bit to the page. Now the pte is gone. */
> > > + if (pte_dirty(pteval))
> > > + set_page_dirty(page);
> > > +
> > > + /*
> > > + * Check that our target page is still mapped at the expected
> > > + * address.
> > > + */
> > > + if (args->mm == mm && args->address == address &&
> > > + pte_write(pteval))
> > > + args->valid = true;
> > > +
> > > + /*
> > > + * Store the pfn of the page in a special migration
> > > + * pte. do_swap_page() will wait until the migration
> > > + * pte is removed and then restart fault handling.
> > > + */
> > > + if (pte_write(pteval))
> > > + entry = make_writable_device_exclusive_entry(
> > > + page_to_pfn(subpage));
> > > + else
> > > + entry = make_readable_device_exclusive_entry(
> > > + page_to_pfn(subpage));
> > > + swp_pte = swp_entry_to_pte(entry);
> > > + if (pte_soft_dirty(pteval))
> > > + swp_pte = pte_swp_mksoft_dirty(swp_pte);
> > > + if (pte_uffd_wp(pteval))
> > > + swp_pte = pte_swp_mkuffd_wp(swp_pte);
> > > +
> > > + set_pte_at(mm, address, pvmw.pte, swp_pte);
> > > +
> > > + /*
> > > + * There is a reference on the page for the swap entry which has
> > > + * been removed, so shouldn't take another.
> > > + */
> > > + page_remove_rmap(subpage, false);
> > > + }
> > > +
> > > + mmu_notifier_invalidate_range_end(&range);
> > > +
> > > + return ret;
> > > +}
> > > +
> > > +/**
> > > + * page_make_device_exclusive - mark the page exclusively owned by a device
> > > + * @page: the page to replace page table entries for
> > > + * @mm: the mm_struct where the page is expected to be mapped
> > > + * @address: address where the page is expected to be mapped
> > > + * @owner: passed to MMU_NOTIFY_EXCLUSIVE range notifier callbacks
> > > + *
> > > + * Tries to remove all the page table entries which are mapping this page and
> > > + * replace them with special device exclusive swap entries to grant a device
> > > + * exclusive access to the page. Caller must hold the page lock.
> > > + *
> > > + * Returns false if the page is still mapped, or if it could not be unmapped
> > > + * from the expected address. Otherwise returns true (success).
> > > + */
> > > +static bool page_make_device_exclusive(struct page *page, struct mm_struct *mm,
> > > + unsigned long address, void *owner)
> > > +{
> > > + struct make_exclusive_args args = {
> > > + .mm = mm,
> > > + .address = address,
> > > + .owner = owner,
> > > + .valid = false,
> > > + };
> > > + struct rmap_walk_control rwc = {
> > > + .rmap_one = page_make_device_exclusive_one,
> > > + .done = page_not_mapped,
> > > + .anon_lock = page_lock_anon_vma_read,
> > > + .arg = &args,
> > > + };
> > > +
> > > + /*
> > > + * Restrict to anonymous pages for now to avoid potential writeback
> > > + * issues.
> > > + */
> > > + if (!PageAnon(page))
> > > + return false;
> > > +
> > > + rmap_walk(page, &rwc);
> >
> > Here we call rmap_walk() on each page we've got. If it was thp then IIUC it'll
> > become the tail pages to walk as the outcome of FOLL_SPLIT_PMD gup (please
> > refer to the last reply of mine). However now I'm uncertain whether we can do
> > rmap_walk on tail page at all... As rmap_walk_anon() has thp_nr_pages() which
> > has:
> >
> > VM_BUG_ON_PGFLAGS(PageTail(page), page);
>
> In either case (FOLL_SPLIT_PMD or not) my understanding is GUP will return a
> sub/tail page (perhaps I mixed up some terminology in the last thread but I
> think we're in agreement here).

Aha, I totally missed this when I read last time (of follow_trans_huge_pmd)..

page += (addr & ~HPAGE_PMD_MASK) >> PAGE_SHIFT;

Now I agree it'll always return subpage, even if thp mapped. And do
FOLL_SPLIT_PMD makes sense too to do early break on cow pages as you said
before.

> For thp this means we could end up passing
> tail pages to rmap_walk(), however it doesn't actually walk them.
>
> Based on the results of previous testing I had done I assumed rmap_walk()
> filtered out tail pages. It does, and I didn't hit the BUG_ON above, but the
> filtering was not as deliberate as assumed.
>
> I've gone back and looked at what was happening in my earlier tests and the
> tail pages get filtered because the VMA is not getting locked in
> page_lock_anon_vma_read() due to failing this check:
>
> anon_mapping = (unsigned long)READ_ONCE(page->mapping);
> if ((anon_mapping & PAGE_MAPPING_FLAGS) != PAGE_MAPPING_ANON)
> goto out;
>
> And now I'm not sure it makes sense to read page->mapping of a tail page. So
> it might be best if we explicitly ignore any tail pages returned from GUP, at
> least for now (a future series will improve thp support such as adding a pmd
> version for exclusive entries).

I feel like it's illegal to access page->mapping of tail pages; I looked at
what happens if we call page_anon_vma() on a tail page:

struct anon_vma *page_anon_vma(struct page *page)
{
unsigned long mapping;

page = compound_head(page);
mapping = (unsigned long)page->mapping;
if ((mapping & PAGE_MAPPING_FLAGS) != PAGE_MAPPING_ANON)
return NULL;
return __page_rmapping(page);
}

It'll just take the head's mapping instead. It makes sense since the tail page
shouldn't have a different value against the head page, afaiu.

It would be great if thp experts could chim in. Before that happens, I agree
with you that a safer approach is to explicitly not walk a tail page for its
rmap (and I think the rmap of a tail page will be the same of the head
anyways.. since they seem to share the anon_vma as quoted).

>
> > So... for thp mappings, wondering whether we should do normal GUP (without
> > SPLIT), pass in always normal or head pages into rmap_walk(), but then
> > unconditionally split_huge_pmd_address() in page_make_device_exclusive_one()?
>
> That could work (although I think GUP will still return tail pages - see
> follow_trans_huge_pmd() which is called from follow_pmd_mask() in gup).

Agreed.

> The main problem is split_huge_pmd_address() unconditionally calls a mmu
> notifier so I would need to plumb in passing an owner everywhere which could
> get messy.

Could I ask why? split_huge_pmd_address() will notify with CLEAR, so I'm a bit
confused why we need to pass over the owner.

I thought plumb it right before your EXCLUSIVE notifier init would work?

---8<---
diff --git a/mm/rmap.c b/mm/rmap.c
index a94d9aed9d95..360ce86f3822 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -2042,6 +2042,12 @@ static bool page_make_device_exclusive_one(struct page *page,
swp_entry_t entry;
pte_t swp_pte;

+ /*
+ * Make sure thps split as device exclusive entries only support pte
+ * level for now.
+ */
+ split_huge_pmd_address(vma, address, false, page);
+
mmu_notifier_range_init_owner(&range, MMU_NOTIFY_EXCLUSIVE, 0, vma,
vma->vm_mm, address, min(vma->vm_end,
address + page_size(page)), args->owner);
---8<---

Thanks,

--
Peter Xu