Re: [PATCH 1/2] mm: Allow architectures to request 'old' entries when prefaulting

From: Hugh Dickins
Date: Wed Dec 23 2020 - 23:10:35 EST


On Tue, 22 Dec 2020, Kirill A. Shutemov wrote:
>
> Updated patch is below.
>
> From 0ec1bc1fe95587350ac4f4c866d6482383740b36 Mon Sep 17 00:00:00 2001
> From: "Kirill A. Shutemov" <kirill.shutemov@xxxxxxxxxxxxxxx>
> Date: Sat, 19 Dec 2020 15:19:23 +0300
> Subject: [PATCH] mm: Cleanup faultaround and finish_fault() codepaths
>
> alloc_set_pte() has two users with different requirements: in the
> faultaround code, it called from an atomic context and PTE page table
> has to be preallocated. finish_fault() can sleep and allocate page table
> as needed.
>
> PTL locking rules are also strange, hard to follow and overkill for
> finish_fault().
>
> Let's untangle the mess. alloc_set_pte() has gone now. All locking is
> explicit.
>
> The price is some code duplication to handle huge pages in faultaround
> path, but it should be fine, having overall improvement in readability.
>
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@xxxxxxxxxxxxxxx>

It's not ready yet.

I won't pretend to have reviewed, but I did try applying and running
with it: mostly it seems to work fine, but turned out to be leaking
huge pages (with vmstat's thp_split_page_failed growing bigger and
bigger as page reclaim cannot get rid of them).

Aside from the actual bug, filemap_map_pmd() seems suboptimal at
present: comments below (plus one comment in do_anonymous_page()).

> diff --git a/mm/filemap.c b/mm/filemap.c
> index 0b2067b3c328..f8fdbe079375 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -2831,10 +2832,74 @@ vm_fault_t filemap_fault(struct vm_fault *vmf)
> }
> EXPORT_SYMBOL(filemap_fault);
>
> +static bool filemap_map_pmd(struct vm_fault *vmf, struct page *page,
> + struct xa_state *xas)
> +{
> + struct vm_area_struct *vma = vmf->vma;
> + struct address_space *mapping = vma->vm_file->f_mapping;
> +
> + /* Huge page is mapped? No need to proceed. */
> + if (pmd_trans_huge(*vmf->pmd))
> + return true;
> +
> + if (xa_is_value(page))
> + goto nohuge;

I think it would be easier to follow if filemap_map_pages() never
passed this an xa_is_value(page): probably just skip them in its
initial xas_next_entry() loop.

> +
> + if (!pmd_none(*vmf->pmd))
> + goto nohuge;

Then at nohuge it unconditionally takes pmd_lock(), finds !pmd_none,
and unlocks again: unnecessary overhead I believe we did not have before.

> +
> + if (!PageTransHuge(page) || PageLocked(page))
> + goto nohuge;

So if PageTransHuge, but someone else temporarily holds PageLocked,
we insert a page table at nohuge, sadly preventing it from being
mapped here later by huge pmd.

> +
> + if (!page_cache_get_speculative(page))
> + goto nohuge;
> +
> + if (page != xas_reload(xas))
> + goto unref;
> +
> + if (!PageTransHuge(page))
> + goto unref;
> +
> + if (!PageUptodate(page) || PageReadahead(page) || PageHWPoison(page))
> + goto unref;
> +
> + if (!trylock_page(page))
> + goto unref;
> +
> + if (page->mapping != mapping || !PageUptodate(page))
> + goto unlock;
> +
> + if (xas->xa_index >= DIV_ROUND_UP(i_size_read(mapping->host), PAGE_SIZE))
> + goto unlock;
> +
> + do_set_pmd(vmf, page);

Here is the source of the huge page leak: do_set_pmd() can fail
(and we would do better to have skipped most of its failure cases long
before getting this far). It worked without leaking once I patched it:

- do_set_pmd(vmf, page);
- unlock_page(page);
- return true;
+ if (do_set_pmd(vmf, page) == 0) {
+ unlock_page(page);
+ return true;
+ }

> + unlock_page(page);
> + return true;
> +unlock:
> + unlock_page(page);
> +unref:
> + put_page(page);
> +nohuge:
> + vmf->ptl = pmd_lock(vma->vm_mm, vmf->pmd);
> + if (likely(pmd_none(*vmf->pmd))) {
> + mm_inc_nr_ptes(vma->vm_mm);
> + pmd_populate(vma->vm_mm, vmf->pmd, vmf->prealloc_pte);
> + vmf->prealloc_pte = NULL;
> + }
> + spin_unlock(vmf->ptl);

I think it's a bit weird to hide this page table insertion inside
filemap_map_pmd() (I guess you're thinking that this function deals
with pmd level, but I'd find it easier to have a filemap_map_huge()
dealing with the huge mapping). Better to do it on return into
filemap_map_pages(); maybe filemap_map_pmd() or filemap_map_huge()
would then need to return vm_fault_t rather than bool, I didn't try.

> +
> + /* See comment in handle_pte_fault() */
> + if (pmd_devmap_trans_unstable(vmf->pmd))
> + return true;
> +
> + return false;
> +}
...
> diff --git a/mm/memory.c b/mm/memory.c
> index c48f8df6e502..96d62774096a 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -3490,7 +3490,7 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
> if (pte_alloc(vma->vm_mm, vmf->pmd))
> return VM_FAULT_OOM;
>
> - /* See the comment in pte_alloc_one_map() */
> + /* See the comment in map_set_pte() */

No, no such function: probably should be like the others and say
/* See comment in handle_pte_fault() */

Hugh