Re: [PATCH v4 02/12] mm/rmap: Add try_to_unmap_hugetlb_one
From: David Hildenbrand (Arm)
Date: Tue Jun 09 2026 - 10:28:50 EST
On 5/26/26 08:36, Dev Jain wrote:
> Simplify try_to_unmap_one by separating out the hugetlb parts into
> try_to_unmap_hugetlb_one.
>
> To understand the correctness of the refactoring, the following points
> are noted:
>
> 1. try_to_unmap() is called for hugetlb folios only when they are
> hwpoisoned.
Ack
>
> 2. A hugetlb VMA cannot be mlocked.
Ack
>
> 3. The pvmw API sets pvmw.pte to the base of the hugetlb folio (pvmw.pmd
> is NULL).
Ack
>
> 4. We won't ever process a softleaf entry that encodes a hugetlb folio;
> hugetlb folios are never swapped out, migration entries will be skipped
> (PVMW_MIGRATION not passed) and device-exclusive does not work for
> hugetlb.
Ack, we should always find present entries.
>
> 5. uffd-wp bit is lost when converting pvmw.pte to hwpoison softleaf
> (therefore no need to call pte_install_uffd_wp_if_needed after
> clearing pvmw.pte)
Are you sure? And if so, can you elaborate why + document?
>
> 6. TTU_HWPOISON is always present; for it to not be present, either folio
> has to be in swapcache, or mapping_can_writeback() is true (see
> unmap_poisoned_folio), none of which is true for hugetlb folios.
Agreed. That's one very confusing bit in the existing code.
>
> 7. hugetlb uses separate counters from normal rss counters, therefore
> update_highwater_rss() need not be called.
Ack.
>
> While at it:
>
> - Change VM_BUG_* to VM_WARN_*.
>
> - Do not declare variables which are only used once
>
> - Assert that the subpage derived by the pvmw walk is exactly the head
> page. This is because try_to_unmap() does not remember the specific
> subpage which was hwpoisoned, and, since we cannot munmap/mremap
> across a hugetlb folio partially, the first pte mapping the hugetlb
> folio (in case of a contpte or contpmd mapped folio) cannot ever point
> to an intermediate page.
Might want to add a Suggested-by:, so people know directly whom to blame :)
I am very much in favor of decoupling both things cleanly, instead of hacking it
together and make the code all big and confusing.
>
> Signed-off-by: Dev Jain <dev.jain@xxxxxxx>
> ---
> mm/rmap.c | 203 ++++++++++++++++++++++++++++++++----------------------
> 1 file changed, 121 insertions(+), 82 deletions(-)
>
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 430c91c8fe2ae..06ab1158d4cd1 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -1978,6 +1978,121 @@ static inline unsigned int folio_unmap_pte_batch(struct folio *folio,
> FPB_RESPECT_WRITE | FPB_RESPECT_SOFT_DIRTY);
> }
>
> +static bool __try_to_unmap_hugetlb_one(struct folio *folio,
> + struct vm_area_struct *vma, struct page_vma_mapped_walk *pvmw,
> + struct mmu_notifier_range *range, enum ttu_flags flags)
> +{
> + unsigned long hsz = huge_page_size(hstate_vma(vma));
Can be const.
> + unsigned long address = pvmw->address;
Can be const. But do we even need the temporary variable?
> + struct mm_struct *mm = vma->vm_mm;
> + struct page *subpage;
a) subpages do not exist. It's just a folio page.
b) subpages are irrelevant for hugetlb, just drop anything that deals with them.
> + bool ret = true;
> + pte_t pteval;
> +
Worth adding a comment here, why we don't need a loop.
/* There is only a single mapping in a VMA. */
> + if (!page_vma_mapped_walk(pvmw))
> + return true;
> +
> + pteval = ptep_get(pvmw->pte);
Should we now properly use huge_ptep_get()?
> + VM_WARN_ON(!pte_present(pteval));
> + subpage = folio_page(folio, pte_pfn(pteval) - folio_pfn(folio));
No need for subpages.
> + VM_WARN_ON(folio_page(folio, 0) != subpage);
VM_WARN_ON(pte_pfn(pteval) != folio_pfn(folio));
I assume both can be VM_WARN_ON_ONCE()
> +
> + /*
> + * huge_pmd_unshare may unmap an entire PMD page. There is no way of
> + * knowing exactly which PMDs may be cached for this mm, so we must
> + * flush them all. start/end were already adjusted above to cover this
> + * range.
> + */
> + flush_cache_range(vma, range->start, range->end);
> +
> + /*
> + * To call huge_pmd_unshare, i_mmap_rwsem must be held in write mode.
> + * Caller needs to explicitly do this outside rmap routines.
> + *
> + * We also must hold hugetlb vma_lock in write mode. Lock order dictates
> + * acquiring vma_lock BEFORE i_mmap_rwsem. We can only try lock here and
> + * fail if unsuccessful.
> + */
> + if (!folio_test_anon(folio)) {
> + struct mmu_gather tlb;
> +
> + VM_WARN_ON(!(flags & TTU_RMAP_LOCKED));
> + if (!hugetlb_vma_trylock_write(vma)) {
> + ret = false;
> + goto walk_done;
> + }
> +
> + tlb_gather_mmu_vma(&tlb, vma);
> + if (huge_pmd_unshare(&tlb, vma, address, pvmw->pte)) {
> + hugetlb_vma_unlock_write(vma);
> + huge_pmd_unshare_flush(&tlb, vma);
> + tlb_finish_mmu(&tlb);
> + /*
> + * The PMD table was unmapped, consequently unmapping
> + * the folio.
> + */
> + goto walk_done;
> + }
> + hugetlb_vma_unlock_write(vma);
> + tlb_finish_mmu(&tlb);
> + }
> + pteval = huge_ptep_clear_flush(vma, address, pvmw->pte);
> + if (pte_dirty(pteval))
Should we now use huge_pte_dirty() for consistency?
> + folio_mark_dirty(folio);
> +
> + VM_WARN_ON(!(flags & TTU_HWPOISON));
This should be checked right at the very beginning of try_to_unmap_hugetlb_one().
> + pteval = swp_entry_to_pte(make_hwpoison_entry(subpage));
> + hugetlb_count_sub(folio_nr_pages(folio), mm);
> + set_huge_pte_at(mm, address, pvmw->pte, pteval, hsz);
> + hugetlb_remove_rmap(folio);
> + folio_put_refs(folio, 1);
> +
> +walk_done:
> + page_vma_mapped_walk_done(pvmw);
> + return ret;
> +}
> +
> +static bool try_to_unmap_hugetlb_one(struct folio *folio,
> + struct vm_area_struct *vma, unsigned long address, void *arg)
> +{
> + DEFINE_FOLIO_VMA_WALK(pvmw, folio, vma, address, 0);
> + struct mmu_notifier_range range;
> + enum ttu_flags flags = (enum ttu_flags)(long)arg;
> + bool ret;
> +
> + /*
> + * The try_to_unmap() is only passed a hugetlb folio in the case
> + * where the hugetlb folio contains a poisoned page.
> + */
> + VM_WARN_ON_FOLIO(!folio_contain_hwpoisoned_page(folio), folio);
As discussed, I'd prefer to just have folio_test_hwpoison() here.
> +
> + /*
> + * When racing against e.g. zap_pte_range() on another cpu,
> + * in between its ptep_get_and_clear_full() and folio_remove_rmap_*(),
> + * try_to_unmap() may return before folio_mapped() has become false,
> + * if page table locking is skipped: use TTU_SYNC to wait for that.
> + */
The comment should be updated to mention hugetlb primitives ... but
... it's a good question whether that is required *at all* ...
> + if (flags & TTU_SYNC)
> + pvmw.flags = PVMW_SYNC;
... and in fact, looking at page_vma_mapped_walk(), PVMW_SYNC is irrelevant.
> +
> + /*
> + * For hugetlb, it could be much worse than THP if we need pud
> + * invalidation in the case of pmd sharing.
I don't understand that comment. Is it really helpful?
> + *
> + * Note that the folio can not be freed in this function as call of
"cannot" ?
> + * try_to_unmap() must hold a reference on the folio.
> + */
Do we really need this comment entirely? We don't even look at the folio after
unmapping it.
> + range.end = vma_address_end(&pvmw);
> + mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, vma->vm_mm,
> + address, range.end);
> + adjust_range_if_pmd_sharing_possible(vma, &range.start, &range.end);
> + mmu_notifier_invalidate_range_start(&range);
> + ret = __try_to_unmap_hugetlb_one(folio, vma, &pvmw, &range,
> + flags);
Is it really worth it moving that stuff to a separate function?
> + mmu_notifier_invalidate_range_end(&range);
> + return ret;
> +}
> +
--
Cheers,
David