Re: [PATCH v4 02/12] mm/rmap: Add try_to_unmap_hugetlb_one

From: Dev Jain

Date: Tue Jun 23 2026 - 04:40:51 EST

On 09/06/26 7:46 pm, David Hildenbrand (Arm) wrote:
> On 5/26/26 08:36, Dev Jain wrote:
>> Simplify try_to_unmap_one by separating out the hugetlb parts into
>> try_to_unmap_hugetlb_one.
>>
>> To understand the correctness of the refactoring, the following points
>> are noted:
>>
>> 1. try_to_unmap() is called for hugetlb folios only when they are
>> hwpoisoned.
>
> Ack
>
>>
>> 2. A hugetlb VMA cannot be mlocked.
>
> Ack
>
>>
>> 3. The pvmw API sets pvmw.pte to the base of the hugetlb folio (pvmw.pmd
>> is NULL).
>
> Ack
>
>>
>> 4. We won't ever process a softleaf entry that encodes a hugetlb folio;
>> hugetlb folios are never swapped out, migration entries will be skipped
>> (PVMW_MIGRATION not passed) and device-exclusive does not work for
>> hugetlb.
>
> Ack, we should always find present entries.
>
>>
>> 5. uffd-wp bit is lost when converting pvmw.pte to hwpoison softleaf
>> (therefore no need to call pte_install_uffd_wp_if_needed after
>> clearing pvmw.pte)
>
> Are you sure? And if so, can you elaborate why + document?

If you see the current code, we construct pteval from the subpage, losing any
bit state of the previous entry. Also it doesn't make sense anyways to keep
uffd-wp for an hwpoisoned entry.
>
>
>>
>> 6. TTU_HWPOISON is always present; for it to not be present, either folio
>> has to be in swapcache, or mapping_can_writeback() is true (see
>> unmap_poisoned_folio), none of which is true for hugetlb folios.
>
> Agreed. That's one very confusing bit in the existing code.
>
>>
>> 7. hugetlb uses separate counters from normal rss counters, therefore
>> update_highwater_rss() need not be called.
>
> Ack.
>
>>
>> While at it:
>>
>> - Change VM_BUG_* to VM_WARN_*.
>>
>> - Do not declare variables which are only used once
>>
>> - Assert that the subpage derived by the pvmw walk is exactly the head
>> page. This is because try_to_unmap() does not remember the specific
>> subpage which was hwpoisoned, and, since we cannot munmap/mremap
>> across a hugetlb folio partially, the first pte mapping the hugetlb
>> folio (in case of a contpte or contpmd mapped folio) cannot ever point
>> to an intermediate page.
>
> Might want to add a Suggested-by:, so people know directly whom to blame :)

Sure.

>
> I am very much in favor of decoupling both things cleanly, instead of hacking it
> together and make the code all big and confusing.
>
>>
>> Signed-off-by: Dev Jain <dev.jain@xxxxxxx>
>> ---
>> mm/rmap.c | 203 ++++++++++++++++++++++++++++++++----------------------
>> 1 file changed, 121 insertions(+), 82 deletions(-)
>>
>> diff --git a/mm/rmap.c b/mm/rmap.c
>> index 430c91c8fe2ae..06ab1158d4cd1 100644
>> --- a/mm/rmap.c
>> +++ b/mm/rmap.c
>> @@ -1978,6 +1978,121 @@ static inline unsigned int folio_unmap_pte_batch(struct folio *folio,
>> FPB_RESPECT_WRITE | FPB_RESPECT_SOFT_DIRTY);
>> }
>>
>> +static bool __try_to_unmap_hugetlb_one(struct folio *folio,
>> + struct vm_area_struct *vma, struct page_vma_mapped_walk *pvmw,
>> + struct mmu_notifier_range *range, enum ttu_flags flags)
>> +{
>> + unsigned long hsz = huge_page_size(hstate_vma(vma));
>
> Can be const.

Ok.

>
>> + unsigned long address = pvmw->address;
>
> Can be const. But do we even need the temporary variable?

It is being used 3 times in this function, so temporary variable is fine.

>
>> + struct mm_struct *mm = vma->vm_mm;
>> + struct page *subpage;
>
> a) subpages do not exist. It's just a folio page.
> b) subpages are irrelevant for hugetlb, just drop anything that deals with them.

Ok.

>
>> + bool ret = true;
>> + pte_t pteval;
>> +
>
> Worth adding a comment here, why we don't need a loop.
>
> /* There is only a single mapping in a VMA. */

Ok.

>
>> + if (!page_vma_mapped_walk(pvmw))
>> + return true;
>> +
>> + pteval = ptep_get(pvmw->pte);
>
> Should we now properly use huge_ptep_get()?

Sending a series to fix this soon : )

>
>> + VM_WARN_ON(!pte_present(pteval));
>> + subpage = folio_page(folio, pte_pfn(pteval) - folio_pfn(folio));
>
> No need for subpages.

Ok.

>
>> + VM_WARN_ON(folio_page(folio, 0) != subpage);
>
> VM_WARN_ON(pte_pfn(pteval) != folio_pfn(folio));
>
>
> I assume both can be VM_WARN_ON_ONCE()

Ok.

>
>> +
>> + /*
>> + * huge_pmd_unshare may unmap an entire PMD page. There is no way of
>> + * knowing exactly which PMDs may be cached for this mm, so we must
>> + * flush them all. start/end were already adjusted above to cover this
>> + * range.
>> + */
>> + flush_cache_range(vma, range->start, range->end);
>> +
>> + /*
>> + * To call huge_pmd_unshare, i_mmap_rwsem must be held in write mode.
>> + * Caller needs to explicitly do this outside rmap routines.
>> + *
>> + * We also must hold hugetlb vma_lock in write mode. Lock order dictates
>> + * acquiring vma_lock BEFORE i_mmap_rwsem. We can only try lock here and
>> + * fail if unsuccessful.
>> + */
>> + if (!folio_test_anon(folio)) {
>> + struct mmu_gather tlb;
>> +
>> + VM_WARN_ON(!(flags & TTU_RMAP_LOCKED));
>> + if (!hugetlb_vma_trylock_write(vma)) {
>> + ret = false;
>> + goto walk_done;
>> + }
>> +
>> + tlb_gather_mmu_vma(&tlb, vma);
>> + if (huge_pmd_unshare(&tlb, vma, address, pvmw->pte)) {
>> + hugetlb_vma_unlock_write(vma);
>> + huge_pmd_unshare_flush(&tlb, vma);
>> + tlb_finish_mmu(&tlb);
>> + /*
>> + * The PMD table was unmapped, consequently unmapping
>> + * the folio.
>> + */
>> + goto walk_done;
>> + }
>> + hugetlb_vma_unlock_write(vma);
>> + tlb_finish_mmu(&tlb);
>> + }
>> + pteval = huge_ptep_clear_flush(vma, address, pvmw->pte);
>> + if (pte_dirty(pteval))
>
> Should we now use huge_pte_dirty() for consistency?

Ok.

>
>> + folio_mark_dirty(folio);
>> +
>> + VM_WARN_ON(!(flags & TTU_HWPOISON));
>
> This should be checked right at the very beginning of try_to_unmap_hugetlb_one().

Ok.

>
>> + pteval = swp_entry_to_pte(make_hwpoison_entry(subpage));
>> + hugetlb_count_sub(folio_nr_pages(folio), mm);
>> + set_huge_pte_at(mm, address, pvmw->pte, pteval, hsz);
>> + hugetlb_remove_rmap(folio);
>> + folio_put_refs(folio, 1);
>> +
>> +walk_done:
>> + page_vma_mapped_walk_done(pvmw);
>> + return ret;
>> +}
>> +
>> +static bool try_to_unmap_hugetlb_one(struct folio *folio,
>> + struct vm_area_struct *vma, unsigned long address, void *arg)
>> +{
>> + DEFINE_FOLIO_VMA_WALK(pvmw, folio, vma, address, 0);
>> + struct mmu_notifier_range range;
>> + enum ttu_flags flags = (enum ttu_flags)(long)arg;
>> + bool ret;
>> +
>> + /*
>> + * The try_to_unmap() is only passed a hugetlb folio in the case
>> + * where the hugetlb folio contains a poisoned page.
>> + */
>> + VM_WARN_ON_FOLIO(!folio_contain_hwpoisoned_page(folio), folio);
>
> As discussed, I'd prefer to just have folio_test_hwpoison() here.

Ok.

>
>> +
>> + /*
>> + * When racing against e.g. zap_pte_range() on another cpu,
>> + * in between its ptep_get_and_clear_full() and folio_remove_rmap_*(),
>> + * try_to_unmap() may return before folio_mapped() has become false,
>> + * if page table locking is skipped: use TTU_SYNC to wait for that.
>> + */
>
> The comment should be updated to mention hugetlb primitives ... but
>
> ... it's a good question whether that is required *at all* ...
>
>> + if (flags & TTU_SYNC)
>> + pvmw.flags = PVMW_SYNC;
>
> ... and in fact, looking at page_vma_mapped_walk(), PVMW_SYNC is irrelevant.

You are right, I'll drop this.

>
>> +
>> + /*
>> + * For hugetlb, it could be much worse than THP if we need pud
>> + * invalidation in the case of pmd sharing.
>
> I don't understand that comment. Is it really helpful?
>
>> + *
>> + * Note that the folio can not be freed in this function as call of
>
> "cannot" ?
>
>> + * try_to_unmap() must hold a reference on the folio.
>> + */
>
> Do we really need this comment entirely? We don't even look at the folio after
> unmapping it.

Yeah I just blindly copied it, looks like historical baggage, I can't understand
that comment at all. I'll drop it entirely.

>
>> + range.end = vma_address_end(&pvmw);
>> + mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, vma->vm_mm,
>> + address, range.end);
>> + adjust_range_if_pmd_sharing_possible(vma, &range.start, &range.end);
>> + mmu_notifier_invalidate_range_start(&range);
>> + ret = __try_to_unmap_hugetlb_one(folio, vma, &pvmw, &range,
>> + flags);
>
> Is it really worth it moving that stuff to a separate function?

I derive pleasure in a code style of:

foo():
lock/start
__foo()
unlock/end

Although, I think this is entirely pointless here since foo() has barely
anything apart from __foo(), and that __foo() has 5 function parameters.
So I'll collapse into a single function.
>
>
>> + mmu_notifier_invalidate_range_end(&range);
>> + return ret;
>> +}
>> +
>
>