Re: [PATCH mm-new v2 1/1] mm/khugepaged: abort collapse scan on non-swap entries

From: Lance Yang

Date: Tue Oct 07 2025 - 06:25:26 EST

On 2025/10/6 22:18, David Hildenbrand wrote:

On 05.10.25 04:12, Lance Yang wrote:

On 2025/10/5 09:05, Wei Yang wrote:

On Wed, Oct 01, 2025 at 06:05:57PM +0800, Lance Yang wrote:

On 2025/10/1 16:54, Wei Yang wrote:

On Wed, Oct 01, 2025 at 11:22:51AM +0800, Lance Yang wrote:

From: Lance Yang <lance.yang@xxxxxxxxx>

Currently, special non-swap entries (like migration, hwpoison, or PTE
markers) are not caught early in hpage_collapse_scan_pmd(), leading to
failures deep in the swap-in logic.

hpage_collapse_scan_pmd()
`- collapse_huge_page()
       `- __collapse_huge_page_swapin() -> fails!

As David suggested[1], this patch skips any such non-swap entries
early. If any one is found, the scan is aborted immediately with the
SCAN_PTE_NON_PRESENT result, as Lorenzo suggested[2], avoiding wasted
work.

[1] https://lore.kernel.org/linux-mm/7840f68e-7580-42cb- a7c8-1ba64fd6df69@xxxxxxxxxx
[2] https://lore.kernel.org/linux-mm/7df49fe7-c6b7-426a-8680- dcd55219c8bd@lucifer.local

Suggested-by: David Hildenbrand <david@xxxxxxxxxx>
Suggested-by: Lorenzo Stoakes <lorenzo.stoakes@xxxxxxxxxx>
Signed-off-by: Lance Yang <lance.yang@xxxxxxxxx>
---
v1 -> v2:
- Skip all non-present entries except swap entries (per David) thanks!
- https://lore.kernel.org/linux-mm/20250924100207.28332-1- lance.yang@xxxxxxxxx/

mm/khugepaged.c | 32 ++++++++++++++++++--------------
1 file changed, 18 insertions(+), 14 deletions(-)

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 7ab2d1a42df3..d0957648db19 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1284,7 +1284,23 @@ static int hpage_collapse_scan_pmd(struct mm_struct *mm,
    for (addr = start_addr, _pte = pte; _pte < pte + HPAGE_PMD_NR;
         _pte++, addr += PAGE_SIZE) {
        pte_t pteval = ptep_get(_pte);
-        if (is_swap_pte(pteval)) {

It looks is_swap_pte() is mis-leading?

Hmm.. not to me, IMO. is_swap_pte() just means:

!pte_none(pte) && !pte_present(pte)

Maybe it has some reason.

I took another look into __collapse_huge_page_swapin(), which just check
is_swap_pte() before do_swap_page().

Thanks for pointing that out.

A function that is called __collapse_huge_page_swapin() and documented to "Bring missing pages in from swap" will handle other types as well.

Unbelievable horrible.

So let's think this through so we can document it in the changelog properly.

We could have currently ended up in do_swap_page() with

(1) Migration entries. We would have waited.

-> Maybe worth it to wait, maybe not. I suspect we don't stumble into
   that frequently such that we don't care. We could always unlock this
   separately later.

(2) Device-exclusive entries. We would have converted to non-exclusive.

-> See make_device_exclusive(), we cannot tolerate PMD entries and have
   to split them through FOLL_SPLIT_PMD. As popped up during a recent
   discussion, collapsing here is actually counter-productive, because
   the next conversion will PTE-map it again. (until recently, it would
   not have worked with large folios at all IIRC).

-> Ok to not collapse.

(3) Device-private entries. We would have migrated to RAM.

-> Device-private still does not support THPs, so collapsing right now just means that the next device access would split the folio again.

-> Ok to not collapse.

(4) HWPoison entries

-> Cannot collapse

(5) Markers

-> Cannot collapse

I suggest we add that in some form to the patch description, stating that we can unlock later what we really need, and not account it towards max_swap_ptes.

We have filtered non-swap entries in hpage_collapse_scan_pmd(), but we drop
mmap lock before isolation. This looks we may have a chance to get non-swap
entry.

Thanks for pointing that out!

Yep, there is a theoretical window between dropping the mmap lock
after the initial scan and re-acquiring it for isolation.

Do you think it is reasonable to add a non_swap_entry() check before
do_swap_page()?

However, that seems unlikely in practice. IMHO, the early check in
hpage_collapse_scan_pmd() is sufficient for now, so I'd prefer to
keep it as-is :)

I think we really should add that check, as per reasoning above.

I was looking into some possible races with uffd-wp being set before we enter do_swap_page(), but I think it might be okay (although very confusing).

How about the version below?

```
Currently, special non-swap entries (like PTE markers) are not caught
early in hpage_collapse_scan_pmd(), leading to failures deep in the
swap-in logic.

A function that is called __collapse_huge_page_swapin() and documented
to "Bring missing pages in from swap" will handle other types as well.

As analyzed by David[1], we could have ended up with the following
entry types right before do_swap_page():

(1) Migration entries. We would have waited.
-> Maybe worth it to wait, maybe not. We suspect we don't stumble
into that frequently such that we don't care. We could always
unlock this separately later.

(2) Device-exclusive entries. We would have converted to non-exclusive.
-> See make_device_exclusive(), we cannot tolerate PMD entries and
have to split them through FOLL_SPLIT_PMD. As popped up during
a recent discussion, collapsing here is actually
counter-productive, because the next conversion will PTE-map
it again.
-> Ok to not collapse.

(3) Device-private entries. We would have migrated to RAM.
-> Device-private still does not support THPs, so collapsing right
now just means that the next device access would split the
folio again.
-> Ok to not collapse.

(4) HWPoison entries
-> Cannot collapse

(5) Markers
-> Cannot collapse

First, this patch adds an early check for these non-swap entries. If
any one is found, the scan is aborted immediately with the
SCAN_PTE_NON_PRESENT result, as Lorenzo suggested[2], avoiding wasted
work.

Second, as Wei pointed out[3], we may have a chance to get a non-swap
entry, since we will drop and re-acquire the mmap lock before
__collapse_huge_page_swapin(). To handle this, we also add a
non_swap_entry() check there.

Note that we can unlock later what we really need, and not account it
towards max_swap_ptes.

[1] https://lore.kernel.org/linux-mm/09eaca7b-9988-41c7-8d6e-4802055b3f1e@xxxxxxxxxx
[2] https://lore.kernel.org/linux-mm/7df49fe7-c6b7-426a-8680-dcd55219c8bd@lucifer.local
[3] https://lore.kernel.org/linux-mm/20251005010511.ysek2nqojebqngf3@master
```

I also think it makes sense to fold the change that adds the
non_swap_entry() check in __collapse_huge_page_swapin() into
this patch, rather than creating a new patch just for that :)

Hmmm... one thing I'm not sure about: regarding the uffd-wp
race you mentioned, is the pte_swp_uffd_wp() check needed
after non_swap_entry()? It seems like it might not be ...

```
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index f4f57ba69d72..bec3e268dc76 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1020,6 +1020,11 @@ static int __collapse_huge_page_swapin(struct mm_struct *mm,
if (!is_swap_pte(vmf.orig_pte))
continue;

+ if (non_swap_entry(pte_to_swp_entry(vmf.orig_pte))) {
+ result = SCAN_PTE_NON_PRESENT;
+ goto out;
+ }
+
vmf.pte = pte;
vmf.ptl = ptl;
ret = do_swap_page(&vmf);
```

@David does that sound good to you?