Re: [PATCH v2] mm/rmap: fix incorrect pte restoration for lazyfree folios

From: Dev Jain

Date: Mon Mar 02 2026 - 04:08:38 EST




On 02/03/26 2:14 pm, David Hildenbrand (Arm) wrote:
> On 2/28/26 19:34, Barry Song wrote:
>> On Sun, Mar 1, 2026 at 3:06 AM Dev Jain <dev.jain@xxxxxxx> wrote:
>>>
>>> We batch unmap anonymous lazyfree folios by folio_unmap_pte_batch.
>>> If the batch has a mix of writable and non-writable bits, we may end up
>>> setting the entire batch writable. Fix this by respecting writable bit
>>> during batching.
>>> Although on a successful unmap of a lazyfree folio, the soft-dirty bit is
>>> lost, preserve it on pte restoration by respecting the bit during batching,
>>> to make the fix consistent w.r.t both writable bit and soft-dirty bit.
>>>
>>> I was able to write the below reproducer and crash the kernel.
>>> Explanation of reproducer (set 64K mTHP to always):
>>>
>>> Fault in a 64K large folio. Split the VMA at mid-point with MADV_DONTFORK.
>>> fork() - parent points to the folio with 8 writable ptes and 8 non-writable
>>> ptes. Merge the VMAs with MADV_DOFORK so that folio_unmap_pte_batch() can
>>> determine all the 16 ptes as a batch. Do MADV_FREE on the range to mark
>>> the folio as lazyfree. Write to the memory to dirty the pte, eventually
>>> rmap will dirty the folio. Then trigger reclaim, we will hit the pte
>>> restoration path, and the kernel will crash with the following trace:
>>>
>>> [ 21.134473] kernel BUG at mm/page_table_check.c:118!
>>> [ 21.134497] Internal error: Oops - BUG: 00000000f2000800 [#1] SMP
>>> [ 21.135917] Modules linked in:
>>> [ 21.136085] CPU: 1 UID: 0 PID: 1735 Comm: dup-lazyfree Not tainted 7.0.0-rc1-00116-g018018a17770 #1028 PREEMPT
>>> [ 21.136858] Hardware name: linux,dummy-virt (DT)
>>> [ 21.137019] pstate: 21400005 (nzCv daif +PAN -UAO -TCO +DIT -SSBS BTYPE=--)
>>> [ 21.137308] pc : page_table_check_set+0x28c/0x2a8
>>> [ 21.137607] lr : page_table_check_set+0x134/0x2a8
>>> [ 21.137885] sp : ffff80008a3b3340
>>> [ 21.138124] x29: ffff80008a3b3340 x28: fffffdffc3d14400 x27: ffffd1a55e03d000
>>> [ 21.138623] x26: 0040000000000040 x25: ffffd1a55f7dd000 x24: 0000000000000001
>>> [ 21.139045] x23: 0000000000000001 x22: 0000000000000001 x21: ffffd1a55f217f30
>>> [ 21.139629] x20: 0000000000134521 x19: 0000000000134519 x18: 005c43e000040000
>>> [ 21.140027] x17: 0001400000000000 x16: 0001700000000000 x15: 000000000000ffff
>>> [ 21.140578] x14: 000000000000000c x13: 005c006000000000 x12: 0000000000000020
>>> [ 21.140828] x11: 0000000000000000 x10: 005c000000000000 x9 : ffffd1a55c079ee0
>>> [ 21.141077] x8 : 0000000000000001 x7 : 005c03e000040000 x6 : 000000004000ffff
>>> [ 21.141490] x5 : ffff00017fffce00 x4 : 0000000000000001 x3 : 0000000000000002
>>> [ 21.141741] x2 : 0000000000134510 x1 : 0000000000000000 x0 : ffff0000c08228c0
>>> [ 21.141991] Call trace:
>>> [ 21.142093] page_table_check_set+0x28c/0x2a8 (P)
>>> [ 21.142265] __page_table_check_ptes_set+0x144/0x1e8
>>> [ 21.142441] __set_ptes_anysz.constprop.0+0x160/0x1a8
>>> [ 21.142766] contpte_set_ptes+0xe8/0x140
>>> [ 21.142907] try_to_unmap_one+0x10c4/0x10d0
>>> [ 21.143177] rmap_walk_anon+0x100/0x250
>>> [ 21.143315] try_to_unmap+0xa0/0xc8
>>> [ 21.143441] shrink_folio_list+0x59c/0x18a8
>>> [ 21.143759] shrink_lruvec+0x664/0xbf0
>>> [ 21.144043] shrink_node+0x218/0x878
>>> [ 21.144285] __node_reclaim.constprop.0+0x98/0x338
>>> [ 21.144763] user_proactive_reclaim+0x2a4/0x340
>>> [ 21.145056] reclaim_store+0x3c/0x60
>>> [ 21.145216] dev_attr_store+0x20/0x40
>>> [ 21.145585] sysfs_kf_write+0x84/0xa8
>>> [ 21.145835] kernfs_fop_write_iter+0x130/0x1c8
>>> [ 21.145994] vfs_write+0x2b8/0x368
>>> [ 21.146119] ksys_write+0x70/0x110
>>> [ 21.146240] __arm64_sys_write+0x24/0x38
>>> [ 21.146380] invoke_syscall+0x50/0x120
>>> [ 21.146513] el0_svc_common.constprop.0+0x48/0xf8
>>> [ 21.146679] do_el0_svc+0x28/0x40
>>> [ 21.146798] el0_svc+0x34/0x110
>>> [ 21.146926] el0t_64_sync_handler+0xa0/0xe8
>>> [ 21.147074] el0t_64_sync+0x198/0x1a0
>>> [ 21.147225] Code: f9400441 b4fff241 17ffff94 d4210000 (d4210000)
>>> [ 21.147440] ---[ end trace 0000000000000000 ]---
>>>
>>>
>>> #define _GNU_SOURCE
>>> #include <stdio.h>
>>> #include <unistd.h>
>>> #include <stdlib.h>
>>> #include <sys/mman.h>
>>> #include <string.h>
>>> #include <sys/wait.h>
>>> #include <sched.h>
>>> #include <fcntl.h>
>>>
>>> void write_to_reclaim() {
>>> const char *path = "/sys/devices/system/node/node0/reclaim";
>>> const char *value = "409600000000";
>>> int fd = open(path, O_WRONLY);
>>> if (fd == -1) {
>>> perror("open");
>>> exit(EXIT_FAILURE);
>>> }
>>>
>>> if (write(fd, value, sizeof("409600000000") - 1) == -1) {
>>> perror("write");
>>> close(fd);
>>> exit(EXIT_FAILURE);
>>> }
>>>
>>> printf("Successfully wrote %s to %s\n", value, path);
>>> close(fd);
>>> }
>>>
>>> int main()
>>> {
>>> char *ptr = mmap((void *)(1UL << 30), 1UL << 16, PROT_READ | PROT_WRITE,
>>> MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
>>> if ((unsigned long)ptr != (1UL << 30)) {
>>> perror("mmap");
>>> return 1;
>>> }
>>>
>>> /* a 64K folio gets faulted in */
>>> memset(ptr, 0, 1UL << 16);
>>>
>>> /* 32K half will not be shared into child */
>>> if (madvise(ptr, 1UL << 15, MADV_DONTFORK)) {
>>> perror("madvise madv dontfork");
>>> return 1;
>>> }
>>>
>>> pid_t pid = fork();
>>>
>>> if (pid < 0) {
>>> perror("fork");
>>> return 1;
>>> } else if (pid == 0) {
>>> sleep(15);
>>> } else {
>>> /* merge VMAs. now first half of the 16 ptes are writable, the other half not. */
>>> if (madvise(ptr, 1UL << 15, MADV_DOFORK)) {
>>> perror("madvise madv fork");
>>> return 1;
>>> }
>>> if (madvise(ptr, (1UL << 16), MADV_FREE)) {
>>> perror("madvise madv free");
>>> return 1;
>>> }
>>>
>>> /* dirty the large folio */
>>> (*ptr) += 10;
>>>
>>> write_to_reclaim();
>>> // sleep(10);
>>> waitpid(pid, NULL, 0);
>>>
>>> }
>>> }
>>>
>>> Fixes: 354dffd29575 ("mm: support batched unmap for lazyfree large folios during reclamation")
>>> Cc: stable <stable@xxxxxxxxxx>
>>> Signed-off-by: Dev Jain <dev.jain@xxxxxxx>
>>> ---
>>> v1->v2:
>>> - Just respect the writable bit instead of hacking in a pte_wrprotect() in
>>> failure path
>>> - Also handle soft-dirty bit
>>>
>>> Based on mm-unstable (df9c51269a5e).
>>>
>>> mm/rmap.c | 12 +++++++++++-
>>> 1 file changed, 11 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/mm/rmap.c b/mm/rmap.c
>>> index bff8f222004e4..fb64829913052 100644
>>> --- a/mm/rmap.c
>>> +++ b/mm/rmap.c
>>> @@ -1955,7 +1955,17 @@ static inline unsigned int folio_unmap_pte_batch(struct folio *folio,
>>> if (userfaultfd_wp(vma))
>>> return 1;
>>>
>>> - return folio_pte_batch(folio, pvmw->pte, pte, max_nr);
>>> + if (!folio_test_anon(folio))
>>> + return folio_pte_batch(folio, pvmw->pte, pte, max_nr);
>>> +
>>> + /*
>>> + * For anon folios, if unmap fails, we need to restore the ptes.
>>> + * To avoid accidentally upgrading write permissions for ptes that
>>> + * were not originally writable, and to avoid losing the soft-dirty
>>> + * bit, use the appropriate FPB flags.
>>> + */
>>> + return folio_pte_batch_flags(folio, vma, pvmw->pte, &pte, max_nr,
>>> + FPB_RESPECT_WRITE | FPB_RESPECT_SOFT_DIRTY);
>>
>> Do we really need to differentiate between file and anon?
>> I’d rather just return unconditionally by removing the
>> if (!folio_test_anon(folio)) check above.
>>
>> If we do want to keep two branches, why not use a flag variant instead?
>
> I suspect Dev's code might generate better code, as the compiler might
> not want to provide two variants of folio_pte_batch_flags() where it
> propagates all constants; and even if it does, we'd end up with two
> essentially identical functions in the kernel binary.

Interesting.

>
> So if we want to special-case anon folios, I think we should use Dev's
> variant.
>
> But I also wonder whether we just want to keep it simple for now and
> just unconditionally check FPB_RESPECT_WRITE | FPB_RESPECT_SOFT_DIRTY.
>
> I'd vote for simplicity at this point.

Fair enough. I'll do that.

>