[PATCH stable 5.10] mm: numa: preserve PMD write permissions in migrate_misplaced_transhuge_page

From: wang.yaxin

Date: Mon May 25 2026 - 00:50:02 EST


From: Chen Junlin <chen.junlin@xxxxxxxxxx>

When a process allocates a transparent huge page in its address space, and
then enters the kernel driver via an ioctl system call, a driver (eg.
ib_uverbs) calls the pin_user_pages_fast function to pin the process’s
virtual addresses to physical pages. Subsequently, when the process
accesses this pinned memory across NUMA nodes, triggering the system’s
NUMA balancing capability, a page fault occurs and the kernel enters
do_huge_pmd_numa_page, then it calls migrate_misplaced_transhuge_page to
migrate the transparent huge page. However, because the memory within the
huge page has been pinned by pin_user_pages_fast, numamigrate_isolate_page
returns 0. migrate_misplaced_transhuge_page proceeds to the out_fail path,
where it changes the PMD page table entry to write-protected by pte_modify.
If the process then performs a fork operation, copy_huge_pmd is invoked.
Due to the pinned memory, __split_huge_pmd is called to split the PMD page
table entry into PTE page table entries. These PTEs are also set to
write-protected. Finally, when the process writes to this memory region, a
copy-on-write (COW) operation takes place, allocating a new physical
memory page. This breaks the binding between the process’s virtual
address and the pinned physical memory.

Here is my test code in userspace.The /dev/test_gup is provided by a
simple kerenl mod, it just help us calls pin_user_pages_fast in kernel by
passing va to ioctl, so I do not provides code of kernel mod.

The test code runs on an x86 QEMU-KVM virtual machine with a specification
of 64 cores and 2 NUMA nodes (0-31:node0, 32-63:node1).
Kernel is 5.10.256, numa balancing para is
kernel.numa_balancing = 1
kernel.numa_balancing_scan_delay_ms = 1000
kernel.numa_balancing_scan_period_max_ms = 60000
kernel.numa_balancing_scan_period_min_ms = 1000
kernel.numa_balancing_scan_size_mb = 256
/sys/kernel/mm/transparent_hugepage/enabled is always

===
#define _GNU_SOURCE
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
#include <sys/mman.h>
#include <errno.h>
#include <sched.h>
#include <fcntl.h>
#include <sys/ioctl.h>
#include <sys/mman.h>
#include <sys/wait.h>
#include <signal.h>
#include <linux/ioctl.h>
#include <linux/types.h>

#define HUGE_PAGE_SIZE (2 * 1024 * 1024)
#define ALIGNMENT HUGE_PAGE_SIZE
#define MEMORY_SIZE (HUGE_PAGE_SIZE * 16)
#define TEST_GUP_IOC_MAGIC 'G'
#define TEST_GUP_IOCTL_PIN_PAGES \
_IOWR(TEST_GUP_IOC_MAGIC, 1, struct test_gup_request)
#define TEST_GUP_IOCTL_UNPIN_PAGES \
_IOW(TEST_GUP_IOC_MAGIC, 2, struct test_gup_request)

struct test_gup_request {
__u64 user_addr;
__u64 size;
__u64 page_count;
__u32 flags;
};

void touch_memory(void *addr, size_t length, int write)
{
volatile char *ptr = (volatile char *)addr;
char tmp;
size_t page_size = getpagesize();
for (size_t i = 0; i < length; i += page_size) {
if (write)
ptr[i] = (char)(i % 256);
else
tmp = ptr[i];
}
}

void init_sigchld()
{
struct sigaction sa;
sa.sa_handler = SIG_IGN;
sigemptyset(&sa.sa_mask);
sa.sa_flags = 0;
sigaction(SIGCHLD, &sa, NULL);
}

int main()
{
init_sigchld();
int fd;
struct test_gup_request req;
int ret;
pid_t pid = getpid();
int cpu = sched_getcpu();
int cpu_af = 63;
int i;

void *memory = mmap(NULL, MEMORY_SIZE, PROT_READ | PROT_WRITE,
MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
touch_memory(memory, MEMORY_SIZE, 1);

if (cpu > 31)
cpu_af = 0;
cpu_set_t mask;
CPU_ZERO(&mask);
CPU_SET(cpu_af, &mask);
sched_setaffinity(0, sizeof(mask), &mask);

// /dev/test_gup here is provide by a test mod
fd = open("/dev/test_gup", O_RDWR);
memset(&req, 0, sizeof(req));
req.user_addr = (__u64) memory;
req.size = 4096 * 9;
// ioctl just call pin_user_pages_fast in kenrel
ret = ioctl(fd, TEST_GUP_IOCTL_PIN_PAGES, &req);
if (ret < 0) {
printf("IOCTL pin pages failed: %s\n", strerror(errno));
} else {
printf("Successfully pinned %lu pages at pid %d %lx\n",
req.page_count, pid, req.user_addr);
}
getchar();
// here you can see the original va <-> pa binding in crash by vtop

i = 0;
while (i < 100000) {
touch_memory(memory, MEMORY_SIZE, 0);
i++;
}
printf("numa balance done\n");
getchar(); // pmd was write-protected

pid_t t_pid = fork();
if (t_pid == 0)
_exit(0);
sleep(1);
printf("fork done\n");
getchar(); // pte was write-protected

memset((void *)req.user_addr, 9, 1);
printf("write pinned mem done\n");
getchar(); // cow was done, the binding of va <-> pa was broken

return 0;
}
===

commit b191f9b106ea ("mm: numa: preserve PTE write permissions across a
NUMA hinting fault") added write permission recovery in
do_huge_pmd_numa_page, but did not add the same recovery in
migrate_misplaced_transhuge_page. Later, commit d042035eaf5f ("mm/thp:
Split huge pmds/puds if they're pinned when fork()") enforced that
transparent huge pages with pinned memory must have their PMD page
tables split into PTE page tables in copy_huge_pmd. After that, this
issue started to appear.

So, the simplest way to fix this issue is to also perform the
corresponding write permission recovery in the out_fail code path of
migrate_misplaced_transhuge_page.

Signed-off-by: Chen Junlin <chen.junlin@xxxxxxxxxx>
---
mm/migrate.c | 4 ++++
1 file changed, 4 insertions(+)

diff --git a/mm/migrate.c b/mm/migrate.c
index bf59b09455ad..126b6ad675ce 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -2143,6 +2143,7 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
struct page *new_page = NULL;
int page_lru = page_is_file_lru(page);
unsigned long start = address & HPAGE_PMD_MASK;
+ bool was_writable;

new_page = alloc_pages_node(node,
(GFP_TRANSHUGE_LIGHT | __GFP_THISNODE),
@@ -2247,7 +2248,10 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
count_vm_events(PGMIGRATE_FAIL, HPAGE_PMD_NR);
ptl = pmd_lock(mm, pmd);
if (pmd_same(*pmd, entry)) {
+ was_writable = pmd_savedwrite(entry);
entry = pmd_modify(entry, vma->vm_page_prot);
+ if (was_writable)
+ entry = pmd_mkwrite(entry);
set_pmd_at(mm, start, pmd, entry);
update_mmu_cache_pmd(vma, address, &entry);
}
--
2.27.0