[PATCH 2/4] ras-tools: verify MC-safe recovery for hwpoison copy-on-write patches
From: Ruidong Tian
Date: Tue Jun 16 2026 - 21:52:55 EST
From: Ruidong Tian <ruidong.trd@xxxxxxxxxxxxxxxxx>
Use EINJ hardware error injection with bpftrace to verify that each
MC-safe page copy path correctly recovers from uncorrectable memory
errors. Each test case targets a specific upstream commit.
== cow_anon -> d302c2398ba2 ("mm, hwpoison: when copy-on-write hits poison, take page offline") ==
[do_sea ] Fault PC (faulting insn): <copy_mc_page> ffffc4d88625b3ec
[do_sea ] Call stack:
[do_sea ] <copy_mc_highpage>
[do_sea ] <copy_mc_user_highpage>
[do_sea ] <__wp_page_copy_user>
[do_sea ] <wp_page_copy>
[do_sea ] <do_wp_page>
[do_sea ] <handle_pte_fault>
[extable ] Fault type: unknown(5)
[extable ] Recovery addr (fixup): <copy_mc_page> ffffc4d88625b45c
== cow_hugetlb -> 1cb9dc4b475c ("mm: hwpoison: support recovery from HugePage copy-on-write faults") ==
[do_sea ] Fault PC (faulting insn): <copy_mc_page> ffffc4d88625b3bc
[do_sea ] Call stack:
[do_sea ] <copy_mc_highpage>
[do_sea ] <copy_mc_user_highpage>
[do_sea ] <copy_user_large_folio>
[do_sea ] <hugetlb_wp>
[do_sea ] <hugetlb_fault>
[do_sea ] <handle_mm_fault>
[extable ] Fault type: unknown(5)
[extable ] Recovery addr (fixup): <copy_mc_page> ffffc4d88625b45c
== khugepaged_anon -> 98c76c9f1ef7 ("mm/khugepaged: recover from poisoned anonymous memory") ==
[do_sea ] Fault PC (faulting insn): <copy_mc_page> ffffc4d88625b3ec
[do_sea ] Call stack:
[do_sea ] <copy_mc_highpage>
[do_sea ] <copy_mc_user_highpage>
[do_sea ] <__collapse_huge_page_copy>
[do_sea ] <collapse_huge_page>
[do_sea ] <collapse_scan_pmd>
[do_sea ] <collapse_single_pmd>
[extable ] Fault type: unknown(5)
[extable ] Recovery addr (fixup): <copy_mc_page> ffffc4d88625b45c
== khugepaged_file -> 12904d953364 ("mm/khugepaged: recover from poisoned file-backed memory") ==
[do_sea ] Fault PC (faulting insn): <__pi_copy_mc_page> ffffa3447405b3ec
[do_sea ] Call stack:
[do_sea ] <copy_mc_highpage>
[do_sea ] <collapse_file>
[do_sea ] <collapse_scan_file>
[do_sea ] <collapse_single_pmd>
[do_sea ] <madvise_collapse>
[do_sea ] <madvise_vma_behavior>
[extable ] Fault type: unknown(5)
[extable ] Recovery addr (fixup): <__pi_copy_mc_page> ffffa3447405b45c
== cow_anon_pinned -> 658be46520ce ("mm: support poison recovery from copy_present_page()") ==
[do_sea ] Fault PC (faulting insn): <copy_mc_page> ffffc4d88625b3ec
[do_sea ] Call stack:
[do_sea ] <copy_mc_highpage>
[do_sea ] <copy_mc_user_highpage>
[do_sea ] <copy_present_page>
[do_sea ] <copy_present_ptes>
[do_sea ] <copy_pte_range>
[do_sea ] <copy_pmd_range>
[extable ] Fault type: unknown(5)
[extable ] Recovery addr (fixup): <copy_mc_page> ffffc4d88625b45c
== cow_private_filemap -> aa549f923f5e ("mm: support poison recovery from do_cow_fault()") ==
[do_sea ] Fault PC (faulting insn): <copy_mc_page> ffffc4d88625b3ec
[do_sea ] Call stack:
[do_sea ] <copy_mc_highpage>
[do_sea ] <copy_mc_user_highpage>
[do_sea ] <do_cow_fault>
[do_sea ] <do_fault>
[do_sea ] <handle_pte_fault>
[do_sea ] <__handle_mm_fault>
[extable ] Fault type: unknown(5)
[extable ] Recovery addr (fixup): <copy_mc_page> ffffc4d88625b45c
== migrate_hugetlb -> f00b295b9b61 ("fs: hugetlbfs: support poisoned recover from hugetlbfs_migrate_folio()") ==
[do_sea ] Fault PC (faulting insn): <copy_mc_page> ffffb45742beb3bc
[do_sea ] Call stack:
[do_sea ] <copy_mc_highpage>
[do_sea ] <folio_mc_copy>
[do_sea ] <migrate_huge_page_move_mapping>
[do_sea ] <hugetlbfs_migrate_folio>
[do_sea ] <move_to_new_folio>
[do_sea ] <unmap_and_move_huge_page>
[extable ] Fault type: unknown(5)
[extable ] Recovery addr (fixup): <copy_mc_page> ffffb45742beb45c
== move_pages_numa -> 060913999d7a ("mm: migrate: support poisoned recover from migrate folio") ==
[do_sea ] Fault PC (faulting insn): <__pi_copy_mc_page> ffffa3447405b3ec
[do_sea ] Call stack:
[do_sea ] <copy_mc_highpage>
[do_sea ] <folio_mc_copy>
[do_sea ] <__migrate_folio.constprop.0>
[do_sea ] <migrate_folio>
[do_sea ] <move_to_new_folio>
[do_sea ] <migrate_folio_move>
[extable ] Fault type: unknown(5)
[extable ] Recovery addr (fixup): <__pi_copy_mc_page> ffffa3447405b45c
== migrate_pages_numa -> 060913999d7a (same patch, different syscall entry) ==
[do_sea ] Fault PC (faulting insn): <copy_mc_page> ffffb45742beb3ec
[do_sea ] Call stack:
[do_sea ] <copy_mc_highpage>
[do_sea ] <folio_mc_copy>
[do_sea ] <__migrate_folio.constprop.0>
[do_sea ] <migrate_folio>
[do_sea ] <move_to_new_folio>
[do_sea ] <migrate_folio_move>
[extable ] Fault type: unknown(5)
[extable ] Recovery addr (fixup): <copy_mc_page> ffffb45742beb45c
== mbind_move -> 060913999d7a (same patch, different syscall entry) ==
[do_sea ] Fault PC (faulting insn): <copy_mc_page> ffffb45742beb3ec
[do_sea ] Call stack:
[do_sea ] <copy_mc_highpage>
[do_sea ] <folio_mc_copy>
[do_sea ] <__migrate_folio.constprop.0>
[do_sea ] <migrate_folio>
[do_sea ] <move_to_new_folio>
[do_sea ] <migrate_folio_move>
[extable ] Fault type: unknown(5)
[extable ] Recovery addr (fixup): <copy_mc_page> ffffb45742beb45c
Signed-off-by: Ruidong Tian <tianruidong@xxxxxxxxxxxxxxxxx>
---
Makefile | 4 +-
einj_mem_uc.c | 141 ++++++++---
einj_mem_uc.h | 69 +++++
einj_mem_uc_mm.c | 637 +++++++++++++++++++++++++++++++++++++++++++++++
4 files changed, 815 insertions(+), 36 deletions(-)
create mode 100644 einj_mem_uc.h
create mode 100644 einj_mem_uc_mm.c
diff --git a/Makefile b/Makefile
index 59c281c..2b5b78d 100644
--- a/Makefile
+++ b/Makefile
@@ -34,8 +34,8 @@ rep_ce_page: rep_ce_page.o proc_pagemap.o einj.o
hornet: hornet.o einj.o
$(CC) -o hornet $(CFLAGS) hornet.o einj.o
-einj_mem_uc: einj_mem_uc.o proc_cpuinfo.o proc_interrupt.o proc_pagemap.o do_memcpy.o einj.o
- $(CC) -o einj_mem_uc $(CFLAGS) einj_mem_uc.o proc_cpuinfo.o proc_interrupt.o proc_pagemap.o do_memcpy.o einj.o -pthread
+einj_mem_uc: einj_mem_uc.o einj_mem_uc_mm.o proc_cpuinfo.o proc_interrupt.o proc_pagemap.o do_memcpy.o einj.o
+ $(CC) -o einj_mem_uc $(CFLAGS) einj_mem_uc.o einj_mem_uc_mm.o proc_cpuinfo.o proc_interrupt.o proc_pagemap.o do_memcpy.o einj.o -pthread
lmce: proc_pagemap.o lmce.o
$(CC) -o lmce $(CFLAGS) proc_pagemap.o lmce.o -pthread
diff --git a/einj_mem_uc.c b/einj_mem_uc.c
index 86986b1..a6471af 100644
--- a/einj_mem_uc.c
+++ b/einj_mem_uc.c
@@ -23,11 +23,44 @@
#include <sys/wait.h>
#include <sys/mount.h>
#include "einj.h"
+#include "einj_mem_uc.h"
#ifndef MAP_HUGETLB
#define MAP_HUGETLB 0x40000
#endif
+/* Cleanup registration used by auxiliary trigger files. */
+#define MAX_CLEANUPS 32
+static void (*cleanup_fns[MAX_CLEANUPS])(void);
+static int cleanup_count;
+
+static void run_cleanups(void)
+{
+ int i;
+
+ for (i = cleanup_count - 1; i >= 0; i--)
+ if (cleanup_fns[i])
+ cleanup_fns[i]();
+}
+
+void register_cleanup(void (*fn)(void))
+{
+ static int atexit_done;
+
+ if (!atexit_done) {
+ atexit(run_cleanups);
+ atexit_done = 1;
+ }
+ if (cleanup_count < MAX_CLEANUPS)
+ cleanup_fns[cleanup_count++] = fn;
+}
+
+void skip_test(const char *msg)
+{
+ printf("[SKIP] %s\n", msg ? msg : "unsupported");
+ exit(77);
+}
+
char *progname;
long pagesize;
int Sflag;
@@ -43,7 +76,7 @@ static int force_flag;
static int cmci_skip_flag;
static int all_flag;
static int *apicmap;
-static int child_process;
+int child_process;
#define CACHE_LINE_SIZE 64
#define DOUBLE_INJECT_OFFSET (pagesize / 4)
@@ -405,19 +438,19 @@ static void *data_alloc_common(int flag)
return p + pagesize / 4;
}
-static void *data_alloc(void)
+void *data_alloc(void)
{
return data_alloc_common(MAP_SHARED|MAP_ANON);
}
-static void *data_alloc_private(void)
+void *data_alloc_private(void)
{
return data_alloc_common(MAP_PRIVATE|MAP_ANON);
}
static FILE *pcfile;
-static void *map_file_alloc(void)
+void *map_file_alloc(void)
{
char c, *p;
int i;
@@ -849,22 +882,7 @@ int trigger_futex(char *addr)
}
/* attributes of the test and which events will follow our trigger */
-#define F_MCE 1
-#define F_CMCI 2
-#define F_SIGBUS 4
-#define F_FATAL 8
-#define F_EITHER 16
-#define F_LONGWAIT 32
-
-struct test {
- char *testname;
- char *testhelp;
- void *(*alloc)(void);
- void (*inject)(unsigned long long, void *, int);
- int notrigger;
- int (*trigger)(char *);
- int flags;
-} tests[] = {
+struct test tests[] = {
{
"single", "Single read in pipeline to target address, generates SRAR machine check",
data_alloc, inject_mem_uc, 1, trigger_single, F_MCE|F_CMCI|F_SIGBUS,
@@ -1032,36 +1050,91 @@ struct test {
{ NULL }
};
-static void show_help(void)
+static void show_test_array(struct test *arr, int count, const char *banner)
{
- struct test *t;
+ int i;
+
+ if (!count && (!arr || !arr->testname))
+ return;
+ if (banner)
+ printf(" --- %s ---\n", banner);
+ if (count) {
+ for (i = 0; i < count; i++)
+ printf(" %-24s %-5s %s\n", arr[i].testname,
+ (arr[i].flags & F_FATAL) ? "YES" : "no",
+ arr[i].testhelp);
+ } else {
+ struct test *t;
+
+ for (t = arr; t->testname; t++)
+ printf(" %-24s %-5s %s\n", t->testname,
+ (t->flags & F_FATAL) ? "YES" : "no",
+ t->testhelp);
+ }
+}
+static void show_help(void)
+{
printf("Usage: %s [-a][-c count][-d delay][-f][-i][j][k] [-m runup:size:align][testname]\n", progname);
- printf(" %-8s %-5s %s\n", "Testname", "Fatal", "Description");
- for (t = tests; t->testname; t++)
- printf(" %-8s %-5s %s\n", t->testname,
- (t->flags & F_FATAL) ? "YES" : "no",
- t->testhelp);
+ printf(" %-24s %-5s %s\n", "Testname", "Fatal", "Description");
+ show_test_array(tests, 0, NULL);
+ show_test_array(mm_tests, mm_tests_count, "MM subsystem (hwpoison recovery)");
exit(0);
}
+static struct test *lookup_in(struct test *arr, int count, const char *s)
+{
+ int i;
+
+ if (count) {
+ for (i = 0; i < count; i++)
+ if (strcmp(s, arr[i].testname) == 0)
+ return &arr[i];
+ } else {
+ struct test *t;
+
+ for (t = arr; t->testname; t++)
+ if (strcmp(s, t->testname) == 0)
+ return t;
+ }
+ return NULL;
+}
+
static struct test *lookup_test(char *s)
{
struct test *t;
- for (t = tests; t->testname; t++)
- if (strcmp(s, t->testname) == 0)
- return t;
+ t = lookup_in(tests, 0, s);
+ if (t)
+ return t;
+ t = lookup_in(mm_tests, mm_tests_count, s);
+ if (t)
+ return t;
fprintf(stderr, "%s: unknown test '%s'\n", progname, s);
exit(1);
}
static struct test *next_test(struct test *t)
{
- t++;
- if (t->testname == NULL)
- t = tests;
- return t;
+ /*
+ * Walk: tests[] (NULL-terminated) -> mm_tests[] (counted) ->
+ * wrap to tests[].
+ */
+ if (t >= tests && (!t->testname || t[1].testname)) {
+ t++;
+ if (t->testname)
+ return t;
+ if (mm_tests_count)
+ return &mm_tests[0];
+ return tests;
+ }
+ if (mm_tests_count && t >= mm_tests && t < mm_tests + mm_tests_count) {
+ t++;
+ if (t < mm_tests + mm_tests_count)
+ return t;
+ return tests;
+ }
+ return tests;
}
static jmp_buf env;
diff --git a/einj_mem_uc.h b/einj_mem_uc.h
new file mode 100644
index 0000000..c2d6c32
--- /dev/null
+++ b/einj_mem_uc.h
@@ -0,0 +1,69 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * einj_mem_uc.h - shared interface between einj_mem_uc.c (main driver)
+ * and the per-subsystem trigger files (einj_mem_uc_mm.c,
+ * einj_mem_uc_uaccess.c, ...).
+ *
+ * Anything declared here must remain ABI-stable across the auxiliary
+ * trigger files; if you need a private helper, keep it static inside
+ * the file that uses it.
+ */
+
+#ifndef EINJ_MEM_UC_H
+#define EINJ_MEM_UC_H
+
+#include "einj.h"
+
+/* Test flags (consumed by the main loop). */
+#define F_MCE 1
+#define F_CMCI 2
+#define F_SIGBUS 4
+#define F_FATAL 8
+#define F_EITHER 16
+#define F_LONGWAIT 32
+
+/*
+ * Description of one fault-injection test. Auxiliary files publish
+ * their own array of these; einj_mem_uc.c walks all of them.
+ */
+struct test {
+ char *testname;
+ char *testhelp;
+ void *(*alloc)(void);
+ void (*inject)(unsigned long long, void *, int);
+ int notrigger;
+ int (*trigger)(char *);
+ int flags;
+};
+
+/* Page allocators shared with auxiliary trigger files. */
+void *data_alloc(void);
+void *data_alloc_private(void);
+void *map_file_alloc(void);
+
+/* hugetlb page size discovered from /proc/meminfo (Hugepagesize:). */
+int get_huge_pagesize(void);
+
+/* Find the mount point of hugetlbfs from /proc/mounts. */
+int hugetlbfs_root(char *dir);
+/*
+ * Set to 1 by trigger functions running in a fork()ed child so the
+ * main loop can break out instead of running the post-trigger
+ * verification.
+ */
+extern int child_process;
+
+/*
+ * Helpers used by auxiliary trigger files. register_cleanup() lets a
+ * trigger queue a cleanup callback that the main driver invokes on
+ * exit; skip_test() bails out of a trigger that cannot run on the
+ * current kernel (e.g. missing CONFIG option).
+ */
+void register_cleanup(void (*fn)(void));
+void skip_test(const char *msg) __attribute__((noreturn));
+
+/* Per-subsystem test arrays (defined in their respective .c files). */
+extern struct test mm_tests[];
+extern int mm_tests_count;
+
+#endif /* EINJ_MEM_UC_H */
diff --git a/einj_mem_uc_mm.c b/einj_mem_uc_mm.c
new file mode 100644
index 0000000..3e23593
--- /dev/null
+++ b/einj_mem_uc_mm.c
@@ -0,0 +1,637 @@
+// SPDX-License-Identifier: GPL-2.0
+
+/*
+ * einj_mem_uc_mm.c - tests that exercise the kernel's MC-safe page
+ * copy primitives: copy_mc_user_highpage (COW / khugepaged / ksm) and
+ * copy_mc_highpage (page migration).
+ *
+ * Each test allocates a page in a way that lets the kernel reach one of
+ * those primitives, and its trigger drives the operation synchronously
+ * whenever possible (MADV_COLLAPSE, move_pages, mbind MOVE, MADV_PAGEOUT).
+ */
+
+#define _GNU_SOURCE 1
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <unistd.h>
+#include <errno.h>
+#include <fcntl.h>
+#include <signal.h>
+#include <dirent.h>
+#include <time.h>
+
+#include <sys/mman.h>
+#include <sys/wait.h>
+#include <sys/syscall.h>
+#include <sys/stat.h>
+#include <sys/ioctl.h>
+
+#include <sys/uio.h>
+
+#include <linux/mempolicy.h>
+#include <linux/userfaultfd.h>
+#include <poll.h>
+#include <pthread.h>
+
+#include "einj.h"
+#include "einj_mem_uc.h"
+
+#ifndef MADV_COLLAPSE
+#define MADV_COLLAPSE 25
+#endif
+#ifndef MADV_PAGEOUT
+#define MADV_PAGEOUT 21
+#endif
+#ifndef MADV_COLD
+#define MADV_COLD 20
+#endif
+#ifndef MPOL_MF_MOVE
+#define MPOL_MF_MOVE (1 << 1)
+#endif
+#ifndef MAP_HUGETLB
+#define MAP_HUGETLB 0x40000
+#endif
+#ifndef IORING_REGISTER_BUFFERS
+#define IORING_REGISTER_BUFFERS 0
+#endif
+#ifndef IORING_UNREGISTER_BUFFERS
+#define IORING_UNREGISTER_BUFFERS 1
+#endif
+
+/* ---------- shared helpers ---------- */
+
+extern int get_huge_pagesize(void); /* defined in einj_mem_uc.c */
+
+static int have_path(const char *path)
+{
+ return access(path, F_OK) == 0;
+}
+
+static int read_file_line(const char *path, char *buf, size_t len)
+{
+ int fd, n;
+
+ if (len == 0)
+ return -1;
+ fd = open(path, O_RDONLY);
+ if (fd < 0)
+ return -1;
+ n = read(fd, buf, len - 1);
+ close(fd);
+ if (n < 0)
+ return -1;
+ buf[n] = '\0';
+ if (n > 0 && buf[n - 1] == '\n')
+ buf[n - 1] = '\0';
+ return n;
+}
+
+static int write_file_str(const char *path, const char *val)
+{
+ int fd;
+ ssize_t n;
+
+ fd = open(path, O_WRONLY);
+ if (fd < 0)
+ return -1;
+ n = write(fd, val, strlen(val));
+ close(fd);
+ return (n < 0) ? -1 : 0;
+}
+
+static int read_int_file(const char *path)
+{
+ char buf[32];
+
+ if (read_file_line(path, buf, sizeof(buf)) < 0)
+ return -1;
+ return atoi(buf);
+}
+
+/* Return the first NUMA node id in /sys/devices/system/node/onlineXX that
+ * differs from `cur`. Returns -1 if only one node is online.
+ */
+static int pick_other_numa_node(int cur)
+{
+ DIR *d = opendir("/sys/devices/system/node");
+ struct dirent *de;
+ int other = -1, n;
+
+ if (!d)
+ return -1;
+ while ((de = readdir(d))) {
+ if (sscanf(de->d_name, "node%d", &n) == 1 && n != cur) {
+ other = n;
+ break;
+ }
+ }
+ closedir(d);
+ return other;
+}
+
+static int numa_query_node(void *addr)
+{
+ int status = -1;
+ void *page = (void *)((unsigned long)addr & ~(pagesize - 1));
+
+ if (syscall(SYS_move_pages, 0, 1UL, &page, NULL, &status, 0) < 0)
+ return -1;
+ return status;
+}
+
+/* ---------- allocators used by this module ---------- */
+
+static void *mm_anon_private_alloc(void)
+{
+ char *p = mmap(NULL, pagesize, PROT_READ | PROT_WRITE,
+ MAP_PRIVATE | MAP_ANON, -1, 0);
+ int i;
+
+ if (p == MAP_FAILED)
+ skip_test("mmap MAP_PRIVATE|ANON failed");
+ for (i = 0; i < pagesize; i++)
+ p[i] = (char)(i ^ 0x5a);
+ return p + pagesize / 4;
+}
+
+static void *mm_anon_shared_alloc(void)
+{
+ char *p = mmap(NULL, pagesize, PROT_READ | PROT_WRITE,
+ MAP_SHARED | MAP_ANON, -1, 0);
+ int i;
+
+ if (p == MAP_FAILED)
+ skip_test("mmap MAP_SHARED|ANON failed");
+ for (i = 0; i < pagesize; i++)
+ p[i] = (char)(i ^ 0x33);
+ return p + pagesize / 4;
+}
+
+/* hugetlb MAP_PRIVATE | ANON page for COW-on-hugetlb test */
+static void *mm_hugetlb_priv_alloc(void)
+{
+ int hps;
+ char *p;
+ int i;
+
+ hps = get_huge_pagesize();
+ if (hps <= 0)
+ skip_test("no hugetlb page size");
+ p = mmap(NULL, hps, PROT_READ | PROT_WRITE,
+ MAP_HUGETLB | MAP_PRIVATE | MAP_ANON, -1, 0);
+ if (p == MAP_FAILED)
+ skip_test("hugetlb private mmap failed (reserve hugepages?)");
+ for (i = 0; i < hps; i++)
+ p[i] = (char)(i ^ 0x77);
+ return p + hps / 4;
+}
+
+/* hugetlb file-backed (hugetlbfs) page for migration test.
+ * This ensures the migration goes through hugetlbfs_migrate_folio()
+ * -> migrate_huge_page_move_mapping() -> folio_mc_copy(), rather than
+ * the anonymous path (__migrate_folio -> folio_mc_copy).
+ */
+static int g_hugetlb_file_fd = -1;
+
+static void hugetlb_file_cleanup(void)
+{
+ if (g_hugetlb_file_fd >= 0) {
+ close(g_hugetlb_file_fd);
+ g_hugetlb_file_fd = -1;
+ }
+}
+
+static void *mm_hugetlb_file_alloc(void)
+{
+ char dir[256];
+ char path[256];
+ int hps;
+ char *p;
+ int i;
+
+ hps = get_huge_pagesize();
+ if (hps <= 0)
+ skip_test("no hugetlb page size");
+ if (!hugetlbfs_root(dir))
+ skip_test("hugetlbfs not mounted");
+ snprintf(path, sizeof(path), "%s/einj-migrate-XXXXXX", dir);
+ g_hugetlb_file_fd = mkstemp(path);
+ if (g_hugetlb_file_fd < 0)
+ skip_test("cannot create hugetlbfs temp file");
+ unlink(path);
+ register_cleanup(hugetlb_file_cleanup);
+
+ p = mmap(NULL, hps, PROT_READ | PROT_WRITE, MAP_SHARED,
+ g_hugetlb_file_fd, 0);
+ if (p == MAP_FAILED)
+ skip_test("hugetlbfs file mmap failed");
+ for (i = 0; i < hps; i++)
+ p[i] = (char)(i ^ 0x77);
+ return p + hps / 4;
+}
+
+/* MAP_PRIVATE on a seeded tmpfile -> first write triggers do_cow_fault */
+static int g_cow_filemap_fd = -1;
+static void cow_filemap_cleanup(void)
+{
+ if (g_cow_filemap_fd >= 0) {
+ close(g_cow_filemap_fd);
+ g_cow_filemap_fd = -1;
+ }
+}
+
+static void *mm_priv_filemap_alloc(void)
+{
+ char path[] = "/tmp/einj-cowfile-XXXXXX";
+ char *p;
+ char seed[4096];
+ int i;
+
+ g_cow_filemap_fd = mkstemp(path);
+ if (g_cow_filemap_fd < 0)
+ skip_test("mkstemp failed");
+ unlink(path);
+ register_cleanup(cow_filemap_cleanup);
+
+ memset(seed, 0x11, sizeof(seed));
+ for (i = 0; i < pagesize; i += sizeof(seed))
+ (void)write(g_cow_filemap_fd, seed, sizeof(seed));
+ /* Read-only PROT so first write faults via do_cow_fault path. */
+ p = mmap(NULL, pagesize, PROT_READ,
+ MAP_PRIVATE, g_cow_filemap_fd, 0);
+ if (p == MAP_FAILED)
+ skip_test("MAP_PRIVATE mmap of tmpfile failed");
+ /* Touch to fault-in read-side so poisoning can succeed */
+ (void)((volatile char *)p)[0];
+ return p + pagesize / 4;
+}
+
+/* 2MB-aligned anonymous region; inject into one of the sub-pages, then
+ * MADV_COLLAPSE to force khugepaged to copy sub-pages via
+ * copy_mc_user_highpage.
+ */
+#define COLLAPSE_SIZE (2UL * 1024 * 1024)
+static char *g_collapse_base;
+
+static void *mm_collapse_anon_alloc(void)
+{
+ size_t len = COLLAPSE_SIZE * 2;
+ char *raw, *base;
+ size_t i;
+
+ /* use transparent_hugepage support check */
+ if (!have_path("/sys/kernel/mm/transparent_hugepage/enabled"))
+ skip_test("THP not configured");
+
+ raw = mmap(NULL, len, PROT_READ | PROT_WRITE,
+ MAP_PRIVATE | MAP_ANON, -1, 0);
+ if (raw == MAP_FAILED)
+ skip_test("anon 2MB mmap failed");
+ base = (char *)(((unsigned long)raw + COLLAPSE_SIZE - 1) &
+ ~(COLLAPSE_SIZE - 1));
+ for (i = 0; i < COLLAPSE_SIZE; i++)
+ base[i] = (char)(i & 0xff);
+ g_collapse_base = base;
+ /* Return address in the middle sub-page so MADV_COLLAPSE has to pull
+ * it through copy_mc_user_highpage.
+ */
+ return base + COLLAPSE_SIZE / 2 + pagesize / 4;
+}
+
+static int g_collapse_file_fd = -1;
+static char *g_collapse_file_base;
+
+static void collapse_file_cleanup(void)
+{
+ if (g_collapse_file_fd >= 0) {
+ close(g_collapse_file_fd);
+ g_collapse_file_fd = -1;
+ }
+}
+
+static void *mm_collapse_file_alloc(void)
+{
+ char path[] = "/tmp/einj-collfile-XXXXXX";
+ size_t len = COLLAPSE_SIZE * 2;
+ char *raw, *base, *hint;
+ size_t i;
+
+ g_collapse_file_fd = mkstemp(path);
+ if (g_collapse_file_fd < 0)
+ skip_test("mkstemp failed");
+ unlink(path);
+ register_cleanup(collapse_file_cleanup);
+ if (ftruncate(g_collapse_file_fd, len) < 0)
+ skip_test("ftruncate failed");
+
+ /*
+ * MADV_COLLAPSE requires (vma->vm_start >> PAGE_SHIFT) - vm_pgoff
+ * to be PMD-aligned. Since we mmap at file offset 0, the VMA start
+ * itself must be 2MB-aligned. THP-disabled kernels won't give us
+ * an aligned address automatically, so reserve+remap at an aligned
+ * spot.
+ */
+ raw = mmap(NULL, len + COLLAPSE_SIZE, PROT_NONE,
+ MAP_PRIVATE | MAP_ANON, -1, 0);
+ if (raw == MAP_FAILED)
+ skip_test("file collapse mmap (reservation) failed");
+ hint = (char *)(((unsigned long)raw + COLLAPSE_SIZE - 1) &
+ ~(COLLAPSE_SIZE - 1));
+ munmap(raw, len + COLLAPSE_SIZE);
+
+ base = mmap(hint, len, PROT_READ | PROT_WRITE,
+ MAP_SHARED | MAP_FIXED_NOREPLACE, g_collapse_file_fd, 0);
+ if (base == MAP_FAILED)
+ skip_test("file collapse mmap (aligned) failed");
+
+ for (i = 0; i < COLLAPSE_SIZE; i++)
+ base[i] = (char)((i >> 3) & 0xff);
+ g_collapse_file_base = base;
+ return base + COLLAPSE_SIZE / 2 + pagesize / 4;
+}
+
+/* ---------- triggers ---------- */
+
+/* Trigger COW on a regular anonymous page: fork() makes the page shared
+ * read-only; child's write faults through wp_page_copy() -> copy_present_page().
+ */
+static int trigger_cow_anon(char *addr)
+{
+ pid_t pid;
+
+ PRINT_TRIGGERING;
+ pid = fork();
+ if (pid == 0) {
+ child_process = 1;
+ *addr = 'W'; /* CoW on anon page */
+ _exit(0);
+ }
+ if (pid > 0) {
+ int st;
+ waitpid(pid, &st, 0);
+ }
+ return 0;
+}
+
+static int trigger_cow_hugetlb(char *addr)
+{
+ pid_t pid;
+
+ PRINT_TRIGGERING;
+ pid = fork();
+ if (pid == 0) {
+ child_process = 1;
+ *addr = '*'; /* CoW on hugetlb */
+ _exit(0);
+ }
+ if (pid > 0) {
+ int st;
+ waitpid(pid, &st, 0);
+ }
+ return 0;
+}
+
+static int trigger_cow_private_filemap(char *addr)
+{
+ pid_t pid;
+
+ PRINT_TRIGGERING;
+ /*
+ * The mapping is PROT_READ; a child that uses mprotect + write will
+ * trigger do_cow_fault (a fresh anon page is allocated and the file
+ * page is copied into it via copy_mc_user_highpage).
+ */
+ pid = fork();
+ if (pid == 0) {
+ child_process = 1;
+ void *page = (void *)((unsigned long)addr & ~(pagesize - 1));
+ mprotect(page, pagesize, PROT_READ | PROT_WRITE);
+ *addr = 'C';
+ _exit(0);
+ }
+ if (pid > 0) {
+ int st;
+ waitpid(pid, &st, 0);
+ }
+ return 0;
+}
+
+static int trigger_madv_collapse_anon(char *addr)
+{
+ PRINT_TRIGGERING;
+ if (madvise(g_collapse_base, COLLAPSE_SIZE, MADV_COLLAPSE) < 0)
+ perror("MADV_COLLAPSE");
+ (void)addr;
+ return 0;
+}
+
+static int trigger_madv_collapse_file(char *addr)
+{
+ PRINT_TRIGGERING;
+ if (madvise(g_collapse_file_base, COLLAPSE_SIZE, MADV_COLLAPSE) < 0)
+ perror("MADV_COLLAPSE (file)");
+ (void)addr;
+ return 0;
+}
+
+static int trigger_move_pages_numa(char *addr)
+{
+ int cur, other, status = 0;
+ void *page = (void *)((unsigned long)addr & ~(pagesize - 1));
+
+ PRINT_TRIGGERING;
+ cur = numa_query_node(addr);
+ if (cur < 0)
+ skip_test("move_pages(status) failed");
+ other = pick_other_numa_node(cur);
+ if (other < 0)
+ skip_test("single NUMA node, cannot migrate");
+ if (syscall(SYS_move_pages, 0, 1UL, &page, &other, &status,
+ MPOL_MF_MOVE) < 0)
+ perror("move_pages");
+ return 0;
+}
+
+/* Migrate a hugetlb page across NUMA nodes; exercises
+ * hugetlbfs_migrate_folio() -> copy_mc_highpage.
+ */
+static int trigger_move_pages_hugetlb(char *addr)
+{
+ int hps = get_huge_pagesize();
+ int cur, other, status = 0;
+ void *page = (void *)((unsigned long)addr & ~((unsigned long)hps - 1));
+
+ PRINT_TRIGGERING;
+ /* Query current node using the base of the huge page */
+ if (syscall(SYS_move_pages, 0, 1UL, &page, NULL, &status, 0) < 0)
+ skip_test("move_pages(query hugetlb) failed");
+ cur = status;
+ if (cur < 0)
+ skip_test("move_pages(status) returned error for hugetlb");
+ other = pick_other_numa_node(cur);
+ if (other < 0)
+ skip_test("single NUMA node, cannot migrate hugetlb");
+ if (syscall(SYS_move_pages, 0, 1UL, &page, &other, &status,
+ MPOL_MF_MOVE) < 0)
+ perror("move_pages(hugetlb)");
+ return 0;
+}
+
+static int trigger_migrate_pages_numa(char *addr)
+{
+ int cur, other;
+ unsigned long maxnode = 64;
+ unsigned long from[1] = {0}, to[1] = {0};
+
+ PRINT_TRIGGERING;
+ cur = numa_query_node(addr);
+ if (cur < 0)
+ skip_test("move_pages query failed");
+ other = pick_other_numa_node(cur);
+ if (other < 0)
+ skip_test("single NUMA node");
+ from[0] = 1UL << cur;
+ to[0] = 1UL << other;
+ if (syscall(SYS_migrate_pages, 0, maxnode, from, to) < 0)
+ perror("migrate_pages");
+ return 0;
+}
+
+static int trigger_mbind_move(char *addr)
+{
+ int cur, other;
+ unsigned long nodemask = 0;
+ void *page = (void *)((unsigned long)addr & ~(pagesize - 1));
+
+ PRINT_TRIGGERING;
+ cur = numa_query_node(addr);
+ if (cur < 0)
+ skip_test("move_pages query failed");
+ other = pick_other_numa_node(cur);
+ if (other < 0)
+ skip_test("single NUMA node");
+ nodemask = 1UL << other;
+ if (syscall(SYS_mbind, page, (unsigned long)pagesize, MPOL_BIND,
+ &nodemask, 64UL, MPOL_MF_MOVE) < 0)
+ perror("mbind(MPOL_MF_MOVE)");
+ return 0;
+}
+
+
+/* ---------- DMA-pinned page COW (copy_present_page) ---------- */
+
+static int g_uring_fd = -1;
+
+static void uring_pin_cleanup(void)
+{
+ if (g_uring_fd >= 0) {
+ syscall(SYS_io_uring_register, g_uring_fd,
+ IORING_UNREGISTER_BUFFERS, NULL, 0);
+ close(g_uring_fd);
+ g_uring_fd = -1;
+ }
+}
+
+/*
+ * trigger_cow_anon_pinned - exercise copy_present_page() during fork.
+ *
+ * Patch 658be46520ce ("mm: support poison recovery from copy_present_page()")
+ * made copy_present_page() MC-safe by using copy_mc_user_highpage().
+ *
+ * copy_present_page() is only called when the page is DMA-pinned
+ * (folio_needs_cow_for_dma() == true), which forces the kernel to
+ * physically copy the page during fork() instead of setting up COW.
+ *
+ * We pin the page via io_uring IORING_REGISTER_BUFFERS (internally uses
+ * pin_user_pages / FOLL_PIN), then fork(). The fork path sees the page
+ * is pinned and calls copy_present_page() to copy the (poisoned) source.
+ */
+static int trigger_cow_anon_pinned(char *addr)
+{
+ struct { __u32 d[40]; } params; /* opaque io_uring_params */
+ struct iovec iov;
+ void *page;
+ pid_t pid;
+ int ret;
+
+ PRINT_TRIGGERING;
+ memset(¶ms, 0, sizeof(params));
+ page = (void *)((unsigned long)addr & ~(pagesize - 1));
+
+ /* Create a minimal io_uring instance */
+ g_uring_fd = syscall(SYS_io_uring_setup, 8, ¶ms);
+ if (g_uring_fd < 0)
+ skip_test("io_uring_setup not available");
+ register_cleanup(uring_pin_cleanup);
+
+ /* Pin the page via FOLL_PIN (internally pin_user_pages) */
+ iov.iov_base = page;
+ iov.iov_len = pagesize;
+ ret = syscall(SYS_io_uring_register, g_uring_fd,
+ IORING_REGISTER_BUFFERS, &iov, 1);
+ if (ret < 0)
+ skip_test("IORING_REGISTER_BUFFERS failed");
+
+ /*
+ * Page is now DMA-pinned. fork() will detect this via
+ * folio_needs_cow_for_dma() -> folio_maybe_dma_pinned() and
+ * call copy_present_page() to physically copy the page for
+ * the child, instead of deferring via COW.
+ *
+ * With the poisoned source page, copy_mc_user_highpage() in
+ * copy_present_page() will return -EHWPOISON. The kernel
+ * should NOT panic; fork() may return an error.
+ */
+ pid = fork();
+ if (pid < 0) {
+ /* fork() failed due to -EHWPOISON from copy_present_page.
+ * This is the expected recovery path: no panic.
+ */
+ return 0;
+ }
+ if (pid == 0) {
+ child_process = 1;
+ _exit(0);
+ }
+ if (pid > 0) {
+ int st;
+ waitpid(pid, &st, 0);
+ }
+ return 0;
+}
+
+
+/* ---------- test table ---------- */
+
+#define MM(name, help, alloc_fn, trig_fn, fl) \
+ { name, help, alloc_fn, inject_mem_uc, 1, trig_fn, fl }
+
+struct test mm_tests[] = {
+ MM("cow_anon", "Fork + write on MAP_PRIVATE anon page -> wp_page_copy",
+ mm_anon_private_alloc, trigger_cow_anon, F_MCE|F_CMCI|F_SIGBUS|F_FATAL),
+ MM("cow_anon_pinned", "Fork with DMA-pinned page -> copy_present_page",
+ mm_anon_private_alloc, trigger_cow_anon_pinned, F_MCE|F_CMCI|F_SIGBUS|F_FATAL),
+ MM("cow_hugetlb", "Fork + write on MAP_PRIVATE hugetlb page -> CoW",
+ mm_hugetlb_priv_alloc, trigger_cow_hugetlb, F_MCE|F_CMCI|F_SIGBUS|F_FATAL),
+ MM("cow_private_filemap", "First write on MAP_PRIVATE file mapping -> do_cow_fault",
+ mm_priv_filemap_alloc, trigger_cow_private_filemap, F_MCE|F_CMCI|F_SIGBUS|F_FATAL),
+
+ MM("khugepaged_anon", "MADV_COLLAPSE on 2MB anon -> copy_mc_user_highpage",
+ mm_collapse_anon_alloc, trigger_madv_collapse_anon, F_MCE|F_CMCI|F_SIGBUS|F_FATAL),
+ MM("khugepaged_file", "MADV_COLLAPSE on 2MB file mapping -> copy_mc_highpage",
+ mm_collapse_file_alloc, trigger_madv_collapse_file, F_MCE|F_CMCI|F_SIGBUS|F_FATAL),
+
+ MM("move_pages_numa", "move_pages(2) cross-node -> migrate_folio/copy_mc_highpage",
+ mm_anon_private_alloc, trigger_move_pages_numa, F_MCE|F_CMCI|F_SIGBUS|F_FATAL),
+ MM("migrate_pages_numa", "migrate_pages(2) cross-node -> copy_mc_highpage",
+ mm_anon_private_alloc, trigger_migrate_pages_numa, F_MCE|F_CMCI|F_SIGBUS|F_FATAL),
+ MM("mbind_move", "mbind(2) MPOL_MF_MOVE -> copy_mc_highpage",
+ mm_anon_private_alloc, trigger_mbind_move, F_MCE|F_CMCI|F_SIGBUS|F_FATAL),
+ MM("migrate_hugetlb", "move_pages(2) hugetlb file cross-node -> hugetlbfs_migrate_folio",
+ mm_hugetlb_file_alloc, trigger_move_pages_hugetlb, F_MCE|F_CMCI|F_SIGBUS|F_FATAL),
+
+
+};
+
+int mm_tests_count = (int)(sizeof(mm_tests) / sizeof(mm_tests[0]));
--
2.39.3