[PATCH v3 8/8] mm: use memcpy_streaming() in zone-device template copies
From: Li Zhe
Date: Tue May 26 2026 - 23:41:53 EST
The template fast path still leaves the actual copy sequence up to the
compiler. Use the streaming-copy helpers introduced in the previous
patches for the ZONE_DEVICE template-copy path so common mm code can
request a write-once copy primitive without embedding arch-specific
store layout in the generic layer.
ZONE_DEVICE memmap initialization is a write-once path: each struct page
is populated once and is not expected to be reused from cache
immediately afterwards. A regular cached copy can therefore incur
write-allocate traffic and pollute the cache without much benefit.
Using memcpy_streaming() lets this path use an architecture-optimized
streaming copy where available, while still degrading to memcpy() on
architectures that do not provide a specialized implementation.
Keep pageblock-aligned PFNs on memcpy() so pageblock initialization can
immediately read back page metadata without introducing a
read-after-streaming dependency. For the remaining PFNs, use
memcpy_streaming() so the hot path can avoid write-allocate traffic
while still leaving unsupported or unsuitable cases to the fallback
implementation.
When the streaming backend uses non-temporal stores, order them before
entering memmap_init_compound(), before prep_compound_head() updates the
overlapping compound metadata, and before returning from
memmap_init_zone_device().
Keep sanitized builds on the slow path so KASAN/KMSAN retain their
instrumented stores.
Tested in a VM with a 100 GB fsdax namespace device configured with
map=dev and a 100 GB devdax namespace (align=2097152) on Intel Ice Lake
server.
Test procedure:
Rebind the nd_pmem and dax_pmem driver 30 times and collect the memmap
initialization time from the pr_debug() output of
memmap_init_zone_device().
Base(v7.1-rc3):
First binding for nd_pmem driver: 1486 ms
Average of subsequent rebinds: 273.52 ms
First binding for dax_pmem driver: 1515 ms
Average of subsequent rebinds: 313.45 ms
With this series:
First binding for nd_pmem driver: 1285 ms
Average of subsequent rebinds: 114.31 ms
First binding for dax_pmem driver: 1331 ms
Average of subsequent rebinds: 99.37 ms
This reduces the average rebind time by about 58.2% for nd_pmem and
68.3% for dax_pmem.
Signed-off-by: Li Zhe <lizhe.67@xxxxxxxxxxxxx>
---
mm/mm_init.c | 47 +++++++++++++++++++++++++++++++++++++++++++++--
1 file changed, 45 insertions(+), 2 deletions(-)
diff --git a/mm/mm_init.c b/mm/mm_init.c
index d5ccb49a048f..1f56765b92e1 100644
--- a/mm/mm_init.c
+++ b/mm/mm_init.c
@@ -1070,11 +1070,21 @@ static void __ref zone_device_page_init_slow(struct page *page,
static inline bool zone_device_page_init_optimization_enabled(void)
{
+ /*
+ * Keep sanitized builds on the slow path so their stores stay
+ * instrumented.
+ */
+ if (IS_ENABLED(CONFIG_KASAN) || IS_ENABLED(CONFIG_KMSAN))
+ return false;
+
/*
* The template fast path copies a preinitialized struct page image.
* Skip it when the page_ref_set tracepoint is enabled.
*/
- return !page_ref_tracepoint_active(page_ref_set);
+ if (page_ref_tracepoint_active(page_ref_set))
+ return false;
+
+ return true;
}
static inline void zone_device_template_head_page_init(struct page *template,
@@ -1120,9 +1130,19 @@ static void zone_device_page_init_from_template(struct page *page,
* 'template' carries the invariant portion of a ZONE_DEVICE struct
* page. Update the PFN-dependent fields in place before copying it
* to the destination page.
+ *
+ * pageblock-aligned pages immediately feed
+ * init_pageblock_migratetype(), which reads back page metadata via
+ * helpers like page_zone(page). Avoid a read-after-streaming
+ * dependency for these rare pages by using regular cached stores
+ * instead of non-temporal ones.
*/
zone_device_page_update_template(template, pfn);
- memcpy(page, template, sizeof(*page));
+ if (unlikely(pageblock_aligned(pfn)))
+ memcpy(page, template, sizeof(*page));
+ else
+ memcpy_streaming(page, template, sizeof(*page));
+
zone_device_page_init_pageblock(page, pfn);
}
@@ -1184,6 +1204,15 @@ static void __ref memmap_init_compound(struct page *head,
prep_compound_tail(page, head, order);
set_page_count(page, 0);
}
+
+ /*
+ * prep_compound_head() updates compound metadata in struct folio fields
+ * that alias the first tail-page descriptors. When the tail pages above
+ * were populated with non-temporal stores, order those writes before the
+ * overlapping metadata updates below.
+ */
+ if (use_template)
+ memcpy_streaming_drain();
prep_compound_head(head, order);
}
@@ -1232,10 +1261,24 @@ void __ref memmap_init_zone_device(struct zone *zone,
if (pfns_per_compound == 1)
continue;
+ /*
+ * memmap_init_compound() immediately updates compound-head
+ * metadata. If the head-page template copy above used
+ * non-temporal stores, order them before entering the
+ * compound setup path.
+ */
+ if (use_template)
+ memcpy_streaming_drain();
+
memmap_init_compound(page, pfn, zone_idx, nid, pgmap,
compound_nr_pages(altmap, pgmap),
use_template);
}
+ /*
+ * Drain any remaining non-temporal stores before returning.
+ */
+ if (use_template)
+ memcpy_streaming_drain();
pr_debug("%s initialised %lu pages in %ums\n", __func__,
nr_pages, jiffies_to_msecs(jiffies - start));
--
2.20.1