Re: [PATCH 0/4] mm: speed up ZONE_DEVICE memmap initialization

From: Mike Rapoport

Date: Wed May 20 2026 - 02:22:57 EST

On Mon, May 18, 2026 at 04:57:00PM +0800, Li Zhe wrote:
> On Mon, 18 May 2026 09:23:33 +0300, rppt@xxxxxxxxxx wrote:
> > On Fri, May 15, 2026 at 04:20:41PM +0800, Li Zhe wrote:
> > >
> > > Performance
> > > ===========
> > > nd_pmem rebind, 100 GB fsdax namespace, map=dev
> > > Base(v7.1-rc3):
> > > First binding: 1486 ms
> > > Average of subsequent rebinds: 273.52 ms
> > > Full series:
> > > First binding: 1272 ms
> > > Average of subsequent rebinds: 104.59 ms
> > >
> > > dax_pmem rebind, 100 GB devdax namespace, align=2097152
> > > Base(v7.1-rc3):
> > > First binding: 1515 ms
> > > Average of subsequent rebinds: 313.45 ms
> > > Full series:
> > > First binding: 1286 ms
> > > Average of subsequent rebinds: 116.93 ms
> >
> > This is really good improvement!
> >
> > It would be also interesting to see how the template approach would improve
> > "normal" memory map initialization.
>
> I also experimented with this approach earlier. Unfortunately, in the
> normal memory map initialization path, functions such as
> deferred_free_pages() are invoked shortly after struct page
> initialization, and this function performs both read and write accesses
> to members of the struct page.
>
> Non-temporal stores via MOVNTI are primarily beneficial for streaming
> write operations, where the cache lines written are not expected to be
> reused by the CPU in the near future. In this case, however, data
> written using MOVNTI is immediately accessed again through regular load
> and store instructions. This results in an access pattern that resembles
> a write-then-reuse workload rather than a pure streaming store.
>
> Consequently, non-temporal stores do not deliver the expected reduction
> in cache pollution, and using MOVNTI provides no measurable performance
> benefit for this particular workload.

We can split initialization and freeing into separate loops if there is
overall benefit, but this needs to be verified on other major architectures
as well.

> That said, a template-based approach can still accelerate initialization.
> Based on measurements from this patchset, it should improve performance
> on the generic path by roughly 10%. I would appreciate feedback on
> whether such an optimization is still considered useful.

Improving the memory map initialization by 10% is valuable.

> Thanks,
> Zhe

--
Sincerely yours,
Mike.