Re: [PATCH 0/4] mm: speed up ZONE_DEVICE memmap initialization

From: Alistair Popple

Date: Wed May 20 2026 - 18:37:10 EST

On 2026-05-20 at 21:57 +1000, Li Zhe <lizhe.67@xxxxxxxxxxxxx> wrote...
> On Wed, 20 May 2026 09:20:18 +0300, rppt@xxxxxxxxxx wrote:
>
> > On Mon, May 18, 2026 at 04:57:00PM +0800, Li Zhe wrote:
> > > On Mon, 18 May 2026 09:23:33 +0300, rppt@xxxxxxxxxx wrote:
> > > > On Fri, May 15, 2026 at 04:20:41PM +0800, Li Zhe wrote:
> > > > >
> > > > > Performance
> > > > > ===========
> > > > > nd_pmem rebind, 100 GB fsdax namespace, map=dev
> > > > > Base(v7.1-rc3):
> > > > > First binding: 1486 ms
> > > > > Average of subsequent rebinds: 273.52 ms
> > > > > Full series:
> > > > > First binding: 1272 ms
> > > > > Average of subsequent rebinds: 104.59 ms
> > > > >
> > > > > dax_pmem rebind, 100 GB devdax namespace, align=2097152
> > > > > Base(v7.1-rc3):
> > > > > First binding: 1515 ms
> > > > > Average of subsequent rebinds: 313.45 ms
> > > > > Full series:
> > > > > First binding: 1286 ms
> > > > > Average of subsequent rebinds: 116.93 ms
> > > >
> > > > This is really good improvement!
> > > >
> > > > It would be also interesting to see how the template approach would improve
> > > > "normal" memory map initialization.
> > >
> > > I also experimented with this approach earlier. Unfortunately, in the
> > > normal memory map initialization path, functions such as
> > > deferred_free_pages() are invoked shortly after struct page
> > > initialization, and this function performs both read and write accesses
> > > to members of the struct page.
> > >
> > > Non-temporal stores via MOVNTI are primarily beneficial for streaming
> > > write operations, where the cache lines written are not expected to be
> > > reused by the CPU in the near future. In this case, however, data
> > > written using MOVNTI is immediately accessed again through regular load
> > > and store instructions. This results in an access pattern that resembles
> > > a write-then-reuse workload rather than a pure streaming store.
> > >
> > > Consequently, non-temporal stores do not deliver the expected reduction
> > > in cache pollution, and using MOVNTI provides no measurable performance
> > > benefit for this particular workload.
> >
> > We can split initialization and freeing into separate loops if there is
> > overall benefit, but this needs to be verified on other major architectures
> > as well.
>
> I agree with your point.
>
> > > That said, a template-based approach can still accelerate initialization.
> > > Based on measurements from this patchset, it should improve performance
> > > on the generic path by roughly 10%. I would appreciate feedback on
> > > whether such an optimization is still considered useful.
> >
> > Improving the memory map initialization by 10% is valuable.

Agree with that - GPUs have to hotplug 100's GB of ZONE_DEVICE memory so any
improvement here is valuable. Thanks for looking at it.

- Alistair

> Thank you for your feedback. I will try the optimization after finishing
> the current patchset.
>
> Thanks,
> Zhe
>