Re: [PATCH v2 0/8] Introduce a huge-page pre-zeroing mechanism

From: David Hildenbrand (Red Hat)

Date: Thu Jan 15 2026 - 12:08:32 EST


On 1/15/26 12:57, Jonathan Cameron wrote:
On Thu, 15 Jan 2026 12:08:03 +0100
"David Hildenbrand (Red Hat)" <david@xxxxxxxxxx> wrote:

On 1/15/26 10:36, Li Zhe wrote:
On Wed, 14 Jan 2026 18:21:08 +0100, david@xxxxxxxxxx wrote:
But again, I think the main motivation here is "increase application
startup", not optimize that the zeroing happens at specific points in
time during system operation (e.g., when idle etc).

Framing this as "increase application startup" and merely shifting the
overhead to shutdown seems like gaming the problem statement to me.
The real problem is total real time spent on it while pages are
needed.

Support for background zeroing can give you more usable pages provided
it has the cpu + ram to do it. If it does not, you are in the worst
case in the same spot as with zeroing on free.

Let's take a look at some examples.

Say there are no free huge pages and you kill a vm + start a new one.
On top of that all CPUs are pegged as is. In this case total time is
the same for "zero on free" as it is for background zeroing.

Right. If the pages get freed to immediately get allocated again, it
doesn't really matter who does the freeing. There might be some details,
of course.

Say the system is freshly booted and you start up a vm. There are no
pre-zeroed pages available so it suffers at start time no matter what.
However, with some support for background zeroing, the machinery could
respond to demand and do it in parallel in some capacity, shortening
the real time needed.

Just like for init_on_free, I would start with zeroing these pages
during boot.

init_on_free assures that all pages in the buddy were zeroed out. Which
greatly simplifies the implementation, because there is no need to track
what was initialized and what was not.

It's a good question if initialization during that should be done in
parallel, possibly asynchronously during boot. Reminds me a bit of
deferred page initialization during boot. But that is rather an
extension that could be added somewhat transparently on top later.

If ever required we could dynamically enable this setting for a running
system. Whoever would enable it (flips the magic toggle) would zero out
all hugetlb pages that are already in the hugetlb allocator as free, but
not initialized yet.

But again, these are extensions on top of the basic design of having all
free hugetlb folios be zeroed.

Say a little bit of real time passes and you start another vm. With
merely zeroing on free there are still no pre-zeroed pages available
so it again suffers the overhead. With background zeroing some of the
that memory would be already sorted out, speeding up said startup.

The moment they end up in the hugetlb allocator as free folios they
would have to get initialized.

Now, I am sure there are downsides to this approach (how to speedup
process exit by parallelizing zeroing, if ever required)? But it sounds
like being a bit ... simpler without user space changes required. In
theory :)

I strongly agree that init_on_free strategy effectively eliminates the
latency incurred during VM creation. However, it appears to introduce
two new issues.

First, the process that later allocates a page may not be the one that
freed it, raising the question of which process should bear the cost
of zeroing.

Right now the cost is payed by the process that allocates a page. If you
shift that to the freeing path, it's still the same process, just at a
different point in time.

Of course, there are exceptions to that: if you have a hugetlb file that
is shared by multiple processes (-> process that essentially truncates
the file). Or if someone (GUP-pin) holds a reference to a file even after
it was truncated (not common but possible).

With CoW it would be the process that last unmaps the folio. CoW with
hugetlb is fortunately something that is rare (and rather shaky :) ).


Second, put_page() is executed atomically, making it inappropriate to
invoke clear_page() within that context; off-loading the zeroing to a
workqueue merely reopens the same accounting problem.

I thought about this as well. For init_on_free we always invoke it for
up to 4MiB folios during put_page() on x86-64.

See __folio_put()->__free_frozen_pages()->free_pages_prepare()

Where we call kernel_init_pages(page, 1 << order);

So surely, for 2 MiB folios (hugetlb) this is not a problem.

... but then, on arm64 with 64k base pages we have 512 MiB folios
(managed by the buddy!) where this is apparently not a problem? Or is
it and should be fixed?

So I would expect once we go up to 1 GiB, we might only reveal more
areas where we should have optimized in the first case by dropping
the reference outside the spin lock ... and these optimizations would
obviously (unless in hugetlb specific code ...) benefit init_on_free
setups as well (and page poisoning).

FWIW I'd be interesting in seeing if we can do the zeroing async and allow
for hardware offloading. If it happens to be in CXL (and someone
built the fancy bits) we can ask the device to zero ranges of memory
for us. If they built the HDM-DB stuff it's coherent too (came up
in the Davidlohr's LPC Device-mem talk on HDM-DB + back invalidate
support)
+CC linux-cxl and Davidlohr + a few others.

More locally this sounds like fun for DMA engines, though they are going
to rapidly eat bandwidth up and so we'll need QoS stuff in place
to stop them perturbing other workloads.

Give me a list of 1Gig pages and this stuff becomes much more efficient
than anything the CPU can do.

Right, and ideally we'd implement any such mechanisms in a way that more parts of the kernel can benefit, and not just an unloved in-memory file-system that most people just want to get rid of as soon as we can :)

--
Cheers

David