Re: [RFC PATCH 00/40] mm: reliable 1GB page allocation

From: Rik van Riel

Date: Mon Jun 29 2026 - 10:52:00 EST

On Mon, 2026-06-29 at 12:03 +0200, Vlastimil Babka (SUSE) wrote:
> On 6/29/26 11:29, Lorenzo Stoakes wrote:
> >
> > So to be concrete, if you send really rough code, Use [pre-RFC] or
> > [DO NOT
> > MERGE] (on the series as a whole) to make that clear and say so in
> > the
> > cover letter VERY VERY clearly.
>
> Yes please. [POC NOT-FOR-MERGE] perhaps?
>
> > Or, you can put it in a repo somewhere and link it in an email
> > discussing
> > the concepts (like I did with scalable CoW for instance).
>
> Indeed.

I'll do that for the next version.

I suspect it will take a while to beat this thing
into shape.
>
> > And _you have already done this_ in your reply here:
> >
> > * "How do people feel about splitting up the free lists, so each
> > gigabyte
> > (well, PUD sized) chunk of memory has its own free lists?"
>
> My immediate response is that now we'd need to search multiple sets
> of lists
> instead of a single one? What about the overhead?

The current code is clearly not good enough. It
has to try several gigablocks almost blindly,
because there is no efficient way to find the
right gigablock.

I have an idea on how to fix that with bitmaps.

We could have one bitmap per order, indicating which
gigablocks have order 0 pages, order 1 pages, etc

Then a second set of bitmaps indicating which gigablocks
have unmovable / reclaimable pages.

At that point, finding a good gigablock to allocate
from can be done with a bitmap_and and a search.

These bitmaps would only need to be changed when the
status of a gigablock changes, eg. going from having
order 0 pages free, to not having any order 0 pages
free.

Does that seem like a workable approach?

Once we can quickly pinpoint a gigablock for the
page allocator to grab pages from, we can also
split out the "pick a gigablock" code from the
"allocate a page" code.

>
> > * "How can we balance the desire for higher-order kernel
> > allocations,
> > against the desire to preserve gigabyte sized chunks of memory
> > that can
> > be used for user space?"
> >
> > * "How do we balance the desire to keep compaction overhead low
> > with the
> > desire to do higher order allocations almost everywhere?"
>
> How can we have a cake and eat it too? :)

Pretty much :/

I suspect it's going to require some fun interactions
between allocation, reclaim, and compaction.

However, with everybody from networking, to filesystems,
to anonymous memory wanting to use higher order allocations
of differing sizes, it seems like we're going to have to
tackle this somehow.

>
> > I'd also very strongly suggest (as I did in my original reply)
> > breaking out
> > parts that can be broken out as prerequisite series.
> >
> > If you're doing something good or useful _anyway_ then just send
> > that
> > separately first, and have later work rely on the earlier work.
>
That becomes cleaner with the "post a link to
a tree" thing, as well.

The pcpbuddy stuff is likely to go in separately.
Johannes is still working on that code.

The "make btrfs inode cache pages movable" thing
already went in.

I think I have a few more things in the tree that
can go in separately, but hopefully that will grow
as this code solidifies.

On the flip side, things like "making compaction
scale" may well end up depending on the gigablock
stuff, because lack of targeting data seems like a
likely cause for why compaction has to try so hard.

I'll make sure to go over every point raised by
you guys before writing the next version of the
code, and again before posting a link to the
tree.

--
All Rights Reversed.