Re: [LSF/MM/BPF TOPIC][RFC PATCH v4 00/27] Private Memory Nodes (w/ Compressed RAM)

From: Gregory Price

Date: Fri Jun 12 2026 - 11:29:53 EST

On Wed, Jun 10, 2026 at 04:12:52PM -0400, Gregory Price wrote:
> On Wed, Jun 10, 2026 at 08:59:59PM +0200, David Hildenbrand (Arm) wrote:
> > >
> > > I understand this question in two ways:
> > >
> > > 1) Can we disallow PAGE allocation and limit this to FOLIO allocation
> >
> > Yes. Can we only allow folios to be allocated from private memory nodes. So let
> > me reply to that one below.
> >
> ... snip ...
> >
> > At LSF/MM we talked about how GFP flags are bad and how deriving stuff from the
> > context might be better. I think there was also talk about how the memalloc_*
> > interface might be a better way forward. Maybe we would start giving the
> > allocator more context ("we are allocating a folio").
> >
> > The following is incomplete (esp. hugetlb stuff I assume), just as some idea:
> >
>
> I will still probably send the next RFC version tomorrow or friday,
> as I want to get some eyes on the __GFP_PRIVATE-less pattern.
>
> Also, I made a new `anondax` driver which enables userland testing
> of this functionality without any specialty hardware.
>

(apologies for the length of this email: this will all be covered in
the coming cover letter, but I just wanted to share a bit of a preview)

===

Just another small update - I am planning to post the RFC today once i
get some mild cleanup done. It will be based on the dax atomic hotplug

https://lore.kernel.org/linux-mm/20260605211911.2160954-1-gourry@xxxxxxxxxx/

But a couple specific details regarding the memalloc pieces that i've
learned the past couple of days playing with it.

1) memalloc_folio is required to ensure non-folio allocations don't land
on the private node, even if it happens within a memalloc_private
context. Since memalloc_folio may be useful in contexts outside of
private nodes, I kept this as a separate flag.

If we think there will *never* be additional users of memalloc_folio,
then we could fold _folio into _private to save the flag for now and
add it back when we actually need it.

2) memalloc_private is needed to unlock private nodes, but in the
original NOFALLBACK-only design, you also needed __GFP_THISNODE.

This is *highly* restrictive. I found when playing with mbind that
MPOL_BIND + __GFP_THISNODE generates a WARN (valid WARN, it normally
implies a bug).

That leads me to #3

3) If a private node is opted into something like Demotion (the node is
a demotion target) or mbind(), such that normal kernel operation can
place memory there - it's *pseudo-private*, and should actually land
in it's own FALLBACK list (reachable without __GFP_THISNODE, but not
reachable as a normal fallback allocation target).

I'm still playing with this, but I think we can even omit the
__GFP_THISNODE requirement (my initial feeling that __GFP_THISNODE
didn't buy us anything in particular seems to have panned out).

At the end of the day, this makes the whole memalloc_private_save()
pattern a heck of a lot cleaner than trying fiddle with GFP.

I think you will all enjoy how clean the code ends up, and how easily
testable it is.

As a testbed I've implement an anondax (we can discuss naming) that
adds some sample NODE_PRIVATE_OPT_* flags so you can do the following.

I'm including this in the next RFC - but we can hack the entire thing
off (including the OPT flags) if we prefer to just get the base set in
without a new driver as a start.

echo 1 > dax0.0/reclaim # kswapd and reclaim run normally on this node
echo 1 > dax0.0/demotion # it is a demotion target
echo 1 > dax0.0/mbind # mbind() can target this node for anon-vma's
echo 1 > dax0.0/madvise # allow madvise() to operate on its folios
echo 1 > dax0.0/numa_balance # allow numa balancing for this node
echo 1 > dax0.0/ltpin # allow GUP longterm pin to operate normally
echo * > dax0.0/adistance # set the adistance for hotplug time
echo * > dax0.0/hotplug # same as kmem/hotplug

This also means *existing hardware* can leverage private nodes if
they're capable of generating a dax device.

I've even gotten it such that you can put a private node above dram in
the adistance heirarchy - which means demotion flows downward from
device to CPU, but allocations don't default or fallback there.

This seems *immediately* useful for a variety of use cases.

~Gregory