Re: [LSF/MM/BPF TOPIC][RFC PATCH v4 00/27] Private Memory Nodes (w/ Compressed RAM)

From: Gregory Price

Date: Thu Jun 04 2026 - 08:29:04 EST

On Thu, Jun 04, 2026 at 08:35:19PM +1000, Balbir Singh wrote:
>
> My concern is that __GFP_PRIVATE is too wide, I wonder if we'll have a
> need to support N_MEMORY_PRIVATE may not be all homogeneous memory nodes.
> Very similar to how not all ZONE_DEVICE memory is homogenous.
>

Can you more precise about your definition of homogeneous here?

Are you saying not all memory on a private node will be homogeneous?
While possible, I would argue that you should not do this and
should instead prefer to use multiple nodes - 1 per memory class.

Are you saying not all private nodes will be homogenous?
I don't see the issue with this.

> >
> > Agreed, but also one which can be deferred and played with since it's
> > all kernel-internal. None of this should have UAPI implications, and we
> > need need to accept that we're going to get it wrong on the first try.
> >
>
> Agreed that we might get the design wrong, until we fix it up. I feel
> that __GFP_PRIVATE should be an evolution of the design to that point.
>

Possibly. If we can't guarantee isolation without __GFP_PRIVATE, then
we probably can't merge the baseline without it.

> > Because pagecache pages are associated with potentially many VMAs.
> >
> > The fault can be a soft fault or a hard fault. On soft fault - the page
> > was already present, and will simply fault into VMA without being
> > migrated.
> >
>
> Let's split this into two:
>
> 1. unmapped page cache is never impacted by mempolicy and should not
> end up on private memory nodes
> 2. For shared pages, mempolicy would be hard, but it would need to
> be on a set of nodes backed by private memory, depending on mbind()
> policy
>
... snip ...
>
> I'd need to think more about this. For now, my basic requirement would
> be that unmapped page cache should not come from/to private nodes.
>

This does not fully describe the problem.

A file can be opened and cached as unmapped page cache, and then mapped
at a later time - at which point the mapped copy would share the filemap
page cache page.

Worse, because it's file-backed, you can have the memory faulted onto
your remote node - reclaimed - and the faulted back in via the process
accessing the file via unmapped operations (read/write), at which point
you've had a silent migration occur.

Basically consider

Process A:
fd = open("myfile", ..., RO);
read(fd, ...); /* mm/filemap.c fills page cache */

Process B:
fd = open("myfile", ...);
mem = mmap(fd, ...);
mbind(mem, ..., private_node);
for page in mem:
int tmp = mem[page]; /* fault into vma */

The result of Process A running first is Process B thinks it has faulted
the memory onto private_node, but in reality it's taking soft faults and
just getting the filemap folio mapped in.

If you wanted mbind() support from the start, we would have to limit
applicability to anon memory only.

Shared anon memory is different, as there is a radix tree that deals
with a shared mempolicy state.

>
> I am open to this, I was coming from the blueprint approach of:
> - Let's mimic N_MEMORY with N_MEMORY_PRIVATE and then pick and choose
> what features to change or make specific to the implementation
>

N_MEMORY essentially states:
"This is normal memory touch it however you like"

N_MEMORY_PRIVATE (_MANAGED, w/e) says
"This is NOT normal memory, there are special rules here"

So, no, lets not mimic N_MEMORY. This is a "closed by default" design,
while N_MEMORY is an "open by default" design. This design choice is
explicit to make reasoning about these nodes feasible.

> > This is informed by a single use case / device.
> >
> > There are users / devices that don't want any UAPI for their memory,
> > but simply wish to re-utilize some subsection of mm/ (page_alloc,
> > reclaim, etc).
> >
>
> But then, why do they need NUMA nodes? Do we have a list of use cases?
>

So far i have collected:

- Network accelerators carrying their own memory for message buffers
- GPUs with semi-general-purpose working memory across coherent links
- Acceptionally slow distributed memory that you do not want fallback
allocations to (so you want to deliberately tier what lands there)
- Compressed memory (just another form of accelerator really) which
has *special access rules* (i.e. writes need to be controlled)

In most if not all of these cases, the right abstraction to reason about
where memory *should come from* IS a NUMA node.

- the network stack can be taught to check if the target device has a
node with memory and prefer that node over local memory

- accelerators can be given private nodes to manage memory using
core mm/ components, without worrying that general kernel operation
will put unrelated memory on those nodes or do things like migrate
your pages out from under you (unless your driver/service requested
that).

the tiering application should be somewhat obvious / trivial.

> >
> > I am trying to test whether, lacking __GFP_PRIVATE, any normal runtime
> > operations access private nodes removed from fallback lists are reached
> > via something like the possible / online nodemask.
> >
> > I remember, maybe a year ago, there were per-node allocations happening
> > during hotplug and that's why I originally proposed __GFP_PRIVATE, but
> > I'm trying to re-collect that data now.
> >
>
> Thanks, I look forward to the next set of patches. Let me know if I
> can help test what's on the list or if you want me to wait for the next
> round
>

Really I want to get the minimized set out the door so we can start
breaking this up by feature (reclaim, mempolicy, etc), because trying to
reason about it as a whole is infeasible - and I cannot be the single
arbiter of every use case (I simply do not have sufficient context).

I'm reworking it all as we speak.

~Gregory