Re: [LSF/MM/BPF TOPIC][RFC PATCH v4 00/27] Private Memory Nodes (w/ Compressed RAM)

From: Balbir Singh

Date: Wed Jun 10 2026 - 19:11:08 EST

On Thu, Jun 04, 2026 at 01:18:44PM +0100, Gregory Price wrote:
> On Thu, Jun 04, 2026 at 08:35:19PM +1000, Balbir Singh wrote:
> >
> > My concern is that __GFP_PRIVATE is too wide, I wonder if we'll have a
> > need to support N_MEMORY_PRIVATE may not be all homogeneous memory nodes.
> > Very similar to how not all ZONE_DEVICE memory is homogenous.
> >
>
> Can you more precise about your definition of homogeneous here?
>
> Are you saying not all memory on a private node will be homogeneous?
> While possible, I would argue that you should not do this and
> should instead prefer to use multiple nodes - 1 per memory class.
>
> Are you saying not all private nodes will be homogenous?
> I don't see the issue with this.

Yes, I meant, nodes might belong to different devices. These might not
want fallover allocations, for example __GFP_PRIVATE falling back to
unwanted nodes.

>
> > >
> > > Agreed, but also one which can be deferred and played with since it's
> > > all kernel-internal. None of this should have UAPI implications, and we
> > > need need to accept that we're going to get it wrong on the first try.
> > >
> >
> > Agreed that we might get the design wrong, until we fix it up. I feel
> > that __GFP_PRIVATE should be an evolution of the design to that point.
> >
>
> Possibly. If we can't guarantee isolation without __GFP_PRIVATE, then
> we probably can't merge the baseline without it.
>

I'll rethink about this, but I am concerned that __GFP_PRIVATE is too
broad, in fact it breaks isolation by allocating from any private
device. Again this is a function of how fallback lists are organized.

> > > Because pagecache pages are associated with potentially many VMAs.
> > >
> > > The fault can be a soft fault or a hard fault. On soft fault - the page
> > > was already present, and will simply fault into VMA without being
> > > migrated.
> > >
> >
> > Let's split this into two:
> >
> > 1. unmapped page cache is never impacted by mempolicy and should not
> > end up on private memory nodes
> > 2. For shared pages, mempolicy would be hard, but it would need to
> > be on a set of nodes backed by private memory, depending on mbind()
> > policy
> >
> ... snip ...
> >
> > I'd need to think more about this. For now, my basic requirement would
> > be that unmapped page cache should not come from/to private nodes.
> >
>
> This does not fully describe the problem.
>
> A file can be opened and cached as unmapped page cache, and then mapped
> at a later time - at which point the mapped copy would share the filemap
> page cache page.
>
> Worse, because it's file-backed, you can have the memory faulted onto
> your remote node - reclaimed - and the faulted back in via the process
> accessing the file via unmapped operations (read/write), at which point
> you've had a silent migration occur.
>
> Basically consider
>
> Process A:
> fd = open("myfile", ..., RO);
> read(fd, ...); /* mm/filemap.c fills page cache */
>
> Process B:
> fd = open("myfile", ...);
> mem = mmap(fd, ...);
> mbind(mem, ..., private_node);
> for page in mem:
> int tmp = mem[page]; /* fault into vma */
>
> The result of Process A running first is Process B thinks it has faulted
> the memory onto private_node, but in reality it's taking soft faults and
> just getting the filemap folio mapped in.
>
> If you wanted mbind() support from the start, we would have to limit
> applicability to anon memory only.
>
> Shared anon memory is different, as there is a radix tree that deals
> with a shared mempolicy state.

Ack, need to think through this.

>
> >
> > I am open to this, I was coming from the blueprint approach of:
> > - Let's mimic N_MEMORY with N_MEMORY_PRIVATE and then pick and choose
> > what features to change or make specific to the implementation
> >
>
> N_MEMORY essentially states:
> "This is normal memory touch it however you like"
>
> N_MEMORY_PRIVATE (_MANAGED, w/e) says
> "This is NOT normal memory, there are special rules here"
>
> So, no, lets not mimic N_MEMORY. This is a "closed by default" design,
> while N_MEMORY is an "open by default" design. This design choice is
> explicit to make reasoning about these nodes feasible.
>
> > > This is informed by a single use case / device.
> > >
> > > There are users / devices that don't want any UAPI for their memory,
> > > but simply wish to re-utilize some subsection of mm/ (page_alloc,
> > > reclaim, etc).
> > >
> >
> > But then, why do they need NUMA nodes? Do we have a list of use cases?
> >
>
> So far i have collected:
>
> - Network accelerators carrying their own memory for message buffers
> - GPUs with semi-general-purpose working memory across coherent links
> - Acceptionally slow distributed memory that you do not want fallback
> allocations to (so you want to deliberately tier what lands there)
> - Compressed memory (just another form of accelerator really) which
> has *special access rules* (i.e. writes need to be controlled)
>
> In most if not all of these cases, the right abstraction to reason about
> where memory *should come from* IS a NUMA node.
>
> - the network stack can be taught to check if the target device has a
> node with memory and prefer that node over local memory
>
> - accelerators can be given private nodes to manage memory using
> core mm/ components, without worrying that general kernel operation
> will put unrelated memory on those nodes or do things like migrate
> your pages out from under you (unless your driver/service requested
> that).
>
> the tiering application should be somewhat obvious / trivial.
>
> > >
> > > I am trying to test whether, lacking __GFP_PRIVATE, any normal runtime
> > > operations access private nodes removed from fallback lists are reached
> > > via something like the possible / online nodemask.
> > >
> > > I remember, maybe a year ago, there were per-node allocations happening
> > > during hotplug and that's why I originally proposed __GFP_PRIVATE, but
> > > I'm trying to re-collect that data now.
> > >
> >
> > Thanks, I look forward to the next set of patches. Let me know if I
> > can help test what's on the list or if you want me to wait for the next
> > round
> >
>
> Really I want to get the minimized set out the door so we can start
> breaking this up by feature (reclaim, mempolicy, etc), because trying to
> reason about it as a whole is infeasible - and I cannot be the single
> arbiter of every use case (I simply do not have sufficient context).
>
> I'm reworking it all as we speak.
>

Look forward to it

Balbir