Re: [LSF/MM/BPF TOPIC][RFC PATCH v4 00/27] Private Memory Nodes (w/ Compressed RAM)

From: Gregory Price

Date: Thu Jun 04 2026 - 04:45:14 EST

On Thu, Jun 04, 2026 at 11:43:14AM +1000, Balbir Singh wrote:
> On Wed, Jun 03, 2026 at 08:02:09AM +0100, Gregory Price wrote:
> >
> > Here is how the page allocator fallback lists and nodemasks interact:
> >
> > Fallbacks A: A B
> > Fallbacks B: B A
> > Fallbacks C: C A B (Private)
> > Fallbacks D: D B A (Private)
> >
>
> Do we want regular memory (N_MEMORY) in the fallback list of device private nodes?
> The assumption is that we have ATS translation enabled? Assumiung A and
> B are N_MEMORY here or am I misreading your illustraion?
>

If we don't have __GFP_PRIVATE, then probably not. This is a holdover
from the current __GFP_PRIVATE branch so that if the preferred_nid=
value is a private node (which is a hint, but not a hard control),
there's a way for that allocation to land *somewhere*.

__GFP_PRIVATE would say "Only allow access to private nodes if this
flag is provided - otherwise treat that as unreachable and fall back".

(__GFP_PRIVATE | __GFP_THISNODE) then does exactly what you expect (only
allocate from specifically this private node and don't fall back).

This has the added benefit of not causing OOM on allocation failure.

Some would consider such a request a bug (i.e. that caller has a bad
mask), but I find the premise of that statement to be flawwed if only
because we do not have good controls over what ends up in a nodemask due
to the existence of things like possible_nodes.

> > If we wanted to change this behavior, realistically we'd be looking for
> > a way to add specific nodes to certain fallback lists - rather than
> > modify the nodemask interaction in some way.
>
> Yes, that is what we did with CDM, control the fallback for
> N_MEMORY_PRIVATE, but there is a design decision to be made here.
>

Agreed, but also one which can be deferred and played with since it's
all kernel-internal. None of this should have UAPI implications, and we
need need to accept that we're going to get it wrong on the first try.

> > 2) full mempolicy support doesn't really make sense
> >
> > task mempolicy PROBABLY should never really touch private nodes,
> > while VMA policy certainly can. Assuming we're able to support
> > multi-private-node masks, none of the non-bind mempolicies even
> > make sense for most private nodes (interleave? weighted interleave?)
> >
>
> Yes, mostly, but is that baked into the design? If so, why?
>

"Baked in" in this case would mean:

set_mempolicy(..., private_node) -> -EINVAL
mbind(..., private_node) -> Success

With appropriate documentation.

This can be changed later if a reasonable design was agreed upon.

> > 4) File VMA interactions don't entirely make sense with mbind
> >
> > In theory you might want:
> >
> > fd = open("somefile", ...);
> > mem = mmap(fd, ...);
> > mbind(mem, ..., private_node);
> > for page in mem:
> > mem[page_off] /* fault file into private memory */
> >
> > In reality: This does not work the way you want.
>
> Why not? Just curious about what you found?
>

Because pagecache pages are associated with potentially many VMAs.

The fault can be a soft fault or a hard fault. On soft fault - the page
was already present, and will simply fault into VMA without being
migrated.

You can imagine the following

Process A:
fd = open("somefile", ...);
mem = mmap(fd, ...);
mbind(mem, ..., private_node_A);
for page in mem:
mem[page_off] /* fault file into private memory */

Process B:
fd = open("somefile", ...);
mem = mmap(fd, ...);
mbind(mem, ..., private_node_B);
for page in mem:
mem[page_off] /* fault file into private memory */

If process A runs first, and assuming VMA mempolicy is respected for
file backed allocation (note: it's not, see below) - then the second
process will think the memory now lives on node B when it's already
living on node A (pages are not migrated on fault).

filemap page cache means file-backed pages are global resources.

Re file-backed VMAs - see filemap_alloc_folio_noprof in mm/filemap.c

struct folio *filemap_alloc_folio_noprof(gfp_t gfp, unsigned int order)
{
int n;
struct folio *folio;

if (cpuset_do_page_mem_spread()) {
unsigned int cpuset_mems_cookie;
do {
cpuset_mems_cookie = read_mems_allowed_begin();
n = cpuset_mem_spread_node();
folio = __folio_alloc_node_noprof(gfp, order, n);
} while (!folio && read_mems_allowed_retry(cpuset_mems_cookie));

return folio;
}
return folio_alloc_noprof(gfp, order);
}

We'd have to hang a mempolicy off of the file and use fctl or something
like this if we want a file to have a node preference.

> >
> > I went digging and we need a few mild extensions to allow
> > migration on mbind to work for pagecache pages, and the fault
> > path does not necessarily respect the vma mempolicy always.
> >
> > You also start getting into the question of "what happens when
> > the node is out of memory and you don't have reclaim support?".
>
> Yes, we should discuss reclaim support, I think we should allow for
> reclaim. It allows you to overcommit private memory the way we can
> with regular memory.
>

Reclaim support is feasible, but again - crawl, walk, run.

If we get the base private node infrastructure in place, we can break
things like mempolicy and reclaim support into different work streams
to enable support for these features.

Different private node users will be interested in different
combinations of mm/ service support.

For example: compressed memory as a swap backend DOES NOT want explicit
reclaim support - it will need to manage its own shrinker. This comes
from requirements associated with that specific use case (which I do not
want to get into here).

That is why this series introduced the concept of NP_OPS_* - so that the
owner (driver) of a private node (such as a CXL-enabled accelerator
driver) can tell mm/ what services it should enable for that node.

> >
> > For all these reasons, I think the be mbind/mempolicy support with
> > private nodes needs to be brought in with follow up work - not
> > introduced as part of the baseline set.
> >
>
> I am not opposed to the follow up work, but I feel mbind() should
> be the fundamental work and user space API.
>

This is informed by a single use case / device.

There are users / devices that don't want any UAPI for their memory,
but simply wish to re-utilize some subsection of mm/ (page_alloc,
reclaim, etc).

> >
> > I am arguing for #1 - the community has argued for #2 and "fixing
> > existing nodemask users". I think we can ship #2 and pivot to #1 if we
> > find fixing existing users is infeasible or too much of a maintenance
> > burden.
>
> Again happy to discuss this, I'd like to make sure we agree on the
> design. I am wondering if there is any experimental data to choose
> between 1 and 2.
>

I am trying to test whether, lacking __GFP_PRIVATE, any normal runtime
operations access private nodes removed from fallback lists are reached
via something like the possible / online nodemask.

I remember, maybe a year ago, there were per-node allocations happening
during hotplug and that's why I originally proposed __GFP_PRIVATE, but
I'm trying to re-collect that data now.

~Gregory