Re: [LSF/MM/BPF TOPIC][RFC PATCH v4 00/27] Private Memory Nodes (w/ Compressed RAM)
From: Yiannis Nikolakopoulos
Date: Tue May 05 2026 - 18:21:51 EST
> On 22 Feb 2026, at 09:48, Gregory Price <gourry@xxxxxxxxxx> wrote:
>
> Topic type: MM
>
> Presenter: Gregory Price <gourry@xxxxxxxxxx>
>
> This series introduces N_MEMORY_PRIVATE, a NUMA node state for memory
> managed by the buddy allocator but excluded from normal allocations.
>
> I present it with an end-to-end Compressed RAM service (mm/cram.c)
> that would otherwise not be possible (or would be considerably more
> difficult, be device-specific, and add to the ZONE_DEVICE boondoggle).
>
>
> TL;DR
> ===
>
> N_MEMORY_PRIVATE is all about isolating NUMA nodes and then punching
> explicit holes in that isolation to do useful things we couldn't do
> before without re-implementing entire portions of mm/ in a driver.
>
>
> /* This is my memory. There are many like it, but this one is mine. */
> rc = add_private_memory_driver_managed(nid, start, size, name, flags,
> online_type, private_context);
>
> page = alloc_pages_node(nid, __GFP_PRIVATE, 0);
>
> /* Ok but I want to do something useful with it */
> static const struct node_private_ops ops = {
> .migrate_to = my_migrate_to,
> .folio_migrate = my_folio_migrate,
> .flags = NP_OPS_MIGRATION | NP_OPS_MEMPOLICY,
> };
> node_private_set_ops(nid, &ops);
>
> /* And now I can use mempolicy with my memory */
> buf = mmap(...);
> mbind(buf, len, mode, private_node, ...);
> buf[0] = 0xdeadbeef; /* Faults onto private node */
>
> /* And to be clear, no one else gets my memory */
> buf2 = malloc(4096); /* Standard allocation */
> buf2[0] = 0xdeadbeef; /* Can never land on private node */
>
> /* But i can choose to migrate it to the private node */
> move_pages(0, 1, &buf, &private_node, NULL, ...);
>
> /* And more fun things like this */
>
>
> Patchwork
> ===
> A fully working branch based on cxl/next can be found here:
> https://github.com/gourryinverse/linux/tree/private_compression
>
> A QEMU device which can inject high/low interrupts can be found here:
> https://github.com/gourryinverse/qemu/tree/compressed_cxl_clean
>
> The additional patches on these branches are CXL and DAX driver
> housecleaning only tangentially relevant to this RFC, so i've
> omitted them for the sake of trying to keep it somewhat clean
> here. Those patches should (hopefully) be going upstream anyway.
>
> Patches 1-22: Core Private Node Infrastructure
>
> Patch 1: Introduce N_MEMORY_PRIVATE scaffolding
> Patch 2: Introduce __GFP_PRIVATE
> Patch 3: Apply allocation isolation mechanisms
> Patch 4: Add N_MEMORY nodes to private fallback lists
> Patches 5-9: Filter operations not yet supported
> Patch 10: free_folio callback
> Patch 11: split_folio callback
> Patches 12-20: mm/ service opt-ins:
> Migration, Mempolicy, Demotion, Write Protect,
> Reclaim, OOM, NUMA Balancing, Compaction,
> LongTerm Pinning
> Patch 21: memory_failure callback
> Patch 22: Memory hotplug plumbing for private nodes
>
> Patch 23: mm/cram -- Compressed RAM Management
>
> Patches 24-27: CXL Driver examples
> Sysram Regions with Private node support
> Basic Driver Example: (MIGRATION | MEMPOLICY)
> Compression Driver Example (Generic)
>
Hi,
As I think this is about to be discussed in the conference, I thought
to share some high level comments.
I have tested this for some time on a device with compression (after some
necessary fixes for CXL RCD to work, that Greg helped me with).
Overall, the isolation property that this provides is something I deem necessary
for this technology. Others are better placed to judge the MM plumbing
itself, but I wanted to say that this functionality is an important piece of the puzzle
from the device/use-case side.
For cram itself, as it is in this RFC, I think there is still performance and
value left on the table (as noted in the description), but I fully understand Gregory’s
premise in approaching it this way.
<snip>
>
> Future CRAM : Loosening the read-only constraint
> ===
>
> The read-only model is safe but conservative. For workloads where
> compressed pages are occasionally written, the promotion fault adds
> latency. A future optimization could allow a tunable fraction of
> compressed pages to be mapped writable, accepting some risk of
> write-driven decompression in exchange for lower overhead.
>
> The private node ops make this straightforward:
>
> - Adjust fixup_migration_pte to selectively skip
> write-protection.
> - Use the backpressure system to either revoke writable mappings,
> deny additional demotions, or evict when device pressure rises.
I have some quick hacks playing with these ideas but I haven’t had the time
to test it thoroughly and get to something robust yet. I saw in another thread
that there is a follow up cooking which looks interesting.
Thanks Greg for pushing this, and I’m happy to test more on HW in our lab.
Best,
/Yiannis