Re: [LSF/MM/BPF TOPIC][RFC PATCH v4 00/27] Private Memory Nodes (w/ Compressed RAM)

From: David Hildenbrand (Arm)

Date: Fri Apr 17 2026 - 05:49:29 EST

On 4/15/26 21:47, Frank van der Linden wrote:
> On Wed, Apr 15, 2026 at 8:18 AM Gregory Price <gourry@xxxxxxxxxx> wrote:
>>
>> On Wed, Apr 15, 2026 at 11:49:59AM +0200, David Hildenbrand (Arm) wrote:
>>
>> As a preface - the current RFC was informed by ZONE_DEVICE patterns.
>>
>> I think that was useful as a way to find existing friction points - but
>> ultimately wrong for this new interface.
>>
>> I don't thinks an ops struct here is the right design, and I think there
>> are only a few patterns that actually make sense for device memory using
>> nodes this way.
>>
>> So there's going to be a *major* contraction in the complexity of this
>> patch series (hopefully I'll have something next week), and much of what
>> you point out below is already in-flight.
>>
>>>
>> ... snip ...
>>>
>>> A related series proposed some MEM_READ/WRITE backend requests [1]
>>>
>>> [1] https://lists.nongnu.org/archive/html/qemu-devel/2024-09/msg02693.html
>>>
>>
>> Oh interesting, thank you for the reference here.
>>
>>>
>>> Something else people were discussing in the past was to physically
>>> limit the area where virtio queues could be placed.
>>>
>>
>> That is functionally what I did - the idea was pretty simple, just have
>> a separate memfd/node dedicated for the queues:
>>
>> guest_memory = memfd(MAP_PRIVATE)
>> net_memory = memfd(MAP_SHARED)
>>
>> And boom, you get what you want.
>>
>> So yeah "It works" - but there's likely other ways to do this too, and
>> as you note re: compatibility, i'm not sure virtio actually wants this,
>> but it's a nice proof-of-concept for a network device on the host that
>> carries its own memory.
>>
>> I'll try post my hack as an example with the next RFC version, as I
>> think it's informative.
>>
>>>
>>> But that's a different "fallback" problem, no?
>>>
>>> You want allocations that target the "special node" to fallback to
>>> *other* nodes, but not other allocations to fallback to *this special* node.
>>>
>> ... snip - slight reordering to put thoughts together ...
>>>
>>> Needs a second thought regarding fallback logic I raised above.
>>>
>>> What I think would have to be audited is the usage of __GFP_THISNODE by
>>> kernel allocations, where we would not actually want to allocate from
>>> this private node.
>>>
>>
>> This is fair, and I a re-visit is absolutely warranted.
>>
>> Re-examining the quick audit from my last response suggests - I should
>> never have seen leakage in those cases, but the fallbacks are needed.
>>
>> So yes, this all requires a second look (and a third, and a ninth).
>>
>> I'm not married to __GFP_PRIVATE, but it has been reliable for me.
>>
>>> Maybe we could just outright refuse *any* non-user (movable) allocations
>>> that target the node, even with __GFP_THISNODE.
>>>
>>> Because, why would we want kernel allocations to even end up on a
>>> private node that is supposed to only be consumed by user space? Or
>>> which use cases are there where we would want to place kernel
>>> allocations on there?
>>>
>>
>> As a start, maybe? But as a permanent invariant? I would wonder whether
>> the decision here would lock us into a design.
>>
>> But then - this is all kernel internal, so i think it would be feasible
>> to change this out from under users without backward compatibility pain.
>>
>> So far I have done my best to avoid changing any userland interfaces in
>> a way that would fundamentally change the contracts. If anything
>> private-node other than just the node's `has_memory_private` attribute
>> leaks into userland, someone messed up.
>>
>> So... I think that's reasonable.
>>
>>>
>>> I assume you will be as LSF/MM? Would be good to discuss some of that in
>>> person.
>>>
>>
>> Yes, looking forward to it :]
>>
>>
>>>
>>>
>>> Again, I am not sure about compaction and khugepaged. All we want to
>>> guarantee is that our memory does not leave the private node.
>>>
>>> That doesn't require any __GFP_PRIVATE magic, just en-lighting these
>>> subsystems that private nodes must use __GFP_THISNODE and must not leak
>>> to other nodes.
>>
>> This is where specific use-cases matter.
>>
>> In the compressed memory example - the device doesn't care about memory
>> leaving - but it cares about memory arriving and *and being modified*.
>> (more on this in your next question)
>>
>> So i'm not convinced *all possible devices* would always want to support
>> move_pages(), mbind(), and set_mempolicy().
>>
>> But, I do want to give this serious thought, and I agree the absolute
>> minimal patch set could just be the fallback control mechanism and
>> mm/ component filters/audit on __GFP_*.
>>
>>
>>>
>>> I'm missing why these are even opt-in. What's the problem with allowing
>>> mbind and mempolicy to use these nodes in some of your drivers?
>>>
>>
>> First:
>>
>> In my latest working branch these two flags have been folded into just
>> _OPS_MEMPOLICY and any other migration interaction is just handled by
>> filtering with the GFP flag.
>>
>>
>> on always allowing mbind and mempolicy vs opt-in
>> ---
>>
>> A proper compressed memory solution should not allow mbind/mempolicy.
>>
>> Compressed memory is different from normal memory - as the kernel can
>> percieves free memory (many unused struct page in the buddy) when the
>> device knows there's none left (the physical capacity is actually full).
>>
>> Any form of write to a compressed memory device is essentially a
>> dangerous condition (OOMs = poison, not oom_kill()).
>>
>> So you need two controls: Allocation and (userland) Write protection
>> I implemented via:
>> - Demotion-only (allocations only happen in reclaim path)
>> - Write-protecting the entire node
>>
>> (I fully accept that a write-protection extension here might be a bridge
>> to far, but please stick with me for the sake of exploration).
>>
>>
>> There's a serious argument to limit these devices to using an mbind
>> pattern, but I wanted to make a full-on attempt to integrate this device
>> into the demotion path as a transparent tier (kinda like zswap).
>>
>> I could not square write-protection with mempolicy, so i had to make
>> them both optional and mutually exclusive.
>>
>> If you limit the device to mbind interactions, you do limit what can
>> crash - but this forces userland software to be less portable by design:
>>
>> - am i running on a system where this device is present?
>> - is that device exposing its memory on a node?
>> - which node?
>> - what memory can i put on that node? (can you prevent a process from
>> putting libc on that node?)
>> - how much compression ratio is left on the device?
>> - can i safety write to this virtual address?
>> - should i write-protect compressed VMAs? Can i handle those faults?
>> - many more
>>
>> That sounds a lot like re-implementing a bunch of mm/ in userland, and
>> that's exactly where we were at with DAX. We know this pattern failed.
>>
>> I'm trying to very much avoid repeating these mistakes, and so I'm very
>> much trying to find a good path forward here that results in transparent
>> usage of this memory.
>>
>>
>>> I also have some questions about longterm pinnings, but that's better
>>> discussed in person :)
>>>
>>
>> The longterm pin extention came from auditing existing zone_device
>> filters.
>>
>> tl;dr: informative mechanism - but it probably should be dropped,
>> it makes no sense (it's device memory, pinnings mean nothing?).
>>
>>
>>>
>>> Right, that's rather invasive.
>>>
>>
>> Yeah i'm trying to avoid it, and the answer may actually just exist in
>> the task-death and VMA cleanup path rather than the folio-free path.
>>
>> From what i've seen of accelerator drivers that implement this, when you
>> inform the driver of a memory region with a task, the driver should have
>> a mechanism to take references on that VMA (or something like this) - so
>> that when the task dies the driver has a way to be notified of the VMA
>> being cleaned up.
>>
>> This probably exists - I just haven't gotten there yet.
>>
>> ~Gregory
>
> This has been a really great discussion. I just wanted to add a few
> points that I think I have mentioned in other forums, but not here.
>
> In essence, this is a discussion about memory properties and the level
> at which they should be dealt with. Right now there are basically 3
> levels: pageblocks, zones and nodes. While these levels exist for good
> reasons, they also sometimes lead to issues. There's duplication of
> functionality. MIGRATE_CMA and ZONE_MOVABLE both implement the same
> basic property, but at different levels (attempts have been made to
> merge them, but it didn't work out). There's also memory with clashing
> properties inhabiting the same data structure: LRUs. Having strictly
> movable memory on the same LRU as unmovable memory is a mismatch. It
> leads to the well known problem of reclaim done in the name of an
> unmovable allocation attempt can be entirely pointless in the face of
> large amounts of ZONE_MOVABLE or MIGRATE_CMA memory: the anon LRU will
> be chock full of movable-only pages. Reclaiming them is useless for
> your allocation, and skipping them leads to locking up the system
> because you're holding on to the LRU lock a long time.
>
> So, looking at having some properties set at the node level makes
> sense to me even in the non-device case. But perhaps that is out of
> scope for the initial discussion.
>
> One use case that seems like a good match for private nodes is guest
> memory. Guest memory is special enough to want to allocate / maintain
> it separately, which is acknowledged by the introduction of
> guest_memfd.

Yes. There is now an interface to configure mbind() for guest_memfd. So
with that and some tweaks, maybe that ... would just work, if we get the
mbind() interaction right?

--
Cheers,

David