Re: [LSF/MM/BPF TOPIC][RFC PATCH v4 00/27] Private Memory Nodes (w/ Compressed RAM)

From: David Hildenbrand (Arm)

Date: Fri Apr 17 2026 - 05:55:55 EST

On 4/15/26 17:17, Gregory Price wrote:
> On Wed, Apr 15, 2026 at 11:49:59AM +0200, David Hildenbrand (Arm) wrote:
>> On 4/13/26 19:05, Gregory Price wrote:
>
> As a preface - the current RFC was informed by ZONE_DEVICE patterns.

:)

>
> I think that was useful as a way to find existing friction points - but
> ultimately wrong for this new interface.
>
> I don't thinks an ops struct here is the right design, and I think there
> are only a few patterns that actually make sense for device memory using
> nodes this way.
>
> So there's going to be a *major* contraction in the complexity of this
> patch series (hopefully I'll have something next week), and much of what
> you point out below is already in-flight.

Sounds like this discussion was valuable. Sorry for not being that
responsive ... repeatedly :)

[...]

>>
>> Something else people were discussing in the past was to physically
>> limit the area where virtio queues could be placed.
>>
>
> That is functionally what I did - the idea was pretty simple, just have
> a separate memfd/node dedicated for the queues:
>
> guest_memory = memfd(MAP_PRIVATE)
> net_memory = memfd(MAP_SHARED)
>
> And boom, you get what you want.
>
> So yeah "It works" - but there's likely other ways to do this too, and
> as you note re: compatibility, i'm not sure virtio actually wants this,
> but it's a nice proof-of-concept for a network device on the host that
> carries its own memory.
>

Jup.

[...]

>> Needs a second thought regarding fallback logic I raised above.
>>
>> What I think would have to be audited is the usage of __GFP_THISNODE by
>> kernel allocations, where we would not actually want to allocate from
>> this private node.
>>
>
> This is fair, and I a re-visit is absolutely warranted.
>
> Re-examining the quick audit from my last response suggests - I should
> never have seen leakage in those cases, but the fallbacks are needed.
>
> So yes, this all requires a second look (and a third, and a ninth).
>
> I'm not married to __GFP_PRIVATE, but it has been reliable for me.

Yes, we should carefully describe which semantics we want to achieve, to
then figure out how we could achieve them.

>
>> Maybe we could just outright refuse *any* non-user (movable) allocations
>> that target the node, even with __GFP_THISNODE.
>>
>> Because, why would we want kernel allocations to even end up on a
>> private node that is supposed to only be consumed by user space? Or
>> which use cases are there where we would want to place kernel
>> allocations on there?
>>
>
> As a start, maybe? But as a permanent invariant? I would wonder whether
> the decision here would lock us into a design.
>
> But then - this is all kernel internal, so i think it would be feasible
> to change this out from under users without backward compatibility pain.

Right. Was just an idea, whether it would currently even make sense to
allow any kernel allocations on there.

The handful of kernel allocations that would be allowed to end up on
there would likely be extremely special.

[...]

>> Again, I am not sure about compaction and khugepaged. All we want to
>> guarantee is that our memory does not leave the private node.
>>
>> That doesn't require any __GFP_PRIVATE magic, just en-lighting these
>> subsystems that private nodes must use __GFP_THISNODE and must not leak
>> to other nodes.
>
> This is where specific use-cases matter.
>
> In the compressed memory example - the device doesn't care about memory
> leaving - but it cares about memory arriving and *and being modified*.
> (more on this in your next question)

Right, but naive me would say that that's a memory allocation problem,
right?

khugepaged() wants to allocate a 2M page to collapse. Goes to the buddy
to allocate it.

Buddy has to say no if the device cannot support it.

So there are free pages but we just don't want to hand them out.

I am being very naive here about the compressed memory scenario, because
it's one of these extremely weird corner cases ;)

[...]

>>> If you want the mbind contract to stay intact:
>>>
>>> NP_OPS_MIGRATION (mbind can generate migrations)
>>> NP_OPS_MEMPOLICY (this just tells mempolicy.c to allow the node)
>>
>> I'm missing why these are even opt-in. What's the problem with allowing
>> mbind and mempolicy to use these nodes in some of your drivers?
>>
>
> First:
>
> In my latest working branch these two flags have been folded into just
> _OPS_MEMPOLICY and any other migration interaction is just handled by
> filtering with the GFP flag.
>
>
> on always allowing mbind and mempolicy vs opt-in
> ---
>
> A proper compressed memory solution should not allow mbind/mempolicy.
>
> Compressed memory is different from normal memory - as the kernel can
> percieves free memory (many unused struct page in the buddy) when the
> device knows there's none left (the physical capacity is actually full).
>
> Any form of write to a compressed memory device is essentially a
> dangerous condition (OOMs = poison, not oom_kill()).
>
> So you need two controls: Allocation and (userland) Write protection
> I implemented via:
> - Demotion-only (allocations only happen in reclaim path)
> - Write-protecting the entire node
>
> (I fully accept that a write-protection extension here might be a bridge
> to far, but please stick with me for the sake of exploration).
>
>
> There's a serious argument to limit these devices to using an mbind
> pattern, but I wanted to make a full-on attempt to integrate this device
> into the demotion path as a transparent tier (kinda like zswap).
>
> I could not square write-protection with mempolicy, so i had to make
> them both optional and mutually exclusive.
>
> If you limit the device to mbind interactions, you do limit what can
> crash - but this forces userland software to be less portable by design:
>
> - am i running on a system where this device is present?
> - is that device exposing its memory on a node?
> - which node?
> - what memory can i put on that node? (can you prevent a process from
> putting libc on that node?)
> - how much compression ratio is left on the device?
> - can i safety write to this virtual address?
> - should i write-protect compressed VMAs? Can i handle those faults?
> - many more
>
> That sounds a lot like re-implementing a bunch of mm/ in userland, and
> that's exactly where we were at with DAX. We know this pattern failed.
>
> I'm trying to very much avoid repeating these mistakes, and so I'm very
> much trying to find a good path forward here that results in transparent
> usage of this memory.
>

As stated above, maybe that's really just a memory allocation problem
for mbind/khugepaged etc, and the memory allocator would need hooks to
say "well, I do have that free memory. but sorry bro, you really cannot
have it right now because it's actually not really free right now, -ENOMEM".

Devil is in the detail, I suppose.

(again, I consider such devices an extreme corner cases; if it makes the
overall design waaaaayy to complicated, we might just want to say "we
cannot reasonably support this without shittifying MM". But maybe there
are ways to handle this in a better way, as of above)

>
>> I also have some questions about longterm pinnings, but that's better
>> discussed in person :)
>>
>
> The longterm pin extention came from auditing existing zone_device
> filters.
>
> tl;dr: informative mechanism - but it probably should be dropped,
> it makes no sense (it's device memory, pinnings mean nothing?).

What I was thinking: We still have different zone options for this memory.

Expose memory to ZONE_MOVABLE -> no longterm pinning allowed.

Expose memory to ZONE_NORMAL -> longterm pinning allowed.

And if we don't even allow arbitrary kernel allocations to end up
ZONE_NORMAL of these special nodes, we can just start using ZONE_NORMAL
and let user space (using vfio/iouring fixed buffers etc) consume this
private memory with longterm pinning.

Just a random thought.

>
>
>>>
>>> The task dies and frees the pages back to the buddy - the question is
>>> whether the 4-5 free_folio paths (put_folio, put_unref_folios, etc) can
>>> all eat an ops.free_folio() callback to inform the driver the memory has
>>> been freed.
>>
>> Right, that's rather invasive.
>>
>
> Yeah i'm trying to avoid it, and the answer may actually just exist in
> the task-death and VMA cleanup path rather than the folio-free path.
>
> From what i've seen of accelerator drivers that implement this, when you
> inform the driver of a memory region with a task, the driver should have
> a mechanism to take references on that VMA (or something like this) - so
> that when the task dies the driver has a way to be notified of the VMA
> being cleaned up.
>
> This probably exists - I just haven't gotten there yet.

That sounds reasonable. Alternatively, maybe the buddy can just inform
the driver about pages getting freed?

Again, just a another random thought. But if these nodes are already
special-private, then why not enlighten the buddy in some way.

That also aligns with my "buddy rejects to hand out free pages if the
device says no" case.

Something to thinker about.

--
Cheers,

David