Re: [LSF/MM/BPF TOPIC] [RFC PATCH 0/4] mm/mempolicy: introduce socket-aware weighted interleave

From: Joshua Hahn

Date: Mon Mar 16 2026 - 11:30:00 EST

Hello Rakie! I hope you have been doing well. Thank you for this
RFC, I think it is a very interesting idea.

[...snip...]

> Consider a dual-socket system:
>
> node0 node1
> +-------+ +-------+
> | CPU 0 |---------| CPU 1 |
> +-------+ +-------+
> | DRAM0 | | DRAM1 |
> +---+---+ +---+---+
> | |
> +---+---+ +---+---+
> | CXL 0 | | CXL 1 |
> +-------+ +-------+
> node2 node3
>
> Assuming local DRAM provides 300 GB/s and local CXL provides 100 GB/s,
> the effective bandwidth varies significantly from the perspective of
> each CPU due to inter-socket interconnect penalties.
>
> Local device capabilities (GB/s) vs. cross-socket effective bandwidth:
>
> 0 1 2 3
> CPU 0 300 150 100 50
> CPU 1 150 300 50 100
>
> A reasonable global weight vector reflecting the base capabilities is:
>
> node0=3 node1=3 node2=1 node3=1
>
> However, because these configured node weights do not account for
> interconnect degradation between sockets, applying them flatly to all
> sources yields the following effective map from each CPU's perspective:
>
> 0 1 2 3
> CPU 0 3 3 1 1
> CPU 1 3 3 1 1
>
> This does not account for the interconnect penalty (e.g., node0->node1
> drops 300->150, node0->node3 drops 100->50) and thus forces allocations
> that cause a mismatch with actual performance.
>
> This patch makes weighted interleave socket-aware. Before weighting is
> applied, the candidate nodes are restricted to the current socket; only
> if no eligible local nodes remain does the policy fall back to the
> wider set.

So when I saw this, I thought the idea was that we would attempt an
allocation with these socket-aware weights, and upon failure, fall back
to the global weights that are set so that we can try to fulfill the
allocation from cross-socket nodes.

However, reading the implementation in 4/4, it seems like what is meant
by "fallback" here is not in the sense of a fallback allocation, but
in the sense of "if there is a misconfiguration and the intersection
between policy nodes and the CPU's package is empty, use the global
nodes instead".

Am I understanding this correctly?

And, it seems like what this also means is that under sane configurations,
there is no more cross socket memory allocation, since it will always
try to fulfill it from the local node.

> Even if the configured global weights remain identically set:
>
> node0=3 node1=3 node2=1 node3=1
>
> The resulting effective map from the perspective of each CPU becomes:
>
> 0 1 2 3
> CPU 0 3 0 1 0
> CPU 1 0 3 0 1

> Now tasks running on node0 prefer DRAM0(3) and CXL0(1), while tasks on
> node1 prefer DRAM1(3) and CXL1(1). This aligns allocation with actual
> effective bandwidth, preserves NUMA locality, and reduces cross-socket
> traffic.

In that sense I thought the word "prefer" was a bit confusing, since I
thought it would mean that it would try to fulfill the alloactions
from within a packet first, then fall back to remote packets if that
failed. (Or maybe I am just misunderstanding your explanation. Please
do let me know if that is the case : -) )

If what I understand is the case , I think this is the same thing as
just restricting allocations to be socket-local. I also wonder if
this idea applies to other mempolicies as well (i.e. unweighted interleave)

I think we should consider what the expected and desirable behavior is
when one socket is fully saturated but the other socket is empty. In my
mind this is no different from considering within-packet remote NUMA
allocations; the tradeoff becomes between reclaiming locally and
keeping allocations local, vs. skipping reclaiming and consuming
free memory while eating the remote access latency, similar to
zone_reclaim mode (packet_reclaim_mode? ; -) )

In my mind (without doing any benchmarking myself or looking at the numbers)
I imagine that there are some scenarios where we actually do want cross
socket allocations, like in the example above when we have very asymmetric
saturations across sockets. Is this something that could be worth
benchmarking as well?

I will end by saying that in the normal case (sockets have similar saturation)
I think this series is a definite win and improvement to weighted interleave.
I just was curious whether we can handle the worst-case scenarios.

Thank you again for the series. Have a great day!
Joshua