Re: [RFC PATCH v2 0/3] mm: mempolicy: Multi-tier weighted interleaving

From: Gregory Price
Date: Fri Oct 20 2023 - 12:33:45 EST


On Fri, Oct 20, 2023 at 02:11:40PM +0800, Huang, Ying wrote:
> Gregory Price <gregory.price@xxxxxxxxxxxx> writes:
>
> >
[...snip...]
> > Example 2: A dual-socket system with 1 CXL device per socket
> > ===
> > CPU Nodes: node0, node1
> > CXL Nodes: node2, node3 (on sockets 0 and 1 respective)
> >
[...snip...]
> > This is similar to example #1, but with one difference: A task running
> > on node 0 should not treat nodes 0 and 1 the same, nor nodes 2 and 3.
[...snip...]
> > This leaves us with weights of:
> >
> > node0 - 57%
> > node1 - 26%
> > node2 - 12%
> > node3 - 5%
> >
>
> Does the workload run on CPU of node 0 only? This appears unreasonable.

Depends. if a user explicitly launches with `numactl --cpunodebind=0`
then yes, you can force a task (and all its children) to run on node0.

If a workload multi-threaded enough to run on both sockets, then you are
right that you'd want to basically limit cross-socket traffic by binding
individual threads to nodes that don't cross sockets - if at all
feasible this may not be feasible).

But at that point, we're getting into the area of numa-aware software.
That's a bit beyond the scope of this - which is to enable a coarse
grained interleaving solution that can easily be accessed with something
like `numactl --interleave` or `numactl --weighted-interleave`.

> If the memory bandwidth requirement of the workload is so large that CXL
> is used to expand bandwidth, why not run workload on CPU of node 1 and
> use the full memory bandwidth of node 1?

Settings are NOT one size fits all. You can certainly come up with another
scenario in which these weights are not optimal.

If we're running enough threads that we need multiple sockets to run
them concurrently, then the memory distribution weights become much more
complex. Without more precise control over task placement and
preventing task migration, you can't really get an "optimal" placement.

What I'm really saying is "Task placement is a more powerful function
for predicting performance than memory placement". However, user
software would need to implement a pseudo-scheduler and explicit data
placement to be the most optimized. Beyond this, there is only so much
we can do from a `numactl` perspective.

tl;dr: We can't get a perfect system here, because getting a best-case
for all possible scenarios is an probably undecidable problem. You will
always be able to generate an example wherein the system is not optimal.

>
> If the workload run on CPU of node 0 and node 1, then the cross-socket
> traffic should be minimized if possible. That is, threads/processes on
> node 0 should interleave memory of node 0 and node 2, while that on node
> 1 should interleave memory of node 1 and node 3.

This can be done with set_mempolicy() with MPOL_INTERLEAVE and set the
nodemask to the what you describe. Those tasks need to also prevent
themselves from being migrated as well. But this can absolutely be
done.

In this scenario, the weights need to be re-calculated to be based on
the bandwidth of the nodes in the mempolicy nodemask, which is what i
described in the last email.

~Gregory