Re: [RFC PATCH v3 0/4] Node Weights and Weighted Interleave

From: Huang, Ying
Date: Tue Oct 31 2023 - 22:36:26 EST


Johannes Weiner <hannes@xxxxxxxxxxx> writes:

> On Tue, Oct 31, 2023 at 04:56:27PM +0100, Michal Hocko wrote:
>> On Tue 31-10-23 11:21:42, Johannes Weiner wrote:
>> > On Tue, Oct 31, 2023 at 10:53:41AM +0100, Michal Hocko wrote:
>> > > On Mon 30-10-23 20:38:06, Gregory Price wrote:

[snip]

>>
>> > This hopefully also explains why it's a global setting. The usecase is
>> > different from conventional NUMA interleaving, which is used as a
>> > locality measure: spread shared data evenly between compute
>> > nodes. This one isn't about locality - the CXL tier doesn't have local
>> > compute. Instead, the optimal spread is based on hardware parameters,
>> > which is a global property rather than a per-workload one.
>>
>> Well, I am not convinced about that TBH. Sure it is probably a good fit
>> for this specific CXL usecase but it just doesn't fit into many others I
>> can think of - e.g. proportional use of those tiers based on the
>> workload - you get what you pay for.
>>
>> Is there any specific reason for not having a new interleave interface
>> which defines weights for the nodemask? Is this because the policy
>> itself is very dynamic or is this more driven by simplicity of use?
>
> A downside of *requiring* weights to be paired with the mempolicy is
> that it's then the application that would have to figure out the
> weights dynamically, instead of having a static host configuration. A
> policy of "I want to be spread for optimal bus bandwidth" translates
> between different hardware configurations, but optimal weights will
> vary depending on the type of machine a job runs on.
>
> That doesn't mean there couldn't be usecases for having weights as
> policy as well in other scenarios, like you allude to above. It's just
> so far such usecases haven't really materialized or spelled out
> concretely. Maybe we just want both - a global default, and the
> ability to override it locally.

I think that this is a good idea. The system-wise configuration with
reasonable default makes applications life much easier. If more control
is needed, some kind of workload specific configuration can be added.
And, instead of adding another memory policy, a cgroup-wise
configuration may be easier to be used. The per-workload weight may
need to be adjusted when we deploying different combination of workloads
in the system.

Another question is that should the weight be per-memory-tier or
per-node? In this patchset, the weight is per-source-target-node
combination. That is, the weight becomes a matrix instead of a vector.
IIUC, this is used to control cross-socket memory access in addition to
per-memory-type memory access. Do you think the added complexity is
necessary?

> Could you elaborate on the 'get what you pay for' usecase you
> mentioned?

--
Best Regards,
Huang, Ying