Re: RFC: Memory Tiering Kernel Interfaces

From: Aneesh Kumar K.V
Date: Tue May 10 2022 - 07:45:31 EST


Wei Xu <weixugc@xxxxxxxxxx> writes:

> On Mon, May 9, 2022 at 7:32 AM Hesham Almatary
> <hesham.almatary@xxxxxxxxxx> wrote:
>>

....

> > nearest lower tier before demoting to lower lower tiers.
>> There might still be simple cases/topologies where we might want to "skip"
>> the very next lower tier. For example, assume we have a 3 tiered memory
>> system as follows:
>>
>> node 0 has a CPU and DDR memory in tier 0, node 1 has GPU and DDR memory
>> in tier 0,
>> node 2 has NVMM memory in tier 1, node 3 has some sort of bigger memory
>> (could be a bigger DDR or something) in tier 2. The distances are as
>> follows:
>>
>> -------------- --------------
>> | Node 0 | | Node 1 |
>> | ------- | | ------- |
>> | | DDR | | | | DDR | |
>> | ------- | | ------- |
>> | | | |
>> -------------- --------------
>> | 20 | 120 |
>> v v |
>> ---------------------------- |
>> | Node 2 PMEM | | 100
>> ---------------------------- |
>> | 100 |
>> v v
>> --------------------------------------
>> | Node 3 Large mem |
>> --------------------------------------
>>
>> node distances:
>> node 0 1 2 3
>> 0 10 20 20 120
>> 1 20 10 120 100
>> 2 20 120 10 100
>> 3 120 100 100 10
>>
>> /sys/devices/system/node/memory_tiers
>> 0-1
>> 2
>> 3
>>
>> N_TOPTIER_MEMORY: 0-1
>>
>>
>> In this case, we want to be able to "skip" the demotion path from Node 1
>> to Node 2,
>>
>> and make demotion go directely to Node 3 as it is closer, distance wise.
>> How can
>>
>> we accommodate this scenario (or at least not rule it out as future
>> work) with the current RFC?
>
> This is an interesting example. I think one way to support this is to
> allow all the lower tier nodes to be the demotion targets of a node in
> the higher tier. We can then use the allocation fallback order to
> select the best demotion target.
>
> For this example, we will have the demotion targets of each node as:
>
> node 0: allowed=2-3, order (based on allocation fallback order): 2, 3
> node 1: allowed=2-3, order (based on allocation fallback order): 3, 2
> node 2: allowed = 3, order (based on allocation fallback order): 3
> node 3: allowed = empty
>
> What do you think?
>

Can we simplify this further with

tier 0 - > empty (no HBM/GPU)
tier 1 -> Node0, Node1
tier 2 -> Node2, Node3

Hence

node 0: allowed=2-3, order (based on allocation fallback order): 2, 3
node 1: allowed=2-3, order (based on allocation fallback order): 3, 2
node 2: allowed = empty
node 3: allowed = empty

-aneesh