Re: RFC: Memory Tiering Kernel Interfaces

From: Aneesh Kumar K V
Date: Thu May 12 2022 - 00:41:05 EST


On 5/11/22 12:42 PM, Alistair Popple wrote:

Wei Xu <weixugc@xxxxxxxxxx> writes:

On Tue, May 10, 2022 at 5:10 AM Aneesh Kumar K V
<aneesh.kumar@xxxxxxxxxxxxx> wrote:

On 5/10/22 3:29 PM, Hesham Almatary wrote:
Hello Yang,

On 5/10/2022 4:24 AM, Yang Shi wrote:
On Mon, May 9, 2022 at 7:32 AM Hesham Almatary
<hesham.almatary@xxxxxxxxxx> wrote:


...


node 0 has a CPU and DDR memory in tier 0, node 1 has GPU and DDR memory
in tier 0,
node 2 has NVMM memory in tier 1, node 3 has some sort of bigger memory
(could be a bigger DDR or something) in tier 2. The distances are as
follows:

-------------- --------------
| Node 0 | | Node 1 |
| ------- | | ------- |
| | DDR | | | | DDR | |
| ------- | | ------- |
| | | |
-------------- --------------
| 20 | 120 |
v v |
---------------------------- |
| Node 2 PMEM | | 100
---------------------------- |
| 100 |
v v
--------------------------------------
| Node 3 Large mem |
--------------------------------------

node distances:
node 0 1 2 3
0 10 20 20 120
1 20 10 120 100
2 20 120 10 100
3 120 100 100 10

/sys/devices/system/node/memory_tiers
0-1
2
3

N_TOPTIER_MEMORY: 0-1


In this case, we want to be able to "skip" the demotion path from Node 1
to Node 2,

and make demotion go directely to Node 3 as it is closer, distance wise.
How can

we accommodate this scenario (or at least not rule it out as future
work) with the

current RFC?
If I remember correctly NUMA distance is hardcoded in SLIT by the
firmware, it is supposed to reflect the latency. So I suppose it is
the firmware's responsibility to have correct information. And the RFC
assumes higher tier memory has better performance than lower tier
memory (latency, bandwidth, throughput, etc), so it sounds like a
buggy firmware to have lower tier memory with shorter distance than
higher tier memory IMHO.

You are correct if you're assuming the topology is all hierarchically

symmetric, but unfortuantely, in real hardware (e.g., my example above)

it is not. The distance/latency between two nodes in the same tier

and a third node, is different. The firmware still provides the correct

latency, but putting a node in a tier is up to the kernel/user, and

is relative: e.g., Node 3 could belong to tier 1 from Node 1's

perspective, but to tier 2 from Node 0's.


A more detailed example (building on my previous one) is when having

the GPU connected to a switch:

----------------------------
| Node 2 PMEM |
----------------------------
^
|
-------------- --------------
| Node 0 | | Node 1 |
| ------- | | ------- |
| | DDR | | | | DDR | |
| ------- | | ------- |
| CPU | | GPU |
-------------- --------------
| |
v v
----------------------------
| Switch |
----------------------------
|
v
--------------------------------------
| Node 3 Large mem |
--------------------------------------

Here, demoting from Node 1 to Node 3 directly would be faster as

it only has to go through one hub, compared to demoting from Node 1

to Node 2, where it goes through two hubs. I hope that example

clarifies things a little bit.


Alistair mentioned that we want to consider GPU memory to be expensive
and want to demote from GPU to regular DRAM. In that case for the above
case we should end up with


tier 0 - > Node3
tier 1 -> Node0, Node1
tier 2 -> Node2

I'm a little bit confused by the tiering here as I don't think it's
quite what we want. As pointed out GPU memory is expensive and therefore
we don't want anything demoting to it. That implies it should be in the
top tier:



I didn't look closely at the topology and assumed that Node3 is the GPU connected to the switch. Hence all the confusion.


tier 0 -> Node1
tier 1 -> Node0, Node3
tier 2 -> Node2

Hence:

node 0: allowed=2
node 1: allowed=0,3,2
node 2: allowed=empty
node 3: allowed=2

looks good to be default and simple.


Alternatively Node3 could be put in tier 2 which would prevent demotion
to PMEM via the switch/CPU:

tier 0 -> Node1
tier 1 -> Node0
tier 2 -> Node2, Node3

node 0: allowed=2,3
node 1: allowed=0,3,2
node 2: allowed=empty
node 3: allowed=empty


and this can be configured via userspace?

Both of these would be an improvement over the current situation
upstream, which demotes everything to GPU memory and doesn't support
demoting from the GPU (meaning reclaim on GPU memory pages everything to
disk).


Hence

node 0: allowed=2
node 1: allowed=2
node 2: allowed = empty
node 3: allowed = 0-1 , based on fallback order 1, 0

If we have 3 tiers as defined above, then we'd better to have:

node 0: allowed = 2
node 1: allowed = 2
node 2: allowed = empty
node 3: allowed = 0-2, based on fallback order: 1,0,2

The firmware should provide the node distance values to reflect that
PMEM is slowest and should have the largest distance away from node 3.

Right. In my above example firmware would have to provide reasonable
distance values to ensure optimal fallback order.

-aneesh