Hello Yang,
On 5/10/2022 4:24 AM, Yang Shi wrote:
On Mon, May 9, 2022 at 7:32 AM Hesham Almatary
<hesham.almatary@xxxxxxxxxx> wrote:
If I remember correctly NUMA distance is hardcoded in SLIT by the
node 0 has a CPU and DDR memory in tier 0, node 1 has GPU and DDR memory
in tier 0,
node 2 has NVMM memory in tier 1, node 3 has some sort of bigger memory
(could be a bigger DDR or something) in tier 2. The distances are as
follows:
-------------- --------------
| Node 0 | | Node 1 |
| ------- | | ------- |
| | DDR | | | | DDR | |
| ------- | | ------- |
| | | |
-------------- --------------
| 20 | 120 |
v v |
---------------------------- |
| Node 2 PMEM | | 100
---------------------------- |
| 100 |
v v
--------------------------------------
| Node 3 Large mem |
--------------------------------------
node distances:
node 0 1 2 3
0 10 20 20 120
1 20 10 120 100
2 20 120 10 100
3 120 100 100 10
/sys/devices/system/node/memory_tiers
0-1
2
3
N_TOPTIER_MEMORY: 0-1
In this case, we want to be able to "skip" the demotion path from Node 1
to Node 2,
and make demotion go directely to Node 3 as it is closer, distance wise.
How can
we accommodate this scenario (or at least not rule it out as future
work) with the
current RFC?
firmware, it is supposed to reflect the latency. So I suppose it is
the firmware's responsibility to have correct information. And the RFC
assumes higher tier memory has better performance than lower tier
memory (latency, bandwidth, throughput, etc), so it sounds like a
buggy firmware to have lower tier memory with shorter distance than
higher tier memory IMHO.
You are correct if you're assuming the topology is all hierarchically
symmetric, but unfortuantely, in real hardware (e.g., my example above)
it is not. The distance/latency between two nodes in the same tier
and a third node, is different. The firmware still provides the correct
latency, but putting a node in a tier is up to the kernel/user, and
is relative: e.g., Node 3 could belong to tier 1 from Node 1's
perspective, but to tier 2 from Node 0's.
A more detailed example (building on my previous one) is when having
the GPU connected to a switch:
----------------------------
| Node 2 PMEM |
----------------------------
^
|
-------------- --------------
| Node 0 | | Node 1 |
| ------- | | ------- |
| | DDR | | | | DDR | |
| ------- | | ------- |
| CPU | | GPU |
-------------- --------------
| |
v v
----------------------------
| Switch |
----------------------------
|
v
--------------------------------------
| Node 3 Large mem |
--------------------------------------
Here, demoting from Node 1 to Node 3 directly would be faster as
it only has to go through one hub, compared to demoting from Node 1
to Node 2, where it goes through two hubs. I hope that example
clarifies things a little bit.