Re: [RFC] memory tiering: use small chunk size and more tiers

From: Aneesh Kumar K V
Date: Fri Oct 28 2022 - 01:06:35 EST


On 10/28/22 8:33 AM, Huang, Ying wrote:
> Hi, Aneesh,
>
> Aneesh Kumar K V <aneesh.kumar@xxxxxxxxxxxxx> writes:
>
>> On 10/27/22 12:29 PM, Huang Ying wrote:
>>> We need some way to override the system default memory tiers. For
>>> the example system as follows,
>>>
>>> type abstract distance
>>> ---- -----------------
>>> HBM 300
>>> DRAM 1000
>>> CXL_MEM 5000
>>> PMEM 5100
>>>
>>> Given the memory tier chunk size is 100, the default memory tiers
>>> could be,
>>>
>>> tier abstract distance types
>>> range
>>> ---- ----------------- -----
>>> 3 300-400 HBM
>>> 10 1000-1100 DRAM
>>> 50 5000-5100 CXL_MEM
>>> 51 5100-5200 PMEM
>>>
>>> If we want to group CXL MEM and PMEM into one tier, we have 2 choices.
>>>
>>> 1) Override the abstract distance of CXL_MEM or PMEM. For example, if
>>> we change the abstract distance of PMEM to 5050, the memory tiers
>>> become,
>>>
>>> tier abstract distance types
>>> range
>>> ---- ----------------- -----
>>> 3 300-400 HBM
>>> 10 1000-1100 DRAM
>>> 50 5000-5100 CXL_MEM, PMEM
>>>
>>> 2) Override the memory tier chunk size. For example, if we change the
>>> memory tier chunk size to 200, the memory tiers become,
>>>
>>> tier abstract distance types
>>> range
>>> ---- ----------------- -----
>>> 1 200-400 HBM
>>> 5 1000-1200 DRAM
>>> 25 5000-5200 CXL_MEM, PMEM
>>>
>>> But after some thoughts, I think choice 2) may be not good. The
>>> problem is that even if 2 abstract distances are almost same, they may
>>> be put in 2 tier if they sit in the different sides of the tier
>>> boundary. For example, if the abstract distance of CXL_MEM is 4990,
>>> while the abstract distance of PMEM is 5010. Although the difference
>>> of the abstract distances is only 20, CXL_MEM and PMEM will put in
>>> different tiers if the tier chunk size is 50, 100, 200, 250, 500, ....
>>> This makes choice 2) hard to be used, it may become tricky to find out
>>> the appropriate tier chunk size that satisfying all requirements.
>>>
>>
>> Shouldn't we wait for gaining experience w.r.t how we would end up
>> mapping devices with different latencies and bandwidth before tuning these values?
>
> Just want to discuss the overall design.
>
>>> So I suggest to abandon choice 2) and use choice 1) only. This makes
>>> the overall design and user space interface to be simpler and easier
>>> to be used. The overall design of the abstract distance could be,
>>>
>>> 1. Use decimal for abstract distance and its chunk size. This makes
>>> them more user friendly.
>>>
>>> 2. Make the tier chunk size as small as possible. For example, 10.
>>> This will put different memory types in one memory tier only if their
>>> performance is almost same by default. And we will not provide the
>>> interface to override the chunk size.
>>>
>>
>> this could also mean we can end up with lots of memory tiers with relative
>> smaller performance difference between them. Again it depends how HMAT
>> attributes will be used to map to abstract distance.
>
> Per my understanding, there will not be many memory types in a system.
> So, there will not be many memory tiers too. In most systems, there are
> only 2 or 3 memory tiers in the system, for example, HBM, DRAM, CXL,
> etc.

So we don't need the chunk size to be 10 because we don't forsee us needing
to group devices into that many tiers.

> Do you know systems with many memory types? The basic idea is to
> put different memory types in different memory tiers by default. If
> users want to group them, they can do that via overriding the abstract
> distance of some memory type.
>

with small chunk size and depending on how we are going to derive abstract distance,
I am wondering whether we would end up with lots of memory tiers with no
real value. Hence my suggestion to wait making a change like this till we have
code that map HMAT/CDAT attributes to abstract distance.




>>
>>> 3. Make the abstract distance of normal DRAM large enough. For
>>> example, 1000, then 100 tiers can be defined below DRAM, this is
>>> more than enough in practice.
>>
>> Why 100? Will we really have that many tiers below/faster than DRAM? As of now
>> I see only HBM below it.
>
> Yes. 100 is more than enough. We just want to avoid to group different
> memory types by default.
>
> Best Regards,
> Huang, Ying
>