Re: [PATCH RESEND 1/4] memory tiering: add abstract distance calculation algorithms management

From: Alistair Popple
Date: Fri Aug 25 2023 - 02:00:39 EST



"Huang, Ying" <ying.huang@xxxxxxxxx> writes:

> Alistair Popple <apopple@xxxxxxxxxx> writes:
>
>> "Huang, Ying" <ying.huang@xxxxxxxxx> writes:
>>
>>> Alistair Popple <apopple@xxxxxxxxxx> writes:
>>>
>>>> "Huang, Ying" <ying.huang@xxxxxxxxx> writes:
>>>>
>>>>> Alistair Popple <apopple@xxxxxxxxxx> writes:
>>>>>
>>>>>> "Huang, Ying" <ying.huang@xxxxxxxxx> writes:
>>>>>>
>>>>>>> Hi, Alistair,
>>>>>>>
>>>>>>> Sorry for late response. Just come back from vacation.
>>>>>>
>>>>>> Ditto for this response :-)
>>>>>>
>>>>>> I see Andrew has taken this into mm-unstable though, so my bad for not
>>>>>> getting around to following all this up sooner.
>>>>>>
>>>>>>> Alistair Popple <apopple@xxxxxxxxxx> writes:
>>>>>>>
>>>>>>>> "Huang, Ying" <ying.huang@xxxxxxxxx> writes:
>>>>>>>>
>>>>>>>>> Alistair Popple <apopple@xxxxxxxxxx> writes:
>>>>>>>>>
>>>>>>>>>> "Huang, Ying" <ying.huang@xxxxxxxxx> writes:
>>>>>>>>>>
>>>>>>>>>>> Alistair Popple <apopple@xxxxxxxxxx> writes:
>>>>>>>>>>>
>>>>>>>>>>>>>>> While other memory device drivers can use the general notifier chain
>>>>>>>>>>>>>>> interface at the same time.
>>>>>>>>>>>>
>>>>>>>>>>>> How would that work in practice though? The abstract distance as far as
>>>>>>>>>>>> I can tell doesn't have any meaning other than establishing preferences
>>>>>>>>>>>> for memory demotion order. Therefore all calculations are relative to
>>>>>>>>>>>> the rest of the calculations on the system. So if a driver does it's own
>>>>>>>>>>>> thing how does it choose a sensible distance? IHMO the value here is in
>>>>>>>>>>>> coordinating all that through a standard interface, whether that is HMAT
>>>>>>>>>>>> or something else.
>>>>>>>>>>>
>>>>>>>>>>> Only if different algorithms follow the same basic principle. For
>>>>>>>>>>> example, the abstract distance of default DRAM nodes are fixed
>>>>>>>>>>> (MEMTIER_ADISTANCE_DRAM). The abstract distance of the memory device is
>>>>>>>>>>> in linear direct proportion to the memory latency and inversely
>>>>>>>>>>> proportional to the memory bandwidth. Use the memory latency and
>>>>>>>>>>> bandwidth of default DRAM nodes as base.
>>>>>>>>>>>
>>>>>>>>>>> HMAT and CDAT report the raw memory latency and bandwidth. If there are
>>>>>>>>>>> some other methods to report the raw memory latency and bandwidth, we
>>>>>>>>>>> can use them too.
>>>>>>>>>>
>>>>>>>>>> Argh! So we could address my concerns by having drivers feed
>>>>>>>>>> latency/bandwidth numbers into a standard calculation algorithm right?
>>>>>>>>>> Ie. Rather than having drivers calculate abstract distance themselves we
>>>>>>>>>> have the notifier chains return the raw performance data from which the
>>>>>>>>>> abstract distance is derived.
>>>>>>>>>
>>>>>>>>> Now, memory device drivers only need a general interface to get the
>>>>>>>>> abstract distance from the NUMA node ID. In the future, if they need
>>>>>>>>> more interfaces, we can add them. For example, the interface you
>>>>>>>>> suggested above.
>>>>>>>>
>>>>>>>> Huh? Memory device drivers (ie. dax/kmem.c) don't care about abstract
>>>>>>>> distance, it's a meaningless number. The only reason they care about it
>>>>>>>> is so they can pass it to alloc_memory_type():
>>>>>>>>
>>>>>>>> struct memory_dev_type *alloc_memory_type(int adistance)
>>>>>>>>
>>>>>>>> Instead alloc_memory_type() should be taking bandwidth/latency numbers
>>>>>>>> and the calculation of abstract distance should be done there. That
>>>>>>>> resovles the issues about how drivers are supposed to devine adistance
>>>>>>>> and also means that when CDAT is added we don't have to duplicate the
>>>>>>>> calculation code.
>>>>>>>
>>>>>>> In the current design, the abstract distance is the key concept of
>>>>>>> memory types and memory tiers. And it is used as interface to allocate
>>>>>>> memory types. This provides more flexibility than some other interfaces
>>>>>>> (e.g. read/write bandwidth/latency). For example, in current
>>>>>>> dax/kmem.c, if HMAT isn't available in the system, the default abstract
>>>>>>> distance: MEMTIER_DEFAULT_DAX_ADISTANCE is used. This is still useful
>>>>>>> to support some systems now. On a system without HMAT/CDAT, it's
>>>>>>> possible to calculate abstract distance from ACPI SLIT, although this is
>>>>>>> quite limited. I'm not sure whether all systems will provide read/write
>>>>>>> bandwith/latency data for all memory devices.
>>>>>>>
>>>>>>> HMAT and CDAT or some other mechanisms may provide the read/write
>>>>>>> bandwidth/latency data to be used to calculate abstract distance. For
>>>>>>> them, we can provide a shared implementation in mm/memory-tiers.c to map
>>>>>>> from read/write bandwith/latency to the abstract distance. Can this
>>>>>>> solve your concerns about the consistency among algorithms? If so, we
>>>>>>> can do that when we add the second algorithm that needs that.
>>>>>>
>>>>>> I guess it would address my concerns if we did that now. I don't see why
>>>>>> we need to wait for a second implementation for that though - the whole
>>>>>> series seems to be built around adding a framework for supporting
>>>>>> multiple algorithms even though only one exists. So I think we should
>>>>>> support that fully, or simplfy the whole thing and just assume the only
>>>>>> thing that exists is HMAT and get rid of the general interface until a
>>>>>> second algorithm comes along.
>>>>>
>>>>> We will need a general interface even for one algorithm implementation.
>>>>> Because it's not good to make a dax subsystem driver (dax/kmem) to
>>>>> depend on a ACPI subsystem driver (acpi/hmat). We need some general
>>>>> interface at subsystem level (memory tier here) between them.
>>>>
>>>> I don't understand this argument. For a single algorithm it would be
>>>> simpler to just define acpi_hmat_calculate_adistance() and a static
>>>> inline version of it that returns -ENOENT when !CONFIG_ACPI than adding
>>>> a layer of indirection through notifier blocks. That breaks any
>>>> dependency on ACPI and there's plenty of precedent for this approach in
>>>> the kernel already.
>>>
>>> ACPI is a subsystem, so it's OK for dax/kmem to depends on CONFIG_ACPI.
>>> But HMAT is a driver of ACPI subsystem (controlled via
>>> CONFIG_ACPI_HMAT). It's not good for a driver of DAX subsystem
>>> (dax/kmem) to depend on a *driver* of ACPI subsystem.
>>>
>>> Yes. Technically, there's no hard wall to prevent this. But I think
>>> that a good design should make drivers depends on subsystems or drivers
>>> of the same subsystem, NOT drivers of other subsystems.
>>
>> Thanks, I wasn't really thinking of HMAT as an ACPI driver. I understand
>> where you're coming from but I really don't see the problem with using a
>> static inline. It doesn't create dependencies (you could still use
>> dax/kmem without ACPI) and results in smaller and easier to follow code.
>>
>> IMHO it's far more obvious that a call to acpi_hmat_calcaulte_adist()
>> returns either a default if ACPI HMAT isn't configured or a calculated
>> value than it is to figure out what notifiers may or may not be
>> registered at runtime and what priority they may be called in from
>> mt_calc_adistance().
>>
>> It appears you think that is a bad design, but I don't understand
>> why. What does this approach give us that a simpler approach wouldn't?
>
> Think about all these again. Finally I admit you are right. The
> general interface is better mainly if there are multiple implementations
> of the interface.
>
> In this series, we provide just one implementation: HMAT. And, the
> second one: CDAT will be implemented soon. And, CDAT will use the same
> method to translate from read/write bandwidth/latency to adistance. So,
> I suggest to:
>
> - Keep the general interface (and notifier chain), for HMAT and soon
> available CDAT
>
> - Move the code to translate from read/write bandwidth/latency to
> adistance to memory-tiers.c. Which is used by HMAT now and will be
> used by CDAT soon. And it can be used by other drivers.
>
> What do you think about that?

That sounds great. I had kinda assumed CDAT was around the corner, and
was "the other driver" I was thinking of hence the suggestion to make it
a bit more general (or alternatively ignore that and make it specific,
but you've already done the work to make it general so happy to keep
it).

BTW for what it's worth I still think it might be best to have the
notifiers return bandwidth/latency numbers. In practice though that
would actually make my life harder so I'm happy to see where the above
takes us.