Re: [PATCH 3/3] mm/numa_balancing:Allow migrate on protnone reference with MPOL_PREFERRED_MANY policy

From: Aneesh Kumar K . V
Date: Tue Feb 20 2024 - 02:47:12 EST


"Huang, Ying" <ying.huang@xxxxxxxxx> writes:

> "Aneesh Kumar K.V" <aneesh.kumar@xxxxxxxxxx> writes:
>
>> On 2/20/24 12:06 PM, Huang, Ying wrote:
>>> Donet Tom <donettom@xxxxxxxxxxxxx> writes:
>>>
>>>> On 2/19/24 17:37, Michal Hocko wrote:
>>>>> On Sat 17-02-24 01:31:35, Donet Tom wrote:
>>>>>> commit bda420b98505 ("numa balancing: migrate on fault among multiple bound
>>>>>> nodes") added support for migrate on protnone reference with MPOL_BIND
>>>>>> memory policy. This allowed numa fault migration when the executing node
>>>>>> is part of the policy mask for MPOL_BIND. This patch extends migration
>>>>>> support to MPOL_PREFERRED_MANY policy.
>>>>>>
>>>>>> Currently, we cannot specify MPOL_PREFERRED_MANY with the mempolicy flag
>>>>>> MPOL_F_NUMA_BALANCING. This causes issues when we want to use
>>>>>> NUMA_BALANCING_MEMORY_TIERING. To effectively use the slow memory tier,
>>>>>> the kernel should not allocate pages from the slower memory tier via
>>>>>> allocation control zonelist fallback. Instead, we should move cold pages
>>>>>> from the faster memory node via memory demotion. For a page allocation,
>>>>>> kswapd is only woken up after we try to allocate pages from all nodes in
>>>>>> the allocation zone list. This implies that, without using memory
>>>>>> policies, we will end up allocating hot pages in the slower memory tier.
>>>>>>
>>>>>> MPOL_PREFERRED_MANY was added by commit b27abaccf8e8 ("mm/mempolicy: add
>>>>>> MPOL_PREFERRED_MANY for multiple preferred nodes") to allow better
>>>>>> allocation control when we have memory tiers in the system. With
>>>>>> MPOL_PREFERRED_MANY, the user can use a policy node mask consisting only
>>>>>> of faster memory nodes. When we fail to allocate pages from the faster
>>>>>> memory node, kswapd would be woken up, allowing demotion of cold pages
>>>>>> to slower memory nodes.
>>>>>>
>>>>>> With the current kernel, such usage of memory policies implies we can't
>>>>>> do page promotion from a slower memory tier to a faster memory tier
>>>>>> using numa fault. This patch fixes this issue.
>>>>>>
>>>>>> For MPOL_PREFERRED_MANY, if the executing node is in the policy node
>>>>>> mask, we allow numa migration to the executing nodes. If the executing
>>>>>> node is not in the policy node mask but the folio is already allocated
>>>>>> based on policy preference (the folio node is in the policy node mask),
>>>>>> we don't allow numa migration. If both the executing node and folio node
>>>>>> are outside the policy node mask, we allow numa migration to the
>>>>>> executing nodes.
>>>>> The feature makes sense to me. How has this been tested? Do you have any
>>>>> numbers to present?
>>>>
>>>> Hi Michal
>>>>
>>>> I have a test program which allocate memory on a specified node and
>>>> trigger the promotion or migration (Keep accessing the pages).
>>>>
>>>> Without this patch if we set MPOL_PREFERRED_MANY promotion or migration was not happening
>>>> with this patch I could see pages are getting migrated or promoted.
>>>>
>>>> My system has 2 CPU+DRAM node (Tier 1) and 1 PMEM node(Tier 2). Below
>>>> are my test results.
>>>>
>>>> In below table N0 and N1 are Tier1 Nodes. N6 is the Tier2 Node.
>>>> Exec_Node is the execution node, Policy is the nodes in nodemask and
>>>> "Curr Location Pages" is the node where pages present before migration
>>>> or promotion start.
>>>>
>>>> Tests Results
>>>> ------------------
>>>> Scenario 1:  if the executing node is in the policy node mask
>>>> ================================================================================
>>>> Exec_Node    Policy           Curr Location Pages Observations
>>>> ================================================================================
>>>> N0           N0 N1 N6             N1 Pages Migrated from N1 to N0
>>>> N0           N0 N1 N6             N6 Pages Promoted from N6 to N0
>>>> N0           N0 N1               N1             Pages Migrated from N1 to N0
>>>> N0           N0 N1                N6     Pages Promoted from N6 to N0
>>>>
>>>> Scenario 2: If the folio node is in policy node mask and Exec node not in policy  node mask
>>>> ================================================================================
>>>> Exec_Node    Policy       Curr Location Pages      Observations
>>>> ================================================================================
>>>> N0          N1 N6             N1 Pages are not Migrating to N0
>>>> N0           N1 N6             N6 Pages are not migration to N0
>>>> N0           N1                N1     Pages are not Migrating to N0
>>>>
>>>> Scenario 3: both the folio node and executing node are outside the policy nodemask
>>>> ==============================================================================
>>>> Exec_Node    Policy         Curr Location Pages       Observations
>>>> ==============================================================================
>>>> N0            N1                     N6          Pages Promoted from N6 to N0
>>>> N0            N6 N1          Pages Migrated from N1 to N0
>>>>
>>>
>>> Please use some benchmarks (e.g., redis + memtier) and show the
>>> proc-vmstat stats and benchamrk score.
>>
>>
>> Without this change numa fault migration is not supported with MPOL_PREFERRED_MANY
>> policy. So there is no performance comparison with and without patch. W.r.t effectiveness of numa
>> fault migration, that is a different topic from this patch
>
> IIUC, the goal of the patch is to optimize performance, right? If so,
> the benchmark score will help justify the change.
>

The objective is to enable the use of the MPOL_PREFERRED_MANY policy,
which is essential for the correct functioning of memory demotion in
conjunction with memory promotion. Once we can use memory promotion, we
should be able to observe the same benefits as those provided by numa
fault memory promotion. The actual benefit of numa fault migration is
dependent on various factors such as the speed of the slower memory
device, the access pattern of the application, etc. We are discussing
its effectiveness and how to improve numa fault overhead in other
forums. However, we believe that this discussion should not hinder the
merging of this patch.

This change is similar to commit bda420b98505 ("numa balancing: migrate
on fault among multiple bound nodes")

-aneesh