Re: [RFC PATCH v6 5/5] mm: sched: move NUMA balancing tiering promotion to pghot

From: Bharata B Rao

Date: Mon Mar 30 2026 - 00:46:36 EST

Replies to review from Shashiko
[https://sashiko.dev/#/patchset/20260323095104.238982-1-bharata%40amd.com]

>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>> index b298cba853ab..fe957ff91df9 100644
>> --- a/mm/huge_memory.c
>> +++ b/mm/huge_memory.c
>[ ... ]
>> @@ -2217,11 +2218,26 @@ vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf)
>> goto out_map;
>>
>> nid = folio_nid(folio);
>> + needs_promotion = folio_is_promo_candidate(folio);
>>
>> target_nid = numa_migrate_check(folio, vmf, haddr, &flags, writable,
>> &last_cpupid);
>> if (target_nid == NUMA_NO_NODE)
>> goto out_map;

> Does this code inadvertently record pages forbidden from migrating in pghot?
> If numa_migrate_check() returns NUMA_NO_NODE due to memory policy, the code
> jumps to out_map. At that point, nid retains its initialized value of
> folio_nid(folio). Since nid != NUMA_NO_NODE evaluates to true, it looks like
> pghot_record_access() will still be called at the end of the function, which
> could pollute the tracking map and trigger useless daemon sweeps.

While for NUMAB1, we need to call task_numa_fault() to just record the fault
stats, the equivalent of it for NUMAB2 isn't necessary. I will make the call
to pghot_record_access() conditional to this scenario.

>> +
>> + if (needs_promotion) {
>> + /*
>> + * Hot page promotion, mode=NUMA_BALANCING_MEMORY_TIERING.
>> + * Isolation and migration are handled by pghot.
>> + *
>> + * TODO: mode2 check
>> + */
>> + writable = false;

> Can this cause a regression involving spurious write-protect faults? By forcing
> writable = false here before jumping to out_map, the write permission appears
> to be stripped from the restored mapping even if the page was previously
> writable. If the application access was a write, it might immediately suffer a
> spurious write-protect fault when it retries the instruction.
>> + nid = target_nid;
>> + goto out_map;
>> + }

Since there is no migration from the fault context for NUMAB2, we don't
release PTL lock at all. Hence I realize that I don't have to modify
writable and ignore_writable at all for the hot page promotion case.

>[ ... ]
>> diff --git a/mm/memory.c b/mm/memory.c
>> index 2f815a34d924..289fa6c07a42 100644
>> --- a/mm/memory.c
>> +++ b/mm/memory.c
>[ ... ]
>> @@ -6086,16 +6087,31 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf)
>> goto out_map;
>>
>> nid = folio_nid(folio);
>> + needs_promotion = folio_is_promo_candidate(folio);
>> nr_pages = folio_nr_pages(folio);
>>
>> target_nid = numa_migrate_check(folio, vmf, vmf->address, &flags,
>> writable, &last_cpupid);
>> if (target_nid == NUMA_NO_NODE)
>> goto out_map;

> Similarly to do_huge_pmd_numa_page(), could this also inadvertently record
> pages forbidden from migrating when jumping to out_map with an unmodified
> nid?

Same as above.

>> - if (migrate_misplaced_folio_prepare(folio, vma, target_nid)) {
>> +
>> + if (needs_promotion) {
>> + /*
>> + * Hot page promotion, mode=NUMA_BALANCING_MEMORY_TIERING.
>> + * Isolation and migration are handled by pghot.
>> + */
>> + writable = false;
>> + ignore_writable = true;

> Would forcing writable = false and ignore_writable = true here also strip
> permissions and cause a performance regression from spurious write-protect
> faults for previously writable pages?

Same as above.

>> + nid = target_nid;
>> + goto out_map;
>> + }
>> +
>> + /* Balancing b/n toptier nodes, mode=NUMA_BALANCING_NORMAL */
>> + if (migrate_misplaced_folio_prepare(folio, vmf->vma, target_nid)) {
>> flags |= TNF_MIGRATE_FAIL;
>> goto out_map;
>> }
>> diff --git a/mm/migrate.c b/mm/migrate.c
>> index a5f48984ed3e..db6832b4b95b 100644
>> --- a/mm/migrate.c
>> +++ b/mm/migrate.c
>[ ... ]
>> @@ -2796,6 +2808,8 @@ int migrate_misplaced_folios_batch(struct list_head
*folio_list, int node)
>> #ifdef CONFIG_NUMA_BALANCING
>> count_vm_numa_events(NUMA_PAGE_MIGRATE, nr_succeeded);
>> count_memcg_events(memcg, NUMA_PAGE_MIGRATE, nr_succeeded);
>> +#endif
>> +#ifdef CONFIG_PGHOT
>> + mod_node_page_state(NODE_DATA(node), PGPROMOTE_SUCCESS, nr_succeeded);
>> #endif
>> }

> Should this use mod_lruvec_state() instead of mod_node_page_state() to
> accurately update both the node and memcg counters? PGPROMOTE_SUCCESS is
> defined in memcg_node_stat_items, so it is exported to the cgroup
> memory.stat interface. Using mod_node_page_state() bypasses memcg accounting,
> which might cause the cgroup-level promotion statistics to silently underflow
> or remain at zero.

Replied to this in one of the previous threads.

>> diff --git a/mm/pghot.c b/mm/pghot.c
>> index 7d7ef0800ae2..3c0ba254ad4c 100644
>> --- a/mm/pghot.c
>> +++ b/mm/pghot.c
[ ... ]
>> @@ -141,6 +166,110 @@ int pghot_record_access(unsigned long pfn, int nid, int
src, unsigned long now)
>> return 0;
>> }
>[ ... ]
>> +static bool kmigrated_should_migrate_memory(unsigned long nr_pages, int nid,
>> + unsigned long time)
>> +{
>> + struct pglist_data *pgdat;
>> + unsigned long rate_limit;
>> + unsigned int th, def_th;
>> + unsigned long now_ms = jiffies_to_msecs(jiffies); /* Based on full-width
jiffies */
>> + unsigned long now = jiffies;

> Could this cause spurious threshold adjustments during millisecond
> wrap-arounds? jiffies_to_msecs() returns a 32-bit unsigned int which is
> zero-extended into the 64-bit unsigned long now_ms. When computing time
> differences like now_ms - start (where start is an unsigned int) in
> kmigrated_promotion_adjust_threshold(), the subtraction is performed in 64-bit
arithmetic.
> When the 32-bit counter wraps around, the subtraction could yield a very large
> value that falsely satisfies conditions like > MSEC_PER_SEC or > th_period,
> which could cause a regression in rate-limiting.

It could wrap around but due to cmpxchg updating pgdat->nbp_rl_start to
post-wrap value, it should be self-correcting. However I can change now_ms
unsigned int to be perfect.

>> +
>> + pgdat = NODE_DATA(nid);
>> + if (pgdat_free_space_enough(pgdat)) {
>[ ... ]
>> +
>> + return !kmigrated_promotion_rate_limit(pgdat, rate_limit, nr_pages, now_ms);
>> +}
[ ... ]
>> @@ -218,6 +347,11 @@ static void kmigrated_walk_zone(unsigned long start_pfn,
unsigned long end_pfn,
>> goto out_next;
>> }
>>
>> + if (!kmigrated_should_migrate_memory(nr, nid, time)) {
>> + folio_put(folio);
>> + goto out_next;
>> + }
>> +

> Does this correctly advance the PFN when encountering large folio tail pages?
> Looking at the rest of kmigrated_walk_zone(), the loop iterator pfn is
> advanced by nr = folio_nr_pages(folio) at the out_next label.
> If the loop lands on a tail page of a large folio (for example, if a
> previous iteration failed a check and incremented by 1), folio_nr_pages()
> returns the size of the entire large folio. Adding the full folio size to a
> tail page's PFN overshoots the end of the folio, potentially skipping valid
> pages of subsequent allocations.
> Would it be safer to advance by
> folio_nr_pages(folio) - folio_page_idx(folio, page)?

We could end up on tail pages leading to skipping of some folios but I think
they will be reached in the next pass. Anyway I will check if your suggestion
can be incorporated without any additional overhead.

Regards,
Bharata.