Re: [RFC PATCH v2 8/8] mm: sched: Move hot page promotion from NUMAB=2 to kpromoted
From: Bharata B Rao
Date: Mon Oct 06 2025 - 01:57:40 EST
On 03-Oct-25 6:08 PM, Jonathan Cameron wrote:
> On Wed, 10 Sep 2025 20:16:53 +0530
> Bharata B Rao <bharata@xxxxxxx> wrote:
>
>> Currently hot page promotion (NUMA_BALANCING_MEMORY_TIERING
>> mode of NUMA Balancing) does hot page detection (via hint faults),
>> hot page classification and eventual promotion, all by itself and
>> sits within the scheduler.
>>
>> With the new hot page tracking and promotion mechanism being
>> available, NUMA Balancing can limit itself to detection of
>> hot pages (via hint faults) and off-load rest of the
>> functionality to the common hot page tracking system.
>>
>> pghot_record_access(PGHOT_HINT_FAULT) API is used to feed the
>> hot page info. In addition, the migration rate limiting and
>> dynamic threshold logic are moved to kpromoted so that the same
>> can be used for hot pages reported by other sources too.
>>
>> Signed-off-by: Bharata B Rao <bharata@xxxxxxx>
>
> Making a direct replacement without any fallback to previous method
> is going to need a lot of data to show there are no important regressions.
>
> So bold move if that's the intent!
Firstly I am only moving the existing hot page heuristics that is part of
NUMAB=2 to kpromoted so that the same can be applied to hot pages being
identified by other sources. So the hint fault mechanism that is inherent
to NUMAB=2 still remains.
In fact, kscand effort started as a potential replacement for the existing
hot page promotion mechanism by getting rid of hint faults and moving the
page table scanning out of process context.
In any case, I will start including numbers from the next post.
>>
>> static unsigned int sysctl_pghot_freq_window = KPROMOTED_FREQ_WINDOW;
>>
>> +/* Restrict the NUMA promotion throughput (MB/s) for each target node. */
>> +static unsigned int sysctl_pghot_promote_rate_limit = 65536;
>
> If the comment correlates with the value, this is 64 GiB/s? That seems
> unlikely if I guess possible.
IIUC, the existing logic tries to limit promotion rate to 64 GiB/s by
limiting the number of candidate pages that are promoted within the
1s observation interval.
Are you saying that achieving the rate of 64 GiB/s is not possible
or unlikely?
>
>> +
>> #ifdef CONFIG_SYSCTL
>> static const struct ctl_table pghot_sysctls[] = {
>> {
>> @@ -44,8 +50,17 @@ static const struct ctl_table pghot_sysctls[] = {
>> .proc_handler = proc_dointvec_minmax,
>> .extra1 = SYSCTL_ZERO,
>> },
>> + {
>> + .procname = "pghot_promote_rate_limit_MBps",
>> + .data = &sysctl_pghot_promote_rate_limit,
>> + .maxlen = sizeof(unsigned int),
>> + .mode = 0644,
>> + .proc_handler = proc_dointvec_minmax,
>> + .extra1 = SYSCTL_ZERO,
>> + },
>> };
>> #endif
>> +
> Put that in earlier patch to reduce noise here.
This patch moves the hot page heuristics to kpromoted and hence this
related sysctl is also being moved in this patch.
>
>> static bool phi_heap_less(const void *lhs, const void *rhs, void *args)
>> {
>> return (*(struct pghot_info **)lhs)->frequency >
>> @@ -94,11 +109,99 @@ static bool phi_heap_insert(struct max_heap *phi_heap, struct pghot_info *phi)
>> return true;
>> }
>>
>> +/*
>> + * For memory tiering mode, if there are enough free pages (more than
>> + * enough watermark defined here) in fast memory node, to take full
>
> I'd use enough_wmark Just because "more than enough" is a common
> English phrase and I at least tripped over that sentence as a result!
Ah I see that, but as you note later, I am currently only doing the
movement.
>
>> + * advantage of fast memory capacity, all recently accessed slow
>> + * memory pages will be migrated to fast memory node without
>> + * considering hot threshold.
>> + */
>> +static bool pgdat_free_space_enough(struct pglist_data *pgdat)
>> +{
>> + int z;
>> + unsigned long enough_wmark;
>> +
>> + enough_wmark = max(1UL * 1024 * 1024 * 1024 >> PAGE_SHIFT,
>> + pgdat->node_present_pages >> 4);
>> + for (z = pgdat->nr_zones - 1; z >= 0; z--) {
>> + struct zone *zone = pgdat->node_zones + z;
>> +
>> + if (!populated_zone(zone))
>> + continue;
>> +
>> + if (zone_watermark_ok(zone, 0,
>> + promo_wmark_pages(zone) + enough_wmark,
>> + ZONE_MOVABLE, 0))
>> + return true;
>> + }
>> + return false;
>> +}
>
>> +
>> +static void kpromoted_promotion_adjust_threshold(struct pglist_data *pgdat,
>
> Needs documentation of the algorithm and the reasons for various choices.
>
> I see it is a code move though so maybe that's a job for another day.
Sure.
Regards,
Bharata.