Re: [RFC PATCH v2 8/8] mm: sched: Move hot page promotion from NUMAB=2 to kpromoted
From: Jonathan Cameron
Date: Mon Oct 06 2025 - 05:53:51 EST
On Mon, 6 Oct 2025 11:27:21 +0530
Bharata B Rao <bharata@xxxxxxx> wrote:
> On 03-Oct-25 6:08 PM, Jonathan Cameron wrote:
> > On Wed, 10 Sep 2025 20:16:53 +0530
> > Bharata B Rao <bharata@xxxxxxx> wrote:
> >
> >> Currently hot page promotion (NUMA_BALANCING_MEMORY_TIERING
> >> mode of NUMA Balancing) does hot page detection (via hint faults),
> >> hot page classification and eventual promotion, all by itself and
> >> sits within the scheduler.
> >>
> >> With the new hot page tracking and promotion mechanism being
> >> available, NUMA Balancing can limit itself to detection of
> >> hot pages (via hint faults) and off-load rest of the
> >> functionality to the common hot page tracking system.
> >>
> >> pghot_record_access(PGHOT_HINT_FAULT) API is used to feed the
> >> hot page info. In addition, the migration rate limiting and
> >> dynamic threshold logic are moved to kpromoted so that the same
> >> can be used for hot pages reported by other sources too.
> >>
> >> Signed-off-by: Bharata B Rao <bharata@xxxxxxx>
> >
> > Making a direct replacement without any fallback to previous method
> > is going to need a lot of data to show there are no important regressions.
> >
> > So bold move if that's the intent!
>
> Firstly I am only moving the existing hot page heuristics that is part of
> NUMAB=2 to kpromoted so that the same can be applied to hot pages being
> identified by other sources. So the hint fault mechanism that is inherent
> to NUMAB=2 still remains.
That makes sense.
>
> In fact, kscand effort started as a potential replacement for the existing
> hot page promotion mechanism by getting rid of hint faults and moving the
> page table scanning out of process context.
Understood and I'm in favor of the that approach but not sure it will be
a fit for all workloads.
>
> In any case, I will start including numbers from the next post.
Great.
> >>
> >> static unsigned int sysctl_pghot_freq_window = KPROMOTED_FREQ_WINDOW;
> >>
> >> +/* Restrict the NUMA promotion throughput (MB/s) for each target node. */
> >> +static unsigned int sysctl_pghot_promote_rate_limit = 65536;
> >
> > If the comment correlates with the value, this is 64 GiB/s? That seems
> > unlikely if I guess possible.
>
> IIUC, the existing logic tries to limit promotion rate to 64 GiB/s by
> limiting the number of candidate pages that are promoted within the
> 1s observation interval.
>
> Are you saying that achieving the rate of 64 GiB/s is not possible
> or unlikely?
Seem rather too high to me, but maybe I just have the wrong mental model
of what we should be moving.
>
> >
> >> +
> >> #ifdef CONFIG_SYSCTL
> >> static const struct ctl_table pghot_sysctls[] = {
> >> {
> >> @@ -44,8 +50,17 @@ static const struct ctl_table pghot_sysctls[] = {
> >> .proc_handler = proc_dointvec_minmax,
> >> .extra1 = SYSCTL_ZERO,
> >> },
> >> + {
> >> + .procname = "pghot_promote_rate_limit_MBps",
> >> + .data = &sysctl_pghot_promote_rate_limit,
> >> + .maxlen = sizeof(unsigned int),
> >> + .mode = 0644,
> >> + .proc_handler = proc_dointvec_minmax,
> >> + .extra1 = SYSCTL_ZERO,
> >> + },
> >> };
> >> #endif
> >> +
> > Put that in earlier patch to reduce noise here.
>
> This patch moves the hot page heuristics to kpromoted and hence this
> related sysctl is also being moved in this patch.
I just mean the blank line - not the block above.
This is just a patch set tidying up comment.
Jonathan