Re: [RFC PATCH v6 3/5] mm: Hot page tracking and promotion - pghot

From: Bharata B Rao

Date: Mon Apr 27 2026 - 01:25:26 EST


On 24-Apr-26 6:27 PM, Donet Tom wrote:
>> +int pghot_record_access(unsigned long pfn, int nid, int src, unsigned long now)
>> +{
>> +    struct mem_section *ms;
>> +    struct folio *folio;
>> +    phi_t *phi, *hot_map;
>> +    struct page *page;
>> +
>> +    if (!kmigrated_started)
>> +        return 0;
>> +
>> +    if (!pghot_nid_valid(nid))
>> +        return -EINVAL;
>> +
>> +    switch (src) {
>> +    case PGHOT_HINTFAULTS:
>> +        if (!static_branch_unlikely(&pghot_src_hintfaults))
>> +            return 0;
>> +        count_vm_event(PGHOT_RECORDED_HINTFAULTS);
>> +        break;
>> +    case PGHOT_HWHINTS:
>> +        if (!static_branch_unlikely(&pghot_src_hwhints))
>> +            return 0;
>> +        count_vm_event(PGHOT_RECORDED_HWHINTS);
>> +        break;
>> +    default:
>> +        return -EINVAL;
>> +    }
>> +
>> +    /*
>> +     * Record only accesses from lower tiers.
>> +     */
>> +    if (node_is_toptier(pfn_to_nid(pfn)))
>> +        return 0;
>
>
> Just a thought—could we check this at the beginning of the function, before the
> switch case?

I am accumulating two stats here: How many hot page intimations pghot obtained
that are attributable to different sources

and

out of them, how many turned out to be actionable

>
>
>> +
>> +    /*
>> +     * Reject the non-migratable pages right away.
>> +     */
>> +    page = pfn_to_online_page(pfn);
>> +    if (!page || is_zone_device_page(page))
>> +        return 0;
>> +
>> +    folio = page_folio(page);
>> +    if (!folio_try_get(folio))
>> +        return 0;
>> +
>> +    if (unlikely(page_folio(page) != folio))
>> +        goto out;
>> +
>> +    if (!folio_test_lru(folio))
>> +        goto out;
>> +
>> +    /* Get the hotness slot corresponding to the 1st PFN of the folio */
>> +    pfn = folio_pfn(folio);
>> +    ms = __pfn_to_section(pfn);
>> +    if (!ms || !ms->hot_map)
>> +        goto out;
>> +
>> +    hot_map = (phi_t *)(((unsigned long)(ms->hot_map)) &
>> ~PGHOT_SECTION_HOT_MASK);
>> +    phi = &hot_map[pfn % PAGES_PER_SECTION];
>> +
>> +    count_vm_event(PGHOT_RECORDED_ACCESSES);

which is this ^

>> +static void kmigrated_do_work(pg_data_t *pgdat)
>> +{
>> +    unsigned long section_nr, s_begin, start_pfn;
>> +    struct mem_section *ms;
>> +    int nid;
>> +
>> +    clear_bit(PGDAT_KMIGRATED_ACTIVATE, &pgdat->flags);
>> +    s_begin = next_present_section_nr(-1);
>> +    for_each_present_section_nr(s_begin, section_nr) {
>> +        start_pfn = section_nr_to_pfn(section_nr);
>
>
> I may be missing something, but in pghot_setup_hot_map() and kmigrated_do_work()
> we seem to iterate over all memory sections. On large memory systems, could this
> become a bottleneck right?
>
> Since hot_map is allocated only for lower-tier memory and the hotness
> information is primarily used there, would it make sense to skip scanning
> higher-tier sections?
>
> for_each_online_node(nid) {
>         if (node_is_toptier(nid))
>             continue;
>
>         start_pfn = node_start_pfn(nid);
>         end_pfn = node_end_pfn(nid);
>
>         s_begin = pfn_to_section_nr(start_pfn);
>         for_each_present_section_nr(s_begin, section_nr) {
>     }
> }
>
> Would this approach be reasonable, or am I overlooking something?

I didn't just yet optimize the walk. Since there is one kmigrated thread per
lower tier, this routine already is aware of which node to scan. We can limit
the section walk to that node instead. Something like this:

static void kmigrated_do_work(pg_data_t *pgdat)
{
unsigned long section_nr, s_begin, start_pfn, end_pfn;
struct mem_section *ms;
int nid = pgdat->node_id;

start_pfn = SECTION_ALIGN_DOWN(node_start_pfn(nid));
end_pfn = SECTION_ALIGN_UP(start_pfn + node_end_pfn(nid));

for (pfn = start_pfn; pfn < end_pfn; pfn += PAGES_PER_SECTION) {
section_nr = pfn_to_section_nr(pfn);

if (!present_section_nr(section_nr))
continue;

ms = __nr_to_section(section_nr);

...

kmigrated_walk_zone(pfn, pfn + PAGES_PER_SECTION, nid);
}
}

>> +static int pghot_online_sec_hotmap(unsigned long start_pfn,
>> +                   unsigned long nr_pages)
>> +{
>> +    int nid = pfn_to_nid(start_pfn);
>> +    unsigned long start, end, pfn;
>> +    struct mem_section *ms;
>> +    int fail = 0;
>> +
>> +    start = SECTION_ALIGN_DOWN(start_pfn);
>> +    end = SECTION_ALIGN_UP(start_pfn + nr_pages);
>> +
>> +    for (pfn = start; !fail && pfn < end; pfn += PAGES_PER_SECTION) {
>> +        ms = __pfn_to_section(pfn);
>> +        if (!ms || ms->hot_map)
>> +            continue;
>> +
>> +        fail = pghot_alloc_hot_map(ms, nid);
>
> I may be missing something, but after pghot_alloc_hot_map fails, we continue the
> loop. Would it make sense to break and go to the cleanup logic instead?

There is a !fail check in the for-loop due to which we break when alloc fails.

Regards,
Bharata.