Re: [RFC PATCH v6 4/5] mm: pghot: Precision mode for pghot

From: Bharata B Rao

Date: Thu Mar 26 2026 - 06:43:32 EST

Replies to review from Shashiko
[https://sashiko.dev/#/patchset/20260323095104.238982-1-bharata%40amd.com]

>> diff --git a/mm/Kconfig b/mm/Kconfig
>> index 4aeab6aee535..14383bb1d890 100644
>> --- a/mm/Kconfig
>> +++ b/mm/Kconfig
>> @@ -1485,6 +1485,17 @@ config PGHOT
>> This adds 1 byte of metadata overhead per page in lower-tier
>> memory nodes.
>>
>> +config PGHOT_PRECISE
>> + bool "Hot page tracking precision mode"
>> + def_bool n

> This isn't a bug, but having both bool with a prompt and def_bool is
> redundant and can be confusing. Should this use default n instead?

Sure.

>> + depends on PGHOT
>> + help
>> + Enables precision mode for tracking hot pages with pghot sub-system.
>[ ... ]
>> diff --git a/mm/pghot-precise.c b/mm/pghot-precise.c
>> new file mode 100644
>> index 000000000000..9e8007adfff9
>> --- /dev/null
>> +++ b/mm/pghot-precise.c
>> @@ -0,0 +1,81 @@
>> +// SPDX-License-Identifier: GPL-2.0
>> +/*
>> + * pghot: Precision mode
>> + *
>> + * 4 byte hotness record per PFN (u32)
>> + * NID, time and frequency tracked as part of the record.
>> + */
>> +
>> +#include <linux/pghot.h>
>> +#include <linux/jiffies.h>
>> +
>> +bool pghot_nid_valid(int nid)
>> +{
>> + /*
>> + * TODO: Add node_online() and node_is_toptier() checks?
>> + */
>> + if (nid != NUMA_NO_NODE && (nid < 0 || nid >= PGHOT_NID_MAX))
>> + return false;

> Could this validation allow node IDs that exceed the system's actual
> MAX_NUMNODES?

Unlikely as the nid gets passed down from other sub-systems like NUMA Balancing
etc.

> If PGHOT_NID_MAX is 1023 but the system supports fewer nodes, could passing
> a large node ID bypass this check and cause out-of-bounds accesses in NUMA
> node arrays like node_data or node_states?
> Also, is there an off-by-one regression here? PGHOT_NID_MAX is the maximum
> value that fits in the bitfield, so should this check be nid > PGHOT_NID_MAX
> to allow the maximum node ID?

Yes, will adjust this in the next iteration.

>> +
>> + return true;
>> +}
>> +
>> +unsigned long pghot_access_latency(unsigned long old_time, unsigned long time)
>> +{
>> + return jiffies_to_msecs((time - old_time) & PGHOT_TIME_MASK);
>> +}

> Does this timestamp wraparound allow cold pages to be improperly promoted?
> If a page is accessed exactly at intervals matching the wraparound of the
> 14-bit time mask (e.g., every 16.384 seconds at HZ=1000), the latency would
> evaluate to 0. Would this bypass the reset of the frequency counter and
> falsely trigger migration?

The hotness timestamp is stored in a limited number of bits (14 bits in precise
mode, 5 bucketed bits in default mode), which means it can only represent a
finite window of time before wrapping around — approximately 16 seconds for
precise mode and 4 seconds for default mode at HZ=1000. Latency calculations
rely on modular subtraction of these truncated timestamps. This produces the
correct elapsed time as long as the true interval between two events is shorter
than the wrap period. When the interval exceeds the wrap period, there could be
approximations.

In practice, this should be acceptable because the system has two independent
gates that a page must pass through before migration actually occurs:

1. Access frequency gate (recording time): pghot_update_record() determines
whether consecutive accesses to a page fall within the same frequency-counting
window. A page must be accessed at least pghot_freq_threshold times within
sysctl_pghot_freq_window (default 5 seconds) to be marked migration-ready.

2. Recency gate (migration time): kmigrated_should_migrate_memory() checks
whether the page's last recorded access is recent enough relative to the current
time before actually initiating migration.

>> +
>> +bool pghot_update_record(phi_t *phi, int nid, unsigned long now)
>> +{
>> + phi_t freq, old_freq, hotness, old_hotness, old_time;
>> + phi_t time = now & PGHOT_TIME_MASK;
>> +
>> + nid = (nid == NUMA_NO_NODE) ? pghot_target_nid : nid;
>> + old_hotness = READ_ONCE(*phi);
>> +
>> + do {
>> + bool new_window = false;
>> +
>> + hotness = old_hotness;
>> + old_freq = (hotness >> PGHOT_FREQ_SHIFT) & PGHOT_FREQ_MASK;
>> + old_time = (hotness >> PGHOT_TIME_SHIFT) & PGHOT_TIME_MASK;
>> +
>> + if (pghot_access_latency(old_time, time) > sysctl_pghot_freq_window)
>> + new_window = true;

> Is there a missing upper bound on the sysctl_pghot_freq_window configuration?
> The time is tracked in a 14-bit field, which represents a maximum latency of
> 16,383 jiffies. If a user configures a window larger than this maximum
> representable latency, could this condition permanently evaluate to false,
> causing sparse accesses to accumulate over arbitrarily long periods?

I can add an upper bound check.

Regards,
Bharata.