Re: [PATCH v2] mm/vmstat: fold stranded per-cpu node stats when a node comes online
From: Andrew Morton
Date: Sat Jun 27 2026 - 19:10:21 EST
On Sat, 27 Jun 2026 16:22:43 -0400 Gregory Price <gourry@xxxxxxxxxx> wrote:
> A per-node vmstat counter is pgdat->vm_stat[] plus per-cpu deltas.
> A balanced counter can sit split as global=+N / per-cpu=-N.
>
> The folds reconciling the split only walk online nodes, so when
> try_offline_node() marks a node offline the per-cpu deltas are stranded.
>
> A subsequent online resets the per-cpu area but not pgdat->vm_stat[],
> orphaning the +N permanently. All NR_VM_NODE_STAT_ITEMS are affected.
Geeze, simple mistake, been there ten years...
> The existing code zeroes the per-cpu counters and causes a permanent
> skew. Fold the stranded deltas instead, before the node rejoins the
> online set. The node is not online yet and the hotplug lock is held,
> so the remote access to per-cpu values is safe.
Oh. Shouldn't we be doing this during offlining?
> Discovered when node compaction hung for a nearly empty node, as the
> math to determine throttling broke. Reproduced by repeated memory
> hotplug/unplug cycles on a node under pressure: NR_ISOLATED_ANON
> ratchets up and never returns to zero.
>
> ...
>
> --- a/mm/mm_init.c
> +++ b/mm/mm_init.c
> @@ -1536,7 +1536,7 @@ void __ref free_area_init_core_hotplug(struct pglist_data *pgdat)
> {
> int nid = pgdat->node_id;
> enum zone_type z;
> - int cpu;
> + int cpu, i;
>
> pgdat_init_internals(pgdat);
>
> @@ -1554,10 +1554,17 @@ void __ref free_area_init_core_hotplug(struct pglist_data *pgdat)
> pgdat->node_start_pfn = 0;
> pgdat->node_present_pages = 0;
>
> - for_each_online_cpu(cpu) {
> - struct per_cpu_nodestat *p;
> + /*
> + * Hot-unplug can leave per-cpu vmstat deltas unfolded (folders skip
> + * offline nodes) - reconcile this at online. Foreign access to counters
> + * is safe: the node is not online yet and we hold the hotplug lock.
> + */
> + for_each_possible_cpu(cpu) {
That's a lot of CPUs
> + struct per_cpu_nodestat *p = per_cpu_ptr(pgdat->per_cpu_nodestats, cpu);
>
> - p = per_cpu_ptr(pgdat->per_cpu_nodestats, cpu);
> + for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++)
and that's a lot of items.
I guess the overall loop count won't be large enough to cause issues,
but it's large!
Perhaps there's some simple test we can do on the per_cpu_nodestat to
avoid the inner loop? Perhaps might need to add a field for this?
btw, "for(int i..." is allowed nowadays. It'll make this code nicer, IMO.
And... Sashiko seems to have found a pre-existing issue:
https://sashiko.dev/#/patchset/20260627202243.758289-1-gourry@xxxxxxxxxx
> + if (p->vm_node_stat_diff[i])
> + node_page_state_add(p->vm_node_stat_diff[i], pgdat, i);
> memset(p, 0, sizeof(*p));
> }