Re: [PATCH] mm/vmstat: spread vmstat_update requeue across the stat interval

From: Usama Arif

Date: Wed Apr 01 2026 - 11:51:46 EST

On Wed, 01 Apr 2026 06:57:50 -0700 Breno Leitao <leitao@xxxxxxxxxx> wrote:

> vmstat_update uses round_jiffies_relative() when re-queuing itself,
> which aligns all CPUs' timers to the same second boundary. When many
> CPUs have pending PCP pages to drain, they all call decay_pcp_high() ->
> free_pcppages_bulk() simultaneously, serializing on zone->lock and
> hitting contention.
>
> Introduce vmstat_spread_delay() which distributes each CPU's
> vmstat_update evenly across the stat interval instead of aligning them.
>
> This does not increase the number of timer interrupts — each CPU still
> fires once per interval. The timers are simply staggered rather than
> aligned. Additionally, vmstat_work is DEFERRABLE_WORK, so it does not
> wake idle CPUs regardless of scheduling; the spread only affects CPUs
> that are already active
>
> `perf lock contention` shows 7.5x reduction in zone->lock contention
> (872 -> 117 contentions, 199ms -> 81ms total wait) on a 72-CPU aarch64
> system under memory pressure.
>
> Tested on a 72-CPU aarch64 system using stress-ng --vm to generate
> memory allocation bursts. Lock contention was measured with:
>
> perf lock contention -a -b -S free_pcppages_bulk
>
> Results with KASAN enabled:
>
> free_pcppages_bulk contention (KASAN):
> +--------------+----------+----------+
> | Metric | No fix | With fix |
> +--------------+----------+----------+
> | Contentions | 872 | 117 |
> | Total wait | 199.43ms | 80.76ms |
> | Max wait | 4.19ms | 35.76ms |
> +--------------+----------+----------+
>
> Results without KASAN:
>
> free_pcppages_bulk contention (no KASAN):
> +--------------+----------+----------+
> | Metric | No fix | With fix |
> +--------------+----------+----------+
> | Contentions | 240 | 133 |
> | Total wait | 34.01ms | 24.61ms |
> | Max wait | 965us | 1.35ms |
> +--------------+----------+----------+
>
> Signed-off-by: Breno Leitao <leitao@xxxxxxxxxx>
> ---
> mm/vmstat.c | 25 ++++++++++++++++++++++++-
> 1 file changed, 24 insertions(+), 1 deletion(-)
>
> diff --git a/mm/vmstat.c b/mm/vmstat.c
> index 2370c6fb1fcd..2e94bd765606 100644
> --- a/mm/vmstat.c
> +++ b/mm/vmstat.c
> @@ -2032,6 +2032,29 @@ static int vmstat_refresh(const struct ctl_table *table, int write,
> }
> #endif /* CONFIG_PROC_FS */
>
> +/*
> + * Return a per-cpu delay that spreads vmstat_update work across the stat
> + * interval. Without this, round_jiffies_relative() aligns every CPU's
> + * timer to the same second boundary, causing a thundering-herd on
> + * zone->lock when multiple CPUs drain PCP pages simultaneously via
> + * decay_pcp_high() -> free_pcppages_bulk().
> + */
> +static unsigned long vmstat_spread_delay(void)
> +{
> + unsigned long interval = sysctl_stat_interval;
> + unsigned int nr_cpus = num_online_cpus();
> +
> + if (nr_cpus <= 1)
> + return round_jiffies_relative(interval);
> +
> + /*
> + * Spread per-cpu vmstat work evenly across the interval. Don't
> + * use round_jiffies_relative() here -- it would snap every CPU
> + * back to the same second boundary, defeating the spread.
> + */
> + return interval + (interval * (smp_processor_id() % nr_cpus)) / nr_cpus;
> +}
> +
> static void vmstat_update(struct work_struct *w)
> {
> if (refresh_cpu_vm_stats(true)) {
> @@ -2042,7 +2065,7 @@ static void vmstat_update(struct work_struct *w)
> */
> queue_delayed_work_on(smp_processor_id(), mm_percpu_wq,
> this_cpu_ptr(&vmstat_work),
> - round_jiffies_relative(sysctl_stat_interval));
> + vmstat_spread_delay());

This is awesome! Maybe this needs to be done to vmstat_shepherd() as well?

vmstat_shepherd() still queues work with delay 0 on all CPUs that
need_update() in its for_each_online_cpu() loop:

if (!delayed_work_pending(dw) && need_update(cpu))
queue_delayed_work_on(cpu, mm_percpu_wq, dw, 0);

So when the shepherd fires, it kicks all dormant CPUs' vmstat workers
simultaneously.

Under sustained memory pressure on a large system, I think the shepherd
fires every sysctl_stat_interval and could re-trigger the same lock
contention?

> }
> }
>
>
> ---
> base-commit: cf7c3c02fdd0dfccf4d6611714273dcb538af2cb
> change-id: 20260401-vmstat-048e0feaf344
>
> Best regards,
> --
> Breno Leitao <leitao@xxxxxxxxxx>
>
>