Re: [PATCH] mm/vmstat: spread vmstat_update requeue across the stat interval

From: Vlastimil Babka (SUSE)

Date: Wed Apr 08 2026 - 06:24:34 EST


On 4/7/26 17:39, Breno Leitao wrote:
> On Thu, Apr 02, 2026 at 06:33:17AM -0700, Breno Leitao wrote:
>> > >
>> > > Cool!
>> > >
>> > > I noticed __round_jiffies_relative() exists and the description looks like
>> > > it's meant for exactly this use case?
>> >
>> > On closer look, using round_jiffies_relative() as before your patch
>> > means it's calling __round_jiffies_relative(j, raw_smp_processor_id())
>> > so that's already doing this spread internally. You're also relying
>> > smp_processor_id() so it's not about using a different cpu id.
>> >
>> > But your patch has better results, why? I still think it's not doing
>> > what it intends - I think it makes every cpu have different interval
>> > length (up to twice the original length), not skew. Is it that, or that
>> > the 3 jiffies skew per cpu used in round_jiffies_common() is
>> > insufficient? Or it a bug in its skew implementation?
>> >
>> > Ideally once that's clear, the findings could be used to improve
>> > round_jiffies_common() and hopefully there's nothing here that's vmstat
>> > specific.
>>
>> Excellent observation. I believe there are two key differences:
>>
>> 1) The interval duration now varies per CPU. Specifically, vmstat_update()
>> is scheduled at sysctl_stat_interval*2 for the highest CPU with my
>> proposed change, rather than a uniform sysctl_stat_interval across
>> all CPUs. (as you raised in the first email)
>>
>> 2) round_jiffies_relative() applies a 3-jiffies shift per CPU, whereas
>> vmstat_spread_delay distributes all CPUs across the full second
>> interval. (My tests were on HZ=1000)
>>
>> I'll investigate this further to provide more concrete data.
>
> After further investigation, I can confirm that both factors mentioned
> above contribute to the performance improvement.
>
> However, we certainly don't want scenario (1) where the delay varies per
> CPU, resulting in the last CPU having vmstat_update() scheduled every
> 2 seconds instead of 1 second.

Indeed.

> I've implemented a patch following Dmitry's suggestion, and the
> performance gains are measurable.
>
> Here's my testing methodology:
>
> 1) Use ftrace to measure the execution time of refresh_cpu_vm_stats()
> * Applied a custom instrumentation patch [1]
>
> 2) Execute stress-ng:
> * stress-ng --vm 72 --vm-bytes 11256M --vm-method all --timeout 60s ; cat /sys/kernel/debug/tracing/trace
>
> 3) Parse the output using a Python script [2]
>
> While the results are not as dramatic as initially reported (since
> approach (1) was good but incorrect), the improvement is still
> substantial:
>
>
> ┌─────────┬────────────┬────────────┬───────┐
> │ Metric │ upstream* │ fix** │ Delta │
> ├─────────┼────────────┼────────────┼───────┤
> │ samples │ 36,981 │ 37,267 │ ~same │
> ├─────────┼────────────┼────────────┼───────┤
> │ avg │ 31,511 ns │ 21,337 ns │ -32% │
> ├─────────┼────────────┼────────────┼───────┤
> │ p50 │ 2,644 ns │ 2,925 ns │ ~same │
> ├─────────┼────────────┼────────────┼───────┤
> │ p99 │ 382,083 ns │ 304,357 ns │ -20% │
> ├─────────┼────────────┼────────────┼───────┤
> │ max │ 72.6 ms │ 16.0 ms │ -78% │
> └─────────┴────────────┴────────────┴───────┘

So you have 72 cpus, the vmstat interval is 1s, and what's the CONFIG_HZ?
If it's 1000, it means 13 jiffies per cpu. Would changing the
round_jiffies_common() implementation to add 13 jiffies per cpu instead of 3
have the same effect?

>
> * Upstream is based on linux-next commit f3e6330d7fe42 ("Add linux-next specific files for 20260407")
> ** "fix" contains the patch below:
>
> Link: https://github.com/leitao/linux/commit/ac200164df1bda45ee8504cc3db5bff5b696245e [1]
> Link: https://github.com/leitao/linux/commit/baa2ea6ea4c4c2b1df689de6db0a2a6f119e51be [2]
>
>
> commit 41b7aaa1a51f07fc1f0db0614d140fbca78463d3
> Author: Breno Leitao <leitao@xxxxxxxxxx>
> Date: Tue Apr 7 07:56:35 2026 -0700
>
> mm/vmstat: spread per-cpu vmstat work to reduce zone->lock contention
>
> vmstat_shepherd() queues all per-cpu vmstat_update work with zero delay,
> and vmstat_update() re-queues itself with round_jiffies_relative(), which
> clusters timers near the same second boundary due to the small per-CPU
> spread in round_jiffies_common(). On many-CPU systems this causes
> thundering-herd contention on zone->lock when multiple CPUs
> simultaneously call refresh_cpu_vm_stats() -> decay_pcp_high() ->
> free_pcppages_bulk().
>
> Introduce vmstat_spread_delay() to assign each CPU a unique offset
> distributed evenly across sysctl_stat_interval. The shepherd uses this
> when initially queuing per-cpu work, and vmstat_update re-queues with a
> plain sysctl_stat_interval to preserve the spread (round_jiffies_relative
> would snap CPUs back to the same boundary).
>
> Signed-off-by: Breno Leitao <leitao@xxxxxxxxxx>

I think this approach could have the following problems:

- the initially spread delays can drift over time, there's nothing keeping
them in sync
- not using round_jiffies_relative() means firing at other times than other
timers that are using the rounding, so this could be working against the
power savings effects of rounding
- it's a vmstat-specific workaround for some yet unclear underlying
suboptimality that's likely not vmstat specific

>
> diff --git a/mm/vmstat.c b/mm/vmstat.c
> index 3704f6ca7a268..8d93eee3b1f75 100644
> --- a/mm/vmstat.c
> +++ b/mm/vmstat.c
> @@ -2040,6 +2040,22 @@ static int vmstat_refresh(const struct ctl_table *table, int write,
> }
> #endif /* CONFIG_PROC_FS */
>
> +/*
> + * Return a per-cpu initial delay that spreads vmstat_update work evenly
> + * across the stat interval, so that CPUs do not all fire at the same
> + * second boundary.
> + */
> +static unsigned long vmstat_spread_delay(int cpu)
> +{
> + unsigned long interval = sysctl_stat_interval;
> + unsigned int nr_cpus = num_online_cpus();
> +
> + if (nr_cpus <= 1)
> + return 0;
> +
> + return (interval * (cpu % nr_cpus)) / nr_cpus;
> +}
> +
> static void vmstat_update(struct work_struct *w)
> {
> if (refresh_cpu_vm_stats(true)) {
> @@ -2047,10 +2063,13 @@ static void vmstat_update(struct work_struct *w)
> * Counters were updated so we expect more updates
> * to occur in the future. Keep on running the
> * update worker thread.
> + * Avoid round_jiffies_relative() here -- it would snap
> + * every CPU back to the same second boundary, undoing
> + * the initial spread from vmstat_shepherd.
> */
> queue_delayed_work_on(smp_processor_id(), mm_percpu_wq,
> this_cpu_ptr(&vmstat_work),
> - round_jiffies_relative(sysctl_stat_interval));
> + sysctl_stat_interval);
> }
> }
>
> @@ -2148,7 +2167,8 @@ static void vmstat_shepherd(struct work_struct *w)
> continue;
>
> if (!delayed_work_pending(dw) && need_update(cpu))
> - queue_delayed_work_on(cpu, mm_percpu_wq, dw, 0);
> + queue_delayed_work_on(cpu, mm_percpu_wq, dw,
> + vmstat_spread_delay(cpu));
> }
>
> cond_resched();