Re: [PATCH] mm/vmstat: spread vmstat_update requeue across the stat interval

Next message: Daniel Golle: "Re: [PATCH v2 1/2] dt-bindings: rng: mtk-rng: add SMC-based TRNG variants"
Previous message: Leonardo Bras: "Re: [PATCH v3 0/5] Support the FEAT_HDBSS introduced in Armv9.5"
In reply to: Vlastimil Babka (SUSE): "Re: [PATCH] mm/vmstat: spread vmstat_update requeue across the stat interval"
Next in thread: Vlastimil Babka (SUSE): "Re: [PATCH] mm/vmstat: spread vmstat_update requeue across the stat interval"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

From: Dmitry Ilvokhin

Date: Thu Apr 02 2026 - 08:43:49 EST

On Wed, Apr 01, 2026 at 07:46:35PM +0200, Vlastimil Babka (SUSE) wrote:

[...]

> > +/*
> > + * Return a per-cpu delay that spreads vmstat_update work across the stat
> > + * interval. Without this, round_jiffies_relative() aligns every CPU's
> > + * timer to the same second boundary, causing a thundering-herd on
> > + * zone->lock when multiple CPUs drain PCP pages simultaneously via
> > + * decay_pcp_high() -> free_pcppages_bulk().
> > + */
> > +static unsigned long vmstat_spread_delay(void)
> > +{
> > + unsigned long interval = sysctl_stat_interval;
> > + unsigned int nr_cpus = num_online_cpus();
> > +
> > + if (nr_cpus <= 1)
> > + return round_jiffies_relative(interval);
> > +
> > + /*
> > + * Spread per-cpu vmstat work evenly across the interval. Don't
> > + * use round_jiffies_relative() here -- it would snap every CPU
> > + * back to the same second boundary, defeating the spread.
> > + */
> > + return interval + (interval * (smp_processor_id() % nr_cpus)) / nr_cpus;
>
> Hm doesn't this mean that lower id cpus will consistently fire in shorter
> intervals and higher id in longer intervals? What we want is same interval
> but differently offset, no?

Yes, I think that's a valid concern, this effectively skews the
interval rather than just introducing a phase offset.

I initially thought this might explain the increase in max wait, but it
turns out the columns were just swapped.

Spreading the initial scheduling and then requeueing with a constant
interval sounds like a reasonable alternative, e.g. below.