Re: [PATCH v2 1/7] memcg: Scale up vmstats flush threshold with int_sqrt(nr_cpus+2)

From: Li Wang

Date: Mon Mar 23 2026 - 08:51:19 EST

On Fri, Mar 20, 2026 at 04:42:35PM -0400, Waiman Long wrote:
> The vmstats flush threshold currently increases linearly with the
> number of online CPUs. As the number of CPUs increases over time, it
> will become increasingly difficult to meet the threshold and update the
> vmstats data in a timely manner. These days, systems with hundreds of
> CPUs or even thousands of them are becoming more common.
>
> For example, the test_memcg_sock test of test_memcontrol always fails
> when running on an arm64 system with 128 CPUs. It is because the
> threshold is now 64*128 = 8192. With 4k page size, it needs changes in
> 32 MB of memory. It will be even worse with larger page size like 64k.
>
> To make the output of memory.stat more correct, it is better to scale
> up the threshold slower than linearly with the number of CPUs. The
> int_sqrt() function is a good compromise as suggested by Li Wang [1].
> An extra 2 is added to make sure that we will double the threshold for
> a 2-core system. The increase will be slower after that.
>
> With the int_sqrt() scale, we can use the possibly larger
> num_possible_cpus() instead of num_online_cpus() which may change at
> run time.
>
> Although there is supposed to be a periodic and asynchronous flush of
> vmstats every 2 seconds, the actual time lag between succesive runs
> can actually vary quite a bit. In fact, I have seen time lags of up
> to 10s of seconds in some cases. So we couldn't too rely on the hope
> that there will be an asynchronous vmstats flush every 2 seconds. This
> may be something we need to look into.
>
> [1] https://lore.kernel.org/lkml/ab0kAE7mJkEL9kWb@xxxxxxxxxx/
>
> Suggested-by: Li Wang <liwang@xxxxxxxxxx>
> Signed-off-by: Waiman Long <longman@xxxxxxxxxx>
> ---
> mm/memcontrol.c | 18 +++++++++++++-----
> 1 file changed, 13 insertions(+), 5 deletions(-)
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 772bac21d155..cc1fc0f5aeea 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -548,20 +548,20 @@ struct memcg_vmstats {
> * rstat update tree grow unbounded.
> *
> * 2) Flush the stats synchronously on reader side only when there are more than
> - * (MEMCG_CHARGE_BATCH * nr_cpus) update events. Though this optimization
> - * will let stats be out of sync by atmost (MEMCG_CHARGE_BATCH * nr_cpus) but
> - * only for 2 seconds due to (1).
> + * (MEMCG_CHARGE_BATCH * int_sqrt(nr_cpus+2)) update events. Though this
> + * optimization will let stats be out of sync by up to that amount. This is
> + * supposed to last for up to 2 seconds due to (1).
> */
> static void flush_memcg_stats_dwork(struct work_struct *w);
> static DECLARE_DEFERRABLE_WORK(stats_flush_dwork, flush_memcg_stats_dwork);
> static u64 flush_last_time;
> +static int vmstats_flush_threshold __ro_after_init;
>
> #define FLUSH_TIME (2UL*HZ)
>
> static bool memcg_vmstats_needs_flush(struct memcg_vmstats *vmstats)
> {
> - return atomic_read(&vmstats->stats_updates) >
> - MEMCG_CHARGE_BATCH * num_online_cpus();
> + return atomic_read(&vmstats->stats_updates) > vmstats_flush_threshold;
> }
>
> static inline void memcg_rstat_updated(struct mem_cgroup *memcg, int val,
> @@ -5191,6 +5191,14 @@ int __init mem_cgroup_init(void)
>
> memcg_pn_cachep = KMEM_CACHE(mem_cgroup_per_node,
> SLAB_PANIC | SLAB_HWCACHE_ALIGN);
> + /*
> + * Scale up vmstats flush threshold with int_sqrt(nr_cpus+2). The extra
> + * 2 constant is to make sure that the threshold is double for a 2-core
> + * system. After that, it will increase by MEMCG_CHARGE_BATCH when the
> + * number of the CPUs reaches the next (2^n - 2) value.
> + */
> + vmstats_flush_threshold = MEMCG_CHARGE_BATCH *
> + (int_sqrt(num_possible_cpus() + 2));
>
> return 0;
> }

Reviewed-by: Li Wang <liwang@xxxxxxxxxx>

--
Regards,
Li Wang