Re: [PATCH v2 1/7] memcg: Scale up vmstats flush threshold with int_sqrt(nr_cpus+2)

From: Li Wang

Date: Thu Apr 02 2026 - 05:32:41 EST

Hi Michal,

> Hello Waiman and Li.
> ...
> The explanation seems [1] to just pick a function because log seemed too
> slow.
>
> (We should add a BPF hook to calculate the threshold. Haha, Date:)
>
> The threshold has twofold role: to bound error and to preserve some
> performance thanks to laziness and these two go against each other when
> determining the threshold. The reasoning for linear scaling is that
> _each_ CPU contributes some updates so that preserves the laziness.
> Whereas error capping would hint to no dependency on nr_cpus.
>
> My idea is that a job associated to a selected memcg doesn't necessarily
> run on _all_ CPUs of (such big) machines but effectively cause updates
> on J CPUs. (Either they're artificially constrained or they simply are
> not-so-parallel jobs.)

> Hence the threshold should be based on that J and not actual nr_cpus.

I completely agree on this point.

> Now the question is what is expected (CPU) size of a job and for that
> I'd would consider a distribution like:
> - 1 job of size nr_cpus, // you'd overcommit your machine with bigger job
> - 2 jobs of size nr_cpus/2,
> - 3 jobs of size nr_cpus/3,
> - ...
> - nr_cpus jobs of size 1. // you'd underutilize the machine with fewer
>
> Note this is quite naïve and arbitrary deliberation of mine but it
> results in something like Pareto distribution which is IMO quite
> reasonable. With (only) that assumption, I can estimate the average size
> of jobs like
> nr_cpus / (log(nr_cpus) + 1)
> (it's natural logarithm from harmonic series and +1 is from that
> approximation too, it comes handy also on UP)
>
> log(x) = ilog2(x) * log(2)/log(e) ~ ilog2(x) * 0.69
> log(x) ~ 45426 * ilog2(x) / 65536
>
> or
> 65536*nr_cpus / (45426 * ilog2(nr_cpus) + 65536)
>
>
> with kernel functions:
> var1 = 65536*nr_cpus / (45426 * ilog2(nr_cpus) + 65536)
> var2 = DIV_ROUND_UP(65536*nr_cpus, 45426 * ilog2(nr_cpus) + 65536)
> var3 = roundup_pow_of_two(var2)
>
> I hope I don't need to present any more numbers at this moment because
> the parameter derivation is backed by solid theory ;-) [*]
> [*]

It is a elegant method but still not based on the J CPUs.

As you capture the core tension: bounding error wants the threshold
as small as possible, while preserving laziness wants it as large as
possible. Any scheme is a compromise between the two.

But there has several practical issues:

The threshold formula is system-wide, while each memcg has its own counter,
they all evaluate against the same MEMCG_CHARGE_BATCH * f(nr_cpu_ids),
with no awareness of how many CPUs are actually active for that particular
memcg. Small tasks with J=2 coexist with large services where J approaches
nr_cpus, yet they all face the same threshold. The ln-harmonic formula
optimizes for the average J, but workloads that most critically need
accurate memory.stat are precisely those spanning many CPUs, well above
average.

Moreover, the "average J" estimate assumes tasks are uniformly distributed
across CPUs, which rarely holds in practice with cpuset constraints, NUMA
affinity, and nested cgroup hierarchies. And even accepting that estimate,
the data shows ln-harmonic still yields 237MB of error at 2048 CPUs with
64K pages — still large enough to cause selftest failures.

In short: the theoretical analysis is sound, but the conclusion conflates
average case with worst case. Under the constraint of a single global
threshold, sqrt remains the more robust choice.

In future, if the J-sensory threshold per-memcg can be achieved, then your
ln-harmonic method is the most ideal formula.

To compare the three methods (linear, sqrt, ln-harmonic):

4K page size (BATCH=64):

CPUs linear sqrt ln-var3
--------------------------------
1 256KB 256KB 256KB
2 512KB 512KB 512KB
4 1MB 512KB 512KB
8 2MB 768KB 1MB
16 4MB 1MB 2MB
32 8MB 1.25MB 2MB
64 16MB 2MB 4MB
128 32MB 2.75MB 8MB
256 64MB 4MB 16MB
512 128MB 5.5MB 32MB
1024 256MB 8MB 64MB
2048 512MB 11.25MB 64MB

64K page size (BATCH=16):

CPUs linear sqrt ln-var3
--------------------------------
1 1MB 1MB 1MB
2 2MB 2MB 2MB
4 4MB 2MB 2MB
8 8MB 3MB 4MB
16 16MB 4MB 8MB
32 32MB 5MB 8MB
64 64MB 8MB 16MB
128 128MB 11MB 32MB
256 256MB 16MB 64MB
512 512MB 22MB 128MB
1024 1GB 32MB 256MB
2048 2GB 45MB 256MB

--
Regards,
Li Wang