Re: [RFC PATCH v2] Introduce Hierarchical Per-CPU Counters

From: Mathieu Desnoyers
Date: Tue Apr 08 2025 - 15:40:38 EST


On 2025-04-08 13:41, Liam R. Howlett wrote:
* Matthew Wilcox <willy@xxxxxxxxxxxxx> [250408 13:03]:
On Tue, Apr 08, 2025 at 09:37:18AM -0700, Christoph Lameter (Ampere) wrote:
The hierarchical per-CPU counters propagate a sum approximation through
a binary tree. When reaching the batch size, the carry is propagated
through a binary tree which consists of log2(nr_cpu_ids) levels. The
batch size for each level is twice the batch size of the prior level.

A binary tree? Could we do this N-way? Otherwise the tree will be 8 levels
on a 512 cpu machine. Given the inflation of the number of cpus this
scheme better work up to 8K cpus.

I find that a fan-out somewhere between 8 and 16 works well in practice.
log16(512) gives a 3 level tree as does a log8 tree. log16(8192) is a 4
level tree whereas log8(8192) is a 5 level tree. Not a big difference
either way.

Somebody was trying to persuade me that a new tree type that maintained
additional information at each level of the tree to make some operations
log(log(N)) would be a better idea than a B-tree that is log(N). I
countered that a wider tree made the argument unsound at any size tree
up to 100k. And we don't tend to have _that_ many objects in a
data structure inside the kernel.

I still maintain vEB trees are super cool, but I am glad we didn't try
to implement an RCU safe version.


ceil(log14(100,000)) = 5
ceil(log2(log2(100,000))) = 5

at a million, there's actually a gap, 6 vs 5. But constant factors
become a much larger factor than scalability arguments at that point.

In retrospect, it seems more of a math win than a practical win - and
only really the O(n) bounds. Beyond what willy points out, writes
rippling up the tree should be a concern for most users since it will
impact the restart of readers and negatively affect the writer speed -
but probably not here (hot plug?).

This implementation of hierarchical per-cpu counters is lock-free
for increment/decrement *and* for precise/approximate sums.

The increment/decrement use:

- this_cpu_add_return on the fast-path,
- atomic_add_return_relaxed for intermediate levels carry propagation,
- atomic_add for approximate sum updates.

The precise sum iterates on all possible cpus, loading their current
value. The approximate sum simply loads the current value of the
approximate sum.

So I'm unsure about your concern of writers restarting readers, because
this tree does not rely on mutual exclusion between updaters and
readers, nor does it rely on cmpxchg-based retry mechanisms in readers.

I agree with you that updates going all the way up the tree may
negatively affect the updater and approximate sum reader performance due
to bouncing of the counter cache line across CPUs.


Working in (multiples of) cacheline sized b-tree nodes makes the most
sense, in my experience.

I'm confused. Can you explain how this recommendation can practically
apply to the hierarchical counters ?

Thanks!

Mathieu


Thanks,
Liam



--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com