Re: [RFC PATCH v2] Introduce Hierarchical Per-CPU Counters

From: Mathieu Desnoyers
Date: Tue Apr 08 2025 - 15:30:02 EST

Next message: Prabhakar: "[PATCH v2 0/2] media: dt-bindings: Document VSP and FCP for RZ/V2H(P) SoC"
Previous message: Markus Elfring: "Re: [PATCH v2 0/2] cpufreq: scmi/scpi: Fix NULL pointer dereference in get_rate()"
In reply to: Liam R. Howlett: "Re: [RFC PATCH v2] Introduce Hierarchical Per-CPU Counters"
Next in thread: Christoph Lameter (Ampere): "Re: [RFC PATCH v2] Introduce Hierarchical Per-CPU Counters"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On 2025-04-08 12:37, Christoph Lameter (Ampere) wrote:

On Tue, 8 Apr 2025, Mathieu Desnoyers wrote:

- Minimize contention when incrementing and decrementing counters,
- Provide fast access to a sum approximation,

In general I like this as a abstraction of the Zoned VM counters in
vmstat.c that will make the scalable counters there useful elsewhere.

I'm glad you like it! :)

It aims at fixing the per-mm RSS tracking which has become too
inaccurate for OOM killer purposes on large many-core systems [1].

There are numerous cases where these issues occur. I know of a few I could
use something like this.

I'm looking forward to see it used.

The hierarchical per-CPU counters propagate a sum approximation through
a binary tree. When reaching the batch size, the carry is propagated
through a binary tree which consists of log2(nr_cpu_ids) levels. The
batch size for each level is twice the batch size of the prior level.

A binary tree? Could we do this N-way? Otherwise the tree will be 8 levels
on a 512 cpu machine. Given the inflation of the number of cpus this
scheme better work up to 8K cpus.

That's a good point. We should probably tune the tree n-arity to limit
the tree depth to about 5. Here is a table I've done showing a possible
n-arity choice for each power of 2 nr_cpu_ids:

nr_cpu_ids that can be represented at depth N for tree n-arity
2, 4, 8, and 16:

N 2^N 4^N 8^N 16^N
--------------------------------------------------------------
0 1 1 1 1
1 2 4 8 16
2 4 16 64 256
3 8 64 512 4096
4 16 256 4096 *65536*
5 32 1024 *32768* 1048576
6 64 4096 262144 16777216
7 128 *16384* 2097152 268435456
8 256 65536 16777216 4294967296
9 512 262144 134217728 68719476736
10 1024 1048576 1073741824 1099511627776
11 2048 4194304 8589934592 17592186044416
12 4096 16777216 68719476736 281474976710656
13 *8192* 67108864 549755813888 4503599627370496
14 16384 268435456 4398046511104 72057594037927936
15 32768 1073741824 35184372088832 1152921504606846976
16 65536 4294967296 281474976710656 18446744073709551616

nr_levels is the number of tree levels for carry propagation
(excludes the final accumulation counter).

nr_cpu_ids n-arity nr_levels
2 2 1
4 2 2
8 2 3
16 4 3
32 4 3
64 4 3
128 8 3
256 8 3
512 8 3
1024 8 4
2048 8 4
4096 8 4
8192 16 4
16384 16 4
32768 16 4
65536 16 4
131072 16 5
262144 16 5
524288 16 5
1048576 16 5

+int percpu_counter_tree_precise_sum(struct percpu_counter_tree *counter);
+int percpu_counter_tree_precise_compare(struct percpu_counter_tree *a, struct percpu_counter_tree *b);
+int percpu_counter_tree_precise_compare_value(struct percpu_counter_tree *counter, int v);

Precise? Concurrent counter updates can occur while determining the global
value. People may get confused.

If counters are incremented or decremented concurrently, of course a "sum"
reader will either observe the value before or after the increment.
This is inherent to reading a counter concurrently with its updates.
However, the same can be said of reading a global counter concurrently
with its updates.

If you introduce an external synchronization mechanism that makes the
updaters quiescent (e.g. RCU) before reading the counter value, then
you have a sum which is guaranteed to include all prior increments/decrements.

So using the term "precise" does not appear misleading to me from that
perspective.

Also maybe there would be a need for a function to collape the values into
the global if f.e. a cpu goes off line or in order to switch off OS
activities on a cpu.

Currently percpu_counter_tree_precise_sum_unbiased() iterates on each
possible cpu, which does not require cpu hotplug integration.

Indeed, as an improvement, we could integrate with cpu hotplug notifiers
and move the offlining CPU counters to a global accumulator, but this would
require the precise sum to synchronize against concurrent CPU hotplug. That
would allow the precise sum iteration to only consider online cpus.

Note that the hierarchical counter tree algorithms for both the
increment/decrement and for approximate/precise sums are conveniently
lock-free at the moment.

We'd have to consider whether it's preferable to keep lock-freedom or
introduce locking for the precise sum to iterate on fewer CPUs.

Thanks,

Mathieu

--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

Next message: Prabhakar: "[PATCH v2 0/2] media: dt-bindings: Document VSP and FCP for RZ/V2H(P) SoC"
Previous message: Markus Elfring: "Re: [PATCH v2 0/2] cpufreq: scmi/scpi: Fix NULL pointer dereference in get_rate()"
In reply to: Liam R. Howlett: "Re: [RFC PATCH v2] Introduce Hierarchical Per-CPU Counters"
Next in thread: Christoph Lameter (Ampere): "Re: [RFC PATCH v2] Introduce Hierarchical Per-CPU Counters"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]