Re: [PATCH 2/2] lib/percpu_counter: fix dying cpu compare race

From: yebin (H)
Date: Tue Apr 04 2023 - 02:54:34 EST




On 2023/4/4 10:50, Yury Norov wrote:
On Tue, Apr 04, 2023 at 09:42:06AM +0800, Ye Bin wrote:
From: Ye Bin <yebin10@xxxxxxxxxx>

In commit 8b57b11cca88 ("pcpcntrs: fix dying cpu summation race") a race
condition between a cpu dying and percpu_counter_sum() iterating online CPUs
was identified.
Acctually, there's the same race condition between a cpu dying and
__percpu_counter_compare(). Here, use 'num_online_cpus()' for quick judgment.
But 'num_online_cpus()' will be decreased before call 'percpu_counter_cpu_dead()',
then maybe return incorrect result.
To solve above issue, also need to add dying CPUs count when do quick judgment
in __percpu_counter_compare().
Not sure I completely understood the race you are describing. All CPU
accounting is protected with percpu_counters_lock. Is it a real race
that you've faced, or hypothetical? If it's real, can you share stack
traces?
Signed-off-by: Ye Bin <yebin10@xxxxxxxxxx>
---
lib/percpu_counter.c | 11 ++++++++++-
1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/lib/percpu_counter.c b/lib/percpu_counter.c
index 5004463c4f9f..399840cb0012 100644
--- a/lib/percpu_counter.c
+++ b/lib/percpu_counter.c
@@ -227,6 +227,15 @@ static int percpu_counter_cpu_dead(unsigned int cpu)
return 0;
}
+static __always_inline unsigned int num_count_cpus(void)
This doesn't look like a good name. Maybe num_offline_cpus?

+{
+#ifdef CONFIG_HOTPLUG_CPU
+ return (num_online_cpus() + num_dying_cpus());
^ ^
'return' is not a function. Braces are not needed

Generally speaking, a sequence of atomic operations is not an atomic
operation, so the above doesn't look correct. I don't think that it
would be possible to implement raceless accounting based on 2 separate
counters.
Yes, there is indeed a concurrency issue with doing so here. But I saw that the process was first
set up dying_mask and then reduce the number of online CPUs. The total quantity maybe is larger
than the actual value and may fall back to a slow path.But this won't cause any problems.


Most probably, you'd have to use the same approach as in 8b57b11cca88:

lock();
for_each_cpu_or(cpu, cpu_online_mask, cpu_dying_mask)
cnt++;
unlock();

And if so, I'd suggest to implement cpumask_weight_or() for that.

+#else
+ return num_online_cpus();
+#endif
+}
+
/*
* Compare counter against given value.
* Return 1 if greater, 0 if equal and -1 if less
@@ -237,7 +246,7 @@ int __percpu_counter_compare(struct percpu_counter *fbc, s64 rhs, s32 batch)
count = percpu_counter_read(fbc);
/* Check to see if rough count will be sufficient for comparison */
- if (abs(count - rhs) > (batch * num_online_cpus())) {
+ if (abs(count - rhs) > (batch * num_count_cpus())) {
if (count > rhs)
return 1;
else
--
2.31.1
.