Re: [PATCH v3] percpu_counter: add a cmpxchg-based _add_batch variant

From: Mateusz Guzik
Date: Sun May 26 2024 - 08:45:37 EST


On Wed, May 22, 2024 at 04:52:00PM -0700, Dennis Zhou wrote:
> On Wed, May 22, 2024 at 06:59:02AM +0200, Mateusz Guzik wrote:
> + do {
> + if (unlikely(abs(count + amount)) >= batch) {
> + raw_spin_lock_irqsave(&fbc->lock, flags);
> + /*
> + * Note: by now we might have migrated to another CPU
> + * or the value might have changed.
> + */
> + count = __this_cpu_read(*fbc->counters);
> + fbc->count += count + amount;
> + __this_cpu_sub(*fbc->counters, count);
> + raw_spin_unlock_irqrestore(&fbc->lock, flags);
> + return;
> + }
> + } while (!this_cpu_try_cmpxchg(*fbc->counters, &count, count + amount));

For completeness I'll add that Vlastimil suggested creating an inline
variant of the fast path of the routine, which does look reasonable now
that it is 2 branches. Should it land, the reworked routine I posted
would serve as the slowpath.

I hacked it up (inlined below for reference).

This of course creates a tradeoff in terms of func call cost vs i-cache
footprint. bloat-o-meter on a debian-based kernel config says:
Total: Before=24584638, After=24592309, chg +0.03%

I don't know if that automatically disqualifies the thing.

At the moment the negative lookup test I used to bench the cmpxchg
change failed to register a win due to numerous artificial slowdowns in
that codepath (most notably memory alloc/free are incredibly slow).

That is to say at the moment I cannot justify submitting the inline
variant for inclusion. If someone has a test which bounces off of the
routine they are welcome to bench it and share if it helps.

I however do intent to whack out some of the problems in the test I was
running, by the end of that process it may be the inline variant will
provide a measurable win, in which case I'll send a proper patch.

diff --git a/include/linux/percpu_counter.h b/include/linux/percpu_counter.h
index 3a44dd1e33d2..b0038811b69f 100644
--- a/include/linux/percpu_counter.h
+++ b/include/linux/percpu_counter.h
@@ -52,9 +52,21 @@ static inline void percpu_counter_destroy(struct percpu_counter *fbc)
percpu_counter_destroy_many(fbc, 1);
}

-void percpu_counter_set(struct percpu_counter *fbc, s64 amount);
+#ifdef CONFIG_HAVE_CMPXCHG_LOCAL
+void percpu_counter_add_batch_slow(struct percpu_counter *fbc, s64 amount, s32 batch);
+void percpu_counter_add_batch(struct percpu_counter *fbc, s64 amount, s32 batch)
+{
+ s64 count = this_cpu_read(*fbc->counters);
+ if (unlikely((abs(count + amount)) >= batch ||
+ !this_cpu_try_cmpxchg(*fbc->counters, &count, count + amount)))
+ percpu_counter_add_batch_slow(fbc, amount, batch);
+}
+#else
void percpu_counter_add_batch(struct percpu_counter *fbc, s64 amount,
s32 batch);
+#endif
+
+void percpu_counter_set(struct percpu_counter *fbc, s64 amount);
s64 __percpu_counter_sum(struct percpu_counter *fbc);
int __percpu_counter_compare(struct percpu_counter *fbc, s64 rhs, s32 batch);
bool __percpu_counter_limited_add(struct percpu_counter *fbc, s64 limit,
diff --git a/lib/percpu_counter.c b/lib/percpu_counter.c
index c3140276bb36..80fa1aa7a1bb 100644
--- a/lib/percpu_counter.c
+++ b/lib/percpu_counter.c
@@ -90,7 +90,7 @@ EXPORT_SYMBOL(percpu_counter_set);
* 1. the fast path uses local cmpxchg (note: no lock prefix)
* 2. the slow path operates with interrupts disabled
*/
-void percpu_counter_add_batch(struct percpu_counter *fbc, s64 amount, s32 batch)
+void percpu_counter_add_batch_slow(struct percpu_counter *fbc, s64 amount, s32 batch)
{
s64 count;
unsigned long flags;
@@ -111,6 +111,7 @@ void percpu_counter_add_batch(struct percpu_counter *fbc, s64 amount, s32 batch)
}
} while (!this_cpu_try_cmpxchg(*fbc->counters, &count, count + amount));
}
+EXPORT_SYMBOL(percpu_counter_add_batch_slow);
#else
/*
* local_irq_save() is used to make the function irq safe:
@@ -134,8 +135,8 @@ void percpu_counter_add_batch(struct percpu_counter *fbc, s64 amount, s32 batch)
}
local_irq_restore(flags);
}
-#endif
EXPORT_SYMBOL(percpu_counter_add_batch);
+#endif

/*
* For percpu_counter with a big batch, the devication of its count could