On Wed, 07 Sep 2022 09:01:53 -0700 Tim Chen <tim.c.chen@xxxxxxxxxxxxxxx> wrote:
On Thu, 2022-09-08 at 01:25 +0800, Jiebin Sun wrote:I think this concept of a percpu_counter_add() which is massively
The msg_bytes and msg_hdrs atomic counters are frequently...
updated when IPC msg queue is in heavy use, causing heavy
cache bounce and overhead. Change them to percpu_counter
greatly improve the performance. Since there is one percpu
struct per namespace, additional memory cost is minimal.
Reading of the count done in msgctl call, which is infrequent.
So the need to sum up the counts in each CPU is infrequent.
Apply the patch and test the pts/stress-ng-1.4.0
-- system v message passing (160 threads).
Score gain: 3.17x
+/* large batch size could reduce the times to sum up percpu counter */Jiebin,
+#define MSG_PERCPU_COUNTER_BATCH 1024
+
1024 is a small size (1/4 page).
The local per cpu counter could overflow to the gloabal count quickly
if it is limited to this size, since our count tracks msg size.
I'll suggest something larger, say 8*1024*1024, about
8MB to accommodate about 2 large page worth of data. Maybe that
will further improve throughput on stress-ng by reducing contention
on adding to the global count.
biased to the write side and with very rare reading is a legitimate
use-case. Perhaps it should become an addition to the formal interface.
Something like
/*
* comment goes here
*/
static inline void percpu_counter_add_local(struct percpu_counter *fbc,
s64 amount)
{
percpu_counter_add_batch(fbc, amount, INT_MAX);
}
and percpu_counter_sub_local(), I guess.
The only instance I can see is
block/blk-cgroup-rwstat.h:blkg_rwstat_add() which is using INT_MAX/2
because it always uses percpu_counter_sum_positive() on the read side.
But that makes two!