Re: [PATCH v4] ipc/msg: mitigate the lock contention with percpu counter

From: Sun, Jiebin
Date: Thu Sep 08 2022 - 04:26:16 EST

Next message: Hyunchul Lee: "[PATCH] MAINTAINERS: remove Hyunchul Lee from ksmbd maintainers"
Previous message: Uwe Kleine-König: "Re: [PATCH v1 1/9] pwm: lpss: Deduplicate board info data structures"
Next in thread: Andrew Morton: "Re: [PATCH v4] ipc/msg: mitigate the lock contention with percpu counter"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On 9/8/2022 5:34 AM, Andrew Morton wrote:

On Wed, 07 Sep 2022 09:01:53 -0700 Tim Chen <tim.c.chen@xxxxxxxxxxxxxxx> wrote:

On Thu, 2022-09-08 at 01:25 +0800, Jiebin Sun wrote:

The msg_bytes and msg_hdrs atomic counters are frequently
updated when IPC msg queue is in heavy use, causing heavy
cache bounce and overhead. Change them to percpu_counter
greatly improve the performance. Since there is one percpu
struct per namespace, additional memory cost is minimal.
Reading of the count done in msgctl call, which is infrequent.
So the need to sum up the counts in each CPU is infrequent.

Apply the patch and test the pts/stress-ng-1.4.0
-- system v message passing (160 threads).

Score gain: 3.17x

...

+/* large batch size could reduce the times to sum up percpu counter */
+#define MSG_PERCPU_COUNTER_BATCH 1024
+

Jiebin,

1024 is a small size (1/4 page).
The local per cpu counter could overflow to the gloabal count quickly
if it is limited to this size, since our count tracks msg size.
I'll suggest something larger, say 8*1024*1024, about
8MB to accommodate about 2 large page worth of data. Maybe that
will further improve throughput on stress-ng by reducing contention
on adding to the global count.

I think this concept of a percpu_counter_add() which is massively
biased to the write side and with very rare reading is a legitimate
use-case. Perhaps it should become an addition to the formal interface.
Something like

/*
* comment goes here
*/
static inline void percpu_counter_add_local(struct percpu_counter *fbc,
s64 amount)
{
percpu_counter_add_batch(fbc, amount, INT_MAX);
}

and percpu_counter_sub_local(), I guess.

The only instance I can see is
block/blk-cgroup-rwstat.h:blkg_rwstat_add() which is using INT_MAX/2
because it always uses percpu_counter_sum_positive() on the read side.

But that makes two!

Yes. Using INT_MAX or INT_MAX/2 could have a big improvement on the performance if heavy writing but rare reading. In our case, if the local percpu counter is near to INT_MAX and there comes a big msgsz, the overflow issue could happen. So I think INT_MAX/2, which is used in blkg_rwstat_add(), might be a better choice. /$ percpu_counter_add_batch(&ns->percpu_msg_bytes, msgsz, batch); /I will send the performance data and draft patch out for discussing.//Jiebin//

Next message: Hyunchul Lee: "[PATCH] MAINTAINERS: remove Hyunchul Lee from ksmbd maintainers"
Previous message: Uwe Kleine-König: "Re: [PATCH v1 1/9] pwm: lpss: Deduplicate board info data structures"
Next in thread: Andrew Morton: "Re: [PATCH v4] ipc/msg: mitigate the lock contention with percpu counter"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]