Re: Question about cacheline bounching with percpu-rwsem and rcu-sync

From: Joel Fernandes
Date: Sun Jun 09 2019 - 17:30:03 EST


On Sun, Jun 09, 2019 at 05:22:26AM -0700, Paul E. McKenney wrote:
> On Sat, Jun 08, 2019 at 08:24:36PM -0400, Joel Fernandes wrote:
> > On Fri, May 31, 2019 at 10:43 AM Joel Fernandes <joel@xxxxxxxxxxxxxxxxx> wrote:
> > [snip]
> > > >
> > > > Either way, it would be good for you to just try it. Create a kernel
> > > > module or similar than hammers on percpu_down_read() and percpu_up_read(),
> > > > and empirically check the scalability on a largish system. Then compare
> > > > this to down_read() and up_read()
> > >
> > > Will do! thanks.
> >
> > I created a test for this and the results are quite amazing just
> > stressed read lock/unlock for rwsem vs percpu-rwsem.
> > The test is conducted on a dual socket Intel x86_64 machine with 14
> > cores each socket.
> >
> > Test runs 10,000,000 loops of rwsem vs percpu-rwsem:
> > https://github.com/joelagnel/linux-kernel/commit/8fe968116bd887592301179a53b7b3200db84424
>
> Interesting location, but looks functional. ;-)
>
> > Graphs/Results here:
> > https://docs.google.com/spreadsheets/d/1cbVLNK8tzTZNTr-EDGDC0T0cnFCdFK3wg2Foj5-Ll9s/edit?usp=sharing
> >
> > The completion time of the test goes up somewhat exponentially with
> > the number of threads, for the rwsem case, where as for percpu-rwsem
> > it is the same. I could add this data to some of the documentation as
> > well.
>
> Actually, the completion time looks to be pretty close to linear in the
> number of CPUs. Which is still really bad, don't get me wrong.

Sure, yes on second thought it is more linear than exponential :)

> Thank you for doing this, and it might be good to have some documentation
> on this. In perfbook, I use counters to make this point, and perhaps
> I need to emphasize more that it also applies to other algorithms,
> including locking. Me, I learned this lesson from a logic analyzer
> back in the very early 1990s. This was back in the days before on-CPU
> caches when a logic analyzer could actually tell you something about
> the detailed execution. ;-)
>
> The key point is that you can often closely approximate the performance
> of synchronization algorithms by counting the number of cache misses and
> the number of CPUs competing for each cache line.

Cool, thanks for that insight. It has been some years since I used a logic
analyzer for some bus protocol debugging, but those are fun!

> If you want to get the microbenchmark test code itself upstream,
> one approach might be to have a kernel/locking/lockperf.c similar to
> kernel/rcu/rcuperf.c.
> Thoughts?

That sounds great to me, there's no other locking performance tests in the
kernel. There's locking api selftests at boot (DEBUG_LOCKING_API_SELFTESTS)
which just tests whether lockdep catches locking issues, and there's
locktorture, but I believe none of these test for lock performance.

I think a lockperf.c could also test other things about locking mechanisms,
such as how they perform if the owner of the lock is currently running vs
sleeping, while another thread is trying to acquire etc. What do you think? I
can add this to my list to do. Right now I'm working on the list-RCU lockdep
checking I started to work on [1] and want to post another series soon.

Thanks a lot,

- Joel

[1] https://lkml.org/lkml/2019/6/1/495
https://lore.kernel.org/patchwork/patch/1082846/
>
> Thanx, Paul
>