Re: [PATCH v8 0/9] rwsem performance optimizations

From: Tim Chen
Date: Wed Oct 16 2013 - 14:28:53 EST


On Wed, 2013-10-16 at 08:55 +0200, Ingo Molnar wrote:
> * Tim Chen <tim.c.chen@xxxxxxxxxxxxxxx> wrote:
>
> > On Thu, 2013-10-10 at 09:54 +0200, Ingo Molnar wrote:
> > > * Tim Chen <tim.c.chen@xxxxxxxxxxxxxxx> wrote:
> > >
> > > > The throughput of pure mmap with mutex is below vs pure mmap is below:
> > > >
> > > > % change in performance of the mmap with pthread-mutex vs pure mmap
> > > > #threads vanilla all rwsem without optspin
> > > > patches
> > > > 1 3.0% -1.0% -1.7%
> > > > 5 7.2% -26.8% 5.5%
> > > > 10 5.2% -10.6% 22.1%
> > > > 20 6.8% 16.4% 12.5%
> > > > 40 -0.2% 32.7% 0.0%
> > > >
> > > > So with mutex, the vanilla kernel and the one without optspin both run
> > > > faster. This is consistent with what Peter reported. With optspin, the
> > > > picture is more mixed, with lower throughput at low to moderate number
> > > > of threads and higher throughput with high number of threads.
> > >
> > > So, going back to your orignal table:
> > >
> > > > % change in performance of the mmap with pthread-mutex vs pure mmap
> > > > #threads vanilla all without optspin
> > > > 1 3.0% -1.0% -1.7%
> > > > 5 7.2% -26.8% 5.5%
> > > > 10 5.2% -10.6% 22.1%
> > > > 20 6.8% 16.4% 12.5%
> > > > 40 -0.2% 32.7% 0.0%
> > > >
> > > > In general, vanilla and no-optspin case perform better with
> > > > pthread-mutex. For the case with optspin, mmap with pthread-mutex is
> > > > worse at low to moderate contention and better at high contention.
> > >
> > > it appears that 'without optspin' appears to be a pretty good choice - if
> > > it wasn't for that '1 thread' number, which, if I correctly assume is the
> > > uncontended case, is one of the most common usecases ...
> > >
> > > How can the single-threaded case get slower? None of the patches should
> > > really cause noticeable overhead in the non-contended case. That looks
> > > weird.
> > >
> > > It would also be nice to see the 2, 3, 4 thread numbers - those are the
> > > most common contention scenarios in practice - where do we see the first
> > > improvement in performance?
> > >
> > > Also, it would be nice to include a noise/sttdev figure, it's really hard
> > > to tell whether -1.7% is statistically significant.
> >
> > Ingo,
> >
> > I think that the optimistic spin changes to rwsem should enhance
> > performance to real workloads after all.
> >
> > In my previous tests, I was doing mmap followed immediately by
> > munmap without doing anything to the memory. No real workload
> > will behave that way and it is not the scenario that we
> > should optimize for. A much better approximation of
> > real usages will be doing mmap, then touching
> > the memories being mmaped, followed by munmap.
>
> That's why I asked for a working testcase to be posted ;-) Not just
> pseudocode - send the real .c thing please.

I was using a modified version of Anton's will-it-scale test. I'll try
to port the tests to perf bench to make it easier for other people to
run the tests.

>
> > This changes the dynamics of the rwsem as we are now dominated by read
> > acquisitions of mmap sem due to the page faults, instead of having only
> > write acquisitions from mmap. [...]
>
> Absolutely, the page fault read case is the #1 optimization target of
> rwsems.
>
> > [...] In this case, any delay in write acquisitions will be costly as we
> > will be blocking a lot of readers. This is where optimistic spinning on
> > write acquisitions of mmap sem can provide a very significant boost to
> > the throughput.
> >
> > I change the test case to the following with writes to
> > the mmaped memory:
> >
> > #define MEMSIZE (1 * 1024 * 1024)
> >
> > char *testcase_description = "Anonymous memory mmap/munmap of 1MB";
> >
> > void testcase(unsigned long long *iterations)
> > {
> > int i;
> >
> > while (1) {
> > char *c = mmap(NULL, MEMSIZE, PROT_READ|PROT_WRITE,
> > MAP_PRIVATE|MAP_ANONYMOUS, -1, 0);
> > assert(c != MAP_FAILED);
> > for (i=0; i<MEMSIZE; i+=8) {
> > c[i] = 0xa;
> > }
> > munmap(c, MEMSIZE);
> >
> > (*iterations)++;
> > }
> > }
>
> It would be _really_ nice to stick this into tools/perf/bench/ as:
>
> perf bench mem pagefaults
>
> or so, with a number of parallelism and workload patterns. See
> tools/perf/bench/numa.c for a couple of workload generators - although
> those are not page fault intense.
>
> So that future generations can run all these tests too and such.

Okay, will do.

>
> > I compare the throughput where I have the complete rwsem patchset
> > against vanilla and the case where I take out the optimistic spin patch.
> > I have increased the run time by 10x from my pervious experiments and do
> > 10 runs for each case. The standard deviation is ~1.5% so any changes
> > under 1.5% is statistically significant.
> >
> > % change in throughput vs the vanilla kernel.
> > Threads all No-optspin
> > 1 +0.4% -0.1%
> > 2 +2.0% +0.2%
> > 3 +1.1% +1.5%
> > 4 -0.5% -1.4%
> > 5 -0.1% -0.1%
> > 10 +2.2% -1.2%
> > 20 +237.3% -2.3%
> > 40 +548.1% +0.3%
>
> The tail is impressive. The early parts are important as well, but it's
> really hard to tell the significance of the early portion without having
> an sttdev column.

Here's the data with sdv column:

n all sdv No-optspin sdv
1 +0.4% 0.9% -0.1% 0.8%
2 +2.0% 0.8% +0.2% 1.2%
3 +1.1% 0.8% +1.5% 0.6%
4 -0.5% 0.9% -1.4% 1.1%
5 -0.1% 1.1% -0.1% 1.1%
10 +2.2% 0.8% -1.2% 1.0%
20 +237.3% 0.7% -2.3% 1.3%
40 +548.1% 0.8% +0.3% 1.2%


> ( "perf stat --repeat N" will give you sttdev output, in handy percentage
> form. )
>
> > Now when I test the case where we acquire mutex in the
> > user space before mmap, I got the following data versus
> > vanilla kernel. There's little contention on mmap sem
> > acquisition in this case.
> >
> > n all No-optspin
> > 1 +0.8% -1.2%
> > 2 +1.0% -0.5%
> > 3 +1.8% +0.2%
> > 4 +1.5% -0.4%
> > 5 +1.1% +0.4%
> > 10 +1.5% -0.3%
> > 20 +1.4% -0.2%
> > 40 +1.3% +0.4%

Adding std-dev to above data:

n all sdv No-optspin sdv
1 +0.8% 1.0% -1.2% 1.2%
2 +1.0% 1.0% -0.5% 1.0%
3 +1.8% 0.7% +0.2% 0.8%
4 +1.5% 0.8% -0.4% 0.7%
5 +1.1% 1.1% +0.4% 0.3%
10 +1.5% 0.7% -0.3% 0.7%
20 +1.4% 0.8% -0.2% 1.0%
40 +1.3% 0.7% +0.4% 0.5%

> >
> > Thanks.
>
> A bit hard to see as there's no comparison _between_ the pthread_mutex and
> plain-parallel versions. No contention isn't a great result if performance
> suffers because it's all serialized.

Now the data for pthread-mutex vs plain-parallel vanilla testcase
with std-dev

n vanilla sdv Rwsem-all sdv No-optspin sdv
1 +0.5% 0.9% +1.4% 0.9% -0.7% 1.0%
2 -39.3% 1.0% -38.7% 1.1% -39.6% 1.1%
3 -52.6% 1.2% -51.8% 0.7% -52.5% 0.7%
4 -59.8% 0.8% -59.2% 1.0% -59.9% 0.9%
5 -63.5% 1.4% -63.1% 1.4% -63.4% 1.0%
10 -66.1% 1.3% -65.6% 1.3% -66.2% 1.3%
20 +178.3% 0.9% +182.3% 1.0% +177.7% 1.1%
40 +604.8% 1.1% +614.0% 1.0% +607.9% 0.9%

The version with full rwsem patchset perform best across the threads.
Serialization actually hurts for smaller number of threads even for
current vanilla kernel.

I'll rerun the tests once I ported them to the perf bench. It may take
me a couple of days.

Thanks.

Tim

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/