Re: [RFC PATCH 0/3] restartable sequences benchmarks

From: Andy Lutomirski
Date: Thu Oct 22 2015 - 15:12:15 EST


On Thu, Oct 22, 2015 at 11:06 AM, Dave Watson <davejwatson@xxxxxx> wrote:
> We've been testing out restartable sequences + malloc changes for use
> at Facebook. Below are some test results, as well as some possible
> changes based on Paul Turner's original patches

Thanks! I'll stare at this some time between now and Kernel Summit.

>
> https://lkml.org/lkml/2015/6/24/665
>
> I ran one service with several permutations of various mallocs. The
> service is CPU-bound, and hits the allocator quite hard. Requests/s
> are held constant at the source, so we use cpu idle time and latency
> as an indicator of service quality. These are average numbers over
> several hours. Machines were dual E5-2660, total 16 cores +
> hyperthreading. This service has ~400 total threads, 70-90 of which
> are doing work at any particular time.
>
> RSS CPUIDLE LATENCYMS
> jemalloc 4.0.0 31G 33% 390
> jemalloc + this patch 25G 33% 390
> jemalloc + this patch using lsl 25G 30% 420
> jemalloc + PT's rseq patch 25G 32% 405
> glibc malloc 2.20 27G 30% 420
> tcmalloc gperftools trunk (2.2) 21G 30% 480

Slightly confused. This is showing a space efficiency improvement but
not a performance improvement? Is the idea that percpu free lists are
more space efficient than per-thread free lists?

>
> jemalloc rseq patch used for testing:
> https://github.com/djwatson/jemalloc
>
> lsl test - using lsl segment limit to get cpu (i.e. inlined vdso
> getcpu on x86) instead of using the thread caching as in this patch.
> There has been some suggestions to add the thread-cached getcpu()
> feature separately. It does seem to move the needle in a real service
> by about ~3% to have a thread-cached getcpu vs. not. I don't think we
> can use restartable sequences in production without a faster getcpu.

If nothing else, I'd like to replace the thread-cached getcpu thing
with percpu gsbase, at least on x86. That doesn't necessarily have to
be exclusive with restartable sequences.

>
> GS-segment / migration only tests
>
> There's been some interest in seeing if we can do this with only gs
> segment, here's some numbers for those. This doesn't have to be gs,
> it could just be a migration signal sent to userspace as well, the
> same approaches would apply.
>
> GS patch: https://lkml.org/lkml/2014/9/13/59
>
> RSS CPUIDLE LATENCYMS
> jemalloc 4.0.0 31G 33% 390
> jemalloc + percpu locking 25G 25% 420
> jemalloc + preempt lock / signal 25G 32% 415

Neat!

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/