Re: [RFC PATCH v8 1/9] Restartable sequences system call

From: Mathieu Desnoyers
Date: Thu Aug 25 2016 - 13:08:45 EST


----- On Aug 19, 2016, at 4:23 PM, Linus Torvalds torvalds@xxxxxxxxxxxxxxxxxxxx wrote:

> On Fri, Aug 19, 2016 at 1:07 PM, Mathieu Desnoyers
> <mathieu.desnoyers@xxxxxxxxxxxx> wrote:
>>
>> Benchmarking various approaches for reading the current CPU number:
>
> So I'd like to see the benchmarks of something that actually *does* something.
>
> IOW, what's the bigger-picture "this is what it actually is useful
> for, and how it speeds things up".
>
> Nobody gets a cpu number just to get a cpu number - it's not a useful
> thing to benchmark. What does getcpu() so much that we care?
>
> We've had tons of clever features that nobody actually uses, because
> they aren't really portable enough. I'd like to be convinced that this
> is actually going to be used by real applications.

I completely agree with your request for real-life application numbers.

The most appealing application we have so far is Dave Watson's Facebook
services using jemalloc as a memory allocator. It would be nice if he
could re-run those benchmarks with my rseq implementation. The trade-offs
here are about speed and memory usage:

1) single process-wide pool:
- speed: does not scale well to many-cores,
+ efficient use of memory.
2) per-thread pools:
+ speed: scales really well to many-cores,
- inefficient use of memory.
3) per-cpu pools without rseq:
- speed: requires atomic instructions due to migration and preemption,
+ efficient use of memory.
4) per-cpu pools with rseq:
+ speed: no atomic instructions required,
+ efficient use of memory.

His benchmarks should confirm that we get best of speed and
memory use with (4).

I plan to personally start working on integrating rseq with
the lttng-ust user-space tracer per-CPU ring buffer, but
I expect to mainly publish microbenchmarks, as most of
our heavy tracing users are proprietary applications, for
which it's tricky to publish numbers. I suspect that
microbenchmarks are not what you are looking for here.

Boqun Feng expressed interested in working on a
userspace RCU flavor that would implement per-CPU
(rather than per-thread) grace period tracking. I suspect
this will be a rather large undertaking. The benefits
should be visible as grace period overhead and speed
in applications that have many more threads than cores.

Paul Turner from Google probably have interesting numbers too,
but I suspect he is busy on other projects at the moment.

Let's see if we can get Dave Watson to provide those numbers.

Thanks!

Mathieu

--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com