Re: [RFC PATCH for 4.16 02/21] rseq: Introduce restartable sequences system call (v12)

From: Mathieu Desnoyers
Date: Thu Dec 14 2017 - 13:10:50 EST


----- On Dec 14, 2017, at 11:44 AM, Chris Lameter cl@xxxxxxxxx wrote:

> On Thu, 14 Dec 2017, Mathieu Desnoyers wrote:
>
>> On x86, yet another possible approach would be to use the gs segment
>> selector to point to user-space per-cpu data. This approach performs
>> similarly to the cpu id cache, but it has two disadvantages: it is
>> not portable, and it is incompatible with existing applications already
>> using the gs segment selector for other purposes.
>
> I think the proper way to think about gs and fs on x86 is as base
> registers. They are essentially values in registers added to the address
> generated in an instruction. As such the approach is transferable to other
> processor architecture. Many support base register and base register
> relative processing. If a processor can do RMV instructions base register
> relative then you have something similar.

How would you do it on ARM32 ?

>
> In a restartable sequence you could increase efficieny by avoiding full
> atomic instructions. This would be similar to the lockless RMV available
> on x86 then. And in that form it is portable.
>
> A context switch to another processors would mean that the value of the
> base register has changed and that we therefore are accessing another per
> cpu segment. Restarting the sequence will yield a correct result without
> any reloading of registers.

As a concrete example, let's try to apply your proposal on a common use-case:
a compare-and-store on user-space per-cpu data.

With my rseq proposal the fast-path pseudo-code boils down to:

load TLS::cpu_id_start into reg_X
add reg_X offset to base to find target v
store pointer to TLS::rseq_cs
compare reg_X against TLS::cpu_id
jne abort
cmp *v, value
jne cmpfail
store newval to *v

My benchmark on Intel x86-64 E5-2630 shows that it takes 1.9 ns/iteration
for a test-case incrementing a counter with this rseq compare-and-store
sequence.

Let's assume we can reserve the gs segment selector for use in user-space,
and that the per-cpu data layout allows using this segment selector as offset.
The compare-and-store use-case would require a "cmpxchg" instruction with
a gs segment selector.

A single-threaded test-case which uses non-lock-prefixed cmpxchg in a loop
on a E5-2630, I get 2.8 ns/iteration. (no per-cpu data involved, done on a single
global value)

One benefit of your proposal is to lessen the number of retired instructions,
but if we take the IPC into account, it is slower than rseq in my benchmark. What
benefits do you expect from using segment selectors and non-lock-prefixed atomic
instructions on the fast-path ?

Thanks,

Mathieu

--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com