Re: [RFC PATCH for 4.15 v12 00/22] Restartable sequences and CPU op vector

From: Mathieu Desnoyers
Date: Wed Nov 22 2017 - 07:36:06 EST

----- On Nov 21, 2017, at 5:59 PM, Thomas Gleixner tglx@xxxxxxxxxxxxx wrote:

> On Tue, 21 Nov 2017, Mathieu Desnoyers wrote:
>> ----- On Nov 21, 2017, at 12:21 PM, Andi Kleen andi@xxxxxxxxxxxxxx wrote:
>> > On Tue, Nov 21, 2017 at 09:18:38AM -0500, Mathieu Desnoyers wrote:
>> >> Hi,
>> >>
>> >> Following changes based on a thorough coding style and patch changelog
>> >> review from Thomas Gleixner and Peter Zijlstra, I'm respinning this
>> >> series for another RFC.
>> >>
>> > My suggestion would be that you also split out the opv system call.
>> > That seems to be main contention point currently, and the restartable
>> > sequences should be useful without it.
>> I consider rseq to be incomplete and a pain to use in various scenarios
>> without cpu_opv.
>> About the contention point you refer to:
>> Using vDSO as an example of how things should be done is just wrong: the
>> vDSO interaction with debugger instruction single-stepping is broken,
>> as I detailed in my previous email.
> Let me turn that around. You're lamenting about a conditional branch in
> your rseq thing for performance reasons and at the same time you want to
> force extra code into the VDSO? clock_gettime() is one of the hottest
> vsyscalls in certain scenarions. So why would we want to have extra code
> there? Just to make debuggers happy. You really can't be serious about
> that.

There is *already* an existing branch in the clock_gettime vsyscall:
it's a loop. It won't hurt the fast-path to use that branch and
make it do something else instead. It could even help the vDSO fast-path
for some non-x86 architectures where branch prediction assumes that
backward branches are always taken (adding an unlikely() does not help
in those cases).

>> Thomas' proposal of handling single-stepping with a user-space locking
>> fallback, which is pretty much what I had in 2016, pushes a lot of
>> complexity to user-space, requires an extra branch in the fast-path,
>> as well as additional store-release/load-acquire semantics for consistency.
>> I don't plan going down that route.
>> Other than that, I have not received any concrete alternative proposal to
>> properly handle single-stepping.
> You provided the details today. Up to that point all we had was handwaving
> and inconsistent information.

I mistakenly presumed you took interest in the past 2 years discussions.
It appears I was wrong, and that information needed to be summarized in
my changelog. This was my mistake and I fixed it.

>> The only opposition against cpu_opv is that there *should* be an hypothetical
>> simpler solution. The rseq idea is not new: it's been presented by Paul Turner
>> in 2012 at LPC. And so far, cpu_opv is the overall simplest and most
>> efficient way I encountered to handle single-stepping, and it gives extra
>> benefits, as described in my changelog.
> That's how you define it and that does not make cpu_opv less complex and
> more debuggable. There is no way to debug that and still you claim that it
> removes compexity from user space.

So I should ask: what kind of observability within cpu_opv() do you want ?
I can add a tracepoint for each operation, which would technically take care
of your concern. You main counter-argument seems to be a tooling issue.

> That ops stuff comes from user space and
> is not magically constructed by the kernel. In some of your use cases it
> even has different semantics than the rseq section code. So how is that
> removing any complexity from user space? All it buys you is an extra branch
> less in your rseq hotpath and that's your justification to shove that
> thing into the kernel.

Actually, the cpu-op user-space library can hide this difference from the
user: I implemented the equivalent rseq algorithm using a compare-and-store:

int cpu_op_cmpnev_storeoffp_load(intptr_t *v, intptr_t expectnot,
off_t voffp, intptr_t *load, int cpu)
intptr_t oldv = READ_ONCE(*v);
intptr_t *newp = (intptr_t *)(oldv + voffp);
int ret;

if (oldv == expectnot)
return 1;
ret = cpu_op_cmpeqv_storep_expect_fault(v, oldv, newp, cpu);
if (!ret) {
*load = oldv;
return 0;
if (ret > 0) {
errno = EAGAIN;
return -1;
return -1;

So from a library user perspective, the fast-path and slow-path are
exactly the same.

> The version I reviewed was just undigestable.

Thanks for the thorough coding style review by the way.

> I did not have time to look
> at the hastily cobbled together version of today. Aside of that the
> scheduler portion of it has not seen any review from scheduler folks
> either.

True. It appears that it really takes a merge window to get some
people's attention. That's OK, you guys are really busy on other
stuff. It's just unfortunate that the feedback about the cpu_opv
concept did not come sooner, e.g. during first rounds of patches
where the cpu_opv design was presented, or even at KS.

> AFAICT there is not a single reviewed-by tag on the sys_rseq and the
> sys_opv patches either.

Very good point! Anyone in CC who cares about getting this in can
find time to do some official review ?

> Are you seriously expecting that new syscalls of that kind are going to be
> merged without a deep and thorough review just based on your decision to
> declare them ready?

In my reply to Andi, I merely state that I'm not willing to push an
half-baked user-space ABI into the kernel, and rseq without cpu_opv
is only part of the solution.

Let's see if others find time to do an official review.



> Thanks,
> tglx

Mathieu Desnoyers
EfficiOS Inc.