Re: [RFC PATCH 0/3] restartable sequences v2: fast user-space percpu critical sections

From: Andy Lutomirski
Date: Thu Apr 07 2016 - 12:44:15 EST


On Thu, Apr 7, 2016 at 8:53 AM, Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote:
> On Thu, Apr 07, 2016 at 08:44:38AM -0700, Andy Lutomirski wrote:
>> On Thu, Apr 7, 2016 at 8:24 AM, Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote:
>> > On Thu, Apr 07, 2016 at 07:35:26AM -0700, Andy Lutomirski wrote:
>> >> What I meant was: rather than shoving individual values into the TLABI
>> >> thing, shove in a pointer:
>> >>
>> >> struct commit_info {
>> >> u64 post_commit_rip;
>> >> u32 cpu;
>> >> u64 *event;
>> >> // whatever else;
>> >> };
>> >>
>> >> and then put a commit_info* in TLABI.
>> >>
>> >> This would save some bytes in the TLABI structure.
>> >
>> > But would cost us extra indirections. The whole point was getting this
>> > stuff at a constant offset from the TLS segment register.
>>
>> I don't think the extra indirections would matter much. The kernel
>> would have to chase the pointer, but only in the very rare case where
>> it resumes userspace during a commit or on the immediately following
>> instruction.
>
> Its about userspace finding these values, not the kernel.

But userspace never reads post_commit_rip or the abort address AFAIK.

>
>> At the very least, post_commit_rip and the abort address (which I
>> forgot about) could both live in a static structure,
>
> Paul keeps the abort address in rcx.

So it's probably a wash. Paul's way is to load the abort address in
rcx (one instruction, no memory access) and to put post_commit_rip in
TLABI (one store). Mine is to put a pointer to a descriptor into
TLABI (one store). Should barely matter.

>
>> and shoving a
>> pointer to *that* into TLABI space is one store instead of two.
>
>> > Ah, so what happens if the signal happens before the commit but after
>> > the load of the seqcount?
>> >
>> > Then, even if the signal motifies the count, we'll not observe.
>> >
>>
>> Where exactly?
>>
>> In my scheme, nothing except the kernel ever loads the seqcount. The
>> user code generates a fresh value, writes it to memory, and then, just
>> before commit, writes that same value to the TLABI area and then
>> double-checks that the value it wrote at the beginning is still there.
>>
>> If the signal modifies the count, then the user code won't directly
>> notice, but prepare_exit_to_usermode on the way out of the signal will
>> notice that the (restored) TLABI state doesn't match the counter that
>> the signal handler changed and will just to the abort address.
>
>
> OK, you lost me.. commit looks like:
>
> + __asm__ __volatile__ goto (
> + "movq $%l[failed], %%rcx\n"
> + "movq $1f, %[commit_instr]\n"
> + "cmpq %[start_value], %[current_value]\n"
>
> If we get preempted/signaled here without the preemption/signal entry
> checking for the post_commit_instr, we'll fail hard.

Not if I'm thinking about this right. Because, in my scheme, we'd be
storing both the address of the counter [1] and the value that we
expect it to be (current_value? something's confused in the asm or
the variable naming, or I'm just not following exactly which variable
is which) in registers or TLABI, and the kernel would notice that
*counter != expected_counter when it resumes and would send us to the
abort label.

Presumably there would be one register hard-coded as the expected
counter value and another register hard-coded as the address of the
counter. That would avoid a bunch of stores to TLABI, and the kernel
could still find them. If the C optimizer did a good job, the
resulting code would be simple.

[1] In my scheme, there would be no reason for the counter to live in
TLABI. In contrast, there would be an excellent reason for it to
*not* live in TLABI -- user code could have more than one set of
counters, and they wouldn't interfere with each other. This would be
nice if one of them protects something that uses long enough code in
the critical sections that it's likely to be preempted.

>
> + "jnz %l[failed]\n"
> + "movq %[to_write], (%[target])\n"
> + "1: movq $0, %[commit_instr]\n"
> + : /* no outputs */
> + : [start_value]"d"(start_value.storage),
> + [current_value]"m"(__rseq_state),
> + [to_write]"r"(to_write),
> + [target]"r"(p),
> + [commit_instr]"m"(__rseq_state.post_commit_instr)
> + : "rcx", "memory"
> + : failed
> + );


More concretely, this looks like (using totally arbitrary register
assingments -- probably far from ideal, especially given how GCC's
constraints work):

enter the critical section:
1:
movq %[cpu], %%r12
movq {address of counter for our cpu}, %%r13
movq {some fresh value}, (%%r13)
cmpq %[cpu], %%r12
jne 1b

... do whatever setup or computation is needed...

movq $%l[failed], %%rcx
movq $1f, %[commit_instr]
cmpq {whatever counter we chose}, (%%r13)
jne %l[failed]
cmpq %[cpu], %%r12
jne %l[failed]

<-- a signal in here that conflicts with us would clobber (%%r13), and
the kernel would notice and send us to the failed label

movq %[to_write], (%[target])
1: movq $0, %[commit_instr]

In contrast to Paul's scheme, this has two additional (highly
predictable) branches and requires generation of a seqcount in
userspace. In its favor, though, it doesnt need preemption hooks,
it's inherently debuggable, and it allows multiple independent
rseq-protected things to coexist without forcing each other to abort.

The first cmpq might not be necessary. I put it there to ensure that
"some fresh value" is visible on the chosen CPU before any other loads
happen, but it's possible that the other cpu double-check is enough.


(Off topic: some of those movq instructions should probably be
rip-relative leaq instead.)

If my descriptor approach were used, we'd save the write to rcx.
Instead we'd do:

.pushsection .rodata
2:
.quad %l[failed] - . /* or similar */
.quad 1b - .
.popsection
leaq 2b(%rip), %[tlabi_rseq_desc]

That's the optimization I was talking about above, although I did a
bad job of describing it.

--Andy