Re: [RFC PATCH 0/3] restartable sequences v2: fast user-space percpu critical sections

From: Mathieu Desnoyers
Date: Fri Dec 11 2015 - 08:39:52 EST


----- On Dec 11, 2015, at 7:05 AM, Mathieu Desnoyers mathieu.desnoyers@xxxxxxxxxxxx wrote:

> ----- On Oct 27, 2015, at 7:56 PM, Paul Turner commonly@xxxxxxxxx wrote:
>
>> This is an update to the previously posted series at:
>> https://lkml.org/lkml/2015/6/24/665
>>
>> Dave Watson has posted a similar follow-up which allows additional critical
>> regions to be registered as well as single-step support at:
>> https://lkml.org/lkml/2015/10/22/588
>>
>> This series is a new approach which introduces an alternate ABI that does not
>> depend on open-coded assembly nor a central 'repository' of rseq sequences.
>> Sequences may now be inlined and the preparatory[*] work for the sequence can
>> be written in a higher level language.
>
> Hi Paul,
>
> Thanks for sending this updated series! A few questions:
>
> When you say that the preparatory work can be written in a higher
> level language, does it mean it still has to be inlined, or function
> calls are explicitly supported ? It would be good to clarify this in
> the changelog/documentation.
>
> Another point: in this changelog, it's unclear to me what the [*] refers
> to (after "preparatory"). Below there is a [*] before "A sequence essentially
> has 3 steps", and and plain '*' before "0. Userspace stores current event+cpu
> counter values".
>
>>
>> This new ABI has also been written to support debugger interaction in a way
>> that the previous ABI could not.
>>
>> [*] A sequence essentially has 3 steps:
>> 1) Determine which cpu the sequence is being run on
>> 2) Preparatory work specific to the state read in 1)
>> 3) A single commit instruction which finalizes any state updates.
>
> The link between those 3 steps and the following "progress" items
> below seems a bit confusing, because those are two numbered lists.
> Perhaps using the numbers above for the overall steps, and sub-numbers
> after a dot could help exposing the relation, e.g.:
>
> 1.1 Userspace stores current event+cpu counter values
> 2.1 Userspace loads the rip to move to at failure into cx
> 2.2 Userspace loads the rip of the instruction following
> the critical section into a registered TLS address.
> ....
>
>
>>
>> We require a single instruction for (3) so that if it is interrupted in any
>> way, we can proceed from (1) once more [or potentially bail].
>>
>> This new ABI can be described as:
>> Progress is ordered as follows:
>> *0. Userspace stores current event+cpu counter values
>
> This does not explain where those "current" values are
> read from ? Moreover, step [3] below seems to refer to values
> read at [0], but [0] is about storing values, not reading them.
> There is probably a very simple explanation to this, but in its
> current form, this is puzzling. Moreover, where is it stored ?
>
> I would suggest to change event+cpu -> (event, cpu), since they
> are not summed, but rather combined as a tuple.
>
>> 1. Userspace loads the rip to move to at failure into cx
>
> Is it really cx (16-bit), or ecx/rcx ?
>
> "the rip to move to at failure" could be defined as
> "failure address" or "restart address", and then referred
> to afterward. This would clarify point [4] below rather than
> saying "to the address from Â[1]".
>
>> 2. Userspace loads the rip of the instruction following
>> the critical section into a registered TLS address.
>
> The critical section is not clearly defined. What is the
> first instruction included in that c.s., and what is the
> first instruction after the c.s. ?
>
>> 3. Userspace loads the values read at [0] into a known
>> location.
>
> How can we "load the value ... into" ? We usually load from,
> and store to a location. What I think you mean here is that
> Userspace loads the values read at [0] into a known register.
> (I would expect location target a memory address).
>
>> 4. Userspace tests to see whether the current event and
>> cpu counter values match those stored at 0. Manually
>> jumping to the address from [1] in the case of a
>> mismatch.
>
> address from [1] -> "failure/restart address"
>
>>
>> Note that if we are preempted or otherwise interrupted
>> then the kernel can also now perform this comparison
>> and conditionally jump us to [1].
>
> Does it jump us to [1] or to the failure/restart address ?
>
>> 5. Our final instruction before [2] is then our commit.
>> The critical section is self-terminating. [2] must
>> also be cleared at this point.
>
> Is it the step [2] or the instruction following the critical
> section to you refer to here ?
>
>>
>> For x86_64:
>> [3] uses rdx to represent cpu and event counter as a
>> single 64-bit value.
>>
>> For i386:
>> [3] uses ax for cpu and dx for the event_counter.
>
> Having 16 bit only for CPU id limits us to 65536 CPUs.
> Can this be an issue ?
>
> Having 16 bit only for the event counter may start to become
> tricky for counter overflows.
>
> Are we strictly required to put both cpu id and seq counter
> into the same register, or those could be put into two regs ?
>
>>
>> Both:
>> Instruction after commit: rseq_state->post_commit_instr
>> Current event and cpu state: rseq_state->event_and_cpu
>>
>> Exactly, for x86_64 this looks like:
>> movq <failed>, rcx [1]
>> movq $1f, <commit_instr> [2]
>> cmpq <start value>, <current value> [3] (start is in rcx)
>
> As you stated yourself in a reply, rcx -> rdx.
>
>> jnz <failed> (4)
>> movq <to_write>, (<target>) (5)
>> 1: movq $0, <commit_instr>
>>
>> There has been some related discussion, which I am supportive of, in which
>> we use fs/gs instead of TLS. This maps naturally to the above and removes
>> the current requirement for per-thread initialization (this is a good thing!).
>
> It would be important to keep allowing other architectures to use TLS-based
> approaches though.
>
>>
>> On debugger interactions:
>>
>> There are some nice properties about this new style of API which allow it to
>> actually support safe interactions with a debugger:
>> a) The event counter is a per-cpu value. This means that we can not advance
>> it if no threads from the same process execute on that cpu. This
>> naturally allows basic single step support with thread-isolation.
>> b) Single-step can be augmented to evalute the ABI without incrementing the
>> event count.
>> c) A debugger can also be augmented to evaluate this ABI and push restarts
>> on the kernel's behalf.
>>
>> This is also compatible with David's approach of not single stepping between
>> 2-4 above. However, I think these are ultimately a little stronger since true
>> single-stepping and breakpoint support would be available. Which would be
>> nice to allow actual debugging of sequences.
>>
>> (Note that I haven't bothered implementing these in the current patchset as we
>> are still winnowing down on the ABI and they just add complexity. It's
>> important to note that they are possible however.)
>
> Ensuring these restartable critical sections can be used on shared memory
> across processes seems rather important. I wonder if the debuggability of the
> critical section really justifies to introduce a "PRIVATE" and a "SHARED" mode
> for the rseq (similar to sys_futex), or if we could just enforce that no
> breakpoint should be inserted within the critical section (except the very
> first instruction).

One more thought about breakpoints: I now understand that with your new approach,
the critical section ranges are only tracked momentarily during the critical
section. There is no registry kept of all critical sections, which could make
ptrace peek tweaks challenging.

An alternative approach might be to allow the debugger to put a breakpoint
within the critical section, but override the breakpoint handler in the kernel
so that it just resumes user-space execution, effectively ignoring the
breakpoint.

Thanks,

Mathieu

>
> Thoughts ?
>
> Thanks,
>
> Mathieu
>
>
>>
>> Thanks,
>>
>> - Paul
>
> --
> Mathieu Desnoyers
> EfficiOS Inc.
> http://www.efficios.com

--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/