Re: [RFC PATCH v7 1/7] Restartable sequences system call
From: Andy Lutomirski
Date: Mon Jul 25 2016 - 19:03:02 EST
On Thu, Jul 21, 2016 at 2:14 PM, Mathieu Desnoyers
<mathieu.desnoyers@xxxxxxxxxxxx> wrote:
> Man page associated:
>
> RSEQ(2) Linux Programmer's Manual RSEQ(2)
>
> NAME
> rseq - Restartable sequences and cpu number cache
>
> SYNOPSIS
> #include <linux/rseq.h>
>
> int rseq(struct rseq * rseq, int flags);
>
> DESCRIPTION
> The rseq() ABI accelerates user-space operations on per-cpu
> data by defining a shared data structure ABI between each user-
> space thread and the kernel.
>
> The rseq argument is a pointer to the thread-local rseq strucâ
> ture to be shared between kernel and user-space. A NULL rseq
> value can be used to check whether rseq is registered for the
> current thread.
>
> The layout of struct rseq is as follows:
>
> Structure alignment
> This structure needs to be aligned on multiples of 64
> bytes.
>
> Structure size
> This structure has a fixed size of 128 bytes.
>
> Fields
>
> cpu_id
> Cache of the CPU number on which the calling thread is
> running.
>
> event_counter
> Restartable sequences event_counter field.
That's an unhelpful description.
>
> rseq_cs
> Restartable sequences rseq_cs field. Points to a struct
> rseq_cs.
Why is it a pointer?
>
> The layout of struct rseq_cs is as follows:
>
> Structure alignment
> This structure needs to be aligned on multiples of 64
> bytes.
>
> Structure size
> This structure has a fixed size of 192 bytes.
>
> Fields
>
> start_ip
> Instruction pointer address of the first instruction of
> the sequence of consecutive assembly instructions.
>
> post_commit_ip
> Instruction pointer address after the last instruction
> of the sequence of consecutive assembly instructions.
>
> abort_ip
> Instruction pointer address where to move the execution
> flow in case of abort of the sequence of consecutive
> assembly instructions.
>
> The flags argument is currently unused and must be specified as
> 0.
>
> Typically, a library or application will keep the rseq strucâ
> ture in a thread-local storage variable, or other memory areas
"variable or other memory area"
> belonging to each thread. It is recommended to perform volatile
> reads of the thread-local cache to prevent the compiler from
> doing load tearing. An alternative approach is to read each
> field from inline assembly.
I don't think the man page needs to tell people how to implement
correct atomic loads.
>
> Each thread is responsible for registering its rseq structure.
> Only one rseq structure address can be registered per thread.
> Once set, the rseq address is idempotent for a given thread.
"Idempotent" is a property that applies to an action, and the "rseq
address" is not an action. I don't know what you're trying to say.
>
> In a typical usage scenario, the thread registering the rseq
> structure will be performing loads and stores from/to that
> structure. It is however also allowed to read that structure
> from other threads. The rseq field updates performed by the
> kernel provide single-copy atomicity semantics, which guarantee
> that other threads performing single-copy atomic reads of the
> cpu number cache will always observe a consistent value.
s/single-copy/relaxed atomic/ perhaps?
>
> Memory registered as rseq structure should never be deallocated
> before the thread which registered it exits: specifically, it
> should not be freed, and the library containing the registered
> thread-local storage should not be dlclose'd. Violating this
> constraint may cause a SIGSEGV signal to be delivered to the
> thread.
That's an unfortunate constraint for threads that exit without help.
>
> Unregistration of associated rseq structure is implicitly perâ
> formed when a thread or process exit.
exits.
[...]
Can you please document what this thing does prior to giving an
example of how to use it.
Hmm, here are the docs, sort of:
> diff --git a/kernel/rseq.c b/kernel/rseq.c
> new file mode 100644
> index 0000000..e1c847b
> --- /dev/null
> +++ b/kernel/rseq.c
> +/*
> + * Each restartable sequence assembly block defines a "struct rseq_cs"
> + * structure which describes the post_commit_ip address, and the
> + * abort_ip address where the kernel should move the thread instruction
> + * pointer if a rseq critical section assembly block is preempted or if
> + * a signal is delivered on top of a rseq critical section assembly
> + * block. It also contains a start_ip, which is the address of the start
> + * of the rseq assembly block, which is useful to debuggers.
> + *
> + * The algorithm for a restartable sequence assembly block is as
> + * follows:
> + *
> + * rseq_start()
> + *
> + * 0. Userspace loads the current event counter value from the
> + * event_counter field of the registered struct rseq TLS area,
> + *
> + * rseq_finish()
> + *
> + * Steps [1]-[3] (inclusive) need to be a sequence of instructions in
> + * userspace that can handle being moved to the abort_ip between any
> + * of those instructions.
> + *
> + * The abort_ip address needs to be equal or above the post_commit_ip.
> + * Step [4] and the failure code step [F1] need to be at addresses
> + * equal or above the post_commit_ip.
> + *
> + * 1. Userspace stores the address of the struct rseq cs rseq
"struct rseq cs rseq" contains a typo.
> + * assembly block descriptor into the rseq_cs field of the
> + * registered struct rseq TLS area.
> + *
> + * 2. Userspace tests to see whether the current event counter values
> + * match those loaded at [0]. Manually jumping to [F1] in case of
> + * a mismatch.
Grammar issues here. More importantly, you said "values", but you
only described one value.
> + *
> + * Note that if we are preempted or interrupted by a signal
> + * after [1] and before post_commit_ip, then the kernel also
> + * performs the comparison performed in [2], and conditionally
> + * clears rseq_cs, then jumps us to abort_ip.
This is the first I've heard of rseq_cs being something that gets
changed as a result of using this facility. What code sets it in the
first place?
I think you've also mentioned "preemption" and "migration". Which do you mean?
> + *
> + * 3. Userspace critical section final instruction before
> + * post_commit_ip is the commit. The critical section is
> + * self-terminating.
> + * [post_commit_ip]
> + *
> + * 4. Userspace clears the rseq_cs field of the struct rseq
> + * TLS area.
> + *
> + * 5. Return true.
> + *
> + * On failure at [2]:
> + *
A major issue I have with percpu critical sections or rseqs or
whatever you want to call them is that, every time I talk to someone
about them, there are a different set of requirements that they are
supposed to satisfy. So:
What problem does this solve?
What are its atomicity properties? Under what conditions does it
work? What assumptions does it make?
What real-world operations become faster as a result of rseq (as
opposed to just cpu number queries)?
Why is it important for the kernel to do something special on every preemption?
What "events" does "event_counter" count and why?
If I'm understanding the intent of this code correctly (which is a big
if), I think you're trying to do this:
start a critical section;
compute something;
commit;
if (commit worked)
return;
else
try again;
where "commit;" is a single instruction. The kernel guarantees that
if the thread is preempted (or migrated, perhaps?) between the start
and commit steps then commit will be forced to fail (or be skipped
entirely). Because I don't understand what you're doing with this
primitive, I can't really tell why you need to detect preemption as
opposed to just migration.
For example: would the following primitive solve the same problem?
begin_dont_migrate_me()
figure out what store to do to take the percpu lock;
do that store;
if (end_dont_migrate_me())
return;
// oops, the kernel migrated us. retry.
--Andy