Re: [RFC PATCH v7 1/7] Restartable sequences system call

From: Mathieu Desnoyers
Date: Mon Jul 25 2016 - 23:02:33 EST


----- On Jul 25, 2016, at 7:02 PM, Andy Lutomirski luto@xxxxxxxxxxxxxx wrote:

> On Thu, Jul 21, 2016 at 2:14 PM, Mathieu Desnoyers
> <mathieu.desnoyers@xxxxxxxxxxxx> wrote:
>> Man page associated:
>>
>> RSEQ(2) Linux Programmer's Manual RSEQ(2)
>>
>> NAME
>> rseq - Restartable sequences and cpu number cache
>>
>> SYNOPSIS
>> #include <linux/rseq.h>
>>
>> int rseq(struct rseq * rseq, int flags);
>>
>> DESCRIPTION
>> The rseq() ABI accelerates user-space operations on per-cpu
>> data by defining a shared data structure ABI between each user-
>> space thread and the kernel.
>>
>> The rseq argument is a pointer to the thread-local rseq strucâ
>> ture to be shared between kernel and user-space. A NULL rseq
>> value can be used to check whether rseq is registered for the
>> current thread.
>>
>> The layout of struct rseq is as follows:
>>
>> Structure alignment
>> This structure needs to be aligned on multiples of 64
>> bytes.
>>
>> Structure size
>> This structure has a fixed size of 128 bytes.
>>
>> Fields
>>
>> cpu_id
>> Cache of the CPU number on which the calling thread is
>> running.
>>
>> event_counter
>> Restartable sequences event_counter field.
>
> That's an unhelpful description.

Good point, how about:

event_counter
Counter guaranteed to be incremented when the current thread is
preempted or when a signal is delivered to the current thread.

In that same line of thoughts, I would reword cpu_id as:

cpu_id
Cache of the CPU number on which the current thread is
running.

>
>>
>> rseq_cs
>> Restartable sequences rseq_cs field. Points to a struct
>> rseq_cs.
>
> Why is it a pointer?

Rewording like this should help understand:

rseq_cs
The rseq_cs field is a pointer to a struct rseq_cs. Is is NULL when
no rseq assembly block critical section is active for the current
thread. Setting it to point to a critical section descriptor (struct
rseq_cs) marks the beginning of the critical section. It is cleared
after the end of the critical section.


>
>>
>> The layout of struct rseq_cs is as follows:
>>
>> Structure alignment
>> This structure needs to be aligned on multiples of 64
>> bytes.
>>
>> Structure size
>> This structure has a fixed size of 192 bytes.
>>
>> Fields
>>
>> start_ip
>> Instruction pointer address of the first instruction of
>> the sequence of consecutive assembly instructions.
>>
>> post_commit_ip
>> Instruction pointer address after the last instruction
>> of the sequence of consecutive assembly instructions.
>>
>> abort_ip
>> Instruction pointer address where to move the execution
>> flow in case of abort of the sequence of consecutive
>> assembly instructions.
>>
>> The flags argument is currently unused and must be specified as
>> 0.
>>
>> Typically, a library or application will keep the rseq strucâ
>> ture in a thread-local storage variable, or other memory areas
>
> "variable or other memory area"

ok

>
>> belonging to each thread. It is recommended to perform volatile
>> reads of the thread-local cache to prevent the compiler from
>> doing load tearing. An alternative approach is to read each
>> field from inline assembly.
>
> I don't think the man page needs to tell people how to implement
> correct atomic loads.

ok, I can remove the two previous sentences.

>
>>
>> Each thread is responsible for registering its rseq structure.
>> Only one rseq structure address can be registered per thread.
>> Once set, the rseq address is idempotent for a given thread.
>
> "Idempotent" is a property that applies to an action, and the "rseq
> address" is not an action. I don't know what you're trying to say.

I mean there is only one address registered per thread, and it stays
registered for the life-time of the thread. Perhaps I could say:

"Once set, the rseq address never changes for a given thread."

>
>>
>> In a typical usage scenario, the thread registering the rseq
>> structure will be performing loads and stores from/to that
>> structure. It is however also allowed to read that structure
>> from other threads. The rseq field updates performed by the
>> kernel provide single-copy atomicity semantics, which guarantee
>> that other threads performing single-copy atomic reads of the
>> cpu number cache will always observe a consistent value.
>
> s/single-copy/relaxed atomic/ perhaps?

ok

>
>>
>> Memory registered as rseq structure should never be deallocated
>> before the thread which registered it exits: specifically, it
>> should not be freed, and the library containing the registered
>> thread-local storage should not be dlclose'd. Violating this
>> constraint may cause a SIGSEGV signal to be delivered to the
>> thread.
>
> That's an unfortunate constraint for threads that exit without help.

I don't understand what you are pointing at here. I see this mostly as
a constraint on the life-time of the library that holds the struct rseq
TLS more than a constraint on the thread life-time.

>
>>
>> Unregistration of associated rseq structure is implicitly perâ
>> formed when a thread or process exit.
>
> exits.

ok

>
> [...]
>
> Can you please document what this thing does prior to giving an
> example of how to use it.

Good point, will do. (more comments on what can be added as documentation
below)

>
> Hmm, here are the docs, sort of:
>
>> diff --git a/kernel/rseq.c b/kernel/rseq.c
>> new file mode 100644
>> index 0000000..e1c847b
>> --- /dev/null
>> +++ b/kernel/rseq.c
>
>> +/*
>> + * Each restartable sequence assembly block defines a "struct rseq_cs"
>> + * structure which describes the post_commit_ip address, and the
>> + * abort_ip address where the kernel should move the thread instruction
>> + * pointer if a rseq critical section assembly block is preempted or if
>> + * a signal is delivered on top of a rseq critical section assembly
>> + * block. It also contains a start_ip, which is the address of the start
>> + * of the rseq assembly block, which is useful to debuggers.
>> + *
>> + * The algorithm for a restartable sequence assembly block is as
>> + * follows:
>> + *
>> + * rseq_start()
>> + *
>> + * 0. Userspace loads the current event counter value from the
>> + * event_counter field of the registered struct rseq TLS area,
>> + *
>> + * rseq_finish()
>> + *
>> + * Steps [1]-[3] (inclusive) need to be a sequence of instructions in
>> + * userspace that can handle being moved to the abort_ip between any
>> + * of those instructions.
>> + *
>> + * The abort_ip address needs to be equal or above the post_commit_ip.
>> + * Step [4] and the failure code step [F1] need to be at addresses
>> + * equal or above the post_commit_ip.
>> + *
>> + * 1. Userspace stores the address of the struct rseq cs rseq
>
> "struct rseq cs rseq" contains a typo.

should be "struct rseq_cs"

>
>> + * assembly block descriptor into the rseq_cs field of the
>> + * registered struct rseq TLS area.
>> + *
>> + * 2. Userspace tests to see whether the current event counter values
>> + * match those loaded at [0]. Manually jumping to [F1] in case of
>> + * a mismatch.
>
> Grammar issues here. More importantly, you said "values", but you
> only described one value.

Indeed, values -> value, and those -> the value

>
>> + *
>> + * Note that if we are preempted or interrupted by a signal
>> + * after [1] and before post_commit_ip, then the kernel also
>> + * performs the comparison performed in [2], and conditionally
>> + * clears rseq_cs, then jumps us to abort_ip.
>
> This is the first I've heard of rseq_cs being something that gets
> changed as a result of using this facility. What code sets it in the
> first place?

struct rseq_cs (the critical section descriptor) is statically declared,
never changes. What I should clarify above is that the rseq_cs field of
struct rseq gets cleared (not the struct rseq_cs per se).

The struct rseq_cs field is initially at NULL, and is populated by the
struct rseq_cs descriptor address when entering the critical section.
It is set back to NULL right after exiting the critical section, through
both the success and failure paths.

>
> I think you've also mentioned "preemption" and "migration". Which do you mean?

We really care about preemption here. Every migration implies a
preemption from a user-space perspective. If we would only care
about keeping the CPU id up-to-date, hooking into migration would be
enough. But since we want atomicity guarantees for restartable
sequences, we need to hook into preemption.

I should update the changelog of patch 1/7 to specify that we really do
hook on preemption, even for the cpu_id update part.

>
>> + *
>> + * 3. Userspace critical section final instruction before
>> + * post_commit_ip is the commit. The critical section is
>> + * self-terminating.
>> + * [post_commit_ip]
>> + *
>> + * 4. Userspace clears the rseq_cs field of the struct rseq
>> + * TLS area.
>> + *
>> + * 5. Return true.
>> + *
>> + * On failure at [2]:
>> + *
>
> A major issue I have with percpu critical sections or rseqs or
> whatever you want to call them is that, every time I talk to someone
> about them, there are a different set of requirements that they are
> supposed to satisfy. So:
>
> What problem does this solve?

It allows user-space to perform update operations on per-cpu data without
requiring heavy-weight atomic operations.

>
> What are its atomicity properties? Under what conditions does it
> work? What assumptions does it make?

Restartable sequences are atomic with respect to preemption (making it
atomic with respect to other threads running on the same CPU), as well
as signal delivery (user-space execution contexts nested over the same
thread).

It is suited for update operations on per-cpu data.

It can be used on data structures shared between threads within a process,
and on data structures shared between threads across different processes.

>
> What real-world operations become faster as a result of rseq (as
> opposed to just cpu number queries)?

A few examples of operations accelerated:

- incrementing per-cpu counters,
- per-cpu spin-lock,
- per-cpu linked-lists (including memory allocator free-list),
- per-cpu ring buffer,

Perhaps others will have other operations in mind ?

Note that compared to Paul Turner's patchset, I removed the percpu_cmpxchg
and percpu_cmpxchg_check APIs from the test program rseq.h in user-space,
because I found out that it was difficult to guarantee progress with those
APIs. The do_rseq() approach, which does 2 attempts and falls back to
locking, does provide progress guarantees even in the face of (unlikely)
frequent migrations.

>
> Why is it important for the kernel to do something special on every preemption?

This is how we can ensure that the entire critical section,
consisting of both the C part and the assembly instruction
sequence, will issue the commit instruction only if executed
atomically with respect to other threads scheduled on the
same CPU.

>
> What "events" does "event_counter" count and why?

Technically, it increments each time a thread returns to
user-space with the NOTIFY_RESUME thread flag set. We ensure
to set this flag on preemption (out), as well as signal delivery.
So it is guaranteed to increment when either of those events take
place. It can however increment due to other kernel code setting
TIF_NOTIFY_RESUME before returning to user-space.

It is meant to allow user-space to detect preemption and signal
delivery, not to count the exact number of such events.

>
>
> If I'm understanding the intent of this code correctly (which is a big
> if), I think you're trying to do this:
>
> start a critical section;
> compute something;
> commit;
> if (commit worked)
> return;
> else
> try again;
>
> where "commit;" is a single instruction. The kernel guarantees that
> if the thread is preempted (or migrated, perhaps?)

A thread needs to have been preempted in order to be migrated, so
from a user-space perspective, detecting preemption is a super-set
of detecting migration. We track preemption and signal delivery here.

> between the start
> and commit steps then commit will be forced to fail (or be skipped
> entirely). Because I don't understand what you're doing with this
> primitive, I can't really tell why you need to detect preemption as
> opposed to just migration.
>
> For example: would the following primitive solve the same problem?
>
> begin_dont_migrate_me()
>
> figure out what store to do to take the percpu lock;
> do that store;
>
> if (end_dont_migrate_me())
> return;
>
> // oops, the kernel migrated us. retry.

First, prohibiting migration from user-space has been frowned upon
by scheduler developers for a long time, and I doubt this mindset will
change.

But if we look at it from the point of view of letting user-space
retry when it detects migration (rather than preemption), it would
require that we use an atomic instruction (although without the lock
prefix) as the commit instruction to ensure atomicity with respect
to other threads running on the same CPU. Detecting preemption
instead allows us to use a simple store instruction as the commit.
Simple store instructions (e.g. mov) are faster than atomic
instructions (e.g. xadd, cmpxchg...). Moreover, detecting
migrations and using atomic instructions as commit is prone to ABA
(e.g. free-list use-case) that are prevented by the restart on
preemption or signal delivery.

Thanks for looking into it!

Mathieu


>
>
> --Andy

--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com