Re: [PATCH man-pages] Add rseq manpage

From: Mathieu Desnoyers
Date: Mon Apr 27 2020 - 11:15:16 EST


----- On Mar 4, 2019, at 1:02 PM, Mathieu Desnoyers mathieu.desnoyers@xxxxxxxxxxxx wrote:

> ----- On Feb 28, 2019, at 3:42 AM, Michael Kerrisk mtk.manpages@xxxxxxxxx wrote:
>
>> On 12/6/18 3:42 PM, Mathieu Desnoyers wrote:
>>> [ Michael, rseq(2) was merged into 4.18. Can you have a look at this
>>> patch which adds rseq documentation to the man-pages project ? ]
>> Hi Matthieu
>>
>> Sorry for the long delay. I've merged this page into a private
>> branch and have done quite a lot of editing. I have many
>> questions :-).
>
> No worries, thanks for looking into it!
>
>>
>> In the first instance, I think it is probably best to have
>> a free-form text discussion rather than firing patches
>> back and forward. Could you take a look at the questions below
>> and respond?
>
> Sure,

Hi Michael,

Gentle bump of this email in your inbox, since I suspect you might have
forgotten about it altogether. A year ago I you had an heavily edited
man page for rseq(2). I provided the requested feedback, but I did not
hear back from you since then.

We are now close to integrate rseq into glibc, and having an official
man page would be useful.

Thanks,

Mathieu


>
>>
>> Thanks,
>>
>> Michael
>>
>>
>> RSEQ(2) Linux Programmer's Manual RSEQ(2)
>>
>> NAME
>> rseq - Restartable sequences and CPU number cache
>>
>> SYNOPSIS
>> #include <linux/rseq.h>
>>
>> int rseq(struct rseq *rseq, uint32_t rseq_len, int flags, uint32_t sig);
>>
>> DESCRIPTION
>> âââââââââââââââââââââââââââââââââââââââââââââââââââââââ
>> âFIXME â
>> âââââââââââââââââââââââââââââââââââââââââââââââââââââââ
>> âImagine you are someone who is pretty new to this â
>> âidea... What is notably lacking from this page is â
>> âan overview explaining: â
>> â â
>> â * What a restartable sequence actually is. â
>> â â
>> â * An outline of the steps to perform when using â
>> â restartable sequences / rseq(2). â
>> â â
>> âI.e., something along the lines of Jon Corbet's â
>> âhttps://lwn.net/Articles/697979/. Can you come up â
>> âwith something? (Part of it might be at the start of â
>> âthis page, and the rest in NOTES; it need not be all â
>> âin one place.) â
>> âââââââââââââââââââââââââââââââââââââââââââââââââââââââ
>
> We recently published a blog post about rseq, which might contain just the
> right level of information we are looking for here:
>
> https://www.efficios.com/blog/2019/02/08/linux-restartable-sequences/
>
> Could something along the following lines work ?
>
> "A restartable sequence is a sequence of instructions guaranteed to be
> executed atomically with respect to other threads and signal handlers on the
> current CPU. If its execution does not complete atomically, the kernel changes
> the execution flow by jumping to an abort handler defined by user-space for
> that restartable sequence.
>
> Using restartable sequences requires to register a __rseq_abi thread-local
> storage
> data structure (struct rseq) through the rseq(2) system call. Only one
> __rseq_abi
> can be registered per thread, so user-space libraries and applications must
> follow
> a user-space ABI defining how to share this resource. The ABI defining how to
> share
> this resource between applications and libraries is defined by the C library.
>
> The __rseq_abi contains a rseq_cs field which points to the currently executing
> critical section. For each thread, a single rseq critical section can run at any
> given point. Each critical section need to be implemented in assembly."
>
>
>> The rseq() ABI accelerates user-space operations on per-CPU data by
>> defining a shared data structure ABI between each user-space thread and
>> the kernel.
>>
>> It allows user-space to perform update operations on per-CPU data withâ
>> out requiring heavy-weight atomic operations.
>>
>> âââââââââââââââââââââââââââââââââââââââââââââââââââââââ
>> âFIXME â
>> âââââââââââââââââââââââââââââââââââââââââââââââââââââââ
>> âIn the following para: "a hardware execution conâ â
>> âtext"? What is the contrast being drawn here? It â
>> âwould be good to state it more explicitly. â
>> âââââââââââââââââââââââââââââââââââââââââââââââââââââââ
>
> Here I'm trying to clarify what we mean by "CPU" in this document. We define
> a CPU as having its own number returned by sched_getcpu(), which I think is
> sometimes referred to as "logical cpu". This is the current hyperthread on
> the current core, on the current "physical CPU", in the current socket.
>
>
>> The term CPU used in this documentation refers to a hardware execution
>> context.
>>
>> Restartable sequences are atomic with respect to preemption (making it
>> atomic with respect to other threads running on the same CPU), as well
>> as signal delivery (user-space execution contexts nested over the same
>> thread). They either complete atomically with respect to preemption on
>> the current CPU and signal delivery, or they are aborted.
>>
>> âââââââââââââââââââââââââââââââââââââââââââââââââââââââ
>> âFIXME â
>> âââââââââââââââââââââââââââââââââââââââââââââââââââââââ
>> âIn the preceding sentence, we need a definition of â
>> â"current CPU". â
>> âââââââââââââââââââââââââââââââââââââââââââââââââââââââ
>
> Not sure how to word it. If a thread or signal handler execution context can
> possibly run and issue, for instance, "sched_getcpu()" between the beginning
> and the end of the critical section and get the same logical CPU number as the
> current thread, then we are guaranteed to abort. Of course, sched_getcpu() is
> just one way to get the CPU number, considering that we can also read it
> from the __rseq_abi cpu_id and cpu_id_start fields.
>
>>
>> âââââââââââââââââââââââââââââââââââââââââââââââââââââââ
>> âFIXME â
>> âââââââââââââââââââââââââââââââââââââââââââââââââââââââ
>> âIn the following, does "It is" means "Restartable â
>> âsequences are"? â
>> âââââââââââââââââââââââââââââââââââââââââââââââââââââââ
>> It is suited for update operations on per-CPU data.
>
> Yes.
>
>
>>
>> âââââââââââââââââââââââââââââââââââââââââââââââââââââââ
>> âFIXME â
>> âââââââââââââââââââââââââââââââââââââââââââââââââââââââ
>> âIn the following, does "It is" means "Restartable â
>> âsequences are"? â
>> âââââââââââââââââââââââââââââââââââââââââââââââââââââââ
>
> "Restartable sequences can be..."
>
>> It can be used on data structures shared between threads within a
>> process, and on data structures shared between threads across different
>> processes.
>>
>> Some examples of operations that can be accelerated or improved by this
>> ABI:
>>
>> Â Memory allocator per-CPU free-lists
>>
>> Â Querying the current CPU number
>>
>> Â Incrementing per-CPU counters
>>
>> Â Modifying data protected by per-CPU spinlocks
>>
>> Â Inserting/removing elements in per-CPU linked-lists
>>
>> Â Writing/reading per-CPU ring buffers content
>>
>> Â Accurately reading performance monitoring unit counters with respect
>> to thread migration
>>
>> Restartable sequences must not perform system calls. Doing so may
>> result in termination of the process by a segmentation fault.
>>
>> The rseq argument is a pointer to the thread-local rseq structure to be
>> shared between kernel and user-space. The layout of this structure is
>> shown below.
>>
>> The rseq_len argument is the size of the struct rseq to register.
>>
>> The flags argument is 0 for registration, or RSEQ_FLAG_UNREGISTER for
>> unregistration.
>>
>> The sig argument is the 32-bit signature to be expected before the
>> abort handler code.
>>
>> The rseq structure
>> The struct rseq is aligned on a 32-byte boundary. This structure is
>> extensible. Its size is passed as parameter to the rseq() system call.
>>
>> âââââââââââââââââââââââââââââââââââââââââââââââââââââââ
>> âFIXME â
>> âââââââââââââââââââââââââââââââââââââââââââââââââââââââ
>> âBelow, I added the structure definition (in abbreviâ â
>> âated form). Is there any reason not to do this? â
>> âââââââââââââââââââââââââââââââââââââââââââââââââââââââ
>
> It seems appropriate.
>
>>
>> struct rseq {
>> __u32 cpu_id_start;
>> __u32 cpu_id;
>> union {
>> __u64 ptr64;
>> #ifdef __LP64__
>> __u64 ptr;
>> #else
>> ....
>> #endif
>> } rseq_cs;
>> __u32 flags;
>> } __attribute__((aligned(4 * sizeof(__u64))));
>>
>> âââââââââââââââââââââââââââââââââââââââââââââââââââââââ
>> âFIXME â
>> âââââââââââââââââââââââââââââââââââââââââââââââââââââââ
>> âIn the text below, I think it would be helpful to â
>> âexplicitly note which of these fields are set by the â
>> âkernel (on return from the reseq() call) and which â
>> âare set by the caller (before calling rseq()). Is â
>> âthe following correct: â
>> â â
>> â cpu_id_start - initialized by caller to possible â
>> â CPU number (e.g., 0), updated by kernel â
>> â on return â
>
> "initialized by caller to possible CPU number (e.g., 0), updated
> by the kernel on return, and updated by the kernel on return after
> thread migration to a different CPU"
>
>> â â
>> â cpu_id - initialized to -1 by caller, â
>> â updated by kernel on return â
>
> "initialized to -1 by caller, updated by the kernel on return, and
> updated by the kernel on return after thread migration to a different
> CPU"
>
>> â â
>> â rseq_cs - initialized by caller, either to NULL â
>> â or a pointer to an 'rseq_cs' structure â
>> â that is initialized by the caller â
>
> "initialized by caller to NULL, then, after returning from successful
> registration, updated to a pointer to an "rseq_cs" structure by user-space.
> Set to NULL by the kernel when it restarts a rseq critical section,
> when it preempts or deliver a signal outside of the range targeted by the
> rseq_cs. Set to NULL by user-space before reclaiming memory that
> contains the targeted struct rseq_cs."
>
>
>> â â
>> â flags - initialized by caller, used by kernel â
>> âââââââââââââââââââââââââââââââââââââââââââââââââââââââ
>>
>> The structure fields are as follows:
>>
>> âââââââââââââââââââââââââââââââââââââââââââââââââââââââ
>> âFIXME â
>> âââââââââââââââââââââââââââââââââââââââââââââââââââââââ
>> âIn the following paragraph, and in later places, I â
>> âchanged "current thread" to "calling thread". Okay? â
>> âââââââââââââââââââââââââââââââââââââââââââââââââââââââ
>
> Yes.
>
>>
>> cpu_id_start
>> Optimistic cache of the CPU number on which the calling thread
>> is running. The value in this field is guaranteed to always be
>> a possible CPU number, even when rseq is not initialized. The
>> value it contains should always be confirmed by reading the
>> cpu_id field.
>>
>> âââââââââââââââââââââââââââââââââââââââââââââââââââââââ
>> âFIXME â
>> âââââââââââââââââââââââââââââââââââââââââââââââââââââââ
>> âWhat does the last sentence mean? â
>> âââââââââââââââââââââââââââââââââââââââââââââââââââââââ
>
> It means the caller thread can always use __rseq_abi.cpu_id_start to index an
> array of per-cpu data and this won't cause an out-of-bound access on load, but
> it
> does not mean it really contains the current CPU number. For instance, if rseq
> registration failed, it will contain "0".
>
> Therefore, it's fine to use cpu_is_start to fetch per-cpu data, but the cpu_id
> field should be used to compare the cpu_is_start value, so the case where rseq
> is not registered is caught. In that case, cpu_id_start=0, but cpu_id=-1 or -2,
> which differ, and therefore the critical section needs to jump to the abort
> handler.
>
>>
>> This field is an optimistic cache in the sense that it is always
>> guaranteed to hold a valid CPU number in the range [0..(nr_posâ
>> sible_cpus - 1)]. It can therefore be loaded by user-space and
>> used as an offset in per-CPU data structures without having to
>> check whether its value is within the valid bounds compared to
>> the number of possible CPUs in the system.
>>
>> For user-space applications executed on a kernel without rseq
>> support, the cpu_id_start field stays initialized at 0, which is
>> indeed a valid CPU number. It is therefore valid to use it as
>> an offset in per-CPU data structures, and only validate whether
>> it's actually the current CPU number by comparing it with the
>> cpu_id field within the rseq critical section.
>>
>> If the kernel does not provide rseq support, that cpu_id field
>> stays initialized at -1, so the comparison always fails, as
>> intended. It is then up to user-space to use a fall-back mechaâ
>> nism, considering that rseq is not available.
>>
>> âââââââââââââââââââââââââââââââââââââââââââââââââââââââ
>> âFIXME â
>> âââââââââââââââââââââââââââââââââââââââââââââââââââââââ
>> âThe last sentence is rather difficult to grok. Can â
>> âwe say some more here? â
>> âââââââââââââââââââââââââââââââââââââââââââââââââââââââ
>
> Perhaps we could use the explanation I've written above in my reply ?
>
>>
>> cpu_id Cache of the CPU number on which the calling thread is running.
>> -1 if uninitialized.
>>
>> rseq_cs
>> The rseq_cs field is a pointer to a struct rseq_cs (described
>> below). It is NULL when no rseq assembly block critical section
>> is active for the calling thread. Setting it to point to a
>> critical section descriptor (struct rseq_cs) marks the beginning
>> of the critical section.
>>
>> flags Flags indicating the restart behavior for the calling thread.
>> This is mainly used for debugging purposes. Can be either:
>>
>> RSEQ_CS_FLAG_NO_RESTART_ON_PREEMPT
>
> Inhibit instruction sequence block restart on preemption for this thread.
>
>>
>> RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL
>
> Inhibit instruction sequence block restart on signal delivery for this thread.
> Restart on signal can only be inhibited when restart on preemption and restart
> on migration are inhibited too, else it will terminate the offending process
> with
> a segmentation fault.
>
>>
>> RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE
>
> Inhibit instruction sequence block restart on migration for this thread.
>
>>
>> âââââââââââââââââââââââââââââââââââââââââââââââââââââââ
>> âFIXME â
>> âââââââââââââââââââââââââââââââââââââââââââââââââââââââ
>> âEach of the above values needs an explanation. â
>> â â
>> âIs it correct that only one of the values may be â
>> âspecified in 'flags'? I ask because in the 'rseq_cs' â
>> âstructure below, the 'flags' field is a bit mask â
>> âwhere any combination of these flags may be ORed â
>> âtogether. â
>> â â
>> âââââââââââââââââââââââââââââââââââââââââââââââââââââââ
>
> Those are also masks and can be ORed.
>
>
>>
>> The rseq_cs structure
>> The struct rseq_cs is aligned on a 32-byte boundary and has a fixed
>> size of 32 bytes.
>>
>> âââââââââââââââââââââââââââââââââââââââââââââââââââââââ
>> âFIXME â
>> âââââââââââââââââââââââââââââââââââââââââââââââââââââââ
>> âBelow, I added the structure definition (in abbreviâ â
>> âated form). Is there any reason not to do this? â
>> âââââââââââââââââââââââââââââââââââââââââââââââââââââââ
>
> It's fine.
>
>>
>> struct rseq_cs {
>> __u32 version;
>> __u32 flags;
>> __u64 start_ip;
>> __u64 post_commit_offset;
>> __u64 abort_ip;
>> } __attribute__((aligned(4 * sizeof(__u64))));
>>
>> The structure fields are as follows:
>>
>> version
>> Version of this structure.
>>
>> âââââââââââââââââââââââââââââââââââââââââââââââââââââââ
>> âFIXME â
>> âââââââââââââââââââââââââââââââââââââââââââââââââââââââ
>> âWhat does 'version' need to be initialized to? â
>> âââââââââââââââââââââââââââââââââââââââââââââââââââââââ
>
> Currently version needs to be 0. Eventually, if we implement support for new
> flags to rseq(),
> we could add feature flags which register support for newer versions of struct
> rseq_cs.
>
>>
>> flags Flags indicating the restart behavior of this structure. Can be
>> a combination of:
>>
>> RSEQ_CS_FLAG_NO_RESTART_ON_PREEMPT
>
> Inhibit instruction sequence block restart on preemption for this thread.
>
>>
>> RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL
>
> Inhibit instruction sequence block restart on signal delivery for this thread.
> Restart on signal can only be inhibited when restart on preemption and restart
> on migration are inhibited too, else it will terminate the offending process
> with
> a segmentation fault.
>
>>
>> RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE
>
> Inhibit instruction sequence block restart on migration for this thread.
>
>>
>> âââââââââââââââââââââââââââââââââââââââââââââââââââââââ
>> âFIXME â
>> âââââââââââââââââââââââââââââââââââââââââââââââââââââââ
>> âEach of the above values needs an explanation. â
>> âââââââââââââââââââââââââââââââââââââââââââââââââââââââ
>>
>> start_ip
>> Instruction pointer address of the first instruction of the
>> sequence of consecutive assembly instructions.
>>
>> post_commit_offset
>> Offset (from start_ip address) of the address after the last
>> instruction of the sequence of consecutive assembly instrucâ
>> tions.
>>
>> abort_ip
>> Instruction pointer address where to move the execution flow in
>> case of abort of the sequence of consecutive assembly instrucâ
>> tions.
>>
>> NOTES
>> A single library per process should keep the rseq structure in a
>> thread-local storage variable. The cpu_id field should be initialized
>> to -1, and the cpu_id_start field should be initialized to a possible
>> CPU value (typically 0).
>
> The part above is not quite right. All applications/libraries wishing to
> register
> rseq must follow the ABI specified by the C library. It can be defined within
> more
> that a single application/library, but in the end only one symbol will be chosen
> for the process's global symbol table.
>
>>
>> Each thread is responsible for registering and unregistering its rseq
>> structure. No more than one rseq structure address can be registered
>> per thread at a given time.
>>
>> âââââââââââââââââââââââââââââââââââââââââââââââââââââââ
>> âFIXME â
>> âââââââââââââââââââââââââââââââââââââââââââââââââââââââ
>> âIn the following paragraph, what is the difference â
>> âbetween "freed" and "reclaim"? I'm supposing they â
>> âmean the same thing, but it's not clear. And if they â
>> âdo mean the same thing, then the first two sentences â
>> âappear to contain contradictory information. â
>> âââââââââââââââââââââââââââââââââââââââââââââââââââââââ
>
> The mean the same thing, and they are subtly not contradictory.
>
> The first states that memory of a _registered_ rseq object must not
> be freed before the thread exits.
>
> The second states that memory of a rseq object must not be freed before
> it is unregistered or the thread exits.
>
> Do you have an alternative wording in mind to make this clearer ?
>
>>
>> Memory of a registered rseq object must not be freed before the thread
>> exits. Reclaim of rseq object's memory must only be done after either
>> an explicit rseq unregistration is performed or after the thread exits.
>> Keep in mind that the implementation of the Thread-Local Storage (C
>> language __thread) lifetime does not guarantee existence of the TLS
>> area up until the thread exits.
>>
>> In a typical usage scenario, the thread registering the rseq structure
>> will be performing loads and stores from/to that structure. It is howâ
>> ever also allowed to read that structure from other threads. The rseq
>> field updates performed by the kernel provide relaxed atomicity semanâ
>> tics, which guarantee that other threads performing relaxed atomic
>> reads of the CPU number cache will always observe a consistent value.
>>
>> âââââââââââââââââââââââââââââââââââââââââââââââââââââââ
>> âFIXME â
>> âââââââââââââââââââââââââââââââââââââââââââââââââââââââ
>> âIn the preceding paragraph, can we reasonably add â
>> âsome words to explain "relaxed atomicity semantics" â
>> âand "relaxed atomic reads"? â
>> âââââââââââââââââââââââââââââââââââââââââââââââââââââââ
>
> Not sure how to word this exactly, but here it means the stores and loads need
> to be done atomically, but don't require nor provide any ordering guarantees
> with respect to other loads/stores (no memory barriers).
>
>>
>> RETURN VALUE
>> A return value of 0 indicates success. On error, -1 is returned, and
>> errno is set appropriately.
>>
>> ERRORS
>> EBUSY Restartable sequence is already registered for this thread.
>>
>> EFAULT rseq is an invalid address.
>>
>> EINVAL Either flags contains an invalid value, or rseq contains an
>> address which is not appropriately aligned, or rseq_len contains
>> a size that does not match the size received on registration.
>>
>> âââââââââââââââââââââââââââââââââââââââââââââââââââââââ
>> âFIXME â
>> âââââââââââââââââââââââââââââââââââââââââââââââââââââââ
>> âThe last case "rseq_len contains a size that does â
>> ânot match the size received on registration" can â
>> âoccur only on RSEQ_FLAG_UNREGISTER, tight? â
>> âââââââââââââââââââââââââââââââââââââââââââââââââââââââ
>>
>> ENOSYS The rseq() system call is not implemented by this kernel.
>>
>> EPERM The sig argument on unregistration does not match the signature
>> received on registration.
>>
>> VERSIONS
>> The rseq() system call was added in Linux 4.18.
>>
>> âââââââââââââââââââââââââââââââââââââââââââââââââââââââ
>> âFIXME â
>> âââââââââââââââââââââââââââââââââââââââââââââââââââââââ
>> âWhat is the current state of library support? â
>> âââââââââââââââââââââââââââââââââââââââââââââââââââââââ
>
> After going through a few RFC rounds, it's been posted as non-rfc a
> few weeks ago. It is pending review from glibc maintainers. I currently
> aim for inclusion of the rseq TLS registration by glibc for glibc 2.30:
>
> https://sourceware.org/ml/libc-alpha/2019-02/msg00317.html
> https://sourceware.org/ml/libc-alpha/2019-02/msg00320.html
> https://sourceware.org/ml/libc-alpha/2019-02/msg00319.html
> https://sourceware.org/ml/libc-alpha/2019-02/msg00318.html
> https://sourceware.org/ml/libc-alpha/2019-02/msg00321.html
>
> Note that the C library will define a user-space ABI which states how
> applications/libraries wishing to register the rseq TLS need to behave so they
> are compatible with the C library when it gets updated to a new version
> providing
> rseq registration support. It seems like an important point to document,
> perhaps even here in the rseq(2) man page.
>
>
>>
>> CONFORMING TO
>> rseq() is Linux-specific.
>>
>> âââââââââââââââââââââââââââââââââââââââââââââââââââââââ
>> âFIXME â
>> âââââââââââââââââââââââââââââââââââââââââââââââââââââââ
>> âIs there any example code that can reasonably be â
>> âincluded in this manual page? Or some example code â
>> âthat can be referred to? â
>> âââââââââââââââââââââââââââââââââââââââââââââââââââââââ
>>
>
> The per-cpu counter example we have here seems compact enough:
>
> https://www.efficios.com/blog/2019/02/08/linux-restartable-sequences/
>
> Thanks,
>
> Mathieu
>
>
>> SEE ALSO
>> sched_getcpu(3), membarrier(2)
>>
>> --
>> Michael Kerrisk
>> Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
>> Linux/UNIX System Programming Training: http://man7.org/training/
>
> --
> Mathieu Desnoyers
> EfficiOS Inc.
> http://www.efficios.com

--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com