Re: [RFC PATCH 1/2] rseq: Implement KTLS prototype for x86-64
From: Mathieu Desnoyers
Date: Mon Sep 28 2020 - 13:29:56 EST
----- On Sep 28, 2020, at 11:13 AM, Florian Weimer fweimer@xxxxxxxxxx wrote:
> * Mathieu Desnoyers:
>
>> Upstreaming efforts aiming to integrate rseq support into glibc led to
>> interesting discussions, where we identified a clear need to extend the
>> size of the per-thread structure shared between kernel and user-space
>> (struct rseq). This is something that is not possible with the current
>> rseq ABI. The fact that the current non-extensible rseq kernel ABI
>> would also prevent glibc's ABI to be extended prevents its integration
>> into glibc.
>>
>> Discussions with glibc maintainers led to the following design, which we
>> are calling "Kernel Thread Local Storage" or KTLS:
>>
>> - at glibc library init:
>> - glibc queries the size and alignment of the KTLS area supported by the
>> kernel,
>> - glibc reserves the memory area required by the kernel for main
>> thread,
>> - glibc registers the offset from thread pointer where the KTLS area
>> will be placed for all threads belonging to the threads group which
>> are created with clone3 CLONE_RSEQ_KTLS,
>> - at nptl thread creation:
>> - glibc reserves the memory area required by the kernel,
>> - application/libraries can query glibc for the offset/size of the
>> KTLS area, and offset from the thread pointer to access that area.
>
> One remaining challenge see is that we want to use vDSO functions to
> abstract away the exact layout of the KTLS area. For example, there are
> various implementation strategies for getuid optimizations, some of them
> exposing a shared struct cred in a thread group, and others not doing
> that.
>
> The vDSO has access to the thread pointer because it's ABI (something
> that we recently (and quite conveniently) clarified for x86). What it
> does not know is the offset of the KTLS area from the thread pointer.
> In the original rseq implementation, this offset could vary from thread
> to thread in a process, although the submitted glibc implementation did
> not use this level of flexibility and the offset is constant. The vDSO
> is not relocated by the run-time dynamic loader, so it can't use ELF TLS
> data.
In the context of this prototype, the KTLS offset is the same for all threads
belonging to a thread group.
>
> Furthermore, not all threads in a thread group may have an associated
> KTLS area. In a potential glibc implementation, only the threads
> created by pthread_create would have it; threads created directly using
> clone would lack it (and would not even run with a correctly set up
> userspace TCB).
Right.
>
> So we have a bootstrap issue here that needs to be solved, I think.
The one thing I'm not sure about is whether the vDSO interface is indeed
superior to KTLS, or if it is just the model we are used to.
AFAIU, the current use-cases for vDSO is that an application calls into
glibc, which then calls the vDSO function exposed by the kernel. I wonder
whether the vDSO indirection is really needed if we typically have a glibc
function used as indirection ? For an end user, what is the benefit of vDSO
over accessing KTLS data directly from glibc ?
If we decide that using KTLS from a vDSO function is indeed a requirement,
then, as you point out, the thread_pointer is available as ABI, but we miss
the KTLS offset.
Some ideas on how we could solve this: we could either make the KTLS
offset part of the ABI (fixed offset), or save the offset near the thread pointer
at a location that would become ABI. It would have to be already populated with
something which can help detect the case where a vDSO is called from a thread
which does not populate KTLS though. Is that even remotely doable ?
>
> In most cases, I would not be too eager to bypass the vDSO completely,
> and having the kernel expose a data-only interface. I could perhaps
> make an exception for the current TID because that's so convenient to
> use in mutex implementations, and errno.
Indeed, using a KTLS field to store errno is another use-case I forgot to
mention. That would make life easier for errno handling in vDSO as well.
> With the latter, we could
> directly expose the vDSO implementation to applications, assuming that
> we agree that the vDSO will not fail with ENOSYS to request fallback to
> the system call, but will itself perform the system call.
We should not forget the fields needed by rseq as well: the rseq_cs pointer and
the cpu_id fields need to be accessed directly from the rseq critical section,
without function call. Those use-cases require that applications and library can
know the KTLS offset and size and use those fields directly. That being said,
there are certainly plenty of use-cases where it makes sense to use the KTLS
data through a vDSO, and only expose the vDSO interface, if the cost of the
extra vDSO call indirection is not prohibitive.
Thanks,
Mathieu
>
> Thanks,
> Florian
> --
> Red Hat GmbH, https://de.redhat.com/ , Registered seat: Grasbrunn,
> Commercial register: Amtsgericht Muenchen, HRB 153243,
> Managing Directors: Charles Cachera, Brian Klemm, Laurie Krebs, Michael O'Neill
--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com