Re: [RFC PATCH v6 1/5] Thread-local ABI system call: cache CPU number of running thread

From: Mathieu Desnoyers
Date: Mon Apr 04 2016 - 16:48:35 EST


----- On Apr 4, 2016, at 3:46 PM, Mathieu Desnoyers mathieu.desnoyers@xxxxxxxxxxxx wrote:

> ----- On Apr 4, 2016, at 1:11 PM, H. Peter Anvin hpa@xxxxxxxxx wrote:
>
>> On 04/04/16 10:01, Mathieu Desnoyers wrote:
>>>
>>> Changes since v5:
>>> - Rename "getcpu_cache" to "thread_local_abi", allowing to extend
>>> this system call to cover future features such as restartable critical
>>> sections. Generalizing this system call ensures that we can add
>>> features similar to the cpu_id field within the same cache-line
>>> without having to track one pointer per feature within the task
>>> struct.
>>> - Add a tlabi_nr parameter to the system call, thus allowing to extend
>>> the ABI beyond the initial 64-byte structure by registering structures
>>> with tlabi_nr greater than 0. The initial ABI structure is associated
>>> with tlabi_nr 0.
>>> - Rebased on kernel v4.5.
>>>
>>
>> This seems absolutely insanely complex, both for the kernel and for
>> userspace.
>>
>> A much saner way would be for userspace to query the kernel for the size
>> of the structure; userspace then allocates the maximum of what it knows
>> and what the kernel knows. That way, the kernel doesn't need to
>> conditionalize its accesses to user space, and libc doesn't need to
>> conditionalize its accesses either.
>
> If we go down the route of having user-space dynamically allocating
> the structure, my understanding is that we need to associate the
> user-space TLS symbol with a pointer to the structure, and test for
> NULL each time, thus requiring user-space to touch one more cache-line
> (read the pointer), and add one conditional per user-space fast-path,
> compared to a statically-sized definition approach. Or perhaps you have
> some clever trick in mind for "allocation by user-space" that I'm missing ?
>
> Besides the NULL pointer check, another issue is feature detection.
> As we extend the feature set, my proposal has a 32-bit features
> mask at the beginning of the TLS structure, within the same
> cache-line containing the structure fields, so user-space can quickly
> check whether the required feature is enabled (adds one conditional
> on the user-space fast path, but does not require to touch another
> cache-line). This allows adding new features without requiring to
> reserve the value "0" within each field of the structure to mean
> "feature unavailable", which I find terminally unaesthetic.
>
> I propose here a fixed-size 64 bytes layout for the first structure,
> for which a 32-bit feature mask should be enough. If we ever fill
> up these 64 bytes, we can then use the following tlabi_nr number (1),
> which will define its own structure size and feature mask. This
> seems like a good compromise between fast-path speed, feature detection
> flexibility, optimal use of cache-lines, and extensibility.

Moreover, the feature set that the application knows about, glibc
knows about, and the kernel knows about are three different things.
My intent here is to have glibc stay out of the way as much as possible,
since this is really an interface between various applications/libraries
and the kernel.

Even if glibc allocates a structure large enough for the union of
the features it knows about and the features the kernel implements,
the application could be built against kernel headers that expose
more features than glibc knows about, and would therefore need to
have a structure length check, for an added branch on the fast path
if we dynamically allocate the tlabi structure.

A statically-sized structure allows application and libraries to
skip pointer load, NULL checks, and structure length checks on
the user-space fast-path.

Thanks,

Mathieu

>
> Thanks,
>
> Mathieu
>
>
> --
> Mathieu Desnoyers
> EfficiOS Inc.
> http://www.efficios.com

--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com