Re: [RFC PATCH 1/2] thread_local_abi system call: caching current CPU number (x86)

From: Mathieu Desnoyers
Date: Sun Dec 13 2015 - 14:58:54 EST


----- On Dec 13, 2015, at 1:15 PM, Andi Kleen andi@xxxxxxxxxxxxxx wrote:

>> This getcpu cache is an alternative to the sched_getcpu() vdso which has
>> a few benefits:
>
>
> Note the first version of getcpu() I proposed had a cache. But it was
> rejected.
>
>> - It is faster to do a memory read that to call a vDSO,
>> - This cached value can be read from within an inline assembly, which
>> makes it a useful building block for restartable sequences.
>
> On x86 we already have the de-facto ABI of using LSL with the magic
> segment directly. While that is a few cycles slower than a memory load
> I question the difference is big enough to justify a new system call,
> and risk slow page fault in context switches.

In the context of restartable sequences [1] [2], the goal is to turn
atomic operations on per-cpu data into a sequence of simple load/store
operations. Therefore, improving getcpu from 12ns to 0.3ns will have a
significant impact there. Those will be used in memory allocators, RCU
read-side in userspace, and tracing fast path, where we can expect
significant speedups even for those few cycles per call.

Moreover, AFAIU, restartable sequences cannot do the function call
required by the vdso while within the c.s.: those need to entirely fit
within an inline assembly. So this CPU number caching actually enables
restartable sequences, whereas the vdso approach cannot be used in that
context.

Regarding your concern about slow page fault in context switches, this
updated patch takes care of it: the context switch is only setting
TIF_NOTIFY_RESUME, which lets the cache value update be performed on
return to userspace.

Finally, even if overall this new system call is not deemed sufficiently
interesting on x86, other popular architectures such as ARM32 don't have
any vDSO for getcpu at the moment, mainly because they don't have similar
segment selector tricks, and I'm not aware of other solutions than caching
the CPU value for those architectures. So we might very well end up having
to implement this system call for other architectures anyway.

>
> BTW the vdso could be also optimized I think. For example glibc today
> does some stupid (slow) things with it, like doing double iindirect
> jumps.

I suspect that most of the difference between the vDSO approach and
CPU number caching is simply the function call required for the vDSO.
I doubt there is much to be done on this front.

Thanks,

Mathieu

[1] https://lwn.net/Articles/664645/
[2] https://lkml.org/lkml/2015/10/27/1095

>
> -Andi

--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/