Re: [PATCH v4 1/5] getcpu_cache system call: cache CPU number of running thread

From: Mathieu Desnoyers
Date: Thu Feb 25 2016 - 11:55:37 EST


----- On Feb 25, 2016, at 4:56 AM, Peter Zijlstra peterz@xxxxxxxxxxxxx wrote:

> On Tue, Feb 23, 2016 at 06:28:36PM -0500, Mathieu Desnoyers wrote:
>> This approach is inspired by Paul Turner and Andrew Hunter's work
>> on percpu atomics, which lets the kernel handle restart of critical
>> sections. [1] [2]
>
> So I'd like a few extra words on the intersection with that work.
>
> Yes, that also needs a CPU number, but that needs a little extra as
> well. Can this work be extended to provide the little extra and is the
> getcpu name still sane in that case?
>
> Alternatively, could you not, at equal speed, get the CPU number from
> the restartable sequence data?
>
> That is, do explain why we want both.

Paul Turner's percpu atomics (restartable sequences) allow
turning atomic instructions (e.g. LOCK; cmpxchg on x86) meant
to update userspace per-cpu data into a sequence of instructions
that end with a single commit instruction. The primary use-case
for this is for implementing efficient memory allocators with
per-cpu memory pools (rather than global or per-thread pools).

This is made possible with the collaboration between kernel and
user-space, where user-space marks the surrounding of this "rseq"
critical section, and the kernel moves the instruction pointer
to a restart address (also published by user-space) if it
preempts/migrates/delivers a signal over that critical section.

The benefit of those restartable sequences over atomic instructions
is that it is much faster to execute a sequence of simple non-atomic
instructions (e.g. load, test, cond. branch, store) than a single
atomic instruction.

The restartable sequences are intrinsically designed to work
on per-cpu data, so they need to fetch the current CPU number
within the rseq critical section. This is where the getcpu_cache
system call becomes very useful when combined with rseq:
getcpu_cache allows reading the current CPU number in a
fraction of cycle.

However, there are other use-cases for having a fast mechanism
for reading the current CPU number, besides restartable sequences.
For instance, it can be used by glibc to implement a faster
sched_getcpu. Therefore, implementing getcpu_cache as its own
system call makes sense: an architecture could very well just
introduce getcpu_cache even if it cannot support restartable
sequences for some reason. Also, a kernel configuration can
enable getcpu_cache (since it has no effect on the scheduler
switch time, only migration) without enabling restartable
sequences.

The main reason why I decided to start working on getcpu_cache
is because I noticed that the restartable sequences system
call originally proposed by Paul Turner was trying to accomplish
too much at once: both handling of restartable sequences, and
quickly reading the current CPU number. My thinking is that
the issue of reading the current CPU number could be completely
taken out of the rseq picture by having rseq rely on the
address registered by getcpu_cache to read the CPU number.
This would therefore simplify the implementation of rseq,
and allow us to focus the rseq review discussions without
being side-tracked on the simpler problem of quickly reading
the current CPU number.

>
> (And remind Paul to keep pushing that)

Indeed, I look forward to Paul's feedback on my review of his
last patchset round. Hopefully this getcpu_cache work will
allow us to better focus the discussions on rseq work.

Is the explanation above OK for you ? I'll add it to the
Changelog in v5 of the getcpu_cache series if so.

Thanks,

Mathieu

--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com