Re: [PATCH v4 1/5] getcpu_cache system call: cache CPU number of running thread

From: Peter Zijlstra
Date: Fri Feb 26 2016 - 06:33:18 EST

On Thu, Feb 25, 2016 at 05:17:51PM +0000, Mathieu Desnoyers wrote:
> ----- On Feb 25, 2016, at 12:04 PM, Peter Zijlstra peterz@xxxxxxxxxxxxx wrote:
> > On Thu, Feb 25, 2016 at 04:55:26PM +0000, Mathieu Desnoyers wrote:
> >> ----- On Feb 25, 2016, at 4:56 AM, Peter Zijlstra peterz@xxxxxxxxxxxxx wrote:
> >> The restartable sequences are intrinsically designed to work
> >> on per-cpu data, so they need to fetch the current CPU number
> >> within the rseq critical section. This is where the getcpu_cache
> >> system call becomes very useful when combined with rseq:
> >> getcpu_cache allows reading the current CPU number in a
> >> fraction of cycle.
> >
> > Yes yes, I know how restartable sequences work.
> >
> > But what I worry about is that they want a cpu number and a sequence
> > number, and for performance it would be very good if those live in the
> > same cacheline.
> >
> > That means either getcpu needs to grow a seq number, or restartable
> > sequences need to _also_ provide the cpu number.
> If we plan things well, we could have both the cpu number and the
> seqnum in the same cache line, registered by two different system
> calls. It's up to user-space to organize those two variables
> to fit within the same cache-line.

I feel this is more fragile than needed. Why not do a single systemcall
that does both?

> getcpu_cache GETCPU_CACHE_SET operation takes the address where
> the CPU number should live as input.
> rseq system call could do the same for the seqnum address.

So I really don't like that, that means we have to track more kernel
state -- we have to carry two pointers instead of one, we have to have
more update functions etc..

That just increases the total overhead of all of this.

> The question becomes: how do we introduce this to user-space,
> considering that only a single address per thread is allowed
> for each of getcpu_cache and rseq ?
> If both CPU number and seqnum are centralized in a TLS within
> e.g. glibc, that would be OK, but if we intend to allow libraries
> or applications to directly register their own getcpu_cache
> address and/or rseq, we may end up in situations where we have
> to fallback on using two different cache-lines. But how much
> should we care about performance in cases where non-generic
> libraries directly use those system calls ?
> Thoughts ?

Yeah, not sure, but that is a separate problem. Both your proposed code
and the rseq code have this. Having them separate system calls just
increases the amount of ways you can do it wrong.