Re: [PATCH 1/3] sched/cputime: Improve scalability of times()/clock_gettime() on 32 bit cpus

From: Stanislaw Gruszka
Date: Thu Sep 01 2016 - 06:11:19 EST


On Thu, Sep 01, 2016 at 11:49:06AM +0200, Peter Zijlstra wrote:
> On Thu, Sep 01, 2016 at 11:27:42AM +0200, Stanislaw Gruszka wrote:
> > My previous commit:
> >
> > a1eb1411b4e4 ("sched/cputime: Improve scalability by not accounting thread group tasks pending runtime")
> >
> > helped to achieve good performance of SYS_times() and
> > SYS_clock_gettimes(CLOCK_PROCESS_CPUTIME_ID) on 64 bit architectures.
> > However taking task_rq_lock() when reading t->se.sum_exec_runtime on
> > 32 bit architectures still make those syscalls slow.
> >
> > The reason why we take the lock is to make 64bit sum_exec_runtime
> > variable consistent. While a inconsistency scenario is very very unlike,
> > I assume it still may happen at least on some 32 bit architectures.
> >
> > To protect the variable I introduced new seqcount lock. Performance
> > improvements on machine with 32 cores (32-bit cpus) measured by
> > benchmarks described in commit:
>
> No,.. running 32bit kernels on a machine with 32 cores is insane, full
> stop.

I agree with that. But I also run this the benchmark on 4 cores
armv7l and see good improvements there too.

> You're now making rather hot paths slower to benefit a rather slow path,
> that too is backwards.

Ok, you have right, I made update_curr() slower (a bit I think, since
this new seqcount primitive should be in the same cache line as other
things).

But do we don't care about inconsistency of accessing of 64 bit variable
on 32 bit processors (see patch 3) ? I know this is unlikely scenario
to get inconsistency, but I assume it's still possible, or not?

If not, I can get rid of read_sum_exec_runtime() and just read
sum_exec_runtime without task_rq_lock() protection on
thread_group_cputime() . That would make the benchmark happy.

Stanislaw