Re: [PATCH 08/37] cputime: Convert task/group cputime to nsecs

From: Frederic Weisbecker
Date: Sat Jan 28 2017 - 10:37:15 EST


On Sat, Jan 28, 2017 at 12:57:40PM +0100, Stanislaw Gruszka wrote:
> Hi Frederic and sorry for late comment.
>
> On Sun, Jan 22, 2017 at 07:19:44PM +0100, Frederic Weisbecker wrote:
> > Now that most cputime readers use the transition API which return the
> > task cputime in old style cputime_t, we can safely store the cputime in
> > nsecs. This will eventually make cputime statistics less opaque and more
> > granular. Back and forth convertions between cputime_t and nsecs in order
> > to deal with cputime_t random granularity won't be needed anymore.
> <snip>
> > - cputime_t utime;
> > - cputime_t stime;
> > + u64 utime;
> > + u64 stime;
> > unsigned long long sum_exec_runtime;
> > };
> <snip>
> > @@ -134,7 +134,7 @@ void account_user_time(struct task_struct *p, cputime_t cputime)
> > int index;
> >
> > /* Add user time to process. */
> > - p->utime += cputime;
> > + p->utime += cputime_to_nsecs(cputime);
> > account_group_user_time(p, cputime);
> <snip>
> > +void task_cputime_adjusted(struct task_struct *p, u64 *ut, u64 *st)
> > {
> > *ut = p->utime;
> > *st = p->stime;
> > o }
>
> On 32 bit architectures 64bit store/load is not atomic and if not
> protected - 64bit variables can be mangled. I do not see any protection
> (lock) between utime/stime store and load in the patch and seems that
> {u/s}time store/load can be performed at the same time. Though problem
> is very very improbable it still can happen at least theoretically when
> lower and upper 32 bits are changed at the same time i.e. process
> {u,s}time become near to multiple of 2**32 nsec (aprox: 4sec) and
> 64bit {u,s}time is stored and loaded at the same time on different
> cpus. As said this is very improbable situation, but eventually could
> be possible on long lived processes.

"Improbable situation" doesn't appply to Linux. With millions (billion?)
of machines using it, a rare issue in the core turns into likely to happen
somewhere in the planet every second.

So it's definetly a race we want to consider. Note it goes beyond the scope
of this patchset as the issue was already there before since cputime_t can already
map to u64 on 32 bits systems upstream. But this patchset definetly extends
the issue on all 32 bits configs.

kcpustat has the same issue upstream. It's is made of u64 on all configs.

>
> BTW we have already similar problem with sum_exec_runtime. I posted
> some patches to solve the problem, but non of them was good:
> - https://lkml.org/lkml/2016/9/1/172
> this one slow down scheduler hot path's and Peter hates it.

Note u64_stats_sync handles the ifdeffery for this solution.

>
> - https://lkml.org/lkml/2016/9/6/305
> this one was fine for Peter, but I dislike it for taking
> task_rq_lock() and do not push it forward.

Yeah taking rq lock looks too much.

>
> I considering fixing problem of sum_exec_runtime possible mangling
> by using prev_sum_exec_runtime:
>
> u64 read_sum_exec_runtime(struct task_struct *t)
> {
> u64 ns, prev_ns;
>
> do {
> prev_ns = READ_ONCE(t->se.prev_sum_exec_runtime);
> ns = READ_ONCE(t->se.sum_exec_runtime);
> } while (ns < prev_ns || ns > (prev_ns + U32_MAX));
>
> return ns;
> }
>
> This should work based on fact that prev_sum_exec_runtime and
> sum_exec_runtime are not modified and stored at the same time, so only
> one of those variabled can be mangled. Though I need to think about
> correctnes of that a bit more.

I'm not sure that would be enough. READ_ONCE prevents from reordering by the
compiler but not by the CPU. You'd need memory barriers between reads and
writes of prev_ns and ns.

WRITE ns READ prev_ns
smp_wmb() smp_rmb()
WRITE prev_ns READ ns
smp_wmb() smp_rmb()

It seems to be the only way to make sure that at least one of the reads
(prev_ns or ns) is correct.

But then I think that seqlock through u64_stats_sync would be more efficient.

It's indeed very complicated to solve such issue without bothering the fast path.
Irqtime accounting uses a per cpu u64_stats_sync which could be extended to the
whole cputime accounting. I fear that would only solve the issue for kcpustat
though as task cputime would need a per task u64_stats_sync.

Thanks.