Re: [PATCH 08/37] cputime: Convert task/group cputime to nsecs

From: Frederic Weisbecker
Date: Mon Jan 30 2017 - 10:39:39 EST


On Mon, Jan 30, 2017 at 02:54:51PM +0100, Stanislaw Gruszka wrote:
> On Sat, Jan 28, 2017 at 04:28:13PM +0100, Frederic Weisbecker wrote:
> > On Sat, Jan 28, 2017 at 12:57:40PM +0100, Stanislaw Gruszka wrote:
> > > On 32 bit architectures 64bit store/load is not atomic and if not
> > > protected - 64bit variables can be mangled. I do not see any protection
> > > (lock) between utime/stime store and load in the patch and seems that
> > > {u/s}time store/load can be performed at the same time. Though problem
> > > is very very improbable it still can happen at least theoretically when
> > > lower and upper 32 bits are changed at the same time i.e. process
> > > {u,s}time become near to multiple of 2**32 nsec (aprox: 4sec) and
> > > 64bit {u,s}time is stored and loaded at the same time on different
> > > cpus. As said this is very improbable situation, but eventually could
> > > be possible on long lived processes.
> >
> > "Improbable situation" doesn't appply to Linux. With millions (billion?)
> > of machines using it, a rare issue in the core turns into likely to happen
> > somewhere in the planet every second.
> >
> > So it's definetly a race we want to consider. Note it goes beyond the scope
> > of this patchset as the issue was already there before since cputime_t can already
> > map to u64 on 32 bits systems upstream. But this patchset definetly extends
> > the issue on all 32 bits configs.
> >
> > kcpustat has the same issue upstream. It's is made of u64 on all configs.
>
> I would like to add what are possible consequences if value will be
> mangled. For sum_exec_runtime, utime and stime we could get wrong values
> on cpu-clock related syscalls like clock_gettime() or clock_nanosleep()
> and cpu-clock timers like timer_create(CLOCK_PROCESS_CPUTIME_ID) can be
> triggered before or long after expected. For kcpustat this seems to be
> wrong values read by procfs and 3 drivers (cpufreq, appldata, macintosh).

Yep, all agreed.

>
> > > I considering fixing problem of sum_exec_runtime possible mangling
> > > by using prev_sum_exec_runtime:
> > >
> > > u64 read_sum_exec_runtime(struct task_struct *t)
> > > {
> > > u64 ns, prev_ns;
> > >
> > > do {
> > > prev_ns = READ_ONCE(t->se.prev_sum_exec_runtime);
> > > ns = READ_ONCE(t->se.sum_exec_runtime);
> > > } while (ns < prev_ns || ns > (prev_ns + U32_MAX));
> > >
> > > return ns;
> > > }
> > >
> > > This should work based on fact that prev_sum_exec_runtime and
> > > sum_exec_runtime are not modified and stored at the same time, so only
> > > one of those variabled can be mangled. Though I need to think about
> > > correctnes of that a bit more.
> >
> > I'm not sure that would be enough. READ_ONCE prevents from reordering by the
> > compiler but not by the CPU. You'd need memory barriers between reads and
> > writes of prev_ns and ns.
>
> It will not be enough, this _suppose_ to work based on that sum_exec_runtime
> and prev_sum_exec_runtime are not written at the same time. i.e. only
> one variable can be mangled as another one sits already in the memory.
> However "not written at the same time" is weak part of reasoning. Even
> if those variables are stored at different part of code (sum_exec_runtime
> on update_curr() and prev_sum_exec_runtime on set_next_entity()) we can
> not assume store of one variable is finished before another one starts.

Right.

>
> > WRITE ns READ prev_ns
> > smp_wmb() smp_rmb()
> > WRITE prev_ns READ ns
> > smp_wmb() smp_rmb()
> >
> > It seems to be the only way to make sure that at least one of the reads
> > (prev_ns or ns) is correct.

Well reading that again, I'm not 100% sure about the correctness guarantee.
But it might work.

>
> I think you have right, but seems on much code paths we have scenario:
>
> WRITE ns READ prev_ns
> smp_wmb() smp_rmb()
> WRITE prev_ns READ ns
>
> and we have already smp_wmb() after write of sum_exec_runtime on
> update_min_vruntime().

You still need a second barrier after the second write and read (or
before the first write and read, it's the same) to ensure that if you read
a mangled version of ns, prev_ns is ok.

Still I think u64_stats_sync is less trouble and more reliable.

Thanks.