Re: [BUG nohz]: wrong user and system time accounting

From: Wanpeng Li
Date: Wed Mar 29 2017 - 21:58:50 EST


2017-03-30 4:08 GMT+08:00 Rik van Riel <riel@xxxxxxxxxx>:
> On Wed, 2017-03-29 at 13:16 -0400, Luiz Capitulino wrote:
>> On Tue, 28 Mar 2017 13:24:06 -0400
>> Luiz Capitulino <lcapitulino@xxxxxxxxxx> wrote:
>>
>> > 1. In my tracing I'm seeing that sometimes (always?) the
>> > time interval between two timer interrupts is less than 1ms
>>
>> I think that's the root cause.
>>
>> In this trace, we see the following:
>>
>> 1. On CPU15, we transition from user-space to kernel-space because
>> of a timer interrupt (it's the tick)
>>
>> 2. vtimer_delta() returns 0, because jiffies didn't change since the
>> last accounting
>>
>> 3. While CPU15 is executing in kernel-space, jiffies is updated
>> by CPU0
>>
>> 4. When going back to user-space, vtime_delta() returns non-zero
>> and the whole time is accounted for system time (observe how
>> the cputime parameter in account_system_time() is less than 1ms)
>
> In other words, the tick on cpu0 is aligned
> with the tick on the nohz_full cpus, and
> jiffies is advanced while the nohz_full cpus
> with an active tick happen to be in kernel
> mode?
>
> Frederic, can you think of any reason why
> the tick on nohz_full CPUs would end up aligned
> with the tick on cpu0, instead of running at some
> random offset?
>
> A random offset, or better yet a somewhat randomized
> tick length to make sure that simultaneous ticks are
> fairly rare and the vtime sampling does not end up
> "in phase" with the jiffies incrementing, could make
> the accounting work right again.
>
> Of course, that assumes the above hypothesis is correct :)

There is such a feature skew_tick currently, refer to commit
5307c9556bc (tick: add tick skew boot option), w/ skew_tick=1 boot
parameter, the bug disappear, however, the commit also mentioned that
it will hurt power consumption. I will try Frederic's proposal which
is similar to my original idea "how bad would it be to revert to
sched_clock() instead of jiffies in vtime_delta()? We could use
nanosecond granularity to check deltas but only perform an actual
cputime update when that delta >= TICK_NSEC."

Regards,
Wanpeng Li