Re: [BUG nohz]: wrong user and system time accounting

From: Wanpeng Li
Date: Wed Mar 29 2017 - 18:46:48 EST


2017-03-30 6:17 GMT+08:00 Frederic Weisbecker <fweisbec@xxxxxxxxx>:
> On Wed, Mar 29, 2017 at 01:16:56PM -0400, Luiz Capitulino wrote:
>> On Tue, 28 Mar 2017 13:24:06 -0400
>> Luiz Capitulino <lcapitulino@xxxxxxxxxx> wrote:
>>
>> > 1. In my tracing I'm seeing that sometimes (always?) the
>> > time interval between two timer interrupts is less than 1ms
>>
>> I think that's the root cause.
>>
>> I'm getting traces like this:
>>
>> hog-11980 [015] 341.494491: function: enter_from_user_mode <-- apic_timer_interrupt
>> <idle>-0 [000] 341.494492: function: smp_apic_timer_interrupt <-- apic_timer_interrupt
>> hog-11980 [015] 341.494492: function: __context_tracking_exit <-- enter_from_user_mode
>> <idle>-0 [000] 341.494492: function: irq_enter <-- smp_apic_timer_interrupt
>> hog-11980 [015] 341.494492: bprint: vtime_delta: diff=0 (now=4295008339 vtime_snap=4295008339)
>> hog-11980 [015] 341.494492: function: smp_apic_timer_interrupt <-- apic_timer_interrupt
>> hog-11980 [015] 341.494492: function: irq_enter <-- smp_apic_timer_interrupt
>> hog-11980 [015] 341.494493: function: tick_sched_timer <-- __hrtimer_run_queues
>> <idle>-0 [000] 341.494493: function: tick_sched_timer <-- __hrtimer_run_queues
>> <idle>-0 [000] 341.494493: function: tick_do_update_jiffies64.part.14 <-- tick_sched_do_timer
>> <idle>-0 [000] 341.494494: function: do_timer <-- tick_do_update_jiffies64.part.14
>> hog-11980 [015] 341.494494: function: irq_exit <-- smp_apic_timer_interrupt
>> <idle>-0 [000] 341.494494: bprint: do_timer: updated jiffies_64=4295008340 ticks=1
>> hog-11980 [015] 341.494494: function: __context_tracking_enter <-- prepare_exit_to_usermode
>> hog-11980 [015] 341.494494: function: vtime_user_enter <-- __context_tracking_enter
>> hog-11980 [015] 341.494495: bprint: vtime_delta: diff=1000000 (now=4295008340 vtime_snap=4295008339)
>> hog-11980 [015] 341.494495: function: __vtime_account_system <-- vtime_user_enter
>> hog-11980 [015] 341.494495: bprint: get_vtime_delta: vtime_snap=4295008339 now=4295008340
>> hog-11980 [015] 341.494495: function: account_system_time <-- __vtime_account_system
>> hog-11980 [015] 341.494495: bprint: account_system_time: cputime=995488
>> <idle>-0 [000] 341.494497: function: irq_exit <-- smp_apic_timer_interrupt
>>
>> In this trace, we see the following:
>>
>> 1. On CPU15, we transition from user-space to kernel-space because
>> of a timer interrupt (it's the tick)
>>
>> 2. vtimer_delta() returns 0, because jiffies didn't change since the
>> last accounting
>>
>> 3. While CPU15 is executing in kernel-space, jiffies is updated
>> by CPU0
>>
>> 4. When going back to user-space, vtime_delta() returns non-zero
>> and the whole time is accounted for system time (observe how
>> the cputime parameter in account_system_time() is less than 1ms)
>
> Aah, so the issue can indeed happen if all CPUs fire their ticks at the same time:
>
>
> CPU 0 CPU 1
> ----- -----
> exit_user() // no cputime update
> tick X update_jiffies
> enter_user() // cputime update
>
>
> exit_user() //no cputime update
> tick X+1 update_jiffies
> enter_user() // cputime update
>
>>
>> That's why my patch from yesterday fixed the issue, it increased the
>> tick period to more than 1ms. So vtime_delta() always evaluate to true
>> when transitioning from user-space to kernel-space (because we spend
>> more than 1ms in user-space between ticks). The patch below achieves
>> the same result by adding 10us to the tick period.
>>
>> diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
>> index 7fe53be..00e46df 100644
>> --- a/kernel/time/tick-sched.c
>> +++ b/kernel/time/tick-sched.c
>> @@ -1165,7 +1165,7 @@ static enum hrtimer_restart tick_sched_timer(struct hrtimer *timer)
>> if (unlikely(ts->tick_stopped))
>> return HRTIMER_NORESTART;
>>
>> - hrtimer_forward(timer, now, tick_period);
>> + hrtimer_forward(timer, now, tick_period + 10000);
>
> I'm surprised it works though. If the 10us shift was only applied to CPU 0 and not the
> others then yes, but if it is applied to all CPUs, the ticks stay synchronized and the
> problem should stay...
>
> Ah wait! It can work because the nohz_full CPUs have their ticks sometimes scheduled
> by tick_nohz_stop_sched_tick() or tick_nohz_restart_sched_tick() which don't have the
> 10us shift. So a drift happens everytime the nohz_full CPUs have their tick stopped.
>
>> Now, why is the tick ticking at less than 1ms? I think it's the time
>> difference between "now" (that we pass to hrtimer_forward()) and the
>> time the timer hardware is actually programmed. That should account
>> for a few microseconds.
>
> Right, that's my feeling. And if it is the case, then it shouldn't matter.
>
> So! Now we need to find a proper fix :o)
>
> Hmm, how bad would it be to revert to sched_clock() instead of jiffies in vtime_delta()?
> We could use nanosecond granularity to check deltas but only perform an actual cputime update
> when that delta >= TICK_NSEC. That should keep the load ok.

Yeah, I mentioned something similar before.
https://lkml.org/lkml/2017/3/26/138 However, Rik's commit optimized
syscalls by not utilize sched_clock(), so if we should distinguish
between syscalls/exceptions and irqs?

Regards,
Wanpeng Li