Re: [PATCH] tick/nohz: Fix wrong user and system time accouting against vtime sampling

From: Luiz Capitulino
Date: Fri Apr 07 2017 - 12:58:16 EST


On Wed, 5 Apr 2017 23:36:24 -0700
Wanpeng Li <kernellwp@xxxxxxxxx> wrote:

> From: Wanpeng Li <wanpeng.li@xxxxxxxxxxx>
>
> This can be reproduced by setting up a CPU for nohz_full and pin two tasks
> that hog the CPU 100% of the time to that CPU, top reports 70% system time
> and 30% user time utilization sometimes. The trace like this:
>
> hog-11980 [015] 341.494491: function: enter_from_user_mode <-- apic_timer_interrupt
> <idle>-0 [000] 341.494492: function: smp_apic_timer_interrupt <-- apic_timer_interrupt
> hog-11980 [015] 341.494492: function: __context_tracking_exit <-- enter_from_user_mode
> <idle>-0 [000] 341.494492: function: irq_enter <-- smp_apic_timer_interrupt
> hog-11980 [015] 341.494492: bprint: vtime_delta: diff=0 (now=4295008339 vtime_snap=4295008339)
> hog-11980 [015] 341.494492: function: smp_apic_timer_interrupt <-- apic_timer_interrupt
> hog-11980 [015] 341.494492: function: irq_enter <-- smp_apic_timer_interrupt
> hog-11980 [015] 341.494493: function: tick_sched_timer <-- __hrtimer_run_queues
> <idle>-0 [000] 341.494493: function: tick_sched_timer <-- __hrtimer_run_queues
> <idle>-0 [000] 341.494493: function: tick_do_update_jiffies64.part.14 <-- tick_sched_do_timer
> <idle>-0 [000] 341.494494: function: do_timer <-- tick_do_update_jiffies64.part.14
> hog-11980 [015] 341.494494: function: irq_exit <-- smp_apic_timer_interrupt
> <idle>-0 [000] 341.494494: bprint: do_timer: updated jiffies_64=4295008340 ticks=1
> hog-11980 [015] 341.494494: function: __context_tracking_enter <-- prepare_exit_to_usermode
> hog-11980 [015] 341.494494: function: vtime_user_enter <-- __context_tracking_enter
> hog-11980 [015] 341.494495: bprint: vtime_delta: diff=1000000 (now=4295008340 vtime_snap=4295008339)
> hog-11980 [015] 341.494495: function: __vtime_account_system <-- vtime_user_enter
> hog-11980 [015] 341.494495: bprint: get_vtime_delta: vtime_snap=4295008339 now=4295008340
> hog-11980 [015] 341.494495: function: account_system_time <-- __vtime_account_system
> hog-11980 [015] 341.494495: bprint: account_system_time: cputime=995488
> <idle>-0 [000] 341.494497: function: irq_exit <-- smp_apic_timer_interrupt
>
> In this trace, we see the following:
>
> 1. On CPU15, we transition from user-space to kernel-space because
> of a timer interrupt (it's the tick)
>
> 2. vtimer_delta() returns 0, because jiffies didn't change since the
> last accounting
>
> 3. While CPU15 is executing in kernel-space, jiffies is updated
> by CPU0
>
> 4. When going back to user-space, vtime_delta() returns non-zero
> and the whole time is accounted for system time (observe how
> the cputime parameter in account_system_time() is less than 1ms)
>
>
> After commit ff9a9b4c433 ("sched, time: Switch VIRT_CPU_ACCOUNTING_GEN
> to jiffy granularity"), it reduces timing accuracy on nohz_full CPUs to
> jiffy based sampling. So the issue can happen if all CPUs fire their ticks
> at the same time:
>
>
> CPU 0 CPU 1
> ----- -----
> exit_user() // no cputime update
> tick X update_jiffies
> enter_user() // cputime update
>
>
> exit_user() // no cputime update
> tick X+1 update_jiffies
> enter_user() // cputime update
>
> The nohz_full CPU receives an interrupt at the same time the timer
> interrupt fires on the housekeeping CPU.
>
> This patch offsets the tick to avert all ticks alignment in order
> that the vtime sampling does not end up "in phase" with the jiffies
> incrementing.
>
> Reported-by: Luiz Capitulino <lcapitulino@xxxxxxxxxx>
> Suggested-by: Rik van Riel <riel@xxxxxxxxxx>
> Cc: Frederic Weisbecker <fweisbec@xxxxxxxxx>
> Cc: Rik van Riel <riel@xxxxxxxxxx>
> Cc: Mike Galbraith <efault@xxxxxx>
> Cc: Luiz Capitulino <lcapitulino@xxxxxxxxxx>
> Cc: Thomas Gleixner <tglx@xxxxxxxxxxxxx>
> Cc: Peter Zijlstra <peterz@xxxxxxxxxxxxx>
> Signed-off-by: Wanpeng Li <wanpeng.li@xxxxxxxxxxx>

Reviewed-and-Tested-by: Luiz Capitulino <lcapitulino@xxxxxxxxxx>

Thanks for everyone who participated on tracking this down
and fixing it!

> ---
> kernel/time/tick-sched.c | 8 ++++++--
> 1 file changed, 6 insertions(+), 2 deletions(-)
>
> diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
> index 7fe53be..35d79f7 100644
> --- a/kernel/time/tick-sched.c
> +++ b/kernel/time/tick-sched.c
> @@ -1197,8 +1197,12 @@ void tick_setup_sched_timer(void)
> /* Get the next period (per-CPU) */
> hrtimer_set_expires(&ts->sched_timer, tick_init_jiffy_update());
>
> - /* Offset the tick to avert jiffies_lock contention. */
> - if (sched_skew_tick) {
> + /*
> + * Offset the tick to avert jiffies_lock contention, and all ticks
> + * alignment in order that the vtime sampling does not end up "in
> + * phase" with the jiffies incrementing.
> + */
> + if (sched_skew_tick || tick_nohz_full_enabled()) {
> u64 offset = ktime_to_ns(tick_period) >> 1;
> do_div(offset, num_possible_cpus());
> offset *= smp_processor_id();