Re: [PATCH] timekeeping: Fix __ktime_get_fast_ns() regression

From: Mathieu Desnoyers
Date: Wed Oct 05 2016 - 06:56:16 EST


----- On Oct 4, 2016, at 10:55 PM, John Stultz john.stultz@xxxxxxxxxx wrote:

> In commit 27727df240c7 ("Avoid taking lock in NMI path with
> CONFIG_DEBUG_TIMEKEEPING"), I changed the logic to open-code
> the timekeeping_get_ns() function, but I forgot to include
> the unit conversion from cycles to nanoseconds, breaking the
> function's output, which impacts users like perf.

It also impacts LTTng. I've had reports of timestamp borkage from
LTTng end users yesterday with kernels 4.7.5 and 4.7.4.

Some stable branches sleuthing indicates that the following
kernel versions are affected: 4.8, 4.7.4+, 4.4.20+, 4.1.32+

I plan to add kernel version range checks in lttng-modules to
work-around this mess, but I need an upper bound. Is there any way
we can get this upstream and cherry-picked into stable branches
ASAP ?

Tested-by: Mathieu Desnoyers <mathieu.desnoyers@xxxxxxxxxxxx>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@xxxxxxxxxxxx>

LTTng kernel traces with Linux 4.8:

[06:37:40.248222037] (+0.000001057) compudjdev power_cpu_idle: { cpu_id = 10 }, { state = 1, cpu_id = 10 }
[06:37:40.248222037] (+0.000000000) compudjdev x86_irq_vectors_local_timer_entry: { cpu_id = 10 }, { vector = 239 }
[06:37:40.248222037] (+0.000000000) compudjdev timer_hrtimer_cancel: { cpu_id = 10 }, { hrtimer = 0xFFFF880FC1692700 }
[06:37:40.248222037] (+0.000000000) compudjdev timer_hrtimer_expire_entry: { cpu_id = 10 }, { hrtimer = 0xFFFF880FC1692700, now = 211036043790, function = 0xFFFFFFFF8116FD40 }
[06:37:40.248222037] (+0.000000000) compudjdev irq_softirq_raise: { cpu_id = 10 }, { vec = 1 }
[06:37:40.248222037] (+0.000000000) compudjdev rcu_utilization: { cpu_id = 10 }, { s = "Start scheduler-tick" }
[06:37:40.248222037] (+0.000000000) compudjdev irq_softirq_raise: { cpu_id = 10 }, { vec = 9 }
[...]

% lttng view | grep "cpu_id = 0" | grep "(+0.000000000)" |wc -l
76156

LTTng kernel traces with Linux 4.8 + this fix applied:

[06:46:52.648061087] (+0.000001455) compudjdev power_cpu_idle: { cpu_id = 23 }, { state = 1, cpu_id = 23 }
[06:46:52.652011654] (+0.003950567) compudjdev x86_irq_vectors_local_timer_entry: { cpu_id = 9 }, { vector = 239 }
[06:46:52.652012324] (+0.000000670) compudjdev timer_hrtimer_cancel: { cpu_id = 9 }, { hrtimer = 0xFFFF880FC1652700 }
[06:46:52.652012654] (+0.000000330) compudjdev timer_hrtimer_expire_entry: { cpu_id = 9 }, { hrtimer = 0xFFFF880FC1652700, now = 91808012323, function = 0xFFFFFFFF8116FD80 }
[06:46:52.652013354] (+0.000000700) compudjdev irq_softirq_raise: { cpu_id = 9 }, { vec = 1 }
[06:46:52.652013784] (+0.000000430) compudjdev rcu_utilization: { cpu_id = 9 }, { s = "Start scheduler-tick" }

% lttng view | grep "cpu_id = 0" | grep "(+0.000000000)" |wc -l
2

Which confirms that this fix addresses the issue.

This issue seems to affect perf, ftrace "mono" clock source, and
eBPF in Linux.

Thanks,

Mathieu

>
> This would result in bogus perf timestamps like:
> swapper 0 [000] 253.427536: 111111111 cpu-clock: ffffffff810a0de6
> native_safe_halt+0x6 ([kernel.kallsyms])
> swapper 0 [000] 254.426573: 111111111 cpu-clock: ffffffff810a0de6
> native_safe_halt+0x6 ([kernel.kallsyms])
> swapper 0 [000] 254.426687: 111111111 cpu-clock: ffffffff810a0de6
> native_safe_halt+0x6 ([kernel.kallsyms])
> swapper 0 [000] 254.426800: 111111111 cpu-clock: ffffffff810a0de6
> native_safe_halt+0x6 ([kernel.kallsyms])
> swapper 0 [000] 254.426905: 111111111 cpu-clock: ffffffff810a0de6
> native_safe_halt+0x6 ([kernel.kallsyms])
> swapper 0 [000] 254.427022: 111111111 cpu-clock: ffffffff810a0de6
> native_safe_halt+0x6 ([kernel.kallsyms])
> swapper 0 [000] 254.427127: 111111111 cpu-clock: ffffffff810a0de6
> native_safe_halt+0x6 ([kernel.kallsyms])
> swapper 0 [000] 254.427239: 111111111 cpu-clock: ffffffff810a0de6
> native_safe_halt+0x6 ([kernel.kallsyms])
> swapper 0 [000] 254.427346: 111111111 cpu-clock: ffffffff810a0de6
> native_safe_halt+0x6 ([kernel.kallsyms])
> swapper 0 [000] 254.427463: 111111111 cpu-clock: ffffffff810a0de6
> native_safe_halt+0x6 ([kernel.kallsyms])
> swapper 0 [000] 255.426572: 111111111 cpu-clock: ffffffff810a0de6
> native_safe_halt+0x6 ([kernel.kallsyms])
>
> Instead of more reasonable expected timestamps like:
> swapper 0 [000] 39.953768: 111111111 cpu-clock: ffffffff810a0de6
> native_safe_halt+0x6 ([kernel.kallsyms])
> swapper 0 [000] 40.064839: 111111111 cpu-clock: ffffffff810a0de6
> native_safe_halt+0x6 ([kernel.kallsyms])
> swapper 0 [000] 40.175956: 111111111 cpu-clock: ffffffff810a0de6
> native_safe_halt+0x6 ([kernel.kallsyms])
> swapper 0 [000] 40.287103: 111111111 cpu-clock: ffffffff810a0de6
> native_safe_halt+0x6 ([kernel.kallsyms])
> swapper 0 [000] 40.398217: 111111111 cpu-clock: ffffffff810a0de6
> native_safe_halt+0x6 ([kernel.kallsyms])
> swapper 0 [000] 40.509324: 111111111 cpu-clock: ffffffff810a0de6
> native_safe_halt+0x6 ([kernel.kallsyms])
> swapper 0 [000] 40.620437: 111111111 cpu-clock: ffffffff810a0de6
> native_safe_halt+0x6 ([kernel.kallsyms])
> swapper 0 [000] 40.731546: 111111111 cpu-clock: ffffffff810a0de6
> native_safe_halt+0x6 ([kernel.kallsyms])
> swapper 0 [000] 40.842654: 111111111 cpu-clock: ffffffff810a0de6
> native_safe_halt+0x6 ([kernel.kallsyms])
> swapper 0 [000] 40.953772: 111111111 cpu-clock: ffffffff810a0de6
> native_safe_halt+0x6 ([kernel.kallsyms])
> swapper 0 [000] 41.064881: 111111111 cpu-clock: ffffffff810a0de6
> native_safe_halt+0x6 ([kernel.kallsyms])
>
> This patch adds the proper use of timekeeping_delta_to_ns()
> to convert the cycle delta to nanoseconds as needed.
>
> Thanks to Brendan and Alexei for finding this quickly after
> the v4.8 release. Unfortunately the problematic commit has
> landed in some -stable trees so they'll need this fix as
> well.
>
> Many apologies for this mistake. I'll be looking to add a
> perf-clock sanity test to the kselftest timers tests soon.
>
> Cc: Steven Rostedt <rostedt@xxxxxxxxxxx>
> Cc: Peter Zijlstra <peterz@xxxxxxxxxxxxx>
> Cc: Ingo Molnar <mingo@xxxxxxxxxx>
> Cc: Thomas Gleixner <tglx@xxxxxxxxxxxxx>
> Cc: stable <stable@xxxxxxxxxxxxxxx>
> Cc: Brendan Gregg <bgregg@xxxxxxxxxxx>
> Cc: Alexei Starovoitov <alexei.starovoitov@xxxxxxxxx>
> Fixes: 27727df240c7 "timekeeping: Avoid taking lock in NMI path with
> CONFIG_DEBUG_TIMEKEEPING"
> Reported-by: Brendan Gregg <bgregg@xxxxxxxxxxx>
> Reported-by: Alexei Starovoitov <alexei.starovoitov@xxxxxxxxx>
> Signed-off-by: John Stultz <john.stultz@xxxxxxxxxx>
> ---
> Thomas/Ingo:
> I've reproduced the issue and validated this fixes it, but given my limited perf
> usage so far, I'd appreciate waiting for a Tested-by: from one of the reporters
> before considering for tip/timers/urgent.
>
> kernel/time/timekeeping.c | 7 +++++--
> 1 file changed, 5 insertions(+), 2 deletions(-)
>
> diff --git a/kernel/time/timekeeping.c b/kernel/time/timekeeping.c
> index e07fb09..37dec7e 100644
> --- a/kernel/time/timekeeping.c
> +++ b/kernel/time/timekeeping.c
> @@ -403,8 +403,11 @@ static __always_inline u64 __ktime_get_fast_ns(struct
> tk_fast *tkf)
> tkr = tkf->base + (seq & 0x01);
> now = ktime_to_ns(tkr->base);
>
> - now += clocksource_delta(tkr->read(tkr->clock),
> - tkr->cycle_last, tkr->mask);
> + now += timekeeping_delta_to_ns(tkr,
> + clocksource_delta(
> + tkr->read(tkr->clock),
> + tkr->cycle_last,
> + tkr->mask));
> } while (read_seqcount_retry(&tkf->seq, seq));
>
> return now;
> --
> 1.9.1

--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com