Re: [PATCH] tick: prefer a lower rating device only if it's CPU local device

From: Kevin Hilman
Date: Mon Jul 02 2018 - 19:44:51 EST


Hi Sudeep,

On Wed, May 9, 2018 at 9:02 AM Sudeep Holla <sudeep.holla@xxxxxxx> wrote:
>
> Checking the equality of cpumask for both new and old tick device doesn't
> ensure that it's CPU local device. This will cause issue if a low rating
> clockevent tick device is registered first followed by the registration
> of higher rating clockevent tick device.
>
> In such case, clockevents_released list will never get emptied as both
> the devices get selected as preferred one and we will loop forever in
> clockevents_notify_released.
>
> Cc: Frederic Weisbecker <fweisbec@xxxxxxxxx>
> Cc: Thomas Gleixner <tglx@xxxxxxxxxxxxx>
> Signed-off-by: Sudeep Holla <sudeep.holla@xxxxxxx>

I've got a arm32 board (meson8b-odroidc1) that's been failing in
kernelCI.org since the merge window (boot log[1]), and I finally got
around to bisecting it[2]. Unfortunately, the bisect pointed at a
merge commit, but with some trial and error (and a suggestion by Arnd)
I was able to test that revering $SUBJECT commit[3], my problem goes
away.

Another interesting data point is that disabling SMP (either by
"nosmp" on the command-line or CONFIG_SMP=n) also makes the problem go
away, without needing to revert this patch.

AFAICT, this platform, is using a single timer as a clocksource
("amlogic,meson6-timer") which is not a per-CPU timer.

I ran out of time to keep digging on this issue, and I'm still not
sure exactly what's going on, but I wanted to report it in case anyone
else has any ideas, and so we can hopefully get it fixed during the
-rc cycle.

Kevin

[1] https://storage.kernelci.org/mainline/master/v4.18-rc2-357-gd3bc0e67f852/arm/multi_v7_defconfig/lab-baylibre-seattle/boot-meson8b-odroidc1.html
[2] http://termbin.com/mk07
[3] in mainline as: 1332a9055801 tick: Prefer a lower rating device
only if it's CPU local device

> ---
> kernel/time/tick-common.c | 3 ++-
> 1 file changed, 2 insertions(+), 1 deletion(-)
>
> Hi Thomas,
>
> I am seeing this issue on my Juno devboard, where system wide timers
> with rating 300 and 400 are registered in same order and we get stuck in
> a loop in clockevents_notify_released. Let me know if this looks sane or
> you have any suggestions that I can try out.
>
> Regards,
> Sudeep
>
> diff --git a/kernel/time/tick-common.c b/kernel/time/tick-common.c
> index 49edc1c4f3e6..78e598334007 100644
> --- a/kernel/time/tick-common.c
> +++ b/kernel/time/tick-common.c
> @@ -277,7 +277,8 @@ static bool tick_check_preferred(struct clock_event_device *curdev,
> */
> return !curdev ||
> newdev->rating > curdev->rating ||
> - !cpumask_equal(curdev->cpumask, newdev->cpumask);
> + (!cpumask_equal(curdev->cpumask, newdev->cpumask) &&
> + !tick_check_percpu(curdev, newdev, smp_processor_id()));
> }
>
> /*
> --
> 2.7.4
>