[KERNEL BUG] do_timer/tick_handover_do_timer 3.10.17

From: pawandeep oza
Date: Wed May 06 2015 - 13:27:33 EST


Hi,

Linux version 3.10.17

Problem Statement: The timekeeping/do_timer seems to be stopped and
the core (in this case it is core0) which is aborting is stuck in the
loop which relies on jiffies.


The root cause/Reason:

we have tickless kernel, so cpu goes to deep idle state, and stop
sched tick. tick_nohz_stop_sched_tick

tick_sched_do_timer should then take the job and whichever cpu is
running transfer jiffies incrementing job to itself. which is
tick_sched_do_timer


but when say core0 has raised BUG, ipi_cpu_stop will amek other cpu to
go to stop. and clcokevents_notify/tick_notify/hrtimer_notifiy
eventually seem to be conencted through cpu_chain.

but this code belong to hotplug where cpu_down happen and then it can
successfully call tick_handover_do_timer which will take over the duty
from dying cpu and assign it to the one which is online.

static void tick_handover_do_timer(int *cpup) { if (*cpup ==
tick_do_timer_cpu) { int cpu = cpumask_first(cpu_online_mask);
tick_do_timer_cpu = (cpu < nr_cpu_ids) ? cpu : TICK_DO_TIMER_NONE; } }


but since cpu_down is not getting called, this handover is not happening.
and the last status of the variable tick_do_timer_cpu is always
pointing to DEAD cpu (1,2 or 3).

and core0 waits forever (where if the code relies on the increment of jiffies).


what is the right way to approach this problem, at first it looks like
kernel should take care of handing over the jiffies job to other
online core indepedent of hotplug.

Regards,
Oza.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/