Re: Timer refuses to expire
From: Paul E. McKenney
Date: Thu Dec 07 2017 - 10:06:48 EST
On Thu, Dec 07, 2017 at 03:03:50PM +0800, Boqun Feng wrote:
> Hi Paul,
>
> On Wed, Dec 06, 2017 at 02:04:21PM -0800, Paul E. McKenney wrote:
> > On Tue, Dec 05, 2017 at 03:37:44PM -0800, Paul E. McKenney wrote:
> > > On Mon, Dec 04, 2017 at 09:42:08AM -0800, Paul E. McKenney wrote:
> > > > On Fri, Dec 01, 2017 at 10:25:29AM -0800, Paul E. McKenney wrote:
> > > > > Hello, Anna-Maria,
> > > > >
> > > > > It turned out that one set of problems was due to NO_HZ_FULL issues,
> > > > > which are addressed with a pair of patches. However, there is still one
> > > > > case of a timer refusing to expire in rcutorture's TREE01 test scenario.
> > > > > The takes from two to four hours to trigger, and the most recent one
> > > > > gives the following diagnostic (see patch at the very end of this email):
> > > > >
> > > > > [13166.127458] schedule_timeout: Waylayed timer base->clk: 0x100c40004 jiffies: 0x100c4524e base->next_expiry: 0x100c40004 timer->flags: 0x1000003 timer->expires 0x100c40003 idx: 4 idx_now: ea base->pending_map 000000000000001000000000001080000000000000040000000002000000000000000000000000000000000000040000000000000000000000000000000000000000000000000000
> > > > >
> > > > > The message appears any time a timer for less than five jiffies takes
> > > > > more than eight seconds to expire. In all cases, this expiry did not
> > > > > happen naturally, but rather via an unsolicited wakeup from the RCU CPU
> > > > > stall code. If I am interpreting this message correctly, base->clk
> > > > > has advanced past this timer, and the timer is correctly flagged in
> > > > > base->pending_map. This seems like part of the problem because the
> > > > > timer's timeout was only three jiffies. However, base->clk has not
> > > > > advanced for more than 20 seconds, which seems like another problem.
> > > > >
> > > > > What additional diagnostics would be helpful? I can capture data
> > > > > at the beginning of the schedule_timeout in the timer_list structure,
> > > > > and of course can print more information at the time of the wakeup.
> > > >
> > > > And on the off-chance that the following messages from this weekend's
> > > > runs are at all helpful. One interesting point is that starting at
> > > > time 4731.872311, there are repeated enqueues to CPU 5's timer wheel,
> > > > but CPU 5's ->clk does not advance. Things seem to advance at
> > > > time 4891.721190.
> > >
> > > Another layer on the onion... For at least some of the failures, there
> > > is a stalled CPU-hotplug operation. This of course can mean that the
> > > timers are stuck on the partially offlined CPU. So I dumped the stack
> > > of the task that is taking the CPU offline. Please see below for console
> > > output and patch.
> > >
> > > I am considering doing an unsolicited wakeup of the task doing the
> > > hotplug operation, but I am not convinced that the entirely of the CPU
> > > hotplug code is willing to put up with that sort of thing.
> >
> > What I did instead was to dump out the state of the task that
> > __cpuhp_kick_ap() waits on, please see the patch at the very end of this
> > email. This triggered as shown below, and you guessed it, that task is
> > waiting on a grace period. Which I am guessing won't happen until the
> > outgoing CPU reaches CPUHP_TIMERS_DEAD state and calls timers_dead_cpu().
> > Which will prevent RCU's grace-period kthread from ever awakening, which
> > will prevent the task that __cpuhp_kick_ap() waits on from ever awakening,
> > which will prevent the outgoing CPU from reaching CPUHP_TIMERS_DEAD state.
> >
> > Deadlock.
>
> There is one thing I'm confused here. Sure, this is a deadlock, but the
> timer should still work in such a deadlock, right? I mean, the timer of
> schedule_timeout() should be able to wake up rcu_gp_kthread() even in
> this case? And yes, the gp kthread will continue to wait due to the
> deadlock, but the deadlock can not explain the "Waylayed timer", right?
My belief is that the timer cannot fire because it is still on the
offlined CPU, and that CPU has not yet reached timers_dead_cpu().
But I might be missing something subtle in either the timers code or the
CPU-hotplug code, so please do check my reasoning here. (I am relying on
the "timer->flags: 0x40000007" and the "cpuhp/7" below, which I believe
means that the timer is on CPU 7 and that it is CPU 7 that is in the
process of going offline.)
The "Waylayed timer" happens because the RCU CPU stall warning code
wakes up the grace-period kthread. This is driven out of the
scheduling-clock tick, so is unaffected by timers, though it does
rely on the jiffies counter continuing to be incremented.
So what am I missing here?
Thanx, Paul
> Regards,
> Boqun
>
> > Does that make sense, or am I missing a trick here?
> >
> > I tried invoking timers_dead_cpu() from sched_cpu_deactivate(), but
> > that triggered the BUG_ON(cpu_online(cpu)). I removed this BUG_ON(),
> > and appear to have deadlocked on the timer ->lock.
> >
> > Any thoughts on what else to try?
> >
> > Thanx, Paul
> >
> > ------------------------------------------------------------------------
> >
> > [ 2939.381210] schedule_timeout: Waylayed timer base->clk: 0x1002709c3 jiffies: 0x100284614 base->next_expiry: 0x1002709c3 timer->flags: 0x40000007 timer->expires 0x10027f3a2 idx: 100 idx_now: 105 base->pending_map 000000000000000800000000000000000002000000000000020080400002000868000000000000010000000001000000000000000000000000000000000000000000000000000000
> > [ 2939.381219] Torture onoff task state:
> > [ 2939.381221] torture_onoff D14584 842 2 0x80000000
> > [ 2939.381239] Call Trace:
> > [ 2939.381245] ? __schedule+0x33c/0x6f0
> > [ 2939.381248] ? preempt_count_add+0x51/0x90
> > [ 2939.381250] schedule+0x37/0x90
> > [ 2939.381270] schedule_timeout+0x20d/0x4c0
> > [ 2939.381276] wait_for_completion+0x95/0x100
> > [ 2939.381282] ? wake_up_q+0x60/0x60
> > [ 2939.381299] __cpuhp_kick_ap+0x49/0x60
> > [ 2939.381313] cpuhp_kick_ap+0x29/0x70
> > [ 2939.381316] cpuhp_kick_ap_work+0x34/0x130
> > [ 2939.381322] _cpu_down+0x17e/0x1f0
> > [ 2939.381326] do_cpu_down+0x30/0x50
> > [ 2939.381354] torture_offline+0x76/0x140
> > [ 2939.381368] torture_onoff+0xef/0x1c0
> > [ 2939.381371] kthread+0xf0/0x130
> > [ 2939.381374] ? torture_kthread_stopping+0x70/0x70
> > [ 2939.381376] ? kthread_destroy_worker+0x40/0x40
> > [ 2939.381378] ret_from_fork+0x1f/0x30
> > [ 2939.381381] __cpuhp_kick_ap task state:
> > [ 2939.381383] cpuhp/7 D14504 46 2 0x80000000
> > [ 2939.381399] Call Trace:
> > [ 2939.381403] ? __schedule+0x33c/0x6f0
> > [ 2939.381406] schedule+0x37/0x90
> > [ 2939.381408] schedule_timeout+0x20d/0x4c0
> > [ 2939.381414] ? enqueue_task_fair+0x556/0x12b0
> > [ 2939.381417] wait_for_completion+0x95/0x100
> > [ 2939.381419] ? wake_up_q+0x60/0x60
> > [ 2939.381431] __wait_rcu_gp+0x10d/0x140
> > [ 2939.381441] ? sched_cpu_activate+0xc0/0xc0
> > [ 2939.381443] sched_cpu_deactivate+0x36/0xa0
> > [ 2939.381453] ? kfree_call_rcu+0x20/0x20
> > [ 2939.381454] ? __call_rcu+0x530/0x530
> > [ 2939.381457] ? __rcu_read_unlock+0x50/0x50
> > [ 2939.381459] ? __rcu_read_unlock+0x50/0x50
> > [ 2939.381462] cpuhp_invoke_callback+0xa8/0x610
> > [ 2939.381466] cpuhp_thread_fun+0xc5/0x150
> > [ 2939.381469] smpboot_thread_fn+0x15c/0x220
> > [ 2939.381476] kthread+0xf0/0x130
> > [ 2939.381478] ? sort_range+0x20/0x20
> > [ 2939.381480] ? kthread_destroy_worker+0x40/0x40
> > [ 2939.381482] ret_from_fork+0x1f/0x30
> >
> > ------------------------------------------------------------------------
> >
> > commit aa2b802cb432a725589884d2ba47833de21e0871
> > Author: Paul E. McKenney <paulmck@xxxxxxxxxxxxxxxxxx>
> > Date: Wed Dec 6 09:46:30 2017 -0800
> >
> > EXP: timer: Dump __cpuhp_kick_ap task state
> >
> > Signed-off-by: Paul E. McKenney <paulmck@xxxxxxxxxxxxxxxxxx>
> >
> > diff --git a/kernel/cpu.c b/kernel/cpu.c
> > index 04892a82f6ac..dc9098946b3a 100644
> > --- a/kernel/cpu.c
> > +++ b/kernel/cpu.c
> > @@ -381,6 +381,8 @@ cpuhp_reset_state(struct cpuhp_cpu_state *st, enum cpuhp_state prev_state)
> > st->bringup = !st->bringup;
> > }
> >
> > +void schedule_timeout_set_task3dump(struct task_struct *t);
> > +
> > /* Regular hotplug invocation of the AP hotplug thread */
> > static void __cpuhp_kick_ap(struct cpuhp_cpu_state *st)
> > {
> > @@ -395,7 +397,9 @@ static void __cpuhp_kick_ap(struct cpuhp_cpu_state *st)
> > smp_mb();
> > st->should_run = true;
> > wake_up_process(st->thread);
> > + schedule_timeout_set_task3dump(st->thread);
> > wait_for_ap_thread(st, st->bringup);
> > + schedule_timeout_set_task3dump(NULL);
> > }
> >
> > static int cpuhp_kick_ap(struct cpuhp_cpu_state *st, enum cpuhp_state target)
> > diff --git a/kernel/time/timer.c b/kernel/time/timer.c
> > index 0a67df4cdeb6..532b43d05219 100644
> > --- a/kernel/time/timer.c
> > +++ b/kernel/time/timer.c
> > @@ -1725,6 +1725,7 @@ static void process_timeout(struct timer_list *t)
> > }
> >
> > static struct task_struct *schedule_timeout_task2dump;
> > +static struct task_struct *schedule_timeout_task3dump;
> > static DEFINE_MUTEX(schedule_timeout_task2dump_mutex);
> > void schedule_timeout_set_task2dump(struct task_struct *t)
> > {
> > @@ -1732,7 +1733,12 @@ void schedule_timeout_set_task2dump(struct task_struct *t)
> > WRITE_ONCE(schedule_timeout_task2dump, t);
> > mutex_unlock(&schedule_timeout_task2dump_mutex);
> > }
> > -EXPORT_SYMBOL_GPL(schedule_timeout_set_task2dump);
> > +void schedule_timeout_set_task3dump(struct task_struct *t)
> > +{
> > + mutex_lock(&schedule_timeout_task2dump_mutex);
> > + WRITE_ONCE(schedule_timeout_task3dump, t);
> > + mutex_unlock(&schedule_timeout_task2dump_mutex);
> > +}
> >
> > /**
> > * schedule_timeout - sleep until timeout
> > @@ -1824,6 +1830,10 @@ signed long __sched schedule_timeout(signed long timeout)
> > pr_info("Torture onoff task state:\n");
> > sched_show_task(schedule_timeout_task2dump);
> > }
> > + if (schedule_timeout_task3dump) {
> > + pr_info("__cpuhp_kick_ap task state:\n");
> > + sched_show_task(schedule_timeout_task3dump);
> > + }
> > mutex_unlock(&schedule_timeout_task2dump_mutex);
> > }
> > }
> >