Re: [PATCH tip/core/rcu 1/2] rcu: Parallelize and economize NOCB kthread wakeups

From: Paul E. McKenney
Date: Fri Aug 22 2014 - 10:48:42 EST


On Fri, Aug 22, 2014 at 06:26:49PM +0530, Amit Shah wrote:
> On (Fri) 22 Aug 2014 [18:06:51], Amit Shah wrote:
> > On (Fri) 22 Aug 2014 [17:54:53], Amit Shah wrote:
> > > On (Mon) 18 Aug 2014 [21:01:49], Paul E. McKenney wrote:
> > >
> > > > The odds are low over the next few days. I am adding nastier rcutorture
> > > > testing, however. It would still be very good to get debug information
> > > > from your setup. One approach would be to convert the trace function
> > > > calls into printk(), if that would help.
> > >
> > > I added a few printks on the lines of the traces in cases where
> > > rcu_nocb_poll was checked -- since that reproduces the hang. Are the
> > > following traces sufficient, or should I keep adding more printks?
> > >
> > > In the case of rcu-trace-nopoll.txt, the messages stop after a while
> > > (when the guest locks up hard). That's when I kill the qemu process.
> >
> > And this is bt from gdb when the endless
> >
> > RCUDEBUG __call_rcu_nocb_enqueue 2146 rcu_preempt 0 WakeNot
> >
> > messages are being spewed.
> >
> > I can't time it, but hope it gives some indication along with the printks.
>
> ... and after the system 'locks up', this is the state it's in:
>
> ^C
> Program received signal SIGINT, Interrupt.
> native_safe_halt () at ./arch/x86/include/asm/irqflags.h:50
> 50 }
> (gdb) bt
> #0 native_safe_halt () at ./arch/x86/include/asm/irqflags.h:50
> #1 0xffffffff8100b9c1 in arch_safe_halt () at ./arch/x86/include/asm/paravirt.h:111
> #2 default_idle () at arch/x86/kernel/process.c:311
> #3 0xffffffff8100c107 in arch_cpu_idle () at arch/x86/kernel/process.c:302
> #4 0xffffffff8106a25a in cpuidle_idle_call () at kernel/sched/idle.c:120
> #5 cpu_idle_loop () at kernel/sched/idle.c:220
> #6 cpu_startup_entry (state=<optimized out>) at kernel/sched/idle.c:268
> #7 0xffffffff813e068b in rest_init () at init/main.c:418
> #8 0xffffffff81a8cf5a in start_kernel () at init/main.c:680
> #9 0xffffffff81a8c4ba in x86_64_start_reservations (real_mode_data=<optimized out>) at arch/x86/kernel/head64.c:193
> #10 0xffffffff81a8c607 in x86_64_start_kernel (real_mode_data=0x13f90 <cpu_lock_stats+29184> <error: Cannot access memory at address 0x13f90>)
> at arch/x86/kernel/head64.c:182
> #11 0x0000000000000000 in ?? ()
>
>
> Wondering why it's doing this. Am stepping through
> cpu_startup_entry() to see if I get any clues.

This looks to me like normal behavior in the x86 ACPI idle loop.
My guess is that the lockup is caused by indefinite blocking, in
which case we would expect all the CPUs to be in the idle loop.

Of course, this all assumes that your system is using ACPI for idle.
(Is it?)

Thanx, Paul

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/