Re: Need your help on rcu_bh_state stall on CPU 0 (t=0 jiffies)

From: Paul E. McKenney
Date: Tue Feb 21 2017 - 19:40:09 EST


On Wed, Feb 22, 2017 at 05:38:00AM +0530, Sumit Gemini wrote:
> Okay Paul, i will try this. but one more thing. According to
> Documentation/stallwarn.txt, occurrence of this issue is rare and customer
> too not observing the same issue and yes this issue come at random.

You lost me on this one. I do understand that chasing down bugs can
be difficult and frustrating to all concerned, if that is what you
are getting at.

> So how to track all function just before stall. Because if issue does not
> come for 1 month but my ftracer is running continuously it will create huge
> log file and consume RAM.

Limit the size of the ftrace buffers. This will get you the last part
of the trace data, which is what you want in this case. Are you able
to modify the kernel? If so, you can use ftrace_dump() to cause the
trce log to be dumped at the very start of the stall-warning process.

> Because I discussed here in my office about ftrace, and they denied this
> option. According to them it is feasible at local end for debugging (
> reproducing issue) but not at customer end.

Perhaps your kernel is too old. Or maybe there is some confusion
about the exact option. Why not just try it?

> Could you please suggest what I do, I don't have idea how to proceed
> further.

I hope that you have tried reproducing this in your lab.

Another approach is to measure the actual overhead in the lab. Compute
the actual performance degradation, having asked beforehand what
degradation would be acceptable.

One last thing: The machine knows what is wrong. Your challenge is to
make it tell you what the problem is, using whatever tools are available.
My ability to help you is sharply limited by the fact that you are
running a very old kernel. I suggest reaching out to people who work
regularly with this kernel version for additional help with debugging.

Thanx, Paul

> Thanks & Regards
> ~Sumit Gemini
>
> On Feb 22, 2017 1:27 AM, "Paul E. McKenney" <paulmck@xxxxxxxxxxxxxxxxxx>
> wrote:
>
> > Hello, Sumit,
> >
> > Instead of tracing RCU-bh, I suggest tracing upper-level function calls
> > in the kernel. This will allow you to see what the kernel was doing
> > before the stall warning. See the max_graph_depth parameter in
> > https://www.kernel.org/doc/Documentation/trace/ftrace.txt for how
> > to do this.
> >
> > Thanx, Paul
> >
> > On Tue, Feb 21, 2017 at 03:48:48PM +0530, Sumit Gemini wrote:
> > > Hi Paul,
> > >
> > > I need you help again, I'm using this link "
> > > http://opensourceforu.com/2010/11/kernel-tracing-with-ftrace-part-1/"; to
> > > understand how ftrace work
> > > but problem is, example here showing traces for all events when we on the
> > > tracing.
> > >
> > > Seeking help on google, i got to this video, in which you addressing
> > about
> > > tracing the rcu-bh event.
> > >
> > > https://www.youtube.com/watch?v=WWEz-uoyoPE
> > >
> > >
> > > Could you please describe the steps to collect logs whenever i got rcu_bh
> > > stall event (cpu stall problem) only?
> > >
> > >
> > > Thanks
> > > ~Sumit Gemini
> > >
> > >
> > >
> > > On Tue, Feb 21, 2017 at 1:35 PM, Paul E. McKenney <
> > > paulmck@xxxxxxxxxxxxxxxxxx> wrote:
> > >
> > > > On Tue, Feb 21, 2017 at 11:36:42AM +0530, Sumit Gemini wrote:
> > > > > Thank Paul, you really save the day by giving your valuable input
> > but I
> > > > > still confused in 2 things :-
> > > > >
> > > > >
> > > > > 1. If i suppress the rcu_bh_state() warning, will NMI not trigger?
> > > > > Because the problem is i don't know how to reproduce this issue. I
> > > > observed
> > > > > this issue first time ever in my career and this issue occured at
> > > > customer
> > > > > end so i have to give him a workaround.
> > > >
> > > > Yes, suppressing the RCU CPU stall warning will also suppress the NMIs.
> > > >
> > > > > 2. and one important thing, i have kernel stack traces which was
> > > > > occurred when stall issue observed. Many developer got the almost
> > same
> > > > > stack-starce. similar this link "
> > > > > http://www.serverphorums.com/read.php?12,468996"; in which Ralf
> > and
> > > > you
> > > > > were discussing. So can we not get any suspicious function or
> > > > statement
> > > > > from this stack trace? or we need ftrace to debug this?
> > > >
> > > > As described in stallwarn.txt (please do read this if you have not
> > already
> > > > done so, and from your questions it does -not- seem like you have read
> > > > it), one cause of RCU CPU stall warnings is long loops in the kernel.
> > > > Adding cond_resched() in older kernels or (better)
> > cond_resched_rcu_qs()
> > > > in newer kernels can prevent the stall warnings and other problems.
> > > > But calls to these functions must be carefully placed, because
> > otherwise
> > > > you will get error messages such as "scheduling while atomic".
> > > >
> > > > Thanx, Paul
> > > >
> > > > > Thanks
> > > > > ~Sumit Gemini
> > > > >
> > > > > On Tue, Feb 21, 2017 at 8:31 AM, Paul E. McKenney <
> > > > > paulmck@xxxxxxxxxxxxxxxxxx> wrote:
> > > > >
> > > > > > On Tue, Feb 21, 2017 at 06:36:34AM +0530, Sumit Gemini wrote:
> > > > > > > Hi Pual,
> > > > > > >
> > > > > > > Thanks for your time.
> > > > > > >
> > > > > > > we are using HA pair with TIPC and opensf functionality. Now
> > problem
> > > > is
> > > > > > on
> > > > > > > active machine TIPC link lost message was observed on 06:45:00
> > but
> > > > > > standby
> > > > > > > machine detected this link lost at 06:45:06 because at 06:45:00
> > > > standby
> > > > > > > machine having rcu_bh_state (CPU stall). Once standby machine
> > > > detected
> > > > > > link
> > > > > > > lost, it fired reboot command for active machine with the help of
> > > > opensf
> > > > > > at
> > > > > > > 06:45:07 to make itself active and standby to previously active
> > > > machine.
> > > > > > >
> > > > > > > I raised this matter before TIPC open source community, and they
> > told
> > > > > > > rcu_bh_state was the reason for delaying to get link down event
> > at
> > > > > > standby
> > > > > > > end.
> > > > > > >
> > > > > > > In normal case TIPC link broken detects in < 2 second. But in our
> > > > case
> > > > > > took
> > > > > > > 7 second because of that stall. It is first time, we observed
> > such
> > > > > > > situation.
> > > > > > >
> > > > > > > Now if we suppres this warning, will we observe this delay again?
> > > > > >
> > > > > > That depends. If the delay was due to the NMIs and stack-trace
> > > > printing,
> > > > > > suppressing the warning will make the delay go away. On the other
> > > > hand,
> > > > > > if there really is a stall independent of the warning, then the
> > delay
> > > > > > will still be there. Why not try it and find out?
> > > > > >
> > > > > > > Can you enlighten on this stall. Why does this stall happen at
> > ideal
> > > > > > > machine? And if possible to debug it, what we need to do?
> > > > > > >
> > > > > > > Any pointers will be highly appreciated.
> > > > > >
> > > > > > I suggest reading Documentation/stallwarn.txt, especially the list
> > > > > > of possible causes near the end:
> > > > > >
> > > > > > https://www.kernel.org/doc/Documentation/RCU/stallwarn.txt
> > > > > >
> > > > > > If the delay persist when you suppress the warnings, then I suggest
> > > > that
> > > > > > you use something like ftrace to gather information on the delay.
> > > > > >
> > > > > > Thanx, Paul
> > > > > >
> > > > > > > Thanks & Regards
> > > > > > > ~Sumit Gemini
> > > > > > >
> > > > > > >
> > > > > > > On Feb 21, 2017 2:28 AM, "Paul E. McKenney" <
> > > > paulmck@xxxxxxxxxxxxxxxxxx>
> > > > > > > wrote:
> > > > > > >
> > > > > > > > On Mon, Feb 20, 2017 at 03:26:32PM +0530, Sumit Gemini wrote:
> > > > > > > > > Hi Paul,
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > I got you while reading 'http://www.serverphorums.com/
> > > > > > read.php?12,468996".
> > > > > > > > I'm
> > > > > > > > > facing same problem but not able to detect what exectly the
> > > > cause of
> > > > > > > > > it.
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > I'm observed weird problem, I got rcu_bh_state on cpu0 but my
> > > > system
> > > > > > > > > was in ideal state when i got rcu_bh_state (Cpu stall)
> > problem.
> > > > After
> > > > > > > > > 7 second my system recovered itself from stall to active.
> > But i
> > > > did
> > > > > > > > > not get what happen at that time.
> > > > > > > >
> > > > > > > > One thing that you might look into below...
> > > > > > > >
> > > > > > > > > Jan 6 16:45:06 ffm-sbc-2a kernel: [3167216.520060] INFO:
> > > > rcu_bh_state
> > > > > > > > detected stall on CPU 0 (t=0 jiffies)
> > > > > > > > > Jan 6 16:45:06 ffm-sbc-2a kernel: [3167216.524042] sending
> > NMI
> > > > to all
> > > > > > > > CPUs:
> > > > > > > > > Jan 6 16:45:06 ffm-sbc-2a kernel: [3167216.524042] NMI
> > backtrace
> > > > for
> > > > > > cpu
> > > > > > > > 0
> > > > > > > > > Jan 6 16:45:06 ffm-sbc-2a kernel: [3167216.524042] CPU 0
> > > > > > > > > Jan 6 16:45:06 ffm-sbc-2a kernel: [3167216.524042] Modules
> > > > linked in:
> > > > > > > > nf_conntrack_netlink af_packet xt_sharedlimit xt_hashlimit
> > > > > > > > ip_set_hash_ipport ip_set_hash_ipportip xt_NOTRACK
> > > > ip_set_bitmap_port
> > > > > > > > xt_sctp nf_conntrack_ipv6 nf_defrag_ipv6 xt_CT arpt_mangle
> > > > > > > > ip_set_hash_ipnet xt_NFLOG nfnetlink_log ipt_ULOG xt_limit
> > > > > > xt_hashcounter
> > > > > > > > ip_set_hash_ipip xt_set ip_set_hash_ip deflate zlib_deflate ctr
> > > > > > > > twofish_x86_64 twofish_common camellia serpent blowfish cast5
> > > > > > des_generic
> > > > > > > > cbc xcbc rmd160 sha512_generic sha256_generic sha1_generic md5
> > > > > > crypto_null
> > > > > > > > af_key iptable_mangle ip_set nfnetlink arptable_filter
> > arp_tables
> > > > > > > > iptable_raw iptable_nat tipc xt_tcpudp xt_state xt_pkttype
> > bonding
> > > > > > > > binfmt_misc iptable_filter ip6table_filter ip6_tables
> > nf_nat_ftp
> > > > nf_nat
> > > > > > > > nf_conntrack_ftp nf_conntrack_ipv4 nf_defrag_ipv4 ip_tables
> > > > x_tables
> > > > > > mperf
> > > > > > > > edd ipmi_devintf ipmi_si ipmi_msghandler
> > nf_conntrack_proto_sctp
> > > > > > > > nf_conntrack sctp 8021q garp stp llc gb_sys usb_storage ioatdma
> > > > ixgbe
> > > > > > uas
> > > > > > > > sg igb iTCO_wdt wmi i2c_i801 pcspkr mdio iTCO_vendor_support
> > button
> > > > > > > > container dca ipv6 autofs4 usbhid megasr(P) ehci_hcd usbcore
> > > > processor
> > > > > > > > thermal_sys [last unloaded: ipt_PORTMAP]
> > > > > > > > > Jan 6 16:45:06 ffm-sbc-2a kernel: [3167216.524042]
> > > > > > > > > Jan 6 16:45:06 ffm-sbc-2a kernel: [3167216.524042] Pid: 0,
> > comm:
> > > > > > swapper
> > > > > > > > Tainted: P 3.1.10-gb17-default #1 Intel Corporation
> > S2600CO/S2600CO
> > > > > > > > > Jan 6 16:45:06 ffm-sbc-2a kernel: [3167216.524042] RIP:
> > > > > > > > 0010:[<ffffffff81007f51>] [<ffffffff81007f51>]
> > > > native_read_tsc+0x2/0xf
> > > > > > > > > Jan 6 16:45:06 ffm-sbc-2a kernel: [3167216.524042] RSP:
> > > > > > > > 0018:ffff88043ee03db0 EFLAGS: 00000803
> > > > > > > > > Jan 6 16:45:06 ffm-sbc-2a kernel: [3167216.524042] RAX:
> > > > > > 0000000037185395
> > > > > > > > RBX: 00000000000003e9 RCX: 0000000000000001
> > > > > > > > > Jan 6 16:45:07 ffm-sbc-2a osafimmd[5035]: WA IMMND DOWN on
> > active
> > > > > > > > controller f2 detected at standby immd!! f1. Possible failover
> > > > > > > > > Jan 6 16:45:07 ffm-sbc-2a osaffmd[5023]: NO Done Locking
> > > > applications
> > > > > > > > node id:2020f ret val:0
> > > > > > > > > Jan 6 16:45:07 ffm-sbc-2a opensaf_recovery: Control interface
> > > > > > status:0
> > > > > > > > Role:STANDBY
> > > > > > > > > Jan 6 16:45:07 ffm-sbc-2a osaffmd[5023]: NO Current role:
> > STANDBY
> > > > > > > > > Jan 6 16:45:07 ffm-sbc-2a osaffmd[5023]: Rebooting OpenSAF
> > > > NodeId =
> > > > > > > > 131599 EE Name = , Reason: Received Node Down for peer
> > controller,
> > > > >
> > > > > > > > OwnNodeId = 131343, SupervisionTime = 60
> > > > > > > >
> > > > > > > > Is this "WA IMMND DOWN" a normal occurrence? Perhaps it is
> > > > pointing
> > > > > > > > at a root cause.
> > > > > > > >
> > > > > > > > > Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.524042] RDX:
> > > > > > 0000000000bf0977
> > > > > > > > RSI: 0000000000000002 RDI: 0000000000032bd4
> > > > > > > > > Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.524042] RBP:
> > > > > > 0000000000032bd4
> > > > > > > > R08: 0000000000000000 R09: ffffffff819232b0
> > > > > > > > > Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.524042] R10:
> > > > > > 7fffffffffffffff
> > > > > > > > R11: 7fffffffffffffff R12: 0000000000000000
> > > > > > > > > Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.524042] R13:
> > > > > > ffffffff819232b0
> > > > > > > > R14: 0000000000000001 R15: ffffffff81065c28
> > > > > > > > > Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.524042] FS:
> > > > > > > > 0000000000000000(0000) GS:ffff88043ee00000(0000)
> > > > knlGS:0000000000000000
> > > > > > > > > Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.524042] CS: 0010
> > DS:
> > > > 0000
> > > > > > ES:
> > > > > > > > 0000 CR0: 000000008005003b
> > > > > > > > > Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.524042] CR2:
> > > > > > 000000000069e034
> > > > > > > > CR3: 0000000001805000 CR4: 00000000000406f0
> > > > > > > > > Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.524042] DR0:
> > > > > > 0000000000000000
> > > > > > > > DR1: 0000000000000000 DR2: 0000000000000000
> > > > > > > > > Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.524042] DR3:
> > > > > > 0000000000000000
> > > > > > > > DR6: 00000000ffff0ff0 DR7: 0000000000000400
> > > > > > > > > Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.524042] Process
> > > > swapper
> > > > > > (pid:
> > > > > > > > 0, threadinfo ffffffff81800000, task ffffffff8180d020)
> > > > > > > > > Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.524042] Stack:
> > > > > > > > > Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.524042]
> > > > ffffffff81200eb5
> > > > > > > > ffffffff81200f44 00000000000003e9 0000000000001000
> > > > > > > > > Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.524042]
> > > > 0000000000000002
> > > > > > > > ffffffff819232b0 ffffffff81017698 7fffffffffffffff
> > > > > > > > > Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.524042]
> > > > 0000000000000002
> > > > > > > > 0000000000000002 ffffffff81017fdf 0000000000000001
> > > > > > > > > Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.524042] Call
> > Trace:
> > > > > > > > > Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.524042]
> > > > > > [<ffffffff81200eb5>]
> > > > > > > > paravirt_read_tsc+0x5/0x8
> > > > > > > > > Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.524042]
> > > > > > [<ffffffff81200f44>]
> > > > > > > > delay_tsc+0x1d/0x68
> > > > > > > > > Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.524042]
> > > > > > [<ffffffff81017698>]
> > > > > > > > native_safe_apic_wait_icr_idle+0x27/0x32
> > > > > > > > > Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.524042]
> > > > > > [<ffffffff81017fdf>]
> > > > > > > > default_send_IPI_dest_field.constprop.0+0x19/0x4d
> > > > > > > > > Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.524042]
> > > > > > [<ffffffff8101804b>]
> > > > > > > > default_send_IPI_mask_sequence_phys+0x38/0x6a
> > > > > > > > > Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.524042]
> > > > > > [<ffffffff8101815e>]
> > > > > > > > arch_trigger_all_cpu_backtrace+0x4d/0x7b
> > > > > > > > > Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.524042]
> > > > > > [<ffffffff8109801b>]
> > > > > > > > check_cpu_stall+0x66/0xdb
> > > > > > > > > Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.524042]
> > > > > > [<ffffffff810980aa>]
> > > > > > > > rcu_pending+0x1a/0x10a
> > > > > > > > > Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.524042]
> > > > > > [<ffffffff8109852c>]
> > > > > > > > rcu_check_callbacks+0x9d/0xae
> > > > > > > > > Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.524042]
> > > > > > [<ffffffff8104c56c>]
> > > > > > > > update_process_times+0x31/0x63
> > > > > > > > > Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.524042]
> > > > > > [<ffffffff81065c92>]
> > > > > > > > tick_sched_timer+0x6a/0x90
> > > > > > > > > Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.524042]
> > > > > > [<ffffffff8105b872>]
> > > > > > > > __run_hrtimer+0xa4/0x148
> > > > > > > > > Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.524042]
> > > > > > [<ffffffff8105c08e>]
> > > > > > > > hrtimer_interrupt+0xdb/0x19a
> > > > > > > > > Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.524042]
> > > > > > [<ffffffff81017768>]
> > > > > > > > smp_apic_timer_interrupt+0x6e/0x80
> > > > > > > > > Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.524042]
> > > > > > [<ffffffff813efcde>]
> > > > > > > > apic_timer_interrupt+0x6e/0x80
> > > > > > > > > Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.524042]
> > > > > > [<ffffffff81239655>]
> > > > > > > > intel_idle+0xdd/0x104
> > > > > > > > > Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.524042]
> > > > > > [<ffffffff81304773>]
> > > > > > > > cpuidle_idle_call+0xda/0x169
> > > > > > > > > Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.524042]
> > > > > > [<ffffffff81001200>]
> > > > > > > > cpu_idle+0x51/0x95
> > > > > > > > > Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.524042]
> > > > > > [<ffffffff8193db0f>]
> > > > > > > > start_kernel+0x388/0x393
> > > > > > > > > Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.524042]
> > > > > > [<ffffffff8193d3af>]
> > > > > > > > x86_64_start_kernel+0xcf/0xdc
> > > > > > > > > Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.524042] Code: 74
> > 03
> > > > e6 80
> > > > > > c3
> > > > > > > > e6 ed c3 bf 8e 21 00 00 e9 ba 8f 1f 00 c3 90 90 90 40 88 f8 e6
> > 70
> > > > e4
> > > > > > 71 c3
> > > > > > > > 40 88 f0 e6 70 40 88 f8 e6 71 c3 0f 31
> > > > > > > > > Jan 6 16:45:07 ffm-sbc-2a kernel[3167216.524042]: c1 48 89
> > d0 48
> > > > c1
> > > > > > e0
> > > > > > > > 20 48 09 c8 c3 41 56 41 55 41 54 55 53
> > > > > > > > >
> > > > > > > > > .
> > > > > > > > > .
> > > > > > > > > .
> > > > > > > > > .
> > > > > > > > > .
> > > > > > > > > .
> > > > > > > > >
> > > > > > > > > Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.830734] NMI
> > backtrace
> > > > for
> > > > > > cpu
> > > > > > > > 31
> > > > > > > > > Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.830734] CPU 31
> > > > > > > > > Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.830734] Modules
> > > > linked in:
> > > > > > > > nf_conntrack_netlink af_packet xt_sharedlimit xt_hashlimit
> > > > > > > > ip_set_hash_ipport ip_set_hash_ipportip xt_NOTRACK
> > > > ip_set_bitmap_port
> > > > > > > > xt_sctp nf_conntrack_ipv6 nf_defrag_ipv6 xt_CT arpt_mangle
> > > > > > > > ip_set_hash_ipnet xt_NFLOG nfnetlink_log ipt_ULOG xt_limit
> > > > > > xt_hashcounter
> > > > > > > > ip_set_hash_ipip xt_set ip_set_hash_ip deflate ctr
> > twofish_x86_64
> > > > > > > > twofish_common camellia serpent > blowfish cast5 des_generic
> > cbc
> > > > xcbc
> > > > > > > > rmd160 sha512_generic sha256_generic sha1_generic md5
> > crypto_null
> > > > > > af_key
> > > > > > > > iptable_mangle ip_set nfnetlink arptable_filter arp_tables
> > > > iptable_raw
> > > > > > > > iptable_nat tipc xt_tcpudp xt_state xt_pkttype bonding
> > binfmt_misc
> > > > > > > > iptable_filter ip6table_filter ip6_tables nf_nat_ftp nf_nat
> > > > > > > > nf_conntrack_ftp nf_conntrack_ipv4 nf_defrag_ipv4 ip_tables
> > > > x_tables
> > > > > > mperf
> > > > > > > > edd ipmi_devintf ipmi_si ipmi_msghandler
> > nf_conntrack_proto_sctp
> > > > > > > > nf_conntrack sctp 8021q garp stp llc gb_sys usb_storage ioatdma
> > > > ixgbe
> > > > > > uas
> > > > > > > > sg igb iTCO_wdt wmi i2c_i801 pcspkr mdio iTCO_vendor_support
> > button
> > > > > > > > container dca ipv6 autofs4 usbhid megasr(P) ehci_hcd usbcore
> > > > processor
> > > > > > > > thermal_sys [last unloaded: ipt_PORTMAP]
> > > > > > > > > Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.830734]
> > > > > > > > > Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.830734] Pid: 0,
> > comm:
> > > > > > > > kworker/0:1 Tainted: P 3.1.10-gb17-default #1 Intel Corporation
> > > > > > > > S2600CO/S2600CO
> > > > > > > > > Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.830734] RIP:
> > > > > > > > 0010:[<ffffffff81239624>] [<ffffffff81239624>]
> > > > intel_idle+0xac/0x104
> > > > > > > > > Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.830734] RSP:
> > > > > > > > 0018:ffff880425e33eb8 EFLAGS: 00000046
> > > > > > > > > Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.830734] RAX:
> > > > > > 0000000000000030
> > > > > > > > RBX: 0000000000000010 RCX: 0000000000000001
> > > > > > > > > Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.830734] RDX:
> > > > > > 0000000000000000
> > > > > > > > RSI: ffff880425e33fd8 RDI: ffffffff81810500
> > > > > > > > > Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.830734] RBP:
> > > > > > 0000000000000030
> > > > > > > > R08: 000000000000006d R09: 0000000000034b10
> > > > > > > > > Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.830734] R10:
> > > > > > ffff88083eded830
> > > > > > > > R11: ffff88083eded830 R12: 149739342cb2ca49
> > > > > > > > > Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.830734] R13:
> > > > > > 0000000000000004
> > > > > > > > R14: 000000000000001f R15: 0000000000000000
> > > > > > > > > Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.830734] FS:
> > > > > > > > 0000000000000000(0000) GS:ffff88083ede0000(0000)
> > > > knlGS:0000000000000000
> > > > > > > > > Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.830734] CS: 0010
> > DS:
> > > > 0000
> > > > > > ES:
> > > > > > > > 0000 CR0: 000000008005003b
> > > > > > > > > Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.830734] CR2:
> > > > > > 00007fe89df67120
> > > > > > > > CR3: 0000000001805000 CR4: 00000000000406e0
> > > > > > > > > Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.830734] DR0:
> > > > > > 0000000000000000
> > > > > > > > DR1: 0000000000000000 DR2: 0000000000000000
> > > > > > > > > Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.830734] DR3:
> > > > > > 0000000000000000
> > > > > > > > DR6: 00000000ffff0ff0 DR7: 0000000000000400
> > > > > > > > > Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.830734] Process
> > > > > > kworker/0:1
> > > > > > > > (pid: 0, threadinfo ffff880425e32000, task ffff880425e30580)
> > > > > > > > > Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.830734] Stack:
> > > > > > > > > Jan 6 16:45:07 ffm-sbc-2a kernel: [3167225.685484] TIPC:
> > > > Resetting
> > > > > > link
> > > > > > > > <1.1.1:bond0-1.1.2:bond0>, requested by peer
> > > > > > > > > Jan 6 16:45:07 ffm-sbc-2a kernel: [3167225.685487] TIPC: Lost
> > > > link
> > > > > > > > <1.1.1:bond0-1.1.2:bond0> on network plane A
> > > > > > > > > Jan 6 16:45:07 ffm-sbc-2a kernel: [3167225.685491] TIPC: Lost
> > > > contact
> > > > > > > > with <1.1.2>
> > > > > > > > > Jan 6 16:45:07 ffm-sbc-2a kernel: [3167225.687214]
> > > > 0000000000000000
> > > > > > > > 000000000cdd3a47 0000000000000000 000000000cdd3a47
> > > > > > > > > Jan 6 16:45:07 ffm-sbc-2a kernel: [3167225.687214]
> > > > ffff880425e33fd8
> > > > > > > > 0000001f3edf8970 ffff88083edf8970 ffff88083edf8b00
> > > > > > > > > Jan 6 16:45:07 ffm-sbc-2a kernel: [3167225.687214]
> > > > 0000000000000000
> > > > > > > > ffffffff81304773 ffffffff819232b0 ffff880425e33fd8
> > > > > > > > > Jan 6 16:45:07 ffm-sbc-2a kernel: [3167225.687214] Call
> > Trace:
> > > > > > > > > Jan 6 16:45:07 ffm-sbc-2a kernel: [3167225.687214]
> > > > > > [<ffffffff81304773>]
> > > > > > > > cpuidle_idle_call+0xda/0x169
> > > > > > > > > Jan 6 16:45:07 ffm-sbc-2a kernel: [3167225.687214]
> > > > > > [<ffffffff81001200>]
> > > > > > > > cpu_idle+0x51/0x95
> > > > > > > > > Jan 6 16:45:07 ffm-sbc-2a kernel: [3167225.687214] Code: 28
> > e0
> > > > ff ff
> > > > > > 80
> > > > > > > > e2 08 75 22 31 d2 48 83 c0 10 48 89 d1 0f 01 c8 0f ae f0 48 8b
> > 86
> > > > 38
> > > > > > e0 ff
> > > > > > > > ff a8 08 75 08 b1 01 48 89 e8 0f 01 c9 <e8> 3f 6e e2 ff 4c 29
> > e0
> > > > 48 89
> > > > > > c7
> > > > > > > > e8 10 ae e0 ff 48 69 e8 40 42
> > > > > > > > > Jan 6 16:45:07 ffm-sbc-2a kernel: [3167225.687214] Call
> > Trace:
> > > > > > > > > Jan 6 16:45:07 ffm-sbc-2a kernel: [3167225.687214]
> > > > > > [<ffffffff81304773>]
> > > > > > > > cpuidle_idle_call+0xda/0x169
> > > > > > > > > Jan 6 16:45:07 ffm-sbc-2a kernel: [3167225.687214]
> > > > > > [<ffffffff81001200>]
> > > > > > > > cpu_idle+0x51/0x95
> > > > > > > >
> > > > > > > > Since NMIs were sent to all CPUs, it is not surprising that
> > one was
> > > > > > found
> > > > > > > > to be idle. Or is there something that I am missing in CPU
> > 31's
> > > > > > > > backtrace?
> > > > > > > >
> > > > > > > > > I'm using suse kernel 3.1.10
> > > > > > > >
> > > > > > > > That is going back a ways!
> > > > > > > >
> > > > > > > > There are quite a few patches to RCU CPU stall warnings that
> > could
> > > > > > > > be applied. It might be easier to move to a newer kernel.
> > > > > > > >
> > > > > > > > Alternatively, if you are convinced that these stall warnings
> > are
> > > > > > > > false positives, you could boot with the rcu_cpu_stall_suppress
> > > > > > > > kernel boot parameter set, or rebuild the kernel with a much
> > larger
> > > > > > > > value of the RCU_CPU_STALL_TIMEOUT Kconfig option.
> > > > > > > >
> > > > > > > > Thanx,
> > Paul
> > > > > > > >
> > > > > > > >
> > > > > >
> > > > > >
> > > >
> > > >
> >
> >