Re: [bug] 6.5.7 - ixbe freezes and causes RCU deadlock?

From: Ian Kumlien
Date: Sun Oct 15 2023 - 08:02:11 EST


On Sun, Oct 15, 2023 at 5:38 AM Hillf Danton <hdanton@xxxxxxxx> wrote:
>
> On Sun, 15 Oct 2023 00:11:41 +0200 Ian Kumlien <ian.kumlien@xxxxxxxxx>
> > So, this keeps happening - it's happened for quite some time now...
> > I can't really reproduce it but it starts with a network adapter
> > freezing and ends with RCU errors
> > and watchdog reboot... :/
> >
> > cat bug.txt | ./scripts/decode_stacktrace.sh vmlinux
> > [185433.169006] ------------[ cut here ]------------
> > [185433.169018] NETDEV WATCHDOG: eno3 (ixgbe): transmit queue 2 timed out 9736 ms
> > [185433.169094] WARNING: CPU: 3 PID: 0 at net/sched/sch_generic.c:525
> > dev_watchdog (net/sched/sch_generic.c:525 (discriminator 3))
>
> Watchdog reported eno3 tx hang.
> ...
> >
> > And in the IPMI console:
> > [185433.169621] ixgbe 0000:07:00.0 eno3: Reset adapter
> > [185444.166717] rcu: INFO: rcu_preempt self-detected stall on CPU
> > [185444.172665] rcu: 0-...!: (20999 ticks this GP)
> > idle=8d84/1/0x4000000000000000 softirq=1976223/1976223 fqs=2
> > [185444.182681] rcu: (t=21015 jiffies g=6787421 q=738 ncpus=12)
> > [185444.188523] rcu: rcu_preempt kthread timer wakeup didn't happen
> > for 21009 jiffies! g6787421 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402
> > [185444.200361] rcu: Possible timer handling issue on cpu=8 timer-softirq=1196063
>
> Timer on CPU8 is suspected to cause RCU stall.
>
> > [185444.207761] rcu: rcu_preempt kthread starved for 21032 jiffies!
> > g6787421 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402 ->cpu=8
> > [185444.218639] rcu: Unless rcu_preempt kthread gets sufficient CPU
> > time, OOM is now expected behavior.
> > [185444.227946] rcu: RCU grace-period kthread stack dump:
> > [185444.233347] rcu: Stack dump where RCU GP kthread last ran:
> > [185507.243156] rcu: INFO: rcu_preempt self-detected stall on CPU
> > [185507.249098] rcu: 0-....: (84002 ticks this GP)
> > idle=8d84/1/0x4000000000000000 softirq=1976223/1976223 fqs=1559
> > [185507.259375] rcu: (t=84094 jiffies g=6787421 q=1213 ncpus=12)
> > [185570.265595] rcu: INFO: rcu_preempt self-detected stall on CPU
> > [185570.271532] rcu: 0-....: (147002 ticks this GP)
> > idle=8d84/1/0x4000000000000000 softirq=1976223/1976223 fqs=13844
> > [185570.282016] rcu: (t=147117 jiffies g=6787421 q=1273 ncpus=12)
> > [185570.288049] rcu: rcu_preempt kthread timer wakeup didn't happen
> > for 13787 jiffies! g6787421 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402
> > [185570.299914] rcu: Possible timer handling issue on cpu=9 timer-softirq=1211534
>
> Ditto on CPU9.
>
> No answer yet to why rcu stall was reported without any info about the timers
> on CPU8/9.

Well... I can't really give you anymore information, all i can say is
that it leads to complete deadlock and eventual reboot by the hardware
watchdog...