Re: [bug] 6.5.7 - ixbe freezes and causes RCU deadlock?

From: Ian Kumlien
Date: Mon Oct 16 2023 - 06:57:15 EST


And again, no oops visible this time

135476.059611] ixgbe 0000:07:00.0 eno3: Reset adapter
[135483.747803] rcu: INFO: rcu_preempt self-detected stall on CPU
[135483.753749] rcu: 3-....: (20999 ticks this GP)
idle=ddf4/1/0x4000000000000000 softirq=997198/997198 fqs=3594
[135483.763852] rcu: (t=21015 jiffies g=4687825 q=371 ncpus=12)
[135483.769694] rcu: rcu_preempt kthread timer wakeup didn't happen
for 6637 jiffies! g4687825 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402
[135483.781436] rcu: Possible timer handling issue on cpu=8
timer-softirq=960866
[135483.788752] rcu: rcu_preempt kthread starved for 6660 jiffies!
g4687825 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402 ->cpu=8
[135483.799540] rcu: Unless rcu_preempt kthread gets sufficient CPU
time, OOM is now expected behavior.
[135483.808849] rcu: RCU grace-period kthread stack dump:
[135483.814249] rcu: Stack dump where RCU GP kthread last ran:
[135546.819253] rcu: INFO: rcu_preempt self-detected stall on CPU
[135546.825177] rcu: 3-....: (83999 ticks this GP)
idle=ddf4/1/0x4000000000000000 softirq=997198/997198 fqs=3594
[135546.835276] rcu: (t=84088 jiffies g=4687825 q=802 ncpus=12)
[135546.841114] rcu: rcu_preempt kthread starved for 69713 jiffies!
g4687825 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x0 ->cpu=8
[135546.851835] rcu: Unless rcu_preempt kthread gets sufficient CPU
time, OOM is now expected behavior.
[135546.861177] rcu: RCU grace-period kthread stack dump:
[135546.866576] rcu: Stack dump where RCU GP kthread last ran:


On Sun, Oct 15, 2023 at 2:01 PM Ian Kumlien <ian.kumlien@xxxxxxxxx> wrote:
>
> On Sun, Oct 15, 2023 at 5:38 AM Hillf Danton <hdanton@xxxxxxxx> wrote:
> >
> > On Sun, 15 Oct 2023 00:11:41 +0200 Ian Kumlien <ian.kumlien@xxxxxxxxx>
> > > So, this keeps happening - it's happened for quite some time now...
> > > I can't really reproduce it but it starts with a network adapter
> > > freezing and ends with RCU errors
> > > and watchdog reboot... :/
> > >
> > > cat bug.txt | ./scripts/decode_stacktrace.sh vmlinux
> > > [185433.169006] ------------[ cut here ]------------
> > > [185433.169018] NETDEV WATCHDOG: eno3 (ixgbe): transmit queue 2 timed out 9736 ms
> > > [185433.169094] WARNING: CPU: 3 PID: 0 at net/sched/sch_generic.c:525
> > > dev_watchdog (net/sched/sch_generic.c:525 (discriminator 3))
> >
> > Watchdog reported eno3 tx hang.
> > ...
> > >
> > > And in the IPMI console:
> > > [185433.169621] ixgbe 0000:07:00.0 eno3: Reset adapter
> > > [185444.166717] rcu: INFO: rcu_preempt self-detected stall on CPU
> > > [185444.172665] rcu: 0-...!: (20999 ticks this GP)
> > > idle=8d84/1/0x4000000000000000 softirq=1976223/1976223 fqs=2
> > > [185444.182681] rcu: (t=21015 jiffies g=6787421 q=738 ncpus=12)
> > > [185444.188523] rcu: rcu_preempt kthread timer wakeup didn't happen
> > > for 21009 jiffies! g6787421 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402
> > > [185444.200361] rcu: Possible timer handling issue on cpu=8 timer-softirq=1196063
> >
> > Timer on CPU8 is suspected to cause RCU stall.
> >
> > > [185444.207761] rcu: rcu_preempt kthread starved for 21032 jiffies!
> > > g6787421 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402 ->cpu=8
> > > [185444.218639] rcu: Unless rcu_preempt kthread gets sufficient CPU
> > > time, OOM is now expected behavior.
> > > [185444.227946] rcu: RCU grace-period kthread stack dump:
> > > [185444.233347] rcu: Stack dump where RCU GP kthread last ran:
> > > [185507.243156] rcu: INFO: rcu_preempt self-detected stall on CPU
> > > [185507.249098] rcu: 0-....: (84002 ticks this GP)
> > > idle=8d84/1/0x4000000000000000 softirq=1976223/1976223 fqs=1559
> > > [185507.259375] rcu: (t=84094 jiffies g=6787421 q=1213 ncpus=12)
> > > [185570.265595] rcu: INFO: rcu_preempt self-detected stall on CPU
> > > [185570.271532] rcu: 0-....: (147002 ticks this GP)
> > > idle=8d84/1/0x4000000000000000 softirq=1976223/1976223 fqs=13844
> > > [185570.282016] rcu: (t=147117 jiffies g=6787421 q=1273 ncpus=12)
> > > [185570.288049] rcu: rcu_preempt kthread timer wakeup didn't happen
> > > for 13787 jiffies! g6787421 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402
> > > [185570.299914] rcu: Possible timer handling issue on cpu=9 timer-softirq=1211534
> >
> > Ditto on CPU9.
> >
> > No answer yet to why rcu stall was reported without any info about the timers
> > on CPU8/9.
>
> Well... I can't really give you anymore information, all i can say is
> that it leads to complete deadlock and eventual reboot by the hardware
> watchdog...