[RFC PATCH] net: stmmac: Fix the problem about interrupt storm

From: Cathy Cai
Date: Wed Mar 27 2024 - 07:02:25 EST


After I do seven days of MSR test (monkey sleep reboot test) in Android,
I can encounter below netdev watchdog timeout issue. Tx queue timed out
then reset adapter. There is a probability that an interruption storm will
occur and the system will crash.

When we do MSR test, there is a NETDEV WATCHDOG WARNING:
[ 117.885804] ------------[ cut here ]------------
[ 117.885818] NETDEV WATCHDOG: eth0 (stmmaceth): transmit queue 0 timed
out
[ 117.885873] WARNING: CPU: 1 PID: 4169 at net/sched/sch_generic.c:473
dev_watchdog+0x2fc/0x41c
[ 117.886070] sprd_systimer sprd_sip_svc sprd_wdt_fiq sprd_wdt_pon
[ 117.886082] CPU: 1 PID: 4169 Comm: RenderThread Tainted: G S C O
5.4.147-ab41313 #1
[ 117.886085] Hardware name: Spreadtrum UIS6780 SoC (DT)
[ 117.886090] pstate: 60400005 (nZCv daif +PAN -UAO)
[ 117.886094] pc : dev_watchdog+0x2fc/0x41c
[ 117.886098] lr : dev_watchdog+0x2fc/0x41c
[ 117.886100] sp : ffffffc01000bcf0
[ 117.886103] x29: ffffffc01000bcf0 x28: ffffffc011eafe28
[ 117.886107] x27: ffffff80f97a5c40 x26: 00000000ffffffff
[ 117.886111] x25: 0000000000000001 x24: 0000000000000008
[ 117.886114] x23: ffffffc011ea6000 x22: ffffffc011e73020
[ 117.886118] x21: 0000000000000000 x20: ffffff80f434841c
[ 117.886122] x19: ffffff80f4348000 x18: ffffffc01000d048
[ 117.886127] x17: ffffffc012050044 x16: 00000000000508d0
[ 117.886130] x15: 0000000000000006 x14: 0000000000000058
[ 117.886134] x13: 0000000000000008 x12: 0000000042d7d11b
[ 117.886138] x11: 0000000000000015 x10: 0000000000000001
[ 117.886141] x9 : a6fe08b7d867fd00 x8 : a6fe08b7d867fd00
[ 117.886145] x7 : 0000000000000000 x6 : ffffffc0120a0899
[ 117.886149] x5 : 0000000000000058 x4 : 0000000000000002
[ 117.886152] x3 : ffffffc01000b980 x2 : 0000000000000007
[ 117.886156] x1 : 0000000000000006 x0 : 000000000000003d
[ 117.886164]

[ 117.887028]
[ 117.887030] Call trace:
[ 117.887035] dev_watchdog+0x2fc/0x41c
[ 117.887043] call_timer_fn+0x5c/0x274
[ 117.887046] expire_timers+0x74/0x1b4
[ 117.887050] __run_timers+0x250/0x2b0
[ 117.887054] run_timer_softirq+0x28/0x4c
[ 117.887061] __do_softirq+0x128/0x4dc
[ 117.887067] irq_exit+0xf8/0xfc
[ 117.887072] __handle_domain_irq+0xb0/0x108
[ 117.887076] gic_handle_irq+0x6c/0x124
[ 117.887081] el0_irq_naked+0x64/0x74
[ 117.887084] ---[ end trace 1308772835db89f6 ]---
[ 117.887188] stmmaceth 32600000.ethernet eth0: Reset adapter.

Tx queue time out then reset adapter. When reset the adapter, stmmac driver
sets the state to STMMAC_DOWN and calls dev_close() function. If an interrupt
is triggered at this instant after setting state to STMMAC_DOWN, before the
dev_close() call.

The scene is as follows:
stmmac_reset_subtask()
set_bit(STMMAC_DOWN, &priv->state);
--->interrupt
stmmac_interrupt()
return IRQ_HANDLED
dev_close(priv->dev);

The interrupt handler stmmac_interrupt is executed, judging that the state is
STMMAC_DOWN and returning IRQ_HANDLED. Then the processing will not continue,
and it will not be able to clear the interrupt status.

Therefore, to avoid this, set STMMAC_DOWN after dev_close().

Signed-off-by: Cathy Cai <cathy.cai@xxxxxxxxxx>
---
drivers/net/ethernet/stmicro/stmmac/stmmac_main.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
index 24cd80490d19..61690b68b6ad 100644
--- a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
+++ b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
@@ -7167,8 +7167,8 @@ static void stmmac_reset_subtask(struct stmmac_priv *priv)
while (test_and_set_bit(STMMAC_RESETING, &priv->state))
usleep_range(1000, 2000);

- set_bit(STMMAC_DOWN, &priv->state);
dev_close(priv->dev);
+ set_bit(STMMAC_DOWN, &priv->state);
dev_open(priv->dev, NULL);
clear_bit(STMMAC_DOWN, &priv->state);
clear_bit(STMMAC_RESETING, &priv->state);
--
2.34.1