Re:Re: [PATCH v2] net: stmmac: fix fatal bus error on resume by reinitializing RX buffers

From: Ding Hui

Date: Fri May 29 2026 - 03:46:59 EST

At 2026-05-28 22:57:38, "Jakub Raczynski" <j.raczynski@xxxxxxxxxxx> wrote:
>On Tue, May 26, 2026 at 10:26:17AM +0800, Ding Hui wrote:
>> From: Ding Hui <dinghui@xxxxxxxxxxx>
>> + } else {
>> + /* Theoretically unreachable: napi_disable() in
>> + * stmmac_suspend() ensures all initialized slots
>> + * have a valid page before we get here.
>> + * Defensive check only.
>> + */
>> + if (!buf->page)
>> + continue;
>> +
>> + stmmac_set_desc_addr(priv, p, buf->addr);
>> + stmmac_set_desc_sec_addr(priv, p, buf->sec_addr,
>> + priv->sph_active &&
>> + buf->sec_page);
>
>It this generally sufficient? Or, in fact, isn't that overkill?
>stmmac_rx_refill() generally does a bit more preparation of descriptors.

You are right that stmmac_rx_refill() does more work — it allocates new
pages and maps them. The key difference here is that in v2 we intentionally
keep all RX buffers alive across suspend/resume, so no allocation is needed.
The only thing that needs to be restored is the buffer address fields in the
descriptors, which were overwritten by hardware write-back.

>The issue seems to be that during suspend there is mismatch,
>caused by writeback format, between rx_dirty and rx_cur pointers and
>there is bad handling of this case, since there is no verification
>of leftover stuff and there will be leftover bad address crashing platform.
>So stmmac needs to refill/reinit descriptors that were consumed but not
>refilled. So isn't going through whole dma_rx_size overkill?
>Wouldn't it be better to iterate over buffer from cur_rx as long as descriptors
>are 0 and only apply refill to those corrupted?

Actually, The hardware may have consumed additional descriptors in the window
between stmmac_disable_all_queues() and stmmac_stop_all_dma(), so cur_rx can lag
behind the hardware's actual position. So maybe not only the descriptors between
rx_dirty and rx_cur pointers need to be refilled.
You are right that we should only refill the consumed descriptors. But checking the
OWN bit requires a new lightweight get_rx_owner() helper across all descriptor
variants (dwmac4, dwxgmac2, norm_desc, enh_desc), adding complexity for marginal gain.

>Could you paste panic that occurs during this issue?
>You mention "fatal bus error" which I would assume is system panic?

Apologies for the misleading wording — this does not cause a kernel panic.
The issue manifests as a Fatal Bus Error interrupt on the DMA controller.
Taking XGMAC as an example, dwxgmac2_dma_interrupt() detects XGMAC_FBE,
increments fatal_bus_error_irq, and returns tx_hard_error, which triggers
stmmac_tx_err() to stop and reset the TX DMA channel. But this has no effect
for the RX DMA engine (may be we should reset RX DMA here). The practical effect
is that the RX DMA engine halts after dereferencing the invalid buffer address,
and the network interface becomes non-functional after resume — no packets can be
received until the driver is reloaded or the device is re-probed.

To reproduce the issue on my platform:
1. Connect the DUT and a PC, configure IP addresses so they can ping
each other (e.g. DUT: 192.168.1.1, PC: 192.168.1.100).

2. On the PC, start an iperf3 server:
iperf3 -s

3. On the DUT, start a high-rate reverse UDP stream to keep the RX DMA
busy during suspend:
iperf3 -c 192.168.1.100 -u -b 900M -R -t 0

4. While iperf3 is running, trigger a suspend/resume cycle on the DUT.

5. After resume, check the fatal_bus_error_irq counter:
ethtool -S <iface> | grep fatal_bus_error_irq

Without this fix the counter increments and the interface stops
receiving packets. With this fix the counter stays at zero and
normal operation resumes.

I will update the commit message to clarify "fatal bus error causing RX
DMA to stop".

Thanks for the review.

Ding Hui