Re: [BUG] wifi: rtw88: Hard system freeze on RTL8821CE when power_save is enabled (LPS/ASPM conflict)

From: LB F

Date: Fri Apr 03 2026 - 17:49:18 EST

Bitterblue Smith <rtl8821cerfe2@xxxxxxxxx> wrote:
> If we can't find the reason for these weird frames, maybe the best
> way to filter them out is to check RTW_RX_DESC_W0_DRV_INFO_SIZE.

Hi Bitterblue, Ping-Ke,

I have a new crash to report that shows a different failure mode
from the garbage RX data, with some characteristics I haven't
seen before.

=== NEW INCIDENT: 2026-04-03 ===

The system froze approximately 1 second after Wi-Fi association
on a fresh cold boot (not resume from hibernation). Hard power-off
was required.

Timeline:
17:16:16 Cold boot (PM: Image not found — no hibernation image)
17:16:38 wlan0 associated with AP (6c:68:a4:1c:97:5b)
17:16:39 First "pci bus timeout" + mac80211 WARNING
17:16:39-17:17:00 System frozen, hard reset required

Kernel: 6.19.10-1-cachyos (PREEMPT full, Clang/LLVM)
Patches applied: DMI quirk (ASPM+LPS disabled), rate validation v2,
Bitterblue's diagnostic hex dump in query_phy_status.

=== THREE DIFFERENCES FROM PREVIOUS CRASHES ===

1) Zero "unused phy status page" events.

Every previous incident had a burst of these messages before
or during the crash. This time there were none at all. The
corrupted data appears to have gone straight to mac80211 without
triggering query_phy_status — likely because PHYST=0 in the
corrupted descriptors, so the diagnostic hex dump never fired.

2) Cold boot, 1 second after initial association.

All previous crashes occurred after minutes to hours of uptime
or shortly after hibernation resume. This one happened on a
fresh boot before any power-state transition. ASPM and LPS Deep
were already disabled by the DMI quirk.

3) Hang mechanism: infinite "pci bus timeout" loop.

Not the NULL dereference (Bug 221286) and not the ASPM deadlock
(Bug 221195). The loop produced 547 "pci bus timeout" messages
and 41 mac80211 WARNINGs over 21 seconds.

=== HANG MECHANISM (my reading of the code, please correct if wrong) ===

The crash appears to follow this sequence in rtw_pci_rx_napi():

while (count--) {
rtw_pci_dma_check(rtwdev, ring, cur_rp); // [A]
...
rtw_rx_query_rx_desc(rtwdev, rx_desc, ...); // [B]
...
ieee80211_rx_napi(rtwdev->hw, NULL, new, napi); // [C]
}

At [A], rtw_pci_dma_check() detects an RX tag mismatch and prints
the warning, but returns void and the loop continues. At [B], since
PHYST=0, query_phy_status is not called. At [C], the garbage frame
reaches ieee80211_rx_list(), triggering WARNING at rx.c:896.

The RBP values across the 41 WARNING traces form a monotonically
increasing sequence from 0x55 to 0x1FF, which looks like cur_rp
cycling through the ring. Once exhausted, rtw_pci_get_hw_rx_ring_nr()
reads more entries from hardware (which is in a bad state), and the
loop restarts. The NAPI poll never returns.

The execution context migrated from irq/58-rtw_pci (PID 635,
170 traces) to ksoftirqd/1 (PID 26, 216 traces) as the softirq
was deferred, but the loop continued in both.

=== FIRST WARNING (full trace) ===

WARNING: net/mac80211/rx.c:896 at ieee80211_rx_list+0x1033/0x1040
[mac80211], CPU#1: irq/58-rtw_pci/635

RAX: 0000000000020100 RBX: 0000000000000000 RCX: 0000000000000004
RDX: 0000000000000000 RSI: ffff8e56c7bb2f18 RDI: 0000000000000000
RBP: 0000000000000055 R08: 0000000000000004 R09: 0000000000000000

Call Trace:
<IRQ>
ieee80211_rx_napi+0x51/0xe0 [mac80211]
rtw_pci_rx_napi+0x2fd/0x400 [rtw_pci]
rtw_pci_napi_poll+0x79/0x1d0 [rtw_pci]
net_rx_action+0x195/0x290
handle_softirqs+0x12d/0x1c0
do_softirq+0x56/0x70
</IRQ>
<TASK>
__local_bh_enable_ip.cold+0xc/0x11
rtw_pci_interrupt_threadfn+0x270/0x360 [rtw_pci]
irq_thread_fn+0x24/0x50
irq_thread+0xbc/0x160
kthread+0x205/0x280
</TASK>

=== NAIVE HARDENING IDEA (please ignore if this is wrong) ===

I am not a kernel developer and I may be misreading the code, but
I wondered if making rtw_pci_dma_check() return a value and
skipping the frame on tag mismatch might prevent the infinite loop,
independently of the DRV_INFO_SIZE filter. Something along these
lines:

--- a/drivers/net/wireless/realtek/rtw88/pci.c
+++ b/drivers/net/wireless/realtek/rtw88/pci.c
-static void rtw_pci_dma_check(struct rtw_dev *rtwdev,
+static bool rtw_pci_dma_check(struct rtw_dev *rtwdev,
struct rtw_pci_rx_ring *rx_ring,
u32 idx)
{
- if (total_pkt_size != rtwpci->rx_tag)
+ if (total_pkt_size != rtwpci->rx_tag) {
rtw_warn(rtwdev, "pci bus timeout, check dma status\n");
+ return false;
+ }
rtwpci->rx_tag = (rtwpci->rx_tag + 1) % RX_TAG_MAX;
+ return true;
}

while (count--) {
- rtw_pci_dma_check(rtwdev, ring, cur_rp);
+ if (!rtw_pci_dma_check(rtwdev, ring, cur_rp))
+ goto next_rp;

I am sure there are considerations I am missing. Please treat this
only as a description of what I observed, not as a proposed patch.

=== SUMMARY ===

The garbage RX data from this chip now appears to cause at least
three distinct failure modes:

1) Bug 221195: ASPM/LPS deadlock (fixed by DMI quirk)
2) Bug 221286: NULL dereference via C2H_ADAPTIVITY misinterpretation
3) This incident: infinite loop triggered by DMA tag mismatch

I wanted to report this new failure mode in case it is useful for
your work on the DRV_INFO_SIZE filter. I can provide the full dmesg
from this crash (7828 lines) if it would be helpful — just let me
know.

Best regards,
Oleksandr Havrylov