ath11k: WCN6855: possible ring buffer corruption

From: Johan Hovold
Date: Tue Apr 16 2024 - 11:40:51 EST


Hi Kalle and Jeff,

Over the past year I've received occasional reports from users of the
Lenovo ThinkPad X13s (aarch64) that the wifi sometimes stops working.
When this happens the kernel log is filled with errors like:

[ 1164.962227] ath11k_warn: 222 callbacks suppressed
[ 1164.962238] ath11k_pci 0006:01:00.0: HTC Rx: insufficient length, got 1484, expected 1492
[ 1164.962309] ath11k_pci 0006:01:00.0: HTC Rx: insufficient length, got 1460, expected 1484
[ 1164.962994] ath11k_pci 0006:01:00.0: HTC Rx: insufficient length, got 1476, expected 1484
[ 1164.963405] ath11k_pci 0006:01:00.0: HTC Rx: insufficient length, got 1484, expected 1488
[ 1164.963701] ath11k_pci 0006:01:00.0: HTC Rx: insufficient length, got 1480, expected 1484
[ 1164.963852] ath11k_pci 0006:01:00.0: HTC Rx: insufficient length, got 1468, expected 1480
[ 1164.964491] ath11k_pci 0006:01:00.0: HTC Rx: insufficient length, got 1484, expected 1492
[ 1164.964733] ath11k_pci 0006:01:00.0: HTC Rx: insufficient length, got 1488, expected 1492
[ 1165.198329] ath11k_pci 0006:01:00.0: HTC Rx: insufficient length, got 1460, expected 1488
[ 1165.198470] ath11k_pci 0006:01:00.0: HTC Rx: insufficient length, got 1460, expected 1476
[ 1166.266513] ath11k_pci 0006:01:00.0: wmi tlv parse failure of tag 2699 at byte 348 (1132 bytes left, 64788 expected)
[ 1166.542803] ath11k_pci 0006:01:00.0: wmi tlv parse failure of tag 4270 at byte 348 (1128 bytes left, 63772 expected)
[ 1166.768238] ath11k_pci 0006:01:00.0: wmi tlv parse failure of tag 0 at byte 376 (1112 bytes left, 11730 expected)
[ 1166.900152] ath11k_pci 0006:01:00.0: wmi tlv parse failure of tag 3 at byte 790 (694 bytes left, 16256 expected)
[ 1168.499073] ath11k_pci 0006:01:00.0: wmi tlv parse failure of tag 1 at byte 62 (1426 bytes left, 3089 expected)
[ 1168.818086] ath11k_pci 0006:01:00.0: wmi tlv parse failure of tag 63063 at byte 1466 (10 bytes left, 50467 expected)
[ 1169.032885] ath11k_pci 0006:01:00.0: wmi tlv parse failure of tag 0 at byte 364 (1120 bytes left, 12483 expected)
[ 1169.308546] ath11k_pci 0006:01:00.0: wmi tlv parse failure of tag 3092 at byte 348 (1128 bytes left, 64780 expected)
[ 1169.563928] ath11k_pci 0006:01:00.0: wmi tlv parse failure of tag 1 at byte 348 (1124 bytes left, 44062 expected)

which after a quick look at the driver seems to suggest that we may be
hitting some kind of ring buffer corruption.

Rebinding the driver supposedly sometimes make things work again, but
not always.

The issue has been confirmed with the 6.8 kernel and the latest firmware
WLAN.HSP.1.1-03125-QCAHSPSWPL_V1_V2_SILICONZ_LITE-3.6510.37.

I've triggered this issue twice myself with 6.6 and .23 firmware, but
the reports date back to at least 6.2 and likely when using even older
firmware.

An unconfirmed hypothesis is that we may be hitting this more often when
enabling the GIC ITS so that the interrupt processing is spread out over
all cores (unlike when using the DWC controller's internal MSI
implementation). This change is now merged for 6.10.

Do you have any immediate theories about what could be causing this?
Does it look like a firmware or driver issue to you, for example? Is it
something you've seen before?

Note that I've previously reported this here:

https://bugzilla.kernel.org/show_bug.cgi?id=218623

Johan