On Wed, Mar 12, 2025 at 09:11:45AM +0800, Miaoqing Pan wrote:Good point!
On 3/11/2025 11:20 PM, Jeff Johnson wrote:
On 3/11/2025 1:29 AM, Miaoqing Pan wrote:
On 3/10/2025 6:09 PM, Johan Hovold wrote:
I'm still waiting for feedback from one user that can reproduce the
ring-buffer corruption very easily, but another user mentioned seeing
multiple zero-length descriptor warnings over the weekend when running
with this patch:
ath11k_pci 0006:01:00.0: rxed invalid length (nbytes 0, max 2048)
Are there ever any valid reasons for seeing a zero-length descriptor
(i.e. unrelated to the race at hand)? IIUC the warning would only be
printed when processing such descriptors a second time (i.e. when
is_desc_len0 is set).
That's exactly the logic, only can see the warning in a second time. For
the first time, ath12k_ce_completed_recv_next() returns '-EIO'.
That didn't answer Johan's first question:
Are there ever any valid reasons for seeing a zero-length descriptor?
The events currently observed are all firmware logs. The discarded
packets will not affect normal operation. I will adjust the logging to
debug level.
That still does not answer the question whether there are ever any valid
reasons for seeing zero-length descriptors. I assume there are none?
We have an issue that there is a race condition where hardware updates theThanks for the addition.
pointer before it has filled all the data. The current solution is just to
read the data a second time. But if that second read also occurs before
hardware has updated the data, then the issue isn't fixed.
So should there be some forced delay before we read a second time?
Or should we attempt to read more times?
The initial fix was to keep waiting for the data to be ready. The
observed phenomenon is that when the second read fails, subsequent reads
will continue to fail until the firmware's CE2 ring is full and triggers
an assert after timeout. However, this situation is relatively rare, and
in most cases, the second read will succeed. Therefore, adding a delay
or multiple read attempts is not useful.
The proposed fix is broken since ath11k_hal_ce_dst_status_get_length()
not just reads the length but also sets it to zero so that the updated
length may indeed never be seen.
I've taken a closer look at the driver and it seems like we're missing aSure, the stress test is running.
read barrier to make sure that the updated descriptor is not read until
after the head pointer.
Miaoqing, could you try the below patch with your reproducer and see if
it is enough to fix the corruption?
If so I can resend with the warning removed and include a correspondingIf the DMA read barrier works, do you think my submitted patch series is still needed? Because the error handling is incorrect.
fix for ath12k (it looks like there are further places where barriers
are missing too).
Johan
From 656dbd0894741445aeb16ee8357e6fef51b6084c Mon Sep 17 00:00:00 2001
From: Johan Hovold <johan+linaro@xxxxxxxxxx>
Date: Wed, 12 Mar 2025 16:49:20 +0100
Subject: [PATCH] wifi: ath11k: fix ring-buffer corruption
Users of the Lenovo ThinkPad X13s have reported that Wi-Fi sometimes
breaks and the log fills up with errors like:
ath11k_pci 0006:01:00.0: HTC Rx: insufficient length, got 1484, expected 1492
ath11k_pci 0006:01:00.0: HTC Rx: insufficient length, got 1460, expected 1484
which based on a quick look at the driver seemed to indicate some kind
of ring-buffer corruption.
Miaoqing Pan tracked it down to the host seeing the updated destination
ring head pointer before the updated descriptor, and the error handling
for that in turn leaves the ring buffer in an inconsistent state.
Add the missing the read barrier to make sure that the descriptor is
read after the head pointer to address the root cause of the corruption.
The error handling can be fixed separately in case there can ever be
actual zero-length descriptors.
FIXME: remove WARN_ON_ONCE() added for verification purposes
Tested-on: WCN6855 hw2.1 WLAN.HSP.1.1-03125-QCAHSPSWPL_V1_V2_SILICONZ_LITE-3.6510.41
Fixes: d5c65159f289 ("ath11k: driver for Qualcomm IEEE 802.11ax devices")
Closes: https://bugzilla.kernel.org/show_bug.cgi?id=218623
Link: https://lore.kernel.org/20250310010217.3845141-3-quic_miaoqing@xxxxxxxxxxx
Cc: Miaoqing Pan <quic_miaoqing@xxxxxxxxxxx>
Cc: stable@xxxxxxxxxxxxxxx # 5.6
Signed-off-by: Johan Hovold <johan+linaro@xxxxxxxxxx>
---
drivers/net/wireless/ath/ath11k/ce.c | 4 ++++
1 file changed, 4 insertions(+)
diff --git a/drivers/net/wireless/ath/ath11k/ce.c b/drivers/net/wireless/ath/ath11k/ce.c
index e66e86bdec20..423b970e288c 100644
--- a/drivers/net/wireless/ath/ath11k/ce.c
+++ b/drivers/net/wireless/ath/ath11k/ce.c
@@ -393,8 +393,12 @@ static int ath11k_ce_completed_recv_next(struct ath11k_ce_pipe *pipe,
goto err;
}
+ /* Make sure descriptor is read after the head pointer. */
+ dma_rmb();
+
*nbytes = ath11k_hal_ce_dst_status_get_length(desc);
if (*nbytes == 0) {
+ WARN_ON_ONCE(1); // FIXME: remove
ret = -EIO;
goto err;
}