Re: [PATCH v2 ath-next 2/2] wifi: ath11k: fix HTC rx insufficient length

From: Miaoqing Pan
Date: Tue Mar 18 2025 - 03:55:00 EST




On 3/17/2025 9:04 PM, Johan Hovold wrote:
On Mon, Mar 17, 2025 at 01:52:15PM +0800, Miaoqing Pan wrote:
On 3/14/2025 4:06 PM, Johan Hovold wrote:
On Fri, Mar 14, 2025 at 09:01:36AM +0800, Miaoqing Pan wrote:
I think the hardware has already ensured synchronization between
descriptor and head pointer, which isn't difficult to achieve. The issue
is likely caused by something else and requires further debugging.

Yeah, but you still need memory barriers on the kernel side.

It could be that we are looking at two different causes for those
zero-length descriptors.

The error handling for that obviously needs to be fixed either way, but
I haven't heard anyone hitting the corruption with the memory barriers
in place on the X13s yet (even if we'd need some more time to test
this).

After multiple and prolonged verifications, adding dma_rmb() did not
improve the issue at all. I think this Status Descriptor is updated by
hardware (Copy Engine) controlled by another system, not involving DMA
or out-of-order CPU access within the same system, so memory barriers do
not take effect.

Then it seems we are looking at two separate root causes for the
corruption as the memory barrier appears to be all that is needed to fix
the X13s issue.

A user who hit the corruption after 2 h without the fix has been running
over the weekend with the memory barrier without any problems. I'll ask
further users to test, but it certainly looks like it is working as
intended.

And the memory barrier is de-facto missing as the head pointer and
descriptor are accessed through (two separate) coherent mappings so
there are no ordering guarantees without explicit barriers.

This situation should occur when there is only one descriptor in the ring. If, as you mentioned, the CPU tries to load the descriptor first, but the descriptor fetch fails before the HP load because the ring returns empty, it won't trigger the current issue.

The Copy Engine hardware module copies the metadata to the Status Descriptor after the DMA is complete, then updates the HP to trigger an interrupt. I think there might be some issues in this process, such as the lack of a wmb instruction after the copy is complete, causing the HP to be updated first.


Now obviously there are further issues in your system, which we should
make sure we understand before adding workarounds to the driver.

Do you have a pointer to the downstream kernel sources you are testing
with? Or even better, can you reproduce the issue with mainline after
adding the PCIe patches that were posted to the lists for these
platforms?

https://github.com/qualcomm-linux/meta-qcom-hwe/blob/scarthgap/recipes-kernel/linux/linux-qcom-base_6.6.bb

Apparently the descriptors can also be passed in non-coherent memory
for some devices (.alloc_cacheable_memory / HAL_SRNG_FLAGS_CACHED). That
implementation looks suspicious and could possibly result in similar
problems. Are you using .alloc_cacheable_memory in your setup?

No.

Does it make any difference if you use a full rmb() barrier?

I've also tried rmb() and mb(), but they didn't work either.

And after modifying ath11k_hal_ce_dst_status_get_length() so that it
does not clear the length, how many times you need to retry? Does it> always work on the second try?


Yes, the test has been running continuously for over 48 hours, always work on the second try, updated in patch v4.