On Tue, Mar 18, 2025 at 03:53:39PM +0800, Miaoqing Pan wrote:
On 3/17/2025 9:04 PM, Johan Hovold wrote:
Then it seems we are looking at two separate root causes for the
corruption as the memory barrier appears to be all that is needed to fix
the X13s issue.
A user who hit the corruption after 2 h without the fix has been running
over the weekend with the memory barrier without any problems. I'll ask
further users to test, but it certainly looks like it is working as
intended.
And the memory barrier is de-facto missing as the head pointer and
descriptor are accessed through (two separate) coherent mappings so
there are no ordering guarantees without explicit barriers.
This situation should occur when there is only one descriptor in the
ring. If, as you mentioned, the CPU tries to load the descriptor first,
but the descriptor fetch fails before the HP load because the ring
returns empty, it won't trigger the current issue.
It could if the CPU observes the updates out of order due to the missing
barrier. The driver could be processing an earlier interrupt when the
new descriptor is added and head pointer updated. If for example the CPU
speculatively fetches the descriptor before the head pointer is updated,
then the descriptor length may be zero when the CPU sees the updated
head pointer.
This seems to be what is happening on the X13s since adding the memory
barrier makes the zero-length descriptors go away.
The Copy Engine hardware module copies the metadata to the Status
Descriptor after the DMA is complete, then updates the HP to trigger an
interrupt. I think there might be some issues in this process, such as
the lack of a wmb instruction after the copy is complete, causing the HP
to be updated first.
Yeah, possibly. At least it seems there are more issues than the missing
barrier on the machines you test.
Now obviously there are further issues in your system, which we shouldhttps://github.com/qualcomm-linux/meta-qcom-hwe/blob/scarthgap/recipes-kernel/linux/linux-qcom-base_6.6.bb
make sure we understand before adding workarounds to the driver.
Do you have a pointer to the downstream kernel sources you are testing
with? Or even better, can you reproduce the issue with mainline after
adding the PCIe patches that were posted to the lists for these
platforms?
Thanks for the pointer. That's a lot of out-of-tree patches on top of
stable so not that easy to check the state of the resulting tree.
Does it make any difference if you use a full rmb() barrier?I've also tried rmb() and mb(), but they didn't work either.
Thanks for checking.
Just to be sure, you did add the barrier in the same place as my patch
(i.e. just before the descriptor read)?
And after modifying ath11k_hal_ce_dst_status_get_length() so that it
does not clear the length, how many times you need to retry? Does it
always work on the second try?
Yes, the test has been running continuously for over 48 hours, always
work on the second try, updated in patch v4.
Good, at least the descriptor-length-sometimes-never-updated issue is
solved.
Johan