On Wed, Mar 19, 2025 at 02:47:12PM +0800, Miaoqing Pan wrote:
On 3/19/2025 1:42 AM, Johan Hovold wrote:
It could if the CPU observes the updates out of order due to the missing
barrier. The driver could be processing an earlier interrupt when the
new descriptor is added and head pointer updated. If for example the CPU
speculatively fetches the descriptor before the head pointer is updated,
then the descriptor length may be zero when the CPU sees the updated
head pointer.
Sorry, I still think this situation won't happen. Please see the
following code.
ath11k_hal_srng_access_begin(ab, srng);
=> srng->u.dst_ring.cached_hp = *srng->u.dst_ring.hp_addr;
desc = ath11k_hal_srng_dst_get_next_entry(ab, srng);
=> if (srng->u.dst_ring.tp == srng->u.dst_ring.cached_hp) return NULL;
//dma_rmb();
*nbytes = ath11k_hal_ce_dst_status_get_length(desc);
If the condition 'srng->u.dst_ring.tp == srng->u.dst_ring.cached_hp' is
true, the descriptor retrieval fails.
The CPU can still speculate that this condition will be false and load
the descriptor.
If the speculation later turns out to be correct, then the descriptor
may have stale values from before the head pointer was updated.
This seems to be what is happening on the X13s since adding the memory
barrier makes the zero-length descriptors go away.
Hmm, it is indeed a bit strange. Could it be that dma_rmb() introduces
some delay ?
It's only expected since you must use memory barriers on weakly ordered
architectures like aarch64 to guarantee the ordering.
The Copy Engine hardware module copies the metadata to the Status
Descriptor after the DMA is complete, then updates the HP to trigger an
interrupt. I think there might be some issues in this process, such as
the lack of a wmb instruction after the copy is complete, causing the HP
to be updated first.
Yeah, possibly. At least it seems there are more issues than the missing
barrier on the machines you test.
Now obviously there are further issues in your system, which we shouldhttps://github.com/qualcomm-linux/meta-qcom-hwe/blob/scarthgap/recipes-kernel/linux/linux-qcom-base_6.6.bb
make sure we understand before adding workarounds to the driver.
Do you have a pointer to the downstream kernel sources you are testing
with? Or even better, can you reproduce the issue with mainline after
adding the PCIe patches that were posted to the lists for these
platforms?
Thanks for the pointer. That's a lot of out-of-tree patches on top of
stable so not that easy to check the state of the resulting tree.
Yes, but there are only a few patches for ath11k.
Sure, but there are other components that come into play here such as
the PCIe controller driver.
A colleague of yours recently submitted an updated patch that overrides
the no_snoop bit for qcs8300:
https://lore.kernel.org/lkml/20250318053836.tievnd5ohzl7bmox@thinkpad/
but that flag appears not to be set in your downstream tree:
https://github.com/qualcomm-linux/meta-qcom-hwe/blob/scarthgap/recipes-kernel/linux/linux-qcom-base-6.6/drivers/qcs8300/0004-PCI-qcom-Add-QCS8300-PCIe-support.patch
Something like that may prevent a cached descriptor from being
invalidated when the controller updates it.
Similarly, the PCIe controllers are marked as dma-coherent in your
devicetrees. A misconfiguration there could also cause problems.
Agreed, I previously mistakenly thought that the status descriptor was not updated by DMA.
I suggest we merge my fix that adds the missing memory barrier, and
which users have now been testing for a week without hitting the
corruption (which they used to see several times a day).
Then we can continue to track down why you are having coherency issues
on qcs615 and qcs8300. You really want to make sure that that is fixed
properly as it may lead to subtle bugs elsewhere too.