Re: [PATCH v2 ath-next 2/2] wifi: ath11k: fix HTC rx insufficient length

From: Johan Hovold
Date: Mon Mar 10 2025 - 06:10:06 EST


On Mon, Mar 10, 2025 at 09:02:17AM +0800, Miaoqing Pan wrote:
> A relatively unusual race condition occurs between host software
> and hardware, where the host sees the updated destination ring head
> pointer before the hardware updates the corresponding descriptor.
> When this situation occurs, the length of the descriptor returns 0.

I still think this description is too vague and it doesn't explain how
this race is even possible. It sounds like there's a bug somewhere in
the driver or firmware, but if this really is an indication the hardware
is broken as your reply here seems to suggest:

https://lore.kernel.org/lkml/bc187777-588c-4fa0-ba8c-847e91c78d43@xxxxxxxxxxx/

then that too should be highlighted in the commit message (e.g. by
describing this as "working around broken hardware").

> The current error handling method is to increment descriptor tail
> pointer by 1, but 'sw_index' is not updated, causing descriptor and
> skb to not correspond one-to-one, resulting in the following error:
>
> ath11k_pci 0006:01:00.0: HTC Rx: insufficient length, got 1488, expected 1492
> ath11k_pci 0006:01:00.0: HTC Rx: insufficient length, got 1460, expected 1484
>
> To address this problem, temporarily skip processing the current
> descriptor and handle it again next time. However, to prevent this
> descriptor from continuously returning 0, use skb cb to set a flag.
> If the length returns 0 again, this descriptor will be discarded.

The ath12k ring-buffer handling looks very similar. Do you need a
corresponding workaround in ath12k_ce_completed_recv_next()? Or are you
sure that this (hardware) bug only affects ath11k devices?

> *nbytes = ath11k_hal_ce_dst_status_get_length(desc);
> - if (*nbytes == 0) {
> - ret = -EIO;
> - goto err;
> + if (unlikely(*nbytes == 0)) {
> + struct ath11k_skb_rxcb *rxcb =
> + ATH11K_SKB_RXCB(pipe->dest_ring->skb[sw_index]);
> +
> + /* A relatively unusual race condition occurs between host
> + * software and hardware, where the host sees the updated
> + * destination ring head pointer before the hardware updates
> + * the corresponding descriptor.
> + *
> + * Temporarily skip processing the current descriptor and handle
> + * it again next time. However, to prevent this descriptor from
> + * continuously returning 0, set 'is_desc_len0' flag. If the
> + * length returns 0 again, this descriptor will be discarded.
> + */
> + if (!rxcb->is_desc_len0) {
> + rxcb->is_desc_len0 = true;
> + ret = -EIO;
> + goto err;
> + }
> }

I'm still waiting for feedback from one user that can reproduce the
ring-buffer corruption very easily, but another user mentioned seeing
multiple zero-length descriptor warnings over the weekend when running
with this patch:

ath11k_pci 0006:01:00.0: rxed invalid length (nbytes 0, max 2048)

Are there ever any valid reasons for seeing a zero-length descriptor
(i.e. unrelated to the race at hand)? IIUC the warning would only be
printed when processing such descriptors a second time (i.e. when
is_desc_len0 is set).

Johan