Re: [Regression] rt2800usb - Wifi performance issues and connection drops

From: Linux regression tracking (Thorsten Leemhuis)
Date: Wed Mar 08 2023 - 07:23:16 EST


On 08.03.23 12:57, Felix Fietkau wrote:
> On 08.03.23 12:41, Alexander Wetzel wrote:
>> On 08.03.23 08:52, Felix Fietkau wrote:
>>>> I'm also planning to provide some more debug patches, to figuring out
>>>> which part of commit 4444bc2116ae ("wifi: mac80211: Proper mark iTXQs
>>>> for resumption") fixes the issue for you. Assuming my understanding
>>>> above is correct the patch should not really fix/break anything for
>>>> you...With the findings above I would have expected your git bisec to
>>>> identify commit a790cc3a4fad ("wifi: mac80211: add wake_tx_queue
>>>> callback to drivers") as the first broken commit...
>>> I can't point to any specific series of events where it would go
>>> wrong, but I suspect that the problem might be the fact that you're
>>> doing tx scheduling from within ieee80211_handle_wake_tx_queue. I
>>> don't see how it's properly protected from potentially being called
>>> on different CPUs concurrently.
>>> Back when I was debugging some iTXQ issues in mt76, I also had
>>> problems when tx scheduling could happen from multiple places. My
>>> solution was to have a single worker thread that handles tx, which is
>>> scheduled from the wake_tx_queue op.
>>> Maybe you could do something similar in mac80211 for non-iTXQ drivers.
>> I think it's already doing all of that:
>> ieee80211_handle_wake_tx_queue() is the mac80211 implementation for the
>> wake_tx_queue op. The drivers without native iTXQ support simply link it
>> to this handler.
> I know. The problem I see is that I can't find anything that guarantees
> that .wake_tx_queue_op is not being called concurrently from multiple
> different places. ieee80211_handle_wake_tx_queue is doing the scheduling
> directly, instead of deferring it to a single workqueue/tasklet/thread,
> and multiple concurrent calls to it could potentially cause issues.

Alexander, Felix, many thx for looking into this.

This more and more sounds like something that might take a while to get
fixed, which makes it harder to get this fixed within those time-frames
Documentation/process/handling-regressions.rst outlines. So please allow
me to ask:

Is reverting the culprit (and reapplying it later once the real cause is
found and fixed) an option, or would that cause other regressions?

Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
--
Everything you wanna know about Linux kernel regression tracking:
https://linux-regtracking.leemhuis.info/about/#tldr
If I did something stupid, please tell me, as explained on that page.