On 08.03.23 12:57, Felix Fietkau wrote:
On 08.03.23 12:41, Alexander Wetzel wrote:
On 08.03.23 08:52, Felix Fietkau wrote:I know. The problem I see is that I can't find anything that guarantees
I think it's already doing all of that:I'm also planning to provide some more debug patches, to figuring outI can't point to any specific series of events where it would go
which part of commit 4444bc2116ae ("wifi: mac80211: Proper mark iTXQs
for resumption") fixes the issue for you. Assuming my understanding
above is correct the patch should not really fix/break anything for
you...With the findings above I would have expected your git bisec to
identify commit a790cc3a4fad ("wifi: mac80211: add wake_tx_queue
callback to drivers") as the first broken commit...
wrong, but I suspect that the problem might be the fact that you're
doing tx scheduling from within ieee80211_handle_wake_tx_queue. I
don't see how it's properly protected from potentially being called
on different CPUs concurrently.
Back when I was debugging some iTXQ issues in mt76, I also had
problems when tx scheduling could happen from multiple places. My
solution was to have a single worker thread that handles tx, which is
scheduled from the wake_tx_queue op.
Maybe you could do something similar in mac80211 for non-iTXQ drivers.
ieee80211_handle_wake_tx_queue() is the mac80211 implementation for the
wake_tx_queue op. The drivers without native iTXQ support simply link it
to this handler.
that .wake_tx_queue_op is not being called concurrently from multiple
different places. ieee80211_handle_wake_tx_queue is doing the scheduling
directly, instead of deferring it to a single workqueue/tasklet/thread,
and multiple concurrent calls to it could potentially cause issues.
Alexander, Felix, many thx for looking into this.
This more and more sounds like something that might take a while to get
fixed, which makes it harder to get this fixed within those time-frames
Documentation/process/handling-regressions.rst outlines. So please allow
me to ask:
Is reverting the culprit (and reapplying it later once the real cause is
found and fixed) an option, or would that cause other regressions?
Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
--
Everything you wanna know about Linux kernel regression tracking:
https://linux-regtracking.leemhuis.info/about/#tldr
If I did something stupid, please tell me, as explained on that page.