Serialization helps. A (crude and in multiple ways incorrect) patch
preventing two drv_wake_tx_queue() running for the same ac fixed the
issue for Thomas:
https://bugzilla.kernel.org/show_bug.cgi?id=217119#c20
So it looks like we'll now have soon a fix for the issue.
The driver wakes the queue for IEEE80211_AC_BE often for only a single
skb and then stops it again.
The short run time is insufficient for wake_txqs_tasklet to proper wake
all queues itself and from time to time a new TX operation squeezes in
after IEEE80211_AC_BE has been unblocked but prior of drv_wake_tx_queue
being called from the wake_txqs_tasklet. When this happens
drv_wake_tx_queue is called two times: Once from the tasklet, once from
the userspace.
ieee80211_handle_wake_tx_queue is using ieee80211_txq_schedule_start,
which has this documented requirement:
"The driver must not call multiple TXQ scheduling rounds concurrently."
Now I don't think that is causing the reported regression. Nevertheless
we should prevent concurrent calls of ieee80211_handle_wake_tx_queue for
that reason alone.
The real reason of the hangs is probably in the rt2800usb driver or
hardware. I don't see anything in the driver code, so probably the HW
itself has a problem with the two near-concurrent TX operations.
The real culprit of the regression should be commit a790cc3a4fad ("wifi:
mac80211: add wake_tx_queue callback to drivers"), which switched
rt2800usb over to iTXQs. But without the fix from commit 4444bc2116ae
("wifi: mac80211: Proper mark iTXQs for resumption") mac80211 omitted to
schedule the required run of the wake_txqs_tasklet. Thus thus instead of
two concurrent drv_wake_tx_queue we only got one and the driver
continued to work.
I asked Thomas on bugzilla to test the "best" solution I came up with.
There seems to be multiple ways. But I can't find a simple, low risk and
complete fix. So I compromised...
When Thomas can confirm the fix we can soon discuss the fix on
linux-wireless.