Re: [PATCH] rt2x00: fix rx queue hang

From: Soeren Moch
Date: Fri Jun 21 2019 - 07:30:50 EST


Hi!

On 18.06.19 11:34, Stanislaw Gruszka wrote:
> Hi
>
> On Mon, Jun 17, 2019 at 11:46:56AM +0200, Soeren Moch wrote:
>> Since commit ed194d136769 ("usb: core: remove local_irq_save() around
>> ->complete() handler") the handlers rt2x00usb_interrupt_rxdone() and
>> rt2x00usb_interrupt_txdone() are not running with interrupts disabled
>> anymore. So these handlers are not guaranteed to run completely before
>> workqueue processing starts. So only mark entries ready for workqueue
>> processing after proper accounting in the dma done queue.
> It was always the case on SMP machines that rt2x00usb_interrupt_{tx/rx}done
> can run concurrently with rt2x00_work_{rx,tx}done, so I do not
> understand how removing local_irq_save() around complete handler broke
> things.
I think because completion handlers can be interrupted now and scheduled
away
in the middle of processing.
> Have you reverted commit ed194d136769 and the revert does solve the problem ?
Yes, I already sent a patch for this, see [1]. But this was not considered
an acceptablesolution. Especially RT folks do not like code running with
interrupts disabled,particularly when trying to acquire spinlocks then.

[1] https://lkml.org/lkml/2019/5/31/863
> Between 4.19 and 4.20 we have some quite big changes in rt2x00 driver:
>
> 0240564430c0 rt2800: flush and txstatus rework for rt2800mmio
> adf26a356f13 rt2x00: use different txstatus timeouts when flushing
> 5022efb50f62 rt2x00: do not check for txstatus timeout every time on tasklet
> 0b0d556e0ebb rt2800mmio: use txdone/txstatus routines from lib
> 5c656c71b1bf rt2800: move usb specific txdone/txstatus routines to rt2800lib
>
> so I'm a bit afraid that one of those changes is real cause of
> the issue not ed194d136769 .
I tested 4.20 and 5.1 and see the exact same behavior. Reverting this
usb core patchsolves the problem.
4.19.x (before this usb core patch) is running fine.
>> Note that rt2x00usb_work_rxdone() processes all available entries, not
>> only such for which queue_work() was called.
>>
>> This fixes a regression on a RT5370 based wifi stick in AP mode, which
>> suddenly stopped data transmission after some period of heavy load. Also
>> stopping the hanging hostapd resulted in the error message "ieee80211
>> phy0: rt2x00queue_flush_queue: Warning - Queue 14 failed to flush".
>> Other operation modes are probably affected as well, this just was
>> the used testcase.
> Do you know what actually make the traffic stop,
> TX queue hung or RX queue hung?
I think RX queue hang, as stated in the patch title. "Queue 14" means QID_RX
(rt2x00queue.h, enum data_queue_qid).
I also tried to re-add local_irq_save() in only one of the handlers. Adding
this tort2x00usb_interrupt_rxdone() alone solved the issue, while doing so
for tx alonedid not.

Note that this doesn't mean there is no problem for tx, that's maybe
just more
difficult to trigger.
>> diff --git a/drivers/net/wireless/ralink/rt2x00/rt2x00dev.c b/drivers/net/wireless/ralink/rt2x00/rt2x00dev.c
>> index 1b08b01db27b..9c102a501ee6 100644
>> --- a/drivers/net/wireless/ralink/rt2x00/rt2x00dev.c
>> +++ b/drivers/net/wireless/ralink/rt2x00/rt2x00dev.c
>> @@ -263,9 +263,9 @@ EXPORT_SYMBOL_GPL(rt2x00lib_dmastart);
>>
>> void rt2x00lib_dmadone(struct queue_entry *entry)
>> {
>> - set_bit(ENTRY_DATA_STATUS_PENDING, &entry->flags);
>> clear_bit(ENTRY_OWNER_DEVICE_DATA, &entry->flags);
>> rt2x00queue_index_inc(entry, Q_INDEX_DMA_DONE);
>> + set_bit(ENTRY_DATA_STATUS_PENDING, &entry->flags);
> Unfortunately I do not understand how this suppose to fix the problem,
> could you elaborate more about this change?
>
Re-adding local_irq_save() around thisrt2x00lib_dmadone()solved
the issue. So I also tried to reverse the order of these calls.
It seems totally plausible to me, that the correct sequence is to
first clear the device assignment, then to set the status to dma_done,
then to trigger the workqueue processing for this entry. When the handler
is scheduled away in the middle of this sequence, now there is no
strange state where the entry can be processed by the workqueue while
not declared dma_done for it.
With this changed sequence there is no need anymore to disable interrupts
for solving the hang issue.

Regards,
Soeren