Re: [PATCH] btusb: fix zero BD address problem during stress test

From: Marcel Holtmann
Date: Wed Dec 07 2016 - 04:07:49 EST


Hi Amitkumar,

>>>>>>>> From: Amitkumar Karwar [mailto:akarwar@xxxxxxxxxxx]
>>>>>>>> Sent: Tuesday, October 18, 2016 6:27 PM
>>>>>>>> To: linux-bluetooth@xxxxxxxxxxxxxxx
>>>>>>>> Cc: marcel@xxxxxxxxxxxx; linux-kernel@xxxxxxxxxxxxxxx; Cathy
>>> Luo;
>>>>>>>> Nishant Sarmukadam; Ganapathi Bhat; Amitkumar Karwar
>>>>>>>> Subject: [PATCH] btusb: fix zero BD address problem during
>>> stress
>>>>>>>> test
>>>>>>>>
>>>>>>>> From: Ganapathi Bhat <gbhat@xxxxxxxxxxx>
>>>>>>>>
>>>>>>>> We came across a corner case issue during reboot stress test
>> in
>>>>>> which
>>>>>>>> hciconfig shows BD address is all zero. Reason is we don't get
>>>>>>>> response for HCI RESET command during initialization
>>>>>>>>
>>>>>>>> The issue is tracked to a race where USB subsystem calls
>>>>>>>> btusb_intr_complete() to deliver a data(NOOP frame) received
>> on
>>>>>>>> interrupt endpoint. HCI_RUNNING flag is not yet set by
>>>>>>>> bluetooth subsystem. So we ignore that frame and return.
>>>>>>>>
>>>>>>>> As we missed to resubmit the buffer to interrupt endpoint in
>>> this
>>>>>>>> case, we don't get response for BT reset command downloaded
>>> after
>>>>>> this.
>>>>>>>>
>>>>>>>> This patch handles the corner case to resolve zero BD address
>>>>>> problem.
>>>>>>>>
>>>>>>>> Signed-off-by: Ganapathi Bhat <gbhat@xxxxxxxxxxx>
>>>>>>>> Signed-off-by: Amitkumar Karwar <akarwar@xxxxxxxxxxx>
>>>>>>>> ---
>>>>>>>> drivers/bluetooth/btusb.c | 5 +----
>>>>>>>> 1 file changed, 1 insertion(+), 4 deletions(-)
>>>>>>>>
>>>>>>>> diff --git a/drivers/bluetooth/btusb.c
>>>>>>>> b/drivers/bluetooth/btusb.c index 811f9b9..b5596ac 100644
>>>>>>>> --- a/drivers/bluetooth/btusb.c
>>>>>>>> +++ b/drivers/bluetooth/btusb.c
>>>>>>>> @@ -607,10 +607,7 @@ static void btusb_intr_complete(struct
>> urb
>>>>>> *urb)
>>>>>>>> BT_DBG("%s urb %p status %d count %d", hdev->name, urb,
>>> urb-
>>>>>>>>> status,
>>>>>>>> urb->actual_length);
>>>>>>>>
>>>>>>>> - if (!test_bit(HCI_RUNNING, &hdev->flags))
>>>>>>>> - return;
>>>>>>>> -
>>>>>>>> - if (urb->status == 0) {
>>>>>>>> + if (urb->status == 0 && test_bit(HCI_RUNNING, &hdev-
>>>>> flags)) {
>>>>>>>> hdev->stat.byte_rx += urb->actual_length;
>>>>>>>>
>>>>>>>> if (btusb_recv_intr(data, urb->transfer_buffer,
>>>>>>>
>>>>>>> Did you get a chance to check this?
>>>>>>> Please let us know if you have any review comments.
>>>>>>
>>>>>> can you explain how this is correct and show me the HCI traces
>>>>>> for this.
>>>>>>
>>>>>
>>>>> I suppose HCI trace means hcidump logs here. As device hasn't yet
>>>> initialized, hcidump won't show anything.
>>>>> We had added debug info in btusb driver to trace the data
>> received
>>>>> on
>>>> all USB endpoints and also checked usbmon logs.
>>>>
>>>> use btmon and it will show it.
>>>>
>>>>> Here is the sequence of events we observed in a corner case while
>>>> running stress test.
>>>>> 1) Inside btusb_open() call ------ Thread 1
>>>>> 2) btusb_submit_intr_urb() submits the URB for receiving data on
>>>>> interrupt endpoint ---- Thread 1
>>>>> 3) btusb_intr_complete() gets called to deliver NOP frame from
>> HCI
>>>>> controller ---- Thread 2
>>>>> 4) HCI_RUNNING isn't set yet. So we return from
>>>>> btusb_intr_complete() without resubmitting the buffer --- Thread
>> 2
>>>>> 5) Exit btusb_open() ---- Thread 1
>>>>> 6) "set_bit(HCI_RUNNING, &hdev->flags)" done by bluetooth core
>>>>> ---- Thread 1
>>>>>
>>>>> Later HCI_RESET command gets timedout, as we haven't re-submitted
>>>> buffer for interrupt endpoint in step (4) above.
>>>>>
>>>>> Please find attached logs. usbmon log shows first frame received
>>>>> on
>>>> interrupt endpoint is NOP(Search for Marvell in log).
>>>>>
>>>>> Here is what bluetooth spec says about NOP frame.
>>>>>
>>>>> "To indicate to the Host that the Controller is ready to receive
>>> HCI
>>>>> command packets, the Controller generates a Command Status event
>>>>> with Status 0x00 and Command_Opcode 0x0000, and the
>>>> Num_HCI_Command_Packets
>>>>> event parameter is set to 1 or more. Command_Opcode, 0x0000 is a
>>> NOP
>>>> (No Operation)â
>>>>
>>>> So I wonder if we need to remove the HCI_RUNNING logic from the
>>>> drivers. It is only left in a few USB drivers and I removed the
>> rest
>>>> and moved it into the core. I am not in favour of papering over
>> this
>>>> issue. I need to understand what is wrong. And actually HCI_RUNNING
>>>> needs to be taken away from the drivers. So the question is if
>>> btusb.c
>>>> still needs it or if that is just a leftover. Meaning is it
>>> protecting
>>>> anything?
>>>>
>>>
>>> I can see HCI_RUNNING check being used at below places in btusb.c
>>>
>>> 1) btusb_intr_complete
>>> 2) btusb_bulk_complete
>>> 3) btusb_isoc_complete
>>> 4) btusb_tx_complete
>>> 5) btusb_isoc_tx_complete
>>> 6) btusb_bcm_set_diag
>>> 7) btusb_resume
>>>
>>> I think, it can be removed from (1), (2) and (3), as we already have
>>> below check inside hci_recv_frame()
>>>
>>> if (!hdev || (!test_bit(HCI_UP, &hdev->flags)
>>> && !test_bit(HCI_INIT, &hdev->flags))) {
>>> kfree_skb(skb);
>>> return -ENXIO;
>>> }
>>>
>>> For (4) and (5), it's just used to decide if we should update "hdev-
>>>> stat" or not.
>>>
>>
>> Please guide on how should we fix this issue. We observe timeout for
>> HCI_RESET command sometimes when system is booted. This happens due to
>> missing urb_submitt() when first frame(NOP) received on INT endpoint is
>> ignored. Below change would also help.
>>
>> --- a/net/bluetooth/hci_core.c
>> +++ b/net/bluetooth/hci_core.c
>> @@ -1318,12 +1318,13 @@ static int hci_dev_do_open(struct hci_dev
>> *hdev)
>> goto done;
>> }
>>
>> + set_bit(HCI_RUNNING, &hdev->flags);
>> if (hdev->open(hdev)) {
>> + clear_bit(HCI_RUNNING, &hdev->flags);
>> ret = -EIO;
>> goto done;
>> }
>>
>> - set_bit(HCI_RUNNING, &hdev->flags);
>> hci_sock_dev_event(hdev, HCI_DEV_OPEN);
>
> How can we proceed on this? I'm not much familiar with btusb driver, so not sure if HCI_RUNNING logic can be completely removed.
> Meanwhile could you accept proposed fix in this patch for the potential issue we spotted during stress tests.
>
> The change doesn't have any side-effects. During teardown, race between btusb_close() and btusb_intr_complete() is appropriately taken care of with the help of BTUSB_INTR_RUNNING flag.

so what is the downside of just replacing all HCI_RUNNING with the appropriate BTUSB_INTR_RUNNING, BTUSB_BULK_RUNNING etc.?

Regards

Marcel