RE: [PATCH] btusb: fix zero BD address problem during stress test

From: Amitkumar Karwar
Date: Wed Dec 07 2016 - 02:45:34 EST


Hi Marcel,

> From: Amitkumar Karwar
> Sent: Friday, November 25, 2016 4:51 PM
> To: 'Marcel Holtmann'
> Cc: 'linux-bluetooth@xxxxxxxxxxxxxxx'; 'linux-kernel@xxxxxxxxxxxxxxx';
> Cathy Luo; Nishant Sarmukadam; Ganapathi Bhat
> Subject: RE: [PATCH] btusb: fix zero BD address problem during stress
> test
>
> Hi Marcel,
>
> > From: Amitkumar Karwar
> > Sent: Wednesday, November 23, 2016 3:14 PM
> > To: 'Marcel Holtmann'
> > Cc: linux-bluetooth@xxxxxxxxxxxxxxx; linux-kernel@xxxxxxxxxxxxxxx;
> > Cathy Luo; Nishant Sarmukadam; Ganapathi Bhat
> > Subject: RE: [PATCH] btusb: fix zero BD address problem during stress
> > test
> >
> > Hi Marcel,
> >
> > > From: Marcel Holtmann [mailto:marcel@xxxxxxxxxxxx]
> > > Sent: Wednesday, November 23, 2016 1:46 PM
> > > To: Amitkumar Karwar
> > > Cc: linux-bluetooth@xxxxxxxxxxxxxxx; linux-kernel@xxxxxxxxxxxxxxx;
> > > Cathy Luo; Nishant Sarmukadam; Ganapathi Bhat
> > > Subject: Re: [PATCH] btusb: fix zero BD address problem during
> > > stress test
> > >
> > > Hi Amitkumar,
> > >
> > > >>>> From: Amitkumar Karwar [mailto:akarwar@xxxxxxxxxxx]
> > > >>>> Sent: Tuesday, October 18, 2016 6:27 PM
> > > >>>> To: linux-bluetooth@xxxxxxxxxxxxxxx
> > > >>>> Cc: marcel@xxxxxxxxxxxx; linux-kernel@xxxxxxxxxxxxxxx; Cathy
> > Luo;
> > > >>>> Nishant Sarmukadam; Ganapathi Bhat; Amitkumar Karwar
> > > >>>> Subject: [PATCH] btusb: fix zero BD address problem during
> > stress
> > > >>>> test
> > > >>>>
> > > >>>> From: Ganapathi Bhat <gbhat@xxxxxxxxxxx>
> > > >>>>
> > > >>>> We came across a corner case issue during reboot stress test
> in
> > > >> which
> > > >>>> hciconfig shows BD address is all zero. Reason is we don't get
> > > >>>> response for HCI RESET command during initialization
> > > >>>>
> > > >>>> The issue is tracked to a race where USB subsystem calls
> > > >>>> btusb_intr_complete() to deliver a data(NOOP frame) received
> on
> > > >>>> interrupt endpoint. HCI_RUNNING flag is not yet set by
> > > >>>> bluetooth subsystem. So we ignore that frame and return.
> > > >>>>
> > > >>>> As we missed to resubmit the buffer to interrupt endpoint in
> > this
> > > >>>> case, we don't get response for BT reset command downloaded
> > after
> > > >> this.
> > > >>>>
> > > >>>> This patch handles the corner case to resolve zero BD address
> > > >> problem.
> > > >>>>
> > > >>>> Signed-off-by: Ganapathi Bhat <gbhat@xxxxxxxxxxx>
> > > >>>> Signed-off-by: Amitkumar Karwar <akarwar@xxxxxxxxxxx>
> > > >>>> ---
> > > >>>> drivers/bluetooth/btusb.c | 5 +----
> > > >>>> 1 file changed, 1 insertion(+), 4 deletions(-)
> > > >>>>
> > > >>>> diff --git a/drivers/bluetooth/btusb.c
> > > >>>> b/drivers/bluetooth/btusb.c index 811f9b9..b5596ac 100644
> > > >>>> --- a/drivers/bluetooth/btusb.c
> > > >>>> +++ b/drivers/bluetooth/btusb.c
> > > >>>> @@ -607,10 +607,7 @@ static void btusb_intr_complete(struct
> urb
> > > >> *urb)
> > > >>>> BT_DBG("%s urb %p status %d count %d", hdev->name, urb,
> > urb-
> > > >>>>> status,
> > > >>>> urb->actual_length);
> > > >>>>
> > > >>>> - if (!test_bit(HCI_RUNNING, &hdev->flags))
> > > >>>> - return;
> > > >>>> -
> > > >>>> - if (urb->status == 0) {
> > > >>>> + if (urb->status == 0 && test_bit(HCI_RUNNING, &hdev-
> > > >flags)) {
> > > >>>> hdev->stat.byte_rx += urb->actual_length;
> > > >>>>
> > > >>>> if (btusb_recv_intr(data, urb->transfer_buffer,
> > > >>>
> > > >>> Did you get a chance to check this?
> > > >>> Please let us know if you have any review comments.
> > > >>
> > > >> can you explain how this is correct and show me the HCI traces
> > > >> for this.
> > > >>
> > > >
> > > > I suppose HCI trace means hcidump logs here. As device hasn't yet
> > > initialized, hcidump won't show anything.
> > > > We had added debug info in btusb driver to trace the data
> received
> > > > on
> > > all USB endpoints and also checked usbmon logs.
> > >
> > > use btmon and it will show it.
> > >
> > > > Here is the sequence of events we observed in a corner case while
> > > running stress test.
> > > > 1) Inside btusb_open() call ------ Thread 1
> > > > 2) btusb_submit_intr_urb() submits the URB for receiving data on
> > > > interrupt endpoint ---- Thread 1
> > > > 3) btusb_intr_complete() gets called to deliver NOP frame from
> HCI
> > > > controller ---- Thread 2
> > > > 4) HCI_RUNNING isn't set yet. So we return from
> > > > btusb_intr_complete() without resubmitting the buffer --- Thread
> 2
> > > > 5) Exit btusb_open() ---- Thread 1
> > > > 6) "set_bit(HCI_RUNNING, &hdev->flags)" done by bluetooth core
> > > > ---- Thread 1
> > > >
> > > > Later HCI_RESET command gets timedout, as we haven't re-submitted
> > > buffer for interrupt endpoint in step (4) above.
> > > >
> > > > Please find attached logs. usbmon log shows first frame received
> > > > on
> > > interrupt endpoint is NOP(Search for Marvell in log).
> > > >
> > > > Here is what bluetooth spec says about NOP frame.
> > > >
> > > > "To indicate to the Host that the Controller is ready to receive
> > HCI
> > > > command packets, the Controller generates a Command Status event
> > > > with Status 0x00 and Command_Opcode 0x0000, and the
> > > Num_HCI_Command_Packets
> > > > event parameter is set to 1 or more. Command_Opcode, 0x0000 is a
> > NOP
> > > (No Operation)â
> > >
> > > So I wonder if we need to remove the HCI_RUNNING logic from the
> > > drivers. It is only left in a few USB drivers and I removed the
> rest
> > > and moved it into the core. I am not in favour of papering over
> this
> > > issue. I need to understand what is wrong. And actually HCI_RUNNING
> > > needs to be taken away from the drivers. So the question is if
> > btusb.c
> > > still needs it or if that is just a leftover. Meaning is it
> > protecting
> > > anything?
> > >
> >
> > I can see HCI_RUNNING check being used at below places in btusb.c
> >
> > 1) btusb_intr_complete
> > 2) btusb_bulk_complete
> > 3) btusb_isoc_complete
> > 4) btusb_tx_complete
> > 5) btusb_isoc_tx_complete
> > 6) btusb_bcm_set_diag
> > 7) btusb_resume
> >
> > I think, it can be removed from (1), (2) and (3), as we already have
> > below check inside hci_recv_frame()
> >
> > if (!hdev || (!test_bit(HCI_UP, &hdev->flags)
> > && !test_bit(HCI_INIT, &hdev->flags))) {
> > kfree_skb(skb);
> > return -ENXIO;
> > }
> >
> > For (4) and (5), it's just used to decide if we should update "hdev-
> > >stat" or not.
> >
>
> Please guide on how should we fix this issue. We observe timeout for
> HCI_RESET command sometimes when system is booted. This happens due to
> missing urb_submitt() when first frame(NOP) received on INT endpoint is
> ignored. Below change would also help.
>
> --- a/net/bluetooth/hci_core.c
> +++ b/net/bluetooth/hci_core.c
> @@ -1318,12 +1318,13 @@ static int hci_dev_do_open(struct hci_dev
> *hdev)
> goto done;
> }
>
> + set_bit(HCI_RUNNING, &hdev->flags);
> if (hdev->open(hdev)) {
> + clear_bit(HCI_RUNNING, &hdev->flags);
> ret = -EIO;
> goto done;
> }
>
> - set_bit(HCI_RUNNING, &hdev->flags);
> hci_sock_dev_event(hdev, HCI_DEV_OPEN);

How can we proceed on this? I'm not much familiar with btusb driver, so not sure if HCI_RUNNING logic can be completely removed.
Meanwhile could you accept proposed fix in this patch for the potential issue we spotted during stress tests.

The change doesn't have any side-effects. During teardown, race between btusb_close() and btusb_intr_complete() is appropriately taken care of with the help of BTUSB_INTR_RUNNING flag.

Regards,
Amitkumar Karwar