Re: [PATCH v2] Bluetooth: Shutdown controller after workqueues are flushed or cancelled

From: Kai-Heng Feng
Date: Wed Aug 04 2021 - 23:45:12 EST


On Thu, Aug 5, 2021 at 11:00 AM Hillf Danton <hdanton@xxxxxxxx> wrote:
>
> On Wed, 4 Aug 2021 22:35:43 +0800 Kai-Heng Feng wrote:
> >On Tue, Aug 3, 2021 at 3:47 PM Hillf Danton <hdanton@xxxxxxxx> wrote:
> >>
> >> On Tue, 3 Aug 2021 14:45:07 +0800 Kai-Heng Feng wrote:
> >> >On Mon, Aug 2, 2021 at 11:05 AM Hillf Danton <hdanton@xxxxxxxx> wrote:
> >> >>
> >> >> Given the skb_get in hci_req_sync_complete makes it safe to free skb on
> >> >> driver side, I doubt this patch is the correct fix as it is.
> >> >
> >> >Some workqueues are still active.
> >> >The shutdown() should be called at least after hci_request_cancel_all().
> >>
> >> What is muddy then is how active workqueues prevent skb_get from protecting
> >> kfree_skb. Can you spot what workqueue it is?
> >
> >I managed to reproduce the issue with another kernel splat:
> >------------[ cut here ]------------
> >kernel BUG at mm/slub.c:321!
> >invalid opcode: 0000 [#1] SMP NOPTI
> >CPU: 2 PID: 2208 Comm: kworker/u9:3 Not tainted 5.14.0-rc4+ #16
> >Hardware name: HP HP ProBook 650 G8 Notebook PC/87ED, BIOS T74 Ver.
> >01.03.04 01/07/2021
> >Workqueue: hci0 discov_update [bluetooth]
> >RIP: 0010:__slab_free+0x20c/0x3a0
> >Code: 00 44 0f b6 54 24 1a 8b 74 24 14 44 0f b6 4c 24 1b 44 8b 44 24
> >1c 48 89 44 24 08 48 8b 54 24 20 48 8b 7c 24 28 e9 ad fe ff ff <0f> 0b
> >49 3b 54 24 28 0f 85 6b ff ff ff 49 89 5c 24 20 49 89 4c 24
> >RSP: 0018:ffffaa0e4164fc50 EFLAGS: 00010246
> >RAX: ffff9cc9a217e668 RBX: ffff9cc9a217e600 RCX: ffff9cc9a217e600
> >RDX: 000000008010000e RSI: ffffd09044885f80 RDI: ffff9cc980e96500
> >RBP: ffffaa0e4164fd00 R08: 0000000000000001 R09: ffffffff885b3a4e
> >R10: ffff9cc999aab800 R11: ffff9cc9a217e600 R12: ffffd09044885f80
> >R13: ffff9cc9a217e600 R14: ffff9cc980e96500 R15: ffff9cc9a217e600
> >FS: 0000000000000000(0000) GS:ffff9cca2b900000(0000) knlGS:0000000000000000
> >CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> >CR2: 00007ffe164d5b98 CR3: 000000013f410002 CR4: 0000000000770ee0
> >PKRU: 55555554
> >Call Trace:
> > ? psi_task_switch+0xc3/0x1e0
> > ? __switch_to_asm+0x36/0x70
> > ? skb_free_head+0x67/0x80
> > kmem_cache_free+0x370/0x3d0
> > ? kfree_skbmem+0x4e/0x90
> > kfree_skbmem+0x4e/0x90
> > kfree_skb+0x47/0xb0
> > __hci_req_sync+0x134/0x2a0 [bluetooth]
> > ? wait_woken+0x70/0x70
> > discov_update+0x2ae/0x310 [bluetooth]
> > process_one_work+0x21d/0x3c0
> > worker_thread+0x53/0x420
> > ? process_one_work+0x3c0/0x3c0
> > kthread+0x127/0x150
> > ? set_kthread_struct+0x50/0x50
> > ret_from_fork+0x1f/0x30
>
> [...]
>
> >
> >So hci_request_cancel_all() -> cancel_work_sync(&hdev->discov_update)
> >and can prevent the race from happening.
>
> Given
>
> __hci_req_sync
> err = hci_req_run_skb(&req, hci_req_sync_complete);
> kfree_skb(hdev->req_skb);
>
> hci_req_sync_complete
> if (skb)
> hdev->req_skb = skb_get(skb);
>
> once more skb_get makes the above race hard to understand, though it should
> better check error before freeing skb there to avoid blind free.
>
> >
> >And the kernel splat is just one symptom of the issue, most of the
>
> It fails to support your reasoning so far.

How? Most of the time the BT controller can't be brought up again
after shutdown(), and we need to stop other activities before that.
What other reasoning is expected?

Kai-Heng

>
> >time it's just "Bluetooth: hci0: HCI reset during shutdown failed" in
> >dmesg.
> >
> >Kai-Heng
>
>
> +++ x/net/bluetooth/hci_request.c
> @@ -257,8 +257,10 @@ int __hci_req_sync(struct hci_dev *hdev,
> break;
> }
>
> - kfree_skb(hdev->req_skb);
> - hdev->req_skb = NULL;
> + if (!err) {
> + kfree_skb(hdev->req_skb);
> + hdev->req_skb = NULL;
> + }
> hdev->req_status = hdev->req_result = 0;
>
> bt_dev_dbg(hdev, "end: err %d", err);