Re: [PATCH] Bluetooth: hci_core: Don't queue tx_work while draining workqueue

From: Heitor Alves de Siqueira

Date: Mon May 18 2026 - 11:25:31 EST

Hi Luiz,

On Fri May 15, 2026 at 10:33 AM -03, Luiz Augusto von Dentz wrote:
> Hi Heitor,
>
> On Thu, May 14, 2026 at 10:54 AM Heitor Alves de Siqueira
> <halves@xxxxxxxxxx> wrote:
>>
>> On Wed May 13, 2026 at 11:04 PM -03, Hillf Danton wrote:
>> > On Wed, 13 May 2026 15:55:23 -0300 Heitor Alves de Siqueira wrote:
>> >> Syzbot reported a warning when L2CAP calls queue_work() on the hdev
>> >> workqueue while it's being drained. This can happen during device reset or
>> >> close paths for hci_send_acl(), hci_send_sco() and hci_send_iso().
>> >>
>> >> The workqueue is drained in hci_dev_do_reset() and in hci_dev_close_sync():
>> >> - hci_dev_close_sync() clears the HCI_UP bit before draining
>> >> - hci_dev_do_reset() sets HCI_CMD_DRAIN_WORKQUEUE before draining
>> >>
>> >> Add these checks before queuing tx_work, and free the SKB if it's not
>> >> queued for transmission.
>> >>
>> >> Fixes: 3eff45eaf817 ("Bluetooth: convert tx_task to workqueue")
>> >> Reported-by: syzbot+97721dd81f792e838ba0@xxxxxxxxxxxxxxxxxxxxxxxxx
>> >> Closes: https://syzkaller.appspot.com/bug?extid=97721dd81f792e838ba0
>> >> Signed-off-by: Heitor Alves de Siqueira <halves@xxxxxxxxxx>
>> >> ---
>> >> net/bluetooth/hci_core.c | 18 ++++++++++++++++++
>> >> 1 file changed, 18 insertions(+)
>> >>
>> >> diff --git a/net/bluetooth/hci_core.c b/net/bluetooth/hci_core.c
>> >> index c46c1236ebfa..5d5f8ad7d1a8 100644
>> >> --- a/net/bluetooth/hci_core.c
>> >> +++ b/net/bluetooth/hci_core.c
>> >> @@ -3278,6 +3278,12 @@ void hci_send_acl(struct hci_chan *chan, struct sk_buff *skb, __u16 flags)
>> >>
>> >> BT_DBG("%s chan %p flags 0x%4.4x", hdev->name, chan, flags);
>> >>
>> >> + if (!test_bit(HCI_UP, &hdev->flags) ||
>> >> + hci_dev_test_flag(hdev, HCI_CMD_DRAIN_WORKQUEUE)) {
>> >> + kfree_skb(skb);
>> >> + return;
>> >> + }
>> >> +
>> >> hci_queue_acl(chan, &chan->data_q, skb, flags);
>> >>
>> >> queue_work(hdev->workqueue, &hdev->tx_work);
>> >>
>> > What you add is not enough, go and see how HCI_CMD_DRAIN_WORKQUEUE is
>> > checked in hci_cmd_work(), and in hci_dev_do_reset() for why.
>>
>> I see, I missed the RCU guards for the device flags. Sorry about that,
>> I'll add them to v2.
>> Thanks for the catch!
>
> Actually this whole thing might be because we try to clean the
> workqueue without actually closing the hdev, so I suspect that if we
> just remove all the code from hci_dev_do_reset with
> hci_dev_do_close+hci_dev_do_open, it might work better and align with
> how things work over the MGMT interface:
>
> diff --git a/net/bluetooth/hci_core.c b/net/bluetooth/hci_core.c
> index c46c1236ebfa..6782bbc9b6a7 100644
> --- a/net/bluetooth/hci_core.c
> +++ b/net/bluetooth/hci_core.c
> @@ -539,47 +539,13 @@ static int hci_dev_do_reset(struct hci_dev *hdev)
>
> hci_req_sync_lock(hdev);
>
> - /* Drop queues */
> - skb_queue_purge(&hdev->rx_q);
> - skb_queue_purge(&hdev->cmd_q);
> -
> - /* Cancel these to avoid queueing non-chained pending work */
> - hci_dev_set_flag(hdev, HCI_CMD_DRAIN_WORKQUEUE);
> - /* Wait for
> - *
> - * if (!hci_dev_test_flag(hdev, HCI_CMD_DRAIN_WORKQUEUE))
> - * queue_delayed_work(&hdev->{cmd,ncmd}_timer)
> - *
> - * inside RCU section to see the flag or complete scheduling.
> - */
> - synchronize_rcu();
> - /* Explicitly cancel works in case scheduled after setting the flag. */
> - cancel_delayed_work(&hdev->cmd_timer);
> - cancel_delayed_work(&hdev->ncmd_timer);
> -
> - /* Avoid potential lockdep warnings from the *_flush() calls by
> - * ensuring the workqueue is empty up front.
> - */
> - drain_workqueue(hdev->workqueue);
> -
> - hci_dev_lock(hdev);
> - hci_inquiry_cache_flush(hdev);
> - hci_conn_hash_flush(hdev);
> - hci_dev_unlock(hdev);
> -
> - if (hdev->flush)
> - hdev->flush(hdev);
> -
> - hci_dev_clear_flag(hdev, HCI_CMD_DRAIN_WORKQUEUE);
> -
> - atomic_set(&hdev->cmd_cnt, 1);
> - hdev->acl_cnt = 0;
> - hdev->sco_cnt = 0;
> - hdev->le_cnt = 0;
> - hdev->iso_cnt = 0;
> + ret = hci_dev_close_sync(hdev);
> + if (ret)
> + goto done;
>
> - ret = hci_reset_sync(hdev);
> + ret = hci_dev_open_sync(hdev);
>
> +done:
> hci_req_sync_unlock(hdev);
> return ret;
> }
>
> Seem to work here and as a side effect we get notified over the MGMT
> when a user uses hciconfig hci0 reset:
>
> # tools/hciconfig hci0 reset
> bluetoothd[34]: src/shared/mgmt.c:can_read_data() [0x0000] event 0x0006
> bluetoothd[34]: src/adapter.c:new_settings_callback() Settings: 0x01fc0ac0
> bluetoothd[34]: src/adapter.c:settings_changed() Changed settings: 0x00000001
> bluetoothd[34]: src/adapter.c:settings_changed() Pending settings: 0x00000000
> bluetoothd[34]: src/adapter.c:cancel_passive_scanning()
> bluetoothd[34]: src/adapter.c:adapter_stop() adapter /org/bluez/hci0
> has been disabled
> bluetoothd[34]: src/shared/mgmt.c:can_read_data() [0x0000] event 0x0008
> bluetoothd[34]: src/shared/mgmt.c:can_read_data() [0x0000] event 0x0006
> bluetoothd[34]: src/adapter.c:new_settings_callback() Settings: 0x01fc0ac1
> bluetoothd[34]: src/adapter.c:settings_changed() Changed settings: 0x00000001
> bluetoothd[34]: src/adapter.c:settings_changed() Pending settings: 0x00000000
> bluetoothd[34]: src/adapter.c:adapter_start() adapter /org/bluez/hci0
> has been enabled
> bluetoothd[34]: src/adapter.c:trigger_passive_scanning()
> bluetoothd[34]: [signal] org.freedesktop.DBus.Properties.PropertiesChanged

Thank you for the review, and for the suggestion! There seems to be a
number of similar syzbot reports for the hdev workqueue, so maybe the
close+open approach is a simpler solution indeed. I've originally
considered implementing a wrapper/helper for workqueue submission in
hci_core.c, but if we can eliminate the race condition altogether that'd
be even better.

My only concern is that there seem to be slight differences between
what hci_dev_do_reset() does now and what we'd get with
hci_dev_close_sync()+hci_dev_open_sync() (e.g. we wouldn't use
HCI_CMD_DRAIN_WORKQUEUE anymore). Would this alternative approach be
suitable for the stable kernels? And if not, do you think the checks
from v1 would be appropriate, in that case?