Re: [PATCH v1] Bluetooth: hci_sync: Fix disconnect complete event timeout issue
From: Cheng Jiang (IOE)
Date: Wed Dec 18 2024 - 03:24:01 EST
Hi Luiz,
On 12/17/2024 11:13 PM, Luiz Augusto von Dentz wrote:
> Hi Cheng,
>
> On Tue, Dec 17, 2024 at 12:51 AM Cheng Jiang (IOE)
> <quic_chejiang@xxxxxxxxxxx> wrote:
>>
>> Hi Luiz,
>>
>> On 12/17/2024 11:15 AM, Luiz Augusto von Dentz wrote:
>>> Hi Cheng,
>>>
>>> On Mon, Dec 16, 2024 at 9:49 PM Cheng Jiang (IOE)
>>> <quic_chejiang@xxxxxxxxxxx> wrote:
>>>>
>>>> Hi Luiz,
>>>>
>>>> On 12/16/2024 10:42 PM, Luiz Augusto von Dentz wrote:
>>>>> Hi Cheng,
>>>>>
>>>>> On Mon, Dec 16, 2024 at 9:32 AM Luiz Augusto von Dentz
>>>>> <luiz.dentz@xxxxxxxxx> wrote:
>>>>>>
>>>>>> Hi Cheng,
>>>>>>
>>>>>> On Mon, Dec 16, 2024 at 3:08 AM Cheng Jiang <quic_chejiang@xxxxxxxxxxx> wrote:
>>>>>>>
>>>>>>> Sometimes, the remote device doesn't acknowledge the LL_TERMINATE_IND
>>>>>>> in time, requiring the controller to wait for the supervision timeout,
>>>>>>> which may exceed 2 seconds. In the current implementation, the
>>>>>>> HCI_EV_DISCONN_COMPLETE event is ignored if it arrives late, since
>>>>>>> the hci_abort_conn_sync has cleaned up the connection after 2 seconds.
>>>>>>> This causes the mgmt to get stuck, resulting in bluetoothd waiting
>>>>>>> indefinitely for the mgmt response to the disconnect. To recover,
>>>>>>> restarting bluetoothd is necessary.
>>>>>>>
>>>>>>> bluetoothctl log like this:
>>>>>>> [Designer Mouse]# disconnect D9:B5:6C:F2:51:91
>>>>>>> Attempting to disconnect from D9:B5:6C:F2:51:91
>>>>>>> [Designer Mouse]#
>>>>>>> [Designer Mouse]# power off
>>>>>>> [Designer Mouse]#
>>>>>>> Failed to set power off: org.freedesktop.DBus.Error.NoReply.
>>>>>>>
>>>>>>> Signed-off-by: Cheng Jiang <quic_chejiang@xxxxxxxxxxx>
>>>>>>> ---
>>>>>>> include/net/bluetooth/hci_core.h | 2 ++
>>>>>>> net/bluetooth/hci_conn.c | 9 +++++++++
>>>>>>> net/bluetooth/hci_event.c | 9 +++++++++
>>>>>>> net/bluetooth/hci_sync.c | 18 ++++++++++++++++++
>>>>>>> 4 files changed, 38 insertions(+)
>>>>>>>
>>>>>>> diff --git a/include/net/bluetooth/hci_core.h b/include/net/bluetooth/hci_core.h
>>>>>>> index 734cd50cd..2ab079dcf 100644
>>>>>>> --- a/include/net/bluetooth/hci_core.h
>>>>>>> +++ b/include/net/bluetooth/hci_core.h
>>>>>>> @@ -753,6 +753,8 @@ struct hci_conn {
>>>>>>>
>>>>>>> struct bt_codec codec;
>>>>>>>
>>>>>>> + struct completion disc_ev_comp;
>>>>>>> +
>>>>>>> void (*connect_cfm_cb) (struct hci_conn *conn, u8 status);
>>>>>>> void (*security_cfm_cb) (struct hci_conn *conn, u8 status);
>>>>>>> void (*disconn_cfm_cb) (struct hci_conn *conn, u8 reason);
>>>>>>> diff --git a/net/bluetooth/hci_conn.c b/net/bluetooth/hci_conn.c
>>>>>>> index d097e308a..e0244e191 100644
>>>>>>> --- a/net/bluetooth/hci_conn.c
>>>>>>> +++ b/net/bluetooth/hci_conn.c
>>>>>>> @@ -1028,6 +1028,15 @@ static struct hci_conn *__hci_conn_add(struct hci_dev *hdev, int type, bdaddr_t
>>>>>>>
>>>>>>> hci_conn_init_sysfs(conn);
>>>>>>>
>>>>>>> + /* This disc_ev_comp is inited when we send a disconnect request to
>>>>>>> + * the remote device but fail to receive the disconnect complete
>>>>>>> + * event within the expected time (2 seconds). This occurs because
>>>>>>> + * the remote device doesn't ack the terminate indication, forcing
>>>>>>> + * the controller to wait for the supervision timeout.
>>>>>>> + */
>>>>>>> + init_completion(&conn->disc_ev_comp);
>>>>>>> + complete(&conn->disc_ev_comp);
>>>>>>> +
>>>>>>> return conn;
>>>>>>> }
>>>>>>>
>>>>>>> diff --git a/net/bluetooth/hci_event.c b/net/bluetooth/hci_event.c
>>>>>>> index 2cc7a9306..60ecb2b18 100644
>>>>>>> --- a/net/bluetooth/hci_event.c
>>>>>>> +++ b/net/bluetooth/hci_event.c
>>>>>>> @@ -3366,6 +3366,15 @@ static void hci_disconn_complete_evt(struct hci_dev *hdev, void *data,
>>>>>>> if (!conn)
>>>>>>> goto unlock;
>>>>>>>
>>>>>>> + /* Wake up disc_ev_comp here is ok. Since we hold the hdev lock
>>>>>>> + * hci_abort_conn_sync will wait hdev lock release to continue.
>>>>>>> + */
>>>>>>> + if (!completion_done(&conn->disc_ev_comp)) {
>>>>>>> + complete(&conn->disc_ev_comp);
>>>>>>> + /* Add some delay for hci_abort_conn_sync to handle the complete */
>>>>>>> + usleep_range(100, 1000);
>>>>>>> + }
>>>>>>> +
>>>>>>> if (ev->status) {
>>>>>>> mgmt_disconnect_failed(hdev, &conn->dst, conn->type,
>>>>>>> conn->dst_type, ev->status);
>>>>>>> diff --git a/net/bluetooth/hci_sync.c b/net/bluetooth/hci_sync.c
>>>>>>> index 0badec712..783d04b57 100644
>>>>>>> --- a/net/bluetooth/hci_sync.c
>>>>>>> +++ b/net/bluetooth/hci_sync.c
>>>>>>> @@ -5590,6 +5590,24 @@ int hci_abort_conn_sync(struct hci_dev *hdev, struct hci_conn *conn, u8 reason)
>>>>>>> break;
>>>>>>> }
>>>>>>>
>>>>>>> + /* Check whether the connection is successfully disconnected.
>>>>>>> + * Sometimes the remote device doesn't acknowledge the
>>>>>>> + * LL_TERMINATE_IND in time, requiring the controller to wait
>>>>>>> + * for the supervision timeout, which may exceed 2 seconds. In
>>>>>>> + * this case, we need to wait for the HCI_EV_DISCONN_COMPLETE
>>>>>>> + * event before cleaning up the connection.
>>>>>>> + */
>>>>>>> + if (err == -ETIMEDOUT) {
>>>>>>> + u32 idle_delay = msecs_to_jiffies(10 * conn->le_supv_timeout);
>>>>>>> +
>>>>>>> + reinit_completion(&conn->disc_ev_comp);
>>>>>>> + if (!wait_for_completion_timeout(&conn->disc_ev_comp, idle_delay)) {
>>>>>>> + bt_dev_warn(hdev, "Failed to get complete");
>>>>>>> + mgmt_disconnect_failed(hdev, &conn->dst, conn->type,
>>>>>>> + conn->dst_type, conn->abort_reason);
>>>>>>> + }
>>>>>>> + }
>>>>>>
>>>>>> Why don't we just set the supervision timeout as timeout then? If we
>>>>>> will have to wait for it anyway just change hci_disconnect_sync to use
>>>>>> 10 * conn->le_supv_timeout as timeout instead.
>>>>>>
>>>> hci_disconnect_sync uses __hci_cmd_sync_status_sk to wait for the
>>>> HCI_EV_DISCONN_COMPLETE event, which will send the command in hci_cmd_work.
>>>> In hci_cmd_work, it will start a timer with HCI_CMD_TIMEOUT(2s) to wait
>>>> for the event. So even in hci_disconnect_sync we can set the timeout to
>>>> supervision timeout, hci_disconnect_sync still timeout after 2s.
>>>
>>> Wait, why are you talking about HCI_CMD_TIMEOUT when I told you to use
>>> the supervision timeout instead? If it still timeout after to 2
>>> seconds then there is something still forcing HCI_CMD_TIMEOUT which
>>> shouldn't happen.
>>>
>> Since the lower layer (hci_cmd_work) has set the timeout to HCI_CMD_TIMEOUT, so
>> even the upper layer set to a larger timeout value, like supervision timeout,
>> it still get the timeout after HCI_CMD_TIMEOUT. The timeout flow is:
>> hci_disconnect_sync -> __hci_cmd_sync_sk(wait_event_interruptible_timeout) ->
>> hci_cmd_work -> hci_cmd_timeout -> hci_cmd_sync_cancel_sync -> wake up the
>> wait_event_interruptible_timeout in __hci_cmd_sync_sk -> hci_disconnect_sync timeout.
>>
>> So even if we set a large timeout value in hci_disconnect_sync, it doesn't work
>> since it's waked up by other event, not the real timeout.
>>
>> What's more, in the hci_disconnect_sync, if the reason it not power_off, it waits
>> for the disconnect complete event rather than command status event. According to
>> BT core spec, disconnect complete event has to wait for remote's ack or wait until
>> supervision timeout. It's a valid case that the disconnect complete event taking
>> more than 2s.
>
> You seems to be confusing the role of 2 different timers:
>
> err = wait_event_interruptible_timeout(hdev->req_wait_q,
> hdev->req_status != HCI_REQ_PEND,
> timeout);
>
> and
>
> queue_delayed_work(hdev->workqueue, &hdev->cmd_timer,
> HCI_CMD_TIMEOUT);
>
> The former waits for a specific event, while the later waits for
> handle_cmd_cnt_and_timer, each can have a distinct timeout as in the
> code bellow:
>
> return __hci_cmd_sync_status_sk(hdev, HCI_OP_LE_EXT_CREATE_CONN,
> plen, data,
> HCI_EV_LE_ENHANCED_CONN_COMPLETE,
> conn->conn_timeout, NULL);
>
> The reason there are 2 timers is that we need to track the number of
> commands outstanding in the controller, and no you can't delay Command
> Status:
>
> When the Controller receives the HCI_Disconnect command, it _shall_ send the
> HCI_Command_Status event to the Host.
>
> So even if HCI_Disconnection_Complete is delayed the following shall
> still work provided the HCI_Command_Status is still being generated
> accordingly:
>
> index 0badec7120ab..0ab607fb6433 100644
> --- a/net/bluetooth/hci_sync.c
> +++ b/net/bluetooth/hci_sync.c
> @@ -5444,7 +5444,7 @@ static int hci_disconnect_sync(struct hci_dev
> *hdev, struct hci_conn *conn,
> return __hci_cmd_sync_status_sk(hdev, HCI_OP_DISCONNECT,
> sizeof(cp), &cp,
> HCI_EV_DISCONN_COMPLETE,
> - HCI_CMD_TIMEOUT, NULL);
> + conn->disc_timeout, NULL);
>
> return __hci_cmd_sync_status(hdev, HCI_OP_DISCONNECT, sizeof(cp), &cp,
> HCI_CMD_TIMEOUT);
>
> Then we need to adjust conn->disc_timeout according to supervision
> timeout, that said it appears disc_timeout is actually acting as idle
> timeout, so perhaps we need a separate field, also it doesn't look
> like we track the supervision timeout for non-LE links.
>
Yes, you are right. I mixed up the two timers. Thank you for the detail
explanation.
I just found you have fixed this issue by 227a0cdf4a028a73dc256d0f5144b4808d718893.
Link: https://github.com/bluez/bluez/issues/932
We are using an old codebase. Sorry for this.
BTW, do you think we should change the timeout value here? Set the timeout value
based on the link type.
@@ -5419,6 +5419,7 @@ static int hci_disconnect_sync(struct hci_dev *hdev, struct hci_conn *conn,
u8 reason)
{
struct hci_cp_disconnect cp;
+ u32 disc_timeout;
if (test_bit(HCI_CONN_BIG_CREATED, &conn->flags)) {
/* This is a BIS connection, hci_conn_del will
@@ -5440,11 +5441,18 @@ static int hci_disconnect_sync(struct hci_dev *hdev, struct hci_conn *conn,
* used when suspending or powering off, where we don't want to wait
* for the peer's response.
*/
- if (reason != HCI_ERROR_REMOTE_POWER_OFF)
+
+ if (reason != HCI_ERROR_REMOTE_POWER_OFF) {
+ if (conn->type == LE_LINK)
+ disc_timeout = msecs_to_jiffies(10 * conn->le_supv_timeout);
+ else
+ disc_timeout = conn->disc_timeout;
+
return __hci_cmd_sync_status_sk(hdev, HCI_OP_DISCONNECT,
sizeof(cp), &cp,
HCI_EV_DISCONN_COMPLETE,
- HCI_CMD_TIMEOUT, NULL);
+ disc_timeout, NULL);
+ }
>>>>>> That said, we really need to fix bluetoothd if it is not able to be
>>>>>> cleaned up if SET_POWERED command fails, but it looks like it is
>>>>>> handling errors correctly so it sounds like something else is at play.
>>>>>
>>>> The issue arises after a 2-second timeout of hci_disconnect_sync. During
>>>> hci_abort_conn_sync, the connection is cleaned up by hci_conn_failed.
>>>> after supervision timeout, the disconnect complete event arrives, but
>>>> it returns at line 3370 since the connection has already been removed.
>>>> As a result, bluetoothd does not send the mgmt event for MGMT_OP_DISCONNECT
>>>> to the application layer (bluetoothctl), causing bluetoothctl to get stuck
>>>> and unable to perform other mgmt commands.
>>>
>>> The command shall have completed regardless if disconnect complete has
>>> been received or not, the is the whole point of having a timeout, so
>>> this makes no sense to me, or you are not describing what is happening
>>> here. Also there is no MGMT_OP_DISCONNECT pending, it is
>>> MGMT_OP_SET_POWERED, if you are talking about the DISCONNECTED event
>>> that is a totally different thing and perhaps that is the source of
>>> the problem because if we do cleanup hci_conn even in case the command
>>> fails/times out then we should be generating a disconnected event as
>>> well.
>>>
>> Here is the flow describe the issue. For normal case:
>> ┌───────────┐ ┌──────────┐ ┌──────┐ ┌──────────┐ ┌──────┐
>> │bluetootctl│ │bluetoothd│ │kernel│ │controller│ │remote│
>> └─────┬─────┘ └─────┬────┘ └───┬──┘ └─────┬────┘ └───┬──┘
>> │ disconnect cmd │ │ │ │
>> │─────────────────────>│ │ │ │
>> │ │ │ │ │
>> │ │ MGMT_OP │ │ │
>> │ │ _DISCONNECT │ │ │
>> │ │───────────────────>│ │ │
>> │ │ │ │ │
>> │ │ │ HCI_Disconnect │ │
>> │ │ │───────────────────>│ │
>> │ │ │ │ │
>> │ │ │ │ LL_TERMINATE │
>> │ │ │ │ _IND │
>> │ │ │ │───────────────────>│
>> │ │ │ │ │
>> │ │ │ │ ACK │
>> │ │ │ │<───────────────────│
>> │ │ │ │ │
>> │ │ │ Disc_Comp_Evt │ │
>> │ │ │<───────────────────│ │
>> │ │ │ │ │
>> │ │ MGMT Response │ │ │
>> │ │<───────────────────│ │ │
>> │ │ │ │ │
>> │ disc succ │ │ │ │
>> │<─────────────────────│ │ │ │
>> ┌─────┴─────┐ ┌─────┴────┐ ┌───┴──┐ ┌─────┴────┐ ┌───┴──┐
>> │bluetootctl│ │bluetoothd│ │kernel│ │controller│ │remote│
>> └───────────┘ └──────────┘ └──────┘ └──────────┘ └──────┘
>>
>>
>> The failure case like this:
>>
>> ┌───────────┐ ┌──────────┐ ┌──────┐ ┌──────────┐ ┌──────┐
>> │bluetootctl│ │bluetoothd│ │kernel│ │controller│ │remote│
>> └─────┬─────┘ └─────┬────┘ └───┬──┘ └─────┬────┘ └───┬──┘
>> │ disconnect │ │ │ │
>> │ cmd │ │ │ │
>> │─────────────────────>│ │ │ │
>> │ │ │ │ │
>> │ │ MGMT_OP_ │ │ │
>> │ │ DISCONNECT │ │ │
>> │ │───────────────────>│ │ │
>> │ │ │ │ │
>> │ │ ┌┴┐ HCI_ │ │
>> │ │ │ │ Disconnect │ │
>> │ │ │ │ ──────────────────>│ │
>> │ │ │ │ │ │
>> │ │ │ │ │ LL_TERMINATE ┌┴┐
>> │ │ │ │ │ _IND │ │
>> │ │ │ │ │─────────────────> │ │
>> │ │ │ │ │ │ │
>> │ │ │ │ 2s │ │ │
>> │ │ │ │ │ │ │
>> │ │ │ │ │ │ │ More
>> │ │ │ │ │ │ │ than
>> │ │ │ │ │ │ │ 2s
>> │ │ │ │ │ │ │
>> │ │ │ │ │ │ │
>> │ │ │ │ │ │ │
>> │ │ └┬┘ │ │ │
>> │ │ │────┐ │ │ │
>> │ │ │ │ hci_disconnect │ │ │
>> │ │ │<───┘ sync timeout, │ │ │
>> │ │ │ del 'con' by │ │ │
>> │ │ │ hci_conn_failed│ │ │
>> │ │ │ │ └┬┘
>> │ │ │ │ ACK │
>> │ │ │ │<───────────────────│
>> │ │ │ │ │
>> │ │ │ Disc_Comp_Evt │ │
>> │ │ │<────────────────────│ │
>> │ │ │ │ │
>> │ │ │────┐ │ │
>> │ \│/ │ │ ignore the │ │
>> │ X │<───┘ event since │ │
>> │ /│\ │ 'con' has been │ │
>> │ │ MGMT │ deleted │ │
>> │ │ Response │ │ │
>> │ │<─ ─ ─ ─ ─ ─ ─ ─ ─ ─│ │ │
>> ┌─────┴─────┐ ┌─────┴────┐ ┌───┴──┐ ┌─────┴────┐ ┌───┴──┐
>> │bluetootctl│ │bluetoothd│ │kernel│ │controller│ │remote│
>> └───────────┘ └──────────┘ └──────┘ └──────────┘ └──────┘
>>
>>
>> So in the failure case, the bluetoothd is blocked by waiting the mgmt response from kernel. From
>> our test, bluetoothd can't accept any command related to mgmt from bluetothctl.
>>
>>>>
>>>> 3355 static void hci_disconn_complete_evt(struct hci_dev *hdev, void *data,
>>>> 3356 struct sk_buff *skb)
>>>> 3357 {
>>>> 3358 struct hci_ev_disconn_complete *ev = data;
>>>> 3359 u8 reason;
>>>> 3360 struct hci_conn_params *params;
>>>> 3361 struct hci_conn *conn;
>>>> 3362 bool mgmt_connected;
>>>> 3363
>>>> 3364 bt_dev_dbg(hdev, "status 0x%2.2x", ev->status);
>>>> 3365
>>>> 3366 hci_dev_lock(hdev);
>>>> 3367
>>>> 3368 conn = hci_conn_hash_lookup_handle(hdev, __le16_to_cpu(ev->handle));
>>>> 3369 if (!conn)
>>>> 3370 goto unlock;
>>>> 3371
>>>> 3372 if (ev->status) {
>>>> 3373 mgmt_disconnect_failed(hdev, &conn->dst, conn->type,
>>>> 3374 conn->dst_type, ev->status);
>>>> 3375 goto unlock;
>>>> 3376 }
>>>> 3377
>>>> 3378 conn->state = BT_CLOSED;
>>>> 3379
>>>> 3380 mgmt_connected = test_and_clear_bit(HCI_CONN_MGMT_CONNECTED, &conn->flags);
>>>> 3381
>>>>
>>>>> I double checked this and apparently this could no longer fail:
>>>>>
>>>>> + /* Disregard possible errors since hci_conn_del shall have been
>>>>> + * called even in case of errors had occurred since it would
>>>>> + * then cause hci_conn_failed to be called which calls
>>>>> + * hci_conn_del internally.
>>>>> + */
>>>>> + hci_abort_conn_sync(hdev, conn, reason);
>>>>>
>>>>> So it will clean up the hci_conn no matter what is the timeout, so
>>>>> either you don't have this change or it is not working for some
>>>>> reason.
>>>>>
>>>> The issue is mgmt command is not repsonsed by bluetoothd, then the bluetootlctl is
>>>> blocked. It does not happen during the power off stage. It happened when disconnect
>>>> a BLE device, but the disconnect complete event sent from controller to host 2s later.
>>>> Then it causes the mgmt in bluetoothctl is blocked as decribed as above.
>>>
>>> There is a difference about a MGMT command, initiated by bluetoothd,
>>> versus a MGMT event initiated by the kernel, so the daemon is not
>>> blocked it just don't get a disconnection event, which is different
>>> than a command complete.
>>>
>>> What is the command sequence that you use to reproduce the problem?
>> Here is the command log:
>> [CHG] Controller 8C:FD:F0:21:81:87 Pairable: yes
>> [bluetooth]# power on
>> Changing power on succeeded
>> [bluetooth]# connect CF:90:67:3C:7A:56
>> Attempting to connect to CF:90:67:3C:7A:56
>> [CHG] Device CF:90:67:3C:7A:56 Connected: yes
>> Connection successful
>> [CHG] Device CF:90:67:3C:7A:56 ServicesResolved: yes
>> [Designer Mouse]#
>> [Designer Mouse]# disconnect D9:B5:6C:F2:51:91
>> Attempting to disconnect from D9:B5:6C:F2:51:91 ## no disconnection success response
>> [Designer Mouse]#
>> [Designer Mouse]# power off
>> [Designer Mouse]#
>> Failed to set power off: org.freedesktop.DBus.Error.NoReply
>>
>> To easily reproduce this issue, we use a firmware which always send the disconnect
>> complete event more than 2s. Then the issue occurred 100%.
>>
>> Actually, the root cause is the hci_disconnect_sync doesn't handle this case since it
>> relies on __hci_cmd_sync_status_sk, which maximum timeout value is constrained to 2s.
>>
>>>
>>>>>>> hci_dev_lock(hdev);
>>>>>>>
>>>>>>> /* Check if the connection has been cleaned up concurrently */
>>>>>>>
>>>>>>> base-commit: e25c8d66f6786300b680866c0e0139981273feba
>>>>>>> --
>>>>>>> 2.34.1
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Luiz Augusto von Dentz
>>>>>
>>>>>
>>>>>
>>>>
>>>
>>>
>>
>
>