Re: [PATCH v1] Bluetooth: hci_sync: Fix disconnect complete event timeout issue
From: Cheng Jiang (IOE)
Date: Tue Dec 17 2024 - 00:51:38 EST
Hi Luiz,
On 12/17/2024 11:15 AM, Luiz Augusto von Dentz wrote:
> Hi Cheng,
>
> On Mon, Dec 16, 2024 at 9:49 PM Cheng Jiang (IOE)
> <quic_chejiang@xxxxxxxxxxx> wrote:
>>
>> Hi Luiz,
>>
>> On 12/16/2024 10:42 PM, Luiz Augusto von Dentz wrote:
>>> Hi Cheng,
>>>
>>> On Mon, Dec 16, 2024 at 9:32 AM Luiz Augusto von Dentz
>>> <luiz.dentz@xxxxxxxxx> wrote:
>>>>
>>>> Hi Cheng,
>>>>
>>>> On Mon, Dec 16, 2024 at 3:08 AM Cheng Jiang <quic_chejiang@xxxxxxxxxxx> wrote:
>>>>>
>>>>> Sometimes, the remote device doesn't acknowledge the LL_TERMINATE_IND
>>>>> in time, requiring the controller to wait for the supervision timeout,
>>>>> which may exceed 2 seconds. In the current implementation, the
>>>>> HCI_EV_DISCONN_COMPLETE event is ignored if it arrives late, since
>>>>> the hci_abort_conn_sync has cleaned up the connection after 2 seconds.
>>>>> This causes the mgmt to get stuck, resulting in bluetoothd waiting
>>>>> indefinitely for the mgmt response to the disconnect. To recover,
>>>>> restarting bluetoothd is necessary.
>>>>>
>>>>> bluetoothctl log like this:
>>>>> [Designer Mouse]# disconnect D9:B5:6C:F2:51:91
>>>>> Attempting to disconnect from D9:B5:6C:F2:51:91
>>>>> [Designer Mouse]#
>>>>> [Designer Mouse]# power off
>>>>> [Designer Mouse]#
>>>>> Failed to set power off: org.freedesktop.DBus.Error.NoReply.
>>>>>
>>>>> Signed-off-by: Cheng Jiang <quic_chejiang@xxxxxxxxxxx>
>>>>> ---
>>>>> include/net/bluetooth/hci_core.h | 2 ++
>>>>> net/bluetooth/hci_conn.c | 9 +++++++++
>>>>> net/bluetooth/hci_event.c | 9 +++++++++
>>>>> net/bluetooth/hci_sync.c | 18 ++++++++++++++++++
>>>>> 4 files changed, 38 insertions(+)
>>>>>
>>>>> diff --git a/include/net/bluetooth/hci_core.h b/include/net/bluetooth/hci_core.h
>>>>> index 734cd50cd..2ab079dcf 100644
>>>>> --- a/include/net/bluetooth/hci_core.h
>>>>> +++ b/include/net/bluetooth/hci_core.h
>>>>> @@ -753,6 +753,8 @@ struct hci_conn {
>>>>>
>>>>> struct bt_codec codec;
>>>>>
>>>>> + struct completion disc_ev_comp;
>>>>> +
>>>>> void (*connect_cfm_cb) (struct hci_conn *conn, u8 status);
>>>>> void (*security_cfm_cb) (struct hci_conn *conn, u8 status);
>>>>> void (*disconn_cfm_cb) (struct hci_conn *conn, u8 reason);
>>>>> diff --git a/net/bluetooth/hci_conn.c b/net/bluetooth/hci_conn.c
>>>>> index d097e308a..e0244e191 100644
>>>>> --- a/net/bluetooth/hci_conn.c
>>>>> +++ b/net/bluetooth/hci_conn.c
>>>>> @@ -1028,6 +1028,15 @@ static struct hci_conn *__hci_conn_add(struct hci_dev *hdev, int type, bdaddr_t
>>>>>
>>>>> hci_conn_init_sysfs(conn);
>>>>>
>>>>> + /* This disc_ev_comp is inited when we send a disconnect request to
>>>>> + * the remote device but fail to receive the disconnect complete
>>>>> + * event within the expected time (2 seconds). This occurs because
>>>>> + * the remote device doesn't ack the terminate indication, forcing
>>>>> + * the controller to wait for the supervision timeout.
>>>>> + */
>>>>> + init_completion(&conn->disc_ev_comp);
>>>>> + complete(&conn->disc_ev_comp);
>>>>> +
>>>>> return conn;
>>>>> }
>>>>>
>>>>> diff --git a/net/bluetooth/hci_event.c b/net/bluetooth/hci_event.c
>>>>> index 2cc7a9306..60ecb2b18 100644
>>>>> --- a/net/bluetooth/hci_event.c
>>>>> +++ b/net/bluetooth/hci_event.c
>>>>> @@ -3366,6 +3366,15 @@ static void hci_disconn_complete_evt(struct hci_dev *hdev, void *data,
>>>>> if (!conn)
>>>>> goto unlock;
>>>>>
>>>>> + /* Wake up disc_ev_comp here is ok. Since we hold the hdev lock
>>>>> + * hci_abort_conn_sync will wait hdev lock release to continue.
>>>>> + */
>>>>> + if (!completion_done(&conn->disc_ev_comp)) {
>>>>> + complete(&conn->disc_ev_comp);
>>>>> + /* Add some delay for hci_abort_conn_sync to handle the complete */
>>>>> + usleep_range(100, 1000);
>>>>> + }
>>>>> +
>>>>> if (ev->status) {
>>>>> mgmt_disconnect_failed(hdev, &conn->dst, conn->type,
>>>>> conn->dst_type, ev->status);
>>>>> diff --git a/net/bluetooth/hci_sync.c b/net/bluetooth/hci_sync.c
>>>>> index 0badec712..783d04b57 100644
>>>>> --- a/net/bluetooth/hci_sync.c
>>>>> +++ b/net/bluetooth/hci_sync.c
>>>>> @@ -5590,6 +5590,24 @@ int hci_abort_conn_sync(struct hci_dev *hdev, struct hci_conn *conn, u8 reason)
>>>>> break;
>>>>> }
>>>>>
>>>>> + /* Check whether the connection is successfully disconnected.
>>>>> + * Sometimes the remote device doesn't acknowledge the
>>>>> + * LL_TERMINATE_IND in time, requiring the controller to wait
>>>>> + * for the supervision timeout, which may exceed 2 seconds. In
>>>>> + * this case, we need to wait for the HCI_EV_DISCONN_COMPLETE
>>>>> + * event before cleaning up the connection.
>>>>> + */
>>>>> + if (err == -ETIMEDOUT) {
>>>>> + u32 idle_delay = msecs_to_jiffies(10 * conn->le_supv_timeout);
>>>>> +
>>>>> + reinit_completion(&conn->disc_ev_comp);
>>>>> + if (!wait_for_completion_timeout(&conn->disc_ev_comp, idle_delay)) {
>>>>> + bt_dev_warn(hdev, "Failed to get complete");
>>>>> + mgmt_disconnect_failed(hdev, &conn->dst, conn->type,
>>>>> + conn->dst_type, conn->abort_reason);
>>>>> + }
>>>>> + }
>>>>
>>>> Why don't we just set the supervision timeout as timeout then? If we
>>>> will have to wait for it anyway just change hci_disconnect_sync to use
>>>> 10 * conn->le_supv_timeout as timeout instead.
>>>>
>> hci_disconnect_sync uses __hci_cmd_sync_status_sk to wait for the
>> HCI_EV_DISCONN_COMPLETE event, which will send the command in hci_cmd_work.
>> In hci_cmd_work, it will start a timer with HCI_CMD_TIMEOUT(2s) to wait
>> for the event. So even in hci_disconnect_sync we can set the timeout to
>> supervision timeout, hci_disconnect_sync still timeout after 2s.
>
> Wait, why are you talking about HCI_CMD_TIMEOUT when I told you to use
> the supervision timeout instead? If it still timeout after to 2
> seconds then there is something still forcing HCI_CMD_TIMEOUT which
> shouldn't happen.
>
Since the lower layer (hci_cmd_work) has set the timeout to HCI_CMD_TIMEOUT, so
even the upper layer set to a larger timeout value, like supervision timeout,
it still get the timeout after HCI_CMD_TIMEOUT. The timeout flow is:
hci_disconnect_sync -> __hci_cmd_sync_sk(wait_event_interruptible_timeout) ->
hci_cmd_work -> hci_cmd_timeout -> hci_cmd_sync_cancel_sync -> wake up the
wait_event_interruptible_timeout in __hci_cmd_sync_sk -> hci_disconnect_sync timeout.
So even if we set a large timeout value in hci_disconnect_sync, it doesn't work
since it's waked up by other event, not the real timeout.
What's more, in the hci_disconnect_sync, if the reason it not power_off, it waits
for the disconnect complete event rather than command status event. According to
BT core spec, disconnect complete event has to wait for remote's ack or wait until
supervision timeout. It's a valid case that the disconnect complete event taking
more than 2s.
>>>> That said, we really need to fix bluetoothd if it is not able to be
>>>> cleaned up if SET_POWERED command fails, but it looks like it is
>>>> handling errors correctly so it sounds like something else is at play.
>>>
>> The issue arises after a 2-second timeout of hci_disconnect_sync. During
>> hci_abort_conn_sync, the connection is cleaned up by hci_conn_failed.
>> after supervision timeout, the disconnect complete event arrives, but
>> it returns at line 3370 since the connection has already been removed.
>> As a result, bluetoothd does not send the mgmt event for MGMT_OP_DISCONNECT
>> to the application layer (bluetoothctl), causing bluetoothctl to get stuck
>> and unable to perform other mgmt commands.
>
> The command shall have completed regardless if disconnect complete has
> been received or not, the is the whole point of having a timeout, so
> this makes no sense to me, or you are not describing what is happening
> here. Also there is no MGMT_OP_DISCONNECT pending, it is
> MGMT_OP_SET_POWERED, if you are talking about the DISCONNECTED event
> that is a totally different thing and perhaps that is the source of
> the problem because if we do cleanup hci_conn even in case the command
> fails/times out then we should be generating a disconnected event as
> well.
>
Here is the flow describe the issue. For normal case:
┌───────────┐ ┌──────────┐ ┌──────┐ ┌──────────┐ ┌──────┐
│bluetootctl│ │bluetoothd│ │kernel│ │controller│ │remote│
└─────┬─────┘ └─────┬────┘ └───┬──┘ └─────┬────┘ └───┬──┘
│ disconnect cmd │ │ │ │
│─────────────────────>│ │ │ │
│ │ │ │ │
│ │ MGMT_OP │ │ │
│ │ _DISCONNECT │ │ │
│ │───────────────────>│ │ │
│ │ │ │ │
│ │ │ HCI_Disconnect │ │
│ │ │───────────────────>│ │
│ │ │ │ │
│ │ │ │ LL_TERMINATE │
│ │ │ │ _IND │
│ │ │ │───────────────────>│
│ │ │ │ │
│ │ │ │ ACK │
│ │ │ │<───────────────────│
│ │ │ │ │
│ │ │ Disc_Comp_Evt │ │
│ │ │<───────────────────│ │
│ │ │ │ │
│ │ MGMT Response │ │ │
│ │<───────────────────│ │ │
│ │ │ │ │
│ disc succ │ │ │ │
│<─────────────────────│ │ │ │
┌─────┴─────┐ ┌─────┴────┐ ┌───┴──┐ ┌─────┴────┐ ┌───┴──┐
│bluetootctl│ │bluetoothd│ │kernel│ │controller│ │remote│
└───────────┘ └──────────┘ └──────┘ └──────────┘ └──────┘
The failure case like this:
┌───────────┐ ┌──────────┐ ┌──────┐ ┌──────────┐ ┌──────┐
│bluetootctl│ │bluetoothd│ │kernel│ │controller│ │remote│
└─────┬─────┘ └─────┬────┘ └───┬──┘ └─────┬────┘ └───┬──┘
│ disconnect │ │ │ │
│ cmd │ │ │ │
│─────────────────────>│ │ │ │
│ │ │ │ │
│ │ MGMT_OP_ │ │ │
│ │ DISCONNECT │ │ │
│ │───────────────────>│ │ │
│ │ │ │ │
│ │ ┌┴┐ HCI_ │ │
│ │ │ │ Disconnect │ │
│ │ │ │ ──────────────────>│ │
│ │ │ │ │ │
│ │ │ │ │ LL_TERMINATE ┌┴┐
│ │ │ │ │ _IND │ │
│ │ │ │ │─────────────────> │ │
│ │ │ │ │ │ │
│ │ │ │ 2s │ │ │
│ │ │ │ │ │ │
│ │ │ │ │ │ │ More
│ │ │ │ │ │ │ than
│ │ │ │ │ │ │ 2s
│ │ │ │ │ │ │
│ │ │ │ │ │ │
│ │ │ │ │ │ │
│ │ └┬┘ │ │ │
│ │ │────┐ │ │ │
│ │ │ │ hci_disconnect │ │ │
│ │ │<───┘ sync timeout, │ │ │
│ │ │ del 'con' by │ │ │
│ │ │ hci_conn_failed│ │ │
│ │ │ │ └┬┘
│ │ │ │ ACK │
│ │ │ │<───────────────────│
│ │ │ │ │
│ │ │ Disc_Comp_Evt │ │
│ │ │<────────────────────│ │
│ │ │ │ │
│ │ │────┐ │ │
│ \│/ │ │ ignore the │ │
│ X │<───┘ event since │ │
│ /│\ │ 'con' has been │ │
│ │ MGMT │ deleted │ │
│ │ Response │ │ │
│ │<─ ─ ─ ─ ─ ─ ─ ─ ─ ─│ │ │
┌─────┴─────┐ ┌─────┴────┐ ┌───┴──┐ ┌─────┴────┐ ┌───┴──┐
│bluetootctl│ │bluetoothd│ │kernel│ │controller│ │remote│
└───────────┘ └──────────┘ └──────┘ └──────────┘ └──────┘
So in the failure case, the bluetoothd is blocked by waiting the mgmt response from kernel. From
our test, bluetoothd can't accept any command related to mgmt from bluetothctl.
>>
>> 3355 static void hci_disconn_complete_evt(struct hci_dev *hdev, void *data,
>> 3356 struct sk_buff *skb)
>> 3357 {
>> 3358 struct hci_ev_disconn_complete *ev = data;
>> 3359 u8 reason;
>> 3360 struct hci_conn_params *params;
>> 3361 struct hci_conn *conn;
>> 3362 bool mgmt_connected;
>> 3363
>> 3364 bt_dev_dbg(hdev, "status 0x%2.2x", ev->status);
>> 3365
>> 3366 hci_dev_lock(hdev);
>> 3367
>> 3368 conn = hci_conn_hash_lookup_handle(hdev, __le16_to_cpu(ev->handle));
>> 3369 if (!conn)
>> 3370 goto unlock;
>> 3371
>> 3372 if (ev->status) {
>> 3373 mgmt_disconnect_failed(hdev, &conn->dst, conn->type,
>> 3374 conn->dst_type, ev->status);
>> 3375 goto unlock;
>> 3376 }
>> 3377
>> 3378 conn->state = BT_CLOSED;
>> 3379
>> 3380 mgmt_connected = test_and_clear_bit(HCI_CONN_MGMT_CONNECTED, &conn->flags);
>> 3381
>>
>>> I double checked this and apparently this could no longer fail:
>>>
>>> + /* Disregard possible errors since hci_conn_del shall have been
>>> + * called even in case of errors had occurred since it would
>>> + * then cause hci_conn_failed to be called which calls
>>> + * hci_conn_del internally.
>>> + */
>>> + hci_abort_conn_sync(hdev, conn, reason);
>>>
>>> So it will clean up the hci_conn no matter what is the timeout, so
>>> either you don't have this change or it is not working for some
>>> reason.
>>>
>> The issue is mgmt command is not repsonsed by bluetoothd, then the bluetootlctl is
>> blocked. It does not happen during the power off stage. It happened when disconnect
>> a BLE device, but the disconnect complete event sent from controller to host 2s later.
>> Then it causes the mgmt in bluetoothctl is blocked as decribed as above.
>
> There is a difference about a MGMT command, initiated by bluetoothd,
> versus a MGMT event initiated by the kernel, so the daemon is not
> blocked it just don't get a disconnection event, which is different
> than a command complete.
>
> What is the command sequence that you use to reproduce the problem?
Here is the command log:
[CHG] Controller 8C:FD:F0:21:81:87 Pairable: yes
[bluetooth]# power on
Changing power on succeeded
[bluetooth]# connect CF:90:67:3C:7A:56
Attempting to connect to CF:90:67:3C:7A:56
[CHG] Device CF:90:67:3C:7A:56 Connected: yes
Connection successful
[CHG] Device CF:90:67:3C:7A:56 ServicesResolved: yes
[Designer Mouse]#
[Designer Mouse]# disconnect D9:B5:6C:F2:51:91
Attempting to disconnect from D9:B5:6C:F2:51:91 ## no disconnection success response
[Designer Mouse]#
[Designer Mouse]# power off
[Designer Mouse]#
Failed to set power off: org.freedesktop.DBus.Error.NoReply
To easily reproduce this issue, we use a firmware which always send the disconnect
complete event more than 2s. Then the issue occurred 100%.
Actually, the root cause is the hci_disconnect_sync doesn't handle this case since it
relies on __hci_cmd_sync_status_sk, which maximum timeout value is constrained to 2s.
>
>>>>> hci_dev_lock(hdev);
>>>>>
>>>>> /* Check if the connection has been cleaned up concurrently */
>>>>>
>>>>> base-commit: e25c8d66f6786300b680866c0e0139981273feba
>>>>> --
>>>>> 2.34.1
>>>>>
>>>>
>>>>
>>>> --
>>>> Luiz Augusto von Dentz
>>>
>>>
>>>
>>
>
>