Re: bluetooth: fix deadlock on device reset and power down

From: David Herrmann
Date: Mon Apr 02 2012 - 04:44:41 EST


Hi Andrei and Alexander

On Mon, Apr 2, 2012 at 10:29 AM, Alexander Holler <holler@xxxxxxxxxxxxx> wrote:
> Am 02.04.2012 08:55, schrieb Andrei Emeltchenko:
>> Hi Alexander,
>>
>> On Sat, Mar 31, 2012 at 03:23:38PM +0200, Alexander Holler wrote:
>>> I've experienced a deadlock on shutdown using kernel 3.3 and tracked
>>> it down. Because I'm not very familiar with the bluetooth stack I'm
>>> not sure if the below patch is correct, but it fixed the problem
>>> here.
>>
>> Could you please attach deadlock dump?
>>
>>>
>>> Commit 09fd0de5bd8f8ef3317e5365f92f1a13dcd89aa9 introduced a deadlock:
>>>
>>> bluetoothd calls ioctl HCIDEVDOWN
>>>     hci_sock_ioctl()
>>>         hci_dev_close()
>>>             hci_dev_do_close()
>>>                 hci_dev_lock(hdev);
>>>                 inquiry_cache_flush();
>>>                 hci_conn_hash_flush();
>>>                     hci_conn_del()
>>>                         cancel_delayed_work_sync()
>>>                             hci_conn_timeout()
>>>                                 hci_dev_lock(hdev); /* DEADLOCK */
>>
>> I am actually not sure that hci_conn_timeout locks hdev. Why do you think
>> so?
>
> By reading the source, printk and suffering through the deadlock. It's
> especially painfull when using a bt-keyboard and systemd, because
> systemd tries 4 times (~ some minutes) to kill bluetoothd before it
> marks the service as failed and finally continues to shut down.

hci_conn_timeout does lock the device. See the source. But the problem
here is actually a race-condition, too. The do_close() code locks the
device and then cancels all workqueues in a synchronous manner.
However, the hci_conn_timeout work might get started exactly before
calling cancel_delayed_work_sync(). The proper fix would probably be
releasing the lock before calling "cancel_delayed_work_sync()".
However, then we need to make sure that the work is not restarted
while we do not have the lock.
I think we recently introduced some flag that is set while closing a
device. How about checking that in hci_conn_timeout before aquiring
the lock?

> Just try to kill bluetoothd while a bt-mouse or bt-keyboard is connected.

Reproducable, indeed.

> But I have to admit, that my patch is likely the wrong solution as I
> think it will introduce some race conditions. Anyway, I prefer to live
> with them (the race conditions) instead of the deadlock. So for
> inclusion into the kernel a proper solution is needed.
> But already said, I'm not familiar with the bt-stack and don't know
> about the locking strategies inside the stack, so it's hard for me to
> find my way through the source.

Yes, your fix introduces races. We need to hold the lock there!
Applying your fix would introduce harder to trace bugs even during
runtime so we need to fix this properly.

> Regards,
>
> Alexander

Thanks
David
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/