Re: [PATCH v2] blktrace: Fix potentail deadlock between delete & sysfs ops

From: Waiman Long
Date: Thu Aug 17 2017 - 17:27:31 EST


On 08/17/2017 05:10 PM, Steven Rostedt wrote:
> On Thu, 17 Aug 2017 12:24:39 -0400
> Waiman Long <longman@xxxxxxxxxx> wrote:
>
>>>> + * sysfs file and then acquiring the bd_mutex. Deleting a block device
>>>> + * requires acquiring the bd_mutex and then waiting for all the sysfs
>>>> + * references to be gone. This can lead to deadlock if both operations
>>>> + * happen simultaneously. To avoid this problem, read/write to the
>>>> + * the tracing sysfs files can now fail if the bd_mutex cannot be
>>>> + * acquired while a deletion operation is in progress.
>>>> + *
>>>> + * A mutex trylock loop is used assuming that tracing sysfs operations
>>> A mutex trylock loop is not enough to stop a deadlock. But I'm guessing
>>> the undocumented bd_deleting may prevent that.
>> Yes, that is what the bd_deleting flag is for.
>>
>> I was thinking about failing the sysfs operation after a certain number
>> of trylock attempts, but then it will require changes to user space code
>> to handle the occasional failures. Finally I decided to fail it only
>> when a delete operation is in progress as all hopes are lost in this case.
> Actually, why not just fail the attr read on deletion? If it is being
> deleted, and one is reading the attribute, perhaps -ENODEV is the
> proper return value?
>
>> The root cause is the lock inversion under this circumstance. I think
>> modifying the blk_trace code has the least impact overall. I agree that
>> the code is ugly. If you have a better suggestion, I will certainly like
>> to hear it.
> Instead of playing games with taking the lock, the only way this race
> is hit, is if the partition is being deleted and the sysfs attribute is
> being read at the same time, correct? In that case, just return
> -ENODEV, and be done with it.
>
> -- Steve

It is actually what the patch is trying to do by checking for the
deletion flag in the mutex_trylock loop. Please note that mutex does not
guarantee FIFO ordering of lock acquisition. As a result, cpu1 may call
mutex_lock() and wait for it while cpu2 can set the deletion flag later
and get the mutex first before cpu1. So checking for the deletion flag
before taking the mutex isn't enough.

Cheers,
Longman