Re: deadlock in synchronize_srcu() in debugfs?

From: Nicolai Stange
Date: Thu Mar 30 2017 - 06:27:21 EST


So, please correct me if I'm wrong, there are two problems with
indefinitely blocking debugfs files' fops:

1. The one which actually hung your system:
An indefinitely blocking debugfs_remove() while holding a lock.
Other tasks attempting to grab that same lock get stuck as well.

2. The other one you've found, namely that the locking granularity is
too coarse: a debugfs_remove() would get blocked by unrelated files'
pending fops.

AFAICS, the first one can't get resolved by simply refining the blocking
granularity: a debugfs_remove() on the indefinitely blocking file would
still block as well.

But:

On Thu, Mar 30 2017, Johannes Berg wrote:

> On Thu, 2017-03-30 at 09:32 +0200, Nicolai Stange wrote:
>>
>> I wonder if holding the RTNL during the debugfs file removal is
>> really needed. I'll try to have a look in the next couple of days.
>
> Yes, I'm pretty much convinced that it is. I considered doing a
> deferred debugfs_remove() by holding the object around, but then I
> can't be sure that I can later re-add a new object with the same
> directory name,

Ah, I see. What about making debugfs provide separate

- debugfs_remove_start(): unlink file (c.f. __debugfs_remove()), can be
called under a lock
and
- debugfs_remove_wait(): the synchronize_srcu(), must not get called
under any lock

?

This would solve 1.). It would still be nice to detect those situations,
i.e. calls to debugfs_remove() with some lock being held, though.

In short, I'd make calling debugfs_remove() with any lock being
held illegal.

What do you think?


> Half the thread here was about that - it's not easily doable because
> you'd have to teach lockdep about the special SRCU semantics first.
> Since it doesn't even seem to do read/write locks properly that's
> probably a massive undertaking.

I haven't looked into lockdep yet. So there is no way to ask lockdep
"is there any lock held?" from debugfs_remove() before doing the
synchonize_srcu()?


> I also doubt that it's useful, because even if we did flag this sort of
> situation, it can occur across very different drivers - for example the
> netronome driver using rtnl_lock() inside its debugfs files, and
> mac80211 removing a completely unrelated debugfs file within
> rtnl_lock().

I'm proposing to convert the latter to a
debugfs_remove_start()/debugfs_remove_wait() pair.


>> > Similarly, nobody should be blocking in debugfs files, like we did
>> > in ours, but also smsdvb_stats_read(), crtc_crc_open() look like
>> > they could block for quite a while.
>>
>> Blocking in the debugfs files' fops shall be fine by itself, that's
>> why SRCU is used for the removal stuff.
>
> No, it isn't fine at all now!

"Shall" in the sense of "it's a requirement" and if it isn't fulfilled,
it must be fixed. So I do agree with you here.


> If I have a debugfs file - like the one I
> had - that could block for external events (firmware, in my case), then
> *any* other debugfs_remove() in the whole system would also block
> indefinitely. That's a major problem!

Indeed.


> Ultimately, I'm not sure I see why one couldn't just have a
> reader/writer lock per *file*, which would be the ultimate granularity
> to solve this. Obviously, a blocking file has to be aborted before
> being removed itself, but there's nothing that says that you can't
> remove any other file - even from the same directory - while this one
> is in a blocking read.

When I did this, per-file reader/writer locks actuallt came to my mind
first. The problem here is that debugfs_use_file_start() must grab the
lock first and check whether the file has been deleted in the meanwhile.
But as it stands, there's nothing that would guarantee the existence of
the lock at the time it's to be taken.

Anyways, I'll have to think a little about possible solutions to mitigate
problem 2.)

Thanks,

Nicolai