Re: INFO: task hung in fuse_reverse_inval_entry

From: Dmitry Vyukov
Date: Mon Jul 23 2018 - 08:47:09 EST


On Mon, Jul 23, 2018 at 2:33 PM, Miklos Szeredi <miklos@xxxxxxxxxx> wrote:
>>>> On Mon, Jul 23, 2018 at 9:59 AM, syzbot
>>>> <syzbot+bb6d800770577a083f8c@xxxxxxxxxxxxxxxxxxxxxxxxx> wrote:
>>>>> Hello,
>>>>>
>>>>> syzbot found the following crash on:
>>>>>
>>>>> HEAD commit: d72e90f33aa4 Linux 4.18-rc6
>>>>> git tree: upstream
>>>>> console output: https://syzkaller.appspot.com/x/log.txt?x=1324f794400000
>>>>> kernel config: https://syzkaller.appspot.com/x/.config?x=68af3495408deac5
>>>>> dashboard link: https://syzkaller.appspot.com/bug?extid=bb6d800770577a083f8c
>>>>> compiler: gcc (GCC) 8.0.1 20180413 (experimental)
>>>>> syzkaller repro:https://syzkaller.appspot.com/x/repro.syz?x=11564d1c400000
>>>>> C reproducer: https://syzkaller.appspot.com/x/repro.c?x=16fc570c400000
>>>>
>>>>
>>>> Hi fuse maintainers,
>>>>
>>>> We are seeing a bunch of such deadlocks in fuse on syzbot. As far as I
>>>> understand this is mostly working-as-intended (parts about deadlocks
>>>> in Documentation/filesystems/fuse.txt). The intended way to resolve
>>>> this is aborting connections via fusectl, right?
>>>
>>> Yes. Alternative is with "umount -f".
>>>
>>>> The doc says "Under
>>>> the fuse control filesystem each connection has a directory named by a
>>>> unique number". The question is: if I start a process and this process
>>>> can mount fuse, how do I kill it? I mean: totally and certainly get
>>>> rid of it right away? How do I find these unique numbers for the
>>>> mounts it created?
>>>
>>> It is the device number found in st_dev for the mount. Other than
>>> doing stat(2) it is possible to find out the device number by reading
>>> /proc/$PID/mountinfo (third field).
>>
>> Thanks. I will try to figure out fusectl connection numbers and see if
>> it's possible to integrate aborting into syzkaller.
>>
>>>> Taking into account that there is usually no
>>>> operator attached to each server, I wonder if kernel could somehow
>>>> auto-abort fuse on kill?
>>>
>>> Depends on what the fuse server is sleeping on. If it's trying to
>>> acquire an inode lock (e.g. unlink(2)), which is classical way to
>>> deadlock a fuse filesystem, then it will go into an uninterruptible
>>> sleep. There's no way in which that process can be killed except to
>>> force a release of the offending lock, which can only be done by
>>> aborting the request that is being performed while holding that lock.
>>
>> I understand that it is not killed today, but I am asking if we can
>> make it killable. It's all code that we can change, and if a human
>> operator can do it, it can be done pure programmatically on kill too,
>> right?
>
> Hmm, you mean if a process is in an uninterruptible sleep trying to
> acquire a lock on a fuse filesystem and is killed, then the fuse
> filesystem should be aborted?
>
> Even if we'd manage to implement that, it's a large backward
> incompatibility risk.
>
> I don't argue that it can be done, but I would definitely argue *if*
> it should be done.


I understand that we should abort only if we are sure that it's
actually deadlocked and there is no other way.
So if fuse-user process is blocked on fuse lock, then we probably
should do nothing. However, if the fuse-server is killed, then perhaps
we could abort the connection at that point. Namely, if a process that
has a fuse fd open is killed and it is the only process that shared
this fd, then we could abort the connection on arrival of the kill
signal (rather than wait untill all it's threads finish and then start
closing all fd's, this is where we get the deadlock -- some of its
threads won't finish). I don't know if such synchronous kill hook is
available, though. If several processes shared the same fuse fd, then
we could close the fd in each process on SIGKILL arrival, then when
all of these processes are killed, fuse fd will be closed and we can
abort the connection, which will un-deadlock all of these processes.
Does this look any reasonable?