Re: fsnotify_mark_srcu wtf?

From: Amir Goldstein
Date: Wed Nov 09 2016 - 13:29:10 EST


On Wed, Nov 9, 2016 at 1:10 PM, Jan Kara <jack@xxxxxxx> wrote:
> On Sun 06-11-16 08:45:54, Amir Goldstein wrote:
>> On Sat, Nov 5, 2016 at 11:34 PM, Jan Kara <jack@xxxxxxx> wrote:
>> > On Wed 02-11-16 23:09:26, Miklos Szeredi wrote:
>> >> We've got a report where a fanotify daemon that implements permission checks
>> >> screws up and doesn't send a reply. This then causes widespread hangs due to
>> >> fsnotify_mark_srcu read side lock being held and thus causing synchronize_srcu()
>> >> called from e.g. inotify_release()-> fsnotify_destroy_group()->
>> >> fsnotify_mark_destroy_list() to block.
>> >
>> > Yes. But if a program implementing permission checks does not reply, your
>> > system is likely hosed anyway. We can only try to somewhat limit the
>> > damage...
>> >
>>
>> That was my initial thought as well, but at least with the sample code
>> Miklos sent
>> the only thing that gets hosed is the one process watching that one file.
>> You could think of a use case of fanotify being used to watch over files
>> in a specific user directory, where the damage on the entire system
>> should/could be limited. No?
>
> Yes, the damage could at least theoretically be limited only to those files
> / dirs watched by the buggy process.
>
>> >> Below program demonstrates the issue. It should output a single line:
>> >>
>> >> close(inotify_fd): success
>> >>
>> >> Instead it outputs nothing, which means that close(inotify_fd) got blocked by
>> >> the waiting permission event.
>> >>
>> >> Wouldn't making the srcu per-group fix this? Would that be too expensive?
>> >
>> > Per-group would be IMHO too expensive. You can have lots of groups and I'm
>> > not sure srcu would scale to that. Furthermore the SRCU protects the list
>> > of groups that need to get notification so it would not even be easily
>> > possible. Also Amir's solution is buggy - I'll comment on that as a reply
>> > to his patch. I'll try to find something to improve the situation but so
>> > far I have no good idea...
>> >
>>
>> Yes, very much buggy indeed :/
>> Anyway, the reason I drafted it quickly was to highlight the fact that the
>> marks only need to live to the point of decision whether or not the event
>> should be sent to the group and afterwards, its sufficient to grab the
>> group reference, without having impact on the entire system.
>
> Yes, fanotify code as such does not need the marks anymore. But the core
> fsnotify code does...
>
>> Yet another possible ugly (but less buggy) solution would be
>> to iterate all marks under SRCU read protection.
>> If any group is about to block (either by suggested return value
>> EAGAIN or another
>> by using a new op should_handle_event_deferred), defer event handling to post
>> marks iteration, by keeping a few group references on stack.
>
> And this does not work as well... Fanotify must notify groups by their
> priority so you cannot arbitrarily reorder ordering in which groups get
> notified. I'm currently pondering on using mark refcount to pin it when
> processing permission event but there are still some details to check.
>

All right, mark refcount sound like the proper solution.

In case this approach doesn't work out for some reason, you may want
to consider fsnotify_mark_srcu (and destroy_list) per priority.
Or just 2 srcu, one for for priority 0 and one for other.
Because priority > 0 may block and priority 0 may not block.

The priority set on an inode/vfsmount can be easily encoded into the
i_fsnotify_mask/mnt_fsnotify_mask, e.g.:
#define FS_GROUP_PRIO_1 0x00040000 /* fanotify content
based access control */
#define FS_GROUP_PRIO_2 0x00080000 /* fanotify
pre-content access */

in fsnotify(), only need to take the read side srcu of relevant priority groups,
but need to take extra care to set the priority bit in the inode/mnt
mask *before*
adding the mark to the list, with a relevant memory barrier before checking
the priority bits in fsnotify().

Amir.