Re: fsnotify_mark_srcu wtf?

From: Amir Goldstein
Date: Thu Nov 10 2016 - 15:02:39 EST


On Thu, Nov 10, 2016 at 9:46 PM, Jan Kara <jack@xxxxxxx> wrote:
> On Wed 09-11-16 20:26:16, Amir Goldstein wrote:
>> On Wed, Nov 9, 2016 at 1:10 PM, Jan Kara <jack@xxxxxxx> wrote:
>> > On Sun 06-11-16 08:45:54, Amir Goldstein wrote:
>> >> On Sat, Nov 5, 2016 at 11:34 PM, Jan Kara <jack@xxxxxxx> wrote:
>> >> > On Wed 02-11-16 23:09:26, Miklos Szeredi wrote:
>> >> >> We've got a report where a fanotify daemon that implements permission checks
>> >> >> screws up and doesn't send a reply. This then causes widespread hangs due to
>> >> >> fsnotify_mark_srcu read side lock being held and thus causing synchronize_srcu()
>> >> >> called from e.g. inotify_release()-> fsnotify_destroy_group()->
>> >> >> fsnotify_mark_destroy_list() to block.
>> >> >
>> >> > Yes. But if a program implementing permission checks does not reply, your
>> >> > system is likely hosed anyway. We can only try to somewhat limit the
>> >> > damage...
>> >> >
>> >>
>> >> That was my initial thought as well, but at least with the sample code
>> >> Miklos sent
>> >> the only thing that gets hosed is the one process watching that one file.
>> >> You could think of a use case of fanotify being used to watch over files
>> >> in a specific user directory, where the damage on the entire system
>> >> should/could be limited. No?
>> >
>> > Yes, the damage could at least theoretically be limited only to those files
>> > / dirs watched by the buggy process.
>> >
>> >> >> Below program demonstrates the issue. It should output a single line:
>> >> >>
>> >> >> close(inotify_fd): success
>> >> >>
>> >> >> Instead it outputs nothing, which means that close(inotify_fd) got blocked by
>> >> >> the waiting permission event.
>> >> >>
>> >> >> Wouldn't making the srcu per-group fix this? Would that be too expensive?
>> >> >
>> >> > Per-group would be IMHO too expensive. You can have lots of groups and I'm
>> >> > not sure srcu would scale to that. Furthermore the SRCU protects the list
>> >> > of groups that need to get notification so it would not even be easily
>> >> > possible. Also Amir's solution is buggy - I'll comment on that as a reply
>> >> > to his patch. I'll try to find something to improve the situation but so
>> >> > far I have no good idea...
>> >> >
>> >>
>> >> Yes, very much buggy indeed :/
>> >> Anyway, the reason I drafted it quickly was to highlight the fact that the
>> >> marks only need to live to the point of decision whether or not the event
>> >> should be sent to the group and afterwards, its sufficient to grab the
>> >> group reference, without having impact on the entire system.
>> >
>> > Yes, fanotify code as such does not need the marks anymore. But the core
>> > fsnotify code does...
>> >
>> >> Yet another possible ugly (but less buggy) solution would be
>> >> to iterate all marks under SRCU read protection.
>> >> If any group is about to block (either by suggested return value
>> >> EAGAIN or another
>> >> by using a new op should_handle_event_deferred), defer event handling to post
>> >> marks iteration, by keeping a few group references on stack.
>> >
>> > And this does not work as well... Fanotify must notify groups by their
>> > priority so you cannot arbitrarily reorder ordering in which groups get
>> > notified. I'm currently pondering on using mark refcount to pin it when
>> > processing permission event but there are still some details to check.
>> >
>>
>> All right, mark refcount sound like the proper solution.
>
> Except it doesn't quite work. We can pin the current marks by a refcount
> but they can still be removed from the list so after we regain srcu lock,
> we are not sure their ->next pointers still point to still allocated marks
> :-| Sadly I realized this only after implementing all this.
>
>> In case this approach doesn't work out for some reason, you may want
>> to consider fsnotify_mark_srcu (and destroy_list) per priority.
>> Or just 2 srcu, one for for priority 0 and one for other.
>> Because priority > 0 may block and priority 0 may not block.
>>
>> The priority set on an inode/vfsmount can be easily encoded into the
>> i_fsnotify_mask/mnt_fsnotify_mask, e.g.:
>> #define FS_GROUP_PRIO_1 0x00040000 /* fanotify content
>> based access control */
>> #define FS_GROUP_PRIO_2 0x00080000 /* fanotify
>> pre-content access */
>>
>> in fsnotify(), only need to take the read side srcu of relevant priority groups,
>> but need to take extra care to set the priority bit in the inode/mnt
>> mask *before*
>> adding the mark to the list, with a relevant memory barrier before checking
>> the priority bits in fsnotify().
>
> Well but how would you like to protect the mark list hanging off the inode
> / mountpoint with two SRCUs? You'd need two lists hanging off the inode &
> mountpoint (for different priorities) and that's too big cost to pay to
> accomodate broken userspace...
>

My idea was to avoid another list on the inode and use the fact that the inode
mark mask can indicate which is the largest priority on the list.
Theoretically speaking, and I know this is playing close to fire,
if you manage find a way to start traversing the first element of the
list with the
correct set of srcu held, say (0 and 1), then you should not be bothered with
reaching higher priority marks (say 2) anymore, because those would be added to
the head of the list and never after lower priority marks.

Also, with the approach of SRCU per priority, one misbehaved fanotify permission
checker will block all other permission checkers, so it's not the best outcome,
but at least it would save all the inotify and fanotify passive observers.

Amir.