Re: fsnotify_mark_srcu wtf?
From: Jan Kara
Date: Thu Nov 10 2016 - 14:46:39 EST
On Wed 09-11-16 20:26:16, Amir Goldstein wrote:
> On Wed, Nov 9, 2016 at 1:10 PM, Jan Kara <jack@xxxxxxx> wrote:
> > On Sun 06-11-16 08:45:54, Amir Goldstein wrote:
> >> On Sat, Nov 5, 2016 at 11:34 PM, Jan Kara <jack@xxxxxxx> wrote:
> >> > On Wed 02-11-16 23:09:26, Miklos Szeredi wrote:
> >> >> We've got a report where a fanotify daemon that implements permission checks
> >> >> screws up and doesn't send a reply. This then causes widespread hangs due to
> >> >> fsnotify_mark_srcu read side lock being held and thus causing synchronize_srcu()
> >> >> called from e.g. inotify_release()-> fsnotify_destroy_group()->
> >> >> fsnotify_mark_destroy_list() to block.
> >> >
> >> > Yes. But if a program implementing permission checks does not reply, your
> >> > system is likely hosed anyway. We can only try to somewhat limit the
> >> > damage...
> >> >
> >>
> >> That was my initial thought as well, but at least with the sample code
> >> Miklos sent
> >> the only thing that gets hosed is the one process watching that one file.
> >> You could think of a use case of fanotify being used to watch over files
> >> in a specific user directory, where the damage on the entire system
> >> should/could be limited. No?
> >
> > Yes, the damage could at least theoretically be limited only to those files
> > / dirs watched by the buggy process.
> >
> >> >> Below program demonstrates the issue. It should output a single line:
> >> >>
> >> >> close(inotify_fd): success
> >> >>
> >> >> Instead it outputs nothing, which means that close(inotify_fd) got blocked by
> >> >> the waiting permission event.
> >> >>
> >> >> Wouldn't making the srcu per-group fix this? Would that be too expensive?
> >> >
> >> > Per-group would be IMHO too expensive. You can have lots of groups and I'm
> >> > not sure srcu would scale to that. Furthermore the SRCU protects the list
> >> > of groups that need to get notification so it would not even be easily
> >> > possible. Also Amir's solution is buggy - I'll comment on that as a reply
> >> > to his patch. I'll try to find something to improve the situation but so
> >> > far I have no good idea...
> >> >
> >>
> >> Yes, very much buggy indeed :/
> >> Anyway, the reason I drafted it quickly was to highlight the fact that the
> >> marks only need to live to the point of decision whether or not the event
> >> should be sent to the group and afterwards, its sufficient to grab the
> >> group reference, without having impact on the entire system.
> >
> > Yes, fanotify code as such does not need the marks anymore. But the core
> > fsnotify code does...
> >
> >> Yet another possible ugly (but less buggy) solution would be
> >> to iterate all marks under SRCU read protection.
> >> If any group is about to block (either by suggested return value
> >> EAGAIN or another
> >> by using a new op should_handle_event_deferred), defer event handling to post
> >> marks iteration, by keeping a few group references on stack.
> >
> > And this does not work as well... Fanotify must notify groups by their
> > priority so you cannot arbitrarily reorder ordering in which groups get
> > notified. I'm currently pondering on using mark refcount to pin it when
> > processing permission event but there are still some details to check.
> >
>
> All right, mark refcount sound like the proper solution.
Except it doesn't quite work. We can pin the current marks by a refcount
but they can still be removed from the list so after we regain srcu lock,
we are not sure their ->next pointers still point to still allocated marks
:-| Sadly I realized this only after implementing all this.
> In case this approach doesn't work out for some reason, you may want
> to consider fsnotify_mark_srcu (and destroy_list) per priority.
> Or just 2 srcu, one for for priority 0 and one for other.
> Because priority > 0 may block and priority 0 may not block.
>
> The priority set on an inode/vfsmount can be easily encoded into the
> i_fsnotify_mask/mnt_fsnotify_mask, e.g.:
> #define FS_GROUP_PRIO_1 0x00040000 /* fanotify content
> based access control */
> #define FS_GROUP_PRIO_2 0x00080000 /* fanotify
> pre-content access */
>
> in fsnotify(), only need to take the read side srcu of relevant priority groups,
> but need to take extra care to set the priority bit in the inode/mnt
> mask *before*
> adding the mark to the list, with a relevant memory barrier before checking
> the priority bits in fsnotify().
Well but how would you like to protect the mark list hanging off the inode
/ mountpoint with two SRCUs? You'd need two lists hanging off the inode &
mountpoint (for different priorities) and that's too big cost to pay to
accomodate broken userspace...
Honza
--
Jan Kara <jack@xxxxxxxx>
SUSE Labs, CR