Re: [PATCH] fanotify: remove redundant capable(CAP_SYS_ADMIN)s

From: Amir Goldstein
Date: Thu May 23 2019 - 09:19:28 EST


On Thu, May 23, 2019 at 2:58 PM Christian Brauner <christian@xxxxxxxxxx> wrote:
>
> On Thu, May 23, 2019 at 02:40:39PM +0300, Amir Goldstein wrote:
> > On Thu, May 23, 2019 at 1:42 PM Christian Brauner <christian@xxxxxxxxxx> wrote:
> > >
> > > On Thu, May 23, 2019 at 01:25:08PM +0300, Amir Goldstein wrote:
> > > > On Thu, May 23, 2019 at 12:55 PM Christian Brauner <christian@xxxxxxxxxx> wrote:
> > > > >
> > > > > On Wed, May 22, 2019 at 11:00:22PM +0300, Amir Goldstein wrote:
> > > > > > On Wed, May 22, 2019 at 9:57 PM Christian Brauner <christian@xxxxxxxxxx> wrote:
> > > > > > >
> > > > > > > On May 22, 2019 8:29:37 PM GMT+02:00, Amir Goldstein <amir73il@xxxxxxxxx> wrote:
> > > > > > > >On Wed, May 22, 2019 at 7:32 PM Christian Brauner
> > > > > > > ><christian@xxxxxxxxxx> wrote:
> > > > > > > >>
> > > > > > > >> This removes two redundant capable(CAP_SYS_ADMIN) checks from
> > > > > > > >> fanotify_init().
> > > > > > > >> fanotify_init() guards the whole syscall with capable(CAP_SYS_ADMIN)
> > > > > > > >at the
> > > > > > > >> beginning. So the other two capable(CAP_SYS_ADMIN) checks are not
> > > > > > > >needed.
> > > > > > > >
> > > > > > > >It's intentional:
> > > > > > > >
> > > > > > > >commit e7099d8a5a34d2876908a9fab4952dabdcfc5909
> > > > > > > >Author: Eric Paris <eparis@xxxxxxxxxx>
> > > > > > > >Date: Thu Oct 28 17:21:57 2010 -0400
> > > > > > > >
> > > > > > > > fanotify: limit the number of marks in a single fanotify group
> > > > > > > >
> > > > > > > >There is currently no limit on the number of marks a given fanotify
> > > > > > > >group
> > > > > > > >can have. Since fanotify is gated on CAP_SYS_ADMIN this was not seen
> > > > > > > >as
> > > > > > > >a serious DoS threat. This patch implements a default of 8192, the
> > > > > > > >same as
> > > > > > > >inotify to work towards removing the CAP_SYS_ADMIN gating and
> > > > > > > >eliminating
> > > > > > > > the default DoS'able status.
> > > > > > > >
> > > > > > > > Signed-off-by: Eric Paris <eparis@xxxxxxxxxx>
> > > > > > > >
> > > > > > > >There idea is to eventually remove the gated CAP_SYS_ADMIN.
> > > > > > > >There is no reason that fanotify could not be used by unprivileged
> > > > > > > >users
> > > > > > > >to setup inotify style watch on an inode or directories children, see:
> > > > > > > >https://patchwork.kernel.org/patch/10668299/
> > > > > > > >
> > > > > > > >>
> > > > > > > >> Fixes: 5dd03f55fd2 ("fanotify: allow userspace to override max queue
> > > > > > > >depth")
> > > > > > > >> Fixes: ac7e22dcfaf ("fanotify: allow userspace to override max
> > > > > > > >marks")
> > > > > > > >
> > > > > > > >Fixes is used to tag bug fixes for stable.
> > > > > > > >There is no bug.
> > > > > > > >
> > > > > > > >Thanks,
> > > > > > > >Amir.
> > > > > > >
> > > > > > > Interesting. When do you think the gate can be removed?
> > > > > >
> > > > > > Nobody is working on this AFAIK.
> > > > > > What I posted was a simple POC, but I have no use case for this.
> > > > > > In the patchwork link above, Jan has listed the prerequisites for
> > > > > > removing the gate.
> > > > > >
> > > > > > One of the prerequisites is FAN_REPORT_FID, which is now merged.
> > > > > > When events gets reported with fid instead of fd, unprivileged user
> > > > > > (hopefully) cannot use fid for privilege escalation.
> > > > > >
> > > > > > > I was looking into switching from inotify to fanotify but since it's not usable from
> > > > > > > non-initial userns it's a no-no
> > > > > > > since we support nested workloads.
> > > > > >
> > > > > > One of Jan's questions was what is the benefit of using inotify-compatible
> > > > > > fanotify vs. using inotify.
> > > > > > So what was the reason you were looking into switching from inotify to fanotify?
> > > > > > Is it because of mount/filesystem watch? Because making those available for
> > > > >
> > > > > Yeah. Well, I would need to look but you could probably do it safely for
> > > > > filesystems mountable in user namespaces (which are few).
> > > > > Can you do a bind-mount and then place a watch on the bind-mount or is
> > > > > this superblock based?
> > > > >
> > > >
> > > > Either.
> > > > FAN_MARK_MOUNT was there from day 1 of fanotify.
> > > > FAN_MARK_FILESYSTEM was merged to Linux Linux 4.20.
> > > >
> > > > But directory modification events that are supported since v5.1 are
> > > > not available
> > > > with FAN_MARK_MOUNT, see:
> > >
> > > Because you're worried about unprivileged users spying on events? Or
> > > something else?
> >
> > Something else. The current fsnotify_move/create/delete() VFS hooks
> > have no path/mount information, so it is not possible to filter them by
> > mount only by inode/sb.
> > Fixing that would not be trivial, but first a strong use case would need
> > to be presented.
> >
> > > Because if you can do a bind-mount there's nothing preventing an
> > > unprivileged user to do a hand-rolled recursive inotify that would
> > > amount to the same thing anyway.
> >
> > There is. unprivileged user cannot traverse into directories it is not
> > allowed to read/search.
>
> Right, I should've mentioned: when you're userns root and you have
> access to all files. The part that is interesting to me is getting rid
> of capable(CAP_SYS_ADMIN).

Indeed. so part of removing the gated capable(CAP_SYS_ADMIN)
is figuring out the permission checks needed for individual features.
I agree that for FAN_MARK_MOUNT/FILESYSTEM,
capabale(CAP_SYS_ADMIN) is too strong.
ns_capable(sb->s_user_ns, CAP_DAC_READ_SEARCH)
is probably enough.

>
> >
> > > (And btw, v5.1 really is a major step forward and I would really like to
> > > use this api tbh.)
> > >
> >
> > You haven't answered my question. What is the reason you are interested
> > in the new API? What does it provide that the old API does not?
> > I know the 2 APIs differ. I just want to know which difference interests *you*,
> > because without a strong use case, it will be hard for me to make progress
> > upstream.
> >
> > Is what you want really a "bind-mount" watch or a "subtree watch"?
> > The distinction is important. I am thinking about solutions for the latter,
> > although there is no immediate solution in the horizon - only ideas.
>
> Both cases would be interesting. But subtree watch is what would
> probably help a lot already. So let me explain.
> For LXD - not sure if you know what that is -
I do

> we allow user to "hotplug"
> mounts or certain whitelisted devices into a user namespace container.
> One of the nifty features is that we let users specify a "required"
> property. When "required" is "false" the user can give us a path, e.g.
> /bla/bla/bla/target and then we place a watch on the closest existing
> ancestor of my-device. When the target shows up we hotplug it for the
> user. Now, as you imagine maintaining that cache until "target" shows up
> is a royal pain.

You lost me there. Are you looking for notifications when device files appear?
When directory is created? Please give a concrete example.
What part of /bla/bla/bla/target appears, when and how.
fanotify does not give notifications when mounts are mounted.
I have seen a proposal by David Howells for mount change notifications.

Thanks,
Amir.