Re: [PATCH] fanotify: remove redundant capable(CAP_SYS_ADMIN)s

From: Christian Brauner
Date: Thu May 23 2019 - 09:38:18 EST


On Thu, May 23, 2019 at 04:16:24PM +0300, Amir Goldstein wrote:
> On Thu, May 23, 2019 at 2:58 PM Christian Brauner <christian@xxxxxxxxxx> wrote:
> >
> > On Thu, May 23, 2019 at 02:40:39PM +0300, Amir Goldstein wrote:
> > > On Thu, May 23, 2019 at 1:42 PM Christian Brauner <christian@xxxxxxxxxx> wrote:
> > > >
> > > > On Thu, May 23, 2019 at 01:25:08PM +0300, Amir Goldstein wrote:
> > > > > On Thu, May 23, 2019 at 12:55 PM Christian Brauner <christian@xxxxxxxxxx> wrote:
> > > > > >
> > > > > > On Wed, May 22, 2019 at 11:00:22PM +0300, Amir Goldstein wrote:
> > > > > > > On Wed, May 22, 2019 at 9:57 PM Christian Brauner <christian@xxxxxxxxxx> wrote:
> > > > > > > >
> > > > > > > > On May 22, 2019 8:29:37 PM GMT+02:00, Amir Goldstein <amir73il@xxxxxxxxx> wrote:
> > > > > > > > >On Wed, May 22, 2019 at 7:32 PM Christian Brauner
> > > > > > > > ><christian@xxxxxxxxxx> wrote:
> > > > > > > > >>
> > > > > > > > >> This removes two redundant capable(CAP_SYS_ADMIN) checks from
> > > > > > > > >> fanotify_init().
> > > > > > > > >> fanotify_init() guards the whole syscall with capable(CAP_SYS_ADMIN)
> > > > > > > > >at the
> > > > > > > > >> beginning. So the other two capable(CAP_SYS_ADMIN) checks are not
> > > > > > > > >needed.
> > > > > > > > >
> > > > > > > > >It's intentional:
> > > > > > > > >
> > > > > > > > >commit e7099d8a5a34d2876908a9fab4952dabdcfc5909
> > > > > > > > >Author: Eric Paris <eparis@xxxxxxxxxx>
> > > > > > > > >Date: Thu Oct 28 17:21:57 2010 -0400
> > > > > > > > >
> > > > > > > > > fanotify: limit the number of marks in a single fanotify group
> > > > > > > > >
> > > > > > > > >There is currently no limit on the number of marks a given fanotify
> > > > > > > > >group
> > > > > > > > >can have. Since fanotify is gated on CAP_SYS_ADMIN this was not seen
> > > > > > > > >as
> > > > > > > > >a serious DoS threat. This patch implements a default of 8192, the
> > > > > > > > >same as
> > > > > > > > >inotify to work towards removing the CAP_SYS_ADMIN gating and
> > > > > > > > >eliminating
> > > > > > > > > the default DoS'able status.
> > > > > > > > >
> > > > > > > > > Signed-off-by: Eric Paris <eparis@xxxxxxxxxx>
> > > > > > > > >
> > > > > > > > >There idea is to eventually remove the gated CAP_SYS_ADMIN.
> > > > > > > > >There is no reason that fanotify could not be used by unprivileged
> > > > > > > > >users
> > > > > > > > >to setup inotify style watch on an inode or directories children, see:
> > > > > > > > >https://patchwork.kernel.org/patch/10668299/
> > > > > > > > >
> > > > > > > > >>
> > > > > > > > >> Fixes: 5dd03f55fd2 ("fanotify: allow userspace to override max queue
> > > > > > > > >depth")
> > > > > > > > >> Fixes: ac7e22dcfaf ("fanotify: allow userspace to override max
> > > > > > > > >marks")
> > > > > > > > >
> > > > > > > > >Fixes is used to tag bug fixes for stable.
> > > > > > > > >There is no bug.
> > > > > > > > >
> > > > > > > > >Thanks,
> > > > > > > > >Amir.
> > > > > > > >
> > > > > > > > Interesting. When do you think the gate can be removed?
> > > > > > >
> > > > > > > Nobody is working on this AFAIK.
> > > > > > > What I posted was a simple POC, but I have no use case for this.
> > > > > > > In the patchwork link above, Jan has listed the prerequisites for
> > > > > > > removing the gate.
> > > > > > >
> > > > > > > One of the prerequisites is FAN_REPORT_FID, which is now merged.
> > > > > > > When events gets reported with fid instead of fd, unprivileged user
> > > > > > > (hopefully) cannot use fid for privilege escalation.
> > > > > > >
> > > > > > > > I was looking into switching from inotify to fanotify but since it's not usable from
> > > > > > > > non-initial userns it's a no-no
> > > > > > > > since we support nested workloads.
> > > > > > >
> > > > > > > One of Jan's questions was what is the benefit of using inotify-compatible
> > > > > > > fanotify vs. using inotify.
> > > > > > > So what was the reason you were looking into switching from inotify to fanotify?
> > > > > > > Is it because of mount/filesystem watch? Because making those available for
> > > > > >
> > > > > > Yeah. Well, I would need to look but you could probably do it safely for
> > > > > > filesystems mountable in user namespaces (which are few).
> > > > > > Can you do a bind-mount and then place a watch on the bind-mount or is
> > > > > > this superblock based?
> > > > > >
> > > > >
> > > > > Either.
> > > > > FAN_MARK_MOUNT was there from day 1 of fanotify.
> > > > > FAN_MARK_FILESYSTEM was merged to Linux Linux 4.20.
> > > > >
> > > > > But directory modification events that are supported since v5.1 are
> > > > > not available
> > > > > with FAN_MARK_MOUNT, see:
> > > >
> > > > Because you're worried about unprivileged users spying on events? Or
> > > > something else?
> > >
> > > Something else. The current fsnotify_move/create/delete() VFS hooks
> > > have no path/mount information, so it is not possible to filter them by
> > > mount only by inode/sb.
> > > Fixing that would not be trivial, but first a strong use case would need
> > > to be presented.
> > >
> > > > Because if you can do a bind-mount there's nothing preventing an
> > > > unprivileged user to do a hand-rolled recursive inotify that would
> > > > amount to the same thing anyway.
> > >
> > > There is. unprivileged user cannot traverse into directories it is not
> > > allowed to read/search.
> >
> > Right, I should've mentioned: when you're userns root and you have
> > access to all files. The part that is interesting to me is getting rid
> > of capable(CAP_SYS_ADMIN).
>
> Indeed. so part of removing the gated capable(CAP_SYS_ADMIN)
> is figuring out the permission checks needed for individual features.
> I agree that for FAN_MARK_MOUNT/FILESYSTEM,
> capabale(CAP_SYS_ADMIN) is too strong.
> ns_capable(sb->s_user_ns, CAP_DAC_READ_SEARCH)
> is probably enough.
>
> >
> > >
> > > > (And btw, v5.1 really is a major step forward and I would really like to
> > > > use this api tbh.)
> > > >
> > >
> > > You haven't answered my question. What is the reason you are interested
> > > in the new API? What does it provide that the old API does not?
> > > I know the 2 APIs differ. I just want to know which difference interests *you*,
> > > because without a strong use case, it will be hard for me to make progress
> > > upstream.
> > >
> > > Is what you want really a "bind-mount" watch or a "subtree watch"?
> > > The distinction is important. I am thinking about solutions for the latter,
> > > although there is no immediate solution in the horizon - only ideas.
> >
> > Both cases would be interesting. But subtree watch is what would
> > probably help a lot already. So let me explain.
> > For LXD - not sure if you know what that is -
> I do
>
> > we allow user to "hotplug"
> > mounts or certain whitelisted devices into a user namespace container.
> > One of the nifty features is that we let users specify a "required"
> > property. When "required" is "false" the user can give us a path, e.g.
> > /bla/bla/bla/target and then we place a watch on the closest existing
> > ancestor of my-device. When the target shows up we hotplug it for the
> > user. Now, as you imagine maintaining that cache until "target" shows up
> > is a royal pain.
>
> You lost me there. Are you looking for notifications when device files appear?

Just when that file is created. fanotify doesn't need to do anything
special for device files. As long as a file (device or
otherwise) and directory creation/deletion triggers an event we're good.
I can then just stat the file and check that it's the device I expect.

> When directory is created? Please give a concrete example.
> What part of /bla/bla/bla/target appears, when and how.

So let's say the user tells me:
- When the "/A/B/C/target" file appears on the host filesystem,
please give me access to "target" in the container at a path I tell
you.
What I do right now is listen for the creation of the "target" file.
But at the time the user gives me instructions to listen for
"/A/B/C/target" only /A might exist and so I currently add a watch on A/
and then wait for the creation of B/, then wait for the creation of C/
and finally for the creation of "target" (Of course, I also need to
handle B/ and C/ being removed again an recreated and so on.). It would
be helpful, if I could specify, give me notifications, recursively for
e.g. A/ without me having to place extra watches on B/ and C/ when they
appear. Maybe that's out of scope...

> fanotify does not give notifications when mounts are mounted.
> I have seen a proposal by David Howells for mount change notifications.

David's going to send this out soon, I think.

Thanks!
Christian