Re: [PATCH 1/2] seccomp: notify user trap about unused filter

From: Kees Cook
Date: Wed May 27 2020 - 17:44:03 EST

On Wed, May 27, 2020 at 01:19:01PM +0200, Christian Brauner wrote:
> loop. But this is problematic since we don't get a notification when the
> seccomp filter has become unused and so we currently never remove the
> seccomp notifier fd from the event loop and just keep accumulating fds
> in the event loop. We've had this issue for a while but it has recently
> become more pressing as more and larger users make use of this.

I had been under the (seemingly very wrong) understanding that when
all the tasks associated with a filter cease to exist, the notif fd is
effectively closed. But I see now where I got confused: this is only
half implemented: if the userspace end of the fd is closed, it'll get
cleaned up in the kernel, but we have nothing going the other direction
except the general object lifetime management on the filter itself.

So, yes, I accept the basic problem statement, "we have fds hanging
around that will never be used again, we need to notice that". :)

Why is EPOLLHUP needed? Can't the fd just get closed on the kernel end?
I would expect that to be visible as EPOLLHUP internally (though I
haven't looked through the VFS yet). And I haven't found how to
close/detach a anon file from the listener task. It strikes me that this
would actually be much cleaner: then we actually don't need the
additional __get_seccomp_filter() in init_listener() -- we simply
invalidate the file during __put_seccomp_filter().

(While I'm here -- why can there be only one listener per task? The
notifications are filter-specific, not task-specific?)

> To fix this, we introduce a new "live" reference counter that tracks the
> live tasks making use of a given filter and when a notifier is
> registered waiting tasks will be notified that the filter is now empty
> by receiving a (E)POLLHUP event.
> The concept in this patch introduces is the same as for signal_struct,
> i.e. reference counting for life-cycle management is decoupled from
> reference counting live taks using the object.

I will need convincing that life-cycle ref-counting needs to be decoupled
from usage ref-counting. I see what you're saying here and in the other
reply about where the notification is coming from (release vs put, etc),
but I think it'd be better if the EPOLLHUP was handled internally to the
VFS due to the kernel end of the file being closed.

> There's probably some trickery possible but the second counter is just
> the correct way of doing this imho and has precedence. The patch also
> lifts the waitqeue from struct notification into into sruct
> seccomp_filter. This is cleaner overall and let's us avoid having to
> take the notifier mutex since we neither need to read nor modify the
> notifier specific aspects of the seccomp filter. In the exit path I'd
> very much like to avoid having to take the notifier mutex for each
> filter in the task's filter hierarchy.

I guess this is a minor size/speed trade-off (every seccomp_filter
struct grows by 1 pointer regardless of the presence of USER_NOTIF
rules attached...). But I think this is an optimization detail, and I
need to understand why we can't just close the file on filter free.

Kees Cook