Re: [PATCH 1/2] seccomp: notify user trap about unused filter

From: Christian Brauner
Date: Thu May 28 2020 - 10:17:06 EST

On Wed, May 27, 2020 at 06:59:54PM -0700, Kees Cook wrote:
> On Thu, May 28, 2020 at 01:16:46AM +0200, Christian Brauner wrote:
> > I'm also starting to think this isn't even possible or currently doable
> > safely.
> > The fdtable in the kernel would end up with a dangling pointer, I would
> > think. Unless you backtrack all fds that still have a reference into the
> > fdtable and refer to that file and close them all in the kernel which I
> > don't think is possible and also sounds very dodgy. This also really
> > seems like we would be breaking a major contract, namely that fds stay
> > valid until userspace calls close, execve(), or exits.
> Right, I think I was just using the wrong words? I was looking at it
> like a pipe, or a socket, where you still have an fd, but reads return
> 0, you might get SIGPIPE, etc. The VFS clearly knows what a
> "disconnected" fd is, and I had assumed there was general logic for it
> to indicate "I'm not here any more".
> I recently did something very similar to the pstore filesystem, but I got
> to cheat with some massive subsystem locks. In that case I needed to clear
> all the inodes out of the tmpfs, so I unlink them all and manage the data
> lifetimes pointing back into the (waiting to be unloaded) backend module
> by NULLing the pointer back, which is safe because of the how the locking
> there happens to work. Any open readers, when they close, will have the
> last ref count dropped, at which point the record itself is released too.
> Back to the seccomp subject: should "all tasks died" be distinguishable
> from "I can't find that notification" in the ioctl()? (i.e. is ENOENT
> sufficient, or does there need to be an EIO or ESRCH there?)

I personally think it's fine as it is but as it might help users if we
reported ESRCH something like the patch below might do.
Actual cleanup of the notifier should still happen in
seccomp_notify_release() imho, and not in __poll_t both conceptually and
also because f_op->release() happens on finaly fput() which punts it to
task_work() which finishes when the task returns from kernel mode (or
exits) - or - if the task is not alive anymore just puts it on the
kernel global workqueue which is perfect for non-high-priority cleanup
stuff. It's better than making __poll_t heavier than it needs to be.
Unless there's an obvious reason not to.

diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index 9fa642d6d549..e6fa03cc4840 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -1221,6 +1221,9 @@ static __poll_t seccomp_notify_poll(struct file *file,

poll_wait(file, &filter->wqh, poll_tab);

+ if (refcount_read(&filter->live) == 0)
+ ret |= EPOLLHUP;
if (mutex_lock_interruptible(&filter->notify_lock) < 0)
return EPOLLERR;

@@ -1231,13 +1234,17 @@ static __poll_t seccomp_notify_poll(struct file *file,
if ((ret & EPOLLIN) && (ret & EPOLLOUT))
+ if ((ret & EPOLLHUP) && cur->state != SECCOMP_NOTIFY_REPLIED) {
+ knotif->state = SECCOMP_NOTIFY_REPLIED;
+ knotif->error = -ESRCH;
+ knotif->val = 0;
+ complete(&knotif->ready);
+ }


- if (refcount_read(&filter->live) == 0)
- ret |= EPOLLHUP;
return ret;