Re: [DRAFT][PATCH] fanotify: lift pidfd reporting restrictions

From: Amir Goldstein

Date: Fri Jun 05 2026 - 04:34:35 EST

On Thu, Jun 4, 2026 at 9:47 PM AnonymeMeow <anonymemeow@xxxxxxxxx> wrote:
>
> The FAN_REPORT_PIDFD and FAN_REPORT_TID flags used to be mutually
> exclusive because by the time the pidfd support was introduced to
> fanotify, pidfds could only be created for thread group leaders. Now
> that the pidfd API supports thread-specific pidfds via PIDFD_THREAD,
> this restriction can be lifted.
>
> Fanotify used to refuse to report pidfds for reaped tasks by applying a
> pid_has_task() check before calling pidfd_prepare(). This prevented
> userspace from obtaining information about the task.
>
> Register the event pid with pidfs when creating the fanotify event if
> pidfd reporting was requested, so pidfd_prepare() can later create a
> pidfd for the reaped task.
>
> Signed-off-by: AnonymeMeow <anonymemeow@xxxxxxxxx>
> ---
>
> On 2026-06-03 12:00 +0200, Amir Goldstein wrote:
> >
> > Thanks for noticing the oddity and explaining the options.
> >
> > We could fail the event in case on ENOMEM because the system
> > is likely is a bad shape anyway and other events are likely to fail
> > allocation themselves, so I am not objecting to your patch as is,
>
> The existing behavior of fanotify is that, for groups with unlimited
> event queues, event allocation is performed with __GFP_NOFAIL and
> therefore must not fail. For groups with limited queues, event allocation
> uses __GFP_RETRY_MAYFAIL, which avoids entering the OOM path.
>
> Currently, pidfs_register_pid() uses hardcoded GFP_KERNEL. For non-costly
> allocations(pid attr falls into this category, am I right?), this is
> almost no-fail, but it will not avoid the OOM path. That means it can
> bypass fanotify's OOM-killer avoidance mitigation for limited queues
> and potentially trigger the OOM killer in the monitoring memcg under
> memory pressure, breaking the intended security isolation, as sashiko
> mentioned. In the meanwhile, GFP_KERNEL does not guarantee no-fail in
> all corner cases, so for unlimited queues, it can still cause events
> to be dropped under memory pressure, this is also reported by sashiko.
>
> Since this patch set is meant to be a functional enhancement, I think
> dropping events from an unlimited queue is not acceptable and can be
> a regression.
>
> The cleanest solution I can come up with is something like this patch.
> (I squashed the commits so that sashiko can apply the patch correctly)
> It adds a pidfs registration helper that takes a gfp_t argument, and
> fanotify registers the pid before allocating the event. With an unlimited
> queue, both pid registration and event allocation are no-fail, the
> existing behavior is preserved. With a limited queue, if pid registration
> fails with ENOMEM, fanotify simply returns NULL and drops the event,
> just like it would do if event allocation itself failed with
> __GFP_RETRY_MAYFAIL.
>
> This also avoids having to destroy an already allocated event on pid
> registration failure. Since the pid reference is only taken when
> assigning the event->pid field, there is no extra put_pid() path needed
> either. And this guarantees that whenever event allocation succeeds,
> the pid has already been registered with pidfs, so userspace can
> reliably get a pidfd later. The API semantics are no longer ambiguous.
>
> I also originally thought pidfs_register_pid() did not need to run in
> the target memcg because it does not explicitly use __GFP_ACCOUNT.
> However, pidfs_attr_cachep is created with SLAB_ACCOUNT, so the
> registration should also happen inside the set_active_memcg() scope.
> This version does that.
>
> But I don't know whether this amount of churn is justified for a
> relatively small feature enhancement. For the next revision, I am
> still inclined to send the simpler best-effort registration approach,
> unless reviewers think this stricter version is preferable.

For the record, this is not "churn":
> fs/pidfs.c | 23 +++++++++++++++++----
"churn" is a patch that touches many files/lines, clutters git blame history
and impares backport of fixes.

It's a new interface which the fanotify caller needs to operate correctly.
The only question is whether it is worth the energy of discussing it ;)
but since we did that already, if Christian is on board with some form
of pidfs_register_pid_gfp() (maybe just change the few callers), then
I think this patch represents the most clean and correct way of dealing
with FAN_REPORT_PIDFD.

Thanks,
Amir

>
> In the best-effort version, fanotify would allocate the event first
> and then try to register the pid with pidfs using the existing
> GFP_KERNEL flag. If registration fails, the event is still queued.
> This does mean that pidfd creation at read time is not guarenteed to
> succeed if the task has already been reaped, and it also means the
> registration allocation may enter the OOM path even though the event
> allocation used __GFP_RETRY_MAYFAIL. But I think accepting that very
> small corner cases may be preferable if it avoids most of the
> additional code churn.
>
> With Best Regards,
> Anonymemeow
>
> >
> > but I am slightly leaning towards the semantically ambiguous API.
> >
> > I don't think that we are going to change the man page to say that
> > FAN_REPORT_PIDFD is guaranteed to return a pidfd, maybe we just
> > add a NOTE about increased likelihood of getting a pidfd since kernel XXX
> > and mention the known cases of pid namespace and ENOMEM.
> >
> > Thanks,
> > Amir.
> >
>
> ---
> fs/notify/fanotify/fanotify.c | 17 +++++++++------
> fs/notify/fanotify/fanotify_user.c | 33 +++++++-----------------------
> fs/pidfs.c | 23 +++++++++++++++++----
> include/linux/pidfs.h | 3 +++
> 4 files changed, 40 insertions(+), 36 deletions(-)
>
> diff --git a/fs/notify/fanotify/fanotify.c b/fs/notify/fanotify/fanotify.c
> index 38290b9c07f7..ece9523e775b 100644
> --- a/fs/notify/fanotify/fanotify.c
> +++ b/fs/notify/fanotify/fanotify.c
> @@ -14,6 +14,7 @@
> #include <linux/sched/mm.h>
> #include <linux/statfs.h>
> #include <linux/stringhash.h>
> +#include <linux/pidfs.h>
>
> #include "fanotify.h"
>
> @@ -842,6 +843,15 @@ static struct fanotify_event *fanotify_alloc_event(
> /* Whoever is interested in the event, pays for the allocation. */
> old_memcg = set_active_memcg(group->memcg);
>
> + if (FAN_GROUP_FLAG(group, FAN_REPORT_TID))
> + pid = task_pid(current);
> + else
> + pid = task_tgid(current);
> +
> + if (FAN_GROUP_FLAG(group, FAN_REPORT_PIDFD) &&
> + pidfs_register_pid_gfp(pid, gfp))
> + goto out;
> +
> if (fanotify_is_perm_event(mask)) {
> event = fanotify_alloc_perm_event(data, data_type, gfp);
> } else if (fanotify_is_error_event(mask)) {
> @@ -863,15 +873,10 @@ static struct fanotify_event *fanotify_alloc_event(
> if (!event)
> goto out;
>
> - if (FAN_GROUP_FLAG(group, FAN_REPORT_TID))
> - pid = get_pid(task_pid(current));
> - else
> - pid = get_pid(task_tgid(current));
> -
> /* Mix event info, FAN_ONDIR flag and pid into event merge key */
> hash ^= hash_long((unsigned long)pid | ondir, FANOTIFY_EVENT_HASH_BITS);
> fanotify_init_event(event, hash, mask);
> - event->pid = pid;
> + event->pid = get_pid(pid);
>
> out:
> set_active_memcg(old_memcg);
> diff --git a/fs/notify/fanotify/fanotify_user.c b/fs/notify/fanotify/fanotify_user.c
> index ae904451dfc0..b604e3da58ad 100644
> --- a/fs/notify/fanotify/fanotify_user.c
> +++ b/fs/notify/fanotify/fanotify_user.c
> @@ -19,6 +19,7 @@
> #include <linux/memcontrol.h>
> #include <linux/statfs.h>
> #include <linux/exportfs.h>
> +#include <linux/pidfd.h>
>
> #include <asm/ioctls.h>
>
> @@ -903,25 +904,13 @@ static ssize_t copy_event_to_user(struct fsnotify_group *group,
> metadata.fd = fd >= 0 ? fd : FAN_NOFD;
>
> if (pidfd_mode) {
> - /*
> - * Complain if the FAN_REPORT_PIDFD and FAN_REPORT_TID mutual
> - * exclusion is ever lifted. At the time of incoporating pidfd
> - * support within fanotify, the pidfd API only supported the
> - * creation of pidfds for thread-group leaders.
> - */
> - WARN_ON_ONCE(FAN_GROUP_FLAG(group, FAN_REPORT_TID));
> + unsigned int pidfd_flags = PIDFD_STALE;
>
> - /*
> - * The PIDTYPE_TGID check for an event->pid is performed
> - * preemptively in an attempt to catch out cases where the event
> - * listener reads events after the event generating process has
> - * already terminated. Depending on flag FAN_REPORT_FD_ERROR,
> - * report either -ESRCH or FAN_NOPIDFD to the event listener in
> - * those cases with all other pidfd creation errors reported as
> - * the error code itself or as FAN_EPIDFD.
> - */
> - if (metadata.pid && pid_has_task(event->pid, PIDTYPE_TGID))
> - pidfd = pidfd_prepare(event->pid, 0, &pidfd_file);
> + if (FAN_GROUP_FLAG(group, FAN_REPORT_TID))
> + pidfd_flags |= PIDFD_THREAD;
> +
> + if (metadata.pid)
> + pidfd = pidfd_prepare(event->pid, pidfd_flags, &pidfd_file);
>
> if (!FAN_GROUP_FLAG(group, FAN_REPORT_FD_ERROR) && pidfd < 0)
> pidfd = pidfd == -ESRCH ? FAN_NOPIDFD : FAN_EPIDFD;
> @@ -1628,14 +1617,6 @@ SYSCALL_DEFINE2(fanotify_init, unsigned int, flags, unsigned int, event_f_flags)
> #endif
> return -EINVAL;
>
> - /*
> - * A pidfd can only be returned for a thread-group leader; thus
> - * FAN_REPORT_PIDFD and FAN_REPORT_TID need to remain mutually
> - * exclusive.
> - */
> - if ((flags & FAN_REPORT_PIDFD) && (flags & FAN_REPORT_TID))
> - return -EINVAL;
> -
> /* Don't allow mixing mnt events with inode events for now */
> if (flags & FAN_REPORT_MNT) {
> if (class != FAN_CLASS_NOTIF)
> diff --git a/fs/pidfs.c b/fs/pidfs.c
> index 1cce4f34a051..e9f6113efc5b 100644
> --- a/fs/pidfs.c
> +++ b/fs/pidfs.c
> @@ -991,14 +991,16 @@ static void pidfs_put_data(void *data)
> }
>
> /**
> - * pidfs_register_pid - register a struct pid in pidfs
> + * pidfs_register_pid_gfp - register a struct pid in pidfs with custom GFP
> + * flags
> * @pid: pid to pin
> + * @gfp: GFP flags for memory allocation
> *
> - * Register a struct pid in pidfs.
> + * Register a struct pid in pidfs with custom GFP flags.
> *
> * Return: On success zero, on error a negative error code is returned.
> */
> -int pidfs_register_pid(struct pid *pid)
> +int pidfs_register_pid_gfp(struct pid *pid, gfp_t gfp)
> {
> struct pidfs_attr *new_attr __free(kfree) = NULL;
> struct pidfs_attr *attr;
> @@ -1014,7 +1016,7 @@ int pidfs_register_pid(struct pid *pid)
> if (attr)
> return 0;
>
> - new_attr = kmem_cache_zalloc(pidfs_attr_cachep, GFP_KERNEL);
> + new_attr = kmem_cache_zalloc(pidfs_attr_cachep, gfp);
> if (!new_attr)
> return -ENOMEM;
>
> @@ -1031,6 +1033,19 @@ int pidfs_register_pid(struct pid *pid)
> return 0;
> }
>
> +/**
> + * pidfs_register_pid - register a struct pid in pidfs
> + * @pid: pid to pin
> + *
> + * Register a struct pid in pidfs.
> + *
> + * Return: On success zero, on error a negative error code is returned.
> + */
> +int pidfs_register_pid(struct pid *pid)
> +{
> + return pidfs_register_pid_gfp(pid, GFP_KERNEL);
> +}
> +
> static struct dentry *pidfs_stash_dentry(struct dentry **stashed,
> struct dentry *dentry)
> {
> diff --git a/include/linux/pidfs.h b/include/linux/pidfs.h
> index 416bdff4d6ce..8054f49bc139 100644
> --- a/include/linux/pidfs.h
> +++ b/include/linux/pidfs.h
> @@ -2,6 +2,8 @@
> #ifndef _LINUX_PID_FS_H
> #define _LINUX_PID_FS_H
>
> +#include <linux/types.h>
> +
> struct coredump_params;
>
> struct file *pidfs_alloc_file(struct pid *pid, unsigned int flags);
> @@ -14,6 +16,7 @@ void pidfs_exit(struct task_struct *tsk);
> void pidfs_coredump(const struct coredump_params *cprm);
> #endif
> extern const struct dentry_operations pidfs_dentry_operations;
> +int pidfs_register_pid_gfp(struct pid *pid, gfp_t gfp);
> int pidfs_register_pid(struct pid *pid);
> void pidfs_free_pid(struct pid *pid);
>
> --
> 2.54.0
>